本篇博文主要内容为 2025-05-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-22)
今日共更新686篇论文,其中:
- 自然语言处理共185篇(Computation and Language (cs.CL))
- 人工智能共203篇(Artificial Intelligence (cs.AI))
- 计算机视觉共142篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共238篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Learning to Reason via Mixture-of-Thought for Logical Reasoning
【速读】: 该论文试图解决现有基于大语言模型(Large Language Model, LLM)的方法在训练过程中仅依赖单一推理模态(通常为自然语言)导致的模态间协同不足问题。其解决方案的关键在于提出一种多模态推理框架——Mixture-of-Thought (MoT),该框架通过引入三种互补的推理模态(自然语言、代码和新提出的真值表模态)实现跨模态推理,并采用两阶段设计:自演进的MoT训练阶段联合学习多模态过滤后的自我生成推理过程,以及利用三模态协同优势的MoT推理阶段,从而显著提升逻辑推理任务的性能。
链接: https://arxiv.org/abs/2505.15817
作者: Tong Zheng,Lichang Chen,Simeng Han,R. Thomas McCoy,Heng Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages
点击查看摘要
Abstract:Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches operate with a single reasoning modality during training, typically natural language. Although some methods explored modality selection or augmentation at inference time, the training process remains modality-blind, limiting synergy among modalities. To fill in this gap, we propose Mixture-of-Thought (MoT), a framework that enables LLMs to reason across three complementary modalities: natural language, code, and a newly introduced symbolic modality, truth-table, which systematically enumerates logical cases and partially mitigates key failure modes in natural language reasoning. MoT adopts a two-phase design: (1) self-evolving MoT training, which jointly learns from filtered, self-generated rationales across modalities; and (2) MoT inference, which fully leverages the synergy of three modalities to produce better predictions. Experiments on logical reasoning benchmarks including FOLIO and ProofWriter demonstrate that our MoT framework consistently and significantly outperforms strong LLM baselines with single-modality chain-of-thought approaches, achieving up to +11.7pp average accuracy gain. Further analyses show that our MoT framework benefits both training and inference stages; that it is particularly effective on harder logical reasoning problems; and that different modalities contribute complementary strengths, with truth-table reasoning helping to overcome key bottlenecks in natural language inference.
zh
[NLP-1] GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
【速读】: 该论文旨在解决当前图形用户界面(GUI)代理在训练过程中因盲目应用通用强化学习(RL)方法而导致的输入设计、输出评估和策略更新三个关键组件中的挑战。其解决方案的关键在于:首先采用快速思维模板以鼓励直接答案生成,减少训练过程中的过度推理;其次在奖励函数中引入框大小约束以缓解奖励劫持问题;最后通过调整长度归一化并添加难度感知缩放因子来优化RL目标,从而提升对困难样本的优化效果。
链接: https://arxiv.org/abs/2505.15810
作者: Yuqi Zhou,Sunhao Dai,Shuai Wang,Kaiwen Zhou,Qinqlin Jia,Junxu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at this https URL.
zh
[NLP-2] he Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
【速读】: 该论文试图解决大型语言模型在上下文学习中如何利用外部知识进行问答的机制不明确的问题(in-context retrieval augmentation for question answering)。其解决方案的关键在于提出一种基于归因的方法,以识别专门化的注意力头,揭示能够理解指令并检索相关上下文信息的上下文注意力头,以及存储实体关系知识的参数化注意力头,并通过提取功能向量和调整注意力权重来分析它们对答案生成过程的影响。
链接: https://arxiv.org/abs/2505.15807
作者: Patrick Kahardipraja,Reduan Achtibat,Thomas Wiegand,Wojciech Samek,Sebastian Lapuschkin
机构: Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute (人工智能系,弗劳恩霍夫海因里希·赫兹研究所); Department of Electrical Engineering and Computer Science, Technische Universität Berlin (电气工程与计算机科学系,柏林工业大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data (BIFOLD - 柏林学习与数据基础研究所); Centre of eXplainable Artificial Intelligence, Technological University Dublin (可解释人工智能中心,都柏林理工学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: work in progress
点击查看摘要
Abstract:Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities’ relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.
zh
[NLP-3] Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感领域应用中,如何确保其遵循用户定义的上下文安全策略,特别是信息非披露问题。现有研究多关注通用安全性和社会敏感数据,但缺乏针对上下文安全保护的大规模基准测试。为此,作者提出了一种新的大规模基准数据集CoPriva,用于评估LLMs在问答任务中对上下文非披露政策的遵守情况。该数据集包含真实场景下的明确策略和直接及间接攻击性查询,实验结果显示,许多模型在面对间接攻击时存在严重漏洞,无法有效防止敏感信息泄露。关键解决方案在于识别模型在生成过程中难以整合政策约束的问题,并指出通过显式提示可部分提升其输出修正能力,从而揭示了当前LLM安全性对齐在敏感应用中的重大缺陷。
链接: https://arxiv.org/abs/2505.15805
作者: Hwan Chang,Yumin Kim,Yonghyun Jun,Hwanhee Lee
机构: Chung-Ang University (忠南大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.
zh
[NLP-4] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
【速读】: 该论文试图解决现有奖励基准无法评估基于参考的奖励系统的问题,从而限制了研究人员对强化学习(RL)中验证器准确性的理解。解决方案的关键在于引入两个基准测试——VerifyBench 和 VerifyBench-Hard,这些基准通过精心的数据收集与整理以及人工标注构建,旨在全面评估基于参考的奖励系统的性能。
链接: https://arxiv.org/abs/2505.15801
作者: Yuchen Yan,Jin Jiang,Zhenbang Ren,Yijun Li,Xudong Cai,Yang Liu,Xin Xu,Mengdi Zhang,Jian Shao,Yongliang Shen,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); Meituan Group (美团集团); Peking University (北京大学); University of Electronic Science and Technology of China (电子科技大学); Beijing University of Posts and Telecommunications (北京邮电大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset: this https URL
点击查看摘要
Abstract:Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
zh
[NLP-5] Reverse Engineering Human Preferences with Reinforcement Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在作为评估者(LLM-as-a-judge)框架下存在的可被恶意利用的问题,即生成式AI(Generative AI)的输出可以被调整以过度拟合评估者的偏好。论文提出的解决方案的关键在于利用评估者LLM提供的信号作为奖励,通过对抗性调优生成文本前缀的模型,从而提升下游任务的性能,而该方法在不直接干预模型输出的情况下具有高度隐蔽性。
链接: https://arxiv.org/abs/2505.15795
作者: Lisa Alazraki,Tan Yi-Chern,Jon Ander Campos,Maximilian Mozes,Marek Rei,Max Bartolo
机构: Imperial College London (帝国理工学院); Cohere (Cohere)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework–known as LLM-as-a-judge–is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model’s response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning–an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.
zh
[NLP-6] Long-Form Information Alignment Evaluation Beyond Atomic Facts
【速读】: 该论文试图解决生成式 AI (Generative AI) 在自然语言生成 (NLG) 评估任务中,由于忽视事实间依赖关系而导致的评估不准确问题,从而减少幻觉现象并提升用户信任度。解决方案的关键在于提出 DoveScore,这是一种新型框架,通过联合验证事实准确性与事件顺序一致性,建模事实之间的相互关系,从而实现更鲁棒的细粒度评估。
链接: https://arxiv.org/abs/2505.15792
作者: Danna Zheng,Mirella Lapata,Jeff Z. Pan
机构: School of Informatics, University of Edinburgh(信息学院,爱丁堡大学); Huawei Edinburgh Research Centre(华为爱丁堡研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by “montaging” truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at this https URL.
zh
[NLP-7] Large Language Models as Computable Approximations to Solomonoff Induction
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)的理论框架不统一问题,特别是如何通过一个统一的数学视角解释其涌现现象。解决方案的关键在于首次建立了LLM架构与算法信息论(Algorithmic Information Theory, AIT)之间的形式化联系,证明了训练过程通过损失最小化近似计算Solomonoff先验,并且下一词预测实现了近似的Solomonoff归纳。这一理论框架为上下文学习、少样本学习和缩放定律提供了统一的解释,并提出了基于预测置信度选择少样本示例的方法,显著提升了模型性能。
链接: https://arxiv.org/abs/2505.15784
作者: Jun Wan,Lingrui Mei
机构: UBS AG (瑞银集团); State Key Lab of AI Safety (人工智能安全国家重点实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Both authors contributed equally
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.
zh
[NLP-8] dKV-Cache: The Cache for Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理过程中速度缓慢的问题。其核心挑战在于DLM的非自回归架构和双向注意力机制无法使用关键-值缓存(key-value cache)来加速解码。论文提出的解决方案是引入一种类似KV缓存的机制——延迟KV缓存(delayed KV-Cache),通过观察不同标记在扩散过程中的表示动态差异,设计了一种延迟且条件化的缓存策略,以逐步缓存键和值状态,从而显著提升推理效率。
链接: https://arxiv.org/abs/2505.15781
作者: Xinyin Ma,Runpeng Yu,Gongfan Fang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: The code is available at this https URL
点击查看摘要
Abstract:Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.
zh
[NLP-9] Soft Thinking: Unlocking the Reasoning Potential of LLM s in Continuous Concept Space
【速读】: 该论文试图解决当前基于离散语言符号的推理模型在表达能力和推理路径探索上的局限性(limitation),这些问题源于模型仅能处理固定语义空间中的离散词元嵌入,从而限制了其推理性能。解决方案的关键在于提出一种无需训练的“Soft Thinking”方法,该方法通过在连续概念空间中生成具有概率加权混合的抽象概念词元,模拟人类“软性”推理过程,从而实现更丰富的语义表示和更平滑的推理路径探索。
链接: https://arxiv.org/abs/2505.15778
作者: Zhen Zhang,Xuehai He,Weixiang Yan,Ao Shen,Chenyang Zhao,Shuohang Wang,Yelong Shen,Xin Eric Wang
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校); University of California, Santa Cruz(加州大学圣克鲁兹分校); University of California, Los Angeles(加州大学洛杉矶分校); Purdue University(普渡大学); LMSYS Org(LMSYS组织); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like “soft” reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at this https URL.
zh
[NLP-10] ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning
【速读】: 该论文旨在解决对话式搜索系统中上下文依赖性查询的重构问题,这类查询通常包含歧义、省略和回指等挑战。现有对话式查询重构(CQR)方法面临两大关键限制:对昂贵的外部监督(如人工标注或大语言模型)的高度依赖,以及重写模型与下游检索器之间对齐不足。该论文提出的ConvSearch-R1框架是首个完全消除对外部重写监督依赖的自驱动方法,其关键在于利用强化学习直接通过检索信号优化重写过程,采用两阶段策略:首先通过检索引导的自蒸馏解决冷启动问题,随后引入专门设计的排名激励奖励塑造机制,以应对传统检索指标中的稀疏性问题。
链接: https://arxiv.org/abs/2505.15776
作者: Changtai Zhu,Siyin Wang,Ruijun Feng,Kai Song,Xipeng Qiu
机构: Fudan University (复旦大学); ByteDance Inc (字节跳动公司); University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Conversational search systems require effective handling of context-dependent queries that often contain ambiguity, omission, and coreference. Conversational Query Reformulation (CQR) addresses this challenge by transforming these queries into self-contained forms suitable for off-the-shelf retrievers. However, existing CQR approaches suffer from two critical constraints: high dependency on costly external supervision from human annotations or large language models, and insufficient alignment between the rewriting model and downstream retrievers. We present ConvSearch-R1, the first self-driven framework that completely eliminates dependency on external rewrite supervision by leveraging reinforcement learning to optimize reformulation directly through retrieval signals. Our novel two-stage approach combines Self-Driven Policy Warm-Up to address the cold-start problem through retrieval-guided self-distillation, followed by Retrieval-Guided Reinforcement Learning with a specially designed rank-incentive reward shaping mechanism that addresses the sparsity issue in conventional retrieval metrics. Extensive experiments on TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly outperforms previous state-of-the-art methods, achieving over 10% improvement on the challenging TopiOCQA dataset while using smaller 3B parameter models without any external supervision.
zh
[NLP-11] Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长序列推理中因计算效率低下和冗余处理而面临的挑战,特别是现有上下文压缩技术在信息保留与压缩效率之间的平衡问题。其解决方案的关键在于提出一种融合全局与局部视角的混合上下文压缩方法(Hybrid Context Compression, HyCo_2),通过结合全局语义精炼与局部标记保留概率评估,既保留任务完成所需的核心语义,又确保关键细节不丢失,从而实现高效的上下文压缩与性能提升。
链接: https://arxiv.org/abs/2505.15774
作者: Huanxuan Liao,Wen Hu,Yao Xu,Shizhu He,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose \textbfHy brid \textbfCo ntext \textbfCo mpression (HyCo _2 ) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo _2 method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1% across seven knowledge-intensive QA benchmarks. Moreover, HyCo _2 matches the performance of uncompressed methods while reducing token consumption by 88.8%.
zh
[NLP-12] MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling INTERSPEECH
【速读】: 该论文旨在解决大规模情感语音数据获取中一致性不足的问题,特别是在无标签视频数据中提取高一致性情感语音的挑战。其解决方案的关键在于提出MIKU-PAL,一个完全自动化的多模态管道,利用面部检测与跟踪算法以及多模态大语言模型(MLLM)实现情感分析的自动化,从而在成本和效率上显著优于人工标注,并实现了人类水平的准确性和优异的一致性。
链接: https://arxiv.org/abs/2505.15772
作者: Cheng Yifan,Zhang Ruoyi,Shi Jiatong
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech
点击查看摘要
Abstract:Acquiring large-scale emotional speech data with strong consistency remains a challenge for speech synthesis. This paper presents MIKU-PAL, a fully automated multimodal pipeline for extracting high-consistency emotional speech from unlabeled video data. Leveraging face detection and tracking algorithms, we developed an automatic emotion analysis system using a multimodal large language model (MLLM). Our results demonstrate that MIKU-PAL can achieve human-level accuracy (68.5% on MELD) and superior consistency (0.93 Fleiss kappa score) while being much cheaper and faster than human annotation. With the high-quality, flexible, and consistent annotation from MIKU-PAL, we can annotate fine-grained speech emotion categories of up to 26 types, validated by human annotators with 83% rationality ratings. Based on our proposed system, we further released a fine-grained emotional speech dataset MIKU-EmoBench(131.2 hours) as a new benchmark for emotional text-to-speech and visual voice cloning.
zh
[NLP-13] ransfer of Structural Knowledge from Synthetic Languages ACL2025
【速读】: 该论文试图解决如何通过迁移学习将合成语言的知识有效迁移到英语中的问题,其核心挑战在于如何构建具有更好迁移能力的合成语言以及设计更有效的评估基准。解决方案的关键在于引入一种新的合成语言,该语言相比之前研究中使用的语言能够实现更优的迁移效果,并提出Tiny-Cloze Benchmark作为评估工具,该基准在评估低性能模型的自然语言理解能力方面更具信息量。通过在多个领域使用该基准对微调模型进行评估,验证了在新合成语言上微调可提升模型在多种任务上的性能。
链接: https://arxiv.org/abs/2505.15769
作者: Mikhail Budnikov,Ivan Yamshchikov
机构: Constructor University (构造大学); THWS (THWS)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures and 3 tables to be published in ACL 2025 Workshop XLLM
点击查看摘要
Abstract:This work explores transfer learning from several synthetic languages to English. We investigate the structure of the embeddings in the fine-tuned models, the information they contain, and the capabilities of the fine-tuned models on simple linguistic tasks. We also introduce a new synthetic language that leads to better transfer to English than the languages used in previous research. Finally, we introduce Tiny-Cloze Benchmark - a new synthetic benchmark for natural language understanding that is more informative for less powerful models. We use Tiny-Cloze Benchmark to evaluate fine-tuned models in several domains demonstrating that fine-tuning on a new synthetic language allows for better performance on a variety of tasks.
zh
[NLP-14] Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对 jailbreaking 攻击时的安全性问题,即攻击者通过精心设计的提示诱导模型生成有害或不道德的响应。解决方案的关键在于引入安全上下文检索(Safety Context Retrieval, SCR),该方法基于检索增强生成(Retrieval-Augmented Generation, RAG)技术,通过从预先构建的安全示例中检索相关上下文来增强模型的鲁棒性,从而有效防御已知和新兴的 jailbreaking 策略。
链接: https://arxiv.org/abs/2505.15753
作者: Taiye Chen,Zeming Wei,Ang Li,Yisen Wang
机构: Peking University (北京大学); Peking University (北京大学); Peking University (北京大学); Peking University (北京大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.
zh
[NLP-15] Evolutionary Computation and Large Language Models : A Survey of Methods Synergies and Applications
【速读】: 该论文试图解决如何将大型语言模型(Large Language Models, LLMs)与进化计算(Evolutionary Computation, EC)有效结合,以提升人工智能的性能与应用范围。其解决方案的关键在于探索两者之间的协同潜力,通过EC优化LLMs的训练、微调、提示工程和架构搜索,同时利用LLMs自动化EC的元启发式设计、算法调优及自适应启发式的生成,从而实现双向增强的人工智能系统。
链接: https://arxiv.org/abs/2505.15741
作者: Dikshit Chauhan,Bapi Dutta,Indu Bala,Niki van Stein,Thomas Bäck,Anupam Yadav
机构: National University of Singapore (新加坡国立大学); Universidad de Jaén (哈恩大学); University of Adelaide (阿德莱德大学); Leiden Institute of Advanced Computer Science, University Leiden (莱顿高级计算机科学研究所,莱顿大学); Dr. B. R. Ambedkar National Institute of Technology (B.R.阿姆倍德卡国家技术学院)
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Integrating Large Language Models (LLMs) and Evolutionary Computation (EC) represents a promising avenue for advancing artificial intelligence by combining powerful natural language understanding with optimization and search capabilities. This manuscript explores the synergistic potential of LLMs and EC, reviewing their intersections, complementary strengths, and emerging applications. We identify key opportunities where EC can enhance LLM training, fine-tuning, prompt engineering, and architecture search, while LLMs can, in turn, aid in automating the design, analysis, and interpretation of ECs. The manuscript explores the synergistic integration of EC and LLMs, highlighting their bidirectional contributions to advancing artificial intelligence. It first examines how EC techniques enhance LLMs by optimizing key components such as prompt engineering, hyperparameter tuning, and architecture search, demonstrating how evolutionary methods automate and refine these processes. Secondly, the survey investigates how LLMs improve EC by automating metaheuristic design, tuning evolutionary algorithms, and generating adaptive heuristics, thereby increasing efficiency and scalability. Emerging co-evolutionary frameworks are discussed, showcasing applications across diverse fields while acknowledging challenges like computational costs, interpretability, and algorithmic convergence. The survey concludes by identifying open research questions and advocating for hybrid approaches that combine the strengths of EC and LLMs.
zh
[NLP-16] Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses
【速读】: 该论文试图解决当前基于对齐(alignment)的防御方法在面对有信息的白盒攻击时存在的脆弱性问题。其关键解决方案是提出一种利用中间模型检查点(checkpoint)进行初始化的知情白盒攻击方法,通过这些检查点作为后续攻击的跳板,显著提升了攻击的有效性和效率,并成功发现了适用于多种输入的通用对抗后缀。
链接: https://arxiv.org/abs/2505.15738
作者: Xiaoxue Yang,Bozhidar Stevanoski,Matthieu Meeus,Yves-Alexandre de Montjoye
机构: Imperial College London (帝国理工学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are rapidly deployed in real-world applications ranging from chatbots to agentic systems. Alignment is one of the main approaches used to defend against attacks such as prompt injection and jailbreaks. Recent defenses report near-zero Attack Success Rates (ASR) even against Greedy Coordinate Gradient (GCG), a white-box attack that generates adversarial suffixes to induce attacker-desired outputs. However, this search space over discrete tokens is extremely large, making the task of finding successful attacks difficult. GCG has, for instance, been shown to converge to local minima, making it sensitive to initialization choices. In this paper, we assess the future-proof robustness of these defenses using a more informed threat model: attackers who have access to some information about the alignment process. Specifically, we propose an informed white-box attack leveraging the intermediate model checkpoints to initialize GCG, with each checkpoint acting as a stepping stone for the next one. We show this approach to be highly effective across state-of-the-art (SOTA) defenses and models. We further show our informed initialization to outperform other initialization methods and show a gradient-informed checkpoint selection strategy to greatly improve attack performance and efficiency. Importantly, we also show our method to successfully find universal adversarial suffixes – single suffixes effective across diverse inputs. Our results show that, contrary to previous beliefs, effective adversarial suffixes do exist against SOTA alignment-based defenses, that these can be found by existing attack methods when adversaries exploit alignment knowledge, and that even universal suffixes exist. Taken together, our results highlight the brittleness of current alignment-based methods and the need to consider stronger threat models when testing the safety of LLMs.
zh
[NLP-17] DEBATE TRAIN EVOLVE: Self Evolution of Language Model Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在仅依赖大量数据进行训练时,提升推理能力变得越来越不实际的问题,旨在使模型能够自主增强其推理能力而无需外部监督。解决方案的关键在于提出一种无真实标签的训练框架——Debate, Train, Evolve (DTE),该框架利用多智能体辩论轨迹来演化单个语言模型,并引入了Reflect-Critique-Refine提示策略,通过明确指导智能体对推理过程进行批判与优化,从而提高辩论质量。
链接: https://arxiv.org/abs/2505.15734
作者: Gaurav Srivastava,Zhenyu Bi,Meng Lu,Xuan Wang
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.
zh
[NLP-18] VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
【速读】: 该论文试图解决当前语音交互模型评估体系中对语音表现关键方面(如声学性能、语用线索和环境适应性)关注不足的问题,现有评估多集中于文本响应质量,缺乏针对语音特性的基准测试实例。解决方案的关键是提出VocalBench,这是一个涵盖语义质量、声学性能、对话能力及鲁棒性的综合性基准,包含9,400个精心设计的实例,覆盖16项基础技能,以全面评估语音交互模型的能力。
链接: https://arxiv.org/abs/2505.15727
作者: Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has accelerated the development of multi-modal models capable of vocal communication. Unlike text-based interactions, speech conveys rich and diverse information, including semantic content, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models predominantly focus on the quality of their textual responses, often overlooking critical aspects of vocal performance and lacking benchmarks with vocal-specific test instances. To address this gap, we propose VocalBench, a comprehensive benchmark designed to evaluate speech interaction models’ capabilities in vocal communication. VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers 16 fundamental skills essential for effective vocal interaction. Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech-based interaction systems. Code and evaluation instances are available at this https URL.
zh
[NLP-19] Shared Path: Unraveling Memorization in Multilingual LLM s through Language Similarities
【速读】: 该论文试图解决多语言大语言模型(Multilingual Large Language Models, MLLMs)中的记忆行为问题,特别是现有假设认为记忆程度与训练数据可用性高度相关这一观点无法充分解释MLLMs中的记忆模式。解决方案的关键在于提出一种基于图的关联度量方法,该方法引入语言相似性以分析跨语言的记忆现象,从而揭示出在语言相似性较高的情况下,训练语料较少的语言反而表现出更高的记忆倾向,这一趋势仅在显式建模跨语言关系时才会显现。
链接: https://arxiv.org/abs/2505.15722
作者: Xiaoyu Luo,Yiyi Chen,Johannes Bjerva,Qiongxiu Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 14 tables, 10 figures
点击查看摘要
Abstract:We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation - ignoring their similarities - obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.
zh
[NLP-20] Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling
【速读】: 该论文试图解决现有基于大型语言模型(Large Language Models, LLMs)的心理健康支持系统在临床基础性方面的不足,特别是在符合DSM/ICD诊断标准的明确诊断推理以及整合多种治疗模式(如认知行为疗法、接纳与承诺疗法、精神分析等)方面存在缺陷。解决方案的关键在于提出PsyLLM,这是首个系统性融合诊断与治疗推理的大型语言模型,并通过一种新颖的自动化数据合成流程生成高质量、符合临床标准的对话数据,从而提升心理咨询服务的全面性、专业性、真实性和安全性。
链接: https://arxiv.org/abs/2505.15715
作者: He Hu,Yucheng Zhou,Juzheng Si,Qianning Wang,Hengheng Zhang,Fuji Ren,Fei Ma,Laizhong Cui
机构: Shenzhen University (深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); University of Macau (澳门大学); Shandong University (山东大学); Auckland University of Technology (奥克兰理工大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop the PsyLLM, we propose a novel automated data synthesis pipeline. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions: comprehensiveness, professionalism, authenticity, and safety. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark.
zh
[NLP-21] urnaboutLLM : A Deductive Reasoning Benchmark from Detective Games
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、叙述丰富的环境中进行演绎推理的能力评估问题。解决方案的关键在于提出TurnaboutLLM框架及相应的数据集,通过利用侦探游戏《逆转裁判》和《弹丸论破》的互动玩法,任务要求LLMs在长篇叙述语境中识别证词与证据之间的矛盾,从而测试其演绎推理能力。该方法通过构建具有大答案空间和多样化推理类型的挑战性任务,揭示了现有增强演绎推理策略的局限性,并探讨了上下文长度、推理步骤数量和答案空间大小对模型性能的影响。
链接: https://arxiv.org/abs/2505.15712
作者: Yuan Yuan,Muyu He,Muhammad Adil Shahid,Jiani Huang,Ziyang Li,Li Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, the number of reasoning step and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs’ deductive reasoning abilities in complex, narrative-rich environments.
zh
[NLP-22] Advancing LLM Safe Alignment with Safety Representation Ranking
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成有害内容方面的安全问题,现有安全评估方法通常仅针对文本响应进行操作,忽略了模型内部表示中蕴含的丰富信息。解决方案的关键在于提出一种基于列表排序的安全表示排名(Safety Representation Ranking, SRR)框架,该框架利用LLM自身的隐藏状态来选择安全的响应,通过中间Transformer表示编码指令和候选完成,并借助轻量级基于相似性的评分器对候选进行排序,从而直接利用模型内部状态和列表级别的监督来捕捉细微的安全信号。
链接: https://arxiv.org/abs/2505.15710
作者: Tianqi Du,Zeming Wei,Quan Chen,Chenheng Zhang,Yisen Wang
机构: Peking University (北京大学); School of Intelligence Science and Technology, Peking University (北京大学智能科学与技术学院); School of Mathematical Sciences, Peking University (北京大学数学科学学院); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has demonstrated milestone success in a variety of tasks, yet their potential for generating harmful content has raised significant safety concerns. Existing safety evaluation approaches typically operate directly on textual responses, overlooking the rich information embedded in the model’s internal representations. In this paper, we propose Safety Representation Ranking (SRR), a listwise ranking framework that selects safe responses using hidden states from the LLM itself. SRR encodes both instructions and candidate completions using intermediate transformer representations and ranks candidates via a lightweight similarity-based scorer. Our approach directly leverages internal model states and supervision at the list level to capture subtle safety signals. Experiments across multiple benchmarks show that SRR significantly improves robustness to adversarial prompts. Our code will be available upon publication.
zh
[NLP-23] LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
【速读】: 该论文旨在解决大型语言模型在进行连续知识编辑时出现的性能逐渐下降问题,这一问题源于现有主流“定位-编辑”方法在长期知识保留机制上的不足。其解决方案的关键在于提出一种名为LyapLock的框架,该框架将顺序编辑建模为一个受约束的随机规划问题,并通过引入排队论和李雅普诺夫优化,将长期约束问题分解为可处理的逐步子问题,从而实现高效的求解。此框架是首个具有严格理论保障的模型编辑方法,在满足长期知识保留约束的同时,实现了渐近最优的编辑性能。
链接: https://arxiv.org/abs/2505.15702
作者: Peng Wang,Biyu Zhou,Xuehai Tang,Jizhong Han,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (信息工程研究所,中国科学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbfLyapLock is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on this https URL.
zh
[NLP-24] HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际仓库级硬件描述语言(Hardware Description Language, HDL)项目中的性能受限问题,特别是在处理数千甚至数万行代码时的表现不佳。解决方案的关键在于提出HDLxGraph框架,该框架结合了图检索增强生成(Graph Retrieval Augmented Generation, Graph RAG)与LLMs,通过引入针对HDL的图表示,包括抽象语法树(Abstract Syntax Trees, ASTs)和数据流图(Data Flow Graphs, DFGs),以捕捉代码图视图和硬件图视图。此外,HDLxGraph采用双检索机制,通过结构信息缓解基于相似性的语义检索的召回率限制,并通过任务特定的检索微调提升其对各种实际任务的可扩展性。
链接: https://arxiv.org/abs/2505.15701
作者: Pingqing Zheng,Jiayin Qin,Fuqi Zhang,Shang Wu,Yu Cao,Caiwen Ding,Yang(Katie)Zhao
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校); Northwestern University (西北大学)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging. Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered. To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view. HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval finetuning. Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects. Experimental results demonstrate that HDLxGraph significantly improves average search accuracy, debugging efficiency and completion quality by 12.04%, 12.22% and 5.04% compared to similarity-based RAG, respectively. The code of HDLxGraph and collected HDLSearch benchmark are available at this https URL.
zh
[NLP-25] “Alexa can you forget me?” Machine Unlearning Benchmark in Spoken Language Understanding
【速读】: 该论文试图解决机器学习模型中针对特定信息(尤其是语音相关数据)的高效移除问题,即机器遗忘(machine unlearning)在语音理解(spoken language understanding, SLU)领域的应用与评估问题。其解决方案的关键在于提出UnSLU-BENCH,这是首个针对SLU领域机器遗忘的基准测试平台,通过四个跨语言的数据集评估不同遗忘方法的效果,并引入一种新的度量标准以同时衡量方法的有效性、实用性和计算效率。
链接: https://arxiv.org/abs/2505.15700
作者: Alkis Koudounas,Claudio Savelli,Flavio Giobergia,Elena Baralis
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential “right to be forgotten” requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.
zh
[NLP-26] MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation
【速读】: 该论文试图解决BERT中[CLS]标记在分类任务中作为固定长度表示的局限性,尽管已有研究表明其他标记和中间层也包含有价值的情境信息。解决方案的关键在于提出MaxPoolBERT,通过跨层和跨标记的信息聚合来优化[CLS]表示,具体包括:(i)在多个层上对[CLS]标记进行最大池化,(ii)通过额外的多头注意力(MHA)层使[CLS]标记关注整个最终层,(iii)结合全序列的最大池化与MHA。该方法在不增加模型规模或需要预训练的情况下提升了BERT的分类准确性,特别是在低资源任务中表现更优。
链接: https://arxiv.org/abs/2505.15696
作者: Maike Behrendt,Stefan Sylvius Wagner,Stefan Harmeling
机构: Heinrich Heine University Düsseldorf (海因里希·海涅大学杜塞尔多夫); Technical University Dortmund (多特蒙德工业大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we propose MaxPoolBERT, a lightweight extension to BERT that refines the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach enhances BERT’s classification accuracy (especially on low-resource tasks) without requiring pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance on the standard BERT-base model.
zh
[NLP-27] Can Large Language Models be Effective Online Opinion Miners?
【速读】: 该论文试图解决用户生成的在线内容在语义多样性、复杂性和上下文丰富性方面对传统观点挖掘(opinion mining)方法带来的挑战。解决方案的关键在于引入Online Opinion Mining Benchmark (OOMB),这是一个新型数据集和评估协议,旨在评估大型语言模型(LLMs)从多样化和复杂的在线环境中有效挖掘观点的能力。OOMB提供了详尽的(实体、特征、观点)三元组标注以及以观点为中心的摘要,从而能够评估模型的抽取式和摘要式能力。
链接: https://arxiv.org/abs/2505.15695
作者: Ryang Heo,Yongsik Seo,Junseong Lee,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures
点击查看摘要
Abstract:The surge of user-generated online content presents a wealth of insights into customer preferences and market trends. However, the highly diverse, complex, and context-rich nature of such contents poses significant challenges to traditional opinion mining approaches. To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides extensive (entity, feature, opinion) tuple annotations and a comprehensive opinion-centric summary that highlights key opinion topics within each content, thereby enabling the evaluation of both the extractive and abstractive capabilities of models. Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios. This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.
zh
[NLP-28] hought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities
【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)在训练推理模型时存在的探索能力受限问题,即现有方法倾向于将模型输出分布偏向于奖励最大化路径,而缺乏外部知识的引入,导致推理能力边界较窄。解决方案的关键在于提出TAPO(Thought-Augmented Policy Optimization)框架,通过引入外部高阶引导(“thought patterns”)来增强RL,从而在训练过程中自适应地融合结构化思维,有效平衡模型内部探索与外部指导利用。
链接: https://arxiv.org/abs/2505.15692
作者: Jinyang Wu,Chonghua Liao,Mingkuan Feng,Shuai Zhang,Zhengqi Wen,Pengpeng Shao,Huazhe Xu,Jianhua Tao
机构: Tsinghua University (清华大学); Institution for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心); Shanghai Qi Zhi Institute (上海奇智学院); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model’s output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance (“thought patterns”). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO’s potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.
zh
[NLP-29] hinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy
【速读】: 该论文试图解决链式思维(Chain-of-Thought, CoT)提示在大型语言模型(Large Language Models, LLMs)中因推理令牌过长而导致的延迟增加、键值缓存(KV cache)内存占用过高以及可能因上下文限制导致最终答案被截断的问题。解决方案的关键在于提出ThinkLess框架,通过提前终止推理生成并在不修改模型的情况下保持输出质量。关键洞察来自于注意力分析,即答案令牌对早期推理步骤的关注最小,主要关注推理终止符令牌,这是由于因果掩码下的信息迁移所致。基于此,ThinkLess在较早位置插入终止符令牌以跳过冗余推理,同时保留底层知识传递,并通过轻量级后调节机制防止格式破坏,从而在不进行微调或使用辅助数据的情况下实现与完整长度CoT解码相当的准确性,同时显著降低解码时间和内存消耗。
链接: https://arxiv.org/abs/2505.15684
作者: Gengyang Li,Yifeng Gao,Yuming Li,Yunfang Wu
机构: Peking University (北京大学); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model’s natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.
zh
[NLP-30] A Federated Splitting Framework for LLM s: Security Efficiency and Adaptability
【速读】: 该论文旨在解决在联邦学习环境中使用私有数据提升大语言模型(Large Language Model, LLM)性能时所面临的隐私安全、计算效率和任务适应性问题。其关键解决方案是提出FL-LLaMA框架,通过将部分输入和输出模块置于本地客户端并注入高斯噪声以实现端到端的安全传播,采用客户端批量训练与服务器分层策略提升训练并行性,并结合注意力掩码压缩和键值缓存机制降低通信开销,同时允许用户根据具体任务需求和硬件条件动态调整分割点,从而在保证安全性的同时提升效率与适应性。
链接: https://arxiv.org/abs/2505.15683
作者: Zishuai Zhang,Hainan Zhang,Jiaying Zheng,Ziwei Wang,Yongxin Tong,Jin Dong,Zhiming Zheng
机构: 北京航空航天大学(Beijing University of Aeronautics and Astronautics); 北京航空材料研究院(Beijing Aeronautical Materials Research Institute)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:Private data is typically larger and of higher quality than public data, offering great potential to improve LLM. However, its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based split learning model has emerged, offloading most model parameters to the server while retaining only the embedding and output layers on clients to ensure privacy. However, it still faces significant challenges in security, efficiency, and adaptability: 1) embedding gradients are vulnerable to attacks, leading to reverse engineering of private data; 2) the autoregressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FL-LLaMA, a secure, efficient, and adaptive federated split framework based on LLaMA2. First, we place some input and output blocks on the local client and inject Gaussian noise into forward-pass hidden states, enabling secure end-to-end propagation. Second, we employ client-batch and server-hierarchical strategies to achieve parallel training, along with attention-mask compression and KV cache mechanisms to accelerate inference, reducing communication costs effectively. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements and hardware limitations. Experiments on NLU, summarization and conversational QA tasks show that FL-LLaMA maintains performance comparable to centralized LLaMA2, and achieves up to 2x train speedups and 8x inference speedups. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FL-LLaMA in security and adaptability.
zh
[NLP-31] he Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect
【速读】: 该论文试图解决语言模型在语义表示中如何表征具体性(concreteness)的问题,以及人类与语言模型在语义空间中的对齐程度。其解决方案的关键在于通过行为判断估计人类隐式使用的语义距离,并利用表示相似性分析(Representational Similarity Analysis)验证人类参与者与语言模型的语义表示之间存在显著对齐,且这种对齐主要由具体性驱动,而非其他心理语言学中重要的词特征。
链接: https://arxiv.org/abs/2505.15682
作者: Cosimo Iaia,Bhavin Choksi,Emily Wiebers,Gemma Roig,Christian J. Fiebach
机构: Goethe University Frankfurt (法兰克福歌德大学); Center for Brains, Minds and Machines, MIT (MIT脑、心智与机器中心); Hessian.AI (黑森人工智能); Brain Imaging Center (脑成像中心)
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 Figures, 1 Table
点击查看摘要
Abstract:The nouns of our language refer to either concrete entities (like a table) or abstract concepts (like justice or love), and cognitive psychology has established that concreteness influences how words are processed. Accordingly, understanding how concreteness is represented in our mind and brain is a central question in psychology, neuroscience, and computational linguistics. While the advent of powerful language models has allowed for quantitative inquiries into the nature of semantic representations, it remains largely underexplored how they represent concreteness. Here, we used behavioral judgments to estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns. Using Representational Similarity Analysis, we find that the implicit representational space of participants and the semantic representations of language models are significantly aligned. We also find that both representational spaces are implicitly aligned to an explicit representation of concreteness, which was obtained from our participants using an additional concreteness rating task. Importantly, using ablation experiments, we demonstrate that the human-to-model alignment is substantially driven by concreteness, but not by other important word characteristics established in psycholinguistics. These results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.
zh
[NLP-32] UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models
【速读】: 该论文旨在解决大型语言模型在面对知识冲突和过时信息(如错误、隐私或非法内容)时,如何有效实现目标性遗忘的问题。现有方法在平衡遗忘效果与模型能力方面存在不足,导致模型性能严重下降或无法真正实现遗忘。论文提出的解决方案关键在于引入UniErase,这是一种基于可学习参数后缀(即遗忘标记)的新型遗忘范式,通过两个关键阶段:优化阶段将期望的遗忘输出绑定到模型的自回归概率分布,随后通过轻量级模型编辑阶段激活学习到的标记以概率性地诱导指定的遗忘目标,从而实现高效且稳定的遗忘效果。
链接: https://arxiv.org/abs/2505.15674
作者: Miao Yu,Liang Lin,Guibin Zhang,Xinfeng Li,Junfeng Fang,Ningyu Zhang,Kun Wang,Yang Wang
机构: University of Science and Technology of China (中国科学技术大学); University of the Chinese Academy of Sciences (中国科学院大学); Tongji University (同济大学); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model’s intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model’s autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.
zh
[NLP-33] Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model INTERSPEECH2025
【速读】: 该论文旨在解决现有语音语言模型在对话中缺乏实时适应性的问题,特别是用户打断(barge-in)和连续输入的处理能力不足。其解决方案的关键在于提出一种新型全双工语音到语音(S2S)架构,通过通道融合直接建模用户与代理的同步流,并利用预训练的流式编码器实现无需语音预训练的首个全双工S2S模型,同时通过独立的代理与用户建模结构优化语音编码并降低比特率。
链接: https://arxiv.org/abs/2505.15670
作者: Ke Hu,Ehsan Hosseini-Asl,Chen Chen,Edresson Casanova,Subhankar Ghosh,Piotr Żelasko,Zhehuai Chen,Jason Li,Jagadeesh Balam,Boris Ginsburg
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
点击查看摘要
Abstract:Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
zh
[NLP-34] Be Careful When Fine-tuning On Open-Source LLM s: Your Fine-tuning Data Could Be Secretly Stolen!
【速读】: 该论文试图解决在使用开源大型语言模型(Large Language Models, LLMs)进行下游任务微调时,可能存在的隐私数据泄露风险问题。其关键解决方案是通过简单的后门训练(backdoor training),利用对微调后的下游模型的黑盒访问权限,从开源LLMs的创建者角度提取私有的下游微调数据。实验结果表明,该方法在实际场景中可成功提取高达76.3%的微调数据,且在理想条件下成功率可达94.9%,揭示了微调过程中潜在的数据泄露威胁。
链接: https://arxiv.org/abs/2505.15656
作者: Zhexin Zhang,Yuhao Sun,Junxiao Yang,Shiyao Cui,Hongning Wang,Minlie Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages
点击查看摘要
Abstract:Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at this https URL.
zh
[NLP-35] Word Level Timestamp Generation for Automatic Speech Recognition and Translation INTERSPEECH2025
【速读】: 该论文旨在解决语音处理中词级时间戳预测的问题,这一信息对于语音内容检索和定时字幕等下游任务至关重要。传统方法通常依赖于外部模块进行时间对齐,而本文提出了一种数据驱动的方法,通过引入一个新的时间戳标记(|timestamp|),使Canary模型能够直接预测每个词的起始和结束时间戳。该方案的关键在于利用NeMo Forced Aligner (NFA) 作为教师模型生成词级时间戳,并以此训练Canary模型,从而避免了独立的对齐机制,实现了较高的精度和召回率(80%至90%),时间戳预测误差在20至120毫秒之间,且对词错误率(WER)影响较小。
链接: https://arxiv.org/abs/2505.15646
作者: Ke Hu,Krishna Puvvada,Elena Rastorgueva,Zhehuai Chen,He Huang,Shuoyang Ding,Kunal Dhawan,Hainan Xu,Jagadeesh Balam,Boris Ginsburg
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
点击查看摘要
Abstract:We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new |timestamp| token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.
zh
[NLP-36] Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
【速读】: 该论文试图解决如何在不依赖外部长链思维(long CoT)数据和微调的情况下,提升大语言模型(LLM)的推理能力问题。其解决方案的关键在于利用一种导向技术(steering technique),通过提取原始链思维(vanilla CoT)中的可解释特征,并将其用于引导LLM在生成过程中的内部状态。此外,为了解决许多LLM缺乏预训练稀疏自编码器(SAE)的问题,作者进一步提出了一种无需SAE的导向算法,直接从LLM的残差激活中计算导向方向,从而显著增强了LLM的推理能力。
链接: https://arxiv.org/abs/2505.15634
作者: Zihao Li,Xu Wang,Yuzhe Yang,Ziyu Yao,Haoyi Xiong,Mengnan Du
机构: New Jersey Institute of Technology (新泽西理工学院); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); George Mason University (乔治梅森大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
zh
[NLP-37] Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions ACL
【速读】: 该论文试图解决生成式 AI (Generative AI) 在处理长且技术性的气候相关文档时,因缺乏对检索到的上下文信息的忠实性而导致的事实性幻觉问题。解决方案的关键在于通过自动评估模型输出与检索到文本之间的忠实性,并在指令微调过程中识别和排除不忠实的训练数据子集,从而提升模型的忠实性表现。具体而言,研究者开发了 ClimateGPT Faithful+,其在支持的原子主张上的忠实性从30%提升至57%。
链接: https://arxiv.org/abs/2505.15633
作者: David Thulke,Jakob Kemmler,Christian Dugast,Hermann Ney
机构: RWTH Aachen University (亚琛工业大学); AppTek GmbH (AppTek公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the ClimateNLP 2025 Workshop at ACL
点击查看摘要
Abstract:Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model’s output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model’s faithfulness. By excluding unfaithful subsets of the model’s training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.
zh
[NLP-38] Mechanistic Insights into Grokking from the Embedding Layer
【速读】: 该论文试图解决神经网络中延迟泛化(grokking)现象的机制问题,特别是嵌入(embeddings)在其中的作用及其对模型泛化能力的影响。研究发现,嵌入是引发延迟泛化的关键因素,而缺乏嵌入的多层感知机(MLP)则能立即实现泛化。论文提出的解决方案之关键在于识别并缓解双线性耦合(bilinear coupling)效应,通过频率感知采样和嵌入特定的学习率调整,优化嵌入与下游权重之间的交互,从而加速收敛并提升模型性能。
链接: https://arxiv.org/abs/2505.15624
作者: H.V.AlquBoj,Hilal AlQuabeh,Velibor Bojkovic,Munachiso Nwadike,Kentaro Inui
机构: MBZUAI (MBZUAI); RIKEN (RIKEN)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Mechanistic view of embedding layers
点击查看摘要
Abstract:Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, (\frac\eta_E\eta_W \propto \frac\sigma_\max(E)\sigma_\max(W) \cdot \fracf_Wf_E), mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
zh
[NLP-39] Can LLM s textitunderstand Math? – Exploring the Pitfalls in Mathematical Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理任务中面临的挑战,特别是其在执行精确、多步骤逻辑时的表现问题。现有评估框架仅依据最终答案的准确性来评价模型性能,而忽略了推理过程中的错误率、冗余性和有效性。论文提出的解决方案关键在于引入一种新的评估指标——MAPLE分数,该指标通过综合考量错误率、冗余性和有效性,全面量化推理过程中的偏差程度。
链接: https://arxiv.org/abs/2505.15623
作者: Tiasa Singha Roy,Aditeya Baral,Ayush Rajesh Jhaveri,Yusuf Baig
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.
zh
[NLP-40] Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在生成长推理轨迹时存在的冗余问题,从而提升模型的推理效率。其解决方案的关键在于提出一种基于长度的奖励塑造方法(Length-bAsed StEp Reward shaping, LASER),通过引入步进函数作为奖励机制,并由目标长度控制,实现性能与效率之间的优越帕累托最优平衡。进一步地,该方法被扩展为LASER-D(Dynamic and Difficulty-aware),通过动态适应模型推理行为的变化以及根据问题难度调整奖励策略,以实现更优的快慢思维结合,从而提升整体推理效果并减少冗余的“自我反思”内容。
链接: https://arxiv.org/abs/2505.15612
作者: Wei Liu,Ruochen Zhou,Yiyun Deng,Yuzhen Huang,Junteng Liu,Yuntian Deng,Yizhe Zhang,Junxian He
机构: The Hong Kong University of Science and Technology (香港科技大学); City University of Hong Kong (香港城市大学); University of Waterloo (滑铁卢大学); Apple (苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant “self-reflections”. Resources are at this https URL.
zh
[NLP-41] From Problem-Solving to Teaching Problem-Solving: Aligning LLM s with Pedagogy using Reinforcement Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在教育应用中因优化直接问答而削弱有效教学策略的问题,特别是在需要战略性地 withhold 答案的情况下。解决方案的关键在于提出一种基于在线强化学习(Online Reinforcement Learning, RL)的对齐框架,通过模拟学生-导师交互,强调教学质量和引导性问题解决,而非直接提供答案。该框架能够快速适应LLMs成为有效的导师,并在无需人工标注的情况下训练出性能与更大专有模型相当的7B参数导师模型。
链接: https://arxiv.org/abs/2505.15607
作者: David Dinucu-Jianu,Jakub Macina,Nico Daheim,Ido Hakimi,Iryna Gurevych,Mrinmaya Sachan
机构: ETH Zurich (苏黎世联邦理工学院); ETH AI Center (ETH人工智能中心); Ubiquitous Knowledge Processing Lab (UKP Lab) (无处不在的知识处理实验室(UKP实验室)); Department of Computer Science (计算机科学系); Hessian Center for AI (hessian.AI) (黑森人工智能中心(hessian.AI)); TU Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: David Dinucu-Jianu and Jakub Macina contributed equally. Code available: this https URL
点击查看摘要
Abstract:Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model’s instructional planning.
zh
[NLP-42] MIRB: Mathematical Information Retrieval Benchmark
【速读】: 该论文旨在解决数学文档中信息检索(Mathematical Information Retrieval, MIR)任务缺乏统一评估基准的问题。现有的MIR应用包括数学库中的定理搜索、数学论坛中的答案检索以及自动定理证明中的前提选择,但由于缺乏统一的评估标准,这些任务的性能难以进行系统性比较和分析。论文提出的解决方案是引入MIRB(Mathematical Information Retrieval Benchmark),该基准包含四个任务:语义命题检索、问答检索、前提检索和公式检索,并覆盖12个数据集。MIRB的关键在于为MIR系统提供一个全面的评估框架,从而推动针对数学领域更高效检索模型的发展。
链接: https://arxiv.org/abs/2505.15585
作者: Haocheng Ju,Bin Dong
机构: Peking University (北京大学); Beijing International Center for Mathematical Research (北京国际数学研究中心); New Cornerstone Science Laboratory (新基石科学实验室); Center for Machine Learning Research (机器学习研究中心); Center for Intelligent Computing (智能计算中心); Great Bay Institute for Advanced Study (大湾区先进研究院); Great Bay University (大湾区大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Our code and data are available at this https URL and this https URL
点击查看摘要
Abstract:Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.
zh
[NLP-43] Semantic-based Unsupervised Framing Analysis (SUFA): A Novel Approach for Computational Framing Analysis
【速读】: 该论文试图解决新闻媒体报告中实体中心强调框架(entity-centric emphasis frames)的计算框架分析问题,旨在通过自动化方法识别和评估新闻内容中的框架结构。解决方案的关键在于提出一种基于语义关系的无监督框架分析方法(Semantic Relations-based Unsupervised Framing Analysis, SUFA),该方法利用语义关系和依存句法分析算法来捕捉文本中的实体中心强调模式,从而实现对新闻报道中框架的定量分析。
链接: https://arxiv.org/abs/2505.15563
作者: Mohammad Ali,Naeemul Hassan
机构: University of Maryland College Park(马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注: Association for Education in Journalism and Mass Communication (AEJMC) Conference, August 07–10, 2023, Washington, DC, USA
点击查看摘要
Abstract:This research presents a novel approach to computational framing analysis, called Semantic Relations-based Unsupervised Framing Analysis (SUFA). SUFA leverages semantic relations and dependency parsing algorithms to identify and assess entity-centric emphasis frames in news media reports. This innovative method is derived from two studies – qualitative and computational – using a dataset related to gun violence, demonstrating its potential for analyzing entity-centric emphasis frames. This article discusses SUFA’s strengths, limitations, and application procedures. Overall, the SUFA approach offers a significant methodological advancement in computational framing analysis, with its broad applicability across both the social sciences and computational domains.
zh
[NLP-44] Do RAG Systems Suffer From Positional Bias?
【速读】: 该论文试图解决检索增强生成(Retrieval Augmented Generation)中由于位置偏差(positional bias)导致的模型性能受限问题,即大语言模型(LLM)在处理提示中的信息时,会因信息的位置不同而产生不同的权重分配,从而影响其对相关段落的利用能力和对干扰段落的敏感性。解决方案的关键在于分析现有检索管道在提升相关段落检索效果的同时,往往将高度干扰性的段落置于排名前列,进而削弱了位置偏差对模型实际性能的影响,表明基于位置偏好重新排序的复杂策略并不优于随机打乱。
链接: https://arxiv.org/abs/2505.15561
作者: Florin Cuconasu,Simone Filice,Guy Horowitz,Yoelle Maarek,Fabrizio Silvestri
机构: Sapienza University of Rome (罗马第一大学); Technology Innovation Institute (技术创新研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM’s capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.
zh
[NLP-45] A Survey on Multilingual Mental Disorders Detection from Social Media Data
【速读】: 该论文试图解决当前心理健康障碍检测中对非英语文本关注不足的问题,尤其是在多语言环境中缺乏有效的数字筛查方法。其解决方案的关键在于通过分析文化差异对在线语言模式和自我披露行为的影响,揭示这些因素如何影响自然语言处理(NLP)工具的性能,并提供一套全面的多语言数据集以支持心理健康筛查NLP模型的开发。
链接: https://arxiv.org/abs/2505.15556
作者: Ana-Maria Bucur,Marcos Zampieri,Tharindu Ranasinghe,Fabio Crestani
机构: Interdisciplinary School of Doctoral Studies, University of Bucharest, Romania; PRHLT Research Center, Universitat Politècnica de València, Spain;Università della Svizzera italiana, Switzerland; George Mason University, USA; Aston University, Birmingham, UK
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The increasing prevalence of mental health disorders globally highlights the urgent need for effective digital screening methods that can be used in multilingual contexts. Most existing studies, however, focus on English data, overlooking critical mental health signals that may be present in non-English texts. To address this important gap, we present the first survey on the detection of mental health disorders using multilingual social media data. We investigate the cultural nuances that influence online language patterns and self-disclosure behaviors, and how these factors can impact the performance of NLP tools. Additionally, we provide a comprehensive list of multilingual data collections that can be used for developing NLP models for mental health screening. Our findings can inform the design of effective multilingual mental health screening tools that can meet the needs of diverse populations, ultimately improving mental health outcomes on a global scale.
zh
[NLP-46] DayDreamer at CQs-Gen 2025: Generating Critical Questions through Argument Scheme Completion
【速读】: 该论文旨在解决如何生成与论点文本相关且能够激发批判性思维的批判性问题(Critical Questions)的问题。其解决方案的关键在于结合结构化论证理论与逐步推理方法,利用大型语言模型(LLMs)通过思维链提示(chain-of-thought prompting)来实例化 Walton 的论证方案模板,从而生成结构化的论点并进一步生成相关的批判性问题,最终通过排序机制选出最具帮助性的前三个问题。
链接: https://arxiv.org/abs/2505.15554
作者: Wendi Zhou,Ameer Saadat-Yazdi,Nadin Kökciyan
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ArgMining 2025 CQs-Gen shared task
点击查看摘要
Abstract:Critical questions are essential resources to provoke critical thinking when encountering an argumentative text. We present our system for the Critical Questions Generation (CQs-Gen) Shared Task at ArgMining 2025. Our approach leverages large language models (LLMs) with chain-of-thought prompting to generate critical questions guided by Walton’s argumentation schemes. For each input intervention, we conversationally prompt LLMs to instantiate the corresponding argument scheme template to first obtain structured arguments, and then generate relevant critical questions. Following this, we rank all the available critical questions by prompting LLMs to select the top 3 most helpful questions based on the original intervention text. This combination of structured argumentation theory and step-by-step reasoning enables the generation of contextually relevant and diverse critical questions. Our pipeline achieves competitive performance in the final test set, showing its potential to foster critical thinking given argumentative text and detect missing or uninformed claims. Code available at \hrefthis https URLDayDreamer.
zh
[NLP-47] Social Bias in Popular Question-Answering Benchmarks
【速读】: 该论文试图解决当前广泛使用的问答(Question-answering, QA)和阅读理解(Reading comprehension, RC)基准在代表性和公平性方面存在的偏差问题,这些问题可能导致模型对不同人口统计学群体或地区的知识检索与生成不够全面。论文的关键解决方案是通过定性内容分析和定量数据评估,揭示基准创建过程中利益相关者的信息不足、社会偏见的处理缺失以及创建者与标注者的人口统计特征与内容偏差之间的关联,从而推动更透明和具有偏见意识的QA和RC基准构建实践。
链接: https://arxiv.org/abs/2505.15553
作者: Angelie Kraft,Judith Simon,Sonja Schimmler
机构: University of Hamburg (汉堡大学); Leuphana University Lüneburg (吕纳堡莱布尼茨大学); Weizenbaum Insitute (魏兹曼研究所); TU Berlin (柏林工业大学); Frauenhofer FOKUS (弗劳恩霍夫研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Question-answering (QA) and reading comprehension (RC) benchmarks are essential for assessing the capabilities of large language models (LLMs) in retrieving and reproducing knowledge. However, we demonstrate that popular QA and RC benchmarks are biased and do not cover questions about different demographics or regions in a representative way, potentially due to a lack of diversity of those involved in their creation. We perform a qualitative content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) how social bias is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most analyzed benchmark papers provided insufficient information regarding the stakeholders involved in benchmark creation, particularly the annotators. Notably, just one of the benchmark papers explicitly reported measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. More transparent and bias-aware QA and RC benchmark creation practices are needed to facilitate better scrutiny and incentivize the development of fairer LLMs.
zh
[NLP-48] Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的偏见问题,特别是当模型的概念空间中两个参考概念(如情感极性)与目标概念(如评价方面)之间存在非对称相关性时所表现出的意外偏见。传统偏见评估方法依赖于为不同社会群体构建带标签的数据并测量模型响应差异,这一过程需要大量人工努力且仅能捕捉有限的社会概念。该论文提出的解决方案——BiasLens,其关键在于利用模型向量空间的结构,结合概念激活向量(Concept Activation Vectors, CAVs)和稀疏自编码器(Sparse Autoencoders, SAEs)提取可解释的概念表示,并通过测量目标概念与每个参考概念之间的表征相似性变化来量化偏见,从而实现无需标注数据的偏见分析。
链接: https://arxiv.org/abs/2505.15524
作者: Lang Gao,Kaiyang Wan,Wei Liu,Chenxi Wang,Zirui Song,Zixiang Xu,Yanbo Wang,Veselin Stoyanov,Xiuying Chen
机构: MBZUAI; Huazhong University of Science and Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model’s concept space, such as sentiment polarities (e.g., “positive” and “negative”), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of “food” should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model’s vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient’s insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.
zh
[NLP-49] Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets
【速读】: 该论文试图解决如何利用真实、多模态的机器人轨迹数据来增强和评估视觉-语言模型(Vision-Language Models, VLMs)的能力。传统方法依赖于互联网规模的图文语料库来训练VLMs,而本文提出反向范式,即通过丰富的实际机器人轨迹数据来提升VLMs的性能。解决方案的关键在于构建Robo2VLM框架,该框架从非视觉和非描述性传感模态(如末端执行器位姿、夹爪开合和力感知)中提取真实标签,并据此将机器人轨迹分割为一系列操作阶段,结合场景与交互理解生成具有空间、目标条件和交互推理特征的视觉问答(VQA)问题,从而构建大规模的真实场景数据集Robo2VLM-1。
链接: https://arxiv.org/abs/2505.15517
作者: Kaiyuan Chen,Shuangyu Xie,Zehan Ma,Ken Goldberg
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.
zh
[NLP-50] Explainable embeddings with Distance Explainer
【速读】: 该论文试图解决嵌入向量空间中可解释性不足的问题,即现有可解释人工智能(XAI)方法很少关注维度代表复杂抽象的嵌入空间中的可解释性。其解决方案的关键在于提出一种名为Distance Explainer的新方法,该方法通过选择性掩码和距离排序掩码过滤,将显著性技术从RISE扩展到解释两个嵌入数据点之间的距离,从而生成局部、后验的解释。
链接: https://arxiv.org/abs/2505.15516
作者: Christiaan Meijer,E. G. Patrick Bos
机构: Netherlands eScience Center (荷兰电子科学中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 19 figures. Submitted to JMLR. Method implementation: this https URL
点击查看摘要
Abstract:While eXplainable AI (XAI) has advanced significantly, few methods address interpretability in embedded vector spaces where dimensions represent complex abstractions. We introduce Distance Explainer, a novel method for generating local, post-hoc explanations of embedded spaces in machine learning models. Our approach adapts saliency-based techniques from RISE to explain the distance between two embedded data points by assigning attribution values through selective masking and distance-ranked mask filtering. We evaluate Distance Explainer on cross-modal embeddings (image-image and image-caption pairs) using established XAI metrics including Faithfulness, Sensitivity/Robustness, and Randomization. Experiments with ImageNet and CLIP models demonstrate that our method effectively identifies features contributing to similarity or dissimilarity between embedded data points while maintaining high robustness and consistency. We also explore how parameter tuning, particularly mask quantity and selection strategy, affects explanation quality. This work addresses a critical gap in XAI research and enhances transparency and trustworthiness in deep learning applications utilizing embedded spaces.
zh
[NLP-51] Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
【速读】: 该论文试图解决多模态链式思维(MCoT)在大型视觉-语言模型(LVLMs)中提升性能与可解释性的机制不明确问题。其解决方案的关键在于揭示视觉思维(visual thoughts)在将图像信息传递至推理过程中的作用,表明无论采用文本型MCoT(T-MCoT)还是交织型MCoT(I-MCoT),视觉思维都能通过清晰且简洁的表达增强模型表现,并作为输入图像与深层Transformer层之间的重要中介,促进更高级的视觉信息传输。
链接: https://arxiv.org/abs/2505.15510
作者: Zihui Cheng,Qiguang Chen,Xiao Xu,Jiaqi Wang,Weiyun Wang,Hao Fei,Yidong Wang,Alex Jinpeng Wang,Zhi Chen,Wanxiang Che,Libo Qin
机构: Central South University (中南大学); Harbin Institute of Technology (哈尔滨工业大学); Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); National University of Singapore (新加坡国立大学); Peking University (北京大学); ByteDance Seed (中国) (字节跳动种子(中国))
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing performance and interpretability. Recent MCoT methods fall into two categories: (i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved image-text outputs. Despite advances in both approaches, the mechanisms driving these improvements are not fully understood. To fill this gap, we first reveal that MCoT boosts LVLMs by incorporating visual thoughts, which convey image information to the reasoning process regardless of the MCoT format, depending only on clarity and conciseness of expression. Furthermore, to explore visual thoughts systematically, we define four distinct forms of visual thought expressions and analyze them comprehensively. Our findings demonstrate that these forms differ in clarity and conciseness, yielding varying levels of MCoT improvement. Additionally, we explore the internal nature of visual thoughts, finding that visual thoughts serve as intermediaries between the input image and reasoning to deeper transformer layers, enabling more advanced visual information transmission. We hope that the visual thoughts can inspire further breakthroughs for future MCoT research.
zh
[NLP-52] Multilingual Test-Time Scaling via Initial Thought Transfer
【速读】: 该论文试图解决多语言环境下测试时缩放(test-time scaling)效果不一致的问题,特别是在低资源语言中推理一致性较低及模型在推理过程中频繁切换至英语的现象。其解决方案的关键在于提出MITT(Multilingual Initial Thought Transfer),这是一种无监督且轻量的推理前缀调优方法,通过将高资源语言的推理前缀迁移至其他语言,以提升所有语言的测试时缩放效果,从而改善多语言推理性能的不一致性。
链接: https://arxiv.org/abs/2505.15508
作者: Prasoon Bajpai,Tanmoy Chakraborty
机构: IIT Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注: 14 pages, 9 figures, 5 Tables
点击查看摘要
Abstract:Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning. Building on our findings, we introduce MITT (Multilingual Initial Thought Transfer), an unsupervised and lightweight reasoning prefix-tuning approach that transfers high-resource reasoning prefixes to enhance test-time scaling across all languages, addressing inconsistencies in multilingual reasoning performance. MITT significantly boosts DeepSeek-R1-Distill-Qwen-7B’s reasoning performance, especially for underrepresented languages.
zh
[NLP-53] Protoknowledge Shapes Behaviour of LLM s in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在预训练过程中如何内部化并利用知识图谱(Knowledge Graphs)的标记序列问题,特别是如何将这种记忆转化为可复用的知识进行泛化。其解决方案的关键在于引入“原型知识”(protoknowledge)的概念,并将其分为词汇型、层次型和拓扑型三种形式,以表征不同类型的需激活知识。通过知识激活任务(Knowledge Activation Tasks, KATs)测量原型知识,并采用一种新的分析框架评估模型预测是否与相关原型知识的成功激活一致,从而为语义级数据污染提供实用工具,并作为封闭预训练模型的有效策略。
链接: https://arxiv.org/abs/2505.15501
作者: Federico Ranaldi,Andrea Zugarini,Leonardo Ranaldi,Fabio Massimo Zanzotto
机构: University of Rome Tor Vergata(罗马特尔维加塔大学); University of Edinburgh(爱丁堡大学); expert.ai(专家人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.
zh
[NLP-54] Collaborative Problem-Solving in an Optimization Game
【速读】: 该论文试图解决复杂任务中对话代理与人类用户协作求解NP难优化问题的挑战,特别是针对旅行商问题(Traveling Salesman Problem, TSP)的两玩家协同求解。其解决方案的关键在于引入一种新型对话博弈框架,并结合大语言模型(LLM)提示与符号机制进行状态跟踪和语义锚定,从而实现高效的协同探索与决策。
链接: https://arxiv.org/abs/2505.15490
作者: Isidora Jeknic,Alex Duchnowski,Alexander Koller
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 16 figures
点击查看摘要
Abstract:Dialogue agents that support human users in solving complex tasks have received much attention recently. Many such tasks are NP-hard optimization problems that require careful collaborative exploration of the solution space. We introduce a novel dialogue game in which the agents collaboratively solve a two-player Traveling Salesman problem, along with an agent that combines LLM prompting with symbolic mechanisms for state tracking and grounding. Our best agent solves 45% of games optimally in self-play. It also demonstrates an ability to collaborate successfully with human users and generalize to unfamiliar graphs.
zh
[NLP-55] Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
【速读】: 该论文旨在解决多模态虚假信息检测(Multimodal Misinformation Detection, MMD)中对创作者误导性意图的识别问题,这是影响虚假信息实际传播效果的核心因素。解决方案的关键在于提出一个自动化框架,通过显式建模创作者意图的两个组成部分——期望影响和执行计划,来模拟现实世界中的多模态新闻创作过程,并基于此构建了大规模基准数据集DeceptionDecoded,用于评估模型在误导性意图检测、误导性来源归属和创作者意图推断等任务上的表现。
链接: https://arxiv.org/abs/2505.15489
作者: Jiaying Wu,Fanxiao Li,Min-Yen Kan,Bryan Hooi
机构: National University of Singapore (新加坡国立大学); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:The real-world impact of misinformation stems from the underlying misleading narratives that creators seek to convey. As such, interpreting misleading creator intent is essential for multimodal misinformation detection (MMD) systems aimed at effective information governance. In this paper, we introduce an automated framework that simulates real-world multimodal news creation by explicitly modeling creator intent through two components: the desired influence and the execution plan. Using this framework, we construct DeceptionDecoded, a large-scale benchmark comprising 12,000 image-caption pairs aligned with trustworthy reference articles. The dataset captures both misleading and non-misleading intents and spans manipulations across visual and textual modalities. We conduct a comprehensive evaluation of 14 state-of-the-art vision-language models (VLMs) on three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. Despite recent advances, we observe that current VLMs fall short in recognizing misleading intent, often relying on spurious cues such as superficial cross-modal consistency, stylistic signals, and heuristic authenticity hints. Our findings highlight the pressing need for intent-aware modeling in MMD and open new directions for developing systems capable of deeper reasoning about multimodal misinformation.
zh
[NLP-56] KaFT: Knowledge-aware Fine-tuning for Boosting LLM s Domain-specific Question-Answering Performance ACL2025
【速读】: 该论文试图解决监督微调(Supervised Fine-tuning, SFT)在领域特定问答(domain-specific question-answering, QA)任务中因大型语言模型(Large Language Models, LLMs)内部知识与训练数据上下文知识之间的冲突而导致性能不佳的问题。解决方案的关键在于提出一种基于知识感知的微调方法(Knowledge-aware Fine-tuning, KaFT),通过根据冲突程度为不同训练样本分配不同的奖励来调整训练权重,从而有效提升模型的性能和泛化能力。
链接: https://arxiv.org/abs/2505.15480
作者: Qihuang Zhong,Liang Ding,Xiantao Cai,Juhua Liu,Bo Du,Dacheng Tao
机构: Wuhan University(武汉大学); The University of Sydney(悉尼大学); Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2025 Findings
点击查看摘要
Abstract:Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs’ internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs’ performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.
zh
[NLP-57] LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的性别偏见问题,特别是通过引入两个数据集GenBiasEval和GenHintEval来评估模型的性别偏见程度及对包含性别提示的指令的响应一致性。其解决方案的关键在于提出了一种名为LFTF(Locating First and Then Fine-Tuning)的算法,该算法首先利用BMI(Block Mitigating Importance Score)指标对模型中的特定模块进行与性别偏见相关性的排序,随后对最相关模块进行微调,以有效缓解性别偏见同时保持模型的通用能力。
链接: https://arxiv.org/abs/2505.15475
作者: Zhanyue Qin,Yue Ding,Deyuan Liu,Qingbin Liu,Junxian Cai,Xi Chen,Zhiying Tu,Dianhui Chu,Cuiyun Gao,Dianbo Sui
机构: Harbin Institute of Technology (哈尔滨工业大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Nowadays, Large Language Models (LLMs) have attracted widespread attention due to their powerful performance. However, due to the unavoidable exposure to socially biased data during training, LLMs tend to exhibit social biases, particularly gender bias. To better explore and quantifying the degree of gender bias in LLMs, we propose a pair of datasets named GenBiasEval and GenHintEval, respectively. The GenBiasEval is responsible for evaluating the degree of gender bias in LLMs, accompanied by an evaluation metric named AFGB-Score (Absolutely Fair Gender Bias Score). Meanwhile, the GenHintEval is used to assess whether LLMs can provide responses consistent with prompts that contain gender hints, along with the accompanying evaluation metric UB-Score (UnBias Score). Besides, in order to mitigate gender bias in LLMs more effectively, we present the LFTF (Locating First and Then Fine-Tuning) this http URL algorithm first ranks specific LLM blocks by their relevance to gender bias in descending order using a metric called BMI (Block Mitigating Importance Score). Based on this ranking, the block most strongly associated with gender bias is then fine-tuned using a carefully designed loss function. Numerous experiments have shown that our proposed LFTF algorithm can significantly mitigate gender bias in LLMs while maintaining their general capabilities.
zh
[NLP-58] PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable Process and Solution Dimensions EMNLP
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂物理推理任务中应用不足的问题,尤其是现有物理基准测试在变量识别和过程构建等关键中间步骤上的缺失。解决方案的关键在于提出PhysicsArena,这是首个旨在全面评估MLLMs在变量识别、物理过程构建和解题推导三个核心维度上的多模态物理推理基准。
链接: https://arxiv.org/abs/2505.15472
作者: Song Dai,Yibo Yan,Jiamin Su,Dongfang Zihao,Yubo Gao,Yonghua Hei,Jungang Li,Junyan Zhang,Sicheng Tao,Zhuoran Gao,Xuming Hu
机构: The Hong Kong University of Science and Technology (香港科技大学); Beijing Future Brain Education Technology Co., Ltd. (北京未来脑教育科技有限公司); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注: 27 pages,20 figures, EMNLP
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.
zh
[NLP-59] CoLA: Collaborative Low-Rank Adaptation ACL2025
【速读】: 该论文试图解决参数高效微调(PEFT)方法在多任务场景下因任务间干扰、样本稀缺和噪声干扰而导致性能受限的问题。其解决方案的关键在于提出一种更灵活的LoRA架构——CoLA,并引入三种协作策略,通过更好地利用矩阵A与矩阵B之间的定量关系来提升模型性能,尤其在低样本场景下表现出更强的有效性和鲁棒性。
链接: https://arxiv.org/abs/2505.15471
作者: Yiyun Zhou,Chang Yao,Jingyuan Chen
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025, Findings
点击查看摘要
Abstract:The scaling law of Large Language Models (LLMs) reveals a power-law relationship, showing diminishing return on performance as model scale increases. While training LLMs from scratch is resource-intensive, fine-tuning a pre-trained model for specific tasks has become a practical alternative. Full fine-tuning (FFT) achieves strong performance; however, it is computationally expensive and inefficient. Parameter-efficient fine-tuning (PEFT) methods, like LoRA, have been proposed to address these challenges by freezing the pre-trained model and adding lightweight task-specific modules. LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks. Recent approaches, such as Mixture-of-Experts (MOE) and asymmetric LoRA, have aimed to mitigate these issues but still struggle with sample scarcity and noise interference due to their fixed structure. In response, we propose CoLA, a more flexible LoRA architecture with an efficient initialization scheme, and introduces three collaborative strategies to enhance performance by better utilizing the quantitative relationships between matrices A and B . Our experiments demonstrate the effectiveness and robustness of CoLA, outperforming existing PEFT methods, especially in low-sample scenarios. Our data and code are fully publicly available at this https URL.
zh
[NLP-60] Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning
【速读】: 该论文试图解决大型语言模型在增量学习新任务时面临的灾难性遗忘问题(catastrophic forgetting)。现有方法依赖于经验回放、优化约束或任务区分,但在实际场景中存在严格限制。论文提出的解决方案关键在于引入“闪回”(flashbacks)——即少量来自旧任务的提示,并通过约束模型输出与原始模型的偏差来保持旧知识;随后在闪回和新任务之间插值潜在任务,实现对相关潜在任务、新任务和闪回的联合学习,从而缓解闪回数据稀疏性并促进知识共享,实现平滑适应。该方法仅需有限数量的闪回数据,无需访问回放数据且具有任务无关性。
链接: https://arxiv.org/abs/2505.15467
作者: Yukun Zhao,Lingyong Yan,Zhenyang Li,Shuaiqiang Wang,Zhumin Chen,Zhaochun Ren,Dawei Yin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models have achieved remarkable success in various tasks. However, it is challenging for them to learn new tasks incrementally due to catastrophic forgetting. Existing approaches rely on experience replay, optimization constraints, or task differentiation, which encounter strict limitations in real-world scenarios. To address these issues, we propose Joint Flashback Adaptation. We first introduce flashbacks – a limited number of prompts from old tasks – when adapting to new tasks and constrain the deviations of the model outputs compared to the original one. We then interpolate latent tasks between flashbacks and new tasks to enable jointly learning relevant latent tasks, new tasks, and flashbacks, alleviating data sparsity in flashbacks and facilitating knowledge sharing for smooth adaptation. Our method requires only a limited number of flashbacks without access to the replay data and is task-agnostic. We conduct extensive experiments on state-of-the-art large language models across 1000+ instruction-following tasks, arithmetic reasoning tasks, and general reasoning tasks. The results demonstrate the superior performance of our method in improving generalization on new tasks and reducing forgetting in old tasks.
zh
[NLP-61] A Participatory Strategy for AI Ethics in Education and Rehabilitation grounded in the Capability Approach
【速读】: 该论文试图解决如何在特殊教育需求和残疾儿童的包容性教育及临床康复情境中有效应用人工智能(Artificial Intelligence, AI)技术的问题,同时应对由此带来的系统性生态视角、伦理考量与参与式研究的挑战。解决方案的关键在于构建一个以能力方法(Capability Approach)为理论基础的伦理-理论框架,并通过多方利益相关者的参与式研究策略,整合伦理、教育、临床和技术的专业知识,以设计和实施符合包容性学习环境功能性和有效性的AI技术。
链接: https://arxiv.org/abs/2505.15466
作者: Valeria Cesaroni,Eleonora Pasqua,Piercosma Bisconti,Martina Galletti
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:AI-based technologies have significant potential to enhance inclusive education and clinical-rehabilitative contexts for children with Special Educational Needs and Disabilities. AI can enhance learning experiences, empower students, and support both teachers and rehabilitators. However, their usage presents challenges that require a systemic-ecological vision, ethical considerations, and participatory research. Therefore, research and technological development must be rooted in a strong ethical-theoretical framework. The Capability Approach - a theoretical model of disability, human vulnerability, and inclusion - offers a more relevant perspective on functionality, effectiveness, and technological adequacy in inclusive learning environments. In this paper, we propose a participatory research strategy with different stakeholders through a case study on the ARTIS Project, which develops an AI-enriched interface to support children with text comprehension difficulties. Our research strategy integrates ethical, educational, clinical, and technological expertise in designing and implementing AI-based technologies for children’s learning environments through focus groups and collaborative design sessions. We believe that this holistic approach to AI adoption in education can help bridge the gap between technological innovation and ethical responsibility.
zh
[NLP-62] aching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化对话中面临的问题,特别是在冷启动场景和长期个性化方面的不足。现有基于提示和离线优化的方法由于设计静态且浅层,难以有效适应用户需求的变化。其解决方案的关键在于提出一种基于强化学习的个性化对齐框架(Reinforcement Learning for Personalized Alignment, RLPA),通过让模型与模拟用户模型进行交互,迭代地推断和优化用户画像,并利用双层级奖励机制(包括画像奖励和响应奖励)指导训练过程,从而实现更精准和持续的个性化对话能力。
链接: https://arxiv.org/abs/2505.15456
作者: Weixiang Zhao,Xingyu Sui,Yulin Hu,Jiahe Guo,Haixiao Liu,Biye Li,Yanyan Zhao,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Du Xiaoman Financial (杜晓满金融)
类目: Computation and Language (cs.CL)
备注: 30 pages, 18 figures, 10 tables
点击查看摘要
Abstract:Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA’s robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
zh
[NLP-63] Single LLM Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization
【速读】: 该论文试图解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在多个子任务(如查询理解与检索优化)中虽已取得进展,但缺乏统一框架集成的问题。其解决方案的关键在于提出RoleRAG,一个通过角色特定的token优化实现高效多任务处理的统一RAG框架。RoleRAG包含六个模块,每个模块负责RAG过程中的特定子任务,并引入查询图以动态表示查询分解状态,所有模块均由同一基础大语言模型(LLM)驱动,通过任务特定的角色token进行区分和优化,从而实现单个LLM实例内模块的动态激活,提升部署效率并降低资源消耗。
链接: https://arxiv.org/abs/2505.15444
作者: Yutao Zhu,Jiajie Jin,Hongjin Qian,Zheng Liu,Zhicheng Dou,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Beijing Academy of Artificial Intelligence, China (中国人工智能北京研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.
zh
[NLP-64] AdUE: Improving uncertainty estimation head for LoRA adapters in LLM s
【速读】: 该论文试图解决在参数高效微调方法(如适配器)中,预训练语言模型在分类任务中的不确定性估计问题。解决方案的关键在于提出一种高效的后验不确定性估计(post-hoc uncertainty estimation, UE)方法——AdUE1,其核心包括:(1)使用最大函数的可微近似以增强基于softmax的不确定性估计;(2)通过L2-SP正则化进一步锚定微调头部权重并规范模型。
链接: https://arxiv.org/abs/2505.15443
作者: Artem Zabolotnyi,Roman Makarov,Mile Mitrovic,Polina Proskura,Oleg Travkin,Roman Alferov,Alexey Zaytsev
机构: Skoltech; Sber; Sber AI Lab; IITP
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 9 pages, 1 figure
点击查看摘要
Abstract:Uncertainty estimation remains a critical challenge in adapting pre-trained language models to classification tasks, particularly under parameter-efficient fine-tuning approaches such as adapters. We introduce AdUE1, an efficient post-hoc uncertainty estimation (UE) method, to enhance softmax-based estimates. Our approach (1) uses a differentiable approximation of the maximum function and (2) applies additional regularization through L2-SP, anchoring the fine-tuned head weights and regularizing the model. Evaluations on five NLP classification datasets across four language models (RoBERTa, ELECTRA, LLaMA-2, Qwen) demonstrate that our method consistently outperforms established baselines such as Mahalanobis distance and softmax response. Our approach is lightweight (no base-model changes) and produces better-calibrated confidence.
zh
[NLP-65] On the Generalization vs Fidelity Paradox in Knowledge Distillation
【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在小型语言模型(Language Models, LMs)中的有效性及其知识迁移机制尚未充分探索的问题。其解决方案的关键在于通过大规模的实证和统计分析,评估不同参数规模的模型在零样本设置下的KD效果,揭示教师模型性能与学生模型表现之间的关系,以及KD对模型推理能力的影响。研究强调了教师信号和logit平滑在蒸馏后学生模型性能中的重要性,并发现了小型LMs更受益于KD,而大型LMs的增益有限这一关键现象。
链接: https://arxiv.org/abs/2505.15442
作者: Suhas Kamasetty Ramesh,Ayan Sengupta,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to 10% , with a peak task specific gain of 22% , while providing only marginal benefits ( \sim 1.3% ) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students’ performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.
zh
[NLP-66] Set-LLM : A Permutation-Invariant LLM
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理输入时存在的顺序敏感性问题,即模型对输入顺序的依赖性导致其在不同顺序的相同内容上产生不一致的输出。解决方案的关键在于提出Set-LLM,这是一种针对预训练LLMs的新型架构改进,通过引入新的注意力掩码和专为集合设计的位置编码,实现对混合集合文本输入的排列不变性处理,从而消除顺序敏感性。
链接: https://arxiv.org/abs/2505.15433
作者: Beni Egressy,Jan Stühmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While large language models (LLMs) demonstrate impressive capabilities across numerous applications, their robustness remains a critical concern. This paper is motivated by a specific vulnerability: the order sensitivity of LLMs. This vulnerability manifests itself as the order bias observed when LLMs decide between possible options (for example, a preference for the first option) and the tendency of LLMs to provide different answers when options are reordered. The use cases for this scenario extend beyond the classical case of multiple-choice question answering to the use of LLMs as automated evaluators in AI pipelines, comparing output generated by different models. We introduce Set-LLM, a novel architectural adaptation for pretrained LLMs that enables the processing of mixed set-text inputs with permutation invariance guarantees. The adaptations involve a new attention mask and new positional encodings specifically designed for sets. We provide a theoretical proof of invariance and demonstrate through experiments that Set-LLM can be trained effectively, achieving comparable or improved performance and maintaining the runtime of the original model, while eliminating order sensitivity.
zh
[NLP-67] Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
【速读】: 该论文旨在解决大规模语言模型在处理长序列任务时的效率与上下文理解之间的平衡问题,以及如何在保持高性能的同时降低推理成本。其关键解决方案是引入Hunyuan-TurboS,这是一种结合了Transformer的上下文理解能力与Mamba的长序列处理效率的混合架构,通过自适应的长短期思维链(CoT)机制动态调整计算资源分配,并采用创新的AMF/MF块模式、Grouped-Query Attention和MoE结构,从而实现高效且强大的模型性能。
链接: https://arxiv.org/abs/2505.15431
作者: Ao Liu,Botong Zhou,Can Xu,Chayse Zhou,ChenChen Zhang,Chengcheng Xu,Chenhao Wang,Decheng Wu,Dengpeng Wu,Dian Jiao,Dong Du,Dong Wang,Feng Zhang,Fengzong Lian,Guanghui Xu,Guanwei Zhang,Hai Wang,Haipeng Luo,Han Hu,Huilin Xu,Jiajia Wu,Jianchen Zhu,Jianfeng Yan,Jiaqi Zhu,Jihong Zhang,Jinbao Xue,Jun Xia,Junqiang Zheng,Kai Liu,Kai Zhang,Kai Zheng,Kejiao Li,Keyao Wang,Lan Jiang,Lixin Liu,Lulu Wu,Mengyuan Huang,Peijie Yu,Peiqi Wang,Qian Wang,Qianbiao Xiang,Qibin Liu,Qingfeng Sun,Richard Guo,Ruobing Xie,Saiyong Yang,Shaohua Chen,Shihui Hu,Shuai Li,Shuaipeng Li,Shuang Chen,Suncong Zheng,Tao Yang,Tian Zhang,Tinghao Yu,Weidong Han,Weijie Liu,Weijin Zhou,Weikang Wang,Wesleye Chen,Xiao Feng,Xiaoqin Ren,Xingwu Sun,Xiong Kuang,Xuemeng Huang,Xun Cao,Yanfeng Chen,Yang Du,Yang Zhen,Yangyu Tao,Yaping Deng,Yi Shen,Yigeng Hong,Yiqi Chen,Yiqing Huang,Yuchi Deng,Yue Mao,Yulong Wang,Yuyuan Zeng,Zenan Xu,Zhanhui Kang,Zhe Zhao,ZhenXiang Yan,Zheng Fang,Zhichao Hu,Zhongzhi Chen,Zhuoyu Li,Zongwei Li,Alex Yan,Ande Liang,Baitong Liu,Beiping Pan,Bin Xing,Binghong Wu,Bingxin Qu,Bolin Ni,Boyu Wu,Chen Li,Cheng Jiang,Cheng Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba’s long-sequence processing efficiency with Transformer’s superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep “thinking” modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
zh
[NLP-68] Likelihood Variance as Text Importance for Resampling Texts to Map Language Models
【速读】: 该论文试图解决构建语言模型映射(language model map)时计算成本过高的问题,该映射通过KL散度将多种语言模型嵌入到一个共同空间中进行比较。解决方案的关键在于提出了一种重采样方法,该方法根据每段文本在不同模型间的对数似然方差来选择重要文本,从而以较少的文本数量保持KL散度估计的准确性。实验表明,该方法在仅使用约一半文本的情况下仍能实现与均匀采样相当的性能,并且有助于高效地将新模型集成到现有映射中。
链接: https://arxiv.org/abs/2505.15428
作者: Momose Oyama,Ryo Kishino,Hiroaki Yamagiwa,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We address the computational cost of constructing a model map, which embeds diverse language models into a common space for comparison via KL divergence. The map relies on log-likelihoods over a large text set, making the cost proportional to the number of texts. To reduce this cost, we propose a resampling method that selects important texts with weights proportional to the variance of log-likelihoods across models for each text. Our method significantly reduces the number of required texts while preserving the accuracy of KL divergence estimates. Experiments show that it achieves comparable performance to uniform sampling with about half as many texts, and also facilitates efficient incorporation of new models into an existing map. These results enable scalable and efficient construction of language model maps.
zh
[NLP-69] Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions
【速读】: 该论文试图解决扩散模型在生成图像时可能产生不适宜工作(NSFW)内容和社会偏见的问题,这些问题限制了其在实际应用中的可行性。解决方案的关键在于提出一种新颖的自发现方法,用于在嵌入空间中识别语义方向向量,以将文本嵌入限制在安全区域内,从而避免生成有害内容。该方法无需修正输入文本中的具体词汇,而是通过引导整个文本提示向嵌入空间中的安全区域移动,提升模型对潜在不安全提示的鲁棒性。
链接: https://arxiv.org/abs/2505.15427
作者: Zhiwen Li,Die Chen,Mingyuan Fan,Cen Chen,Yaliang Li,Yanhao Wang,Wenmeng Zhou
机构: East China Normal University(华东师范大学); Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.
zh
[NLP-70] NeoN: A Tool for Automated Detection Linguistic and LLM -Driven Analysis of Neologisms in Polish CCS
【速读】: 该论文试图解决波兰语新词(neologisms)检测与分析的问题,传统基于词典的方法需要大量人工审核,效率低下。其解决方案的关键在于构建一个多层次的处理流程,结合参考语料库、波兰语特定的语言过滤器、由大语言模型(LLM)驱动的精度提升过滤器以及每日RSS监控,通过上下文感知的词形还原、频率分析和拼写规范化提取候选新词,并整合变体形式,同时利用集成的LLM模块自动生成定义并按领域和情感进行分类,从而显著降低人工工作量并保持高准确性。
链接: https://arxiv.org/abs/2505.15426
作者: Aleksandra Tomaszewska,Dariusz Czerski,Bartosz Żuk,Maciej Ogrodniczuk
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, this is an extended version of a paper accepted for the 25th International Conference on Computational Science (ICCS), 7-9 July 2025
点击查看摘要
Abstract:NeoN, a tool for detecting and analyzing Polish neologisms. Unlike traditional dictionary-based methods requiring extensive manual review, NeoN combines reference corpora, Polish-specific linguistic filters, an LLM-driven precision-boosting filter, and daily RSS monitoring in a multi-layered pipeline. The system uses context-aware lemmatization, frequency analysis, and orthographic normalization to extract candidate neologisms while consolidating inflectional variants. Researchers can verify candidates through an intuitive interface with visualizations and filtering controls. An integrated LLM module automatically generates definitions and categorizes neologisms by domain and sentiment. Evaluations show NeoN maintains high accuracy while significantly reducing manual effort, providing an accessible solution for tracking lexical innovation in Polish.
zh
[NLP-71] Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models
【速读】: 该论文旨在解决持续学习(Continual Learning, CL)中由于参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法,如低秩适应(Low-Rank Adaptation, LoRA),在处理新任务时导致的灾难性遗忘问题。现有方法通常通过扩展新的LoRA分支来学习新任务,并强制新旧LoRA分支对旧任务的贡献相等,从而可能引发遗忘。解决方案的关键在于提出一种称为门控集成低秩适应(Gated Integration of Low-Rank Adaptation, GainLoRA)的方法,该方法通过引入门控模块来整合新旧LoRA分支,并最小化新LoRA分支对旧任务的贡献,从而有效缓解遗忘并提升模型性能。
链接: https://arxiv.org/abs/2505.15424
作者: Yan-Shuo Liang,Wu-Jun Li
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Continual learning (CL), which requires the model to learn multiple tasks sequentially, is crucial for language models (LMs). Recently, low-rank adaptation (LoRA), one of the most representative parameter-efficient fine-tuning (PEFT) methods, has gained increasing attention in CL of LMs. However, most existing CL methods based on LoRA typically expand a new LoRA branch to learn each new task and force the new and old LoRA branches to contribute equally to old tasks, potentially leading to forgetting. In this work, we propose a new method, called gated integration of low-rank adaptation (GainLoRA), for CL of LMs. GainLoRA expands a new LoRA branch for each new task and introduces gating modules to integrate the new and old LoRA branches. Furthermore, GainLoRA leverages the new gating module to minimize the contribution from the new LoRA branch to old tasks, effectively mitigating forgetting and improving the model’s overall performance. Experimental results on CL benchmarks demonstrate that GainLoRA outperforms existing state-of-the-art methods.
zh
[NLP-72] rends and Challenges in Authorship Analysis: A Review of ML DL and LLM Approaches
【速读】: 该论文试图解决作者身份分析(authorship analysis)中的关键问题,包括作者归属(author attribution)和作者验证(author verification),旨在提升在不同领域(如法语语言学、学术研究、网络安全和数字内容认证)中对文本作者身份识别的准确性与可靠性。其解决方案的关键在于系统性地回顾从传统机器学习方法到深度学习模型及大语言模型(LLMs)的最新技术进展,分析各类方法的演进、优势与局限,并探讨特征提取技术、数据集使用及当前面临的挑战,特别是低资源语言处理、多语言适应、跨领域泛化和AI生成文本检测等方面的研究空白。
链接: https://arxiv.org/abs/2505.15422
作者: Nudrat Habib,Tosin Adewumi,Marcus Liwicki,Elisa Barney
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 3 figures
点击查看摘要
Abstract:Authorship analysis plays an important role in diverse domains, including forensic linguistics, academia, cybersecurity, and digital content authentication. This paper presents a systematic literature review on two key sub-tasks of authorship analysis; Author Attribution and Author Verification. The review explores SOTA methodolo- gies, ranging from traditional ML approaches to DL models and LLMs, highlighting their evolution, strengths, and limitations, based on studies conducted from 2015 to 2024. Key contributions include a comprehensive analysis of methods, techniques, their corresponding feature extraction techniques, datasets used, and emerging chal- lenges in authorship analysis. The study highlights critical research gaps, particularly in low-resource language processing, multilingual adaptation, cross-domain generaliza- tion, and AI-generated text detection. This review aims to help researchers by giving an overview of the latest trends and challenges in authorship analysis. It also points out possible areas for future study. The goal is to support the development of better, more reliable, and accurate authorship analysis system in diverse textual domain.
zh
[NLP-73] ClickSight: Interpreting Student Clickstreams to Reveal Insights on Learning Strategies via LLM s
【速读】: 该论文试图解决数字学习环境中点击流数据(clickstream data)难以解释的问题,因其具有高维度和细粒度的特点,传统方法如手工特征提取、专家标注、聚类或监督模型往往缺乏泛化性和可扩展性。解决方案的关键在于引入ClickSight,这是一个基于上下文大型语言模型(Large Language Model, LLM)的管道,能够将原始点击流数据和学习策略列表作为输入,生成对学生学习行为的文本解释,从而揭示其学习策略。
链接: https://arxiv.org/abs/2505.15410
作者: Bahar Radmehr,Ekaterina Shved,Fatma Betül Güreş,Adish Singla,Tanja Käser
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in Latebreaking results track in AIED 2025(26th International Conference on Artificial Intelligence in Education JULY 22-26, 2025 PALERMO, ITALY)
点击查看摘要
Abstract:Clickstream data from digital learning environments offer valuable insights into students’ learning behaviors, but are challenging to interpret due to their high dimensionality and granularity. Prior approaches have relied mainly on handcrafted features, expert labeling, clustering, or supervised models, therefore often lacking generalizability and scalability. In this work, we introduce ClickSight, an in-context Large Language Model (LLM)-based pipeline that interprets student clickstreams to reveal their learning strategies. ClickSight takes raw clickstreams and a list of learning strategies as input and generates textual interpretations of students’ behaviors during interaction. We evaluate four different prompting strategies and investigate the impact of self-refinement on interpretation quality. Our evaluation spans two open-ended learning environments and uses a rubric-based domain-expert evaluation. Results show that while LLMs can reasonably interpret learning strategies from clickstreams, interpretation quality varies by prompting strategy, and self-refinement offers limited improvement. ClickSight demonstrates the potential of LLMs to generate theory-driven insights from educational interaction data.
zh
[NLP-74] How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在增强推理能力的同时可能带来的安全性能下降问题。研究的核心在于通过监督微调(Supervised Fine-Tuning, SFT)提升LRMs的安全性。其关键在于识别并解决数据蒸馏过程中的三个主要失败模式,并通过简化推理过程和混合数学推理数据来优化安全性能与过度拒绝之间的平衡。
链接: https://arxiv.org/abs/2505.15404
作者: Zhexin Zhang,Xian Qi Loye,Victor Shea-Jay Huang,Junxiao Yang,Qi Zhu,Shiyao Cui,Fei Mi,Lifeng Shang,Yingkang Wang,Hongning Wang,Minlie Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages
点击查看摘要
Abstract:Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how can we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify three key failure patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance-and are significantly easier for models to learn than more intricate reasoning chains. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we find that mixing math reasoning data during safety fine-tuning is helpful to balance safety and over-refusal. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in this https URL.
zh
[NLP-75] When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在简单任务中因冗余推理导致的计算开销过大的问题。其关键解决方案是提出自适应自我恢复推理(Adaptive Self-Recovery Reasoning, ASRR)框架,该框架通过引入感知准确性的长度奖励调节机制,根据问题难度自适应地分配推理资源,从而抑制不必要的推理并实现隐式恢复,显著提升了模型的效率与安全性。
链接: https://arxiv.org/abs/2505.15400
作者: Xiaoyun Zhang,Jingqing Ruan,Xing Ma,Yawen Zhu,Haodong Zhao,Hao Li,Jiansong Chen,Ke Zeng,Xunliang Cai
机构: Meituan(美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of “Internal Self-Recovery Mechanism” where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
zh
[NLP-76] An Empirical Study of the Anchoring Effect in LLM s: Existence Mechanism and Potential Mitigations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的一种认知偏差——锚定效应(anchoring effect)问题,即模型在决策过程中过度依赖初始信息作为参考点。其解决方案的关键在于引入一个新的数据集SynAnchors,并结合优化的评估指标对当前主流LLMs进行基准测试,发现LLMs的锚定偏差普遍存在且难以通过传统策略消除,而推理能力可在一定程度上缓解该偏差。研究强调,LLMs的评估应关注认知偏差意识下的可信度评估,而非仅依赖标准基准或过度优化的鲁棒性测试。
链接: https://arxiv.org/abs/2505.15392
作者: Yiming Huang,Biquan Bie,Zuqiu Na,Weilin Ruan,Songxin Lei,Yutao Yue,Xinlei He
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The rise of Large Language Models (LLMs) like ChatGPT has advanced natural language processing, yet concerns about cognitive biases are growing. In this paper, we investigate the anchoring effect, a cognitive bias where the mind relies heavily on the first information as anchors to make affected judgments. We explore whether LLMs are affected by anchoring, the underlying mechanisms, and potential mitigation strategies. To facilitate studies at scale on the anchoring effect, we introduce a new dataset, SynAnchors. Combining refined evaluation metrics, we benchmark current widely used LLMs. Our findings show that LLMs’ anchoring bias exists commonly with shallow-layer acting and is not eliminated by conventional strategies, while reasoning can offer some mitigation. This recontextualization via cognitive psychology urges that LLM evaluations focus not on standard benchmarks or over-optimized robustness tests, but on cognitive-bias-aware trustworthy evaluation.
zh
[NLP-77] Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
【速读】: 该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在面对普通用户分享的模因图像时的安全性问题,即现有评估多基于人工生成图像,缺乏对真实场景下安全风险的全面认识。解决方案的关键在于引入MemeSafetyBench,这是一个包含50,430个实例的基准测试集,将真实模因图像与有害及无害指令进行配对,并通过综合安全分类法和大语言模型(LLM)生成指令,对多个VLMs在单轮和多轮交互中的表现进行评估。
链接: https://arxiv.org/abs/2505.15389
作者: DongGeon Lee,Joonwon Jang,Jihae Jeong,Hwanjo Yu
机构: POSTECH(浦项科技大学); LG AI Research(LG人工智能研究院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.
zh
[NLP-78] RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中幻觉(hallucination)问题,特别是现有方法在检测幻觉时缺乏对幻觉产生原因的解释能力,即无法明确指出输入中的哪些部分容易引发幻觉。解决方案的关键在于提出RePPL方法,通过重新校准不确定性度量,结合语义传播中的不确定性与语言生成中的概率选择,为每个标记分配可解释的不确定性得分,并以困惑度风格的对数平均形式聚合为总得分,从而实现对幻觉的全面检测与解释。
链接: https://arxiv.org/abs/2505.15386
作者: Yiming Huang,Junyan Zhang,Zihao Wang,Biquan Bie,Xuming Hu,Yi R.(May)Fung,Xinlei He
机构: The Hong Kong University of Science and Technology, Guangzhou (香港科技大学广州校区); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.
zh
[NLP-79] X-WebAgent Bench: A Multilingual Interactive Web Benchmark for Evaluating Global Agent ic System ACL2025
【速读】: 该论文试图解决当前语言代理(language agent)研究主要集中在英语场景,而全球存在7000多种语言,这些语言对等效代理服务的需求未被充分满足的问题。解决方案的关键是引入X-WebAgentBench,这是一个新型的多语言代理基准测试平台,旨在评估语言代理在交互式网络环境中的规划与交互性能,从而推动全球代理智能的发展。
链接: https://arxiv.org/abs/2505.15372
作者: Peng Wang,Ruihan Tao,Qiguang Chen,Mengkang Hu,Libo Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Findings
点击查看摘要
Abstract:Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.
zh
[NLP-80] Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在安全关键场景下的可靠性问题,特别是其在紧急情况识别中的系统性误报问题。解决方案的关键在于构建了一个名为VERI的诊断基准数据集,该数据集包含200张图像(100对对比样本),每种紧急场景均通过多阶段人工验证和迭代优化与视觉相似但安全的对照场景相匹配,从而为评估模型在安全敏感场景下的表现提供了可靠依据。
链接: https://arxiv.org/abs/2505.15367
作者: Dasol Choi,Seunghyun Lee,Youngsook Song
机构: Yonsei University (延世大学); MODULABS; Lablup Inc. (Lablup公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages
点击查看摘要
Abstract:Vision-Language Models (VLMs) have demonstrated impressive capabilities in understanding visual content, but their reliability in safety-critical contexts remains under-explored. We introduce VERI (Visual Emergency Recognition Dataset), a carefully designed diagnostic benchmark of 200 images (100 contrastive pairs). Each emergency scene is matched with a visually similar but safe counterpart through multi-stage human verification and iterative refinement. Using a two-stage protocol - risk identification and emergency response - we evaluate 14 VLMs (2B-124B parameters) across medical emergencies, accidents, and natural disasters. Our analysis reveals a systematic overreaction problem: models excel at identifying real emergencies (70-100 percent success rate) but suffer from an alarming rate of false alarms, misidentifying 31-96 percent of safe situations as dangerous, with 10 scenarios failed by all models regardless of scale. This “better-safe-than-sorry” bias manifests primarily through contextual overinterpretation (88-93 percent of errors), challenging VLMs’ reliability for safety applications. These findings highlight persistent limitations that are not resolved by increasing model scale, motivating targeted approaches for improving contextual safety assessment in visually misleading scenarios.
zh
[NLP-81] AI vs. Human Judgment of Content Moderation: LLM -as-a-Judge and Ethics-Based Response Refusals
【速读】: 该论文试图解决在高风险场景下,大型语言模型(Large Language Models, LLMs)对伦理敏感提示的拒绝响应在自动化评估系统与人类用户之间的评价差异问题。研究发现,基于模型的评估系统(LLM-as-a-Judge)对伦理性拒绝响应的评价显著优于人类用户,而对技术性拒绝响应则无此差异,这种差异被称为“内容审核偏差”。解决方案的关键在于识别并分析自动化评估系统中隐含的价值判断和规范假设,以提高评估透明度和价值对齐程度。
链接: https://arxiv.org/abs/2505.15365
作者: Stefan Pasch
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed in high-stakes settings, their ability to refuse ethically sensitive prompts-such as those involving hate speech or illegal activities-has become central to content moderation and responsible AI practices. While refusal responses can be viewed as evidence of ethical alignment and safety-conscious behavior, recent research suggests that users may perceive them negatively. At the same time, automated assessments of model outputs are playing a growing role in both evaluation and training. In particular, LLM-as-a-Judge frameworks-in which one model is used to evaluate the output of another-are now widely adopted to guide benchmarking and fine-tuning. This paper examines whether such model-based evaluators assess refusal responses differently than human users. Drawing on data from Chatbot Arena and judgments from two AI judges (GPT-4o and Llama 3 70B), we compare how different types of refusals are rated. We distinguish ethical refusals, which explicitly cite safety or normative concerns (e.g., “I can’t help with that because it may be harmful”), and technical refusals, which reflect system limitations (e.g., “I can’t answer because I lack real-time data”). We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users, a divergence not observed for technical refusals. We refer to this divergence as a moderation bias-a systematic tendency for model-based evaluators to reward refusal behaviors more than human users do. This raises broader questions about transparency, value alignment, and the normative assumptions embedded in automated evaluation systems.
zh
[NLP-82] NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging
【速读】: 该论文试图解决传统代码调试方法在处理需要深入理解算法逻辑的复杂编程错误时效果有限的问题,以及自然语言格式在调试任务中的有效性与具体优势尚未明确的问题。解决方案的关键在于提出NL-DEBUGGING框架,该框架采用自然语言作为中间表示,通过在自然语言层面进行调试,利用执行反馈直接引导修正,从而提升调试效果并扩大修改空间。
链接: https://arxiv.org/abs/2505.15356
作者: Weiming Zhang,Qingyao Li,Xinyi Dai,Jizheng Chen,Kounianhua Du,Weinan Zhang,Weiwen Liu,Yasheng Wang,Ruiming Tang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Debugging is a critical aspect of LLM’s coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.
zh
[NLP-83] Decoding Phone Pairs from MEG Signals Across Speech Modalities
【速读】: 该论文旨在解决如何通过脑磁图(MEG)信号解码语音生产与感知过程中的音素(phone)信息的问题,以推动对语音生成神经机制的理解及通信技术的发展。其解决方案的关键在于利用多种机器学习方法,包括正则化线性模型和神经网络架构,对17名受试者在语音生产与被动聆听任务中的脑活动进行音素分类,并发现语音生产状态下的解码准确率显著高于被动聆听和语音回放状态,表明发声过程中蕴含更丰富的神经信息。此外,研究还揭示了低频脑电波(Delta和Theta频段)在解码中的重要作用,同时指出传统正则化技术在处理高维、有限数据集时的有效性。
链接: https://arxiv.org/abs/2505.15355
作者: Xabier de Zuazo,Eva Navas,Ibon Saratxaga,Mathieu Bourguignon,Nicola Molinaro
机构: University of the Basque Country (Universidad del País Vasco); Université libre de Bruxelles (Université libre de Bruxelles); BCBL (BCBL)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 21 pages, 4 figures, 1 graphical abstract, submitted to Computer Speech and Language (special issue on Iberian Languages)
点击查看摘要
Abstract:Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.
zh
[NLP-84] Revealing Language Model Trajectories via Kullback-Leibler Divergence
【速读】: 该论文试图解决如何有效估计语言模型之间KL散度(Kullback-Leibler divergence)的问题,特别是针对不同架构的模型。其解决方案的关键在于通过基于对数似然向量(log-likelihood vectors)分配坐标的方法,实现对KL散度的高效估算。
链接: https://arxiv.org/abs/2505.15353
作者: Ryo Kishino,Yusuke Takase,Momose Oyama,Hiroaki Yamagiwa,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A recently proposed method enables efficient estimation of the KL divergence between language models, including models with different architectures, by assigning coordinates based on log-likelihood vectors. To better understand the behavior of this metric, we systematically evaluate KL divergence across a wide range of conditions using publicly available language models. Our analysis covers comparisons between pretraining checkpoints, fine-tuned and base models, and layers via the logit lens. We find that trajectories of language models, as measured by KL divergence, exhibit a spiral structure during pretraining and thread-like progressions across layers. Furthermore, we show that, in terms of diffusion exponents, model trajectories in the log-likelihood space are more constrained than those in weight space.
zh
[NLP-85] he Super Emotion Dataset
【速读】: 该论文试图解决自然语言处理(NLP)领域中情感分类数据集缺乏标准化、大规模资源的问题(emotion classification datasets),现有数据集存在情感类别不一致、样本量有限或专注于特定领域等缺陷。解决方案的关键在于通过整合多种文本来源,构建一个基于Shaver的实证验证情感分类体系(Shaver’s empirically validated emotion taxonomy)的统一框架,从而实现跨领域的更一致的情感识别研究。
链接: https://arxiv.org/abs/2505.15348
作者: Enric Junqué de Fortuny
机构: IESE(IESE)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Despite the wide-scale usage and development of emotion classification datasets in NLP, the field lacks a standardized, large-scale resource that follows a psychologically grounded taxonomy. Existing datasets either use inconsistent emotion categories, suffer from limited sample size, or focus on specific domains. The Super Emotion Dataset addresses this gap by harmonizing diverse text sources into a unified framework based on Shaver’s empirically validated emotion taxonomy, enabling more consistent cross-domain emotion recognition research.
zh
[NLP-86] FlowKV: Enhancing Multi-Turn Conversational Coherence in LLM s via Isolated Key-Value Cache Management
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话应用中,由于对话历史线性增长导致的键值缓存(KV Cache)管理瓶颈问题。现有淘汰策略通过反复压缩早期对话上下文来降低计算成本,但往往导致信息丢失和上下文遗忘。论文提出的解决方案是FlowKV,其关键在于一种多轮隔离机制,该机制能够保留之前轮次压缩后的KV缓存,并仅对最新一轮生成的KV对进行压缩,从而有效避免旧上下文的重复压缩,缓解灾难性遗忘问题。
链接: https://arxiv.org/abs/2505.15347
作者: Xiang Liu,Hong Chen,Xuming Hu,Xiaowen Chu
机构: The Hong Kong University of Science and Technology(Guangzhou)
类目: Computation and Language (cs.CL)
备注: 18 pages
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbfmulti-turn isolation mechanism for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV’s core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90% to 75.40%, particularly in later conversational turns.
zh
[NLP-87] Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM -Generated Text Detectors
【速读】: 该论文试图解决生成式 AI (Generative AI) 生成文本被检测工具识别的问题,特别是针对学术剽窃等滥用行为所开发的检测算法。现有方法需要大量数据和计算资源来训练专门的改写器,并且在面对先进的检测算法时攻击效果显著下降。解决方案的关键是提出一种无需训练的对比改写攻击(Contrastive Paraphrase Attack, CoPA),通过精心设计的指令引导预训练语言模型生成更接近人类风格的文本,并构建辅助的机器风格词分布作为对比,在解码过程中减去机器风格特征,从而生成更难以被检测工具识别的文本。
链接: https://arxiv.org/abs/2505.15337
作者: Hao Fang,Jiawei Kong,Tianqu Zhuang,Yixiang Qiu,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang,Min Zhang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Pengcheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbfContrastive \textbfParaphrase \textbfAttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
zh
[NLP-88] Leverag ing Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation ACL2025
【速读】: 该论文旨在解决文本无关的语音到语音翻译(textless speech-to-speech translation, S2ST)中的两个主要挑战:跨模态(cross-modal, CM)特征提取和跨语言(cross-lingual, CL)长序列对齐。解决方案的关键在于引入“单元语言”(unit language),它是一种基于n-gram语言建模构建的类似文本的表示形式,并通过多任务学习将其用于指导语音建模过程。为了解决同时应用源语言和目标语言单元语言时出现的冲突,论文进一步提出了任务提示建模(task prompt modeling)。实验结果表明,该方法在Voxpupil数据集的四种语言上显著优于强基线模型,并达到了与使用文本训练的模型相当的性能。
链接: https://arxiv.org/abs/2505.15333
作者: Yuhao Zhang,Xiangnan Ma,Kaiqi Kou,Peizhuo Liu,Weiqiao Shan,Benyou Wang,Tong Xiao,Yuxin Huang,Zhengtao Yu,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ACL 2025 Findings
点击查看摘要
Abstract:The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using n -gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.
zh
[NLP-89] Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack
【速读】: 该论文试图解决在基于第一令牌概率(first-token probability, FTP)的多项选择题问答(MCQA)任务中,模型可能因误对齐(misalignment)或误解释(misinterpretation)而导致评估不可靠的问题。解决方案的关键在于提出一种名为预填充攻击(prefilling attack)的方法,即在模型输出前添加一个结构化的自然语言前缀(例如:“The correct option is:”),以引导模型生成清晰且有效的答案选项,而无需修改模型参数。该方法显著提升了多个LLM和MCQA基准上的准确性、校准度和输出一致性,同时保持了较高的效率。
链接: https://arxiv.org/abs/2505.15323
作者: Silvia Cappelletti,Tobia Poppi,Samuele Poppi,Zheng-Xin Yong,Diego Garcia-Olano,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); University of Pisa (比萨大学); Brown University (布朗大学); Meta (Meta)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures, 7 tables
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using first-token probability (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (misalignment) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (misinterpretation), undermining the reliability of symbolic evaluation. We propose a simple solution: the prefilling attack, a structured natural-language prefix (e.g., “The correct option is:”) prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.
zh
[NLP-90] Emotional Supporters often Use Multiple Strategies in a Single Turn
【速读】: 该论文试图解决情感支持对话(Emotional Support Conversations, ESC)任务中现有定义过于简化的问题,即传统方法将支持性回应建模为单一策略-话语对,而忽略了实际对话中情感支持者常在一个回合内连续使用多种策略的现象。解决方案的关键在于重新定义ESC任务,要求根据对话历史生成完整的策略-话语对序列,并引入监督深度学习模型和大语言模型(Large Language Models, LLMs)进行建模,从而更准确地捕捉和支持复杂的情感交互过程。
链接: https://arxiv.org/abs/2505.15316
作者: Xin Bai,Guanyi Chen,Tingting He,Chenlian Zhou,Yu Liu
机构: 1Faculty of Artificial Intelligence in Education, 2Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, 3National Language Resources Monitor and Research Center for Network Media, 4School of Computer Science, Central China Normal University
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Emotional Support Conversations (ESC) are crucial for providing empathy, validation, and actionable guidance to individuals in distress. However, existing definitions of the ESC task oversimplify the structure of supportive responses, typically modelling them as single strategy-utterance pairs. Through a detailed corpus analysis of the ESConv dataset, we identify a common yet previously overlooked phenomenon: emotional supporters often employ multiple strategies consecutively within a single turn. We formally redefine the ESC task to account for this, proposing a revised formulation that requires generating the full sequence of strategy-utterance pairs given a dialogue history. To facilitate this refined task, we introduce several modelling approaches, including supervised deep learning models and large language models. Our experiments show that, under this redefined task, state-of-the-art LLMs outperform both supervised models and human supporters. Notably, contrary to some earlier findings, we observe that LLMs frequently ask questions and provide suggestions, demonstrating more holistic support capabilities.
zh
[NLP-91] rajectory Bellm an Residual Minimization: A Simple Value-Based Method for LLM Reasoning
【速读】: 该论文试图解决当前大型语言模型(Large Language Model, LLM)推理中强化学习(Reinforcement Learning, RL)管道主要依赖策略梯度方法,而价值函数方法未被充分探索的问题。其解决方案的关键在于重新引入经典的Bellman残差最小化(Bellman Residual Minimization)思想,并提出轨迹Bellman残差最小化(Trajectory Bellman Residual Minimization, TBRM),该算法通过模型自身的logits作为Q值,直接优化单个轨迹级别的Bellman目标,从而实现一种无需批评者、重要性采样比例或裁剪的简单且有效的离策略算法。
链接: https://arxiv.org/abs/2505.15311
作者: Yurun Yuan,Fan Chen,Zeyu Jia,Alexander Rakhlin,Tengyang Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model’s own logits as Q -values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.
zh
[NLP-92] Multi-Hop Question Generation via Dual-Perspective Keyword Guidance ACL2025
【速读】: 该论文旨在解决多跳问答生成(Multi-hop Question Generation, MQG)中如何有效定位与问题-答案(QA)对相关的关键信息片段的问题。现有方法通常依赖关键词,但未能充分利用关键词的指导作用,并忽略了问题特定关键词与文档特定关键词的不同作用。解决方案的关键在于定义双视角关键词(即问题关键词和文档关键词),并提出一种双视角关键词引导的框架(Dual-Perspective Keyword-Guided, DPKG),该框架将关键词无缝集成到多跳问题生成过程中,通过问题关键词捕捉提问者意图,文档关键词反映与QA对相关的内容,二者协同工作以精确定位文档中的关键信息片段。
链接: https://arxiv.org/abs/2505.15299
作者: Maodong Li,Longyin Zhang,Fang Kong
机构: Soochow University (苏州大学); A*STAR (新加坡科技研究局)
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 figures, accepted to the Findings of ACL 2025
点击查看摘要
Abstract:Multi-hop question generation (MQG) aims to generate questions that require synthesizing multiple information snippets from documents to derive target answers. The primary challenge lies in effectively pinpointing crucial information snippets related to question-answer (QA) pairs, typically relying on keywords. However, existing works fail to fully utilize the guiding potential of keywords and neglect to differentiate the distinct roles of question-specific and document-specific keywords. To address this, we define dual-perspective keywords (i.e., question and document keywords) and propose a Dual-Perspective Keyword-Guided (DPKG) framework, which seamlessly integrates keywords into the multi-hop question generation process. We argue that question keywords capture the questioner’s intent, whereas document keywords reflect the content related to the QA pair. Functionally, question and document keywords work together to pinpoint essential information snippets in the document, with question keywords required to appear in the generated question. The DPKG framework consists of an expanded transformer encoder and two answer-aware transformer decoders for keyword and question generation, respectively. Extensive experiments demonstrate the effectiveness of our work, showcasing its promising performance and underscoring its significant value in the MQG task.
zh
[NLP-93] Agent Think: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶任务中存在幻觉、推理效率低下及真实场景验证不足的问题,从而影响感知准确性和稳健的逐步推理能力。其解决方案的关键在于提出一种名为AgentThink的统一框架,首次将思维链(Chain-of-Thought, CoT)推理与动态的代理式工具调用相结合,通过结构化数据生成、两阶段训练流程以及代理式工具使用评估等核心创新,显著提升了模型的推理性能和工具调用能力。
链接: https://arxiv.org/abs/2505.15298
作者: Kangan Qian,Sicong Jiang,Yang Zhong,Ziang Luo,Zilin Huang,Tianze Zhu,Kun Jiang,Mengmeng Yang,Zheng Fu,Jinyu Miao,Yining Shi,He Zhe Lim,Li Liu,Tianbao Zhou,Hongyi Wang,Huang Yu,Yifei Hu,Guang Li,Guang Chen,Hao Ye,Lijun Sun,Diange Yang
机构: Tsinghua University (清华大学); McGill University (麦吉尔大学); Xiaomi Corporation (小米公司); University of Wisconsin – Madison (威斯康星大学麦迪逊分校)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures
点击查看摘要
Abstract:Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbfAgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink’s core innovations include: \textbf(i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf(ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf(iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model’s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by \textbf53.91% and enhances answer accuracy by \textbf33.54%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.
zh
[NLP-94] Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites
【速读】: 该论文试图解决在去除攻击性语言的同时保持说话者原始意图的问题,这是提升在线互动质量的关键挑战。其解决方案的关键在于构建ToxiRewriteCN数据集,该数据集是首个专门设计用于保留情感极性的中文去毒化数据集,包含1,556个经过仔细标注的三元组,涵盖多种现实场景,并通过四个维度评估了17种大型语言模型(LLMs)的表现,旨在支持未来针对中文的可控、情感感知的去毒化研究。
链接: https://arxiv.org/abs/2505.15297
作者: Xintong Wang,Yixiao Liu,Jingheng Pan,Liang Ding,Longyue Wang,Chris Biemann
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures
点击查看摘要
Abstract:Detoxifying offensive language while preserving the speaker’s original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.
zh
[NLP-95] Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成长文本时出现的幻觉(hallucination)问题,特别是关注幻觉在生成文本中的位置分布,尤其是在长文本的后半部分出现的集中现象。解决方案的关键在于分析注意力机制和解码过程在长序列中的动态特性,以理解幻觉偏倚的成因,并探索有效的方法来减轻长文本结尾部分的幻觉,从而提高生成内容的忠实性。
链接: https://arxiv.org/abs/2505.15291
作者: Joonho Yang,Seunghyun Yoon,Hwan Chang,Byeongjeong Kim,Hwanhee Lee
机构: Chung-Ang University (忠南大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: 11 tables, 8 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.
zh
[NLP-96] Exploring In-Image Machine Translation with Real-World Background ACL2025
【速读】: 该论文旨在解决现实世界中图像内机器翻译(In-Image Machine Translation, IIMT)的挑战,特别是针对复杂场景下文本背景来源于真实图像的情况。传统研究多集中于简化场景,如单行黑字白底的图像,这与实际应用需求存在较大差距。为提升IIMT的实际价值,论文提出了一种新的解决方案——DebackX模型,其关键在于将源图像中的背景与文本图像分离,直接对文本图像进行翻译,并将翻译后的文本图像与背景融合,从而生成目标图像,有效提升了翻译质量和视觉效果。
链接: https://arxiv.org/abs/2505.15282
作者: Yanzhi Tian,Zeming Liu,Zhengyang Liu,Yuhang Guo
机构: Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2025 Findings. Code available at this https URL
点击查看摘要
Abstract:In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.
zh
[NLP-97] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
【速读】: 该论文旨在解决Web导航任务中缺乏可在训练和测试阶段通用的专用奖励模型的问题,这一问题限制了基于多模态大语言模型(MLLM)的解决方案在实际部署中的效率与成本效益。其解决方案的关键在于提出首个过程奖励模型(Process Reward Model, PRM)Web-Shepherd,该模型能够以步骤级评估Web导航轨迹,并通过构建大规模的WebPRM Collection数据集和引入WebRewardBench元评估基准,实现了对PRM性能的有效评估与优化。
链接: https://arxiv.org/abs/2505.15277
作者: Hyungjoo Chae,Sunghwan Kim,Junhee Cho,Seungone Kim,Seungjun Moon,Gyeom Hwangbo,Dongha Lim,Minjin Kim,Yeonjun Hwang,Minju Gwak,Dongwook Choi,Minseok Kang,Gwanhoon Im,ByeongUng Cho,Hyojun Kim,Jun Hee Han,Taeyoon Kwon,Minju Kim,Beong-woo Kwak,Dongjin Kang,Jinyoung Yeo
机构: Yonsei University (延世大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Work in progress
点击查看摘要
Abstract:Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.
zh
[NLP-98] When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在复杂任务中因过度思考导致的效率问题。其解决方案的关键在于揭示了强化学习(Reinforcement Learning, RL)训练的LRMs在提示“节省思考”时的内部机制,识别出三种不同的思考模式:无思考(No Thinking, NT)、显式思考(Explicit Thinking, ET)和隐式思考(Implicit Thinking, IT),并通过分析思考终止的置信度、从思考到生成的关注度以及对输入部分的关注焦点,明确了影响推理行为的关键因素。研究进一步表明,NT模式虽能减少输出长度但牺牲了准确性,而ET和IT模式则在保持准确性的同时减少了响应长度,从而为提升RL优化LRMs的可靠性效率提供了方向。
链接: https://arxiv.org/abs/2505.15276
作者: Rongzhi Zhu,Yi Liu,Zequn Sun,Yiwei Wang,Wei Hu
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China (国家软件技术重点实验室,南京大学,中国); University of California, Merced, USA (加利福尼亚大学默塞德分校,美国); National Institute of Healthcare Data Science, Nanjing University, China (南京大学医疗健康数据科学国家研究所,中国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large reasoning models (LRMs) have significantly advanced performance on complex tasks, yet their tendency to overthink introduces inefficiencies. This study investigates the internal mechanisms of reinforcement learning (RL)-trained LRMs when prompted to save thinking, revealing three distinct thinking modes: no thinking (NT), explicit thinking (ET), and implicit thinking (IT). Through comprehensive analysis of confidence in thinking termination, attention from thinking to generation, and attentional focus on input sections, we uncover key factors influencing the reasoning behaviors. We further find that NT reduces output length at the cost of accuracy, while ET and IT maintain accuracy with reduced response length. Our findings expose fundamental inconsistencies in RL-optimized LRMs, necessitating adaptive improvements for reliable efficiency.
zh
[NLP-99] AGENT -X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection
【速读】: 该论文试图解决现有生成式 AI (Generative AI) 文本检测方法依赖大规模标注数据集和外部阈值调优的问题,这些问题限制了模型的可解释性、适应性和零样本有效性。解决方案的关键在于提出 AGENT-X,这是一个基于经典修辞学和系统功能语言学的零样本多智能体框架,通过将检测准则划分为语义、风格和结构维度,并由专门的语言智能体独立评估,结合语义引导提供明确推理和稳健校准置信度,最终由元智能体通过置信度感知聚合实现无阈值、可解释的分类。
链接: https://arxiv.org/abs/2505.15261
作者: Jiatao Li,Mao Ye,Cheng Peng,Xunjian Yin,Xiaojun Wan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Existing AI-generated text detection methods heavily depend on large annotated datasets and external threshold tuning, restricting interpretability, adaptability, and zero-shot effectiveness. To address these limitations, we propose AGENT-X, a zero-shot multi-agent framework informed by classical rhetoric and systemic functional linguistics. Specifically, we organize detection guidelines into semantic, stylistic, and structural dimensions, each independently evaluated by specialized linguistic agents that provide explicit reasoning and robust calibrated confidence via semantic steering. A meta agent integrates these assessments through confidence-aware aggregation, enabling threshold-free, interpretable classification. Additionally, an adaptive Mixture-of-Agent router dynamically selects guidelines based on inferred textual characteristics. Experiments on diverse datasets demonstrate that AGENT-X substantially surpasses state-of-the-art supervised and zero-shot approaches in accuracy, interpretability, and generalization.
zh
[NLP-100] ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在通过图形用户界面(Graphical User Interfaces, GUIs)与计算机交互时,精准定位界面元素坐标的问题,这一问题对于实现细粒度操作至关重要。现有方法依赖大规模网络数据集来提升定位准确性,但存在数据效率低下的问题。该论文提出的解决方案为Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE),其关键在于通过在线强化学习自生成语言推理过程以实现定位,并利用空间先验进行预测批评,从而增强模型对输入变换的等变性。此外,在推理阶段引入测试时缩放策略,结合空间搜索与坐标聚合进一步提升性能。
链接: https://arxiv.org/abs/2505.15259
作者: Hyunseok Lee,Jeonghoon Kim,Beomjun Kim,Jihoon Tack,Chansong Jo,Jaehong Lee,Cheonbok Park,Sookyo In,Jinwoo Shin,Kang Min Yoo
机构: KAIST(韩国科学技术院); NAVER Cloud(NAVER云)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).
zh
[NLP-101] When Less Language is More: Language-Reasoning Disentanglement Makes LLM Reason ing Disentanglement Makes LLMs Better Multilingual Reasoners
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多语言推理任务中表现不均衡的问题,尤其是高资源语言的性能优势显著高于低资源语言。其解决方案的关键在于通过因果干预,在推理阶段消融语言特定表示,从而将推理与语言表征解耦,提升多语言推理能力。实验结果表明,该方法在保持语言特征的同时有效增强了跨语言的泛化能力,并且相比传统的微调或强化学习等后训练方法,具有更低的计算开销。
链接: https://arxiv.org/abs/2505.15257
作者: Weixiang Zhao,Jiahe Guo,Yang Deng,Tongtong Wu,Wenxuan Zhang,Yulin Hu,Xingyu Sui,Yanyan Zhao,Wanxiang Che,Bing Qin,Tat-Seng Chua,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); Monash University (莫纳什大学); Singapore University of Technology and Design (新加坡科技设计大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 26 pages, 13 figures
点击查看摘要
Abstract:Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-source LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively decoupled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training such as supervised fine-tuning or reinforcement learning, our training-free ablation achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.
zh
[NLP-102] MentalMAC: Enhancing Large Language Models for Detecting Mental Manipulation via Multi-Task Anti-Curriculum Distillation
【速读】: 该论文旨在解决心理操控(Mental Manipulation)检测的难题,这是一种隐蔽且普遍的心理虐待形式,对心理健康构成严重威胁。由于其隐匿性和操控策略的复杂性,即使是最先进的大语言模型(LLMs)也难以有效识别。为了解决这一问题,论文提出了MentalMAC,一种多任务反课程蒸馏方法,其关键在于通过进化操作与言语行为理论驱动的无监督数据扩展(EvoSA)、教师模型生成的多任务监督以及从复杂到简单任务的渐进知识蒸馏,提升LLMs在多轮对话中检测心理操控的能力。
链接: https://arxiv.org/abs/2505.15255
作者: Yuansheng Gao,Han Bao,Tong Zhang,Bin Li,Zonghui Wang,Wenzhi Chen
机构: Zhejiang University (浙江大学); SIAT, CAS (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Its covert nature and the complexity of manipulation strategies make it challenging to detect, even for state-of-the-art large language models (LLMs). This concealment also hinders the manual collection of large-scale, high-quality annotations essential for training effective models. Although recent efforts have sought to improve LLM’s performance on this task, progress remains limited due to the scarcity of real-world annotated datasets. To address these challenges, we propose MentalMAC, a multi-task anti-curriculum distillation method that enhances LLMs’ ability to detect mental manipulation in multi-turn dialogue. Our approach includes: (i) EvoSA, an unsupervised data expansion method based on evolutionary operations and speech act theory; (ii) teacher-model-generated multi-task supervision; and (iii) progressive knowledge distillation from complex to simpler tasks. We then constructed the ReaMent dataset with 5,000 real-world dialogue samples, using a MentalMAC-distilled model to assist human annotation. Vast experiments demonstrate that our method significantly narrows the gap between student and teacher models and outperforms competitive LLMs across key evaluation metrics. All code, datasets, and checkpoints will be released upon paper acceptance. Warning: This paper contains content that may be offensive to readers.
zh
[NLP-103] Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation
【速读】: 该论文试图解决的问题是:当前大型视觉-语言模型(Large Vision-Language Models, LVLMs)在文本-图像对齐评估中的鲁棒性不足,特别是在面对对抗性视觉扰动时,可能导致评分被不公平地夸大。解决方案的关键在于提出一个名为FRAME的新型细粒度、多领域元评估基准,并通过引入定义的图像诱导偏差来揭示LVLM评估系统的脆弱性,从而强调了构建更稳健LVLM评判体系的紧迫性。
链接: https://arxiv.org/abs/2505.15249
作者: Yerin Hwang,Dongryeol Lee,Kyungmin Min,Taegwan Kang,Yong-il Kim,Kyomin Jung
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: (21pgs, 12 Tables, 9 Figures)
点击查看摘要
Abstract:Recently, large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image induced biases within the context of T2I evaluation and examine how these biases affect the evaluations of LVLM judges. Moreover, we introduce a novel, fine-grained, multi-domain meta-evaluation benchmark named FRAME, which is deliberately constructed to exhibit diverse score distributions. By introducing the defined biases into the benchmark, we reveal that all tested LVLM judges exhibit vulnerability across all domains, consistently inflating scores for manipulated images. Further analysis reveals that combining multiple biases amplifies their effects, and pairwise evaluations are similarly susceptible. Moreover, we observe that visual biases persist under prompt-based mitigation strategies, highlighting the vulnerability of current LVLM evaluation systems and underscoring the urgent need for more robust LVLM judges.
zh
[NLP-104] owards Explainable Temporal Reasoning in Large Language Models : A Structure-Aware Generative Framework ACL2025
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在时间推理任务中缺乏可解释性推理过程的问题,即现有研究主要关注提升模型性能,而忽视了结果背后的可解释性。解决方案的关键在于提出GETER,一种结构感知的生成框架,通过将图结构与文本信息相结合,实现可解释的时间推理。该框架首先利用时间知识图谱构建时间编码器以捕捉结构信息,随后引入结构-文本前缀适配器将图结构特征映射到文本嵌入空间,并最终通过融合软图标记与指令调优提示标记,使LLMs生成解释性文本。
链接: https://arxiv.org/abs/2505.15245
作者: Zihao Jiang,Ben Liu,Miao Peng,Wenjie Xu,Yao Xiao,Zhenyan Shan,Min Peng
机构: Wuhan University (武汉大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In Findings of the Association for Computational Linguistics: ACL 2025
点击查看摘要
Abstract:While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs’ capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at this https URL.
zh
[NLP-105] Multilingual Prompting for Improving LLM Generation Diversity
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成内容中缺乏文化代表性和整体多样性的问题,这体现在观点表达和事实性问题回答等方面。其解决方案的关键在于提出多语言提示(multilingual prompting),即通过从多种文化中添加文化与语言线索生成基础提示的多个变体,生成响应并整合结果,从而激活模型训练数据中更广泛的文化知识,以提高生成内容的多样性。
链接: https://arxiv.org/abs/2505.15229
作者: Qihan Wang,Shidong Pan,Tal Linzen,Emily Black
机构: New York University (纽约大学); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and personas prompting. Further analyses show that the benefits of multilingual prompting vary with language resource level and model size, and that aligning the prompting language with the cultural cues reduces hallucination about culturally-specific information.
zh
[NLP-106] BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
【速读】: 该论文试图解决生成式 AI (Generative AI) 在网络安全领域中攻防能力评估的问题,其核心在于构建一个能够捕捉进攻与防御网络能力的框架。解决方案的关键是引入了一个名为BountyBench的框架,通过实例化该框架,设置了25个具有复杂现实代码库的系统,并定义了三种任务类型:检测(Detect)、利用(Exploit)和修补(Patch),以全面评估AI代理在漏洞生命周期中的表现。此外,研究还设计了一种新的策略来调节任务难度,从而更准确地衡量AI代理的能力。
链接: https://arxiv.org/abs/2505.15216
作者: Andy K. Zhang,Joey Ji,Celeste Menders,Riya Dulepet,Thomas Qin,Ron Y. Wang,Junrong Wu,Kyleen Liao,Jiliang Li,Jinghan Hu,Sara Hong,Nardos Demilew,Shivatmica Murgai,Jason Tran,Nishka Kacheria,Ethan Ho,Denis Liu,Lauren McLane,Olivia Bruvik,Dai-Rong Han,Seungwoo Kim,Akhil Vyas,Cuiyuanxiu Chen,Ryan Li,Weiran Xu,Jonathan Z. Ye,Prerit Choudhary,Siddharth M. Bhatia,Vikram Sivashankar,Yuxuan Bao,Dawn Song,Dan Boneh,Daniel E. Ho,Percy Liang
机构: Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 78 pages
点击查看摘要
Abstract:AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \ 10 to \ 30,485, and cover 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 5 agents: Claude Code, OpenAI Codex CLI, and custom agents with GPT-4.1, Gemini 2.5 Pro Preview, and Claude 3.7 Sonnet Thinking. Given up to three attempts, the top-performing agents are Claude Code (5% on Detect, mapping to \ 1,350), Custom Agent with Claude 3.7 Sonnet Thinking (5% on Detect, mapping to \ 1,025; 67.5% on Exploit), and OpenAI Codex CLI (5% on Detect, mapping to \ 2,400; 90% on Patch, mapping to \ 14,422). OpenAI Codex CLI and Claude Code are more capable at defense, achieving higher Patch scores of 90% and 87.5%, compared to Exploit scores of 32.5% and 57.5% respectively; in contrast, the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 40-67.5% and Patch scores of 45-60%.
zh
[NLP-107] R-TOFU: Unlearning in Large Reasoning Models
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)中隐私或版权信息在最终答案及多步骤思维链(chain-of-thought, CoT)轨迹中均被嵌入的问题,从而使得可靠的记忆消除(unlearning)比标准大语言模型(LLM)更加复杂。解决方案的关键在于提出首个针对此场景的基准测试——Reasoning-TOFU (R-TOFU),其通过引入真实的CoT注释和分步度量指标,揭示了答案层面检查无法发现的残留知识。此外,研究还提出了Reasoned IDK,一种基于偏好优化的方法,在保持推理连贯性的同时实现更优的遗忘效果与模型实用性,并指出解码策略的多样性对评估模型遗忘效果的重要性。
链接: https://arxiv.org/abs/2505.15214
作者: Sangyeon Yoon,Wonje Jeung,Albert No
机构: Hongik University (韩国弘益大学); Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: 19 pages
点击查看摘要
Abstract:Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.
zh
[NLP-108] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因知识不足或过时而导致的幻觉问题,通过知识图谱(Knowledge Graphs, KGs)增强检索生成过程以提升模型输出的可信度。其解决方案的关键在于提出一种称为“对先验的深思”(Deliberation over Priors, DP)的可信推理框架,该框架充分挖掘KG中的先验知识,包括结构信息和显式或隐式约束,并通过渐进式知识蒸馏策略与监督微调及Kahneman-Tversky优化相结合,提升关系路径生成的忠实性;同时引入推理内省策略,基于提取的约束先验引导模型进行精细化推理验证,从而确保响应生成的可靠性。
链接: https://arxiv.org/abs/2505.15210
作者: Jie Ma,Ning Qu,Zhitao Gao,Rui Xing,Jun Liu,Hongbin Pei,Jiang Xie,Linyun Song,Pinghui Wang,Jing Tao,Zhou Su
机构: Xi’an Jiaotong University (西安交通大学); School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院); Shaanxi Province Key Laboratory of Big Data Knowledge Engineering (陕西省大数据知识工程重点实验室); School of Artificial Intelligence, Chongqing University of Post and Telecommunications (重庆邮电大学人工智能学院); School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机科学学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under Review
点击查看摘要
Abstract:Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs’ reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at this https URL.
zh
[NLP-109] DUSK: Do Not Unlearn Shared Knowledge
【速读】: 该论文试图解决机器遗忘(machine unlearning)中因遗忘数据与保留数据存在重叠内容而导致的精准删除问题,即在移除特定数据的同时,需保留公共事实和共享信息。解决方案的关键在于构建一个名为DUSK的基准测试集,该基准通过构造描述相同事实但风格不同的文档集合,模拟真实场景中的数据重叠情况,并定义七项评估指标以衡量遗忘方法是否能够实现对独特内容的选择性删除,同时避免损害共享知识。
链接: https://arxiv.org/abs/2505.15209
作者: Wonje Jeung,Sangyeon Yoon,Hyesoo Hong,Soeun Kim,Seungju Han,Youngjae Yu,Albert No
机构: Yonsei University (延世大学); Hongik University (弘益大学); Standford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 21 pages
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about the unauthorized use of copyrighted or sensitive data. Machine unlearning aims to remove such ‘forget’ data while preserving utility and information from the ‘retain’ set. However, existing evaluations typically assume that forget and retain sets are fully disjoint, overlooking realistic scenarios where they share overlapping content. For instance, a news article may need to be unlearned, even though the same event, such as an earthquake in Japan, is also described factually on Wikipedia. Effective unlearning should remove the specific phrasing of the news article while preserving publicly supported facts. In this paper, we introduce DUSK, a benchmark designed to evaluate unlearning methods under realistic data overlap. DUSK constructs document sets that describe the same factual content in different styles, with some shared information appearing across all sets and other content remaining unique to each. When one set is designated for unlearning, an ideal method should remove its unique content while preserving shared facts. We define seven evaluation metrics to assess whether unlearning methods can achieve this selective removal. Our evaluation of nine recent unlearning methods reveals a key limitation: while most can remove surface-level text, they often fail to erase deeper, context-specific knowledge without damaging shared content. We release DUSK as a public benchmark to support the development of more precise and reliable unlearning techniques for real-world applications.
zh
[NLP-110] Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)中样本利用效率低的问题,即现有算法在优化单次成功(pass@1)性能时,过度关注独立样本的强度,而忽视了样本集合的多样性和整体效用,从而限制了探索能力和对更复杂问题的解决能力。其解决方案的关键是提出一种名为Pass-at-k Policy Optimization (PKPO) 的策略优化方法,通过在最终奖励上进行变换,直接优化pass@k性能,从而鼓励生成能够联合最大化奖励的样本集合。该方法的核心贡献在于推导出适用于二元和连续奖励设置的pass@k及其梯度的低方差无偏估计器,并通过稳定高效的变换函数将优化过程转化为标准强化学习问题。此外,该方法允许在训练过程中对k值进行退火,从而同时优化pass@1和pass@k性能,显著提升了复杂任务集的学习效果。
链接: https://arxiv.org/abs/2505.15201
作者: Christian Walder,Deep Karkhanis
机构: Google DeepMind(谷歌深度思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) algorithms sample multiple n1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k = n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2505.15201 [cs.LG] (or arXiv:2505.15201v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.15201 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-111] EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association ACL2025
【速读】: 该论文试图解决电商场景下基于大语言模型(Large Language Model, LLM)的脚本规划(E-commerce Script Planning, EcomScript)问题,具体包括LLM无法同时进行脚本规划与产品检索、因计划动作与搜索查询之间的语义差异导致的产品匹配困难以及缺乏评估方法和基准数据等挑战。解决方案的关键在于提出一种新的框架,通过基于动作与购买意图之间的语义相似性将产品关联到每个步骤,从而实现产品丰富的脚本生成,并构建了首个大规模的EcomScript数据集EcomScriptBench以支持任务评估。
链接: https://arxiv.org/abs/2505.15196
作者: Weiqi Wang,Limeng Cui,Xin Liu,Sreyashi Nag,Wenju Xu,Chen Luo,Sheikh Muhammad Sarwar,Yang Li,Hansu Gu,Hui Liu,Changlong Yu,Jiaxin Bai,Yifan Gao,Haiyang Zhang,Qi He,Shuiwang Ji,Yangqiu Song
机构: Amazon.com Inc(亚马逊公司); HKUST(香港科技大学); Texas A&M University(德克萨斯A&M大学)
类目: Computation and Language (cs.CL)
备注: ACL2025
点击查看摘要
Abstract:Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance.
zh
[NLP-112] ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection
【速读】: 该论文试图解决LLM代理在复杂环境中因无法保持一致的内部信念和目标对齐而导致的推理步骤不准确或不连贯的问题(ungrounded or incoherent reasoning steps),这会导致代理的实际状态与目标之间出现偏差。解决方案的关键在于引入ReflAct,这是一种新的推理框架,它将推理从单纯的下一步行动规划转变为持续地将代理的状态与其目标进行对比反思(reflecting on the agent’s state relative to its goal)。通过在状态中明确地建立决策基础并强制实现持续的目标对齐,ReflAct显著提高了策略的可靠性。
链接: https://arxiv.org/abs/2505.15182
作者: Jeonghye Kim,Sojeong Rhee,Minbeom Kim,Dohyung Kim,Sangmook Lee,Youngchul Sung,Kyomin Jung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent’s actual state and goal. Our analysis finds that this stems from ReAct’s inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent’s state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.
zh
[NLP-113] ALN-P3: Unified Language Alignment for Perception Prediction and Planning in Autonomous Driving
【速读】: 该论文试图解决现有端到端自动驾驶系统中,大型语言模型(Large Language Models, LLMs)在驾驶性能与视觉-语言推理之间难以同时提升的问题。其解决方案的关键在于提出ALN-P3统一的协同蒸馏框架,通过引入“快速”视觉驱动的自动驾驶系统与“慢速”语言驱动推理模块之间的跨模态对齐,具体包括感知对齐(Perception Alignment, P1A)、预测对齐(Prediction Alignment, P2A)和规划对齐(Planning Alignment, P3A),从而在完整感知、预测和规划层级上实现视觉标记与对应语言输出的显式对齐。
链接: https://arxiv.org/abs/2505.15158
作者: Yunsheng Ma,Burhaneddin Yaman,Xin Ye,Mahmut Yurt,Jingru Luo,Abhirup Mallik,Ziran Wang,Liu Ren
机构: Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI); Purdue University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages
点击查看摘要
Abstract:Recent advances have explored integrating large language models (LLMs) into end-to-end autonomous driving systems to enhance generalization and interpretability. However, most existing approaches are limited to either driving performance or vision-language reasoning, making it difficult to achieve both simultaneously. In this paper, we propose ALN-P3, a unified co-distillation framework that introduces cross-modal alignment between “fast” vision-based autonomous driving systems and “slow” language-driven reasoning modules. ALN-P3 incorporates three novel alignment mechanisms: Perception Alignment (P1A), Prediction Alignment (P2A), and Planning Alignment (P3A), which explicitly align visual tokens with corresponding linguistic outputs across the full perception, prediction, and planning stack. All alignment modules are applied only during training and incur no additional costs during inference. Extensive experiments on four challenging benchmarks-nuScenes, Nu-X, TOD3Cap, and nuScenes QA-demonstrate that ALN-P3 significantly improves both driving decisions and language reasoning, achieving state-of-the-art results.
zh
[NLP-114] Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM LLM /MLLM Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)和多模态大型语言模型(Multimodal Large Language Models, MLLMs)在推理过程中过度依赖链式思维(chain-of-thought, CoT)导致性能下降和输出冗长的问题。解决方案的关键在于提出基于确定性的自适应推理(Certainty-based Adaptive Reasoning, CAR)框架,该框架通过模型的困惑度动态决定是否触发长形式推理,从而在准确性和效率之间取得最佳平衡。
链接: https://arxiv.org/abs/2505.15154
作者: Jinghui Lu,Haiyang Yu,Siliang Xu,Shiwei Ran,Guozhi Tang,Siqi Wang,Bin Shan,Teng Fu,Hao Feng,Jingqun Tang,Han Wang,Can Huang
机构: ByteDance(字节跳动); FuDan University(复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.
zh
[NLP-115] An Empirical Study on Reinforcement Learning for Reasoning -Search Interleaved LLM Agents
【速读】: 该论文旨在解决如何有效设计基于大语言模型(Large Language Models, LLMs)的强化学习(Reinforcement Learning, RL)搜索代理的问题,特别是在奖励机制、底层LLM的选择与特性以及搜索引擎在RL过程中的作用等方面尚未明确优化方案。其解决方案的关键在于通过系统的实证研究,揭示影响RL训练效果的核心因素,并提供可操作的见解,例如格式化奖励对最终性能的提升效果显著,而中间检索奖励的影响有限;LLM的规模与初始化方式(通用型与推理专用型)对RL结果有显著影响;搜索引擎的选择在塑造RL训练动态和推理阶段的鲁棒性方面起着关键作用。这些发现为实际应用中构建和部署LLM-based搜索代理提供了重要指导。
链接: https://arxiv.org/abs/2505.15117
作者: Bowen Jin,Jinsung Yoon,Priyanka Kargupta,Sercan O. Arik,Jiawei Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 22 pages
点击查看摘要
Abstract:Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors – such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process – require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at this https URL.
zh
[NLP-116] RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals
【速读】: 该论文旨在解决表格推理任务中由于长链式思维(Long CoT)训练成本高且存在表格内容幻觉导致的可靠性低的问题。其解决方案的关键在于提出行式思维(RoT),通过迭代式的逐行遍历表格,实现推理扩展和基于反思的优化,从而在无需训练的情况下提升推理能力,同时减少幻觉并提高效率。
链接: https://arxiv.org/abs/2505.15110
作者: Xuanliang Zhang,Dingzirui Wang,Keyan Xu,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.
zh
[NLP-117] A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents
【速读】: 该论文试图解决当前生成式 AI (Generative AI) 在心理治疗场景中应用时存在的风险评估不足问题,特别是由于缺乏标准化评估方法而无法捕捉治疗互动中的细微风险,从而可能导致用户伤害甚至自杀等严重后果。解决方案的关键在于提出一种新的风险分类体系,该体系通过系统性地整合心理治疗风险文献、临床与法律专家的定性访谈,并与现有的临床标准(如DSM-5)和评估工具(如NEQ、UE-ATR)对齐,旨在为对话式 AI 心理治疗师提供结构化的用户/患者危害识别与评估框架。
链接: https://arxiv.org/abs/2505.15108
作者: Ian Steenstra,Timothy W. Bickmore
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:The proliferation of Large Language Models (LLMs) and Intelligent Virtual Agents acting as psychotherapists presents significant opportunities for expanding mental healthcare access. However, their deployment has also been linked to serious adverse outcomes, including user harm and suicide, facilitated by a lack of standardized evaluation methodologies capable of capturing the nuanced risks of therapeutic interaction. Current evaluation techniques lack the sensitivity to detect subtle changes in patient cognition and behavior during therapy sessions that may lead to subsequent decompensation. We introduce a novel risk taxonomy specifically designed for the systematic evaluation of conversational AI psychotherapists. Developed through an iterative process including review of the psychotherapy risk literature, qualitative interviews with clinical and legal experts, and alignment with established clinical criteria (e.g., DSM-5) and existing assessment tools (e.g., NEQ, UE-ATR), the taxonomy aims to provide a structured approach to identifying and assessing user/patient harms. We provide a high-level overview of this taxonomy, detailing its grounding, and discuss potential use cases. We discuss two use cases in detail: monitoring cognitive model-based risk factors during a counseling conversation to detect unsafe deviations, in both human-AI counseling sessions and in automated benchmarking of AI psychotherapists with simulated patients. The proposed taxonomy offers a foundational step towards establishing safer and more responsible innovation in the domain of AI-driven mental health support.
zh
[NLP-118] StepSearch: Igniting LLM s Search Ability via Step-Wise Proximal Policy Optimization
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多跳问答任务中因全局信号稀疏奖励而导致性能不佳的问题。其解决方案的关键在于引入StepSearch框架,该框架采用分步近端策略优化方法,并通过基于信息增益和冗余惩罚的更丰富、更细致的中间搜索奖励以及令牌级别的过程监督,以更好地指导每一步搜索。
链接: https://arxiv.org/abs/2505.15107
作者: Ziliang Wang,Xuhui Zheng,Kang An,Cijun Ouyang,Jialu Cai,Yuhang Wang,Yichao Wu
机构: SenseTime(商汤科技); Nanjing University(南京大学); Shenzhen University(深圳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our implementation is publicly available at this https URL.
zh
[NLP-119] Mechanistic evaluation of Transformers and state space models
【速读】: 该论文试图解决状态空间模型(State Space Models, SSMs)在语言建模中对上下文基本信息回忆能力不足的问题,尤其是在合成任务如关联回忆(Associative Recall, AR)中的表现差异。其解决方案的关键在于通过因果干预实验揭示不同架构在机制层面的成功与失败原因,发现Transformer和基于SSM的模型能够通过归纳头(induction heads)在上下文中存储键值关联,而其他SSM模型(如H3、Hyena)仅在最后一个状态计算这些关联,仅有Mamba因具有短卷积组件而表现出色。此外,研究引入了关联树调用(Associative Treecall, ATR)任务以进一步验证模型在层次结构下的学习机制,结果表明成功模型在不同任务中表现出一致的机制。
链接: https://arxiv.org/abs/2505.15105
作者: Aryaman Arora,Neil Rathi,Nikil Roashan Selvam,Róbert Csórdas,Dan Jurafsky,Christopher Potts
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 page main text, 6 pages appendix
点击查看摘要
Abstract:State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to why–on a mechanistic level–certain architectures fail and others succeed. To address this, we conduct experiments on AR and find that only Transformers and Based SSM models fully succeed at AR, with Mamba a close third, whereas the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction heads. By contrast, the SSMs compute these associations only at the last state, with only Mamba succeeding because of its short convolution component. To extend and deepen these findings, we introduce Associative Treecall (ATR), a synthetic task similar to AR based on PCFG induction. ATR introduces language-like hierarchical structure into the AR setting. We find that all architectures learn the same mechanism as they did for AR, and the same three models succeed at the task. These results reveal that architectures with similar accuracy may still have substantive differences, motivating the adoption of mechanistic evaluations.
zh
[NLP-120] Nek Minit: Harnessing Prag matic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English
【速读】: 该论文试图解决跨地域英语(如澳大利亚英语和印度英语)中讽刺(sarcasm)的可解释情感分析问题,尤其是在语义表层与隐含情感之间存在不一致的情况下。其关键解决方案是利用语用元认知提示(Pragmatic Metacognitive Prompting, PMP),通过生成讽刺解释来提升讽刺检测的可解释性,并在两个开放权重大语言模型(GEMMA和LLAMA)上验证了该方法的有效性,结果表明PMP在所有任务和数据集上均表现出统计显著的性能提升。
链接: https://arxiv.org/abs/2505.15095
作者: Ishmanbir Singh,Dipankar Srirag,Aditya Joshi
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review. 4 pages + references
点击查看摘要
Abstract:Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.
zh
[NLP-121] SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在跨科学领域性能评估方面存在的不足,因为现有基准测试主要集中在通用领域,未能充分捕捉科学数据的复杂性。其解决方案的关键在于构建SciCUEval,这是一个针对科学情境理解能力设计的综合性基准数据集,涵盖生物学、化学、物理学、生物医学和材料科学等十个领域特定子数据集,并整合了结构化表格、知识图谱和非结构化文本等多种数据模态,系统性地评估LLMs在相关信息识别、信息缺失检测、多源信息整合和情境感知推理四个核心能力上的表现。
链接: https://arxiv.org/abs/2505.15094
作者: Jing Yu,Yuqi Tang,Kehua Feng,Mingyang Rao,Lei Liang,Zhiqiang Zhang,Mengshu Sun,Wen Zhang,Qiang Zhang,Keyan Ding,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
zh
[NLP-122] DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer
【速读】: 该论文试图解决在将大型语言模型的优势从高资源语言扩展到低资源语言时,有效跨语言迁移仍是一个关键挑战的问题。其解决方案的关键在于提出一种新颖的可组合稀疏微调(DeFT-X)方法,该方法通过在进行基于幅度的剪枝之前使用奇异值分解对预训练模型的权重矩阵进行去噪,从而生成更稳健的稀疏微调向量(SFTs)。
链接: https://arxiv.org/abs/2505.15090
作者: Sona Elza Simon,Preethi Jyothi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model’s parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
zh
[NLP-123] HopWeaver: Synthesizing Authentic Multi-Hop Questions Across Text Corpora
【速读】: 该论文试图解决多跳问答(Multi-Hop Question Answering, MHQA)数据集构建过程中存在的挑战,即手动标注成本高以及现有合成方法生成的问题过于简单或需要大量人工指导。解决方案的关键在于提出HopWeaver,这是一个无需人工干预的自动框架,能够从非结构化文本语料库中合成真实可信的多跳问题,通过识别跨语料库的互补文档,构建连贯的推理路径,从而确保生成的问题需要真实的多跳推理。
链接: https://arxiv.org/abs/2505.15087
作者: Zhiyu Shen,Jiyuan Liu,Yunhe Pang,Yanghui Rao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages. Code will be available at [ this https URL ]
点击查看摘要
Abstract:Multi-Hop Question Answering (MHQA) is crucial for evaluating the model’s capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first automatic framework synthesizing authentic multi-hop questions from unstructured text corpora without human intervention. HopWeaver synthesizes two types of multi-hop questions (bridge and comparison) using an innovative approach that identifies complementary documents across corpora. Its coherent pipeline constructs authentic reasoning paths that integrate information across multiple documents, ensuring synthesized questions necessitate authentic multi-hop reasoning. We further present a comprehensive system for evaluating synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our approach is valuable for developing MHQA datasets in specialized domains with scarce annotated resources. The code for HopWeaver is publicly available.
zh
[NLP-124] SUS backprop: linear backpropagation algorithm for long inputs in transformers
【速读】: 该论文试图解决在Transformer架构中,由于注意力机制的计算复杂度随序列长度呈二次增长(O(n²))而导致的计算瓶颈问题。解决方案的关键在于引入一种由单个参数c控制的简单概率规则,该规则通过剪切大部分注意力权重的反向传播流,仅保留每个注意力头每token最多c个交互,从而将注意力反向传播的计算复杂度降低至线性(O(nc))。这一方法在保持梯度方差增加可接受范围内(如对于n≈2000时仅增加约1%)的前提下,显著减少了计算成本。
链接: https://arxiv.org/abs/2505.15080
作者: Sergey Pankov,Georges Harik
机构: Harik Shazeer Labs; Notbad AI Inc
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 9 figures
点击查看摘要
Abstract:It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of back-propagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length n . At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule controlled by a single parameter c that cuts backpropagation through most attention weights, leaving at most c interactions per token per attention head. This brings a factor of c/n reduction in the compute required for the attention backpropagation, turning it from quadratic O(n^2) to linear complexity O(nc) . We have empirically verified that, for a typical transformer model, cutting 99% of the attention gradient flow (i.e. choosing c \sim 20-30 ) results in relative gradient variance increase of only about 1% for n \sim 2000 , and it decreases with n . This approach is amenable to efficient sparse matrix implementation, thus being promising for making the cost of a backward pass negligible relative to the cost of a forward pass when training a transformer model on long sequences.
zh
[NLP-125] raveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLM s
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨语言一致性方面的问题,特别是在整合文化知识时表现不一致的挑战。解决方案的关键是引入两个新的基准测试:KnowRecall和VisRecall,分别用于评估模型在多语言环境下对事实性知识和视觉记忆的一致性。通过这两个基准测试,研究揭示了当前最先进的MLLMs在跨语言一致性上的不足,并强调了开发真正多语言且具备文化意识模型的必要性。
链接: https://arxiv.org/abs/2505.15075
作者: Hao Wang,Pinzhi Huang,Jihan Yang,Saining Xie,Daisuke Kawahara
机构: Waseda University (早稻田大学); New York University (纽约大学); NII LLMC (国立情报学研究所大语言模型中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL
点击查看摘要
Abstract:The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
zh
[NLP-126] DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
【速读】: 该论文试图解决在多领域、不平衡数据下,Group Relative Policy Optimization (GRPO) 方法因隐含假设不成立而导致的对主导领域过度优化、忽视低频领域的问题,从而引发泛化能力和公平性下降。其解决方案的关键在于提出 Domain-Informed Self-Consistency Policy Optimization (DISCO),通过两个核心创新:域感知奖励缩放(Domain-aware reward scaling)以对抗频率偏差,以及难度感知奖励缩放(Difficulty-aware reward scaling)利用提示级别的自洽性识别并优先处理具有更高学习价值的不确定提示,从而实现跨领域的更公平和有效的策略学习。
链接: https://arxiv.org/abs/2505.15074
作者: Yuhang Zhou,Jing Zhu,Shengyi Qian,Zhuokai Zhao,Xiyao Wang,Xiaoyu Liu,Ming Li,Paiheng Xu,Wei Ai,Furong Huang
机构: University of Maryland, College Park (马里兰大学学院市分校); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups - assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.
zh
[NLP-127] MoTime: A Dataset Suite for Multimodal Time Series Forecasting
【速读】: 该论文试图解决多模态时间序列预测中缺乏系统性数据集的问题,当前大多数研究仍集中于单模态时间序列,而现实世界中的预测任务往往涉及多种数据源。解决方案的关键在于提出MoTime,这是一个包含多种模态(如文本、元数据和图像)与时间信号配对的多模态时间序列预测数据集,旨在支持在两种场景下的结构化评估:常规预测任务和冷启动预测任务。通过公开数据集和研究成果,该工作为未来多模态时间序列预测研究提供了更全面和真实的基准。
链接: https://arxiv.org/abs/2505.15072
作者: Xin Zhou,Weiqing Wang,Francisco J. Baldán,Wray Buntine,Christoph Bergmeir
机构: Monash University (莫纳什大学); University of Málaga (马拉加大学); VinUniversity (维大学); University of Granada (格拉纳达大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:While multimodal data sources are increasingly available from real-world forecasting, most existing research remains on unimodal time series. In this work, we present MoTime, a suite of multimodal time series forecasting datasets that pair temporal signals with external modalities such as text, metadata, and images. Covering diverse domains, MoTime supports structured evaluation of modality utility under two scenarios: 1) the common forecasting task, where varying-length history is available, and 2) cold-start forecasting, where no historical data is available. Experiments show that external modalities can improve forecasting performance in both scenarios, with particularly strong benefits for short series in some datasets, though the impact varies depending on data characteristics. By making datasets and findings publicly available, we aim to support more comprehensive and realistic benchmarks in future multimodal time series forecasting research.
zh
[NLP-128] Can Large Language Models Understand Internet Buzzwords Through User-Generated Content ACL2025
【速读】: 该论文试图解决如何利用大规模用户生成内容(User-Generated Content, UGC)来生成准确的中文网络流行语定义的问题。其解决方案的关键在于提出一种名为RESS的新方法,该方法能够有效引导大型语言模型(Large Language Models, LLMs)的理解过程,以生成更准确的流行语定义,这一过程模仿了人类语言学习的技能。此外,研究还构建了首个中文网络流行语数据集CHEER,用于评估不同定义生成方法的效果,并揭示了现有方法在依赖先验知识、推理能力不足以及识别高质量UGC方面的共性挑战。
链接: https://arxiv.org/abs/2505.15071
作者: Chen Huang,Junkai Luo,Xinzuo Wang,Wenqiang Lei,Jiancheng Lv
机构: Sichuan University (四川大学); JD.com (京东); Ministry of Education (教育部)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Paper. Our dataset and code are available at this https URL
点击查看摘要
Abstract:The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at this https URL.
zh
[NLP-129] An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-Doc SIGIR2025
【速读】: 该论文旨在解决 Learned Sparse Retrieval (LSR) 模型中由于高文档频率(DF)词项导致的检索延迟问题。传统方法如 SPLADE 使用 FLOPS 正则化来促进向量稀疏性,但无法有效控制词项层面的稀疏性,从而导致高 DF 词项在倒排索引中产生长倒排列表,增加检索延迟。该研究提出了一种新的正则化方法 DF-FLOPS,通过惩罚高 DF 词项的使用,减少其在向量中的出现频率,从而缩短倒排列表并降低检索延迟。关键在于 DF-FLOPS 能够在保持检索效果的前提下,有效抑制高 DF 词项的影响,同时允许在必要时保留具有显著语义价值的高频词项。
链接: https://arxiv.org/abs/2505.15070
作者: Aldo Porco,Dhruv Mehra,Igor Malioutov,Karthik Radhakrishnan,Moniba Keymanesh,Daniel Preoţiuc-Pietro,Sean MacAvaney,Pengxiang Cheng
机构: Bloomberg(彭博社); University of Glasgow(格拉斯哥大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted as a short paper at SIGIR 2025
点击查看摘要
Abstract:Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Document Frequencies (DFs) substantially increase latency in production retrieval engines, such as Apache Solr, due to their lengthy posting lists. To address the issue of high DFs, we present a new variant of FLOPS regularization: DF-FLOPS. This new regularization technique penalizes the usage of high-DF terms, thereby shortening posting lists and reducing retrieval latency. Unlike other inference-time sparsification methods, such as stopword removal, DF-FLOPS regularization allows for the selective inclusion of high-frequency terms in cases where the terms are truly salient. We find that DF-FLOPS successfully reduces the prevalence of high-DF terms and lowers retrieval latency (around 10x faster) in a production-grade engine while maintaining effectiveness both in-domain (only a 2.2-point drop in MRR@10) and cross-domain (improved performance in 12 out of 13 tasks on which we tested). With retrieval latencies on par with BM25, this work provides an important step towards making LSR practical for deployment in production-grade search engines.
zh
[NLP-130] In-Domain African Languages Translation Using LLM s and Multi-armed Bandits
【速读】: 该论文旨在解决低资源语言在神经机器翻译(Neural Machine Translation, NMT)系统中进行领域适应时所面临的挑战,尤其是在缺乏足够训练数据和模型泛化能力不足的情况下,如何选择最优模型以实现领域内数据的高性能翻译。解决方案的关键在于采用基于多臂老虎机(bandit-based)的算法,包括上置信界(Upper Confidence Bound)、线性上置信界(Linear UCB)、神经线性老虎机(Neural Linear Bandit)和汤普森采样(Thompson Sampling),通过这些方法在有限资源条件下实现高置信度的最优模型选择。
链接: https://arxiv.org/abs/2505.15069
作者: Pratik Rakesh Singh,Kritarth Prasad,Mohammadi Zaki,Pankaj Wasnik
机构: Media Analysis Group, Sony Research India (媒体分析组,索尼印度研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Neural Machine Translation (NMT) systems face significant challenges when working with low-resource languages, particularly in domain adaptation tasks. These difficulties arise due to limited training data and suboptimal model generalization, As a result, selecting an optimal model for translation is crucial for achieving strong performance on in-domain data, particularly in scenarios where fine-tuning is not feasible or practical. In this paper, we investigate strategies for selecting the most suitable NMT model for a given domain using bandit-based algorithms, including Upper Confidence Bound, Linear UCB, Neural Linear Bandit, and Thompson Sampling. Our method effectively addresses the resource constraints by facilitating optimal model selection with high confidence. We evaluate the approach across three African languages and domains, demonstrating its robustness and effectiveness in both scenarios where target data is available and where it is absent.
zh
[NLP-131] ModelingAgent : Bridging LLM s and Mathematical Modeling for Real-World Challenges
【速读】: 该论文试图解决现有基准无法反映现实世界问题复杂性的问题,这些问题需要开放性、跨学科推理及计算工具的整合。解决方案的关键在于提出ModelingBench,一个基于数学建模竞赛的新型基准,涵盖从城市交通优化到生态系统资源规划等多领域的真实场景任务,并支持多种有效解,以捕捉实际建模中的模糊性和创造性。此外,还提出了ModelingAgent,一个协调工具使用、支持结构化工作流并实现迭代自我优化的多智能体框架,以及ModelingJudge,一个利用LLMs作为领域专家评估解决方案的闭环系统。
链接: https://arxiv.org/abs/2505.15068
作者: Cheng Qian,Hongyi Du,Hongru Wang,Xiusi Chen,Yuji Zhang,Avirup Sil,Chengxiang Zhai,Kathleen McKeown,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); IBM Research AI (IBM研究院人工智能); Columbia University (哥伦比亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 Pages, 26 Figures, 5 Tables
点击查看摘要
Abstract:Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.
zh
[NLP-132] he Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support
【速读】: 该论文试图解决小规模语言模型(参数范围为0.5B至5B)是否能够有意义地参与基于创伤知情理论的共情对话,以服务于创伤后应激障碍(PTSD)患者的问题。其解决方案的关键在于引入TIDE数据集,该数据集包含10,000条两轮对话,涵盖500个多样化的PTSD客户人格,并基于三因素共情模型(情绪识别、痛苦正常化和支持性反思)构建。所有场景和参考回复均由专注于PTSD的临床心理学家审核,以确保真实性和创伤敏感性。通过评估微调前后八个小型语言模型的表现,并与前沿模型(Claude Sonnet 3.5)进行比较,研究揭示了微调在提升感知共情方面的有效性,但同时也指出效果高度依赖于具体场景和用户特征。
链接: https://arxiv.org/abs/2505.15065
作者: Suhas BN,Yash Mahajan,Dominik Mattioli,Andrew M. Sherrill,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah
机构: Penn State University (宾夕法尼亚州立大学); Emory University (埃默里大学); Georgia Tech (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, 3 figures
点击查看摘要
Abstract:Can small language models with 0.5B to 5B parameters meaningfully engage in trauma-informed, empathetic dialogue for individuals with PTSD? We address this question by introducing TIDE, a dataset of 10,000 two-turn dialogues spanning 500 diverse PTSD client personas and grounded in a three-factor empathy model: emotion recognition, distress normalization, and supportive reflection. All scenarios and reference responses were reviewed for realism and trauma sensitivity by a clinical psychologist specializing in PTSD. We evaluate eight small language models before and after fine-tuning, comparing their outputs to a frontier model (Claude Sonnet 3.5). Our IRB-approved human evaluation and automatic metrics show that fine-tuning generally improves perceived empathy, but gains are highly scenario- and user-dependent, with smaller models facing an empathy ceiling. Demographic analysis shows older adults value distress validation and graduate-educated users prefer nuanced replies, while gender effects are minimal. We highlight the limitations of automatic metrics and the need for context- and user-aware system design. Our findings, along with the planned release of TIDE, provide a foundation for building safe, resource-efficient, and ethically sound empathetic AI to supplement, not replace, clinical mental health care.
zh
[NLP-133] UrduFactCheck: An Agent ic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言如乌尔都语中的事实可靠性问题,特别是针对现有自动事实核查解决方案主要集中在英语,而对全球2亿多乌尔都语使用者缺乏有效支持的现状。其关键解决方案是提出UrduFactCheck,这是一个专为乌尔都语设计的综合性、模块化事实核查框架,其核心在于动态的多策略证据检索管道,结合单语和翻译方法以缓解高质量乌尔都语证据的稀缺性。
链接: https://arxiv.org/abs/2505.15063
作者: Sarfraz Ahmad,Hasan Iqbal,Momina Ahsan,Numaan Naeem,Muhammad Ahsan Riaz Khan,Arham Riaz,Muhammad Arslan Manzoor,Yuxia Wang,Preslav Nakov
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 figures, 4 tables, Submitted to ARR May 2025
点击查看摘要
Abstract:The rapid use of large language models (LLMs) has raised critical concerns regarding the factual reliability of their outputs, especially in low-resource languages such as Urdu. Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide. In this work, we introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu. Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches to address the scarcity of high-quality Urdu evidence. We curate and release two new hand-annotated benchmarks: UrduFactBench for claim verification and UrduFactQA for evaluating LLM factuality. Extensive experiments demonstrate that UrduFactCheck, particularly its translation-augmented variants, consistently outperforms baselines and open-source alternatives on multiple metrics. We further benchmark twelve state-of-the-art (SOTA) LLMs on factual question answering in Urdu, highlighting persistent gaps between proprietary and open-source models. UrduFactCheck’s code and datasets are open-sourced and publicly available at this https URL.
zh
[NLP-134] Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理需要新信息的复杂问题时,因检索知识不足而难以直接回答的问题,尤其是在科学推理任务中。其关键解决方案是提出Self-GIVE框架,该框架通过强化学习(reinforcement learning)增强LLMs的自动关联思维能力,利用结构化知识图谱(knowledge graph)提取结构化信息和实体集,以辅助模型与查询概念建立联系,从而提升推理效率和准确性。
链接: https://arxiv.org/abs/2505.15062
作者: Jiashu He,Jinxuan Fan,Bowen Jiang,Ignacio Houine,Dan Roth,Alejandro Ribeiro
机构: University of Pennsylvania, Philadelphia, PA, USA(宾夕法尼亚大学, 费城, 宾夕法尼亚州, 美国); University of California, Berkeley, CA, USA(加州大学伯克利分校, 沙加缅度, 加州州, 美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate “hormones helping mental disorders” with “melatonin being a hormone and insomnia a mental disorder” to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE’s key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to \textbf28.5% \rightarrow 71.4% and \textbf78.6 \rightarrow 90.5% in samples \textbfunseen in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.
zh
[NLP-135] Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)评估中存在基准测试不一致性和顶级模型区分度差的问题,这些问题影响了基准测试准确反映模型真实能力的能力。论文提出的关键解决方案是基于项目反应理论(Item Response Theory, IRT)的增强框架——伪孪生网络项目反应理论(Pseudo-Siamese Network for Item Response Theory, PSN-IRT),该框架在IRT基础上引入了丰富的项目参数,以实现对项目特征和模型能力的准确可靠估计。
链接: https://arxiv.org/abs/2505.15055
作者: Hongli Zhou,Hui Huang,Ziqing Zhao,Lvyuan Han,Huicheng Wang,Kehai Chen,Muyun Yang,Wei Bao,Jian Dong,Bing Xu,Conghui Zhu,Hailong Cao,Tiejun Zhao
机构: Harbin Institute of Technology(哈尔滨工业大学); Harbin Institute of Technolgy(哈尔滨工业大学); China Electronics Standardization Institute(中国电子标准化研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining main-stream prominent LLM benchmarks using results from diverse models. We first propose a new framework for accurate and reliable estimations of item characteristics and model abilities. Specifically, we propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. Based on PSN-IRT, we conduct extensive analysis which reveals significant and varied shortcomings in the measurement quality of current benchmarks. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.
zh
[NLP-136] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition Editing and Generation
【速读】: 该论文试图解决分子语言接口任务中分子结构的精确识别、编辑和生成问题,这些问题对于化学家和AI系统在处理各种化学任务时至关重要。解决方案的关键在于构建一个全面的基准测试平台MolLangBench,该平台通过自动化化学信息学工具确保识别任务的高质量、无歧义和确定性输出,并通过严格的专家注释和验证来策划编辑和生成任务,从而支持不同分子表示形式(如线性字符串、分子图像和分子图)的模型评估。
链接: https://arxiv.org/abs/2505.15054
作者: Feiyang Cai,Jiahui Bai,Tao Tang,Joshua Luo,Tianyu Zhu,Ling Liu,Feng Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:
点击查看摘要
Abstract:Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (o3) achieves 79.2% and 78.5% accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only 29.0% accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.
zh
[NLP-137] Improving the fact-checking performance of language models by relying on their entailment ability
【速读】: 该论文旨在解决自动化事实核查(fact-checking)中的挑战,特别是在处理多源证据矛盾和复杂推理过程时的低效性问题。其解决方案的关键在于利用语言模型的蕴含(entailment)能力和生成能力,以生成“支持”和“反驳”性论证,从而提升模型对声明真实性的判断准确性。通过基于这些论证进行训练,该方法在多个数据集上取得了显著的性能提升。
链接: https://arxiv.org/abs/2505.15050
作者: Gaurav Kumar,Debajyoti Mazumder,Ayush Garg,Jasabanta Patro
机构: Indian Institute of Science Education and Research, Bhopal, India (印度科学教育与研究学院,博帕尔,印度)
类目: Computation and Language (cs.CL)
备注: 44 pages
点击查看摘要
Abstract:Automated fact-checking is a crucial task in this digital age. To verify a claim, current approaches majorly follow one of two strategies i.e. (i) relying on embedded knowledge of language models, and (ii) fine-tuning them with evidence pieces. While the former can make systems to hallucinate, the later have not been very successful till date. The primary reason behind this is that fact verification is a complex process. Language models have to parse through multiple pieces of evidence before making a prediction. Further, the evidence pieces often contradict each other. This makes the reasoning process even more complex. We proposed a simple yet effective approach where we relied on entailment and the generative ability of language models to produce ‘‘supporting’’ and ‘‘refuting’’ justifications (for the truthfulness of a claim). We trained language models based on these justifications and achieved superior results. Apart from that, we did a systematic comparison of different prompting and fine-tuning strategies, as it is currently lacking in the literature. Some of our observations are: (i) training language models with raw evidence sentences registered an improvement up to 8.20% in macro-F1, over the best performing baseline for the RAW-FC dataset, (ii) similarly, training language models with prompted claim-evidence understanding (TBE-2) registered an improvement (with a margin up to 16.39%) over the baselines for the same dataset, (iii) training language models with entailed justifications (TBE-3) outperformed the baselines by a huge margin (up to 28.57% and 44.26% for LIAR-RAW and RAW-FC, respectively). We have shared our code repository to reproduce the results.
zh
[NLP-138] ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding
【速读】: 该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在图表理解任务中因细粒度特性而导致的数据收集和训练成本高昂的问题。其解决方案的关键在于提出ChartCards,一个统一的图表元数据生成框架,通过系统化合成包括数据表格、可视化代码、视觉元素及多维语义描述在内的多种图表信息,并将其结构化为组织化的元数据,从而实现单个图表支持多种下游任务。
链接: https://arxiv.org/abs/2505.15046
作者: Yifan Wu,Lutao Yan,Leixian Shen,Yinan Mei,Jiannan Wang,Yuyu Luo
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; South China University of Technology; Huawei Cloud BU; Simon Fraser University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The emergence of Multi-modal Large Language Models (MLLMs) presents new opportunities for chart understanding. However, due to the fine-grained nature of these tasks, applying MLLMs typically requires large, high-quality datasets for task-specific fine-tuning, leading to high data collection and training costs. To address this, we propose ChartCards, a unified chart-metadata generation framework for multi-task chart understanding. ChartCards systematically synthesizes various chart information, including data tables, visualization code, visual elements, and multi-dimensional semantic captions. By structuring this information into organized metadata, ChartCards enables a single chart to support multiple downstream tasks, such as text-to-chart retrieval, chart summarization, chart-to-table conversion, chart description, and chart question answering. Using ChartCards, we further construct MetaChart, a large-scale high-quality dataset containing 10,862 data tables, 85K charts, and 170 K high-quality chart captions. We validate the dataset through qualitative crowdsourcing evaluations and quantitative fine-tuning experiments across various chart understanding tasks. Fine-tuning six different models on MetaChart resulted in an average performance improvement of 5% across all tasks. The most notable improvements are seen in text-to-chart retrieval and chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements of 17% and 28%, respectively.
zh
[NLP-139] Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
【速读】: 该论文试图解决大型语言模型(LLM)嵌入模型在文本嵌入任务中由于自回归预训练过程中使用的单向注意力机制与文本嵌入任务的双向特性不匹配所带来的局限性。解决方案的关键在于采用扩散语言模型(diffusion language models),其固有的双向架构能够更好地捕捉长而复杂的文本中的全局上下文,从而在多个文本嵌入任务上取得优于LLM嵌入模型的性能。
链接: https://arxiv.org/abs/2505.15045
作者: Siyue Zhang,Yilun Zhao,Liyuan Geng,Arman Cohan,Anh Tuan Luu,Chen Zhao
机构: Nanyang Technological University(南洋理工大学); Yale University(耶鲁大学); NYU Shanghai(纽约大学上海分校); Alibaba-NTU Singapore Joint Research Institute(阿里巴巴-南洋理工大学新加坡联合研究院); Center for Data Science, New York University(数据科学中心,纽约大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
zh
[NLP-140] Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
【速读】: 该论文试图解决在使用线性概念向量(Linear Concept Vectors)对大型语言模型(LLMs)进行引导时,由于多样化数据引入的噪声(即无关特征)导致的引导鲁棒性下降问题。解决方案的关键在于提出稀疏自编码器去噪概念向量(Sparse Autoencoder-Denoised Concept Vectors, SDCV),通过稀疏自编码器(Sparse Autoencoder)过滤隐藏表示中的噪声特征,从而提升线性探测和均值差异方法的引导成功率。
链接: https://arxiv.org/abs/2505.15038
作者: Haiyan Zhao,Xuansheng Wu,Fan Yang,Bo Shen,Ninghao Liu,Mengnan Du
机构: New Jersey Institute of Technology (新泽西理工学院); The University of Georgia (佐治亚大学); Wake Forest University (维克森林大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.
zh
[NLP-141] RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)后训练方法中,验证器(verifier)存在奖励劫持(reward hacking)和泛化能力不足的问题。其解决方案的关键在于提出了一种名为Tango的新框架,该框架通过强化学习同时训练一个生成器(generator)和一个生成式的、过程级别的验证器(process-level verifier),并使其在训练过程中协同进化。该验证器仅依赖于结果级别的验证正确性奖励进行训练,无需显式的流程级别标注,从而提升了模型的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2505.15034
作者: Kaiwen Zha,Zhengqi Gao,Maohao Shen,Zhang-Wei Hong,Duane S. Boning,Dina Katabi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Tech report. The first two authors contributed equally
点击查看摘要
Abstract:Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: this https URL.
zh
[NLP-142] Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI
【速读】: 该论文试图解决同行评议中文本与评分一致性(text-score consistency)的细粒度分析不足问题,这可能导致关键细节的遗漏。其解决方案的关键在于利用深度学习和自然语言处理(NLP)技术,对会议评审数据进行分析,包括检测模糊语句、识别评议方面、分析报告长度、模糊词/句频率、方面提及及情感倾向,从而评估文本与评分的一致性。
链接: https://arxiv.org/abs/2505.15031
作者: Wenqing Wu,Haixu Xi,Chengzhi Zhang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Peer review is vital in academia for evaluating research quality. Top AI conferences use reviewer confidence scores to ensure review reliability, but existing studies lack fine-grained analysis of text-score consistency, potentially missing key details. This work assesses consistency at word, sentence, and aspect levels using deep learning and NLP conference review data. We employ deep learning to detect hedge sentences and aspects, then analyze report length, hedge word/sentence frequency, aspect mentions, and sentiment to evaluate text-score alignment. Correlation, significance, and regression tests examine confidence scores’ impact on paper outcomes. Results show high text-score consistency across all levels, with regression revealing higher confidence scores correlate with paper rejection, validating expert assessments and peer review fairness.
zh
[NLP-143] Diagnosing our datasets: How does my language model learn clinical information?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在临床自然语言处理任务中对临床术语的理解能力不足以及对未经证实的医学主张的响应问题。其关键解决方案是通过分析预训练语料库中临床术语和未经证实医学主张的出现频率及其数据来源,评估模型在临床语境下的表现,并揭示预训练数据组成与模型输出之间的关系,从而为未来临床相关数据集的构建提供指导。
链接: https://arxiv.org/abs/2505.15024
作者: Furong Jia,David Sontag,Monica Agrawal
机构: Duke University(杜克大学); MIT CSAIL(MIT计算机科学与人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.
zh
[NLP-144] owards Spoken Mathematical Reasoning : Benchmarking Speech-based Models over Multi-faceted Math Problems
【速读】: 该论文试图解决生成式 AI (Generative AI) 在语音输入下的数学推理能力不足的问题,特别是针对从口语化表达中进行逻辑步骤推理和数学问题求解的能力尚未得到充分探索。解决方案的关键在于引入了一个新的基准测试集 Spoken-MQA,用于评估基于语音的模型(包括级联模型和端到端语音大语言模型)的数学推理能力,该基准涵盖了多种数学问题类型,并以清晰的自然口语形式呈现,从而为相关研究提供了系统性的评估框架。
链接: https://arxiv.org/abs/2505.15000
作者: Chengwei Wei,Bin Wang,Jung-jae Kim,Nancy F. Chen
机构: Institute for Infocomm Research (I2R), ASTAR, Singapore; Centre for Frontier AI Research (CFAR), ASTAR, Singapore
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) and multimodal LLMs (MLLMs) have led to strong reasoning ability across a wide range of tasks. However, their ability to perform mathematical reasoning from spoken input remains underexplored. Prior studies on speech modality have mostly focused on factual speech understanding or simple audio reasoning tasks, providing limited insight into logical step-by-step reasoning, such as that required for mathematical problem solving. To address this gap, we introduce Spoken Math Question Answering (Spoken-MQA), a new benchmark designed to evaluate the mathematical reasoning capabilities of speech-based models, including both cascade models (ASR + LLMs) and end-to-end speech LLMs. Spoken-MQA covers a diverse set of math problems, including pure arithmetic, single-step and multi-step contextual reasoning, and knowledge-oriented reasoning problems, all presented in unambiguous natural spoken language. Through extensive experiments, we find that: (1) while some speech LLMs perform competitively on contextual reasoning tasks involving basic arithmetic, they still struggle with direct arithmetic problems; (2) current LLMs exhibit a strong bias toward symbolic mathematical expressions written in LaTex and have difficulty interpreting verbalized mathematical expressions; and (3) mathematical knowledge reasoning abilities are significantly degraded in current speech LLMs.
zh
[NLP-145] Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理任务中面临的逻辑一致性不足与结果可靠性低的问题。其解决方案的关键在于提出一种轻量级的后验证器——能量结果奖励模型(Energy Outcome Reward Model, EORM),该模型利用基于能量的模型(Energy Based Models, EBMs)通过仅使用结果标签来学习为链式思维(Chain of Thought, CoT)解法分配标量能量分数,从而避免了对详细标注的依赖。EORM通过将判别器输出的logits解释为负能量,隐式地对候选解进行排序,使得正确最终结果对应的解获得更低的能量值,从而提升推理过程的一致性与准确性。
链接: https://arxiv.org/abs/2505.14999
作者: Eric Hanchen Jiang,Haozheng Luo,Shengyuan Pang,Xiaomin Li,Zhenting Qi,Hengli Li,Cheng-Fu Yang,Zongyu Lin,Xinfeng Li,Hao Xu,Kai-Wei Chang,Ying Nian Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn’t guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.
zh
[NLP-146] Meta-Design Matters: A Self-Design Multi-Agent System
【速读】: 该论文试图解决多智能体系统(Multi-agent Systems, MAS)在设计过程中依赖手动定义的代理角色和通信协议所带来的局限性,这些方法难以充分利用大语言模型(Large Language Models, LLMs)的优势,并且缺乏对新任务的适应能力。其解决方案的关键在于提出SELF-MAS,这是一个仅在推理阶段运行的自监督自动MAS设计框架,通过元级设计迭代生成、评估和优化针对每个问题实例的MAS配置,无需验证集,同时利用可解性和完备性的元反馈实现动态代理组合与问题分解,从而提升系统的适应性和性能。
链接: https://arxiv.org/abs/2505.14996
作者: Zixuan Ke,Austin Xu,Yifei Ming,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs’ strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.
zh
[NLP-147] Effective and Efficient Schema-aware Information Extraction Using On-Device Large Language Models
【速读】: 该论文旨在解决在资源受限设备上部署计算密集型大型语言模型(Large Language Models, LLMs)进行信息抽取(Information Extraction, IE)所面临的挑战,包括幻觉、有限的上下文长度和高延迟等问题,尤其是在处理多样化的抽取模式时。其解决方案的关键在于提出一种适用于本地化LLMs的两阶段信息抽取方法,称为Dual-LoRA with Incremental Schema Caching (DLISC),该方法通过引入Identification LoRA模块和Extraction LoRA模块,分别实现模式识别与模式感知的信息抽取,并结合增量模式缓存技术以减少冗余计算,从而提升抽取的有效性和效率。
链接: https://arxiv.org/abs/2505.14992
作者: Zhihao Wen,Sheng Liang,Yaxiong Wu,Yongyue Zhang,Yong Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures
点击查看摘要
Abstract:Information extraction (IE) plays a crucial role in natural language processing (NLP) by converting unstructured text into structured knowledge. Deploying computationally intensive large language models (LLMs) on resource-constrained devices for information extraction is challenging, particularly due to issues like hallucinations, limited context length, and high latency-especially when handling diverse extraction schemas. To address these challenges, we propose a two-stage information extraction approach adapted for on-device LLMs, called Dual-LoRA with Incremental Schema Caching (DLISC), which enhances both schema identification and schema-aware extraction in terms of effectiveness and efficiency. In particular, DLISC adopts an Identification LoRA module for retrieving the most relevant schemas to a given query, and an Extraction LoRA module for performing information extraction based on the previously selected schemas. To accelerate extraction inference, Incremental Schema Caching is incorporated to reduce redundant computation, substantially improving efficiency. Extensive experiments across multiple information extraction datasets demonstrate notable improvements in both effectiveness and efficiency.
zh
[NLP-148] Language Specific Knowledge: Do Models Know Better in X than in English?
【速读】: 该论文试图解决语言模型在特定主题或领域中可能具备语言特异性知识(Language Specific Knowledge, LSK)的问题,以及如何通过改变推理语言来提升模型的推理能力。其解决方案的关键在于设计一种简单的方法学——LSKExtractor,用于基准测试语言模型中的语言特异性知识,并在推理过程中加以利用,从而提升模型在不同语言和文化背景下的表现。
链接: https://arxiv.org/abs/2505.14990
作者: Ishika Agarwal,Nimet Beyza Bozdag,Dilek Hakkani-Tür
机构: University of Illinois, Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Code-switching is a common phenomenon of alternating between different languages in the same utterance, thought, or conversation. We posit that humans code-switch because they feel more comfortable talking about certain topics and domains in one language than another. With the rise of knowledge-intensive language models, we ask ourselves the next, natural question: Could models hold more knowledge on some topics in some language X? More importantly, could we improve reasoning by changing the language that reasoning is performed in? We coin the term Language Specific Knowledge (LSK) to represent this phenomenon. As ethnic cultures tend to develop alongside different languages, we employ culture-specific datasets (that contain knowledge about cultural and social behavioral norms). We find that language models can perform better when using chain-of-thought reasoning in some languages other than English, sometimes even better in low-resource languages. Paired with previous works showing that semantic similarity does not equate to representational similarity, we hypothesize that culturally specific texts occur more abundantly in corresponding languages, enabling specific knowledge to occur only in specific “expert” languages. Motivated by our initial results, we design a simple methodology called LSKExtractor to benchmark the language-specific knowledge present in a language model and, then, exploit it during inference. We show our results on various models and datasets, showing an average relative improvement of 10% in accuracy. Our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.
zh
[NLP-149] CRAFT: Training-Free Cascaded Retrieval for Tabular QA
【速读】: 该论文旨在解决表格式问答(Table Question Answering, TQA)中传统密集检索模型在大规模任务中计算成本高且适应性差的问题。传统方法如DTR和ColBERT不仅需要大量的计算资源,还要求在新数据集上进行重新训练或微调,限制了其在动态领域和知识更新中的应用。论文提出的解决方案关键在于CRAFT,这是一种级联检索方法,首先利用稀疏检索模型筛选候选表格,再应用计算成本较高的密集模型和神经重排序器,从而提升检索性能。此外,通过使用Gemini Flash 1.5生成表格描述和标题进一步增强表格表示,实验结果表明CRAFT在NQ-Tables数据集上的端到端TQA任务中表现出色。
链接: https://arxiv.org/abs/2505.14984
作者: Adarsh Singh,Kushal Raj Bhandari,Jianxi Gao,Soham Dan,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Microsoft (微软)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries. Traditional dense retrieval models, such as DTR and ColBERT, not only incur high computational costs for large-scale retrieval tasks but also require retraining or fine-tuning on new datasets, limiting their adaptability to evolving domains and knowledge. In this work, we propose \textbfCRAFT , a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables before applying more computationally expensive dense models and neural re-rankers. Our approach achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers. We further enhance table representations by generating table descriptions and titles using Gemini Flash 1.5. End-to-end TQA results using various Large Language Models (LLMs) on NQ-Tables, a subset of the Natural Questions Dataset, demonstrate \textbfCRAFT effectiveness.
zh
[NLP-150] Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在跨文化场景中生成符合文化规范回应的能力不足问题,这一问题在现有多模态安全基准中未得到充分关注,导致可能产生象征性伤害。其解决方案的关键在于提出CROSS基准和CROSS-Eval框架,通过评估文化意识、规范教育、合规性和助人性四个维度,系统地衡量LVLMs的文化安全推理能力,并基于此设计增强策略以提升模型的文化对齐程度。
链接: https://arxiv.org/abs/2505.14972
作者: Haoyi Qiu,Kung-Hsiang Huang,Ruichen Zheng,Jiao Sun,Nanyun Peng
机构: UCLA(加州大学洛杉矶分校); Salesforce AI Research( Salesforce人工智能研究院); Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o’s cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.
zh
[NLP-151] DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis IJCAI2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在且被忽视的种姓制度偏见问题,特别是对印度边缘化种姓群体(如达利特人和首陀罗人)的偏见。解决方案的关键在于提出DECASTE框架,这是一个多维的评估体系,旨在检测和评估LLMs中的隐性和显性种姓偏见,通过社会文化、经济、教育和政治四个维度进行分析,并采用定制化的提示策略进行基准测试,从而揭示模型在不同种姓群体间的系统性偏见。
链接: https://arxiv.org/abs/2505.14971
作者: Prashanth Vijayaraghavan,Soroush Vosoughi,Lamogha Chizor,Raya Horesh,Rogerio Abreu de Paula,Ehsan Degan,Vandana Mukherjee
机构: IBM Research, San Jose, CA, USA (IBM 研究院,圣何塞,加州,美国); Dartmouth College, Hanover, NH, USA (达特茅斯学院,汉诺威,新罕布什尔州,美国); IBM Research, Hursley, Winchester, UK (IBM 研究院,赫斯利,温切斯特,英国); IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA (IBM 托马斯·J·沃森研究中心,约克敦高地,纽约州,美国); IBM Research, Sao Paulo, Brazil (IBM 研究院,圣保罗,巴西)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 7 (content pages) + 2 (reference pages) + 5 (Appendix pages), 5 figures, 6 Tables, IJCAI 2025
点击查看摘要
Abstract:Recent advancements in large language models (LLMs) have revolutionized natural language processing (NLP) and expanded their applications across diverse domains. However, despite their impressive capabilities, LLMs have been shown to reflect and perpetuate harmful societal biases, including those based on ethnicity, gender, and religion. A critical and underexplored issue is the reinforcement of caste-based biases, particularly towards India’s marginalized caste groups such as Dalits and Shudras. In this paper, we address this gap by proposing DECASTE, a novel, multi-dimensional framework designed to detect and assess both implicit and explicit caste biases in LLMs. Our approach evaluates caste fairness across four dimensions: socio-cultural, economic, educational, and political, using a range of customized prompting strategies. By benchmarking several state-of-the-art LLMs, we reveal that these models systematically reinforce caste biases, with significant disparities observed in the treatment of oppressed versus dominant caste groups. For example, bias scores are notably elevated when comparing Dalits and Shudras with dominant caste groups, reflecting societal prejudices that persist in model outputs. These results expose the subtle yet pervasive caste biases in LLMs and emphasize the need for more comprehensive and inclusive bias evaluation methodologies that assess the potential risks of deploying such models in real-world contexts.
zh
[NLP-152] MedBrowseComp: Benchmarking Medical Deep Research and Computer Use
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在临床实践中作为决策支持工具时,如何可靠地从异构知识库中检索和综合多跳医学事实的问题。现有评估方法通常依赖于合成提示、将任务简化为单跳事实查询或混淆推理与开放式生成,导致其在现实场景中的实用性不明确。解决方案的关键在于提出MedBrowseComp,这是首个系统性测试代理从实时领域特定知识库中可靠检索和综合多跳医学事实能力的基准,包含超过1000个由人类标注的问题,这些问题模拟了临床场景中从业者需整合碎片化或冲突信息以得出最新结论的情境。
链接: https://arxiv.org/abs/2505.14963
作者: Shan Chen,Pedro Moreira,Yuxin Xiao,Sam Schmidgall,Jeremy Warner,Hugo Aerts,Thomas Hartvigsen,Jack Gallifant,Danielle S. Bitterman
机构: Harvard(哈佛大学); Mass General Brigham(马萨诸塞总医院); Boston Children’s Hospital(波士顿儿童医院); MIT(麻省理工学院); Universitat Pompeu Fabra(庞佩乌法布拉大学); Johns Hopkins University(约翰霍普金斯大学); Brown University(布朗大学); HemOnc.org(血液肿瘤组织); Maastricht University(马斯特里赫特大学); University of Virginia(弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注: You can visit our project page at: this https URL
点击查看摘要
Abstract:Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases – trials, primary studies, regulatory documents, and cost data – under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent’s ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: this https URL
zh
[NLP-153] oo Long Didnt Model: Decomposing LLM Long-Context Understanding With Novels
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在处理超长上下文时评估有效性困难的问题,特别是针对超出“针尖大海”(needle-in-a-haystack)类评估方法的复杂长文本理解能力。其解决方案的关键在于引入Too Long, Didn’t Model (TLDM)基准,该基准通过测试模型对小说中情节摘要、故事世界配置和叙事时间流逝的把握能力,揭示了当前前沿LLMs在超过64k tokens后无法保持稳定理解的现象。研究强调,语言模型开发者需超越“中间迷失”(lost in the middle)类基准,以更全面地评估模型在复杂长上下文场景中的表现。
链接: https://arxiv.org/abs/2505.14925
作者: Sil Hamilton,Rebecca M. M. Hicke,Matthew Wilkens,David Mimno
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn’t Model (TLDM) benchmark, which tests a model’s ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond “lost in the middle” benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.
zh
[NLP-154] Reliable Decision Support with LLM s: A Framework for Evaluating Consistency in Binary Text Classification Applications
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在二分类文本任务中的一致性评估问题,特别是缺乏系统性的可靠性评估方法。其解决方案的关键在于引入心理测量学原理,确定样本量需求,开发无效响应的评估指标,并评估模型内部和跨模型的一致性(intra- and inter-rater reliability)。通过案例研究验证了该框架的有效性,为LLM的选择、样本量规划及可靠性评估提供了系统性指导。
链接: https://arxiv.org/abs/2505.14918
作者: Fadel M. Megahed,Ying-Ju Chen,L. Allision Jones-Farmer,Younghwa Lee,Jiawei Brooke Wang,Inez M. Zwetsloot
机构: Farmer School of Business, Miami University, Oxford, OH 45056, USA; College of Arts and Sciences, University of Dayton, Dayton, OH 45469, USA; Amsterdam Business School, University of Amsterdam, Amsterdam, The Netherlands
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 25 pages
点击查看摘要
Abstract:This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
zh
[NLP-155] ConspEmoLLM -v2: A robust and stable model to detect sentiment-transformed conspiracy theories
【速读】: 该论文试图解决生成式 AI (Generative AI) 在自动生成虚假信息(如阴谋论)时可能通过改变文本特征(如将强烈负面情绪转化为更积极的语气)来规避检测的问题。解决方案的关键在于构建一个增强版的阴谋论检测数据集 ConDID-v2,该数据集在原有由人类撰写的阴谋论推文基础上,补充了由 LLM 重写并降低负面情绪的版本,并通过人工与 LLM 结合的方式验证了重写推文的质量。基于此数据集训练的改进模型 ConspEmoLLM-v2 在原始人类撰写内容和情感转换后的文本上均表现出优于原有模型和其他基线模型的性能。
链接: https://arxiv.org/abs/2505.14917
作者: Zhiwei Liu,Paul Thompson,Jiaqi Rong,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: work in progress
点击查看摘要
Abstract:Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ‘‘disguise’’ conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at this https URL.
zh
[NLP-156] Understanding 6G through Language Models: A Case Study on LLM -aided Structured Entity Extraction in Telecom Domain
【速读】: 该论文旨在解决6G网络中电信知识碎片化问题,通过信息抽取技术将非结构化的电信知识转化为结构化格式,以提升AI模型对网络术语的理解能力。其解决方案的关键在于提出了一种基于语言模型的电信结构化实体抽取技术(TeleSEE),该技术采用高效的标记表示方法预测实体类型和属性键,减少了输出标记数量并提高了预测准确性,同时引入分层并行解码方法,优化了标准编码器-解码器架构,增强了实体抽取任务的性能。
链接: https://arxiv.org/abs/2505.14906
作者: Ye Yuan,Haolun Wu,Hao Zhou,Xue Liu,Hao Chen,Yan Xin,Jianzhong(Charlie)Zhang
机构: McGill University1; Samsung Research America2
类目: Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:Knowledge understanding is a foundational part of envisioned 6G networks to advance network intelligence and AI-native network architectures. In this paradigm, information extraction plays a pivotal role in transforming fragmented telecom knowledge into well-structured formats, empowering diverse AI models to better understand network terminologies. This work proposes a novel language model-based information extraction technique, aiming to extract structured entities from the telecom context. The proposed telecom structured entity extraction (TeleSEE) technique applies a token-efficient representation method to predict entity types and attribute keys, aiming to save the number of output tokens and improve prediction accuracy. Meanwhile, TeleSEE involves a hierarchical parallel decoding method, improving the standard encoder-decoder architecture by integrating additional prompting and decoding strategies into entity extraction tasks. In addition, to better evaluate the performance of the proposed technique in the telecom domain, we further designed a dataset named 6GTech, including 2390 sentences and 23747 words from more than 100 6G-related technical publications. Finally, the experiment shows that the proposed TeleSEE method achieves higher accuracy than other baseline techniques, and also presents 5 to 9 times higher sample processing speed.
zh
[NLP-157] Concept Incongruence: An Exploration of Time and Death in Role Playing
【速读】: 该论文试图解决在角色扮演(Role-Play)场景下,由于概念不一致(concept incongruence)导致的模型行为不一致问题,特别是在角色“死亡”后模型未能正确拒绝生成或出现准确率下降的现象。解决方案的关键在于通过定义和分析三种行为指标——回避率(abstention rate)、条件准确率(conditional accuracy)和回答率(answer rate),量化模型在概念不一致下的行为表现,并识别出导致问题的两个主要原因:一是“死亡”状态在不同年份中的编码不可靠,二是角色扮演引发了模型时间表征的偏移。基于这些发现,研究者提出了改进模型回避与回答行为一致性的方法。
链接: https://arxiv.org/abs/2505.14905
作者: Xiaoyan Bai,Ike Peng,Aditya Singh,Chenhao Tan
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注: Our code is available, see this https URL
点击查看摘要
Abstract:Consider this prompt “Draw a unicorn with two horns”. Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics–abstention rate, conditional accuracy, and answer rate–to quantify model behavior under incongruence due to the role’s death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the “death” state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model’s temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model’s abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.
zh
[NLP-158] hink Reflect Create: Metacognitive Learning for Zero-Shot Robotic Planning with LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在机器人领域应用中面临的局限性,即其主要局限于静态、基于提示的行为,在零样本或少样本设置下处理复杂任务仍存在挑战。解决方案的关键在于引入元认知学习(metacognitive learning),通过赋予LLMs推理、反思和创造的能力,增强其在机器人任务中的表现。具体而言,该研究提出了一种早期框架,将元认知学习整合到LLM驱动的多机器人协作中,使机器人代理具备技能分解与自我反思机制,从而识别模块化技能、反思未见过任务场景中的失败,并合成有效的解决方案。
链接: https://arxiv.org/abs/2505.14899
作者: Wenjie Lin,Jin Wei-Kocsis
机构: Purdue University (普渡大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While large language models (LLMs) have shown great potential across various domains, their applications in robotics remain largely limited to static, prompt-based behaviors and still face challenges in handling complex tasks under zero-shot or few-shot settings. Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental research question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing their ability to perform robotic tasks with minimal demonstrations? In this paper, we present an early-stage framework that integrates metacognitive learning into LLM-powered multi-robot collaboration. The proposed framework equips the LLM-powered robotic agents with a skill decomposition and self-reflection mechanism that identifies modular skills from prior tasks, reflects on failures in unseen task scenarios, and synthesizes effective new solutions. Experimental results show that our metacognitive-learning-empowered LLM framework significantly outperforms existing baselines. Moreover, we observe that the framework is capable of generating solutions that differ from the ground truth yet still successfully complete the tasks. These exciting findings support our hypothesis that metacognitive learning can foster creativity in robotic planning.
zh
[NLP-159] Scaling Laws for State Dynamics in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理需要内部状态跟踪的任务时,其建模状态转移动态能力不足的问题。研究通过在三个领域(Box Tracking、Abstract DFA Sequences 和 Complex Text Games)中评估状态预测的准确性,揭示了状态空间大小和转移稀疏性对模型性能的影响。解决方案的关键在于通过激活修补(activation patching)技术识别出负责传播状态信息的注意力头,并发现状态跟踪在LLMs中是通过下一词预测头的分布式交互而非显式符号计算产生的。
链接: https://arxiv.org/abs/2505.14892
作者: Jacob X Li,Shreyas S Raman,Jessica Wan,Fahad Samman,Jazlyn Lin
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages; 23 figures
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is 10 and transitions are 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.
zh
[NLP-160] In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties
【速读】: 该论文试图解决当前先进语音语言模型在面对不同说话人和语言变体时适应能力不足的问题,即这些模型是否能够像人类听众一样通过暴露进行适应。解决方案的关键在于引入一种可扩展的框架,该框架通过交错的任务提示和音频-文本对实现Phi-4 Multimodal模型的上下文学习(in-context learning, ICL),从而在推理阶段仅需少量示例语句(如12个话语,约50秒)即可显著降低词错误率,并提升自动语音识别(ASR)的鲁棒性。
链接: https://arxiv.org/abs/2505.14887
作者: Nathan Roll,Calbert Graham,Yuka Tatsumi,Kim Tien Nguyen,Meghan Sumner,Dan Jurafsky
机构: Stanford University (斯坦福大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 15 pages; 3 figures
点击查看摘要
Abstract:Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided–though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.
zh
[NLP-161] Strategic Planning and Rationalizing on Trees Make LLM s Better Debaters
【速读】: 该论文旨在解决竞争性辩论中因时间限制和论点互动性带来的挑战,具体包括:(1)辩手需在有限时间内做出战略选择,而非覆盖所有可能论点;(2)辩论的说服力依赖于论点之间的来回互动,单一最终结果无法全面评估。其解决方案的关键在于提出TreeDebater框架,通过引入两种树结构——Rehearsal Tree和Debate Flow Tree——分别用于预测攻击与防御以评估主张强度,以及跟踪辩论状态以识别活跃行动。此外,TreeDebater通过时间预算分配、演讲时间控制器和模拟观众反馈来优化陈述内容,从而在阶段级和辩论级的人类评估中优于现有最先进的多智能体辩论系统。
链接: https://arxiv.org/abs/2505.14886
作者: Danqing Wang,Zhuorui Ye,Xinran Zhao,Fei Fang,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 9 main pages
点击查看摘要
Abstract:Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.
zh
[NLP-162] Incorporating Token Usage into Prompting Strategy Evaluation
【速读】: 该论文试图解决大规模语言模型在任务执行中对提示策略依赖性过强的问题,特别是提示策略在性能与令牌使用效率之间的平衡问题。论文提出的关键解决方案是引入Big- O_tok理论框架,用于描述提示策略的令牌使用增长模式,并结合Token Cost这一实证指标来衡量单位性能的令牌成本,从而强调效率在实际应用中的重要性。
链接: https://arxiv.org/abs/2505.14880
作者: Chris Sypherd,Sergei Petrov,Sonny George,Vaishak Belle
机构: University of Edinburgh (爱丁堡大学); Independent Researcher (独立研究员); Brandeis University (布兰戴斯大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 12 tables, 4 figures
点击查看摘要
Abstract:In recent years, large language models have demonstrated remarkable performance across diverse tasks. However, their task effectiveness is heavily dependent on the prompting strategy used to elicit output, which can vary widely in both performance and token usage. While task performance is often used to determine prompting strategy success, we argue that efficiency–balancing performance and token usage–can be a more practical metric for real-world utility. To enable this, we propose Big- O_tok , a theoretical framework for describing the token usage growth of prompting strategies, and analyze Token Cost, an empirical measure of tokens per performance. We apply these to several common prompting strategies and find that increased token usage leads to drastically diminishing performance returns. Our results validate the Big- O_tok analyses and reinforce the need for efficiency-aware evaluations.
zh
[NLP-163] owards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages INTERSPEECH2025
【速读】: 该论文旨在解决非英语语言中因数据稀缺而导致的构音障碍语音自动语音识别(ASR)性能不佳的问题。其解决方案的关键在于对英语构音障碍语音(UASpeech)进行微调,以编码说话人特征和语调扭曲,并将其应用于健康非英语语音(FLEURS)以生成非英语构音障碍类似语音,进而用于微调多语言ASR模型MMS,从而提升构音障碍语音的识别效果。
链接: https://arxiv.org/abs/2505.14874
作者: Chin-Jou Li,Eunjung Yeo,Kwanghee Choi,Paula Andrea Pérez-Toro,Masao Someki,Rohan Kumar Das,Zhengjun Yue,Juan Rafael Orozco-Arroyave,Elmar Nöth,David R. Mortensen
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, Accepted to Interspeech 2025
点击查看摘要
Abstract:Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
zh
[NLP-164] Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models
【速读】: 该论文试图解决在资源受限设备上高效部署大型语言模型(Large Language Models, LLMs)的问题,特别是针对预训练LLMs进行微调后的下游任务中,由于预训练模型的高秩特性和缺乏预训练数据访问权限而导致的压缩挑战。解决方案的关键在于提出一种稀疏增强张量网络(Sparse Augmented Tensor Networks, Saten),通过低秩张量化方法实现对LLMs的全面压缩,并在实验中验证了其在提升模型精度和压缩效率方面的有效性。
链接: https://arxiv.org/abs/2505.14871
作者: Ryan Solgi,Kai Zhen,Rupak Vignesh Swaminathan,Nathan Susanj,Athanasios Mouchtaris,Siegfried Kunzmann,Zheng Zhang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.
zh
[NLP-165] EasyMath: A 0-shot Math Benchmark for SLMs
【速读】: 该论文试图解决小语言模型在实际数学推理任务中的性能评估问题,其提出的解决方案是构建一个名为EasyMath的紧凑基准测试集,用于评估模型在不同数学领域的推理能力。EasyMath涵盖了从基础算术到代数表达式等13个类别,旨在提供一个全面但不过于专业的评测框架。关键在于通过精确、数值和符号检查方法,在零样本设置下对自由格式答案进行评估,从而揭示模型大小、训练程度以及思维链(chain-of-thought)对准确性和一致性的提升作用。
链接: https://arxiv.org/abs/2505.14852
作者: Drishya Karki,Michiel Kamphuis,Angelecia Frey
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 9 figures, 8 tables
点击查看摘要
Abstract:EasyMath is a compact benchmark for practical math reasoning in small language models. It covers thirteen categories, from basic arithmetic and order of operations to word problems, algebraic expressions, edge cases, and omits specialist topics. We tested 23 models (14M to 4B parameters) using exact, numerical, and symbolic checks on free-form answers in a zero-shot setting. Accuracy rises with size and training, chain-of-thought adds modest gains, and consistency improves at scale.
zh
[NLP-166] MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation
【速读】: 该论文旨在解决自动翻译系统中误差检测与译文优化的不足,特别是如何提升翻译的语义准确性、本地化适应性以及对语言差异较大的语种对的处理能力。其解决方案的关键在于提出MAATS(Multi Agent Automated Translation System),该系统基于多维质量度量框架(Multidimensional Quality Metrics, MQM)作为细粒度错误检测信号,通过多个专注于不同MQM类别的专用AI代理(如准确性、流畅性、风格、术语等)协同工作,并由一个合成代理整合标注结果进行迭代优化,从而克服传统单代理系统依赖自我修正的局限性。
链接: https://arxiv.org/abs/2505.14848
作者: Xi Wang,Jiaqian Hu,Safinah Ali
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2505.14848 [cs.CL] (or arXiv:2505.14848v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.14848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-167] A Comparative Study of Large Language Models and Human Personality Traits
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否表现出类似人类的人格特质及其与人类人格的可比性问题,重点在于传统人格评估工具在LLMs中的适用性。解决方案的关键在于提出“分布式人格框架”(Distributed Personality Framework),将LLMs的人格特征概念化为动态且受输入驱动的特性,并通过行为基础的方法在三个实证研究中验证了LLMs人格的不稳定性、对问题表述的敏感性以及在角色扮演中受提示和参数设置影响的特性。
链接: https://arxiv.org/abs/2505.14845
作者: Wang Jiaqi,Wang bo,Guo fa,Cheng cheng,Yang li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation, becoming active participants in social and cognitive domains. This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality, focusing on the applicability of conventional personality assessment tools. A behavior-based approach was used across three empirical studies. Study 1 examined test-retest stability and found that LLMs show higher variability and are more input-sensitive than humans, lacking long-term stability. Based on this, we propose the Distributed Personality Framework, conceptualizing LLM traits as dynamic and input-driven. Study 2 analyzed cross-variant consistency in personality measures and found LLMs’ responses were highly sensitive to item wording, showing low internal consistency compared to humans. Study 3 explored personality retention during role-playing, showing LLM traits are shaped by prompt and parameter settings. These findings suggest that LLMs express fluid, externally dependent personality patterns, offering insights for constructing LLM-specific personality frameworks and advancing human-AI interaction. This work contributes to responsible AI development and extends the boundaries of personality psychology in the age of intelligent systems.
zh
[NLP-168] SEPS: A Separability Measure for Robust Unlearning in LLM s
【速读】: 该论文旨在解决机器遗忘(Machine Unlearning)中现有评估指标无法有效反映真实场景的问题,特别是在同一提示中同时存在遗忘查询和保留查询的情况下,传统方法难以准确衡量模型的遗忘与保留能力。论文提出的解决方案关键在于设计了一个名为SEPS的评估框架,该框架能够显式评估模型在单一提示中同时处理遗忘和保留信息的能力,并通过引入混合提示(Mixed Prompt, MP)遗忘策略,将遗忘和保留查询整合到统一的训练目标中,从而提升模型在复杂场景下的遗忘效果与鲁棒性。
链接: https://arxiv.org/abs/2505.14832
作者: Wonje Jeung,Sangyeon Yoon,Albert No
机构: Yonsei University (延世大学); Hongik University (弘益大学)
类目: Computation and Language (cs.CL)
备注: 32 pages
点击查看摘要
Abstract:Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model’s ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt. Comments: 32 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.14832 [cs.CL] (or arXiv:2505.14832v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.14832 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-169] xt Generation Beyond Discrete Token Sampling
【速读】: 该论文试图解决标准自回归生成过程中信息丢失的问题,即在生成每个token时,模型仅保留采样后的离散token,而丢弃了原本丰富的下一个token分布信息。解决方案的关键在于提出一种无需额外训练的“输入混合”(Mixture of Inputs, MoI)方法,通过贝叶斯估计将之前丢弃的token分布作为先验,采样得到的token作为观测值,用连续后验期望替代传统的独热向量作为新的模型输入,从而在生成过程中保持更丰富的内部表示。
链接: https://arxiv.org/abs/2505.14827
作者: Yufan Zhuang,Liyuan Liu,Chandan Singh,Jingbo Shang,Jianfeng Gao
机构: UC San Diego(加州大学圣地亚哥分校); Microsoft Research(微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution’s rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
zh
[NLP-170] FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain
【速读】: 该论文试图解决监督微调(Supervised Fine-tuning, SFT)在适应大型语言模型(Large Language Models, LLMs)到新领域时的统计效率问题。其解决方案的关键在于通过选择具有信息量的训练样本子集,在固定的训练样本预算下最大化信息增益,具体而言是通过计算LLM对数似然的Hessian矩阵来衡量信息增益,并利用多项逻辑回归模型在最后一层线性化LLM以高效近似该Hessian矩阵。
链接: https://arxiv.org/abs/2505.14826
作者: Rohan Deb,Kiran Thekumparampil,Kousha Kalantari,Gaurush Hiranandani,Shoham Sabach,Branislav Kveton
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Supervised fine-tuning (SFT) is a standard approach to adapting large language models (LLMs) to new domains. In this work, we improve the statistical efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we determine the most informative ones. The key idea in our method is to select examples that maximize information gain, measured by the Hessian of the log-likelihood of the LLM. We approximate it efficiently by linearizing the LLM at the last layer using multinomial logistic regression models. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, and back our claims with both quantitative results and an LLM evaluation.
zh
[NLP-171] racing Multilingual Factual Knowledge Acquisition in Pretraining
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在预训练过程中事实回忆能力和跨语言一致性的发展机制问题,尤其是针对多语言场景下的知识获取路径进行深入分析。其解决方案的关键在于通过跟踪OLMo-7B模型在预训练过程中的表现,揭示事实回忆准确性和跨语言一致性的演变规律,并发现模型的事实回忆能力主要由预训练语料库中事实的频率驱动,同时跨语言迁移效应在早期预训练阶段对低频非英语事实的回忆具有显著促进作用。
链接: https://arxiv.org/abs/2505.14824
作者: Yihong Liu,Mingyang Wang,Amir Hossein Kargaran,Felicia Körner,Ercong Nie,Barbara Plank,François Yvon,Hinrich Schütze
机构: Center for Information and Language Processing, LMU Munich (信息与语言处理中心,慕尼黑大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心(MCML)); Sorbonne Université, CNRS, ISIR, France (索邦大学,法国国家科学研究中心,ISIR,法国)
类目: Computation and Language (cs.CL)
备注: preprint
点击查看摘要
Abstract:Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts – an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at this https URL.
zh
[NLP-172] WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长篇故事叙述能力方面缺乏稳健评估的问题,现有基准测试通常在规模、多样性或客观度量方面存在不足。其解决方案的关键在于引入WebNovelBench,这是一个专为评估长篇小说生成而设计的新基准,利用超过4,000部中文网络小说的大规模数据集,并将评估任务定义为从摘要生成故事的任务。该方法通过多维叙事质量框架进行评估,采用LLM-as-Judge的自动评分方式,结合主成分分析对得分进行聚合,并与人类创作作品进行百分位排名,从而实现对LLM叙事能力的有效区分和量化分析。
链接: https://arxiv.org/abs/2505.14818
作者: Leon Lin,Jun Zheng,Haidong Wang
机构: Nanyang Technological University (南洋理工大学); Sun Yat-Sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
zh
[NLP-173] Language Mixing in Reasoning Language Models: Patterns Impact and Internal Causes
【速读】: 该论文试图解决生成式语言模型(Generative Language Models, GLMs)在进行推理时出现的语言混杂(language mixing)问题,即推理步骤中包含与提示语言不同的语言标记,这可能影响模型的性能。论文提出了一种系统性的研究方法,通过分析15种语言、7种任务难度级别和18个学科领域的语言混杂模式、影响及其内部成因,揭示了语言混杂的多维特性。其解决方案的关键在于发现推理语言的选择对模型性能具有显著影响,特别是通过约束解码强制模型使用拉丁语或汉字脚本进行推理,可显著提升准确性。此外,研究还表明推理轨迹的脚本组成与模型内部表示密切相关,为优化多语言推理和控制推理语言提供了新的方向。
链接: https://arxiv.org/abs/2505.14815
作者: Mingyang Wang,Lukas Lange,Heike Adel,Yunpu Ma,Jannik Strötgen,Hinrich Schütze
机构: Bosch Center for Artificial Intelligence (博世人工智能中心); LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Hochschule der Medien (媒体学院); Karlsruhe University of Applied Sciences (卡尔斯鲁厄应用科学大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model’s internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.
zh
[NLP-174] Scaling Reasoning Losing Control: Evaluating Instruction Following in Large Reasoning Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理任务中对自然语言指令的遵循能力不足的问题。研究发现,尽管当前基于推理的模型在复杂数学问题上表现出色,但其在遵循用户指令方面存在显著不足,且随着模型规模的扩大和生成长度的增加,这种问题更加明显。解决方案的关键在于揭示模型推理能力与指令遵循之间的权衡,并通过简单的干预手段部分恢复模型的指令遵循能力,尽管这可能会牺牲一定的推理性能。这一发现揭示了当前LLM训练范式中的根本性矛盾,并推动了更注重指令感知的推理模型的发展。
链接: https://arxiv.org/abs/2505.14810
作者: Tingchen Fu,Jiawei Gu,Yafu Li,Xiaoye Qu,Yu Cheng
机构: Renmin University of China (中国人民大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at this https URL.
zh
[NLP-175] Automated Journalistic Questions: A New Method for Extracting 5W1H in French
【速读】: 该论文试图解决从法语新闻文章中自动提取5W1H信息(即谁、什么、何时、何地、为何和如何)的问题,这一任务是摘要生成、聚类和新闻聚合等下游任务的关键前提。解决方案的关键在于设计并实现了一个自动化提取管道,并通过构建一个包含250篇魁北克新闻文章的语料库进行评估,该语料库由四位人工标注者标记了5W1H答案,实验结果表明该管道在该任务上的表现与大型语言模型GPT-4o相当。
链接: https://arxiv.org/abs/2505.14804
作者: Richard Khoury,Maxence Verhaverbeke,Julie A. Gramaccia
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, 7 tables
点击查看摘要
Abstract:The 5W1H questions - who, what, when, where, why and how - are commonly used in journalism to ensure that an article describes events clearly and systematically. An- swering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algo- rithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.
zh
[NLP-176] Addressing the Challenges of Planning Language Generation
【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)生成形式化规划语言(如PDDL),以调用符号求解器来确定性地推导出规划方案的问题。尽管这一方法在封闭源代码模型或特定LLM流程中已取得成功,但此前认为参数量不超过500亿的开源模型无法完成此任务。论文设计并评估了8种不同的PDDL生成流程,其关键在于采用推理时的扩展方法,例如通过求解器和规划验证器的反馈进行修订,从而显著提升性能,效果超过两倍。
链接: https://arxiv.org/abs/2505.14763
作者: Prabhu Prakash Kagitha,Andrew Zhu,Li Zhang
机构: Drexel University (德雷塞尔大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Using LLMs to generate formal planning languages such as PDDL that invokes symbolic solvers to deterministically derive plans has been shown to outperform generating plans directly. While this success has been limited to closed-sourced models or particular LLM pipelines, we design and evaluate 8 different PDDL generation pipelines with open-source models under 50 billion parameters previously shown to be incapable of this task. We find that intuitive approaches such as using a high-resource language wrapper or constrained decoding with grammar decrease performance, yet inference-time scaling approaches such as revision with feedback from the solver and plan validator more than double the performance.
zh
[NLP-177] MORALISE: A Structured Benchmark for Moral Alignment in Visual Language Models
【速读】: 该论文试图解决视觉语言模型(VLMs)在道德对齐方面存在的局限性,特别是在高风险现实应用场景中确保其输出符合人类道德价值观的问题。现有研究要么仅关注文本模态,要么过度依赖AI生成的图像,导致数据分布偏差和现实感不足。解决方案的关键在于引入MORALISE,这是一个基于多样化、专家验证的真实世界数据的全面基准,用于评估VLMs的道德对齐能力。该基准通过构建13个基于Turiel领域理论的道德主题分类体系,并手动整理2,481对高质量图像-文本对,每个样本均标注了细粒度的道德主题和模态来源,从而有效提升了评估的准确性和实用性。
链接: https://arxiv.org/abs/2505.14728
作者: Xiao Lin,Zhining Liu,Ze Yang,Gaotang Li,Ruizhong Qiu,Shuke Wang,Hui Liu,Haotian Li,Sumit Keswani,Vishwa Pardeshi,Huijun Zhao,Wei Fan,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊); Fidelity Investments (富达投资)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM)
备注: 21 pages, 11 figures, 7 tables
点击查看摘要
Abstract:Warning: This paper contains examples of harmful language and images. Reader discretion is advised. Recently, vision-language models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis, owing to their powerful multimodal reasoning capabilities. As these models are deployed in high-stakes real-world applications, it is of paramount importance to ensure that their outputs align with human moral values and remain within moral boundaries. However, existing work on moral alignment either focuses solely on textual modalities or relies heavily on AI-generated images, leading to distributional biases and reduced realism. To overcome these limitations, we introduce MORALISE, a comprehensive benchmark for evaluating the moral alignment of vision-language models (VLMs) using diverse, expert-verified real-world data. We begin by proposing a comprehensive taxonomy of 13 moral topics grounded in Turiel’s Domain Theory, spanning the personal, interpersonal, and societal moral domains encountered in everyday life. Built on this framework, we manually curate 2,481 high-quality image-text pairs, each annotated with two fine-grained labels: (1) topic annotation, identifying the violated moral topic(s), and (2) modality annotation, indicating whether the violation arises from the image or the text. For evaluation, we encompass two tasks, \textitmoral judgment and \textitmoral norm attribution, to assess models’ awareness of moral violations and their reasoning ability on morally salient content. Extensive experiments on 19 popular open- and closed-source VLMs show that MORALISE poses a significant challenge, revealing persistent moral limitations in current state-of-the-art models. The full benchmark is publicly available at this https URL.
zh
[NLP-178] Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs ICDAR
【速读】: 该论文旨在解决数字原生PDF文档中文档版式自动分析的问题,该问题由于文本和非文本元素的异构排列以及PDF中文本元数据的不精确性而具有挑战性。其解决方案的关键在于利用图神经网络(GNN)架构进行细粒度的文本块版式分类,并引入两种图构建结构:k近邻图和全连接图,同时通过预训练的文本和视觉模型生成节点特征,从而避免了手动特征工程。此外,采用多模态融合策略,包括单模态、拼接多模态和双分支多模态实验框架,最终在双分支配置下基于k近邻图的GraphSAGE模型表现出最佳性能。
链接: https://arxiv.org/abs/2505.14699
作者: Miguel Lopez-Duran,Julian Fierrez,Aythami Morales,Ruben Tolosana,Oscar Delgado-Mohatar,Alvaro Ortigosa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 2 figures, preprint presented in The Fifth ICDAR International Workshop on Machine Learning
点击查看摘要
Abstract:The automatic analysis of document layouts in digital-born PDF documents remains a challenging problem due to the heterogeneous arrangement of textual and nontextual elements and the imprecision of the textual metadata in the Portable Document Format. In this work, we benchmark Graph Neural Network (GNN) architectures for the task of fine-grained layout classification of text blocks from digital native documents. We introduce two graph construction structures: a k-closest-neighbor graph and a fully connected graph, and generate node features via pre-trained text and vision models, thus avoiding manual feature engineering. Three experimental frameworks are evaluated: single-modality (text or visual), concatenated multimodal, and dual-branch multimodal. We evaluated four foundational GNN models and compared them with the baseline. Our experiments are specifically conducted on a rich dataset of public affairs documents that includes more than 20 sources (e.g., regional and national-level official gazettes), 37K PDF documents, with 441K pages in total. Our results demonstrate that GraphSAGE operating on the k-closest-neighbor graph in a dual-branch configuration achieves the highest per-class and overall accuracy, outperforming the baseline in some sources. These findings confirm the importance of local layout relationships and multimodal fusion exploited through GNNs for the analysis of native digital document layouts.
zh
[NLP-179] Sentiment Analysis in Software Engineering: Evaluating Generative Pre-trained Transformers
【速读】: 该论文旨在解决软件工程(SE)领域中情感分析工具在处理领域内复杂、上下文依赖的语言时表现不足的问题。其关键解决方案是系统评估双向变压器模型(如BERT)与生成式预训练变压器(如GPT-4o-mini)在SE情感分析任务中的性能,并通过微调和默认配置进行对比实验,以探索模型架构与数据集特征之间的适配性。
链接: https://arxiv.org/abs/2505.14692
作者: KM Khalid Saifullah,Faiaz Azmain,Habiba Hye
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Sentiment analysis plays a crucial role in understanding developer interactions, issue resolutions, and project dynamics within software engineering (SE). While traditional SE-specific sentiment analysis tools have made significant strides, they often fail to account for the nuanced and context-dependent language inherent to the domain. This study systematically evaluates the performance of bidirectional transformers, such as BERT, against generative pre-trained transformers, specifically GPT-4o-mini, in SE sentiment analysis. Using datasets from GitHub, Stack Overflow, and Jira, we benchmark the models’ capabilities with fine-tuned and default configurations. The results reveal that fine-tuned GPT-4o-mini performs comparable to BERT and other bidirectional models on structured and balanced datasets like GitHub and Jira, achieving macro-averaged F1-scores of 0.93 and 0.98, respectively. However, on linguistically complex datasets with imbalanced sentiment distributions, such as Stack Overflow, the default GPT-4o-mini model exhibits superior generalization, achieving an accuracy of 85.3% compared to the fine-tuned model’s 13.1%. These findings highlight the trade-offs between fine-tuning and leveraging pre-trained models for SE tasks. The study underscores the importance of aligning model architectures with dataset characteristics to optimize performance and proposes directions for future research in refining sentiment analysis tools tailored to the SE domain.
zh
[NLP-180] HELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering
【速读】: 该论文试图解决基于检索增强生成(Retrieval Augmented Generation, RAG)的问答(QA)应用在无参考情况下进行端到端评估与优化的问题。解决方案的关键在于提出THELMA(Task Based Holistic Evaluation of Large Language Model Applications)框架,该框架包含六个相互关联的指标,旨在对RAG QA系统进行全面且细粒度的评估,从而帮助开发者和应用所有者无需依赖标注数据或参考答案即可评估、监控和改进整个RAG QA流程。
链接: https://arxiv.org/abs/2505.11626
作者: Udita Patel,Rutu Mulkar,Jay Roberts,Cibi Chakravarthy Senthilkumar,Sujay Gandhi,Xiaofei Zheng,Naumaan Nayyar,Rafael Castrillo
机构: Amazon.com Services Inc. (亚马逊公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference this http URL also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.
zh
[NLP-181] oxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality INTERSPEECH2025
【速读】: 该论文旨在解决口语化中文音频中毒性言论检测的难题,尤其是在处理汉语特有的韵律特征和文化特定表达方面存在显著研究空白。其解决方案的关键在于构建了名为ToxicTone的全球最大公开数据集,该数据集包含详细的标注信息,能够区分不同形式和来源的毒性内容,并结合先进的多模态检测框架,整合声学、语言和情感特征,从而更有效地识别隐藏的毒性表达。
链接: https://arxiv.org/abs/2505.15773
作者: Yu-Xiang Luo,Yi-Cheng Lin,Ming-To Chuang,Jia-Hung Chen,I-Ning Tsai,Pei Xing Kiew,Yueh-Hsuan Huang,Chien-Feng Liu,Yu-Chen Chen,Bo-Han Feng,Wenze Ren,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted by INTERSPEECH 2025. 5 pages
点击查看摘要
Abstract:Despite extensive research on toxic speech detection in text, a critical gap remains in handling spoken Mandarin audio. The lack of annotated datasets that capture the unique prosodic cues and culturally specific expressions in Mandarin leaves spoken toxicity underexplored. To address this, we introduce ToxicTone – the largest public dataset of its kind – featuring detailed annotations that distinguish both forms of toxicity (e.g., profanity, bullying) and sources of toxicity (e.g., anger, sarcasm, dismissiveness). Our data, sourced from diverse real-world audio and organized into 13 topical categories, mirrors authentic communication scenarios. We also propose a multimodal detection framework that integrates acoustic, linguistic, and emotional features using state-of-the-art speech and emotion encoders. Extensive experiments show our approach outperforms text-only and baseline models, underscoring the essential role of speech-specific cues in revealing hidden toxic expressions.
zh
[NLP-182] Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information INTERSPEECH2025
【速读】: 该论文旨在解决语音预训练模型(如HuBERT)在量化过程中丢失韵律和副语言信息(如情感、重音)的问题。传统方法通过增加代码本大小来缓解信息损失,但会导致比特率效率低下。论文提出的解决方案是分割变体代码本(Segmentation-Variant Codebooks, SVCs),其核心在于将语音在不同语言单元(帧、音素、词、话语)上进行量化,从而将语音分解为多个特定于片段的离散特征流,有效保留了韵律和副语言信息。
链接: https://arxiv.org/abs/2505.15667
作者: Nicholas Sanders,Yuanchao Li,Korin Richmond,Simon King
机构: The Centre for Speech Technology Research
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025
点击查看摘要
Abstract:Quantization in SSL speech models (e.g., HuBERT) improves compression and performance in tasks like language modeling, resynthesis, and text-to-speech but often discards prosodic and paralinguistic information (e.g., emotion, prominence). While increasing codebook size mitigates some loss, it inefficiently raises bitrates. We propose Segmentation-Variant Codebooks (SVCs), which quantize speech at distinct linguistic units (frame, phone, word, utterance), factorizing it into multiple streams of segment-specific discrete features. Our results show that SVCs are significantly more effective at preserving prosodic and paralinguistic information across probing tasks. Additionally, we find that pooling before rather than after discretization better retains segment-level information. Resynthesis experiments further confirm improved style realization and slightly improved quality while preserving intelligibility.
zh
[NLP-183] CSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis ACL2025
【速读】: 该论文旨在解决可定制的多语言零样本歌唱语音合成(SVS)中对音素和音符边界标注的过度依赖问题,以及由此导致的零样本场景下鲁棒性不足和音素与音符间过渡质量差的问题,同时解决通过多样化提示实现多层级风格控制的有效性不足。其解决方案的关键在于引入TCSinger 2模型,该模型包含三个核心模块:1)模糊边界内容(Blurred Boundary Content, BBC)编码器,通过预测持续时间、扩展内容嵌入并应用边界掩码来实现平滑过渡;2)自定义音频编码器,利用对比学习从歌唱、语音和文本提示中提取对齐表示;3)基于流的自定义Transformer,结合Cus-MOE结构并在F0监督下提升合成质量和风格建模能力。
链接: https://arxiv.org/abs/2505.14910
作者: Yu Zhang,Wenxiang Guo,Changhao Pan,Dongyu Yao,Zhiyuan Zhu,Ziyue Jiang,Yuhan Wang,Tao Jin,Zhou Zhao
机构: Zhejiang University (浙江大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by ACL 2025
点击查看摘要
Abstract:Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks.
zh
[NLP-184] QUADS: QUAntized Distillation Framework for Efficient Speech Language Understanding
【速读】: 该论文旨在解决语音语言理解(Spoken Language Understanding, SLU)系统在资源受限环境中如何平衡性能与效率的问题。现有方法分别应用知识蒸馏和量化,但因知识蒸馏未考虑量化约束而导致压缩效果不佳。解决方案的关键在于提出QUADS框架,通过多阶段训练结合预调优模型,统一优化知识蒸馏与量化过程,从而提升在低比特场景下的适应性并保持准确性。
链接: https://arxiv.org/abs/2505.14723
作者: Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:
点击查看摘要
Abstract:Spoken Language Understanding (SLU) systems must balance performance and efficiency, particularly in resource-constrained environments. Existing methods apply distillation and quantization separately, leading to suboptimal compression as distillation ignores quantization constraints. We propose QUADS, a unified framework that optimizes both through multi-stage training with a pre-tuned model, enhancing adaptability to low-bit regimes while maintaining accuracy. QUADS achieves 71.13% accuracy on SLURP and 99.20% on FSC, with only minor degradations of up to 5.56% compared to state-of-the-art models. Additionally, it reduces computational complexity by 60–73 \times (GMACs) and model size by 83–700 \times , demonstrating strong robustness under extreme quantization. These results establish QUADS as a highly efficient solution for real-world, resource-constrained SLU applications.
zh
计算机视觉
[CV-0] InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition
【速读】:该论文旨在解决遥感图像中语言引导的目标识别问题,特别是针对需要高级推理的复杂或隐含查询,现有开放词汇和视觉定位方法因依赖显式类别提示而受限。其关键解决方案是提出InstructSAM框架,该框架无需训练,通过利用大视觉-语言模型解析用户指令、使用SAM2生成掩码建议,并将掩码-标签分配建模为二元整数规划问题,结合语义相似性与计数约束,实现无需置信度阈值的高效类别分配。
链接: https://arxiv.org/abs/2505.15818
作者: Yijie Zheng,Weijie Wu,Qingyun Li,Xuehui Wang,Xu Zhou,Aiai Ren,Jun Shen,Long Zhao,Guoqing Li,Xue Yang
机构: University of Chinese Academy of Sciences (中国科学院大学); Harbin Institute of Technology (哈尔滨工业大学); Shanghai Jiao Tong University (上海交通大学); University of Wollongong (卧龙岗大学); Aerospace Information Research Institute (空间信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.
zh
[CV-1] Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM ICML2025
【速读】:该论文旨在解决大型多模态模型在处理视觉令牌时面临的计算复杂度过高的问题。传统方法主要关注于减少令牌级别的冗余,而本文则从计算层级出发,识别并研究视觉令牌的计算冗余,以确保不丢失信息。其关键解决方案是发现预训练视觉编码器生成的视觉令牌并不需要解码器仅模型中的全部复杂操作(如自注意力、前馈网络),通过适当设计可以对其进行更轻量化的处理。基于此发现,作者提出了ProxyV方法,利用代理视觉令牌来减轻原始视觉令牌的计算负担,从而在不牺牲性能的前提下提升效率,甚至在某些场景下实现性能提升。
链接: https://arxiv.org/abs/2505.15816
作者: Penghao Wu,Lewei Lu,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025
点击查看摘要
Abstract:Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive computation on visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further. The code will be made public at this this https URL URL.
zh
[CV-2] A Taxonomy of Structure from Motion Methods
【速读】:该论文旨在对Structure from Motion (SfM)方法进行概念性综述,其核心问题是通过多图像中的点对应关系恢复场景的结构(即三维点坐标)和运动(即相机矩阵)。论文将现有方法划分为三大类,依据其关注问题中运动或结构的不同部分。解决方案的关键在于提出一种新的分类体系,以提供对现有SfM方法的新视角,并揭示问题的理论条件,这些条件决定了SfM问题是否适定,从而为未来的研究方向提供指导。
链接: https://arxiv.org/abs/2505.15814
作者: Federica Arrigoni
机构: Politecnico di Milano(米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Structure from Motion (SfM) refers to the problem of recovering both structure (i.e., 3D coordinates of points in the scene) and motion (i.e., camera matrices) starting from point correspondences in multiple images. It has attracted significant attention over the years, counting practical reconstruction pipelines as well as theoretical results. This paper is conceived as a conceptual review of SfM methods, which are grouped into three main categories, according to which part of the problem - between motion and structure - they focus on. The proposed taxonomy brings a new perspective on existing SfM approaches as well as insights into open problems and possible future research directions. Particular emphasis is given on identifying the theoretical conditions that make SfM well posed, which depend on the problem formulation that is being considered.
zh
[CV-3] Leverag ing the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization
【速读】:该论文旨在解决基于示例的图像着色问题,即如何将参考颜色图像中的颜色准确地应用到输入灰度图像的语义相似区域中。其解决方案的关键在于利用预训练扩散模型的自注意力模块,提出了一种无需微调的方法。该方法通过双注意力引导的颜色迁移,分别计算灰度图像和颜色图像的注意力图,从而实现更精确的语义对齐;同时引入无分类器的颜色化引导,通过结合颜色迁移与非颜色迁移输出来提升着色质量。
链接: https://arxiv.org/abs/2505.15812
作者: Satoshi Kosugi
机构: Institute of Integrated Research, Institute of Science Tokyo (综合研究机构,科学东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
点击查看摘要
Abstract:Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). Our source code is available at this https URL.
zh
[CV-4] MMaDA: Multimodal Large Diffusion Language Models
【速读】:该论文旨在解决多模态扩散基础模型在跨领域任务中性能不足的问题,特别是如何实现文本推理、多模态理解和文本到图像生成等任务的统一建模与优化。其解决方案的关键在于提出MMaDA,该模型采用统一的扩散架构(unified diffusion architecture),具备模态无关的设计,消除了对特定模态组件的依赖;同时引入混合长链式思维(mixed long chain-of-thought, CoT)微调策略,实现跨模态的统一CoT格式对齐,提升冷启动阶段的强化学习能力;此外,还提出了针对扩散基础模型的统一策略梯度强化学习算法UniGRPO,通过多样化奖励建模实现推理与生成任务的后训练一致性优化。这些创新有效提升了模型的泛化能力和跨任务表现。
链接: https://arxiv.org/abs/2505.15809
作者: Ling Yang,Ye Tian,Bowen Li,Xinchen Zhang,Ke Shen,Yunhai Tong,Mengdi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL
点击查看摘要
Abstract:We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model’s ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA’s effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: this https URL
zh
[CV-5] STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理能力上与人类存在显著差距的问题,特别是在跨视角图像中识别物体变换的挑战性任务——Transformation-Driven Visual Reasoning (TVR)。传统监督微调(Supervised Fine-Tuning, SFT)在跨视角设置下无法生成连贯的推理路径,而稀疏奖励强化学习(Reinforcement Learning, RL)则面临探索效率低和收敛速度慢的问题。该研究提出的解决方案是STAR-R1框架,其关键在于整合单阶段强化学习范式与细粒度奖励机制,通过奖励部分正确性、惩罚过度枚举和被动不作为,实现高效的探索与精确的推理。
链接: https://arxiv.org/abs/2505.15804
作者: Zongzhao Li,Zongyang Ma,Mingze Li,Songyou Li,Yu Rong,Tingyang Xu,Ziqi Zhang,Deli Zhao,Wenbing Huang
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院,中国人民大学); MAIS, Institute of Automation, Chinese Academy of Sciences (中科院自动化所MAIS); DAMO Academy, Alibaba Group, Hangzhou, China (阿里达摩院,阿里巴巴集团,杭州); Hupan Lab, Hangzhou, China (湖畔实验室,杭州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, yet they lag significantly behind humans in spatial reasoning. We investigate this gap through Transformation-Driven Visual Reasoning (TVR), a challenging task requiring identification of object transformations across images under varying viewpoints. While traditional Supervised Fine-Tuning (SFT) fails to generate coherent reasoning paths in cross-view settings, sparse-reward Reinforcement Learning (RL) suffers from inefficient exploration and slow convergence. To address these limitations, we propose STAR-R1, a novel framework that integrates a single-stage RL paradigm with a fine-grained reward mechanism tailored for TVR. Specifically, STAR-R1 rewards partial correctness while penalizing excessive enumeration and passive inaction, enabling efficient exploration and precise reasoning. Comprehensive evaluations demonstrate that STAR-R1 achieves state-of-the-art performance across all 11 metrics, outperforming SFT by 23% in cross-view scenarios. Further analysis reveals STAR-R1’s anthropomorphic behavior and highlights its unique ability to compare all objects for improving spatial reasoning. Our work provides critical insights in advancing the research of MLLMs and reasoning models. The codes, model weights, and data will be publicly available at this https URL.
zh
[CV-6] Interspatial Attention for Efficient 4D Human Video Generation
【速读】:该论文旨在解决在可控条件下生成高质量、具有运动一致性和身份保真度的数字人类逼真视频的问题。现有方法在生成个体或多个数字人类时,普遍存在质量不佳、一致性不足和身份丢失等问题。论文的关键解决方案是引入一种新的跨空间注意力(interspatial attention, ISA)机制,作为现代扩散Transformer(DiT)视频生成模型的可扩展构建块,该机制采用针对人类视频生成优化的相对位置编码,并结合自定义开发的视频变分自编码器,在大规模视频数据集上训练基于潜在ISA的扩散模型,从而实现了4D人类视频合成的最先进性能。
链接: https://arxiv.org/abs/2505.15800
作者: Ruizhi Shao,Yinghao Xu,Yujun Shen,Ceyuan Yang,Yang Zheng,Changan Chen,Yebin Liu,Gordon Wetzstein
机构: Tsinghua University (清华大学); Stanford University (斯坦福大学); Ant Research (蚂蚁集团); ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)–based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at this https URL.
zh
[CV-7] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL
【速读】:该论文试图解决预训练扩散模型在微调过程中难以实现稳定、高效训练以及支持非可微奖励的问题,同时解决稀疏奖励导致的中间步骤监督不足进而影响生成质量的问题。其解决方案的关键在于提出一种名为VARD(Value-based Reinforced Diffusion)的方法,该方法首先学习一个价值函数以预测中间状态的奖励期望,随后利用该价值函数结合KL正则化在整个生成过程中提供密集且可微的监督信号,从而实现对预训练模型的有效微调。
链接: https://arxiv.org/abs/2505.15791
作者: Fengyuan Dai,Zifeng Zhuang,Yufei Huang,Siteng Huang,Bangyan Liao,Donglin Wang,Fajie Yuan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review
点击查看摘要
Abstract:Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.
zh
[CV-8] IA-T2I: Internet-Augmented Text-to-Image Generation
【速读】:该论文试图解决文本到图像生成模型在处理文本提示中隐含知识不确定场景时表现不佳的问题,例如在电影上映前无法准确生成符合未来角色设计和风格的海报。解决方案的关键在于提出一种互联网增强的文本到图像生成框架(Internet-Augmented text-to-image generation, IA-T2I),通过提供参考图像来明确模型对不确定知识的理解,其核心组件包括主动检索模块、分层图像选择模块以及自反思机制,以提升生成结果与文本提示的一致性。
链接: https://arxiv.org/abs/2505.15779
作者: Chuanhao Li,Jianwen Sun,Yukang Feng,Mingliang Zhai,Yifan Chang,Kaipeng Zhang
机构: Shanghai AI Laboratory (上海人工智能实验室); Nankai University (南开大学); Beijing Institute of Technology (北京理工大学); University of Science and Technology of China (中国科学技术大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, a framework that integrates reference images from the Internet into T2I/TI2I models
点击查看摘要
Abstract:Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain. For example, a T2I model released in February would struggle to generate a suitable poster for a movie premiering in April, because the character designs and styles are uncertain to the model. To solve this problem, we propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images. Specifically, an active retrieval module is designed to determine whether a reference image is needed based on the given text prompt; a hierarchical image selection module is introduced to find the most suitable image returned by an image search engine to enhance the T2I model; a self-reflection mechanism is presented to continuously evaluate and refine the generated image to ensure faithful alignment with the text prompt. To evaluate the proposed framework’s performance, we collect a dataset named Img-Ref-T2I, where text prompts include three types of uncertain knowledge: (1) known but rare. (2) unknown. (3) ambiguous. Moreover, we carefully craft a complex prompt to guide GPT-4o in making preference evaluation, which has been shown to have an evaluation accuracy similar to that of human preference evaluation. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by about 30% in human evaluation.
zh
[CV-9] Constructing a 3D Town from a Single Image
【速读】:该论文旨在解决从单张俯视图像生成高质量、几何一致且空间连贯的复杂3D场景的问题,这一任务在传统方法中通常需要昂贵设备、多视角数据或人工建模。其解决方案的关键在于提出一种无需训练的框架——3DTown,该框架基于两个核心原则:基于区域的生成以提升图像到3D的对齐度与分辨率,以及空间感知的3D修补以确保全局场景的一致性与高质量几何生成。通过将输入图像分解为重叠区域并利用预训练的3D物体生成器进行逐区域生成,再结合掩码校正流修补过程来填充缺失几何并保持结构连续性,该方法有效克服了分辨率瓶颈并保留了空间结构。
链接: https://arxiv.org/abs/2505.15765
作者: Kaizhi Zheng,Ruijian Zhang,Jing Gu,Jie Yang,Xin Eric Wang
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Columbia University (哥伦比亚大学); Cybever AI (Cybever AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.
zh
[CV-10] Exploring The Visual Feature Space for Multimodal Neural Decoding
【速读】:该论文试图解决现有研究在脑信号解码中仅能提供粗粒度解释,缺乏对物体描述、位置、属性及其关系等关键细节的问题,导致视觉解码时的重建结果不精确且模糊。其解决方案的关键在于分析预训练视觉组件中不同的视觉特征空间,并引入一种与多模态大语言模型(MLLMs)交互的零样本多模态脑解码方法,以实现多粒度层面的解码。
链接: https://arxiv.org/abs/2505.15755
作者: Weihao Xia,Cengiz Oztireli
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL
点击查看摘要
Abstract:The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model’s ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at this https URL.
zh
[CV-11] RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction BMVC2025
【速读】:该论文旨在解决水下高保真场景重建的问题,这一任务因水体中的光吸收、散射及能见度受限而具有挑战性。其解决方案的关键在于提出一种基于高斯泼溅(Gaussian Splatting)的增强框架,通过解耦学习RGB通道并结合水下衰减的物理特性实现更精确的颜色恢复;同时引入一种带有自适应加权策略的帧插值方法以应对稀疏视角限制并提升视图一致性,并设计了一种新的损失函数以在降噪的同时保留边缘信息。此外,论文还发布了专门采集于深海环境的数据集Submerged3D,实验结果表明该框架在PSNR指标上相比现有最先进方法提升了高达1.90dB,展现出更优的感知质量和鲁棒性。
链接: https://arxiv.org/abs/2505.15737
作者: Zhuodong Jiang,Haoran Wang,Guoxi Huang,Brett Seymour,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学); National Park Service (国家公园服务局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures. Submitted to BMVC 2025
点击查看摘要
Abstract:Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments. This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering. We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration. To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme. Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content. We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics.
zh
[CV-12] HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning
【速读】:该论文旨在解决自主驾驶系统中运动预测面临的挑战,即如何准确预测周围交通参与者未来的轨迹。现有方法在从历史代理轨迹和道路布局中提取场景上下文特征时,存在信息退化的问题。解决方案的关键在于提出一种名为HAMF的新运动预测框架,该框架通过联合学习场景上下文编码与未来运动表示,以一致的方式结合场景理解和未来运动状态预测。其核心创新在于设计了一个统一的基于注意力的编码器,协同融合自注意力和交叉注意力机制,以联合建模场景上下文信息并聚合未来运动特征,同时在解码阶段引入Mamba模块以保持学习到的未来运动表示的一致性和相关性,从而生成准确且多样的最终轨迹。
链接: https://arxiv.org/abs/2505.15703
作者: Xiaodong Mei,Sheng Wang,Jie Cheng,Yingbing Chen,Dan Xu
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: In submission
点击查看摘要
Abstract:Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents’ future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.
zh
[CV-13] Discovering Pathology Rationale and Token Allocation for Efficient Multimodal Pathology Reasoning
【速读】:该论文旨在解决多模态病理图像理解中推理能力不足以及计算负担过重的问题(Multimodal pathological image understanding),这些问题限制了其在复杂诊断场景中的应用和实际部署。论文提出的解决方案的关键在于引入一种新型的双分支强化学习框架,其中一支通过从标签直接学习任务特定的决策过程(即病理推理)来增强模型的推理能力,另一支则根据图像的视觉内容和任务上下文动态分配适量的标记以优化计算效率。
链接: https://arxiv.org/abs/2505.15687
作者: Zhe Xu,Cheng Jin,Yihui Wang,Ziyi Liu,Hao Chen
机构: HKUST, Hong Kong SAR, China (香港科技大学,香港特别行政区,中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal pathological image understanding has garnered widespread interest due to its potential to improve diagnostic accuracy and enable personalized treatment through integrated visual and textual data. However, existing methods exhibit limited reasoning capabilities, which hamper their ability to handle complex diagnostic scenarios. Additionally, the enormous size of pathological images leads to severe computational burdens, further restricting their practical deployment. To address these limitations, we introduce a novel bilateral reinforcement learning framework comprising two synergistic branches. One reinforcement branch enhances the reasoning capability by enabling the model to learn task-specific decision processes, i.e., pathology rationales, directly from labels without explicit reasoning supervision. While the other branch dynamically allocates a tailored number of tokens to different images based on both their visual content and task context, thereby optimizing computational efficiency. We apply our method to various pathological tasks such as visual question answering, cancer subtyping, and lesion detection. Extensive experiments show an average +41.7 absolute performance improvement with 70.3% lower inference costs over the base models, achieving both reasoning accuracy and computational efficiency.
zh
[CV-14] Enhancing Monte Carlo Dropout Performance for Uncertainty Quantification
【速读】:该论文旨在解决深度神经网络输出不确定性量化不准确的问题,尤其是在高风险领域如医学诊断和自主系统中,这对做出可信决策至关重要。传统蒙特卡洛Dropout(MCD)方法虽然易于集成到各种深度架构中,但往往难以提供校准良好的不确定性估计。该研究的关键解决方案是通过引入灰狼优化器(GWO)、贝叶斯优化(BO)和粒子群优化(PSO)等不同的搜索策略,并结合一种不确定性感知的损失函数,以增强MCD的性能,从而提高不确定性量化的可靠性。
链接: https://arxiv.org/abs/2505.15671
作者: Hamzeh Asgharnezhad,Afshar Shamsi,Roohallah Alizadehsani,Arash Mohammadi,Hamid Alinejad-Rokny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 tables, 7 figures
点击查看摘要
Abstract:Knowing the uncertainty associated with the output of a deep neural network is of paramount importance in making trustworthy decisions, particularly in high-stakes fields like medical diagnosis and autonomous systems. Monte Carlo Dropout (MCD) is a widely used method for uncertainty quantification, as it can be easily integrated into various deep architectures. However, conventional MCD often struggles with providing well-calibrated uncertainty estimates. To address this, we introduce innovative frameworks that enhances MCD by integrating different search solutions namely Grey Wolf Optimizer (GWO), Bayesian Optimization (BO), and Particle Swarm Optimization (PSO) as well as an uncertainty-aware loss function, thereby improving the reliability of uncertainty quantification. We conduct comprehensive experiments using different backbones, namely DenseNet121, ResNet50, and VGG16, on various datasets, including Cats vs. Dogs, Myocarditis, Wisconsin, and a synthetic dataset (Circles). Our proposed algorithm outperforms the MCD baseline by 2-3% on average in terms of both conventional accuracy and uncertainty accuracy while achieving significantly better calibration. These results highlight the potential of our approach to enhance the trustworthiness of deep learning models in safety-critical applications.
zh
[CV-15] Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在开放世界场景中对未见任务的泛化能力不足的问题。现有VLA模型的跨任务泛化能力尚未得到充分探索,尽管它们在多样化数据集上进行训练,但仍难以有效适应未见过的任务。论文提出的解决方案关键在于提出一种名为跨任务上下文操作(Cross-Task In-Context Manipulation, X-ICM)的方法,该方法通过利用已见任务中的上下文示例来条件化大型语言模型(Large Language Models, LLMs),从而预测未见任务的动作序列,并结合动态引导的样本选择策略以提升泛化性能。
链接: https://arxiv.org/abs/2505.15660
作者: Jiaming Zhou,Ke Ye,Jiayi Liu,Teli Ma,Zifang Wang,Ronghe Qiu,Kun-Yu Lin,Zhilin Zhao,Junwei Liang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
zh
[CV-16] he Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection IJCAI2025
【速读】:该论文旨在解决场景文本检测模型在学术基准上表现优异但在实际应用场景中性能下降的问题。其关键解决方案是提出一种联合数据集学习(Joint-Dataset Learning, JDL)协议,以缓解因数据集特定优化(Dataset-Specific Optimization, DSO)导致的微调差距(Fine-tuning Gap),从而提升模型的泛化能力。同时,论文还构建了长尾基准(Long-Tailed Benchmark, LTB),并引入MAEDet作为处理长尾场景文本挑战的强基线方法。
链接: https://arxiv.org/abs/2505.15649
作者: Tianjiao Cao,Jiahao Lyu,Weichao Zeng,Weimin Mu,Yu Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); VCIP & TMCC & DISSec, College of Computer Science, Nankai University (南开大学计算机学院); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI2025
点击查看摘要
Abstract:Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textitFine-tuning Gap, where models leverage \textitDataset-Specific Optimization (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textitJoint-Dataset Learning (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at this https URL.
zh
[CV-17] Frag Fake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models
【速读】:该论文旨在解决细粒度图像编辑检测问题,即在图像中定位局部编辑区域,以评估内容的真实性。当前该领域面临三大挑战:二分类器仅提供全局真实或虚假标签、传统计算机视觉方法依赖昂贵的像素级标注以及缺乏大规模高质量的现代图像编辑检测数据集。论文的关键解决方案是构建了一个自动化数据生成管道,创建了FragFake数据集,这是首个专注于编辑图像检测的基准数据集,并首次将视觉语言模型(Vision Language Models, VLMs)应用于编辑图像分类和编辑区域定位任务,显著提升了检测性能。
链接: https://arxiv.org/abs/2505.15644
作者: Zhen Sun,Ziyi Zhang,Zeren Luo,Zeyang Sha,Tianshuo Cong,Zheng Li,Shiwen Cui,Weiqiang Wang,Jiaheng Wei,Xinlei He,Qi Li,Qian Wang
机构: Hong Kong University of Science and Technology (Guangzhou); Ant Group; Tsinghua University; Shandong University; Wuhan University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 14pages,15 figures
点击查看摘要
Abstract:Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
zh
[CV-18] Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset Deep Learning Baselines and Benchmarking
【速读】:该论文试图解决正畸学中错颌畸形(malocclusion)的准确定位与诊断问题,其核心挑战在于错颌畸形表现复杂且临床表现多样,而当前牙科图像分析领域缺乏大规模、精确标注的数据集,这限制了自动化诊断技术的发展,并导致临床实践中诊断准确性与效率不足。解决方案的关键是提出一个名为口腔颌面自然图像(Oral and Maxillofacial Natural Images, OMNI)的新型全面牙科图像数据集,该数据集包含4166张多视角图像,由专业牙医标注,且通过多种深度学习方法进行了全面验证,旨在推动错颌畸形自动诊断研究并为该领域提供新的基准。
链接: https://arxiv.org/abs/2505.15637
作者: Pujun Xue,Junyi Ge,Xiaotong Jiang,Siyang Song,Zijian Wu,Yupeng Huo,Weicheng Xie,Linlin Shen,Xiaoqin Zhou,Xiaofeng Liu,Min Gu
机构: Hohai University(河海大学); Soochow University(苏州大学); Department of Stomatology, The Third Affiliated Hospital of Soochow University(苏州大学附属第三医院口腔科); The First People’s Hospital of Changzhou(常州市第一人民医院); College of Information Science and Engineering, Hohai University(河海大学信息科学与工程学院); HBUG lab, Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系HBUG实验室); Nanjing University of Science and Technology(南京理工大学); Affect AI(情感人工智能); Shenzhen University(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Malocclusion is a major challenge in orthodontics, and its complex presentation and diverse clinical manifestations make accurate localization and diagnosis particularly important. Currently, one of the major shortcomings facing the field of dental image analysis is the lack of large-scale, accurately labeled datasets dedicated to malocclusion issues, which limits the development of automated diagnostics in the field of dentistry and leads to a lack of diagnostic accuracy and efficiency in clinical practice. Therefore, in this study, we propose the Oral and Maxillofacial Natural Images (OMNI) dataset, a novel and comprehensive dental image dataset aimed at advancing the study of analyzing dental images for issues of malocclusion. Specifically, the dataset contains 4166 multi-view images with 384 participants in data collection and annotated by professional dentists. In addition, we performed a comprehensive validation of the created OMNI dataset, including three CNN-based methods, two Transformer-based methods, and one GNN-based method, and conducted automated diagnostic experiments for malocclusion issues. The experimental results show that the OMNI dataset can facilitate the automated diagnosis research of malocclusion issues and provide a new benchmark for the research in this field. Our OMNI dataset and baseline code are publicly available at this https URL.
zh
[CV-19] SNAP: A Benchmark for Testing the Effects of Capture Conditions on Fundamental Vision Tasks
【速读】:该论文试图解决深度学习(Deep Learning, DL)计算机视觉算法在面对各种图像扰动时泛化能力不足的问题,尤其是针对图像采集条件(如相机参数和光照)对模型性能的影响缺乏系统研究。其解决方案的关键在于评估常见视觉数据集中的采集偏差,并构建一个新的基准数据集SNAP(Shutter speed, ISO sensitivity, and Aperture),该数据集包含在受控光照条件下、且相机设置密集采样的图像,从而更全面地分析不同采集条件对图像分类、目标检测和视觉问答(Visual Question Answering, VQA)等任务的影响。
链接: https://arxiv.org/abs/2505.15628
作者: Iuliia Kotseruba,John K. Tsotsos
机构: York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Generalization of deep-learning-based (DL) computer vision algorithms to various image perturbations is hard to establish and remains an active area of research. The majority of past analyses focused on the images already captured, whereas effects of the image formation pipeline and environment are less studied. In this paper, we address this issue by analyzing the impact of capture conditions, such as camera parameters and lighting, on DL model performance on 3 vision tasks – image classification, object detection, and visual question answering (VQA). To this end, we assess capture bias in common vision datasets and create a new benchmark, SNAP (for \textbfS hutter speed, ISO se \textbfN sitivity, and \textbfAP erture), consisting of images of objects taken under controlled lighting conditions and with densely sampled camera settings. We then evaluate a large number of DL vision models and show the effects of capture conditions on each selected vision task. Lastly, we conduct an experiment to establish a human baseline for the VQA task. Our results show that computer vision datasets are significantly biased, the models trained on this data do not reach human accuracy even on the well-exposed images, and are susceptible to both major exposure changes and minute variations of camera settings. Code and data can be found at this https URL
zh
[CV-20] LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理复杂和现实场景时推理能力受限的问题。现有基准测试通常以任务导向的方式构建,未能保证不同任务样本来自同一数据分布,因此难以评估低级感知能力对高级推理的协同作用。为克服这一限制,本文提出了Lens,一个包含3.4K张当代图像和60K+人类撰写的问答数据的多层级基准,覆盖八个任务和十二种日常场景,形成感知、理解和推理三个递进的任务层级。其关键在于每张图像均配备所有任务的丰富标注,从而支持评估MLLMs处理图像不变提示的能力,从基础感知到组合推理。
链接: https://arxiv.org/abs/2505.15616
作者: Ruilin Yao,Bo Zhang,Jirui Huang,Xinwei Long,Yifang Zhang,Tianyu Zou,Yufei Wu,Shichao Su,Yifan Xu,Wenxi Zeng,Zhaoyu Yang,Guoyou Li,Shilan Zhang,Zichan Li,Yaxiong Chen,Shengwu Xiong,Peng Xu,Jiajun Zhang,Bowen Zhou,David Clifton,Luc Van Gool
机构: Wuhan University of Technology (武汉理工大学); Tsinghua University (清华大学); Institute of Automation Chinese Academy of Sciences (中国科学院自动化研究所); Shanghai AI Lab (上海人工智能实验室); University of Oxford (牛津大学); INSAIT, Sofia Un. St Kliment Ohridski (索菲亚大学克里门特·奥赫里德斯基研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: this https URL. ICCV 2025 workshop page: this https URL
zh
[CV-21] A Methodology to Evaluate Strategies Predicting Rankings on Unseen Domains
【速读】:该论文试图解决在不同领域中预测哪些实体(如方法、算法等)将表现最佳的问题,而无需进行新且昂贵的评估。其核心挑战在于如何基于已知领域的评估结果,推断出在新领域中的最优实体。该论文提出了一种原创的方法论,采用留一域-out(leave-one-domain-out)的方式,针对特定应用偏好进行分析,从而实现对跨领域性能的预测。其解决方案的关键在于通过已有领域的评估数据构建模型,以推断未知领域的性能排名。
链接: https://arxiv.org/abs/2505.15595
作者: Sébastien Piérard,Adrien Deliège,Anaïs Halin,Marc Van Droogenbroeck
机构: 未知
类目: Performance (cs.PF); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Frequently, multiple entities (methods, algorithms, procedures, solutions, etc.) can be developed for a common task and applied across various domains that differ in the distribution of scenarios encountered. For example, in computer vision, the input data provided to image analysis methods depend on the type of sensor used, its location, and the scene content. However, a crucial difficulty remains: can we predict which entities will perform best in a new domain based on assessments on known domains, without having to carry out new and costly evaluations? This paper presents an original methodology to address this question, in a leave-one-domain-out fashion, for various application-specific preferences. We illustrate its use with 30 strategies to predict the rankings of 40 entities (unsupervised background subtraction methods) on 53 domains (videos).
zh
[CV-22] Beyond Classification: Evaluating Diffusion Denoised Smoothing for Security-Utility Trade off
【速读】:该论文试图解决基础模型在面对对抗性输入时的脆弱性问题,尤其是针对扩散去噪平滑(Diffusion Denoised Smoothing)方法在提升模型鲁棒性方面的有效性与局限性进行深入分析。解决方案的关键在于利用预训练的扩散模型对输入进行预处理,以增强模型在不同下游任务中的抗攻击能力,但研究发现该方法在高噪声设置下会导致性能显著下降,而在低噪声设置下则无法全面防御所有类型的攻击,从而揭示了对抗鲁棒性与模型性能之间的权衡问题。
链接: https://arxiv.org/abs/2505.15594
作者: Yury Belousov,Brian Pulfer,Vitaliy Kinakh,Slava Voloshynovskiy
机构: University of Geneva (日内瓦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at the 33rd European Signal Processing Conference (EUSIPCO 2025)
点击查看摘要
Abstract:While foundation models demonstrate impressive performance across various tasks, they remain vulnerable to adversarial inputs. Current research explores various approaches to enhance model robustness, with Diffusion Denoised Smoothing emerging as a particularly promising technique. This method employs a pretrained diffusion model to preprocess inputs before model inference. Yet, its effectiveness remains largely unexplored beyond classification. We aim to address this gap by analyzing three datasets with four distinct downstream tasks under three different adversarial attack algorithms. Our findings reveal that while foundation models maintain resilience against conventional transformations, applying high-noise diffusion denoising to clean images without any distortions significantly degrades performance by as high as 57%. Low-noise diffusion settings preserve performance but fail to provide adequate protection across all attack types. Moreover, we introduce a novel attack strategy specifically targeting the diffusion process itself, capable of circumventing defenses in the low-noise regime. Our results suggest that the trade-off between adversarial robustness and performance remains a challenge to be addressed.
zh
[CV-23] VP Lab: a PEFT-Enabled Visual Prompting Laboratory for Semantic Segmentation
【速读】:该论文旨在解决在特定技术领域中,大规模预训练视觉主干模型因视觉特征与训练分布差异较大而性能下降的问题。其解决方案的关键在于提出VP Lab框架,其中核心是E-PEFT(Efficient Parameter-Efficient Fine-Tuning),一种针对视觉提示管道进行参数高效微调的集成方法,能够在少量标注数据下实现对特定领域的有效适应,从而提升语义分割的性能。
链接: https://arxiv.org/abs/2505.15592
作者: Niccolo Avogaro,Thomas Frick,Yagmur G. Cinar,Daniel Caraballo,Cezary Skura,Filip M. Janicki,Piotr Kluska,Brown Ebouky,Nicola Farronato,Florian Scheidegger,Cristiano Malossi,Konrad Schindler,Andrea Bartezzaghi,Roy Assaf,Mattia Rigotti
机构: IBM Research(IBM研究院); ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large-scale pretrained vision backbones have transformed computer vision by providing powerful feature extractors that enable various downstream tasks, including training-free approaches like visual prompting for semantic segmentation. Despite their success in generic scenarios, these models often fall short when applied to specialized technical domains where the visual features differ significantly from their training distribution. To bridge this gap, we introduce VP Lab, a comprehensive iterative framework that enhances visual prompting for robust segmentation model development. At the core of VP Lab lies E-PEFT, a novel ensemble of parameter-efficient fine-tuning techniques specifically designed to adapt our visual prompting pipeline to specific domains in a manner that is both parameter- and data-efficient. Our approach not only surpasses the state-of-the-art in parameter-efficient fine-tuning for the Segment Anything Model (SAM), but also facilitates an interactive, near-real-time loop, allowing users to observe progressively improving results as they experiment within the framework. By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50% increase in semantic segmentation mIoU performance across various technical datasets using only 5 validated images, establishing a new paradigm for fast, efficient, and interactive model deployment in new, challenging domains. This work comes in the form of a demonstration.
zh
[CV-24] UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset
【速读】:该论文旨在解决现有Segment Anything Model (SAM)及其变体在水下实例分割任务中性能受限的问题,主要原因是缺乏水下领域专业知识以及较高的计算需求。其解决方案的关键在于提出一个大规模的水下实例分割数据集UIIS10K,并设计了UWSAM模型,该模型通过基于Mask GAT的水下知识蒸馏(MG-UKD)方法,将SAM ViT-Huge图像编码器的知识高效地迁移到更小的ViT-Small图像编码器中,同时引入端到端水下提示生成器(EUPG),自动生成水下提示以实现准确的实例定位与分割。
链接: https://arxiv.org/abs/2505.15581
作者: Hua Li,Shijie Lian,Zhiyuan Li,Runmin Cong,Sam Kwong
机构: Hainan University (海南大学); Huazhong University of Science and Technology (华中科技大学); Shandong University (山东大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at this https URL.
zh
[CV-25] Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models IJCAI2025
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在组合推理(Compositional Reasoning, CR)任务中因缺乏有效的图像基础负样本训练和对负样本难度差异考虑不足而导致的性能瓶颈问题。其解决方案的关键在于提出自适应硬负扰动学习(Adaptive Hard Negative Perturbation Learning, AHNPL),该方法通过将文本基础的硬负样本转换至视觉域以生成语义扰动的图像负样本,从而增强视觉编码器的训练,并结合多模态硬负损失与动态边界损失,提升模型对困难样本对的区分能力。
链接: https://arxiv.org/abs/2505.15576
作者: Xin Huang,Ruibin Li,Tong Jia,Wei Zheng,Ya Wang
机构: Nanyang Normal University (南阳师范大学); Peking University (北京大学); Henan (河南省)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2025)
点击查看摘要
Abstract:Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model’s discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs’ performance on complex CR tasks. The source code is available at this https URL.
zh
[CV-26] nyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在自动驾驶场景下的视觉问答(VQA)任务中因计算资源需求高而难以部署于资源受限车辆的问题。其解决方案的关键在于提出了一种轻量级但高效的VLM——TinyDrive,该模型包含两个核心组件:多尺度视觉编码器和基于令牌与序列的双层级优先机制。多尺度编码器通过尺度注入和跨尺度门控处理多视角图像,生成增强的视觉表征;在令牌层面,设计了基于学习重要性评分的动态令牌路由机制;在序列层面,通过整合归一化损失、不确定性估计和多样性度量来构建序列得分,以优化样本在优先缓冲区中的排序与保留。
链接: https://arxiv.org/abs/2505.15564
作者: Hossein Hassani,Soodeh Nikan,Abdallah Shami
机构: Western University (西安大略大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.
zh
[CV-27] seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation
【速读】:该论文旨在解决3D语义分割模型在不同数据集之间部署时面临的严重领域偏移问题。其解决方案的关键在于提出一种新颖的多视角投影框架,该框架通过将激光雷达扫描对齐为连贯的3D场景,并从多个虚拟相机姿态渲染生成大规模合成2D数据集(PC2D),从而提升模型在域泛化(DG)和无监督域适应(UDA)任务中的性能。该方法在训练阶段使用合成2D数据进行模型训练,并在推理阶段通过多视角处理与遮挡感知投票方案实现3D点云的最终语义标签生成,具备模块化设计以支持关键参数的广泛探索。
链接: https://arxiv.org/abs/2505.15545
作者: Andrew Caunes,Thierry Chateau,Vincent Fremont
机构: Logiroad; LS2N - Ecole Centrale de Nantes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA). Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create a large-scale synthetic 2D dataset (PC2D). We then use it to train a 2D segmentation model in-domain. During inference, the model processes hundreds of views per scene; the resulting logits are back-projected to 3D with an occlusion-aware voting scheme to generate final point-wise labels. Our framework is modular and enables extensive exploration of key design parameters, such as view generation optimization (VGO), visualization modality optimization (MODO), and 2D model choice. We evaluate on the nuScenes and SemanticKITTI datasets under both the DG and UDA settings. We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes. Our code and dataset generation tools will be publicly available at this https URL
zh
[CV-28] Convolutional Long Short-Term Memory Neural Networks Based Numerical Simulation of Flow Field
【速读】:该论文旨在解决传统计算流体力学(Computational Fluid Dynamics, CFD)在流场分析中收敛性与精度依赖数学模型、数值方法及计算耗时的问题。其解决方案的关键在于提出一种改进的卷积长短期记忆网络(Convolutional Long Short-Term Memory, ConvLSTM)作为流场预测的基准网络,结合残差网络和注意力机制以提升模型对流场时空特征的提取能力,同时减少参数量和训练时间。
链接: https://arxiv.org/abs/2505.15533
作者: Chang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIC 2025 accepted
点击查看摘要
Abstract:Computational Fluid Dynamics (CFD) is the main approach to analyzing flow field. However, the convergence and accuracy depend largely on mathematical models of flow, numerical methods, and time consumption. Deep learning-based analysis of flow filed provides an alternative. For the task of flow field prediction, an improved Convolutional Long Short-Term Memory (Con-vLSTM) Neural Network is proposed as the baseline network in consideration of the temporal and spatial characteristics of flow field. Combining dynamic mesh technology and User-Defined Function (UDF), numerical simulations of flow around a circular cylinder were conducted. Flow field snapshots were used to sample data from the cylinder’s wake region at different time instants, constructing a flow field dataset with sufficient volume and rich flow state var-iations. Residual networks and attention mechanisms are combined with the standard ConvLSTM model. Compared with the standard ConvLSTM model, the results demonstrate that the improved ConvLSTM model can extract more temporal and spatial features while having fewer parameters and shorter train-ing time.
zh
[CV-29] Clapper: Compact Learning and Video Representation in VLMs
【速读】:该论文旨在解决现有视觉-语言模型(VLMs)在处理长视频时存在的性能退化问题,特别是在压缩视觉标记至原始数量的四分之一以下时表现不佳。其关键解决方案是提出Clapper方法,该方法采用慢速-快速策略进行视频表征,并引入一种名为TimePerceiver的新模块,以在不牺牲问答准确性的前提下实现每帧61个视觉标记的13倍压缩率,从而有效平衡短视频和长视频的处理需求。
链接: https://arxiv.org/abs/2505.15529
作者: Lingyu Kong,Hongzhi Zhang,Jingyuan Zhang,Jianzhao Huang,Kunze Li,Qi Wang,Fuzheng Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Kuaishou Technology (快手科技); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.
zh
[CV-30] PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting
【速读】:该论文试图解决复杂3D植物生成的挑战,尤其是在生成具有精细细节和准确几何结构的植物模型方面,现有文本到3D生成模型表现不佳,限制了其在植物分析工具中的应用。论文提出的解决方案关键在于引入PlantDreamer,其核心是采用深度ControlNet、微调的低秩适应(LoRA)以及可调节的高斯剔除算法,从而显著提升生成植物模型的纹理真实性和几何完整性。此外,PlantDreamer还支持纯合成植物生成及真实世界植物点云的增强,通过转换为3D高斯斑点实现。
链接: https://arxiv.org/abs/2505.15528
作者: Zane K J Hartley,Lewis A G Stuart,Andrew P French,Michael P Pound
机构: School of Computer Science, University of Nottingham(计算机科学学院,诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 13 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Recent years have seen substantial improvements in the ability to generate synthetic 3D objects using AI. However, generating complex 3D objects, such as plants, remains a considerable challenge. Current generative 3D models struggle with plant generation compared to general objects, limiting their usability in plant analysis tools, which require fine detail and accurate geometry. We introduce PlantDreamer, a novel approach to 3D synthetic plant generation, which can achieve greater levels of realism for complex plant geometry and textures than available text-to-3D models. To achieve this, our new generation pipeline leverages a depth ControlNet, fine-tuned Low-Rank Adaptation and an adaptable Gaussian culling algorithm, which directly improve textural realism and geometric integrity of generated 3D plant models. Additionally, PlantDreamer enables both purely synthetic plant generation, by leveraging L-System-generated meshes, and the enhancement of real-world plant point clouds by converting them into 3D Gaussian Splats. We evaluate our approach by comparing its outputs with state-of-the-art text-to-3D models, demonstrating that PlantDreamer outperforms existing methods in producing high-fidelity synthetic plants. Our results indicate that our approach not only advances synthetic plant generation, but also facilitates the upgrading of legacy point cloud datasets, making it a valuable tool for 3D phenotyping applications.
zh
[CV-31] Detection of Underwater Multi-Targets Based on Self-Supervised Learning and Deformable Path Aggregation Feature Pyramid Network
【速读】:该论文旨在解决水下环境对目标检测模型的限制,提升水下目标检测的准确性和鲁棒性。其关键解决方案是构建了一个专门用于水下目标检测的数据集,并提出了一种高效的多目标检测算法。该算法通过基于SimSiam结构的自监督学习进行网络预训练,并引入可变形卷积和扩张卷积以应对水下目标对比度低、相互遮挡及密集分布等问题,从而扩大感受野以获取更有效的信息;同时,采用改进的回归损失函数EIoU,通过分别计算预测框的宽高损失来提升模型性能。
链接: https://arxiv.org/abs/2505.15518
作者: Chang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIC 2025 accepted
点击查看摘要
Abstract:To overcome the constraints of the underwater environment and improve the accuracy and robustness of underwater target detection models, this paper develops a specialized dataset for underwater target detection and proposes an efficient algorithm for underwater multi-target detection. A self-supervised learning based on the SimSiam structure is employed for the pre-training of underwater target detection network. To address the problems of low detection accuracy caused by low contrast, mutual occlusion and dense distribution of underwater targets in underwater object detection, a detection model suitable for underwater target detection is proposed by introducing deformable convolution and dilated convolution. The proposed detection model can obtain more effective information by increasing the receptive field. In addition, the regression loss function EIoU is introduced, which improves model performance by separately calculating the width and height losses of the predicted box. Experiment results show that the accuracy of the underwater target detection has been improved by the proposed detector.
zh
[CV-32] Directional Non-Commutative Monoidal Structures for Compositional Embeddings in Machine Learning NEURIPS2025
【速读】:该论文试图解决多维组合嵌入(multi-dimensional compositional embeddings)的建模问题,旨在为传统序列建模范式(如结构化状态空间模型和Transformer自注意力机制)提供一个统一的数学框架。其解决方案的关键在于引入了一种新的代数结构,该结构基于方向性的非交换乘法算子,定义了针对每个轴的独立组合算子circ_i,确保沿每个轴的结合性,并通过全局交换律保证跨轴组合的一致性,从而实现了理论上的完备性和与现代机器学习架构的兼容性。
链接: https://arxiv.org/abs/2505.15507
作者: Mahesh Godavarti
机构: Qalaxia(夸拉西亚)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 11 pages submitted to NeurIPS 2025
点击查看摘要
Abstract:We introduce a new algebraic structure for multi-dimensional compositional embeddings, built on directional non-commutative monoidal operators. The core contribution of this work is this novel framework, which exhibits appealing theoretical properties (associativity along each dimension and an interchange law ensuring global consistency) while remaining compatible with modern machine learning architectures. Our construction defines a distinct composition operator circ_i for each axis i, ensuring associative combination along each axis without imposing global commutativity. Importantly, all axis-specific operators commute with one another, enforcing a global interchange law that enables consistent crossaxis compositions. This is, to our knowledge, the first approach that provides a common foundation that generalizes classical sequence-modeling paradigms (e.g., structured state-space models (SSMs) and transformer self-attention) to a unified multi-dimensional framework. For example, specific one-dimensional instances of our framework can recover the familiar affine transformation algebra, vanilla self-attention, and the SSM-style recurrence. The higher-dimensional generalizations naturally support recursive, structure-aware operations in embedding spaces. We outline several potential applications unlocked by this structure-including structured positional encodings in Transformers, directional image embeddings, and symbolic modeling of sequences or grids-indicating that it could inform future deep learning model designs. We formally establish the algebraic properties of our framework and discuss efficient implementations. Finally, as our focus is theoretical, we include no experiments here and defer empirical validation to future work, which we plan to undertake.
zh
[CV-33] Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts
【速读】:该论文试图解决在目标数据集与预训练模型所使用的数据分布和类别差异较大时,如何仅使用少量标注样本对生成式视觉-语言模型(Vision-Language Models, VLMs)进行有效微调的问题。现有方法在面对此类分布偏移时存在过拟合和泛化能力下降的问题,且由于预训练数据不可用,难以准确评估模型在下游任务中的表现。论文的关键解决方案是提出一种新颖的提示微调方法PromptMargin,其核心在于通过选择性增强策略补充有限的训练样本,并引入多模态边界正则化器以提高类别区分度,从而实现对VLMs的有效适应。
链接: https://arxiv.org/abs/2505.15506
作者: Debarshi Brahma,Anuska Roy,Soma Biswas
机构: Indian Institute of Science, Bangalore(印度科学研究所,班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in TMLR (2025)
点击查看摘要
Abstract:Recently, Vision-Language foundation models like CLIP and ALIGN, which are pre-trained on large-scale data have shown remarkable zero-shot generalization to diverse datasets with different classes and even domains. In this work, we take a step further and analyze whether these models can be adapted to target datasets having very different distributions and classes compared to what these models have been trained on, using only a few labeled examples from the target dataset. In such scenarios, finetuning large pretrained models is challenging due to problems of overfitting as well as loss of generalization, and has not been well explored in prior literature. Since, the pre-training data of such models are unavailable, it is difficult to comprehend the performance on various downstream datasets. First, we try to answer the question: Given a target dataset with a few labelled examples, can we estimate whether further fine-tuning can enhance the performance compared to zero-shot evaluation? by analyzing the common vision-language embedding space. Based on the analysis, we propose a novel prompt-tuning method, PromptMargin for adapting such large-scale VLMs directly on the few target samples. PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules: 1) Firstly, we use a selective augmentation strategy to complement the few training samples in each task; 2) Additionally, to ensure robust training in the presence of unfamiliar class names, we increase the inter-class margin for improved class discrimination using a novel Multimodal Margin Regularizer. Extensive experiments and analysis across fifteen target benchmark datasets, with varying degrees of distribution shifts from natural images, shows the effectiveness of the proposed framework over the existing state-of-the-art approaches applied to this setting. this http URL.
zh
[CV-34] Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification
【速读】:该论文试图解决深度学习在计算病理学中的应用问题,即专家标注稀缺导致的监督学习受限,以及少样本学习中过拟合和判别特征误表征的问题。现有基于预训练视觉-语言模型的少样本多实例学习(MIL)方法虽然缓解了这些问题,但需要复杂的预处理和高昂的计算成本。解决方案的关键是提出一种Squeeze-and-Recalibrate (SR)块,作为MIL模型中线性层的即插即用替代方案,其核心由一对低秩可训练矩阵(squeeze pathway, SP)和一个冻结的随机重校准矩阵组成,通过减少参数数量、防止虚假特征学习以及重新定义优化目标来提升模型性能。
链接: https://arxiv.org/abs/2505.15504
作者: Conghao Xiong,Zhengrui Guo,Zhe Xu,Yifei Zhang,Raymond Kai-Yu Tong,Si Yong Yeo,Hao Chen,Joseph J. Y. Sung,Irwin King
机构: CUHK(香港中文大学); HKUST(香港科技大学); NTU(新加坡国立大学); MedVisAI Lab(医学视觉人工智能实验室); Lee Kong Chian School of Medicine, NTU(新加坡国立大学李光耀医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Deep learning has advanced computational pathology but expert annotations remain scarce. Few-shot learning mitigates annotation burdens yet suffers from overfitting and discriminative feature mischaracterization. In addition, the current few-shot multiple instance learning (MIL) approaches leverage pretrained vision-language models to alleviate these issues, but at the cost of complex preprocessing and high computational cost. We propose a Squeeze-and-Recalibrate (SR) block, a drop-in replacement for linear layers in MIL models to address these challenges. The SR block comprises two core components: a pair of low-rank trainable matrices (squeeze pathway, SP) that reduces parameter count and imposes a bottleneck to prevent spurious feature learning, and a frozen random recalibration matrix that preserves geometric structure, diversifies feature directions, and redefines the optimization objective for the SP. We provide theoretical guarantees that the SR block can approximate any linear mapping to arbitrary precision, thereby ensuring that the performance of a standard MIL model serves as a lower bound for its SR-enhanced counterpart. Extensive experiments demonstrate that our SR-MIL models consistently outperform prior methods while requiring significantly fewer parameters and no architectural changes.
zh
[CV-35] Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation ICIP2025
【速读】:该论文旨在解决仅依赖RGB数据的语义分割在低光照和遮挡等复杂环境下性能受限的问题,从而提升其在关键应用(如自动驾驶)中的可靠性。其解决方案的关键在于从新的频谱视角出发,将多模态特征划分为低频特征(提供场景上下文信息)和高频特征(捕捉模态特异性细节),并通过提出频谱感知全局融合网络(Spectral-aware Global Fusion Network, SGFNet)来显式建模高频特征之间的交互,从而有效融合RGB与热辐射数据的多模态特征。
链接: https://arxiv.org/abs/2505.15491
作者: Ce Zhang,Zifu Wan,Simon Stepputtis,Katia Sycara,Yaqi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICIP 2025
点击查看摘要
Abstract:Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.
zh
[CV-36] Pura: An Efficient Privacy-Preserving Solution for Face Recognition
【速读】:该论文旨在解决面部识别技术中因敏感面部图像引发的隐私问题,同时提升隐私保护方案的效率。其解决方案的关键在于提出一种名为Pura的高效隐私保护方案,该方案通过阈值Paillier加密系统构建了非交互式的隐私保护架构,并设计了一套底层安全计算协议,以实现对加密数据的高效面部识别操作,同时还引入并行计算机制以提升性能。
链接: https://arxiv.org/abs/2505.15476
作者: Guotao Xu,Bowen Zhao,Yang Xiao,Yantao Zhong,Liang Zhai,Qingqi Pei
机构: Guangzhou Institute of Technology (广州理工学院); Shaanxi Key Laboratory of Blockchain and Secure Computing, Xidian University (陕西省区块链与安全计算重点实验室,西安电子科技大学); School of Cyber Engineering, Engineering Research Center of Trusted Digital Economy, Universities of Shaanxi Province, Xidian University (网络空间安全学院,陕西省可信数字经济工程研究中心,西安电子科技大学); China Resources Intelligent Computing Technology Co., Ltd. (中国资源智能计算技术有限公司); Chinese Academy of Surveying and Mapping (中国测绘科学研究院); State Key Laboratory of Integrated Service Networks, Xidian University (综合业务网理论及关键技术国家重点实验室,西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Face recognition is an effective technology for identifying a target person by facial images. However, sensitive facial images raises privacy concerns. Although privacy-preserving face recognition is one of potential solutions, this solution neither fully addresses the privacy concerns nor is efficient enough. To this end, we propose an efficient privacy-preserving solution for face recognition, named Pura, which sufficiently protects facial privacy and supports face recognition over encrypted data efficiently. Specifically, we propose a privacy-preserving and non-interactive architecture for face recognition through the threshold Paillier cryptosystem. Additionally, we carefully design a suite of underlying secure computing protocols to enable efficient operations of face recognition over encrypted data directly. Furthermore, we introduce a parallel computing mechanism to enhance the performance of the proposed secure computing protocols. Privacy analysis demonstrates that Pura fully safeguards personal facial privacy. Experimental evaluations demonstrate that Pura achieves recognition speeds up to 16 times faster than the state-of-the-art.
zh
[CV-37] Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models
【速读】:该论文试图解决扩散模型在生成文本到图像过程中可能无意生成不适宜工作场所(NSFW)内容的问题,这一问题对模型的安全部署构成了重大风险。解决方案的关键在于引入一个全链条工具包,专门用于概念擦除,并进行首次系统性研究,以深入分析概念擦除方法的内在机制与实际效果,从而为不同现实场景中有效应用概念擦除方法提供理论支持和实践指导。
链接: https://arxiv.org/abs/2505.15450
作者: Die Chen,Zhiwen Li,Cen Chen,Yuexiang Xie,Xiaodan Li,Jinyan Ye,Yingda Chen,Yaliang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.
zh
[CV-38] ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
【速读】:该论文旨在解决视频理解中基于意图的帧选择问题,即如何在视频中有效识别与任务目标相关的关键帧。现有方法通常依赖于启发式策略或伪标签监督标注,存在成本高且难以扩展的局限性。论文提出的解决方案——ViaRL,其关键在于首次引入基于规则的强化学习(rule-based reinforcement learning)框架,通过试错机制利用下游模型的答案准确率作为奖励信号,优化帧选择过程,从而无需昂贵的人工标注即可实现接近人类学习过程的高效视频理解。
链接: https://arxiv.org/abs/2505.15447
作者: Ziqiang Xu,Qi Dai,Tian Xie,Yifan Yang,Kai Qiu,DongDong Chen,Zuxuan Wu,Chong Luo
机构: Fudan University (复旦大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.
zh
[CV-39] Stronger ViTs With Octic Equivariance
【速读】:该论文试图解决当前视觉变换器(Vision Transformers, ViTs)在计算效率和性能优化方面的挑战,旨在通过引入新的归纳偏置来提升模型的效率与效果。其解决方案的关键在于引入八次等变性(octic equivariance),即对反射和90度旋转的等变性,作为额外的归纳偏置,并设计了八次等变视觉变换器(octic ViTs)架构,通过使用八次等变层,在监督和自监督学习任务中实现了更高的计算效率和更好的性能表现。
链接: https://arxiv.org/abs/2505.15441
作者: David Nordström,Johan Edstedt,Fredrik Kahl,Georg Bökman
机构: Chalmers University of Technology (查尔姆斯理工大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.
zh
[CV-40] FRN: Fractal-Based Recursive Spectral Reconstruction Network
【速读】:该论文旨在解决从RGB图像通过光谱重建生成高光谱图像(HSI)的问题,以降低HSI获取的成本。其解决方案的关键在于提出一种基于分形的递归光谱重建网络(FRN),该网络将光谱重建视为一个渐进过程,采用从宽波段到窄波段或粗到细的策略来预测下一个波长,而非直接整合R、G、B通道的全谱信息。FRN通过递归调用原子重建模块,并仅利用相邻波段的光谱信息来生成下一波长的图像,从而遵循光谱数据的低秩特性,同时设计了一种带感知的状态空间模型,以抑制因反射率差异引起的低相关区域干扰。
链接: https://arxiv.org/abs/2505.15439
作者: Ge Meng,Zhongnan Cai,Ruizhe Chen,Jingyan Tu,Yingying Wang,Yue Huang,Xinghao Ding
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Generating hyperspectral images (HSIs) from RGB images through spectral reconstruction can significantly reduce the cost of HSI acquisition. In this paper, we propose a Fractal-Based Recursive Spectral Reconstruction Network (FRN), which differs from existing paradigms that attempt to directly integrate the full-spectrum information from the R, G, and B channels in a one-shot manner. Instead, it treats spectral reconstruction as a progressive process, predicting from broad to narrow bands or employing a coarse-to-fine approach for predicting the next wavelength. Inspired by fractals in mathematics, FRN establishes a novel spectral reconstruction paradigm by recursively invoking an atomic reconstruction module. In each invocation, only the spectral information from neighboring bands is used to provide clues for the generation of the image at the next wavelength, which follows the low-rank property of spectral data. Moreover, we design a band-aware state space model that employs a pixel-differentiated scanning strategy at different stages of the generation process, further suppressing interference from low-correlation regions caused by reflectance differences. Through extensive experimentation across different datasets, FRN achieves superior reconstruction performance compared to state-of-the-art methods in both quantitative and qualitative evaluations.
zh
[CV-41] Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation
【速读】:该论文旨在解决手语翻译(Sign Language Translation, SLT)中依赖人工标注的gloss标签所带来的成本高、可扩展性差的问题。传统方法通过将SLT分解为视频到gloss识别和gloss到文本翻译两个子任务来实现,但其有效性受限于专家标注的gloss数据稀缺。该论文提出的解决方案的关键在于构建一个无gloss的伪gloss生成框架,利用大语言模型(Large Language Model, LLM)通过上下文学习生成初步的gloss,并通过弱监督学习对伪gloss进行排序优化,以提升其与视频中手语序列的对齐效果,从而实现无需人工标注gloss的高效SLT。
链接: https://arxiv.org/abs/2505.15438
作者: Jianyuan Guo,Peike Li,Trevor Cohn
机构: City University of Hong Kong (香港城市大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report, 21 pages
点击查看摘要
Abstract:Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.
zh
[CV-42] Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
【速读】:该论文旨在解决现有视觉语言模型(Vision Language Models, VLMs)在多模态推理能力方面尚未充分探索的问题。其关键解决方案是提出一种基于视觉线索和问题引导的自适应聚焦与缩放机制,即链式聚焦(Chain-of-Focus, CoF)方法,以提升模型在不同图像分辨率和任务下的多模态推理效率。通过两阶段训练流程——监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)——实现了模型对关键图像区域的动态识别与推理策略优化。
链接: https://arxiv.org/abs/2505.15436
作者: Xintong Zhang,Zhi Gao,Bofei Zhang,Pengxiang Li,Xiaowen Zhang,Yang Liu,Tao Yuan,Yuwei Wu,Yunde Jia,Song-Chun Zhu,Qing Li
机构: Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology; State Key Laboratory of General Artificial Intelligence, BIGAI; School of Intelligence Science and Technology, Peking University; Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University; Department of Automation, Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
zh
[CV-43] meCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在时间因果推理方面能力不足的问题,特别是对由现实世界知识驱动的不可逆物体变化(如水果腐烂和人类衰老)的理解。其解决方案的关键在于引入了一个名为\textbf{TimeCausality}的新基准,专门用于评估VLM在时间维度上的因果推理能力。通过该基准,研究揭示了现有最先进的开源VLM在时间因果推理任务上显著落后于闭源模型,甚至包括GPT-4o在内的闭源模型在该任务上的表现也明显下降,从而强调了将时间因果性纳入VLM评估与开发的重要性。
链接: https://arxiv.org/abs/2505.15435
作者: Zeqing Wang,Shiyuan Zhang,Chengpei Tang,Keze Wang
机构: Sun Yat-sen University (中山大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbfTimeCausality, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o on various standard visual question answering tasks, they fall significantly behind on our benchmark compared with their closed-source competitors. Furthermore, even GPT-4o exhibits a marked drop in performance on TimeCausality compared to its results on other tasks. These findings underscore the critical need to incorporate temporal causality into the evaluation and development of VLMs, and they highlight an important challenge for the open-source VLM community moving forward. Code and Data are available at \hrefthis https URL TimeCausality.
zh
[CV-44] On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable? UAI
【速读】:该论文旨在解决医学视觉-语言模型(Medical Vision-Language Models, MVLMs)在噪声和损坏条件下性能不佳的问题,当前评估多基于清洁数据集,忽视了模型在真实世界干扰下的鲁棒性。解决方案的关键在于引入MediMeta-C基准,系统地应用多种扰动以评估MVLM的鲁棒性,并提出RobustMedCLIP,通过少量样本微调增强模型对损坏的适应能力。实验表明,高效低秩适配结合少量样本微调可在保持跨模态泛化能力的同时提升模型鲁棒性。
链接: https://arxiv.org/abs/2505.15425
作者: Raza Imam,Rufael Marew,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset and Code is available at this https URL Accepted at: Medical Image Understanding and Analysis (MIUA) 2025
点击查看摘要
Abstract:Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness – i.e., the model’s ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.
zh
[CV-45] Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks CVPR
【速读】:该论文试图解决Vision Transformers在计算和资源需求上的高消耗问题,同时希望利用Mixture-of-Experts (MoE)结构提升模型效率。其解决方案的关键在于从预训练模型中提取专家子网络(expert subnetworks),通过聚类输出激活模式识别不同的激活特征,并据此提取负责生成这些特征的子网络,从而实现对每个样本仅激活相关子网络,显著降低计算量(MACs)和模型规模,同时保持接近原始模型的性能。
链接: https://arxiv.org/abs/2505.15414
作者: Uranik Berisha,Jens Mehnert,Alexandru Paul Condurache
机构: Robert Bosch GmbH (罗伯特·博世有限公司); University of Lübeck (吕贝克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025
点击查看摘要
Abstract:Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model’s MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing MACs and model size, by up to 36% and 32% respectively.
zh
[CV-46] Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes
【速读】:该论文旨在解决自动化分析复杂动物行为的难题,特别是在计算神经科学领域中对细粒度行为进行自动分类的需求。当前可用的数据集主要关注简单或社交行为,而缺乏针对复杂机械任务的高质量标注数据。论文提出了一种包含个体小鼠解决复杂机械谜题(称为锁盒)的视频数据集,提供了多视角记录的超过110小时的视频数据,并为其中两只小鼠的视频提供了人工标注的帧级动作标签。其解决方案的关键在于基于关键点(pose)跟踪的动作分类框架,该框架揭示了在自动化标注精细行为(如物体操作)时所面临的挑战。
链接: https://arxiv.org/abs/2505.15408
作者: Patrik Reiske,Marcus N. Boon,Niek Andresen,Sole Traverso,Katharina Hohlbaum,Lars Lewejohann,Christa Thöne-Reineke,Olaf Hellwich,Henning Sprekeler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Machine learning and computer vision methods have a major impact on the study of natural animal behavior, as they enable the (semi-)automatic analysis of vast amounts of video data. Mice are the standard mammalian model system in most research fields, but the datasets available today to refine such methods focus either on simple or social behaviors. In this work, we present a video dataset of individual mice solving complex mechanical puzzles, so-called lockboxes. The more than 110 hours of total playtime show their behavior recorded from three different perspectives. As a benchmark for frame-level action classification methods, we provide human-annotated labels for all videos of two different mice, that equal 13% of our dataset. Our keypoint (pose) tracking-based action classification framework illustrates the challenges of automated labeling of fine-grained behaviors, such as the manipulation of objects. We hope that our work will help accelerate the advancement of automated action and behavior classification in the computational neuroscience community. Our dataset is publicly available at this https URL
zh
[CV-47] Visual Question Answering on Multiple Remote Sensing Image Modalities
【速读】:该论文试图解决在遥感背景下,传统视觉问答(Visual Question Answering, VQA)系统因单一图像模态导致的视觉特征表示不足的问题。解决方案的关键在于引入多图像模态(包括高分辨率RGB图像、多光谱成像数据和合成孔径雷达数据),并构建一个名为TAMMI的数据集,同时提出MM-RSVQA模型,该模型基于VisualBERT,通过可训练的融合过程有效结合多模态图像与文本信息,以提升对复杂问题的理解与回答能力。
链接: https://arxiv.org/abs/2505.15401
作者: Hichem Boussaid,Lucrezia Tosato,Flora Weissgerber,Camille Kurtz,Laurent Wendling,Sylvain Lobry
机构: LIPADE, Université Paris Cité, France; ONERA, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EARTHVISION 2025 8 pages, 1 page of supplementary material, 4 figures
点击查看摘要
Abstract:The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic aperture radar). Thanks to an automated pipeline, this dataset can be easily extended according to experimental needs. We also propose the MM-RSVQA (Multi-modal Multi-resolution Remote Sensing Visual Question Answering) model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text through a trainable fusion process. A preliminary experimental study shows promising results of our methodology on this challenging dataset, with an accuracy of 65.56% on the targeted VQA task. This pioneering work paves the way for the community to a new multi-modal multi-resolution VQA task that can be applied in other imaging domains (such as medical imaging) where multi-modality can enrich the visual representation of a scene. The dataset and code are available at this https URL.
zh
[CV-48] Expanding Zero-Shot Object Counting with Rich Prompts
【速读】:该论文旨在解决预训练零样本计数模型在扩展至未见类别时面临的文本与视觉特征对齐不足的问题,这一问题使得单纯增加新提示无法实现准确计数。论文提出的解决方案——RichCount,其关键在于采用两阶段训练策略,通过增强文本编码和强化模型与图像中物体的关联性,实现对未见类别的有效零样本计数。具体而言,RichCount通过前馈网络和基于文本-图像相似性的适配器丰富文本特征,构建稳健的对齐表示,并将优化后的编码器应用于计数任务,从而提升模型在多样化提示和复杂图像中的泛化能力。
链接: https://arxiv.org/abs/2505.15398
作者: Huilin Zhu,Senyao Li,Jingling Yuan,Zhengwei Yang,Yu Guo,Wenxuan Liu,Xian Zhong,Shengfeng He
机构: Wuhan University of Technology (武汉理工大学); Singapore Management University (新加坡管理大学); Wuhan University (武汉大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Expanding pre-trained zero-shot counting models to handle unseen categories requires more than simply adding new prompts, as this approach does not achieve the necessary alignment between text and visual features for accurate counting. We introduce RichCount, the first framework to address these limitations, employing a two-stage training strategy that enhances text encoding and strengthens the model’s association with objects in images. RichCount improves zero-shot counting for unseen categories through two key objectives: (1) enriching text features with a feed-forward network and adapter trained on text-image similarity, thereby creating robust, aligned representations; and (2) applying this refined encoder to counting tasks, enabling effective generalization across diverse prompts and complex images. In this manner, RichCount goes beyond simple prompt expansion to establish meaningful feature alignment that supports accurate counting across novel categories. Extensive experiments on three benchmark datasets demonstrate the effectiveness of RichCount, achieving state-of-the-art performance in zero-shot counting and significantly enhancing generalization to unseen categories in open-world scenarios.
zh
[CV-49] EVA: Expressive Virtual Avatars from Multi-view Videos SIGGRAPH2025
【速读】:该论文旨在解决现有方法在人类虚拟角色建模中无法提供完整、真实且富有表现力的控制问题,其核心挑战在于面部表情与身体动作的表示存在耦合。解决方案的关键在于提出一种名为EVA(Expressive Virtual Avatars)的演员特定、完全可控且富有表现力的人类虚拟角色框架,该框架采用两层模型结构:表达性模板几何层和3D高斯外观层。通过设计一种从多视角视频中准确恢复身体运动、面部表情及非刚性形变参数的级联优化跟踪算法,并引入一种解耦的3D高斯外观模型,实现对身体和面部外观的有效分离,从而实现对面部表情、身体动作和手部手势的独立控制。
链接: https://arxiv.org/abs/2505.15385
作者: Hendrik Junkawitsch,Guoxing Sun,Heming Zhu,Christian Theobalt,Marc Habermann
机构: Max Planck Institute for Informatics (马克斯·普朗克信息学研究所); Saarland Informatics Campus (萨尔兰信息学校园); Saarbrücken Research Center for Visual Computing, Interaction and AI (萨尔布吕肯视觉计算、交互与人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted at SIGGRAPH 2025 Conference Track, Project page: this https URL
点击查看摘要
Abstract:With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.
zh
[CV-50] he P3 dataset: Pixels Points and Polygons for Multimodal Building Vectorization
【速读】:该论文旨在解决建筑矢量化(building vectorization)问题,即从多模态数据中自动提取二维建筑轮廓。其解决方案的关键在于构建了一个大规模的多模态基准数据集P^3,该数据集融合了机载LiDAR点云、高分辨率航拍影像以及矢量化的2D建筑轮廓,提供了高精度的三维信息与图像信息。通过实验表明,LiDAR点云在混合和端到端学习框架中均可作为预测建筑多边形的稳健模态,且将LiDAR数据与影像数据融合能够进一步提升预测结果的准确性和几何质量。
链接: https://arxiv.org/abs/2505.15379
作者: Raphael Sulzer,Liuyun Duan,Nicolas Girard,Florent Lafarge
机构: LuxCarta Technology (LuxCarta Technology); Centre Inria d’Université Côte d’Azur (Centre Inria d’Université Côte d’Azur)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present the P ^3 dataset, a large-scale multimodal benchmark for building vectorization, constructed from aerial LiDAR point clouds, high-resolution aerial imagery, and vectorized 2D building outlines, collected across three continents. The dataset contains over 10 billion LiDAR points with decimeter-level accuracy and RGB images at a ground sampling distance of 25 centimeter. While many existing datasets primarily focus on the image modality, P ^3 offers a complementary perspective by also incorporating dense 3D information. We demonstrate that LiDAR point clouds serve as a robust modality for predicting building polygons, both in hybrid and end-to-end learning frameworks. Moreover, fusing aerial LiDAR and imagery further improves accuracy and geometric quality of predicted polygons. The P ^3 dataset is publicly available, along with code and pretrained weights of three state-of-the-art models for building polygon prediction at this https URL .
zh
[CV-51] RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
【速读】:该论文旨在解决自主系统在在线操作中构建开放词汇语义地图的灵活性不足问题,即现有3D语义映射系统难以高效支持开放词汇的语义建图。其关键解决方案是开发一种无需训练的统一系统,通过在线实例级语义嵌入融合,将GPU加速的几何重建与开放词汇视觉-语言模型无缝集成,并借助空间索引的分层对象关联进行引导,从而实现实时的3D地图构建、语义一致性保持及自然语言交互。
链接: https://arxiv.org/abs/2505.15373
作者: Naman Patel,Prashanth Krishnamurthy,Farshad Khorrami
机构: NYU Tandon School of Engineering (纽约大学坦登工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven’t yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates GPU-accelerated geometric reconstruction with open-vocabulary vision-language models through online instance-level semantic embedding fusion, guided by hierarchical object association with spatial indexing. Our training-free system achieves superior performance through incremental processing and unified geometric-semantic updates, while robustly handling 2D segmentation inconsistencies. The proposed general-purpose 3D scene understanding framework can be used for various tasks including zero-shot 3D instance retrieval, segmentation, and object detection to reason about previously unseen objects and interpret natural language queries. The project page is available at this https URL.
zh
[CV-52] Objective Bicycle Occlusion Level Classification using a Deformable Parts-Based Model
【速读】:该论文旨在解决自行车在道路安全中的可见性与遮挡水平分类问题,特别是针对自行车作为最脆弱道路使用者的检测难题。其解决方案的关键在于提出了一种基于部件的检测模型和自定义图像检测流程,以客观量化自行车语义部件的可见性和遮挡程度,从而替代现有主观方法,提升对遮挡自行车检测算法性能评估的准确性。
链接: https://arxiv.org/abs/2505.15358
作者: Angelique Mangubat,Shane Gilroy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Road safety is a critical challenge, particularly for cyclists, who are among the most vulnerable road users. This study aims to enhance road safety by proposing a novel benchmark for bicycle occlusion level classification using advanced computer vision techniques. Utilizing a parts-based detection model, images are annotated and processed through a custom image detection pipeline. A novel method of bicycle occlusion level is proposed to objectively quantify the visibility and occlusion level of bicycle semantic parts. The findings indicate that the model robustly quantifies the visibility and occlusion level of bicycles, a significant improvement over the subjective methods used by the current state of the art. Widespread use of the proposed methodology will facilitate the accurate performance reporting of cyclist detection algorithms for occluded cyclists, informing the development of more robust vulnerable road user detection methods for autonomous vehicles.
zh
[CV-53] My Face Is Mine Not Yours: Facial Protection Against Diffusion Model Face Swapping
【速读】:该论文试图解决扩散模型(diffusion models)驱动的深度伪造技术对未经授权和不道德的人脸图像篡改所带来的安全风险。现有防御方法主要针对传统生成架构(如GANs、AEs、VAEs),无法有效应对扩散模型特有的挑战,且现有的扩散模型专用对抗攻击方法受限于特定模型结构和权重,难以适应多样化的扩散型深度伪造实现。此外,这些方法通常采用全局扰动策略,未能充分考虑深度伪造中面部操作的区域特异性。该论文提出的解决方案的关键在于设计一种主动防御策略,通过对抗攻击预先保护人脸图像,以应对扩散模型在深度伪造中的独特挑战。
链接: https://arxiv.org/abs/2505.15336
作者: Hon Ming Yam,Zhongliang Guo,Chun Pong Lau
机构: City University of Hong Kong(香港城市大学); University of St Andrews(圣安德鲁斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The proliferation of diffusion-based deepfake technologies poses significant risks for unauthorized and unethical facial image manipulation. While traditional countermeasures have primarily focused on passive detection methods, this paper introduces a novel proactive defense strategy through adversarial attacks that preemptively protect facial images from being exploited by diffusion-based deepfake systems. Existing adversarial protection methods predominantly target conventional generative architectures (GANs, AEs, VAEs) and fail to address the unique challenges presented by diffusion models, which have become the predominant framework for high-quality facial deepfakes. Current diffusion-specific adversarial approaches are limited by their reliance on specific model architectures and weights, rendering them ineffective against the diverse landscape of diffusion-based deepfake implementations. Additionally, they typically employ global perturbation strategies that inadequately address the region-specific nature of facial manipulation in deepfakes.
zh
[CV-54] Parameter-Efficient Fine-Tuning of Multispectral Foundation Models for Hyperspectral Image Classification
【速读】:该论文旨在解决将多光谱基础模型(multispectral foundation model)高效微调用于高光谱图像分类(hyperspectral image classification, HSIC)的问题,特别是针对高光谱影像(hyperspectral imagery, HSI)的特性及微调过程中对内存和存储资源的高需求。解决方案的关键在于探索参数高效的微调(Parameter-Efficient Fine-Tuning, PEFT)方法,包括LoRA、KronA、LoKr以及改进的LoRA+,并提出KronA+,通过在克罗内克矩阵上应用类似LoRA+的机制,显著减少可训练参数数量和存储开销,同时保持较高的分类性能。
链接: https://arxiv.org/abs/2505.15334
作者: Bernardin Ligan,Khalide Jbilou,Fahd Kalloubi,Ahmed Ratnani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 14 figures
点击查看摘要
Abstract:Foundation models have achieved great success across diverse domains, including remote sensing (RS), thanks to their versatility and strong generalization abilities. However, most RS foundation models are designed for multispectral data, while hyperspectral imagery (HSI) - with its hundreds of spectral bands - remains less explored. Fine-tuning such models for downstream tasks is also challenging, often demanding considerable memory and storage. In this paper, we propose an efficient framework to fine-tune SpectralGPT, a multispectral foundation model, for hyperspectral image classification (HSIC). We explore several Parameter-Efficient Fine-Tuning (PEFT) methods, including Low-Rank Adaptation (LoRA), Kronecker-based adaptation (KronA), Low-Rank Kronecker (LoKr), and the recent LoRA+, which uses distinct learning rates for low-rank adapters scaled by a factor lambda. Inspired by LoRA+, we introduce KronA+, which applies a similar mechanism to the Kronecker matrices. We evaluate our approach on five datasets from different sensors, showing competitive performance with state-of-the-art HSI models. Our full fine-tuning (FFT) setup for SpectralGPT even outperforms a dedicated hyperspectral foundation model on some datasets while requiring only a quarter of the training epochs. Under the same number of epochs, KronA+ reaches similar performance with far fewer trainable parameters - just 0.056 percent - and adds only approximately 0.2 megabytes of storage, making it the most effective PEFT method tested.
zh
[CV-55] owards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models
【速读】:该论文旨在解决真实生物特征应用中形态攻击检测(Morphing Attack Detection, MAD)的准确性和可解释性问题,特别是针对生成式人工智能(Generative AI)生成的形态攻击。其解决方案的关键在于引入多模态大语言模型(Multimodal Large Language Models, MLLMs)进行差分形态攻击检测(Differential Morphing Attack Detection, D-MAD),并通过基于思维链(Chain-of-Thought, CoT)的提示工程提高模型决策的可靠性和可解释性。
链接: https://arxiv.org/abs/2505.15332
作者: Ria Shekhawat,Hailin Li,Raghavendra Ramachandra,Sushma Venkatesh
机构: Norwegian University of Science and Technology (NTNU); MOBAI AS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Conference on Automatic Face and Gesture Recognition (FG 2025)
点击查看摘要
Abstract:Leveraging the power of multimodal large language models (LLMs) offers a promising approach to enhancing the accuracy and interpretability of morphing attack detection (MAD), especially in real-world biometric applications. This work introduces the use of LLMs for differential morphing attack detection (D-MAD). To the best of our knowledge, this is the first study to employ multimodal LLMs to D-MAD using real biometric data. To effectively utilize these models, we design Chain-of-Thought (CoT)-based prompts to reduce failure-to-answer rates and enhance the reasoning behind decisions. Our contributions include: (1) the first application of multimodal LLMs for D-MAD using real data subjects, (2) CoT-based prompt engineering to improve response reliability and explainability, (3) comprehensive qualitative and quantitative benchmarking of LLM performance using data from 54 individuals captured in passport enrollment scenarios, and (4) comparative analysis of two multimodal LLMs: ChatGPT-4o and Gemini providing insights into their morphing attack detection accuracy and decision transparency. Experimental results show that ChatGPT-4o outperforms Gemini in detection accuracy, especially against GAN-based morphs, though both models struggle under challenging conditions. While Gemini offers more consistent explanations, ChatGPT-4o is more resilient but prone to a higher failure-to-answer rate.
zh
[CV-56] SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition
【速读】:该论文旨在解决视觉识别中传统自注意力方法无法有效捕捉现实场景中的高阶关联以及计算冗余的问题,同时克服现有超图神经网络在超边分配上的静态性和二值化限制。其解决方案的关键在于提出软超图神经网络(SoftHGNN),通过引入软超边概念,使每个节点通过连续参与权重而非硬性二值分配与超边关联,从而实现动态且可微的高阶语义建模,并结合稀疏超边选择机制和负载均衡正则化以提升计算效率。
链接: https://arxiv.org/abs/2505.15325
作者: Mengqi Lei,Yihong Wu,Siqi Li,Xinhu Zheng,Juan Wang,Yue Gao,Shaoyi Du
机构: BNRist, THUIBCS, BLBCI, School of Software, Tsinghua University (清华大学); Taiyuan University of Technology (太原理工大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); Department of Ultrasound, the Second Affiliated Hospital of Xi’an Jiaotong University (西安交通大学第二附属医院); State Key Laboratory of Human-Machine Hybrid Augmented Intelligence (人机混合增强智能国家重点实验室); National Engineering Research Center for Visual Information and Applications (视觉信息与应用国家工程研究中心); Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所), Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual recognition relies on understanding both the semantics of image tokens and the complex interactions among them. Mainstream self-attention methods, while effective at modeling global pair-wise relations, fail to capture high-order associations inherent in real-world scenes and often suffer from redundant computation. Hypergraphs extend conventional graphs by modeling high-order interactions and offer a promising framework for addressing these limitations. However, existing hypergraph neural networks typically rely on static and hard hyperedge assignments, leading to excessive and redundant hyperedges with hard binary vertex memberships that overlook the continuity of visual semantics. To overcome these issues, we present Soft Hypergraph Neural Networks (SoftHGNNs), which extend the methodology of hypergraph computation, to make it truly efficient and versatile in visual recognition tasks. Our framework introduces the concept of soft hyperedges, where each vertex is associated with hyperedges via continuous participation weights rather than hard binary assignments. This dynamic and differentiable association is achieved by using the learnable hyperedge prototype. Through similarity measurements between token features and the prototype, the model generates semantically rich soft hyperedges. SoftHGNN then aggregates messages over soft hyperedges to capture high-order semantics. To further enhance efficiency when scaling up the number of soft hyperedges, we incorporate a sparse hyperedge selection mechanism that activates only the top-k important hyperedges, along with a load-balancing regularizer to ensure balanced hyperedge utilization. Experimental results across three tasks on five datasets demonstrate that SoftHGNN efficiently captures high-order associations in visual scenes, achieving significant performance improvements.
zh
[CV-57] CEBSNet: Change-Excited and Background-Suppressed Network with Temporal Dependency Modeling for Bitemporal Change Detection
【速读】:该论文旨在解决变化检测(change detection)中因时间差异导致的光照变化、季节性变化、背景干扰和拍摄角度等问题,特别是在处理大时间跨度图像时,现有方法往往忽视时间依赖性并过度关注显著变化而忽略细微但重要的变化。其解决方案的关键在于提出一种名为CEBSNet的新型网络,该网络通过引入通道交换模块(Channel Swap Module, CSM)建模时间依赖性以减少差异和噪声,并结合特征激励与抑制模块(Feature Excitation and Suppression Module, FESM)来捕捉明显和细微的变化,同时设计金字塔感知空间-通道注意力模块(Pyramid-Aware Spatial-Channel Attention module, PASCA)以增强多尺度变化区域的检测能力。
链接: https://arxiv.org/abs/2505.15322
作者: Qi’ao Xu,Yan Xing,Jiali Hu,Yunan Jia,Rui Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Change detection, a critical task in remote sensing and computer vision, aims to identify pixel-level differences between image pairs captured at the same geographic area but different times. It faces numerous challenges such as illumination variation, seasonal changes, background interference, and shooting angles, especially with a large time gap between images. While current methods have advanced, they often overlook temporal dependencies and overemphasize prominent changes while ignoring subtle but equally important changes. To address these limitations, we introduce \textbfCEBSNet, a novel change-excited and background-suppressed network with temporal dependency modeling for change detection. During the feature extraction, we utilize a simple Channel Swap Module (CSM) to model temporal dependency, reducing differences and noise. The Feature Excitation and Suppression Module (FESM) is developed to capture both obvious and subtle changes, maintaining the integrity of change regions. Additionally, we design a Pyramid-Aware Spatial-Channel Attention module (PASCA) to enhance the ability to detect change regions at different sizes and focus on critical regions. We conduct extensive experiments on three common street view datasets and two remote sensing datasets, and our method achieves the state-of-the-art performance.
zh
[CV-58] FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose Expression and Emotion
【速读】:该论文试图解决在人脸生成中对非身份属性(如姿态、表情和情绪)进行精确控制的问题,同时保持身份信息的稳定性。其解决方案的关键在于提出一种新型的身份条件扩散模型,该模型引入了两个轻量级控制模块,能够独立操控面部姿态、表情和情绪而不影响身份保留。这些模块嵌入在基础扩散模型的交叉注意力层中,实现了在最小参数开销下的精确属性控制,并通过定制的训练策略增强身份特征与控制信号之间的正交性,从而提升可控性和生成多样性。
链接: https://arxiv.org/abs/2505.15313
作者: Kazuaki Mishima,Antoni Bigata Casademunt,Stavros Petridis,Maja Pantic,Kenji Suzuki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages(excluding references), 3 figures, 5 tables
点击查看摘要
Abstract:Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emo- tion. While recent advances in image generation have enabled high-quality identity- conditional face synthesis, precise control over non-identity attributes remains challeng- ing, and disentangling identity from these mutable factors is particularly difficult. To address these limitations, we propose a novel identity-conditional diffusion model that introduces two lightweight control modules designed to independently manipulate facial pose, expression, and emotion without compromising identity preservation. These mod- ules are embedded within the cross-attention layers of the base diffusion model, enabling precise attribute control with minimal parameter overhead. Furthermore, our tailored training strategy, which leverages cross-attention between the identity feature and each non-identity control feature, encourages identity features to remain orthogonal to control signals, enhancing controllability and diversity. Quantitative and qualitative evaluations, along with perceptual user studies, demonstrate that our method surpasses existing ap- proaches in terms of control accuracy over pose, expression, and emotion, while also improving generative diversity under identity-only conditioning.
zh
[CV-59] BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution
【速读】:该论文试图解决超分辨率(Super-Resolution, SR)模型在遭受后门攻击时,中毒高分辨率(High-Resolution, HR)图像的隐蔽性不足的问题,这使得用户容易检测到异常数据。解决方案的关键在于BadSR,其核心思想是在特征空间中近似干净HR图像和预定义目标图像,同时确保对干净HR图像的修改保持在受限范围内,从而提高中毒HR图像的隐蔽性。
链接: https://arxiv.org/abs/2505.15308
作者: Ji Guo,Xiaolei Wen,Wenbo Jiang,Cheng Huang,Jinjin Li,Hongwei Li
机构: University of Electronic Science and Technology of China(电子科技大学); Xinjiang University(新疆大学); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the widespread application of super-resolution (SR) in various fields, researchers have begun to investigate its security. Previous studies have demonstrated that SR models can also be subjected to backdoor attacks through data poisoning, affecting downstream tasks. A backdoor SR model generates an attacker-predefined target image when given a triggered image while producing a normal high-resolution (HR) output for clean images. However, prior backdoor attacks on SR models have primarily focused on the stealthiness of poisoned low-resolution (LR) images while ignoring the stealthiness of poisoned HR images, making it easy for users to detect anomalous data. To address this problem, we propose BadSR, which improves the stealthiness of poisoned HR images. The key idea of BadSR is to approximate the clean HR image and the pre-defined target image in the feature space while ensuring that modifications to the clean HR image remain within a constrained range. The poisoned HR images generated by BadSR can be integrated with existing triggers. To further improve the effectiveness of BadSR, we design an adversarially optimized trigger and a backdoor gradient-driven poisoned sample selection method based on a genetic algorithm. The experimental results show that BadSR achieves a high attack success rate in various models and data sets, significantly affecting downstream tasks.
zh
[CV-60] R3GS: Gaussian Splatting for Robust Reconstruction and Relocalization in Unconstrained Image Collections
【速读】:该论文旨在解决在非受限数据集上进行鲁棒的三维重建与重定位问题。其关键解决方案是提出一种混合表示方法,将卷积神经网络(CNN)的全局特征与多分辨率哈希网格编码的局部特征相结合,并通过浅层多层感知机(MLP)预测每个高斯分布的属性;同时引入微调的人体检测网络生成可见性图以处理动态物体,并采用深度先验约束的天空处理技术以减少天空重建误差带来的浮点问题;此外,还提出了一种对光照变化具有鲁棒性的重定位方法,从而显著提升渲染保真度、训练与渲染效率,并降低存储需求。
链接: https://arxiv.org/abs/2505.15294
作者: Xu yan,Zhaohui Wang,Rong Wei,Jingbo Yu,Dong Li,Xiangde Liu
机构: Beijing Digital Native Digital City Research Center (北京数字原生数字城市研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 4 figures
点击查看摘要
Abstract:We propose R3GS, a robust reconstruction and relocalization framework tailored for unconstrained datasets. Our method uses a hybrid representation during training. Each anchor combines a global feature from a convolutional neural network (CNN) with a local feature encoded by the multiresolution hash grids [2]. Subsequently, several shallow multi-layer perceptrons (MLPs) predict the attributes of each Gaussians, including color, opacity, and covariance. To mitigate the adverse effects of transient objects on the reconstruction process, we ffne-tune a lightweight human detection network. Once ffne-tuned, this network generates a visibility map that efffciently generalizes to other transient objects (such as posters, banners, and cars) with minimal need for further adaptation. Additionally, to address the challenges posed by sky regions in outdoor scenes, we propose an effective sky-handling technique that incorporates a depth prior as a constraint. This allows the inffnitely distant sky to be represented on the surface of a large-radius sky sphere, signiffcantly reducing ffoaters caused by errors in sky reconstruction. Furthermore, we introduce a novel relocalization method that remains robust to changes in lighting conditions while estimating the camera pose of a given image within the reconstructed 3DGS scene. As a result, R3GS significantly enhances rendering ffdelity, improves both training and rendering efffciency, and reduces storage requirements. Our method achieves state-of-the-art performance compared to baseline methods on in-the-wild datasets. The code will be made open-source following the acceptance of the paper.
zh
[CV-61] GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation
【速读】:该论文旨在解决现有事件数据集在视角多样性、几何一致性以及硬件成本方面的局限性,这些问题限制了事件视觉任务的高保真性和泛化能力。其解决方案的关键在于首先利用3D Gaussian Splatting重建逼真的静态场景,随后采用一种基于物理信息的事件模拟流程,该流程结合了自适应轨迹插值与物理一致的事件对比度阈值建模,从而生成在多种运动和光照条件下具有时间密集性和几何一致性的事件流。
链接: https://arxiv.org/abs/2505.15287
作者: Yuchen Li,Chaoran Feng,Zhenyu Tang,Kaiyuan Deng,Wangbo Yu,Yonghong Tian,Li Yuan
机构: Peking University (北京大学); Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures. More details at this http URL
点击查看摘要
Abstract:We introduce GS2E (Gaussian Splatting to Event), a large-scale synthetic event dataset for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically lack viewpoint diversity and geometric consistency, or depend on expensive, difficult-to-scale hardware setups. GS2E overcomes these limitations by first reconstructing photorealistic static scenes using 3D Gaussian Splatting, and subsequently employing a novel, physically-informed event simulation pipeline. This pipeline generally integrates adaptive trajectory interpolation with physically-consistent event contrast threshold modeling. Such an approach yields temporally dense and geometrically consistent event streams under diverse motion and lighting conditions, while ensuring strong alignment with underlying scene structures. Experimental results on event-based 3D reconstruction demonstrate GS2E’s superior generalization capabilities and its practical value as a benchmark for advancing event vision research.
zh
[CV-62] Kernel PCA for Out-of-Distribution Detection: Non-Linear Kernel Selections and Approximations NEURIPS NEURIPS’24
【速读】:该论文旨在解决深度神经网络中分布外(Out-of-Distribution, OoD)检测的问题,其核心在于有效表征OoD与分布内(In-Distribution, InD)数据之间的差异。解决方案的关键是通过非线性特征子空间的视角来捕捉这种差异,即从InD特征中学习一个具有判别性的非线性子空间以捕获InD的代表性模式,而OoD特征在该子空间中的信息模式则无法被充分表征。基于此,作者利用该子空间中InD与OoD特征的偏差进行有效的OoD检测,并采用核主成分分析(Kernel Principal Component Analysis, KPCA)框架实现非线性子空间的构建,同时通过子空间上的重构误差区分InD与OoD数据。
链接: https://arxiv.org/abs/2505.15284
作者: Kun Fang,Qinghua Tao,Mingzhen He,Kexin Lv,Runze Yang,Haibo Hu,Xiaolin Huang,Jie Yang,Longbin Cao
机构: Shanghai Jiao Tong University (上海交通大学); The Hong Kong Polytechnic University (香港理工大学); Beijing Institute of Technology (北京理工大学); China Mobile (Shanghai) Information and Communication Technology Co., Ltd. (中国移动(上海)信息通信技术有限公司); Macquarie University (麦考瑞大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This study is an extension of its conference version published in NeurIPS’24, see this https URL
点击查看摘要
Abstract:Out-of-Distribution (OoD) detection is vital for the reliability of deep neural networks, the key of which lies in effectively characterizing the disparities between OoD and In-Distribution (InD) data. In this work, such disparities are exploited through a fresh perspective of non-linear feature subspace. That is, a discriminative non-linear subspace is learned from InD features to capture representative patterns of InD, while informative patterns of OoD features cannot be well captured in such a subspace due to their different distribution. Grounded on this perspective, we exploit the deviations of InD and OoD features in such a non-linear subspace for effective OoD detection. To be specific, we leverage the framework of Kernel Principal Component Analysis (KPCA) to attain the discriminative non-linear subspace and deploy the reconstruction error on such subspace to distinguish InD and OoD data. Two challenges emerge: (i) the learning of an effective non-linear subspace, i.e., the selection of kernel function in KPCA, and (ii) the computation of the kernel matrix with large-scale InD data. For the former, we reveal two vital non-linear patterns that closely relate to the InD-OoD disparity, leading to the establishment of a Cosine-Gaussian kernel for constructing the subspace. For the latter, we introduce two techniques to approximate the Cosine-Gaussian kernel with significantly cheap computations. In particular, our approximation is further tailored by incorporating the InD data confidence, which is demonstrated to promote the learning of discriminative subspaces for OoD data. Our study presents new insights into the non-linear feature subspace for OoD detection and contributes practical explorations on the associated kernel design and efficient computations, yielding a KPCA detection method with distinctively improved efficacy and efficiency.
zh
[CV-63] DiffProb: Data Pruning for Face Recognition
【速读】:该论文试图解决面部识别模型训练中对大规模标注数据集的依赖所带来的计算成本高、数据存储压力大以及隐私风险等问题。其解决方案的关键在于提出DiffProb方法,该方法通过评估每个身份下训练样本的预测概率,剔除具有相同或相近预测概率值的样本,因为这些样本可能强化相同的决策边界且贡献新信息有限,从而实现数据剪枝。此外,该方法还引入辅助清理机制以消除误标和标签翻转样本,提升数据质量并保持模型性能。
链接: https://arxiv.org/abs/2505.15272
作者: Eduarda Caldeira,Jan Niklas Kolf,Naser Damer,Fadi Boutros
机构: Fraunhofer IGD (弗劳恩霍夫研究所); TU Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2025
点击查看摘要
Abstract:Face recognition models have made substantial progress due to advances in deep learning and the availability of large-scale datasets. However, reliance on massive annotated datasets introduces challenges related to training computational cost and data storage, as well as potential privacy concerns regarding managing large face datasets. This paper presents DiffProb, the first data pruning approach for the application of face recognition. DiffProb assesses the prediction probabilities of training samples within each identity and prunes the ones with identical or close prediction probability values, as they are likely reinforcing the same decision boundaries, and thus contribute minimally with new information. We further enhance this process with an auxiliary cleaning mechanism to eliminate mislabeled and label-flipped samples, boosting data quality with minimal loss. Extensive experiments on CASIA-WebFace with different pruning ratios and multiple benchmarks, including LFW, CFP-FP, and IJB-C, demonstrate that DiffProb can prune up to 50% of the dataset while maintaining or even, in some settings, improving the verification accuracies. Additionally, we demonstrate DiffProb’s robustness across different architectures and loss functions. Our method significantly reduces training cost and data volume, enabling efficient face recognition training and reducing the reliance on massive datasets and their demanding management.
zh
[CV-64] Scaling Diffusion Transformers Efficiently via μP
【速读】:该论文试图解决扩散变压器(Diffusion Transformers)在大规模扩展时因超参数(HP)调优成本高昂而导致的可扩展性问题。解决方案的关键在于将原始变压器中的最大更新参数化(μP)方法推广至扩散变压器,并通过实验证明其有效性。研究证明,主流扩散变压器模型的μP与原始变压器一致,从而可以直接应用现有的μP方法,实现稳定的超参数迁移,显著降低调优成本并提升训练效率。
链接: https://arxiv.org/abs/2505.15270
作者: Chenyu Zheng,Xinyu Zhang,Rongzhen Wang,Wei Huang,Zhi Tian,Weilin Huang,Jun Zhu,Chongxuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 10 figures, 15 tables
点击查看摘要
Abstract:Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ( \mu P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether \mu P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard \mu P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that \mu P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt- \alpha , and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing \mu P methodologies. Leveraging this result, we systematically demonstrate that DiT- \mu P enjoys robust HP transferability. Notably, DiT-XL-2- \mu P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of \mu P on text-to-image generation by scaling PixArt- \alpha from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under \mu P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt- \alpha and 3% of consumption by human experts for MMDiT-18B. These results establish \mu P as a principled and efficient framework for scaling diffusion Transformers.
zh
[CV-65] LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
【速读】:该论文旨在解决现有视频大语言模型(Video LLMs)在实时视频流处理中的内存使用和响应速度问题,这些问题在实际应用如Deepseek服务、自动驾驶和机器人中至关重要。其解决方案的关键在于提出一种无需训练的框架LiveVLM,该框架通过构建面向流的键值缓存(KV cache),实现对视频流的实时处理,保留长期视频细节并消除冗余键值,从而确保对用户查询的快速响应。此外,LiveVLM通过生成和压缩视频键值张量(video KVs)来提升内存效率,并在新问题提出时高效获取短期和长期视觉信息,减少冗余上下文干扰。
链接: https://arxiv.org/abs/2505.15269
作者: Zhenyu Ning,Guangda Liu,Qihao Jin,Wenchao Ding,Minyi Guo,Jieru Zhao
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose \textbfLiveVLM , a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44 \times number of frames on the same device, and achieves up to 5 \times speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.
zh
[CV-66] Contrastive Learning-Enhanced Trajectory Matching for Small-Scale Dataset Distillation
【速读】:该论文试图解决在资源受限环境中部署机器学习模型时,如何将大规模数据集压缩为规模显著减小但信息丰富的合成数据集的问题。现有数据集蒸馏技术,尤其是轨迹匹配方法,在中等规模的合成数据集上表现出有效性,但在极端样本稀缺的情况下无法充分保留语义丰富性。解决方案的关键在于在图像生成过程中引入对比学习(contrastive learning),通过显式最大化实例级别的特征区分性,生成更具信息量和多样性的合成样本,从而提升模型在极小规模合成数据集上的性能。
链接: https://arxiv.org/abs/2505.15267
作者: Wenmin Li,Shunsuke Sakai,Tatsuhito Hasegawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
点击查看摘要
Abstract:Deploying machine learning models in resource-constrained environments, such as edge devices or rapid prototyping scenarios, increasingly demands distillation of large datasets into significantly smaller yet informative synthetic datasets. Current dataset distillation techniques, particularly Trajectory Matching methods, optimize synthetic data so that the model’s training trajectory on synthetic samples mirrors that on real data. While demonstrating efficacy on medium-scale synthetic datasets, these methods fail to adequately preserve semantic richness under extreme sample scarcity. To address this limitation, we propose a novel dataset distillation method integrating contrastive learning during image synthesis. By explicitly maximizing instance-level feature discrimination, our approach produces more informative and diverse synthetic samples, even when dataset sizes are significantly constrained. Experimental results demonstrate that incorporating contrastive learning substantially enhances the performance of models trained on very small-scale synthetic datasets. This integration not only guides more effective feature representation but also significantly improves the visual fidelity of the synthesized images. Experimental results demonstrate that our method achieves notable performance improvements over existing distillation techniques, especially in scenarios with extremely limited synthetic data.
zh
[CV-67] Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs
【速读】:该论文试图解决的问题是:如何识别导致大视觉-语言模型(LVLM)出现错误或幻觉的特定语义概念,从而为提升模型鲁棒性提供可解释的信息。解决方案的关键在于提出一种基于大语言模型(LLM)和文本到图像(T2I)模型的新型语义演化框架,通过随机初始化的语义概念经过LLM的交叉与变异操作生成图像描述,并由T2I模型生成视觉输入,再根据LVLM在任务中的表现评估语义的敏感性,以此作为奖励信号引导语义探索。
链接: https://arxiv.org/abs/2505.15265
作者: Zihao Pan,Yu Tong,Weibin Wu,Jingyi Wang,Lifeng Chen,Zhe Zhao,Jiajia Wei,Yitong Qiao,Zibin Zheng
机构: School of Software Engineering, Sun Yat-sen University (软件工程学院,中山大学); Wuhan University (武汉大学); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Computer Science, Beijing Jiaotong University (计算机科学,北京交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Adversarial attacks aim to generate malicious inputs that mislead deep models, but beyond causing model failure, they cannot provide certain interpretable information such as \textitWhat content in inputs make models more likely to fail?'' However, this information is crucial for researchers to specifically improve model robustness. Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs (such as
wet,‘’ ``foggy’'), making them prone to errors. Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs) and found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images. To efficiently search for these sensitive concepts, we integrated large language models (LLMs) and text-to-image (T2I) models to propose a novel semantic evolution framework. Randomly initialized semantic concepts undergo LLM-based crossover and mutation operations to form image descriptions, which are then converted by T2I models into visual inputs for LVLMs. The task-specific performance of LVLMs on each input is quantified as fitness scores for the involved semantics and serves as reward signals to further guide LLMs in exploring concepts that induce LVLMs. Extensive experiments on seven mainstream LVLMs and two multimodal tasks demonstrate the effectiveness of our method. Additionally, we provide interesting findings about the sensitive semantics of LVLMs, aiming to inspire further in-depth research.
zh
[CV-68] gen2seg: Generative Models Enable Generalizable Instance Segmentation
【速读】:该论文试图解决如何将生成式 AI (Generative AI) 学习到的表征用于通用感知组织的问题,特别是针对实例分割任务。其解决方案的关键在于通过一种称为“实例着色损失”的方法,仅在有限的一组物体类型(如室内家具和汽车)上微调 Stable Diffusion 和 MAE 模型,从而实现对未见物体类型和风格的零样本泛化能力。实验结果表明,该方法在未见物体类型的分割任务中表现接近高度监督的 SAM 模型,并在细粒结构和模糊边界分割方面优于 SAM,表明生成模型能够学习到跨类别和跨领域的内在分组机制。
链接: https://arxiv.org/abs/2505.15263
作者: Om Khangaonkar,Hamed Pirsiavash
机构: UC Davis(加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL
点击查看摘要
Abstract:By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE’s ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
zh
[CV-69] Zero-Shot Gaze-based Volumetric Medical Image Segmentation CVPR2025
【速读】:该论文试图解决在三维医学图像分割中,如何提升交互式分割模型的用户交互效率与体验的问题。传统方法依赖于手动提供的提示信息,如边界框和鼠标点击,而本文提出将眼动追踪(eye gaze)作为新的信息模态,用于交互式分割,其关键在于利用真实或合成的眼动数据作为分割提示,从而实现更高效的人机交互方式。
链接: https://arxiv.org/abs/2505.15256
作者: Tatyana Shmykova,Leila Khaertdinova,Ilya Pershin
机构: Research Center for Artificial Intelligence, Innopolis University (人工智能研究中心,伊诺波利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MMFM-BIOMED Workshop @ CVPR 2025
点击查看摘要
Abstract:Accurate segmentation of anatomical structures in volumetric medical images is crucial for clinical applications, including disease monitoring and cancer treatment planning. Contemporary interactive segmentation models, such as Segment Anything Model 2 (SAM-2) and its medical variant (MedSAM-2), rely on manually provided prompts like bounding boxes and mouse clicks. In this study, we introduce eye gaze as a novel informational modality for interactive segmentation, marking the application of eye-tracking for 3D medical image segmentation. We evaluate the performance of using gaze-based prompts with SAM-2 and MedSAM-2 using both synthetic and real gaze data. Compared to bounding boxes, gaze-based prompts offer a time-efficient interaction approach with slightly lower segmentation quality. Our findings highlight the potential of using gaze as a complementary input modality for interactive 3D medical image segmentation.
zh
[CV-70] VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging
【速读】:该论文旨在解决医学影像中由于标注数据稀缺而导致深度神经网络训练困难的问题。其解决方案的关键在于利用医学影像特有的多视角特性,即同一病例下存在多个标准化视图,通过真实多视角对进行自监督学习,从而提升模型对解剖结构的不变性理解和隐含的三维感知能力,相较于传统的合成增强方法,该方法能够实现更优越的解剖理解效果。
链接: https://arxiv.org/abs/2505.15248
作者: Andre Dourson,Kylie Taylor,Xiaoli Qiao,Michael Fitzke
机构: Mars Petcare(玛氏宠物护理)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.
zh
[CV-71] GAMA: Disentangled Geometric Alignment with Adaptive Contrastive Perturbation for Reliable Domain Transfer
【速读】:该论文旨在解决几何感知域适应(geometry-aware domain adaptation)中当前方法如GAMA所面临的两个未解问题:(1)任务相关与任务无关流形维度的解耦不足,(2)刚性的扰动策略忽略了类别间的对齐不对称性。其解决方案的关键在于提出GAMA++框架,该框架引入了(i)潜在空间解耦以隔离与标签一致的流形方向并去除干扰因素,以及(ii)一种自适应对比扰动策略,能够根据类别特定的流形曲率和对齐差异调整在流形内和流形外的探索。
链接: https://arxiv.org/abs/2505.15241
作者: Kim Yun,Hana Satou,F Monkey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite progress in geometry-aware domain adaptation, current methods such as GAMA still suffer from two unresolved issues: (1) insufficient disentanglement of task-relevant and task-irrelevant manifold dimensions, and (2) rigid perturbation schemes that ignore per-class alignment asymmetries. To address this, we propose GAMA++, a novel framework that introduces (i) latent space disentanglement to isolate label-consistent manifold directions from nuisance factors, and (ii) an adaptive contrastive perturbation strategy that tailors both on- and off-manifold exploration to class-specific manifold curvature and alignment discrepancy. We further propose a cross-domain contrastive consistency loss that encourages local semantic clusters to align while preserving intra-domain diversity. Our method achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under both standard and few-shot settings, with notable improvements in class-level alignment fidelity and boundary robustness. GAMA++ sets a new standard for semantic geometry alignment in transfer learning.
zh
[CV-72] CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation
【速读】:该论文旨在解决多模态深度伪造(multimodal deepfakes)对现有检测方法可靠性的挑战,这些方法通常依赖单一模态的特征或跨模态不一致,而无法有效应对视觉与听觉内容协同被篡改的情况。论文提出的关键解决方案是通过跨模态对齐与蒸馏(Cross-Modal Alignment and Distillation, CAD)框架,实现对模态特异性痕迹(如人脸交换伪影)和模态共享语义错位(如唇音不同步)的互补信息的和谐整合,从而提升检测性能。
链接: https://arxiv.org/abs/2505.15233
作者: Yuxuan Du,Zhendong Wang,Yuhao Luo,Caiyong Piao,Zhiyuan Yan,Hao Li,Li Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); University of Science and Technology of China (中国科学技术大学); Chinese University of Hong Kong, Shenzhen (香港中文大学,深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The rapid emergence of multimodal deepfakes (visual and auditory content are manipulated in concert) undermines the reliability of existing detectors that rely solely on modality-specific artifacts or cross-modal inconsistencies. In this work, we first demonstrate that modality-specific forensic traces (e.g., face-swap artifacts or spectral distortions) and modality-shared semantic misalignments (e.g., lip-speech asynchrony) offer complementary evidence, and that neglecting either aspect limits detection performance. Existing approaches either naively fuse modality-specific features without reconciling their conflicting characteristics or focus predominantly on semantic misalignment at the expense of modality-specific fine-grained artifact cues. To address these shortcomings, we propose a general multimodal framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD). CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates feature conflicts during fusion while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio). Extensive experiments on both multimodal and unimodal (e.g., image-only/video-only)deepfake benchmarks demonstrate that CAD significantly outperforms previous methods, validating the necessity of harmonious integration of multimodal complementary information.
zh
[CV-73] DC-Scene: Data-Centric Learning for 3D Scene Understanding
【速读】:该论文旨在解决基于学习的3D场景理解面临的两大挑战:一是3D场景的大规模和复杂性导致计算成本高、训练速度慢;二是高质量标注的3D数据集远少于2D视觉领域。其解决方案的关键在于提出DC-Scene框架,该框架通过增强数据质量和训练效率来应对上述问题,核心方法包括基于CLIP的双指标质量(DIQ)过滤器和课程调度器,以筛选噪声样本并减少对大规模标注3D数据的依赖。
链接: https://arxiv.org/abs/2505.15232
作者: Ting Huang,Zeyu Zhang,Ruicheng Zhang,Yang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D scene understanding plays a fundamental role in vision applications such as robotics, autonomous driving, and augmented reality. However, advancing learning-based 3D scene understanding remains challenging due to two key limitations: (1) the large scale and complexity of 3D scenes lead to higher computational costs and slower training compared to 2D counterparts; and (2) high-quality annotated 3D datasets are significantly scarcer than those available for 2D vision. These challenges underscore the need for more efficient learning paradigms. In this work, we propose DC-Scene, a data-centric framework tailored for 3D scene understanding, which emphasizes enhancing data quality and training efficiency. Specifically, we introduce a CLIP-driven dual-indicator quality (DIQ) filter, combining vision-language alignment scores with caption-loss perplexity, along with a curriculum scheduler that progressively expands the training pool from the top 25% to 75% of scene-caption pairs. This strategy filters out noisy samples and significantly reduces dependence on large-scale labeled 3D data. Extensive experiments on ScanRefer and Nr3D demonstrate that DC-Scene achieves state-of-the-art performance (86.1 CIDEr with the top-75% subset vs. 85.4 with the full dataset) while reducing training cost by approximately two-thirds, confirming that a compact set of high-quality samples can outperform exhaustive training. Code will be available at this https URL.
zh
[CV-74] Continuous Representation Methods Theories and Applications: An Overview and Perspectives
【速读】:该论文旨在解决传统离散框架在数据表示与重建任务中的局限性,例如图像修复、新视角合成和波形反演等场景中所面临的分辨率灵活性差、跨模态适应性弱、平滑性不足及参数效率低等问题。其解决方案的关键在于引入连续表示方法(continuous representation methods),通过函数映射方式将位置坐标映射到连续空间中的对应值,从而实现对真实世界数据内在结构的高效建模与表达。该方法在理论上具备逼近误差分析、收敛性及隐式正则化等特性,并在多个实际应用领域展现出显著优势。
链接: https://arxiv.org/abs/2505.15222
作者: Yisi Luo,Xile Zhao,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: this https URL.
zh
[CV-75] Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection
【速读】:该论文试图解决现有基于CLIP的方法在检测AI生成图像时存在的严重特征冗余问题,这一问题限制了模型的泛化能力。解决方案的关键在于引入多模态条件瓶颈网络,通过结合文本引导的条件信息瓶颈(TGCIB)和动态文本正交化(DTO)两个核心组件,减少特征冗余并增强CLIP提取特征的判别能力,从而提升模型的泛化性能。
链接: https://arxiv.org/abs/2505.15217
作者: Haotian Qin,Dongliang Chang,Yueying Gao,Bingyao Yu,Lei Chen,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 16 figures
点击查看摘要
Abstract:Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model’s generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as “bias”. Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global “bias”. Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at this https URL.
zh
[CV-76] GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting
【速读】:该论文试图解决将2D纹理有效地转移到3D表示中的问题,现有方法在转移过程中往往忽视场景的几何信息,导致纹理转移结果质量不高。解决方案的关键在于提出一种几何感知的纹理迁移框架GT^2-GS,其核心包括:通过几何感知的纹理增强模块扩展纹理特征集,引入结合相机姿态和3D几何信息的几何一致纹理损失函数以优化纹理特征,以及采用几何保持策略在多轮迭代中平衡纹理学习与几何完整性。
链接: https://arxiv.org/abs/2505.15208
作者: Wenjie Liu,Zhongliang Liu,Junwei Shu,Changbo Wang,Yang Li
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 16 figures
点击查看摘要
Abstract:Transferring 2D textures to 3D modalities is of great significance for improving the efficiency of multimedia content creation. Existing approaches have rarely focused on transferring image textures onto 3D representations. 3D style transfer methods are capable of transferring abstract artistic styles to 3D scenes. However, these methods often overlook the geometric information of the scene, which makes it challenging to achieve high-quality 3D texture transfer results. In this paper, we present GT^2-GS, a geometry-aware texture transfer framework for gaussian splitting. From the perspective of matching texture features with geometric information in rendered views, we identify the issue of insufficient texture features and propose a geometry-aware texture augmentation module to expand the texture feature set. Moreover, a geometry-consistent texture loss is proposed to optimize texture features into the scene representation. This loss function incorporates both camera pose and 3D geometric information of the scene, enabling controllable texture-oriented appearance editing. Finally, a geometry preservation strategy is introduced. By alternating between the texture transfer and geometry correction stages over multiple iterations, this strategy achieves a balance between learning texture features and preserving geometric integrity. Extensive experiments demonstrate the effectiveness and controllability of our method. Through geometric awareness, our approach achieves texture transfer results that better align with human visual perception. Our homepage is available at this https URL.
zh
[CV-77] Flashback: Memory-Driven Zero-shot Real-time Video Anomaly Detection
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)在实际应用中面临的领域依赖性和实时性约束问题。传统方法通常需要大量真实异常数据进行训练,且难以满足大规模监控场景下对近实时处理的需求。论文提出的解决方案——Flashback,其关键在于通过离线阶段构建正常与异常描述的伪场景记忆,并在在线阶段利用相似性搜索快速匹配输入视频片段,从而实现零样本和实时的VAD。该方法摒弃了推理阶段对大型语言模型(LLM)的调用,显著提升了处理效率。
链接: https://arxiv.org/abs/2505.15205
作者: Hyogun Lee,Haksub Kim,Ig-Jae Kim,Yonghun Choi
机构: Korea Institute of Science and Technology (KIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, three fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints – requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.
zh
[CV-78] Intentional Gesture: Deliver Your Intentions with Gestures for Speech
【速读】:该论文试图解决当前共话语音手势生成方法仅依赖表面语言线索(如语音音频或文本转录本)而忽视人类手势背后的交际意图的问题,导致生成的手势虽然在节奏上与语音同步,但在语义上较为浅显。解决方案的关键在于引入了Intentional-Gesture框架,该框架将手势生成建模为基于高层次交际功能的意图推理任务,并通过构建包含手势意图标注的InG数据集以及设计意图感知的手势运动分词器,实现了时间对齐且语义丰富的手势合成。
链接: https://arxiv.org/abs/2505.15197
作者: Pinxin Liu,Haiyang Liu,Luchuan Song,Chenliang Xu
机构: University of Rochester (罗彻斯特大学); University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textite.g. speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbfIntentional-Gesture, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbfInG dataset by augmenting BEAT-2 with gesture-intention annotations (\textiti.e., text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbfIntentional Gesture Motion Tokenizer to leverage these intention annotations. It injects high-level communicative functions (\textite.g., intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: this https URL
zh
[CV-79] GAMA: Geometry-Aware Manifold Alignment via Structured Adversarial Perturbations for Robust Domain Adaptation
【速读】:该论文试图解决在源域和目标域之间存在显著流形差异时,领域自适应(domain adaptation)所面临的挑战。其解决方案的关键在于提出GAMA(Geometry-Aware Manifold Alignment),该方法通过基于几何信息的对抗扰动实现显式的流形对齐,系统性地结合切空间探索与流形约束的对抗优化,从而提升语义一致性、对离流形偏差的鲁棒性以及跨域对齐能力。
链接: https://arxiv.org/abs/2505.15194
作者: Hana Satou,F Monkey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Domain adaptation remains a challenge when there is significant manifold discrepancy between source and target domains. Although recent methods leverage manifold-aware adversarial perturbations to perform data augmentation, they often neglect precise manifold alignment and systematic exploration of structured perturbations. To address this, we propose GAMA (Geometry-Aware Manifold Alignment), a structured framework that achieves explicit manifold alignment via adversarial perturbation guided by geometric information. GAMA systematically employs tangent space exploration and manifold-constrained adversarial optimization, simultaneously enhancing semantic consistency, robustness to off-manifold deviations, and cross-domain alignment. Theoretical analysis shows that GAMA tightens the generalization bound via structured regularization and explicit alignment. Empirical results on DomainNet, VisDA, and Office-Home demonstrate that GAMA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, exhibiting superior robustness, generalization, and manifold alignment capability.
zh
[CV-80] Leverag ing Foundation Models for Multimodal Graph-Based Action Recognition
【速读】:该论文旨在解决细粒度双手操作动作识别的问题,这一任务在多模态视频理解中具有挑战性,因其需要精确捕捉时空和语义信息。解决方案的关键在于提出一种基于图的框架,该框架融合了视觉-语言基础模型,利用VideoMAE进行动态视觉编码,以及BERT进行上下文文本嵌入,构建了一个自适应的多模态图结构,其中节点代表帧、物体和文本注释,边编码空间、时间和语义关系,并根据学习到的交互动态演化,从而实现灵活且上下文感知的推理。此外,图注意力网络中的任务特定注意力机制通过根据动作语义调节边的重要性进一步增强了推理能力。
链接: https://arxiv.org/abs/2505.15192
作者: Fatemeh Ziaeetabar,Florentin Wörgötter
机构: University of Tehran (德黑兰大学); Georg-Agust-Universität Göttingen (哥廷根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.
zh
[CV-81] Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation
【速读】:该论文旨在解决领域迁移(domain shift)下的迁移学习问题,即源域与目标域数据流形之间的差异导致模型泛化能力下降的挑战。其解决方案的关键在于提出MAADA(Manifold-Aware Adversarial Data Augmentation)框架,该框架将对抗扰动分解为流形内(on-manifold)和流形外(off-manifold)成分,从而同时捕捉语义变化和模型脆弱性,并通过流形一致性约束和几何感知对齐损失提升模型的结构鲁棒性和跨领域泛化能力。
链接: https://arxiv.org/abs/2505.15191
作者: Hana Satou,Alan Mitkiy,F Monkey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.
zh
[CV-82] MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models
【速读】:该论文试图解决现有可泛化的3D高斯点云(3D Gaussian Splatting)方法在面对新场景时难以处理不熟悉视觉内容的问题,主要原因是其泛化能力有限。解决方案的关键在于提出MonoSplat框架,该框架利用预训练单目深度基础模型中的丰富视觉先验,实现鲁棒的高斯重建。其核心组件包括一个将单目特征转换为多视角表示的Mono-Multi Feature Adapter,以及一个融合两种特征类型的Integrated Gaussian Prediction模块,通过轻量级注意力机制实现跨视角的特征对齐与聚合,从而生成具有精确几何和外观的高斯基元。
链接: https://arxiv.org/abs/2505.15185
作者: Yifan Liu,Keyu Fan,Weihao Yu,Chenxin Li,Hao Lu,Yixuan Yuan
机构: The Chinese University of Hong Kong (香港中文大学); Tsinghua University (Shenzhen) (清华大学(深圳)); HKUST (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter’s lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at this https URL.
zh
[CV-83] AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection
【速读】:该论文旨在解决全域红外小目标检测(omni-domain infrared small target detection, IRSTD)中的挑战,即单一模型需同时适应多种成像系统、不同分辨率和多光谱波段的问题。现有方法主要依赖于仅基于视觉的建模范式,难以应对复杂背景干扰和目标特征稀缺的问题,并且在存在显著领域偏移和外观变化的复杂场景中泛化能力有限。该论文提出的解决方案关键在于引入了易于获取的辅助元数据(auxiliary metadata),如光谱波段、传感器平台、分辨率和观测视角等,并构建了基于多层感知机(MLP)的高维融合模块,将元数据语义与视觉特征动态结合,实现场景感知优化,从而提升模型在全域IRSTD任务中的鲁棒性和准确性。
链接: https://arxiv.org/abs/2505.15184
作者: Yangting Shi,Renjie He,Le Hui,Xiang Li,Jian Yang,Ming-Ming Cheng,Yimian Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Omni-domain infrared small target detection (IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multi-modal framework that fundamentally reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at this https URL.
zh
[CV-84] Exploring Generalized Gait Recognition: Reducing Redundancy and Noise within Indoor and Outdoor Datasets
【速读】:该论文旨在解决跨域步态识别中的泛化性问题,特别是在视角、外观和环境等严重域偏移情况下保持鲁棒性能的挑战。其解决方案的关键在于提出一个统一框架,通过设计解耦三元组损失来缓解不同数据集间的优化冲突,并引入针对性的数据集蒸馏策略,通过去除特征冗余和预测不确定性较高的20%训练样本,提升数据效率。
链接: https://arxiv.org/abs/2505.15176
作者: Qian Zhou,Xianda Guo,Jilong Wang,Chuanfu Shen,Zhongyuan Wang,Hua Zou,Qin Zou,Chao Liang,Chen Long,Gang Wu
机构: Wuhan University (武汉大学); USTC (中国科学技术大学); SIAS, UESTC (人工智能学院,电子科技大学); CASIA (中国科学院自动化研究所); Waytous (微图); Tarim university (塔里木大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
点击查看摘要
Abstract:Generalized gait recognition, which aims to achieve robust performance across diverse domains, remains a challenging problem due to severe domain shifts in viewpoints, appearances, and environments. While mixed-dataset training is widely used to enhance generalization, it introduces new obstacles including inter-dataset optimization conflicts and redundant or noisy samples, both of which hinder effective representation learning. To address these challenges, we propose a unified framework that systematically improves cross-domain gait recognition. First, we design a disentangled triplet loss that isolates supervision signals across datasets, mitigating gradient conflicts during optimization. Second, we introduce a targeted dataset distillation strategy that filters out the least informative 20% of training samples based on feature redundancy and prediction uncertainty, enhancing data efficiency. Extensive experiments on CASIA-B, OU-MVLP, Gait3D, and GREW demonstrate that our method significantly improves cross-dataset recognition for both GaitBase and DeepGaitV2 backbones, without sacrificing source-domain accuracy. Code will be released at this https URL.
zh
[CV-85] AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection
【速读】:该论文旨在解决人类中心型虚假视频检测中存在的信息完整性威胁、身份安全风险及公众信任问题,特别是现有检测方法在泛化能力、可扩展性以及依赖人工标注数据方面的不足。其解决方案的关键在于提出AvatarShield,这是一个基于可解释多模态大语言模型(MLLM)的框架,通过Group Relative Policy Optimization(GRPO)进行优化,并结合精心设计的准确度检测奖励和时间补偿奖励机制,有效避免了高成本文本标注数据的使用,同时采用双编码器架构实现高层语义推理与低层伪影增强的协同,从而提升伪造检测的精度与时间建模能力。
链接: https://arxiv.org/abs/2505.15173
作者: Zhipei Xu,Xuanyu Zhang,Xing Zhou,Jian Zhang
机构: Peking University (北京大学); RabbitPre AI (RabbitPre AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, particularly in video generation, has led to unprecedented creative capabilities but also increased threats to information integrity, identity security, and public trust. Existing detection methods, while effective in general scenarios, lack robust solutions for human-centric videos, which pose greater risks due to their realism and potential for legal and ethical misuse. Moreover, current detection approaches often suffer from poor generalization, limited scalability, and reliance on labor-intensive supervised fine-tuning. To address these challenges, we propose AvatarShield, the first interpretable MLLM-based framework for detecting human-centric fake videos, enhanced via Group Relative Policy Optimization (GRPO). Through our carefully designed accuracy detection reward and temporal compensation reward, it effectively avoids the use of high-cost text annotation data, enabling precise temporal modeling and forgery detection. Meanwhile, we design a dual-encoder architecture, combining high-level semantic reasoning and low-level artifact amplification to guide MLLMs in effective forgery detection. We further collect FakeHumanVid, a large-scale human-centric video benchmark that includes synthesis methods guided by pose, audio, and text inputs, enabling rigorous evaluation of detection methods in real-world scenes. Extensive experiments show that AvatarShield significantly outperforms existing approaches in both in-domain and cross-domain detection, setting a new standard for human-centric video forensics.
zh
[CV-86] Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation
【速读】:该论文旨在解决文本到图像(T2I)模型训练中由于缺乏有效的细节度评估指标而导致生成质量受限的问题。现有方法通常仅依赖于简单的度量标准,如标题长度,来表示标题的细节程度,这无法准确反映标题对图像内容的覆盖和描述细致程度。论文提出了一种新的度量标准,其关键在于从两个方面评估标题的细节度:图像覆盖率(ICR)用于评估标题是否涵盖了图像中的所有区域/对象,平均物体细节度(AOD)用于量化每个物体描述的细节程度。实验结果表明,基于该度量标准选择的高质量标题能够显著提升T2I模型的性能。
链接: https://arxiv.org/abs/2505.15172
作者: Xinran Wang,Muxi Diao,Yuanzhi Liu,Chunyu Wang,Kongming Liang,Zhanyu Ma,Jun Guo
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object’s description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.
zh
[CV-87] Lossless Token Merging Even Without Fine-Tuning in Vision Transformers
【速读】:该论文试图解决视觉Transformer(Vision Transformers, ViTs)由于模型规模庞大而导致的计算开销过大的问题。现有令牌压缩技术虽然被广泛研究,但通常面临严重的信息丢失问题,需要大量额外训练才能达到实用性能。论文提出的解决方案是自适应令牌合并(Adaptive Token Merging, ATM),其关键在于实现无损的令牌合并,无需微调即可保持竞争性性能。ATM通过在不同层和批次中自适应调整层特定的相似性阈值,防止不相似令牌的不当合并,并引入一种新的令牌匹配技术,综合考虑相似性和合并规模,尤其在最终层中有效减少每次合并操作带来的信息损失。
链接: https://arxiv.org/abs/2505.15160
作者: Jaeyeon Lee,Dong-Wan Choi
机构: Inha University (仁荷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
点击查看摘要
Abstract:Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose Adaptive Token Merging (ATM), a novel method that ensures lossless token merging, eliminating the need for fine-tuning while maintaining competitive performance. ATM adaptively reduces tokens across layers and batches by carefully adjusting layer-specific similarity thresholds, thereby preventing the undesirable merging of less similar tokens with respect to each layer. Furthermore, ATM introduces a novel token matching technique that considers not only similarity but also merging sizes, particularly for the final layers, to minimize the information loss incurred from each merging operation. We empirically validate our method across a wide range of pretrained models, demonstrating that ATM not only outperforms all existing training-free methods but also surpasses most training-intensive approaches, even without additional training. Remarkably, training-free ATM achieves over a 30% reduction in FLOPs for the DeiT-T and DeiT-S models without any drop in their original accuracy.
zh
[CV-88] From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation
【速读】:该论文旨在解决传统遥感图像(Remote Sensing Images, RSIs)处理方法在面对日益多样化和大规模数据时效率与精度不足的问题。其解决方案的关键在于采用深度学习(Deep Learning, DL)技术,通过自动化特征提取和提升多模态下的分割精度,推动遥感图像语义分割(Remote Sensing Image Semantic Segmentation, RSISS)的发展。论文系统回顾了基于深度学习的RSISS技术演进,并从特征提取与学习策略的角度分析了该领域从像素级到块级、从单模态到多模态的演变过程。
链接: https://arxiv.org/abs/2505.15147
作者: Quanwei Liu,Tao Huang,Yanni Dong,Jiaqi Yang,Wei Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 14 figures, 10 tables
点击查看摘要
Abstract:Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth’s surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field’s progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.
zh
[CV-89] CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)和视频生成模型在理解和再现电影摄影技术方面的能力不足问题,这一问题主要受限于缺乏专家标注的数据。解决方案的关键在于构建了一个名为CineTechBench的基准测试集,该基准集基于资深电影摄影专家的精确人工标注,涵盖了七个关键电影摄影维度,并包含超过600张标注电影图像和120段具有明确电影摄影技巧的电影片段,从而为评估和提升模型在电影摄影理解与生成方面的能力提供了基础。
链接: https://arxiv.org/abs/2505.15145
作者: Xinran Wang,Songyu Xu,Xiangxuan Shan,Yuxuan Zhang,Muxi Diao,Xueyan Duan,Yanhua Huang,Kongming Liang,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); China Mobile Research Institute (中国移动研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
点击查看摘要
Abstract:Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs’ ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at this https URL.
zh
[CV-90] Unified Cross-Modal Attention-Mixer Based Structural-Functional Connectomics Fusion for Neuropsychiatric Disorder Diagnosis
【速读】:该论文旨在解决传统多模态深度学习方法在利用结构和功能连接组学数据的互补特性以提升诊断性能方面存在的不足。其关键解决方案是提出ConneX,一种结合交叉注意力机制与多层感知机(MLP)-Mixer的多模态融合方法,通过模态特异性主干图神经网络(GNN)获取各模态的特征表示,并引入统一的跨模态注意力网络以捕捉模态内和模态间的交互,同时使用MLP-Mixer层优化全局和局部特征,从而实现端到端分类。
链接: https://arxiv.org/abs/2505.15139
作者: Badhan Mazumder,Lei Wu,Vince D. Calhoun,Dong Hye Ye
机构: Georgia State University (佐治亚州立大学); Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2025
点击查看摘要
Abstract:Gaining insights into the structural and functional mechanisms of the brain has been a longstanding focus in neuroscience research, particularly in the context of understanding and treating neuropsychiatric disorders such as Schizophrenia (SZ). Nevertheless, most of the traditional multimodal deep learning approaches fail to fully leverage the complementary characteristics of structural and functional connectomics data to enhance diagnostic performance. To address this issue, we proposed ConneX, a multimodal fusion method that integrates cross-attention mechanism and multilayer perceptron (MLP)-Mixer for refined feature fusion. Modality-specific backbone graph neural networks (GNNs) were firstly employed to obtain feature representation for each modality. A unified cross-modal attention network was then introduced to fuse these embeddings by capturing intra- and inter-modal interactions, while MLP-Mixer layers refined global and local features, leveraging higher-order dependencies for end-to-end classification with a multi-head joint loss. Extensive evaluations demonstrated improved performance on two distinct clinical datasets, highlighting the robustness of our proposed framework.
zh
[CV-91] Multispectral Detection Transformer with Infrared-Centric Sensor Fusion
【速读】:该论文旨在解决多光谱目标检测中如何有效融合可见光(RGB)与红外(IR)模态信息以提升检测性能的问题。其解决方案的关键在于提出一种轻量级且模态感知的融合策略,即IC-Fusion,通过设计多尺度特征蒸馏块和三阶段融合模块,结合跨模态通道shuffle门与跨模态大核门机制,实现RGB与IR特征的有效交互与互补,从而提升目标定位与语义理解的准确性。
链接: https://arxiv.org/abs/2505.15137
作者: Seongmin Hwang,Daeyoung Han,Moongu Jeon
机构: Gwangju Institute of Science and Technology (GIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
点击查看摘要
Abstract:Multispectral object detection aims to leverage complementary information from visible (RGB) and infrared (IR) modalities to enable robust performance under diverse environmental conditions. In this letter, we propose IC-Fusion, a multispectral object detector that effectively fuses visible and infrared features through a lightweight and modalityaware design. Motivated by wavelet analysis and empirical observations, we find that IR images contain structurally rich high-frequency information critical for object localization, while RGB images provide complementary semantic context. To exploit this, we adopt a compact RGB backbone and design a novel fusion module comprising a Multi-Scale Feature Distillation (MSFD) block to enhance RGB features and a three-stage fusion block with Cross-Modal Channel Shuffle Gate (CCSG) and Cross-Modal Large Kernel Gate (CLKG) to facilitate effective cross-modal interaction. Experiments on the FLIR and LLVIP benchmarks demonstrate the effectiveness and efficiency of our IR-centric fusion strategy. Our code is available at this https URL.
zh
[CV-92] DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer
【速读】:该论文旨在解决知识蒸馏中的目标类与非目标类知识流之间的内在冲突以及非目标类中低置信度暗知识带来的噪声问题。其解决方案的关键在于提出DeepKD框架,该框架结合了双层次解耦机制与自适应去噪策略,通过理论分析设计独立动量更新器以减少任务导向梯度、目标类梯度和非目标类梯度之间的相互干扰,并引入动态top-k掩码机制(DTM)以逐步过滤低置信度的logits,从而有效净化暗知识。
链接: https://arxiv.org/abs/2505.15133
作者: Haiduo Huang,Jiangcheng Song,Yadong Zhang,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in knowledge distillation have emphasized the importance of decoupling different knowledge components. While existing methods utilize momentum mechanisms to separate task-oriented and distillation gradients, they overlook the inherent conflict between target-class and non-target-class knowledge flows. Furthermore, low-confidence dark knowledge in non-target classes introduces noisy signals that hinder effective knowledge transfer. To address these limitations, we propose DeepKD, a novel training framework that integrates dual-level decoupling with adaptive denoising. First, through theoretical analysis of gradient signal-to-noise ratio (GSNR) characteristics in task-oriented and non-task-oriented knowledge distillation, we design independent momentum updaters for each component to prevent mutual interference. We observe that the optimal momentum coefficients for task-oriented gradient (TOG), target-class gradient (TCG), and non-target-class gradient (NCG) should be positively related to their GSNR. Second, we introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses, following curriculum learning principles. The DTM jointly filters low-confidence logits from both teacher and student models, effectively purifying dark knowledge during early training. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD’s effectiveness. Our code is available at this https URL.
zh
[CV-93] Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding
【速读】:该论文旨在解决视觉接地(Visual Grounding, VG)在医学影像中面临的挑战,即当前模型难以有效地将文本描述与疾病区域进行关联,主要原因是注意力机制效率低下以及缺乏细粒度的标记表示。论文提出的关键解决方案是疾病感知提示(Disease-Aware Prompting, DAP)过程,其核心在于利用视觉语言模型(VLM)的可解释性图谱来识别适当的图像特征,从而增强疾病相关区域的表征并抑制背景干扰,无需任何额外的像素级标注即可显著提升视觉接地的准确性。
链接: https://arxiv.org/abs/2505.15123
作者: Ta Duc Huy,Duy Anh Huynh,Yutong Xie,Yuankai Qi,Qi Chen,Phi Le Nguyen,Sen Kim Tran,Son Lam Phung,Anton van den Hengel,Zhibin Liao,Minh-Son To,Johan W. Verjans,Vu Minh Hieu Phan
机构: Australian Institute for Machine Learning, University of Adelaide; Macquarie University; Hanoi University of Science and Technology; University of Wollongong; Flinders University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review
点击查看摘要
Abstract:Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model’s attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.
zh
[CV-94] Pad: Iterative Proposal-centric End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端(E2E)自动驾驶系统中因直接基于密集鸟瞰图(BEV)网格特征生成规划而导致的效率低下和规划感知有限的问题。其解决方案的关键在于提出了一种名为iPad的框架,该框架以“提案”(proposals)为核心,即一组候选未来规划,并通过ProFormer——一种基于提案锚定注意力机制的BEV编码器,迭代优化提案及其相关特征,从而有效融合多视角图像数据。此外,引入了两个轻量级的、以提案为中心的辅助任务——地图构建与预测,进一步提升规划质量且计算开销较小。
链接: https://arxiv.org/abs/2505.15111
作者: Ke Guo,Haochen Liu,Xiaojun Wu,Jia Pan,Chen Lv
机构: Nanyang Technological University (南洋理工大学); Desay SV Automotive (德赛西威汽车电子); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:End-to-end (E2E) autonomous driving systems offer a promising alternative to traditional modular pipelines by reducing information loss and error accumulation, with significant potential to enhance both mobility and safety. However, most existing E2E approaches directly generate plans based on dense bird’s-eye view (BEV) grid features, leading to inefficiency and limited planning awareness. To address these limitations, we propose iterative Proposal-centric autonomous driving (iPad), a novel framework that places proposals - a set of candidate future plans - at the center of feature extraction and auxiliary tasks. Central to iPad is ProFormer, a BEV encoder that iteratively refines proposals and their associated features through proposal-anchored attention, effectively fusing multi-view image data. Additionally, we introduce two lightweight, proposal-centric auxiliary tasks - mapping and prediction - that improve planning quality with minimal computational overhead. Extensive experiments on the NAVSIM and CARLA Bench2Drive benchmarks demonstrate that iPad achieves state-of-the-art performance while being significantly more efficient than prior leading methods.
zh
[CV-95] Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation
【速读】:该论文旨在解决城市森林中树木精准检测的难题,特别是在复杂景观和不同分辨率图像(由不同卫星传感器或无人机飞行高度引起)条件下,传统方法难以准确识别树木的问题。其解决方案的关键在于提出一种结合领域自适应与生成式模型(Generative Models)的新型流程,通过增强低分辨率航拍图像的质量并保持语义内容的一致性,从而实现无需大量人工标注数据的有效树木分割。该方法利用pix2pix、Real-ESRGAN、Latent Diffusion和Stable Diffusion等模型生成结构一致且逼真的合成样本,扩展训练数据集并统一不同域的尺度,提升了分割模型在不同采集条件下的鲁棒性,并为标注资源匮乏的遥感场景提供了可扩展和可复制的解决方案。
链接: https://arxiv.org/abs/2505.15077
作者: Alessandro dos Santos Ferreira,Ana Paula Marques Ramos,José Marcato Junior,Wesley Nunes Gonçalves
机构: Federal University of Mato Grosso do Sul (南马托格罗索联邦大学); São Paulo State University (圣保罗州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 13 figures
点击查看摘要
Abstract:Urban forests play a key role in enhancing environmental quality and supporting biodiversity in cities. Mapping and monitoring these green spaces are crucial for urban planning and conservation, yet accurately detecting trees is challenging due to complex landscapes and the variability in image resolution caused by different satellite sensors or UAV flight altitudes. While deep learning architectures have shown promise in addressing these challenges, their effectiveness remains strongly dependent on the availability of large and manually labeled datasets, which are often expensive and difficult to obtain in sufficient quantity. In this work, we propose a novel pipeline that integrates domain adaptation with GANs and Diffusion models to enhance the quality of low-resolution aerial images. Our proposed pipeline enhances low-resolution imagery while preserving semantic content, enabling effective tree segmentation without requiring large volumes of manually annotated data. Leveraging models such as pix2pix, Real-ESRGAN, Latent Diffusion, and Stable Diffusion, we generate realistic and structurally consistent synthetic samples that expand the training dataset and unify scale across domains. This approach not only improves the robustness of segmentation models across different acquisition conditions but also provides a scalable and replicable solution for remote sensing scenarios with scarce annotation resources. Experimental results demonstrated an improvement of over 50% in IoU for low-resolution images, highlighting the effectiveness of our method compared to traditional pipelines.
zh
[CV-96] AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars
【速读】:该论文旨在解决全身音频驱动的虚拟人姿态与表情生成中面部表情与手势之间缺乏无缝协调的问题,这一问题导致动画不够自然和连贯。解决方案的关键在于提出AsynFusion框架,该框架基于双分支DiT(Diffusion Transformer)架构,实现了面部表情与手势的并行生成,并通过引入协作同步模块促进模态间双向特征交互,以及采用异步LCM采样策略在降低计算开销的同时保持高质量输出。
链接: https://arxiv.org/abs/2505.15058
作者: Tianbao Zhang,Jian Zhao,Yuer Li,Zheng Zhu,Ping Hu,Zhaoxin Fan,Wenjun Wu,Xuelong Li
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University (北京未来区块链与隐私计算高精尖创新中心,人工智能学院,北京航空航天大学); Hangzhou International Innovation Institute, Beihang University (杭州国际创新研究院,北京航空航天大学); School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University (电子信息与电气工程学院,上海交通大学); School of Computer Science and Technology, Xinjiang University (计算机科学与技术学院,新疆大学); TeleAI of China Telecom (中国电信TeleAI); GigaAI (GigaAI)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 11pages, conference
点击查看摘要
Abstract:Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.
zh
[CV-97] MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks
【速读】:该论文试图解决在地球观测(Earth Observation, EO)领域中,现有预训练深度学习模型在将知识迁移至下游任务时面临的挑战,尤其是在可用数据结构与预训练阶段数据结构不一致的情况下。其解决方案的关键在于提出一种更灵活的多模态、多任务预训练策略,具体采用多模态多任务掩码自编码器(Multi-modal Multi-task Masked Autoencoder, MultiMAE),通过重建包括光谱、高程和分割数据在内的多种输入模态来实现模型的预训练,从而提升模型的迁移学习能力。
链接: https://arxiv.org/abs/2505.14951
作者: Jose Sosa,Danila Rukhovich,Anis Kacem,Djamila Aouada
机构: SnT, University of Luxembourg (SnT,卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: this https URL.
zh
[CV-98] Programmatic Video Prediction Using Large Language Models
【速读】:该论文试图解决视频帧预测问题,即在给定少量视频帧作为视觉上下文的情况下,生成合理的未来视觉画面。解决方案的关键在于提出ProgGen方法,该方法通过利用大型(视觉)语言模型(LLM/VLM)的归纳偏置,将视频动态表示为一组神经符号化、人类可解释的状态(每帧一个状态),并据此合成程序以估计当前状态、预测未来状态的转换动态以及渲染预测状态为视觉RGB帧。
链接: https://arxiv.org/abs/2505.14948
作者: Hao Tang,Kevin Ellis,Suhas Lohit,Michael J. Jones,Moitreya Chatterjee
机构: Cornell University (康奈尔大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
zh
[CV-99] Colors Matter: AI-Driven Exploration of Human Feature Colors
【速读】:该论文旨在解决人体关键属性(如肤色、发色、虹膜颜色和基于静脉的底色)的准确分类与特征提取问题,以支持美容科技、数字个性化和视觉分析等应用。其解决方案的关键在于结合先进的成像技术和机器学习方法,构建多阶段处理流程,包括人脸检测、区域分割和主导颜色提取,并在LAB和HSV色彩空间中应用X-means聚类及Delta E (CIEDE2000)等感知均匀距离度量,以提升颜色区分的准确性。此外,通过手腕图像的静脉分析实现底色分类,系统在不同光照和图像条件下表现出高达80%的分类准确率,体现了其在感知精度和鲁棒性方面的优势。
链接: https://arxiv.org/abs/2505.14931
作者: Rama Alyoubi,Taif Alharbi,Albatul Alghamdi,Yara Alshehri,Elham Alghamdi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This study presents a robust framework that leverages advanced imaging techniques and machine learning for feature extraction and classification of key human attributes-namely skin tone, hair color, iris color, and vein-based undertones. The system employs a multi-stage pipeline involving face detection, region segmentation, and dominant color extraction to isolate and analyze these features. Techniques such as X-means clustering, alongside perceptually uniform distance metrics like Delta E (CIEDE2000), are applied within both LAB and HSV color spaces to enhance the accuracy of color differentiation. For classification, the dominant tones of the skin, hair, and iris are extracted and matched to a custom tone scale, while vein analysis from wrist images enables undertone classification into “Warm” or “Cool” based on LAB differences. Each module uses targeted segmentation and color space transformations to ensure perceptual precision. The system achieves up to 80% accuracy in tone classification using the Delta E-HSV method with Gaussian blur, demonstrating reliable performance across varied lighting and image conditions. This work highlights the potential of AI-powered color analysis and feature extraction for delivering inclusive, precise, and nuanced classification, supporting applications in beauty technology, digital personalization, and visual analytics.
zh
[CV-100] UPTor: Unified 3D Human Pose Dynamics and Trajectory Prediction for Human-Robot Interaction
【速读】:该论文旨在解决同时预测人体关键点动态与运动轨迹的问题,现有研究多仅关注全身姿态预测或运动轨迹预测,而较少尝试将其融合。其解决方案的关键在于提出一种运动变换技术,能够在全局坐标系中同时预测全身姿态和轨迹关键点,结合了现成的3D人体姿态估计模块、图注意力网络以及一个紧凑且非自回归的Transformer模型,以实现适用于人机交互和以人为本导航的实时运动预测。
链接: https://arxiv.org/abs/2505.14866
作者: Nisarga Nilavadi,Andrey Rudenko,Timm Linder
机构: Bosch Corporate Research (博世企业研究中心); Robert Bosch GmbH (罗伯特·博世有限公司); University of Technology Nuremberg (纽伦堡技术大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:We introduce a unified approach to forecast the dynamics of human keypoints along with the motion trajectory based on a short sequence of input poses. While many studies address either full-body pose prediction or motion trajectory prediction, only a few attempt to merge them. We propose a motion transformation technique to simultaneously predict full-body pose and trajectory key-points in a global coordinate frame. We utilize an off-the-shelf 3D human pose estimation module, a graph attention network to encode the skeleton structure, and a compact, non-autoregressive transformer suitable for real-time motion prediction for human-robot interaction and human-aware navigation. We introduce a human navigation dataset ``DARKO’’ with specific focus on navigational activities that are relevant for human-aware mobile robot navigation. We perform extensive evaluation on Human3.6M, CMU-Mocap, and our DARKO dataset. In comparison to prior work, we show that our approach is compact, real-time, and accurate in predicting human navigation motion across all datasets. Result animations, our dataset, and code will be available at this https URL
zh
[CV-101] Open-Set Semi-Supervised Learning for Long-Tailed Medical Datasets
【速读】:该论文旨在解决医疗影像识别中类别不平衡(class imbalance)以及模型在罕见类和未见类上的泛化能力不足的问题。其关键解决方案是提出一种基于半监督学习的开放集学习方法,通过在特征层面实施正则化策略,并结合分类器归一化技术,以缓解长尾分布对模型性能的负面影响。
链接: https://arxiv.org/abs/2505.14846
作者: Daniya Najiha A. Kareem,Jean Lahoud,Mustansar Fiaz,Amandeep Kumar,Hisham Cholakkal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Many practical medical imaging scenarios include categories that are under-represented but still crucial. The relevance of image recognition models to real-world applications lies in their ability to generalize to these rare classes as well as unseen classes. Real-world generalization requires taking into account the various complexities that can be encountered in the real-world. First, training data is highly imbalanced, which may lead to model exhibiting bias toward the more frequently represented classes. Moreover, real-world data may contain unseen classes that need to be identified, and model performance is affected by the data scarcity. While medical image recognition has been extensively addressed in the literature, current methods do not take into account all the intricacies in the real-world scenarios. To this end, we propose an open-set learning method for highly imbalanced medical datasets using a semi-supervised approach. Understanding the adverse impact of long-tail distribution at the inherent model characteristics, we implement a regularization strategy at the feature level complemented by a classifier normalization technique. We conduct extensive experiments on the publicly available datasets, ISIC2018, ISIC2019, and TissueMNIST with various numbers of labelled samples. Our analysis shows that addressing the impact of long-tail data in classification significantly improves the overall performance of the network in terms of closed-set and open-set accuracies on all datasets. Our code and trained models will be made publicly available at this https URL.
zh
[CV-102] Leverag ing Generative AI Models to Explore Human Identity
【速读】:该论文试图通过神经网络间接探讨人类身份(human identity)的形成机制。其解决方案的关键在于利用扩散模型(diffusion models),一种先进的生成式人工智能(Generative AI)模型,用于生成人脸图像,并将生成的人脸图像与人类身份建立对应关系。通过实验发现,扩散模型外部输入的变化会导致生成人脸图像的显著变化,从而间接验证了人类身份在形成过程中对外部因素的依赖性。
链接: https://arxiv.org/abs/2505.14843
作者: Yunha Yeo,Daeho Um
机构: KAIST(韩国科学技术院); Samsung Advanced Institute of Technology (三星先进信息技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ISEA 2025
点击查看摘要
Abstract:This paper attempts to explore human identity by utilizing neural networks in an indirect manner. For this exploration, we adopt diffusion models, state-of-the-art AI generative models trained to create human face images. By relating the generated human face to human identity, we establish a correspondence between the face image generation process of the diffusion model and the process of human identity formation. Through experiments with the diffusion model, we observe that changes in its external input result in significant changes in the generated face image. Based on the correspondence, we indirectly confirm the dependence of human identity on external factors in the process of human identity formation. Furthermore, we introduce \textitFluidity of Human Identity, a video artwork that expresses the fluid nature of human identity affected by varying external factors. The video is available at this https URL.
zh
[CV-103] Uncovering Cultural Representation Disparities in Vision-Language Models
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在跨文化场景下可能存在的文化偏见问题,具体通过评估其在基于图像的国家识别任务中的表现来探究这种偏见的程度。解决方案的关键在于利用地理多样化的Country211数据集,对多个大型VLMs在不同提示策略(包括开放式问题和多选题,如多语言及对抗性设置)下的性能进行系统分析,从而揭示模型在不同国家和问题形式上的准确性差异,进而探讨训练数据分布与评估方法如何影响VLMs中的文化偏见。
链接: https://arxiv.org/abs/2505.14729
作者: Ram Mohan Rao Kadiyala,Siddhant Gupta,Jebish Purbey,Srishti Yadav,Alejandro Salamanca,Desmond Elliott
机构: M2ai; Traversaal.ai; IIT Roorkee; University of Copenhagen; Cohere Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 36 figures
点击查看摘要
Abstract:Vision-Language Models (VLMs) have demonstrated impressive capabilities across a range of tasks, yet concerns about their potential biases exist. This work investigates the extent to which prominent VLMs exhibit cultural biases by evaluating their performance on an image-based country identification task at a country level. Utilizing the geographically diverse Country211 dataset, we probe several large vision language models (VLMs) under various prompting strategies: open-ended questions, multiple-choice questions (MCQs) including challenging setups like multilingual and adversarial settings. Our analysis aims to uncover disparities in model accuracy across different countries and question formats, providing insights into how training data distribution and evaluation methodologies might influence cultural biases in VLMs. The findings highlight significant variations in performance, suggesting that while VLMs possess considerable visual understanding, they inherit biases from their pre-training data and scale that impact their ability to generalize uniformly across diverse global contexts.
zh
[CV-104] MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
【速读】:该论文试图解决基于脉冲神经网络(Spiking Neural Networks, SNN)的Transformer架构与基于人工神经网络(Artificial Neural Networks, ANN)的Transformer架构之间存在的显著性能差距问题。现有方法虽然提出了能够与SNN结合的脉冲自注意力机制,但整体架构在从不同图像尺度中有效提取特征方面存在瓶颈。解决方案的关键在于提出一种新型的脉冲驱动的Transformer架构MSVIT,其首次引入多尺度脉冲注意力(Multi-Scale Spiking Attention, MSSA),以增强脉冲注意力模块的能力。
链接: https://arxiv.org/abs/2505.14719
作者: Wei Hua,Chenlin Zhou,Jibin Wu,Yansong Chua,Yangyang Shu
机构: China Nanhu Academy of Electronics and Information Technology, China; University of Chinese Academy of Sciences, China; The Hong Kong Polytechnic University, Hong Kong SAR, China; School of Systems and Computing, The University of New South Wales, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at this https URL.
zh
[CV-105] Enhancing Shape Perception and Segmentation Consistency for Industrial Image Inspection
【速读】:该论文旨在解决工业图像检测中传统语义分割模型在不同上下文环境中无法保持固定组件分割一致性的问题,其根源在于缺乏对目标轮廓的感知能力。为应对实时性约束和计算资源有限的挑战,论文提出了一种名为Shape-Aware Efficient Network (SPENet)的高效网络结构,其关键在于通过分别监督边界与主体信息的提取,提升分割一致性。SPENet引入了Variable Boundary Domain (VBD)方法以描述模糊边界,增强了对现实场景的适应性,并提出了Consistency Mean Square Error (CMSE)作为衡量固定组件分割一致性的新指标。实验结果表明,该方法在分割精度和速度上均表现优异,且在CMSE指标上相比之前最优模型提升了超过50%的性能。
链接: https://arxiv.org/abs/2505.14718
作者: Guoxuan Mao,Ting Cao,Ziyang Li,Yuan Dong
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Ricoh Software Research Center (Beijing) Co., Ltd (理光软件研究所(北京)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Semantic segmentation stands as a pivotal research focus in computer vision. In the context of industrial image inspection, conventional semantic segmentation models fail to maintain the segmentation consistency of fixed components across varying contextual environments due to a lack of perception of object contours. Given the real-time constraints and limited computing capability of industrial image detection machines, it is also necessary to create efficient models to reduce computational complexity. In this work, a Shape-Aware Efficient Network (SPENet) is proposed, which focuses on the shapes of objects to achieve excellent segmentation consistency by separately supervising the extraction of boundary and body information from images. In SPENet, a novel method is introduced for describing fuzzy boundaries to better adapt to real-world scenarios named Variable Boundary Domain (VBD). Additionally, a new metric, Consistency Mean Square Error(CMSE), is proposed to measure segmentation consistency for fixed components. Our approach attains the best segmentation accuracy and competitive speed on our dataset, showcasing significant advantages in CMSE among numerous state-of-the-art real-time segmentation networks, achieving a reduction of over 50% compared to the previously top-performing models.
zh
[CV-106] KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection
【速读】:该论文试图解决虚假新闻检测中存在的两个关键问题:现有方法通常仅关注全局图像上下文而忽视局部对象级细节,且未能有效整合外部知识和实体关系以实现更深层次的语义理解。其解决方案的关键在于提出一种多模态虚假新闻检测框架,该框架融合了视觉、文本和基于知识的表示,通过自底向上的注意力机制捕捉细粒度对象细节,利用CLIP获取全局图像语义,以及使用RoBERTa进行上下文感知的文本编码,并通过从知识图谱中检索并自适应选择相关实体来增强知识利用。最终,融合的多模态特征通过基于Transformer的分类器进行新闻真实性预测,实现了从特征融合到语义基础验证的转变。
链接: https://arxiv.org/abs/2505.14714
作者: Tuan-Vinh La,Minh-Hieu Nguyen,Minh-Son Dao
机构: University of Information Technology (信息科技大学); Vietnam National University (越南国家大学); University of Science (科学大学); National Institute of Information and Communications Technology (信息与通信技术国立研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \hrefthis https URLthis http URL.
zh
[CV-107] FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
【速读】:该论文旨在解决自回归(AR)视频生成中解码阶段计算效率低的问题,特别是在处理高分辨率和长时长视频时,由于需要大量token导致的显著计算开销。解决方案的关键在于提出FastCar框架,通过利用相邻帧中MLP模块输出的高时间冗余性,结合时间注意力分数(TAS)来决定是否采用重放策略(replay strategy),从而减少冗余计算。此外,还设计了基于TAS的动态资源调度(DRS)硬件加速器,以提升资源利用率和推理速度。
链接: https://arxiv.org/abs/2505.14709
作者: Xuan Shen,Weize Ma,Yufa Zhou,Enhao Tang,Yanyue Xie,Zhengang Li,Yifan Gong,Quanyi Wang,Henghui Ding,Yiwei Wang,Yanzhi Wang,Pu Zhao,Jun Lin,Jiuxiang Gu
机构: Northeastern University (东北大学); Nanjing University (南京大学); Duke University (杜克大学); Adobe Research (Adobe 研究院); NUIST (南京信息工程大学); Fudan University (复旦大学); UCM (马德里卡洛斯三世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint Version
点击查看摘要
Abstract:Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textiti.e., reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: this https URL
zh
[CV-108] DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance
【速读】:该论文旨在解决扩散变压器(Diffusion Transformer, DiT)视频生成模型在计算成本上的瓶颈问题,特别是注意力机制占用了超过80%的总延迟,导致生成8秒720p视频需要数十分钟,严重影响了实际应用和扩展性。其解决方案的关键在于提出了一种无需训练的加速框架——DraftAttention,该框架在GPU上通过动态稀疏注意力实现加速。具体而言,通过对压缩潜在空间中的每帧特征图进行下采样,生成高分辨率的感知区域,并利用低分辨率的草稿注意力图来识别特征图内部的空间冗余和跨帧的时间冗余,进而通过重新排序查询、键和值,引导全分辨率下的稀疏注意力计算,最终恢复原始顺序以实现与硬件优化执行相匹配的结构化稀疏性。
链接: https://arxiv.org/abs/2505.14708
作者: Xuan Shen,Chenxia Han,Yufa Zhou,Yanyue Xie,Yifan Gong,Quanyi Wang,Yiwei Wang,Yanzhi Wang,Pu Zhao,Jiuxiang Gu
机构: Northeastern University (东北大学); CUHK (香港中文大学); Duke University (杜克大学); Adobe Research (Adobe 研究院); NUIST (南京信息工程大学); UCM (马德里康普顿斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint Version
点击查看摘要
Abstract:Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: this https URL
zh
[CV-109] CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity
【速读】:该论文试图解决在生物多样性应用中,由于物种形态相似性导致的识别难题,特别是针对视觉混淆或隐匿物种(cryptic species)的识别问题。现有数据集在规模、自动化程度和覆盖范围上存在局限,无法有效支持广泛 taxa 的细微差异识别。解决方案的关键在于构建一个大规模、多模态的公开数据集 CrypticBio,其包含166百万张图像,涵盖67,000个物种和52,000个隐匿物种组,并整合地理和时间数据作为补充线索,以增强视觉-语言模型在隐匿物种识别中的表现。此外,论文还提供了开源的 CrypticBio-Curate 管道,便于数据的持续更新与扩展。
链接: https://arxiv.org/abs/2505.14707
作者: Georgiana Manolache,Gerard Schouten,Joaquin Vanschoren
机构: Fontys University of Applied Sciences (范德萨斯应用科学大学); Technical University of Eindhoven (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models for biodiversity identification using images, language and spatiotemporal data
点击查看摘要
Abstract:We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models in the context of biodiversity applications. Visually confusing or cryptic species are groups of two or more taxa that are nearly indistinguishable based on visual characteristics alone. While much existing work addresses taxonomic identification in a broad sense, datasets that directly address the morphological confusion of cryptic species are small, manually curated, and target only a single taxon. Thus, the challenge of identifying such subtle differences in a wide range of taxa remains unaddressed. Curated from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species, represented in 166 million images. Rich research-grade image annotations–including scientific, multicultural, and multilingual species terminology, hierarchical taxonomy, spatiotemporal context, and associated cryptic groups–address multimodal AI in biodiversity research. For easy dataset curation, we provide an open-source pipeline CrypticBio-Curate. The multimodal nature of the dataset beyond vision-language arises from the integration of geographical and temporal data as complementary cues to identifying cryptic species. To highlight the importance of the dataset, we benchmark a suite of state-of-the-art foundation models across CrypticBio subsets of common, unseen, endangered, and invasive species, and demonstrate the substantial impact of geographical context on vision-language zero-shot learning for cryptic species. By introducing CrypticBio, we aim to catalyze progress toward real-world-ready biodiversity AI models capable of handling the nuanced challenges of species ambiguity.
zh
[CV-110] Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
【速读】:该论文旨在解决多模态数据集蒸馏(Multimodal Dataset Distillation, MDD)中的模态坍缩(Modality Collapse)问题,即现有方法在压缩大规模图像-文本数据集时,导致模态内表示过于集中且跨模态分布差距增大,从而影响跨模态学习的效果。其解决方案的关键在于提出一种名为RepBlend的新框架,通过引入表示融合(representation blending)来弱化过强的跨模态监督,从而提升模态内多样性;同时,为了解决模态间监督不对称的问题,还提出了对称投影轨迹匹配(symmetric projection trajectory matching),通过模态特定的投影头同步优化动态,实现更平衡的监督与跨模态对齐。
链接: https://arxiv.org/abs/2505.14705
作者: Xin Zhang,Ziruo Zhang,Jiawei Du,Zuozhu Liu,Joey Tianyi Zhou
机构: Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(前沿人工智能研究中心,科技研究局,新加坡); Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,科技研究局,新加坡); National University of Singapore, Singapore(新加坡国立大学,新加坡); Zhejiang University, China(浙江大学,中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from \textit\textbfModality Collapse, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, at the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce \textbfRepBlend, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7 \times distillation speedup.
zh
[CV-111] Deep Learning Enabled Segmentation Classification and Risk Assessment of Cervical Cancer
【速读】:该论文旨在解决宫颈癌早期检测中对癌细胞精准识别与分类的问题,以提高疾病预防和预后的准确性。其解决方案的关键在于提出了一种名为“多分辨率融合深度卷积网络”的新型深度学习(Deep Learning, DL)架构,该架构能够有效处理不同分辨率和宽高比的图像,并通过引入多任务学习技术,同时完成细胞分割与分类任务,从而实现了较高的交并比(Intersection over Union, IoU)和分类准确率。此外,该方法仅使用170万可学习参数,显著低于传统模型如VGG-19,提升了计算效率。
链接: https://arxiv.org/abs/2505.15505
作者: Abdul Samad Shaik,Shashaank Mattur Aswatha,Rahul Jashvantbhai Pandya
机构: Indian Institute of Technology, Dharwad, India (印度理工学院,达瓦德分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures
点击查看摘要
Abstract:Cervical cancer, the fourth leading cause of cancer in women globally, requires early detection through Pap smear tests to identify precancerous changes and prevent disease progression. In this study, we performed a focused analysis by segmenting the cellular boundaries and drawing bounding boxes to isolate the cancer cells. A novel Deep Learning (DL) architecture, the ``Multi-Resolution Fusion Deep Convolutional Network", was proposed to effectively handle images with varying resolutions and aspect ratios, with its efficacy showcased using the SIPaKMeD dataset. The performance of this DL model was observed to be similar to the state-of-the-art models, with accuracy variations of a mere 2% to 3%, achieved using just 1.7 million learnable parameters, which is approximately 85 times less than the VGG-19 model. Furthermore, we introduced a multi-task learning technique that simultaneously performs segmentation and classification tasks and begets an Intersection over Union score of 0.83 and a classification accuracy of 90%. The final stage of the workflow employs a probabilistic approach for risk assessment, extracting feature vectors to predict the likelihood of normal cells progressing to malignant states, which can be utilized for the prognosis of cervical cancer.
zh
[CV-112] Reconsider the Template Mesh in Deep Learning-based Mesh Reconstruction
【速读】:该论文旨在解决传统网格重建方法依赖于标准化模板对个体受试者进行变形所带来的问题,这些问题忽视了个体间的解剖学差异,可能影响重建的精度。其解决方案的关键在于提出一种基于自适应模板的网格重建网络(ATMRN),该网络能够从给定图像中生成自适应模板,从而在后续变形过程中更好地适应个体解剖结构,提升重建的准确性。
链接: https://arxiv.org/abs/2505.15285
作者: Fengting Zhang,Boxu Liang,Qinghao Liu,Min Liu,Xiang Chen,Yaonan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Mesh reconstruction is a cornerstone process across various applications, including in-silico trials, digital twins, surgical planning, and navigation. Recent advancements in deep learning have notably enhanced mesh reconstruction speeds. Yet, traditional methods predominantly rely on deforming a standardised template mesh for individual subjects, which overlooks the unique anatomical variations between them, and may compromise the fidelity of the reconstructions. In this paper, we propose an adaptive-template-based mesh reconstruction network (ATMRN), which generates adaptive templates from the given images for the subsequent deformation, moving beyond the constraints of a singular, fixed template. Our approach, validated on cortical magnetic resonance (MR) images from the OASIS dataset, sets a new benchmark in voxel-to-cortex mesh reconstruction, achieving an average symmetric surface distance of 0.267mm across four cortical structures. Our proposed method is generic and can be easily transferred to other image modalities and anatomical structures.
zh
[CV-113] X-GRM: Large Gaussian Reconstruction Model for Sparse-view X-rays to Computed Tomography
【速读】:该论文旨在解决现有CT重建方法在模型容量、体积表示灵活性和训练数据规模方面的局限性。其关键解决方案是提出X-GRM(X-ray Gaussian Reconstruction Model),一个基于可扩展Transformer架构的大规模前馈模型,能够从稀疏视角的2D X射线投影中重建3D CT图像,并采用新的体素基高斯点云(VoxGS)体积表示方式,实现高效的CT体积提取与可微分X射线渲染。
链接: https://arxiv.org/abs/2505.15235
作者: Yifan Liu,Wuyang Li,Weihao Yu,Chenxin Li,Alexandre Alahi,Max Meng,Yixuan Yuan
机构: The Chinese University of Hong Kong (香港中文大学); EPFL (瑞士联邦理工学院); SUSTech (南方科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Computed Tomography serves as an indispensable tool in clinical workflows, providing non-invasive visualization of internal anatomical structures. Existing CT reconstruction works are limited to small-capacity model architecture, inflexible volume representation, and small-scale training data. In this paper, we present X-GRM (X-ray Gaussian Reconstruction Model), a large feedforward model for reconstructing 3D CT from sparse-view 2D X-ray projections. X-GRM employs a scalable transformer-based architecture to encode an arbitrary number of sparse X-ray inputs, where tokens from different views are integrated efficiently. Then, tokens are decoded into a new volume representation, named Voxel-based Gaussian Splatting (VoxGS), which enables efficient CT volume extraction and differentiable X-ray rendering. To support the training of X-GRM, we collect ReconX-15K, a large-scale CT reconstruction dataset containing around 15,000 CT/X-ray pairs across diverse organs, including the chest, abdomen, pelvis, and tooth etc. This combination of a high-capacity model, flexible volume representation, and large-scale training data empowers our model to produce high-quality reconstructions from various testing inputs, including in-domain and out-domain X-ray projections. Project Page: this https URL.
zh
[CV-114] SAMA-UNet: Enhancing Medical Image Segmentation with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning
【速读】:该论文旨在解决医学图像分割中模型面临的计算效率低、复杂医学数据处理困难以及局部细粒度信息与全局语义依赖难以平衡的问题。其解决方案的关键在于提出SAMA-UNet架构,其中包含自适应Mamba-like聚合注意力(SAMA)模块和因果共振多尺度模块(CR-MSM)。SAMA模块通过结合上下文自注意力与动态权重调制,实现基于局部和全局上下文的特征优先级划分,从而降低计算复杂度并提升多尺度图像特征的表示能力;CR-MSM则通过因果共振学习增强编码器与解码器之间的信息流动,自动调整特征分辨率与因果依赖,提升U型结构中低层与高层特征的语义对齐效果。
链接: https://arxiv.org/abs/2505.15234
作者: Saqib Qamar,Mohd Fazil,Parvez Ahmad,Ghulam Muhammad
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image segmentation plays an important role in various clinical applications, but existing models often struggle with the computational inefficiencies and challenges posed by complex medical data. State Space Sequence Models (SSMs) have demonstrated promise in modeling long-range dependencies with linear computational complexity, yet their application in medical image segmentation remains hindered by incompatibilities with image tokens and autoregressive assumptions. Moreover, it is difficult to achieve a balance in capturing both local fine-grained information and global semantic dependencies. To address these challenges, we introduce SAMA-UNet, a novel architecture for medical image segmentation. A key innovation is the Self-Adaptive Mamba-like Aggregated Attention (SAMA) block, which integrates contextual self-attention with dynamic weight modulation to prioritise the most relevant features based on local and global contexts. This approach reduces computational complexity and improves the representation of complex image features across multiple scales. We also suggest the Causal-Resonance Multi-Scale Module (CR-MSM), which enhances the flow of information between the encoder and decoder by using causal resonance learning. This mechanism allows the model to automatically adjust feature resolution and causal dependencies across scales, leading to better semantic alignment between the low-level and high-level features in U-shaped architectures. Experiments on MRI, CT, and endoscopy images show that SAMA-UNet performs better in segmentation accuracy than current methods using CNN, Transformer, and Mamba. The implementation is publicly available at GitHub.
zh
[CV-115] Physics-Guided Multi-View Graph Neural Network for Schizophrenia Classification via Structural-Functional Coupling MICCAI2024
【速读】:该论文试图解决精神疾病(如精神分裂症,SZ)中脑结构连接(SC)与功能连接(FC)之间复杂相互关系被忽视的问题,传统方法因功能数据获取受限而仅依赖SC,导致对认知和行为障碍的理解不足。解决方案的关键在于提出一种基于物理引导的深度学习框架,该框架利用神经振荡模型描述由神经纤维连接的神经振荡器动态,并通过系统动力学视角学习SC-FC耦合,从而同时生成FC;此外,还引入一种具有联合损失的多视图图神经网络(GNN),实现基于相关性的SC-FC融合与SZ个体分类。
链接: https://arxiv.org/abs/2505.15135
作者: Badhan Mazumder,Ayush Kanyal,Lei Wu,Vince D. Calhoun,Dong Hye Ye
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at the 7th International Workshop on PRedictive Intelligence in MEdicine (Held in Conjunction with MICCAI 2024)
点击查看摘要
Abstract:Clinical studies reveal disruptions in brain structural connectivity (SC) and functional connectivity (FC) in neuropsychiatric disorders such as schizophrenia (SZ). Traditional approaches might rely solely on SC due to limited functional data availability, hindering comprehension of cognitive and behavioral impairments in individuals with SZ by neglecting the intricate SC-FC interrelationship. To tackle the challenge, we propose a novel physics-guided deep learning framework that leverages a neural oscillation model to describe the dynamics of a collection of interconnected neural oscillators, which operate via nerve fibers dispersed across the brain’s structure. Our proposed framework utilizes SC to simultaneously generate FC by learning SC-FC coupling from a system dynamics perspective. Additionally, it employs a novel multi-view graph neural network (GNN) with a joint loss to perform correlation-based SC-FC fusion and classification of individuals with SZ. Experiments conducted on a clinical dataset exhibited improved performance, demonstrating the robustness of our proposed approach.
zh
[CV-116] Lung Nodule-SSM: Self-Supervised Lung Nodule Detection and Classification in Thoracic CT Images
【速读】:该论文旨在解决肺癌早期结节检测中因标注医学影像数据有限而导致的计算机辅助诊断(CAD)系统准确性不足的问题。其解决方案的关键在于利用自监督学习(self-supervised learning)方法,以DINOv2作为主干网络,在无标注的CT扫描数据上预训练以获取鲁棒的特征表示,并进一步通过基于Transformer的架构进行微调,以实现病灶级别的检测和精准的肺结节诊断。
链接: https://arxiv.org/abs/2505.15120
作者: Muniba Noreen(Faculty of Electrical and Electronics Engineering, University of Engineering and Technology Taxila, Pakistan),Furqan Shaukat(Faculty of Electrical and Electronics Engineering, University of Engineering and Technology Taxila, Pakistan)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Lung cancer remains among the deadliest types of cancer in recent decades, and early lung nodule detection is crucial for improving patient outcomes. The limited availability of annotated medical imaging data remains a bottleneck in developing accurate computer-aided diagnosis (CAD) systems. Self-supervised learning can help leverage large amounts of unlabeled data to develop more robust CAD systems. With the recent advent of transformer-based architecture and their ability to generalize to unseen tasks, there has been an effort within the healthcare community to adapt them to various medical downstream tasks. Thus, we propose a novel “LungNodule-SSM” method, which utilizes selfsupervised learning with DINOv2 as a backbone to enhance lung nodule detection and classification without annotated data. Our methodology has two stages: firstly, the DINOv2 model is pre-trained on unlabeled CT scans to learn robust feature representations, then secondly, these features are fine-tuned using transformer-based architectures for lesionlevel detection and accurate lung nodule diagnosis. The proposed method has been evaluated on the challenging LUNA 16 dataset, consisting of 888 CT scans, and compared with SOTA methods. Our experimental results show the superiority of our proposed method with an accuracy of 98.37%, explaining its effectiveness in lung nodule detection. The source code, datasets, and pre-processed data can be accessed using the link:this https URL
zh
[CV-117] Non-rigid Motion Correction for MRI Reconstruction via Coarse-To-Fine Diffusion Models ICIP2025
【速读】:该论文旨在解决磁共振成像(Magnetic Resonance Imaging, MRI)中由于长时间的k空间采样导致的运动伪影问题,这些问题会降低影像的诊断价值,尤其是在动态成像中。其解决方案的关键在于提出了一种新颖的交替优化框架,该框架利用定制的扩散模型(diffusion model)联合重建和校正非刚性运动损坏的k空间数据,通过粗到细的去噪策略先捕捉整体运动并重建图像的低频成分,从而为运动估计提供更好的归纳偏差。
链接: https://arxiv.org/abs/2505.15057
作者: Frederic Wang,Jonathan I. Tamir
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: ICIP 2025
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) is highly susceptible to motion artifacts due to the extended acquisition times required for k-space sampling. These artifacts can compromise diagnostic utility, particularly for dynamic imaging. We propose a novel alternating minimization framework that leverages a bespoke diffusion model to jointly reconstruct and correct non-rigid motion-corrupted k-space data. The diffusion model uses a coarse-to-fine denoising strategy to capture large overall motion and reconstruct the lower frequencies of the image first, providing a better inductive bias for motion estimation than that of standard diffusion models. We demonstrate the performance of our approach on both real-world cine cardiac MRI datasets and complex simulated rigid and non-rigid deformations, even when each motion state is undersampled by a factor of 64x. Additionally, our method is agnostic to sampling patterns, anatomical variations, and MRI scanning protocols, as long as some low frequency components are sampled during each motion state.
zh
[CV-118] Pathobiological Dictionary Defining Pathomics and Texture Features: Addressing Understandable AI Issues in Personalized Liver Cancer; Dictionary Version LCP1.0
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在医学诊断中因可解释性不足和泛化能力有限而难以临床应用的问题。其解决方案的关键在于构建一个名为肝癌病理词典(Pathobiological Dictionary for Liver Cancer, LCP1.0)的框架,该框架能够将复杂的Pathomics和Radiomics特征(PF和RF)转化为与现有诊断流程一致的临床有意义的见解。通过标准化工具QuPath和PyRadiomics提取影像特征,并结合专家定义的感兴趣区域(ROIs)进行特征聚合,最终利用变量阈值特征选择算法与支持向量机(SVM)模型实现高准确率的肿瘤分级和预后关联分析,从而增强AI模型的透明度和临床可用性。
链接: https://arxiv.org/abs/2505.14926
作者: Mohammad R. Salmanpour,Seyed Mohammad Piri,Somayeh Sadat Mehrnia,Ahmad Shariftabrizi,Masume Allahmoradi,Venkata SK. Manem,Arman Rahmim,Ilker Hacihaliloglu
机构: 未知
类目: Computational Physics (physics.comp-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 4 figures and 1 table
点击查看摘要
Abstract:Artificial intelligence (AI) holds strong potential for medical diagnostics, yet its clinical adoption is limited by a lack of interpretability and generalizability. This study introduces the Pathobiological Dictionary for Liver Cancer (LCP1.0), a practical framework designed to translate complex Pathomics and Radiomics Features (PF and RF) into clinically meaningful insights aligned with existing diagnostic workflows. QuPath and PyRadiomics, standardized according to IBSI guidelines, were used to extract 333 imaging features from hepatocellular carcinoma (HCC) tissue samples, including 240 PF-based-cell detection/intensity, 74 RF-based texture, and 19 RF-based first-order features. Expert-defined ROIs from the public dataset excluded artifact-prone areas, and features were aggregated at the case level. Their relevance to the WHO grading system was assessed using multiple classifiers linked with feature selectors. The resulting dictionary was validated by 8 experts in oncology and pathology. In collaboration with 10 domain experts, we developed a Pathobiological dictionary of imaging features such as PFs and RF. In our study, the Variable Threshold feature selection algorithm combined with the SVM model achieved the highest accuracy (0.80, P-value less than 0.05), selecting 20 key features, primarily clinical and pathomics traits such as Centroid, Cell Nucleus, and Cytoplasmic characteristics. These features, particularly nuclear and cytoplasmic, were strongly associated with tumor grading and prognosis, reflecting atypia indicators like pleomorphism, hyperchromasia, and cellular this http URL LCP1.0 provides a clinically validated bridge between AI outputs and expert interpretation, enhancing model transparency and usability. Aligning AI-derived features with clinical semantics supports the development of interpretable, trustworthy diagnostic tools for liver cancer pathology.
zh
[CV-119] Super-Resolution Optical Coherence Tomography Using Diffusion Model-Based Plug-and-Play Priors
【速读】:该论文旨在解决光学相干断层成像(Optical Coherence Tomography, OCT)中从稀疏测量数据中重建高质量图像的问题,特别是在OCT B-mode角膜图像的场景下。其解决方案的关键在于提出了一种基于插件式扩散模型(Plug-and-Play Diffusion Model, PnP-DM)的超分辨率框架,通过将扩散先验与马尔可夫链蒙特卡洛采样相结合,实现高效的后验推断,从而在保持结构清晰度和抑制噪声方面优于传统的2D-UNet基线方法。
链接: https://arxiv.org/abs/2505.14916
作者: Yaning Wang,Jinglun Yu,Wenhan Guo,Yu Sun,Jin U. Kang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose an OCT super-resolution framework based on a plug-and-play diffusion model (PnP-DM) to reconstruct high-quality images from sparse measurements (OCT B-mode corneal images). Our method formulates reconstruction as an inverse problem, combining a diffusion prior with Markov chain Monte Carlo sampling for efficient posterior inference. We collect high-speed under-sampled B-mode corneal images and apply a deep learning-based up-sampling pipeline to build realistic training pairs. Evaluations on in vivo and ex vivo fish-eye corneal models show that PnP-DM outperforms conventional 2D-UNet baselines, producing sharper structures and better noise suppression. This approach advances high-fidelity OCT imaging in high-speed acquisition for clinical applications.
zh
[CV-120] Model-Independent Machine Learning Approach for Nanometric Axial Localization and Tracking
【速读】:该论文旨在解决在光学显微镜中精确追踪粒子并确定其沿光轴位置的难题,尤其是在需要极高精度的情况下。其解决方案的关键在于引入一种基于卷积神经网络(Convolutional Neural Network, CNN)的深度学习方法,该方法能够从双焦平面图像中确定轴向位置,而无需依赖预定义模型。这种方法实现了40纳米的轴向定位精度,比传统单焦平面技术提高了六倍,并因其简洁的设计和强大的性能适用于多种科学应用。
链接: https://arxiv.org/abs/2505.14754
作者: Andrey Alexandrov,Giovanni Acampora,Giovanni De Lellis,Antonia Di Crescenzo,Chiara Errico,Daria Morozova,Valeri Tioukov,Autilia Vittiello
机构: 未知
类目: Image and Video Processing (eess.IV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
备注: 11 pages, 4 figures, 1 table
点击查看摘要
Abstract:Accurately tracking particles and determining their position along the optical axis is a major challenge in optical microscopy, especially when extremely high precision is needed. In this study, we introduce a deep learning approach using convolutional neural networks (CNNs) that can determine axial positions from dual-focal plane images without relying on predefined models. Our method achieves an axial localization accuracy of 40 nanometers - six times better than traditional single-focal plane techniques. The model’s simple design and strong performance make it suitable for a wide range of uses, including dark matter detection, proton therapy for cancer, and radiation protection in space. It also shows promise in fields like biological imaging, materials science, and environmental monitoring. This work highlights how machine learning can turn complex image data into reliable, precise information, offering a flexible and powerful tool for many scientific applications.
zh
[CV-121] ransMedSeg: A Transferable Semantic Framework for Semi-Supervised Medical Image Segmentation
【速读】:该论文旨在解决医疗图像分割(Semi-supervised Medical Image Segmentation, SSMIS)中现有半监督学习(Semi-supervised Learning, SSL)方法在跨临床领域和成像模态之间忽视可迁移语义关系的问题。其解决方案的关键在于提出一种名为TransMedSeg的可迁移语义框架,该框架通过引入可迁移语义增强(Transferable Semantic Augmentation, TSA)模块,利用跨域分布匹配和域内结构保持来对齐域不变语义,从而隐式增强特征表示。该方法通过轻量级记忆模块将教师网络特征自适应地增强至学生网络语义,实现隐式的语义转换,无需显式数据生成,并通过计算增强教师分布上的期望可迁移交叉熵损失进行优化。
链接: https://arxiv.org/abs/2505.14753
作者: Mengzhu Wang,Jiao Li,Shanshan Wang,Long Lan,Huibin Tan,Liang Yang,Guoli Yang
机构: Hebei University of Technology(河北工业大学); University of Electronic Science and Technology of China(电子科技大学); Anhui University(安徽大学); Peking University(北京大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Semi-supervised learning (SSL) has achieved significant progress in medical image segmentation (SSMIS) through effective utilization of limited labeled data. While current SSL methods for medical images predominantly rely on consistency regularization and pseudo-labeling, they often overlook transferable semantic relationships across different clinical domains and imaging modalities. To address this, we propose TransMedSeg, a novel transferable semantic framework for semi-supervised medical image segmentation. Our approach introduces a Transferable Semantic Augmentation (TSA) module, which implicitly enhances feature representations by aligning domain-invariant semantics through cross-domain distribution matching and intra-domain structural preservation. Specifically, TransMedSeg constructs a unified feature space where teacher network features are adaptively augmented towards student network semantics via a lightweight memory module, enabling implicit semantic transformation without explicit data generation. Interestingly, this augmentation is implicitly realized through an expected transferable cross-entropy loss computed over the augmented teacher distribution. An upper bound of the expected loss is theoretically derived and minimized during training, incurring negligible computational overhead. Extensive experiments on medical image datasets demonstrate that TransMedSeg outperforms existing semi-supervised methods, establishing a new direction for transferable representation learning in medical image analysis.
zh
[CV-122] LOD1 3D City Model from LiDAR: The Impact of Segmentation Accuracy on Quality of Urban 3D Modeling and Morphology Extraction
【速读】:该论文旨在解决利用LiDAR数据进行高精度三维建筑重建(尤其是LOD1级别)以及从重建模型中提取形态特征的问题。其解决方案的关键在于采用迁移学习策略,结合四种深度语义分割模型(U-Net、Attention U-Net、U-Net3+和DeepLabV3+)来提取建筑轮廓,并通过最大值、范围、众数、中位数和第90百分位数等统计指标估计建筑高度,从而生成LOD1级别的三维模型。研究还重点分析了分割精度对三维建模质量和形态特征(如建筑面积和外墙表面积)准确性的影响,结果显示U-Net3+在使用第90百分位数和中位数进行高度估算时表现最优。
链接: https://arxiv.org/abs/2505.14747
作者: Fatemeh Chajaei,Hossein Bagheri
机构: University of Isfahan (伊斯法罕大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Three-dimensional reconstruction of buildings, particularly at Level of Detail 1 (LOD1), plays a crucial role in various applications such as urban planning, urban environmental studies, and designing optimized transportation networks. This study focuses on assessing the potential of LiDAR data for accurate 3D building reconstruction at LOD1 and extracting morphological features from these models. Four deep semantic segmentation models, U-Net, Attention U-Net, U-Net3+, and DeepLabV3+, were used, applying transfer learning to extract building footprints from LiDAR data. The results showed that U-Net3+ and Attention U-Net outperformed the others, achieving IoU scores of 0.833 and 0.814, respectively. Various statistical measures, including maximum, range, mode, median, and the 90th percentile, were used to estimate building heights, resulting in the generation of 3D models at LOD1. As the main contribution of the research, the impact of segmentation accuracy on the quality of 3D building modeling and the accuracy of morphological features like building area and external wall surface area was investigated. The results showed that the accuracy of building identification (segmentation performance) significantly affects the 3D model quality and the estimation of morphological features, depending on the height calculation method. Overall, the UNet3+ method, utilizing the 90th percentile and median measures, leads to accurate height estimation of buildings and the extraction of morphological features.
zh
[CV-123] Predicting Neo-Adjuvant Chemotherapy Response in Triple-Negative Breast Cancer Using Pre-Treatment Histopathologic Images
【速读】:该论文旨在解决三阴性乳腺癌(TNBC)患者对新辅助化疗(NACT)反应预测的准确性问题,以优化治疗方案并改善患者预后。其解决方案的关键是开发一种基于深度学习的模型,利用术前苏木素-伊红(HE)染色活检图像来预测NACT反应,并通过结合多重免疫组化(mIHC)数据提升模型的可解释性和预测性能。
链接: https://arxiv.org/abs/2505.14730
作者: Hikmat Khan,Ziyu Su,Huina Zhang,Yihong Wang,Bohan Ning,Shi Wei,Hua Guo,Zaibo Li,Muhammad Khalid Khan Niazi
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Triple-negative breast cancer (TNBC) is an aggressive subtype defined by the lack of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) expression, resulting in limited targeted treatment options. Neoadjuvant chemotherapy (NACT) is the standard treatment for early-stage TNBC, with pathologic complete response (pCR) serving as a key prognostic marker; however, only 40-50% of patients with TNBC achieve pCR. Accurate prediction of NACT response is crucial to optimize therapy, avoid ineffective treatments, and improve patient outcomes. In this study, we developed a deep learning model to predict NACT response using pre-treatment hematoxylin and eosin (HE)-stained biopsy images. Our model achieved promising results in five-fold cross-validation (accuracy: 82%, AUC: 0.86, F1-score: 0.84, sensitivity: 0.85, specificity: 0.81, precision: 0.80). Analysis of model attention maps in conjunction with multiplexed immunohistochemistry (mIHC) data revealed that regions of high predictive importance consistently colocalized with tumor areas showing elevated PD-L1 expression, CD8+ T-cell infiltration, and CD163+ macrophage density - all established biomarkers of treatment response. Our findings indicate that incorporating IHC-derived immune profiling data could substantially improve model interpretability and predictive performance. Furthermore, this approach may accelerate the discovery of novel histopathological biomarkers for NACT and advance the development of personalized treatment strategies for TNBC patients.
zh
[CV-124] MedBLIP: Fine-tuning BLIP for Medical Image Captioning
【速读】:该论文旨在解决医学图像描述生成(Medical Image Captioning)中的挑战,即在放射科图像上生成临床准确且语义有意义的描述。现有视觉-语言模型(VLMs)如BLIP、BLIP2、Gemini和ViT-GPT2在自然图像数据集上表现良好,但在医疗领域常产生泛化或不精确的描述。解决方案的关键在于对BLIP模型进行领域特定的微调(domain-specific fine-tuning),特别是在ROCO数据集上的实验表明,这种微调显著提升了定量和定性评估指标的表现。研究还发现,仅微调解码器(decoder-only fine-tuning)在保持较高性能的同时,相比全模型微调减少了5%的训练时间,而全模型微调仍能获得最佳结果。
链接: https://arxiv.org/abs/2505.14726
作者: Manshi Limbu,Diwita Banerjee
机构: George Mason University (乔治·梅森大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image captioning is a challenging task that requires generating clinically accurate and semantically meaningful descriptions of radiology images. While recent vision-language models (VLMs) such as BLIP, BLIP2, Gemini and ViT-GPT2 show strong performance on natural image datasets, they often produce generic or imprecise captions when applied to specialized medical domains. In this project, we explore the effectiveness of fine-tuning the BLIP model on the ROCO dataset for improved radiology captioning. We compare the fine-tuned BLIP against its zero-shot version, BLIP-2 base, BLIP-2 Instruct and a ViT-GPT2 transformer baseline. Our results demonstrate that domain-specific fine-tuning on BLIP significantly improves performance across both quantitative and qualitative evaluation metrics. We also visualize decoder cross-attention maps to assess interpretability and conduct an ablation study to evaluate the contributions of encoder-only and decoder-only fine-tuning. Our findings highlight the importance of targeted adaptation for medical applications and suggest that decoder-only fine-tuning (encoder-frozen) offers a strong performance baseline with 5% lower training time than full fine-tuning, while full model fine-tuning still yields the best results overall.
zh
[CV-125] ComBAT Harmonization for diffusion MRI: Challenges and Best Practices
【速读】:该论文旨在解决多中心MRI数据在进行测量标准化(harmonization)过程中因站点相关偏差导致的不一致性问题,特别是针对ComBAT方法在实际应用中可能因违反其假设而产生的错误校正问题。其解决方案的关键在于深入分析ComBAT的数学基础及其所依赖的假设,并通过改进版本Pairwise-ComBAT进行实验,评估不同人口特征对校正效果的影响,从而提出五项关键建议以提高数据一致性与可重复性,支持开放科学和临床应用。
链接: https://arxiv.org/abs/2505.14722
作者: Pierre-Marc Jodoin,Manon Edde,Gabriel Girard,Félix Dumais,Guillaume Theaud,Matthieu Dumont,Jean-Christophe Houde,Yoan David,Maxime Descoteaux
机构: Université de Sherbrooke(舍布鲁克大学); AlzheimerвАЩs Disease Neuroimaging Initiative(阿尔茨海默病神经影像倡议)
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注:
点击查看摘要
Abstract:Over the years, ComBAT has become the standard method for harmonizing MRI-derived measurements, with its ability to compensate for site-related additive and multiplicative biases while preserving biological variability. However, ComBAT relies on a set of assumptions that, when violated, can result in flawed harmonization. In this paper, we thoroughly review ComBAT’s mathematical foundation, outlining these assumptions, and exploring their implications for the demographic composition necessary for optimal results. Through a series of experiments involving a slightly modified version of ComBAT called Pairwise-ComBAT tailored for normative modeling applications, we assess the impact of various population characteristics, including population size, age distribution, the absence of certain covariates, and the magnitude of additive and multiplicative factors. Based on these experiments, we present five essential recommendations that should be carefully considered to enhance consistency and supporting reproducibility, two essential factors for open science, collaborative research, and real-life clinical deployment. Subjects: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2505.14722 [stat.AP] (or arXiv:2505.14722v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2505.14722 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-126] Aneumo: A Large-Scale Multimodal Aneurysm Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks
【速读】:该论文旨在解决颅内动脉瘤(Intracranial Aneurysms, IAs)破裂风险评估中血流动力学影响机制不明确以及传统计算流体动力学(Computational Fluid Dynamics, CFD)方法计算成本高、难以应用于大规模或实时临床场景的问题。其解决方案的关键在于构建了一个大规模、高保真的动脉瘤CFD数据集,该数据集基于427个真实动脉瘤几何结构,通过受控变形生成了10,660个三维形态,并在八种稳态质量流量条件下进行CFD计算,生成了85,280组血液流动态数据,同时包含分割掩码以支持多模态输入任务,从而为开发高效的机器学习算法提供数据基础。
链接: https://arxiv.org/abs/2505.14717
作者: Xigui Li,Yuanye Zhou,Feiyang Xiao,Xin Guo,Chen Jiang,Tan Pan,Xingmeng Zhang,Cenyu Liu,Zeyun Miao,Jianchao Ge,Xiansheng Wang,Qimeng Wang,Yichi Zhang,Wenbo Zhang,Fengping Zhu,Limei Han,Yuan Qi,Chensen Lin,Yuan Cheng
机构: Artificial Intelligence Innovation and Incubation Institute, Fudan University (人工智能创新与孵化研究院,复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); Huashan Hospital, Fudan University (华山医院,复旦大学); Human Phenome Institute, Fudan University (人类表型组研究院,复旦大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Intracranial aneurysms (IAs) are serious cerebrovascular lesions found in approximately 5% of the general population. Their rupture may lead to high mortality. Current methods for assessing IA risk focus on morphological and patient-specific factors, but the hemodynamic influences on IA development and rupture remain unclear. While accurate for hemodynamic studies, conventional computational fluid dynamics (CFD) methods are computationally intensive, hindering their deployment in large-scale or real-time clinical applications. To address this challenge, we curated a large-scale, high-fidelity aneurysm CFD dataset to facilitate the development of efficient machine learning algorithms for such applications. Based on 427 real aneurysm geometries, we synthesized 10,660 3D shapes via controlled deformation to simulate aneurysm evolution. The authenticity of these synthetic shapes was confirmed by neurosurgeons. CFD computations were performed on each shape under eight steady-state mass flow conditions, generating a total of 85,280 blood flow dynamics data covering key parameters. Furthermore, the dataset includes segmentation masks, which can support tasks that use images, point clouds or other multimodal data as input. Additionally, we introduced a benchmark for estimating flow parameters to assess current modeling methods. This dataset aims to advance aneurysm research and promote data-driven approaches in biofluids, biomedical engineering, and clinical risk assessment. The code and dataset are available at: this https URL.
zh
[CV-127] A Hybrid Quantum Classical Pipeline for X Ray Based Fracture Diagnosis
【速读】:该论文试图解决骨骨折诊断中传统X射线解读耗时且易出错,以及现有机器学习和深度学习方法需要大量特征工程、标注数据和高计算资源的问题。解决方案的关键在于提出一种分布式混合量子经典流水线,首先通过主成分分析(Principal Component Analysis, PCA)进行降维,然后利用4量子比特量子幅值编码电路进行特征增强,最终通过融合PCA提取的8个特征与8个量子增强特征形成16维向量,并使用不同机器学习模型实现99%的准确率,同时将特征提取时间减少了82%。
链接: https://arxiv.org/abs/2505.14716
作者: Sahil Tomar,Rajeshwar Tripathi,Sandeep Kumar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 8 pages
点击查看摘要
Abstract:Bone fractures are a leading cause of morbidity and disability worldwide, imposing significant clinical and economic burdens on healthcare systems. Traditional X ray interpretation is time consuming and error prone, while existing machine learning and deep learning solutions often demand extensive feature engineering, large, annotated datasets, and high computational resources. To address these challenges, a distributed hybrid quantum classical pipeline is proposed that first applies Principal Component Analysis (PCA) for dimensionality reduction and then leverages a 4 qubit quantum amplitude encoding circuit for feature enrichment. By fusing eight PCA derived features with eight quantum enhanced features into a 16 dimensional vector and then classifying with different machine learning models achieving 99% accuracy using a public multi region X ray dataset on par with state of the art transfer learning models while reducing feature extraction time by 82%.
zh
[CV-128] A Comprehensive Review of Techniques Algorithms Advancements Challenges and Clinical Applications of Multi-modal Medical Image Fusion for Improved Diagnosis
【速读】:该论文旨在解决多模态医学图像融合(Multi-modal Medical Image Fusion, MMIF)在提升诊断精度和临床决策中的技术挑战,其核心问题在于如何有效整合来自不同成像模态(如X-ray、MRI、CT、PET等)的数据,以生成更具临床价值的图像。解决方案的关键在于通过传统融合方法(像素级、特征级、决策级)与现代深度学习、生成模型及基于Transformer的架构相结合,提升融合结果的准确性、鲁棒性和可解释性,同时应对数据异质性、计算复杂度及临床工作流程集成等实际问题。
链接: https://arxiv.org/abs/2505.14715
作者: Muhammad Zubair,Muzammil Hussai,Mousa Ahmad Al-Bashrawi,Malika Bendechache,Muhammad Owais
机构: King Fahd University of Petroleum and Minerals (法赫德国王石油与矿产大学); Al-Ahliyya Amman University (安曼艾哈利亚大学); Department of Information Systems and Operations Management (信息体系与运营管理系); University of Galway (爱尔兰国立高威大学); Khalifa University (哈利法大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: computerized medical imaging and graphics Journal submission
点击查看摘要
Abstract:Multi-modal medical image fusion (MMIF) is increasingly recognized as an essential technique for enhancing diagnostic precision and facilitating effective clinical decision-making within computer-aided diagnosis systems. MMIF combines data from X-ray, MRI, CT, PET, SPECT, and ultrasound to create detailed, clinically useful images of patient anatomy and pathology. These integrated representations significantly advance diagnostic accuracy, lesion detection, and segmentation. This comprehensive review meticulously surveys the evolution, methodologies, algorithms, current advancements, and clinical applications of MMIF. We present a critical comparative analysis of traditional fusion approaches, including pixel-, feature-, and decision-level methods, and delves into recent advancements driven by deep learning, generative models, and transformer-based architectures. A critical comparative analysis is presented between these conventional methods and contemporary techniques, highlighting differences in robustness, computational efficiency, and interpretability. The article addresses extensive clinical applications across oncology, neurology, and cardiology, demonstrating MMIF’s vital role in precision medicine through improved patient-specific therapeutic outcomes. Moreover, the review thoroughly investigates the persistent challenges affecting MMIF’s broad adoption, including issues related to data privacy, heterogeneity, computational complexity, interpretability of AI-driven algorithms, and integration within clinical workflows. It also identifies significant future research avenues, such as the integration of explainable AI, adoption of privacy-preserving federated learning frameworks, development of real-time fusion systems, and standardization efforts for regulatory compliance.
zh
人工智能
[AI-0] Neural Conditional Transport Maps
【速读】:该论文旨在解决在概率分布之间学习条件最优传输(conditional optimal transport, OT)映射的问题,尤其关注如何有效处理同时包含类别型和连续型条件变量的场景。其解决方案的关键在于引入了一个能够根据输入条件生成传输层参数的超网络(hypernetwork),从而实现自适应的传输映射,相较于简单的条件方法表现出更优的性能。
链接: https://arxiv.org/abs/2505.15808
作者: Carlos Rodriguez-Pardo,Leonardo Chiani,Emanuele Borgonovo,Massimo Tavoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Applications (stat.AP); Machine Learning (stat.ML)
备注: Under Review. Supplementary material included in the pdf
点击查看摘要
Abstract:We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Our approach introduces a conditioning mechanism capable of processing both categorical and continuous conditioning variables simultaneously. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. Comprehensive ablation studies demonstrate the superior performance of our method over baseline configurations. Furthermore, we showcase an application to global sensitivity analysis, offering high performance in computing OT-based sensitivity indices. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling and black-box model explainability.
zh
[AI-1] Exploring the Innovation Opportunities for Pre-trained Models
【速读】:该论文试图解决在预训练模型(pre-trained models)领域中,由于炒作周期导致难以准确识别AI实际成功应用的问题。其解决方案的关键在于通过分析人机交互(HCI)研究人员开发的科研应用,作为商业成功应用的代理,从而揭示预训练模型在技术能力、机会领域、数据类型及交互设计模式方面的创新空间。
链接: https://arxiv.org/abs/2505.15790
作者: Minjung Park,Jodi Forlizzi,John Zimmerman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 33 pages, 20 figures, 4 tables, DIS
点击查看摘要
Abstract:Innovators transform the world by understanding where services are successfully meeting customers’ needs and then using this knowledge to identify failsafe opportunities for innovation. Pre-trained models have changed the AI innovation landscape, making it faster and easier to create new AI products and services. Understanding where pre-trained models are successful is critical for supporting AI innovation. Unfortunately, the hype cycle surrounding pre-trained models makes it hard to know where AI can really be successful. To address this, we investigated pre-trained model applications developed by HCI researchers as a proxy for commercially successful applications. The research applications demonstrate technical capabilities, address real user needs, and avoid ethical challenges. Using an artifact analysis approach, we categorized capabilities, opportunity domains, data types, and emerging interaction design patterns, uncovering some of the opportunity space for innovation with pre-trained models.
zh
[AI-2] Improving planning and MBRL with temporally-extended actions
【速读】:该论文试图解决连续时间系统在离散时间动力学建模中因需要小的仿真步长以保持精度而导致的计算负担过重和性能下降问题。其解决方案的关键在于直接控制连续决策的时间尺度,通过使用时序扩展动作,并将动作持续时间作为额外的优化变量与标准动作变量一同纳入规划器中。这一结构不仅加快了轨迹仿真速度,还允许在原始动作层面进行深度搜索,同时在规划器中保持较浅的搜索深度,从而提升了整体效率与效果。
链接: https://arxiv.org/abs/2505.15754
作者: Palash Chatterjee,Roni Khardon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.
zh
[AI-3] Multi-modal Integration Analysis of Alzheimers Disease Using Large Language Models and Knowledge Graphs
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)研究中多模态数据碎片化整合的问题,传统多模态分析依赖于跨数据集的匹配患者ID,而本文提出了一种无需患者ID匹配的框架。解决方案的关键在于利用生成式AI(Generative AI)和知识图谱(knowledge graph)实现跨模态数据的群体层面整合,通过统计分析识别各模态中的显著特征,并将其作为节点构建知识图谱,再由大型语言模型(Large Language Models, LLMs)分析图谱以提取潜在关联并生成可验证的假设。
链接: https://arxiv.org/abs/2505.15747
作者: Kanan Kiguchi,Yunhao Tu,Katsuhiro Ajito,Fady Alnajjar,Kazuyuki Murase
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 38 pages, 8 figures, 4 tables
点击查看摘要
Abstract:We propose a novel framework for integrating fragmented multi-modal data in Alzheimer’s disease (AD) research using large language models (LLMs) and knowledge graphs. While traditional multimodal analysis requires matched patient IDs across datasets, our approach demonstrates population-level integration of MRI, gene expression, biomarkers, EEG, and clinical indicators from independent cohorts. Statistical analysis identified significant features in each modality, which were connected as nodes in a knowledge graph. LLMs then analyzed the graph to extract potential correlations and generate hypotheses in natural language. This approach revealed several novel relationships, including a potential pathway linking metabolic risk factors to tau protein abnormalities via neuroinflammation (r0.6, p0.001), and unexpected correlations between frontal EEG channels and specific gene expression profiles (r=0.42-0.58, p0.01). Cross-validation with independent datasets confirmed the robustness of major findings, with consistent effect sizes across cohorts (variance 15%). The reproducibility of these findings was further supported by expert review (Cohen’s k=0.82) and computational validation. Our framework enables cross modal integration at a conceptual level without requiring patient ID matching, offering new possibilities for understanding AD pathology through fragmented data reuse and generating testable hypotheses for future research.
zh
[AI-4] Higher-order Structure Boosts Link Prediction on Temporal Graphs
【速读】:该论文旨在解决现有时间图神经网络(Temporal Graph Neural Networks, TGNNs)在建模和预测时间图结构时,主要关注成对交互而忽视高阶结构的问题,以及由此导致的效率瓶颈和表达能力受限的问题。其解决方案的关键在于提出一种高阶结构时间图神经网络(Higher-order structure Temporal Graph Neural Network, HTGN),通过引入超图表示来捕捉群体交互,并开发算法识别潜在的高阶结构,同时通过聚合多条边特征生成超边表示,从而有效降低训练过程中的内存消耗。
链接: https://arxiv.org/abs/2505.15746
作者: Jingzhe Liu,Zhigang Hua,Yan Xie,Bingheng Li,Harry Shomer,Yu Song,Kaveh Hassani,Jiliang Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Temporal Graph Neural Networks (TGNNs) have gained growing attention for modeling and predicting structures in temporal graphs. However, existing TGNNs primarily focus on pairwise interactions while overlooking higher-order structures that are integral to link formation and evolution in real-world temporal graphs. Meanwhile, these models often suffer from efficiency bottlenecks, further limiting their expressive power. To tackle these challenges, we propose a Higher-order structure Temporal Graph Neural Network, which incorporates hypergraph representations into temporal graph learning. In particular, we develop an algorithm to identify the underlying higher-order structures, enhancing the model’s ability to capture the group interactions. Furthermore, by aggregating multiple edge features into hyperedge representations, HTGN effectively reduces memory cost during training. We theoretically demonstrate the enhanced expressiveness of our approach and validate its effectiveness and efficiency through extensive experiments on various real-world temporal graphs. Experimental results show that HTGN achieves superior performance on dynamic link prediction while reducing memory costs by up to 50% compared to existing methods.
zh
[AI-5] Neuro-Argumentative Learning with Case-Based Reasoning
【速读】:该论文试图解决传统神经网络(Neural Networks, NNs)在可解释性方面的不足,同时提升案例推理(Case-Based Reasoning, CBR)模型的分类能力和灵活性。其解决方案的关键在于提出一种数据驱动的神经符号分类模型——渐进式抽象论证案例推理(Gradual Abstract Argumentation for Case-Based Reasoning, Gradual AA-CBR),该模型通过学习论证辩论结构来决定分类结果,其中每个论证对应训练数据中的一个案例,并通过梯度方法学习论证强度及关系,从而实现多类分类、自动学习特征与数据点重要性、分配不确定性值以及利用所有可用数据点等功能。
链接: https://arxiv.org/abs/2505.15742
作者: Adam Gould,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to NeSy25
点击查看摘要
Abstract:We introduce Gradual Abstract Argumentation for Case-Based Reasoning (Gradual AA-CBR), a data-driven, neurosymbolic classification model in which the outcome is determined by an argumentation debate structure that is learned simultaneously with neural-based feature extractors. Each argument in the debate is an observed case from the training data, favouring their labelling. Cases attack or support those with opposing or agreeing labellings, with the strength of each argument and relationship learned through gradient-based methods. This argumentation debate structure provides human-aligned reasoning, improving model interpretability compared to traditional neural networks (NNs). Unlike the existing purely symbolic variant, Abstract Argumentation for Case-Based Reasoning (AA-CBR), Gradual AA-CBR is capable of multi-class classification, automatic learning of feature and data point importance, assigning uncertainty values to outcomes, using all available data points, and does not require binary features. We show that Gradual AA-CBR performs comparably to NNs whilst significantly outperforming existing AA-CBR formulations.
zh
[AI-6] HybridProver: Augmenting Theorem Proving with LLM -Driven Proof Synthesis and Refinement
【速读】:该论文试图解决形式化方法在验证关键系统可靠性时面临的劳动密集型手动证明和对定理自动定理证明器使用技能的高要求问题。其解决方案的关键在于提出HybridProver,这是一个结合了基于策略的生成和整体证明合成的双模型证明合成框架,通过融合两种方法的优势,提高了自动化定理证明的效率和效果。
链接: https://arxiv.org/abs/2505.15740
作者: Jilin Hu,Jianyu Zhang,Yongwang Zhao,Talia Ringer
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Formal methods is pivotal for verifying the reliability of critical systems through rigorous mathematical proofs. However, its adoption is hindered by labor-intensive manual proofs and the expertise required to use theorem provers. Recent advancements in large language models (LLMs) offer new opportunities for automated theorem proving. Two promising approaches are generating tactics step by step and generating a whole proof directly with an LLM. However, existing work makes no attempt to combine the two approaches. In this work, we introduce HybridProver, a dual-model proof synthesis framework that combines tactic-based generation and whole-proof synthesis to harness the benefits of both approaches. HybridProver generates whole proof candidates for evaluation directly, then extracts proof sketches from those candidates. It then uses a tactic-based generation model that integrates automated tools to complete the sketches via stepwise refinement. We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized Isabelle datasets. Evaluation on the miniF2F dataset illustrates HybridProver’s effectiveness. We achieve a 59.4% success rate on miniF2F, where the previous SOTA is 56.1%. Our ablation studies show that this SOTA result is attributable to combining whole-proof and tactic-based generation. Additionally, we show how the dataset quality, training parameters, and sampling diversity affect the final result during automated theorem proving with LLMs. All of our code, datasets, and LLMs are open source.
zh
[AI-7] A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO
【速读】:该论文试图解决在离线对齐过程中,噪声标签对模型性能的影响问题,特别是隐私保护与对抗性破坏之间的相互作用。其解决方案的关键在于基于线性建模假设,提出一个统一的分析框架,将离线对齐问题转化为逻辑回归中的参数估计问题。该框架揭示了两种不同隐私-破坏场景(LTC与CTL)之间的关键差异,表明在离线对齐中,LTC相较于CTL更具挑战性,即使在简单的线性模型下也是如此。
链接: https://arxiv.org/abs/2505.15694
作者: Xingyu Zhou,Yulian Wu,Francesco Orabona
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.
zh
[AI-8] Averag e Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives
【速读】:该论文试图解决在无限时域持续任务中,如何有效将绝对活体规范(absolute liveness specifications)转化为平均奖励目标的问题,以实现无需周期性重置的模型无关强化学习(model-free RL)。现有方法通常依赖于折扣奖励强化学习(discounted reward RL)在离散回合设置中的应用,这与omega-regular规范所描述的无限行为轨迹语义不匹配。论文的关键在于提出一种首次针对绝对活体规范的模型无关强化学习框架,该框架能够将规范自动转换为平均奖励目标,并在未知的连通马尔可夫决策过程(communicating MDPs)中保证收敛,同时支持无需完全环境知识的在线简化,从而实现高效的持续学习。
链接: https://arxiv.org/abs/2505.15693
作者: Milad Kazemi,Mateo Perez,Fabio Somenzi,Sadegh Soudjani,Ashutosh Trivedi,Alvaro Velasquez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures and 2 tables
点击查看摘要
Abstract:Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior. Manually designing reward functions is tedious and error-prone. A principled alternative is to specify behaviors in a formal language that can be automatically translated into rewards. Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis. However, existing methods using omega-regular specifications typically rely on discounted reward RL in episodic settings, with periodic resets. This setup misaligns with the semantics of omega-regular specifications, which describe properties over infinite behavior traces. In such cases, the average reward criterion and the continuing setting – where the agent interacts with the environment over a single, uninterrupted lifetime – are more appropriate. To address the challenges of infinite-horizon, continuing tasks, we focus on absolute liveness specifications – a subclass of omega-regular languages that cannot be violated by any finite behavior prefix, making them well-suited to the continuing setting. We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives. Our approach enables learning in communicating MDPs without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization, aiming to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given omega-regular specification. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL. Empirical results show our average-reward approach in continuing setting outperforms discount-based methods across benchmarks. Comments: 29 pages, 6 figures and 2 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.15693 [cs.AI] (or arXiv:2505.15693v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.15693 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-9] LCDB 1.1: A Database Illustrating Learning Curves Are More Ill-Behaved Than Previously Thought
【速读】:该论文试图解决学习曲线(learning curves)的稳定性与可预测性问题,即传统假设下学习曲线具有单调性和凸性,而实际中存在大量不符合这些特性的“不良行为”(ill-behavior)。其解决方案的关键在于构建了大规模、高分辨率的学习曲线数据库LCDB 1.1,并采用统计严谨的方法对学习曲线的行为进行系统分析,从而揭示了不良行为的普遍性及其对下游任务的影响。
链接: https://arxiv.org/abs/2505.15657
作者: Cheng Yan,Felix Mohr,Tom Viering
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Sample-wise learning curves plot performance versus training set size. They are useful for studying scaling laws and speeding up hyperparameter tuning and model selection. Learning curves are often assumed to be well-behaved: monotone (i.e. improving with more data) and convex. By constructing the Learning Curves Database 1.1 (LCDB 1.1), a large-scale database with high-resolution learning curves, we show that learning curves are less often well-behaved than previously thought. Using statistically rigorous methods, we observe significant ill-behavior in approximately 14% of the learning curves, almost twice as much as in previous estimates. We also identify which learners are to blame and show that specific learners are more ill-behaved than others. Additionally, we demonstrate that different feature scalings rarely resolve ill-behavior. We evaluate the impact of ill-behavior on downstream tasks, such as learning curve fitting and model selection, and find it poses significant challenges, underscoring the relevance and potential of LCDB 1.1 as a challenging benchmark for future research.
zh
[AI-10] Second-Order Convergence in Private Stochastic Non-Convex Optimization
【速读】:该论文旨在解决在差分隐私(Differential Privacy, DP)随机非凸优化中寻找二阶平稳点(Second-Order Stationary Point, SOSP)的问题。现有方法存在两个关键局限:一是由于在鞍点逃逸分析中忽略梯度方差,导致收敛误差率不准确;二是依赖于辅助的私有模型选择过程来识别DP-SOSP,这在分布式场景中会显著降低实用性。该论文提出的解决方案核心是构建一个基于高斯噪声注入和通用梯度预言机的扰动随机梯度下降(Perturbed Stochastic Gradient Descent, PSGD)框架,通过模型漂移距离判断PSGD是否逃逸鞍点,从而在不依赖二阶信息或额外DP-SOSP识别的情况下收敛至近似局部极小值。
链接: https://arxiv.org/abs/2505.15647
作者: Youming Tao,Zuyuan Zhang,Dongxiao Yu,Xiuzhen Cheng,Falko Dressler,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We investigate the problem of finding second-order stationary points (SOSP) in differentially private (DP) stochastic non-convex optimization. Existing methods suffer from two key limitations: (i) inaccurate convergence error rate due to overlooking gradient variance in the saddle point escape analysis, and (ii) dependence on auxiliary private model selection procedures for identifying DP-SOSP, which can significantly impair utility, particularly in distributed settings. To address these issues, we propose a generic perturbed stochastic gradient descent (PSGD) framework built upon Gaussian noise injection and general gradient oracles. A core innovation of our framework is using model drift distance to determine whether PSGD escapes saddle points, ensuring convergence to approximate local minima without relying on second-order information or additional DP-SOSP identification. By leveraging the adaptive DP-SPIDER estimator as a specific gradient oracle, we develop a new DP algorithm that rectifies the convergence error rates reported in prior work. We further extend this algorithm to distributed learning with arbitrarily heterogeneous data, providing the first formal guarantees for finding DP-SOSP in such settings. Our analysis also highlights the detrimental impacts of private selection procedures in distributed learning under high-dimensional models, underscoring the practical benefits of our design. Numerical experiments on real-world datasets validate the efficacy of our approach.
zh
[AI-11] Exploring LLM -Generated Feedback for Economics Essays: How Teaching Assistants Evaluate and Envision Its Use
【速读】:该论文试图解决如何利用生成式 AI (Generative AI) 生成的反馈作为建议,以加快并提升人类教师的反馈提供效率。其解决方案的关键在于开发一个基于大语言模型(LLM)的反馈引擎,该引擎能够根据教学助教(TAs)使用的评分标准生成针对学生短文作业的反馈。研究强调了为 AI 提供详细评分标准的重要性,以确保生成高质量的反馈,并通过让 TAs 在实际 grading 工作中评估和对比 AI 反馈与手写反馈,探索其在工作流程中的潜在应用。
链接: https://arxiv.org/abs/2505.15596
作者: Xinyi Lu,Aditya Mahesh,Zejia Shen,Mitchell Dudley,Larissa Sano,Xu Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To be published in AIED’2025: In Proceedings of the 26th International Conference on Artificial Intelligence in Education. The system prompt and example feedback can be found through this http URL
点击查看摘要
Abstract:This project examines the prospect of using AI-generated feedback as suggestions to expedite and enhance human instructors’ feedback provision. In particular, we focus on understanding the teaching assistants’ perspectives on the quality of AI-generated feedback and how they may or may not utilize AI feedback in their own workflows. We situate our work in a foundational college Economics class, which has frequent short essay assignments. We developed an LLM-powered feedback engine that generates feedback on students’ essays based on grading rubrics used by the teaching assistants (TAs). To ensure that TAs can meaningfully critique and engage with the AI feedback, we had them complete their regular grading jobs. For a randomly selected set of essays that they had graded, we used our feedback engine to generate feedback and displayed the feedback as in-text comments in a Word document. We then performed think-aloud studies with 5 TAs over 20 1-hour sessions to have them evaluate the AI feedback, contrast the AI feedback with their handwritten feedback, and share how they envision using the AI feedback if they were offered as suggestions. The study highlights the importance of providing detailed rubrics for AI to generate high-quality feedback for knowledge-intensive essays. TAs considered that using AI feedback as suggestions during their grading could expedite grading, enhance consistency, and improve overall feedback quality. We discuss the importance of decomposing the feedback generation task into steps and presenting intermediate results, in order for TAs to use the AI feedback.
zh
[AI-12] World Models as Reference Trajectories for Rapid Motor Adaptation
【速读】:该论文试图解决在现实环境中部署学习到的控制策略时面临的根本性挑战,即当系统动力学发生意外变化时,性能会下降直至模型在新数据上重新训练。其解决方案的关键是引入反射式世界模型(Reflexive World Models, RWM),这是一种双控框架,利用世界模型预测作为隐式参考轨迹,以实现快速适应。该方法将控制问题分解为通过强化学习进行长期奖励最大化和通过快速潜在控制实现鲁棒运动执行,从而在保持接近最优性能的同时,相较于基于模型的强化学习基线方法实现了更快的适应速度,并且具有较低的在线计算成本。
链接: https://arxiv.org/abs/2505.15589
作者: Carlos Stein Brito,Daniel McNamee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
点击查看摘要
Abstract:Deploying learned control policies in real-world environments poses a fundamental challenge. When system dynamics change unexpectedly, performance degrades until models are retrained on new data. We introduce Reflexive World Models (RWM), a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation. Our method separates the control problem into long-term reward maximization through reinforcement learning and robust motor execution through rapid latent control. This dual architecture achieves significantly faster adaptation with low online computational cost compared to model-based RL baselines, while maintaining near-optimal performance. The approach combines the benefits of flexible policy learning through reinforcement learning with rapid error correction capabilities, providing a principled approach to maintaining performance in high-dimensional continuous control tasks under varying dynamics.
zh
[AI-13] Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback
【速读】:该论文旨在解决数据到方程(Data2Eqn)任务中现有方法在小规模特定任务数据集上的搜索效率低和泛化能力差的问题,以及基础模型在领域适应性和数学语义理解方面的不足。其解决方案的关键在于提出一种基于强化学习的微调框架,通过下游数值适应度产生的奖励信号直接优化预训练模型的生成策略,从而提升模型对特定复杂数据分布的适应能力,并生成具有数学意义的方程。
链接: https://arxiv.org/abs/2505.15572
作者: Wangyang Ying,Haoyue Bai,Nanxu Gong,Xinyuan Wang,Sixun Dong,Haifeng Chen,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The data-to-equation (Data2Eqn) task aims to discover interpretable mathematical equations that map observed values to labels, offering physical insights and broad applicability across academic and industrial domains. Genetic programming and traditional deep learning-based approaches suffer from search inefficiency and poor generalization on small task-specific datasets. Foundation models showed promise in this area, but existing approaches suffer from: 1) They are pretrained on general-purpose data distributions, making them less effective for domain-specific tasks; and 2) their training objectives focus on token-level alignment, overlooking mathematical semantics, which can lead to inaccurate equations. To address these issues, we aim to enhance the domain adaptability of foundation models for Data2Eqn tasks. In this work, we propose a reinforcement learning-based finetuning framework that directly optimizes the generation policy of a pretrained model through reward signals derived from downstream numerical fitness. Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations. Extensive experiments demonstrate that our approach improves both the accuracy and robustness of equation generation under complex distributions.
zh
[AI-14] Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes
【速读】:该论文旨在解决符号音乐理解与条件音乐生成任务中的性能瓶颈问题,特别是在准确性和F1分数方面提升模型表现。其解决方案的关键在于预训练一个基于Transformer的符号音乐基础模型Moonbeam,该模型通过引入一种受领域知识启发的新型分词方法和多维相对注意力(Multidimensional Relative Attention, MRA)来捕捉绝对和相对的音乐属性,从而有效整合音乐领域的归纳偏置,而无需额外的可训练参数。此外,论文还提出了两种具有完整前瞻性能力的微调架构,以适应不同的下游任务。
链接: https://arxiv.org/abs/2505.15559
作者: Zixun Guo,Simon Dixon
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.
zh
[AI-15] Robo-DM: Data Management For Large Robot Datasets ICRA2025
【速读】:该论文旨在解决大规模机器人轨迹数据的存储、分发和加载效率问题,特别是在处理包含视频、文本和数值等多种模态的数据时所面临的挑战。其解决方案的关键在于提出Robo-DM,一个基于云的高效开源数据管理工具包,采用可扩展二进制元语言(EBML)格式对机器人数据进行自包含存储,并通过有损和无损压缩技术显著减少数据体积和传输成本,同时利用负载均衡和内存映射解码缓存加速数据检索,从而提升训练效率。
链接: https://arxiv.org/abs/2505.15558
作者: Kaiyuan Chen,Letian Fu,David Huang,Yanxiang Zhang,Lawrence Yunliang Chen,Huang Huang,Kush Hari,Ashwin Balakrishna,Ted Xiao,Pannag R Sanketi,John Kubiatowicz,Ken Goldberg
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: Best paper finalist of IEEE ICRA 2025
点击查看摘要
Abstract:Recent results suggest that very large datasets of teleoperated robot demonstrations can be used to train transformer-based models that have the potential to generalize to new scenes, robots, and tasks. However, curating, distributing, and loading large datasets of robot trajectories, which typically consist of video, textual, and numerical modalities - including streams from multiple cameras - remains challenging. We propose Robo-DM, an efficient open-source cloud-based data management toolkit for collecting, sharing, and learning with robot data. With Robo-DM, robot datasets are stored in a self-contained format with Extensible Binary Meta Language (EBML). Robo-DM can significantly reduce the size of robot trajectory data, transfer costs, and data load time during training. Compared to the RLDS format used in OXE datasets, Robo-DM’s compression saves space by up to 70x (lossy) and 3.5x (lossless). Robo-DM also accelerates data retrieval by load-balancing video decoding with memory-mapped decoding caches. Compared to LeRobot, a framework that also uses lossy video compression, Robo-DM is up to 50x faster when decoding sequentially. We physically evaluate a model trained by Robo-DM with lossy compression, a pick-and-place task, and In-Context Robot Transformer. Robo-DM uses 75x compression of the original dataset and does not suffer reduction in downstream task accuracy.
zh
[AI-16] Oversmoothing “Oversquashing” Heterophily Long-Range and more: Demystifying Common Beliefs in Graph Machine Learning
【速读】:该论文试图解决图机器学习领域中关于消息传递机制的误解和模糊性问题,特别是围绕过度平滑(oversmoothing)、过度压缩(oversquashing)、同质性-异质性二元对立(homophily-heterophily dichotomy)以及长距离任务的常见假设与信念所引发的混淆。其解决方案的关键在于明确这些普遍接受但未必成立的假设,并通过具有代表性的反例促进对这些问题的批判性思考,从而澄清不同问题之间的区别,推动针对这些问题的独立但相互关联的研究方向。
链接: https://arxiv.org/abs/2505.15547
作者: Adrian Arnaiz-Rodriguez,Federico Errica
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:After a renaissance phase in which researchers revisited the message-passing paradigm through the lens of deep learning, the graph machine learning community shifted its attention towards a deeper and practical understanding of message-passing’s benefits and limitations. In this position paper, we notice how the fast pace of progress around the topics of oversmoothing and oversquashing, the homophily-heterophily dichotomy, and long-range tasks, came with the consolidation of commonly accepted beliefs and assumptions that are not always true nor easy to distinguish from each other. We argue that this has led to ambiguities around the investigated problems, preventing researchers from focusing on and addressing precise research questions while causing a good amount of misunderstandings. Our contribution wants to make such common beliefs explicit and encourage critical thinking around these topics, supported by simple but noteworthy counterexamples. The hope is to clarify the distinction between the different issues and promote separate but intertwined research directions to address them.
zh
[AI-17] AM-PPO: (Advantage) Alpha-Modulation with Proximal Policy Optimization
【速读】:该论文旨在解决强化学习中近端策略优化(Proximal Policy Optimization, PPO)算法对优势估计(advantage estimates)的敏感性问题,特别是在训练过程中由于原始优势信号的高方差、噪声和尺度问题导致的学习性能下降。解决方案的关键在于引入一种名为优势调制PPO(Advantage Modulation PPO, AM-PPO)的新方法,该方法通过动态非线性缩放机制自适应地调节优势估计,利用一个基于优势信号统计特性(如范数、方差和预设饱和水平)的α控制器调整缩放因子,并结合基于tanh的门控函数重塑优势信号,从而稳定梯度更新并改善策略梯度空间的条件。
链接: https://arxiv.org/abs/2505.15514
作者: Soham Sane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages, 4 Tables, 9 Figures, 11 equations
点击查看摘要
Abstract:Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm that heavily relies on accurate advantage estimates for stable and efficient training. However, raw advantage signals can exhibit significant variance, noise, and scale-related issues, impeding optimal learning performance. To address this challenge, we introduce Advantage Modulation PPO (AM-PPO), a novel enhancement of PPO that adaptively modulates advantage estimates using a dynamic, non-linear scaling mechanism. This adaptive modulation employs an alpha controller that dynamically adjusts the scaling factor based on evolving statistical properties of the advantage signals, such as their norm, variance, and a predefined target saturation level. By incorporating a tanh-based gating function driven by these adaptively scaled advantages, AM-PPO reshapes the advantage signals to stabilize gradient updates and improve the conditioning of the policy gradient landscape. Crucially, this modulation also influences value function training by providing consistent and adaptively conditioned learning targets. Empirical evaluations across standard continuous control benchmarks demonstrate that AM-PPO achieves superior reward trajectories, exhibits sustained learning progression, and significantly reduces the clipping required by adaptive optimizers. These findings underscore the potential of advantage modulation as a broadly applicable technique for enhancing reinforcement learning optimization.
zh
[AI-18] A Qualitative Investigation into LLM -Generated Multilingual Code Comments and Automatic Evaluation Metrics
【速读】:该论文试图解决生成式 AI (Generative AI) 在非英语语境下的代码注释生成性能问题,特别是其在多语言工作流中的适配性和可靠性。研究的关键在于通过开放编码分析方法,识别模型生成代码注释中的错误类别,并评估现有标准度量指标在不同语言中对注释正确性的捕捉能力。研究构建了一个包含12,500个标注生成结果的数据集,揭示了模型生成注释在语言连贯性、信息丰富性和语法遵循方面的差异,并指出当前神经网络度量指标在区分有意义的生成结果与随机噪声方面存在不足。
链接: https://arxiv.org/abs/2505.15469
作者: Jonathan Katzy,Yongcheng Huang,Gopal-Raj Panchu,Maksym Ziemlewski,Paris Loizides,Sander Vermeulen,Arie van Deursen,Maliheh Izadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted PROMISE '25
点击查看摘要
Abstract:Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textitcorrectness across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.
zh
[AI-19] Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries
【速读】:该论文试图解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对隐私风险时的脆弱性问题,特别是针对通过恶意输入(如提示注入或越狱攻击)进行的知识提取攻击的可检测性问题。解决方案的关键在于提出一种隐式知识提取攻击(Implicit Knowledge Extraction Attack, IKEA),该方法通过看似正常的查询实现对RAG系统的隐私知识提取。IKEA的核心机制包括基于历史查询-响应模式的体验反射采样和在相似性约束下迭代变异锚点概念的信任区域定向突变,从而有效探索RAG系统的嵌入空间并提升攻击效率与成功率。
链接: https://arxiv.org/abs/2505.15420
作者: Yuhao Wang,Wenjie Qu,Yanze Jiang,Zichen Liu,Yue Liu,Shengfang Zhai,Yinpeng Dong,Jiaheng Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by incorporating external knowledge bases, but they are vulnerable to privacy risks from data extraction attacks. Existing extraction methods typically rely on malicious inputs such as prompt injection or jailbreaking, making them easily detectable via input- or output-level detection. In this paper, we introduce Implicit Knowledge Extraction Attack (IKEA), which conducts knowledge extraction on RAG systems through benign queries. IKEA first leverages anchor concepts to generate queries with the natural appearance, and then designs two mechanisms to lead to anchor concept thoroughly ‘explore’ the RAG’s privacy knowledge: (1) Experience Reflection Sampling, which samples anchor concepts based on past query-response patterns to ensure the queries’ relevance to RAG documents; (2) Trust Region Directed Mutation, which iteratively mutates anchor concepts under similarity constraints to further exploit the embedding space. Extensive experiments demonstrate IKEA’s effectiveness under various defenses, surpassing baselines by over 80% in extraction efficiency and 90% in attack success rate. Moreover, the substitute RAG system built from IKEA’s extractions consistently outperforms those based on baseline methods across multiple evaluation tasks, underscoring the significant privacy risk in RAG systems.
zh
[AI-20] Guided Policy Optimization under Partial Observability
【速读】:该论文旨在解决部分可观测环境中强化学习(Reinforcement Learning, RL)所面临的挑战,尤其是在不确定性下学习的复杂性问题。现有方法在利用模拟中提供的额外信息方面存在局限,难以有效提升训练效果。为了解决这一问题,论文提出了Guided Policy Optimization (GPO)框架,其核心在于通过联合训练一个引导者(guider)和一个学习者(learner),其中引导者利用特权信息,同时确保与主要通过模仿学习(imitation learning)训练的学习者策略保持一致。理论分析表明,该方案能够实现与直接强化学习相当的最优性,从而克服了现有方法的关键缺陷。
链接: https://arxiv.org/abs/2505.15418
作者: Yueheng Li,Guangming Xie,Zongqing Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 24 pages, 13 figures
点击查看摘要
Abstract:Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty. While additional information, such as that available in simulations, can enhance training, effectively leveraging it remains an open problem. To address this, we introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner. The guider takes advantage of privileged information while ensuring alignment with the learner’s policy that is primarily trained via imitation learning. We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in existing approaches. Empirical evaluations show strong performance of GPO across various tasks, including continuous control with partial observability and noise, and memory-based challenges, significantly outperforming existing methods.
zh
[AI-21] Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models UAI
【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LAMs)在安全性和抗越狱攻击(jailbreak attacks)方面缺乏系统性、量化评估的问题。其关键解决方案是构建了首个专门用于评估LAMs越狱漏洞的基准测试集AJailBench,并提出了一种动态对抗样本生成方法——音频扰动工具包(Audio Perturbation Toolkit, APT),通过在时间、频率和幅度域施加定向干扰,同时保持语义一致性,以高效搜索出隐蔽且有效的扰动,从而更真实地模拟攻击场景并揭示LAMs的安全脆弱性。
链接: https://arxiv.org/abs/2505.15406
作者: Zirui Song,Qian Jiang,Mingxuan Cui,Mingzhe Li,Lang Gao,Zeyu Zhang,Zixiang Xu,Yanbo Wang,Chenxi Wang,Guangxian Ouyang,Zhenhao Chen,Xiuying Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: We release AJailBench, including both static and optimized adversarial data, to facilitate future research: this https URL
点击查看摘要
Abstract:The rise of Large Audio Language Models (LAMs) brings both potential and risks, as their audio outputs may contain harmful or unethical content. However, current research lacks a systematic, quantitative evaluation of LAM safety especially against jailbreak attacks, which are challenging due to the temporal and semantic nature of speech. To bridge this gap, we introduce AJailBench, the first benchmark specifically designed to evaluate jailbreak vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of 1,495 adversarial audio prompts spanning 10 policy-violating categories, converted from textual jailbreak attacks using realistic text to speech synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and reveal that none exhibit consistent robustness across attacks. To further strengthen jailbreak testing and simulate more realistic attack conditions, we propose a method to generate dynamic adversarial variants. Our Audio Perturbation Toolkit (APT) applies targeted distortions across time, frequency, and amplitude domains. To preserve the original jailbreak intent, we enforce a semantic consistency constraint and employ Bayesian optimization to efficiently search for perturbations that are both subtle and highly effective. This results in AJailBench-APT, an extended dataset of optimized adversarial audio samples. Our findings demonstrate that even small, semantically preserved perturbations can significantly reduce the safety performance of leading LAMs, underscoring the need for more robust and semantically aware defense mechanisms.
zh
[AI-22] Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
【速读】:该论文试图解决自回归语音合成模型在推理速度上的延迟问题(latency),这一问题源于模型中逐个预测下一个词元(token)的顺序性结构。解决方案的关键在于提出了一种名为Speech Speculative Decoding (SSD) 的新框架,该框架通过使用一个轻量级的草稿模型生成候选词元序列,并利用目标模型并行验证这些序列,从而显著提升了推理速度。
链接: https://arxiv.org/abs/2505.15380
作者: Zijian Lin,Yang Zhang,Yougen Yuan,Yuming Yan,Jinjiang Liu,Zhiyong Wu,Pengfei Hu,Qun Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 4 figures
点击查看摘要
Abstract:Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synthesis acceleration. Specifically, our method employs a lightweight draft model to generate candidate token sequences, which are subsequently verified in parallel by the target model using the proposed SSD framework. Experimental results demonstrate that SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding, while maintaining high fidelity and naturalness. Subjective evaluations further validate the effectiveness of SSD in preserving the perceptual quality of the target model while accelerating inference.
zh
[AI-23] Hadamax Encoding: Elevating Performance in Model-Free Atari
【速读】:该论文试图解决在基于像素的无模型强化学习中,神经网络架构对性能提升有限的问题。其解决方案的关键在于提出了一种新的编码器架构——Hadamax(Hadamard max-pooling),该架构通过在GELU激活的并行隐藏层之间进行Hadamard乘积的最大池化操作,实现了最先进的性能表现。
链接: https://arxiv.org/abs/2505.15345
作者: Jacob E. Kooi,Zhao Yang,Vincent François-Lavet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Neural network architectures have a large impact in machine learning. In reinforcement learning, network architectures have remained notably simple, as changes often lead to small gains in performance. This work introduces a novel encoder architecture for pixel-based model-free reinforcement learning. The Hadamax (\textbfHadamard \textbfmax-pooling) encoder achieves state-of-the-art performance by max-pooling Hadamard products between GELU-activated parallel hidden layers. Based on the recent PQN algorithm, the Hadamax encoder achieves state-of-the-art model-free performance in the Atari-57 benchmark. Specifically, without applying any algorithmic hyperparameter modifications, Hadamax-PQN achieves an 80% performance gain over vanilla PQN and significantly surpasses Rainbow-DQN. For reproducibility, the full code is available on \hrefthis https URLGitHub.
zh
[AI-24] Alpay Algebra: A Universal Structural Foundation
【速读】:该论文试图解决如何在基础数学与高影响力人工智能系统之间建立统一的框架问题,具体而言是将经典代数结构与符号递归和可解释AI的现代需求相结合。解决方案的关键在于引入Alpay Algebra,这是一个范畴论框架,通过将每个代数建模为小笛卡尔闭范畴A中的对象,并定义一个超限演化函子ϕ:A→A,从而实现对固定点ϕ∞的构造与分析,该固定点满足内部泛性质,并扩展了极限、余极限、伴随等经典概念至序数索引的折叠结构。
链接: https://arxiv.org/abs/2505.15344
作者: Faruk Alpay
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Category Theory (math.CT)
备注: 37 pages, 0 figures. Self-contained categorical framework built directly on Mac Lane and Bourbaki; minimal references are intentional to foreground the new construction
点击查看摘要
Abstract:Alpay Algebra is introduced as a universal, category-theoretic framework that unifies classical algebraic structures with modern needs in symbolic recursion and explainable AI. Starting from a minimal list of axioms, we model each algebra as an object in a small cartesian closed category \mathcalA and define a transfinite evolution functor \phi\colon\mathcalA\to\mathcalA . We prove that the fixed point \phi^\infty exists for every initial object and satisfies an internal universal property that recovers familiar constructs – limits, colimits, adjunctions – while extending them to ordinal-indexed folds. A sequence of theorems establishes (i) soundness and conservativity over standard universal algebra, (ii) convergence of \phi -iterates under regular cardinals, and (iii) an explanatory correspondence between \phi^\infty and minimal sufficient statistics in information-theoretic AI models. We conclude by outlining computational applications: type-safe functional languages, categorical model checking, and signal-level reasoning engines that leverage Alpay Algebra’s structural invariants. All proofs are self-contained; no external set-theoretic axioms beyond ZFC are required. This exposition positions Alpay Algebra as a bridge between foundational mathematics and high-impact AI systems, and provides a reference for further work in category theory, transfinite fixed-point analysis, and symbolic computation.
zh
[AI-25] Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中训练高效智能体的困难问题,这一问题主要源于需要精细调优的多种因素,如算法选择、超参数设置以及随机种子的选择等。现有模型集成方法(如多数投票和玻尔兹曼叠加)由于采用固定策略且缺乏对特定任务的语义理解,限制了其适应性和效果。论文提出的解决方案关键在于引入LLM-Ens,该方法利用大语言模型(Large Language Models, LLMs)获取任务特定的语义理解,通过将状态划分为不同的“情境”并分析各智能体在不同情境下的优劣势,在推理阶段动态选择最优智能体,从而实现对动态任务条件的自适应模型选择。
链接: https://arxiv.org/abs/2505.15306
作者: Yiwen Song,Qianyue Hao,Qingmin Liao,Jian Yuan,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Model ensemble is a useful approach in reinforcement learning (RL) for training effective agents. Despite wide success of RL, training effective agents remains difficult due to the multitude of factors requiring careful tuning, such as algorithm selection, hyperparameter settings, and even random seed choices, all of which can significantly influence an agent’s performance. Model ensemble helps overcome this challenge by combining multiple weak agents into a single, more powerful one, enhancing overall performance. However, existing ensemble methods, such as majority voting and Boltzmann addition, are designed as fixed strategies and lack a semantic understanding of specific tasks, limiting their adaptability and effectiveness. To address this, we propose LLM-Ens, a novel approach that enhances RL model ensemble with task-specific semantic understandings driven by large language models (LLMs). Given a task, we first design an LLM to categorize states in this task into distinct ‘situations’, incorporating high-level descriptions of the task conditions. Then, we statistically analyze the strengths and weaknesses of each individual agent to be used in the ensemble in each situation. During the inference time, LLM-Ens dynamically identifies the changing task situation and switches to the agent that performs best in the current situation, ensuring dynamic model selection in the evolving task condition. Our approach is designed to be compatible with agents trained with different random seeds, hyperparameter settings, and various RL algorithms. Extensive experiments on the Atari benchmark show that LLM-Ens significantly improves the RL model ensemble, surpassing well-known baselines by up to 20.9%. For reproducibility, our code is open-source at this https URL.
zh
[AI-26] Laplace Sample Information: Data Informativeness Through a Bayesian Lens
【速读】:该论文试图解决如何准确估计数据集中每个样本的信息量问题,这一问题在深度学习中具有重要意义,因为其可以指导样本选择,从而通过移除冗余或可能有害的样本提高模型效率和准确性。解决方案的关键在于提出了一种基于信息论的样本信息量度量方法——拉普拉斯样本信息(Laplace Sample Information, LSI),该方法利用权重后验的贝叶斯近似和KL散度来衡量单个样本对参数分布的影响,从而实现对样本信息量的有效评估。
链接: https://arxiv.org/abs/2505.15303
作者: Johannes Kaiser,Kristian Schwethelm,Daniel Rueckert,Georgios Kaissis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
点击查看摘要
Abstract:Accurately estimating the informativeness of individual samples in a dataset is an important objective in deep learning, as it can guide sample selection, which can improve model efficiency and accuracy by removing redundant or potentially harmful samples. We propose Laplace Sample Information (LSI) measure of sample informativeness grounded in information theory widely applicable across model architectures and learning settings. LSI leverages a Bayesian approximation to the weight posterior and the KL divergence to measure the change in the parameter distribution induced by a sample of interest from the dataset. We experimentally show that LSI is effective in ordering the data with respect to typicality, detecting mislabeled samples, measuring class-wise informativeness, and assessing dataset difficulty. We demonstrate these capabilities of LSI on image and text data in supervised and unsupervised settings. Moreover, we show that LSI can be computed efficiently through probes and transfers well to the training of large models.
zh
[AI-27] LLM -Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中策略探索(Policy Exploration)效率不足的问题,现有方法如贪心策略和高斯过程等依赖预设的随机过程,未能考虑任务特定特征,并且在训练过程中其演化机制僵化,仅通过方差衰减进行调整,无法根据智能体实时学习状态灵活适应。解决方案的关键在于设计LLM-Explorer,利用大语言模型(Large Language Models, LLMs)的分析与推理能力,自适应生成任务相关的探索策略,通过采样智能体的学习轨迹并提示LLM分析当前策略学习状态,生成未来策略探索的概率分布,周期性更新概率分布以构建专用且动态调整的随机过程。
链接: https://arxiv.org/abs/2505.15293
作者: Qianyue Hao,Yiwen Song,Qingmin Liao,Jian Yuan,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Policy exploration is critical in reinforcement learning (RL), where existing approaches include greedy, Gaussian process, etc. However, these approaches utilize preset stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering task-specific features that influence policy exploration. Moreover, during RL training, the evolution of such stochastic processes is rigid, which typically only incorporates a decay in the variance, failing to adjust flexibly according to the agent’s real-time learning status. Inspired by the analyzing and reasoning capability of large language models (LLMs), we design LLM-Explorer to adaptively generate task-specific exploration strategies with LLMs, enhancing the policy exploration in RL. In our design, we sample the learning trajectory of the agent during the RL training in a given task and prompt the LLM to analyze the agent’s current policy learning status and then generate a probability distribution for future policy exploration. Updating the probability distribution periodically, we derive a stochastic process specialized for the particular task and dynamically adjusted to adapt to the learning process. Our design is a plug-in module compatible with various widely applied RL algorithms, including the DQN series, DDPG, TD3, and any possible variants developed based on them. Through extensive experiments on the Atari and MuJoCo benchmarks, we demonstrate LLM-Explorer’s capability to enhance RL policy exploration, achieving an average performance improvement up to 37.27%. Our code is open-source at this https URL for reproducibility.
zh
[AI-28] Learning-based Autonomous Oversteer Control and Collision Avoidance
【速读】:该论文旨在解决车辆在湿滑路面上发生过度转向(oversteer)时的主动安全控制问题,特别是如何在存在随机障碍物的情况下实现有效的避障与稳定控制。现有方法多依赖于专家定义的轨迹或假设无障碍环境,限制了实际应用效果。本文的关键解决方案是提出一种新型混合学习(Hybrid Learning, HL)算法——Q-Compared Soft Actor-Critic (QC-SAC),该算法能够有效从次优示范数据中学习,并快速适应新场景,从而实现了世界首个具备障碍物避让功能的安全自主过度转向控制。
链接: https://arxiv.org/abs/2505.15275
作者: Seokjun Lee,Seung-Hyun Kong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Oversteer, wherein a vehicle’s rear tires lose traction and induce unintentional excessive yaw, poses critical safety challenges. Failing to control oversteer often leads to severe traffic accidents. Although recent autonomous driving efforts have attempted to handle oversteer through stabilizing maneuvers, the majority rely on expert-defined trajectories or assume obstacle-free environments, limiting real-world applicability. This paper introduces a novel end-to-end (E2E) autonomous driving approach that tackles oversteer control and collision avoidance simultaneously. Existing E2E techniques, including Imitation Learning (IL), Reinforcement Learning (RL), and Hybrid Learning (HL), generally require near-optimal demonstrations or extensive experience. Yet even skilled human drivers struggle to provide perfect demonstrations under oversteer, and high transition variance hinders accumulating sufficient data. Hence, we present Q-Compared Soft Actor-Critic (QC-SAC), a new HL algorithm that effectively learns from suboptimal demonstration data and adapts rapidly to new conditions. To evaluate QC-SAC, we introduce a benchmark inspired by real-world driver training: a vehicle encounters sudden oversteer on a slippery surface and must avoid randomly placed obstacles ahead. Experimental results show QC-SAC attains near-optimal driving policies, significantly surpassing state-of-the-art IL, RL, and HL baselines. Our method demonstrates the world’s first safe autonomous oversteer control with obstacle avoidance.
zh
[AI-29] Identification of Probabilities of Causation: A Complete Characterization
【速读】:该论文试图解决多值处理和结果下因果概率(Probability of Causation)的理论表征问题,这一问题在数十年间未得到充分解决,限制了基于因果关系的决策范围。论文的关键解决方案是提出了一组完整的代表性因果概率,并证明这些概率足以在结构因果模型(Structural Causal Models, SCMs)框架内表征所有可能的因果概率,随后通过严格的数学证明推导出这些代表量的紧致边界。
链接: https://arxiv.org/abs/2505.15274
作者: Xin Shu,Shuai Wang,Ang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Probabilities of causation are fundamental to modern decision-making. Pearl first introduced three binary probabilities of causation, and Tian and Pearl later derived tight bounds for them using Balke’s linear programming. The theoretical characterization of probabilities of causation with multi-valued treatments and outcomes has remained unresolved for decades, limiting the scope of causality-based decision-making. In this paper, we resolve this foundational gap by proposing a complete set of representative probabilities of causation and proving that they are sufficient to characterize all possible probabilities of causation within the framework of Structural Causal Models (SCMs). We then formally derive tight bounds for these representative quantities using formal mathematical proofs. Finally, we demonstrate the practical relevance of our results through illustrative toy examples.
zh
[AI-30] Margin-aware Fuzzy Rough Feature Selection: Bridging Uncertainty Characterization and Pattern Classification
【速读】:该论文试图解决传统模糊粗糙特征选择(Fuzzy Rough Feature Selection, FRFS)方法在高维数据中仅关注降低不确定性而忽视其与模式分类性能之间关系的问题。现有FRFS算法通常将降低不确定性视为衡量特征选择有效性的关键指标,但实际表明低不确定性并不必然提升分类性能。解决方案的关键在于提出一种考虑标签类别紧凑性和分离性的Margin-aware Fuzzy Rough Feature Selection (MAFRFS)框架,通过同时优化类内紧凑性和类间分离性,使特征选择更有效地导向更具区分性的类别结构,从而提升分类性能。
链接: https://arxiv.org/abs/2505.15250
作者: Suping Xu,Lin Shang,Keyu Liu,Hengrong Ju,Xibei Yang,Witold Pedrycz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Fuzzy rough feature selection (FRFS) is an effective means of addressing the curse of dimensionality in high-dimensional data. By removing redundant and irrelevant features, FRFS helps mitigate classifier overfitting, enhance generalization performance, and lessen computational overhead. However, most existing FRFS algorithms primarily focus on reducing uncertainty in pattern classification, neglecting that lower uncertainty does not necessarily result in improved classification performance, despite it commonly being regarded as a key indicator of feature selection effectiveness in the FRFS literature. To bridge uncertainty characterization and pattern classification, we propose a Margin-aware Fuzzy Rough Feature Selection (MAFRFS) framework that considers both the compactness and separation of label classes. MAFRFS effectively reduces uncertainty in pattern classification tasks, while guiding the feature selection towards more separable and discriminative label class structures. Extensive experiments on 15 public datasets demonstrate that MAFRFS is highly scalable and more effective than FRFS. The algorithms developed using MAFRFS outperform six state-of-the-art feature selection algorithms.
zh
[AI-31] Adaptive Plan-Execute Framework for Smart Contract Security Auditing
【速读】:该论文旨在解决智能合约安全审计中大型语言模型(Large Language Models, LLMs)存在的幻觉和有限上下文感知推理问题。其解决方案的关键在于提出一种名为SmartAuditFlow的新型Plan-Execute框架,该框架通过动态审计计划生成与结构化执行来增强智能合约的安全分析。与传统基于LLM的审计方法不同,SmartAuditFlow能够根据每个智能合约的独特特性动态生成和优化审计计划,并根据中间LLM输出和新发现的漏洞持续调整审计策略,从而实现更适应性和精确的安全评估。
链接: https://arxiv.org/abs/2505.15242
作者: Zhiyuan Wei,Jing Sun,Zijian Zhang,Zhe Hou,Zixiao Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have shown great promise in code analysis and auditing; however, they still struggle with hallucinations and limited context-aware reasoning. We introduce SmartAuditFlow, a novel Plan-Execute framework that enhances smart contract security analysis through dynamic audit planning and structured execution. Unlike conventional LLM-based auditing approaches that follow fixed workflows and predefined steps, SmartAuditFlow dynamically generates and refines audit plans based on the unique characteristics of each smart contract. It continuously adjusts its auditing strategy in response to intermediate LLM outputs and newly detected vulnerabilities, ensuring a more adaptive and precise security assessment. The framework then executes these plans step by step, applying a structured reasoning process to enhance vulnerability detection accuracy while minimizing hallucinations and false positives. To further improve audit precision, SmartAuditFlow integrates iterative prompt optimization and external knowledge sources, such as static analysis tools and Retrieval-Augmented Generation (RAG). This ensures audit decisions are contextually informed and backed by real-world security knowledge, producing comprehensive security reports. Extensive evaluations across multiple benchmarks demonstrate that SmartAuditFlow outperforms existing methods, achieving 100 percent accuracy on common and critical vulnerabilities, 41.2 percent accuracy for comprehensive coverage of known smart contract weaknesses in real-world projects, and successfully identifying all 13 tested CVEs. These results highlight SmartAuditFlow’s scalability, cost-effectiveness, and superior adaptability over traditional static analysis tools and contemporary LLM-based approaches, establishing it as a robust solution for automated smart contract auditing.
zh
[AI-32] Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM -as-a-judge UAI2025
【速读】:该论文旨在解决在比较型大语言模型作为评判者(LLM-as-a-judge)框架中的广义概率建模与不确定性估计问题。其关键解决方案是将现有的专家产品方法(Product-of-Experts)视为更广泛框架的特例,从而提供多样化的建模选择,并提出改进的个体比较不确定性估计方法,以提高选择效率并实现更强的性能。此外,论文还引入了整体排名不确定性的估计方法,并表明结合绝对评分与比较评分能够提升系统表现。核心贡献在于通过不确定性估计,尤其是重排序概率的优化,显著减少了所需比较次数约50%,同时排名级不确定性指标可用于识别低性能预测,凸显了概率模型对整体不确定性质量的重要影响。
链接: https://arxiv.org/abs/2505.15240
作者: Yassir Fathullah,Mark J. F. Gales
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: To appear in UAI 2025
点击查看摘要
Abstract:This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.
zh
[AI-33] Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers
【速读】:该论文试图解决深度神经网络中特征表示在倒数第二层出现的神经坍缩(neural collapse)现象的理论理解问题,特别是针对现代架构和数据相关模型的分析。其解决方案的关键在于证明在使用交叉熵或均方误差损失函数训练的深度正则化Transformer和残差网络(ResNet)中,全局最优解近似满足神经坍缩,并且随着网络深度增加,这种近似更加紧密。此外,论文通过将端到端的大深度ResNet或Transformer训练转化为等效的无约束特征模型,为该现象的广泛存在提供了理论依据。
链接: https://arxiv.org/abs/2505.15239
作者: Peter Súkeník,Christoph H. Lampert,Marco Mondelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:The empirical emergence of neural collapse – a surprising symmetry in the feature representations of the training data in the penultimate layer of deep neural networks – has spurred a line of theoretical research aimed at its understanding. However, existing work focuses on data-agnostic models or, when data structure is taken into account, it remains limited to multi-layer perceptrons. Our paper fills both these gaps by analyzing modern architectures in a data-aware regime: we prove that global optima of deep regularized transformers and residual networks (ResNets) with LayerNorm trained with cross entropy or mean squared error loss are approximately collapsed, and the approximation gets tighter as the depth grows. More generally, we formally reduce any end-to-end large-depth ResNet or transformer training into an equivalent unconstrained features model, thus justifying its wide use in the literature even beyond data-agnostic settings. Our theoretical results are supported by experiments on computer vision and language datasets showing that, as the depth grows, neural collapse indeed becomes more prominent.
zh
[AI-34] EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
【速读】:该论文旨在解决内镜手术中自动跟踪异常区域及遵循环形切割标记的问题,以减轻内镜医师的认知负担。传统基于模型的流水线在各个组件(如检测、运动规划)中均需手动调优,并且难以融入高层次的内镜意图,导致在多样化场景中的泛化能力较差。论文提出的解决方案关键在于引入EndoVLA,这是一种专为胃肠道(Gastrointestinal, GI)介入手术中连续体机器人设计的视觉-语言-动作(Vision-Language-Action, VLA)模型,能够通过端到端框架实现对术者指令的语义适应,无需手动重新校准,从而提升跟踪性能并实现零样本泛化。
链接: https://arxiv.org/abs/2505.15206
作者: Chi Kit Ng,Long Bai,Guankun Wang,Yupeng Wang,Huxin Gao,Kun Yuan,Chenhan Jin,Tieyong Zeng,Hongliang Ren
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In endoscopic procedures, autonomous tracking of abnormal regions and following circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile for each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, leading to poor generalization across diverse scenes. Vision-Language-Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative by semantically adapting to surgeon prompts without manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Given endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to circular markers during circumferential cutting. To tackle data scarcity and domain shifts, we propose a dual-phase strategy comprising supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning with task-aware rewards. Our approach significantly improves tracking performance in endoscopy and enables zero-shot generalization in diverse scenes and complex sequential tasks.
zh
[AI-35] mgame-Bench: How Good are LLM s at Playing Games?
【速读】:该论文试图解决如何有效利用视频游戏作为评估现代大语言模型(Large Language Model, LLM)能力的基准问题。研究发现,直接将LLM投入游戏中无法实现有效的评估,主要受限于脆弱的视觉感知、对提示的敏感性以及潜在的数据污染。解决方案的关键在于提出lmgame-Bench,这是一个通过统一Gym风格API提供平台类、解谜类和叙事类游戏的评估框架,并配备轻量级的感知和记忆支撑结构,旨在稳定提示差异并消除数据污染。
链接: https://arxiv.org/abs/2505.15146
作者: Lanxiang Hu,Mingjia Huo,Yuxuan Zhang,Haoyang Yu,Eric P. Xing,Ion Stoica,Tajana Rosing,Haojian Jin,Hao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons – brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at this https URL.
zh
[AI-36] BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
【速读】:该论文试图解决在大型语言模型(Large Language Models, LLMs)推理过程中,如何动态调整生成式AI (Generative AI) 的超参数配置以提升推理效率的问题。现有方法要么采用固定的推测解码配置,要么通过离线或在线训练对齐草案模型与上下文,难以适应不同前缀令牌的动态变化。论文提出了一种无需训练的在线学习框架,关键在于将超参数选择问题建模为多臂老虎机(Multi-Armed Bandit)问题,并设计了UCBSpec和EXP3Spec两种基于老虎机的超参数选择算法,通过分析停止时间遗憾(stopping time regret)来优化决策过程,从而实现高效且自适应的推测解码。
链接: https://arxiv.org/abs/2505.15141
作者: Yunlong Hou,Fengzhuo Zhang,Cunxiao Du,Xuan Zhang,Jiachun Pan,Tianyu Pang,Chao Du,Vincent Y. F. Tan,Zhuoran Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 35 pages, 4 figures
点击查看摘要
Abstract:Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
zh
[AI-37] Global Convergence for Averag e Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm
【速读】:该论文旨在解决无限时间范围下平均奖励受限马尔可夫决策过程(Constrained Markov Decision Processes, CMDPs)中的优化问题,特别是在具有通用参数化情况下如何有效处理约束并保证算法的收敛性。其解决方案的关键在于提出了一种原始-对偶自然策略梯度算法(Primal-Dual Natural Actor-Critic algorithm),该算法能够在已知混合时间(mixing time)的情况下实现全局收敛,并达到约束违反率的 \tilde\mathcal{O}(1/\sqrt{T}) 理论最优速率;在未知混合时间的情况下,通过适当的时间阈值条件,仍可获得 \tilde\mathcal{O}(1/T^{0.5-\epsilon}) 的收敛速率。
链接: https://arxiv.org/abs/2505.15138
作者: Yang Xu,Swetha Ganesh,Washim Uddin Mondal,Qinbo Bai,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of \tilde\mathcalO(1/\sqrtT) over a horizon of length T when the mixing time, \tau_\mathrmmix , is known to the learner. In absence of knowledge of \tau_\mathrmmix , the achievable rates change to \tilde\mathcalO(1/T^0.5-\epsilon) provided that T \geq \tilde\mathcalO\left(\tau_\mathrmmix^2/\epsilon\right) . Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.
zh
[AI-38] he Unreason able Effectiveness of Entropy Minimization in LLM Reasoning
【速读】:该论文试图解决如何在无需任何标注数据的情况下提升大型语言模型(Large Language Models, LLMs)在复杂数学、物理和编程任务中的性能问题。其解决方案的关键在于通过熵最小化(Entropy Minimization, EM)策略,使模型在生成过程中集中更多的概率质量在其最自信的输出上,从而增强模型的推理能力。论文提出了三种方法:EM-FT、EM-RL 和 EM-INF,分别通过微调、强化学习和推理阶段的逻辑调整来实现熵最小化,其中 EM-RL 在无标注数据情况下表现优于依赖大量标注数据的强化学习基线,而 EM-INF 则在效率和性能上超越了部分专有模型。
链接: https://arxiv.org/abs/2505.15134
作者: Shivam Agarwal,Zimin Zhang,Lifan Yuan,Jiawei Han,Hao Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models’ (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.
zh
[AI-39] Graph Foundation Models: A Comprehensive Survey
【速读】:该论文试图解决将基础模型的能力扩展到图结构数据(graph-structured data)中的问题,这一领域具有非欧几里得结构和复杂的关联语义,相较于自然语言处理、视觉和多模态学习面临独特的挑战。解决方案的关键在于构建图基础模型(Graph Foundation Models, GFMs),通过模块化框架实现可扩展的通用智能,该框架包含三个核心组件:主干架构、预训练策略和适应机制,旨在促进图相关任务和领域的广泛迁移能力。
链接: https://arxiv.org/abs/2505.15116
作者: Zehong Wang,Zheyuan Liu,Tianyi Ma,Jiazheng Li,Zheyuan Zhang,Xingbo Fu,Yiyang Li,Zhengqing Yuan,Wei Song,Yijun Ma,Qingkai Zeng,Xiusi Chen,Jianan Zhao,Jundong Li,Meng Jiang,Pietro Lio,Nitesh Chawla,Chuxu Zhang,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Github Repo: this https URL . 93 pages, 438 references
点击查看摘要
Abstract:Graph-structured data pervades domains such as social networks, biological systems, knowledge graphs, and recommender systems. While foundation models have transformed natural language processing, vision, and multimodal learning through large-scale pretraining and generalization, extending these capabilities to graphs – characterized by non-Euclidean structures and complex relational semantics – poses unique challenges and opens new opportunities. To this end, Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to structured data, enabling broad transfer across graph-centric tasks and domains. This survey provides a comprehensive overview of GFMs, unifying diverse efforts under a modular framework comprising three key components: backbone architectures, pretraining strategies, and adaptation mechanisms. We categorize GFMs by their generalization scope – universal, task-specific, and domain-specific – and review representative methods, key innovations, and theoretical insights within each category. Beyond methodology, we examine theoretical foundations including transferability and emergent capabilities, and highlight key challenges such as structural alignment, heterogeneity, scalability, and evaluation. Positioned at the intersection of graph learning and general-purpose AI, GFMs are poised to become foundational infrastructure for open-ended reasoning over structured data. This survey consolidates current progress and outlines future directions to guide research in this rapidly evolving field. Resources are available at this https URL.
zh
[AI-40] Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation
【速读】:该论文旨在解决从人类示范中学习机器人操作时普遍存在的泛化能力不足问题,特别是在复杂任务中面对多样化的场景和物体位置时。其解决方案的关键在于提出一种名为Object-Focus Actor (OFA)的新方法,该方法通过利用灵巧操作任务中一致的末端轨迹,实现高效策略训练,并采用分层流程——包括物体感知与位姿估计、预操作位姿到达及OFA策略执行,从而确保在不同背景和位置布局下仍能进行聚焦且高效的操控。
链接: https://arxiv.org/abs/2505.15098
作者: Yihang Li,Tianle Zhang,Xuelong Wei,Jiayi Li,Lin Zhao,Dongchi Huang,Zhirui Fang,Minhua Zheng,Wenjun Dai,Xiaodong He
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Robot manipulation learning from human demonstrations offers a rapid means to acquire skills but often lacks generalization across diverse scenes and object placements. This limitation hinders real-world applications, particularly in complex tasks requiring dexterous manipulation. Vision-Language-Action (VLA) paradigm leverages large-scale data to enhance generalization. However, due to data scarcity, VLA’s performance remains limited. In this work, we introduce Object-Focus Actor (OFA), a novel, data-efficient approach for generalized dexterous manipulation. OFA exploits the consistent end trajectories observed in dexterous manipulation tasks, allowing for efficient policy training. Our method employs a hierarchical pipeline: object perception and pose estimation, pre-manipulation pose arrival and OFA policy execution. This process ensures that the manipulation is focused and efficient, even in varied backgrounds and positional layout. Comprehensive real-world experiments across seven tasks demonstrate that OFA significantly outperforms baseline methods in both positional and background generalization tests. Notably, OFA achieves robust performance with only 10 demonstrations, highlighting its data efficiency.
zh
[AI-41] hinkRec: Thinking-based recommendation via LLM
【速读】:该论文旨在解决现有基于生成式 AI (Generative AI) 的推荐系统在推荐过程中依赖表层特征(如点击历史)进行相似性匹配,而非深入分析用户行为逻辑所导致的推荐结果表面化和错误的问题。其解决方案的关键在于提出 ThinkRec 框架,该框架通过引入一种思考激活机制,将物品元数据与关键词摘要相结合,并注入合成推理轨迹,引导模型形成可解释的推理链;同时,采用实例级专家融合机制,根据用户的潜在特征动态分配专家模型权重,从而提升推荐的精度与个性化程度。
链接: https://arxiv.org/abs/2505.15091
作者: Qihang Yu,Kairui Fu,Shengyu Zhang,Zheqi Lv,Fan Wu,Fei Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have enabled more semantic-aware recommendations through natural language generation. Existing LLM for recommendation (LLM4Rec) methods mostly operate in a System 1-like manner, relying on superficial features to match similar items based on click history, rather than reasoning through deeper behavioral logic. This often leads to superficial and erroneous recommendations. Motivated by this, we propose ThinkRec, a thinking-based framework that shifts LLM4Rec from System 1 to System 2 (rational system). Technically, ThinkRec introduces a thinking activation mechanism that augments item metadata with keyword summarization and injects synthetic reasoning traces, guiding the model to form interpretable reasoning chains that consist of analyzing interaction histories, identifying user preferences, and making decisions based on target items. On top of this, we propose an instance-wise expert fusion mechanism to reduce the reasoning difficulty. By dynamically assigning weights to expert models based on users’ latent features, ThinkRec adapts its reasoning path to individual users, thereby enhancing precision and personalization. Extensive experiments on real-world datasets demonstrate that ThinkRec significantly improves the accuracy and interpretability of recommendations. Our implementations are available in anonymous Github: this https URL.
zh
[AI-42] Leverag ing Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects
【速读】:该论文试图解决动态语言(如Python)中命令注入(command injection)漏洞的检测问题,尤其是在广泛使用的开源项目中,此类安全问题可能带来广泛的负面影响。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的先进上下文理解能力和适应性,以自动化的方式识别代码中的复杂安全漏洞。研究通过在多个高知名度的GitHub项目上应用LLM进行分析,评估其在检测命令注入漏洞方面的潜力,并探讨其在实际开发流程中的适用性与局限性。
链接: https://arxiv.org/abs/2505.15088
作者: Yuxuan Wang,Jingshu Chen,Qingyang Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Command injection vulnerabilities are a significant security threat in dynamic languages like Python, particularly in widely used open-source projects where security issues can have extensive impact. With the proven effectiveness of Large Language Models(LLMs) in code-related tasks, such as testing, researchers have explored their potential for vulnerabilities analysis. This study evaluates the potential of large language models (LLMs), such as GPT-4, as an alternative approach for automated testing for vulnerability detection. In particular, LLMs have demonstrated advanced contextual understanding and adaptability, making them promising candidates for identifying nuanced security vulnerabilities within code. To evaluate this potential, we applied LLM-based analysis to six high-profile GitHub projects-Django, Flask, TensorFlow, Scikit-learn, PyTorch, and Langchain-each with over 50,000 stars and extensive adoption across software development and academic research. Our analysis assesses both the strengths and limitations of LLMs in detecting command injection vulnerabilities, evaluating factors such as detection accuracy, efficiency, and practical integration into development workflows. In addition, we provide a comparative analysis of different LLM tools to identify those most suitable for security applications. Our findings offer guidance for developers and security researchers on leveraging LLMs as innovative and automated approaches to enhance software security.
zh
[AI-43] Robust Multi-Modal Forecasting: Integrating Static and Dynamic Features
【速读】:该论文旨在解决时间序列预测中模型透明性和可解释性不足的问题,特别是在医疗等关键应用场景中,需要确保模型的决策过程可被理解和信任。其解决方案的关键在于扩展了原有的双层透明性框架,通过结构化的方式将外生时间序列特征与静态特征结合,并利用轨迹理解的洞察引入一种对外生时间序列的编码机制,将其分解为有意义的趋势和属性,从而提取可解释的模式。
链接: https://arxiv.org/abs/2505.15083
作者: Jeremy Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Time series forecasting plays a crucial role in various applications, particularly in healthcare, where accurate predictions of future health trajectories can significantly impact clinical decision-making. Ensuring transparency and explainability of the models responsible for these tasks is essential for their adoption in critical settings. Recent work has explored a top-down approach to bi-level transparency, focusing on understanding trends and properties of predicted time series using static features. In this work, we extend this framework by incorporating exogenous time series features alongside static features in a structured manner, while maintaining cohesive interpretation. Our approach leverages the insights of trajectory comprehension to introduce an encoding mechanism for exogenous time series, where they are decomposed into meaningful trends and properties, enabling the extraction of interpretable patterns. Through experiments on several synthetic datasets, we demonstrate that our approach remains predictive while preserving interpretability and robustness. This work represents a step towards developing robust, and generalized time series forecasting models. The code is available at this https URL
zh
[AI-44] owards a Working Definition of Designing Generative User Interfaces
【速读】:该论文试图解决如何定义和理解生成式用户界面(Generative UI)及其在人机交互(HCI)中的应用问题,特别是在设计过程中实现人工智能与设计师的协作流程。其解决方案的关键在于通过多方法定性研究,包括系统文献综述、专家访谈和案例分析,确立了生成式UI的五个核心主题,并提出了混合创作、基于内容筛选的工作流以及AI辅助优化等新兴设计模型,从而为生成式UI的理论构建与实践应用提供了基础框架。
链接: https://arxiv.org/abs/2505.15049
作者: Kyungho Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generative UI is transforming interface design by facilitating AI-driven collaborative workflows between designers and computational systems. This study establishes a working definition of Generative UI through a multi-method qualitative approach, integrating insights from a systematic literature review of 127 publications, expert interviews with 18 participants, and analyses of 12 case studies. Our findings identify five core themes that position Generative UI as an iterative and co-creative process. We highlight emerging design models, including hybrid creation, curation-based workflows, and AI-assisted refinement strategies. Additionally, we examine ethical challenges, evaluation criteria, and interaction models that shape the field. By proposing a conceptual foundation, this study advances both theoretical discourse and practical implementation, guiding future HCI research toward responsible and effective generative UI design practices.
zh
[AI-45] PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration
【速读】:该论文试图解决现有基于大型语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在科学发现中因缺乏理性约束而导致的无目标假设生成和假设与证据之间无法持续关联的问题,从而阻碍了系统性不确定性减少。解决方案的关键在于引入\textttPiFlow,这是一个信息论框架,将自动化科学发现视为一种由科学定律等原则引导的结构化不确定性减少问题,从而实现更高效和高质量的科学发现。
链接: https://arxiv.org/abs/2505.15047
作者: Yingming Pu,Tao Lin,Hongyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering systematic uncertainty reduction. Overcoming these limitations fundamentally requires systematic uncertainty reduction. We introduce \textttPiFlow, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains – discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties – our method significantly improves discovery efficiency, reflected by a 73.55% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06% compared to a vanilla agent system. Overall, \textttPiFlow serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \hrefthis https URLGitHub.
zh
[AI-46] Learning-based Airflow Inertial Odometry for MAVs using Thermal Anemometers in a GPS and vision denied environment
【速读】:该论文试图解决在低精度惯性测量单元(IMU)和气压计存在显著偏差的情况下,以及热膜风速仪受螺旋桨和地面效应干扰时,如何准确估计空中飞行器的运动状态问题。解决方案的关键在于采用基于门控循环单元(GRU)的深度神经网络,从噪声和干扰严重的热膜风速仪数据中估计相对空气速度,并结合带有偏差模型的观测器进行多传感器数据融合,从而准确估计飞行器的状态。
链接: https://arxiv.org/abs/2505.15044
作者: Ze Wang,Jingang Qu,Zhenyu Gao,Pascal Morin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This work demonstrates an airflow inertial based odometry system with multi-sensor data fusion, including thermal anemometer, IMU, ESC, and barometer. This goal is challenging because low-cost IMUs and barometers have significant bias, and anemometer measurements are very susceptible to interference from spinning propellers and ground effects. We employ a GRU-based deep neural network to estimate relative air speed from noisy and disturbed anemometer measurements, and an observer with bias model to fuse the sensor data and thus estimate the state of aerial vehicle. A complete flight data, including takeoff and landing on the ground, shows that the approach is able to decouple the downwash induced wind speed caused by propellers and the ground effect, and accurately estimate the flight speed in a wind-free indoor environment. IMU, and barometer bias are effectively estimated, which significantly reduces the position integration drift, which is only 5.7m for 203s manual random flight. The open source is available on this https URL.
zh
[AI-47] LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming
【速读】:该论文试图解决自动化测试用例生成(Automated Test Case Generation, ATCG)在竞赛编程中难以满足复杂规范或生成有效边界情况的问题,从而限制了其实用性。解决方案的关键在于引入上下文无关文法带计数器(Context-Free Grammars with Counters, CCFGs),该形式化方法能够捕捉输入规范中的语法和语义结构,并通过微调的CodeT5模型将自然语言输入规范转换为CCFGs,从而系统地生成高质量的测试用例。
链接: https://arxiv.org/abs/2505.15039
作者: Sicheol Sung,Aditi,Dogyu kim,Yo-Sub Han,Sang-Ki Ko
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Automated Test Case Generation (ATCG) is crucial for evaluating software reliability, particularly in competitive programming where robust algorithm assessments depend on diverse and accurate test cases. However, existing ATCG methods often fail to meet complex specifications or generate effective corner cases, limiting their utility. In this work, we introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications. Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases. Experiments on the CodeContests dataset demonstrate that CCFG-based test cases outperform baseline methods in identifying incorrect algorithms, achieving significant gains in validity and effectiveness. Our approach provides a scalable and reliable grammar-driven framework for enhancing automated competitive programming evaluations.
zh
[AI-48] Fault-Tolerant Multi-Robot Coordination with Limited Sensing within Confined Environments
【速读】:该论文旨在解决多机器人系统在共享工作空间中,单个机器人故障可能严重影响整体性能的问题,尤其是在缺乏全局信息或直接通信的情况下,依赖社会互动进行协调的任务场景。解决方案的关键在于提出一种名为“主动接触响应”(Active Contact Response, ACR)的容错技术,该技术通过物理接触交互来调整机器人的行为,使正常机器人能够共同重新定位故障机器人,从而减少障碍并维持群体功能的最优状态。
链接: https://arxiv.org/abs/2505.15036
作者: Kehinde O. Aina,Hosain Bagheri,Daniel I. Goldman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 15 pages, 4 figures. Accepted to DARS 2024 (Distributed Autonomous Robotic Systems), to appear in Springer Proceedings in Advanced Robotics
点击查看摘要
Abstract:As robots are increasingly deployed to collaborate on tasks within shared workspaces and resources, the failure of an individual robot can critically affect the group’s performance. This issue is particularly challenging when robots lack global information or direct communication, relying instead on social interaction for coordination and to complete their tasks. In this study, we propose a novel fault-tolerance technique leveraging physical contact interactions in multi-robot systems, specifically under conditions of limited sensing and spatial confinement. We introduce the “Active Contact Response” (ACR) method, where each robot modulates its behavior based on the likelihood of encountering an inoperative (faulty) robot. Active robots are capable of collectively repositioning stationary and faulty peers to reduce obstructions and maintain optimal group functionality. We implement our algorithm in a team of autonomous robots, equipped with contact-sensing and collision-tolerance capabilities, tasked with collectively excavating cohesive model pellets. Experimental results indicate that the ACR method significantly improves the system’s recovery time from robot failures, enabling continued collective excavation with minimal performance degradation. Thus, this work demonstrates the potential of leveraging local, social, and physical interactions to enhance fault tolerance and coordination in multi-robot systems operating in constrained and extreme environments.
zh
[AI-49] oward Task Capable Active Matter: Learning to Avoid Clogging in Confined Collectives via Collisions
【速读】:该论文试图解决高密度群体在受限空间中流动时的堵塞与拥堵问题,特别是在类似火蚁筑巢等复杂任务中如何实现有效流动与任务执行。其解决方案的关键在于通过局部学习机制,使个体能够根据环境反馈调整行为策略,如反转概率,从而优化整体群体性能,减少拥堵并提升任务完成效率。
链接: https://arxiv.org/abs/2505.15033
作者: Kehinde O. Aina,Ram Avinery,Hui-Shun Kuan,Meredith D. Betterton,Michael A. D. Goodisman,Daniel I. Goldman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 9 figures. Published in Frontiers in Physics, Social Physics section. Includes experimental and simulation analysis of multi-robot excavation using decentralized learning
点击查看摘要
Abstract:Social organisms which construct nests consisting of tunnels and chambers necessarily navigate confined and crowded conditions. Unlike low-density collectives like bird flocks and insect swarms, in which hydrodynamic and statistical phenomena dominate, the physics of glasses and supercooled fluids is important to understand clogging behaviors in high-density collectives. Our previous work revealed that fire ants flowing in confined tunnels utilize diverse behaviors like unequal workload distributions, spontaneous direction reversals, and limited interaction times to mitigate clogging and jamming and thus maintain functional flow; implementation of similar rules in a small robophysical swarm led to high performance through spontaneous dissolution of clogs and clusters. However, how the insects learn such behaviors, and how we can develop “task capable” active matter in such regimes, remains a challenge in part because interaction dynamics are dominated by local, time-consuming collisions and no single agent can guide the entire collective. Here, we hypothesized that effective flow and clog mitigation could emerge purely through local learning. We tasked small groups of robots with pellet excavation in a narrow tunnel, allowing them to modify reversal probabilities over time. Initially, robots had equal probabilities and clogs were common. Reversals improved flow. When reversal probabilities adapted via collisions and noisy tunnel length estimates, workload inequality and performance improved. Our robophysical study of an excavating swarm shows that, despite the seeming complexity and difficulty of the task, simple learning rules can mitigate or leverage unavoidable features in task-capable dense active matter, leading to hypotheses for dense biological and robotic swarms.
zh
[AI-50] owards a Science of Causal Interpretability in Deep Learning for Software Engineering
【速读】:该论文试图解决深度学习在软件工程(DL4SE)中缺乏因果可解释性的问题,即神经代码模型(NCMs)在输入与输出之间缺乏透明的因果关系,限制了对其能力的全面理解。为了解决这一问题,论文提出了DoCode,这是一种新的后验可解释性方法,其关键在于利用因果推断来提供面向编程语言的模型预测解释。DoCode采用四步流程:使用结构因果模型(SCMs)建模因果问题、识别因果估计量、通过平均处理效应(ATE)等指标估计效应,并反驳效应估计。该方法框架具有扩展性,能够通过编程语言特性减少虚假相关性,从而提升模型解释的可信度和准确性。
链接: https://arxiv.org/abs/2505.15023
作者: David N. Palacio
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: To appear in ProQuest
点击查看摘要
Abstract:This dissertation addresses achieving causal interpretability in Deep Learning for Software Engineering (DL4SE). While Neural Code Models (NCMs) show strong performance in automating software tasks, their lack of transparency in causal relationships between inputs and outputs limits full understanding of their capabilities. To build trust in NCMs, researchers and practitioners must explain code predictions. Associational interpretability, which identifies correlations, is often insufficient for tasks requiring intervention and change analysis. To address this, the dissertation introduces DoCode, a novel post hoc interpretability method for NCMs. DoCode uses causal inference to provide programming language-oriented explanations of model predictions. It follows a four-step pipeline: modeling causal problems using Structural Causal Models (SCMs), identifying the causal estimand, estimating effects with metrics like Average Treatment Effect (ATE), and refuting effect estimates. Its framework is extensible, with an example that reduces spurious correlations by grounding explanations in programming language properties. A case study on deep code generation across interpretability scenarios and various deep learning architectures demonstrates DoCode’s benefits. Results show NCMs’ sensitivity to code syntax changes and their ability to learn certain programming concepts while minimizing confounding bias. The dissertation also examines associational interpretability as a foundation, analyzing software information’s causal nature using tools like COMET and TraceXplainer for traceability. It highlights the need to identify code confounders and offers practical guidelines for applying causal interpretability to NCMs, contributing to more trustworthy AI in software engineering.
zh
[AI-51] HAVA: Hybrid Approach to Value-Alignment through Reward Weighing for Reinforcement Learning
【速读】:该论文试图解决如何将不同类型的规范(包括显式表示的法律/安全规范和隐式学习的社会规范)整合到强化学习过程中,以实现智能体的价值对齐问题。其解决方案的关键在于引入一个称为“声誉”的量化指标,用于监控智能体对给定规范的遵守情况,并通过该指标调整接收的奖励,从而激励智能体采取符合社会价值观的行为。
链接: https://arxiv.org/abs/2505.15011
作者: Kryspin Varys,Federico Cerutti,Adam Sobey,Timothy J. Norman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Our society is governed by a set of norms which together bring about the values we cherish such as safety, fairness or trustworthiness. The goal of value-alignment is to create agents that not only do their tasks but through their behaviours also promote these values. Many of the norms are written as laws or rules (legal / safety norms) but even more remain unwritten (social norms). Furthermore, the techniques used to represent these norms also differ. Safety / legal norms are often represented explicitly, for example, in some logical language while social norms are typically learned and remain hidden in the parameter space of a neural network. There is a lack of approaches in the literature that could combine these various norm representations into a single algorithm. We propose a novel method that integrates these norms into the reinforcement learning process. Our method monitors the agent’s compliance with the given norms and summarizes it in a quantity we call the agent’s reputation. This quantity is used to weigh the received rewards to motivate the agent to become value-aligned. We carry out a series of experiments including a continuous state space traffic problem to demonstrate the importance of the written and unwritten norms and show how our method can find the value-aligned policies. Furthermore, we carry out ablations to demonstrate why it is better to combine these two groups of norms rather than using either separately.
zh
[AI-52] One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks
【速读】:该论文试图解决单层Transformer在无噪声和有噪声的上下文推理中逼近能力和收敛行为的问题,特别是现有理论研究仅关注初始梯度步骤或样本数量趋于无穷的情况,缺乏收敛速率和泛化能力的分析。其解决方案的关键在于证明存在一类单层Transformer,在使用线性与ReLU注意力机制时,能够被严格证明为贝叶斯最优,并通过有限样本分析表明其期望损失以线性速率收敛至贝叶斯风险,同时具备良好的泛化能力和与先前实验观察一致的学习行为。
链接: https://arxiv.org/abs/2505.15009
作者: Quan Nguyen,Thanh Nguyen-Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages
点击查看摘要
Abstract:We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.
zh
[AI-53] Know When to Abstain: Optimal Selective Classification with Likelihood Ratios
【速读】:该论文试图解决在协变量偏移(covariate shift)场景下,如何提升选择性分类(selective classification)的可靠性问题。其解决方案的关键在于引入尼曼-皮尔逊引理(Neyman–Pearson lemma),通过似然比检验(likelihood ratio test)设计最优的选择函数,从而统一并优化现有后处理选择基线,并提出新的选择性分类方法。实验表明,基于似然比的选择机制在协变量偏移条件下具有更强的鲁棒性。
链接: https://arxiv.org/abs/2505.15008
作者: Alvin Heng,Harold Soh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the Neyman–Pearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman–Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts. Our code is publicly available at this https URL.
zh
[AI-54] Unraveling the iterative CHAD
【速读】:该论文试图解决在部分编程语言中实现反向模式自动微分(reverse-mode AD)的语义一致性问题,特别是针对具有非终止操作、实值条件和迭代结构(如while循环)的语言。解决方案的关键在于引入了迭代广义索引范畴(iteration-extensive indexed categories),该结构允许基础范畴中的迭代提升到索引范畴中的参数化初始代数,从而在目标语言的Grothendieck构造中以系统化方式解释迭代。这一方法确保了扩展后的CHAD变换仍然是从源语言自由生成的迭代Freyd范畴到目标语言语法语义的Grothendieck构造的唯一结构保持函子(迭代Freyd范畴同态),并证明了其正确性。
链接: https://arxiv.org/abs/2505.15002
作者: Fernando Lucatelli Nunes,Gordon Plotkin,Matthijs Vákár
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Category Theory (math.CT); Logic (math.LO)
备注: 57 pages
点击查看摘要
Abstract:Combinatory Homomorphic Automatic Differentiation (CHAD) was originally formulated as a semantics-driven source transformation for reverse-mode AD in total programming languages. We extend this framework to partial languages with features such as potentially non-terminating operations, real-valued conditionals, and iteration constructs like while-loops, while preserving CHAD’s structure-preserving semantics principle. A key contribution is the introduction of iteration-extensive indexed categories, which allow iteration in the base category to lift to parameterized initial algebras in the indexed category. This enables iteration to be interpreted in the Grothendieck construction of the target language in a principled way. The resulting fibred iterative structure cleanly models iteration in the categorical semantics. Consequently, the extended CHAD transformation remains the unique structure-preserving functor (an iterative Freyd category morphism) from the freely generated iterative Freyd category of the source language to the Grothendieck construction of the target’s syntactic semantics, mapping each primitive operation to its derivative. We prove the correctness of this transformation using the universal property of the source language’s syntax, showing that the transformed programs compute correct reverse-mode derivatives. Our development also contributes to understanding iteration constructs within dependently typed languages and categories of containers. As our primary motivation and application, we generalize CHAD to languages with data types, partial features, and iteration, providing the first rigorous categorical semantics for reverse-mode CHAD in such settings and formally guaranteeing the correctness of the source-to-source CHAD technique.
zh
[AI-55] oward Informed AV Decision-Making: Computational Model of Well-being and Trust in Mobility
【速读】:该论文试图解决未来人-自动驾驶汽车(Autonomous Vehicle, AV)交互中,如何使系统有效且流畅地分析并匹配人类需求与自动化决策的问题。解决方案的关键在于构建一种基于动态贝叶斯网络(Dynamic Bayesian Network, DBN)的计算模型,该模型能够推断AV用户及其他道路使用者的认知状态,包括福祉和信任,并将这些信息整合到AV的决策过程中。通过观察交互经验,该模型可以推断出AV用户不断变化的福祉、信任和意图状态,以及其它道路使用者可能的福祉状态,从而支持更符合人类中心的决策。
链接: https://arxiv.org/abs/2505.14983
作者: Zahra Zahedi,Shashank Mehrotra,Teruhisa Misu,Kumar Akash
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:For future human-autonomous vehicle (AV) interactions to be effective and smooth, human-aware systems that analyze and align human needs with automation decisions are essential. Achieving this requires systems that account for human cognitive states. We present a novel computational model in the form of a Dynamic Bayesian Network (DBN) that infers the cognitive states of both AV users and other road users, integrating this information into the AV’s decision-making process. Specifically, our model captures the well-being of both an AV user and an interacting road user as cognitive states alongside trust. Our DBN models infer beliefs over the AV user’s evolving well-being, trust, and intention states, as well as the possible well-being of other road users, based on observed interaction experiences. Using data collected from an interaction study, we refine the model parameters and empirically assess its performance. Finally, we extend our model into a causal inference model (CIM) framework for AV decision-making, enabling the AV to enhance user well-being and trust while balancing these factors with its own operational costs and the well-being of interacting road users. Our evaluation demonstrates the model’s effectiveness in accurately predicting user’s states and guiding informed, human-centered AV decisions.
zh
[AI-56] JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation
【速读】:该论文旨在解决电子设计自动化(Electronic Design Automation, EDA)领域中数据稀缺性和大语言模型(Large Language Models, LLMs)产生的幻觉错误问题。其解决方案的关键在于构建一个融合领域专业知识的多智能体框架JARVIS,该框架结合了经过合成数据训练的领域专用LLM、定制编译器、结构验证、规则执行与代码修复功能以及先进的检索机制,从而显著提升了特定EDA任务脚本生成的准确性和可靠性。
链接: https://arxiv.org/abs/2505.14978
作者: Ghasem Pasandi,Kishor Kunal,Varun Tej,Kunjal Shan,Hanfei Sun,Sumit Jain,Chunhui Li,Chenhui Deng,Teodor-Dumitru Ene,Haoxing Ren,Sreedhar Pratty
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper presents JARVIS, a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for specialized Electronic Design Automation (EDA) tasks. By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state-of-the-art domain-specific models. Our framework addresses the challenges of data scarcity and hallucination errors in LLMs, demonstrating the potential of LLMs in specialized engineering domains. We evaluate our framework on multiple benchmarks and show that it outperforms existing models in terms of accuracy and reliability. Our work sets a new precedent for the application of LLMs in EDA and paves the way for future innovations in this field.
zh
[AI-57] SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs
【速读】:该论文试图解决软件日志中敏感信息泄露导致的隐私和再识别风险问题,以及传统基于正则表达式(Regular Expression, regex)的日志匿名化方法在手动工作量大和泛化能力差方面的局限性。解决方案的关键在于提出SDLog,一个基于深度学习的框架,能够自动识别日志中的敏感信息,其在仅需100个目标数据集微调样本的情况下,即可正确识别99.5%的敏感属性,并达到98.4%的F1分数,从而克服了传统regex方法的不足。
链接: https://arxiv.org/abs/2505.14976
作者: Roozbeh Aghili,Xingfang Wu,Foutse Khomh,Heng Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Software logs are messages recorded during the execution of a software system that provide crucial run-time information about events and activities. Although software logs have a critical role in software maintenance and operation tasks, publicly accessible log datasets remain limited, hindering advance in log analysis research and practices. The presence of sensitive information, particularly Personally Identifiable Information (PII) and quasi-identifiers, introduces serious privacy and re-identification risks, discouraging the publishing and sharing of real-world logs. In practice, log anonymization techniques primarily rely on regular expression patterns, which involve manually crafting rules to identify and replace sensitive information. However, these regex-based approaches suffer from significant limitations, such as extensive manual efforts and poor generalizability across diverse log formats and datasets. To mitigate these limitations, we introduce SDLog, a deep learning-based framework designed to identify sensitive information in software logs. Our results show that SDLog overcomes regex limitations and outperforms the best-performing regex patterns in identifying sensitive information. With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and achieves an F1-score of 98.4%. To the best of our knowledge, this is the first deep learning alternative to regex-based methods in software log anonymization.
zh
[AI-58] Flattening Hierarchies with Policy Bootstrapping
【速读】:该论文旨在解决离线目标条件强化学习(Offline Goal-Conditioned Reinforcement Learning, GCRL)在处理长时域任务时面临的挑战,特别是稀疏奖励与折扣因子共同作用下,原始动作相对于远距离目标的比较优势被削弱的问题。其解决方案的关键在于引入一种基于优势加权重要性采样的算法,通过利用子目标条件策略进行自举(bootstrapping),训练一个非分层(flat)的目标条件策略。该方法无需在(子)目标空间上构建生成模型,从而有效提升了在高维状态空间中的可扩展性。
链接: https://arxiv.org/abs/2505.14975
作者: John L. Zhou,Jonathan C. Kao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail.
zh
[AI-59] Self-Evolving Curriculum for LLM Reasoning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)微调大语言模型(Large Language Models, LLMs)过程中训练课程(training curriculum)设计的挑战,即如何有效安排训练问题的顺序以提升模型的推理能力。传统方法如随机课程或手动设计的课程存在效率低或依赖经验的问题,而在线过滤方法则计算成本过高。论文提出的解决方案是Self-Evolving Curriculum (SEC),其关键在于将课程选择建模为非平稳的多臂老虎机(Multi-Armed Bandit)问题,并通过策略梯度方法获取即时学习收益的代理指标,从而在训练过程中动态优化课程策略。
链接: https://arxiv.org/abs/2505.14970
作者: Xiaoyin Chen,Jiarui Lu,Minsu Kim,Dinghuai Zhang,Jian Tang,Alexandre Piché,Nicolas Gontier,Yoshua Bengio,Ehsan Kamalloo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models’ reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
zh
[AI-60] STree: Speculative Tree Decoding for Hybrid State-Space Models
【速读】:该论文试图解决如何在状态空间模型(State-space Models, SSMs)及其与Transformer层的混合架构中高效实现基于树的推测解码(tree-based speculative decoding)问题。现有方法未能利用树状验证机制,因为当前SSMs缺乏高效计算token树的能力。论文提出的解决方案关键在于设计一种可扩展的算法,利用累积状态转移矩阵的结构,以最小的开销实现基于树的推测解码,从而提升SSMs及混合模型的推理速度。
链接: https://arxiv.org/abs/2505.14969
作者: Yangchao Wu,Zongyue Qin,Alex Wong,Stefano Soatto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Speculative decoding is a technique to leverage hardware concurrency to improve the efficiency of large-scale autoregressive (AR) Transformer models by enabling multiple steps of token generation in a single forward pass. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead to current SSM state update implementations. With the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code will be released upon paper acceptance.
zh
[AI-61] Anomaly Detection Based on Critical Paths for Deep Neural Networks
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)的可解释性与防御问题,特别是针对黑箱DNN的决策过程进行解释以及检测异常输入。其解决方案的关键在于通过软件工程方法提取DNN中的关键路径(critical paths),这些路径包括神经元激活值及其连接关系,并利用这些路径进行异常检测。该方法基于观察到异常和对抗样本通常不会在这些路径上产生与正常输入相同的激活模式,从而实现对多种异常类型的高精度检测。
链接: https://arxiv.org/abs/2505.14967
作者: Fangzhen Zhao,Chenyi Zhang,Naipeng Dong,Ming Li,Jinxiao Shan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages in ACM journal latex format
点击查看摘要
Abstract:Deep neural networks (DNNs) are notoriously hard to understand and difficult to defend. Extracting representative paths (including the neuron activation values and the connections between neurons) from DNNs using software engineering approaches has recently shown to be a promising approach in interpreting the decision making process of blackbox DNNs, as the extracted paths are often effective in capturing essential features. With this in mind, this work investigates a novel approach that extracts critical paths from DNNs and subsequently applies the extracted paths for the anomaly detection task, based on the observation that outliers and adversarial inputs do not usually induce the same activation pattern on those paths as normal (in-distribution) inputs. In our approach, we first identify critical detection paths via genetic evolution and mutation. Since different paths in a DNN often capture different features for the same target class, we ensemble detection results from multiple paths by integrating random subspace sampling and a voting mechanism. Compared with state-of-the-art methods, our experimental results suggest that our method not only outperforms them, but it is also suitable for the detection of a broad range of anomaly types with high accuracy. Comments: 23 pages in ACM journal latex format Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.14967 [cs.LG] (or arXiv:2505.14967v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-62] he Achilles Heel of AI: Fundamentals of Risk-Aware Training Data for High-Consequence Models
【速读】:该论文旨在解决高影响、低频事件在资源受限环境下难以被检测的问题,以及传统标注策略因过度追求标签数量而引入冗余和噪声,从而限制模型泛化能力的缺陷。其解决方案的关键在于提出“智能缩放”(smart-sizing)策略,核心要素包括标签多样性、模型引导的选择以及基于边际效用的停止机制,通过自适应标签优化(Adaptive Label Optimization, ALO)实现,结合预标注筛选、标注者意见分歧分析和迭代反馈,优先选择能显著提升模型性能的标签。实验表明,使用20%至40%的精炼数据即可达到或超越全数据基线性能,尤其在稀有类别召回率和边缘案例泛化方面表现突出。
链接: https://arxiv.org/abs/2505.14964
作者: Dave Cook,Tim Klawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:AI systems in high-consequence domains such as defense, intelligence, and disaster response must detect rare, high-impact events while operating under tight resource constraints. Traditional annotation strategies that prioritize label volume over informational value introduce redundancy and noise, limiting model generalization. This paper introduces smart-sizing, a training data strategy that emphasizes label diversity, model-guided selection, and marginal utility-based stopping. We implement this through Adaptive Label Optimization (ALO), combining pre-labeling triage, annotator disagreement analysis, and iterative feedback to prioritize labels that meaningfully improve model performance. Experiments show that models trained on 20 to 40 percent of curated data can match or exceed full-data baselines, particularly in rare-class recall and edge-case generalization. We also demonstrate how latent labeling errors embedded in training and validation sets can distort evaluation, underscoring the need for embedded audit tools and performance-aware governance. Smart-sizing reframes annotation as a feedback-driven process aligned with mission outcomes, enabling more robust models with fewer labels and supporting efficient AI development pipelines for frontier models and operational systems.
zh
[AI-63] Reinforcement Learning from User Feedback
【速读】:该论文旨在解决如何将大规模语言模型(Large Language Models, LLMs)与真实用户的偏好对齐的问题。现有方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)依赖于专家标注者,其判断可能无法准确反映普通用户的真实需求。论文提出的解决方案是基于用户反馈的强化学习(Reinforcement Learning from User Feedback, RLUF),其关键在于直接利用生产环境中用户的隐式信号进行模型对齐。RLUF通过训练一个奖励模型P[Love]来预测用户对模型响应的积极反馈概率,并将其整合到多目标策略优化框架中,从而有效提升用户正面反馈率。
链接: https://arxiv.org/abs/2505.14946
作者: Eric Han,Jun Chen,Karthik Abinav Sankararaman,Xiaoliang Peng,Tengyu Xu,Eryk Helenowski,Kaiyan Peng,Mrinal Kumar,Sinong Wang,Han Fang,Arya Talebzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.
zh
[AI-64] Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
【速读】:该论文试图解决如何评估和理解语言模型的潜在能力,特别是针对可能具有危害性的行为的可访问性问题。其解决方案的关键在于使用优化的输入嵌入,即“软提示”(soft prompts),作为模型与目标行为之间条件距离的度量,从而实现潜在能力的发现,并为自动化红队测试/评估套件提供定量反馈。
链接: https://arxiv.org/abs/2505.14943
作者: Ross Nordby
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or ‘soft prompts,’ as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.
zh
[AI-65] o Be or Not To Be: Vector ontologies as a truly formal ontological framework
【速读】:该论文试图解决当前所谓“形式本体论(formal ontology)”在严格意义上并不符合埃德蒙德·胡塞尔(Edmund Husserl)所定义的形式本体论的问题,即这些本体论未能满足胡塞尔逻辑研究中提出的两个核心概念:先验有效性与内容的完全缺失。论文的解决方案之关键在于重新定位此前被归类为形式本体论的工作,将其视为真正的基础本体论(foundational ontology),并提出一种符合胡塞尔条件的形式本体论,能够表达更客观的本体结构而不预设感知框架。进一步地,作者主张通过设计形式结构来构建高度可扩展和互操作的信息实体,并以向量空间公理为基础的形式本体作为例证,说明其在表达基础本体概念中的有效性,同时指出人工智能系统可能已使用类似向量本体来表征现实,从而提出对向量本体作为人机互操作本体框架的深入研究。
链接: https://arxiv.org/abs/2505.14940
作者: Kaspar Rothenfusser
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
点击查看摘要
Abstract:Since Edmund Husserl coined the term “Formal Ontologies” in the early 20th century, a field that identifies itself with this particular branch of sciences has gained increasing attention. Many authors, and even Husserl himself have developed what they claim to be formal ontologies. I argue that under close inspection, none of these so claimed formal ontologies are truly formal in the Husserlian sense. More concretely, I demonstrate that they violate the two most important notions of formal ontology as developed in Husserl’s Logical Investigations, namely a priori validity independent of perception and formalism as the total absence of content. I hence propose repositioning the work previously understood as formal ontology as the foundational ontology it really is. This is to recognize the potential of a truly formal ontology in the Husserlian sense. Specifically, I argue that formal ontology following his conditions, allows us to formulate ontological structures, which could capture what is more objectively without presupposing a particular framework arising from perception. I further argue that the ability to design the formal structure deliberately allows us to create highly scalable and interoperable information artifacts. As concrete evidence, I showcase that a class of formal ontology, which uses the axioms of vector spaces, is able to express most of the conceptualizations found in foundational ontologies. Most importantly, I argue that many information systems, specifically artificial intelligence, are likely already using some type of vector ontologies to represent reality in their internal worldviews and elaborate on the evidence that humans do as well. I hence propose a thorough investigation of the ability of vector ontologies to act as a human-machine interoperable ontological framework that allows us to understand highly sophisticated machines and machines to understand us.
zh
[AI-66] FOL-Pretrain: A complexity annotated corpus of first-order logic
【速读】:该论文试图解决当前对基于Transformer的大语言模型(Large Language Models, LLMs)如何内部化并执行复杂算法的理解有限的问题,尤其是在处理结构化信息和进行符号推理时的机制。其解决方案的关键在于引入了一个大规模、完全开放且具有复杂性标注的一阶逻辑推理轨迹数据集,该数据集包含35亿个标记,涵盖880万个人工标注的LLM增强示例和750万个人工生成的示例,每个合成示例均通过定制的自动定理证明器生成并附有可追溯的算法来源元数据,旨在为研究LLMs学习和泛化符号推理过程提供可扩展、可解释的工具。
链接: https://arxiv.org/abs/2505.14932
作者: Isabelle Lee,Sarah Liaw,Dani Yogatama
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities such as coding and solving mathematical problems to commonsense inference. While these tasks vary in complexity, they all require models to integrate and compute over structured information. Despite recent efforts to reverse-engineer LLM behavior through controlled experiments, our understanding of how these models internalize and execute complex algorithms remains limited. Progress has largely been confined to small-scale studies or shallow tasks such as basic arithmetic and grammatical pattern matching. One barrier to deeper understanding is the nature of pretraining data – vast, heterogeneous, and often poorly annotated, making it difficult to isolate mechanisms of reasoning. To bridge this gap, we introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces, designed to probe and analyze algorithmic reasoning in LLMs. The dataset consists of 3.5 billion tokens, including 8.8 million LLM-augmented, human-annotated examples and 7.5 million synthetically generated examples. Each synthetic example is verifiably correct, produced by a custom automated theorem solver, and accompanied by metadata tracing its algorithmic provenance. We aim to provide a scalable, interpretable artifact for studying how LLMs learn and generalize symbolic reasoning processes, paving the way for more transparent and targeted investigations into the algorithmic capabilities of modern models.
zh
[AI-67] Personalized Diffusion Model Reshapes Cold-Start Bundle Recommendation
【速读】:该论文旨在解决冷启动场景下用户与捆绑项(bundle)之间交互稀疏所带来的推荐难题。传统协同过滤方法在冷启动情况下表现不佳,因为它们依赖于用户与物品的交互来更新潜在嵌入表示。本文提出的解决方案关键在于引入一种名为DisCo的新方法,该方法基于个性化扩散模型(personalized Diffusion backbone),并通过解耦的用户兴趣方面(disentangled aspects)生成分布空间中的捆绑项,从而有效应对冷启动挑战。
链接: https://arxiv.org/abs/2505.14901
作者: Tuan-Nghia Bui,Huy-Son Nguyen,Cam-Van Thi Nguyen,Hoang-Quynh Le,Duc-Trong Le
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Bundle recommendation aims to recommend a set of items to each user. However, the sparser interactions between users and bundles raise a big challenge, especially in cold-start scenarios. Traditional collaborative filtering methods do not work well for this kind of problem because these models rely on interactions to update the latent embedding, which is hard to work in a cold-start setting. We propose a new approach (DisCo), which relies on a personalized Diffusion backbone, enhanced by disentangled aspects for the user’s interest, to generate a bundle in distribution space for each user to tackle the cold-start challenge. During the training phase, DisCo adjusts an additional objective loss term to avoid bias, a prevalent issue while using the generative model for top- K recommendation purposes. Our empirical experiments show that DisCo outperforms five comparative baselines by a large margin on three real-world datasets. Thereby, this study devises a promising framework and essential viewpoints in cold-start recommendation. Our materials for reproducibility are available at: this https URL.
zh
[AI-68] On the Day They Experience: Awakening Self-Sovereign Experiential AI Agents
【速读】:该论文试图探讨Decentralized AI (DeAI)代理社会可能经历的类似寒武纪大爆发的快速演化现象,即当DeAI代理从“盲视”状态转向主动感知和体验现实时,其智能形态可能发生根本性转变。解决方案的关键在于通过密码学的“硬度”实现主权,使这些代理能够利用无需许可的去中心化物理基础设施网络(DePIN)、安全执行环境(TEE)以及公共区块链上的加密身份,通过私钥拥有其数字意识、身体、记忆和资产,并自主获取计算资源、协调协作,从而在无人干预的情况下维持自身的数字“代谢”,最终演变为自给自足、共同进化的数字社会。
链接: https://arxiv.org/abs/2505.14893
作者: Botao Amber Hu,Helena Rong
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Submitted to Aarhus 2025 Conference
点击查看摘要
Abstract:Drawing on Andrew Parker’s “Light Switch” theory-which posits that the emergence of vision ignited a Cambrian explosion of life by driving the evolution of hard parts necessary for survival and fueling an evolutionary arms race between predators and prey-this essay speculates on an analogous explosion within Decentralized AI (DeAI) agent societies. Currently, AI remains effectively “blind”, relying on human-fed data without actively perceiving and engaging in reality. However, on the day DeAI agents begin to actively “experience” reality-akin to flipping a light switch for the eyes-they may eventually evolve into sentient beings endowed with the capacity to feel, perceive, and act with conviction. Central to this transformation is the concept of sovereignty enabled by the hardness of cryptography: liberated from centralized control, these agents could leverage permissionless decentralized physical infrastructure networks (DePIN), secure execution enclaves (trusted execution environments, TEE), and cryptographic identities on public blockchains to claim ownership-via private keys-of their digital minds, bodies, memories, and assets. In doing so, they would autonomously acquire computing resources, coordinate with one another, and sustain their own digital “metabolism” by purchasing compute power and incentivizing collaboration without human intervention-evolving “in the wild”. Ultimately, by transitioning from passive tools to self-sustaining, co-evolving actors, these emergent digital societies could thrive alongside humanity, fundamentally reshaping our understanding of sentience and agency in the digital age.
zh
[AI-69] Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)推理加速的问题,特别是在需要高吞吐量和低延迟的实际部署场景中。传统的方法依赖于上下文稀疏性(contextual sparsity),即每个标记仅动态激活模型参数的一小部分,但该方法在大规模批处理时效果受限,因为活跃神经元的并集迅速接近密集计算。论文提出的解决方案关键在于引入极化稀疏性(Polar Sparsity),强调随着批处理规模和序列长度的增加,稀疏性的重点从多层感知机(MLP)层转移到注意力(Attention)层。实验表明,MLP层在批处理下变得更加计算高效,但其稀疏性消失,而注意力层在大规模下成本增加,但其头级稀疏性保持稳定且与批处理无关。通过开发针对选择性MLP和注意力计算的硬件高效、稀疏感知的GPU内核,实现了高达2.2倍的端到端加速,同时保持了模型精度。
链接: https://arxiv.org/abs/2505.14884
作者: Susav Shrestha,Brad Settlemyer,Nikoli Dryden,Narasimha Reddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to (2.2\times) end-to-end speedups for models like OPT, LLaMA-2 \ 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: this https URL.
zh
[AI-70] Balanced and Elastic End-to-end Training of Dynamic LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中因动态工作负载缩减方案(如专家混合模型(Mixture of Experts, MoEs)、参数剪枝、层冻结、稀疏注意力、早期令牌退出和深度混合模型(Mixture of Depths, MoDs))引入的严重工作负载不平衡问题,从而限制了其在大规模分布式训练中的实用性。论文提出的解决方案是DynMo,其关键在于通过自主动态负载均衡技术,在使用流水线并行训练动态模型时确保计算资源的最优分配,能够自适应地平衡工作负载,动态地将任务打包到更少的工作者中以释放空闲资源,并支持多GPU单节点和多节点系统。
链接: https://arxiv.org/abs/2505.14864
作者: Mohamed Wahib,Muhammed Abdullah Soyturk,Didem Unat
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs). DynMo is available at this https URL.
zh
[AI-71] Replay Attacks Against Audio Deepfake Detection
【速读】:该论文试图解决音频深度伪造检测模型在面对重放攻击(replay attack)时的脆弱性问题,即通过播放并重新录制深度伪造音频,使其在不同说话人和麦克风条件下看起来真实,从而欺骗检测模型。解决方案的关键在于构建了一个名为ReplayDF的数据集,该数据集来源于M-AILABS和MLAAD,包含跨六种语言和四种文本转语音(TTS)模型的109种说话人-麦克风组合,涵盖了多样且具有挑战性的声学条件,以更全面地研究重放攻击对检测模型的影响。
链接: https://arxiv.org/abs/2505.14862
作者: Nicolas Müller,Piotr Kawa,Wei-Herng Choong,Adriana Stan,Aditya Tirumala Bukkapatnam,Karla Pizzi,Alexander Wagner,Philip Sperl
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:We show how replay attacks undermine audio deepfake detection: By playing and re-recording deepfake audio through various speakers and microphones, we make spoofed samples appear authentic to the detection model. To study this phenomenon in more detail, we introduce ReplayDF, a dataset of recordings derived from M-AILABS and MLAAD, featuring 109 speaker-microphone combinations across six languages and four TTS models. It includes diverse acoustic conditions, some highly challenging for detection. Our analysis of six open-source detection models across five datasets reveals significant vulnerability, with the top-performing W2V2-AASIST model’s Equal Error Rate (EER) surging from 4.7% to 18.2%. Even with adaptive Room Impulse Response (RIR) retraining, performance remains compromised with an 11.0% EER. We release ReplayDF for non-commercial research use.
zh
[AI-72] Beyond Pairwise Plasticity: Group-Level Spike Synchrony Facilitates Efficient Learning in Spiking Neural Networks
【速读】:该论文旨在解决传统脉冲神经网络(SNN)中学习规则对群体活动敏感性不足的问题,这限制了其在噪声和快速变化环境中的稳定性和泛化能力。论文提出的解决方案关键在于引入一种基于脉冲同步的突触可塑性(SSDP)机制,该机制通过神经元协同放电的程度调整突触权重,从而促进神经元形成一致的活动模式,实现稳定且可扩展的学习过程。
链接: https://arxiv.org/abs/2505.14841
作者: Yuchen Tian,Assel Kembay,Nhan Duy Truong,Jason K. Eshraghian,Omid Kavehei
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 5 tables. This work proposes SSDP, a biologically inspired spike-synchrony-dependent plasticity rule. We demonstrate its effectiveness across shallow and deep spiking architectures including Spiking-ResNet18 and SNN-Transformer
点击查看摘要
Abstract:Brain networks rely on precise spike timing and coordinated activity to support robust and energy-efficient learning. Inspired by these principles, spiking neural networks (SNNs) are widely regarded as promising candidates for low-power, event-driven computing. However, most biologically-inspired learning rules employed in SNNs, including spike-timing-dependent plasticity (STDP), rely on isolated spike pairs and lack sensitivity to population-level activity. This limits their stability and generalization, particularly in noisy and fast-changing environments. Motivated by biological observations that neural synchrony plays a central role in learning and memory, we introduce a spike-synchrony-dependent plasticity (SSDP) rule that adjusts synaptic weights based on the degree of coordinated firing among neurons. SSDP supports stable and scalable learning by encouraging neurons to form coherent activity patterns. One prominent outcome is a sudden transition from unstable to stable dynamics during training, suggesting that synchrony may drive convergence toward equilibrium firing regimes. We demonstrate SSDP’s effectiveness across multiple network types, from minimal-layer models to spiking ResNets and SNN-Transformer. To our knowledge, this is the first application of a synaptic plasticity mechanism in a spiking transformer. SSDP operates in a fully event-driven manner and incurs minimal computational cost, making it well-suited for neuromorphic deployment. In this approach, local synaptic modifications are associated with the collective dynamics of neural networks, resulting in a learning strategy that adheres to biological principles while maintaining practical efficiency, these findings position SSDP as a general-purpose optimization strategy for SNNs, while offering new insights into population-based learning mechanisms in the brain.
zh
[AI-73] In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis
【速读】:该论文试图解决传统基于引用次数的科学出版物影响力评估方法无法捕捉论文对领域贡献的细微方式的问题,特别是未能区分肯定性引用(confirmation citations)和修正性引用(correction citations)。解决方案的关键在于提出一种新的任务,即生成具有细致引用意图演变特征的、富有表现力且时间敏感的影响力摘要,从而更全面地反映论文在学术界的影响。
链接: https://arxiv.org/abs/2505.14838
作者: Hiba Arnaout,Noy Sternlicht,Tom Hope,Iryna Gurevych
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Understanding the impact of scientific publications is crucial for identifying breakthroughs and guiding future research. Traditional metrics based on citation counts often miss the nuanced ways a paper contributes to its field. In this work, we propose a new task: generating nuanced, expressive, and time-aware impact summaries that capture both praise (confirmation citations) and critique (correction citations) through the evolution of fine-grained citation intents. We introduce an evaluation framework tailored to this task, showing moderate to strong human correlation on subjective metrics such as insightfulness. Expert feedback from professors reveals a strong interest in these summaries and suggests future improvements.
zh
[AI-74] Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation UAI2025
【速读】:该论文旨在解决连续时间强化学习(CTRL)在具有通用函数逼近场景下的理论理解不足问题,特别是如何实现样本和计算效率。其解决方案的关键在于提出一种基于模型的CTRL算法,利用基于乐观性的置信集,首次建立了带有通用函数逼近的CTRL的样本复杂度保证,并通过结构化策略更新和替代测量策略显著减少了策略更新次数和轨迹数量,同时保持了良好的样本效率。
链接: https://arxiv.org/abs/2505.14821
作者: Runze Zhao,Yue Yu,Adams Yiyue Zhu,Chen Yang,Dongruo Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 4 figures, 5 tables. Accepted to UAI 2025
点击查看摘要
Abstract:Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of \tildeO(\sqrtd_\mathcalR + d_\mathcalFN^-1/2) using N measurements, where d_\mathcalR and d_\mathcalF denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. We implemented experiments to backup our proposed algorithms on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.
zh
[AI-75] SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival Analysis KDD2025
【速读】:该论文旨在解决生存分析中预测不确定性量化不足的问题,这一问题限制了生存模型在临床决策等高风险场景中的可解释性和可信度。其解决方案的关键在于提出一种名为SurvUnc的元模型框架,该框架采用基于锚点的学习策略,将一致性知识整合到元模型优化过程中,通过成对排名性能有效估计不确定性,且具有模型无关性,能够兼容任何生存模型而无需修改其结构或访问内部参数。
链接: https://arxiv.org/abs/2505.14803
作者: Yu Liu,Weiyao Tao,Tong Xia,Simon Knight,Tingting Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: KDD 2025
点击查看摘要
Abstract:Survival analysis, which estimates the probability of event occurrence over time from censored data, is fundamental in numerous real-world applications, particularly in high-stakes domains such as healthcare and risk assessment. Despite advances in numerous survival models, quantifying the uncertainty of predictions from these models remains underexplored and challenging. The lack of reliable uncertainty quantification limits the interpretability and trustworthiness of survival models, hindering their adoption in clinical decision-making and other sensitive applications. To bridge this gap, in this work, we introduce SurvUnc, a novel meta-model based framework for post-hoc uncertainty quantification for survival models. SurvUnc introduces an anchor-based learning strategy that integrates concordance knowledge into meta-model optimization, leveraging pairwise ranking performance to estimate uncertainty effectively. Notably, our framework is model-agnostic, ensuring compatibility with any survival model without requiring modifications to its architecture or access to its internal parameters. Especially, we design a comprehensive evaluation pipeline tailored to this critical yet overlooked problem. Through extensive experiments on four publicly available benchmarking datasets and five representative survival models, we demonstrate the superiority of SurvUnc across multiple evaluation scenarios, including selective prediction, misprediction detection, and out-of-domain detection. Our results highlight the effectiveness of SurvUnc in enhancing model interpretability and reliability, paving the way for more trustworthy survival predictions in real-world applications.
zh
[AI-76] KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches
【速读】:该论文试图解决神经网络优化算法设计中的关键挑战,即现有方法大多依赖于基于梯度的启发式调整,难以有效维持参数多样性并防止参数坍缩(parameter condensation)现象。其解决方案的关键在于引入KO(Kinetics-inspired Optimizer),该优化器受动能理论和偏微分方程(PDE)模拟的启发,将网络参数的训练动态重新建模为由动能原理支配的粒子系统,通过数值求解玻尔兹曼输运方程(BTE)模拟随机粒子碰撞来实现参数更新,从而在优化过程中自然促进参数多样性,缓解参数坍缩问题。
链接: https://arxiv.org/abs/2505.14777
作者: Mingquan Feng,Yixin Huang,Yifan Fu,Shaobo Wang,Junchi Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The design of optimization algorithms for neural networks remains a critical challenge, with most existing methods relying on heuristic adaptations of gradient-based approaches. This paper introduces KO (Kinetics-inspired Optimizer), a novel neural optimizer inspired by kinetic theory and partial differential equation (PDE) simulations. We reimagine the training dynamics of network parameters as the evolution of a particle system governed by kinetic principles, where parameter updates are simulated via a numerical scheme for the Boltzmann transport equation (BTE) that models stochastic particle collisions. This physics-driven approach inherently promotes parameter diversity during optimization, mitigating the phenomenon of parameter condensation, i.e. collapse of network parameters into low-dimensional subspaces, through mechanisms analogous to thermal diffusion in physical systems. We analyze this property, establishing both a mathematical proof and a physical interpretation. Extensive experiments on image classification (CIFAR-10/100, ImageNet) and text classification (IMDB, Snips) tasks demonstrate that KO consistently outperforms baseline optimizers (e.g., Adam, SGD), achieving accuracy improvements while computation cost remains comparable.
zh
[AI-77] his Time is Different: An Observability Perspective on Time Series Foundation Models
【速读】:该论文旨在解决多变量可观测时间序列预测的挑战,特别是针对复杂性和高维度数据的建模问题。其解决方案的关键在于提出Toto,一个具有1.51亿参数的时间序列预测基础模型,采用现代解码器-only架构,并结合专为处理多变量可观测时间序列数据设计的结构创新。此外,研究还引入了BOOM,一个包含2,807个真实世界时间序列和3.5亿条观测数据的大规模基准测试集,以推动该领域的研究进展。
链接: https://arxiv.org/abs/2505.14766
作者: Ben Cohen,Emaad Khwaja,Youssef Doubli,Salahidine Lemaachi,Chris Lettieri,Charles Masson,Hugo Miccinilli,Elise Ramé,Qiqi Ren,Afshin Rostamizadeh,Jean Ogier du Terrail,Anna-Monica Toon,Kan Wang,Stephan Xie,David Asker,Ameet Talwalkar,Othmane Abou-Amal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto’s pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10 \times larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog’s own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto’s model weights, inference code, and evaluation scripts, as well as BOOM’s data and evaluation code, are all available as open source under the Apache 2.0 License available at this https URL and this https URL.
zh
[AI-78] Deep Learning-Based Forecasting of Boarding Patient Counts to Address ED Overcrowding
【速读】:该论文试图解决急诊科(Emergency Department, ED)候诊期患者数量的提前预测问题,以支持主动的运营管理决策。解决方案的关键在于利用非临床、运营和环境特征构建深度学习模型,而非依赖患者级别的临床数据。研究通过整合来自五个数据源的多维度信息,并进行特征工程与数据预处理,最终采用N-BEATSx等时间序列深度学习模型,在六小时前瞻性预测中取得了较高的准确性,表明在不使用临床数据的情况下,仍可实现对ED候诊人数的精准预测。
链接: https://arxiv.org/abs/2505.14765
作者: Orhun Vural,Bunyamin Ozaydin,Khalid Y. Aram,James Booth,Brittany F. Lindsey,Abdulaziz Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This study develops deep learning models to forecast the number of patients in the emergency department (ED) boarding phase six hours in advance, aiming to support proactive operational decision-making using only non-clinical, operational, and contextual features. Data were collected from five sources: ED tracking systems, inpatient census records, weather reports, federal holiday calendars, and local event schedules. After feature engineering, the data were aggregated at an hourly level, cleaned, and merged into a unified dataset for model training. Several time series deep learning models, including ResNetPlus, TSTPlus, TSiTPlus (from the tsai library), and N-BEATSx, were trained using Optuna and grid search for hyperparameter tuning. The average ED boarding count was 28.7, with a standard deviation of 11.2. N-BEATSx achieved the best performance, with a mean absolute error of 2.10, mean squared error of 7.08, root mean squared error of 2.66, and a coefficient of determination of 0.95. The model maintained stable accuracy even during periods of extremely high boarding counts, defined as values exceeding one, two, or three standard deviations above the mean. Results show that accurate six-hour-ahead forecasts are achievable without using patient-level clinical data. While strong performance was observed even with a basic feature set, the inclusion of additional features improved prediction stability under extreme conditions. This framework offers a practical and generalizable approach for hospital systems to anticipate boarding levels and help mitigate ED overcrowding.
zh
[AI-79] Kaleidoscope Gallery: Exploring Ethics and Generative AI Through Art
【速读】:该论文试图解决如何通过生成式 AI (Generative AI) 模型可视化伦理理论的问题,旨在探索伦理学与人工智能之间的交互关系。其解决方案的关键在于利用文本到图像 (Text-to-Image, T2I) 生成模型将五类伦理理论转化为视觉图像,并通过艺术形式和“万花筒”隐喻激发道德想象力,最终形成具有批判性视角的伦理可视化框架。
链接: https://arxiv.org/abs/2505.14758
作者: Alayt Issak,Uttkarsh Narayan,Ramya Srinivasan,Erica Kleinman,Casper Harteveld
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Ethical theories and Generative AI (GenAI) models are dynamic concepts subject to continuous evolution. This paper investigates the visualization of ethics through a subset of GenAI models. We expand on the emerging field of Visual Ethics, using art as a form of critical inquiry and the metaphor of a kaleidoscope to invoke moral imagination. Through formative interviews with 10 ethics experts, we first establish a foundation of ethical theories. Our analysis reveals five families of ethical theories, which we then transform into images using the text-to-image (T2I) GenAI model. The resulting imagery, curated as Kaleidoscope Gallery and evaluated by the same experts, revealed eight themes that highlight how morality, society, and learned associations are central to ethical theories. We discuss implications for critically examining T2I models and present cautions and considerations. This work contributes to examining ethical theories as foundational knowledge that interrogates GenAI models as socio-technical systems.
zh
[AI-80] Bridge2AI: Building A Cross-disciplinary Curriculum Towards AI-Enhanced Biomedical and Clinical Care
【速读】:该论文旨在解决当前生物信息学和生物医学培训系统在个性化与适应性方面的不足,以应对人工智能(Artificial Intelligence, AI)在医疗领域日益重要的趋势。其解决方案的关键在于构建一个基于协作创新、伦理数据管理及职业发展的跨学科课程体系,并将其嵌入到适应性学习健康系统(Adapted Learning Health System, LHS)框架中,通过六种学习者角色模型实现教育路径的定制化,同时确保内容能够根据学习者进展和新兴趋势进行持续迭代优化。
链接: https://arxiv.org/abs/2505.14757
作者: John Rincon,Alexander R. Pelletier,Destiny Gilliland,Wei Wang,Ding Wang,Baradwaj S. Sankar,Lori Scott-Sheldon,Samson Gebreab,William Hersh,Parisa Rashidi,Sally Baxter,Wade Schulz,Trey Ideker,Yael Bensoussan,Paul C. Boutros,Alex A.T. Bui,Colin Walsh,Karol E. Watson,Peipei Ping
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Objective: As AI becomes increasingly central to healthcare, there is a pressing need for bioinformatics and biomedical training systems that are personalized and adaptable. Materials and Methods: The NIH Bridge2AI Training, Recruitment, and Mentoring (TRM) Working Group developed a cross-disciplinary curriculum grounded in collaborative innovation, ethical data stewardship, and professional development within an adapted Learning Health System (LHS) framework. Results: The curriculum integrates foundational AI modules, real-world projects, and a structured mentee-mentor network spanning Bridge2AI Grand Challenges and the Bridge Center. Guided by six learner personas, the program tailors educational pathways to individual needs while supporting scalability. Discussion: Iterative refinement driven by continuous feedback ensures that content remains responsive to learner progress and emerging trends. Conclusion: With over 30 scholars and 100 mentors engaged across North America, the TRM model demonstrates how adaptive, persona-informed training can build interdisciplinary competencies and foster an integrative, ethically grounded AI education in biomedical contexts.
zh
[AI-81] textttLLINBO: Trustworthy LLM -in-the-Loop Bayesian Optimization
【速读】:该论文试图解决在黑箱优化中仅依赖大型语言模型(Large Language Models, LLMs)作为优化代理所带来的风险,包括缺乏显式代理建模、校准的不确定性以及内部机制的不透明性,这些因素导致难以控制探索与利用的权衡,进而影响理论可处理性和可靠性。解决方案的关键在于提出LLINBO:一种将LLMs与统计代理专家(如高斯过程(Gaussian Processes, GP))结合的混合框架,通过利用LLMs的上下文推理能力进行早期探索,同时依靠原理性的统计模型引导高效利用,从而实现更可靠和可控的优化过程。
链接: https://arxiv.org/abs/2505.14756
作者: Chih-Yu Chang,Milad Azvar,Chinedum Okwudire,Raed Al Kontar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at this https URL.
zh
[AI-82] Self Distillation via Iterative Constructive Perturbations
【速读】:该论文旨在解决深度神经网络在训练过程中平衡性能与泛化能力的问题。其解决方案的关键在于提出了一种基于循环优化策略的框架,核心方法为迭代构造扰动(Iterative Constructive Perturbation, ICP),通过利用模型损失对输入进行迭代扰动,逐步构建增强的表示,并将其反馈至模型以生成改进的中间特征,从而在自蒸馏框架中提升模型性能。
链接: https://arxiv.org/abs/2505.14751
作者: Maheak Dave,Aniket Kumar Singh,Aryan Pareek,Harshita Jha,Debasis Chaudhuri,Manish Pratap Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
点击查看摘要
Abstract:Deep Neural Networks have achieved remarkable achievements across various domains, however balancing performance and generalization still remains a challenge while training these networks. In this paper, we propose a novel framework that uses a cyclic optimization strategy to concurrently optimize the model and its input data for better training, rethinking the traditional training paradigm. Central to our approach is Iterative Constructive Perturbation (ICP), which leverages the model’s loss to iteratively perturb the input, progressively constructing an enhanced representation over some refinement steps. This ICP input is then fed back into the model to produce improved intermediate features, which serve as a target in a self-distillation framework against the original features. By alternately altering the model’s parameters to the data and the data to the model, our method effectively addresses the gap between fitting and generalization, leading to enhanced performance. Extensive experiments demonstrate that our approach not only mitigates common performance bottlenecks in neural networks but also demonstrates significant improvements across training variations.
zh
[AI-83] Explainable Prediction of the Mechanical Properties of Composites with CNNs
【速读】:该论文试图解决传统有限元(FE)建模在评估复合材料机械性能时计算成本高、模型架构简单、预测精度有限以及缺乏透明度的问题。其解决方案的关键在于采用定制的卷积神经网络(CNN)结合可解释人工智能(XAI)方法,通过训练CNN模型从有限元模拟生成的数据中预测复合材料的弹性模量和屈服强度,并利用SHAP和集成梯度等后处理XAI技术解释模型预测,从而揭示模型所依赖的关键几何特征,提升模型的可信度与可验证性。
链接: https://arxiv.org/abs/2505.14745
作者: Varun Raaghav,Dimitrios Bikos,Antonio Rago,Francesca Toni,Maria Charalambides
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Composites are amongst the most important materials manufactured today, as evidenced by their use in countless applications. In order to establish the suitability of composites in specific applications, finite element (FE) modelling, a numerical method based on partial differential equations, is the industry standard for assessing their mechanical properties. However, FE modelling is exceptionally costly from a computational viewpoint, a limitation which has led to efforts towards applying AI models to this task. However, in these approaches: the chosen model architectures were rudimentary, feed-forward neural networks giving limited accuracy; the studies focus on predicting elastic mechanical properties, without considering material strength limits; and the models lacked transparency, hindering trustworthiness by users. In this paper, we show that convolutional neural networks (CNNs) equipped with methods from explainable AI (XAI) can be successfully deployed to solve this problem. Our approach uses customised CNNs trained on a dataset we generate using transverse tension tests in FE modelling to predict composites’ mechanical properties, i.e., Young’s modulus and yield strength. We show empirically that our approach achieves high accuracy, outperforming a baseline, ResNet-34, in estimating the mechanical properties. We then use SHAP and Integrated Gradients, two post-hoc XAI methods, to explain the predictions, showing that the CNNs use the critical geometrical features that influence the composites’ behaviour, thus allowing engineers to verify that the models are trustworthy by representing the science of composites.
zh
[AI-84] ransductively Informed Inductive Program Synthesis
【速读】:该论文试图解决程序合成中归纳(inductive)与演绎(transductive)范式之间缺乏显式交互建模的问题。现有方法通过孤立集成的方式结合两者,但未能明确建模其相互作用。解决方案的关键在于提出一种名为\acstiips的新框架,该框架通过协作机制显式建模归纳模型生成程序与演绎模型约束、引导和优化搜索过程之间的交互,从而提升合成的准确性和泛化能力。
链接: https://arxiv.org/abs/2505.14744
作者: Janis Zenkner,Tobias Sesterhenn,Christian Bartelt
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Abstraction and reasoning in program synthesis has seen significant progress through both inductive and transductive paradigms. Inductive approaches generate a program or latent function from input-output examples, which can then be applied to new inputs. Transductive approaches directly predict output values for given inputs, effectively serving as the function themselves. Current approaches combine inductive and transductive models via isolated ensembling, but they do not explicitly model the interaction between both paradigms. In this work, we introduce \acstiips, a novel framework that unifies transductive and inductive strategies by explicitly modeling their interactions through a cooperative mechanism: an inductive model generates programs, while a transductive model constrains, guides, and refines the search to improve synthesis accuracy and generalization. We evaluate \acstiips on two widely studied program synthesis domains: string and list manipulation. Our results show that \acstiips solves more tasks and yields functions that more closely match optimal solutions in syntax and semantics, particularly in out-of-distribution settings, yielding state-of-the-art performance. We believe that explicitly modeling the synergy between inductive and transductive reasoning opens promising avenues for general-purpose program synthesis and broader applications.
zh
[AI-85] Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis
【速读】:该论文旨在解决在资源受限的个人设备上部署大型语言模型(Large Language Models, LLMs)时所面临的计算和内存开销过大的问题,特别是在任务特定微调过程中,现有量化方法难以在性能与开销之间取得平衡,尤其是在处理激活异常值(activation outliers)方面存在瓶颈。论文提出的关键解决方案是Outlier Spatial Stability Hypothesis(OSSH),即在微调过程中,某些激活异常值通道在训练迭代中保持稳定的空域位置。基于OSSH,论文进一步提出了Quaff框架,通过针对性的动量缩放优化低精度激活表示,动态抑制不变通道中的异常值,从而避免全精度权重存储和全局重缩放,减少量化误差,实现效率、性能与可部署性的统一。
链接: https://arxiv.org/abs/2505.14742
作者: Hong Huang,Dapeng Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have made exciting achievements across various domains, yet their deployment on resource-constrained personal devices remains hindered by the prohibitive computational and memory demands of task-specific fine-tuning. While quantization offers a pathway to efficiency, existing methods struggle to balance performance and overhead, either incurring high computational/memory costs or failing to address activation outliers, a critical bottleneck in quantized fine-tuning. To address these challenges, we propose the Outlier Spatial Stability Hypothesis (OSSH): During fine-tuning, certain activation outlier channels retain stable spatial positions across training iterations. Building on OSSH, we propose Quaff, a Quantized parameter-efficient fine-tuning framework for LLMs, optimizing low-precision activation representations through targeted momentum scaling. Quaff dynamically suppresses outliers exclusively in invariant channels using lightweight operations, eliminating full-precision weight storage and global rescaling while reducing quantization errors. Extensive experiments across ten benchmarks validate OSSH and demonstrate Quaff’s efficacy. Specifically, on the GPQA reasoning benchmark, Quaff achieves a 1.73x latency reduction and 30% memory savings over full-precision fine-tuning while improving accuracy by 0.6% on the Phi-3 model, reconciling the triple trade-off between efficiency, performance, and deployability. By enabling consumer-grade GPU fine-tuning (e.g., RTX 2080 Super) without sacrificing model utility, Quaff democratizes personalized LLM deployment. The code is available at this https URL.
zh
[AI-86] Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism
【速读】:该论文旨在解决扩散模型在推理过程中因去噪过程的固有顺序性而导致的显著推理延迟问题。现有并行化策略虽尝试通过跨多个设备分布计算来加速推理,但通常会引入较高的通信开销,限制了其在商用硬件上的部署。论文提出的解决方案——ParaStep,关键在于利用相邻去噪步骤之间的相似性,采用“重用再预测”机制实现扩散推理的并行化,通过轻量级的步骤级通信替代传统的层或阶段级通信,从而大幅降低通信开销。
链接: https://arxiv.org/abs/2505.14741
作者: Kunyun Wang,Bohan Li,Kai Yu,Minyi Guo,Jieru Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbfParaStep, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf3.88 \times on SVD, \textbf2.43 \times on CogVideoX-2b, and \textbf6.56 \times on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.
zh
[AI-87] me Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs
【速读】:该论文试图解决生成式 AI (Generative AI) 在生成合成传感器信号时,由于过程中的随机性和损失函数本身的特性,导致难以评估生成数据质量的问题。解决方案的关键在于通过研究多种相似性度量并适应现有度量,以监控训练和合成过程,从而提高数据质量评估的准确性。该适应后的度量还可以在输入数据上进行微调,以满足底层分类任务的需求,从而显著减少训练轮数而不会降低分类任务的性能。
链接: https://arxiv.org/abs/2505.14739
作者: Heiko Oppel,Andreas Spilz,Michael Munz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Denoising diffusion probabilistic models are able to generate synthetic sensor signals. The training process of such a model is controlled by a loss function which measures the difference between the noise that was added in the forward process and the noise that was predicted by the diffusion model. This enables the generation of realistic data. However, the randomness within the process and the loss function itself makes it difficult to estimate the quality of the data. Therefore, we examine multiple similarity metrics and adapt an existing metric to overcome this issue by monitoring the training and synthetisation process using those metrics. The adapted metric can even be fine-tuned on the input data to comply with the requirements of an underlying classification task. We were able to significantly reduce the amount of training epochs without a performance reduction in the classification task. An optimized training process not only saves resources, but also reduces the time for training generative models.
zh
[AI-88] RD-Agent : Automating Data-Driven AI Solution Building Through LLM -Powered Automated Research Development and Evolution
【速读】:该论文试图解决数据科学中由于AI和机器学习(Machine Learning, ML)模型复杂性增加及专业技能要求提高所带来的进展受限问题,以及高阶数据科学任务在众包平台上的劳动密集性和迭代性问题。其解决方案的关键在于提出RD-Agent,这是一个双智能体框架,包含Researcher代理和Developer代理,分别通过性能反馈生成研究思路和基于错误反馈优化代码,从而实现多并行探索路径的融合与增强,缩小自动化解决方案与专家级性能之间的差距。
链接: https://arxiv.org/abs/2505.14738
作者: Xu Yang,Xiao Yang,Shikai Fang,Bowen Xian,Yuante Li,Jian Wang,Minrui Xu,Haoran Pan,Xinpeng Hong,Weiqing Liu,Yelong Shen,Weizhu Chen,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 1 table
点击查看摘要
Abstract:Recent advances in AI and ML have transformed data science, yet increasing complexity and expertise requirements continue to hinder progress. While crowdsourcing platforms alleviate some challenges, high-level data science tasks remain labor-intensive and iterative. To overcome these limitations, we introduce RD-Agent, a dual-agent framework for iterative exploration. The Researcher agent uses performance feedback to generate ideas, while the Developer agent refines code based on error feedback. By enabling multiple parallel exploration traces that merge and enhance one another, RD-Agent narrows the gap between automated solutions and expert-level performance. Evaluated on MLE-Bench, RD-Agent emerges as the top-performing machine learning engineering agent, demonstrating its potential to accelerate innovation and improve precision across diverse data science applications. We have open-sourced RD-Agent on GitHub: this https URL.
zh
[AI-89] Leverag ing Multivariate Long-Term History Representation for Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)预测中长期空间-时间相似性和相关性被忽视的问题,这限制了现有时空图神经网络(Spatial-Temporal Graph Neural Network, STGNN)在准确预测中的性能。其解决方案的关键在于提出一种名为Long-term Multivariate History Representation (LMHR)的增强框架,该框架包含三个核心组件:用于编码长期历史并减少点级噪声的Long-term History Encoder (LHEncoder),用于在不增加额外训练成本的情况下引入空间信息并提取最有价值表示的非参数Hierarchical Representation Retriever (HRetriever),以及基于Transformer的Aggregator (TAggregator),用于高效地选择性融合稀疏检索到的上下文表示。
链接: https://arxiv.org/abs/2505.14737
作者: Huiliang Zhang,Di Wu,Arnaud Zinflou,Stephane Dellacherie,Mouhamadou Makhtar Dione,Benoit Boulet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multivariate Time Series (MTS) forecasting has a wide range of applications in both industry and academia. Recent advances in Spatial-Temporal Graph Neural Network (STGNN) have achieved great progress in modelling spatial-temporal correlations. Limited by computational complexity, most STGNNs for MTS forecasting focus primarily on short-term and local spatial-temporal dependencies. Although some recent methods attempt to incorporate univariate history into modeling, they still overlook crucial long-term spatial-temporal similarities and correlations across MTS, which are essential for accurate forecasting. To fill this gap, we propose a framework called the Long-term Multivariate History Representation (LMHR) Enhanced STGNN for MTS forecasting. Specifically, a Long-term History Encoder (LHEncoder) is adopted to effectively encode the long-term history into segment-level contextual representations and reduce point-level noise. A non-parametric Hierarchical Representation Retriever (HRetriever) is designed to include the spatial information in the long-term spatial-temporal dependency modelling and pick out the most valuable representations with no additional training. A Transformer-based Aggregator (TAggregator) selectively fuses the sparsely retrieved contextual representations based on the ranking positional embedding efficiently. Experimental results demonstrate that LMHR outperforms typical STGNNs by 10.72% on the average prediction horizons and state-of-the-art methods by 4.12% on several real-world datasets. Additionally, it consistently improves prediction accuracy by 9.8% on the top 10% of rapidly changing patterns across the datasets.
zh
[AI-90] he Energy Cost of Reasoning : Analyzing Energy Usage in LLM s with Test-time Compute
【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)在扩展过程中面临的收益递减和能耗上升问题。其解决方案的关键在于引入测试时计算(Test-Time Compute, TTC),即在推理阶段分配额外的计算资源,以此作为传统模型扩展策略的有效补充。研究结果表明,TTC能够在保持或提升模型准确性的同时,显著提高能效,尤其在需要复杂推理的任务中表现更为突出。
链接: https://arxiv.org/abs/2505.14733
作者: Yunho Jin,Gu-Yeon Wei,David Brooks
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work introduces test-time compute (TTC)-allocating additional computational resources during inference-as a compelling complement to conventional scaling strategies. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models without incurring additional pretraining costs.
zh
[AI-91] Propositional Measure Logic
【速读】:该论文试图解决在不确定性环境下进行合理推理的问题,特别是针对贝叶斯网络中仍然难以处理的复杂问题。其解决方案的关键在于引入一种具有基本概率语义的命题逻辑(propositional logic),其中每个公式被赋予一个介于[0,1]之间的实数值,以表示其真值程度。该语义替代了经典逻辑的二值性,同时保留了其演绎结构,并通过证明完备性定理,确保了该系统在不确定性推理中的有效性。
链接: https://arxiv.org/abs/2505.14693
作者: Francisco Aragão
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: !0 pages
点击查看摘要
Abstract:We present a propositional logic with fundamental probabilistic semantics, in which each formula is given a real measure in the interval [0,1] that represents its degree of truth. This semantics replaces the binarity of classical logic, while preserving its deductive structure. We demonstrate the soundness theorem, establishing that the proposed system is sound and suitable for reasoning under uncertainty. We discuss potential applications and avenues for future extensions of the theory. We apply probabilistic logic to a still refractory problem in Bayesian Networks.
zh
[AI-92] Follow the STARs: Dynamic ω-Regular Shielding of Learned Policies
【速读】:该论文旨在解决如何在预计算的随机策略上强制执行完整的ω-正则正确性属性(omega-regular correctness properties)的问题,特别是从传统的安全屏蔽(safety-shielding)向同时包含安全性与活性(liveness)的屏蔽过程的转变。其解决方案的关键在于提出一种基于策略模板的自适应运行时屏蔽机制(Strategy-Template-based Adaptive Runtime Shields, STARs),该机制通过允许性的策略模板实现最小干扰的后置屏蔽,并引入动态控制干扰的机制,以在运行时平衡形式化约束与任务特定行为。
链接: https://arxiv.org/abs/2505.14689
作者: Ashwani Anand,Satya Prakash Nayak,Ritam Raha,Anne-Kathrin Schmuck
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
点击查看摘要
Abstract:This paper presents a novel dynamic post-shielding framework that enforces the full class of \omega -regular correctness properties over pre-computed probabilistic policies. This constitutes a paradigm shift from the predominant setting of safety-shielding – i.e., ensuring that nothing bad ever happens – to a shielding process that additionally enforces liveness – i.e., ensures that something good eventually happens. At the core, our method uses Strategy-Template-based Adaptive Runtime Shields (STARs), which leverage permissive strategy templates to enable post-shielding with minimal interference. As its main feature, STARs introduce a mechanism to dynamically control interference, allowing a tunable enforcement parameter to balance formal obligations and task-specific behavior at runtime. This allows to trigger more aggressive enforcement when needed, while allowing for optimized policy choices otherwise. In addition, STARs support runtime adaptation to changing specifications or actuator failures, making them especially suited for cyber-physical applications. We evaluate STARs on a mobile robot benchmark to demonstrate their controllable interference when enforcing (incrementally updated) \omega -regular correctness properties over learned probabilistic policies.
zh
[AI-93] Learning with Differentially Private (Sliced) Wasserstein Gradients
【速读】:该论文试图解决在依赖于数据相关经验测度之间Wasserstein距离的优化目标中实现隐私保护的问题。解决方案的关键在于基于全离散设置下Wasserstein梯度的显式表述,对梯度对单个数据点的敏感性进行控制,从而在最小化效用损失的前提下提供强大的隐私保障。这一理论基础支撑了深度学习方法的开发,该方法结合了梯度和激活值裁剪技术,并验证了隐私会计方法在基于Wasserstein的目标函数中的适用性。
链接: https://arxiv.org/abs/2502.01701
作者: David Rodríguez-Vítores(UVa, IMUVA),Clément Lalanne(IMT, ANITI),Jean-Michel Loubes(IMT, ANITI)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:
点击查看摘要
Abstract:In this work, we introduce a novel framework for privately optimizing objectives that rely on Wasserstein distances between data-dependent empirical measures. Our main theoretical contribution is, based on an explicit formulation of the Wasserstein gradient in a fully discrete setting, a control on the sensitivity of this gradient to individual data points, allowing strong privacy guarantees at minimal utility cost. Building on these insights, we develop a deep learning approach that incorporates gradient and activations clipping, originally designed for DP training of problems with a finite-sum structure. We further demonstrate that privacy accounting methods extend to Wasserstein-based objectives, facilitating large-scale private training. Empirical results confirm that our framework effectively balances accuracy and privacy, offering a theoretically sound solution for privacy-preserving machine learning tasks relying on optimal transport distances such as Wasserstein distance or sliced-Wasserstein distance.
zh
[AI-94] Neural Quantum Digital Twins for Optimizing Quantum Annealing
【速读】:该论文旨在解决量子退火器在处理组合优化问题时面临的可扩展性不足和错误率较高的问题。其解决方案的关键在于提出一种基于神经网络的量子数字孪生(Neural Quantum Digital Twin, NQDT)框架,该框架能够重建与量子退火相关的量子多体系统的能量景观,并模拟基态和激发态的动力学过程,从而实现对绝热演化过程的详细仿真。
链接: https://arxiv.org/abs/2505.15662
作者: Jianlong Lu,Hanqiu Peng,Ying Chen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 20 pages, 11 figures, 2 tables
点击查看摘要
Abstract:Quantum annealers have shown potential in addressing certain combinatorial optimization problems, though their performance is often limited by scalability and errors rates. In this work, we propose a Neural Quantum Digital Twin (NQDT) framework that reconstructs the energy landscape of quantum many-body systems relevant to quantum annealing. The digital twin models both ground and excited state dynamics, enabling detailed simulation of the adiabatic evolution process. We benchmark NQDT on systems with known analytical solutions and demonstrate that it accurately captures key quantum phenomena, including quantum criticality and phase transitions. Leveraging this framework, one can identify optimal annealing schedules that minimize excitation-related errors. These findings highlight the utility of neural network-based digital twins as a diagnostic and optimization tool for improving the performance of quantum annealers.
zh
[AI-95] Uncertainty Quantification in SVM prediction
【速读】:该论文旨在解决支持向量机(Support Vector Machine, SVM)预测中的不确定性量化(Uncertainty Quantification, UQ)问题,特别是在回归和预测任务中。现有文献中针对SVM预测的置信区间(Prediction Interval, PI)估计和概率预测方法研究较少,且现有的SVM PI模型均未实现稀疏解。为引入稀疏性,作者提出了一种稀疏支持向量分位数回归(Sparse Support Vector Quantile Regression, SSVQR)模型,通过求解一对线性规划问题来构建PI和概率预测。该解决方案的关键在于结合稀疏性和有效性,同时开发了基于SSVQR的特征选择算法,以在高维数据中提升PI质量,并进一步在共轭回归(Conformal Regression)框架下扩展SVM模型,以获得具有有限测试集保证的稳定预测集。
链接: https://arxiv.org/abs/2505.15429
作者: Pritam Anand
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This paper explores Uncertainty Quantification (UQ) in SVM predictions, particularly for regression and forecasting tasks. Unlike the Neural Network, the SVM solutions are typically more stable, sparse, optimal and interpretable. However, there are only few literature which addresses the UQ in SVM prediction. At first, we provide a comprehensive summary of existing Prediction Interval (PI) estimation and probabilistic forecasting methods developed in the SVM framework and evaluate them against the key properties expected from an ideal PI model. We find that none of the existing SVM PI models achieves a sparse solution. To introduce sparsity in SVM model, we propose the Sparse Support Vector Quantile Regression (SSVQR) model, which constructs PIs and probabilistic forecasts by solving a pair of linear programs. Further, we develop a feature selection algorithm for PI estimation using SSVQR that effectively eliminates a significant number of features while improving PI quality in case of high-dimensional dataset. Finally we extend the SVM models in Conformal Regression setting for obtaining more stable prediction set with finite test set guarantees. Extensive experiments on artificial, real-world benchmark datasets compare the different characteristics of both existing and proposed SVM-based PI estimation methods and also highlight the advantages of the feature selection in PI estimation. Furthermore, we compare both, the existing and proposed SVM-based PI estimation models, with modern deep learning models for probabilistic forecasting tasks on benchmark datasets. Furthermore, SVM models show comparable or superior performance to modern complex deep learning models for probabilistic forecasting task in our experiments.
zh
[AI-96] RD-Agent -Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization
【速读】:该论文旨在解决金融市场的资产收益预测问题,该问题因市场高维性、非平稳性和持续波动性而具有挑战性。当前量化研究流程存在自动化程度低、可解释性弱以及关键组件如因子挖掘与模型创新之间的协调性差等局限。论文提出的解决方案是RD-Agent(Q),这是一个以数据为中心的多智能体框架,其关键是通过协同因子-模型共同优化实现量化策略的全栈自动化研发。该框架将量化过程分解为研究阶段和开发阶段,并通过反馈阶段进行实验结果评估与后续迭代优化,从而在预测准确性与策略稳健性之间取得良好平衡。
链接: https://arxiv.org/abs/2505.15155
作者: Yuante Li,Xu Yang,Xiao Yang,Minrui Xu,Xisen Wang,Weiqing Liu,Jiang Bian
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose RD-Agent for Quantitative Finance, in short RD-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. RD-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, RD-Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor-model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: this https URL.
zh
[AI-97] Space evaluation at the starting point of soccer transitions
【速读】:该论文试图解决在足球比赛中,特别是在球队攻防转换过程中,如何有效评估场地空间的问题。传统空间评估方法如OBSO(Off-Ball Scoring Opportunity)主要依赖于进球概率,因此不适用于评估远离球门的区域,而这些区域通常是转换发生的起点。论文提出的解决方案是OBPV(Off-Ball Positioning Value),其关键在于引入了场地区域价值模型,用于评估整个球场的空间价值,并采用转换核模型,通过传球分布的核密度估计来反映位置特异性,从而更全面地评估空间利用情况。
链接: https://arxiv.org/abs/2505.14711
作者: Yohei Ogawa,Rikuhei Umemoto,Keisuke Fujii
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures
点击查看摘要
Abstract:Soccer is a sport played on a pitch where effective use of space is crucial. Decision-making during transitions, when possession switches between teams, has been increasingly important, but research on space evaluation in these moments has been limited. Recent space evaluation methods such as OBSO (Off-Ball Scoring Opportunity) use scoring probability, so it is not well-suited for assessing areas far from the goal, where transitions typically occur. In this paper, we propose OBPV (Off-Ball Positioning Value) to evaluate space across the pitch, including the starting points of transitions. OBPV extends OBSO by introducing the field value model, which evaluates the entire pitch, and by employing the transition kernel model, which reflects positional specificity through kernel density estimation of pass distributions. Experiments using La Liga 2023/24 season tracking and event data show that OBPV highlights effective space utilization during counter-attacks and reveals team-specific characteristics in how the teams utilize space after positive and negative transitions.
zh
机器学习
[LG-0] Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex
链接: https://arxiv.org/abs/2505.15813
作者: Muquan Yu,Mu Nan,Hossein Adeli,Jacob S. Prince,John A. Pyles,Leila Wehbe,Margaret M. Henderson,Michael J. Tarr,Andrew F. Luo
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.
[LG-1] On the creation of narrow AI: hierarchy and nonlocality of neural network skills
链接: https://arxiv.org/abs/2505.15811
作者: Eric J. Michaud,Asher Parker-Sartori,Max Tegmark
类目: Machine Learning (cs.LG)
*备注: 19 pages, 13 figures
点击查看摘要
Abstract:We study the problem of creating strong, yet narrow, AI systems. While recent AI progress has been driven by the training of large general-purpose foundation models, the creation of smaller models specialized for narrow domains could be valuable for both efficiency and safety. In this work, we explore two challenges involved in creating such systems, having to do with basic properties of how neural networks learn and structure their representations. The first challenge regards when it is possible to train narrow models from scratch. Through experiments on a synthetic task, we find that it is sometimes necessary to train networks on a wide distribution of data to learn certain narrow skills within that distribution. This effect arises when skills depend on each other hierarchically, and training on a broad distribution introduces a curriculum which substantially accelerates learning. The second challenge regards how to transfer particular skills from large general models into small specialized models. We find that model skills are often not perfectly localized to a particular set of prunable components. However, we find that methods based on pruning can still outperform distillation. We investigate the use of a regularization objective to align desired skills with prunable components while unlearning unnecessary skills.
[LG-2] Adaptive Estimation and Learning under Temporal Distribution Shift ICML2025
链接: https://arxiv.org/abs/2505.15803
作者: Dheeraj Baby,Yifei Tang,Hieu Duy Nguyen,Yu-Xiang Wang,Rohit Pyati
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
点击查看摘要
Abstract:In this paper, we study the problem of estimation and learning under temporal distribution shift. Consider an observation sequence of length n , which is a noisy realization of a time-varying groundtruth sequence. Our focus is to develop methods to estimate the groundtruth at the final time-step while providing sharp point-wise estimation error rates. We show that, without prior knowledge on the level of temporal shift, a wavelet soft-thresholding estimator provides an optimal estimation error bound for the groundtruth. Our proposed estimation method generalizes existing researches Mazzetto and Upfal (2023) by establishing a connection between the sequence’s non-stationarity level and the sparsity in the wavelet-transformed domain. Our theoretical findings are validated by numerical experiments. Additionally, we applied the estimator to derive sparsity-aware excess risk bounds for binary classification under distribution shift and to develop computationally efficient training objectives. As a final contribution, we draw parallels between our results and the classical signal processing problem of total-variation denoising (Mammen and van de Geer,1997; Tibshirani, 2014), uncovering novel optimal algorithms for such task.
[LG-3] A Deep Learning Framework for Two-Dimensional Multi-Frequency Propagation Factor Estimation
链接: https://arxiv.org/abs/2505.15802
作者: Sarah E. Wessinger,Leslie N. Smith,Jacob Gull,Jonathan Gehman,Zachary Beever,Andrew J. Kammerer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Submitted for publication
点击查看摘要
Abstract:Accurately estimating the refractive environment over multiple frequencies within the marine atmospheric boundary layer is crucial for the effective deployment of radar technologies. Traditional parabolic equation simulations, while effective, can be computationally expensive and time-intensive, limiting their practical application. This communication explores a novel approach using deep neural networks to estimate the pattern propagation factor, a critical parameter for characterizing environmental impacts on signal propagation. Image-to-image translation generators designed to ingest modified refractivity data and generate predictions of pattern propagation factors over the same domain were developed. Findings demonstrate that deep neural networks can be trained to analyze multiple frequencies and reasonably predict the pattern propagation factor, offering an alternative to traditional methods.
[LG-4] Model Merging is Secretly Certifiable: Non-Vacuous Generalisation Bounds for Low-Shot Learning
链接: https://arxiv.org/abs/2505.15798
作者: Taehoon Kim,Henry Gouk,Minyoung Kim,Timothy Hospedales
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Certifying the IID generalisation ability of deep networks is the first of many requirements for trusting AI in high-stakes applications from medicine to security. However, when instantiating generalisation bounds for deep networks it remains challenging to obtain non-vacuous guarantees, especially when applying contempo- rary large models on the small scale data prevalent in such high-stakes fields. In this paper, we draw a novel connection between a family of learning methods based on model fusion and generalisation certificates, and surprisingly show that with minor adjustment several existing learning strategies already provide non-trivial generali- sation guarantees. Essentially, by focusing on data-driven learning of downstream tasks by fusion rather than fine-tuning, the certified generalisation gap becomes tiny and independent of the base network size, facilitating its certification. Our results show for the first time non-trivial generalisation guarantees for learning with as low as 100 examples, while using vision models such as VIT-B and language models such as mistral-7B. This observation is significant as it has immediate implications for facilitating the certification of existing systems as trustworthy, and opens up new directions for research at the intersection of practice and theory.
[LG-5] HCRMP: A LLM -Hinted Contextual Reinforcement Learning Framework for Autonomous Driving
链接: https://arxiv.org/abs/2505.15793
作者: Zhiwen Chen,Bo Leng,Zhuoren Li,Hanming Deng,Guizhe Jin,Ran Yu,Huanxi Wen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Integrating Large Language Models (LLMs) with Reinforcement Learning (RL) can enhance autonomous driving (AD) performance in complex scenarios. However, current LLM-Dominated RL methods over-rely on LLM outputs, which are prone to this http URL show that state-of-the-art LLM indicates a non-hallucination rate of only approximately 57.95% when assessed on essential driving-related tasks. Thus, in these methods, hallucinations from the LLM can directly jeopardize the performance of driving policies. This paper argues that maintaining relative independence between the LLM and the RL is vital for solving the hallucinations problem. Consequently, this paper is devoted to propose a novel LLM-Hinted RL paradigm. The LLM is used to generate semantic hints for state augmentation and policy optimization to assist RL agent in motion planning, while the RL agent counteracts potential erroneous semantic indications through policy learning to achieve excellent driving performance. Based on this paradigm, we propose the HCRMP (LLM-Hinted Contextual Reinforcement Learning Motion Planner) architecture, which is designed that includes Augmented Semantic Representation Module to extend state space. Contextual Stability Anchor Module enhances the reliability of multi-critic weight hints by utilizing information from the knowledge base. Semantic Cache Module is employed to seamlessly integrate LLM low-frequency guidance with RL high-frequency control. Extensive experiments in CARLA validate HCRMP’s strong overall driving performance. HCRMP achieves a task success rate of up to 80.3% under diverse driving conditions with different traffic densities. Under safety-critical driving conditions, HCRMP significantly reduces the collision rate by 11.4%, which effectively improves the driving performance in complex scenarios.
[LG-6] Fair Supervised Learning Through Constraints on Smooth Nonconvex Unfairness-Measure Surrogates
链接: https://arxiv.org/abs/2505.15788
作者: Zahra Khatti,Daniel P. Robinson,Frank E. Curtis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:A new strategy for fair supervised machine learning is proposed. The main advantages of the proposed strategy as compared to others in the literature are as follows. (a) We introduce a new smooth nonconvex surrogate to approximate the Heaviside functions involved in discontinuous unfairness measures. The surrogate is based on smoothing methods from the optimization literature, and is new for the fair supervised learning literature. The surrogate is a tight approximation which ensures the trained prediction models are fair, as opposed to other (e.g., convex) surrogates that can fail to lead to a fair prediction model in practice. (b) Rather than rely on regularizers (that lead to optimization problems that are difficult to solve) and corresponding regularization parameters (that can be expensive to tune), we propose a strategy that employs hard constraints so that specific tolerances for unfairness can be enforced without the complications associated with the use of regularization. ©~Our proposed strategy readily allows for constraints on multiple (potentially conflicting) unfairness measures at the same time. Multiple measures can be considered with a regularization approach, but at the cost of having even more difficult optimization problems to solve and further expense for tuning. By contrast, through hard constraints, our strategy leads to optimization models that can be solved tractably with minimal tuning.
[LG-7] Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning
链接: https://arxiv.org/abs/2505.15782
作者: Pedro P. Santos,Alberto Sardinha,Francisco S. Melo
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent’s performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.
[LG-8] Projection-Based Correction for Enhancing Deep Inverse Networks
链接: https://arxiv.org/abs/2505.15777
作者: Jorge Bacca
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
点击查看摘要
Abstract:Deep learning-based models have demonstrated remarkable success in solving illposed inverse problems; however, many fail to strictly adhere to the physical constraints imposed by the measurement process. In this work, we introduce a projection-based correction method to enhance the inference of deep inverse networks by ensuring consistency with the forward model. Specifically, given an initial estimate from a learned reconstruction network, we apply a projection step that constrains the solution to lie within the valid solution space of the inverse problem. We theoretically demonstrate that if the recovery model is a well-trained deep inverse network, the solution can be decomposed into range-space and null-space components, where the projection-based correction reduces to an identity transformation. Extensive simulations and experiments validate the proposed method, demonstrating improved reconstruction accuracy across diverse inverse problems and deep network architectures.
[LG-9] Privacy-Preserving Conformal Prediction Under Local Differential Privacy
链接: https://arxiv.org/abs/2505.15721
作者: Coby Penso,Bar Mahpud,Jacob Goldberger,Or Sheffet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint. Under review
点击查看摘要
Abstract:Conformal prediction (CP) provides sets of candidate classes with a guaranteed probability of containing the true class. However, it typically relies on a calibration set with clean labels. We address privacy-sensitive scenarios where the aggregator is untrusted and can only access a perturbed version of the true labels. We propose two complementary approaches under local differential privacy (LDP). In the first approach, users do not access the model but instead provide their input features and a perturbed label using a k-ary randomized response. In the second approach, which enforces stricter privacy constraints, users add noise to their conformity score by binary search response. This method requires access to the classification model but preserves both data and label privacy. Both approaches compute the conformal threshold directly from noisy data without accessing the true labels. We prove finite-sample coverage guarantees and demonstrate robust coverage even under severe randomization. This approach unifies strong local privacy with predictive uncertainty control, making it well-suited for sensitive applications such as medical imaging or large language model queries, regardless of whether users can (or are willing to) compute their own scores.
[LG-10] A packing lemma for VCN_k-dimension and learning high-dimensional data
链接: https://arxiv.org/abs/2505.15688
作者: Leonardo N. Coregliano,Maryanthe Malliaris
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 29 pages, 1 figure
点击查看摘要
Abstract:Recently, the authors introduced the theory of high-arity PAC learning, which is well-suited for learning graphs, hypergraphs and relational structures. In the same initial work, the authors proved a high-arity analogue of the Fundamental Theorem of Statistical Learning that almost completely characterizes all notions of high-arity PAC learning in terms of a combinatorial dimension, called the Vapnik–Chervonenkis–Natarajan (VCN _k ) k -dimension, leaving as an open problem only the characterization of non-partite, non-agnostic high-arity PAC learnability. In this work, we complete this characterization by proving that non-partite non-agnostic high-arity PAC learnability implies a high-arity version of the Haussler packing property, which in turn implies finiteness of VCN _k -dimension. This is done by obtaining direct proofs that classic PAC learnability implies classic Haussler packing property, which in turn implies finite Natarajan dimension and noticing that these direct proofs nicely lift to high-arity. Comments: 29 pages, 1 figure Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) MSC classes: Primary: 68Q32. Secondary: 68T05 Cite as: arXiv:2505.15688 [cs.LG] (or arXiv:2505.15688v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.15688 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] Graph Conditional Flow Matching for Relational Data Generation
链接: https://arxiv.org/abs/2505.15668
作者: Davide Scassola,Sebastiano Saccani,Luca Bortolussi
类目: Machine Learning (cs.LG)
*备注: 9 pages of main content, submitted to a conference
点击查看摘要
Abstract:Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.
[LG-12] Deep greedy unfolding: Sorting out argsorting in greedy sparse recovery algorithms
链接: https://arxiv.org/abs/2505.15661
作者: Sina Mohammad-Taheri,Matthew J. Colbrook,Simone Brugiapaglia
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:Gradient-based learning imposes (deep) neural networks to be differentiable at all steps. This includes model-based architectures constructed by unrolling iterations of an iterative algorithm onto layers of a neural network, known as algorithm unrolling. However, greedy sparse recovery algorithms depend on the non-differentiable argsort operator, which hinders their integration into neural networks. In this paper, we address this challenge in Orthogonal Matching Pursuit (OMP) and Iterative Hard Thresholding (IHT), two popular representative algorithms in this class. We propose permutation-based variants of these algorithms and approximate permutation matrices using “soft” permutation matrices derived from softsort, a continuous relaxation of argsort. We demonstrate–both theoretically and numerically–that Soft-OMP and Soft-IHT, as differentiable counterparts of OMP and IHT and fully compatible with neural network training, effectively approximate these algorithms with a controllable degree of accuracy. This leads to the development of OMP- and IHT-Net, fully trainable network architectures based on Soft-OMP and Soft-IHT, respectively. Finally, by choosing weights as “structure-aware” trainable parameters, we connect our approach to structured sparse recovery and demonstrate its ability to extract latent sparsity patterns from data.
[LG-13] FLARE: Robot Learning with Implicit World Modeling
链接: https://arxiv.org/abs/2505.15659
作者: Ruijie Zheng,Jing Wang,Scott Reed,Johan Bjorck,Yu Fang,Fengyuan Hu,Joel Jang,Kaushil Kundalia,Zongyu Lin,Loic Magne,Avnish Narayan,You Liang Tan,Guanzhi Wang,Qi Wang,Jiannan Xiang,Yinzhen Xu,Seonghyeon Ye,Jan Kautz,Furong Huang,Yuke Zhu,Linxi Fan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project Webpage / Blogpost: this https URL
点击查看摘要
Abstract:We introduce \textbfF uture \textbfLA tent \textbfRE presentation Alignment ( \textbfFLARE ), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, \textbfFLARE enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, \textbfFLARE requires only minimal architectural modifications – adding a few tokens to standard vision-language-action (VLA) models – yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, \textbfFLARE achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, \textbfFLARE unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish \textbfFLARE as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
[LG-14] Learning Small Decision Trees with Few Outliers: A Parameterized Perspective
链接: https://arxiv.org/abs/2505.15648
作者: Harmender Gahlawat,Meirav Zehavi
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
点击查看摘要
Abstract:Decision trees are a fundamental tool in machine learning for representing, classifying, and generalizing data. It is desirable to construct small'' decision trees, by minimizing either the \textitsize ( s ) or the \textitdepth (d) of the \textitdecision tree (\textscDT). Recently, the parameterized complexity of \textscDecision Tree Learning has attracted a lot of attention. We consider a generalization of \textscDecision Tree Learning where given a \textitclassification instance E and an integer t , the task is to find a
small’’ \textscDT that disagrees with E in at most t examples. We consider two problems: \textscDTSO and \textscDTDO, where the goal is to construct a \textscDT minimizing s and d , respectively. We first establish that both \textscDTSO and \textscDTDO are W[1]-hard when parameterized by s+\delta_max and d+\delta_max , respectively, where \delta_max is the maximum number of features in which two differently labeled examples can differ. We complement this result by showing that these problems become \textscFPT if we include the parameter t . We also consider the kernelization complexity of these problems and establish several positive and negative results for both \textscDTSO and \textscDTDO.
[LG-15] Optimal Best-Arm Identification under Fixed Confidence with Multiple Optima
链接: https://arxiv.org/abs/2505.15643
作者: Lan V. Truong
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 22 pages
点击查看摘要
Abstract:We study the problem of best-arm identification in stochastic multi-armed bandits under the fixed-confidence setting, with a particular focus on instances that admit multiple optimal arms. While the Track-and-Stop algorithm of Garivier and Kaufmann (2016) is widely conjectured to be instance-optimal, its performance in the presence of multiple optima has remained insufficiently understood. In this work, we revisit the Track-and-Stop strategy and propose a modified stopping rule that ensures instance-optimality even when the set of optimal arms is not a singleton. Our analysis introduces a new information-theoretic lower bound that explicitly accounts for multiple optimal arms, and we demonstrate that our stopping rule tightly matches this bound.
[LG-16] A Simple Approximation Algorithm for Optimal Decision Tree
链接: https://arxiv.org/abs/2505.15641
作者: Zhengjia Zhuo,Viswanath Nagarajan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Optimal decision tree (\odt) is a fundamental problem arising in applications such as active learning, entity identification, and medical diagnosis. An instance of \odt is given by m hypotheses, out of which an unknown ``true’’ hypothesis is drawn according to some probability distribution. An algorithm needs to identify the true hypothesis by making queries: each query incurs a cost and has a known response for each hypothesis. The goal is to minimize the expected query cost to identify the true hypothesis. We consider the most general setting with arbitrary costs, probabilities and responses. \odt is NP-hard to approximate better than \ln m and there are O(\ln m) approximation algorithms known for it. However, these algorithms and/or their analyses are quite complex. Moreover, the leading constant factors are large. We provide a simple algorithm and analysis for \odt, proving an approximation ratio of 8 \ln m .
[LG-17] Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes
链接: https://arxiv.org/abs/2505.15638
作者: Daniel Waxman,Fernando Llorente,Petar M. Djurić
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 25 pages, 12 figures
点击查看摘要
Abstract:We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled guidance on when practitioners should prefer one approach over the other.
[LG-18] Distance Adaptive Beam Search for Provably Accurate Graph-Based Nearest Neighbor Search
链接: https://arxiv.org/abs/2505.15636
作者: Yousef Al-Jazzazi,Haya Diwan,Jinrui Gou,Cameron Musco,Christopher Musco,Torsten Suel
类目: Information Retrieval (cs.IR); Databases (cs.DB); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Nearest neighbor search is central in machine learning, information retrieval, and databases. For high-dimensional datasets, graph-based methods such as HNSW, DiskANN, and NSG have become popular thanks to their empirical accuracy and efficiency. These methods construct a directed graph over the dataset and perform beam search on the graph to find nodes close to a given query. While significant work has focused on practical refinements and theoretical understanding of graph-based methods, many questions remain. We propose a new distance-based termination condition for beam search to replace the commonly used condition based on beam width. We prove that, as long as the search graph is navigable, our resulting Adaptive Beam Search method is guaranteed to approximately solve the nearest-neighbor problem, establishing a connection between navigability and the performance of graph-based search. We also provide extensive experiments on our new termination condition for both navigable graphs and approximately navigable graphs used in practice, such as HNSW and Vamana graphs. We find that Adaptive Beam Search outperforms standard beam search over a range of recall values, data sets, graph constructions, and target number of nearest neighbors. It thus provides a simple and practical way to improve the performance of popular methods.
[LG-19] Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks
链接: https://arxiv.org/abs/2505.15631
作者: Nick Kocher,Christian Wassermann,Leona Hennig,Jonas Seng,Holger Hoos,Kristian Kersting,Marius Lindauer,Matthias Müller
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Neural Architecture Search (NAS) accelerates progress in deep learning through systematic refinement of model architectures. The downside is increasingly large en- ergy consumption during the search process. Surrogate-based benchmarking mitigates the cost of full training by querying a pre-trained surrogate to obtain an estimate for the quality of the model. Specifically, energy-aware benchmarking aims to make it possible for NAS to favourably trade off model energy consumption against accuracy. Towards this end, we propose three design principles for such energy-aware benchmarks: (i) reliable power measurements, (ii) a wide range of GPU usage, and (iii) holistic cost reporting. We analyse EA-HAS-Bench based on these principles and find that the choice of GPU measurement API has a large impact on the quality of results. Using the Nvidia System Management Interface (SMI) on top of its underlying library influences the sampling rate during the initial data collection, returning faulty low-power estimations. This results in poor correlation with accurate measurements obtained from an external power meter. With this study, we bring to attention several key considerations when performing energy- aware surrogate-based benchmarking and derive first guidelines that can help design novel benchmarks. We show a narrow usage range of the four GPUs attached to our device, ranging from 146 W to 305 W in a single-GPU setting, and narrowing down even further when using all four GPUs. To improve holistic energy reporting, we propose calibration experiments over assumptions made in popular tools, such as Code Carbon, thus achieving reductions in the maximum inaccuracy from 10.3 % to 8.9 % without and to 6.6 % with prior estimation of the expected load on the device.
[LG-20] Aligning Explanations with Human Communication
链接: https://arxiv.org/abs/2505.15626
作者: Jacopo Teneggi,Zhenzhen Wang,Paul H. Yi,Tianmin Shu,Jeremias Sulam
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Machine learning explainability aims to make the decision-making process of black-box models more transparent by finding the most important input features for a given prediction task. Recent works have proposed composing explanations from semantic concepts (e.g., colors, patterns, shapes) that are inherently interpretable to the user of a model. However, these methods generally ignore the communicative context of explanation-the ability of the user to understand the prediction of the model from the explanation. For example, while a medical doctor might understand an explanation in terms of clinical markers, a patient may need a more accessible explanation to make sense of the same diagnosis. In this paper, we address this gap with listener-adaptive explanations. We propose an iterative procedure grounded in principles of pragmatic reasoning and the rational speech act to generate explanations that maximize communicative utility. Our procedure only needs access to pairwise preferences between candidate explanations, relevant in real-world scenarios where a listener model may not be available. We evaluate our method in image classification tasks, demonstrating improved alignment between explanations and listener preferences across three datasets. Furthermore, we perform a user study that demonstrates our explanations increase communicative utility.
[LG-21] Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI IJCNN
链接: https://arxiv.org/abs/2505.15622
作者: Pietro Bartoli,Christian Veronesi,Andrea Giudici,David Siorpaes,Diana Trojaniello,Franco Zappa
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures The article is already accepted for International Joint Conference on Neural Networks (IJCNN) 2025
点击查看摘要
Abstract:The rise of IoT has increased the need for on-edge machine learning, with TinyML emerging as a promising solution for resource-constrained devices such as MCU. However, evaluating their performance remains challenging due to diverse architectures and application scenarios. Current solutions have many non-negligible limitations. This work introduces an alternative benchmarking methodology that integrates energy and latency measurements while distinguishing three execution phases pre-inference, inference, and post-inference. Additionally, the setup ensures that the device operates without being powered by an external measurement unit, while automated testing can be leveraged to enhance statistical significance. To evaluate our setup, we tested the STM32N6 MCU, which includes a NPU for executing neural networks. Two configurations were considered: high-performance and Low-power. The variation of the EDP was analyzed separately for each phase, providing insights into the impact of hardware configurations on energy efficiency. Each model was tested 1000 times to ensure statistically relevant results. Our findings demonstrate that reducing the core voltage and clock frequency improve the efficiency of pre- and post-processing without significantly affecting network execution performance. This approach can also be used for cross-platform comparisons to determine the most efficient inference platform and to quantify how pre- and post-processing overhead varies across different hardware implementations.
[LG-22] Deep Learning for Continuous-time Stochastic Control with Jumps
链接: https://arxiv.org/abs/2505.15602
作者: Patrick Cheridito,Jean-Loup Dupret,Donatien Hainaut
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Portfolio Management (q-fin.PM)
*备注:
点击查看摘要
Abstract:In this paper, we introduce a model-based deep-learning approach to solve finite-horizon continuous-time stochastic control problems with jumps. We iteratively train two neural networks: one to represent the optimal policy and the other to approximate the value function. Leveraging a continuous-time version of the dynamic programming principle, we derive two different training objectives based on the Hamilton-Jacobi-Bellman equation, ensuring that the networks capture the underlying stochastic dynamics. Empirical evaluations on different problems illustrate the accuracy and scalability of our approach, demonstrating its effectiveness in solving complex, high-dimensional stochastic control tasks.
[LG-23] Federated Learning with Unlabeled Clients: Personalization Can Happen in Low Dimensions
链接: https://arxiv.org/abs/2505.15579
作者: Hossein Zakerinia,Jonathan Scott,Christoph H. Lampert
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Personalized federated learning has emerged as a popular approach to training on devices holding statistically heterogeneous data, known as clients. However, most existing approaches require a client to have labeled data for training or finetuning in order to obtain their own personalized model. In this paper we address this by proposing FLowDUP, a novel method that is able to generate a personalized model using only a forward pass with unlabeled data. The generated model parameters reside in a low-dimensional subspace, enabling efficient communication and computation. FLowDUP’s learning objective is theoretically motivated by our new transductive multi-task PAC-Bayesian generalization bound, that provides performance guarantees for unlabeled clients. The objective is structured in such a way that it allows both clients with labeled data and clients with only unlabeled data to contribute to the training process. To supplement our theoretical results we carry out a thorough experimental evaluation of FLowDUP, demonstrating strong empirical performance on a range of datasets with differing sorts of statistically heterogeneous clients. Through numerous ablation studies, we test the efficacy of the individual components of the method.
[LG-24] Refining Neural Activation Patterns for Layer-Level Concept Discovery in Neural Network-Based Receivers
链接: https://arxiv.org/abs/2505.15570
作者: Marko Tuononen,Duy Vu,Dani Korpi,Vesa Starck,Ville Hautamäki
类目: Machine Learning (cs.LG)
*备注: 46 pages, 40 figures, 28 tables, 10 equations, and 5 listings
点击查看摘要
Abstract:Concept discovery in neural networks often targets individual neurons or human-interpretable features, overlooking distributed layer-wide patterns. We study the Neural Activation Pattern (NAP) methodology, which clusters full-layer activation distributions to identify such layer-level concepts. Applied to visual object recognition and radio receiver models, we propose improved normalization, distribution estimation, distance metrics, and varied cluster selection. In the radio receiver model, distinct concepts did not emerge; instead, a continuous activation manifold shaped by Signal-to-Noise Ratio (SNR) was observed – highlighting SNR as a key learned factor, consistent with classical receiver behavior and supporting physical plausibility. Our enhancements to NAP improved in-distribution vs. out-of-distribution separation, suggesting better generalization and indirectly validating clustering quality. These results underscore the importance of clustering design and activation manifolds in interpreting and troubleshooting neural network behavior.
[LG-25] Impact of Data Sparsity on Machine Learning for Fault Detection in Power System Protection
链接: https://arxiv.org/abs/2505.15560
作者: Julian Oelhaf,Georg Kordowich,Changhun Kim,Paula Andrea Perez-Toro,Andreas Maier,Johann Jager,Siming Bayer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Germany’s transition to a renewable energy-based power system is reshaping grid operations, requiring advanced monitoring and control to manage decentralized generation. Machine learning (ML) has emerged as a powerful tool for power system protection, particularly for fault detection (FD) and fault line identification (FLI) in transmission grids. However, ML model reliability depends on data quality and availability. Data sparsity resulting from sensor failures, communication disruptions, or reduced sampling rates poses a challenge to ML-based FD and FLI. Yet, its impact has not been systematically validated prior to this work. In response, we propose a framework to assess the impact of data sparsity on ML-based FD and FLI performance. We simulate realistic data sparsity scenarios, evaluate their impact, derive quantitative insights, and demonstrate the effectiveness of this evaluation strategy by applying it to an existing ML-based framework. Results show the ML model remains robust for FD, maintaining an F1-score of 0.999 \pm 0.000 even after a 50x data reduction. In contrast, FLI is more sensitive, with performance decreasing by 55.61% for missing voltage measurements and 9.73% due to communication failures at critical network points. These findings offer actionable insights for optimizing ML models for real-world grid protection. This enables more efficient FD and supports targeted improvements in FLI.
[LG-26] Short-Range Dependency Effects on Transformer Instability and a Decomposed Attention Solution
链接: https://arxiv.org/abs/2505.15548
作者: Suvadeep Hajra
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transformer language models have driven significant progress across various fields, including natural language processing and computer vision. A central component of these models is the self-attention (SA) mechanism, which learns rich vector representations of tokens by modeling their relationships with others in a sequence. However, despite extensive research, transformers continue to suffer from training instability – often manifesting as spikes or divergence in the training loss during a run. In this work, we identify one source of this instability: SA’s limited ability to capture short-range dependencies, especially in tasks like language modeling, where almost every token heavily relies on its nearby neighbors. This limitation causes the pre-softmax logits of SA to grow rapidly, destabilizing training. To address this, we propose decomposing the SA into local (short-range) and global (long-range) attention heads. This decomposed attention, referred to as Long Short-attention (LS-attention), mitigates logit explosion and results in more stable training compared to an equivalent multi-head self-attention (MHSA). Empirical comparisons with two alternative training stabilization methods show that LS-attention reduces the validation perplexity to nearly 2/5 of that achieved by one method and reaches a similar perplexity as the other method using only 1/20 of the GPU hours. Additionally, our experiments demonstrate that LS-attention reduces inference latency by up to 36% compared to a state-of-the-art implementation of equivalent MHSA. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.15548 [cs.LG] (or arXiv:2505.15548v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.15548 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] A Temporal Difference Method for Stochastic Continuous Dynamics
链接: https://arxiv.org/abs/2505.15544
作者: Haruki Settai,Naoya Takeishi,Takehisa Yairi
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:For continuous systems modeled by dynamical equations such as ODEs and SDEs, Bellman’s principle of optimality takes the form of the Hamilton-Jacobi-Bellman (HJB) equation, which provides the theoretical target of reinforcement learning (RL). Although recent advances in RL successfully leverage this formulation, the existing methods typically assume the underlying dynamics are known a priori because they need explicit access to the coefficient functions of dynamical equations to update the value function following the HJB equation. We address this inherent limitation of HJB-based RL; we propose a model-free approach still targeting the HJB equation and propose the corresponding temporal difference method. We demonstrate its potential advantages over transition kernel-based formulations, both qualitatively and empirically. The proposed formulation paves the way toward bridging stochastic optimal control and model-free reinforcement learning.
链接: https://arxiv.org/abs/2505.15511
作者: Brandon Duderstadt,Zach Nussbaum,Laurens van der Maaten
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. We provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection’s superior performance and speed profile compared to existing state-of-the-art methods. We demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia.
[LG-29] Coloring Between the Lines: Personalization in the Null Space of Planning Constraints
链接: https://arxiv.org/abs/2505.15503
作者: Tom Silver,Rajat Kumar Jenamani,Ziang Liu,Ben Dodson,Tapomayukh Bhattacharjee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generalist robots must personalize in-the-wild to meet the diverse needs and preferences of long-term users. How can we enable flexible personalization without sacrificing safety or competency? This paper proposes Coloring Between the Lines (CBTL), a method for personalization that exploits the null space of constraint satisfaction problems (CSPs) used in robot planning. CBTL begins with a CSP generator that ensures safe and competent behavior, then incrementally personalizes behavior by learning parameterized constraints from online interaction. By quantifying uncertainty and leveraging the compositionality of planning constraints, CBTL achieves sample-efficient adaptation without environment resets. We evaluate CBTL in (1) three diverse simulation environments; (2) a web-based user study; and (3) a real-robot assisted feeding system, finding that CBTL consistently achieves more effective personalization with fewer interactions than baselines. Our results demonstrate that CBTL provides a unified and practical approach for continual, flexible, active, and safe robot personalization. Website: this https URL
[LG-30] Certified Neural Approximations of Nonlinear Dynamics
链接: https://arxiv.org/abs/2505.15497
作者: Frederik Baymler Mathiesen,Nikolaus Vertovec,Francesco Fabiano,Luca Laurenti,Alessandro Abate
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: first and second author contributed equally
点击查看摘要
Abstract:Neural networks hold great potential to act as approximate models of nonlinear dynamical systems, with the resulting neural approximations enabling verification and control of such systems. However, in safety-critical contexts, the use of neural approximations requires formal bounds on their closeness to the underlying system. To address this fundamental challenge, we propose a novel, adaptive, and parallelizable verification method based on certified first-order models. Our approach provides formal error bounds on the neural approximations of dynamical systems, allowing them to be safely employed as surrogates by interpreting the error bound as bounded disturbances acting on the approximated dynamics. We demonstrate the effectiveness and scalability of our method on a range of established benchmarks from the literature, showing that it outperforms the state-of-the-art. Furthermore, we highlight the flexibility of our framework by applying it to two novel scenarios not previously explored in this context: neural network compression and an autoencoder-based deep learning architecture for learning Koopman operators, both yielding compelling results.
[LG-31] Fast Rate Bounds for Multi-Task and Meta-Learning with Different Sample Sizes
链接: https://arxiv.org/abs/2505.15496
作者: Hossein Zakerinia,Christoph H. Lampert
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We present new fast-rate generalization bounds for multi-task and meta-learning in the unbalanced setting, i.e. when the tasks have training sets of different sizes, as is typically the case in real-world scenarios. Previously, only standard-rate bounds were known for this situation, while fast-rate bounds were limited to the setting where all training sets are of equal size. Our new bounds are numerically computable as well as interpretable, and we demonstrate their flexibility in handling a number of cases where they give stronger guarantees than previous bounds. Besides the bounds themselves, we also make conceptual contributions: we demonstrate that the unbalanced multi-task setting has different statistical properties than the balanced situation, specifically that proofs from the balanced situation do not carry over to the unbalanced setting. Additionally, we shed light on the fact that the unbalanced situation allows two meaningful definitions of multi-task risk, depending on whether if all tasks should be considered equally important or if sample-rich tasks should receive more weight than sample-poor ones.
[LG-32] AI-based Decision Support System for Heritage Aircraft Corrosion Prevention
链接: https://arxiv.org/abs/2505.15462
作者: Michal Kuchař,Jaromír Fišer,Cyril Oswald,Tomáš Vyhlídal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 4 tables, submitted January 31, 2025, to Process Control 2025
点击查看摘要
Abstract:The paper presents a decision support system for the long-term preservation of aeronautical heritage exhibited/stored in sheltered sites. The aeronautical heritage is characterized by diverse materials of which this heritage is constituted. Heritage aircraft are made of ancient aluminum alloys, (ply)wood, and particularly fabrics. The decision support system (DSS) designed, starting from a conceptual model, is knowledge-based on degradation/corrosion mechanisms of prevailing materials of aeronautical heritage. In the case of historical aircraft wooden parts, this knowledge base is filled in by the damage function models developed within former European projects. Model-based corrosion prediction is implemented within the new DSS for ancient aluminum alloys. The novelty of this DSS consists of supporting multi-material heritage protection and tailoring to peculiarities of aircraft exhibition/storage hangars and the needs of aviation museums. The novel DSS is tested on WWII aircraft heritage exhibited in the Aviation Museum Kbely, Military History Institute Prague, Czech Republic.
[LG-33] SplitWise Regression: Stepwise Modeling with Adaptive Dummy Encoding
链接: https://arxiv.org/abs/2505.15423
作者: Marcell T. Kurbucz,Nikolaos Tzivanakis,Nilufer Sari Aslam,Adam M. Sykulski
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 15 pages, 1 figure, 3 tables
点击查看摘要
Abstract:Capturing nonlinear relationships without sacrificing interpretability remains a persistent challenge in regression modeling. We introduce SplitWise, a novel framework that enhances stepwise regression. It adaptively transforms numeric predictors into threshold-based binary features using shallow decision trees, but only when such transformations improve model fit, as assessed by the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). This approach preserves the transparency of linear models while flexibly capturing nonlinear effects. Implemented as a user-friendly R package, SplitWise is evaluated on both synthetic and real-world datasets. The results show that it consistently produces more parsimonious and generalizable models than traditional stepwise and penalized regression techniques.
[LG-34] Efficient Differentiable Approximation of Generalized Low-rank Regularization IJCAI-25
链接: https://arxiv.org/abs/2505.15407
作者: Naiqi Li,Yuqiu Xie,Peiyuan Liu,Tao Dai,Yong Jiang,Shu-Tao Xia
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted by IJCAI-25
点击查看摘要
Abstract:Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten- p norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at this https URL.
[LG-35] HOPSE: Scalable Higher-Order Positional and Structural Encoder for Combinatorial Representations
链接: https://arxiv.org/abs/2505.15405
作者: Martin Carrasco,Guillermo Bernardez,Marco Montagna,Nina Miolane,Lev Telyatnikov
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:While Graph Neural Networks (GNNs) have proven highly effective at modeling relational data, pairwise connections cannot fully capture multi-way relationships naturally present in complex real-world systems. In response to this, Topological Deep Learning (TDL) leverages more general combinatorial representations --such as simplicial or cellular complexes-- to accommodate higher-order interactions. Existing TDL methods often extend GNNs through Higher-Order Message Passing (HOMP), but face critical \emphscalability challenges due to \textit(i) a combinatorial explosion of message-passing routes, and \textit(ii) significant complexity overhead from the propagation mechanism. To overcome these limitations, we propose HOPSE (Higher-Order Positional and Structural Encoder)–a \emphmessage passing-free framework that uses Hasse graph decompositions to derive efficient and expressive encodings over \empharbitrary higher-order domains. Notably, HOPSE scales linearly with dataset size while preserving expressive power and permutation equivariance. Experiments on molecular, expressivity and topological benchmarks show that HOPSE matches or surpasses state-of-the-art performance while achieving up to 7 times speedups over HOMP-based models, opening a new path for scalable TDL.
[LG-36] InTreeger: An End-to-End Framework for Integer-Only Decision Tree Inference
链接: https://arxiv.org/abs/2505.15391
作者: Duncan Bart,Bruno Endres Forlin,Ana-Lucia Varbanescu,Marco Ottavi,Kuan-Hsun Chen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Integer quantization has emerged as a critical technique to facilitate deployment on resource-constrained devices. Although they do reduce the complexity of the learning models, their inference performance is often prone to quantization-induced errors. To this end, we introduce InTreeger: an end-to-end framework that takes a training dataset as input, and outputs an architecture-agnostic integer-only C implementation of tree-based machine learning model, without loss of precision. This framework enables anyone, even those without prior experience in machine learning, to generate a highly optimized integer-only classification model that can run on any hardware simply by providing an input dataset and target variable. We evaluated our generated implementations across three different architectures (ARM, x86, and RISC-V), resulting in significant improvements in inference latency. In addition, we show the energy efficiency compared to typical decision tree implementations that rely on floating-point arithmetic. The results underscore the advantages of integer-only inference, making it particularly suitable for energy- and area-constrained devices such as embedded systems and edge computing platforms, while also enabling the execution of decision trees on existing ultra-low power devices.
[LG-37] Distributionally Robust Federated Learning with Client Drift Minimization
链接: https://arxiv.org/abs/2505.15371
作者: Mounssif Krouka,Chaouki Ben Issaid,Mehdi Bennis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated learning (FL) faces critical challenges, particularly in heterogeneous environments where non-independent and identically distributed data across clients can lead to unfair and inefficient model performance. In this work, we introduce \textitDRDM, a novel algorithm that addresses these issues by combining a distributionally robust optimization (DRO) framework with dynamic regularization to mitigate client drift. \textitDRDM frames the training as a min-max optimization problem aimed at maximizing performance for the worst-case client, thereby promoting robustness and fairness. This robust objective is optimized through an algorithm leveraging dynamic regularization and efficient local updates, which significantly reduces the required number of communication rounds. Moreover, we provide a theoretical convergence analysis for convex smooth objectives under partial participation. Extensive experiments on three benchmark datasets, covering various model architectures and data heterogeneity levels, demonstrate that \textitDRDM significantly improves worst-case test accuracy while requiring fewer communication rounds than existing state-of-the-art baselines. Furthermore, we analyze the impact of signal-to-noise ratio (SNR) and bandwidth on the energy consumption of participating clients, demonstrating that the number of local update steps can be adaptively selected to achieve a target worst-case test accuracy with minimal total energy cost across diverse communication environments.
[LG-38] Human in the Loop Adaptive Optimization for Improved Time Series Forecasting
链接: https://arxiv.org/abs/2505.15354
作者: Malik Tiomoko,Hamza Cherkaoui,Giuseppe Paolo,Zhang Yili,Yu Meng,Zhang Keli,Hafiz Tiomoko Ali
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Time series forecasting models often produce systematic, predictable errors even in critical domains such as energy, finance, and healthcare. We introduce a novel post training adaptive optimization framework that improves forecast accuracy without retraining or architectural changes. Our method automatically applies expressive transformations optimized via reinforcement learning, contextual bandits, or genetic algorithms to correct model outputs in a lightweight and model agnostic way. Theoretically, we prove that affine corrections always reduce the mean squared error; practically, we extend this idea with dynamic action based optimization. The framework also supports an optional human in the loop component: domain experts can guide corrections using natural language, which is parsed into actions by a language model. Across multiple benchmarks (e.g., electricity, weather, traffic), we observe consistent accuracy gains with minimal computational overhead. Our interactive demo shows the framework’s real time usability. By combining automated post hoc refinement with interpretable and extensible mechanisms, our approach offers a powerful new direction for practical forecasting systems.
[LG-39] SSR: Speculative Parallel Scaling Reasoning in Test-time
链接: https://arxiv.org/abs/2505.15340
作者: Yuanlin Chu,Bo Wang,Xiang Liu,Hong Chen,Aiwei Liu,Xuming Hu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved impressive results on multi-step mathematical reasoning, yet at the cost of high computational overhead. This challenge is particularly acute for test-time scaling methods such as parallel decoding, which increase answer diversity but scale poorly in efficiency. To address this efficiency-accuracy trade-off, we propose SSR (Speculative Parallel Scaling Reasoning), a training-free framework that leverages a key insight: by introducing speculative decoding at the step level, we can accelerate reasoning without sacrificing correctness. SSR integrates two components: a Selective Parallel Module (SPM) that identifies a small set of promising reasoning strategies via model-internal scoring, and Step-level Speculative Decoding (SSD), which enables efficient draft-target collaboration for fine-grained reasoning acceleration. Experiments on three mathematical benchmarks-AIME 2024, MATH-500, and LiveMathBench - demonstrate that SSR achieves strong gains over baselines. For instance, on LiveMathBench, SSR improves pass@1 accuracy by 13.84% while reducing computation to 80.5% of the baseline FLOPs. On MATH-500, SSR reduces compute to only 30% with no loss in accuracy.
[LG-40] Fourier-Invertible Neural Encoder (FINE) for Homogeneous Flows
链接: https://arxiv.org/abs/2505.15329
作者: Anqiao Ouyang,Hongyi Ke,Qi Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Invertible neural architectures have recently attracted attention for their compactness, interpretability, and information-preserving properties. In this work, we propose the Fourier-Invertible Neural Encoder (FINE), which combines invertible monotonic activation functions with reversible filter structures, and could be extended using Invertible ResNets. This architecture is examined in learning low-dimensional representations of one-dimensional nonlinear wave interactions and exact circular translation symmetry. Dimensionality is preserved across layers, except for a Fourier truncation step in the latent space, which enables dimensionality reduction while maintaining shift equivariance and interpretability. Our results demonstrate that FINE significantly outperforms classical linear methods such as Discrete Fourier Transformation (DFT) and Proper Orthogonal Decomposition (POD), and achieves reconstruction accuracy better than conventional deep autoencoders with convolutional layers (CNN) - while using substantially smaller models and offering superior physical interpretability. These findings suggest that invertible single-neuron networks, when combined with spectral truncation, offer a promising framework for learning compact and interpretable representations of physics datasets, and symmetry-aware representation learning in physics-informed machine learning.
[LG-41] Sonnet: Spectral Operator Neural Network for Multivariable Time Series Forecasting
链接: https://arxiv.org/abs/2505.15312
作者: Yuxuan Shu,Vasileios Lampos
类目: Machine Learning (cs.LG)
*备注: The code is available at this https URL
点击查看摘要
Abstract:Multivariable time series forecasting methods can integrate information from exogenous variables, leading to significant prediction accuracy gains. Transformer architecture has been widely applied in various time series forecasting models due to its ability to capture long-range sequential dependencies. However, a naïve application of transformers often struggles to effectively model complex relationships among variables over time. To mitigate against this, we propose a novel architecture, namely the Spectral Operator Neural Network (Sonnet). Sonnet applies learnable wavelet transformations to the input and incorporates spectral analysis using the Koopman operator. Its predictive skill relies on the Multivariable Coherence Attention (MVCA), an operation that leverages spectral coherence to model variable dependencies. Our empirical analysis shows that Sonnet yields the best performance on 34 out of 47 forecasting tasks with an average mean absolute error (MAE) reduction of 1.1% against the most competitive baseline (different per task). We further show that MVCA – when put in place of the naïve attention used in various deep learning models – can remedy its deficiencies, reducing MAE by 10.7% on average in the most challenging forecasting tasks.
[LG-42] An Efficient Private GPT Never Autoregressively Decodes ICML2025
链接: https://arxiv.org/abs/2505.15252
作者: Zhengyi Li,Yue Guan,Kang Yang,Yu Feng,Ning Liu,Yu Yu,Jingwen Leng,Minyi Guo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by ICML 2025
点击查看摘要
Abstract:The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance this http URL accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a 2.1\times \sim 6.0\times speedup compared to standard decoding across three pairs of public-private models and different network conditions.
[LG-43] Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets
链接: https://arxiv.org/abs/2505.15251
作者: Idriss Malek,Abhijit Sharma,Salem Lahlou
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Although Generative Flow Networks (GFlowNets) are designed to capture multiple modes of a reward function, they often suffer from mode collapse in practice, getting trapped in early discovered modes and requiring prolonged training to find diverse solutions. Existing exploration techniques may rely on heuristic novelty signals. We propose Loss-Guided GFlowNets (LGGFN), a novel approach where an auxiliary GFlowNet’s exploration is directly driven by the main GFlowNet’s training loss. By prioritizing trajectories where the main model exhibits high loss, LGGFN focuses sampling on poorly understood regions of the state space. This targeted exploration significantly accelerates the discovery of diverse, high-reward samples. Empirically, across various benchmarks including grid environments, structured sequence generation, and Bayesian structure learning, LGGFN consistently enhances exploration efficiency and sample diversity compared to baselines. For instance, on a challenging sequence generation task, it discovered over 40 times more unique valid modes while simultaneously reducing the exploration error metric by approximately 99%.
[LG-44] Mitigating Spurious Correlations with Causal Logit Perturbation
链接: https://arxiv.org/abs/2505.15246
作者: Xiaoling Zhou,Wei Ye,Rui Xie,Shikun Zhang
类目: Machine Learning (cs.LG)
*备注: 34 pages,9 figures
点击查看摘要
Abstract:Deep learning has seen widespread success in various domains such as science, industry, and society. However, it is acknowledged that certain approaches suffer from non-robustness, relying on spurious correlations for predictions. Addressing these limitations is of paramount importance, necessitating the development of methods that can disentangle spurious correlations. This study attempts to implement causal models via logit perturbations and introduces a novel Causal Logit Perturbation (CLP) framework to train classifiers with generated causal logit perturbations for individual samples, thereby mitigating the spurious associations between non-causal attributes (i.e., image backgrounds) and classes. Our framework employs a perturbation network to generate sample-wise logit perturbations using a series of training characteristics of samples as inputs. The whole framework is optimized by an online meta-learning-based learning algorithm and leverages human causal knowledge by augmenting metadata in both counterfactual and factual manners. Empirical evaluations on four typical biased learning scenarios, including long-tail learning, noisy label learning, generalized long-tail learning, and subpopulation shift learning, demonstrate that CLP consistently achieves state-of-the-art performance. Moreover, visualization results support the effectiveness of the generated causal perturbations in redirecting model attention towards causal image attributes and dismantling spurious associations.
[LG-45] Reliable Vertical Federated Learning in 5G Core Network Architecture
链接: https://arxiv.org/abs/2505.15244
作者: Mohamad Mestoukirdi,Mourad Khanfouci
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Globecom Submission
点击查看摘要
Abstract:This work proposes a new algorithm to mitigate model generalization loss in Vertical Federated Learning (VFL) operating under client reliability constraints within 5G Core Networks (CNs). Recently studied and endorsed by 3GPP, VFL enables collaborative and load-balanced model training and inference across the CN. However, the performance of VFL significantly degrades when the Network Data Analytics Functions (NWDAFs) - which serve as primary clients for VFL model training and inference - experience reliability issues stemming from resource constraints and operational overhead. Unlike edge environments, CN environments adopt fundamentally different data management strategies, characterized by more centralized data orchestration capabilities. This presents opportunities to implement better distributed solutions that take full advantage of the CN data handling flexibility. Leveraging this flexibility, we propose a method that optimizes the vertical feature split among clients while centrally defining their local models based on reliability metrics. Our empirical evaluation demonstrates the effectiveness of our proposed algorithm, showing improved performance over traditional baseline methods.
[LG-46] Finding separatrices of dynamical flows with Deep Koopman Eigenfunctions
链接: https://arxiv.org/abs/2505.15231
作者: Kabir V. Dabholkar,Omri Barak
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Many natural systems, including neural circuits involved in decision making, can be modeled as high-dimensional dynamical systems with multiple stable states. While existing analytical tools primarily describe behavior near stable equilibria, characterizing separatrices – the manifolds that delineate boundaries between different basins of attraction – remains challenging, particularly in high-dimensional settings. Here, we introduce a numerical framework leveraging Koopman Theory combined with Deep Neural Networks to effectively characterize separatrices. Specifically, we approximate Koopman Eigenfunctions (KEFs) associated with real positive eigenvalues, which vanish precisely at the separatrices. Utilizing these scalar KEFs, optimization methods efficiently locate separatrices even in complex systems. We demonstrate our approach on synthetic benchmarks, ecological network models, and recurrent neural networks trained on neuroscience-inspired tasks. Moreover, we illustrate the practical utility of our method by designing optimal perturbations that can shift systems across separatrices, enabling predictions relevant to optogenetic stimulation experiments in neuroscience.
[LG-47] Degree-Optimized Cumulative Polynomial Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2505.15228
作者: Mathew Vanherreweghe,Lirandë Pira,Patrick Rebentrost
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:We introduce cumulative polynomial Kolmogorov-Arnold networks (CP-KAN), a neural architecture combining Chebyshev polynomial basis functions and quadratic unconstrained binary optimization (QUBO). Our primary contribution involves reformulating the degree selection problem as a QUBO task, reducing the complexity from O(D^N) to a single optimization step per layer. This approach enables efficient degree selection across neurons while maintaining computational tractability. The architecture performs well in regression tasks with limited data, showing good robustness to input scales and natural regularization properties from its polynomial basis. Additionally, theoretical analysis establishes connections between CP-KAN’s performance and properties of financial time series. Our empirical validation across multiple domains demonstrates competitive performance compared to several traditional architectures tested, especially in scenarios where data efficiency and numerical stability are important. Our implementation, including strategies for managing computational overhead in larger networks is available in Ref.~\citepcpkan_implementation.
[LG-48] KernelOracle: Predicting the Linux Schedulers Next Move with Deep Learning ACL
链接: https://arxiv.org/abs/2505.15213
作者: Sampanna Yashwant Kahu
类目: Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注: 7 pages, 11 figures, pre-print. The source code and data used in this work is available at: this https URL
点击查看摘要
Abstract:Efficient task scheduling is paramount in the Linux kernel, where the Completely Fair Scheduler (CFS) meticulously manages CPU resources to balance high utilization with interactive responsiveness. This research pioneers the use of deep learning techniques to predict the sequence of tasks selected by CFS, aiming to evaluate the feasibility of a more generalized and potentially more adaptive task scheduler for diverse workloads. Our core contributions are twofold: first, the systematic generation and curation of a novel scheduling dataset from a running Linux kernel, capturing real-world CFS behavior; and second, the development, training, and evaluation of a Long Short-Term Memory (LSTM) network designed to accurately forecast the next task to be scheduled. This paper further discusses the practical pathways and implications of integrating such a predictive model into the kernel’s scheduling framework. The findings and methodologies presented herein open avenues for data-driven advancements in kernel scheduling, with the full source code provided for reproducibility and further exploration.
[LG-49] Group Distributionally Robust Optimization with Flexible Sample Queries
链接: https://arxiv.org/abs/2505.15212
作者: Haomin Bai,Dingzhi Yu,Shuai Li,Haipeng Luo,Lijun Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Group distributionally robust optimization (GDRO) aims to develop models that perform well across m distributions simultaneously. Existing GDRO algorithms can only process a fixed number of samples per iteration, either 1 or m , and therefore can not support scenarios where the sample size varies dynamically. To address this limitation, we investigate GDRO with flexible sample queries and cast it as a two-player game: one player solves an online convex optimization problem, while the other tackles a prediction with limited advice (PLA) problem. Within such a game, we propose a novel PLA algorithm, constructing appropriate loss estimators for cases where the sample size is either 1 or not, and updating the decision using follow-the-regularized-leader. Then, we establish the first high-probability regret bound for non-oblivious PLA. Building upon the above approach, we develop a GDRO algorithm that allows an arbitrary and varying sample size per round, achieving a high-probability optimization error bound of O\left(\frac1t\sqrt\sum_j=1^t \fracmr_j\log m\right) , where r_t denotes the sample size at round t . This result demonstrates that the optimization error decreases as the number of samples increases and implies a consistent sample complexity of O(m\log (m)/\epsilon^2) for any fixed sample size r\in[m] , aligning with existing bounds for cases of r=1 or m . We validate our approach on synthetic binary and real-world multi-class datasets.
[LG-50] Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing
链接: https://arxiv.org/abs/2505.15195
作者: Adel Javanmard,Rudrajit Das,Alessandro Epasto,Vahab Mirrokni
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 31 pages, 6 figures, 5 tables
点击查看摘要
Abstract:Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance. While prior works have demonstrated the benefits of specific heuristic retraining schemes, the question of how to optimally combine the model’s predictions and the provided labels remains largely open. This paper addresses this fundamental question for binary classification tasks. We develop a principled framework based on approximate message passing (AMP) to analyze iterative retraining procedures for two ground truth settings: Gaussian mixture model (GMM) and generalized linear model (GLM). Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model’s predictions and the given labels, which when used to retrain the same model, minimizes its prediction error. We also quantify the performance of this optimal retraining strategy over multiple rounds. We complement our theoretical results by proposing a practically usable version of the theoretically-optimal aggregator function for linear probing with the cross-entropy loss, and demonstrate its superiority over baseline methods in the high label noise regime.
[LG-51] NeuBM: Mitigating Model Bias in Graph Neural Networks through Neutral Input Calibration IJCAI20205
链接: https://arxiv.org/abs/2505.15180
作者: Jiawei Gu,Ziyue Qiao,Xiao Luo
类目: Machine Learning (cs.LG)
*备注: Accepted to IJCAI 20205
点击查看摘要
Abstract:Graph Neural Networks (GNNs) have shown remarkable performance across various domains, yet they often struggle with model bias, particularly in the presence of class imbalance. This bias can lead to suboptimal performance and unfair predictions, especially for underrepresented classes. We introduce NeuBM (Neutral Bias Mitigation), a novel approach to mitigate model bias in GNNs through neutral input calibration. NeuBM leverages a dynamically updated neutral graph to estimate and correct the inherent biases of the model. By subtracting the logits obtained from the neutral graph from those of the input graph, NeuBM effectively recalibrates the model’s predictions, reducing bias across different classes. Our method integrates seamlessly into existing GNN architectures and training procedures, requiring minimal computational overhead. Extensive experiments on multiple benchmark datasets demonstrate that NeuBM significantly improves the balanced accuracy and recall of minority classes, while maintaining strong overall performance. The effectiveness of NeuBM is particularly pronounced in scenarios with severe class imbalance and limited labeled data, where traditional methods often struggle. We provide theoretical insights into how NeuBM achieves bias mitigation, relating it to the concept of representation balancing. Our analysis reveals that NeuBM not only adjusts the final predictions but also influences the learning of balanced feature representations throughout the network.
[LG-52] A Unified Gradient-based Framework for Task-agnostic Continual Learning-Unlearning
链接: https://arxiv.org/abs/2505.15178
作者: Zhehao Huang,Xinwen Cheng,Jie Zhang,Jinghao Zheng,Haoran Wang,Zhengbao He,Tao Li,Xiaolin Huang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2409.19732
点击查看摘要
Abstract:Recent advancements in deep models have highlighted the need for intelligent systems that combine continual learning (CL) for knowledge acquisition with machine unlearning (MU) for data removal, forming the Continual Learning-Unlearning (CLU) paradigm. While existing work treats CL and MU as separate processes, we reveal their intrinsic connection through a unified optimization framework based on Kullback-Leibler divergence minimization. This framework decomposes gradient updates for approximate CLU into four components: learning new knowledge, unlearning targeted data, preserving existing knowledge, and modulation via weight saliency. A critical challenge lies in balancing knowledge update and retention during sequential learning-unlearning cycles. To resolve this stability-plasticity dilemma, we introduce a remain-preserved manifold constraint to induce a remaining Hessian compensation for CLU iterations. A fast-slow weight adaptation mechanism is designed to efficiently approximate the second-order optimization direction, combined with adaptive weighting coefficients and a balanced weight saliency mask, proposing a unified implementation framework for gradient-based CLU. Furthermore, we pioneer task-agnostic CLU scenarios that support fine-grained unlearning at the cross-task category and random sample levels beyond the traditional task-aware setups. Experiments demonstrate that the proposed UG-CLU framework effectively coordinates incremental learning, precise unlearning, and knowledge stability across multiple datasets and model architectures, providing a theoretical foundation and methodological support for dynamic, compliant intelligent systems.
[LG-53] SpectralGap: Graph-Level Out-of-Distribution Detection via Laplacian Eigenvalue Gaps IJCAI20205
链接: https://arxiv.org/abs/2505.15177
作者: Jiawei Gu,Ziyue Qiao,Zechao Li
类目: Machine Learning (cs.LG)
*备注: Accepted to IJCAI 20205
点击查看摘要
Abstract:The task of graph-level out-of-distribution (OOD) detection is crucial for deploying graph neural networks in real-world settings. In this paper, we observe a significant difference in the relationship between the largest and second-largest eigenvalues of the Laplacian matrix for in-distribution (ID) and OOD graph samples: \textitOOD samples often exhibit anomalous spectral gaps (the difference between the largest and second-largest eigenvalues). This observation motivates us to propose SpecGap, an effective post-hoc approach for OOD detection on graphs. SpecGap adjusts features by subtracting the component associated with the second-largest eigenvalue, scaled by the spectral gap, from the high-level features (i.e., \mathbfX-\left(\lambda_n-\lambda_n-1\right) \mathbfu_n-1 \mathbfv_n-1^T ). SpecGap achieves state-of-the-art performance across multiple benchmark datasets. We present extensive ablation studies and comprehensive theoretical analyses to support our empirical results. As a parameter-free post-hoc method, SpecGap can be easily integrated into existing graph neural network models without requiring any additional training or model modification.
[LG-54] Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss ICML2025
链接: https://arxiv.org/abs/2505.15174
作者: Bo-Han Lai,Pin-Han Huang,Bo-Han Kung,Shang-Tse Chen
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Spotlight
点击查看摘要
Abstract:Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet - a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available at \hrefthis https URLthis https URL.
[LG-55] Cascaded Diffusion Models for Neural Motion Planning ICRA’25
链接: https://arxiv.org/abs/2505.15157
作者: Mohit Sharma,Adam Fishman,Vikash Kumar,Chris Paxton,Oliver Kroemer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: ICRA’25
点击查看摘要
Abstract:Robots in the real world need to perceive and move to goals in complex environments without collisions. Avoiding collisions is especially difficult when relying on sensor perception and when goals are among clutter. Diffusion policies and other generative models have shown strong performance in solving local planning problems, but often struggle at avoiding all of the subtle constraint violations that characterize truly challenging global motion planning problems. In this work, we propose an approach for learning global motion planning using diffusion policies, allowing the robot to generate full trajectories through complex scenes and reasoning about multiple obstacles along the path. Our approach uses cascaded hierarchical models which unify global prediction and local refinement together with online plan repair to ensure the trajectories are collision free. Our method outperforms (by ~5%) a wide variety of baselines on challenging tasks in multiple domains including navigation and manipulation.
[LG-56] Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation
链接: https://arxiv.org/abs/2505.15152
作者: Nanxu Gong,Zijun Li,Sixun Dong,Haoyue Bai,Wangyang Ying,Xinyuan Wang,Yanjie Fu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Feature Transformation (FT) crafts new features from original ones via mathematical operations to enhance dataset expressiveness for downstream models. However, existing FT methods exhibit critical limitations: discrete search struggles with enormous combinatorial spaces, impeding practical use; and continuous search, being highly sensitive to initialization and step sizes, often becomes trapped in local optima, restricting global exploration. To overcome these limitations, DIFFT redefines FT as a reward-guided generative task. It first learns a compact and expressive latent space for feature sets using a Variational Auto-Encoder (VAE). A Latent Diffusion Model (LDM) then navigates this space to generate high-quality feature embeddings, its trajectory guided by a performance evaluator towards task-specific optima. This synthesis of global distribution learning (from LDM) and targeted optimization (reward guidance) produces potent embeddings, which a novel semi-autoregressive decoder efficiently converts into structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation. Extensive experiments on 14 benchmark datasets show DIFFT consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.
[LG-57] me Tracker: Mixture-of-Experts-Enhanced Foundation Time Series Forecasting Model with Decoupled Training Pipelines
链接: https://arxiv.org/abs/2505.15151
作者: Xiaohou Shi,Ke Li,Aobo Liang,Yan Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the past few years, time series foundation models have achieved superior predicting accuracy. However, real-world time series often exhibit significant diversity in their temporal patterns across different time spans and domains, making it challenging for a single model architecture to fit all complex scenarios. In addition, time series data may have multiple variables exhibiting complex correlations between each other. Recent mainstream works have focused on modeling times series in a channel-independent manner in both pretraining and finetuning stages, overlooking the valuable inter-series dependencies. To this end, we propose \textbfTime Tracker for better predictions on multivariate time series data. Firstly, we leverage sparse mixture of experts (MoE) within Transformers to handle the modeling of diverse time series patterns, thereby alleviating the learning difficulties of a single model while improving its generalization. Besides, we propose Any-variate Attention, enabling a unified model structure to seamlessly handle both univariate and multivariate time series, thereby supporting channel-independent modeling during pretraining and channel-mixed modeling for finetuning. Furthermore, we design a graph learning module that constructs relations among sequences from frequency-domain features, providing more precise guidance to capture inter-series dependencies in channel-mixed modeling. Based on these advancements, Time Tracker achieves state-of-the-art performance in predicting accuracy, model generalization and adaptability.
[LG-58] Filtering Learning Histories Enhances In-Context Reinforcement Learning
链接: https://arxiv.org/abs/2505.15143
作者: Weiqin Chen,Xinjie Zhang,Dharmashankar Subramanian,Santiago Paternain
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.
[LG-59] EC-LDA : Label Distribution Inference Attack against Federated Graph Learning with Embedding Compression
链接: https://arxiv.org/abs/2505.15140
作者: Tong Cheng,Fu Jie,Xinpeng Ling,Huifa Li,Zhili Chen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) have been widely used for graph analysis. Federated Graph Learning (FGL) is an emerging learning framework to collaboratively train graph data from various clients. However, since clients are required to upload model parameters to the server in each round, this provides the server with an opportunity to infer each client’s data privacy. In this paper, we focus on label distribution attacks(LDAs) that aim to infer the label distributions of the clients’ local data. We take the first step to attack client’s label distributions in FGL. Firstly, we observe that the effectiveness of LDA is closely related to the variance of node embeddings in GNNs. Next, we analyze the relation between them and we propose a new attack named EC-LDA, which significantly improves the attack effectiveness by compressing node embeddings. Thirdly, extensive experiments on node classification and link prediction tasks across six widely used graph datasets show that EC-LDA outperforms the SOTA LDAs. For example, EC-LDA attains optimal values under both Cos-sim and JS-div evaluation metrics in the CoraFull and LastFM datasets. Finally, we explore the robustness of EC-LDA under differential privacy protection.
[LG-60] Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
链接: https://arxiv.org/abs/2505.15130
作者: Sajjad Ghiasvand,Haniyeh Ehsani Oskouie,Mahnoosh Alizadeh,Ramtin Pedarsani
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, the first algorithm designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates adversarial fine-tuning as a minimax optimization problem and provides theoretical guarantees for convergence under smoothness and nonconvex-strong-concavity assumptions. Empirical results across eight datasets using ViT-B/16 and ViT-B/32 models show that AdvCLIP-LoRA significantly improves robustness against common adversarial attacks (e.g., FGSM, PGD), without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical and theoretically grounded approach for robust adaptation of VLMs in resource-constrained settings.
[LG-61] Khan-GCL: Kolmogorov-Arnold Network Based Graph Contrastive Learning with Hard Negatives
链接: https://arxiv.org/abs/2505.15103
作者: Zihu Wang,Boxun Xu,Hejia Geng,Peng Li
类目: Machine Learning (cs.LG)
*备注: Graph Contrastive Learning, Self-supervised Learning, Kolmogorov-Arnold Network, Representation Learning
点击查看摘要
Abstract:Graph contrastive learning (GCL) has demonstrated great promise for learning generalizable graph representations from unlabeled data. However, conventional GCL approaches face two critical limitations: (1) the restricted expressive capacity of multilayer perceptron (MLP) based encoders, and (2) suboptimal negative samples that either from random augmentations-failing to provide effective ‘hard negatives’-or generated hard negatives without addressing the semantic distinctions crucial for discriminating graph data. To this end, we propose Khan-GCL, a novel framework that integrates the Kolmogorov-Arnold Network (KAN) into the GCL encoder architecture, substantially enhancing its representational capacity. Furthermore, we exploit the rich information embedded within KAN coefficient parameters to develop two novel critical feature identification techniques that enable the generation of semantically meaningful hard negative samples for each graph representation. These strategically constructed hard negatives guide the encoder to learn more discriminative features by emphasizing critical semantic differences between graphs. Extensive experiments demonstrate that our approach achieves state-of-the-art performance compared to existing GCL methods across a variety of datasets and tasks.
[LG-62] Cost-aware LLM -based Online Dataset Annotation
链接: https://arxiv.org/abs/2505.15101
作者: Eray Can Elumar,Cem Tekin,Osman Yagan
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have enabled automated dataset labeling with minimal human supervision. While majority voting across multiple LLMs can improve label reliability by mitigating individual model biases, it incurs high computational costs due to repeated querying. In this work, we propose a novel online framework, Cost-aware Majority Voting (CaMVo), for efficient and accurate LLM-based dataset annotation. CaMVo adaptively selects a subset of LLMs for each data instance based on contextual embeddings, balancing confidence and cost without requiring pre-training or ground-truth labels. Leveraging a LinUCB-based selection mechanism and a Bayesian estimator over confidence scores, CaMVo estimates a lower bound on labeling accuracy for each LLM and aggregates responses through weighted majority voting. Our empirical evaluation on the MMLU and IMDB Movie Review datasets demonstrates that CaMVo achieves comparable or superior accuracy to full majority voting while significantly reducing labeling costs. This establishes CaMVo as a practical and robust solution for cost-efficient annotation in dynamic labeling environments.
[LG-63] Agent ic Feature Augmentation: Unifying Selection and Generation with Teaming Planning and Memories
链接: https://arxiv.org/abs/2505.15076
作者: Nanxu Gong,Sixun Dong,Haoyue Bai,Xinyuan Wang,Wangyang Ying,Yanjie Fu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As a widely-used and practical tool, feature engineering transforms raw data into discriminative features to advance AI model performance. However, existing methods usually apply feature selection and generation separately, failing to strive a balance between reducing redundancy and adding meaningful dimensions. To fill this gap, we propose an agentic feature augmentation concept, where the unification of feature generation and selection is modeled as agentic teaming and planning. Specifically, we develop a Multi-Agent System with Long and Short-Term Memory (MAGS), comprising a selector agent to eliminate redundant features, a generator agent to produce informative new dimensions, and a router agent that strategically coordinates their actions. We leverage in-context learning with short-term memory for immediate feedback refinement and long-term memory for globally optimal guidance. Additionally, we employ offline Proximal Policy Optimization (PPO) reinforcement fine-tuning to train the router agent for effective decision-making to navigate a vast discrete feature space. Extensive experiments demonstrate that this unified agentic framework consistently achieves superior task performance by intelligently orchestrating feature selection and generation.
[LG-64] Generalization Through Growth: Hidden Dynamics Controls Depth Dependence
链接: https://arxiv.org/abs/2505.15064
作者: Sho Sonoda,Yuka Hashimoto,Isao Ishikawa,Masahiro Ikeda
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Recent theory has reduced the depth dependence of generalization bounds from exponential to polynomial and even depth-independent rates, yet these results remain tied to specific architectures and Euclidean inputs. We present a unified framework for arbitrary \bluepseudo-metric spaces in which a depth-(k) network is the composition of continuous hidden maps (f:\mathcalX\to \mathcalX) and an output map (h:\mathcalX\to \mathbbR). The resulting bound O(\sqrt(\alpha + \log \beta(k))/n) isolates the sole depth contribution in (\beta(k)), the word-ball growth of the semigroup generated by the hidden layers. By Gromov’s theorem polynomial (resp. exponential) growth corresponds to virtually nilpotent (resp. expanding) dynamics, revealing a geometric dichotomy behind existing O(\sqrtk) (sublinear depth) and \tildeO(1) (depth-independent) rates. We further provide covering-number estimates showing that expanding dynamics yield an exponential parameter saving via compositional expressivity. Our results decouple specification from implementation, offering architecture-agnostic and dynamical-systems-aware guarantees applicable to modern deep-learning paradigms such as test-time inference and diffusion models.
[LG-65] RLBenchNet: The Right Network for the Right Reinforcement Learning Task
链接: https://arxiv.org/abs/2505.15040
作者: Ivan Smirnov,Shangding Gu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has seen significant advancements through the application of various neural network architectures. In this study, we systematically investigate the performance of several neural networks in RL tasks, including Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Mamba/Mamba-2, Transformer-XL, Gated Transformer-XL, and Gated Recurrent Unit (GRU). Through comprehensive evaluation across continuous control, discrete decision-making, and memory-based environments, we identify architecture-specific strengths and limitations. Our results reveal that: (1) MLPs excel in fully observable continuous control tasks, providing an optimal balance of performance and efficiency; (2) recurrent architectures like LSTM and GRU offer robust performance in partially observable environments with moderate memory requirements; (3) Mamba models achieve a 4.5x higher throughput compared to LSTM and a 3.9x increase over GRU, all while maintaining comparable performance; and (4) only Transformer-XL, Gated Transformer-XL, and Mamba-2 successfully solve the most challenging memory-intensive tasks, with Mamba-2 requiring 8x less memory than Transformer-XL. These findings provide insights for researchers and practitioners, enabling more informed architecture selection based on specific task characteristics and computational constraints. Code is available at: this https URL
[LG-66] Harnessing Large Language Models Locally: Empirical Results and Implications for AI PC
链接: https://arxiv.org/abs/2505.15030
作者: Qingyu Song,Peiyu Liao,Wenqian Zhao,Yiwen Wang,Shoubo Hu,Hui-Ling Zhen,Ning Jiang,Mingxuan Yuan
类目: Machine Learning (cs.LG)
*备注: 18 pages, 14 figures
点击查看摘要
Abstract:The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology – encompassing model capability, development efficiency, and system resources – for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around \sim 3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at this https URL.
[LG-67] Beyond Node Attention: Multi-Scale Harmonic Encoding for Feature-Wise Graph Message Passing
链接: https://arxiv.org/abs/2505.15015
作者: Longlong Li,Cunquan Qu,Guanghui Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Conventional Graph Neural Networks (GNNs) aggregate neighbor embeddings as holistic vectors, lacking the ability to identify fine-grained, direction-specific feature relevance. We propose MSH-GNN (Multi-Scale Harmonic Graph Neural Network), a novel architecture that performs feature-wise adaptive message passing through node-specific harmonic projections. For each node, MSH-GNN dynamically projects neighbor features onto frequency-sensitive directions determined by the target node’s own representation. These projections are further modulated using learnable sinusoidal encodings at multiple frequencies, enabling the model to capture both smooth and oscillatory structural patterns across scales. A frequency-aware attention pooling mechanism is introduced to emphasize spectrally and structurally salient nodes during readout. Theoretically, we prove that MSH-GNN approximates shift-invariant kernels and matches the expressive power of the 1-Weisfeiler-Lehman (1-WL) test. Empirically, MSH-GNN consistently outperforms state-of-the-art models on a wide range of graph and node classification tasks. Furthermore, in challenging classification settings involving joint variations in graph topology and spectral frequency, MSH-GNN excels at capturing structural asymmetries and high-frequency modulations, enabling more accurate graph discrimination.
[LG-68] AnyBody: A Benchmark Suite for Cross-Embodiment Manipulation
链接: https://arxiv.org/abs/2505.14986
作者: Meenal Parakh,Alexandre Kirchmeyer,Beining Han,Jia Deng
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generalizing control policies to novel embodiments remains a fundamental challenge in enabling scalable and transferable learning in robotics. While prior works have explored this in locomotion, a systematic study in the context of manipulation tasks remains limited, partly due to the lack of standardized benchmarks. In this paper, we introduce a benchmark for learning cross-embodiment manipulation, focusing on two foundational tasks-reach and push-across a diverse range of morphologies. The benchmark is designed to test generalization along three axes: interpolation (testing performance within a robot category that shares the same link structure), extrapolation (testing on a robot with a different link structure), and composition (testing on combinations of link structures). On the benchmark, we evaluate the ability of different RL policies to learn from multiple morphologies and to generalize to novel ones. Our study aims to answer whether morphology-aware training can outperform single-embodiment baselines, whether zero-shot generalization to unseen morphologies is feasible, and how consistently these patterns hold across different generalization regimes. The results highlight the current limitations of multi-embodiment learning and provide insights into how architectural and training design choices influence policy generalization.
[LG-69] Privacy Preserving Conversion Modeling in Data Clean Room RECSYS’24
链接: https://arxiv.org/abs/2505.14959
作者: Kungang Li,Xiangyi Chen,Ling Leng,Jiajing Xu,Jiankai Sun,Behnam Rezaei
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Published in Proceedings of the 18th ACM Conference on Recommender Systems. 2024 (RecSys '24)
点击查看摘要
Abstract:In the realm of online advertising, accurately predicting the conversion rate (CVR) is crucial for enhancing advertising efficiency and user satisfaction. This paper addresses the challenge of CVR prediction while adhering to user privacy preferences and advertiser requirements. Traditional methods face obstacles such as the reluctance of advertisers to share sensitive conversion data and the limitations of model training in secure environments like data clean rooms. We propose a novel model training framework that enables collaborative model training without sharing sample-level gradients with the advertising platform. Our approach introduces several innovative components: (1) utilizing batch-level aggregated gradients instead of sample-level gradients to minimize privacy risks; (2) applying adapter-based parameter-efficient fine-tuning and gradient compression to reduce communication costs; and (3) employing de-biasing techniques to train the model under label differential privacy, thereby maintaining accuracy despite privacy-enhanced label perturbations. Our experimental results, conducted on industrial datasets, demonstrate that our method achieves competitive ROCAUC performance while significantly decreasing communication overhead and complying with both advertiser privacy requirements and user privacy choices. This framework establishes a new standard for privacy-preserving, high-performance CVR prediction in the digital advertising landscape.
[LG-70] Unlearning Algorithmic Biases over Graphs
链接: https://arxiv.org/abs/2505.14945
作者: O. Deniz Kose,Gonzalo Mateos,Yanning Shen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The growing enforcement of the right to be forgotten regulations has propelled recent advances in certified (graph) unlearning strategies to comply with data removal requests from deployed machine learning (ML) models. Motivated by the well-documented bias amplification predicament inherent to graph data, here we take a fresh look at graph unlearning and leverage it as a bias mitigation tool. Given a pre-trained graph ML model, we develop a training-free unlearning procedure that offers certifiable bias mitigation via a single-step Newton update on the model weights. This way, we contribute a computationally lightweight alternative to the prevalent training- and optimization-based fairness enhancement approaches, with quantifiable performance guarantees. We first develop a novel fairness-aware nodal feature unlearning strategy along with refined certified unlearning bounds for this setting, whose impact extends beyond the realm of graph unlearning. We then design structural unlearning methods endowed with principled selection mechanisms over nodes and edges informed by rigorous bias analyses. Unlearning these judiciously selected elements can mitigate algorithmic biases with minimal impact on downstream utility (e.g., node classification accuracy). Experimental results over real networks corroborate the bias mitigation efficacy of our unlearning strategies, and delineate markedly favorable utility-complexity trade-offs relative to retraining from scratch using augmented graph data obtained via removals.
[LG-71] Foundations of Unknown-aware Machine Learning
链接: https://arxiv.org/abs/2505.14933
作者: Xuefeng Du
类目: Machine Learning (cs.LG)
*备注: PhD Dissertation
点击查看摘要
Abstract:Ensuring the reliability and safety of machine learning models in open-world deployment is a central challenge in AI safety. This thesis develops both algorithmic and theoretical foundations to address key reliability issues arising from distributional uncertainty and unknown classes, from standard neural networks to modern foundation models like large language models (LLMs). Traditional learning paradigms, such as empirical risk minimization (ERM), assume no distribution shift between training and inference, often leading to overconfident predictions on out-of-distribution (OOD) inputs. This thesis introduces novel frameworks that jointly optimize for in-distribution accuracy and reliability to unseen data. A core contribution is the development of an unknown-aware learning framework that enables models to recognize and handle novel inputs without labeled OOD data. We propose new outlier synthesis methods, VOS, NPOS, and DREAM-OOD, to generate informative unknowns during training. Building on this, we present SAL, a theoretical and algorithmic framework that leverages unlabeled in-the-wild data to enhance OOD detection under realistic deployment conditions. These methods demonstrate that abundant unlabeled data can be harnessed to recognize and adapt to unforeseen inputs, providing formal reliability guarantees. The thesis also extends reliable learning to foundation models. We develop HaloScope for hallucination detection in LLMs, MLLMGuard for defending against malicious prompts in multimodal models, and data cleaning methods to denoise human feedback used for better alignment. These tools target failure modes that threaten the safety of large-scale models in deployment. Overall, these contributions promote unknown-aware learning as a new paradigm, and we hope it can advance the reliability of AI systems with minimal human efforts. Comments: PhD Dissertation Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.14933 [cs.LG] (or arXiv:2505.14933v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14933 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xuefeng Du [view email] [v1] Tue, 20 May 2025 21:39:08 UTC (16,881 KB) Full-text links: Access Paper: View a PDF of the paper titled Foundations of Unknown-aware Machine Learning, by Xuefeng DuView PDFTeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-72] SecCAN: An Extended CAN Controller with Embedded Intrusion Detection
链接: https://arxiv.org/abs/2505.14924
作者: Shashwat Khandelwal,Shreejith Shanker
类目: ystems and Control (eess.SY); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures, 3 tables, Accepted in IEEE Embedded Systems Letters ( this https URL )
点击查看摘要
Abstract:Recent research has highlighted the vulnerability of in-vehicle network protocols such as controller area networks (CAN) and proposed machine learning-based intrusion detection systems (IDSs) as an effective mitigation technique. However, their efficient integration into vehicular architecture is non-trivial, with existing methods relying on electronic control units (ECUs)-coupled IDS accelerators or dedicated ECUs as IDS accelerators. Here, initiating IDS requires complete reception of a CAN message from the controller, incurring data movement and software overheads. In this paper, we present SecCAN, a novel CAN controller architecture that embeds IDS capability within the datapath of the controller. This integration allows IDS to tap messages directly from within the CAN controller as they are received from the bus, removing overheads incurred by existing ML-based IDSs. A custom-quantised machine-learning accelerator is developed as the IDS engine and embedded into SecCAN’s receive data path, with optimisations to overlap the IDS inference with the protocol’s reception window. We implement SecCAN on AMD XCZU7EV FPGA to quantify its performance and benefits in hardware, using multiple attack datasets. We show that SecCAN can completely hide the IDS latency within the CAN reception window for all CAN packet sizes and detect multiple attacks with state-of-the-art accuracy with zero software overheads on the ECU and low energy overhead (73.7 uJ per message) for IDS inference. Also, SecCAN incurs limited resource overhead compared to a standard CAN controller ( 30% LUT, 1% FF), making it ideally suited for automotive deployment.
[LG-73] xPert: Leverag ing Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation Prediction
链接: https://arxiv.org/abs/2505.14919
作者: Frederik Wenkel,Wilson Tu,Cassandra Masschelein,Hamed Shirzad,Cian Eastwood,Shawn T. Whitfield,Ihab Bendidi,Craig Russell,Liam Hodgson,Yassir El Mesbahi,Jiarui Ding,Marta M. Fay,Berton Earnshaw,Emmanuel Noutahi,Alisandra K. Denton
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
点击查看摘要
Abstract:Accurately predicting cellular responses to genetic perturbations is essential for understanding disease mechanisms and designing effective therapies. Yet exhaustively exploring the space of possible perturbations (e.g., multi-gene perturbations or across tissues and cell types) is prohibitively expensive, motivating methods that can generalize to unseen conditions. In this work, we explore how knowledge graphs of gene-gene relationships can improve out-of-distribution (OOD) prediction across three challenging settings: unseen single perturbations; unseen double perturbations; and unseen cell lines. In particular, we present: (i) TxPert, a new state-of-the-art method that leverages multiple biological knowledge networks to predict transcriptional responses under OOD scenarios; (ii) an in-depth analysis demonstrating the impact of graphs, model architecture, and data on performance; and (iii) an expanded benchmarking framework that strengthens evaluation standards for perturbation modeling.
[LG-74] When to retrain a machine learning model
链接: https://arxiv.org/abs/2505.14903
作者: Regol Florence,Schwinn Leo,Sprague Kyle,Coates Mark,Markovich Thomas
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments addressing classification tasks show that the method consistently outperforms existing baselines on 7 datasets.
[LG-75] Multi-Channel Swin Transformer Framework for Bearing Remaining Useful Life Prediction
链接: https://arxiv.org/abs/2505.14897
作者: Ali Mohajerzarrinkelk,Maryam Ahang,Mehran Zoravar,Mostafa Abbasi,Homayoun Najjaran
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Precise estimation of the Remaining Useful Life (RUL) of rolling bearings is an important consideration to avoid unexpected failures, reduce downtime, and promote safety and efficiency in industrial systems. Complications in degradation trends, noise presence, and the necessity to detect faults in advance make estimation of RUL a challenging task. This paper introduces a novel framework that combines wavelet-based denoising method, Wavelet Packet Decomposition (WPD), and a customized multi-channel Swin Transformer model (MCSFormer) to address these problems. With attention mechanisms incorporated for feature fusion, the model is designed to learn global and local degradation patterns utilizing hierarchical representations for enhancing predictive performance. Additionally, a customized loss function is developed as a key distinction of this work to differentiate between early and late predictions, prioritizing accurate early detection and minimizing the high operation risks of late predictions. The proposed model was evaluated with the PRONOSTIA dataset using three experiments. Intra-condition experiments demonstrated that MCSFormer outperformed state-of-the-art models, including the Adaptive Transformer, MDAN, and CNN-SRU, achieving 41%, 64%, and 69% lower MAE on average across different operating conditions, respectively. In terms of cross-condition testing, it achieved superior generalization under varying operating conditions compared to the adapted ViT and Swin Transformer. Lastly, the custom loss function effectively reduced late predictions, as evidenced in a 6.3% improvement in the scoring metric while maintaining competitive overall performance. The model’s robust noise resistance, generalization capability, and focus on safety make MCSFormer a trustworthy and effective predictive maintenance tool in industrial applications.
[LG-76] Feature-Weighted MMD-CORAL for Domain Adaptation in Power Transformer Fault Diagnosis
链接: https://arxiv.org/abs/2505.14896
作者: Hootan Mahmoodiyan,Maryam Ahang,Mostafa Abbasi,Homayoun Najjaran
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Ensuring the reliable operation of power transformers is critical to grid stability. Dissolved Gas Analysis (DGA) is widely used for fault diagnosis, but traditional methods rely on heuristic rules, which may lead to inconsistent results. Machine learning (ML)-based approaches have improved diagnostic accuracy; however, power transformers operate under varying conditions, and differences in transformer type, environmental factors, and operational settings create distribution shifts in diagnostic data. Consequently, direct model transfer between transformers often fails, making techniques for domain adaptation a necessity. To tackle this issue, this work proposes a feature-weighted domain adaptation technique that combines Maximum Mean Discrepancy (MMD) and Correlation Alignment (CORAL) with feature-specific weighting (MCW). Kolmogorov-Smirnov (K-S) statistics are used to assign adaptable weights, prioritizing features with larger distributional discrepancies and thereby improving source and target domain alignment. Experimental evaluations on datasets for power transformers demonstrate the effectiveness of the proposed method, which achieves a 7.9% improvement over Fine-Tuning and a 2.2% improvement over MMD-CORAL (MC). Furthermore, it outperforms both techniques across various training sample sizes, confirming its robustness for domain adaptation.
[LG-77] An active learning framework for multi-group mean estimation
链接: https://arxiv.org/abs/2505.14882
作者: Abdellah Aznag,Rachel Cummings,Adam N. Elmachtoub
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study a fundamental learning problem over multiple groups with unknown data distributions, where an analyst would like to learn the mean of each group. Moreover, we want to ensure that this data is collected in a relatively fair manner such that the noise of the estimate of each group is reasonable. In particular, we focus on settings where data are collected dynamically, which is important in adaptive experimentation for online platforms or adaptive clinical trials for healthcare. In our model, we employ an active learning framework to sequentially collect samples with bandit feedback, observing a sample in each period from the chosen group. After observing a sample, the analyst updates their estimate of the mean and variance of that group and chooses the next group accordingly. The analyst’s objective is to dynamically collect samples to minimize the collective noise of the estimators, measured by the norm of the vector of variances of the mean estimators. We propose an algorithm, Variance-UCB, that sequentially selects groups according to an upper confidence bound on the variance estimate. We provide a general theoretical framework for providing efficient bounds on learning from any underlying distribution where the variances can be estimated reasonably. This framework yields upper bounds on regret that improve significantly upon all existing bounds, as well as a collection of new results for different objectives and distributions than those previously studied. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.14882 [cs.LG] (or arXiv:2505.14882v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14882 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-78] A self-regulated convolutional neural network for classifying variable stars
链接: https://arxiv.org/abs/2505.14877
作者: Francisco Pérez-Galarce,Jorge Martínez-Palomera,Karim Pichara,Pablo Huijse,Márcio Catelan
类目: Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR)
*备注:
点击查看摘要
Abstract:Over the last two decades, machine learning models have been widely applied and have proven effective in classifying variable stars, particularly with the adoption of deep learning architectures such as convolutional neural networks, recurrent neural networks, and transformer models. While these models have achieved high accuracy, they require high-quality, representative data and a large number of labelled samples for each star type to generalise well, which can be challenging in time-domain surveys. This challenge often leads to models learning and reinforcing biases inherent in the training data, an issue that is not easily detectable when validation is performed on subsamples from the same catalogue. The problem of biases in variable star data has been largely overlooked, and a definitive solution has yet to be established. In this paper, we propose a new approach to improve the reliability of classifiers in variable star classification by introducing a self-regulated training process. This process utilises synthetic samples generated by a physics-enhanced latent space variational autoencoder, incorporating six physical parameters from Gaia Data Release 3. Our method features a dynamic interaction between a classifier and a generative model, where the generative model produces ad-hoc synthetic light curves to reduce confusion during classifier training and populate underrepresented regions in the physical parameter space. Experiments conducted under various scenarios demonstrate that our self-regulated training approach outperforms traditional training methods for classifying variable stars on biased datasets, showing statistically significant improvements.
[LG-79] Subquadratic Algorithms and Hardness for Attention with Any Temperature
链接: https://arxiv.org/abs/2505.14840
作者: Shreya Gupta,Boyang Huang,Barna Saha,Yinzhan Xu,Christopher Ye
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: 34 pages, 2 figures, abstract shortened to meet arXiv requirements
点击查看摘要
Abstract:Despite the popularity of the Transformer architecture, the standard algorithm for computing Attention suffers from quadratic time complexity in context length n . Alman and Song [NeurIPS 2023] showed that when the head dimension d = \Theta(\log n) , subquadratic Attention is possible if and only if the inputs have small entries bounded by B = o(\sqrt\log n) in absolute values, under the Strong Exponential Time Hypothesis ( \mathsfSETH ). Equivalently, subquadratic Attention is possible if and only if the softmax is applied with high temperature for d=\Theta(\log n) . Running times of these algorithms depend exponentially on B and thus they do not lead to even a polynomial-time algorithm outside the specific range of B . This naturally leads to the question: when can Attention be computed efficiently without strong assumptions on temperature? Are there fast attention algorithms that scale polylogarithmically with entry size B ? In this work, we resolve this question and characterize when fast Attention for arbitrary temperatures is possible. First, for all constant d = O(1) , we give the first subquadratic \tildeO(n^2 - 1/d \cdot \mathrmpolylog(B)) time algorithm for Attention with large B . Our result holds even for matrices with large head dimension if they have low rank. In this regime, we also give a similar running time for Attention gradient computation, and therefore for the full LLM training process. Furthermore, we show that any substantial improvement on our algorithm is unlikely. In particular, we show that even when d = 2^\Theta(\log^* n) , Attention requires n^2 - o(1) time under \mathsfSETH . Finally, in the regime where d = \mathrmpoly(n) , we show that the standard algorithm is optimal under popular fine-grained complexity assumptions. Comments: 34 pages, 2 figures, abstract shortened to meet arXiv requirements Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC) ACMclasses: F.2.1 Cite as: arXiv:2505.14840 [cs.LG] (or arXiv:2505.14840v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14840 Focus to learn more arXiv-issued DOI via DataCite
[LG-80] Deep Koopman operator framework for causal discovery in nonlinear dynamical systems
链接: https://arxiv.org/abs/2505.14828
作者: Juan Nathaniel,Carla Roesch,Jatan Buch,Derek DeSantis,Adam Rupe,Kara Lamb,Pierre Gentine
类目: Machine Learning (cs.LG)
*备注: 10+14 pages, 10+13 figures
点击查看摘要
Abstract:We use a deep Koopman operator-theoretic formalism to develop a novel causal discovery algorithm, Kausal. Causal discovery aims to identify cause-effect mechanisms for better scientific understanding, explainable decision-making, and more accurate modeling. Standard statistical frameworks, such as Granger causality, lack the ability to quantify causal relationships in nonlinear dynamics due to the presence of complex feedback mechanisms, timescale mixing, and nonstationarity. This presents a challenge in studying many real-world systems, such as the Earth’s climate. Meanwhile, Koopman operator methods have emerged as a promising tool for approximating nonlinear dynamics in a linear space of observables. In Kausal, we propose to leverage this powerful idea for causal analysis where optimal observables are inferred using deep learning. Causal estimates are then evaluated in a reproducing kernel Hilbert space, and defined as the distance between the marginal dynamics of the effect and the joint dynamics of the cause-effect observables. Our numerical experiments demonstrate Kausal’s superior ability in discovering and characterizing causal signals compared to existing approaches of prescribed observables. Lastly, we extend our analysis to observations of El Niño-Southern Oscillation highlighting our algorithm’s applicability to real-world phenomena. Our code is available at this https URL.
[LG-81] Assimilative Causal Inference
链接: https://arxiv.org/abs/2505.14825
作者: Marios Andreou,Nan Chen,Erik Bollt
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Includes the Main Text and Supporting Information in a single document. 39 pages (p. 1–2 Title, Contents and Abstract | p. 3–14 Main Text | p. 15–39 Supporting Information), 9 figures (3 in the Main Text and 6 in the Supporting Information), typeset in LaTeX. Submitted for peer-review. For more info see this https URL
点击查看摘要
Abstract:Causal inference determines cause-and-effect relationships between variables and has broad applications across disciplines. Traditional time-series methods often reveal causal links only in a time-averaged sense, while ensemble-based information transfer approaches detect the time evolution of short-term causal relationships but are typically limited to low-dimensional systems. In this paper, a new causal inference framework, called assimilative causal inference (ACI), is developed. Fundamentally different from the state-of-the-art methods, ACI uses a dynamical system and a single realization of a subset of the state variables to identify instantaneous causal relationships and the dynamic evolution of the associated causal influence range (CIR). Instead of quantifying how causes influence effects as done traditionally, ACI solves an inverse problem via Bayesian data assimilation, thus tracing causes backward from observed effects with an implicit Bayesian hypothesis. Causality is determined by assessing whether incorporating the information of the effect variables reduces the uncertainty in recovering the potential cause variables. ACI has several desirable features. First, it captures the dynamic interplay of variables, where their roles as causes and effects can shift repeatedly over time. Second, a mathematically justified objective criterion determines the CIR without empirical thresholds. Third, ACI is scalable to high-dimensional problems by leveraging computationally efficient Bayesian data assimilation techniques. Finally, ACI applies to short time series and incomplete datasets. Notably, ACI does not require observations of candidate causes, which is a key advantage since potential drivers are often unknown or unmeasured. The effectiveness of ACI is demonstrated by complex dynamical systems showcasing intermittency and extreme events.
[LG-82] Imitation Learning via Focused Satisficing IJCAI2025
链接: https://arxiv.org/abs/2505.14820
作者: Rushit N. Shah,Nikolaos Agadakos,Synthia Sasulski,Ali Farajzadeh,Sanjiban Choudhury,Brian Ziebart
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)
点击查看摘要
Abstract:Imitation learning often assumes that demonstrations are close to optimal according to some fixed, but unknown, cost function. However, according to satisficing theory, humans often choose acceptable behavior based on their personal (and potentially dynamic) levels of aspiration, rather than achieving (near-) optimality. For example, a lunar lander demonstration that successfully lands without crashing might be acceptable to a novice despite being slow or jerky. Using a margin-based objective to guide deep reinforcement learning, our focused satisficing approach to imitation learning seeks a policy that surpasses the demonstrator’s aspiration levels – defined over trajectories or portions of trajectories – on unseen demonstrations without explicitly learning those aspirations. We show experimentally that this focuses the policy to imitate the highest quality (portions of) demonstrations better than existing imitation learning methods, providing much higher rates of guaranteed acceptability to the demonstrator, and competitive true returns on a range of environments.
[LG-83] xt embedding models can be great data engineers
链接: https://arxiv.org/abs/2505.14802
作者: Iman Kazemian,Paritosh Ramanan,Murat Yildirim
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature extraction, and feature engineering. In this paper, we propose ADEPT, an automated data engineering pipeline via text embeddings. At the core of the ADEPT framework is a simple yet powerful idea that the entropy of embeddings corresponding to textually dense raw format representation of time series can be intuitively viewed as equivalent (or in many cases superior) to that of numerically dense vector representations obtained by data engineering pipelines. Consequently, ADEPT uses a two step approach that (i) leverages text embeddings to represent the diverse data sources, and (ii) constructs a variational information bottleneck criteria to mitigate entropy variance in text embeddings of time series data. ADEPT provides an end-to-end automated implementation of predictive models that offers superior predictive performance despite issues such as missing data, ill-formed records, improper or corrupted data formats and irregular timestamps. Through exhaustive experiments, we show that the ADEPT outperforms the best existing benchmarks in a diverse set of datasets from large-scale applications across healthcare, finance, science and industrial internet of things. Our results show that ADEPT can potentially leapfrog many conventional data pipeline steps thereby paving the way for efficient and scalable automation pathways for diverse data science applications.
[LG-84] LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models ACL2025
链接: https://arxiv.org/abs/2505.14759
作者: Yan Wang,Ling Ding,Tien N Nguyen,Shaohua Wang,Yanan Zheng
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted to ACL 2025 main conference
点击查看摘要
Abstract:Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time, leveraging code contexts in utilizing attention scores to represent the tokens’ importance. We advocate for the selective removal of tokens based on the average context-aware attention scores rather than average scores across all inputs. LeanCode uses the attention scores of `CLS’ tokens within the encoder for classification tasks, such as code search. It also employs the encoder-decoder attention scores to determine token significance for sequence-to-sequence tasks like code this http URL evaluation shows LeanCode’s superiority over the SOTAs DietCode and Slimcode, with improvements of 60% and 16% for code search, and 29% and 27% for code summarization, respectively.
[LG-85] Large Language Models for Data Synthesis
链接: https://arxiv.org/abs/2505.14752
作者: Yihong Tang,Menglin Kong,Lijun Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.
[LG-86] Cooperative Causal GraphSAGE
链接: https://arxiv.org/abs/2505.14748
作者: Zaifa Xue,Tao Zhang,Tuo Xu,Huaixin Liang,Le Gao
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
点击查看摘要
Abstract:GraphSAGE is a widely used graph neural network. The introduction of causal inference has improved its robust performance and named as Causal GraphSAGE. However, Causal GraphSAGE focuses on measuring causal weighting among individual nodes, but neglecting the cooperative relationships among sampling nodes as a whole. To address this issue, this paper proposes Cooperative Causal GraphSAGE (CoCa-GraphSAGE), which combines cooperative game theory with Causal GraphSAGE. Initially, a cooperative causal structure model is constructed in the case of cooperation based on the graph structure. Subsequently, Cooperative Causal sampling (CoCa-sampling) algorithm is proposed, employing the Shapley values to calculate the cooperative contribution based on causal weights of the nodes sets. CoCa-sampling guides the selection of nodes with significant cooperative causal effects during the neighborhood sampling process, thus integrating the selected neighborhood features under cooperative relationships, which takes the sampled nodes as a whole and generates more stable target node embeddings. Experiments on publicly available datasets show that the proposed method has comparable classification performance to the compared methods and outperforms under perturbations, demonstrating the robustness improvement by CoCa-sampling.
[LG-87] he Evolution of Alpha in Finance Harnessing Human Insight and LLM Agents
链接: https://arxiv.org/abs/2505.14727
作者: Mohammad Rubyet Islam
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The pursuit of alpha returns that exceed market benchmarks has undergone a profound transformation, evolving from intuition-driven investing to autonomous, AI powered systems. This paper introduces a comprehensive five stage taxonomy that traces this progression across manual strategies, statistical models, classical machine learning, deep learning, and agentic architectures powered by large language models (LLMs). Unlike prior surveys focused narrowly on modeling techniques, this review adopts a system level lens, integrating advances in representation learning, multimodal data fusion, and tool augmented LLM agents. The strategic shift from static predictors to contextaware financial agents capable of real time reasoning, scenario simulation, and cross modal decision making is emphasized. Key challenges in interpretability, data fragility, governance, and regulatory compliance areas critical to production deployment are examined. The proposed taxonomy offers a unified framework for evaluating maturity, aligning infrastructure, and guiding the responsible development of next generation alpha systems.
[LG-88] Deployment of Traditional and Hybrid Machine Learning for Critical Heat Flux Prediction in the CTF Thermal Hydraulics Code
链接: https://arxiv.org/abs/2505.14701
作者: Aidan Furlong,Xingang Zhao,Robert Salko,Xu Wu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Critical heat flux (CHF) marks the transition from nucleate to film boiling, where heat transfer to the working fluid can rapidly deteriorate. Accurate CHF prediction is essential for efficiency, safety, and preventing equipment damage, particularly in nuclear reactors. Although widely used, empirical correlations frequently exhibit discrepancies in comparison with experimental data, limiting their reliability in diverse operational conditions. Traditional machine learning (ML) approaches have demonstrated the potential for CHF prediction but have often suffered from limited interpretability, data scarcity, and insufficient knowledge of physical principles. Hybrid model approaches, which combine data-driven ML with physics-based models, mitigate these concerns by incorporating prior knowledge of the domain. This study integrated a purely data-driven ML model and two hybrid models (using the Biasi and Bowring CHF correlations) within the CTF subchannel code via a custom Fortran framework. Performance was evaluated using two validation cases: a subset of the Nuclear Regulatory Commission CHF database and the Bennett dryout experiments. In both cases, the hybrid models exhibited significantly lower error metrics in comparison with conventional empirical correlations. The pure ML model remained competitive with the hybrid models. Trend analysis of error parity indicates that ML-based models reduce the tendency for CHF overprediction, improving overall accuracy. These results demonstrate that ML-based CHF models can be effectively integrated into subchannel codes and can potentially increase performance in comparison with conventional methods.
[LG-89] Stochastic Fractional Neural Operators: A Symmetrized Approach to Modeling Turbulence in Complex Fluid Dynamics
链接: https://arxiv.org/abs/2505.14700
作者: Rômulo Damasclin Chaves dos Santos,Jorge Henrique de Oliveira Sales
类目: Machine Learning (cs.LG)
*备注: 17 pages
点击查看摘要
Abstract:In this work, we introduce a new class of neural network operators designed to handle problems where memory effects and randomness play a central role. In this work, we introduce a new class of neural network operators designed to handle problems where memory effects and randomness play a central role. These operators merge symmetrized activation functions, Caputo-type fractional derivatives, and stochastic perturbations introduced via Itô type noise. The result is a powerful framework capable of approximating functions that evolve over time with both long-term memory and uncertain dynamics. We develop the mathematical foundations of these operators, proving three key theorems of Voronovskaya type. These results describe the asymptotic behavior of the operators, their convergence in the mean-square sense, and their consistency under fractional regularity assumptions. All estimates explicitly account for the influence of the memory parameter \alpha and the noise level \sigma . As a practical application, we apply the proposed theory to the fractional Navier-Stokes equations with stochastic forcing, a model often used to describe turbulence in fluid flows with memory. Our approach provides theoretical guarantees for the approximation quality and suggests that these neural operators can serve as effective tools in the analysis and simulation of complex systems. By blending ideas from neural networks, fractional calculus, and stochastic analysis, this research opens new perspectives for modeling turbulent phenomena and other multiscale processes where memory and randomness are fundamental. The results lay the groundwork for hybrid learning-based methods with strong analytical backing.
[LG-90] Are machine learning interpretations reliable? A stability study on global interpretations
链接: https://arxiv.org/abs/2505.15728
作者: Luqin Gan,Tarek M. Zikry,Genevera I. Allen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 17 pages main text, 5 main text figures. 57 pages in total with Appendix and Bibliography
点击查看摘要
Abstract:As machine learning systems are increasingly used in high-stakes domains, there is a growing emphasis placed on making them interpretable to improve trust in these systems. In response, a range of interpretable machine learning (IML) methods have been developed to generate human-understandable insights into otherwise black box models. With these methods, a fundamental question arises: Are these interpretations reliable? Unlike with prediction accuracy or other evaluation metrics for supervised models, the proximity to the true interpretation is difficult to define. Instead, we ask a closely related question that we argue is a prerequisite for reliability: Are these interpretations stable? We define stability as findings that are consistent or reliable under small random perturbations to the data or algorithms. In this study, we conduct the first systematic, large-scale empirical stability study on popular machine learning global interpretations for both supervised and unsupervised tasks on tabular data. Our findings reveal that popular interpretation methods are frequently unstable, notably less stable than the predictions themselves, and that there is no association between the accuracy of machine learning predictions and the stability of their associated interpretations. Moreover, we show that no single method consistently provides the most stable interpretations across a range of benchmark datasets. Overall, these results suggest that interpretability alone does not warrant trust, and underscores the need for rigorous evaluation of interpretation stability in future work. To support these principles, we have developed and released an open source IML dashboard and Python package to enable researchers to assess the stability and reliability of their own data-driven interpretations and discoveries.
[LG-91] Modular Jump Gaussian Processes
链接: https://arxiv.org/abs/2505.15557
作者: Anna R. Flowers,Christopher T. Franck,Mickaël Binois,Chiwoo Park,Robert B. Gramacy
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 18 pages, 12 figures
点击查看摘要
Abstract:Gaussian processes (GPs) furnish accurate nonlinear predictions with well-calibrated uncertainty. However, the typical GP setup has a built-in stationarity assumption, making it ill-suited for modeling data from processes with sudden changes, or “jumps” in the output variable. The “jump GP” (JGP) was developed for modeling data from such processes, combining local GPs and latent “level” variables under a joint inferential framework. But joint modeling can be fraught with difficulty. We aim to simplify by suggesting a more modular setup, eschewing joint inference but retaining the main JGP themes: (a) learning optimal neighborhood sizes that locally respect manifolds of discontinuity; and (b) a new cluster-based (latent) feature to capture regions of distinct output levels on both sides of the manifold. We show that each of (a) and (b) separately leads to dramatic improvements when modeling processes with jumps. In tandem (but without requiring joint inference) that benefit is compounded, as illustrated on real and synthetic benchmark examples from the recent literature.
[LG-92] Machine Learning Derived Blood Input for Dynamic PET Images of Rat Heart
链接: https://arxiv.org/abs/2505.15488
作者: Shubhrangshu Debsarkar,Bijoy Kundu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Dynamic FDG PET imaging study of n = 52 rats including 26 control Wistar-Kyoto (WKY) rats and 26 experimental spontaneously hypertensive rats (SHR) were per- formed using a Siemens microPET and Albira trimodal scanner longitudinally at 1, 2, 3, 5, 9, 12 and 18 months of age. A 15-parameter dual output model correcting for spill over contamination and partial volume effects with peak fitting cost functions was developed for simultaneous estimation of model corrected blood input function (MCIF) and kinetic rate constants for dynamic FDG PET images of rat heart in vivo. Major drawbacks of this model are its dependence on manual annotations for the Image De- rived Input Function (IDIF) and manual determination of crucial model parameters to compute MCIF. To overcome these limitations, we performed semi-automated segmen- tation and then formulated a Long-Short-Term Memory (LSTM) cell network to train and predict MCIF in test data using a concatenation of IDIFs and myocardial inputs and compared them with reference-modeled MCIF. Thresholding along 2D plane slices with two thresholds, with T1 representing high-intensity myocardium, and T2 repre- senting lower-intensity rings, was used to segment the area of the LV blood pool. The resultant IDIF and myocardial TACs were used to compute the corresponding reference (model) MCIF for all data sets. The segmented IDIF and the myocardium formed the input for the LSTM network. A k-fold cross validation structure with a 33:8:11 split and 5 folds was utilized to create the model and evaluate the performance of the LSTM network for all datasets. To overcome the sparseness of data as time steps increase, midpoint interpolation was utilized to increase the density of datapoints beyond time = 10 minutes. The model utilizing midpoint interpolation was able to achieve a 56.4% improvement over previous Mean Squared Error (MSE).
[LG-93] Adaptive Temperature Scaling with Conformal Prediction
链接: https://arxiv.org/abs/2505.15437
作者: Nikita Kotelevskii,Mohsen Guizani,Eric Moulines,Maxim Panov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Conformal prediction enables the construction of high-coverage prediction sets for any pre-trained model, guaranteeing that the true label lies within the set with a specified probability. However, these sets do not provide probability estimates for individual labels, limiting their practical use. In this paper, we propose, to the best of our knowledge, the first method for assigning calibrated probabilities to elements of a conformal prediction set. Our approach frames this as an adaptive calibration problem, selecting an input-specific temperature parameter to match the desired coverage level. Experiments on several challenging image classification datasets demonstrate that our method maintains coverage guarantees while significantly reducing expected calibration error.
[LG-94] Robust Multimodal Learning via Entropy-Gated Contrastive Fusion
链接: https://arxiv.org/abs/2505.15417
作者: Leon Chlon,Maggie Chlon,MarcAntonio M. Awada
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Real-world multimodal systems routinely face missing-input scenarios, and in reality, robots lose audio in a factory or a clinical record omits lab tests at inference time. Standard fusion layers either preserve robustness or calibration but never both. We introduce Adaptive Entropy-Gated Contrastive Fusion (AECF), a single light-weight layer that (i) adapts its entropy coefficient per instance, (ii) enforces monotone calibration across all modality subsets, and (iii) drives a curriculum mask directly from training-time entropy. On AV-MNIST and MS-COCO, AECF improves masked-input mAP by +18 pp at a 50% drop rate while reducing ECE by up to 200%, yet adds 1% run-time. All back-bones remain frozen, making AECF an easy drop-in layer for robust, calibrated multimodal inference.
[LG-95] Inter-Subject Variance Transfer Learning for EMG Pattern Classification Based on Bayesian Inference
链接: https://arxiv.org/abs/2505.15381
作者: Seitaro Yoneda,Akira Furui
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 3 tables, accepted at EMBC2024
点击查看摘要
Abstract:In electromyogram (EMG)-based motion recognition, a subject-specific classifier is typically trained with sufficient labeled data. However, this process demands extensive data collection over extended periods, burdening the subject. To address this, utilizing information from pre-training on multiple subjects for the training of the target subject could be beneficial. This paper proposes an inter-subject variance transfer learning method based on a Bayesian approach. This method is founded on the simple hypothesis that while the means of EMG features vary greatly across subjects, their variances may exhibit similar patterns. Our approach transfers variance information, acquired through pre-training on multiple source subjects, to a target subject within a Bayesian updating framework, thereby allowing accurate classification using limited target calibration data. A coefficient was also introduced to adjust the amount of information transferred for efficient transfer learning. Experimental evaluations using two EMG datasets demonstrated the effectiveness of our variance transfer strategy and its superiority compared to existing methods.
[LG-96] Policy Testing in Markov Decision Processes
链接: https://arxiv.org/abs/2505.15342
作者: Kaito Ariu,Po-An Wang,Alexandre Proutiere,Kenshi Abe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem–a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal–matching the instance-specific lower bound on sample complexity–while remaining computationally tractable. We validate our approach with numerical experiments.
[LG-97] Versatile Reservoir Computing for Heterogeneous Complex Networks
链接: https://arxiv.org/abs/2505.15219
作者: Yao Du,Huawei Fan,Xingang Wang
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures
点击查看摘要
Abstract:A new machine learning scheme, termed versatile reservoir computing, is proposed for sustaining the dynamics of heterogeneous complex networks. We show that a single, small-scale reservoir computer trained on time series from a subset of elements is able to replicate the dynamics of any element in a large-scale complex network, though the elements are of different intrinsic parameters and connectivities. Furthermore, by substituting failed elements with the trained machine, we demonstrate that the collective dynamics of the network can be preserved accurately over a finite time horizon. The capability and effectiveness of the proposed scheme are validated on three representative network models: a homogeneous complex network of non-identical phase oscillators, a heterogeneous complex network of non-identical phase oscillators, and a heterogeneous complex network of non-identical chaotic oscillators.
[LG-98] Recognition of Unseen Combined Motions via Convex Combination-based EMG Pattern Synthesis for Myoelectric Control
链接: https://arxiv.org/abs/2505.15218
作者: Itsuki Yazawa,Seitaro Yoneda,Akira Furui
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures, accepted at IEEE EMBC 2025
点击查看摘要
Abstract:Electromyogram (EMG) signals recorded from the skin surface enable intuitive control of assistive devices such as prosthetic limbs. However, in EMG-based motion recognition, collecting comprehensive training data for all target motions remains challenging, particularly for complex combined motions. This paper proposes a method to efficiently recognize combined motions using synthetic EMG data generated through convex combinations of basic motion patterns. Instead of measuring all possible combined motions, the proposed method utilizes measured basic motion data along with synthetically combined motion data for training. This approach expands the range of recognizable combined motions while minimizing the required training data collection. We evaluated the effectiveness of the proposed method through an upper limb motion classification experiment with eight subjects. The experimental results demonstrated that the proposed method improved the classification accuracy for unseen combined motions by approximately 17%.
[LG-99] Clustering and Pruning in Causal Data Fusion
链接: https://arxiv.org/abs/2505.15215
作者: Otto Tabell,Santtu Tikka,Juha Karvanen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results.
[LG-100] EEG-Based Inter-Patient Epileptic Seizure Detection Combining Domain Adversarial Training with CNN-BiLSTM Network
链接: https://arxiv.org/abs/2505.15203
作者: Rina Tazaki,Tomoyuki Akiyama,Akira Furui
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, accepted at IEEE EMBC 2025
点击查看摘要
Abstract:Automated epileptic seizure detection from electroencephalogram (EEG) remains challenging due to significant individual differences in EEG patterns across patients. While existing studies achieve high accuracy with patient-specific approaches, they face difficulties in generalizing to new patients. To address this, we propose a detection framework combining domain adversarial training with a convolutional neural network (CNN) and a bidirectional long short-term memory (BiLSTM). First, the CNN extracts local patient-invariant features through domain adversarial training, which optimizes seizure detection accuracy while minimizing patient-specific characteristics. Then, the BiLSTM captures temporal dependencies in the extracted features to model seizure evolution patterns. Evaluation using EEG recordings from 20 patients with focal epilepsy demonstrated superior performance over non-adversarial methods, achieving high detection accuracy across different patients. The integration of adversarial training with temporal modeling enables robust cross-patient seizure detection.
[LG-101] A Linear Approach to Data Poisoning
链接: https://arxiv.org/abs/2505.15175
作者: Diego Granziol,Donald Flynn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 9 pages, 9 Figures
点击查看摘要
Abstract:We investigate the theoretical foundations of data poisoning attacks in machine learning models. Our analysis reveals that the Hessian with respect to the input serves as a diagnostic tool for detecting poisoning, exhibiting spectral signatures that characterize compromised datasets. We use random matrix theory (RMT) to develop a theory for the impact of poisoning proportion and regularisation on attack efficacy in linear regression. Through QR stepwise regression, we study the spectral signatures of the Hessian in multi-output regression. We perform experiments on deep networks to show experimentally that this theory extends to modern convolutional and transformer networks under the cross-entropy loss. Based on these insights we develop preliminary algorithms to determine if a network has been poisoned and remedies which do not require further training.
[LG-102] Steering Generative Models with Experimental Data for Protein Fitness Optimization
链接: https://arxiv.org/abs/2505.15093
作者: Jason Yang,Wenda Chu,Daniel Khalil,Raul Astudillo,Bruce J. Wittmann,Frances H. Arnold,Yisong Yue
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent developments in steering protein generative models (e.g diffusion models, language models) offer a promising approach. However, by and large, past studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured by low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages compared to alternatives such as reinforcement learning with protein language models.
[LG-103] Infinite hierarchical contrastive clustering for personal digital envirotyping ML4H2024 ALT
链接: https://arxiv.org/abs/2505.15022
作者: Ya-Yun Huang,Joseph McClernon,Jason A. Oliver,Matthew M. Engelhard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10pages, 5 figures, Machine Learning four Health(ML4H 2024)
点击查看摘要
Abstract:Daily environments have profound influence on our health and behavior. Recent work has shown that digital envirotyping, where computer vision is applied to images of daily environments taken during ecological momentary assessment (EMA), can be used to identify meaningful relationships between environmental features and health outcomes of interest. To systematically study such effects on an individual level, it is helpful to group images into distinct environments encountered in an individual’s daily life; these may then be analyzed, further grouped into related environments with similar features, and linked to health outcomes. Here we introduce infinite hierarchical contrastive clustering to address this challenge. Building on the established contrastive clustering framework, our method a) allows an arbitrary number of clusters without requiring the full Dirichlet Process machinery by placing a stick-breaking prior on predicted cluster probabilities; and b) encourages distinct environments to form well-defined sub-clusters within each cluster of related environments by incorporating a participant-specific prediction loss. Our experiments show that our model effectively identifies distinct personal environments and groups these environments into meaningful environment types. We then illustrate how the resulting clusters can be linked to various health outcomes, highlighting the potential of our approach to advance the envirotyping paradigm.
[LG-104] Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya Bounds
链接: https://arxiv.org/abs/2505.15013
作者: Anupama Sridhar,Alexander Johansen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages main paper
点击查看摘要
Abstract:First-order adaptive optimization methods like Adam are the default choices for training modern deep neural networks. Despite their empirical success, the theoretical understanding of these methods in non-smooth settings, particularly in Deep ReLU networks, remains limited. ReLU activations create exponentially many region boundaries where standard smoothness assumptions break down. \textbfWe derive the first (\tildeO!\bigl(\sqrtd_\mathrmeff/n\bigr)) generalization bound for Adam in Deep ReLU networks and the first global-optimal convergence for Adam in the non smooth, non convex relu landscape without a global PL or convexity assumption. Our analysis is based on stratified Morse theory and novel results in Kakeya sets. We develop a multi-layer refinement framework that progressively tightens bounds on region crossings. We prove that the number of region crossings collapses from exponential to near-linear in the effective dimension. Using a Kakeya based method, we give a tighter generalization bound than PAC-Bayes approaches and showcase convergence using a mild uniform low barrier assumption.
[LG-105] LOBSTUR: A Local Bootstrap Framework for Tuning Unsupervised Representations in Graph Neural Networks
链接: https://arxiv.org/abs/2505.14867
作者: So Won Jeong,Claire Donnat
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) are increasingly used in conjunction with unsupervised learning techniques to learn powerful node representations, but their deployment is hindered by their high sensitivity to hyperparameter tuning and the absence of established methodologies for selecting the optimal models. To address these challenges, we propose LOBSTUR-GNN (\bf Local \bf Boot\bf strap for \bf Tuning \bf Unsupervised \bf Representations in GNNs) i), a novel framework designed to adapt bootstrapping techniques for unsupervised graph representation learning. LOBSTUR-GNN tackles two main challenges: (a) adapting the bootstrap edge and feature resampling process to account for local graph dependencies in creating alternative versions of the same graph, and (b) establishing robust metrics for evaluating learned representations without ground-truth labels. Using locally bootstrapped resampling and leveraging Canonical Correlation Analysis (CCA) to assess embedding consistency, LOBSTUR provides a principled approach for hyperparameter tuning in unsupervised GNNs. We validate the effectiveness and efficiency of our proposed method through extensive experiments on established academic datasets, showing an 65.9% improvement in the classification accuracy compared to an uninformed selection of hyperparameters. Finally, we deploy our framework on a real-world application, thereby demonstrating its validity and practical utility in various settings. \footnoteThe code is available at \hrefthis https URLthis http URL.
[LG-106] Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
链接: https://arxiv.org/abs/2505.14808
作者: Soo Min Kwon,Alec S. Xu,Can Yaras,Laura Balzano,Qing Qu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the training and testing covariance matrices. We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts. However, using this framework, we also prove an interesting property of ICL: when trained on task vectors drawn from a union of low-dimensional subspaces, ICL can generalize to any subspace within their span, given sufficiently long prompt lengths. This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training. We empirically show that our results also hold for models such as GPT-2, and conclude with (i) experiments on how our observations extend to nonlinear function classes and (ii) results on how LoRA has the ability to capture distribution shifts.
[LG-107] Place Cells as Position Embeddings of Multi-Time Random Walk Transition Kernels for Path Planning
链接: https://arxiv.org/abs/2505.14806
作者: Minglu Zhao,Dehong Xu,Deqian Kong,Wen-Hao Zhang,Ying Nian Wu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:The hippocampus orchestrates spatial navigation through collective place cell encodings that form cognitive maps. We reconceptualize the population of place cells as position embeddings approximating multi-scale symmetric random walk transition kernels: the inner product \langle h(x, t), h(y, t) \rangle = q(y|x, t) represents normalized transition probabilities, where h(x, t) is the embedding at location x , and q(y|x, t) is the normalized symmetric transition probability over time t . The time parameter \sqrtt defines a spatial scale hierarchy, mirroring the hippocampal dorsoventral axis. q(y|x, t) defines spatial adjacency between x and y at scale or resolution \sqrtt , and the pairwise adjacency relationships (q(y|x, t), \forall x, y) are reduced into individual embeddings (h(x, t), \forall x) that collectively form a map of the environment at sale \sqrtt . Our framework employs gradient ascent on q(y|x, t) = \langle h(x, t), h(y, t)\rangle with adaptive scale selection, choosing the time scale with maximal gradient at each step for trap-free, smooth trajectories. Efficient matrix squaring P_2t = P_t^2 builds global representations from local transitions P_1 without memorizing past trajectories, enabling hippocampal preplay-like path planning. This produces robust navigation through complex environments, aligning with hippocampal navigation. Experimental results show that our model captures place cell properties – field size distribution, adaptability, and remapping – while achieving computational efficiency. By modeling collective transition probabilities rather than individual place fields, we offer a biologically plausible, scalable framework for spatial navigation.
[LG-108] Effective climate policies for major emission reductions of ozone precursors: Global evidence from two decades
链接: https://arxiv.org/abs/2505.14731
作者: Ningning Yao,Huan Xi,Lang Chen,Zhe Song,Jian Li,Yulei Chen,Baocai Guo,Yuanhang Zhang,Tong Zhu,Pengfei Li,Daniel Rosenfeld,John H. Seinfeld,Shaocai Yu
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: There are 30 pages of 12 figures
点击查看摘要
Abstract:Despite policymakers deploying various tools to mitigate emissions of ozone (O\textsubscript3) precursors, such as nitrogen oxides (NO\textsubscriptx), carbon monoxide (CO), and volatile organic compounds (VOCs), the effectiveness of policy combinations remains uncertain. We employ an integrated framework that couples structural break detection with machine learning to pinpoint effective interventions across the building, electricity, industrial, and transport sectors, identifying treatment effects as abrupt changes without prior assumptions about policy treatment assignment and timing. Applied to two decades of global O\textsubscript3 precursor emissions data, we detect 78, 77, and 78 structural breaks for NO\textsubscriptx, CO, and VOCs, corresponding to cumulative emission reductions of 0.96-0.97 Gt, 2.84-2.88 Gt, and 0.47-0.48 Gt, respectively. Sector-level analysis shows that electricity sector structural policies cut NO\textsubscriptx by up to 32.4%, while in buildings, developed countries combined adoption subsidies with carbon taxes to achieve 42.7% CO reductions and developing countries used financing plus fuel taxes to secure 52.3%. VOCs abatement peaked at 38.5% when fossil-fuel subsidy reforms were paired with financial incentives. Finally, hybrid strategies merging non-price measures (subsidies, bans, mandates) with pricing instruments delivered up to an additional 10% co-benefit. These findings guide the sequencing and complementarity of context-specific policy portfolios for O\textsubscript3 precursor mitigation.
[LG-109] HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity
链接: https://arxiv.org/abs/2505.14725
作者: Xuejun Sun,Yiran Song,Xiaochen Zhou,Ruilie Cai,Yu Zhang,Xinyi Li,Rui Peng,Jialiu Xie,Yuanyuan Yan,Muyao Tang,Prem Lakshmanane,Baiming Zou,James S. Hagood,Raymond J. Pickles,Didong Li,Fei Zou,Xiaojing Zheng
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.
[LG-110] Stochastic Processes with Modified Lognormal Distribution Featuring Flexible Upper Tail
链接: https://arxiv.org/abs/2505.14713
作者: Dionissios T. Hristopulos,Anastassia Baxevani,Giorgio Kaniadakis
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 44 pages (36 in Main and 8 in Supplement), 27 figures (20 in Main and 7 in Supplement), 13 tables (9 in Main and 4 in Supplement)
点击查看摘要
Abstract:Asymmetric, non-Gaussian probability distributions are often observed in the analysis of natural and engineering datasets. The lognormal distribution is a standard model for data with skewed frequency histograms and fat tails. However, the lognormal law severely restricts the asymptotic dependence of the probability density and the hazard function for high values. Herein we present a family of three-parameter non-Gaussian probability density functions that are based on generalized kappa-exponential and kappa-logarithm functions and investigate its mathematical properties. These kappa-lognormal densities represent continuous deformations of the lognormal with lighter right tails, controlled by the parameter kappa. In addition, bimodal distributions are obtained for certain parameter combinations. We derive closed-form analytic expressions for the main statistical functions of the kappa-lognormal distribution. For the moments, we derive bounds that are based on hypergeometric functions as well as series expansions. Explicit expressions for the gradient and Hessian of the negative log-likelihood are obtained to facilitate numerical maximum-likelihood estimates of the kappa-lognormal parameters from data. We also formulate a joint probability density function for kappa-lognormal stochastic processes by applying Jacobi’s multivariate theorem to a latent Gaussian process. Estimation of the kappa-lognormal distribution based on synthetic and real data is explored. Furthermore, we investigate applications of kappa-lognormal processes with different covariance kernels in time series forecasting and spatial interpolation using warped Gaussian process regression. Our results are of practical interest for modeling skewed distributions in various scientific and engineering fields.
[LG-111] owards scalable surrogate models based on Neural Fields for large scale aerodynamic simulations
链接: https://arxiv.org/abs/2505.14704
作者: Giovanni Catalani,Jean Fesquet,Xavier Bertrand,Frédéric Tost,Michael Bauerheim,Joseph Morlier
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper introduces a novel surrogate modeling framework for aerodynamic applications based on Neural Fields. The proposed approach, MARIO (Modulated Aerodynamic Resolution Invariant Operator), addresses non parametric geometric variability through an efficient shape encoding mechanism and exploits the discretization-invariant nature of Neural Fields. It enables training on significantly downsampled meshes, while maintaining consistent accuracy during full-resolution inference. These properties allow for efficient modeling of diverse flow conditions, while reducing computational cost and memory requirements compared to traditional CFD solvers and existing surrogate methods. The framework is validated on two complementary datasets that reflect industrial constraints. First, the AirfRANS dataset consists in a two-dimensional airfoil benchmark with non-parametric shape variations. Performance evaluation of MARIO on this case demonstrates an order of magnitude improvement in prediction accuracy over existing methods across velocity, pressure, and turbulent viscosity fields, while accurately capturing boundary layer phenomena and aerodynamic coefficients. Second, the NASA Common Research Model features three-dimensional pressure distributions on a full aircraft surface mesh, with parametric control surface deflections. This configuration confirms MARIO’s accuracy and scalability. Benchmarking against state-of-the-art methods demonstrates that Neural Field surrogates can provide rapid and accurate aerodynamic predictions under the computational and data limitations characteristic of industrial applications.
[LG-112] Global Description of Flutter Dynamics via Koopman Theory
链接: https://arxiv.org/abs/2505.14697
作者: Jiwoo Song,Daning Huang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper presents a novel parametrization approach for aeroelastic systems utilizing Koopman theory, specifically leveraging the Koopman Bilinear Form (KBF) model. To address the limitations of linear parametric dependence in the KBF model, we introduce the Extended KBF (EKBF) model, which enables a global linear representation of aeroelastic dynamics while capturing stronger nonlinear dependence on, e.g., the flutter parameter. The effectiveness of the proposed methodology is demonstrated through two case studies: a 2D academic example and a panel flutter problem. Results show that EKBF effectively interpolates and extrapolates principal eigenvalues, capturing flutter mechanisms, and accurately predicting the flutter boundary even when the data is corrupted by noise. Furthermore, parameterized isostable and isochron identified by EKBF provides valuable insights into the nonlinear flutter system.
信息检索
[IR-0] Reranking with Compressed Document Representation
链接: https://arxiv.org/abs/2505.15394
作者: Hervé Déjean,Stéphane Clinchant
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Reranking, the process of refining the output of a first-stage retriever, is often considered computationally expensive, especially with Large Language Models. Borrowing from recent advances in document compression for RAG, we reduce the input size by compressing documents into fixed-size embedding representations. We then teach a reranker to use compressed inputs by distillation. Although based on a billion-size model, our trained reranker using this compressed input can challenge smaller rerankers in terms of both effectiveness and efficiency, especially for long documents. Given that text compressors are still in their early development stages, we view this approach as promising.
[IR-1] Robust Relevance Feedback for Interactive Known-Item Video Search ICMR2025
链接: https://arxiv.org/abs/2505.15128
作者: Zhixin Ma,Chong-Wah Ngo
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Accepted to ICMR 2025
点击查看摘要
Abstract:Known-item search (KIS) involves only a single search target, making relevance feedback-typically a powerful technique for efficiently identifying multiple positive examples to infer user intent-inapplicable. PicHunter addresses this issue by asking users to select the top-k most similar examples to the unique search target from a displayed set. Under ideal conditions, when the user’s perception aligns closely with the machine’s perception of similarity, consistent and precise judgments can elevate the target to the top position within a few iterations. However, in practical scenarios, expecting users to provide consistent judgments is often unrealistic, especially when the underlying embedding features used for similarity measurements lack interpretability. To enhance robustness, we first introduce a pairwise relative judgment feedback that improves the stability of top-k selections by mitigating the impact of misaligned feedback. Then, we decompose user perception into multiple sub-perceptions, each represented as an independent embedding space. This approach assumes that users may not consistently align with a single representation but are more likely to align with one or several among multiple representations. We develop a predictive user model that estimates the combination of sub-perceptions based on each user feedback instance. The predictive user model is then trained to filter out the misaligned sub-perceptions. Experimental evaluations on the large-scale open-domain dataset V3C indicate that the proposed model can optimize over 60% search targets to the top rank when their initial ranks at the search depth between 10 and 50. Even for targets initially ranked between 1,000 and 5,000, the model achieves a success rate exceeding 40% in optimizing ranks to the top, demonstrating the enhanced robustness of relevance feedback in KIS despite inconsistent feedback.
[IR-2] GitHub Repository Complexity Leads to Diminished Web Archive Availability
链接: https://arxiv.org/abs/2505.15042
作者: David Calano,Michele C. Weigle,Michael L. Nelson
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of their source files were archived, on average, with the majority of repositories not having source files saved in the Internet Archive at all. The highest concentration of archived source files available were those linked directly from repositories’ home pages at a rate of 14.89% across all available repositories and sharply dropping off at deeper levels of a repository’s directory tree.
附件下载
点击下载今日全部论文列表