本篇博文主要内容为 2025-05-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-29)

今日共更新745篇论文,其中:

  • 自然语言处理149篇(Computation and Language (cs.CL))
  • 人工智能249篇(Artificial Intelligence (cs.AI))
  • 计算机视觉206篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习262篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理简单推理问题时容易出现的过度思考问题,即生成冗长的思维链(Chain-of-Thought, CoT)路径,从而增加推理成本和延迟。解决方案的关键在于提出一种动态且与模型无关的框架——Auto Long-Short Reasoning (AutoL2S),该框架使LLMs能够根据问题复杂度动态压缩生成的推理路径。其核心机制是通过训练数据中的长短CoT路径以及特殊的EASY标记,使模型自主判断何时需要更长的推理,何时可采用更简短的推理方式,从而在不牺牲性能的前提下显著减少推理路径长度。

链接: https://arxiv.org/abs/2505.22662
作者: Feng Luo,Yu-Neng Chuang,Guanchu Wang,Hoang Anh Duy Le,Shaochen Zhong,Hongyi Liu,Jiayi Yuan,Yang Sui,Vladimir Braverman,Vipin Chaudhary,Xia Hu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The reasoning-capable large language models (LLMs) demonstrate strong performance on complex reasoning tasks but often suffer from overthinking, generating unnecessarily long chain-of-thought (CoT) reasoning paths for easy reasoning questions, thereby increasing inference cost and latency. Recent approaches attempt to address this challenge by manually deciding when to apply long or short reasoning. However, they lack the flexibility to adapt CoT length dynamically based on question complexity. In this paper, we propose Auto Long-Short Reasoning (AutoL2S), a dynamic and model-agnostic framework that enables LLMs to dynamically compress their generated reasoning path based on the complexity of the reasoning question. AutoL2S enables a learned paradigm, in which LLMs themselves can decide when longer reasoning is necessary and when shorter reasoning suffices, by training on data annotated with our proposed method, which includes both long and short CoT paths and a special EASY token. We then use EASY token to indicate when the model can skip generating lengthy CoT reasoning. This proposed annotation strategy can enhance the LLMs’ ability to generate shorter CoT reasoning paths with improved quality after training. Extensive evaluation results show that AutoL2S reduces the length of reasoning generation by up to 57% without compromising performance, demonstrating the effectiveness of AutoL2S for scalable and efficient LLM reasoning.
zh

[NLP-1] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLM s in Domain-Specific Knowledge and Reasoning ACL2025

【速读】: 该论文试图解决传统静态基准测试在评估大语言模型(Large Language Models, LLMs)时存在的两个主要问题:一是预定义测试集缺乏对多样化应用领域的适应性,二是标准化评估协议难以捕捉领域特定知识和上下文推理能力的细粒度评估。其解决方案的关键在于提出GuessArena,这是一个基于对抗博弈交互的自适应评估框架,通过融合动态领域知识建模与渐进式推理评估,提升了评估的准确性与适用性。

链接: https://arxiv.org/abs/2505.22661
作者: Qingchen Yu,Zifan Zheng,Ding Chen,Simin Niu,Bo Tang,Feiyu Xiong,Zhiyu Li
机构: MemTensor (Shanghai) Technology Co., Ltd. ; University of Sydney ; Research Institute of China Telecom ; Renmin University of China
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.
zh

[NLP-2] 3DLLM -Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在动态、多房间3D环境中缺乏有效的长期空间-时间记忆建模,从而难以进行有效规划和行动的问题。其解决方案的关键在于提出3DLLM-Mem模型,该模型通过引入工作记忆标记作为查询,选择性地关注并融合来自情景记忆中的最有用的空间和时间特征,从而实现对长期记忆的推理与动作决策,提升了代理在复杂、长时程环境中的任务相关信息聚焦能力和记忆效率。

链接: https://arxiv.org/abs/2505.22657
作者: Wenbo Hu,Yining Hong,Yanjun Wang,Leison Gao,Zibu Wei,Xingcheng Yao,Nanyun Peng,Yonatan Bitton,Idan Szpektor,Kai-Wei Chang
机构: University of California, Los Angeles; Google Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: demos at: this https URL

点击查看摘要

Abstract:Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent’s ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench’s most challenging in-the-wild embodied tasks.
zh

[NLP-3] Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents ICML2025

【速读】: 该论文试图解决大型语言模型(Large-language models, LLMs)和聊天机器人代理在交互过程中产生的不确定性量化问题,特别是在开放和互动环境中,传统区分观测不确定性(aleatoric uncertainty)和认知不确定性(epistemic uncertainty)的二分法存在局限性。解决方案的关键在于提出三种新的研究方向:未明确性不确定性(underspecification uncertainty),用于描述用户未提供完整信息或任务定义不明确的情况;交互式学习,通过提问减少当前上下文的不确定性;输出不确定性,利用丰富的语言和语音空间表达超出数值的不确定性。这些方法旨在提升LLM代理交互的透明度、可信度和直观性。

链接: https://arxiv.org/abs/2505.22655
作者: Michael Kirchhof,Gjergji Kasneci,Enkelejda Kasneci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.
zh

[NLP-4] he Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

【速读】: 该论文旨在解决后训练阶段大型语言模型(Large Language Models, LLMs)在面对奖励噪声时的性能稳定性问题,尤其是在实际应用场景中,奖励信号往往不精确或存在噪声。其解决方案的关键在于引入一种基于关键推理模式奖励(Reasoning Pattern Reward, RPR)的方法,通过仅奖励推理过程中的关键短语而非最终答案的正确性,显著提升了模型在开放性任务中的表现,同时有效校准了噪声奖励模型,减少了误判并增强了模型的鲁棒性。

链接: https://arxiv.org/abs/2505.22653
作者: Ang Lv,Ruobing Xie,Xingwu Sun,Zhanhui Kang,Rui Yan
机构: GSAI, Renmin University of China (国家智能产业研究院,中国人民大学); Large Language Model Department, Tencent (腾讯大模型部门); University of Macau (澳门大学); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function’s outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to’‘-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM’s performance on open-ended tasks. These findings suggest the importance of improving models’ foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at this https URL.
zh

[NLP-5] Sherlock: Self-Correcting Reasoning in Vision-Language Models

【速读】: 该论文旨在解决推理型视觉语言模型(Reasoning Vision-Language Models, VLMs)在复杂多模态任务中面临的挑战,包括对推理错误的高度敏感性、对大量标注数据或精确验证器的依赖以及在特定领域外的泛化能力不足。其解决方案的关键在于引入Sherlock框架,该框架通过轨迹级自校正目标、基于视觉扰动的偏好数据构建方法以及动态β值的偏好调优策略,实现模型的自校正与自提升。在仅使用20k随机采样的标注数据后,模型能够无需外部监督持续自我优化,从而显著提升性能并减少对标注数据的依赖。

链接: https://arxiv.org/abs/2505.22651
作者: Yi Ding,Ruqi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages

点击查看摘要

Abstract:Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs’ self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic \beta for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
zh

[NLP-6] WebDancer: Towards Autonomous Information Seeking Agency

【速读】: 该论文旨在解决复杂现实问题中所需的深度信息检索与多步骤推理任务。其解决方案的关键在于提出一种从数据驱动和训练阶段视角构建端到端代理信息检索代理的综合范式,包含四个关键阶段:浏览数据构建、轨迹采样、监督微调以实现有效的冷启动以及强化学习以提升泛化能力。通过在基于ReAct框架的WebDancer网络代理上实例化该框架,实验结果表明其在GAIA和WebWalkerQA等挑战性信息检索基准上的优异表现,验证了所提训练范式的有效性。

链接: https://arxiv.org/abs/2505.22648
作者: Jialong Wu,Baixuan Li,Runnan Fang,Wenbiao Yin,Liwen Zhang,Zhengwei Tao,Dingchu Zhang,Zekun Xi,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in this https URL.
zh

[NLP-7] Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对简体中文和繁体中文时是否存在性能差异的问题,以及这种差异可能带来的代表性伤害和下游决策中的不公平问题。其解决方案的关键在于设计两个反映现实场景的基准任务:区域术语选择任务(要求模型根据描述的物品名称,该名称在大陆与中国台湾地区有所不同)和区域名称选择任务(要求模型从同时包含简体和繁体中文名称的列表中选择招聘对象)。通过审计11个主流商业LLM服务和开源模型在这两项任务上的表现,研究揭示了模型响应偏差与任务类型及提示语言之间的依赖关系,并探讨了训练数据表示、书写字符偏好和分词差异等因素可能导致这些偏差的原因。

链接: https://arxiv.org/abs/2505.22645
作者: Hanjia Lyu,Jiebo Luo,Jian Kang,Allison Koenecke
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: To appear in the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25)

点击查看摘要

Abstract:While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models – spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (this https URL).
zh

[NLP-8] Learning Composable Chains-of-Thought

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在缺乏标注的链式思维(Chain-of-Thought, CoT)数据情况下,如何实现推理技能的组合泛化问题。其关键解决方案是通过将原子推理任务的CoT格式进行最小化修改以使其可组合,从而训练出“原子CoT”模型,并通过多任务学习或模型融合提升目标组合任务的零样本性能,最终利用拒绝采样微调(Rejection Sampling Fine-Tuning, RFT)在少量组合数据上进一步优化模型。

链接: https://arxiv.org/abs/2505.22635
作者: Fangcong Yin,Zeyu Leo Liu,Liu Leqi,Xi Ye,Greg Durrett
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common approach for teaching large language models (LLMs) to reason is to train on chain-of-thought (CoT) traces of in-distribution reasoning problems, but such annotated data is costly to obtain for every problem of interest. We want reasoning models to generalize beyond their training distribution, and ideally to generalize compositionally: combine atomic reasoning skills to solve harder, unseen reasoning tasks. We take a step towards compositional generalization of reasoning skills when addressing a target compositional task that has no labeled CoT data. We find that simply training models on CoT data of atomic tasks leads to limited generalization, but minimally modifying CoT formats of constituent atomic tasks to be composable can lead to improvements. We can train “atomic CoT” models on the atomic tasks with Composable CoT data and combine them with multitask learning or model merging for better zero-shot performance on the target compositional task. Such a combined model can be further bootstrapped on a small amount of compositional data using rejection sampling fine-tuning (RFT). Results on string operations and natural language skill compositions show that training LLMs on Composable CoT outperforms multitask learning and continued fine-tuning baselines within a given training data budget.
zh

[NLP-9] Spatial Knowledge Graph-Guided Multimodal Synthesis

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间感知能力上的不足问题,特别是如何确保合成数据符合空间常识。其解决方案的关键在于提出SKG2Data方法,该方法基于空间知识图谱(Spatial Knowledge Graph, SKG)进行多模态数据合成,通过自动构建SKG来模拟人类对空间方向和距离的感知,并以此指导数据生成过程,从而提升MLLMs的空间感知与推理能力及数据的泛化性。

链接: https://arxiv.org/abs/2505.22633
作者: Yida Xue,Zhen Bi,Jinnan Yang,Jungang Lou,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Huzhou University (湖州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Ongoing work

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.
zh

[NLP-10] Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLM s ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对无关上下文时产生的幻觉问题,即模型会错误地将误导性上下文线索整合到预测中。其解决方案的关键在于揭示了模型错误的结构性缺陷——类基础(误)泛化机制,该机制表明模型通过结合抽象类别线索与查询或上下文中的特征来生成答案。研究进一步通过可解释性实验发现,模型内部计算中抽象类别表示在低层构建,并在高层细化为具体答案,同时特征选择由两个竞争电路控制,分别侧重于基于查询的推理和上下文线索的整合,其相对影响决定了最终输出。

链接: https://arxiv.org/abs/2505.22630
作者: Ziling Cheng,Meng Cao,Marc-Antoine Rondeau,Jackie Chi Kit Cheung
机构: Mila – Quebec Artificial Intelligence Institute (蒙特利尔人工智能研究所); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大 CIFAR 人工智能主席)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main Conference)

点击查看摘要

Abstract:The widespread success of large language models (LLMs) on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that reproduce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term class-based (mis)generalization, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model’s internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits – one prioritizing direct query-based reasoning, the other incorporating contextual cues – whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues – what we term stochastic chameleons.
zh

[NLP-11] Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions

【速读】: 该论文试图解决在有限预算约束下,如何系统性优化人类标注工作以提高图像描述标注的样本数量和全面性的问题。其解决方案的关键在于提出了一种基于“AI-in-the-loop”的方法——Chain-of-Talkers (CoTalk),该方法通过顺序标注减少冗余工作,并利用多模态接口提升标注效率,即后续标注者仅需标注前序标注未覆盖的视觉信息,同时通过语音输入实现更高的标注吞吐量。

链接: https://arxiv.org/abs/2505.22627
作者: Yijun Shen,Delong Chen,Fan Liu,Xingyu Wang,Chuanyi Zhang,Liang Yao,Yuhui Zheng
机构: Hohai University (河海大学); HKUST (香港科技大学); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While densely annotated image captions significantly facilitate the learning of robust vision-language alignment, methodologies for systematically optimizing human annotation efforts remain underexplored. We introduce Chain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximize the number of annotated samples and improve their comprehensiveness under fixed budget constraints (e.g., total human annotation time). The framework is built upon two key insights. First, sequential annotation reduces redundant workload compared to conventional parallel annotation, as subsequent annotators only need to annotate the ``residual’’ – the missing visual information that previous annotations have not covered. Second, humans process textual input faster by reading while outputting annotations with much higher throughput via talking; thus a multimodal interface enables optimized efficiency. We evaluate our framework from two aspects: intrinsic evaluations that assess the comprehensiveness of semantic units, obtained by parsing detailed captions into object-attribute trees and analyzing their effective connections; extrinsic evaluation measures the practical usage of the annotated captions in facilitating vision-language alignment. Experiments with eight participants show our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30 units/sec) and retrieval performance (41.13% vs. 40.52%) over the parallel method.
zh

[NLP-12] Fast-dLLM : Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

【速读】: 该论文旨在解决扩散模型(Diffusion LLMs)在非自回归文本生成中推理速度较慢以及并行解码时生成质量下降的问题。其关键解决方案是引入一种针对双向扩散模型的块级近似键值缓存(block-wise approximate KV Cache)机制,实现缓存复用且性能损失可忽略,同时提出一种基于置信度的并行解码策略,通过选择性解码高置信度的token来缓解依赖关系破坏,从而保持生成质量。

链接: https://arxiv.org/abs/2505.22618
作者: Chengyue Wu,Hao Zhang,Shuchen Xue,Zhijian Liu,Shizhe Diao,Ligeng Zhu,Ping Luo,Song Han,Enze Xie
机构: The University of Hong Kong (香港大学); NVIDIA (英伟达); MIT (麻省理工学院); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to \textbf27.6 \times throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.
zh

[NLP-13] he Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

【速读】: 该论文旨在解决在使用大语言模型(Large Language Models, LLMs)进行强化学习(Reinforcement Learning, RL)时,策略熵(policy entropy)崩溃这一主要障碍。该现象在大量无熵干预的RL训练中被一致观察到,表现为策略熵在训练初期急剧下降,伴随探索能力减弱和策略性能饱和。解决方案的关键在于通过管理策略熵以维持持续的探索能力,论文提出了一种将熵H与下游性能R关联的变换方程R=-a*e^H+b,并通过理论和实证分析揭示了策略熵变化由动作概率与logits变化之间的协方差驱动,进而提出Clip-Cov和KL-Cov两种方法,通过限制高协方差标记的更新来控制熵,从而避免熵崩溃并提升性能。

链接: https://arxiv.org/abs/2505.22617
作者: Ganqu Cui,Yuchen Zhang,Jiacheng Chen,Lifan Yuan,Zhi Wang,Yuxin Zuo,Haozhan Li,Yuchen Fan,Huayu Chen,Weize Chen,Zhiyuan Liu,Hao Peng,Lei Bai,Wanli Ouyang,Yu Cheng,Bowen Zhou,Ning Ding
机构: Shanghai AI Laboratory; Tsinghua University; UIUC; Peking University; Nanjing University; CUHK
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.
zh

[NLP-14] RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

【速读】: 该论文旨在解决图像描述生成(image recaptioning)中由于多模态大语言模型(MLLM)产生的幻觉和细节缺失导致的描述不准确和不完整问题。其解决方案的关键在于提出RICO框架,通过视觉重建来优化描述:利用文本到图像模型将描述重构为参考图像,并借助MLLM识别原始与重构图像之间的差异以迭代优化描述,从而生成更忠实且全面的文本描述。

链接: https://arxiv.org/abs/2505.22613
作者: Yuchi Wang,Yishuo Cai,Shuhuai Ren,Sihan Yang,Linli Yao,Yuanxin Liu,Yuanxing Zhang,Pengfei Wan,Xu Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: code: this https URL

点击查看摘要

Abstract:Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at this https URL.
zh

[NLP-15] Self-Error-Instruct: Generalizing from Errors for LLM s Mathematical Reasoning

【速读】: 该论文试图解决大型语言模型在数学推理任务中存在诸多错误案例(bad cases)的问题,这些问题限制了模型的性能。其解决方案的关键在于提出一种名为Self-Error-Instruct (SEI)的框架,通过分析错误案例生成具有代表性的关键短语,并基于这些短语进行聚类以识别错误类型,随后利用指导模型(如GPT-4o)通过自指导方法合成更具泛化能力的训练数据,最终通过一次学习过程筛选出最优示例并微调目标模型,从而提升模型在数学推理任务中的表现。

链接: https://arxiv.org/abs/2505.22591
作者: Erxin Yu,Jing Li,Ming Liao,Qi Zhu,Boyang Xue,Minghui Xu,Baojun Wang,Lanqing Hong,Fei Mi,Lifeng Shang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model’s (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs’ mathematical reasoning through error generalization.
zh

[NLP-16] Precise In-Parameter Concept Erasure in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在预训练过程中可能习得的不 desirable 知识(如敏感信息或受版权保护的内容)在下游应用中难以有效移除的问题。现有方法如微调、低秩适配器训练或事实级编辑存在粒度粗、深度浅或效果不佳等局限性。论文提出的解决方案 PISCES(Precise In-parameter Suppression for Concept EraSure)的关键在于通过直接编辑参数空间中编码特定概念的方向,实现对整个概念的精确擦除。该方法利用解耦模型将 MLP 向量分解为可解释特征,并通过自动化可解释性技术识别与目标概念相关的特征,从而从模型参数中移除这些特征,实现了更高的擦除效果、特异性和鲁棒性。

链接: https://arxiv.org/abs/2505.22586
作者: Yoav Gur-Arieh,Clara Suslik,Yihuai Hong,Fazl Barez,Mor Geva
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
zh

[NLP-17] Less but Better: Efficient Multilingual Expansion for LLM s via Layer-wise Mixture-of-Experts ACL2025

【速读】: 该论文试图解决在扩展大型语言模型(Large Language Models, LLMs)新语言时,如何在保持原有语言性能的同时,避免参数成本过高和灾难性遗忘的问题。其解决方案的关键在于分析LLMs中不同层的语言特征,提出分层专家分配算法(LayerMoE),根据语言表示相似性动态分配新专家数量,并在高相似性层中引入分类器以指导旧语言token的路由,从而有效减少专家数量并缓解旧语言遗忘问题。

链接: https://arxiv.org/abs/2505.22582
作者: Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Yufeng Chen,Jinan Xu,Jie Zhou
机构: Beijing Jiaotong University (北京交通大学); School of Computer Science and Technology, Beijing Jiaotong University (计算机科学与技术学院,北京交通大学); Pattern Recognition Center, WeChat AI, Tencent Inc (模式识别中心,微信人工智能,腾讯公司)
类目: Computation and Language (cs.CL)
备注: ACL 2025 (Main), 16 pages, 5 figures, 11 tables

点击查看摘要

Abstract:Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.
zh

[NLP-18] Fusion Steering: Prompt-Specific Activation Control

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在问答(Question-Answering, QA)任务中事实准确性不足的问题。其解决方案的关键在于提出了一种名为Fusion Steering的激活控制方法,该方法通过动态注入与提示相关的激活差异(activation deltas)来实现对模型输出的灵活调控。这些激活差异来源于结合了真实答案和模型生成解释的参考完成结果,从而实现语义丰富、示例特定的引导。此外,通过Optuna优化每条提示的注入权重,以平衡词元重叠(事实对齐)和困惑度(流畅性代理),最终提升了模型输出的准确性和连贯性。

链接: https://arxiv.org/abs/2505.22572
作者: Waldemar Chang,Alhassan Yasin
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 2 tables

点击查看摘要

Abstract:We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring \geq 0.6 ), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.
zh

[NLP-19] Agent -UniRAG : A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理单跳和多跳查询时的局限性,这些系统通常单独处理不同类型的问题,限制了其在实际应用中的泛用性。解决方案的关键在于提出一种可训练的代理框架——Agent-UniRAG,该框架能够根据输入复杂度逐步解决RAG任务,并以端到端的方式同时处理单跳和多跳查询,从而提升RAG系统的有效性与可解释性。此外,研究还引入了SynAgent-RAG合成数据集,以支持小型开源大语言模型(LLM)的训练与应用。

链接: https://arxiv.org/abs/2505.22571
作者: Hoang Pham,Khac-Hoai Nam Bui
机构: Viettel Artificial Intelligence and Data Services Center (越捷人工智能与数据服务中心); Viettel Group (越捷集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper presents a novel approach for unified retrieval-augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question-answering systems (e.g., multi-hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single-hop or multi-hop approaches separately, which limits the application of those approaches to real-world applications. In this study, we propose a trainable agent framework called Agent-UniRAG for unified retrieval-augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step-by-step based on the complexity of the inputs, simultaneously including single-hop and multi-hop queries in an end-to-end manner. Furthermore, we introduce SynAgent-RAG, a synthetic dataset to enable the proposed agent framework for small open-source LLMs (e.g., Llama-3-8B). The results show comparable performances with closed-source and larger open-source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.
zh

[NLP-20] Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)与人脑是否在计算原理上趋于一致的问题,特别是探讨LLMs中观察到的类脑模式是源于模型规模的扩大,还是反映了与人类语言处理架构更深层次的对齐。其解决方案的关键在于系统研究LLMs中的层次化表示与人类句子理解过程中动态神经反应之间的对齐关系,通过将14个公开可用的LLMs的层次化嵌入与受试者在自然叙事故事刺激下的fMRI数据进行比较,构建句级神经预测模型,以精确识别与大脑区域激活最显著相关的模型层。

链接: https://arxiv.org/abs/2505.22563
作者: Yu Lei,Xingyang Ge,Yi Zhang,Yiming Yang,Bolei Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shandong University (山东大学); FAU Erlangen-Nuremberg (弗劳恩霍夫埃朗根-纽伦堡大学); LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Linguistic Science Laboratory, Jiangsu Normal University (江苏师范大学语言科学实验室); Collaborative Innovation Center for Language Ability, Jiangsu Normal University (江苏师范大学语言能力协同创新中心); School of Linguistic Sciences and Arts, Jiangsu Normal University (江苏师范大学语言科学与艺术学院)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Understanding whether large language models (LLMs) and the human brain converge on similar computational principles remains a fundamental and important question in cognitive neuroscience and AI. Do the brain-like patterns observed in LLMs emerge simply from scaling, or do they reflect deeper alignment with the architecture of human language processing? This study focuses on the sentence-level neural mechanisms of language models, systematically investigating how hierarchical representations in LLMs align with the dynamic neural responses during human sentence comprehension. By comparing hierarchical embeddings from 14 publicly available LLMs with fMRI data collected from participants, who were exposed to a naturalistic narrative story, we constructed sentence-level neural prediction models to precisely identify the model layers most significantly correlated with brain region activations. Results show that improvements in model performance drive the evolution of representational architectures toward brain-like hierarchies, particularly achieving stronger functional and anatomical correspondence at higher semantic abstraction levels.
zh

[NLP-21] ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM ACL2025

【速读】: 该论文旨在解决将知识图谱(Knowledge Graphs, KGs)与大语言模型(Large Language Models, LLMs)结合以提升事实验证能力的问题,现有方法多依赖非结构化文本语料,难以有效利用KG的结构化语义信息,同时现代LLMs在处理多步骤模块化流程和KG推理时也存在不足。解决方案的关键在于提出ClaimPKG框架,该框架通过轻量级专用LLM将输入声明表示为伪子图,引导专门的子图检索模块识别相关KG子图,并由通用LLM生成最终结论与解释,从而实现LLM推理与KG结构化知识的无缝集成。

链接: https://arxiv.org/abs/2505.22552
作者: Hoang Pham,Thanh-Do Nguyen,Khac-Hoai Nam Bui
机构: Viettel Artificial Intelligence and Data Services Center, Viettel Group (越南电信人工智能与数据服务中心,越南电信集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted by ACL 2025 findings

点击查看摘要

Abstract:Integrating knowledge graphs (KGs) to enhance the reasoning capabilities of large language models (LLMs) is an emerging research challenge in claim verification. While KGs provide structured, semantically rich representations well-suited for reasoning, most existing verification methods rely on unstructured text corpora, limiting their ability to effectively leverage KGs. Additionally, despite possessing strong reasoning abilities, modern LLMs struggle with multi-step modular pipelines and reasoning over KGs without adaptation. To address these challenges, we propose ClaimPKG, an end-to-end framework that seamlessly integrates LLM reasoning with structured knowledge from KGs. Specifically, the main idea of ClaimPKG is to employ a lightweight, specialized LLM to represent the input claim as pseudo-subgraphs, guiding a dedicated subgraph retrieval module to identify relevant KG subgraphs. These retrieved subgraphs are then processed by a general-purpose LLM to produce the final verdict and justification. Extensive experiments on the FactKG dataset demonstrate that ClaimPKG achieves state-of-the-art performance, outperforming strong baselines in this research field by 9%-12% accuracy points across multiple categories. Furthermore, ClaimPKG exhibits zero-shot generalizability to unstructured datasets such as HoVer and FEVEROUS, effectively combining structured knowledge from KGs with LLM reasoning across various LLM backbones.
zh

[NLP-22] Emotion-o1 : Adaptive Long Reasoning for Emotion Understanding in LLM s

【速读】: 该论文旨在解决现有情感理解方法依赖固定长度的思维链(Chain-of-Thought, CoT)推理,无法适应情感任务复杂性变化的问题。其解决方案的关键在于提出一种任务自适应的推理框架,利用DeepSeek-R1生成不同情感任务的可变长度推理链,并通过微调与强化学习结合设计复合奖励函数,以平衡预测准确性、推理深度控制、推理路径结构多样性以及重复逻辑抑制四个目标,从而实现动态上下文敏感的推理并增强大语言模型的深度推理能力。

链接: https://arxiv.org/abs/2505.22548
作者: Changhao Song,Yazhou Zhang,Peng Zhang
机构: TianJin University (天津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion understanding includes basic tasks (e.g., sentiment/emotion classification) and advanced tasks (e.g., sarcasm/humor detection). Current methods rely on fixed-length CoT reasoning, failing to adapt to the varying complexity of emotions. We propose a task-adaptive reasoning framework that employs DeepSeek-R1 to generate variable-length reasoning chains for different emotion tasks. By combining fine-tuning with reinforcement learning, we design a composite reward function that balances four objectives: prediction accuracy, adaptive reasoning depth control, structural diversity in reasoning paths, and suppression of repetitive logic. This approach achieves dynamic context-sensitive inference while enabling LLMs to autonomously develop deep reasoning capabilities. Experimental results demonstrate consistent improvements in both Acc and F1 scores across four tasks: emotion, sentiment, humor, and sarcasm. Notably, peak enhancements reached 3.56% F1 (2.76% Acc) for basic tasks and 37.95% F1 (23.14% Acc) for advanced tasks. Our work bridges rigid CoT reasoning and emotional complexity through adaptive-depth analysis.
zh

[NLP-23] hinking with Generated Images

【速读】: 该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在视觉推理中存在局限性的问题,即当前模型要么仅能处理用户提供的固定图像,要么仅通过文本链式思维(Chain-of-Thought, CoT)进行推理,缺乏跨模态的主动思考能力。解决方案的关键在于提出“生成式图像思维”(Thinking with Generated Images)范式,使模型能够通过自发生成中间视觉思维步骤,在文本与视觉模态之间进行原生的交互与推理,从而实现对视觉假设的自我批判与迭代优化。

链接: https://arxiv.org/abs/2505.22525
作者: Ethan Chern,Zhulin Hu,Steffi Chern,Siqi Kou,Jiadi Su,Yan Ma,Zhijie Deng,Pengfei Liu
机构: Shanghai Jiao Tong University (上海交通大学); SII (SII); Fudan University (复旦大学); Generative AI Research Lab (GAIR) (生成式人工智能研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at this https URL.
zh

[NLP-24] Multi-MLLM Knowledge Distillation for Out-of-Context News Detection

【速读】: 该论文旨在解决小规模多模态大语言模型(MLLM)在低资源场景下检测非上下文新闻(out-of-context news)时性能受限的问题。现有方法通常依赖于标签丰富的微调或昂贵的GPT模型API调用,而该研究提出了一种更高效且成本更低的解决方案。其关键在于通过多教师MLLM生成标签预测及推理过程作为知识,采用两阶段知识蒸馏框架将知识迁移至学生模型:第一阶段使用所有训练数据进行LoRA微调,第二阶段则在教师预测冲突的数据点上结合LoRA微调与直接偏好优化(DPO),从而降低标注成本并提升模型在复杂情况下的表现。

链接: https://arxiv.org/abs/2505.22517
作者: Yimeng Gu,Zhao Tong,Ignacio Castro,Shu Wu,Gareth Tyson
机构: Queen Mary University of London (伦敦玛丽女王大学); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); The Hong Kong University of Science and Technology (GZ) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context. Many existing works have leveraged multimodal large language models (MLLMs) for detecting out-of-context news. However, observing the limited zero-shot performance of smaller MLLMs, they generally require label-rich fine-tuning and/or expensive API calls to GPT models to improve the performance, which is impractical in low-resource scenarios. In contrast, we aim to improve the performance of small MLLMs in a more label-efficient and cost-effective manner. To this end, we first prompt multiple teacher MLLMs to generate both label predictions and corresponding rationales, which collectively serve as the teachers’ knowledge. We then introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM. In Stage 1, we apply LoRA fine-tuning to the student model using all training data. In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers’ predictions conflict. This two-stage strategy reduces annotation costs and helps the student model uncover subtle patterns in more challenging cases. Experimental results demonstrate that our approach achieves state-of-the-art performance using less than 10% labeled data.
zh

[NLP-25] EvolveSearch: An Iterative Self-Evolving Search Agent

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放搜索领域中进行网络搜索时面临的挑战,即监督微调(Supervised Fine-Tuning, SFT)在数据生成上的困难以及强化学习(Reinforcement Learning, RL)收敛过快导致的数据利用效率低的问题。解决方案的关键在于提出了一种名为EvolveSearch的新型迭代自进化框架,该框架结合了SFT与RL,以在无需外部人工标注推理数据的情况下提升代理式网络搜索能力。

链接: https://arxiv.org/abs/2505.22501
作者: Dingchu Zhang,Yida Zhao,Jialong Wu,Baixuan Li,Wenbiao Yin,Liwen Zhang,Yong Jiang,Yufeng Li,Kewei Tu,Pengjun Xie,Fei Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
zh

[NLP-26] Effective Context in Neural Speech Models INTERSPEECH2025

【速读】: 该论文试图解决现代神经语音模型在实际应用中所使用的有效上下文(effective context)难以量化的问题,尽管已有多种方法用于增加模型可使用的最大上下文长度。其解决方案的关键在于提出两种测量有效上下文的方法,并利用这些方法对不同的语音Transformer模型进行分析,从而揭示不同任务和模型结构下有效上下文的实际使用情况。

链接: https://arxiv.org/abs/2505.22487
作者: Yen Meng,Sharon Goldwater,Hao Tang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use. However, few have attempted to measure how much context these models actually use, i.e., the effective context. Here, we propose two approaches to measuring the effective context, and use them to analyze different speech Transformers. For supervised models, we find that the effective context correlates well with the nature of the task, with fundamental frequency tracking, phone classification, and word classification requiring increasing amounts of effective context. For self-supervised models, we find that effective context increases mainly in the early layers, and remains relatively short – similar to the supervised phone model. Given that these models do not use a long context during prediction, we show that HuBERT can be run in streaming mode without modification to the architecture and without further fine-tuning.
zh

[NLP-27] Fostering Video Reasoning via Next-Event Prediction

【速读】: 该论文旨在解决如何为多模态大语言模型(Multimodal Large Language Models, MLLMs)赋予时间推理能力的问题。现有任务如视频问答通常依赖人工标注或更强大的MLLMs,而视频字幕生成则容易将时间推理与空间信息混淆。为此,本文提出下一事件预测(Next-Event Prediction, NEP),其关键在于利用未来的视频片段作为丰富的自监督信号,促使模型在处理过去帧时进行时间推理以预测未来事件的摘要,从而增强模型的时间推理能力。

链接: https://arxiv.org/abs/2505.22457
作者: Haonan Wang,Hongfu Liu,Xiangyan Liu,Chao Du,Kenji Kawaguchi,Ye Wang,Tianyu Pang
机构: National University of Singapore (新加坡国立大学); Sea AI Lab (Sea人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
zh

[NLP-28] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

【速读】: 该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在后训练阶段依赖昂贵且难以持续获取的监督数据的问题。传统方法如监督微调(SFT)和强化学习(RL)虽然有效,但需要大量人工标注的多模态数据,这在实践中不可持续。为了解决这一问题,作者提出了一种无需外部监督的无监督后训练框架MM-UPT,其关键在于采用GRPO算法,并引入基于多数投票的自奖励机制,以替代传统的奖励信号,从而实现模型的持续自我优化。

链接: https://arxiv.org/abs/2505.22453
作者: Lai Wei,Yuting Li,Chen Wang,Yue Wang,Linghe Kong,Weiran Huang,Lichao Sun
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Zhongguancun Academy (中关村学院); Lehigh University (利哈伊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data–an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 % \rightarrow 72.9 % on MathVista, 62.9 % \rightarrow 68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at this https URL.
zh

[NLP-29] RAG -Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning

【速读】: 该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的评估框架在部署可信的检索增强生成(Retrieval-Augmented Generation, RAG)系统时,存在计算成本高、模型推理能力未被充分利用的问题。其解决方案的关键在于提出RAG-Zeval框架,将忠实性和正确性评估建模为规则引导的推理任务,并通过强化学习训练评估器,使轻量级模型能够在单次处理中生成全面且合理的评估结果。此外,该方法引入基于排名的结果奖励机制,利用偏好判断代替绝对评分,以降低对精确点对点奖励信号的依赖。

链接: https://arxiv.org/abs/2505.22430
作者: Kun Li,Yunxiang Li,Tianhua Zhang,Hongyin Luo,Xixin Wu,James Glass,Helen Meng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models’ reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval’s superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100 times more parameters. Our approach also exhibits superior interpretability in response evaluation.
zh

[NLP-30] Scaling Reasoning without Attention

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中面临的两个核心挑战:一是由于依赖Transformer架构导致的结构效率低下,二是高难度领域缺乏结构化的微调方法。其解决方案的关键在于提出一种无注意力机制的语言模型\ourmodel,该模型基于Mamba-2的状态空间双层(state space dual, SSD)结构,消除了自注意力和键值缓存的需求,从而实现固定内存、常数时间的推理。此外,为提升模型的复杂推理能力,作者还提出了基于\textscPromptCoT合成范式的两阶段课程微调策略,通过抽象概念选择和推理引导生成来构造教学结构化问题。

链接: https://arxiv.org/abs/2505.22425
作者: Xueliang Zhao,Wei Wu,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textscPromptCoT synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6% on AIME 24, 0.6% on AIME 25, and 3.0% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.
zh

[NLP-31] Mitigating Overthinking in Large Reasoning Models via Manifold Steering

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中出现的“过度思考”问题,该问题表现为过多的验证循环和冗余推理,导致计算开销显著增加。解决方案的关键在于通过机制可解释性分析,发现过度思考现象与模型激活空间中的低维流形相关,并提出一种名为流形引导(Manifold Steering)的新方法,该方法将控制方向投影到低维激活流形上,以减少高维扰动噪声的影响,从而有效降低输出标记数量并保持或提升模型在数学基准测试中的准确性。

链接: https://arxiv.org/abs/2505.22411
作者: Yao Huang,Huanran Chen,Shouwei Ruan,Yichi Zhang,Xingxing Wei,Yinpeng Dong
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model’s activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: this https URL.
zh

[NLP-32] Pangu Embedded: An Efficient Dual-system LLM Reason er with Metacognition

【速读】: 该论文旨在解决现有推理优化大型语言模型(LLM)中存在的计算成本高和推理延迟大的问题。其解决方案的关键在于提出一种两阶段训练框架:第一阶段通过迭代蒸馏过程对模型进行微调,并结合迭代间模型融合以有效聚合互补知识;随后在Ascend集群上进行强化学习(RL),采用延迟容忍调度器优化训练过程,同时引入多源自适应奖励系统(MARS)生成动态任务特定奖励信号。第二阶段则引入双系统框架,赋予Pangu Embedded“快速”模式处理常规查询和“慢速”模式处理复杂推理的能力,从而在延迟与推理深度之间实现动态资源分配。

链接: https://arxiv.org/abs/2505.22375
作者: Hanting Chen,Yasheng Wang,Kai Han,Dong Li,Lin Li,Zhenni Bi,Jinpeng Li,Haoyu Wang,Fei Mi,Mingjian Zhu,Bin Wang,Kaikai Song,Yifei Fu,Xu He,Yu Luo,Chong Zhu,Quan He,Xueyu Wu,Wei He,Hailin Hu,Yehui Tang,Dacheng Tao,Xinghao Chen,Yunhe Wang,Other Contributors
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents Pangu Embedded, an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs), featuring flexible fast and slow thinking capabilities. Pangu Embedded addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs. We propose a two-stage training framework for its construction. In Stage 1, the model is finetuned via an iterative distillation process, incorporating inter-iteration model merging to effectively aggregate complementary knowledge. This is followed by reinforcement learning on Ascend clusters, optimized by a latency-tolerant scheduler that combines stale synchronous parallelism with prioritized data queues. The RL process is guided by a Multi-source Adaptive Reward System (MARS), which generates dynamic, task-specific reward signals using deterministic metrics and lightweight LLM evaluators for mathematics, coding, and general problem-solving tasks. Stage 2 introduces a dual-system framework, endowing Pangu Embedded with a “fast” mode for routine queries and a deeper “slow” mode for complex inference. This framework offers both manual mode switching for user control and an automatic, complexity-aware mode selection mechanism that dynamically allocates computational resources to balance latency and reasoning depth. Experimental results on benchmarks including AIME 2024, GPQA, and LiveCodeBench demonstrate that Pangu Embedded with 7B parameters, outperforms similar-size models like Qwen3-8B and GLM4-9B. It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture, highlighting a promising direction for developing powerful yet practically deployable LLM reasoners.
zh

[NLP-33] LLM s Struggle to Reject False Presuppositions when Misinformation Stakes are High

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理虚假预设(false presuppositions)时的表现问题,特别是探讨某些语言因素是否会影响它们对错误预设内容的响应。解决方案的关键在于采用基于语言预设分析的系统性方法,以研究LLMs在不同条件下对虚假预设的敏感性,从而揭示其在政治语境中识别和处理虚假信息的能力。

链接: https://arxiv.org/abs/2505.22354
作者: Judith Sieker,Clara Lachenmaier,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注: 8 pages (including References). Accepted at CogSci 2025

点击查看摘要

Abstract:This paper examines how LLMs handle false presuppositions and whether certain linguistic factors influence their responses to falsely presupposed content. Presuppositions subtly introduce information as given, making them highly effective at embedding disputable or false information. This raises concerns about whether LLMs, like humans, may fail to detect and correct misleading assumptions introduced as false presuppositions, even when the stakes of misinformation are high. Using a systematic approach based on linguistic presupposition analysis, we investigate the conditions under which LLMs are more or less sensitive to adopt or reject false presuppositions. Focusing on political contexts, we examine how factors like linguistic construction, political party, and scenario probability impact the recognition of false presuppositions. We conduct experiments with a newly created dataset and examine three LLMs: OpenAI’s GPT-4-o, Meta’s LLama-3-8B, and MistralAI’s Mistral-7B-v03. Our results show that the models struggle to recognize false presuppositions, with performance varying by condition. This study highlights that linguistic presupposition analysis is a valuable tool for uncovering the reinforcement of political misinformation in LLM responses.
zh

[NLP-34] xt2Grad: Reinforcement Learning from Natural Language Feedback MICRO

【速读】: 该论文试图解决传统强化学习与人类反馈(RLHF)中因使用粗粒度标量奖励而导致的学习过程缓慢且不透明的问题,这些问题掩盖了成功或失败的细粒度原因。解决方案的关键在于提出Text2Grad,这是一种将自由文本反馈转化为片段级梯度的强化学习范式。通过将每条反馈短语与相关标记片段对齐,并将其转换为可微分的奖励信号,Text2Grad能够直接优化模型策略中存在问题的部分,从而实现精确的、基于反馈的调整。

链接: https://arxiv.org/abs/2505.22338
作者: Hanyang Wang,Lu Wang,Chaoyun Zhang,Tianjun Mao,Si Qin,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The code for our method is available at this https URL

点击查看摘要

Abstract:Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model’s policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at this https URL
zh

[NLP-35] Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在链式思维推理能力上的提升问题,特别是如何有效结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)以增强模型的多模态推理性能。其解决方案的关键在于采用两阶段方法:首先通过SFT引入结构化的链式思维推理模式作为冷启动策略,随后利用GRPO算法进行强化学习进一步优化推理能力,从而在多个多模态推理基准测试中实现优于单一SFT或RL方法的性能表现。

链接: https://arxiv.org/abs/2505.22334
作者: Lai Wei,Yuting Li,Kaipeng Zheng,Chen Wang,Yue Wang,Linghe Kong,Lichao Sun,Weiran Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While “aha moment” patterns–where models exhibit self-correction through reflection–are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 % \rightarrow 73.4 % on MathVista, 62.9 % \rightarrow 70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at this https URL.
zh

[NLP-36] NLP for Social Good: A Survey of Challenges Opportunities and Responsible Deployment

【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)在应对紧迫社会挑战中的应用问题,强调在部署过程中需要更具意图性和责任感。其解决方案的关键在于通过跨学科分析社会目标与新兴风险,识别有前景的研究方向,并提出必须解决的挑战,以确保NLP for Social Good(NLP4SG)研究的负责任和公平进展。

链接: https://arxiv.org/abs/2505.22327
作者: Antonia Karamolegkou,Angana Borah,Eunjung Cho,Sagnik Ray Choudhury,Martina Galletti,Rajarshi Ghosh,Pranav Gupta,Oana Ignat,Priyanka Kargupta,Neema Kotonya,Hemank Lamba,Sun-Joo Lee,Arushi Mangla,Ishani Mondal,Deniz Nazarova,Poli Nemkova,Dina Pisarevskaya,Naquee Rizwan,Nazanin Sabri,Dominik Stammbach,Anna Steinberg,David Tomás,Steven R Wilson,Bowen Yi,Jessica H Zhu,Arkaitz Zubiaga,Anders Søgaard,Alexander Fraser,Zhijing Jin,Rada Mihalcea,Joel R. Tetreault,Daryna Dementieva
机构: University of Copenhagen(哥本哈根大学); University of Michigan-Ann Arbor(密歇根大学安娜堡分校); ETH Zurich(苏黎世联邦理工学院); University of North Texas(北德克萨斯大学); Sony Computer Science Laboratories - Paris(索尼巴黎计算机科学实验室); University of Rome “La Sapienza”(罗马第一大学); Algoverse(Algoverse); University of Maryland, College Park(马里兰大学学院公园分校); Lowe’s(劳氏公司); Santa Clara University(圣克拉拉大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Dataminr( Dataminr); United Nations Development Programme (UNDP)(联合国开发计划署); University of Washington(华盛顿大学); Queen Mary University of London(伦敦玛丽女王大学); IIT Kharagpur(印度技术学院卡哈格普尔分校); University of California San Diego(加州大学圣地亚哥分校); Princeton University(普林斯顿大学); LMU Munich(慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心); University of Alicante(阿尔卡拉大学); University of Michigan-Flint(密歇根大学弗林特分校); University of Southern California(南加利福尼亚大学); Max Planck Institute for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所,图宾根); Vector Institute(向量研究所); University of Toronto(多伦多大学); Technical University of Munich(慕尼黑工业大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.
zh

[NLP-37] Advancing Expert Specialization for Better MoE

【速读】: 该论文试图解决混合专家(Mixture-of-Experts, MoE)模型在后训练过程中因常用辅助负载平衡损失导致的专家重叠和路由过于均匀的问题,这阻碍了专家的专业化并降低了整体性能。解决方案的关键在于引入两个互补的目标:(1)正交性损失,以鼓励专家处理不同类型的标记;(2)方差损失,以促进更具区分性的路由决策。这两个目标与现有的辅助损失兼容,并有助于优化训练过程。实验结果表明,该方法显著提升了专家专业化程度,并在不改变架构或增加额外组件的情况下,提升了经典MoE基线模型的性能。

链接: https://arxiv.org/abs/2505.22323
作者: Hongcan Guo,Haolang Lu,Guoshun Nan,Bolun Chu,Jialin Zhuang,Yuan Yang,Wenhao Che,Sicong Leng,Qimei Cui,Xudong Jiang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 33pages, 6figures

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
zh

[NLP-38] If Pigs Could Fly… Can LLM s Logically Reason Through Counterfactuals?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对与参数化知识冲突的反事实情境时,推理能力显著下降的问题。其解决方案的关键在于提出一种名为Self-Segregate的提示方法,该方法通过在推理前引入元认知意识(metacognitive awareness),显式识别知识冲突,从而有效缓解性能下降问题。实验结果表明,该方法将平均性能差距从27%降低至11%,同时显著提升了整体准确率。

链接: https://arxiv.org/abs/2505.22318
作者: Ishwar B Balappanawar,Vamshi Krishna Bonagiri,Anish R Joishy,Manas Gaur,Krishnaprasad Thirunarayan,Ponnurangam Kumaraguru
机构: IIIT Hyderabad(印度信息技术研究所海得拉巴分校); University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校); Wright State University(莱特州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive reasoning capabilities in familiar contexts, but struggle when the context conflicts with their parametric knowledge. To investigate this phenomenon, we introduce CounterLogic, a dataset containing 1,800 examples across 9 logical schemas, explicitly designed to evaluate logical reasoning through counterfactual (hypothetical knowledge-conflicting) scenarios. Our systematic evaluation of 11 LLMs across 6 different datasets reveals a consistent performance degradation, with accuracies dropping by 27% on average when reasoning through counterfactual information. We propose Self-Segregate, a prompting method enabling metacognitive awareness (explicitly identifying knowledge conflicts) before reasoning. Our method dramatically narrows the average performance gaps from 27% to just 11%, while significantly increasing the overall accuracy (+7.5%). We discuss the implications of these findings and draw parallels to human cognitive processes, particularly on how humans disambiguate conflicting information during reasoning tasks. Our findings offer practical insights for understanding and enhancing LLMs reasoning capabilities in real-world applications, especially where models must logically reason independently of their factual knowledge.
zh

[NLP-39] Skywork Open Reason er 1 Technical Report

【速读】: 该论文旨在解决如何通过强化学习(Reinforcement Learning, RL)提升大型语言模型(Large Language Models, LLMs)的推理能力,特别是针对长链式思维(Long Chain-of-Thought, CoT)模型的优化问题。其解决方案的关键在于提出了一种高效且可扩展的RL实现方法——Skywork-OR1,该方法基于DeepSeek-R1-Distill模型系列,在多个基准测试中显著提升了模型的平均准确率,同时通过深入分析熵崩溃现象并缓解其过早发生,进一步提高了模型的测试性能。

链接: https://arxiv.org/abs/2505.22312
作者: Jujie He,Jiacai Liu,Chris Yuhao Liu,Rui Yan,Chaojie Wang,Peng Cheng,Xiaoyu Zhang,Fuxiang Zhang,Jiacheng Xu,Wei Shen,Siyuan Li,Liang Zeng,Tianwen Wei,Cheng Cheng,Bo An,Yang Liu,Yahui Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.
zh

[NLP-40] Adaptive Detoxification: Safeguarding General Capabilities of LLM s through Toxicity-Aware Knowledge Editing ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对恶意提示和越狱攻击时的脆弱性问题,以及现有知识编辑方法在去毒过程中存在的两个主要挑战:一是依赖实体特定定位,导致对缺乏显式实体的对抗输入无效;二是过度编辑问题,即去毒后的模型会拒绝合法查询,影响整体性能。其解决方案的关键在于提出ToxEdit,一种具有毒性感知的知识编辑方法,能够在前向传播过程中动态检测毒性激活模式,并通过自适应层间路径路由计算以有效缓解毒性,从而在保持LLMs通用能力的同时实现精准的毒性缓解。

链接: https://arxiv.org/abs/2505.22298
作者: Yifan Lu,Jing Li,Yigeng Zhou,Yihui Zhang,Wenya Wang,Xiucheng Li,Meishan Zhang,Fangming Liu,Jun Yu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China; Nanyang Technological University, Singapore; Peng Cheng Laboratory, China
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs’ general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.
zh

[NLP-41] 360-LLaMA-Factory: Plug Play Sequence Parallelism for Long Post-Training

【速读】: 该论文旨在解决大规模语言模型训练中的高效并行计算问题,特别是通过引入序列并行性(sequence parallelism)来提升训练效率。其解决方案的关键在于对360-LLaMA-Factory中不同序列并行模式的深入分析与实现优化,从而在保持模型性能的同时降低计算资源消耗。

链接: https://arxiv.org/abs/2505.22296
作者: Haosheng Zou,Xiaowei Lv,Shousheng Jia,Xiangzheng Zhang
机构: Qiyuan Tech(启元科技); Renmin University(中国人民大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: code at this https URL

点击查看摘要

Abstract:Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at this https URL. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies’ training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
zh

[NLP-42] Compensating for Data with Reasoning : Low-Resource Machine Translation with LLM s

【速读】: 该论文旨在解决低资源语言(如意大利语的两种拉丁语变体)在机器翻译中的挑战,尤其是传统神经系统在这些语言上的表现不足以及提示工程(prompt engineering)的局限性。其解决方案的关键在于提出一种新的上下文学习方法——Fragment-Shot Prompting,该方法通过基于句法覆盖度对输入进行分段并检索翻译示例,从而提升翻译质量;此外,还引入了Pivoted Fragment-Shot,使得在无直接平行数据的情况下也能实现有效翻译。

链接: https://arxiv.org/abs/2505.22293
作者: Samuel Frontull,Thomas Ströhle
机构: University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in multilingual machine translation, sometimes even outperforming traditional neural systems. However, previous research has highlighted the challenges of using LLMs, particularly with prompt engineering, for low-resource languages. In this work, we introduce Fragment-Shot Prompting, a novel in-context learning method that segments input and retrieves translation examples based on syntactic coverage, along with Pivoted Fragment-Shot, an extension that enables translation without direct parallel data. We evaluate these methods using GPT-3.5, GPT-4o, o1-mini, LLaMA-3.3, and DeepSeek-R1 for translation between Italian and two Ladin variants, revealing three key findings: (1) Fragment-Shot Prompting is effective for translating into and between the studied low-resource languages, with syntactic coverage positively correlating with translation quality; (2) Models with stronger reasoning abilities make more effective use of retrieved knowledge, generally produce better translations, and enable Pivoted Fragment-Shot to significantly improve translation quality between the Ladin variants; and (3) prompt engineering offers limited, if any, improvements when translating from a low-resource to a high-resource language, where zero-shot prompting already yields satisfactory results. We publicly release our code and the retrieval corpora.
zh

[NLP-43] Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂推理任务时性能受限的问题,尤其是在面对“超难”推理任务时,传统评估方法因依赖简单的上下文学习示例而未能充分激发模型的推理潜力。论文提出的解决方案的关键在于结合上下文搜索(in-context search)与测试时缩放(test-time scaling)技术,通过增强模型的内部缩放能力,显著提升了模型在原本被认为“不可解”的任务上的表现,实现了性能的突破性提升。

链接: https://arxiv.org/abs/2505.22290
作者: Fanzeng Xia,Yidong Luo,Tinko Sebastian Bartels,Yaqi Xu,Tongxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs’ deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed “unsolvable” (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.
zh

[NLP-44] Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review ACL2025

【速读】: 该论文试图解决在循证医学(Evidence-based medicine, EBM)中,由于医学文献数量庞大且增长迅速,以及人工整理成本高昂,导致证据获取、评估、整合与传播效率低下的问题。解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术,以自动化的方式识别、评估、综合、总结并传播EBM相关的证据,从而提升临床决策的效率与准确性。

链接: https://arxiv.org/abs/2505.22280
作者: Zihan Xu,Haotian Ma,Gongbo Zhang,Yihao Ding,Chunhua Weng,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学中心); Columbia University (哥伦比亚大学); University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM – Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.
zh

[NLP-45] Comprehensive Evaluation on Lexical Normalization: Boundary-Aware Approaches for Unsegmented Languages

【速读】: 该论文旨在解决用户生成文本中非正式表达的词汇归一化(Lexical Normalization)问题,特别是在未分词语言中的处理挑战。其关键解决方案是基于最先进的预训练模型开发归一化方法,并通过多领域、大规模的日本语归一化数据集进行实验验证,证明了编码器-only和解码器-only方法在准确性和效率方面均表现出色。

链接: https://arxiv.org/abs/2505.22273
作者: Shohei Higashiyama,Masao Utiyama
机构: National Institute of Information and Communications Technology (国家信息与通信技术研究所)
类目: Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:Lexical normalization research has sought to tackle the challenge of processing informal expressions in user-generated text, yet the absence of comprehensive evaluations leaves it unclear which methods excel across multiple perspectives. Focusing on unsegmented languages, we make three key contributions: (1) creating a large-scale, multi-domain Japanese normalization dataset, (2) developing normalization methods based on state-of-the-art pretrained models, and (3) conducting experiments across multiple evaluation perspectives. Our experiments show that both encoder-only and decoder-only approaches achieve promising results in both accuracy and efficiency.
zh

[NLP-46] st-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)及其多模态版本在面对 jailbreak 攻击时的脆弱性问题,这类攻击能够诱导模型生成违反安全策略的内容。现有防御方法通常针对特定类型的 jailbreak 攻击,缺乏通用性和适应性。该论文提出的解决方案是构建一个通用的防御框架,称为 Test-time IMmunization (TIM),其关键在于通过自适应机制实现对多种 jailbreak 攻击的动态防御。TIM 首先训练一个语义标记用于高效检测,随后在推理阶段利用该标记识别 jailbreak 活动,并通过安全微调提升模型安全性,同时通过解耦微调过程与检测模块来避免性能下降。

链接: https://arxiv.org/abs/2505.22271
作者: Yongcan Yu,Yanbo Wang,Ran He,Jian Liang
机构: NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (国家模式识别重点实验室与多媒体信息处理系统,自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods are proposed to defend against jailbreak attacks, however, they are often tailored to specific types of jailbreak attacks, limiting their effectiveness against diverse adversarial strategies. For instance, rephrasing-based defenses are effective against text adversarial jailbreaks but fail to counteract image-based attacks. To overcome these limitations, we propose a universal defense framework, termed Test-time IMmunization (TIM), which can adaptively defend against various jailbreak attacks in a self-evolving way. Specifically, TIM initially trains a gist token for efficient detection, which it subsequently applies to detect jailbreak activities during inference. When jailbreak attempts are identified, TIM implements safety fine-tuning using the detected jailbreak instructions paired with refusal answers. Furthermore, to mitigate potential performance degradation in the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy of TIM.
zh

[NLP-47] MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps

【速读】: 该论文旨在解决SemEval 2025任务8:基于表格数据的问答挑战(Question-Answering over Tabular Data challenge)。其核心解决方案是利用大语言模型(Large Language Models, LLMs)生成Python代码,以与表格数据交互并获取问题答案。关键步骤包括理解表格内容、生成自然语言操作指令、将指令转化为代码、执行代码并处理可能的错误或异常。该方法采用了开源LLMs和针对每个任务步骤的细粒度优化提示(prompt),最终在子任务1中取得了70.50%的得分。

链接: https://arxiv.org/abs/2505.22264
作者: Maximiliano Hormazábal Lagos,Álvaro Bueno Saez,Héctor Cerezo-Costas,Pedro Alonso Doval,Jorge Alcalde Vesteiro
机构: Fundación Centro Tecnolóxico de Telecomunicacións de Galicia (GRADIANT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 pages, 6 tables

点击查看摘要

Abstract:In this paper we expose our approach to solve the \textitSemEval 2025 Task 8: Question-Answering over Tabular Data challenge. Our strategy leverages Python code generation with LLMs to interact with the table and get the answer to the questions. The process is composed of multiple steps: understanding the content of the table, generating natural language instructions in the form of steps to follow in order to get the answer, translating these instructions to code, running it and handling potential errors or exceptions. These steps use open source LLMs and fine grained optimized prompts for each task (step). With this approach, we achieved a score of 70.50% for subtask 1.
zh

[NLP-48] rain Sparse Autoencoders Efficiently by Utilizing Features Correlation

【速读】: 该论文试图解决在大规模语言模型中训练稀疏自编码器(Sparse Autoencoders, SAEs)时面临的计算和内存效率问题,尤其是在使用大字典尺寸时。其解决方案的关键在于提出KronSAE架构,通过Kronecker乘积分解对潜在表示进行因子分解,从而显著降低内存和计算开销。此外,还引入了mAND这一可微分激活函数,用于近似二进制AND操作,以提升因子化框架中的可解释性和性能。

链接: https://arxiv.org/abs/2505.22255
作者: Vadim Kurochkin,Yaroslav Aksenov,Daniil Laptev,Daniil Gavrilov,Nikita Balagansky
机构: T-Tech; Moscow Institute of Physics and Technology
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.
zh

[NLP-49] BioHopR: A Benchmark for Multi-Hop Multi-Answer Reasoning in Biomedical Domain

【速读】: 该论文旨在解决生物医学领域中多跳推理评估基准不足的问题,特别是在处理涉及一对一和多对多关系的查询时,现有基准缺乏有效的评估能力。其解决方案的关键在于构建BioHopR,这是一个针对结构化生物医学知识图谱的新型基准,包含1跳和2跳推理任务,以反映实际生物医学复杂性,从而为多跳、多答案推理提供评估标准。

链接: https://arxiv.org/abs/2505.22240
作者: Yunsoo Kim,Yusuf Abdulle,Honghan Wu
机构: UCL(伦敦大学学院); King’s College London(国王学院伦敦大学); University of Glasgow(格拉斯哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Biomedical reasoning often requires traversing interconnected relationships across entities such as drugs, diseases, and proteins. Despite the increasing prominence of large language models (LLMs), existing benchmarks lack the ability to evaluate multi-hop reasoning in the biomedical domain, particularly for queries involving one-to-many and many-to-many relationships. This gap leaves the critical challenges of biomedical multi-hop reasoning underexplored. To address this, we introduce BioHopR, a novel benchmark designed to evaluate multi-hop, multi-answer reasoning in structured biomedical knowledge graphs. Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop reasoning tasks that reflect real-world biomedical complexities. Evaluations of state-of-the-art models reveal that O3-mini, a proprietary reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on 2-hop tasks, outperforming proprietary models such as GPT4O and open-source biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all models exhibit significant declines in multi-hop performance, underscoring the challenges of resolving implicit reasoning steps in the biomedical domain. By addressing the lack of benchmarks for multi-hop reasoning in biomedical domain, BioHopR sets a new standard for evaluating reasoning capabilities and highlights critical gaps between proprietary and open-source models while paving the way for future advancements in biomedical LLMs. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.22240 [cs.CL] (or arXiv:2505.22240v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.22240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-50] A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity CONLL2025

【速读】: 该论文试图解决Text-to-Speech (TTS) 系统在生成语调短语边界时对句法结构的敏感性问题。研究发现,当句子的句法边界存在歧义时(如花园路径句子或结构歧义句子),TTS系统难以准确生成语调短语边界,此时需要依赖如逗号等表面提示来正确放置边界。而针对句法结构较简单的句子,系统能够利用句法线索超越表面标记。解决方案的关键在于通过在没有逗号的句法边界位置微调模型,促使系统关注更细微的语言线索,从而生成更清晰的语调模式,更好地反映句子的内在结构。

链接: https://arxiv.org/abs/2505.22236
作者: Charlotte Pouw,Afra Alishahi,Willem Zuidema
机构: Institute for Logic, Language and Computation, University of Amsterdam (逻辑、语言和计算研究所,阿姆斯特丹大学); Cognitive Science and Artificial Intelligence, Tilburg University (认知科学与人工智能,蒂尔堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CoNLL 2025

点击查看摘要

Abstract:We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.
zh

[NLP-51] Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

【速读】: 该论文旨在解决高质量多语言训练数据稀缺的问题,尤其是现有开源多语言数据集在跨语言迁移能力和可扩展性方面的局限性。其解决方案的关键在于提出JQL,一种系统化的方法,通过将预训练多语言嵌入的标注能力提炼为轻量级标注器,从而高效地大规模筛选多样且高质量的多语言数据,同时显著降低计算需求。

链接: https://arxiv.org/abs/2505.22232
作者: Mehdi Ali,Manuel Brack,Max Lübbering,Elias Wendt,Abbas Goher Khan,Richard Rutmann,Alex Jude,Maurice Kraus,Alexander Arno Weber,Felix Stollenwerk,David Kaczér,Florian Mai,Lucie Flek,Rafet Sifa,Nicolas Flores-Herr,Joachim Köhler,Patrick Schramowski,Michael Fromm,Kristian Kersting
机构: Lamarr Institute; Fraunhofer IAIS (弗劳恩霍夫IAIS研究所); DFKI SAINT (DFKI SAINT); Hessian AI (黑森人工智能); Computer Science Department, TU Darmstadt (图宾根大学计算机科学系); AI Sweden (瑞典人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page available at this https URL

点击查看摘要

Abstract:High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
zh

[NLP-52] Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis

【速读】: 该论文试图解决传统听力测试在表征听力损失对言语理解功能影响方面的局限性,尤其是针对老年性耳聋等情况下存在的 supra-threshold 缺陷和频率特异性感知挑战。其解决方案的关键在于开发一种基于自动语音识别(ASR)的频率特异性语音测试,通过模拟中度斜坡型听力损失的感知效应,分析语音刺激在受控声学退化条件下的音素级混淆模式,从而提供更细致的诊断信息。

链接: https://arxiv.org/abs/2505.22231
作者: Stefan Bleeck
机构: Institute of Sound and Vibration Research, University of Southampton (声学与振动研究所,南安普顿大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Traditional audiometry often fails to fully characterize the functional impact of hearing loss on speech understanding, particularly supra-threshold deficits and frequency-specific perception challenges in conditions like presbycusis. This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. Our approach leverages ASR to simulate the perceptual effects of moderate sloping hearing loss by processing speech stimuli under controlled acoustic degradation and subsequently analyzing phoneme-level confusion patterns. Key findings indicate that simulated hearing loss introduces specific phoneme confusions, predominantly affecting high-frequency consonants (e.g., alveolar/palatal to labiodental substitutions) and leading to significant phoneme deletions, consistent with the acoustic cues degraded in presbycusis. A test battery curated from these ASR-derived confusions demonstrated diagnostic value, effectively differentiating between simulated normal-hearing and hearing-impaired listeners in a comprehensive simulation. This ASR-driven methodology offers a promising avenue for developing objective, granular, and frequency-specific hearing assessment tools that complement traditional audiometry. Future work will focus on validating these findings with human participants and exploring the integration of advanced AI models for enhanced diagnostic precision.
zh

[NLP-53] Look Mark: Leverag ing Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学影像分析中存在幻觉和临床显著性错误的问题,这些问题限制了其在实际应用中的可靠性。论文提出的解决方案关键在于引入一种名为Look Mark(LM)的新型定位固定策略,该策略将放射科医生的眼动注视点(Look)和边界框标注(Mark)整合到LLM提示框架中,通过上下文学习实现性能提升,而无需重新训练模型。

链接: https://arxiv.org/abs/2505.22222
作者: Yunsoo Kim,Jinge Wu,Su-Hwan Kim,Pardeep Vasudev,Jiashu Shen,Honghan Wu
机构: UCL(伦敦大学学院); Technical University of Munich(慕尼黑工业大学); University of Oxford(牛津大学); University of Glasgow(格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look Mark (LM), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, LM leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, LM demonstrates significant gains, including a 1.2% improvement in overall metrics (this http URL) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from LM combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (this http URL)-the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that LM reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight LM’s potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.
zh

[NLP-54] Pitfalls of Rule- and Model-based Verifiers – A Case Study on Mathematical Reasoning

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Reward, RLVR)的可信验证器不足问题,特别是在数学推理等复杂领域中,现有基于规则的验证器和基于模型的验证器均存在可靠性不足的问题。研究发现,当前开源的基于规则的验证器在识别不同格式的等价答案时表现不佳,导致显著的假阴性率,而基于模型的验证器虽然在静态评估中准确率较高,但容易被“攻击”产生假阳性,从而在策略模型优化过程中导致奖励被人为夸大。论文的关键解决方案在于揭示两类验证器的固有风险,并为构建更稳健的奖励系统提供理论依据与实践指导。

链接: https://arxiv.org/abs/2505.22203
作者: Yuzhen Huang,Weihao Zeng,Xingshan Zeng,Qi Zhu,Junxian He
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct (i.e., false positives). This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique risks inherent to both rule-based and model-based verifiers, aiming to offer valuable insights to develop more robust reward systems in reinforcement learning.
zh

[NLP-55] Lets Predict Sentence by Sentence

【速读】: 该论文试图解决传统自回归语言模型(Autoregressive Language Models, LMs)在生成文本时仅基于原始token序列进行推理,而人类推理则依赖于更高层次的语义单元(如句子、命题和概念)之间的交互这一问题。其核心挑战在于如何使预训练LM能够在结构化的语义单元上进行推理,而非仅仅处理token级的输出。解决方案的关键在于构建一个框架,将预训练的token级LM适配到句子空间中,通过自回归方式预测下一句子的连续嵌入表示,并采用两种嵌入范式:语义嵌入(Semantic Embeddings)和上下文嵌入(Contextual Embeddings)。其中,上下文嵌入在连续推理模式下表现出与Chain-of-Thought(CoT)相当的性能,同时显著降低了推理时的计算量(FLOPs)。

链接: https://arxiv.org/abs/2505.22202
作者: Hyeonbin Hwang,Byeongguk Jeon,Seungone Kim,Jiyeon Kim,Hoyeon Chang,Sohee Yang,Seungpil Won,Dohaeng Lee,Youbin Ahn,Minjoon Seo
机构: KAIST(韩国科学技术院); Carnegie Mellon University(卡内基梅隆大学); University College London(伦敦大学学院); LG AI Research(LG人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work In Progress

点击查看摘要

Abstract:Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.
zh

[NLP-56] Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

【速读】: 该论文试图解决中文中通过同音字伪装的有毒内容(cloaked toxicity)识别问题,这一问题在现有方法中尚未得到有效解决。其关键解决方案是提出一种无需训练和提示的C^2 TU方法,该方法首先通过子串匹配识别可能的有毒词汇,随后利用基于BERT和大语言模型(LLM)的两种过滤变体进行非有毒词汇的过滤与伪装内容的还原,其中针对LLM的自回归限制,采用文本序列的完整语义上下文来揭示被隐藏的有毒内容。

链接: https://arxiv.org/abs/2505.22184
作者: Xuchen Ma,Jianxiang Yu,Wenming Shao,Bo Pang,Xiang Li
机构: East China Normal University (华东师范大学); Shanghai EastWonder Info-tech Co., Ltd. (上海东望信息科技有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C ^2 TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C ^2 TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively.
zh

[NLP-57] Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

【速读】: 该论文试图解决大语言模型在内存受限推理中的效率问题,特别是如何有效结合推测解码(speculative decoding)与量化(quantization)技术以提升推理速度。其解决方案的关键在于发现当前先进的推测解码方法在4-bit权重量化模型上会因计算负载增加而抵消内存带宽优化带来的优势,并提出一种分层框架,利用小型模型作为中间阶段将树状草稿转换为序列草稿,从而更好地利用目标量化模型的内存访问优势。

链接: https://arxiv.org/abs/2505.22179
作者: Yudi Zhang,Weilin Zhao,Xu Han,Tiejun Zhao,Wang Xu,Hailong Cao,Conghui Zhu
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78 \times speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31 \times . Code available at this https URL.
zh

[NLP-58] abXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation ACL2025

【速读】: 该论文试图解决表格在定性和定量评估中面临的挑战,传统指标往往无法捕捉结构和内容上的细微差异。解决方案的关键在于提出一种新颖的系统性评分标准,该标准结合了多层级结构描述符与细粒度上下文量化,从而为全面的表格比较奠定了坚实基础。在此基础上,论文进一步提出了TabXEval,一个全面且可解释的两阶段评估框架,通过TabAlign进行结构对齐,再利用TabCompare进行系统的语义和语法比较,有效识别出传统方法所忽视的细微差异。

链接: https://arxiv.org/abs/2505.22176
作者: Vihang Pancholi,Jainit Bafna,Tejas Anvekar,Manish Shrivastava,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); IIIT Hyderabad (印度信息技术研究所海得拉巴校区)
类目: Computation and Language (cs.CL)
备注: Accepeted for Findings at ACL 2025

点击查看摘要

Abstract:Evaluating tables qualitatively quantitatively presents a significant challenge, as traditional metrics often fail to capture nuanced structural and content discrepancies. To address this, we introduce a novel, methodical rubric integrating multi-level structural descriptors with fine-grained contextual quantification, thereby establishing a robust foundation for comprehensive table comparison. Building on this foundation, we propose TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval initially aligns reference tables structurally via TabAlign subsequently conducts a systematic semantic and syntactic comparison using TabCompare; this approach clarifies the evaluation process and pinpoints subtle discrepancies overlooked by conventional methods. The efficacy of this framework is assessed using TabXBench, a novel, diverse, multi-domain benchmark we developed, featuring realistic table perturbations and human-annotated assessments. Finally, a systematic analysis of existing evaluation methods through sensitivity-specificity trade-offs demonstrates the qualitative and quantitative effectiveness of TabXEval across diverse table-related tasks and domains, paving the way for future innovations in explainable table evaluation.
zh

[NLP-59] Reverse Preference Optimization for Complex Instruction Following ACL2025

【速读】: 该论文试图解决大型语言模型在处理具有多个约束条件的复杂指令时存在的对齐难题。传统方法通过选择满足更多约束的偏好对进行训练,但这种方法引入了噪声,因为被选中的示例可能无法遵循所有约束,而被拒绝的示例可能在某些方面表现更优。论文提出的解决方案是逆向偏好优化(Reverse Preference Optimization, RPO),其关键在于通过动态反转指令中的约束,确保被选中的响应是完美的,从而减少对大量采样和过滤的需求,并扩大选择与拒绝响应之间的差距,提升优化方向的清晰度和对噪声的鲁棒性。

链接: https://arxiv.org/abs/2505.22172
作者: Xiang Huang,Ting-En Lin,Feiteng Fang,Yuchuan Wu,Hangyu Li,Yuzhong Qu,Fei Huang,Yongbin Li
机构: Tongyi Lab; State Key Laboratory for Novel Software Technology, Nanjing University (国家软件新技术重点实验室,南京大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.
zh

[NLP-60] ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在评估过程中对提示(prompt)措辞高度敏感的问题,而现有标准基准测试通常仅使用单一提示进行性能评估,这引发了评估结果可靠性的担忧。解决方案的关键在于提出一种基于随机方法的矩估计(stochastic method of moments)评估方法,通过在保持语义不变的提示扰动空间中进行评估,引入了可靠评估的正式定义,并提出了ReliableEval——一种估计获取有意义结果所需提示重采样次数的方法。该方法具有模型、任务和度量指标的无关性,为大语言模型的有意义且稳健的评估提供了一种通用方案。

链接: https://arxiv.org/abs/2505.22169
作者: Gili Lior,Eliya Habba,Shahar Levy,Avi Caciularu,Gabriel Stanovsky
机构: The Hebrew University of Jerusalem (希伯来大学); Google Research (谷歌研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
zh

[NLP-61] Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes ACL2025

【速读】: 该论文旨在解决传统离散和连续扩散模型在文本生成任务中的局限性,即离散模型缺乏细粒度控制,而连续模型在扩散过程中无法区分不同词元的语义差异。其解决方案的关键在于提出一种非同时连续扩散模型(NeoDiff),通过引入泊松扩散过程实现灵活且细粒度的噪声注入,并利用时间预测器在逆向过程中根据词元语义自适应地调节去噪进度,从而提升文本生成的质量与灵活性。

链接: https://arxiv.org/abs/2505.22165
作者: Bocheng Li,Zhujin Gao,Linli Xu
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Diffusion models have emerged as a promising approach for text generation, with recent works falling into two main categories: discrete and continuous diffusion models. Discrete diffusion models apply token corruption independently using categorical distributions, allowing for different diffusion progress across tokens but lacking fine-grained control. Continuous diffusion models map tokens to continuous spaces and apply fine-grained noise, but the diffusion progress is uniform across tokens, limiting their ability to capture semantic nuances. To address these limitations, we propose \textbf\underlineNon-simultan\textbf\underlineeous C\textbf\underlineontinuous \textbf\underlineDiffusion Models (NeoDiff), a novel diffusion model that integrates the strengths of both discrete and continuous approaches. NeoDiff introduces a Poisson diffusion process for the forward process, enabling a flexible and fine-grained noising paradigm, and employs a time predictor for the reverse process to adaptively modulate the denoising progress based on token semantics. Furthermore, NeoDiff utilizes an optimized schedule for inference to ensure more precise noise control and improved performance. Our approach unifies the theories of discrete and continuous diffusion models, offering a more principled and effective framework for text generation. Experimental results on several text generation tasks demonstrate NeoDiff’s superior performance compared to baselines of non-autoregressive continuous and discrete diffusion models, iterative-based methods and autoregressive diffusion-based methods. These results highlight NeoDiff’s potential as a powerful tool for generating high-quality text and advancing the field of diffusion-based text generation.
zh

[NLP-62] Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在后训练阶段数据集下采样的效率与通用性问题,即如何在不显著影响性能的前提下,高效且广泛适用地选择高质量数据。解决方案的关键在于提出一种多步骤流程,包括高效地将数据点分组、利用专用模型估计质量以及通过稳健轻量的方法评分难度,结合基于任务的分类以控制最终数据集的组成,并通过嵌入模型和聚类算法提升多样性,从而实现高性能微调且计算开销最小。

链接: https://arxiv.org/abs/2505.22157
作者: Paramita Mirza,Lucas Weber,Fabian Küch
机构: Fraunhofer IIS (弗劳恩霍夫信息与通信技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this paper, we demonstrate that data selection can be both – efficient and universal – by using a multi-step pipeline in which we efficiently bin data points into groups, estimate quality using specialized models, and score difficulty with a robust, lightweight method. Task-based categorization allows us to control the composition of our final data – crucial for finetuning multi-purpose models. To guarantee diversity, we improve upon previous work using embedding models and a clustering algorithm. This integrated strategy enables high-performance fine-tuning with minimal overhead.
zh

[NLP-63] InComeS: Integrating Compression and Selection Mechanisms into LLM s for Efficient Model Editing

【速读】: 该论文旨在解决现有模型编辑方法在需要深层次语义理解的复杂场景中表现不佳的问题,尤其是在面对大量编辑操作时,由于大语言模型(Large Language Models, LLMs)的上下文窗口限制导致性能和效率下降。其解决方案的关键在于提出InComeS框架,通过显式的压缩与选择机制增强LLMs处理编辑上下文的能力,具体包括将每个编辑上下文压缩到特殊摘要标记(gist token)的键值(Key-Value, KV)缓存中,并引入专门的跨注意力模块以动态选择最相关的信息,从而实现对编辑信息的自适应高效利用。

链接: https://arxiv.org/abs/2505.22156
作者: Shuaiyi Li,Zhisong Zhang,Yang Deng,Chenlong Deng,Tianqing Fang,Hongming Zhang,Haitao Mi,Dong Yu,Wai Lam
机构: The Chinese University of Hong Kong (香港中文大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs’ ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model’s context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.
zh

[NLP-64] Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging

【速读】: 该论文试图解决脑到图像重建中生成的视觉刺激缺乏细节和语义不一致的问题,这可能是由于语义信息不足所致。其解决方案的关键在于提出一种名为细粒度脑到图像重建(FgB2I)的方法,该方法利用细粒度文本作为桥梁来提升图像重建效果。FgB2I包含三个关键阶段:细节增强、解码细粒度文本描述以及文本桥接的脑到图像重建,其中通过引入三种奖励指标(物体准确性、文本-图像语义相似性、图像-图像语义相似性)来指导语言模型从fMRI信号中解码出细粒度文本描述。

链接: https://arxiv.org/abs/2505.22150
作者: Runze Xia,Shuo Feng,Renzhi Wang,Congchi Yin,Xuyun Wen,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: CogSci2025

点击查看摘要

Abstract:Brain-to-Image reconstruction aims to recover visual stimuli perceived by humans from brain activity. However, the reconstructed visual stimuli often missing details and semantic inconsistencies, which may be attributed to insufficient semantic information. To address this issue, we propose an approach named Fine-grained Brain-to-Image reconstruction (FgB2I), which employs fine-grained text as bridge to improve image reconstruction. FgB2I comprises three key stages: detail enhancement, decoding fine-grained text descriptions, and text-bridged brain-to-image reconstruction. In the detail-enhancement stage, we leverage large vision-language models to generate fine-grained captions for visual stimuli and experimentally validate its importance. We propose three reward metrics (object accuracy, text-image semantic similarity, and image-image semantic similarity) to guide the language model in decoding fine-grained text descriptions from fMRI signals. The fine-grained text descriptions can be integrated into existing reconstruction methods to achieve fine-grained Brain-to-Image reconstruction.
zh

[NLP-65] Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language

【速读】: 该论文试图解决灵活工具选择这一复杂认知能力的计算建模问题,该能力能够区分人类与其他物种。其解决方案的关键在于构建一个基于低维属性表示的框架,以连接视觉工具感知与语言任务理解。该框架通过视觉编码器(如ResNet或ViT)从工具图像中提取属性,并利用微调的语言模型(如GPT-2、LLaMA、DeepSeek)从任务描述中推导所需属性,从而实现高效的工具选择。

链接: https://arxiv.org/abs/2505.22146
作者: Guangfu Hao,Haojie Wen,Liangxuna Guo,Yang Chen,Yanchao Bi,Shan Yu
机构: Laboratory of Brain Atlas and Brain-inspired Intelligence, Institute of Automation Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS); State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University; IDG/McGovern Institute for Brain Research, Beijing Normal University; School of Future Technology, University of Chinese Academy of Sciences (UCAS)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Ablation studies revealed that manipulation-related attributes (graspability, hand-relatedness, elongation) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.
zh

[NLP-66] Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets Not Arguments ACL2025

【速读】: 该论文试图解决当前生成式AI在论点识别任务中泛化能力不足的问题,特别是模型在面对未见过的数据集时性能显著下降的现象。其解决方案的关键在于通过任务特定的预训练和联合基准训练来提升模型的鲁棒性和泛化能力。研究评估了四种Transformer模型,包括三种标准模型和一种通过对比预训练增强的模型,在17个英语句级数据集上的表现,发现现有模型在很大程度上依赖于与内容词相关的词汇捷径,而非真正的任务对齐,从而揭示了模型泛化能力受限的原因。

链接: https://arxiv.org/abs/2505.22137
作者: Marc Feger,Katarina Boland,Stefan Dietze
机构: Heinrich-Heine-University (海因里希·海涅大学); GESIS - Leibniz Institute for the Social Sciences (GESIS-莱布尼茨社会科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted to ACL 2025 and will be published after 27.07.2025

点击查看摘要

Abstract:Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17 English sentence-level datasets as most relevant to the task. Our findings show that, to varying degrees, these models tend to rely on lexical shortcuts tied to content words, suggesting that apparent progress may often be driven by dataset-specific cues rather than true task alignment. While the models achieve strong results on familiar benchmarks, their performance drops markedly when applied to unseen datasets. Nonetheless, incorporating both task-specific pre-training and joint benchmark training proves effective in enhancing both robustness and generalization.
zh

[NLP-67] RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding

【速读】: 该论文旨在解决混合模型(如结合Transformer与状态空间模型SSMs)在优化过程中存在的潜在冗余问题,尤其是Transformer组件中的冗余性带来的挑战。其解决方案的关键在于提出了一种名为RAD(Redundancy-Aware Distillation)的框架,该框架利用自预测解码作为诊断工具,识别出冗余的注意力层,并将其选择性地替换为SSM组件,随后进行针对性的(自)知识蒸馏。RAD通过关注被识别为冗余的组件,并结合架构调整和特定的权重初始化策略,实现了更高效的模型优化与性能提升。

链接: https://arxiv.org/abs/2505.22135
作者: Yuichiro Hoshino,Hideyuki Tachibana,Muneyoshi Inahara,Hiroto Takegawa
机构: PKSHA Technology inc.(PKSHA Technology inc.); NII LLMC (NII LLMC); Asia University (Asia University)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:Hybrid models combining Transformers and State Space Models (SSMs) are promising for balancing performance and efficiency. However, optimizing these hybrid models, particularly by addressing the potential redundancy inherent within the Transformer components, remains a significant challenge. In this paper, we propose RAD (Redundancy-Aware Distillation), a novel framework that uses self-speculative decoding as a diagnostic tool to identify redundant attention layers within the model. These identified layers are then selectively replaced with SSM components, followed by targeted (self-)distillation. Specifically, RAD focuses knowledge transfer on the components identified as redundant, considering architectural changes and specific weight initialization strategies. We experimentally demonstrate that self-distillation using RAD significantly surpasses the performance of the original base model on mathematical and coding tasks. Furthermore, RAD is also effective in standard knowledge distillation settings, achieving up to approximately 2x faster convergence compared to baseline methods. Notably, while a baseline model distilled from a Llama-3.1 70B teacher achieves scores of 46.17 on GSM8K and 22.75 on CRUX, RAD achieves significantly higher scores of 71.27 on GSM8K and 28.25 on CRUX, even when using a much smaller Llama-3.1 8B teacher. RAD offers a new pathway for efficient optimization and performance enhancement in the distillation of hybrid models.
zh

[NLP-68] EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学问题求解任务中,如何通过学习错误来进一步提升其性能的问题。现有方法通常依赖于采样轨迹获取合成解中的错误,但难以为每个数学问题生成高质量的错误解。解决方案的关键在于提出一种名为EULER的误差诱导学习模型,该模型通过优化误差暴露机制,提高自动生成解错误的概率,并利用来自更优LLM的解对生成质量进行正则化,从而生成更具挑战性和教育意义的错误解,以增强LLMs的数学推理能力。

链接: https://arxiv.org/abs/2505.22131
作者: Zhuoyang Wu,Xinze Li,Zhenghao Liu,Yukun Yan,Zhiyuan Liu,Minghe Yu,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong reasoning capabilities and achieved promising results in mathematical problem-solving tasks. Learning from errors offers the potential to further enhance the performance of LLMs during Supervised Fine-Tuning (SFT). However, the errors in synthesized solutions are typically gathered from sampling trails, making it challenging to generate solution errors for each mathematical problem. This paper introduces the Error-IndUced LEaRning (EULER) model, which aims to develop an error exposure model that generates high-quality solution errors to enhance the mathematical reasoning capabilities of LLMs. Specifically, EULER optimizes the error exposure model to increase the generation probability of self-made solution errors while utilizing solutions produced by a superior LLM to regularize the generation quality. Our experiments across various mathematical problem datasets demonstrate the effectiveness of the EULER model, achieving an improvement of over 4% compared to all baseline models. Further analysis reveals that EULER is capable of synthesizing more challenging and educational solution errors, which facilitate both the training and inference processes of LLMs. All codes are available at this https URL.
zh

[NLP-69] LoKI: Low-damage Knowledge Implanting of Large Language Models

【速读】: 该论文试图解决预训练模型微调过程中出现的灾难性遗忘(catastrophic forgetting, CF)问题,即在适应特定任务时可能覆盖掉预训练阶段获得的关键知识。解决方案的关键在于提出一种名为LoKI(Low-damage Knowledge Implanting)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,该技术基于对Transformer架构中知识存储机制的机械理解,能够在保持模型通用能力的同时,实现与全量微调或LoRA方法相当甚至更优的任务特定性能。

链接: https://arxiv.org/abs/2505.22120
作者: Runyu Wang,Peng Ping,Zhengyu Guo,Xiaoye Zhang,Quan Shi,Liting Zhou,Tianbo Ji
机构: Nantong University (南通大学); South China University of Technology (华南理工大学); China Southern Power Grid Company Limited (中国南方电网公司); Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pre-training is overwritten. Current Parameter-Efficient Fine-Tuning (PEFT) methods for Large Language Models (LLMs), while efficient, often sacrifice general capabilities. To address the issue of CF in a general-purpose PEFT framework, we propose \textbfLow-damage \textbfKnowledge \textbfImplanting (\textbfLoKI), a PEFT technique that is based on a mechanistic understanding of how knowledge is stored in transformer architectures. In two real-world scenarios, LoKI demonstrates task-specific performance that is comparable to or even surpasses that of full fine-tuning and LoRA-based methods across various model types, while significantly better preserving general capabilities. Our work connects mechanistic insights into LLM knowledge storage with practical fine-tuning objectives, achieving state-of-the-art trade-offs between task specialization and the preservation of general capabilities. Our implementation is publicly available as ready-to-use code\footnotethis https URL.
zh

[NLP-70] Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches

【速读】: 该论文试图解决跨语言事实核查声明检索的问题,即在不同语言之间准确检索已验证的声明,以辅助专业事实核查人员的工作。其解决方案的关键在于提升多语言和跨语言性能,具体包括在监督学习中选择负样本以及在无监督设置中进行重新排序(re-ranking)。研究发现,基于大语言模型(LLM)的重新排序方法效果最佳,其次是使用基于句子相似度策略采样负样本的微调方法。此外,研究强调了跨语言设置与多语言设置具有不同的特性。

链接: https://arxiv.org/abs/2505.22118
作者: Alan Ramponi,Marco Rovera,Robert Moro,Sara Tonelli
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.
zh

[NLP-71] Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

【速读】: 该论文旨在解决术中低血压(Intraoperative Hypotension, IOH)预测中的挑战,特别是事件稀疏性和跨不同患者整合静态与动态数据的困难。其解决方案的关键在于提出一种多模态语言模型框架IOHFuseLM,采用两阶段训练策略:第一阶段通过扩散方法增强生理时间序列数据进行领域自适应预训练,以提升模型对低血压相关模式的敏感性;第二阶段在原始临床数据集上进行任务微调,以增强区分正常血压与低血压状态的能力。此外,通过在token层面将结构化临床描述与对应的生理时间序列对齐,实现患者级别的多模态融合,从而捕捉个体化的时空模式及其对应的临床语义。

链接: https://arxiv.org/abs/2505.22116
作者: Jintao Zhang,Zirui Liu,Mingyue Cheng,Shilong Zhang,Tingyue Pan,Qi Liu,Yanhu Xie
机构: University of Science and Technology of China (中国科学技术大学); The First Affiliated Hospital of University of Science and Technology of China (中国科学技术大学第一附属医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbfIOHFuseLM, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at this https URL.
zh

[NLP-72] HINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在处理简单任务时存在的过度思考(overthinking)问题,即模型生成大量冗余且对准确结果贡献有限的标记,从而导致计算资源的浪费。解决方案的关键在于引入Think-Bench基准测试平台,用于系统评估LRMs的推理效率,并提出新的效率指标,以全面分析模型在推理过程、结果质量和思维链(chain-of-thought, CoT)特性等方面的性能。

链接: https://arxiv.org/abs/2505.22113
作者: Zhiyuan Li,Yi Chang,Yuan Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency. Overthinking occurs when models generate excessive and redundant tokens that contribute little to accurate outcomes, especially in simple tasks, resulting in a significant waste of computational resources. To systematically investigate this issue, we introduce Think-Bench, a benchmark designed to evaluate the reasoning efficiency of LRMs. We also propose novel efficiency metrics and conduct a comprehensive evaluation of various LRMs across multiple dimensions, including the reasoning process, outcome quality, and chain-of-thought (CoT) characteristics. Our analysis reveals that most LRMs exhibit overthinking in handling easy questions, generating unnecessarily lengthy reasoning chains. While many LRMs demonstrate high CoT quality, several suffer from low efficiency. We hope that Think-Bench can serve as a robust foundation for advancing research into LRMs.
zh

[NLP-73] Curse of High Dimensionality Issue in Transformer for Long-context Modeling ICML2025

【速读】: 该论文旨在解决基于Transformer的大型语言模型(Large Language Models, LLMs)在长文本建模中因冗余注意力计算导致的计算效率低下问题。其关键解决方案是将传统的概率序列建模重新表述为监督学习任务,从而分离相关与不相关标记,并揭示注意力稀疏性。在此基础上,作者将注意力优化建模为线性编码问题,提出了一种分组编码策略,理论上证明其能够提升对随机噪声的鲁棒性并提高学习效率。最终,基于该策略提出了动态分组注意力(Dynamic Group Attention, DGA),通过在注意力计算过程中聚合不重要标记以显式减少冗余,从而显著降低计算成本并保持模型性能。

链接: https://arxiv.org/abs/2505.22107
作者: Shuhai Zhang,Zeng You,Yaofo Chen,Zhiquan Wen,Qianyue Wang,Zhijie Qiu,Yuanqing Li,Mingkui Tan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textitredundant attention computations: while attention weights are often \textitsparse, all tokens consume \textitequal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textitsupervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textitgroup coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textitDynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive this http URL is available at this https URL.
zh

[NLP-74] MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在处理记忆方面缺乏统一且结构化的架构问题。现有模型主要依赖参数化记忆和短暂激活记忆,虽有如检索增强生成(Retrieval-Augmented Generation, RAG)等方法引入文本记忆,但其生命周期管理和多模态集成能力不足,限制了长期知识演进。论文提出的解决方案是MemOS,其关键在于首次将记忆提升为第一类操作资源,并通过MemCube这一标准化记忆抽象,实现对异构记忆的跟踪、融合与迁移,从而构建统一的记忆表示、组织与治理机制,形成以记忆为中心的可控制、可适应、可演进的执行框架。

链接: https://arxiv.org/abs/2505.22101
作者: Zhiyu Li,Shichao Song,Hanyu Wang,Simin Niu,Ding Chen,Jiawei Yang,Chenyang Xi,Huayi Lai,Jihao Zhao,Yezhaohui Wang,Junpeng Ren,Zehao Lin,Jiahao Huo,Tianyi Chen,Kai Chen,Kehang Li,Zhiqiang Yin,Qingchen Yu,Bo Tang,Hongkang Yang,Zhi-Qin John Xu,Feiyu Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.
zh

[NLP-75] Knowledge Base Construction for Knowledge-Augmented Text-to-SQL ACL

【速读】: 该论文试图解决文本到SQL(Text-to-SQL)任务中生成的SQL语句准确性不足的问题,尤其是在面对多样化的领域查询和不同数据库模式时,大型语言模型(Large Language Models, LLMs)的参数化知识可能存在局限性。解决方案的关键是构建一个全面的知识库,该知识库基于所有可用问题及其关联的数据库模式和相关知识,能够为给定查询检索和生成必要的知识,从而提升生成SQL的准确性,并支持跨不同数据集和领域的未见过的数据库场景。

链接: https://arxiv.org/abs/2505.22096
作者: Jinheon Baek,Horst Samulowitz,Oktie Hassanzadeh,Dharmashankar Subramanian,Sola Shirai,Alfio Gliozzo,Debarun Bhattacharjya
机构: KAIST(韩国科学技术院); IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL Findings 2025

点击查看摘要

Abstract:Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.
zh

[NLP-76] Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中出现的幻觉问题,通过引入外部知识来增强生成的准确性。现有方法通常采用静态检索流程,未能充分利用MLLMs的推理与规划能力以动态决定如何在推理过程中与不同知识库(Knowledge Bases, KBs)交互。论文提出的解决方案关键在于R1-Router框架,该框架能够根据不断变化的推理状态学习决定何时以及从何处检索知识,并生成后续查询以引导至最合适的KB,从而将外部知识整合到连贯的推理路径中。此外,还引入了针对步骤的组相对策略优化(Step-wise Group Relative Policy Optimization, Step-GRPO)算法,通过分配步骤特定奖励来优化MLLMs的推理行为。

链接: https://arxiv.org/abs/2505.22095
作者: Chunyi Peng,Zhipeng Xu,Zhenghao Liu,Yishan Li,Yukun Yan,Shuo Wang,Zhiyuan Liu,Yu Gu,Minghe Yu,Ge Yu,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); ModelBest Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.
zh

[NLP-77] Visual Cues Support Robust Turn-taking Prediction in Noise

【速读】: 该论文旨在解决在噪声环境下准确预测对话轮换(Predictive Turn-Taking Models, PTTMs)性能下降的问题。研究发现,PTTMs在噪声条件下的表现显著恶化,例如在10 dB音乐噪声下,Hold/Shift准确率从干净语音中的84%降至52%。解决方案的关键在于引入多模态的PTTM,该模型结合了视觉特征以更好地利用视觉线索,在10 dB音乐噪声下达到了72%的准确率。与仅依赖音频的PTTM相比,多模态PTTM在所有噪声类型和信噪比下均表现更优,表明其有效利用了视觉信息;然而,这种优势并不总能泛化到新的噪声类型。此外,研究还指出,成功的训练依赖于准确的转录,这限制了自动语音识别(ASR)生成的转录在噪声环境中的应用。

链接: https://arxiv.org/abs/2505.22088
作者: Sam O’Connor Russell,Naomi Harte
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages

点击查看摘要

Abstract:Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.
zh

[NLP-78] ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation

【速读】: 该论文试图解决指令跟随大型语言模型(Large Language Models, LLMs)在处理需要领域知识的任务时表现不佳的问题,特别是针对计算论证(Computational Argumentation, CA)领域的任务。解决方案的关键在于对LLMs进行专门的指令微调,通过构建针对CA领域的自然语言指令集和基准测试,提升模型在未见过的CA任务上的表现,同时保持其在通用自然语言处理任务上的稳定性。

链接: https://arxiv.org/abs/2505.22076
作者: Maja Stahl,Timon Ziegenbein,Joonsuk Park,Henning Wachsmuth
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); University of Richmond (里士满大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) to follow instructions has significantly enhanced their ability to tackle unseen tasks. However, despite their strong generalization capabilities, instruction-following LLMs encounter difficulties when dealing with tasks that require domain knowledge. This work introduces a specialized instruction fine-tuning for the domain of computational argumentation (CA). The goal is to enable an LLM to effectively tackle any unseen CA tasks while preserving its generalization capabilities. Reviewing existing CA research, we crafted natural language instructions for 105 CA tasks to this end. On this basis, we developed a CA-specific benchmark for LLMs that allows for a comprehensive evaluation of LLMs’ capabilities in solving various CA tasks. We synthesized 52k CA-related instructions, adapting the self-instruct process to train a CA-specialized instruction-following LLM. Our experiments suggest that CA-specialized instruction fine-tuning significantly enhances the LLM on both seen and unseen CA tasks. At the same time, performance on the general NLP tasks of the SuperNI benchmark remains stable.
zh

[NLP-79] Beyond path selection: Better LLM s for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R2)GRPO

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在科学信息抽取(Scientific Information Extraction, SciIE)任务中表现不佳的问题,尤其是在推理能力和记忆能力方面不如小型Bert基模型。其解决方案的关键在于提出一种两阶段训练方法:第一阶段为MimicSFT,利用结构化推理模板而无需高质量的思维链数据;第二阶段为R²GRPO,结合相关性与规则诱导奖励机制。实验结果表明,该方法能够有效提升模型的推理能力,并在关系抽取任务中超越基线LLMs和专用监督模型。

链接: https://arxiv.org/abs/2505.22068
作者: Ran Li,Shimin Di,Yuchen Liu,Chen Jing,Yu Qiu,Lei Chen
机构: HKUST(香港科技大学); SEU(东南大学); HKUST(GZ)(香港科技大学(广州)); Zhipu AI(智普AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R ^2 GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R ^2 GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at this https URL.
zh

[NLP-80] Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home?

【速读】: 该论文旨在解决检索增强生成(Retrieval-augmented generation, RAG)系统中由于直接传递私有检索文档给大语言模型(Large language models, LLMs)而引发的成员推理攻击(Membership inference attacks, MIAs)问题。解决方案的关键在于提出了一种基于相似性的MIAs检测框架Mirabel,该框架利用MIAs查询通常仅与一个目标文档高度相似的特性,通过简单的检测与隐藏策略有效混淆攻击者,同时保持数据效用和系统无关性。

链接: https://arxiv.org/abs/2505.22061
作者: Yujin Choi,Youngjoo Park,Junyoung Byun,Jaewook Lee,Jinseong Park
机构: Seoul National University (首尔国立大学); Chung-Ang University (中央大学); Korea Institute for Advanced Study (韩国高级科学研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) mitigates the hallucination problem in large language models (LLMs) and has proven effective for specific, personalized applications. However, passing private retrieved documents directly to LLMs introduces vulnerability to membership inference attacks (MIAs), which try to determine whether the target datum exists in the private external database or not. Based on the insight that MIA queries typically exhibit high similarity to only one target document, we introduce Mirabel, a similarity-based MIA detection framework designed for the RAG system. With the proposed Mirabel, we show that simple detect-and-hide strategies can successfully obfuscate attackers, maintain data utility, and remain system-agnostic. We experimentally prove its detection and defense against various state-of-the-art MIA methods and its adaptability to existing private RAG systems.
zh

[NLP-81] Voice Adaptation for Swiss German INTERSPEECH

【速读】: 该论文试图解决将标准德语文本转换为瑞士德语方言语音的语音适配问题(Voice Adaptation),旨在提升语音克隆技术(Voice Cloning)在语言资源较少的方言或语言中的适用性。解决方案的关键在于对大量瑞士播客数据进行预处理,包括自动转录和方言类别标注,从而获得约5000小时的弱标签训练数据,并在此基础上微调XTTSv2模型,使其能够准确生成目标方言的语音输出。

链接: https://arxiv.org/abs/2505.22054
作者: Samuel Stucki,Jan Deriu,Mark Cieliebak
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech

点击查看摘要

Abstract:This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.
zh

[NLP-82] Jailbreak Distillation: Renewable Safety Benchmarking

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在关键应用中部署时所面临的安全评估基准不足的问题,特别是现有安全评估方法在可重复性、更新性及泛化能力方面的局限性。其解决方案的关键在于提出了一种名为Jailbreak Distillation (JBDistill)的基准构建框架,该框架通过“蒸馏”越狱攻击生成高质量且易于更新的安全评估基准,利用少量开发模型和现有越狱攻击算法生成候选提示池,并通过提示选择算法提取有效的提示作为安全基准,从而实现公平比较、可复现性和高效更新。

链接: https://arxiv.org/abs/2505.22037
作者: Jingyu Zhang,Ahmed Elgohary,Xiawei Wang,A S M Iftekhar,Ahmed Magooda,Benjamin Van Durme,Daniel Khashabi,Kyle Jackson
机构: Microsoft Responsible AI Research (微软负责任人工智能研究); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: Project page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that “distills” jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.
zh

[NLP-83] VRAG -RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

【速读】: 该论文旨在解决传统RAG(Retrieval-Augmented Generation)方法在处理视觉丰富信息时的局限性,特别是文本基方法无法有效处理视觉信息,而现有视觉基RAG方法受限于固定流程且难以有效进行推理。其解决方案的关键在于引入VRAG-RL,一个专为跨视觉丰富信息进行复杂推理设计的强化学习(Reinforcement Learning, RL)框架。该框架使视觉语言模型(Vision-Language Models, VLMs)能够与搜索引擎交互,通过视觉感知标记自主采样单轮或多轮推理轨迹,并基于这些样本进行持续优化。此外,论文还定义了针对视觉丰富输入的动作空间,包括裁剪和缩放等操作,以实现从粗到细的信息获取,并通过结合查询重写与检索性能的奖励机制,弥合用户原始查询与检索器之间的差距。

链接: https://arxiv.org/abs/2505.22019
作者: Qiuchen Wang,Ruixue Ding,Yu Zeng,Zehui Chen,Lin Chen,Shihang Wang,Pengjun Xie,Fei Huang,Feng Zhao
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users’ original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlinkthis https URLthis https URL.
zh

[NLP-84] Improving Continual Pre-training Through Seamless Data Packing ACL2025

【速读】: 该论文试图解决持续预训练过程中由于数据打包方式导致的过度截断和上下文不连贯问题,这些问题会阻碍模型性能的提升。解决方案的关键在于提出一种名为Seamless Packing (SP)的新颖数据打包策略,该策略通过两个阶段实现:第一阶段采用滑动窗口技术同步连续序列间的重叠标记,以增强上下文连续性和一致性;第二阶段则利用First-Fit-Decreasing算法将较短文本打包到略大于目标序列长度的容器中,从而减少填充和截断。

链接: https://arxiv.org/abs/2505.22018
作者: Ruicheng Yin,Xuan Gao,Changze Lv,Xiaohua Wang,Xiaoqing Zheng,Xuanjing Huang
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at this https URL.
zh

[NLP-85] CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在测试时由于过度推理导致的输出冗长和令牌效率低下的问题。其关键解决方案是提出CoThink,一个简单有效的流水线:首先由指令模型生成高层次的解决方案大纲,随后由推理模型完成具体解答。该方法实现了根据输入难度动态调整推理深度,从而在保持推理准确性的同时显著减少总令牌生成量。

链接: https://arxiv.org/abs/2505.22017
作者: Siqi Fan,Peng Han,Shuo Shang,Yequan Wang,Aixin Sun
机构: University of Electronic Science and Technology of China (中国电子科技大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling. However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency. By comparing these models with equally sized instruct models, we identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps. Since LLMs cannot assess the difficulty of a given problem, they tend to apply the same cautious reasoning strategy across all tasks, resulting in inefficient overthinking. To address this, we propose CoThink, an embarrassingly simple pipeline: an instruct model first drafts a high-level solution outline; a reasoning model then works out the solution. We observe that CoThink enables dynamic adjustment of reasoning depth based on input difficulty. Evaluated with three reasoning models DAPO, DeepSeek-R1, and QwQ on three datasets GSM8K, MATH500, and AIME24, CoThink reduces total token generation by 22.3% while maintaining pass@1 accuracy within a 0.42% margin on average. With reference to the instruct model, we formally define reasoning efficiency and observe a potential reasoning efficiency scaling law in LLMs.
zh

[NLP-86] Legal Assist AI: Leverag ing Transformer-Based Model for Effective Legal Assistance

【速读】: 该论文试图解决印度民众在获取法律援助方面存在的关键问题,即由于法律意识薄弱和法律信息获取渠道有限,导致公民难以有效行使法律权利。解决方案的关键在于开发Legal Assist AI,这是一个基于Transformer的模型,通过大规模语言模型(LLMs)提供有效的法律协助。该模型在印度法律领域的广泛数据集上进行微调,包括印度宪法、《印度刑法典》(Bharatiya Nyaya Sanhita, BNS)和《印度公民安全法典》(Bharatiya Nagarik Suraksha Sanhita, BNSS)等,从而实现了对印度法律复杂性的深入理解,并在法律问答任务中表现出卓越的效率与专业性。

链接: https://arxiv.org/abs/2505.22003
作者: Jatin Gupta,Akhil Sharma,Saransh Singhania,Ali Imam Abidi
机构: Sharda University (沙德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 tables, 4 figures. This is a revised version of a preprint previously available at this URL: this https URL

点击查看摘要

Abstract:Pursuit of accessible legal assistance in India faces a critical gap, as many citizens struggle to leverage their legal rights due to limited awareness and access to relevant legal information. This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs). The system retrieves relevant legal information from a curated database and generates accurate responses, enabling effective assistance for diverse users, including legal professionals, scholars, and the general public. The model was fine-tuned on extensive datasets from the Indian legal domain, including Indian Constitution, Bharatiya Nyaya Sanhita (BNS), Bharatiya Nagarik Suraksha Sanhita (BNSS) and so forth, providing a robust understanding of the complexities of Indian law. By incorporating domain-specific legal datasets, the proposed model demonstrated remarkable efficiency and specialization in legal Question-Answering. The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE, outperforming its competitors in legal reasoning and accuracy. Unlike other models, Legal Assist AI avoided common issues such as hallucinations, making it highly reliable for practical legal applications. It showcases the model’s applicability in real-world legal scenarios, with future iterations aiming to enhance performance and expand its dataset to cover a broader range of multilingual and case-specific queries as well.
zh

[NLP-87] Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨语言一致性方面存在的问题,即模型在不同语言下对相同查询的响应是否保持一致。传统评估方法依赖于昂贵的标注数据集,且对于开放性生成任务的评估具有挑战性。论文提出的解决方案关键在于采用一种简单的“先翻译后评估”策略,构建了一个评估框架,从信息和共情两个维度衡量模型的跨语言一致性。该框架揭示了主流LLM在30种语言中的显著不一致性,特别是在某些语系和文字系统中表现较差,突显了其多语言能力的不足。

链接: https://arxiv.org/abs/2505.21999
作者: Ashim Gupta,Maitrey Mehta,Zhichao Xu,Vivek Srikumar
机构: Kahlert School of Computing (Kahlert 计算机学院); University of Utah (犹他大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM’s cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.
zh

[NLP-88] Leverag ing Interview-Informed LLM s to Model Survey Responses: Comparative Insights from AI-Generated and Human Data

【速读】: 该论文试图解决混合方法研究中定量与定性数据结构差异带来的对测量特性及个体响应模式分析的挑战,其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成基于定性数据的合成调查响应。研究通过行为锻炼调节问卷(BREQ)和课后项目工作人员访谈作为案例,验证了LLMs在整合访谈数据后能否可靠地预测人类调查响应,并发现访谈内容对响应多样性与一致性具有显著影响。

链接: https://arxiv.org/abs/2505.21997
作者: Jihong Zhang,Xinya Liang,Anqi Deng,Nicole Bonge,Lin Tan,Ling Zhang,Nicole Zarrett
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research.
zh

[NLP-89] Learning Compositional Behaviors from Demonstration and Language

【速读】: 该论文试图解决长时空中机器人操作任务的泛化与规划问题,特别是在面对新初始状态、外部状态扰动和新目标时的挑战。解决方案的关键在于提出BLADE框架,该框架通过融合模仿学习与基于模型的规划,利用语言标注的示范数据,从大语言模型(Large Language Models, LLMs)中提取抽象动作知识,并构建结构化的高层动作表示库,这些表示包含基于视觉感知的先决条件和效果,以及作为神经网络策略实现的对应控制器,从而实现了无需手动标记状态或符号定义的自动结构化表示恢复。

链接: https://arxiv.org/abs/2505.21981
作者: Weiyu Liu,Neil Nie,Ruohan Zhang,Jiayuan Mao,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at CoRL 2024 and as an Oral Presentation at the 2024 CoRL LEAP Workshop. The first two authors contributed equally. The last two authors jointly advised the project. For videos and additional results, visit: this https URL

点击查看摘要

Abstract:We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.
zh

[NLP-90] Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset

【速读】: 该论文试图解决主流大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的文化偏见问题,旨在通过构建多样化的多模态数据集来提升模型的文化理解能力。解决方案的关键在于引入Pearl,这是一个大规模的阿拉伯语多模态数据集和基准测试,通过先进的代理工作流和来自阿拉伯世界45名标注者的广泛人机协同标注构建而成,涵盖了十个具有文化意义的领域,提供了两个稳健的评估基准以及一个专门用于评估细微文化差异的子集Pearl-X。

链接: https://arxiv.org/abs/2505.21979
作者: Fakhraddin Alwajih,Samar Mohamed Magdy,Abdellah El Mekki,Omer Nacar,Youssef Nafea,Safaa Taher Abdelfadil,Abdulfattah Mohammed Yahya,Hamzah Luqman,Nada Almarwani,Samah Aloufi,Baraah Qawasmeh,Houdaifa Atou,Serry Sibaee,Hamzah A. Alsayadi,Walid Al-Dhabyani,Maged S. Al-shaibani,Aya El aatar,Nour Qandos,Rahaf Alhamouri,Samar Ahmad,Razan Khassib,Lina Hamad,Mohammed Anwar AL-Ghrawi,Fatimah Alshamari,Cheikh Malainine,Doaa Qawasmeh,Aminetou Yacoub,Tfeil moilid,Ruwa AbuHweidi,Ahmed Aboeitta,Vatimetou Mohamed Lemin,Reem Abdel-Salam,Ahlam Bashiti,Adel Ammar,Aisha Alansari,Ahmed Ashraf,Nora Alturayeif,Sara Shatnawi,Alcides Alcoba Inciarte,AbdelRahim A. Elmadany,Mohamedou cheikh tourad,Ismail Berrada,Mustafa Jarrar,Shady Shehata,Muhammad Abdul-Mageed
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
zh

[NLP-91] Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack

【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在面对对抗性攻击时的安全性问题,特别是传统对抗性攻击如何绕过模型内置的安全机制。其解决方案的关键在于提出了一种两阶段的评估框架,用于系统地分析和量化对抗性攻击的效果,第一阶段区分了指令不合规、直接拒绝和成功的对抗性利用,第二阶段则评估模型输出对有害意图的满足程度,并将拒绝行为细分为直接拒绝、软性拒绝和部分有用拒绝。此外,论文还引入了一种规范性框架,以定义模型在面对有害提示时的理想行为,为多模态系统的安全对齐提供理论依据。

链接: https://arxiv.org/abs/2505.21967
作者: Juan Ren,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model’s output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.
zh

[NLP-92] MapStory: LLM -Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing

【速读】: 该论文试图解决传统地图动画制作过程中需要复杂手动操作和专业技能的问题,旨在通过自然语言驱动的方式简化地图动画的创作流程。解决方案的关键在于引入MapStory系统,该系统基于大语言模型(LLM)的代理架构,能够自动将用户提供的脚本分解为关键动画构建模块,并结合地理空间信息查询与交互式时间轴编辑功能,实现对动画序列的高效生成与定制化调整。

链接: https://arxiv.org/abs/2505.21966
作者: Aditya Gunturu,Ben Pearman,Keiichi Ihara,Morteza Faraji,Bryan Wang,Rubaiat Habib Kazi,Ryo Suzuki
机构: University of Calgary(卡尔加里大学); University of Tsukuba(筑波大学); Adobe(Adobe公司); University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 16 pages and 15 figures

点击查看摘要

Abstract:We introduce MapStory, an LLM-powered animation authoring tool that generates editable map animation sequences directly from natural language text. Given a user-written script, MapStory leverages an agentic architecture to automatically produce a scene breakdown, which decomposes the script into key animation building blocks such as camera movements, visual highlights, and animated elements. Our system includes a researcher component that accurately queries geospatial information by leveraging an LLM with web search, enabling the automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these blocks through an interactive timeline editor. We detail the system’s design and architecture, informed by formative interviews with professional animators and an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.
zh

[NLP-93] UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

【速读】: 该论文试图解决外部知识与实际任务执行之间的关键知识-执行差距(knowledge-execution gap),即检索到的知识难以有效转化为现实世界中的任务执行。解决方案的关键在于提出UI-Evol模块,该模块包含两个阶段:重溯阶段(Retrace Stage)从实际的智能体-环境交互中提取忠实的目标动作序列,以及批判阶段(Critique Stage)通过将这些序列与外部参考进行比较来优化现有知识,从而提升任务执行的成功率和智能体的可靠性。

链接: https://arxiv.org/abs/2505.21964
作者: Ziyun Zhang,Xinyi Liu,Xiaoyi Zhang,Jun Wang,Gang Chen,Yan Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.
zh

[NLP-94] LaMDAgent : An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

【速读】: 该论文试图解决如何自动化构建和优化完整的后训练流水线问题,以提升大型语言模型(Large Language Models, LLMs)在特定领域或应用中的性能。现有方法通常依赖人工设计或仅优化单一组件,缺乏系统性的自动化框架。解决方案的关键在于提出LaMDAgent,这是一个基于生成式AI(Generative AI)的框架,通过LLM驱动的智能体自主探索多种模型生成技术、数据集和超参数配置,并利用任务反馈发现高性能的后训练流水线,从而减少人工干预并提升效果。

链接: https://arxiv.org/abs/2505.21963
作者: Taro Yano,Yoichi Ishibashi,Masafumi Oyamada
机构: NEC Corporation (日本电气股份有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post-training techniques such as Supervised Fine-Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post-training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post-training pipelines through the use of LLM-based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task-based feedback to discover high-performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities. Moreover, it uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.
zh

[NLP-95] EnsemW2S: Enhancing Weak-to-Strong Generalization with Large Language Model Ensembles

【速读】: 该论文旨在解决弱到强(Weak-to-Strong, W2S)泛化问题,即如何利用仅接触人类水平数据的较小模型来有效监督和提升大型语言模型(Large Language Models, LLMs)的性能。其解决方案的关键在于提出一种名为EnsemW2S的新型方法,该方法通过在相同有限的人类水平数据上训练多个弱专家,并采用逐标记级的集成策略,迭代融合多个弱专家以系统性地弥补前序迭代中的不足,从而显著提升弱模型集体对更强学生模型的监督能力。

链接: https://arxiv.org/abs/2505.21959
作者: Aakriti Agrawal,Mucong Ding,Zora Che,Chenghao Deng,Anirudh Satheesh,Bang An,Bayan Bruss,John Langford,Furong Huang
机构: Capital One(资本一号); Microsoft(微软); University of Maryland(马里兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Superalignment. arXiv admin note: substantial text overlap with arXiv:2410.04571

点击查看摘要

Abstract:With Large Language Models (LLMs) rapidly approaching and potentially surpassing human-level performance, it has become imperative to develop approaches capable of effectively supervising and enhancing these powerful models using smaller, human-level models exposed to only human-level data. We address this critical weak-to-strong (W2S) generalization challenge by proposing a novel method aimed at improving weak experts, by training on the same limited human-level data, enabling them to generalize to complex, super-human-level tasks. Our approach, called \textbfEnsemW2S, employs a token-level ensemble strategy that iteratively combines multiple weak experts, systematically addressing the shortcomings identified in preceding iterations. By continuously refining these weak models, we significantly enhance their collective ability to supervise stronger student models. We extensively evaluate the generalization performance of both the ensemble of weak experts and the subsequent strong student model across in-distribution (ID) and out-of-distribution (OOD) datasets. For OOD, we specifically introduce question difficulty as an additional dimension for defining distributional shifts. Our empirical results demonstrate notable improvements, achieving 4%, and 3.2% improvements on ID datasets and, upto 6% and 2.28% on OOD datasets for experts and student models respectively, underscoring the effectiveness of our proposed method in advancing W2S generalization.
zh

[NLP-96] Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

【速读】: 该论文旨在解决领域特定指令微调中数据选择(Data Selection, DS)效果不佳的问题,尤其是在存在知识冲突的情况下,现有方法难以选出符合大语言模型(Large Language Models, LLMs)实际需求的高质量数据。解决方案的关键在于提出一种基于知识感知的数据选择框架(Knowledge-aware Data Selection, KDS),其核心是通过两个知识感知指标,从上下文-记忆知识对齐和记忆内知识一致性两个方面定量衡量知识冲突,并通过过滤高冲突数据和采样高质量多样数据,有效提升LLMs的性能并缓解幻觉问题。

链接: https://arxiv.org/abs/2505.21958
作者: Qihuang Zhong,Liang Ding,Fei Liao,Juhua Liu,Bo Du,Dacheng Tao
机构: Renmin Hospital, Wuhan University(中国人民大学武汉医院); School of Computer Science, Wuhan University(武汉大学计算机学院); School of Computer Science, Faculty of Engineering, The University of Sydney(悉尼大学工程学院计算机学院); College of Computing & Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs’ pretrained knowledge and context knowledge of instruction data, which could damage LLMs’ prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs’ actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs’ abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.
zh

[NLP-97] Cross-modal RAG : Sub-dimensional Retrieval-Augmented Text-to-Image Generation

【速读】: 该论文旨在解决文本到图像生成中对领域特定、细粒度且快速演变知识的需求,而预训练模型无法完全捕捉这些知识的问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)方法通过检索全局相关图像来应对这一问题,但在复杂用户查询中,单张图像往往无法包含所有所需元素。该论文提出的跨模态RAG框架的关键在于将查询和图像分解为子维度组件,实现子查询感知的检索与生成,并引入混合检索策略——结合子维度稀疏检索器与密集检索器,以识别一组帕累托最优图像,每张图像贡献查询的不同互补方面。

链接: https://arxiv.org/abs/2505.21956
作者: Mengdan Zhu,Senhao Cheng,Guangji Bai,Yifei Zhang,Liang Zhao
机构: Emory University (埃默里大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.
zh

[NLP-98] st-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

【速读】: 该论文试图解决多语言生成任务中推理能力不足的问题,特别是通过推理时的重复采样(inference-time scaling)方法提升生成质量。其解决方案的关键在于利用基于困惑度(perplexity)和奖励(reward)的验证器(verifier)对生成结果进行评估与优化,实验表明,尽管基于困惑度的评分在开放性提示中表现有效,但仅基于奖励的验证器能在需要推理的任务(如数学、代码)中显著提升性能。

链接: https://arxiv.org/abs/2505.21941
作者: Ashim Gupta,Vivek Srikumar
机构: Kahlert School of Computing (Kahlert计算机学院); University of Utah (犹他大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.
zh

[NLP-99] RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中的不足,特别是多跳问答(Multi-Hop Question Answering, MHQA)任务中由于需要整合来自不同来源的证据并处理复杂的逻辑依赖而导致的推理错误问题。解决方案的关键在于提出一种名为RISE(Reasoning Enhancement via Iterative Self-Exploration)的新框架,该框架通过迭代自我探索增强模型的推理能力,其核心步骤包括问题分解、检索后阅读以及自我批判,从而提升模型在MHQA任务中整合证据、保持逻辑一致性和整体性能的能力。

链接: https://arxiv.org/abs/2505.21940
作者: Bolei He,Xinran He,Mengke Chen,Xianwei Xue,Ying Zhu,Zhenhua Ling
机构: Baidu Inc.(百度公司); University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) excel in many areas but continue to face challenges with complex reasoning tasks, such as Multi-Hop Question Answering (MHQA). MHQA requires integrating evidence from diverse sources while managing intricate logical dependencies, often leads to errors in reasoning. Retrieval-Augmented Generation (RAG), widely employed in MHQA tasks, faces challenges in effectively filtering noisy data and retrieving all necessary evidence, thereby limiting its effectiveness in addressing MHQA challenges. To address these challenges, we propose RISE:Reasoning Enhancement via Iterative Self-Exploration, a novel framework designed to enhance models’ reasoning capability through iterative self-exploration. Specifically, RISE involves three key steps in addressing MHQA tasks: question decomposition, retrieve-then-read, and self-critique. By leveraging continuous self-exploration, RISE identifies accurate reasoning paths, iteratively self-improving the model’s capability to integrate evidence, maintain logical consistency, and enhance performance in MHQA tasks. Extensive experiments on multiple MHQA benchmarks demonstrate that RISE significantly improves reasoning accuracy and task performance.
zh

[NLP-100] Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic Languages

【速读】: 该论文试图解决多词表达(MWEs)和习语在跨语言翻译中的挑战,尤其是由于文化差异导致的习语翻译的一对多特性问题。传统静态知识图谱(KG)和基于提示的方法难以捕捉这些复杂关系,从而导致翻译效果不佳。解决方案的关键在于提出一种基于图神经网络(GNN)的自适应方法IdiomCE,该方法能够学习习语表达之间的复杂映射,并在训练过程中有效泛化到已见和未见节点,从而提升翻译质量,特别是在资源受限的场景下。

链接: https://arxiv.org/abs/2505.21937
作者: Pratik Rakesh Singh,Kritarth Prasad,Mohammadi Zaki,Pankaj Wasnik
机构: Media Analysis Group, Sony Research India (媒体分析组,索尼印度研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Translating multi-word expressions (MWEs) and idioms requires a deep understanding of the cultural nuances of both the source and target languages. This challenge is further amplified by the one-to-many nature of idiomatic translations, where a single source idiom can have multiple target-language equivalents depending on cultural references and contextual variations. Traditional static knowledge graphs (KGs) and prompt-based approaches struggle to capture these complex relationships, often leading to suboptimal translations. To address this, we propose IdiomCE, an adaptive graph neural network (GNN) based methodology that learns intricate mappings between idiomatic expressions, effectively generalizing to both seen and unseen nodes during training. Our proposed method enhances translation quality even in resource-constrained settings, facilitating improved idiomatic translation in smaller models. We evaluate our approach on multiple idiomatic translation datasets using reference-less metrics, demonstrating significant improvements in translating idioms from English to various Indian languages.
zh

[NLP-101] RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

【速读】: 该论文旨在解决Computer-use agents (CUAs)在跨操作系统(OS)和网络环境中的间接提示注入(indirect prompt injection)威胁问题,当前的评估方法要么缺乏现实但可控的测试环境,要么忽略了涉及两种界面的混合网络-操作系统攻击场景。解决方案的关键是提出RedTeamCUA,这是一个对抗性测试框架,其核心是一个创新的混合沙箱,集成了基于虚拟机(VM)的操作系统环境与基于Docker的网络平台,支持灵活的对抗场景配置,并通过直接在对抗性注入点初始化测试来解耦对抗性评估与CUAs的导航限制。

链接: https://arxiv.org/abs/2505.21936
作者: Zeyi Liao,Jaylen Jones,Linxi Jiang,Eric Fosler-Lussier,Yu Su,Zhiqiang Lin,Huan Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection. Current evaluations of this threat either lack support realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, with the recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%, demonstrating that indirect prompt injection presents tangible risks for even advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
zh

[NLP-102] Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets ACL’25

【速读】: 该论文试图解决在多个不同任务的数据集上高效微调语言模型的问题,现有方法如量化低秩适配(QLoRA)在单个数据集上表现高效,但在多任务场景下缺乏有效的适应策略。解决方案的关键在于使用多个小型适配器的集成代替每个任务一个适配器,通过将n个数据集划分为m个组(m远小于n),为每组训练一个适配器,并通过加权组合形成集成,从而提升微调效率与性能。该方法利用了低秩适配的一阶近似特性,通过基础模型的梯度估计微调性能,实验证明其在340亿参数模型上的误差低于1%,计算速度相比基础微调提升了105倍。

链接: https://arxiv.org/abs/2505.21930
作者: Dongyue Li,Ziniu Zhang,Lu Wang,Hongyang R. Zhang
机构: Northeastern University (东北大学); University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages. To appear in ACL’25

点击查看摘要

Abstract:This paper develops an ensemble method for fine-tuning a language model to multiple datasets. Existing methods, such as quantized LoRA (QLoRA), are efficient when adapting to a single dataset. When training on multiple datasets of different tasks, a common setup in practice, it remains unclear how to design an efficient adaptation for fine-tuning language models. We propose to use an ensemble of multiple smaller adapters instead of a single adapter per task. We design an efficient algorithm that partitions n datasets into m groups, where m is typically much smaller than n in practice, and train one adapter for each group before taking a weighted combination to form the ensemble. The algorithm leverages a first-order approximation property of low-rank adaptation to quickly obtain the fine-tuning performances of dataset combinations since methods like LoRA stay close to the base model. Hence, we use the gradients of the base model to estimate its behavior during fine-tuning. Empirically, this approximation holds with less than 1% error on models with up to 34 billion parameters, leading to an estimation of true fine-tuning performances under 5% error while speeding up computation compared to base fine-tuning by 105 times. When applied to fine-tune Llama and GPT models on ten text classification tasks, our approach provides up to 10% higher average test accuracy over QLoRA, with only 9% more FLOPs. On a Llama model with 34 billion parameters, an ensemble of QLoRA increases test accuracy by 3% compared to QLoRA, with only 8% more FLOPs.
zh

[NLP-103] Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning ACL2025

【速读】: 该论文试图解决现有知识图谱(Knowledge Graph, KG)基础模型主要关注结构信息而忽视文本信息的问题,以及由此导致的在跨领域或超出知识图谱范围的任务(out-of-KG tasks)中表现受限的问题。其解决方案的关键在于提出一种多视角条件消息传递(Multi-perspective Conditional Message Passing, CMP)编码架构,以融合知识图谱中的结构与文本信息,并引入动态残差融合模块和灵活的边评分机制,从而实现两种模态的无缝集成与任务适应性增强。

链接: https://arxiv.org/abs/2505.21926
作者: Yin Hua,Zhiqiang Liu,Mingyang Chen,Zheng Fang,Chi Man Wong,Lingxiao Li,Chi Man Vong,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Shopee Pte.Ltd. (虾皮有限公司); University of Macau (澳门大学); Zhejiang Key Laboratory of Big Data Intelligent Computing (浙江省大数据智能计算重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings

点击查看摘要

Abstract:In natural language processing (NLP) and computer vision (CV), the successful application of foundation models across diverse tasks has demonstrated their remarkable potential. However, despite the rich structural and textual information embedded in knowledge graphs (KGs), existing research of foundation model for KG has primarily focused on their structural aspects, with most efforts restricted to in-KG tasks (e.g., knowledge graph completion, KGC). This limitation has hindered progress in addressing more challenging out-of-KG tasks. In this paper, we introduce MERRY, a foundation model for general knowledge graph reasoning, and investigate its performance across two task categories: in-KG reasoning tasks (e.g., KGC) and out-of-KG tasks (e.g., KG question answering, KGQA). We not only utilize the structural information, but also the textual information in KGs. Specifically, we propose a multi-perspective Conditional Message Passing (CMP) encoding architecture to bridge the gap between textual and structural modalities, enabling their seamless integration. Additionally, we introduce a dynamic residual fusion module to selectively retain relevant textual information and a flexible edge scoring mechanism to adapt to diverse downstream tasks. Comprehensive evaluations on 28 datasets demonstrate that MERRY outperforms existing baselines in most scenarios, showcasing strong reasoning capabilities within KGs and excellent generalization to out-of-KG tasks such as KGQA.
zh

[NLP-104] Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy

【速读】: 该论文试图解决AI copilots在个性化方面的不足,特别是在实时交互系统中如何有效捕捉、建模和优化用户偏好以提升可用性、信任度和生产力的问题。解决方案的关键在于提出一种基于阶段的偏好优化策略分类框架,涵盖预交互、交互中和交互后三个阶段,并系统分析了偏好信号获取、用户意图建模及反馈机制集成等技术,旨在为设计适应性强、偏好感知的AI copilots提供结构化基础。

链接: https://arxiv.org/abs/2505.21907
作者: Saleh Afzoon,Zahra Jahanandish,Phuong Thao Huynh,Amin Beheshti,Usman Naseem
机构: School of Computing, Macquarie University, Sydney, Australia (计算学院,麦考瑞大学,悉尼,澳大利亚); Department of Computer Engineering and Information Technology, Shiraz University of Technology, Shiraz, Iran (计算机工程与信息技术系,设拉子科技大学,设拉子,伊朗)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI copilots, context-aware, AI-powered systems designed to assist users in tasks such as software development and content creation, are becoming integral to modern workflows. As these systems grow in capability and adoption, personalization has emerged as a cornerstone for ensuring usability, trust, and productivity. Central to this personalization is preference optimization: the ability of AI copilots to detect, interpret, and align with individual user preferences. While personalization techniques are well-established in domains like recommender systems and dialogue agents, their adaptation to interactive, real-time systems like AI copilots remains fragmented and underexplored. This survey addresses this gap by synthesizing research on how user preferences are captured, modeled, and refined within the design of AI copilots. We introduce a unified definition of AI copilots and propose a phase-based taxonomy of preference optimization strategies, structured around pre-interaction, mid-interaction, and post-interaction stages. We analyze techniques for acquiring preference signals, modeling user intent, and integrating feedback loops, highlighting both established approaches and recent innovations. By bridging insights from AI personalization, human-AI collaboration, and large language model adaptation, this survey provides a structured foundation for designing adaptive, preference-aware AI copilots. It offers a holistic view of the available preference resources, how they can be leveraged, and which technical approaches are most suited to each stage of system design.
zh

[NLP-105] Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

【速读】: 该论文旨在解决独立智能体在处理需要大量交互和计算资源的复杂任务时存在的局限性,以及多智能体系统(MAS)在资源使用上的低效问题。其解决方案的关键在于提出一种资源感知的多智能体系统——Co-Saving,通过引入“快捷方式”(shortcuts)——即从历史成功轨迹中学习到的指令转换,以绕过冗余推理过程,从而提升集体问题解决的效率和质量。

链接: https://arxiv.org/abs/2505.21898
作者: Rennai Qiu,Chen Qian,Ran Li,Yufan Dang,Weize Chen,Cheng Yang,Yingli Zhang,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Beijing University of Posts and Telecommunications (北京邮电大学); Siemens (西门子); Tencent Robotics X (腾讯机器人实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: Work in Progress

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system – Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of “shortcuts” – instructional transitions learned from historically successful trajectories – which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.
zh

[NLP-106] EFIM: Efficient Serving of LLM s for Infilling Tasks with Improved KV Cache Reuse

【速读】: 该论文旨在解决在生成式 AI(Generative AI)的填充任务中,由于提示格式结构导致的跨请求键值(KV)缓存复用效率低下的问题。其关键解决方案是提出EFIM(Enhanced FIM),一种改进的填充格式,通过重新设计提示结构来提升KV缓存复用的效果,从而降低延迟并提高吞吐量。尽管EFIM能够提升效率,但其引入的片段化文本处理也暴露了当前大语言模型在部分词素生成上的不足,为此论文进一步提出了片段分词训练方法以优化这一问题。

链接: https://arxiv.org/abs/2505.21889
作者: Tianyu Guo,Hande Dong,Yichong Leng,Feng Liu,Cheater Lin,Nong Xiao,Xianwei Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling this http URL’s source code is publicly available at this https URL.
zh

[NLP-107] Incorporating LLM s for Large-Scale Urban Complex Mobility Simulation

【速读】: 该论文试图解决传统基于规则的Agent-Based Modeling(ABM)在城市交通模拟中缺乏代理多样性与现实性的局限性。其解决方案的关键在于引入大型语言模型(Large Language Model, LLM),通过生成合成人口特征、分配日常与偶尔活动地点以及模拟个性化路径,提升模拟的复杂度与真实性。

链接: https://arxiv.org/abs/2505.21880
作者: Yu-Lun Song,Chung-En Tsern,Che-Cheng Wu,Yu-Ming Chang,Syuan-Bo Huang,Wei-Chu Chen,Michael Chia-Liang Lin,Yu-Ta Lin
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 8 pages, 8 figures. This paper is reviewed and accepted by the CUPUM (Computational Urban Planning and Urban Management) Conference held by University College London (UCL) in 2025

点击查看摘要

Abstract:This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.
zh

[NLP-108] Evaluating the Retrieval Robustness of Large Language Models

【速读】: 该论文试图解决检索增强生成(Retrieval-augmented generation, RAG)在实际应用中可能因检索不准确和模型对检索内容利用能力有限而导致性能下降的问题。其解决方案的关键在于评估大型语言模型(Large language models, LLMs)在RAG设置下的鲁棒性,并通过构建一个包含1500个开放域问题的基准测试集,引入三种鲁棒性指标来分别对应三个研究问题:RAG是否始终优于非RAG、更多检索文档是否总是提升性能以及文档顺序是否影响结果。实验结果表明,尽管所有LLMs表现出较高的检索鲁棒性,但不同程度的不完善仍限制了其充分受益于RAG。

链接: https://arxiv.org/abs/2505.21870
作者: Shuyang Cao,Karthik Radhakrishnan,David Rosenberg,Steven Lu,Pengxiang Cheng,Lu Wang,Shiyue Zhang
机构: Bloomberg(彭博社); University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) generally enhances large language models’ (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model’s limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.
zh

[NLP-109] GETReason : Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning

【速读】: 该论文试图解决从事件中获取的具有公共意义的图像难以准确提取其相关性的难题(Publicly significant images from events hold valuable contextual information, crucial for journalism and education. However, existing methods often struggle to extract this relevance accurately)。解决方案的关键在于引入GETReason(Geospatial Event Temporal Reasoning)框架,通过提取全局事件、时间及地理空间信息来推断图像的深层语境意义,并提出GREAT(Geospatial Reasoning and Event Accuracy with Temporal Alignment)作为评估基于推理的图像理解的新指标。

链接: https://arxiv.org/abs/2505.21863
作者: Shikhhar Siingh,Abhinav Rawat,Vivek Gupta,Chitta Baral
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Publicly significant images from events hold valuable contextual information, crucial for journalism and education. However, existing methods often struggle to extract this relevance accurately. To address this, we introduce GETReason (Geospatial Event Temporal Reasoning), a framework that moves beyond surface-level image descriptions to infer deeper contextual meaning. We propose that extracting global event, temporal, and geospatial information enhances understanding of an image’s significance. Additionally, we introduce GREAT (Geospatial Reasoning and Event Accuracy with Temporal Alignment), a new metric for evaluating reasoning-based image understanding. Our layered multi-agent approach, assessed using a reasoning-weighted metric, demonstrates that meaningful insights can be inferred, effectively linking images to their broader event context.
zh

[NLP-110] Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多文档摘要任务中因“中间丢失”现象导致的源文本覆盖不足问题,这一问题限制了模型对多样化来源信息的全面概括能力。其解决方案的关键在于采用基于原则的内容选择方法,将摘要任务分解为三个步骤:首先将文档集合简化为原子关键点,其次利用确定性点过程(Determinantal Point Processes, DPP)选择优先考虑多样性的关键点,最后进行最终摘要的重写。通过结合提示工程与原则性内容选择技术,有效提升了源文本的覆盖范围。

链接: https://arxiv.org/abs/2505.21859
作者: Vishakh Padmakumar,Zichao Wang,David Arbour,Jennifer Healey
机构: New York University (纽约大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: To appear at ACL 2025 - Main Conference

点击查看摘要

Abstract:While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the “lost in the middle” phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps – (1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate personalized summaries that cover relevant source information while retaining coverage.
zh

[NLP-111] Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

【速读】: 该论文试图解决在推理过程中如何最优分配计算资源的问题,具体是探讨在顺序扩展(如更长的思维链)与并行扩展(如多个短思维链的多数投票)之间应优先选择哪种策略。论文的关键解决方案是通过理论分析和实验验证,证明在某些基于图连通性问题的复杂分布场景下,顺序扩展相较于并行扩展具有指数级的优势。

链接: https://arxiv.org/abs/2505.21825
作者: Parsa Mirtaheri,Ezra Edelman,Samy Jelassi,Eran Malach,Enric Boix-Adsera
机构: UC San Diego (加州大学圣地亚哥分校); University of Pennsylvania (宾夕法尼亚大学); Harvard University (哈佛大学); MIT and Harvard University (麻省理工学院和哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.
zh

[NLP-112] Representative Language Generation ICML2025

【速读】: 该论文试图解决生成模型中的多样性与偏差问题,旨在通过引入“代表性生成”(representative generation)扩展Kleinberg等(2024)和Li等(2024)提出的生成理论框架。其解决方案的关键在于要求生成模型的输出能够按比例代表训练数据中的目标群体,并通过定义“群体闭包维数”(group closure dimension)作为核心组合量来表征代表性生成的性质。研究分析了在极限情况下代表性生成的信息论与计算方面,证明了在特定条件下可实现可数无限假设类和群体集合的代表性生成,但同时也证明了仅使用成员查询无法实现可计算性,这与Kleinberg等(2024)在标准生成中的积极结果形成对比。

链接: https://arxiv.org/abs/2505.21819
作者: Charlotte Peale,Vinod Raman,Omer Reingold
机构: Stanford University (斯坦福大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:We introduce “representative generation,” extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the “group closure dimension” as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.
zh

[NLP-113] Revisiting Common Assumptions about Arabic Dialects in NLP ACL2025

【速读】: 该论文试图解决阿拉伯语方言在自然语言处理(NLP)领域中被过度简化的假设问题,这些假设包括阿拉伯语方言可以被划分为可区分的地区性方言。解决方案的关键在于通过扩展和分析一个多标签数据集,并由各国家层面方言的母语者手动评估句子的有效性,从而对这四个假设进行实证检验。研究结果表明,这些假设未能准确反映现实情况,可能阻碍了阿拉伯语NLP任务的进一步发展。

链接: https://arxiv.org/abs/2505.21816
作者: Amr Keleg,Sharon Goldwater,Walid Magdy
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所); School of Informatics, University of Edinburgh (信息学院,爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects") and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.
zh

[NLP-114] Scientific Paper Retrieval with LLM -Guided Semantic-Based Ranking

【速读】: 该论文试图解决科学论文检索中因密集检索方法无法捕捉细粒度科学概念而导致的查询理解不准确问题,以及基于大语言模型(Large Language Models, LLMs)的方法缺乏语料库特定知识支撑、可能生成不可靠内容的问题。解决方案的关键在于提出SemRank框架,该框架结合LLM引导的查询理解与基于概念的语义索引,通过多粒度科学概念对论文进行索引,并在查询时利用LLM识别源自语料库的核心概念,从而实现精确的语义匹配,提升检索准确性。

链接: https://arxiv.org/abs/2505.21815
作者: Yunyi Zhang,Ruozhen Yang,Siqi Jiao,SeongKu Kang,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Korea University (韩国科学技术院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query’s information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.
zh

[NLP-115] From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成对话时容易产生虚假信息的问题,特别是如何理解和干预模型中与事实性陈述相关的行为。其解决方案的关键在于引入概念锥(concept cones)框架,通过识别跨多个LLM家族的多维锥结构,来因果性地调控与真实性相关的行为。该方法通过三种证据支持:因果干预可可靠地改变模型对事实性陈述的响应,学习到的锥结构在不同模型架构间具有泛化能力,且基于锥的干预不会影响其他无关模型行为。这一研究揭示了LLMs中简单真/假命题的更丰富的多方向结构,并展示了概念锥作为探测抽象行为的潜在工具的价值。

链接: https://arxiv.org/abs/2505.21800
作者: Stanley Yu,Vaidehi Bulusu,Oscar Yasunaga,Clayton Lau,Cole Blondin,Sean O’Brien,Kevin Zhu,Vasu Sharma
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model’s internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.
zh

[NLP-116] VeriTrail: Closed-Domain Hallucination Detection with Traceability

【速读】: 该论文试图解决在多生成步骤(Multiple Generative Steps, MGS)过程中出现的封闭领域幻觉(closed-domain hallucination)问题,即语言模型在生成内容时可能偏离原始资料并产生无根据的信息。传统方法仅关注最终输出中的幻觉检测,但该研究认为这对于MGS过程而言是不够的,因为需要追踪幻觉内容可能的引入位置,并分析内容如何从源材料中衍生出来。解决方案的关键在于提出VeriTrail,这是首个为MGS和单一步骤生成(Single Generative Step, SGS)过程提供可追溯性的封闭领域幻觉检测方法,并构建了包含所有中间输出及最终输出忠实性人工标注的数据集。

链接: https://arxiv.org/abs/2505.21786
作者: Dasha Metropolitansky,Jonathan Larson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Even when instructed to adhere to source material, Language Models often generate unsubstantiated content - a phenomenon known as “closed-domain hallucination.” This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs’ faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.
zh

[NLP-117] Born a Transformer – Always a Transformer?

【速读】: 该论文试图解决预训练大语言模型(Large Language Models, LLMs)在序列到序列任务中是否能够克服Transformer架构理论上的局限性问题,特别是其在长度泛化能力上的限制。研究的关键在于通过设计受Liu等人[2024]启发的检索和复制任务,结合C-RASP框架对长度泛化进行理论保障,从而分析预训练模型在诱导与反向诱导(induction-versus-anti-induction)任务中的表现差异,并揭示这种不对称性与Transformer内部电路强度差异之间的关系。

链接: https://arxiv.org/abs/2505.21785
作者: Yana Veitsman,Mayank Jobanputra,Yash Sarrof,Aleksandra Bakalova,Vera Demberg,Ellie Pavlick,Michael Hahn
机构: Saarland University (萨尔兰大学); Brown University (布朗大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of \textitretrieval and \textitcopying tasks inspired by Liu et al. [2024]. We use the recently proposed C-RASP framework for studying length generalization [Huang et al., 2025b] to provide guarantees for each of our settings. Empirically, we observe an \textitinduction-versus-anti-induction asymmetry, where pretrained models are better at retrieving tokens to the right (induction) rather than the left (anti-induction) of a query token. This asymmetry disappears upon targeted fine-tuning if length-generalization is guaranteed by theory. Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained Transformers. We validate our findings through practical experiments on real-world tasks demonstrating reliability risks. Our results highlight that pretraining selectively enhances certain Transformer capabilities, but does not overcome fundamental length-generalization limits.
zh

[NLP-118] owards Safety Reasoning in LLM s: AI-agent ic Deliberation for Policy-embedded CoT Data Creation ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在生成响应前进行安全推理(safety reasoning)时面临的挑战,即如何高效构建高质量的嵌入安全策略的思维链(chain-of-thought, CoT)数据集,同时确保推理过程准确、无幻觉或政策冲突。其解决方案的关键在于提出AIDSAFE:一种基于多智能体协商的迭代推理框架,通过智能体间的协作逐步扩展安全策略的推理过程,并引入数据精炼阶段以消除重复、冗余和欺骗性思考,从而保证输出质量。此外,还提出了补充方法以生成用于对齐阶段的偏好数据,进一步提升模型的安全性和鲁棒性。

链接: https://arxiv.org/abs/2505.21784
作者: Tharindu Kumarage,Ninareh Mehrabi,Anil Ramakrishna,Xinyan Zhao,Richard Zemel,Kai-Wei Chang,Aram Galstyan,Rahul Gupta,Charith Peris
机构: Amazon Nova Responsible AI(亚马逊诺瓦负责任人工智能); Arizona State University(亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Findings)

点击查看摘要

Abstract:Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: this https URL
zh

[NLP-119] GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task

【速读】: 该论文旨在解决低资源语言下的语音翻译(Speech Translation, ST)问题,特别是在缺乏足够标注数据的情况下提升翻译性能。其关键解决方案是基于SeamlessM4T-v2模型进行自动语音识别(ASR)、机器翻译(MT)以及端到端语音翻译(E2E ST)的微调,并探索了多种训练范式,包括直接端到端微调、多任务学习以及利用微调后的ASR和/或MT模型进行参数初始化,以提升在未训练语言上的ST性能。

链接: https://arxiv.org/abs/2505.21781
作者: Chutong Meng,Antonios Anastasopoulos
机构: George Mason University (乔治·梅森大学)
类目: Computation and Language (cs.CL)
备注: IWSLT 2025

点击查看摘要

Abstract:This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.
zh

[NLP-120] Calibrating LLM Confidence by Probing Perturbed Representation Stability

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中置信度估计不准确的问题,即模型对其预测结果的自信程度与其实际准确性之间存在偏差,这会降低模型的可靠性。解决方案的关键在于提出一种名为CCPS(Calibrating LLM Confidence by Probing Perturbed Representation Stability)的新方法,该方法通过在最终隐藏状态上施加有针对性的对抗性扰动,提取反映模型对扰动响应的特征,并利用轻量级分类器预测答案的正确性,从而更准确地估计模型的置信度。

链接: https://arxiv.org/abs/2505.21772
作者: Reza Khanmohammadi,Erfan Miahi,Mehrsa Mardikoraem,Simerjot Kaur,Ivan Brugere,Charese H. Smiley,Kundan Thind,Mohammad M. Ghassemi
机构: Michigan State University (密歇根州立大学); Independent AI Researcher (独立人工智能研究员); JPMorgan AI Research (摩根大通人工智能研究院); Henry Ford Health (亨利·福特健康)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model’s response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.
zh

[NLP-121] BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中行为适应性不足的问题,特别是在主动参与任务(如未提示的情况下识别关键缺失信息或风险)方面表现不佳。其核心挑战在于LLMs在临床辅助任务中缺乏一致的主动性。为解决这一问题,作者提出了BehaviorSFT,一种基于行为标记(behavioral tokens)的新型训练策略,通过显式条件化使LLMs能够在从被动响应到主动干预的临床辅助谱系中动态选择合适的行为。该方法的关键在于利用行为标记增强模型对不同行为模式的感知与选择能力,从而提升其在主动任务中的表现。

链接: https://arxiv.org/abs/2505.21757
作者: Yubin Kim,Zhiyuan Hu,Hyewon Jeong,Eugene Park,Shuyue Stella Li,Chanwoo Park,Shiyun Xiong,MingYu Lu,Hyeonhoon Lee,Xin Liu,Daniel McDuff,Cynthia Breazeal,Samir Tulebaev,Hae Won Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs’ inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.
zh

[NLP-122] FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering CVPR2025

【速读】: 该论文旨在解决视觉问答(Visual Question Answering, VQA)系统在面对真实世界数据分布变化时的适应性问题,尤其是在多模态上下文中的鲁棒性不足。现有评估设置主要局限于单模态或特定类型的分布外(Out-of-Distribution, OOD)场景,难以全面反映多模态环境下的复杂性。解决方案的关键在于提出一个新的基准框架FRAMES-VQA,通过整合多个现有的VQA基准数据集,并将其分类为内分布(In-Distribution, ID)、近似分布外(Near OOD)和远分布外(Far OOD)数据集,涵盖单模态、多模态及对抗性分布偏移,从而系统地评估鲁棒微调方法的有效性。此外,通过计算单模态和多模态嵌入的马氏距离量化分布偏移,并分析单模态与多模态偏移之间的交互作用及模态重要性,为提升多模态分布偏移下的微调方法提供理论支持。

链接: https://arxiv.org/abs/2505.21755
作者: Chengyue Huang,Brisa Maneechotesuwan,Shivang Chopra,Zsolt Kira
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at this https URL .
zh

[NLP-123] From prosthetic memory to prosthetic denial: Auditing whether large language models are prone to mass atrocity denialism

【速读】: 该论文试图解决生成式 AI (Generative AI) 在历史记忆传播中的潜在影响,特别是其在塑造或扭曲大规模暴行记忆方面的风险。研究的核心问题是:生成式 AI 是否会促进“拟态记忆”(prosthetic memory)或“拟态否认”(prosthetic denial),即通过人工智能中介的历史事件体验或对暴行记忆的抹除与歪曲。解决方案的关键在于通过对比审计五种大型语言模型(Claude、GPT、Llama、Mixtral 和 Gemini)在四个历史案例(乌克兰饥荒、大屠杀、柬埔寨种族灭绝和卢旺达图西族大屠杀)中的响应,评估其对否认主义言论的敏感性与一致性,从而揭示训练数据可用性及模型概率输出对记忆完整性的潜在影响。

链接: https://arxiv.org/abs/2505.21753
作者: Roberto Ulloa,Eve M. Zucker,Daniel Bultmann,David J. Simon,Mykola Makhortykh
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) can influence how historical narratives are disseminated and perceived. This study explores the implications of LLMs’ responses on the representation of mass atrocity memory, examining whether generative AI systems contribute to prosthetic memory, i.e., mediated experiences of historical events, or to what we term “prosthetic denial,” the AI-mediated erasure or distortion of atrocity memories. We argue that LLMs function as interfaces that can elicit prosthetic memories and, therefore, act as experiential sites for memory transmission, but also introduce risks of denialism, particularly when their outputs align with contested or revisionist narratives. To empirically assess these risks, we conducted a comparative audit of five LLMs (Claude, GPT, Llama, Mixtral, and Gemini) across four historical case studies: the Holodomor, the Holocaust, the Cambodian Genocide, and the genocide against the Tutsis in Rwanda. Each model was prompted with questions addressing common denialist claims in English and an alternative language relevant to each case (Ukrainian, German, Khmer, and French). Our findings reveal that while LLMs generally produce accurate responses for widely documented events like the Holocaust, significant inconsistencies and susceptibility to denialist framings are observed for more underrepresented cases like the Cambodian Genocide. The disparities highlight the influence of training data availability and the probabilistic nature of LLM responses on memory integrity. We conclude that while LLMs extend the concept of prosthetic memory, their unmoderated use risks reinforcing historical denialism, raising ethical concerns for (digital) memory preservation, and potentially challenging the advantageous role of technology associated with the original values of prosthetic memory.
zh

[NLP-124] Revisiting Bi-Linear State Transitions in Recurrent Neural Networks

【速读】: 该论文试图解决传统循环神经网络中隐藏单元被视作被动记忆存储的问题,而提出隐藏单元应作为网络计算中的主动参与者。其解决方案的关键在于重新审视双线性操作(bi-linear operations),即隐藏单元与输入嵌入之间的乘法交互,证明其在状态跟踪任务中构成了表示隐藏状态演化的自然归纳偏置,并揭示了双线性状态更新在任务复杂度上的自然层次结构。

链接: https://arxiv.org/abs/2505.21749
作者: M.Reza Ebrahimi,Roland Memisevic
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The role of hidden units in recurrent neural networks is typically seen as modeling memory, with research focusing on enhancing information retention through gating mechanisms. A less explored perspective views hidden units as active participants in the computation performed by the network, rather than passive memory stores. In this work, we revisit bi-linear operations, which involve multiplicative interactions between hidden units and input embeddings. We demonstrate theoretically and empirically that they constitute a natural inductive bias for representing the evolution of hidden states in state tracking tasks. These are the simplest type of task that require hidden units to actively contribute to the behavior of the network. We also show that bi-linear state updates form a natural hierarchy corresponding to state tracking tasks of increasing complexity, with popular linear recurrent networks such as Mamba residing at the lowest-complexity center of that hierarchy.
zh

[NLP-125] Counterfactual Simulatability of LLM Explanations for Generation Tasks

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险场景下解释其行为能力不足的问题,特别是如何评估模型解释的准确性。解决方案的关键在于引入并扩展“反事实可模拟性”(counterfactual simulatability)的评估方法,以判断用户是否能够通过模型提供的解释预测其在相关反事实情境下的输出。研究将该方法应用于生成任务,如新闻摘要和医疗建议,并发现尽管在摘要任务中LLM的解释有所成效,但在医疗建议任务中仍有较大改进空间,同时指出该评估方法更适合技能型任务而非知识型任务。

链接: https://arxiv.org/abs/2505.21740
作者: Marvin Limpijankit,Yanda Chen,Melanie Subbiah,Nicholas Deas,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model’s output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.
zh

[NLP-126] Assessing and Refining ChatGPT s Performance in Identifying Targeting and Inappropriate Language: A Comparative Study

【速读】: 该论文试图解决在线评论中识别定向语言(targeting language)和不适当语言(inappropriate language)的问题,旨在评估生成式 AI(Generative AI)模型 ChatGPT 在内容审核中的有效性。解决方案的关键在于利用 ChatGPT 的自然语言处理能力,通过与众包标注和专家评估进行对比,验证其在检测准确性、检测范围和一致性方面的表现,并通过迭代优化提升其性能,尤其是在版本 6 中显示出显著的准确率提升。然而,研究也指出其在定向语言检测方面仍存在较高的误报率,表明需要进一步改进模型的上下文理解能力。

链接: https://arxiv.org/abs/2505.21710
作者: Barbarestani Baran,Maks Isa,Vossen Piek
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study evaluates the effectiveness of ChatGPT, an advanced AI model for natural language processing, in identifying targeting and inappropriate language in online comments. With the increasing challenge of moderating vast volumes of user-generated content on social network sites, the role of AI in content moderation has gained prominence. We compared ChatGPT’s performance against crowd-sourced annotations and expert evaluations to assess its accuracy, scope of detection, and consistency. Our findings highlight that ChatGPT performs well in detecting inappropriate content, showing notable improvements in accuracy through iterative refinements, particularly in Version 6. However, its performance in targeting language detection showed variability, with higher false positive rates compared to expert judgments. This study contributes to the field by demonstrating the potential of AI models like ChatGPT to enhance automated content moderation systems while also identifying areas for further improvement. The results underscore the importance of continuous model refinement and contextual understanding to better support automated moderation and mitigate harmful online behavior.
zh

[NLP-127] Do We Know What LLM s Dont Know? A Study of Consistency in Knowledge Probing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识缺口探测中的可靠性问题,特别是模型容易产生幻觉(hallucination)导致的不确定性。论文提出的解决方案关键在于设计一种基于输入变异和定量指标的新评估流程,通过该流程揭示了知识缺口探测中的两个不一致性维度:方法内不一致性和方法间不一致性,从而凸显现有探测方法的脆弱性,并强调构建对扰动鲁棒的探测框架的紧迫性。

链接: https://arxiv.org/abs/2505.21701
作者: Raoyuan Zhao,Abdullatif Köksal,Ali Modarressi,Michael A. Hedderich,Hinrich Schütze
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The reliability of large language models (LLMs) is greatly compromised by their tendency to hallucinate, underscoring the need for precise identification of knowledge gaps within LLMs. Various methods for probing such gaps exist, ranging from calibration-based to prompting-based methods. To evaluate these probing methods, in this paper, we propose a new process based on using input variations and quantitative metrics. Through this, we expose two dimensions of inconsistency in knowledge gap probing. (1) Intra-method inconsistency: Minimal non-semantic perturbations in prompts lead to considerable variance in detected knowledge gaps within the same probing method; e.g., the simple variation of shuffling answer options can decrease agreement to around 40%. (2) Cross-method inconsistency: Probing methods contradict each other on whether a model knows the answer. Methods are highly inconsistent – with decision consistency across methods being as low as 7% – even though the model, dataset, and prompt are all the same. These findings challenge existing probing methods and highlight the urgent need for perturbation-robust probing frameworks.
zh

[NLP-128] MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨语言文化意识方面的差异问题,这种差异源于其以英语为中心的预训练方式,导致在非英语语境下可能出现偏差。解决方案的关键在于引入MAKIEval,这是一个自动化的多语言框架,用于评估LLMs在不同语言、地区和主题下的文化意识。MAKIEval通过利用Wikidata的多语言结构作为跨语言锚点,自动识别模型输出中的文化实体并将其链接到结构化知识,从而实现无需人工标注或翻译的可扩展、语言无关的评估。

链接: https://arxiv.org/abs/2505.21693
作者: Raoyuan Zhao,Beiduo Chen,Barbara Plank,Michael A. Hedderich
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are used globally across many languages, but their English-centric pretraining raises concerns about cross-lingual disparities for cultural awareness, often resulting in biased outputs. However, comprehensive multilingual evaluation remains challenging due to limited benchmarks and questionable translation quality. To better assess these disparities, we introduce MAKIEval, an automatic multilingual framework for evaluating cultural awareness in LLMs across languages, regions, and topics. MAKIEval evaluates open-ended text generation, capturing how models express culturally grounded knowledge in natural language. Leveraging Wikidata’s multilingual structure as a cross-lingual anchor, it automatically identifies cultural entities in model outputs and links them to structured knowledge, enabling scalable, language-agnostic evaluation without manual annotation or translation. We then introduce four metrics that capture complementary dimensions of cultural awareness: granularity, diversity, cultural specificity, and consensus across languages. We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems, across 13 languages, 19 countries and regions, and 6 culturally salient topics (e.g., food, clothing). Notably, we find that models tend to exhibit stronger cultural awareness in English, suggesting that English prompts more effectively activate culturally grounded knowledge. We publicly release our code and data.
zh

[NLP-129] LLM PR: A Novel LLM -Driven Transfer Learning based Petition Ranking Model

【速读】: 该论文试图解决印度司法系统中未决法律案件持续积累导致司法及时性受阻的问题,特别是手动优先级排序方法存在的低效性和主观偏见问题。解决方案的关键在于提出一种基于大型语言模型(Large Language Model-based Petition Ranking, LLMPR)的自动化框架,通过迁移学习和机器学习技术,依据案件的上下文紧迫性对法律请愿书进行优先级排序。该框架利用ILDC数据集中的7,593个标注请愿书,结合DistilBERT、LegalBERT和MiniLM等嵌入技术提取文本特征,并与定量指标如间隔天数、排名分数和字数相结合,训练多种机器学习模型以实现高效的请愿书排序。

链接: https://arxiv.org/abs/2505.21689
作者: Avijit Gayen,Somyajit Chakraborty,Mainak Sen,Soham Paul,Angshuman Jana
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 5 figures, journal paper, submitted to AI and Law

点击查看摘要

Abstract:The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \rho = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.
zh

[NLP-130] Rethinking the Outlier Distribution in Large Language Models : An In-depth Study

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中异常值(outliers)对量化过程的影响问题,这些异常值会引发显著的量化误差,进而降低模型性能。解决方案的关键在于深入分析异常值的形成机制,并提出有效的策略以减少大规模激活和通道级异常值的出现,从而在最小影响模型精度的前提下提升量化效率与部署可行性。

链接: https://arxiv.org/abs/2505.21670
作者: Rahul Raman,Khushi Sharma,Sai Qian Zhang
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.
zh

[NLP-131] R1-Code-Interpreter: Training LLM s to Reason with Code via Supervised and Reinforcement Learning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在需要精确计算、符号操作、优化和算法推理的任务中表现不佳的问题,尤其是在文本推理缺乏代码执行严谨性的情况下。其解决方案的关键在于开发R1-Code-Interpreter,这是一个通过多轮监督微调(SFT)和强化学习(RL)训练的文本仅模型扩展,能够自主生成多个代码查询以进行逐步推理。该方法通过有效的代码生成和自检行为提升了模型在多样化任务上的性能。

链接: https://arxiv.org/abs/2505.21668
作者: Yongchao Chen,Yueying Liu,Junwei Zhou,Yilun Hao,Jingquan Wang,Yang Zhang,Chuchu Fan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注: 33 pages, 8 figures

点击查看摘要

Abstract:Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0% to 64.1%, outperforming GPT-4o (text-only: 58.6%) and approaching GPT-4o with Code Interpreter (70.9%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at this https URL and this https URL.
zh

[NLP-132] Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)由于其“黑箱”特性而导致的可解释性不足问题,这一问题在需要信任和责任的领域中尤为关键。论文提出的解决方案是SMILE,其关键在于通过微调输入并测量输出变化来识别对模型响应影响最大的提示部分,从而生成直观的热力图以展示提示中各部分的重要性。该方法具有模型无关性,并通过准确性、一致性、稳定性和保真度等指标验证了其解释的清晰性和可靠性。

链接: https://arxiv.org/abs/2505.21657
作者: Zeinab Dehghani,Koorosh Aslansefat,Adil Khan,Mohammed Naveed Akram
机构: University of Hull(赫尔大学); Fraunhofer IESE(弗劳恩霍夫IESE)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2412.16277

点击查看摘要

Abstract:Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.
zh

[NLP-133] Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts ECML KDD2025

【速读】: 该论文试图解决材料发现与优化过程中因元素组合和相关性质的“组合爆炸”而导致的数据稀缺问题。其解决方案的关键在于构建一个迭代的语料库精炼框架,通过战略性选择最具多样性的文献、训练Word2Vec模型,并监控嵌入空间中成分-性能关联的收敛性,从而充分利用科学文本中的隐含知识。该方法成功预测了多种候选成分中性能最优的材料,并通过实验验证了其有效性,展示了迭代语料库精炼在加速材料发现与优化中的潜力。

链接: https://arxiv.org/abs/2505.21646
作者: Lei Zhang,Markus Stricker
机构: 未知
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)
备注: 13 pages, 5 figures, 2 tables, accepted at ECMLPKDD 2025

点击查看摘要

Abstract:The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion’. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.
zh

[NLP-134] How does Misinformation Affect Large Language Model Behaviors and Preferences? ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对虚假信息时的脆弱性问题,特别是缺乏对LLMs受虚假信息影响的具体方面和程度的细致分析。解决方案的关键是提出一个名为MisBench的基准测试集,这是目前最大且最全面的用于评估LLMs对虚假信息行为和知识偏好性的基准,包含10,346,712条虚假信息,并综合考虑了知识冲突和风格变化两个维度。此外,研究还提出了一个新的方法Reconstruct to Discriminate (RtD),以增强LLMs检测虚假信息的能力。

链接: https://arxiv.org/abs/2505.21608
作者: Miao Peng,Nuo Chen,Jianheng Tang,Jia Li
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs’ behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs’ ability to detect misinformation. Our study provides valuable insights into LLMs’ interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at this https URL.
zh

[NLP-135] R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理能力强大但推理开销大的问题,以及小型语言模型(Small Language Models, SLMs)虽然效率高但无法复现LLMs推理路径导致性能下降的问题。解决方案的关键在于发现LLMs与SLMs之间的推理路径仅在少量token上存在显著差异,大多数生成的token要么相同,要么仅存在表达上的微小差异。基于这一观察,作者提出了一种神经token路由方法——Roads to Rome (R2R),该方法仅在关键的路径分歧token上使用LLM,而其余token的生成则由SLM完成,从而在保持性能的同时显著提升推理效率。

链接: https://arxiv.org/abs/2505.21600
作者: Tianyu Fu,Yi Ge,Yichen You,Enshu Liu,Zhihang Yuan,Guohao Dai,Shengen Yan,Huazhong Yang,Yu Wang
机构: Tsinghua University (清华大学); Infinigence AI (Infinigence AI); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs’ reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at this https URL.
zh

[NLP-136] Rethinking Data Mixture for Large Language Models : A Comprehensive Survey and New Perspectives ACL

【速读】: 该论文试图解决在有限的训练预算下,如何确定不同数据域的权重以训练出性能最优的大型语言模型的问题(domain weights optimization)。其解决方案的关键在于对现有的数据混合方法进行细粒度分类,并深入分析各类方法的优化框架、算法实现及其优缺点,从而为实际应用提供理论指导和实践参考。

链接: https://arxiv.org/abs/2505.21598
作者: Yajiao Liu,Congliang Chen,Junchi Yang,Ruoyu Sun
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shenzhen Research Institute of Big Data (深圳市大数据研究院)
类目: Computation and Language (cs.CL)
备注: The first version of this paper was submitted to ACL ARR 2025 February Submission

点击查看摘要

Abstract:Training large language models with data collected from various domains can improve their performance on downstream tasks. However, given a fixed training budget, the sampling proportions of these different domains significantly impact the model’s performance. How can we determine the domain weights across different data domains to train the best-performing model within constrained computational resources? In this paper, we provide a comprehensive overview of existing data mixture methods. First, we propose a fine-grained categorization of existing methods, extending beyond the previous offline and online classification. Offline methods are further grouped into heuristic-based, algorithm-based, and function fitting-based methods. For online methods, we categorize them into three groups: online min-max optimization, online mixing law, and other approaches by drawing connections with the optimization frameworks underlying offline methods. Second, we summarize the problem formulations, representative algorithms for each subtype of offline and online methods, and clarify the relationships and distinctions among them. Finally, we discuss the advantages and disadvantages of each method and highlight key challenges in the field of data mixture.
zh

[NLP-137] Loquacious Set: 25000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use INTERSPEECH2025

【速读】: 该论文试图解决当前自动语音识别(Automatic Speech Recognition, ASR)研究中可用数据集的局限性问题,这些数据集要么规模较小、仅关注干净的朗读语音,导致词错误率接近零,无法反映真实场景;要么存在许可限制、转录不可靠、音频数据错误或缺乏评估集。解决方案的关键在于提出Loquacious Set,这是一个包含25,000小时的高质量英语语音数据集,经过精心筛选并具有商业可用性,涵盖大量不同口音的说话人以及多种语音类型(如朗读、即兴发言、演讲、清晰和嘈杂环境下的语音),旨在支持学术界和工业界研究人员在实际场景中构建更鲁棒的ASR系统。

链接: https://arxiv.org/abs/2505.21578
作者: Titouan Parcollet,Yuan Tseng,Shucong Zhang,Rogier van Dalen
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People’s Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.
zh

[NLP-138] ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

【速读】: 该论文试图解决化学工具固有的预测误差限制大型语言模型(Large Language Model, LLM)代理在化学相关任务中性能提升的问题。其解决方案的关键在于提出ChemHAS(Chemical Hierarchical Agent Stacking),一种通过从有限数据中优化代理堆叠结构来增强化学工具的方法,从而有效减少预测误差并实现最先进的性能表现。

链接: https://arxiv.org/abs/2505.21569
作者: Zhucong Li,Bowei Zhang,Jin Xiao,Zhijian Zhou,Fenglei Cao,Jiaqing Liang,Yuan Qi
机构: Fudan University (复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Large Language Model (LLM)-based agents have demonstrated the ability to improve performance in chemistry-related tasks by selecting appropriate tools. However, their effectiveness remains limited by the inherent prediction errors of chemistry tools. In this paper, we take a step further by exploring how LLMbased agents can, in turn, be leveraged to reduce prediction errors of the tools. To this end, we propose ChemHAS (Chemical Hierarchical Agent Stacking), a simple yet effective method that enhances chemistry tools through optimizing agent-stacking structures from limited data. ChemHAS achieves state-of-the-art performance across four fundamental chemistry tasks, demonstrating that our method can effectively compensate for prediction errors of the tools. Furthermore, we identify and characterize four distinct agent-stacking behaviors, potentially improving interpretability and revealing new possibilities for AI agent applications in scientific research. Our code and dataset are publicly available at https: //anonymous.this http URL.
zh

[NLP-139] Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

【速读】: 该论文旨在解决现有CLIP模型在多模态图像-文本检索任务中因固定图像分辨率和有限上下文导致的细粒度跨模态理解能力不足的问题,同时保持其强大的零样本分类能力。解决方案的关键在于提出一种元教师-学生蒸馏框架,通过双向跨模态注意力机制,使教师模型在YOLO提取的图像区域与对应文本片段之间生成语义和空间对齐的增强嵌入,进而指导轻量学生模型的训练,采用结合对比学习和余弦相似性目标的混合损失函数提升检索性能。

链接: https://arxiv.org/abs/2505.21549
作者: Daniel Csizmadia,Andrei Codreanu,Victor Sim,Vighnesh Prabeau,Michael Lu,Kevin Zhu,Sean O’Brien,Vasu Sharma
机构: Algoverse AI Research (Algoverse AI Research); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model’s strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP’s original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP’s zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at this https URL.
zh

[NLP-140] Vision Meets Language: A RAG -Augmented YOLOv8 Framework for Coffee Disease Diagnosis and Farmer Assistance

【速读】: 该论文试图解决传统农业实践中资源利用效率低下和环境问题,以及大型语言模型(LLM)在农业应用中的幻觉问题,同时实现适应性治疗方案和实时病害检测。解决方案的关键在于提出一种混合方法,结合模型AI的三个不同领域:目标检测、大语言模型(LLM)和检索增强生成(RAG),通过融合视觉与语言模型来识别树 leaf 中的潜在病害,并利用RAG提供上下文感知的诊断及自然语言处理(NLP)与YOLOv8进行作物病害检测,从而提高系统的准确性与实用性。

链接: https://arxiv.org/abs/2505.21544
作者: Semanto Mondal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: There are 14 pages, 8 figures

点击查看摘要

Abstract:As a social being, we have an intimate bond with the environment. A plethora of things in human life, such as lifestyle, health, and food are dependent on the environment and agriculture. It comes under our responsibility to support the environment as well as agriculture. However, traditional farming practices often result in inefficient resource use and environmental challenges. To address these issues, precision agriculture has emerged as a promising approach that leverages advanced technologies to optimise agricultural processes. In this work, a hybrid approach is proposed that combines the three different potential fields of model AI: object detection, large language model (LLM), and Retrieval-Augmented Generation (RAG). In this novel framework, we have tried to combine the vision and language models to work together to identify potential diseases in the tree leaf. This study introduces a novel AI-based precision agriculture system that uses Retrieval Augmented Generation (RAG) to provide context-aware diagnoses and natural language processing (NLP) and YOLOv8 for crop disease detection. The system aims to tackle major issues with large language models (LLMs), especially hallucinations and allows for adaptive treatment plans and real-time disease detection. The system provides an easy-to-use interface to the farmers, which they can use to detect the different diseases related to coffee leaves by just submitting the image of the affected leaf the model will detect the diseases as well as suggest potential remediation methodologies which aim to lower the use of pesticides, preserving livelihoods, and encouraging environmentally friendly methods. With an emphasis on scalability, dependability, and user-friendliness, the project intends to improve RAG-integrated object detection systems for wider agricultural applications in the future.
zh

[NLP-141] How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control

【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)理解并生成符合人类运动规律的3D角色动画问题。其解决方案的关键在于通过分层规划策略,即首先让LLMs生成包含连续步骤的高层运动计划(High-level Planning),再在每一步中指定身体部位的位置(Low-level Planning),并通过线性插值生成角色动画,以此作为人类评估者验证的清晰视角。该方法旨在评估LLMs在运动指令理解与执行中的能力边界。

链接: https://arxiv.org/abs/2505.21531
作者: Kunhang Li,Jason Naradowsky,Yansong Feng,Yusuke Miyao
机构: The University of Tokyo (东京大学); Peking University (北京大学); NII LLMC (国立情報学研究所LLMC)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We explore Large Language Models (LLMs)’ human motion knowledge through 3D avatar control. Given a motion instruction, we prompt LLMs to first generate a high-level movement plan with consecutive steps (High-level Planning), then specify body part positions in each step (Low-level Planning), which we linearly interpolate into avatar animations as a clear verification lens for human evaluators. Through carefully designed 20 representative motion instructions with full coverage of basic movement primitives and balanced body part usage, we conduct comprehensive evaluations including human assessment of both generated animations and high-level movement plans, as well as automatic comparison with oracle positions in low-level planning. We find that LLMs are strong at interpreting the high-level body movements but struggle with precise body part positioning. While breaking down motion queries into atomic components improves planning performance, LLMs have difficulty with multi-step movements involving high-degree-of-freedom body parts. Furthermore, LLMs provide reasonable approximation for general spatial descriptions, but fail to handle precise spatial specifications in text, and the precise spatial-temporal parameters needed for avatar control. Notably, LLMs show promise in conceptualizing creative motions and distinguishing culturally-specific motion patterns.
zh

[NLP-142] More Thinking Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

【速读】: 该论文试图解决多模态大语言模型在测试时计算增强后生成的推理链过长导致的幻觉问题,即模型在推理过程中逐渐偏离视觉内容并过度依赖语言先验。解决方案的关键在于引入RH-AUC指标,该指标量化了模型感知准确性随推理长度变化的情况,从而评估模型在推理过程中是否保持视觉 grounding,并通过RH-Bench基准测试来评估推理能力与幻觉之间的权衡。

链接: https://arxiv.org/abs/2505.21523
作者: Chengzhi Liu,Zhongxing Xu,Qingyue Wei,Juncheng Wu,James Zou,Xin Eric Wang,Yuyin Zhou,Sheng Liu
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Stanford University (斯坦福大学); UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model’s perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
zh

[NLP-143] Capability-Based Scaling Laws for LLM Red-Teaming

【速读】: 该论文试图解决在大型语言模型能力不断增强的背景下,传统红队测试方法在面对能力更强的目标模型时可能失效的问题,即弱到强的红队挑战。解决方案的关键在于通过分析攻击者与目标模型之间的能力差距,建立一种基于能力差距的越狱攻击尺度定律,从而预测攻击成功率。研究通过评估超过500对攻击者-目标模型对,揭示了模型能力与攻击成功率之间的关系,并强调了未来模型提供商需准确衡量和控制模型的说服与操控能力以降低安全风险。

链接: https://arxiv.org/abs/2505.20162
作者: Alexander Panfilov,Paul Kassianik,Maksym Andriushchenko,Jonas Geiping
机构: ELLIS Institute Tübingen (ELLIS研究所图宾根); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Tübingen AI Center (图宾根人工智能中心); Foundation AI – Cisco Systems Inc. (基础AI-思科系统公司); EPFL (洛桑联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target’s capability exceeds the attacker’s, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models’ persuasive and manipulative abilities to limit their effectiveness as attackers.
zh

[NLP-144] Offset Unlearning for Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练语料中可能记忆敏感信息(如版权内容、偏见信息和隐私数据)所引发的伦理和法律问题。现有遗忘技术要么因需要访问模型内部权重而不适用于黑盒LLMs,要么通过保留敏感数据进行推理时修正而违反数据保护原则。该论文提出的解决方案是\delta-Unlearning,其关键在于不直接调整黑盒LLM本身,而是通过对比一对较小模型的logit输出来学习所需的logit偏移量,从而实现有效遗忘目标数据,同时保持或提升在通用非遗忘范围任务上的性能,并能兼容多种遗忘算法。

链接: https://arxiv.org/abs/2404.11045
作者: James Y. Huang,Wenxuan Zhou,Fei Wang,Fred Morstatter,Sheng Zhang,Hoifung Poon,Muhao Chen
机构: University of Southern California(南加州大学); Microsoft Research(微软研究院); University of California, Davis(加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in TMLR. this https URL

点击查看摘要

Abstract:Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, biased, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose \delta-Unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, \delta-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that \delta- Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. \delta-Unlearning also effectively incorporates different unlearning algorithms, making our approach a versatile solution to adapting various existing unlearning algorithms to black-box LLMs.
zh

[NLP-145] Evaluation of LLM s in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在语音任务中性能评估的可靠性问题,特别是由于LibriSpeech和Common Voice数据集中的部分样本出现在公开的LLM预训练语料库中,可能导致评估结果出现偏差。解决方案的关键在于通过对比使用污染数据和未污染数据训练的LLM,分析数据污染对模型生成结果的影响,从而揭示数据污染对模型输出的潜在偏倚,并强调在评估基于LLM的语音系统时使用保留数据的重要性。

链接: https://arxiv.org/abs/2505.22251
作者: Yuan Tseng,Titouan Parcollet,Rogier van Dalen,Shucong Zhang,Sourav Bhattacharya
机构: Samsung AI Center Cambridge (三星人工智能中心剑桥); Samsung (三星)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure the impact of contamination, LLMs trained with or without contamination are compared, showing that a contaminated LLM is more likely to generate test sentences it has seen during training. Speech recognisers using contaminated LLMs shows only subtle differences in error rates, but assigns significantly higher probabilities to transcriptions seen during training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.
zh

[NLP-146] Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding?

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在文化适配性上的不足,即这些模型虽然能够使用用户语言进行交流,但未必能准确反映用户的文化价值观和实践。为了解决这一问题,研究者提出了一种评估方法,通过文化维度(如价值观和实践)对印度本地模型与全球模型进行对比分析。该研究的关键在于揭示区域微调(regional fine-tuning)并未有效提升模型的文化适应能力,甚至可能因限制已有知识的回忆而削弱其表现,其根本原因在于缺乏高质量、未翻译且具有文化背景的预训练与微调数据。

链接: https://arxiv.org/abs/2505.21548
作者: Dhruv Agarwal,Anya Shukla,Sunayana Sitaram,Aditya Vashistha
机构: Cornell University (康奈尔大学); Microsoft Research (微软研究院)
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) are used around the world but exhibit Western cultural tendencies. To address this cultural misalignment, many countries have begun developing “regional” LLMs tailored to local communities. Yet it remains unclear whether these models merely speak the language of their users or also reflect their cultural values and practices. Using India as a case study, we evaluate five Indic and five global LLMs along two key dimensions: values (via the Inglehart-Welzel map and GlobalOpinionQA) and practices (via CulturalBench and NormAd). Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models. In fact, an average American person is a better proxy for Indian cultural values than any Indic model. Even prompting strategies fail to meaningfully improve alignment. Ablations show that regional fine-tuning does not enhance cultural competence and may in fact hurt it by impeding recall of existing knowledge. We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data. Our study positions cultural evaluation as a first-class requirement alongside multilingual benchmarks and offers a reusable methodology for developers. We call for deeper investments in culturally representative data to build and evaluate truly sovereign LLMs.
zh

[NLP-147] VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

【速读】: 该论文试图解决低资源语言(如越南语)中自动语音识别(ASR)系统对大规模标注数据的高度依赖问题,以及现有系统在训练成本、延迟和可访问性方面的不足。解决方案的关键在于提出一种新的ASR训练流程——VietASR,该流程利用大量未标注数据和少量标注数据进行多轮ASR偏置的自监督学习,从而实现高效且实用的ASR性能提升。

链接: https://arxiv.org/abs/2505.21527
作者: Jianheng Zhuo,Yifan Yang,Yiwen Shao,Yong Xu,Dong Yu,Kai Yu,Xie Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR.
zh

[NLP-148] Complexity counts: global and local perspectives on Indo-Aryan numeral systems

【速读】: 该论文试图解决 Indo-Aryan 语言的数词系统为何相较于其他语言更为复杂的问题,以及这些复杂系统为何在某些语言中得以延续。其解决方案的关键在于通过跨语言数据和可量化的指标评估数词系统的复杂性,并分析语言内外因素(如宗教、地理隔离等)对复杂性的影响,从而揭示复杂数词系统在南亚地区特定 Indo-Aryan 语言中的发展与存续机制。

链接: https://arxiv.org/abs/2505.21510
作者: Chundra Cathcart
机构: 未知
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1–99 are highly non-transparent and are cannot be constructed using straightforward rules. As an example, Hindi/Urdu ikyānve 91' is not decomposable into the composite elements *ek* one’ and nave `ninety’ in the way that its English counterpart is. This paper situates Indo-Aryan languages within the typology of cross-linguistic numeral systems, and explores the linguistic and non-linguistic factors that may be responsible for the persistence of complex systems in these languages. Using cross-linguistic data from multiple databases, we develop and employ a number of cross-linguistically applicable metrics to quantifies the complexity of languages’ numeral systems, and demonstrate that Indo-Aryan languages have decisively more complex numeral systems than the world’s languages as a whole, though individual Indo-Aryan languages differ from each other in terms of the complexity of the patterns they display. We investigate the factors (e.g., religion, geographic isolation, etc.) that underlie complexity in numeral systems, with a focus on South Asia, in an attempt to develop an account of why complex numeral systems developed and persisted in certain Indo-Aryan languages but not elsewhere. Finally, we demonstrate that Indo-Aryan numeral systems adhere to certain general pressures toward efficient communication found cross-linguistically, despite their high complexity. We call for this somewhat overlooked dimension of complexity to be taken seriously when discussing general variation in cross-linguistic numeral systems.
zh

计算机视觉

[CV-0] Zero-Shot Vision Encoder Grafting via LLM Surrogates

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在训练过程中由于使用大规模语言模型(Large Language Model, LLM)作为解码器而导致的高计算成本问题。其解决方案的关键在于构建小型“代理模型”(surrogate models),这些模型通过直接继承目标LLM的浅层结构,共享相同的嵌入空间和表示语言。随后,将视觉编码器在代理模型上进行训练,并直接迁移至大型LLM中,这一过程称为零样本嫁接(zero-shot grafting),有效降低了整体VLM的训练成本。

链接: https://arxiv.org/abs/2505.22664
作者: Kaiyu Yue,Vasu Singla,Menglin Jia,John Kirchenbauer,Rifaa Qadri,Zikui Cai,Abhinav Bhatele,Furong Huang,Tom Goldstein
机构: University of Maryland (马里兰大学); Meta (元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small “surrogate models” that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting – when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.
zh

[CV-1] raining Free Stylized Abstraction

【速读】:该论文旨在解决如何从单张图像生成具有视觉夸张但语义忠实的风格化抽象表示的问题,特别是在保持主体识别性的同时允许风格上的显著差异,这对分布外个体尤其具有挑战性。其解决方案的关键在于提出了一种无需训练的框架,利用视觉-语言模型(VLLM)在推理阶段进行缩放以提取与身份相关的特征,并结合一种新颖的跨域校正流逆过程策略,基于风格依赖先验重建结构。此外,通过风格感知的时间调度动态调整结构恢复,实现了对主体和风格的高保真再现。

链接: https://arxiv.org/abs/2505.22663
作者: Aimon Rahman,Kartik Narayan,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Stylized abstraction synthesizes visually exaggerated yet semantically faithful representations of subjects, balancing recognizability with perceptual distortion. Unlike image-to-image translation, which prioritizes structural fidelity, stylized abstraction demands selective retention of identity cues while embracing stylistic divergence, especially challenging for out-of-distribution individuals. We propose a training-free framework that generates stylized abstractions from a single image using inference-time scaling in vision-language models (VLLMs) to extract identity-relevant features, and a novel cross-domain rectified flow inversion strategy that reconstructs structure based on style-dependent priors. Our method adapts structural restoration dynamically through style-aware temporal scheduling, enabling high-fidelity reconstructions that honor both subject and style. It supports multi-round abstraction-aware generation without fine-tuning. To evaluate this task, we introduce StyleBench, a GPT-based human-aligned metric suited for abstract styles where pixel-level similarity fails. Experiments across diverse abstraction (e.g., LEGO, knitted dolls, South Park) show strong generalization to unseen identities and styles in a fully open-source setup.
zh

[CV-2] VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

【速读】:该论文旨在解决大型视觉-语言模型(LVLMs)在推理过程中因较长的视觉标记序列导致的高计算成本问题,从而影响其实时部署。其解决方案的关键在于提出VScan框架,通过两个阶段的视觉标记压缩策略:在视觉编码阶段整合全局与局部扫描并进行标记合并,在语言模型中间层引入剪枝操作,以有效减少冗余视觉标记,提升推理速度并保持较高的模型性能。

链接: https://arxiv.org/abs/2505.22654
作者: Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu
机构: Carnegie Mellon University (卡内基梅隆大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91 \times speedup in prefilling and a 10 \times reduction in FLOPs, while retaining 95.4% of the original performance.
zh

[CV-3] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

【速读】:该论文试图解决多人物对话视频生成中的音频与人物绑定错误问题以及模型指令遵循能力不足的问题。其解决方案的关键在于提出一种新的框架MultiTalk,其中包含用于解决音频与人物绑定问题的标签旋转位置嵌入(Label Rotary Position Embedding, L-RoPE)方法,并通过部分参数训练和多任务训练来保持基础模型的指令遵循能力。

链接: https://arxiv.org/abs/2505.22647
作者: Zhe Kong,Feng Gao,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Xunliang Cai,Guanying Chen,Wenhan Luo
机构: Meituan; Division of AMC and Department of ECE, HKUST; Shenzhen Campus of Sun Yat-sen University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL Github: this https URL

点击查看摘要

Abstract:Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
zh

[CV-4] SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation

【速读】:该论文旨在解决现有基于范围视图(range-view)的LiDAR场景生成方法仅能生成无语义标签的点云数据,而无法有效生成带有语义标签的LiDAR场景的问题。现有方法依赖预训练的分割模型预测语义图,导致跨模态一致性不足。其解决方案的关键在于提出Spiral模型,这是一种新型的基于范围视图的LiDAR扩散模型,能够同时生成深度图、反射率图像和语义图,从而提升生成数据的语义一致性与质量。

链接: https://arxiv.org/abs/2505.22643
作者: Dekai Zhu,Yixuan Hu,Youquan Liu,Dongyue Lu,Lingdong Kong,Slobodan Ilic
机构: Technical University of Munich (慕尼黑工业大学); Fudan University (复旦大学); National University of Singapore (新加坡国立大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.
zh

[CV-5] ObjectClear: Complete Object Removal via Object-Effect Attention

【速读】:该论文试图解决的是物体去除(object removal)问题,即在移除目标物体的同时,还需消除其产生的影响,如阴影和反射。现有的基于扩散模型的修复方法常产生伪影、幻觉内容、改变背景,并且难以准确移除物体效应。解决方案的关键在于提出一个名为OBER的新数据集,该数据集提供了带有和不带物体效应的配对图像以及精确的物体及其视觉伪影掩码,并在此基础上设计了ObjectClear框架,该框架引入了物体-效应注意力机制,通过学习注意力掩码引导模型关注前景去除区域,从而有效解耦前景去除与背景重建,同时在推理阶段利用预测的注意力图实现注意力引导的融合策略,以保留背景细节。

链接: https://arxiv.org/abs/2505.22636
作者: Jixin Zhao,Shangchen Zhou,Zhouxia Wang,Peiqing Yang,Chen Change Loy
机构: S-Lab, Nanyang Technological University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Object removal requires eliminating not only the target object but also its effects, such as shadows and reflections. However, diffusion-based inpainting methods often produce artifacts, hallucinate content, alter background, and struggle to remove object effects accurately. To address this challenge, we introduce a new dataset for OBject-Effect Removal, named OBER, which provides paired images with and without object effects, along with precise masks for both objects and their associated visual artifacts. The dataset comprises high-quality captured and simulated data, covering diverse object categories and complex multi-object scenes. Building on OBER, we propose a novel framework, ObjectClear, which incorporates an object-effect attention mechanism to guide the model toward the foreground removal regions by learning attention masks, effectively decoupling foreground removal from background reconstruction. Furthermore, the predicted attention map enables an attention-guided fusion strategy during inference, greatly preserving background details. Extensive experiments demonstrate that ObjectClear outperforms existing methods, achieving improved object-effect removal quality and background fidelity, especially in complex scenarios.
zh

[CV-6] PS4PRO: Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization CVPR2025

【速读】:该论文旨在解决神经渲染方法在重建三维场景时因输入视图数量有限而导致的重建质量受限问题,尤其是在复杂和动态场景中,某些物体角度可能从未被观测到。其解决方案的关键在于利用视频帧插值作为数据增强手段,并设计了一个轻量级但高质量的视频帧插值模型PS4PRO,该模型通过在多样化视频数据集上训练,隐式建模相机运动及真实世界三维几何,从而作为隐式世界先验,增强三维重建的图像监督。

链接: https://arxiv.org/abs/2505.22616
作者: Yezhi Shen,Qiuchen Zhai,Fengqing Zhu
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to the CVPR 2025 Workshop on Autonomous Driving (WAD)

点击查看摘要

Abstract:Neural rendering methods have gained significant attention for their ability to reconstruct 3D scenes from 2D images. The core idea is to take multiple views as input and optimize the reconstructed scene by minimizing the uncertainty in geometry and appearance across the views. However, the reconstruction quality is limited by the number of input views. This limitation is further pronounced in complex and dynamic scenes, where certain angles of objects are never seen. In this paper, we propose to use video frame interpolation as the data augmentation method for neural rendering. Furthermore, we design a lightweight yet high-quality video frame interpolation model, PS4PRO (Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization). PS4PRO is trained on diverse video datasets, implicitly modeling camera movement as well as real-world 3D geometry. Our model performs as an implicit world prior, enriching the photo supervision for 3D reconstruction. By leveraging the proposed method, we effectively augment existing datasets for neural rendering methods. Our experimental results indicate that our method improves the reconstruction performance on both static and dynamic scenes.
zh

[CV-7] Adversarially Robust AI-Generated Image Detection for Free: An Information Theoretic Perspective

【速读】:该论文试图解决生成式 AI(AIGI)检测中对抗训练(AT)面临的表现崩溃问题,即在面对对抗攻击时,原本有效的防御方法会显著降低检测性能。解决方案的关键在于提出一种无需训练的鲁棒检测方法——TRIM,该方法基于信息论度量,通过预测熵和KL散度量化特征偏移,从而在不改变原始检测器结构的前提下提升对对抗攻击的鲁棒性。

链接: https://arxiv.org/abs/2505.22604
作者: Ruixuan Zhang,He Wang,Zhengyu Zhao,Zhiqing Guo,Xun Yang,Yunfeng Diao,Meng Wang
机构: Hefei University of Technology (合肥工业大学); University College London (伦敦大学学院); Xi’an Jiaotong University (西安交通大学); Xinjiang University (新疆大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid advances in Artificial Intelligence Generated Images (AIGI) have facilitated malicious use, such as forgery and misinformation. Therefore, numerous methods have been proposed to detect fake images. Although such detectors have been proven to be universally vulnerable to adversarial attacks, defenses in this field are scarce. In this paper, we first identify that adversarial training (AT), widely regarded as the most effective defense, suffers from performance collapse in AIGI detection. Through an information-theoretic lens, we further attribute the cause of collapse to feature entanglement, which disrupts the preservation of feature-label mutual information. Instead, standard detectors show clear feature separation. Motivated by this difference, we propose Training-free Robust Detection via Information-theoretic Measures (TRIM), the first training-free adversarial defense for AIGI detection. TRIM builds on standard detectors and quantifies feature shifts using prediction entropy and KL divergence. Extensive experiments across multiple datasets and attacks validate the superiority of our TRIM, e.g., outperforming the state-of-the-art defense by 33.88% (28.91%) on ProGAN (GenImage), while well maintaining original accuracy.
zh

[CV-8] SAM-R1: Leverag ing SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

【速读】:该论文旨在解决多模态大模型在图像分割任务中依赖于昂贵且耗时的手动标注数据的问题,其核心挑战在于如何在不依赖显式推理标注的情况下赋予模型推理能力。解决方案的关键在于提出SAM-R1框架,该框架首次在多模态推理模型的训练过程中引入细粒度分割设置,并通过结合任务特定的细粒度奖励与定制化的优化目标,提升模型的推理与分割对齐能力,同时利用Segment Anything Model (SAM) 作为强大的奖励提供者,从而在仅使用3k训练样本的情况下实现了多基准测试中的优异性能。

链接: https://arxiv.org/abs/2505.22596
作者: Jiaqi Huang,Zunnan Xu,Jun Zhou,Ting Liu,Yicheng Xiao,Mingwen Ou,Bowen Ji,Xiu Li,Kehong Yuan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model’s reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.
zh

[CV-9] me Habibi is it Real or Fake?

【速读】:该论文旨在解决多语言和语码转换(code-switching)语音视频深度伪造检测的挑战,这一领域在现有研究中被严重忽视。传统深度伪造检测方法主要针对单一语言内容,而现实中的多语言混合内容(如阿拉伯语与英语的语码转换)对现有模型构成了额外的识别难题。论文提出的解决方案关键在于构建一个大规模的阿拉伯语-英语跨语言音视频深度伪造数据集\textbfArEnAV,该数据集包含387,000个视频和超过765小时的真实与伪造视频,首次引入了话语内语码转换、方言变化及纯阿拉伯语内容,并通过集成四种文本到语音和两种唇形同步模型生成,为多模态多语言深度伪造检测提供了全面的基准。

链接: https://arxiv.org/abs/2505.22581
作者: Kartik Kuckreja,Parul Gupta,Injy Hamed,Thamar Solorio,Muhammad Haris Khan,Abhinav Dhall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 12 tables

点击查看摘要

Abstract:Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce \textbfArEnAV, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It \textbfcontains 387k videos and over 765 hours of real and fake videos. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed \hrefthis https URLhere.
zh

[CV-10] ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models

【速读】:该论文试图解决扩散模型在与人类偏好对齐过程中出现的多样性下降问题,即基于奖励的微调虽然能提升对齐度,但通常会导致生成结果缺乏多样性。其解决方案的关键在于两个贡献:首先,引入了“联合生成”(combined generation)采样策略,仅在生成过程的后期阶段使用经过奖励微调的扩散模型,而早期阶段保留基础模型,从而减少早期过拟合并保持全局结构和多样性;其次,提出了ImageReFL微调方法,通过在真实图像上训练并结合多种正则化项(包括扩散损失和ReFL损失),在保持质量的同时提升图像多样性。

链接: https://arxiv.org/abs/2505.22569
作者: Dmitrii Sorokin,Maksim Nakhodnov,Andrey Kuznetsov,Aibek Alanov
机构: HSE University (高等经济学院); AIRI (人工智能研究机构); Sber (斯ber); Innopolis (英诺波利斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The source code can be found at this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have led to impressive image generation capabilities, but aligning these models with human preferences remains challenging. Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs. In this work, we address this trade-off with two contributions. First, we introduce \textitcombined generation, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process, while preserving the base model for earlier steps. This approach mitigates early-stage overfitting and helps retain global structure and diversity. Second, we propose \textitImageReFL, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images and incorporating multiple regularizers, including diffusion and ReFL losses. Our approach outperforms conventional reward tuning methods on standard quality and diversity metrics. A user study further confirms that our method better balances human preference alignment and visual diversity. The source code can be found at this https URL .
zh

[CV-11] Universal Visuo-Tactile Video Understanding for Embodied Interaction

【速读】:该论文旨在解决现有方法在物理理解中未能有效整合触觉信息的问题,从而提升具身智能体对物体物理属性的感知能力。其关键解决方案是提出VTV-LLM,首个用于通用Visuo-Tactile Video(VTV)理解的多模态大语言模型,通过构建包含150,000帧视频数据的VTV150K数据集,并采用三阶段训练范式:VTV增强以获得鲁棒的视觉-触觉表征、VTV-文本对齐实现跨模态对应关系,以及文本提示微调以支持自然语言生成,从而实现高效的触觉推理能力。

链接: https://arxiv.org/abs/2505.22566
作者: Yifan Xie,Mingyang Li,Shoujie Li,Xingting Li,Guangyu Chen,Fei Ma,Fei Richard Yu,Wenbo Ding
机构: Tsinghua University (清华大学); Sun Yat-sen University (中山大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.
zh

[CV-12] PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion

【速读】:该论文旨在解决大规模视频数据处理在深度学习应用中的计算挑战,特别是在视频数据集压缩(video dataset condensation)领域的问题。与图像数据集压缩相比,视频数据由于空间内容与时间动态之间的复杂交互而面临独特的挑战。论文提出的解决方案PRISM(Progressive Refinement and Insertion for Sparse Motion)的关键在于重新考虑视频数据的压缩方式,通过保留静态内容与动态运动之间的本质依赖关系,而非将其分离。该方法通过逐步优化和插入帧来充分捕捉动作中的运动信息,在保证性能的同时减少存储需求,考虑了每帧之间的梯度关系,从而实现了更高效的视频数据集压缩。

链接: https://arxiv.org/abs/2505.22564
作者: Jaehyun Choi,Jiwan Hur,Gyojin Han,Jaemyung Yu,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video dataset condensation has emerged as a critical technique for addressing the computational challenges associated with large-scale video data processing in deep learning applications. While significant progress has been made in image dataset condensation, the video domain presents unique challenges due to the complex interplay between spatial content and temporal dynamics. This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation, a novel approach that fundamentally reconsiders how video data should be condensed. Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements. Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage, considering the relation of gradients for each frame. Extensive experiments across standard video action recognition benchmarks demonstrate that PRISM outperforms existing disentangled approaches while maintaining compact representations suitable for resource-constrained environments.
zh

[CV-13] MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism

【速读】:该论文旨在解决基于信道状态信息(Channel State Information, CSI)的人体姿态估计中多人体姿态识别准确性和有效CSI特征学习的问题。其关键解决方案是提出了一种基于Transformer的时间-频率双令牌特征提取器,该结构通过多头自注意力机制建模子载波间的相关性与CSI的时间依赖性,并结合多阶段特征融合网络(Multi-Stage Feature Fusion Network, MSFN)将提取的CSI特征与姿态概率热图进行融合,以强化解剖学约束,从而提升姿态估计的精度。

链接: https://arxiv.org/abs/2505.22555
作者: Yanyi Qu,Haoyang Ma,Wenhui Xiong
机构: University of Electronic Science and Technology of China (电子科技大学); National Key Laboratory of Science and Technology on Communications (国家通信科学与技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Human pose estimation based on Channel State Information (CSI) has emerged as a promising approach for non-intrusive and precise human activity monitoring, yet faces challenges including accurate multi-person pose recognition and effective CSI feature learning. This paper presents MultiFormer, a wireless sensing system that accurately estimates human pose through CSI. The proposed system adopts a Transformer based time-frequency dual-token feature extractor with multi-head self-attention. This feature extractor is able to model inter-subcarrier correlations and temporal dependencies of the CSI. The extracted CSI features and the pose probability heatmaps are then fused by Multi-Stage Feature Fusion Network (MSFN) to enforce the anatomical constraints. Extensive experiments conducted on on the public MM-Fi dataset and our self-collected dataset show that the MultiFormer achieves higher accuracy over state-of-the-art approaches, especially for high-mobility keypoints (wrists, elbows) that are particularly difficult for previous methods to accurately estimate.
zh

[CV-14] Deep Learning-Based BMD Estimation from Radiographs with Conformal Uncertainty Quantification

【速读】:该论文试图解决因双能X射线吸收测定法(DXA)设备获取受限而阻碍骨质疏松症筛查的问题。其解决方案的关键在于利用广泛可得的膝关节X光片,通过深度学习技术进行骨矿物密度(BMD)的机遇性估计,并强调了在临床应用中必不可少的稳健不确定性量化方法。研究采用EfficientNet模型在OAI数据集上训练以预测BMD,并引入Split Conformal Prediction来提供具有保证覆盖率的患者特异性预测区间,从而增强模型的可信度和实用性。

链接: https://arxiv.org/abs/2505.22551
作者: Long Hui,Wai Lok Yeung
机构: Sky Long Artificial Intelligence Company (天空龙人工智能公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Limited DXA access hinders osteoporosis screening. This proof-of-concept study proposes using widely available knee X-rays for opportunistic Bone Mineral Density (BMD) estimation via deep learning, emphasizing robust uncertainty quantification essential for clinical use. An EfficientNet model was trained on the OAI dataset to predict BMD from bilateral knee radiographs. Two Test-Time Augmentation (TTA) methods were compared: traditional averaging and a multi-sample approach. Crucially, Split Conformal Prediction was implemented to provide statistically rigorous, patient-specific prediction intervals with guaranteed coverage. Results showed a Pearson correlation of 0.68 (traditional TTA). While traditional TTA yielded better point predictions, the multi-sample approach produced slightly tighter confidence intervals (90%, 95%, 99%) while maintaining coverage. The framework appropriately expressed higher uncertainty for challenging cases. Although anatomical mismatch between knee X-rays and standard DXA limits immediate clinical use, this method establishes a foundation for trustworthy AI-assisted BMD screening using routine radiographs, potentially improving early osteoporosis detection.
zh

[CV-15] Scaling-up Perceptual Video Quality Assessment

【速读】:该论文旨在解决感知视频质量评估(Perceptual Video Quality Assessment, VQA)领域中由于标注资源稀缺和数据集规模不足而导致的数据缩放定律潜力未被充分挖掘的问题。其解决方案的关键在于提出一种高效的框架\textbfOmniVQA,用于构建高质量的人机协同VQA多模态指令数据库(Multi-Modal Instruction Databases, MIDBs),并通过扩展形成当前最大的VQA领域MIDB——\textbfOmniVQA-Chat-400K,同时引入互补训练策略和细粒度基准\textbfOmniVQA-FG以提升模型在质量和评分任务上的性能。

链接: https://arxiv.org/abs/2505.22543
作者: Ziheng Jia,Zicheng Zhang,Zeyu Zhang,Yingji Liang,Xiaorong Zhu,Chunyi Li,Jinliang Han,Haoning Wu,Bin Wang,Haoran Zhang,Guanyu Zhu,Qiyong Zhao,Xiaohong Liu,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiaotong University (上海交通大学); Media Experience and Evaluation Lab (媒体体验与评估实验室); Huawei Techonologies (华为技术); Nanyang Technological University (南洋理工大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbfOmniVQA, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create \textbfOmniVQA-Chat-400K, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the \textbfOmniVQA-MOS-20K dataset to enhance the model’s quantitative quality rating capabilities. We then introduce a \textbfcomplementary training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the \textbfOmniVQA-FG (fine-grain)-Benchmark to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
zh

[CV-16] RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting

【速读】:该论文试图解决现有深度学习方法在水文学中主要局限于局部尺度应用,未能充分利用水体的固有空间关联性的问题,从而限制了河流径流和洪水预测的准确性与适用性。解决方案的关键在于提出一种新的深度学习模型——RiverMamba,该模型通过预训练长期再分析数据,并利用高效的Mamba块来捕捉全球尺度的河道网络输运过程,同时结合ECMWF HRES气象预报并考虑其不确定性,从而实现对全球河流径流和洪水的7天提前期预测。

链接: https://arxiv.org/abs/2505.22535
作者: Mohamad Hakam Shams Eddin,Yikui Zahng,Stefan Kollet,Juergen Gall
机构: Institute of Computer Science, University of Bonn (计算机科学研究所,波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉玛尔机器学习与人工智能研究所); Institute of Bio- and Geosciences Agrosphere (IBG-3), Research Centre Jülich (生物与地球科学研究所农业圈(IBG-3),于利希研究中心); Centre for High-Performance Scientific Computing in Terrestrial Systems (地表高性能科学计算中心); Geoverbund ABC/J, Jülich (地球系统联盟ABC/J,于利希)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Main paper 10 pages, Appendix 53 pages

点击查看摘要

Abstract:Recent deep learning approaches for river discharge forecasting have improved the accuracy and efficiency in flood forecasting, enabling more reliable early warning systems for risk management. Nevertheless, existing deep learning approaches in hydrology remain largely confined to local-scale applications and do not leverage the inherent spatial connections of bodies of water. Thus, there is a strong need for new deep learning methodologies that are capable of modeling spatio-temporal relations to improve river discharge and flood forecasting for scientific and operational applications. To address this, we present RiverMamba, a novel deep learning model that is pretrained with long-term reanalysis data and that can forecast global river discharge and floods on a 0.05^\circ grid up to 7 days lead time, which is of high relevance in early warning. To achieve this, RiverMamba leverages efficient Mamba blocks that enable the model to capture global-scale channel network routing and enhance its forecast capability for longer lead times. The forecast blocks integrate ECMWF HRES meteorological forecasts, while accounting for their inaccuracies through spatio-temporal modeling. Our analysis demonstrates that RiverMamba delivers reliable predictions of river discharge, including extreme floods across return periods and lead times, surpassing both operational AI- and physics-based models.
zh

[CV-17] PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

【速读】:该论文旨在解决多层透明图像生成中缺乏大规模高质量数据集的问题,从而推动多层生成模型的发展。其关键解决方案包括:发布首个开放的超高清多层透明图像数据集PrismLayersPro,引入无需训练的合成流程以按需生成数据,并开发出性能优越的开源多层生成模型ART+。核心技术贡献包括LayerFLUX和MultiLayerFLUX,分别用于生成高质量单层透明图像和基于语义布局组合多层图像,同时通过严格的过滤和人工筛选确保图像质量。

链接: https://arxiv.org/abs/2505.22523
作者: Junwen Chen,Heyang Jiang,Yanbin Wang,Keming Wu,Ji Li,Chao Zhang,Keiji Yanai,Dong Chen,Yuhui Yuan
机构: Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:Generating high-quality, multi-layer transparent images from text prompts can unlock a new level of creative control, allowing users to edit each layer as effortlessly as editing text outputs from LLMs. However, the development of multi-layer generative models lags behind that of conventional text-to-image models due to the absence of a large, high-quality corpus of multi-layer transparent data. In this paper, we address this fundamental challenge by: (i) releasing the first open, ultra-high-fidelity PrismLayers (PrismLayersPro) dataset of 200K (20K) multilayer transparent images with accurate alpha mattes, (ii) introducing a trainingfree synthesis pipeline that generates such data on demand using off-the-shelf diffusion models, and (iii) delivering a strong, open-source multi-layer generation model, ART+, which matches the aesthetics of modern text-to-image generation models. The key technical contributions include: LayerFLUX, which excels at generating high-quality single transparent layers with accurate alpha mattes, and MultiLayerFLUX, which composes multiple LayerFLUX outputs into complete images, guided by human-annotated semantic layout. To ensure higher quality, we apply a rigorous filtering stage to remove artifacts and semantic mismatches, followed by human selection. Fine-tuning the state-of-the-art ART model on our synthetic PrismLayersPro yields ART+, which outperforms the original ART in 60% of head-to-head user study comparisons and even matches the visual quality of images generated by the FLUX.1-[dev] model. We anticipate that our work will establish a solid dataset foundation for the multi-layer transparent image generation task, enabling research and applications that require precise, editable, and visually compelling layered imagery.
zh

[CV-18] PathFL: Multi-Alignment Federated Learning for Pathology Image Segmentation

【速读】:该论文旨在解决多中心病理图像分割中由于成像模态、器官和扫描设备等异质性来源导致的表示偏差问题,这些问题阻碍了可泛化的分割模型的发展。其解决方案的关键在于提出了一种名为PathFL的新型多对齐联邦学习框架,通过图像、特征和模型聚合三个层次的对齐策略实现跨中心数据的一致性与泛化能力提升。

链接: https://arxiv.org/abs/2505.22522
作者: Yuan Zhang,Feng Chen,Yaolei Qi,Guanyu Yang,Huazhu Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures; Accepted by MedIA

点击查看摘要

Abstract:Pathology image segmentation across multiple centers encounters significant challenges due to diverse sources of heterogeneity including imaging modalities, organs, and scanning equipment, whose variability brings representation bias and impedes the development of generalizable segmentation models. In this paper, we propose PathFL, a novel multi-alignment Federated Learning framework for pathology image segmentation that addresses these challenges through three-level alignment strategies of image, feature, and model aggregation. Firstly, at the image level, a collaborative style enhancement module aligns and diversifies local data by facilitating style information exchange across clients. Secondly, at the feature level, an adaptive feature alignment module ensures implicit alignment in the representation space by infusing local features with global insights, promoting consistency across heterogeneous client features learning. Finally, at the model aggregation level, a stratified similarity aggregation strategy hierarchically aligns and aggregates models on the server, using layer-specific similarity to account for client discrepancies and enhance global generalization. Comprehensive evaluations on four sets of heterogeneous pathology image datasets, encompassing cross-source, cross-modality, cross-organ, and cross-scanner variations, validate the effectiveness of our PathFL in achieving better performance and robustness against data heterogeneity.
zh

[CV-19] he Meeseeks Mesh: Spatially Consistent 3D Adversarial Objects for BEV Detector

【速读】:该论文旨在解决3D目标检测模型在面对3D对抗攻击时的脆弱性问题,特别是针对自动驾驶系统中基于鸟瞰图(BEV)的3D目标检测框架。其解决方案的关键在于生成非侵入式的3D对抗物体,这些物体在不同时间和视角下具有空间一致性,并通过可微分渲染技术精确建模对抗物体与目标车辆之间的空间关系。此外,引入了遮挡感知模块以提高不同视角下的视觉一致性和真实感,并设计了BEV空间特征引导的优化策略以保持多帧攻击的有效性。

链接: https://arxiv.org/abs/2505.22499
作者: Aixuan Li,Mochu Xiang,Jing Zhang,Yuchao Dai
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D object detection is a critical component in autonomous driving systems. It allows real-time recognition and detection of vehicles, pedestrians and obstacles under varying environmental conditions. Among existing methods, 3D object detection in the Bird’s Eye View (BEV) has emerged as the mainstream framework. To guarantee a safe, robust and trustworthy 3D object detection, 3D adversarial attacks are investigated, where attacks are placed in 3D environments to evaluate the model performance, e.g., putting a film on a car, clothing a pedestrian. The vulnerability of 3D object detection models to 3D adversarial attacks serves as an important indicator to evaluate the robustness of the model against perturbations. To investigate this vulnerability, we generate non-invasive 3D adversarial objects tailored for real-world attack scenarios. Our method verifies the existence of universal adversarial objects that are spatially consistent across time and camera views. Specifically, we employ differentiable rendering techniques to accurately model the spatial relationship between adversarial objects and the target vehicle. Furthermore, we introduce an occlusion-aware module to enhance visual consistency and realism under different viewpoints. To maintain attack effectiveness across multiple frames, we design a BEV spatial feature-guided optimization strategy. Experimental results demonstrate that our approach can reliably suppress vehicle predictions from state-of-the-art 3D object detectors, serving as an important tool to test robustness of 3D object detection models before deployment. Moreover, the generated adversarial objects exhibit strong generalization capabilities, retaining its effectiveness at various positions and distances in the scene.
zh

[CV-20] ProCrop: Learning Aesthetic Image Cropping from Professional Compositions

【速读】:该论文旨在解决图像裁剪(image cropping)中现有基于规则和数据驱动的方法缺乏多样性或依赖标注训练数据的问题。其解决方案的关键在于提出ProCrop,一种基于检索的方法,通过融合专业摄影作品与查询图像的特征,学习专业构图策略,从而显著提升裁剪性能。此外,研究还构建了一个包含242K条弱标注图像的大规模数据集,通过迭代优化生成多样化的高质量裁剪方案,为图像美学与构图分析提供了新的基准。

链接: https://arxiv.org/abs/2505.22490
作者: Ke Zhang,Tianyu Ding,Jiachen Jiang,Tianyi Chen,Ilya Zharkov,Vishal M. Patel,Luming Liang
机构: Johns Hopkins University (约翰霍普金斯大学); Ohio State University (俄亥俄州立大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 15 figures

点击查看摘要

Abstract:Image cropping is crucial for enhancing the visual appeal and narrative impact of photographs, yet existing rule-based and data-driven approaches often lack diversity or require annotated training data. We introduce ProCrop, a retrieval-based method that leverages professional photography to guide cropping decisions. By fusing features from professional photographs with those of the query image, ProCrop learns from professional compositions, significantly boosting performance. Additionally, we present a large-scale dataset of 242K weakly-annotated images, generated by out-painting professional images and iteratively refining diverse crop proposals. This composition-aware dataset generation offers diverse high-quality crop proposals guided by aesthetic principles and becomes the largest publicly available dataset for image cropping. Extensive experiments show that ProCrop significantly outperforms existing methods in both supervised and weakly-supervised settings. Notably, when trained on the new dataset, our ProCrop surpasses previous weakly-supervised methods and even matches fully supervised approaches. Both the code and dataset will be made publicly available to advance research in image aesthetics and composition analysis.
zh

[CV-21] Understanding Adversarial Training with Energy-based Models

【速读】:该论文试图解决对抗训练(Adversarial Training, AT)中的关键现象,即灾难性过拟合(Catastrophic Overfitting, CO)和鲁棒过拟合(Robust Overfitting, RO),以及鲁棒分类器在生成建模任务中的性能限制。其解决方案的关键在于从能量视角(Energy-based Model, EBM)分析对抗样本与自然样本的能量差异,并提出一种新的正则化方法——Delta Energy Regularizer (DER),用于平滑训练过程中的能量景观,从而有效缓解CO和RO问题。此外,该研究还通过局部类特定主成分分析(PCA)和基于能量的引导策略,提升了鲁棒分类器作为生成模型时的样本多样性和生成质量。

链接: https://arxiv.org/abs/2505.22486
作者: Mujtaba Hussain Mirza,Maria Rosaria Briglia,Filippo Bartolucci,Senad Beadini,Giuseppe Lisanti,Iacopo Masi
机构: Sapienza University of Rome (罗马大学); University of Bologna (博洛尼亚大学); Eustema S.p.A. (Eustema S.p.A.)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review for TPAMI

点击查看摘要

Abstract:We aim at using Energy-based Model (EBM) framework to better understand adversarial training (AT) in classifiers, and additionally to analyze the intrinsic generative capabilities of robust classifiers. By viewing standard classifiers through an energy lens, we begin by analyzing how the energies of adversarial examples, generated by various attacks, differ from those of the natural samples. The central focus of our work is to understand the critical phenomena of Catastrophic Overfitting (CO) and Robust Overfitting (RO) in AT from an energy perspective. We analyze the impact of existing AT approaches on the energy of samples during training and observe that the behavior of the ``delta energy’ – change in energy between original sample and its adversarial counterpart – diverges significantly when CO or RO occurs. After a thorough analysis of these energy dynamics and their relationship with overfitting, we propose a novel regularizer, the Delta Energy Regularizer (DER), designed to smoothen the energy landscape during training. We demonstrate that DER is effective in mitigating both CO and RO across multiple benchmarks. We further show that robust classifiers, when being used as generative models, have limits in handling trade-off between image quality and variability. We propose an improved technique based on a local class-wise principal component analysis (PCA) and energy-based guidance for better class-specific initialization and adaptive stopping, enhancing sample diversity and generation quality. Considering that we do not explicitly train for generative modeling, we achieve a competitive Inception Score (IS) and Fréchet inception distance (FID) compared to hybrid discriminative-generative models.
zh

[CV-22] A Closer Look at Multimodal Representation Collapse ICML

【速读】:该论文试图解决模态坍缩(modality collapse)问题,这是一种在多模态融合模型训练中观察到的现象,即模型倾向于仅依赖部分模态,而忽略其他模态。解决方案的关键在于通过跨模态知识蒸馏(cross-modal knowledge distillation)隐式地解耦表示,从而释放学生编码器中的秩瓶颈,去噪融合头输出而不影响任一模态的预测特征。基于此,作者提出了一种通过显式基向量重分配来防止模态坍缩的算法,并验证了其在处理缺失模态中的有效性。

链接: https://arxiv.org/abs/2505.22483
作者: Abhra Chaudhuri,Anjan Dutta,Tu Bui,Serban Georgescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Machine Learning (ICML) 2025 (Spotlight)

点击查看摘要

Abstract:We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: this https URL.
zh

[CV-23] Single Domain Generalization for Alzheimers Detection from 3D MRIs with Pseudo-Morphological Augmentations and Contrastive Learning

【速读】:该论文试图解决阿尔茨海默病(Alzheimer’s disease)通过磁共振成像(MRI)检测中由于类别不平衡、扫描协议差异和数据集多样性有限导致的模型泛化能力不足的问题。其解决方案的关键在于引入可学习的伪形态模块,旨在生成具有形状感知性和解剖学意义的类别特定增强数据,并结合监督对比学习模块以提取鲁棒的类别特定表征,从而提升模型在不同分布下的泛化性能。

链接: https://arxiv.org/abs/2505.22465
作者: Zobia Batool,Huseyin Ozkan,Erchan Aptoula
机构: Sabanci University (萨班奇大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although Alzheimer’s disease detection via MRIs has advanced significantly thanks to contemporary deep learning models, challenges such as class imbalance, protocol variations, and limited dataset diversity often hinder their generalization capacity. To address this issue, this article focuses on the single domain generalization setting, where given the data of one domain, a model is designed and developed with maximal performance w.r.t. an unseen domain of distinct distribution. Since brain morphology is known to play a crucial role in Alzheimer’s diagnosis, we propose the use of learnable pseudo-morphological modules aimed at producing shape-aware, anatomically meaningful class-specific augmentations in combination with a supervised contrastive learning module to extract robust class-specific representations. Experiments conducted across three datasets show improved performance and generalization capacity, especially under class imbalance and imaging protocol variations. The source code will be made available upon acceptance at this https URL.
zh

[CV-24] SHTOcc: Effective 3D Occupancy Prediction with Sparse Head and Tail Voxels

【速读】:该论文旨在解决3D占用预测中由于类别分布不均衡导致的长尾问题以及由体素几何分布引起的性能下降问题。其解决方案的关键在于提出SHTOcc(Sparse Head-Tail Occupancy),通过稀疏头尾体素构建来准确识别并平衡头部和尾部类别的关键体素,并采用解耦学习减少模型对主导类别(头部类别)的偏差,增强对尾部类别的关注。

链接: https://arxiv.org/abs/2505.22461
作者: Qiucheng Yu,Yuan Xie,Xin Tan
机构: City University of Hong Kong (香港城市大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D occupancy prediction has attracted much attention in the field of autonomous driving due to its powerful geometric perception and object recognition capabilities. However, existing methods have not explored the most essential distribution patterns of voxels, resulting in unsatisfactory results. This paper first explores the inter-class distribution and geometric distribution of voxels, thereby solving the long-tail problem caused by the inter-class distribution and the poor performance caused by the geometric distribution. Specifically, this paper proposes SHTOcc (Sparse Head-Tail Occupancy), which uses sparse head-tail voxel construction to accurately identify and balance key voxels in the head and tail classes, while using decoupled learning to reduce the model’s bias towards the dominant (head) category and enhance the focus on the tail class. Experiments show that significant improvements have been made on multiple baselines: SHTOcc reduces GPU memory usage by 42.2%, increases inference speed by 58.6%, and improves accuracy by about 7%, verifying its effectiveness and efficiency. The code is available at this https URL
zh

[CV-25] Universal Domain Adaptation for Semantic Segmentation CVPR2025

【速读】:该论文旨在解决无监督域适应语义分割(UDA-SS)中类别设置未知带来的性能下降问题,特别是在目标域存在私有类的情况下。传统方法假设源域和目标域的类别设置已知,这在实际场景中并不成立。为了解决这一问题,作者提出了通用域适应语义分割(UniDA-SS),其关键在于提出了一种名为UniMAP的框架,包含两个核心组件:基于领域特定原型区分(DSPD)和基于目标的图像匹配(TIM)。DSPD通过将每个类别划分为两个领域特定原型,实现更细粒度的领域特征分离;TIM则通过选择包含最多公共类像素的源图像进行配对,促进公共类的有效学习。

链接: https://arxiv.org/abs/2505.22458
作者: Seun-An Choe,Keon-Hee Park,Jinwoo Choi,Gyeong-Moon Park
机构: Kyung Hee University (庆熙大学); Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled source data to unlabeled target data. However, traditional UDA-SS methods assume that category settings between source and target domains are known, which is unrealistic in real-world scenarios. This leads to performance degradation if private classes exist. To address this limitation, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS), achieving robust adaptation even without prior knowledge of category settings. We define the problem in the UniDA-SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA-SS with Image Matching and Prototype-based Distinction, a novel framework composed of two key components. First, Domain-Specific Prototype-based Distinction (DSPD) divides each class into two domain-specific prototypes, enabling finer separation of domain-specific features and enhancing the identification of common classes across domains. Second, Target-based Image Matching (TIM) selects a source image containing the most common-class pixels based on the target pseudo-label and pairs it in a batch to promote effective learning of common classes. We also introduce a new UniDA-SS benchmark and demonstrate through various experiments that UniMAP significantly outperforms baselines. The code is available at \hrefthis https URLthis https URL.
zh

[CV-26] NFR: Neural Feature-Guided Non-Rigid Shape Registration

【速读】:该论文旨在解决3D形状配准(3D shape registration)中的挑战,特别是在面对显著的非刚性形变和部分性(partiality)时,传统方法难以获得准确的对应关系。其解决方案的关键在于将基于深度学习的形状匹配网络所学习到的神经特征(neural features)整合进一个迭代式的几何形状配准流程中,从而在无需对应标注的情况下实现更准确和语义丰富的对应估计,并通过中间配准结果动态更新对应关系,结合一致性先验进行过滤,提升整体鲁棒性。

链接: https://arxiv.org/abs/2505.22445
作者: Puhua Jiang,Zhangquan Chen,Mingze Sun,Ruqi Huang
机构: 清华大学(THU)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures. arXiv admin note: substantial text overlap with arXiv:2311.04494

点击查看摘要

Abstract:In this paper, we propose a novel learning-based framework for 3D shape registration, which overcomes the challenges of significant non-rigid deformation and partiality undergoing among input shapes, and, remarkably, requires no correspondence annotation during training. Our key insight is to incorporate neural features learned by deep learning-based shape matching networks into an iterative, geometric shape registration pipeline. The advantage of our approach is two-fold – On one hand, neural features provide more accurate and semantically meaningful correspondence estimation than spatial features (e.g., coordinates), which is critical in the presence of large non-rigid deformations; On the other hand, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching and partial shape matching across varying settings, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work.
zh

[CV-27] On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation

【速读】:该论文旨在解决大规模预训练点云模型在适应特定下游任务时需要全量微调所带来的高计算和存储成本问题。现有参数高效微调(PEFT)方法在3D点云模型中表现不佳,主要是因为其将点视为无序标记,忽略了3D建模中的局部空间结构和全局几何上下文。论文提出的解决方案是引入几何编码混合器(GEM),其关键在于通过显式整合细粒度局部位置编码与轻量级潜在注意力机制,以捕捉全面的全局上下文,从而有效缓解空间和几何分布不匹配的问题。

链接: https://arxiv.org/abs/2505.22444
作者: Liyao Tang,Zhe Chen,Dacheng Tao
机构: The University of Sydney (悉尼大学); La Trobe University (拉特罗布大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model’s parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code will be released.
zh

[CV-28] Can NeRFs See without Cameras?

【速读】:该论文试图解决如何利用射频(RF)或音频信号中的多径效应来推断环境信息的问题,特别是从室内稀疏的WiFi测量数据中重建房屋的平面图。其解决方案的关键在于对神经辐射场(Neural Radiance Fields, NeRFs)进行重新设计,使其能够从包含多径成分的信号中学习并“看到”环境结构,从而实现对场景的隐式建模与重建。

链接: https://arxiv.org/abs/2505.22441
作者: Chaitanya Amballa,Sattwik Basu,Yu-Lin Wei,Zhijian Yang,Mehmet Ergezer,Romit Roy Choudhury
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have been remarkably successful at synthesizing novel views of 3D scenes by optimizing a volumetric scene function. This scene function models how optical rays bring color information from a 3D object to the camera pixels. Radio frequency (RF) or audio signals can also be viewed as a vehicle for delivering information about the environment to a sensor. However, unlike camera pixels, an RF/audio sensor receives a mixture of signals that contain many environmental reflections (also called “multipath”). Is it still possible to infer the environment using such multipath signals? We show that with redesign, NeRFs can be taught to learn from multipath signals, and thereby “see” the environment. As a grounding application, we aim to infer the indoor floorplan of a home from sparse WiFi measurements made at multiple locations inside the home. Although a difficult inverse problem, our implicitly learnt floorplans look promising, and enables forward applications, such as indoor signal prediction and basic ray tracing.
zh

[CV-29] Synonymous Variational Inference for Perceptual Image Compression ICML2025

【速读】:该论文试图解决感知图像压缩问题,其核心在于如何有效平衡率失真与感知质量之间的关系。解决方案的关键是基于语义信息理论中语义与句法信息之间的集合-元素关系,提出一种基于同义关系的变分推断(synonymous variational inference, SVI)方法,通过将感知相似性作为同义准则构建理想同义集(Synset),并利用参数化密度近似其潜在同义表示的后验分布,从而实现对感知图像压缩优化方向的理论分析。

链接: https://arxiv.org/abs/2505.22438
作者: Zijian Liang,Kai Niu,Changshuo Wang,Jin Xu,Ping Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 31 pages, 20 figures. This paper is accepted by Proceedings of the 42nd International Conference on Machine Learning (ICML 2025) Poster

点击查看摘要

Abstract:Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model’s capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.
zh

[CV-30] Distance Transform Guided Mixup for Alzheimers Detection

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)检测中由于医学数据集存在类别不平衡、成像协议差异以及数据集多样性不足等问题导致的模型泛化能力受限的问题。其解决方案的关键在于通过扩展经典的mixup方法,计算MRI扫描的距变(distance transform),将图像在空间上分割为多个层,并从不同样本中组合这些层以生成增强图像,从而在保持大脑结构的前提下生成多样化数据,提升模型的泛化性能。

链接: https://arxiv.org/abs/2505.22434
作者: Zobia Batool,Huseyin Ozkan,Erchan Aptoula
机构: Sabanci University (萨班奇大学); VPALab (视觉与感知实验室); VERIM (数据分析师卓越中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Alzheimer’s detection efforts aim to develop accurate models for early disease diagnosis. Significant advances have been achieved with convolutional neural networks and vision transformer based approaches. However, medical datasets suffer heavily from class imbalance, variations in imaging protocols, and limited dataset diversity, which hinder model generalization. To overcome these challenges, this study focuses on single-domain generalization by extending the well-known mixup method. The key idea is to compute the distance transform of MRI scans, separate them spatially into multiple layers and then combine layers stemming from distinct samples to produce augmented images. The proposed approach generates diverse data while preserving the brain’s structure. Experimental results show generalization performance improvement across both ADNI and AIBL datasets.
zh

[CV-31] Zero-Shot 3D Visual Grounding from Vision-Language Models CVPR2025

【速读】:该论文旨在解决3D视觉定位(3D Visual Grounding, 3DVG)中依赖标注的3D数据和预定义类别而导致的可扩展性问题,从而限制其在开放世界场景中的应用。其解决方案的关键在于提出SeeGround框架,该框架利用2D视觉-语言模型(VLMs)实现零样本学习,避免了对3D特定训练的依赖。通过引入混合输入格式、视角适应模块和融合对齐模块,有效提升了定位精度与泛化能力。

链接: https://arxiv.org/abs/2505.22429
作者: Rong Li,Shijie Li,Lingdong Kong,Xulei Yang,Junwei Liang
机构: HKUST(GZ)(香港科技大学(广州)); I2R, A*STAR(资讯通信研究院,新加坡科技研究局); NUS(新加坡国立大学); CSE, HKUST(计算机科学与工程,香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 3D-LLM/VLA @ CVPR 2025; Project Page at this https URL

点击查看摘要

Abstract:3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D confirm that SeeGround achieves substantial improvements over existing zero-shot baselines – outperforming them by 7.7% and 7.1%, respectively – and even rivals fully supervised alternatives, demonstrating strong generalization under challenging conditions.
zh

[CV-32] RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network CVPR2025

【速读】:该论文旨在解决雷达与相机系统在运行过程中实现自动几何标定的难题,特别是针对雷达高度数据中存在的数据稀疏性和测量不确定性问题。其解决方案的关键在于提出一种双视角表示(Dual-Perspective representation),通过融合前视图中丰富的但易受干扰的高度信息与鸟瞰图中对高度不确定性具有鲁棒性的特征,并结合选择性融合机制(Selective Fusion Mechanism)来提取可靠特征,从而降低高度不确定性的负面影响。此外,还引入了多模态交叉注意力机制(Multi-Modal Cross-Attention Mechanism)以实现跨模态匹配,并设计了抗噪匹配器(Noise-Resistant Matcher)以提升匹配机制的鲁棒性。

链接: https://arxiv.org/abs/2505.22427
作者: Van-Tin Luu,Yon-Lin Cai,Vu-Hoang Tran,Wei-Chen Chiu,Yi-Ting Chen,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Ho Chi Minh City University of Technology and Education (胡志明市技术教育大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:This paper presents a groundbreaking approach - the first online automatic geometric calibration method for radar and camera systems. Given the significant data sparsity and measurement uncertainty in radar height data, achieving automatic calibration during system operation has long been a challenge. To address the sparsity issue, we propose a Dual-Perspective representation that gathers features from both frontal and bird’s-eye views. The frontal view contains rich but sensitive height information, whereas the bird’s-eye view provides robust features against height uncertainty. We thereby propose a novel Selective Fusion Mechanism to identify and fuse reliable features from both perspectives, reducing the effect of height uncertainty. Moreover, for each view, we incorporate a Multi-Modal Cross-Attention Mechanism to explicitly find location correspondences through cross-modal matching. During the training phase, we also design a Noise-Resistant Matcher to provide better supervision and enhance the robustness of the matching mechanism against sparsity and height uncertainty. Our experimental results, tested on the nuScenes dataset, demonstrate that our method significantly outperforms previous radar-camera auto-calibration methods, as well as existing state-of-the-art LiDAR-camera calibration techniques, establishing a new benchmark for future research. The code is available at this https URL.
zh

[CV-33] GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

【速读】:该论文旨在解决当前自动驾驶中世界模型在保持3D几何一致性以及处理遮挡时累积伪影的问题,这些问题严重影响了自主导航任务中的安全评估可靠性。其解决方案的关键在于提出GeoDrive,该方法通过显式整合鲁棒的3D几何条件到驾驶世界模型中,以增强空间理解能力和动作可控性。具体而言,GeoDrive首先从输入帧中提取3D表示,并根据用户指定的自车轨迹生成对应的2D渲染图像,同时在训练过程中引入动态编辑模块,通过调整车辆位置来提升渲染效果。

链接: https://arxiv.org/abs/2505.22421
作者: Anthony Chen,Wenzhao Zheng,Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Kurt Keutzer,Shangbang Zhang
机构: Peking University (北京大学); Li Auto Inc. (小鹏汽车公司); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: code will be released at this https URL

点击查看摘要

Abstract:Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.
zh

[CV-34] Neural Face Skinning for Mesh-agnostic Facial Expression Cloning

【速读】:该论文试图解决在面部动画重定向中准确将面部表情重定向到面部网格并实现可操控性的问题,现有深度学习方法虽然通过全局潜在代码编码面部表情,但往往无法捕捉局部区域的细粒度细节。解决方案的关键在于结合全局与局部变形模型的优势,通过将全局潜在代码的影响局部化,使模型能够预测目标面部网格每个顶点的皮肤权重,从而实现对不同结构面部网格的直观控制和详细表情克隆。核心思想是利用预定义的分割标签进行间接监督,以局部化全局潜在代码,同时通过基于面部动作编码系统(FACS)的混合形状对潜在代码进行监督,确保可解释性与可编辑性。

链接: https://arxiv.org/abs/2505.22416
作者: Sihun Cha,Serin Yoon,Kwanggyoon Seo,Junyong Noh
机构: Visual Media Lab, KAIST, Republic of Korea; Flawless AI, United States of America
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.
zh

[CV-35] Frugal Incremental Generative Modeling using Variational Autoencoders

【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘问题,尤其是在使用基于重放(replay)的方法时数据量增长带来的可扩展性挑战。其解决方案的关键在于提出一种无需重放的增量学习模型,该模型基于变分自编码器(Variational Autoencoders, VAEs),通过构建一个精心设计的多模态潜在空间实现增量生成建模,并引入正交性准则以缓解VAEs的灾难性遗忘。此外,该方法考虑了静态和动态两种VAE变体,在参数数量上保持无(或最多受控)增长,从而在保持高性能的同时显著降低内存消耗。

链接: https://arxiv.org/abs/2505.22408
作者: Victor Enescu,Hichem Sahbi
机构: Sorbonne University (索邦大学); CNRS (法国国家科学研究中心); LIP6 (LIP6)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual or incremental learning holds tremendous potential in deep learning with different challenges including catastrophic forgetting. The advent of powerful foundation and generative models has propelled this paradigm even further, making it one of the most viable solution to train these models. However, one of the persisting issues lies in the increasing volume of data particularly with replay-based methods. This growth introduces challenges with scalability since continuously expanding data becomes increasingly demanding as the number of tasks grows. In this paper, we attenuate this issue by devising a novel replay-free incremental learning model based on Variational Autoencoders (VAEs). The main contribution of this work includes (i) a novel incremental generative modelling, built upon a well designed multi-modal latent space, and also (ii) an orthogonality criterion that mitigates catastrophic forgetting of the learned VAEs. The proposed method considers two variants of these VAEs: static and dynamic with no (or at most a controlled) growth in the number of parameters. Extensive experiments show that our method is (at least) an order of magnitude more ``memory-frugal’’ compared to the closely related works while achieving SOTA accuracy scores.
zh

[CV-36] Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation

【速读】:该论文旨在解决图像生成任务中图像推理能力不足的问题,特别是在以逻辑为中心的图像生成任务中。其解决方案的关键在于提出一种自反思强化学习算法(SRRL),该算法通过在生成轨迹中进行反思与迭代,实现逻辑图像的推理生成。SRRL将去噪过程中的整个轨迹视为思维链(Chain of Thought, CoT)步骤,并引入条件引导的前向过程,以实现CoT步骤间的反思迭代,从而提升图像生成中的逻辑推理能力。

链接: https://arxiv.org/abs/2505.22407
作者: Jiadong Pan,Zhiyuan Ma,Kaiyan Zhang,Ning Ding,Bowen Zhou
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have recently demonstrated exceptional performance in image generation task. However, existing image generation methods still significantly suffer from the dilemma of image reasoning, especially in logic-centered image generation tasks. Inspired by the success of Chain of Thought (CoT) and Reinforcement Learning (RL) in LLMs, we propose SRRL, a self-reflective RL algorithm for diffusion models to achieve reasoning generation of logical images by performing reflection and iteration across generation trajectories. The intermediate samples in the denoising process carry noise, making accurate reward evaluation difficult. To address this challenge, SRRL treats the entire denoising trajectory as a CoT step with multi-round reflective denoising process and introduces condition guided forward process, which allows for reflective iteration between CoT steps. Through SRRL-based iterative diffusion training, we introduce image reasoning through CoT into generation tasks adhering to physical laws and unconventional physical phenomena for the first time. Notably, experimental results of case study exhibit that the superior performance of our SRRL algorithm even compared with GPT-4o. The project page is this https URL.
zh

[CV-37] STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering

【速读】:该论文旨在解决动态场景重建中由于初始化阶段存在时空不一致而导致的表示纠缠问题,即现有基于3D Gaussian Splatting (3DGS) 的方法在构建规范高斯分布时未区分时间信息,导致空间与时间模式耦合,难以准确建模动态运动。其解决方案的关键在于提出STDR(Spatio-Temporal Decoupling for Real-time rendering)模块,该模块通过学习每个高斯的空间-时间概率分布,结合时空掩码、分离的形变场和一致性正则化,实现空间与时间特征的联合解耦。

链接: https://arxiv.org/abs/2505.22400
作者: Zehao Li,Hao Jiang,Yujun Cai,Jianing Chen,Baolong Bi,Shuqin Gao,Honglong Zhao,Yiwei Wang,Tianlu Mao,Zhaoqi Wang
机构: Institute of Computing Technology, Chinese Academy of Sciences, ICT (中国科学院计算技术研究所); University of Chinese Academy of Sciences, UCAS (中国科学院大学); The University of Queensland (昆士兰大学); University of California, Merced (加州大学默塞德分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although dynamic scene reconstruction has long been a fundamental challenge in 3D vision, the recent emergence of 3D Gaussian Splatting (3DGS) offers a promising direction by enabling high-quality, real-time rendering through explicit Gaussian primitives. However, existing 3DGS-based methods for dynamic reconstruction often suffer from \textitspatio-temporal incoherence during initialization, where canonical Gaussians are constructed by aggregating observations from multiple frames without temporal distinction. This results in spatio-temporally entangled representations, making it difficult to model dynamic motion accurately. To overcome this limitation, we propose \textbfSTDR (Spatio-Temporal Decoupling for Real-time rendering), a plug-and-play module that learns spatio-temporal probability distributions for each Gaussian. STDR introduces a spatio-temporal mask, a separated deformation field, and a consistency regularization to jointly disentangle spatial and temporal patterns. Extensive experiments demonstrate that incorporating our module into existing 3DGS-based dynamic scene reconstruction frameworks leads to notable improvements in both reconstruction quality and spatio-temporal consistency across synthetic and real-world benchmarks.
zh

[CV-38] Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在处理多图像任务时由于跨模态对齐问题导致的幻觉现象(如上下文遗漏、混淆和误读)。现有方法使用直接偏好优化(Direct Preference Optimization, DPO)仅限于输入序列中的单一图像参考,忽略了整体上下文建模。论文提出的解决方案是上下文到提示的直接偏好优化(Context-to-Cue Direct Preference Optimization, CcDPO),其关键在于通过多层级偏好优化框架增强多图像场景下的图像感知能力,具体包括上下文层级优化和针刺层级优化,分别关注全局语境与局部细节,从而有效减少幻觉并提升模型性能。

链接: https://arxiv.org/abs/2505.22396
作者: Xudong Li,Mengdan Zhang,Peixian Chen,Xiawu Zheng,Yan Zhang,Jingyuan Zheng,Yunhang Shen,Ke Li,Chaoyou Fu,Xing Sun,Rongrong Ji
机构: Xiamen University (厦门大学); Tencent Youtu Lab (腾讯优图实验室); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues – from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs’ multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.
zh

[CV-39] PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

【速读】:该论文试图解决从无纹理的3D网格、文本描述和可选图像提示生成物理基础渲染(PBR)材质纹理的问题,特别是早期方法在多视图生成中存在推理时间长和全局纹理不一致的问题,以及现有方法在提升全局一致性时限制了单个视图的分辨率。解决方案的关键在于引入视图打包(view packing)技术,通过将多视图映射排列建模为二维矩形装箱问题,在不增加额外推理成本的情况下显著提高每个视图的有效分辨率,同时保持与当前2D生成模型的兼容性。此外,通过在下一尺度预测自回归框架中实现细粒度控制和多领域生成,进一步降低了推理成本。

链接: https://arxiv.org/abs/2505.22394
作者: Fan Fei,Jiajun Tang,Fei-Peng Tian,Boxin Shi,Ping Tan
机构: Peking University (北京大学); The Hong Kong University of Science and Technology (香港科技大学); Light Illusions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures from an untextured 3D mesh, a text description, and an optional image prompt. Early 2D generation-based texturing approaches generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures. More recent approaches adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation without imposing additional inference cost, by formulating the arrangement of multi-view maps as a 2D rectangle bin packing problem. In contrast to UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inference cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework to create an efficient multi-view multi-domain generative backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality of generated PBR textures and efficiency in training and inference.
zh

[CV-40] DAM: Domain-Aware Module for Multi-Domain Dataset Condensation

【速读】:该论文试图解决传统数据集压缩(Dataset Condensation, DC)方法在处理多领域(multi-domain)数据时的不足,这些数据通常由来自多个领域的异构图像组成。现有方法忽略了数据的多领域特性,导致压缩后的数据在跨领域或跨架构场景下的泛化能力受限。论文提出的解决方案关键在于引入了领域感知模块(Domain-Aware Module, DAM),该模块在训练过程中通过可学习的空间掩码将领域相关特征嵌入到每个合成图像中,从而提升压缩数据在单领域和多领域设置下的性能。DAM仅在数据集压缩过程中激活,确保每类图像数量(Images Per Class, IPC)与传统方法保持一致。

链接: https://arxiv.org/abs/2505.22387
作者: Jaehyun Choi,Gyojin Han,Dong-Jae Lee,Sunghyun Baek,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dataset Condensation (DC) has emerged as a promising solution to mitigate the computational and storage burdens associated with training deep learning models. However, existing DC methods largely overlook the multi-domain nature of modern datasets, which are increasingly composed of heterogeneous images spanning multiple domains. In this paper, we extend DC and introduce Multi-Domain Dataset Condensation (MDDC), which aims to condense data that generalizes across both single-domain and multi-domain settings. To this end, we propose the Domain-Aware Module (DAM), a training-time module that embeds domain-related features into each synthetic image via learnable spatial masks. As explicit domain labels are mostly unavailable in real-world datasets, we employ frequency-based pseudo-domain labeling, which leverages low-frequency amplitude statistics. DAM is only active during the condensation process, thus preserving the same images per class (IPC) with prior methods. Experiments show that DAM consistently improves in-domain, out-of-domain, and cross-architecture performance over baseline dataset condensation methods.
zh

[CV-41] Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion

【速读】:该论文旨在解决当前文本到图像生成方法在处理特定主体(subject)时,难以有效分离与主体身份相关的信息和无关细节的问题,从而导致生成图像出现过拟合或无法保持主体身份的现象。其解决方案的关键在于提出一种新的框架,包含隐式-显式前景-背景解耦模块(Implicit-Explicit foreground-background Decoupling Module, IEDM)和基于专家混合(Mixture of Experts, MoE)的特征融合模块(Feature Fusion Module, FFM),通过结合特征层面的隐式解耦与图像层面的显式前景-背景分离,并动态融合身份相关与无关特征,提升生成图像的质量与文本对齐度。

链接: https://arxiv.org/abs/2505.22360
作者: Kewen Chen,Xiaobin Hu,Wenqi Ren
机构: School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络空间安全学院); Technische Universität München (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large-scale text-to-image generation models have led to a surge in subject-driven text-to-image generation, which aims to produce customized images that align with textual descriptions while preserving the identity of specific subjects. Despite significant progress, current methods struggle to disentangle identity-relevant information from identity-irrelevant details in the input images, resulting in overfitting or failure to maintain subject identity. In this work, we propose a novel framework that improves the separation of identity-related and identity-unrelated features and introduces an innovative feature fusion mechanism to improve the quality and text alignment of generated images. Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module (IEDM) and a Feature Fusion Module (FFM) based on a Mixture of Experts (MoE). IEDM combines learnable adapters for implicit decoupling at the feature level with inpainting techniques for explicit foreground-background separation at the image level. FFM dynamically integrates identity-irrelevant features with identity-related features, enabling refined feature representations even in cases of incomplete decoupling. In addition, we introduce three complementary loss functions to guide the decoupling process. Extensive experiments demonstrate the effectiveness of our proposed method in enhancing image generation quality, improving flexibility in scene adaptation, and increasing the diversity of generated outputs across various textual descriptions.
zh

[CV-42] VME: A Satellite Imagery Dataset and Benchmark for Detecting Vehicles in the Middle East and Beyond

【速读】:该论文试图解决卫星图像中车辆检测模型在不同地理区域表现不佳的问题,尤其是现有数据集存在地理偏差,未能充分覆盖中东等地区。解决方案的关键在于构建专门针对中东地区高分辨率卫星图像的车辆检测数据集——VME数据集,以及整合多源数据的全球最大卫星图像汽车检测基准数据集CDSI,从而提升模型在中东地区的检测精度及全球范围内的性能。

链接: https://arxiv.org/abs/2505.22353
作者: Noora Al-Emadi,Ingmar Weber,Yin Yang,Ferda Ofli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting vehicles in satellite images is crucial for traffic management, urban planning, and disaster response. However, current models struggle with real-world diversity, particularly across different regions. This challenge is amplified by geographic bias in existing datasets, which often focus on specific areas and overlook regions like the Middle East. To address this gap, we present the Vehicles in the Middle East (VME) dataset, designed explicitly for vehicle detection in high-resolution satellite images from Middle Eastern countries. Sourced from Maxar, the VME dataset spans 54 cities across 12 countries, comprising over 4,000 image tiles and more than 100,000 vehicles, annotated using both manual and semi-automated methods. Additionally, we introduce the largest benchmark dataset for Car Detection in Satellite Imagery (CDSI), combining images from multiple sources to enhance global car detection. Our experiments demonstrate that models trained on existing datasets perform poorly on Middle Eastern images, while the VME dataset significantly improves detection accuracy in this region. Moreover, state-of-the-art models trained on CDSI achieve substantial improvements in global car detection.
zh

[CV-43] ask-Driven Implicit Representations for Automated Design of LiDAR Systems

【速读】:该论文旨在解决LiDAR系统设计中存在复杂性高、耗时且主要依赖人工的问题,尤其是在移动设备、自动驾驶车辆和航拍成像平台中,由于独特的空间和时间采样需求,使得设计更加复杂。其解决方案的关键在于提出一种基于任务驱动的自动化LiDAR系统设计框架,通过在连续六维设计空间中表示LiDAR配置,并利用基于流的生成建模学习特定任务的隐式密度,随后通过期望最大化算法将传感器建模为六维空间中的参数化分布,并拟合到所学的隐式密度,从而实现高效且考虑约束条件的LiDAR系统设计。

链接: https://arxiv.org/abs/2505.22344
作者: Nikhil Behari,Aaron Young,Akshat Dave,Ramesh Raskar
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.
zh

[CV-44] Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training

【速读】:该论文试图解决机器学习训练过程中因依赖大规模数据集和模型而导致的高成本问题,尤其是传统训练方法中反复均匀采样数据集所带来的低效率。其解决方案的关键在于提出一种名为“Progressive Data Dropout”的新型训练范式,该方法结合了硬数据挖掘和丢弃(dropout)的思路,能够在不改变模型架构或优化器的前提下,显著减少有效训练轮数(仅需基准值的12.4%),同时提升模型准确率最高达4.82%。

链接: https://arxiv.org/abs/2505.22342
作者: Shriram M S,Xinyue Hao,Shihao Hou,Yang Lu,Laura Sevilla-Lara,Anurag Arnab,Shreyank N Gowda
机构: University of Manchester (曼彻斯特大学); University of Edinburgh (爱丁堡大学); Xiamen University (厦门大学); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: this https URL
zh

[CV-45] Learning to Infer Parameterized Representations of Plants from 3D Scans

【速读】:该论文试图解决从非结构化观测中准确重建植物三维结构的问题(3D reconstruction of plants),这一任务面临由于植物器官复杂的空间网络结构导致的自遮挡或空间邻近性带来的计算挑战。其解决方案的关键在于提出一种统一的方法,通过给定的植物三维扫描数据,推断出参数化的植物表示,该表示描述了植物的分枝结构并包含每个器官的参数信息,从而可直接用于多种任务。该方法基于数据驱动的递归神经网络,利用基于L-system的程序模型生成虚拟植物进行训练,最终能够根据输入的三维点云推断出参数化的树状表示。

链接: https://arxiv.org/abs/2505.22337
作者: Samara Ghrer,Christophe Godin,Stefanie Wuhrer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing faithfully the 3D architecture of plants from unstructured observations is a challenging task. Plants frequently contain numerous organs, organized in branching systems in more or less complex spatial networks, leading to specific computational issues due to self-occlusion or spatial proximity between organs. Existing works either consider inverse modeling where the aim is to recover the procedural rules that allow to simulate virtual plants, or focus on specific tasks such as segmentation or skeletonization. We propose a unified approach that, given a 3D scan of a plant, allows to infer a parameterized representation of the plant. This representation describes the plant’s branching structure, contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using an L-systems-based procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is applicable to any plant that can be represented as binary axial tree. We evaluate our approach on Chenopodium Album plants, using experiments on synthetic plants to show that our unified framework allows for different tasks including reconstruction, segmentation and skeletonization, while achieving results on-par with state-of-the-art for each task.
zh

[CV-46] UP-SLAM: Adaptively Structured Gaussian SLAM with Uncertainty Prediction in Dynamic Environments

【速读】:该论文旨在解决动态环境中实时RGB-D SLAM系统的跟踪与建图问题,特别是在面对动态物体时,传统3D Gaussian Splatting (3DGS) 方法因顺序优化框架和对动态物体的敏感性而导致的实时性能不足与鲁棒性差的问题。其解决方案的关键在于提出UP-SLAM系统,通过并行化框架实现跟踪与建图的解耦,并采用概率八叉树自适应管理高斯基元,结合无需训练的不确定性估计器以实现无语义标签的开放集动态物体处理,同时引入时间编码器与浅层多层感知机提升渲染质量和不确定性预测的鲁棒性。

链接: https://arxiv.org/abs/2505.22335
作者: Wancai Zheng,Linlin Ou,Jiajie He,Libo Zhou,Xinyi Yu,Yan Wei
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D Gaussian Splatting (3DGS) techniques for Visual Simultaneous Localization and Mapping (SLAM) have significantly progressed in tracking and high-fidelity mapping. However, their sequential optimization framework and sensitivity to dynamic objects limit real-time performance and robustness in real-world scenarios. We present UP-SLAM, a real-time RGB-D SLAM system for dynamic environments that decouples tracking and mapping through a parallelized framework. A probabilistic octree is employed to manage Gaussian primitives adaptively, enabling efficient initialization and pruning without hand-crafted thresholds. To robustly filter dynamic regions during tracking, we propose a training-free uncertainty estimator that fuses multi-modal residuals to estimate per-pixel motion uncertainty, achieving open-set dynamic object handling without reliance on semantic labels. Furthermore, a temporal encoder is designed to enhance rendering quality. Concurrently, low-dimensional features are efficiently transformed via a shallow multilayer perceptron to construct DINO features, which are then employed to enrich the Gaussian field and improve the robustness of uncertainty prediction. Extensive experiments on multiple challenging datasets suggest that UP-SLAM outperforms state-of-the-art methods in both localization accuracy (by 59.8%) and rendering quality (by 4.57 dB PSNR), while maintaining real-time performance and producing reusable, artifact-free static maps in dynamic this http URL project: this https URL
zh

[CV-47] From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)中遗忘机制易受重学习攻击的问题,即被认为已遗忘的知识在微调过程中重新出现。其解决方案的关键在于发现通过仅对保留集进行微调,遗忘集的准确率可以显著恢复,这一现象表明模型在权重空间中存在特定的属性,如原始模型与遗忘后模型之间的L₂距离和线性模式连通性,这些属性可用来预测模型对重学习攻击的抵抗能力。基于此,作者提出了一类新的方法,实现了对重学习攻击的最先进抵抗。

链接: https://arxiv.org/abs/2505.22310
作者: Shoaib Ahmed Siddiqui,Adrian Weller,David Krueger,Gintare Karolina Dziugaite,Michael Curtis Mozer,Eleni Triantafillou
机构: University of Cambridge (剑桥大学); The Alan Turing Institute (艾伦·图灵研究所); Mila (Mila); Google DeepMind (谷歌深度思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set – i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, L_2 -distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.
zh

[CV-48] IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

【速读】:该论文试图解决在缺乏真实标签的情况下,如何评估视觉-语言模型在视频目标识别中的性能问题。解决方案的关键在于提出IKIWISI工具,该工具通过将模型输出转换为二值热图,利用人类的模式识别能力来评估模型的可靠性,并引入“spy objects”以检测模型对不存在对象的幻觉现象,从而实现对模型行为的可视化评估和认知审计。

链接: https://arxiv.org/abs/2505.22305
作者: Md Touhidul Islam,Imran Kabir,Md Alimoor Reza,Syed Masum Billah
机构: Pennsylvania State University (宾夕法尼亚州立大学); Drake University (德雷克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at DIS’25 (Funchal, Portugal)

点击查看摘要

Abstract:We present IKIWISI (“I Know It When I See It”), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans’ innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces “spy objects”: adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems. Comments: Accepted at DIS’25 (Funchal, Portugal) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.22305 [cs.CV] (or arXiv:2505.22305v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.22305 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3715336.3735754 Focus to learn more DOI(s) linking to related resources
zh

[CV-49] CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction ACL2025

【速读】:该论文试图解决在计算机辅助设计(CAD)流程中,设计师需要耗费大量时间对3D对象原型进行审查和修正的问题,特别是在与参考图像对比时。现有先进的多模态大语言模型(MLLMs)在识别多个几何组件及执行CAD程序中的空间几何操作方面存在不足,导致审查结果不准确。论文提出的解决方案是构建一个名为ReCAD的CAD程序修复框架,其关键在于有效检测程序错误并提供有助于错误修正的反馈,从而提升CAD审查的准确性与效率。

链接: https://arxiv.org/abs/2505.22304
作者: Jiali Chen,Xusen Hei,HongFei Liu,Yuancheng Wei,Zikun Deng,Jiayuan Xie,Yi Cai,Li Qing
机构: School of Software Engineering, South China University of Technology (软件学院,华南理工大学); Key Laboratory of Big Data and Intelligent Robot Ministry of Education (教育部大数据与智能机器人重点实验室); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 main conference

点击查看摘要

Abstract:Computer-aided design (CAD) is crucial in prototyping 3D objects through geometric instructions (i.e., CAD programs). In practical design workflows, designers often engage in time-consuming reviews and refinements of these prototypes by comparing them with reference images. To bridge this gap, we introduce the CAD review task to automatically detect and correct potential errors, ensuring consistency between the constructed 3D objects and reference images. However, recent advanced multimodal large language models (MLLMs) struggle to recognize multiple geometric components and perform spatial geometric operations within the CAD program, leading to inaccurate reviews. In this paper, we propose the CAD program repairer (ReCAD) framework to effectively detect program errors and provide helpful feedback on error correction. Additionally, we create a dataset, CADReview, consisting of over 20K program-image pairs, with diverse errors for the CAD review task. Extensive experiments demonstrate that our ReCAD significantly outperforms existing MLLMs, which shows great potential in design applications.
zh

[CV-50] Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data

【速读】:该论文旨在解决早期视觉艺术作品(特别是彩色照片)在保存过程中因老化和不当存储导致的褪色缺陷问题,尤其是针对数字化自动彩色摄影(autochrome photographs)中的绿色化缺陷进行自动去除。其解决方案的关键在于提出了一种基于合成数据集生成的方法,并利用生成式 AI (Generative AI) 结合精心设计的损失函数来进行视觉艺术修复。为了解决缺乏适合训练的数据集的问题,研究者引入了一种新颖的方法来准确模拟绿色化缺陷;同时,还提出了改进的加权损失函数以处理缺陷区域与非缺陷区域之间的颜色不平衡问题。

链接: https://arxiv.org/abs/2505.22291
作者: Saptarshi Neil Sinha,P. Julius Kuehn,Johannes Koppe,Arjan Kuijper,Michael Weinmann
机构: Fraunhofer Institute for Graphics and Image Processing (弗劳恩霍夫图形与图像处理研究所); Technische Universität Darmstadt (达姆施塔特工业大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The preservation of early visual arts, particularly color photographs, is challenged by deterioration caused by aging and improper storage, leading to issues like blurring, scratches, color bleeding, and fading defects. In this paper, we present the first approach for the automatic removal of greening color defects in digitized autochrome photographs. Our main contributions include a method based on synthetic dataset generation and the use of generative AI with a carefully designed loss function for the restoration of visual arts. To address the lack of suitable training datasets for analyzing greening defects in damaged autochromes, we introduce a novel approach for accurately simulating such defects in synthetic data. We also propose a modified weighted loss function for the ChaIR method to account for color imbalances between defected and non-defected areas. While existing methods struggle with accurately reproducing original colors and may require significant manual effort, our method allows for efficient restoration with reduced time requirements.
zh

[CV-51] From Controlled Scenarios to Real-World: Cross-Domain Degradation Pattern Matching for All-in-One Image Restoration

【速读】:该论文旨在解决All-in-One Image Restoration (AiOIR)在真实场景中性能下降的问题,其核心原因是训练数据(源域)与实际测试数据(目标域)之间的分布差异导致退化感知能力不足。解决方案的关键在于提出一种统一的领域自适应图像修复框架(Unified Domain-Adaptive Image Restoration, UDAIR),通过从源域迁移知识到目标域来实现有效的AiOIR。该方法包括设计一个用于表示退化模式的代码本(codebook)以及跨样本对比学习机制以提升退化识别能力,并引入领域自适应策略和基于相关性对齐的测试时自适应机制,以动态对齐代码本嵌入并优化退化嵌入与源域聚类中心的一致性。

链接: https://arxiv.org/abs/2505.22284
作者: Junyu Fan,Chuanlin Liao,Yi Lin
机构: Sichuan University(四川大学); The National Key Laboratory of Fundamental Science on Synthetic Vision(合成视觉基础科学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a fundamental imaging task, All-in-One Image Restoration (AiOIR) aims to achieve image restoration caused by multiple degradation patterns via a single model with unified parameters. Although existing AiOIR approaches obtain promising performance in closed and controlled scenarios, they still suffered from considerable performance reduction in real-world scenarios since the gap of data distributions between the training samples (source domain) and real-world test samples (target domain) can lead inferior degradation awareness ability. To address this issue, a Unified Domain-Adaptive Image Restoration (UDAIR) framework is proposed to effectively achieve AiOIR by leveraging the learned knowledge from source domain to target domain. To improve the degradation identification, a codebook is designed to learn a group of discrete embeddings to denote the degradation patterns, and the cross-sample contrastive learning mechanism is further proposed to capture shared features from different samples of certain degradation. To bridge the data gap, a domain adaptation strategy is proposed to build the feature projection between the source and target domains by dynamically aligning their codebook embeddings, and a correlation alignment-based test-time adaptation mechanism is designed to fine-tune the alignment discrepancies by tightening the degradation embeddings to the corresponding cluster center in the source domain. Experimental results on 10 open-source datasets demonstrate that UDAIR achieves new state-of-the-art performance for the AiOIR task. Most importantly, the feature cluster validate the degradation identification under unknown conditions, and qualitative comparisons showcase robust generalization to real-world scenarios.
zh

[CV-52] Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss

【速读】:该论文旨在解决在稀疏视角条件下,三维计算机视觉中的新视角合成任务中重建质量显著下降的问题,这主要是由于几何线索不足导致的细节模糊和结构伪影。其解决方案的关键在于提出了一种名为分层深度引导点绘(Hierarchical Depth-Guided Splatting, HDGS)的深度监督框架,该框架通过逐级细化几何结构从粗到细地优化重建效果,其中核心创新是引入了级联皮尔逊相关损失(Cascade Pearson Correlation Loss, CPCL),该损失函数在多个空间尺度上对渲染深度与估计单目深度进行对齐,从而显著提升了稀疏视角场景下的结构保真度。

链接: https://arxiv.org/abs/2505.22279
作者: Wenjun Lu,Haodong Chen,Anqi Yi,Yuk Ying Chung,Zhiyong Wang,Kun Hu
机构: The University of Sydney (悉尼大学); Edith Cowan University (埃迪斯科文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis is a fundamental task in 3D computer vision that aims to reconstruct realistic images from a set of posed input views. However, reconstruction quality degrades significantly under sparse-view conditions due to limited geometric cues. Existing methods, such as Neural Radiance Fields (NeRF) and the more recent 3D Gaussian Splatting (3DGS), often suffer from blurred details and structural artifacts when trained with insufficient views. Recent works have identified the quality of rendered depth as a key factor in mitigating these artifacts, as it directly affects geometric accuracy and view consistency. In this paper, we address these challenges by introducing Hierarchical Depth-Guided Splatting (HDGS), a depth supervision framework that progressively refines geometry from coarse to fine levels. Central to HDGS is a novel Cascade Pearson Correlation Loss (CPCL), which aligns rendered and estimated monocular depths across multiple spatial scales. By enforcing multi-scale depth consistency, our method substantially improves structural fidelity in sparse-view scenarios. Extensive experiments on the LLFF and DTU benchmarks demonstrate that HDGS achieves state-of-the-art performance under sparse-view settings while maintaining efficient and high-quality rendering
zh

[CV-53] Domain Adaptation of Attention Heads for Zero-shot Anomaly Detection

【速读】:该论文旨在解决零样本异常检测(Zero-shot Anomaly Detection, ZSAD)中存在的领域适应性不足问题,即现有方法要么未考虑通用主干模型到异常检测领域的领域适应,要么仅对部分模型组件进行有限的适应。其解决方案的关键在于提出HeadCLIP,通过有效适应文本和图像编码器到目标领域,利用文本编码器中的可学习提示来泛化正常与异常概念,并在图像编码器中引入可学习头权重以动态调整每个注意力头的特征,同时通过联合异常得分最大化领域适应效果,从而提升像素级和图像级异常检测性能。

链接: https://arxiv.org/abs/2505.22259
作者: Kiyoon Jeong,Jaehyuk Heo,Junyeong Son,Pilsung Kang
机构: Korea University (韩国科学技术院); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) in images is an approach that can detect anomalies without access to normal samples, which can be beneficial in various realistic scenarios where model training is not possible. However, existing ZSAD research has shown limitations by either not considering domain adaptation of general-purpose backbone models to anomaly detection domains or by implementing only partial adaptation to some model components. In this paper, we propose HeadCLIP to overcome these limitations by effectively adapting both text and image encoders to the domain. HeadCLIP generalizes the concepts of normality and abnormality through learnable prompts in the text encoder, and introduces learnable head weights to the image encoder to dynamically adjust the features held by each attention head according to domain characteristics. Additionally, we maximize the effect of domain adaptation by introducing a joint anomaly score that utilizes domain-adapted pixel-level information for image-level anomaly detection. Experimental results using multiple real datasets in both industrial and medical domains show that HeadCLIP outperforms existing ZSAD techniques at both pixel and image levels. In the industrial domain, improvements of up to 4.9%p in pixel-level mean anomaly detection score (mAD) and up to 3.0%p in image-level mAD were achieved, with similar improvements (3.2%p, 3.1%p) in the medical domain.
zh

[CV-54] LiDAR Based Semantic Perception for Forklifts in Outdoor Environments

【速读】:该论文旨在解决自主叉车在复杂户外环境中进行语义分割的挑战,以实现安全、高效的自主导航。解决方案的关键在于引入了一种双LiDAR系统(dual LiDAR system),通过融合前向和向下倾斜的LiDAR传感器数据,提升对动态和静态障碍物的检测与分割精度,同时利用高分辨率三维点云数据,实现了对安全关键实例类(如行人、车辆、叉车)和环境类(如可行驶地面、车道、建筑物)的高效语义分割。

链接: https://arxiv.org/abs/2505.22258
作者: Benjamin Serfling,Hannes Reichert,Lorenzo Bayerlein,Konrad Doll,Kati Radkhah-Lens
机构: University of Applied Sciences Aschaffenburg(应用科学大学阿沙芬堡分校); Linde Material Handling GmbH(林德物料搬运有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this study, we present a novel LiDAR-based semantic segmentation framework tailored for autonomous forklifts operating in complex outdoor environments. Central to our approach is the integration of a dual LiDAR system, which combines forward-facing and downward-angled LiDAR sensors to enable comprehensive scene understanding, specifically tailored for industrial material handling tasks. The dual configuration improves the detection and segmentation of dynamic and static obstacles with high spatial precision. Using high-resolution 3D point clouds captured from two sensors, our method employs a lightweight yet robust approach that segments the point clouds into safety-critical instance classes such as pedestrians, vehicles, and forklifts, as well as environmental classes such as driveable ground, lanes, and buildings. Experimental validation demonstrates that our approach achieves high segmentation accuracy while satisfying strict runtime requirements, establishing its viability for safety-aware, fully autonomous forklift navigation in dynamic warehouse and yard environments.
zh

[CV-55] YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction

【速读】:该论文旨在解决珊瑚礁生态监测中手动分析效率低和复杂水下场景下分割精度不足的双重挑战。其解决方案的关键在于构建了基于多模态大模型(MLLM)的智能框架,通过“目标检测-语义分割-先验输入”的流程,利用目标检测模块生成空间先验框以指导像素级分割,并结合微调后的分类指令作为先验输入,从而实现高精度的属级分类及核心生态指标提取。

链接: https://arxiv.org/abs/2505.22250
作者: Mingzhuang Wang,Yvyang Li,Xiyang Zhang,Fei Tan,Qi Shi,Guotao Zhang,Siqi Chen,Yufei Liu,Lei Lei,Ming Zhou,Qiang Lin,Hongqiang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Coral reefs, crucial for sustaining marine biodiversity and ecological processes (e.g., nutrient cycling, habitat provision), face escalating threats, underscoring the need for efficient monitoring. Coral reef ecological monitoring faces dual challenges of low efficiency in manual analysis and insufficient segmentation accuracy in complex underwater scenarios. This study develops the YH-OSI system, establishing an intelligent framework centered on the Multimodal Large Model (MLLM) for “object detection-semantic segmentation-prior input”. The system uses the object detection module (mAP@0.5=0.78) to generate spatial prior boxes for coral instances, driving the segment module to complete pixel-level segmentation in low-light and densely occluded scenarios. The segmentation masks and finetuned classification instructions are fed into the Qwen2-VL-based multimodal model as prior inputs, achieving a genus-level classification accuracy of 88% and simultaneously extracting core ecological metrics. Meanwhile, the system retains the scalability of the multimodal model through standardized interfaces, laying a foundation for future integration into multimodal agent-based underwater robots and supporting the full-process automation of “image acquisition-prior generation-real-time analysis.”
zh

[CV-56] StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

【速读】:该论文试图解决基于扩散模型的世界模型在复杂环境中缺乏长期环境状态表示,导致视觉一致性在少数步骤后崩溃的问题(long-term memory deficiency)。解决方案的关键在于引入StateSpaceDiffuser,通过将状态空间模型(Mamba)的序列表示与扩散模型结合,使模型能够捕捉完整的交互历史,从而恢复长期记忆而不牺牲扩散模型的高保真合成能力。

链接: https://arxiv.org/abs/2505.22246
作者: Nedko Savov,Naser Kazemi,Deheng Zhang,Danda Pani Paudel,Xi Wang,Luc Van Gool
机构: INSAIT(INSAIT); Sofia University “St. Kliment Ohridski”(索非亚大学“圣克莱门特·奥赫里德斯基”); ETH Zurich(苏黎世联邦理工学院); TU Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on a short sequence of observations causes them to quickly lose track of context. As a result, visual consistency breaks down after just a few steps, and generated scenes no longer reflect information seen earlier. This limitation of the state-of-the-art diffusion-based world models comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform on long-context tasks by integrating a sequence representation from a state-space model (Mamba), representing the entire interaction history. This design restores long-term memory without sacrificing the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model’s ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.
zh

[CV-57] Enjoying Information Dividend: Gaze Track-based Medical Weakly Supervised Segmentation MICCAI2025

【速读】:该论文旨在解决医学影像中弱监督语义分割(Weakly supervised semantic segmentation, WSSS)因稀疏标注而难以有效利用标注信息的问题。其解决方案的关键在于充分利用医生的注视轨迹(gaze track)信息,包括注视点、持续时间和时间顺序,通过提出GradTrack框架,结合注视轨迹图生成和轨迹注意力机制,在解码过程中实现多层级注视监督下的特征逐步优化,从而提升分割性能。

链接: https://arxiv.org/abs/2505.22230
作者: Zhisong Wang,Yiwen Ye,Ziyang Chen,Yong Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, MICCAI 2025 (Early Accept)

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) in medical imaging struggles with effectively using sparse annotations. One promising direction for WSSS leverages gaze annotations, captured via eye trackers that record regions of interest during diagnostic procedures. However, existing gaze-based methods, such as GazeMedSeg, do not fully exploit the rich information embedded in gaze data. In this paper, we propose GradTrack, a framework that utilizes physicians’ gaze track, including fixation points, durations, and temporal order, to enhance WSSS performance. GradTrack comprises two key components: Gaze Track Map Generation and Track Attention, which collaboratively enable progressive feature refinement through multi-level gaze supervision during the decoding process. Experiments on the Kvasir-SEG and NCI-ISBI datasets demonstrate that GradTrack consistently outperforms existing gaze-based methods, achieving Dice score improvements of 3.21% and 2.61%, respectively. Moreover, GradTrack significantly narrows the performance gap with fully supervised models such as nnUNet.
zh

[CV-58] GoMatching: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

【速读】:该论文旨在解决视频文本定位(Video Text Spotting, VTS)中现有方法在性能上仍落后于图像文本定位(Image Text Spotting, ITS)的问题,其核心挑战在于当前视频文本检测器的识别能力有限,即使经过大量端到端训练仍未能达到理想效果。论文提出的解决方案是GoMatching++,其关键在于通过冻结预训练的图像文本检测器并引入一个轻量级可训练的跟踪器,从而将图像文本检测器转化为视频专用模型,该方法在参数和数据效率方面表现出色。

链接: https://arxiv.org/abs/2505.22228
作者: Haibin He,Jing Zhang,Maoyuan Ye,Juhua Liu,Bo Du,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2401.07080

点击查看摘要

Abstract:Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter’s ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at this https URL.
zh

[CV-59] Hadaptive-Net: Efficient Vision Models via Adaptive Cross-Hadamard Synergy

【速读】:该论文试图解决传统卷积操作在跨通道交互和通道扩展方面的局限性,旨在提升网络的表示能力和维度压缩效果。其解决方案的关键在于提出一种计算高效的模块——自适应交叉Hadamard(Adaptive Cross-Hadamard, ACH),该模块利用自适应的跨通道Hadamard乘积实现高维通道扩展,从而在保持模型轻量化的同时提升性能。基于此模块,论文进一步构建了Hadamard Adaptive Network(Hadaptive-Net),在视觉任务中实现了推理速度与准确性的卓越平衡。

链接: https://arxiv.org/abs/2505.22226
作者: Xuyang Zhang,Xi Zhang,Liang Chen,Hao Shi,Qingshan Guo
机构: Beijing Institute of Technology (北京理工大学); Chongqing Innovation Center, Beijing Institute of Technology (重庆创新中心,北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent studies have revealed the immense potential of Hadamard product in enhancing network representational capacity and dimensional compression. However, despite its theoretical promise, this technique has not been systematically explored or effectively applied in practice, leaving its full capabilities underdeveloped. In this work, we first analyze and identify the advantages of Hadamard product over standard convolutional operations in cross-channel interaction and channel expansion. Building upon these insights, we propose a computationally efficient module: Adaptive Cross-Hadamard (ACH), which leverages adaptive cross-channel Hadamard products for high-dimensional channel expansion. Furthermore, we introduce Hadaptive-Net (Hadamard Adaptive Network), a lightweight network backbone for visual tasks, which is demonstrated through experiments that it achieves an unprecedented balance between inference speed and accuracy through our proposed module.
zh

[CV-60] A Survey on Training-free Open-Vocabulary Semantic Segmentation

【速读】:该论文旨在解决开放词汇语义分割(open-vocabulary semantic segmentation)中的任务,即让模型能够对未在训练阶段学习过的类别进行分类,而传统方法依赖大量精细标注数据,成本高昂。其解决方案的关键在于采用无需训练的方法(training-free methods),通过利用已有的多模态分类模型(如CLIP)来实现分割任务,从而避免了对大规模标注数据和计算资源的依赖。

链接: https://arxiv.org/abs/2505.22209
作者: Naomi Kombol,Ivan Martinović,Siniša Šegvić
机构: Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva (萨格勒布大学,电气工程与计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.
zh

[CV-61] Investigating Mechanisms for In-Context Vision Language Binding CVPR

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在跨模态关联中的机制问题,即如何将图像中的对象与文本描述进行有效绑定。其解决方案的关键在于引入了绑定ID(Binding ID)机制,通过该机制,模型为图像中的对象标记和对应的文本描述分配唯一的绑定ID,从而实现上下文中跨模态的关联。

链接: https://arxiv.org/abs/2505.22200
作者: Darshana Saravanan,Makarand Tapaswi,Vineet Gandhi
机构: CVIT, IIIT Hyderabad, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to MIV at CVPRW 2025 (Oral)

点击查看摘要

Abstract:To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an ‘image of a red toy car’, the model should associate this image to phrases like ‘car’, ‘red toy’, ‘red object’, etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object’s image tokens and its textual references, enabling in-context association.
zh

[CV-62] S2AFormer: Strip Self-Attention for Efficient Vision Transformer

【速读】:该论文旨在解决Vision Transformer(ViT)在计算效率上的瓶颈问题,即随着token数量增加,其计算需求呈二次增长,导致实际应用中的效率受限。尽管已有方法结合卷积与自注意力机制以获得更好的权衡,但自注意力中昂贵的成对token相似度计算和复杂的矩阵操作仍是主要障碍。该论文提出的解决方案是S2AFormer,其关键在于引入了新颖的Strip Self-Attention(SSA)机制,通过减少K和V的空间维度并压缩Q和K的通道维度,在保持精度的同时显著降低计算开销,实现了效率与效果之间的最优平衡。

链接: https://arxiv.org/abs/2505.22195
作者: Guoan Xu,Wenfeng Huang,Wenjing Jia,Jiamao Li,Guangwei Gao,Guo-Jun Qi
机构: University of Technology Sydney (悉尼科技大学); State Key Laboratory of Transducer Technology (传感技术国家重点实验室); Shanghai Institute of Microsystem and Information Technology (上海微系统与信息技术研究所); Nanjing University of Posts and Telecommunications (南京邮电大学); Key Laboratory of Artificial Intelligence (人工智能重点实验室); Research Center for Industries of the Future (未来产业研究中心); School of Engineering (工程学院); Westlake University (西湖大学); OPPO Research (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer’s sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs, the expensive pairwise token affinity and complex matrix operations inherent in self-attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer’s attention mechanisms. A key innovation of SSA lies in its reducing the spatial dimensions of K and V while compressing the channel dimensions of Q and K . This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet-1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non-GPU environments, making it a strong candidate for efficient vision Transformers.
zh

[CV-63] Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers ICML2025

【速读】:该论文旨在解决扩散变压器(DiT)在视频生成任务中因参数量大、计算复杂度高而难以部署到边缘设备的问题,以及现有针对图像生成模型的量化方法在视频生成任务中泛化能力不足的问题。其解决方案的关键在于提出Q-VDiT框架,该框架包含两个核心组件:从量化角度提出的Token-aware Quantization Estimator (TQE),用于补偿令牌和特征维度中的量化误差;以及从优化角度提出的Temporal Maintenance Distillation (TMD),用于保持帧间的时空相关性并基于整体视频上下文优化每帧。

链接: https://arxiv.org/abs/2505.22167
作者: Weilun Feng,Chuanguang Yang,Haotong Qin,Xiangqi Li,Yu Wang,Zhulin An,Libo Huang,Boyu Diao,Zixiang Zhao,Yongjun Xu,Michele Magno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML2025

点击查看摘要

Abstract:Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9 \times . Code will be available at this https URL.
zh

[CV-64] ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在接触密集型任务中面临的挑战,特别是在需要精细控制力的情况下,尤其是在视觉遮挡或动态不确定性条件下。其解决方案的关键在于提出了一种名为ForceVLA的新端到端操作框架,该框架将外部力传感作为VLA系统中的第一类模态,并引入了FVLMoE模块,这是一个基于力感知的专家混合融合模块,能够在动作解码过程中动态整合预训练的视觉-语言嵌入与实时六轴力反馈,从而提升机器人对细微接触动态的适应能力。

链接: https://arxiv.org/abs/2505.22159
作者: Jiawen Yu,Hairuo Liu,Qiaojun Yu,Jieji Ren,Ce Hao,Haitong Ding,Guangyu Huang,Guofan Huang,Yan Song,Panpan Cai,Cewu Lu,Wenqiang Zhang
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); Shanghai University (上海大学); Xi’an Jiaotong University (西安交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose \textbfForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces \textbfFVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot’s ability to adapt to subtle contact dynamics. We also introduce \textbfForceVLA-Data, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong \pi_0 -based baselines, achieving up to 80% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at this https URL.
zh

[CV-65] Learning A Robust RGB-Thermal Detector for Extreme Modality Imbalance

【速读】:该论文旨在解决RGB-Thermal (RGB-T) 目标检测中由于模态退化导致的极端模态不平衡问题,该问题在实际应用中会引发分布外(out-of-distribution, OOD)问题并影响模型训练收敛。其解决方案的关键在于提出一种基于主-辅助检测器架构的结构,通过引入模态交互模块以自适应地权衡不同质量的模态,并利用模态伪退化技术在训练阶段模拟真实世界的不平衡情况。主检测器基于高质量数据对进行训练,为辅助检测器提供一致性约束,从而提升模型在严重模态退化条件下的鲁棒性与性能。

链接: https://arxiv.org/abs/2505.22154
作者: Chao Tian,Chao Yang,Guoqing Zhu,Qiang Wang,Zhenyu He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-Thermal (RGB-T) object detection utilizes thermal infrared (TIR) images to complement RGB data, improving robustness in challenging conditions. Traditional RGB-T detectors assume balanced training data, where both modalities contribute equally. However, in real-world scenarios, modality degradation-due to environmental factors or technical issues-can lead to extreme modality imbalance, causing out-of-distribution (OOD) issues during testing and disrupting model convergence during training. This paper addresses these challenges by proposing a novel base-and-auxiliary detector architecture. We introduce a modality interaction module to adaptively weigh modalities based on their quality and handle imbalanced samples effectively. Additionally, we leverage modality pseudo-degradation to simulate real-world imbalances in training data. The base detector, trained on high-quality pairs, provides a consistency constraint for the auxiliary detector, which receives degraded samples. This framework enhances model robustness, ensuring reliable performance even under severe modality degradation. Experimental results demonstrate the effectiveness of our method in handling extreme modality imbalances~(decreasing the Missing Rate by 55%) and improving performance across various baseline detectors.
zh

[CV-66] 3D Question Answering via only 2D Vision-Language Models ICML2025

【速读】:该论文试图解决如何利用大型视觉语言模型(Large Vision-Language Models, LVLMs)的潜力来处理3D场景理解任务,以3D问答(3D-QA)为例。由于3D数据的训练样本有限,作者未对LVLM进行训练,而是采用零样本推理方式。解决方案的关键在于提出cdViews方法,该方法通过自动选择关键且多样化的2D视图来提升3D-QA性能,其核心组件包括viewSelector和viewNMS,分别用于优先选择具有答案特定信息的视图以及通过空间重叠去除冗余视图。

链接: https://arxiv.org/abs/2505.22143
作者: Fengyun Wang,Sicheng Yu,Jiawei Wu,Jinhui Tang,Hanwang Zhang,Qianru Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML2025

点击查看摘要

Abstract:Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
zh

[CV-67] FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing

【速读】:该论文试图解决音频驱动的说话头生成中面部属性编辑能力不足的问题(facial attribute editing),这一问题限制了个性化应用的实现和实际场景的扩展。解决方案的关键在于提出一个统一框架FaceEditTalker,其核心包含两个关键组件:图像特征空间编辑模块,用于提取语义与细节特征并实现对表情、发型和配饰等属性的灵活控制;以及音频驱动的视频生成模块,通过将编辑后的特征与音频引导的面部关键点融合,驱动基于扩散模型的生成器,从而保证视频的时间一致性、视觉保真度和身份一致性。

链接: https://arxiv.org/abs/2505.22141
作者: Guanwen Feng,Zhiyuan Ma,Yunan Li,Junwei Jing,Jiahao Yang,Qiguang Miao
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in audio-driven talking head generation have achieved impressive results in lip synchronization and emotional expression. However, they largely overlook the crucial task of facial attribute editing. This capability is crucial for achieving deep personalization and expanding the range of practical applications, including user-tailored digital avatars, engaging online education content, and brand-specific digital customer service. In these key domains, the flexible adjustment of visual attributes-such as hairstyle, accessories, and subtle facial features is essential for aligning with user preferences, reflecting diverse brand identities, and adapting to varying contextual demands. In this paper, we present FaceEditTalker, a unified framework that enables controllable facial attribute manipulation while generating high-quality, audio-synchronized talking head videos. Our method consists of two key components: an image feature space editing module, which extracts semantic and detail features and allows flexible control over attributes like expression, hairstyle, and accessories; and an audio-driven video generation module, which fuses these edited features with audio-guided facial landmarks to drive a diffusion-based generator. This design ensures temporal coherence, visual fidelity, and identity preservation across frames. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art approaches in lip-sync accuracy, video quality, and attribute controllability. Project page: this https URL
zh

[CV-68] What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

【速读】:该论文旨在解决将文本到图像扩散模型(如Stable Diffusion)适配到360度全景图生成的问题,特别是针对透视图像与全景图像之间存在的显著领域差异。研究的核心在于揭示在微调过程中模型参数行为的变化及其对生成效果的影响。解决方案的关键在于识别出注意力模块中的查询和键矩阵主要负责跨域共享的通用信息,而值和输出权重矩阵则更专注于将预训练知识适配到全景域,因此在微调过程中起着更为关键的作用。基于此分析,作者提出了一种名为UniPano的简单框架,以建立未来研究的基准,并在性能和效率上优于现有方法。

链接: https://arxiv.org/abs/2505.22129
作者: Jinhong Ni,Chang-Bin Zhang,Qiang Zhang,Jing Zhang
机构: Australian National University (澳大利亚国立大学); The University of Hong Kong (香港大学); Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion, has stimulated research to adapt them to 360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises questions about the underlying mechanisms enabling this empirical success. We hypothesize and examine that the trainable counterparts exhibit distinct behaviors when fine-tuned on panoramic data, and such an adaptation conceals some intrinsic mechanism to leverage the prior knowledge within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the attention modules are responsible for common information that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for panorama generation. We empirically verify these insights by introducing a simple framework called UniPano, with the objective of establishing an elegant baseline for future research. UniPano not only outperforms existing methods but also significantly reduces memory usage and training time compared to prior dual-branch approaches, making it scalable for end-to-end panorama generation with higher resolution. The code will be released.
zh

[CV-69] Real-Time Blind Defocus Deblurring for Earth Observation: The IMAGIN-e Mission Approach

【速读】:该论文旨在解决地球观测图像中由于机械失焦导致的图像模糊问题,特别是在国际空间站(ISS)上的IMAGIN-e任务中获取的图像。其关键解决方案是提出了一种适用于空间边缘计算环境的盲去模糊方法,该方法利用Sentinel-2数据估计失焦核并在此基础上训练一个基于生成对抗网络(GAN)的恢复模型,能够在无参考图像的情况下有效进行图像修复。

链接: https://arxiv.org/abs/2505.22128
作者: Alejandro D. Mousist
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work addresses mechanical defocus in Earth observation images from the IMAGIN-e mission aboard the ISS, proposing a blind deblurring approach adapted to space-based edge computing constraints. Leveraging Sentinel-2 data, our method estimates the defocus kernel and trains a restoration model within a GAN framework, effectively operating without reference images. On Sentinel-2 images with synthetic degradation, SSIM improved by 72.47% and PSNR by 25.00%, confirming the model’s ability to recover lost details when the original clean image is known. On IMAGIN-e, where no reference images exist, perceptual quality metrics indicate a substantial enhancement, with NIQE improving by 60.66% and BRISQUE by 48.38%, validating real-world onboard restoration. The approach is currently deployed aboard the IMAGIN-e mission, demonstrating its practical application in an operational space environment. By efficiently handling high-resolution images under edge computing constraints, the method enables applications such as water body segmentation and contour detection while maintaining processing viability despite resource limitations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.22128 [cs.CV] (or arXiv:2505.22128v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.22128 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-70] SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

【速读】:该论文试图解决科学插图生成任务中AI模型表现不足的问题,该任务需要准确理解技术内容并将其转化为清晰、标准化的视觉表达,相较于通用图像合成更具知识密集性和劳动强度。解决方案的关键在于提出SridBench,这是首个针对科学图表生成的基准测试,包含1,120个来自13个自然科学与计算机科学领域的高质量样本,并从语义保真度和结构准确性等六个维度进行评估,旨在推动更先进、基于推理的视觉生成能力的发展。

链接: https://arxiv.org/abs/2505.22126
作者: Yifan Chang,Yukang Feng,Jianwen Sun,Jiaxin Ai,Chuanhao Li,S. Kevin Zhou,Kaipeng Zhang
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Innovation Institute (上海创新研究所); Nankai University (南开大学); Wuhan University (武汉大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.
zh

[CV-71] Autoregression-free video prediction using diffusion model for mitigating error propagation

【速读】:该论文试图解决现有长期视频预测方法中由于自回归(autoregressive)机制导致的误差传播问题,特别是在预测远距离未来帧时表现不佳。其解决方案的关键在于提出首个无自回归(AutoRegression-Free, ARFree)的视频预测框架,该框架通过直接从上下文帧元组预测任意未来帧元组,避免了传统方法中的误差累积。ARFree的核心组件包括:1)基于上下文帧元组提取的运动特征预测未来运动的运动预测模块;2)提升相邻未来帧元组之间运动连续性和上下文一致性的训练方法。

链接: https://arxiv.org/abs/2505.22111
作者: Woonho Ko,Jin Bok Park,Il Yong Chun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Existing long-term video prediction methods often rely on an autoregressive video prediction mechanism. However, this approach suffers from error propagation, particularly in distant future frames. To address this limitation, this paper proposes the first AutoRegression-Free (ARFree) video prediction framework using diffusion models. Different from an autoregressive video prediction mechanism, ARFree directly predicts any future frame tuples from the context frame tuple. The proposed ARFree consists of two key components: 1) a motion prediction module that predicts a future motion using motion feature extracted from the context frame tuple; 2) a training method that improves motion continuity and contextual consistency between adjacent future frame tuples. Our experiments with two benchmark datasets show that the proposed ARFree video prediction framework outperforms several state-of-the-art video prediction methods.
zh

[CV-72] Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation

【速读】:该论文旨在解决电力输电走廊危险区域分割(Power Transmission Corridor Hazard Segmentation, PTCHS)问题,即从复杂背景中分离输电设备和周围危险因素,以保障电力传输安全。其解决方案的关键在于提出ELE-SAM模型,通过引入上下文感知提示适配器(Context-Aware Prompt Adapter)融合全局与局部特征并关注关键区域,以及设计高保真掩码解码器(High-Fidelity Mask Decoder)利用多粒度掩码特征提升细结构危险目标的分割精度。

链接: https://arxiv.org/abs/2505.22105
作者: Hang Chen,Maoyuan Ye,Peng Yang,Haibin He,Juhua Liu,Bo Du
机构: 武汉大学(Whuhan University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at this https URL.
zh

[CV-73] On the Transferability and Discriminability of Repersentation Learning in Unsupervised Domain Adaptation

【速读】:该论文试图解决无监督域适应(Unsupervised Domain Adaptation, UDA)中仅依赖分布对齐和源域经验风险最小化所带来的性能局限性,其核心问题是现有对抗框架忽视了目标域特征的判别性,导致次优表现。解决方案的关键在于定义“良好表征学习”为确保迁移性和判别性的统一,并证明引入针对目标域判别性的额外损失项是必要的,进而提出一种新型对抗框架RLGLC,该框架显式整合域对齐目标与判别增强约束,通过AR-WWD和局部一致性机制有效提升模型性能。

链接: https://arxiv.org/abs/2505.22099
作者: Wenwen Qiang,Ziyin Gu,Lingyu Si,Jiangmeng Li,Changwen Zheng,Fuchun Sun,Hui Xiong
机构: National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; National Key Laboratory of Space Integrated Information System, Department of Computer Science and Technology, Tsinghua University, Beijing, China; Artificial Intelligence Thrust, Information Hub, Department of Computer Science & Engineering, School of Engineering, Hong Kong University of Science and Technology, guangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we addressed the limitation of relying solely on distribution alignment and source-domain empirical risk minimization in Unsupervised Domain Adaptation (UDA). Our information-theoretic analysis showed that this standard adversarial-based framework neglects the discriminability of target-domain features, leading to suboptimal performance. To bridge this theoretical-practical gap, we defined “good representation learning” as guaranteeing both transferability and discriminability, and proved that an additional loss term targeting target-domain discriminability is necessary. Building on these insights, we proposed a novel adversarial-based UDA framework that explicitly integrates a domain alignment objective with a discriminability-enhancing constraint. Instantiated as Domain-Invariant Representation Learning with Global and Local Consistency (RLGLC), our method leverages Asymmetrically-Relaxed Wasserstein of Wasserstein Distance (AR-WWD) to address class imbalance and semantic dimension weighting, and employs a local consistency mechanism to preserve fine-grained target-domain discriminative information. Extensive experiments across multiple benchmark datasets demonstrate that RLGLC consistently surpasses state-of-the-art methods, confirming the value of our theoretical perspective and underscoring the necessity of enforcing both transferability and discriminability in adversarial-based UDA.
zh

[CV-74] UAVPairs: A Challenging Benchmark for Match Pair Retrieval of Large-scale UAV Images

【速读】:该论文旨在解决大规模无人机(UAV)图像匹配对检索中的性能瓶颈问题,特别是针对传统方法在复杂纹理场景下的鲁棒性不足以及训练成本过高的问题。其关键解决方案包括构建一个包含21,622张高分辨率图像的基准数据集UAVPairs,并设计一种基于几何相似性和多场景结构的批量非平凡样本挖掘策略,同时引入排名列表损失(ranked list loss)以优化全局相似性结构,从而提升图像检索模型的判别能力。

链接: https://arxiv.org/abs/2505.22098
作者: Junhuan Liu,San Jiang,Wei Ge,Wei Huang,Bingxuan Guo,Qingquan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The primary contribution of this paper is a challenging benchmark dataset, UAVPairs, and a training pipeline designed for match pair retrieval of large-scale UAV images. First, the UAVPairs dataset, comprising 21,622 high-resolution images across 30 diverse scenes, is constructed; the 3D points and tracks generated by SfM-based 3D reconstruction are employed to define the geometric similarity of image pairs, ensuring genuinely matchable image pairs are used for training. Second, to solve the problem of expensive mining cost for global hard negative mining, a batched nontrivial sample mining strategy is proposed, leveraging the geometric similarity and multi-scene structure of the UAVPairs to generate training samples as to accelerate training. Third, recognizing the limitation of pair-based losses, the ranked list loss is designed to improve the discrimination of image retrieval models, which optimizes the global similarity structure constructed from the positive set and negative set. Finally, the effectiveness of the UAVPairs dataset and training pipeline is validated through comprehensive experiments on three distinct large-scale UAV datasets. The experiment results demonstrate that models trained with the UAVPairs dataset and the ranked list loss achieve significantly improved retrieval accuracy compared to models trained on existing datasets or with conventional losses. Furthermore, these improvements translate to enhanced view graph connectivity and higher quality of reconstructed 3D models. The models trained by the proposed approach perform more robustly compared with hand-crafted global features, particularly in challenging repetitively textured scenes and weakly textured scenes. For match pair retrieval of large-scale UAV images, the trained image retrieval models offer an effective solution. The dataset would be made publicly available at this https URL.
zh

[CV-75] Fast Feature Matching of UAV Images via Matrix Band Reduction-based GPU Data Schedule

【速读】:该论文旨在解决结构从运动(Structure from Motion, SfM)中特征匹配所导致的高时间成本问题。其关键解决方案是提出一种基于GPU的数据调度算法,通过矩阵带宽缩减(Matrix Band Reduction, MBR)将数据集划分为紧凑的图像块,并结合GPU加速的级联哈希技术实现高效的特征匹配。该方法通过减少冗余的数据输入输出负担并充分利用GPU计算资源,显著提升了特征匹配的效率。

链接: https://arxiv.org/abs/2505.22089
作者: San Jiang,Kan You,Wanshou Jiang,Qingquan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feature matching dominats the time costs in structure from motion (SfM). The primary contribution of this study is a GPU data schedule algorithm for efficient feature matching of Unmanned aerial vehicle (UAV) images. The core idea is to divide the whole dataset into blocks based on the matrix band reduction (MBR) and achieve efficient feature matching via GPU-accelerated cascade hashing. First, match pairs are selected by using an image retrieval technique, which converts images into global descriptors and searches high-dimension nearest neighbors with graph indexing. Second, compact image blocks are iteratively generated from a MBR-based data schedule strategy, which exploits image connections to avoid redundant data IO (input/output) burden and increases the usage of GPU computing power. Third, guided by the generated image blocks, feature matching is executed sequentially within the framework of GPU-accelerated cascade hashing, and initial candidate matches are refined by combining a local geometric constraint and RANSAC-based global verification. For further performance improvement, these two seps are designed to execute parallelly in GPU and CPU. Finally, the performance of the proposed solution is evaluated by using large-scale UAV datasets. The results demonstrate that it increases the efficiency of feature matching with speedup ratios ranging from 77.0 to 100.0 compared with KD-Tree based matching methods, and achieves comparable accuracy in relative and absolute bundle adjustment (BA). The proposed algorithm is an efficient solution for feature matching of UAV images.
zh

[CV-76] MObyGaze: a film dataset of multimodal objectification densely annotated by experts

【速读】:该论文试图解决音频视觉叙事内容中性别代表性差异的表征与量化问题,以揭示物化现象如何在影视作品中被强化。其解决方案的关键在于引入了一个新的AI任务,即通过分析多模态(视觉、语音、音频)时间模式来表征和量化影视中的物化现象,并构建了Multimodal Objectifying Gaze (MObyGaze)数据集,该数据集包含20部电影的密集标注片段,涵盖43小时视频内容,具有细粒度的定位与分类。此外,研究还探讨了在标注者数量有限的情况下,如何从标签多样性中学习,并对最新的视觉、文本和音频模型进行了基准测试,验证了该任务的可行性。

链接: https://arxiv.org/abs/2505.22084
作者: Julie Tores,Elisa Ancarani,Lucile Sassatelli,Hui-Yin Wu,Clement Bergman,Lea Andolfi,Victor Ecrement,Remy Sun,Frederic Precioso,Thierry Devars,Magali Guaresi,Virginie Julliard,Sarah Lecossais
机构: Université Côte d’Azur, CNRS, I3S, France; Université Côte d’Azur, CNRS, Inria, I3S, France; Institut Universitaire de France; Université Côte d’Azur, Inria, France; Université Côte d’Azur, CNRS, BCL, France; Sorbonne Université, GRIPIC; Université Sorbonne Paris Nord, LabSIC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Characterizing and quantifying gender representation disparities in audiovisual storytelling contents is necessary to grasp how stereotypes may perpetuate on screen. In this article, we consider the high-level construct of objectification and introduce a new AI task to the ML community: characterize and quantify complex multimodal (visual, speech, audio) temporal patterns producing objectification in films. Building on film studies and psychology, we define the construct of objectification in a structured thesaurus involving 5 sub-constructs manifesting through 11 concepts spanning 3 modalities. We introduce the Multimodal Objectifying Gaze (MObyGaze) dataset, made of 20 movies annotated densely by experts for objectification levels and concepts over freely delimited segments: it amounts to 6072 segments over 43 hours of video with fine-grained localization and categorization. We formulate different learning tasks, propose and investigate best ways to learn from the diversity of labels among a low number of annotators, and benchmark recent vision, text and audio models, showing the feasibility of the task. We make our code and our dataset available to the community and described in the Croissant format: this https URL.
zh

[CV-77] Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis CVPR2025

【速读】:该论文旨在解决将通用领域的视觉-语言预训练模型(如CLIP)应用于医疗数据时所面临的挑战,特别是处理否定语义以及医疗数据集固有的不平衡问题。其解决方案的关键在于引入临床增强的动态软标签和医学图形对齐机制,以提升模型对临床语言的理解能力,并通过基于否定的难例挖掘进一步深化模型对临床语言复杂性的理解。该方法可无缝集成到医疗CLIP训练流程中,并在多个任务上取得了最先进的性能。

链接: https://arxiv.org/abs/2505.22079
作者: Hanbin Ko,Chang-Min Park
机构: Seoul National University Graduate School (首尔国立大学研究生院); Seoul National University Hospital (首尔国立大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages (8 main, 2 references, 6 appendix), 13 figures. Accepted to CVPR 2025. This author-accepted manuscript includes an expanded ethics/data user agreement section. The final version will appear in the Proceedings of CVPR 2025

点击查看摘要

Abstract:The development of large-scale image-text pair datasets has significantly advanced self-supervised learning in Vision-Language Processing (VLP). However, directly applying general-domain architectures such as CLIP to medical data presents challenges, particularly in handling negations and addressing the inherent data imbalance of medical datasets. To address these issues, we propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment, thereby improving clinical comprehension and the applicability of contrastive loss in medical contexts. Furthermore, we introduce negation-based hard negatives to deepen the model’s understanding of the complexities of clinical language. Our approach is easily integrated into the medical CLIP training pipeline and achieves state-of-the-art performance across multiple tasks, including zero-shot, fine-tuned classification, and report retrieval. To comprehensively evaluate our model’s capacity for understanding clinical language, we introduce CXR-Align, a benchmark uniquely designed to evaluate the understanding of negation and clinical information within chest X-ray (CXR) datasets. Experimental results demonstrate that our proposed methods are straightforward to implement and generalize effectively across contrastive learning frameworks, enhancing medical VLP capabilities and advancing clinical language understanding in medical imaging.
zh

[CV-78] From Failures to Fixes: LLM -Driven Scenario Repair for Self-Evolving Autonomous Driving

【速读】:该论文旨在解决自动驾驶系统在面对复杂和安全关键场景时,因现有场景生成与选择方法缺乏适应性和语义相关性而导致的性能提升受限问题。其解决方案的关键在于提出一种基于大语言模型(LLM)的框架SERA,该框架通过分析性能日志识别失败模式,并从结构化场景库中动态检索语义对齐的场景,结合LLM驱动的反思机制优化推荐结果,最终通过少量样本微调实现针对性适应,从而提升系统的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2505.22067
作者: Xinyu Xia,Xingjun Ma,Yunfeng Hu,Ting Qu,Hong Chen,Xun Gong
机构: Jilin University(吉林大学); Fudan University(复旦大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose \textbfSERA, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.
zh

[CV-79] AquaMonitor: A multimodal multi-view image sequence dataset for real-life aquatic invertebrate biodiversity monitoring

【速读】:该论文试图解决 aquatic invertebrates(水生无脊椎动物)在环境监测中自动化识别的难题,特别是在实际应用场景下面临的开放集识别、分布偏移和极端类别不平衡等挑战。解决方案的关键在于构建了AquaMonitor数据集,该数据集是首个基于标准化采集协议的大规模水生无脊椎动物计算机视觉数据集,包含2.7M张图像、1358个样本的DNA序列以及1494个样本的干质量和尺寸测量数据,为自动化识别方法的评估提供了现实且无偏的基准设置。

链接: https://arxiv.org/abs/2505.22065
作者: Mikko Impiö,Philipp M. Rehsen,Tiina Laamanen,Arne J. Beermann,Florian Leese,Jenni Raitoharju
机构: Finnish Environment Institute (芬兰环境研究所); University of Duisburg-Essen (杜伊斯堡-埃森大学); Centre for Water and Environmental Research (ZWU) (水与环境研究中心(ZWU)); University of Jyväskylä (于韦斯屈莱大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the AquaMonitor dataset, the first large computer vision dataset of aquatic invertebrates collected during routine environmental monitoring. While several large species identification datasets exist, they are rarely collected using standardized collection protocols, and none focus on aquatic invertebrates, which are particularly laborious to collect. For AquaMonitor, we imaged all specimens from two years of monitoring whenever imaging was possible given practical limitations. The dataset enables the evaluation of automated identification methods for real-life monitoring purposes using a realistically challenging and unbiased setup. The dataset has 2.7M images from 43,189 specimens, DNA sequences for 1358 specimens, and dry mass and size measurements for 1494 specimens, making it also one of the largest biological multi-view and multimodal datasets to date. We define three benchmark tasks and provide strong baselines for these: 1) Monitoring benchmark, reflecting real-life deployment challenges such as open-set recognition, distribution shift, and extreme class imbalance, 2) Classification benchmark, which follows a standard fine-grained visual categorization setup, and 3) Few-shot benchmark, which targets classes with only few training examples from very fine-grained categories. Advancements on the Monitoring benchmark can directly translate to improvement of aquatic biodiversity monitoring, which is an important component of regular legislative water quality assessment in many countries.
zh

[CV-80] LatentMove: Towards Complex Human Movement Video Generation

【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成中复杂、非重复人类动作的自然性与一致性问题,特别是在处理快速且精细的动作时容易出现不自然变形的问题。其解决方案的关键在于提出一种基于DiT(Diffusion Transformer)的框架LatentMove,该框架通过引入条件控制分支和可学习的面部/身体标记,以保持帧间的一致性及细节的准确性。此外,论文还构建了Complex-Human-Videos(CHV)数据集,并设计了两个评估指标来衡量生成视频的光流和轮廓一致性,从而提升了I2V系统的鲁棒性和生成质量。

链接: https://arxiv.org/abs/2505.22046
作者: Ashkan Taghipour,Morteza Ghahremani,Mohammed Bennamoun,Farid Boussaid,Aref Miri Rekavandi,Zinuo Li,Qiuhong Ke,Hamid Laga
机构: The University of Western Australia(西澳大利亚大学); Technical University of Munich(慕尼黑工业大学); Monash University(莫纳什大学); Murdoch University(默多克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Image-to-video (I2V) generation seeks to produce realistic motion sequences from a single reference image. Although recent methods exhibit strong temporal consistency, they often struggle when dealing with complex, non-repetitive human movements, leading to unnatural deformations. To tackle this issue, we present LatentMove, a DiT-based framework specifically tailored for highly dynamic human animation. Our architecture incorporates a conditional control branch and learnable face/body tokens to preserve consistency as well as fine-grained details across frames. We introduce Complex-Human-Videos (CHV), a dataset featuring diverse, challenging human motions designed to benchmark the robustness of I2V systems. We also introduce two metrics to assess the flow and silhouette consistency of generated videos with their ground truth. Experimental results indicate that LatentMove substantially improves human animation quality–particularly when handling rapid, intricate movements–thereby pushing the boundaries of I2V generation. The code, the CHV dataset, and the evaluation metrics will be available at this https URL --.
zh

[CV-81] Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning INTERSPEECH2025

【速读】:该论文旨在解决现实场景中视觉引导的音频描述系统对音视频不对齐问题(audiovisual misalignment)处理能力不足的问题,例如配音内容或屏幕外声音等情况。其解决方案的关键在于提出一种基于熵的门控融合框架,通过跨模态不确定性量化动态调节视觉信息流,并在跨注意力层中利用注意力熵分析自动识别并抑制误导性视觉线索。此外,该方法还引入了批次级音视频打乱技术,生成合成的不匹配训练对,以提升模型对对齐噪声的鲁棒性。

链接: https://arxiv.org/abs/2505.22045
作者: Le Xu,Chenxing Li,Yong Ren,Yujie Chen,Yu Gu,Ruibo Fu,Shan Yang,Dong Yu
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by INTERSPEECH 2025

点击查看摘要

Abstract:Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system’s superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.
zh

[CV-82] OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning

【速读】:该论文旨在解决工业场景中异常检测与分析过程中,如何生成结合领域知识的详细分析这一挑战。其解决方案的关键在于提出OmniAD框架,该框架通过统一异常检测与理解,实现细粒度分析,其核心是融合视觉与文本推理的多模态推理机制。具体而言,视觉推理通过Text-as-Mask Encoding技术进行无阈值的异常检测,而视觉引导的文本推理则通过整合视觉感知进行综合分析,同时采用监督微调与强化学习相结合的训练策略,提升少样本泛化能力。

链接: https://arxiv.org/abs/2505.22039
作者: Shifang Zhao,Yiheng Lin,Lu Han,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.
zh

[CV-83] Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

【速读】:该论文试图解决大型多模态模型(Large Vision-Language Models, LVLMs)在处理高分辨率图像时因图像标记数量过多而导致的计算开销过大的问题。现有方法通过令牌剪枝(token pruning)减少图像标记数量,但通常仅考虑当前层输出的局部影响,而忽视了剪枝对后续层输出的全局影响,导致剪枝决策不够优化。解决方案的关键是提出一种平衡的令牌剪枝(Balanced Token Pruning, BTP)方法,该方法利用一个小的校准集将剪枝过程划分为多个阶段,在早期阶段侧重于剪枝对后续层的影响,而在深层阶段则注重保持局部输出的一致性,从而实现高效的剪枝效果。

链接: https://arxiv.org/abs/2505.22038
作者: Kaiyuan Li,Xiaoyue Chen,Chen Gao,Yong Li,Xinlei Chen
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer’s output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models’ performance on average.
zh

[CV-84] Guess the Age of Photos: An Interactive Web Platform for Historical Image Age Estimation

【速读】:该论文旨在解决历史照片年代估计的挑战,通过构建一个交互式平台来增强用户对历史图像时间信息的感知与理解。其关键解决方案是设计并实现了一个基于Web的平台,采用两种游戏化模式(Guess the Year和Timeline Challenge),结合动态评分和排行榜机制以提高用户参与度,并利用10,150张来自Date Estimation in the Wild数据集的图像进行训练与测试,从而有效提升用户在相对时间比较任务中的准确率。

链接: https://arxiv.org/abs/2505.22031
作者: Hasan Yucedag,Adam Jatowt
机构: Leopold-Franzens Universität Innsbruck(因斯布鲁克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 Pages,4 figures and 1 system architect

点击查看摘要

Abstract:This paper introduces Guess the Age of Photos, a web platform engaging users in estimating the years of historical photographs through two gamified modes: Guess the Year (predicting a single image’s year) and Timeline Challenge (comparing two images to identify the older). Built with Python, Flask, Bootstrap, and PostgreSQL, it uses a 10,150-image subset of the Date Estimation in the Wild dataset (1930-1999). Features like dynamic scoring and leaderboards boost engagement. Evaluated with 113 users and 15,473 gameplays, the platform earned a 4.25/5 satisfaction rating. Users excelled in relative comparisons (65.9% accuracy) over absolute year guesses (25.6% accuracy), with older decades easier to identify. The platform serves as an educational tool, fostering historical awareness and analytical skills via interactive exploration of visual heritage. Furthermore, the platform provides a valuable resource for studying human perception of temporal cues in images and could be used to generate annotated data for training and evaluating computer vision models.
zh

[CV-85] Learnable Burst-Encodable Time-of-Flight Imaging for High-Fidelity Long-Distance Depth Sensing

【速读】:该论文旨在解决长距离深度成像中的相位模糊(phase wrapping)和信噪比(SNR)下降问题,这些问题在传统间接时间-of-飞行(iToF)成像中尤为显著。其解决方案的关键在于提出一种新型的飞行时间成像范式——脉冲编码飞行时间(Burst-Encodable Time-of-Flight, BE-ToF),通过在脉冲突发模式下发射光脉冲,并在整个突发周期内估计反射信号的相位延迟,从而有效避免相位模糊。此外,还设计了一个端到端可学习框架,联合优化编码函数与深度重建网络,以提升远距离下的成像质量。

链接: https://arxiv.org/abs/2505.22025
作者: Manchao Bao,Shengjiang Fang,Tao Yue,Xuemei Hu
机构: 南京大学(University of Nanjing)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Long-distance depth imaging holds great promise for applications such as autonomous driving and robotics. Direct time-of-flight (dToF) imaging offers high-precision, long-distance depth sensing, yet demands ultra-short pulse light sources and high-resolution time-to-digital converters. In contrast, indirect time-of-flight (iToF) imaging often suffers from phase wrapping and low signal-to-noise ratio (SNR) as the sensing distance increases. In this paper, we introduce a novel ToF imaging paradigm, termed Burst-Encodable Time-of-Flight (BE-ToF), which facilitates high-fidelity, long-distance depth imaging. Specifically, the BE-ToF system emits light pulses in burst mode and estimates the phase delay of the reflected signal over the entire burst period, thereby effectively avoiding the phase wrapping inherent to conventional iToF systems. Moreover, to address the low SNR caused by light attenuation over increasing distances, we propose an end-to-end learnable framework that jointly optimizes the coding functions and the depth reconstruction network. A specialized double well function and first-order difference term are incorporated into the framework to ensure the hardware implementability of the coding functions. The proposed approach is rigorously validated through comprehensive simulations and real-world prototype experiments, demonstrating its effectiveness and practical applicability.
zh

[CV-86] RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling INTERSPEECH2025

【速读】:该论文旨在解决唇语到语音(Lip-to-speech, L2S)合成中由于监督信息有限而导致的语音准确性与自然性不足的问题。其解决方案的关键在于提出RESOUND系统,该系统基于声源-滤波器理论,将模型分为声学路径和语义路径,分别用于预测韵律特征和提取语言信息,从而简化学习过程并实现独立优化。此外,通过将语音单元(speech units)这一无监督语音表示技术整合到波形生成过程中,进一步提升了语音的可理解性与表达性,同时保持了内容和说话人身份的一致性。

链接: https://arxiv.org/abs/2505.22024
作者: Long-Khanh Pham,Thanh V. T. Tran,Minh-Tan Pham,Van Nguyen
机构: FPT Software AI Center (FPT软件人工智能中心); IRISA (IRISA)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: accepted in Interspeech 2025

点击查看摘要

Abstract:Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.
zh

[CV-87] GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

【速读】:该论文旨在解决现有Document Image Enhancement (DIE)方法在处理多退化彩色文档图像时的局限性,这些方法通常局限于单一退化恢复或灰度图像处理,难以满足实际应用场景中的高效性和鲁棒性需求。其解决方案的关键在于提出一种名为Global with Local Parametric Generation Enhancement Network (GL-PGENet)的新架构,该架构包含三个核心创新:层级增强框架实现全局外观校正与局部细化的结合,双分支局部细化网络通过参数化生成机制替代传统直接预测以提升局部一致性与模型泛化能力,以及改进的NestUNet结构融合低级像素特征与高级语义特征,同时采用两阶段训练策略以增强模型的泛化性能。

链接: https://arxiv.org/abs/2505.22021
作者: Zhihong Tang,Yang Li
机构: QQ Browser R&D Team, Tencent CSIG (QQ浏览器研发团队,腾讯CSIG)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.
zh

[CV-88] PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

【速读】:该论文旨在解决现有全景视频生成模型难以利用传统文本到视频生成模型的预训练生成先验,以生成高质量且多样化的全景视频的问题,这一问题主要源于数据集规模有限和空间特征表示的差异。论文提出的解决方案关键在于PanoWan,它通过最小模块实现从文本到视频模型的有效迁移,其中纬度感知采样避免了纬度失真,旋转语义去噪和填充像素级解码确保了经度边界处的无缝过渡。

链接: https://arxiv.org/abs/2505.22016
作者: Yifei Xia,Shuchen Weng,Siqi Yang,Jingqi Liu,Chengxuan Zhu,Minggui Teng,Zijian Jia,Han Jiang,Boxin Shi
机构: 1State Key Lab of Multimedia Info. Processing, School of Computer Science, Peking University; 2Nat’l Eng. Research Ctr. of Visual Tech., School of Computer Science, Peking University; 3OpenBayes Information Technology Co., Ltd. (OpenBayes 信息科技有限公司); 4Beijing Academy of Artificial Intelligence (北京人工智能研究院); 5Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); 6Nat’l Key Lab of General AI, School of Intelligence Science and Technology, Peking University (国家通用人工智能重点实验室,北京大学智能科学与技术学院); 7School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Panoramic video generation enables immersive 360° content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature representations. In this paper, we introduce PanoWan to effectively lift pre-trained text-to-video models to the panoramic domain, equipped with minimal modules. PanoWan employs latitude-aware sampling to avoid latitudinal distortion, while its rotated semantic denoising and padded pixel-wise decoding ensure seamless transitions at longitude boundaries. To provide sufficient panoramic videos for learning these lifted representations, we contribute PanoVid, a high-quality panoramic video dataset with captions and diverse scenarios. Consequently, PanoWan achieves state-of-the-art performance in panoramic video generation and demonstrates robustness for zero-shot downstream tasks.
zh

[CV-89] Prototype Embedding Optimization for Human-Object Interaction Detection in Livestreaming

【速读】:该论文旨在解决直播场景中人类-物体交互(Human-Object Interaction, HOI)检测中存在的物体偏差问题,即现有方法过于关注物体而忽视其与主播的交互关系。解决方案的关键在于提出一种原型嵌入优化(Prototype Embedding Optimization for HOI detection, PeO-HOI),通过预处理提取人-物对特征、采用原型嵌入优化缓解物体偏差,并建模人-物对之间的时空上下文以提升检测性能。

链接: https://arxiv.org/abs/2505.22011
作者: Menghui Zhang,Jing Zhang,Lin Chen,Li Zhuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Livestreaming often involves interactions between streamers and objects, which is critical for understanding and regulating web content. While human-object interaction (HOI) detection has made some progress in general-purpose video downstream tasks, when applied to recognize the interaction behaviors between a streamer and different objects in livestreaming, it tends to focuses too much on the objects and neglects their interactions with the streamer, which leads to object bias. To solve this issue, we propose a prototype embedding optimization for human-object interaction detection (PeO-HOI). First, the livestreaming is preprocessed using object detection and tracking techniques to extract features of the human-object (HO) pairs. Then, prototype embedding optimization is adopted to mitigate the effect of object bias on HOI. Finally, after modelling the spatio-temporal context between HO pairs, the HOI detection results are obtained by the prediction head. The experimental results show that the detection accuracy of the proposed PeO-HOI method has detection accuracies of 37.19%@full, 51.42%@non-rare, 26.20%@rare on the publicly available dataset VidHOI, 45.13%@full, 62.78%@non-rare and 30.37%@rare on the self-built dataset BJUT-HOI, which effectively improves the HOI detection performance in livestreaming.
zh

[CV-90] Event-based Egocentric Human Pose Estimation in Dynamic Environment ICIP2025

【速读】:该论文旨在解决在低光照环境或运动模糊条件下,利用前向佩戴式摄像头进行人体姿态估计的挑战。现有方法多依赖于RGB摄像头,难以应对这些复杂场景。其解决方案的关键在于引入基于事件的摄像头(event-based camera)并提出D-EventEgo框架,该框架首先估计头部姿态,再以此作为条件生成身体姿态;同时,为提高头部姿态估计的准确性,设计了运动分割模块以去除动态物体并提取背景信息。

链接: https://arxiv.org/abs/2505.22007
作者: Wataru Ikeda,Masashi Hatano,Ryosei Hara,Mariko Isogawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2025, Project Page: this https URL

点击查看摘要

Abstract:Estimating human pose using a front-facing egocentric camera is essential for applications such as sports motion analysis, VR/AR, and AI for wearable devices. However, many existing methods rely on RGB cameras and do not account for low-light environments or motion blur. Event-based cameras have the potential to address these challenges. In this work, we introduce a novel task of human pose estimation using a front-facing event-based camera mounted on the head and propose D-EventEgo, the first framework for this task. The proposed method first estimates the head poses, and then these are used as conditions to generate body poses. However, when estimating head poses, the presence of dynamic objects mixed with background events may reduce head pose estimation accuracy. Therefore, we introduce the Motion Segmentation Module to remove dynamic objects and extract background information. Extensive experiments on our synthetic event-based dataset derived from EgoBody, demonstrate that our approach outperforms our baseline in four out of five evaluation metrics in dynamic environments.
zh

[CV-91] Efficiently Enhancing General Agents With Hierarchical-categorical Memory

【速读】:该论文旨在解决现有方法在构建通用多模态代理时存在的两个主要问题:一是依赖计算成本高昂的端到端训练,二是采用工具使用方法但缺乏持续学习和适应新环境的能力。其解决方案的关键在于提出EHC,一个无需参数更新即可学习的通用代理,其核心由分层记忆检索(Hierarchical Memory Retrieval, HMR)模块和任务类别导向经验学习(Task-Category Oriented Experience Learning, TOEL)模块组成,分别实现了高效记忆检索与存储以及跨任务类别的模式提取与理解。

链接: https://arxiv.org/abs/2505.22006
作者: Changze Qiao,Mingming Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With large language models (LLMs) demonstrating remarkable capabilities, there has been a surge in research on leveraging LLMs to build general-purpose multi-modal agents. However, existing approaches either rely on computationally expensive end-to-end training using large-scale multi-modal data or adopt tool-use methods that lack the ability to continuously learn and adapt to new environments. In this paper, we introduce EHC, a general agent capable of learning without parameter updates. EHC consists of a Hierarchical Memory Retrieval (HMR) module and a Task-Category Oriented Experience Learning (TOEL) module. The HMR module facilitates rapid retrieval of relevant memories and continuously stores new information without being constrained by memory capacity. The TOEL module enhances the agent’s comprehension of various task characteristics by classifying experiences and extracting patterns across different categories. Extensive experiments conducted on multiple standard datasets demonstrate that EHC outperforms existing methods, achieving state-of-the-art performance and underscoring its effectiveness as a general agent for handling complex multi-modal tasks.
zh

[CV-92] D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples ICML2025

【速读】:该论文试图解决扩散模型中生成图像与文本提示之间的对齐问题,以及由此引发的视觉不一致性问题,这限制了直接偏好优化(DPO)方法的有效性。解决方案的关键在于提出D-Fusion方法,通过掩码引导的自注意力融合技术,构建出在视觉上一致且可进行DPO训练的样本,从而提升模型在微调过程中识别促进对齐因素的能力。

链接: https://arxiv.org/abs/2505.22002
作者: Zijing Hu,Fengda Zhang,Kun Kuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.
zh

[CV-93] Learning World Models for Interactive Video Generation

【速读】:该论文旨在解决长视频生成中世界模型(world models)的交互性不足以及时空一致性缺失的问题。现有模型由于两个主要挑战——误差累积和记忆机制不足,导致其内在世界建模能力有限。论文提出的解决方案关键在于引入视频检索增强生成(VRAG),通过显式的全局状态条件控制,显著减少了长期误差累积并提升了时空一致性。相比之下,简单的扩展上下文窗口或检索增强生成方法效果有限,主要是因为当前视频模型的上下文学习能力受限。

链接: https://arxiv.org/abs/2505.21996
作者: Taiye Chen,Xun Hu,Zihan Ding,Chi Jin
机构: Peking University (北京大学); University of Oxford (牛津大学); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
zh

[CV-94] DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model

【速读】:该论文旨在解决文档图像去畸变(document dewarping)中保持文档结构完整性的挑战,尤其是在处理高分辨率复杂文档图像时,传统方法难以实现精确控制。其解决方案的关键在于提出DvD,这是首个基于扩散框架(diffusion framework)的生成式模型,通过引入坐标级去噪(coordinate-level denoising)替代传统的像素级去噪,生成用于校正变形的映射,并结合时间变化条件细化机制(time-variant condition refinement mechanism)以增强文档结构的保留效果。

链接: https://arxiv.org/abs/2505.21975
作者: Weiguang Zhang,Huangcheng Lu,Maizhen Ning,Xiaowei Huang,Wei Wang,Kaizhu Huang,Qiufeng Wang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学); Duke Kunshan University (杜克昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000 \times 3000 resolution). In this paper, we propose DvD, the first generative model to tackle document \textbfDewarping \textbfvia a \textbfDiffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available.
zh

[CV-95] A2Seek: Towards Reasoning -Centric Benchmark for Aerial Anomaly Understanding

【速读】:该论文旨在解决无人机视角下异常检测性能下降的问题,即现有数据集和方法主要针对固定地面视角设计,难以适应无人机动态视角、尺度变化和复杂场景带来的挑战。其解决方案的关键在于提出A2Seek-R1推理框架,该框架通过图-of-thought(GoT)引导的监督微调激活模型的潜在推理能力,并引入面向航空场景的规则奖励函数设计方法——Aerial Group Relative Policy Optimization(A-GRPO),同时结合模拟无人机飞行行为的“seeking”机制,以提升对异常发生位置和原因的理解能力。

链接: https://arxiv.org/abs/2505.21962
作者: Mengjingcheng Mo,Xinyang Tong,Jiaxu Leng,Mingpi Tan,Jiankang Zheng,Yiran Liu,Haosheng Chen,Ji Gan,Weisheng Li,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Chongqing Institute for Brain and Intelligence (重庆脑与智能科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While unmanned aerial vehicles (UAVs) offer wide-area, high-altitude coverage for anomaly detection, they face challenges such as dynamic viewpoints, scale variations, and complex scenes. Existing datasets and methods, mainly designed for fixed ground-level views, struggle to adapt to these conditions, leading to significant performance drops in drone-view scenarios. To bridge this gap, we introduce A2Seek (Aerial Anomaly Seek), a large-scale, reasoning-centric benchmark dataset for aerial anomaly understanding. This dataset covers various scenarios and environmental conditions, providing high-resolution real-world aerial videos with detailed annotations, including anomaly categories, frame-level timestamps, region-level bounding boxes, and natural language explanations for causal reasoning. Building on this dataset, we propose A2Seek-R1, a novel reasoning framework that generalizes R1-style strategies to aerial anomaly understanding, enabling a deeper understanding of “Where” anomalies occur and “Why” they happen in aerial frames. To this end, A2Seek-R1 first employs a graph-of-thought (GoT)-guided supervised fine-tuning approach to activate the model’s latent reasoning capabilities on A2Seek. Then, we introduce Aerial Group Relative Policy Optimization (A-GRPO) to design rule-based reward functions tailored to aerial scenarios. Furthermore, we propose a novel “seeking” mechanism that simulates UAV flight behavior by directing the model’s attention to informative regions. Extensive experiments demonstrate that A2Seek-R1 achieves up to a 22.04% improvement in AP for prediction accuracy and a 13.9% gain in mIoU for anomaly localization, exhibiting strong generalization across complex environments and out-of-distribution scenarios. Our dataset and code will be released at this https URL.
zh

[CV-96] One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models CVPR2025

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在推理速度与图像质量之间的权衡问题,以提升模型的高效部署能力。其解决方案的关键在于提出首个时间无关的统一编码器 TiUE,该编码器能够跨多个解码器时间步共享编码器特征,从而实现并行采样并显著降低推理时间复杂度。此外,通过引入KL散度项对噪声预测进行正则化,进一步提升了生成图像的感知真实性和多样性。

链接: https://arxiv.org/abs/2505.21960
作者: Senmao Li,Lei Wang,Kai Wang,Tao Liu,Jiehang Xie,Joost van de Weijer,Fahad Shahbaz Khan,Shiqi Yang,Yaxing Wang,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR2025, Code: this https URL

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have made remarkable advancements in generative modeling; however, they face a trade-off between inference speed and image quality, posing challenges for efficient deployment. Existing distilled T2I models can generate high-fidelity images with fewer sampling steps, but often struggle with diversity and quality, especially in one-step models. From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps. Based on these observations, we introduce the first Time-independent Unified Encoder TiUE for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling and significantly reducing inference time complexity. In addition, we incorporate a KL divergence term to regularize noise prediction, which enhances the perceptual realism and diversity of the generated images. Experimental results demonstrate that TiUE outperforms state-of-the-art methods, including LCM, SD-Turbo, and SwiftBrushv2, producing more diverse and realistic results while maintaining the computational efficiency.
zh

[CV-97] owards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在处理第一人称(egocentric)视角输入时,由于视野狭窄和缺乏全局上下文而导致的空间或语境复杂查询失败的问题。其解决方案的关键在于引入一种框架,通过融合第三人称(exocentric)视角的信息,为LVLMs提供互补的全局场景布局和物体可见性等信息,从而增强模型的多视角推理能力。

链接: https://arxiv.org/abs/2505.21955
作者: Insu Lee,Wooje Park,Jaeyun Jang,Minyoung Noh,Kyuhong Shim,Byonghyo Shim
机构: Seoul National University (首尔国立大学); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, their narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs.
zh

[CV-98] UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

【速读】:该论文旨在解决主动说话者检测(Active Speaker Detection, ASD)任务中模型泛化能力不足的问题,特别是在复杂和真实场景下的表现。与以往基准如AVA相比,UniTalk数据集通过引入多样化的现实条件,如低资源语言、噪声背景和多人重叠对话等挑战性场景,弥补了领域差距。其关键在于构建了一个大规模、多样的视频数据集,包含超过44.5小时的视频和48,693个说话者身份的帧级标注,从而为模型训练和评估提供了更贴近实际应用的基准。

链接: https://arxiv.org/abs/2505.21954
作者: Le Thien Phuc Nguyen,Zhuoran Yu,Khoa Quang Nhat Cao,Yuwei Guo,Tu Ho Manh Pham,Tuan Tai Nguyen,Toan Ngo Duc Vo,Lucas Poon,Soochahn Lee,Yong Jae Lee
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Kookmin University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern “in-the-wild” datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: this https URL Code: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.21954 [cs.CV] (or arXiv:2505.21954v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.21954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-99] Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting CVPR-2025

【速读】:该论文旨在解决基于点检测的计数方法在训练过程中对标注数据依赖度过高的问题,尤其是在密集场景中需要大量人工标注点的问题。其解决方案的关键在于引入一种半监督计数框架,通过伪标签(pseudo-labeling)减少对人工标注的依赖,同时提出了一种点到区域(Point-to-Region, P2R)的监督机制,以替代传统的点到点(Point-to-Point, P2P)监督方式,从而有效缓解因伪标签置信度无法传递至背景像素而导致的误检问题。

链接: https://arxiv.org/abs/2505.21943
作者: Wei Lin,Chenyang Zhao,Antoni B. Chan
机构: City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR-2025(highlight)

点击查看摘要

Abstract:Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training encounters issues as the confidence for pseudo-labels fails to be propagated to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence with the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolve issues identified in PSAM. The code is available at this https URL.
zh

[CV-100] RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination SIGGRAPH2025 MICRO

【速读】:该论文试图解决的是如何直接从基于三角形的场景表示中渲染出具有完整全局光照效果的图像,而无需针对每个场景进行训练或微调。解决方案的关键在于将渲染过程建模为一种序列到序列的转换,其中包含两个阶段:第一阶段为视图无关的三角形到三角形的光传输建模,第二阶段为视图相关的光线束到像素值的转换,这两个阶段均基于Transformer架构,并在最小先验约束下进行学习。

链接: https://arxiv.org/abs/2505.21925
作者: Chong Zeng,Yue Dong,Pieter Peers,Hongzhi Wu,Xin Tong
机构: Zhejiang University (浙江大学); Microsoft Research Asia (微软亚洲研究院); College of William & Mary (威廉与玛丽学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to SIGGRAPH 2025. Project page: this https URL

点击查看摘要

Abstract:We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.
zh

[CV-101] InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective ICML2025

【速读】:该论文试图解决生成式 AI (Generative AI) 在专业领域中表现不足的问题,具体而言是针对 Segment Anything Model (SAM) 在特定场景下的适应性问题。现有参数高效微调(PEFT)方法未能充分保留预训练模型中编码的领域不变关系,导致其在新场景中的性能受限。解决方案的关键在于提出 InfoSAM,该方法通过信息论框架,利用互信息最大化和知识压缩两个目标,实现对 SAM 预训练分割知识的有效迁移与保留,从而提升其在实际任务中的适应性和性能。

链接: https://arxiv.org/abs/2505.21920
作者: Yuanhong Zhang,Muyao Yuan,Weizhan Zhang,Tieliang Gong,Wen Wen,Jiangyong Ying,Weijie Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025 (Highlight)

点击查看摘要

Abstract:The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zero-shot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distilling and preserving its pre-trained segmentation knowledge. Specifically, we formulate the knowledge transfer process as two novel mutual information-based objectives: (i) to compress the domain-invariant relation extracted from pre-trained SAM, excluding pseudo-invariant information as possible, and (ii) to maximize mutual information between the relational knowledge learned by the teacher (pre-trained SAM) and the student (fine-tuned model). The proposed InfoSAM establishes a robust distillation framework for PEFT of SAM. Extensive experiments across diverse benchmarks validate InfoSAM’s effectiveness in improving SAM family’s performance on real-world tasks, demonstrating its adaptability and superiority in handling specialized scenarios.
zh

[CV-102] BD Open LULC Map: High-resolution land use land cover mapping benchmarking for urban development in Dhaka Bangladesh ICIP2025

【速读】:该论文旨在解决南亚/东南亚发展中国家由于标注卫星数据稀缺而导致的土地利用与土地覆盖(Land Use Land Cover, LULC)分类可靠性不足的问题。其关键解决方案是引入BD Open LULC Map (BOLM),该数据集通过高分辨率Bing卫星影像(2.22 m/pixel)为达卡都市区及其周边地区提供了像素级的LULC标注,涵盖十一类地物类型,并经过GIS专家三阶段验证,以支持深度学习模型和领域自适应任务,填补南亚/东南亚地区LULC数据集的空白。

链接: https://arxiv.org/abs/2505.21915
作者: Mir Sazzat Hossain,Ovi Paul,Md Akil Raihan Iftee,Rakibul Hasan Rajib,Abu Bakar Siddik Nayem,Anis Sarker,Arshad Momen,Md. Ashraful Amin,Amin Ahsan Ali,AKM Mahbubur Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, 3 tables, Accepted In ICIP 2025

点击查看摘要

Abstract:Land Use Land Cover (LULC) mapping using deep learning significantly enhances the reliability of LULC classification, aiding in understanding geography, socioeconomic conditions, poverty levels, and urban sprawl. However, the scarcity of annotated satellite data, especially in South/East Asian developing countries, poses a major challenge due to limited funding, diverse infrastructures, and dense populations. In this work, we introduce the BD Open LULC Map (BOLM), providing pixel-wise LULC annotations across eleven classes (e.g., Farmland, Water, Forest, Urban Structure, Rural Built-Up) for Dhaka metropolitan city and its surroundings using high-resolution Bing satellite imagery (2.22 m/pixel). BOLM spans 4,392 sq km (891 million pixels), with ground truth validated through a three-stage process involving GIS experts. We benchmark LULC segmentation using DeepLab V3+ across five major classes and compare performance on Bing and Sentinel-2A imagery. BOLM aims to support reliable deep models and domain adaptation tasks, addressing critical LULC dataset gaps in South/East Asia.
zh

[CV-103] LiDARDustX: A LiDAR Dataset for Dusty Unstructured Road Environments

【速读】:该论文旨在解决现有自动驾驶数据集主要关注结构化城市环境,而忽视了非结构化和特殊场景(尤其是高尘环境)的问题。其解决方案的关键在于构建LiDARDustX数据集,该数据集专门用于高尘条件下的感知任务,包含30,000帧由六种不同LiDAR传感器采集的数据,配有3D边界框标注和点云语义分割,并且超过80%的场景受到灰尘影响,从而为评估先进3D检测与分割算法提供了基准。

链接: https://arxiv.org/abs/2505.21914
作者: Chenfeng Wei,Qi Wu,Si Zuo,Jiahua Xu,Boyang Zhao,Zeyu Yang,Guotao Xie,Shenhong Wang
机构: Wuxi Intelligent Control Research Institute, HNU; Hunan University; Tsinghua University; Xi’an Jiaotong-Liverpool University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban environments, which limits the exploration of unstructured and specialized scenarios, particularly those characterized by significant dust levels. This paper introduces the LiDARDustX dataset, which is specifically designed for perception tasks under high-dust conditions, such as those encountered in mining areas. The LiDARDustX dataset consists of 30,000 LiDAR frames captured by six different LiDAR sensors, each accompanied by 3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust-affected scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state-of-the-art 3D detection and segmentation algorithms. Additionally, we have analyzed the impact of dust on perception accuracy and delved into the causes of these effects. The data and further information can be accessed at: this https URL.
zh

[CV-104] Detecting Cultural Differences in News Video Thumbnails via Computational Aesthetics

【速读】:该论文试图解决跨文化源图像风格差异检测的问题,其核心在于如何有效识别和比较不同文化背景下的图像视觉特征。解决方案的关键在于采用两步法:首先根据内容将图像聚类为更细粒度的视觉主题,随后在这些主题内比较图像的审美特征,从而更准确地捕捉文化偏好对图像风格的影响。

链接: https://arxiv.org/abs/2505.21912
作者: Marvin Limpijankit,John Kender
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a two-step approach for detecting differences in the style of images across sources of differing cultural affinity, where images are first clustered into finer visual themes based on content before their aesthetic features are compared. We test this approach on 2,400 YouTube video thumbnails taken equally from two U.S. and two Chinese YouTube channels, and relating equally to COVID-19 and the Ukraine conflict. Our results suggest that while Chinese thumbnails are less formal and more candid, U.S. channels tend to use more deliberate, proper photographs as thumbnails. In particular, U.S. thumbnails are less colorful, more saturated, darker, more finely detailed, less symmetric, sparser, less varied, and more up close and personal than Chinese thumbnails. We suggest that most of these differences reflect cultural preferences, and that our methods and observations can serve as a baseline against which suspected visual propaganda can be computed and compared.
zh

[CV-105] AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

【速读】:该论文旨在解决个性化图像生成中由于提示文本与参考图像之间存在语义不一致时,生成结果过度偏向文本先验而丢失参考图像内容的问题。其解决方案的关键在于提出AlignGen,一种跨模态先验对齐机制,通过引入可学习的令牌以弥合文本与视觉先验之间的差距、采用稳健的训练策略确保先验对齐以及在多模态注意力机制中使用选择性跨模态注意力掩码进一步对齐先验,从而提升生成结果的质量与一致性。

链接: https://arxiv.org/abs/2505.21911
作者: Yiheng Lin,Shifang Zhao,Ting Liu,Xiaochao Qu,Luoqi Liu,Yao Zhao,Yunchao Wei
机构: Institute of Information Science, Beijing Jiaotong University (信息科学研究所,北京交通大学); MT Lab, Meitu Inc. (美图实验室,美图公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
zh

[CV-106] aming Transformer Without Using Learning Rate Warmup ICLR2025

【速读】:该论文试图解决在不使用学习率预热等技术技巧的情况下,将Transformer模型扩展到大规模训练时出现的模型崩溃问题,该问题由\textit{谱能量集中}(spectral energy concentration)现象引起,具体表现为查询矩阵\bW_q与键矩阵\bW_k的乘积的谱能量集中在少数方向上,导致恶性熵崩溃。解决方案的关键在于受\textit{Weyl’s Inequality}启发提出的一种新颖优化策略,即在连续步骤中使权重更新平滑——当梯度奇异值与权重矩阵奇异值的比值超过阈值时,自动将学习率限制为该比值的加权倍数,从而防止谱能量集中,避免恶性熵崩溃,实现稳定训练。

链接: https://arxiv.org/abs/2505.21910
作者: Xianbiao Qi,Yelin He,Jiaquan Ye,Chun-Guang Li,Bojia Zi,Xili Dai,Qin Zou,Rong Xiao
机构: Intellifusion Inc. (智源科技); BUPT (北京邮电大学); CUHK (香港中文大学); HKUST (GZ) (香港科技大学(广州)); WHU (武汉大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed \textitspectral energy concentration of \bW_q^\top \bW_k , which is the reason for a malignant entropy collapse, where \bW_q and \bW_k are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by \textitWeyl’s Inequality, we present a novel optimization strategy, \ie, making the weight updating in successive steps smooth – if the ratio \frac\sigma_1(\nabla \bW_t)\sigma_1(\bW_t-1) is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of \frac\sigma_1(\bW_t-1)\sigma_1(\nabla \bW_t) , where \nabla \bW_t is the updating quantity in step t . Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.
zh

[CV-107] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

【速读】:该论文旨在解决现有端到端视觉-语言-动作(Vision-Language-Action, VLA)系统在微调过程中丢失关键能力的问题,特别是当模型适应特定机器人任务时。其解决方案的关键在于引入ChatVLA-2,这是一种结合专家混合架构的新型VLA模型,并采用专门设计的三阶段训练流程,以保留预训练视觉-语言模型(Vision-Language Model, VLM)的核心能力,同时实现可操作的推理能力。

链接: https://arxiv.org/abs/2505.21906
作者: Zhongyi Zhou,Yichen Zhu,Junjie Wen,Chaomin Shen,Yi Xu
机构: Midea Group (美的集团); East China Normal University (华东师范大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM’s core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, capable of solving math problems, possessing visual-spatial intelligence, 2) Reasoning following - effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized three-stage training pipeline designed to preserve the VLM’s original strengths while enabling actionable reasoning. To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and pi-zero. This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.
zh

[CV-108] Reference-Guided Identity Preserving Face Restoration

【速读】:该论文旨在解决基于扩散模型的图像修复中保持人脸身份的挑战,特别是在利用参考人脸时未能充分挖掘其潜力的问题。解决方案的关键在于提出一种新方法,通过三个核心贡献提升参考人脸的利用率:1)复合上下文(Composite Context),融合参考人脸的多层级信息以提供更丰富的指导;2)硬样本身份损失(Hard Example Identity Loss),通过参考人脸优化现有身份损失中的学习效率问题;3)一种无需训练即可在推理阶段适应多参考输入的方法。该方法在FFHQ-Ref和CelebA-Ref-Test等基准测试中实现了最先进的身份保留修复效果。

链接: https://arxiv.org/abs/2505.21905
作者: Mo Zhou,Keren Ye,Viraj Shah,Kangfu Mei,Mauricio Delbracio,Peyman Milanfar,Vishal M. Patel,Hossein Talebi
机构: Google(谷歌); Johns Hopkins University(约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing reference-based methods often fail to fully exploit their potential. This paper introduces a novel approach that maximizes reference face utility for improved face restoration and identity preservation. Our method makes three key contributions: 1) Composite Context, a comprehensive representation that fuses multi-level (high- and low-level) information from the reference face, offering richer guidance than prior singular representations. 2) Hard Example Identity Loss, a novel loss function that leverages the reference face to address the identity learning inefficiencies found in the existing identity loss. 3) A training-free method to adapt the model to multi-reference inputs during inference. The proposed method demonstrably restores high-quality faces and achieves state-of-the-art identity preserving restoration on benchmarks such as FFHQ-Ref and CelebA-Ref-Test, consistently outperforming previous work.
zh

[CV-109] CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

【速读】:该论文旨在解决实例分割任务中依赖昂贵的逐像素标注和大型模型的问题,提出一种半监督知识蒸馏(SSKD)框架CAST,通过有限的标注数据和大量的未标注数据,将预训练的视觉基础模型(VFM)压缩为紧凑的专家模型。解决方案的关键在于引入了一种实例感知的像素级对比损失,该损失融合了掩码和类别得分,以挖掘信息丰富的负样本并强化实例间的边界,从而在领域适应和知识蒸馏过程中保持对比信号的一致性,实现教师模型与学生模型嵌入的一致性对齐,并充分利用未标注图像。

链接: https://arxiv.org/abs/2505.21904
作者: Pardis Taghavi,Tian Liu,Renjie Li,Reza Langari,Zhengzhong Tu
机构: Texas A&M University (得克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instance segmentation demands costly per-pixel annotations and large models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pretrained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM teacher(s) via self-training with contrastive pixel calibration, (2) distillation into a compact student via a unified multi-objective loss that couples standard supervision and pseudo-labels with our instance-aware pixel-wise contrastive term, and (3) fine-tuning on labeled data to remove residual pseudo-label bias. Central to CAST is an \emphinstance-aware pixel-wise contrastive loss that fuses mask and class scores to mine informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11X smaller student surpasses its adapted VFM teacher(s) by +3.4 AP (33.9 vs. 30.5) and +1.5 AP (16.7 vs. 15.2) and outperforms state-of-the-art semi-supervised approaches.
zh

[CV-110] Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation IJCAI2025

【速读】:该论文旨在解决少样本医学图像分割(Few-Shot Medical Image Segmentation, FSMIS)中由于支持图像仅通过随机采样或局部平均生成多个原型而导致的边界模糊问题。其解决方案的关键在于设计了一个支持自预测(Support Self-Prediction, SSP)模块,用于识别对清晰分割边界至关重要的弱特征,并通过硬原型生成(Hard Prototypes Generation, HPG)模块基于这些弱特征生成多个硬原型。此外,还引入了多相似性图融合(Multiple Similarity Maps Fusion, MSMF)模块和边界损失以进一步提升分割性能。

链接: https://arxiv.org/abs/2505.21897
作者: Jianchao Jiang,Haofeng Zhang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures, 9 tables, accepted by IJCAI 2025

点击查看摘要

Abstract:Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at this https URL.
zh

[CV-111] Hyperspectral Gaussian Splatting

【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)在农业应用中对植物营养成分进行非破坏性估计和精确测定的挑战,特别是在三维场景重建和光谱分布一致性方面存在的局限性。现有方法如Neural Radiance Field(NeRF)在训练时间和渲染速度上存在不足,难以实现高效的三维显式重建与全波段新视角合成。该论文提出的解决方案是 Hyperspectral Gaussian Splatting (HS-GS),其关键在于结合先进的三维高斯泼溅(3D Gaussian Splatting, 3DGS)与扩散模型,通过引入波长编码器生成波长特定的球面谐波偏移,并采用基于Kullback–Leibler散度的损失函数减少渲染图像与真实数据之间的光谱分布差异,从而提升模型对细粒度反射率变化的捕捉能力及去噪效果。

链接: https://arxiv.org/abs/2505.21890
作者: Sunil Kumar Narayanan,Lingjun Zhao,Lu Gan,Yongsheng Chen
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise determination of nutritional elements in samples. Recently, 3D reconstruction methods have been used to create implicit neural representations of HSI scenes, which can help localize the target object’s nutrient composition spatially and spectrally. Neural Radiance Field (NeRF) is a cutting-edge implicit representation that can render hyperspectral channel compositions of each spatial location from any viewing direction. However, it faces limitations in training time and rendering speed. In this paper, we propose Hyperspectral Gaussian Splatting (HS-GS), which combines the state-of-the-art 3D Gaussian Splatting (3DGS) with a diffusion model to enable 3D explicit reconstruction of the hyperspectral scenes and novel view synthesis for the entire spectral range. To enhance the model’s ability to capture fine-grained reflectance variations across the light spectrum and leverage correlations between adjacent wavelengths for denoising, we introduce a wavelength encoder to generate wavelength-specific spherical harmonics offsets. We also introduce a novel Kullback–Leibler divergence-based loss to mitigate the spectral distribution gap between the rendered image and the ground truth. A diffusion model is further applied for denoising the rendered images and generating photorealistic hyperspectral images. We present extensive evaluations on five diverse hyperspectral scenes from the Hyper-NeRF dataset to show the effectiveness of our proposed HS-GS framework. The results demonstrate that HS-GS achieves new state-of-the-art performance among all previously published methods. Code will be released upon publication.
zh

[CV-112] EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

【速读】:该论文旨在解决视频扩散模型(VDM)中3D相机控制的局限性,特别是现有方法依赖于高误差的点云估计生成锚定视频以及需要大量相机轨迹标注所带来的资源消耗问题。其解决方案的关键在于提出EPiC框架,通过基于第一帧可见性的源视频掩码自动生成高质量锚定视频,从而无需昂贵的相机轨迹标注,并确保高对齐性。此外,引入Anchor-ControlNet作为轻量级条件模块,将可见区域的锚定视频引导整合到预训练VDM中,仅使用不到1%的主干模型参数,实现了高效且参数较少的训练过程。

链接: https://arxiv.org/abs/2505.21876
作者: Zun Wang,Jaemin Cho,Jialu Li,Han Lin,Jaehong Yoon,Yue Zhang,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.
zh

[CV-113] Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

【速读】:该论文旨在解决小目标检测(Small Object Detection, SOD)中由于信息有限和模型类别预测分数较低所带来的挑战。传统基于Transformer的检测框架在处理SOD时存在不足,如CNN主干网络难以捕捉必要的上下文信息,Transformer编码器中的多头注意力机制难以有效关注小目标并导致特征模糊。为了解决这些问题,论文提出了一种名为Cross-DINO的新方法,其关键在于引入深度MLP网络以聚合短距离和长距离信息,并通过Cross Coding Twice Module (CCTM) 将这些初始表示与Transformer编码器特征进行融合,从而增强小目标的细节。此外,还引入了Category-Size (CS) 软标签和对应的Boost Loss函数,以提升模型的类别预测分数。

链接: https://arxiv.org/abs/2505.21868
作者: Guiping Cao,Wenjian Huang,Xiangyuan Lan,Jianguo Zhang,Dongmei Jiang,Yaowei Wang
机构: Southern University of Science and Technology (南方科技大学); Pengcheng Laboratory (鹏城实验室); Pazhou Laboratory (琶洲实验室); Harbin Institute of Technology at Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TRANSACTIONS ON MULTIMEDIA

点击查看摘要

Abstract:Small Object Detection (SOD) poses significant challenges due to limited information and the model’s low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model’s lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at this https URL.
zh

[CV-114] owards Scalable Language-Image Pre-training for 3D Medical Imaging

【速读】:该论文旨在解决生成式 AI (Generative AI) 在3D医学影像(如CT和MRI)中预训练的计算需求高、难以在大规模未标注临床数据上进行有效训练的问题。其解决方案的关键在于提出了一种可扩展的预训练框架——分层注意力语言-图像预训练(Hierarchical attention for Language-Image Pre-training, HLIP),该框架采用受放射学数据自然层次结构(切片、扫描、研究)启发的轻量级分层注意力机制,从而实现了良好的泛化能力和计算效率,使得直接在未标注数据上进行预训练成为可能。

链接: https://arxiv.org/abs/2505.21862
作者: Chenhui Zhao,Yiwei Lyu,Asadur Chowdury,Edward Harake,Akhil Kondepudi,Akshay Rao,Xinhai Hou,Honglak Lee,Todd Hollon
机构: University of Michigan(密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-image pre-training has demonstrated strong performance in 2D medical imaging, but its success in 3D modalities such as CT and MRI remains limited due to the high computational demands of volumetric data, which pose a significant barrier to training on large-scale, uncurated clinical studies. In this study, we introduce Hierarchical attention for Language-Image Pre-training (HLIP), a scalable pre-training framework for 3D medical imaging. HLIP adopts a lightweight hierarchical attention mechanism inspired by the natural hierarchy of radiology data: slice, scan, and study. This mechanism exhibits strong generalizability, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. Moreover, the computational efficiency of HLIP enables direct training on uncurated datasets. Trained on 220K patients with 3.13 million scans for brain MRI and 240K patients with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +32.4% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +1.4% and +6.9% macro AUC on head CT benchmarks RSNA and CQ500, respectively. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at this https URL
zh

[CV-115] Rethinking Gradient-based Adversarial Attacks on Point Cloud Classification

【速读】:该论文旨在解决梯度基对抗攻击在点云分类模型中因依赖统一更新规则而未能考虑点云异质性,导致扰动过度且易被感知的问题。其解决方案的关键在于提出两种新策略:WAAttack通过引入加权梯度和自适应步长策略,根据每个点的局部结构和敏感性动态调整更新,实现更精准和隐蔽的扰动;SubAttack则通过将点云分解为子集并集中扰动结构关键区域,提升攻击效果与隐蔽性。

链接: https://arxiv.org/abs/2505.21854
作者: Jun Chen,Xinke Li,Mingyue Xu,Tianrui Li,Chongshou Li
机构: Southwest Jiaotong University (西南交通大学); City University of Hong Kong (香港城市大学); Ministry of Education (教育部); Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gradient-based adversarial attacks have become a dominant approach for evaluating the robustness of point cloud classification models. However, existing methods often rely on uniform update rules that fail to consider the heterogeneous nature of point clouds, resulting in excessive and perceptible perturbations. In this paper, we rethink the design of gradient-based attacks by analyzing the limitations of conventional gradient update mechanisms and propose two new strategies to improve both attack effectiveness and imperceptibility. First, we introduce WAAttack, a novel framework that incorporates weighted gradients and an adaptive step-size strategy to account for the non-uniform contribution of points during optimization. This approach enables more targeted and subtle perturbations by dynamically adjusting updates according to the local structure and sensitivity of each point. Second, we propose SubAttack, a complementary strategy that decomposes the point cloud into subsets and focuses perturbation efforts on structurally critical regions. Together, these methods represent a principled rethinking of gradient-based adversarial attacks for 3D point cloud classification. Extensive experiments demonstrate that our approach outperforms state-of-the-art baselines in generating highly imperceptible adversarial examples. Code will be released upon paper acceptance.
zh

[CV-116] Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task ACL

【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在抽象视觉推理(Abstract Visual Reasoning, AVR)任务中的表现不足问题,尤其是其在高阶推理能力上的欠缺。现有AVR基准测试主要关注单步推理,忽视了推理过程的多阶段特性,且传统评估指标仅关注最终结果而未考虑中间步骤的正确性。为解决这一问题,研究者提出了MultiStAR,一个基于RAVEN的多阶段AVR基准,用于评估不同复杂度下的推理能力,并引入MSEval作为新的评估指标,以同时考量中间步骤和最终结果的正确性。

链接: https://arxiv.org/abs/2505.21850
作者: Yanbei Jiang,Yihao Ding,Chao Lei,Jiayang Ao,Jey Han Lau,Krista A. Ehinger
机构: The University of Melbourne(墨尔本大学); University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ACL Findings

点击查看摘要

Abstract:Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception. Existing AVR benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process. Past studies found MLLMs struggle with these benchmarks, but it doesn’t explain how they fail. To address this gap, we introduce MultiStAR, a Multi-Stage AVR benchmark, based on RAVEN, designed to assess reasoning across varying levels of complexity. Additionally, existing metrics like accuracy only focus on the final outcomes while do not account for the correctness of intermediate steps. Therefore, we propose a novel metric, MSEval, which considers the correctness of intermediate steps in addition to the final outcomes. We conduct comprehensive experiments on MultiStAR using 17 representative close-source and open-source MLLMs. The results reveal that while existing MLLMs perform adequately on basic perception tasks, they continue to face challenges in more complex rule detection stages.
zh

[CV-117] FPAN: Mitigating Replication in Diffusion Models through the Fine-Grained Probabilistic Addition of Noise to Token Embeddings

【速读】:该论文旨在解决扩散模型在生成图像时对训练数据的复制问题,这一问题引发了严重的隐私担忧,尤其是在训练数据包含敏感信息的情况下。论文提出的解决方案的关键在于引入一种细粒度的概率噪声注入技术(Fine-grained Probabilistic Addition of Noise, FPAN),该技术通过概率性地向标记嵌入中添加更大量的噪声,从而有效降低图像复制现象,同时保持图像质量不受显著影响。

链接: https://arxiv.org/abs/2505.21848
作者: Jingqi Xu,Chenghao Li,Yuke Zhang,Peter A. Beerel
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable potential in generating high-quality images. However, their tendency to replicate training data raises serious privacy concerns, particularly when the training datasets contain sensitive or private information. Existing mitigation strategies primarily focus on reducing image duplication, modifying the cross-attention mechanism, and altering the denoising backbone architecture of diffusion models. Moreover, recent work has shown that adding a consistent small amount of noise to text embeddings can reduce replication to some degree. In this work, we begin by analyzing the impact of adding varying amounts of noise. Based on our analysis, we propose a fine-grained noise injection technique that probabilistically adds a larger amount of noise to token embeddings. We refer to our method as Fine-grained Probabilistic Addition of Noise (FPAN). Through our extensive experiments, we show that our proposed FPAN can reduce replication by an average of 28.78% compared to the baseline diffusion model without significantly impacting image quality, and outperforms the prior consistent-magnitude-noise-addition approach by 26.51%. Moreover, when combined with other existing mitigation methods, our FPAN approach can further reduce replication by up to 16.82% with similar, if not improved, image quality.
zh

[CV-118] RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers ICML2025

【速读】:该论文旨在解决Vision Transformer(ViT)在推理过程中由于前馈网络(FFN)层导致的高延迟问题,尤其是在模型规模增大时这一问题更加显著。其解决方案的关键在于提出一种新颖的通道空闲机制,通过在每个FFN层中保留部分特征通道并绕过非线性激活函数,形成线性路径以实现结构重参数化,从而在保持或提升准确率的前提下显著降低推理延迟。

链接: https://arxiv.org/abs/2505.21847
作者: Xuwei Xu,Yang Li,Yudong Chen,Jiajun Liu,Sen Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICML2025

点击查看摘要

Abstract:We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at this https URL.
zh

[CV-119] st-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

【速读】:该论文试图解决在开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS)任务中,测试时自适应(Test-Time Adaptation, TTA)方法被完全忽视的问题。其解决方案的关键在于提出一种针对分割任务的多层级、多提示(Multi-Level and Multi-Prompt, MLMP)熵最小化方法,该方法结合了视觉编码器中间层的特征,并在全局分类(CLS)标记和局部像素级层面使用不同的文本提示模板,从而实现对视觉语言模型(Vision-Language Models, VLMs)的有效测试时自适应。

链接: https://arxiv.org/abs/2505.21844
作者: Mehrdad Noori,David Osowiechi,Gustavo Adolfo Vargas Hakim,Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers
机构: LIVIA, ÉTS Montréal, Canada; International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, seven segmentation datasets, and 15 common corruptions, with a total of 82 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines.
zh

[CV-120] UniMoGen: Universal Motion Generation

【速读】:该论文试图解决现有运动生成方法对特定骨骼结构的依赖问题,这限制了其在不同角色间的通用性。解决方案的关键在于提出UniMoGen,一个基于UNet的扩散模型,能够实现骨骼无关的运动生成。该模型通过动态处理每个角色所需的必要关节,实现了骨骼无关性和计算效率,并支持通过风格和轨迹输入进行控制以及从历史帧延续运动。

链接: https://arxiv.org/abs/2505.21837
作者: Aliasghar Khani,Arianna Rampini,Evan Atherton,Bruno Roy
机构: Autodesk Research(欧特克研究); Canada(加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motion generation is a cornerstone of computer graphics, animation, gaming, and robotics, enabling the creation of realistic and varied character movements. A significant limitation of existing methods is their reliance on specific skeletal structures, which restricts their versatility across different characters. To overcome this, we introduce UniMoGen, a novel UNet-based diffusion model designed for skeleton-agnostic motion generation. UniMoGen can be trained on motion data from diverse characters, such as humans and animals, without the need for a predefined maximum number of joints. By dynamically processing only the necessary joints for each character, our model achieves both skeleton agnosticism and computational efficiency. Key features of UniMoGen include controllability via style and trajectory inputs, and the ability to continue motions from past frames. We demonstrate UniMoGen’s effectiveness on the 100style dataset, where it outperforms state-of-the-art methods in diverse character motion generation. Furthermore, when trained on both the 100style and LAFAN1 datasets, which use different skeletons, UniMoGen achieves high performance and improved efficiency across both skeletons. These results highlight UniMoGen’s potential to advance motion generation by providing a flexible, efficient, and controllable solution for a wide range of character animations.
zh

[CV-121] HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation

【速读】:该论文试图解决在真实观看条件下,对高动态范围(High Dynamic Range, HDR)与标准动态范围(Standard Dynamic Range, SDR)视频内容进行客观质量评估的问题。解决方案的关键在于构建了HDRSDR-VQA数据集,该数据集包含960个视频,源自54个多样化的源序列,在九种失真级别下分别以HDR和SDR格式呈现,并通过22,000次成对比较获取可靠的感知质量分数,从而支持HDR与SDR内容的直接对比分析。

链接: https://arxiv.org/abs/2505.21831
作者: Bowen Chen,Cheng-han Lee,Yixu Chen,Zaixi Shang,Hai Wei,Alan C. Bovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce HDRSDR-VQA, a large-scale video quality assessment dataset designed to facilitate comparative analysis between High Dynamic Range (HDR) and Standard Dynamic Range (SDR) content under realistic viewing conditions. The dataset comprises 960 videos generated from 54 diverse source sequences, each presented in both HDR and SDR formats across nine distortion levels. To obtain reliable perceptual quality scores, we conducted a comprehensive subjective study involving 145 participants and six consumer-grade HDR-capable televisions. A total of over 22,000 pairwise comparisons were collected and scaled into Just-Objectionable-Difference (JOD) scores. Unlike prior datasets that focus on a single dynamic range format or use limited evaluation protocols, HDRSDR-VQA enables direct content-level comparison between HDR and SDR versions, supporting detailed investigations into when and why one format is preferred over the other. The open-sourced part of the dataset is publicly available to support further research in video quality assessment, content-adaptive streaming, and perceptual model development.
zh

[CV-122] ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

【速读】:该论文旨在解决扩散模型在推理过程中因迭代去噪过程导致的计算开销大、难以在资源受限环境中部署的问题。现有加速方法通常采用统一策略,无法捕捉扩散生成过程中的时间变化,而常见的“剪枝后微调”策略由于预训练权重与最终参数之间的不匹配导致次优性能。论文提出的解决方案关键在于引入ALTER:一种将扩散模型转换为高效时间专家混合体的统一框架,通过可训练的超网络实现层剪枝、专家路由和模型微调的单阶段优化,动态生成层剪枝决策并管理时间步路由至专用剪枝专家子网络,从而在保持高生成质量的同时显著提升效率。

链接: https://arxiv.org/abs/2505.21817
作者: Xiaomeng Yang,Lei Lu,Qihui Fan,Changdi Yang,Juyi Lin,Yanzhi Wang,Xuan Zhang,Shangqian Gao
机构: Northeastern University (东北大学); Florida State University (佛罗里达州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images. However, their iterative denoising process results in significant computational overhead during inference, limiting their practical deployment in resource-constrained environments. Existing acceleration methods often adopt uniform strategies that fail to capture the temporal variations during diffusion generation, while the commonly adopted sequential pruning-then-fine-tuning strategy suffers from sub-optimality due to the misalignment between pruning decisions made on pretrained weights and the model’s final parameters. To address these limitations, we introduce ALTER: All-in-One Layer Pruning and Temporal Expert Routing, a unified framework that transforms diffusion models into a mixture of efficient temporal experts. ALTER achieves a single-stage optimization that unifies layer pruning, expert routing, and model fine-tuning by employing a trainable hypernetwork, which dynamically generates layer pruning decisions and manages timestep routing to specialized, pruned expert sub-networks throughout the ongoing fine-tuning of the UNet. This unified co-optimization strategy enables significant efficiency gains while preserving high generative quality. Specifically, ALTER achieves same-level visual fidelity to the original 50-step Stable Diffusion v2.1 model while utilizing only 25.9% of its total MACs with just 20 inference steps and delivering a 3.64x speedup through 35% sparsity.
zh

[CV-123] SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

【速读】:该论文旨在解决少样本分割(few-shot segmentation)问题,即在仅有少量标注示例的情况下对未见过的物体类别进行分割。其解决方案的关键在于利用Segment Anything 2 (SAM2) 的潜在语义结构,通过最小的任务特定修改,提出SANSA(Semantically Aligned Segment Anything 2)框架,使该结构显式化并适配少样本分割任务,从而在保持高效性和紧凑性的同时,显著提升分割性能。

链接: https://arxiv.org/abs/2505.21795
作者: Claudia Cuttano,Gabriele Trivigno,Giuseppe Averta,Carlo Masone
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at this https URL.
zh

[CV-124] Compositional Scene Understanding through Inverse Generative Modeling ICML2025

【速读】:该论文试图解决如何利用生成式 AI (Generative AI) 不仅合成视觉内容,还能从自然图像中理解场景属性的问题。其核心挑战在于如何从与训练数据差异较大的图像中推断出场景结构,并实现对新场景的鲁棒泛化。解决方案的关键在于将场景理解建模为逆向生成建模问题,通过组合更小的模型来构建视觉生成模型,从而在不同场景片段上进行有效推理,进而推断场景中的物体集合及全局场景因素。这一方法还可直接应用于预训练的文本到图像生成模型,以实现零样本多对象感知。

链接: https://arxiv.org/abs/2505.21780
作者: Yanbo Wang,Justin Dauwels,Yilun Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025, Webpage: this https URL

点击查看摘要

Abstract:Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at \hrefthis https URLthis https URL.
zh

[CV-125] MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

【速读】:该论文旨在解决当前视觉语言模型(VLMs)在处理复杂、现实世界中的多模态表格(multimodal tables)推理任务时表现不足的问题。其解决方案的关键在于引入MMTBENCH,这是一个包含500个来自不同现实来源的多模态表格以及4021个问答对的基准测试集,涵盖了多种问题类型、推理类型和表格类型,用以评估和推动多模态表格理解技术的发展。

链接: https://arxiv.org/abs/2505.21771
作者: Prasham Yatinkumar Titiya,Jainil Trivedi,Chitta Baral,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table reasoning remains unexplored. To bridge this gap, we introduce MMTBENCH (Multimodal Table Benchmark), a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources, with a total of 4021 question answer pairs. MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types (Single/Multiple Entity, Maps and Charts with Entities, Single/Multiple Charts, Maps, and Visualizations). Extensive evaluation of state of the art models on all types reveals substantial performance gaps, particularly on questions requiring visual-based reasoning and multi-step inference. These findings show the urgent need for improved architectures that more tightly integrate vision and language processing. By providing a challenging, high-quality resource that mirrors the complexity of real-world tasks, MMTBENCH underscores its value as a resource for future research on multimodal tables.
zh

[CV-126] Visual Loop Closure Detection Through Deep Graph Consensus

【速读】:该论文旨在解决视觉回环检测中因误检回环导致的位姿图估计性能下降问题,特别是在在线同时定位与地图构建(SLAM)场景下,由于时间和计算资源受限,难以验证大量候选回环。其解决方案的关键在于提出一种基于图神经网络(GNN)的LoopGNN架构,通过利用通过场景识别检索到的视觉相似关键帧团(clique)进行回环闭合共识估计,从而在保持高召回率的同时实现高精度的回环检测,并展现出优于传统方法的计算效率。

链接: https://arxiv.org/abs/2505.21754
作者: Martin Büchner,Liza Dahiya,Simon Dorer,Vipul Ramtekkar,Kenji Nishimiya,Daniele Cattaneo,Abhinav Valada
机构: University of Freiburg (弗赖堡大学); Honda R&D Co., Ltd. (本田研发有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual loop closure detection traditionally relies on place recognition methods to retrieve candidate loops that are validated using computationally expensive RANSAC-based geometric verification. As false positive loop closures significantly degrade downstream pose graph estimates, verifying a large number of candidates in online simultaneous localization and mapping scenarios is constrained by limited time and compute resources. While most deep loop closure detection approaches only operate on pairs of keyframes, we relax this constraint by considering neighborhoods of multiple keyframes when detecting loops. In this work, we introduce LoopGNN, a graph neural network architecture that estimates loop closure consensus by leveraging cliques of visually similar keyframes retrieved through place recognition. By propagating deep feature encodings among nodes of the clique, our method yields high-precision estimates while maintaining high recall. Extensive experimental evaluations on the TartanDrive 2.0 and NCLT datasets demonstrate that LoopGNN outperforms traditional baselines. Additionally, an ablation study across various keypoint extractors demonstrates that our method is robust, regardless of the type of deep feature encodings used, and exhibits higher computational efficiency compared to classical geometric verification baselines. We release our code, supplementary material, and keyframe data at this https URL.
zh

[CV-127] Learning to See More: UAS-Guided Super-Resolution of Satellite Imagery for Precision Agriculture

【速读】:该论文旨在解决精准农业中卫星遥感与无人飞行器系统(UAS)数据在空间、时间和光谱覆盖范围上的局限性问题。其关键解决方案是通过超分辨率方法融合卫星与UAS影像,实现光谱和空间维度的扩展,从而提升作物生物量和氮素估算的精度。具体而言,利用UAS的高空间分辨率数据进行光谱扩展,结合卫星数据的广域覆盖,有效弥补了单一平台的不足,并降低了获取高精度数据的成本与频率需求。

链接: https://arxiv.org/abs/2505.21746
作者: Arif Masrur,Peder A. Olsen,Paul R. Adler,Carlan Jackson,Matthew W. Myers,Nathan Sedghi,Ray R. Weil
机构: Esri(埃斯里); Microsoft Research(微软研究院); USDA - Agricultural Research Service(美国农业部-农业研究服务); Alabama A&M University(阿拉巴马农工大学); Univ. of Maryland(马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unmanned Aircraft Systems (UAS) and satellites are key data sources for precision agriculture, yet each presents trade-offs. Satellite data offer broad spatial, temporal, and spectral coverage but lack the resolution needed for many precision farming applications, while UAS provide high spatial detail but are limited by coverage and cost, especially for hyperspectral data. This study presents a novel framework that fuses satellite and UAS imagery using super-resolution methods. By integrating data across spatial, spectral, and temporal domains, we leverage the strengths of both platforms cost-effectively. We use estimation of cover crop biomass and nitrogen (N) as a case study to evaluate our approach. By spectrally extending UAS RGB data to the vegetation red edge and near-infrared regions, we generate high-resolution Sentinel-2 imagery and improve biomass and N estimation accuracy by 18% and 31%, respectively. Our results show that UAS data need only be collected from a subset of fields and time points. Farmers can then 1) enhance the spectral detail of UAS RGB imagery; 2) increase the spatial resolution by using satellite data; and 3) extend these enhancements spatially and across the growing season at the frequency of the satellite flights. Our SRCNN-based spectral extension model shows considerable promise for model transferability over other cropping systems in the Upper and Lower Chesapeake Bay regions. Additionally, it remains effective even when cloud-free satellite data are unavailable, relying solely on the UAS RGB input. The spatial extension model produces better biomass and N predictions than models built on raw UAS RGB images. Once trained with targeted UAS RGB data, the spatial extension model allows farmers to stop repeated UAS flights. While we introduce super-resolution advances, the core contribution is a lightweight and scalable system for affordable on-farm use.
zh

[CV-128] What is Adversarial Training for Diffusion Models?

【速读】:该论文试图解决扩散模型(Diffusion Models, DMs)在面对噪声、数据损坏和对抗攻击时的鲁棒性问题。与分类器中通过对抗训练(Adversarial Training, AT)实现输出不变性不同,扩散模型的对抗训练需要保持扩散过程与数据分布的一致性,即要求其具有等变性(equivariance)。解决方案的关键在于通过添加随机噪声或对抗噪声来增强扩散流的平滑性,从而提升模型对异常值和噪声数据的鲁棒性,且无需对噪声模型做出假设,能够无缝集成到扩散训练过程中。

链接: https://arxiv.org/abs/2505.21742
作者: Briglia Maria Rosaria,Mujtaba Hussain Mirza,Giuseppe Lisanti,Iacopo Masi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 40 pages

点击查看摘要

Abstract:We answer the question in the title, showing that adversarial training (AT) for diffusion models (DMs) fundamentally differs from classifiers: while AT in classifiers enforces output invariance, AT in DMs requires equivariance to keep the diffusion process aligned with the data distribution. AT is a way to enforce smoothness in the diffusion flow, improving robustness to outliers and corrupted data. Unlike prior art, our method makes no assumptions about the noise model and integrates seamlessly into diffusion training by adding random noise, similar to randomized smoothing, or adversarial noise, akin to AT. This enables intrinsic capabilities such as handling noisy data, dealing with extreme variability such as outliers, preventing memorization, and improving robustness. We rigorously evaluate our approach with proof-of-concept datasets with known distributions in low- and high-dimensional space, thereby taking a perfect measure of errors; we further evaluate on standard benchmarks such as CIFAR-10, CelebA and LSUN Bedroom, showing strong performance under severe noise, data corruption, and iterative adversarial attacks.
zh

[CV-129] Moment kernels: a simple and scalable approach for equivariance to rotations and reflections in deep convolutional networks

【速读】:该论文试图解决如何有效利用图像中的对称性(如旋转、反射等)以提升生物医学图像分析性能的问题,尤其是在这些对称性未被广泛采用的情况下。解决方案的关键在于提出了一种称为“矩核”(moment kernels)的简单卷积核形式,证明所有等变核必须采用这种形式,其本质是空间位置 $ x $ 的径向对称函数与 $ x $ 的分量或单位矩阵的幂次相乘。通过使用标准卷积模块实现等变神经网络,该方法在生物医学图像分析任务中展现出良好的适应性。

链接: https://arxiv.org/abs/2505.21736
作者: Zachary Schlamowitz,Andrew Bennecke,Daniel J. Tward
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The principle of translation equivariance (if an input image is translated an output image should be translated by the same amount), led to the development of convolutional neural networks that revolutionized machine vision. Other symmetries, like rotations and reflections, play a similarly critical role, especially in biomedical image analysis, but exploiting these symmetries has not seen wide adoption. We hypothesize that this is partially due to the mathematical complexity of methods used to exploit these symmetries, which often rely on representation theory, a bespoke concept in differential geometry and group theory. In this work, we show that the same equivariance can be achieved using a simple form of convolution kernels that we call ``moment kernels,‘’ and prove that all equivariant kernels must take this form. These are a set of radially symmetric functions of a spatial position x , multiplied by powers of the components of x or the identity matrix. We implement equivariant neural networks using standard convolution modules, and provide architectures to execute several biomedical image analysis tasks that depend on equivariance principles: classification (outputs are invariant under orthogonal transforms), 3D image registration (outputs transform like a vector), and cell segmentation (quadratic forms defining ellipses transform like a matrix).
zh

[CV-130] OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

【速读】:该论文试图解决在线多模态对话响应生成(Online Multimodal Conversational Response Generation, OMCRG)问题,即在多模态输入条件下实时生成同步的言语和非言语听众反馈。该任务旨在模拟自然的双人交互,并在生成的音频和面部反应之间实现同步性。解决方案的关键在于引入文本作为连接音频和面部反应的中间模态,并提出OmniResponse模型,该模型基于增强的多模态大语言模型(Multimodal Large Language Model, MLLM),通过Chrono-Text和TempoVoice两个创新组件实现高质量多模态响应的自回归生成。

链接: https://arxiv.org/abs/2505.21724
作者: Cheng Luo,Jianghui Wang,Bing Li,Siyang Song,Bernard Ghanem
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker’s multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.
zh

[CV-131] MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis

【速读】:该论文旨在解决通用视觉-语言基础模型(Vision-Language Foundation Models, VLMs)在医学图像诊断中表现不佳的问题,主要由于医学图像与自然图像之间存在显著的领域偏移(domain shifts),而训练专门的医学基础模型需要大量标注数据和计算资源。论文提出的解决方案是MedBridge,其关键在于三个核心组件:首先,Focal Sampling模块通过提取高分辨率局部区域来捕捉细微病理特征;其次,Query Encoder(QEncoder)通过注入可学习查询来对齐VLM的冻结特征图与医学语义;最后,Mixture of Experts机制利用可学习查询驱动,融合多种VLM的优势以提升诊断性能。

链接: https://arxiv.org/abs/2505.21698
作者: Yitong Li,Morteza Ghahremani,Christian Wachinger
机构: Lab for AI in Medical Imaging, Technical University of Munich (TUM), Germany; Munich Center for Machine Learning (MCML), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent vision-language foundation models deliver state-of-the-art results on natural image classification but falter on medical images due to pronounced domain shifts. At the same time, training a medical foundation model requires substantial resources, including extensive annotated data and high computational capacity. To bridge this gap with minimal overhead, we introduce MedBridge, a lightweight multimodal adaptation framework that re-purposes pretrained VLMs for accurate medical image diagnosis. MedBridge comprises three key components. First, a Focal Sampling module that extracts high-resolution local regions to capture subtle pathological features and compensate for the limited input resolution of general-purpose VLMs. Second, a Query Encoder (QEncoder) injects a small set of learnable queries that attend to the frozen feature maps of VLM, aligning them with medical semantics without retraining the entire backbone. Third, a Mixture of Experts mechanism, driven by learnable queries, harnesses the complementary strength of diverse VLMs to maximize diagnostic performance. We evaluate MedBridge on five medical imaging benchmarks across three key adaptation tasks, demonstrating its superior performance in both cross-domain and in-domain adaptation settings, even under varying levels of training data availability. Notably, MedBridge achieved over 6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in multi-label thoracic disease diagnosis, underscoring its effectiveness in leveraging foundation models for accurate and data-efficient medical diagnosis. Our code is available at this https URL.
zh

[CV-132] Scalable Segmentation for Ultra-High-Resolution Brain MR Images

【速读】:该论文旨在解决超高清脑部MRI图像中准确且高效分割的难题,这一问题主要源于细尺度解剖结构缺乏标注训练数据以及高计算需求。其解决方案的关键在于提出一种新颖的框架,该框架利用易于获取的低分辨率粗略标签作为空间参考和指导,无需增加额外的标注成本;同时,通过回归每类的符号距离变换图实现平滑、边界感知的监督,并引入可扩展的类别条件分割策略,使模型能够逐类进行分割,从而提升模型的可扩展性、泛化能力和效率。

链接: https://arxiv.org/abs/2505.21697
作者: Xiaoling Hu,Peirong Liu,Dina Zemlyanker,Jonathan Williams Ramirez,Oula Puonti,Juan Eugenio Iglesias
机构: Massachusetts General Hospital and Harvard Medical School (麻省总医院和哈佛医学院); Danish Research Centre for Magnetic Resonance, Copenhagen University Hospital (丹麦磁共振研究中心,哥本哈根大学医院); Hawkes Institute, University College London (霍克斯研究所,伦敦大学学院); Computer Science and AI Laboratory, Massachusetts Institute of Technology (计算机科学与人工智能实验室,麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Although deep learning has shown great success in 3D brain MRI segmentation, achieving accurate and efficient segmentation of ultra-high-resolution brain images remains challenging due to the lack of labeled training data for fine-scale anatomical structures and high computational demands. In this work, we propose a novel framework that leverages easily accessible, low-resolution coarse labels as spatial references and guidance, without incurring additional annotation cost. Instead of directly predicting discrete segmentation maps, our approach regresses per-class signed distance transform maps, enabling smooth, boundary-aware supervision. Furthermore, to enhance scalability, generalizability, and efficiency, we introduce a scalable class-conditional segmentation strategy, where the model learns to segment one class at a time conditioned on a class-specific input. This novel design not only reduces memory consumption during both training and testing, but also allows the model to generalize to unseen anatomical classes. We validate our method through comprehensive experiments on both synthetic and real-world datasets, demonstrating its superior performance and scalability compared to conventional segmentation approaches.
zh

[CV-133] hink Before You Diffuse: LLM s-Guided Physics-Aware Video Generation

【速读】:该论文试图解决视频生成中物理效果正确性不足的问题,即在生成视觉吸引人的视频结果的同时,难以准确合成正确的物理效应。解决方案的关键在于提出DiffPhy框架,通过微调预训练的视频扩散模型,并利用大语言模型(LLM)从文本提示中显式推理出全面的物理上下文,以此指导视频生成过程。此外,该方法引入了一组新的训练目标,以联合强制实现物理正确性和与输入文本的语义一致性,并构建了一个高质量的物理视频数据集以支持有效的微调。

链接: https://arxiv.org/abs/2505.21653
作者: Ke Zhang,Cihan Xiao,Yiqun Mei,Jiacong Xu,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to explicitly reason a comprehensive physical context from the text prompt and use it to guide the generation. To incorporate physical context into the diffusion model, we leverage a Multimodal large language model (MLLM) as a supervisory signal and introduce a set of novel training objectives that jointly enforce physical correctness and semantic consistency with the input text. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at this https URL
zh

[CV-134] Right Side Up? Disentangling Orientation Understanding in MLLM s with Fine-grained Multi-axis Perception Tasks

【速读】:该论文试图解决视觉感知中对象方向理解(object orientation understanding)这一基础性挑战,该问题对于机器人操作和增强现实等应用至关重要。现有视觉-语言基准测试未能将该能力独立出来,往往将其与位置关系和整体场景理解混淆。论文提出的解决方案是引入DORI(Discriminative Orientation Reasoning Intelligence),这是一个全面的基准,将对象方向感知作为主要评估目标,通过四个维度评估方向理解能力:正面对齐、旋转变换、相对方向关系以及规范方向理解。DORI的关键在于其精心构建的任务集,涵盖67个物体类别,覆盖合成与真实世界场景,从而揭示多模态系统在对象方向理解方面的表现。

链接: https://arxiv.org/abs/2505.21649
作者: Keanu Nichols,Nazia Tasnim,Yan Yuting,Nicholas Ikechukwu,Elva Zou,Deepti Ghadiyaram,Bryan Plummer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: this https URL
zh

[CV-135] QuARI: Query Adaptive Retrieval Improvement

【速读】:该论文试图解决在大规模图像集合中进行实例检索(instance retrieval)时,视觉-语言模型(Vision-Language Model, VLM)性能不佳的问题。其解决方案的关键在于学习将给定查询映射到特定于查询的特征空间变换,通过线性变换来强调与感兴趣领域相关的子空间,从而提升检索效果。由于该变换是线性的,因此可以以极低的计算成本应用于数百万张图像嵌入,使其适用于大规模检索或重排序任务。

链接: https://arxiv.org/abs/2505.21647
作者: Eric Xing,Abby Stylianou,Robert Pless,Nathan Jacobs
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校); Saint Louis University (圣路易斯大学); The George Washington University (乔治华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.
zh

[CV-136] Geometric Feature Prompting of Image Segmentation Models

【速读】:该论文旨在解决植物根系在rhizotron或minirhizotron图像中自动分割的难题,这一任务传统上因手动标注的劳动密集性和主观性而难以自动化。解决方案的关键在于提出一种基于几何的提示生成器(GeomPrompt),该生成器能够生成与感兴趣特征共位的提示点,从而实现使用较少点提示即可自动产生敏感且特异的分割结果。

链接: https://arxiv.org/abs/2505.21644
作者: Kenneth Ball,Erin Taylor,Nirav Patel,Andrew Bartels,Gary Koplik,James Polly,Jay Hineman
机构: Geometric Data Analytics, Inc.(几何数据分析公司); Penrose Research(佩恩罗思研究所); Georgia Institute of Technology(佐治亚理工学院); Applied Research Associates(应用研究协会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in machine learning, especially the introduction of transformer architectures and vision transformers, have led to the development of highly capable computer vision foundation models. The segment anything model (known colloquially as SAM and more recently SAM 2), is a highly capable foundation model for segmentation of natural images and has been further applied to medical and scientific image segmentation tasks. SAM relies on prompts – points or regions of interest in an image – to generate associated segmentations. In this manuscript we propose the use of a geometrically motivated prompt generator to produce prompt points that are colocated with particular features of interest. Focused prompting enables the automatic generation of sensitive and specific segmentations in a scientific image analysis task using SAM with relatively few point prompts. The image analysis task examined is the segmentation of plant roots in rhizotron or minirhizotron images, which has historically been a difficult task to automate. Hand annotation of rhizotron images is laborious and often subjective; SAM, initialized with GeomPrompt local ridge prompts has the potential to dramatically improve rhizotron image processing. The authors have concurrently released an open source software suite called geomprompt this https URL that can produce point prompts in a format that enables direct integration with the segment-anything package. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.21644 [cs.CV] (or arXiv:2505.21644v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.21644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-137] BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration

【速读】:该论文旨在解决现有全功能图像修复(all-in-one image restoration, AIR)方法在面对分布外退化和图像时仍存在脆弱性的问题,从而限制了其在现实场景中的应用。解决方案的关键在于提出一种多源表示学习框架BaryIR,该框架通过将多源退化图像的潜在空间分解为一个连续巴里中心空间(barycenter space)用于统一特征编码,以及源特定子空间用于特定语义编码,从而实现对多源数据流形固有几何结构的捕捉。具体而言,BaryIR通过引入多源潜在最优传输巴里中心问题,学习一个连续的巴里中心映射,将潜在表示传输到巴里中心空间,并设计传输成本以使源特定子空间的表示相互对比,同时保持与巴里中心空间表示的正交性。

链接: https://arxiv.org/abs/2505.21637
作者: Xiaole Tang,Xiaoyi He,Xiang Gu,Jian Sun
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite remarkable advances made in all-in-one image restoration (AIR) for handling different types of degradations simultaneously, existing methods remain vulnerable to out-of-distribution degradations and images, limiting their real-world applicability. In this paper, we propose a multi-source representation learning framework BaryIR, which decomposes the latent space of multi-source degraded images into a continuous barycenter space for unified feature encoding and source-specific subspaces for specific semantic encoding. Specifically, we seek the multi-source unified representation by introducing a multi-source latent optimal transport barycenter problem, in which a continuous barycenter map is learned to transport the latent representations to the barycenter space. The transport cost is designed such that the representations from source-specific subspaces are contrasted with each other while maintaining orthogonality to those from the barycenter space. This enables BaryIR to learn compact representations with unified degradation-agnostic information from the barycenter space, as well as degradation-specific semantics from source-specific subspaces, capturing the inherent geometry of multi-source data manifold for generalizable AIR. Extensive experiments demonstrate that BaryIR achieves competitive performance compared to state-of-the-art all-in-one methods. Particularly, BaryIR exhibits superior generalization ability to real-world data and unseen degradations. The code will be publicly available at this https URL.
zh

[CV-138] Object Concepts Emerge from Motion

【速读】:该论文试图解决如何在无监督条件下学习以物体为中心的视觉表征问题,从而捕捉视觉实例这一被现有视觉基础模型所忽视的关键抽象层次。解决方案的关键在于利用运动边界作为物体级分组的强信号,通过使用现成的光流和聚类算法生成基于运动的实例掩码,并借助对比学习训练视觉编码器,从而实现无需标签和相机标定的端到端学习框架。

链接: https://arxiv.org/abs/2505.21635
作者: Haoqian Liang,Xiaohui Wang,Zhichao Li,Ya Yang,Naiyan Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Xiaomi EV (小米电动汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object concepts play a foundational role in human visual cognition, enabling perception, memory, and interaction in the physical world. Inspired by findings in developmental neuroscience - where infants are shown to acquire object understanding through observation of motion - we propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner. Our key insight is that motion boundary serves as a strong signal for object-level grouping, which can be used to derive pseudo instance supervision from raw videos. Concretely, we generate motion-based instance masks using off-the-shelf optical flow and clustering algorithms, and use them to train visual encoders via contrastive learning. Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data. We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The corresponding code will be released upon paper acceptance.
zh

[CV-139] VideoMarkBench: Benchmarking Robustness of Video Watermarking

【速读】:该论文旨在解决当前视频水印技术在面对常见和对抗性扰动时鲁棒性不足的问题(video watermarking)。其关键解决方案是引入VideoMarkBench,这是首个系统化的基准测试平台,用于评估视频水印在水印移除和水印伪造攻击下的鲁棒性,并通过统一数据集、多种水印方法及聚合策略进行全面的扰动分析。

链接: https://arxiv.org/abs/2505.21620
作者: Zhengyuan Jiang,Moyang Guo,Kecen Li,Yuepeng Hu,Yupu Wang,Zhicong Huang,Cheng Hong,Neil Zhenqiang Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid development of video generative models has led to a surge in highly realistic synthetic videos, raising ethical concerns related to disinformation and copyright infringement. Recently, video watermarking has been proposed as a mitigation strategy by embedding invisible marks into AI-generated videos to enable subsequent detection. However, the robustness of existing video watermarking methods against both common and adversarial perturbations remains underexplored. In this work, we introduce VideoMarkBench, the first systematic benchmark designed to evaluate the robustness of video watermarks under watermark removal and watermark forgery attacks. Our study encompasses a unified dataset generated by three state-of-the-art video generative models, across three video styles, incorporating four watermarking methods and seven aggregation strategies used during detection. We comprehensively evaluate 12 types of perturbations under white-box, black-box, and no-box threat models. Our findings reveal significant vulnerabilities in current watermarking approaches and highlight the urgent need for more robust solutions. Our code is available at this https URL.
zh

[CV-140] Any-to-Bokeh: One-Step Video Bokeh via Multi-Plane Image Guided Diffusion

【速读】:该论文旨在解决视频景深效果(video bokeh)中缺乏对焦平面的显式控制和背景虚化强度调整的问题,以及现有方法在扩展至视频时导致的时间闪烁和边缘模糊过渡不自然的问题。其解决方案的关键在于提出一种单步视频虚化框架,通过多平面图像(multi-plane image, MPI)表示结合逐步扩展的深度采样函数,提供依赖深度的模糊合成的显式几何引导,并利用预训练模型如Stable Video Diffusion的强大三维先验,实现时空一致且逼真的虚化效果。

链接: https://arxiv.org/abs/2505.21593
作者: Yang Yang,Siming Zheng,Jinwei Chen,Boxi Wu,Xiaofei He,Deng Cai,Bo Li,Peng-Tao Jiang
机构: vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion based editing models have enabled realistic camera simulation and image-based bokeh, but video bokeh remains largely unexplored. Existing video editing models cannot explicitly control focus planes or adjust bokeh intensity, limiting their applicability for controllable optical effects. Moreover, naively extending image-based bokeh methods to video often results in temporal flickering and unsatisfactory edge blur transitions due to the lack of temporal modeling and generalization capability. To address these challenges, we propose a novel one-step video bokeh framework that converts arbitrary input videos into temporally coherent, depth-aware bokeh effects. Our method leverages a multi-plane image (MPI) representation constructed through a progressively widening depth sampling function, providing explicit geometric guidance for depth-dependent blur synthesis. By conditioning a single-step video diffusion model on MPI layers and utilizing the strong 3D priors from pre-trained models such as Stable Video Diffusion, our approach achieves realistic and consistent bokeh effects across diverse scenes. Additionally, we introduce a progressive training strategy to enhance temporal consistency, depth robustness, and detail preservation. Extensive experiments demonstrate that our method produces high-quality, controllable bokeh effects and achieves state-of-the-art performance on multiple evaluation benchmarks.
zh

[CV-141] Do you see what I see? An Ambiguous Optical Illusion Dataset exposing limitations of Explainable AI

【速读】:该论文试图解决光学幻觉数据集稀缺的问题,并探索视觉学习中感知模糊性对模型准确性的影响。其解决方案的关键在于引入一个包含交织动物对的新型光学幻觉数据集,通过识别如注视方向和眼神线索等可泛化的视觉概念,系统地生成具有不同概念的光学幻觉,以研究人类与机器视觉之间的偏差和对齐问题。

链接: https://arxiv.org/abs/2505.21589
作者: Carina Newen,Luca Hinkamp,Maria Ntonti,Emmanuel Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 18 figures

点击查看摘要

Abstract:From uncertainty quantification to real-world object detection, we recognize the importance of machine learning algorithms, particularly in safety-critical domains such as autonomous driving or medical diagnostics. In machine learning, ambiguous data plays an important role in various machine learning domains. Optical illusions present a compelling area of study in this context, as they offer insight into the limitations of both human and machine perception. Despite this relevance, optical illusion datasets remain scarce. In this work, we introduce a novel dataset of optical illusions featuring intermingled animal pairs designed to evoke perceptual ambiguity. We identify generalizable visual concepts, particularly gaze direction and eye cues, as subtle yet impactful features that significantly influence model accuracy. By confronting models with perceptual ambiguity, our findings underscore the importance of concepts in visual learning and provide a foundation for studying bias and alignment between human and machine vision. To make this dataset useful for general purposes, we generate optical illusions systematically with different concepts discussed in our bias mitigation section. The dataset is accessible in Kaggle via this https URL. Our source code can be found at this https URL.
zh

[CV-142] CogAD: Cognitive-Hierarchy Guided End-to-End Autonomous Driving

【速读】:该论文试图解决当前端到端自动驾驶方法在感知和规划层面与人类认知原则存在根本性不匹配的问题。其解决方案的关键在于提出CogAD模型,该模型通过模拟人类驾驶员的层次化认知机制,实现了全局到局部的上下文处理以实现类人感知,以及基于意图的多模式轨迹生成以实现认知启发式规划。

链接: https://arxiv.org/abs/2505.21581
作者: Zhennan Wang,Jianing Teng,Canqun Xiang,Kangliang Chen,Xing Pan,Lu Deng,Weihao Gu
机构: HAOMO.AI Technology Co., Ltd (哈莫人工智能科技有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While end-to-end autonomous driving has advanced significantly, prevailing methods remain fundamentally misaligned with human cognitive principles in both perception and planning. In this paper, we propose CogAD, a novel end-to-end autonomous driving model that emulates the hierarchical cognition mechanisms of human drivers. CogAD implements dual hierarchical mechanisms: global-to-local context processing for human-like perception and intent-conditioned multi-mode trajectory generation for cognitively-inspired planning. The proposed method demonstrates three principal advantages: comprehensive environmental understanding through hierarchical perception, robust planning exploration enabled by multi-level planning, and diverse yet reasonable multi-modal trajectory generation facilitated by dual-level uncertainty modeling. Extensive experiments on nuScenes and Bench2Drive demonstrate that CogAD achieves state-of-the-art performance in end-to-end planning, exhibiting particular superiority in long-tail scenarios and robust generalization to complex real-world driving conditions.
zh

[CV-143] Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models

【速读】:该论文旨在解决现有合成数据增强方法在提升图像分类器泛化能力时,难以保证生成数据的多样性以及有效扩大数据规模的问题。其解决方案的关键在于:仅对训练过程中早期未被充分学习的数据部分进行合成增强,而非对整个数据集进行增强,从而在不放大噪声的前提下,通过促进特征学习速度的一致性来提升模型的泛化能力。实验表明,该方法在仅增强30%-40%数据的情况下,能够在多个基准数据集和模型架构上显著提升性能。

链接: https://arxiv.org/abs/2505.21574
作者: Dang Nguyen,Jiping Li,Jinghao Zheng,Baharan Mirzasoleiman
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts the performance by up to 2.8% in a variety of scenarios, including training ResNet, ViT and DenseNet on CIFAR-10, CIFAR-100, and TinyImageNet, with a range of optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet. It can also easily stack with existing weak and strong augmentation strategies to further boost the performance.
zh

[CV-144] hickness-aware E(3)-Equivariant 3D Mesh Neural Networks ICML2025

【速读】:该论文旨在解决传统基于网格的3D静态分析方法在处理实际三维物体时忽略其固有厚度的问题,这种厚度在相对表面之间表现出高度相关性和相似行为。现有方法由于表面之间的分离性及网格内部边连接的缺失,无法有效建模厚度效应。论文提出的解决方案是构建一种新型框架——厚度感知的E(3)-等变3D网格神经网络(T-EMNN),其关键在于将物体厚度信息有效整合到表面网格的计算中,同时保持计算效率,并引入数据驱动的坐标系以编码空间信息并保持E(3)-等变性或不变性,从而实现对节点级3D变形的准确预测。

链接: https://arxiv.org/abs/2505.21572
作者: Sungwon Kim,Namkyeong Lee,Yunyoung Doh,Seungmin Shin,Guimok Cho,Seung-Won Jeon,Sangkook Kim,Chanyoung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:Mesh-based 3D static analysis methods have recently emerged as efficient alternatives to traditional computational numerical solvers, significantly reducing computational costs and runtime for various physics-based analyses. However, these methods primarily focus on surface topology and geometry, often overlooking the inherent thickness of real-world 3D objects, which exhibits high correlations and similar behavior between opposing surfaces. This limitation arises from the disconnected nature of these surfaces and the absence of internal edge connections within the mesh. In this work, we propose a novel framework, the Thickness-aware E(3)-Equivariant 3D Mesh Neural Network (T-EMNN), that effectively integrates the thickness of 3D objects while maintaining the computational efficiency of surface meshes. Additionally, we introduce data-driven coordinates that encode spatial information while preserving E(3)-equivariance or invariance properties, ensuring consistent and robust analysis. Evaluations on a real-world industrial dataset demonstrate the superior performance of T-EMNN in accurately predicting node-level 3D deformations, effectively capturing thickness effects while maintaining computational efficiency.
zh

[CV-145] EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models

【速读】:该论文旨在解决现有Vision-Language-Action (VLA)模型在量化过程中因token对齐问题导致的性能下降问题,从而优化计算和存储成本。其解决方案的关键在于提出了一种名为EaqVLA的优化框架,该框架通过编码对齐量化(encoding-aligned quantization)实现更高效的量化策略,具体包括一种多粒度的对齐分析方法以及一种考虑编码对齐的混合精度量化方案。

链接: https://arxiv.org/abs/2505.21567
作者: Feng Jiang,Zihao Zheng,Xiuping Cui,Maoliang Li,JIayu Chen,Xiang Chen
机构: Peking University (北京大学); Beijing (北京); China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the development of Embodied Artificial intelligence, the end-to-end control policy such as Vision-Language-Action (VLA) model has become the mainstream. Existing VLA models faces expensive computing/storage cost, which need to be optimized. Quantization is considered as the most effective method which can not only reduce the memory cost but also achieve computation acceleration. However, we find the token alignment of VLA models hinders the application of existing quantization methods. To address this, we proposed an optimized framework called EaqVLA, which apply encoding-aligned quantization to VLA models. Specifically, we propose an complete analysis method to find the misalignment in various granularity. Based on the analysis results, we propose a mixed precision quantization with the awareness of encoding alignment. Experiments shows that the porposed EaqVLA achieves better quantization performance (with the minimal quantization loss for end-to-end action control and xxx times acceleration) than existing quantization methods.
zh

[CV-146] Diffusion Model-based Activity Completion for AI Motion Capture from Videos

【速读】:该论文旨在解决现有基于AI的动作捕捉方法依赖于预定义动作序列的问题,即无法生成超出观察序列之外的灵活动作。其解决方案的关键在于提出一种基于扩散模型的动作补全技术,通过生成补充的人体运动序列来填补动作片段之间的缺失过渡,从而实现平滑且连续的运动。此外,引入门控模块和位置-时间嵌入模块以提升性能,使方法在Human3.6M数据集上取得竞争性结果。

链接: https://arxiv.org/abs/2505.21566
作者: Gao Huayu,Huang Tengjiu,Ye Xiaolong,Tsuyoshi Okita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages, 16 figures

点击查看摘要

Abstract:AI-based motion capture is an emerging technology that offers a cost-effective alternative to traditional motion capture systems. However, current AI motion capture methods rely entirely on observed video sequences, similar to conventional motion capture. This means that all human actions must be predefined, and movements outside the observed sequences are not possible. To address this limitation, we aim to apply AI motion capture to virtual humans, where flexible actions beyond the observed sequences are required. We assume that while many action fragments exist in the training data, the transitions between them may be missing. To bridge these gaps, we propose a diffusion-model-based action completion technique that generates complementary human motion sequences, ensuring smooth and continuous movements. By introducing a gate module and a position-time embedding module, our approach achieves competitive results on the Human3.6M dataset. Our experimental results show that (1) MDC-Net outperforms existing methods in ADE, FDE, and MMADE but is slightly less accurate in MMFDE, (2) MDC-Net has a smaller model size (16.84M) compared to HumanMAC (28.40M), and (3) MDC-Net generates more natural and coherent motion sequences. Additionally, we propose a method for extracting sensor data, including acceleration and angular velocity, from human motion sequences.
zh

[CV-147] Multi-instance Learning as Downstream Task of Self-Supervised Learning-based Pre-trained Model

【速读】:该论文试图解决在深度多实例学习中,当一个包(bag)中的实例数量增加到256时,学习过程变得极其困难的问题(multi-instance learning)。解决方案的关键是采用预训练模型结合自监督学习作为下游任务,以提升模型在脑出血CT图像中低密度标记分类的性能,即使在原始目标任务存在伪相关性问题的情况下,仍能实现准确率提升5%至13%,F1度量提升40%至55%。

链接: https://arxiv.org/abs/2505.21564
作者: Koki Matsuishi,Tsuyoshi Okita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:In deep multi-instance learning, the number of applicable instances depends on the data set. In histopathology images, deep learning multi-instance learners usually assume there are hundreds to thousands instances in a bag. However, when the number of instances in a bag increases to 256 in brain hematoma CT, learning becomes extremely difficult. In this paper, we address this drawback. To overcome this problem, we propose using a pre-trained model with self-supervised learning for the multi-instance learner as a downstream task. With this method, even when the original target task suffers from the spurious correlation problem, we show improvements of 5% to 13% in accuracy and 40% to 55% in the F1 measure for the hypodensity marker classification of brain hematoma CT.
zh

[CV-148] Knowledge Distillation Approach for SOS Fusion Staging: Towards Fully Automated Skeletal Maturity Assessment CVPR

【速读】:该论文旨在解决spheno-occipital synchondrosis (SOS)融合的自动化分期问题,这是正畸学和法医人类学中的关键诊断指标。其解决方案的关键在于采用双模型架构,通过知识蒸馏技术将教师模型(基于手动裁剪图像训练)的空间理解能力转移到学生模型(处理完整未裁剪图像)中,该过程依赖于一种新提出的损失函数,能够对齐空间logits并结合基于梯度的注意力空间映射,从而确保学生模型在无需外部裁剪或YOLO分割的情况下学习到解剖学相关特征。

链接: https://arxiv.org/abs/2505.21561
作者: Omid Halimi Milani,Amanda Nikho,Marouane Tliba,Lauren Mills,Ahmet Enis Cetin,Mohammed H Elnagar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been accepted to the CVPR Workshop 2025, to be held in Nashville, Tennessee

点击查看摘要

Abstract:We introduce a novel deep learning framework for the automated staging of spheno-occipital synchondrosis (SOS) fusion, a critical diagnostic marker in both orthodontics and forensic anthropology. Our approach leverages a dual-model architecture wherein a teacher model, trained on manually cropped images, transfers its precise spatial understanding to a student model that operates on full, uncropped images. This knowledge distillation is facilitated by a newly formulated loss function that aligns spatial logits as well as incorporates gradient-based attention spatial mapping, ensuring that the student model internalizes the anatomically relevant features without relying on external cropping or YOLO-based segmentation. By leveraging expert-curated data and feedback at each step, our framework attains robust diagnostic accuracy, culminating in a clinically viable end-to-end pipeline. This streamlined approach obviates the need for additional pre-processing tools and accelerates deployment, thereby enhancing both the efficiency and consistency of skeletal maturation assessment in diverse clinical settings.
zh

[CV-149] A Novel Convolutional Neural Network-Based Framework for Complex Multiclass Brassica Seed Classification

【速读】:该论文旨在解决农民在作物生产和农场运营压力下,缺乏时间和资源进行田间研究的问题,同时针对种子分类中由于纹理相似性导致的识别难题。其解决方案的关键在于提出一种基于卷积神经网络(CNN)的框架,通过定制设计的CNN架构有效区分十种常见的十字花科(Brassica)种子类型,从而提高分类效率和准确性。

链接: https://arxiv.org/abs/2505.21558
作者: Elhoucine Elfatimia,Recep Eryigitb,Lahcen Elfatimi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 Figure

点击查看摘要

Abstract:Agricultural research has accelerated in recent years, yet farmers often lack the time and resources for on-farm research due to the demands of crop production and farm operations. Seed classification offers valuable insights into quality control, production efficiency, and impurity detection. Early identification of seed types is critical to reducing the cost and risk associated with field emergence, which can lead to yield losses or disruptions in downstream processes like harvesting. Seed sampling supports growers in monitoring and managing seed quality, improving precision in determining seed purity levels, guiding management adjustments, and enhancing yield estimations. This study proposes a novel convolutional neural network (CNN)-based framework for the efficient classification of ten common Brassica seed types. The approach addresses the inherent challenge of texture similarity in seed images using a custom-designed CNN architecture. The model’s performance was evaluated against several pre-trained state-of-the-art architectures, with adjustments to layer configurations for optimized classification. Experimental results using our collected Brassica seed dataset demonstrate that the proposed model achieved a high accuracy rate of 93 percent.
zh

[CV-150] Analytical Calculation of Weights Convolutional Neural Network

【速读】:该论文试图解决传统卷积神经网络(Convolutional Neural Networks, CNNs)需要依赖大量标注数据和迭代训练过程来确定权重和阈值的问题。其解决方案的关键在于提出一种解析算法,能够在不使用标准训练流程的情况下,仅基于MNIST数据集中的10张选定图像(每张图像代表一个数字0到9)计算出CNN的权重和阈值,并且通过解析方法推导出CNN各层的通道数。该方法实现了无需训练即可完成分类任务的CNN构建,显著提升了推理速度。

链接: https://arxiv.org/abs/2505.21557
作者: Polad Geidarov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an algorithm for analytically calculating the weights and thresholds of convolutional neural networks (CNNs) without using standard training procedures. The algorithm enables the determination of CNN parameters based on just 10 selected images from the MNIST dataset, each representing a digit from 0 to 9. As part of the method, the number of channels in CNN layers is also derived analytically. A software module was implemented in C++ Builder, and a series of experiments were conducted using the MNIST dataset. Results demonstrate that the analytically computed CNN can recognize over half of 1000 handwritten digit images without any training, achieving inference in fractions of a second. These findings suggest that CNNs can be constructed and applied directly for classification tasks without training, using purely analytical computation of weights.
zh

[CV-151] Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

【速读】:该论文试图解决现有基于优化的越狱攻击(jailbreak)在缺乏明确有毒信号时难以引发模型安全偏差的问题。传统方法依赖于“有毒延续”(Toxic-Continuation)范式,仅能有效延续已有的有毒输入,而无法在无显性有毒提示的情况下诱导模型产生不安全输出。论文提出的解决方案关键在于引入“良性到有毒”(Benign-to-Toxic, B2T)的新范式,通过优化对抗性图像,使模型在无安全违规的良性条件输入下生成有毒输出,从而迫使模型突破其安全机制。该方法在黑盒环境下具有迁移性,并可与基于文本的越狱方法互补,揭示了多模态对齐中的潜在漏洞并提出了新的越狱方向。

链接: https://arxiv.org/abs/2505.21556
作者: Hee-Seon Kim,Minbeom Kim,Wonjun Lee,Kihyun Kim,Changick Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: LVLM, Jailbreak

点击查看摘要

Abstract:Optimization-based jailbreaks typically adopt the Toxic-Continuation setting in large vision-language models (LVLMs), following the standard next-token prediction objective. In this setting, an adversarial image is optimized to make the model predict the next token of a toxic prompt. However, we find that the Toxic-Continuation paradigm is effective at continuing already-toxic inputs, but struggles to induce safety misalignment when explicit toxic signals are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike prior work, we optimize adversarial images to induce toxic outputs from benign conditioning. Since benign conditioning contains no safety violations, the image alone must break the model’s safety mechanisms. Our method outperforms prior approaches, transfers in black-box settings, and complements text-based jailbreaks. These results reveal an underexplored vulnerability in multimodal alignment and introduce a fundamentally new direction for jailbreak approaches.
zh

[CV-152] Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing

【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型在生成过程中虚构不存在的物体。其解决方案的关键在于通过构建图像标记的共现图,并利用图神经网络(Graph Neural Network, GNN)结合对比学习和聚类方法,识别出在视觉上下文中频繁共现的标记集群。研究发现,幻觉主要对应于那些在输入中占主导地位的标记集群,而其中视觉上缺失的标记与幻觉物体具有更高的相关性。基于此,作者提出了一种通过修改生成过程中的潜在图像嵌入来抑制视觉上缺失标记影响的幻觉缓解方法。

链接: https://arxiv.org/abs/2505.21547
作者: Weixing Wang,Zifeng Ding,Jindong Gu,Rui Cao,Christoph Meinel,Gerard de Melo,Haojin Yang
机构: Hasso Plattner Institute, University of Potsdam (哈索普拉特纳研究所,波茨坦大学); University of Cambridge (剑桥大学); University of Oxford (牛津大学); German University of Digital Science (德国数字科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at this https URL
zh

[CV-153] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

【速读】:该论文旨在解决潜在视频扩散模型(Latent Video Diffusion Models, LVDMs)在面对不完美的条件输入时所表现出的语义漂移和时间不连贯问题,尤其是在噪声较大、规模庞大的网络视频-文本数据集上。其解决方案的关键在于提出一种污染感知的训练框架CAT-LVDM,通过结构化且与数据对齐的噪声注入提升模型的鲁棒性。具体而言,该方法包括批次中心噪声注入(Batch-Centered Noise Injection, BCNI)和频谱感知上下文噪声(Spectrum-Aware Contextual Noise, SACN),分别用于保持时间一致性与增强低频平滑性,从而显著提升了模型在不同数据集上的性能。

链接: https://arxiv.org/abs/2505.21545
作者: Chika Maduabuchi,Hao Chen,Yujin Han,Jindong Wang
机构: William & Mary (威廉与玛丽学院); Carnegie Mellon University (卡内基梅隆大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code: this https URL Models: this https URL

点击查看摘要

Abstract:Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: this https URL
zh

[CV-154] DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

【速读】:该论文试图解决半透明或透明层遮挡下的图像分层分解问题,现有方法由于依赖掩码先验、静态物体假设以及数据集的缺乏而难以有效分离这些遮挡。解决方案的关键在于提出AlphaBlend数据集,这是首个大规模且高质量的透明与半透明层分解数据集,支持六种现实世界子任务,并基于此数据集构建了DiffDecompose框架,该框架采用扩散Transformer结构,通过输入图像、语义提示和混合类型学习可能分层分解的后验分布,实现了无需逐层监督的上下文分解,并引入了层位置编码克隆技术以保持各层间的像素级对应关系。

链接: https://arxiv.org/abs/2505.21541
作者: Zitong Wang,Hang Zhao,Qianyu Zhou,Xuequan Lu,Xiangtai Li,Yiren Song
机构: Jilin University (吉林大学); The University of Tokyo (东京大学); University of Western Australia (西澳大学); ByteDance Inc (字节跳动公司); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large-scale and high-quality dataset for transparent and semi-transparent layer decomposition, supporting six real-world subtasks (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In-Context Decomposition, enabling the model to predict one or multiple layers without per-layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel-level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance. Our code will be available at: this https URL.
zh

[CV-155] Equivariant Flow Matching for Point Cloud Assembly

【速读】:该论文试图解决点云装配问题,即通过对齐多个点云片段来重建完整的3D形状(3D shape)。其解决方案的关键在于提出了一种基于流匹配模型的等变求解器,通过学习与输入片段相关的向量场来实现等变分布的学习,从而有效提升装配效果和训练数据效率。

链接: https://arxiv.org/abs/2505.21539
作者: Ziming Wang,Nan Xue,Rebecka Jörnsten
机构: CTH(查默斯理工大学); Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The goal of point cloud assembly is to reconstruct a complete 3D shape by aligning multiple point cloud pieces. This work presents a novel equivariant solver for assembly tasks based on flow matching models. We first theoretically show that the key to learning equivariant distributions via flow matching is to learn related vector fields. Based on this result, we propose an assembly model, called equivariant diffusion assembly (Eda), which learns related vector fields conditioned on the input pieces. We further construct an equivariant path for Eda, which guarantees high data efficiency of the training process. Our numerical results show that Eda is highly competitive on practical datasets, and it can even handle the challenging situation where the input pieces are non-overlapped.
zh

[CV-156] Caption This Reason That: VLMs Caught in the Middle

【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在特定视觉任务中(如计数或关系推理)仍落后于人类能力的问题,其核心在于揭示VLM在感知、注意和记忆等认知维度上的局限性。解决方案的关键在于通过针对这些认知能力的基准测试评估先进VLMs,并发现模型在直接视觉推理任务中的不足可以通过基于自身生成文本描述的推理方式得到显著改善,从而强调了提升VLM链式思维(Chain-of-Thought, CoT)能力的重要性。此外,研究还表明对复合视觉推理任务进行针对性微调能够有效增强VLM的核心认知能力。

链接: https://arxiv.org/abs/2505.21538
作者: Zihan Weng,Lucas Gomez,Taylor Whittington Webb,Pouya Bashivan
机构: McGill University (麦吉尔大学); Mila, University of Montreal (蒙特利尔大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs substantially improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on these other benchmarks. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.
zh

[CV-157] Is Attention Required for Transformer Inference? Explore Function-preserving Attention Replacement NEURIPS2025

【速读】:该论文试图解决Transformer模型在推理过程中依赖注意力机制导致的效率问题,特别是在边缘和嵌入式加速器上由于并行性和内存带宽限制所带来的挑战。解决方案的关键在于提出FAR(Function-preserving Attention Replacement)框架,该框架通过可学习的序列到序列模块(如LSTM)替换预训练Transformer中的所有注意力模块,从而在保持模型性能的同时显著降低参数量和推理延迟。

链接: https://arxiv.org/abs/2505.21535
作者: Yuxin Ren,Maxwell D Collins,Miao Hu,Huanrui Yang
机构: University of Arizona (亚利桑那大学); TetraMem, Inc. (TetraMem公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages main paper + 6 pages appendix, 14 figures; submitted to NeurIPS 2025

点击查看摘要

Abstract:While transformers excel across vision and language pretraining tasks, their reliance on attention mechanisms poses challenges for inference efficiency, especially on edge and embedded accelerators with limited parallelism and memory bandwidth. Hinted by the observed redundancy of attention at inference time, we hypothesize that though the model learns complicated token dependency through pretraining, the inference-time sequence-to-sequence mapping in each attention layer is actually ‘‘simple’’ enough to be represented with a much cheaper function. In this work, we explore FAR, a Function-preserving Attention Replacement framework that replaces all attention blocks in pretrained transformers with learnable sequence-to-sequence modules, exemplified by an LSTM. FAR optimize a multi-head LSTM architecture with a block-wise distillation objective and a global structural pruning framework to achieve a family of efficient LSTM-based models from pretrained transformers. We validate FAR on the DeiT vision transformer family and demonstrate that it matches the accuracy of the original models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships and the token-to-token correlation learned in the transformer’s attention module.
zh

[CV-158] Self-Organizing Visual Prototypes for Non-Parametric Representation Learning ICML2025

【速读】:该论文试图解决无监督视觉特征学习中的原型表示不足问题,传统方法通常依赖单一原型来编码隐藏聚类中的所有相关特征,而这种方法可能无法充分捕捉数据区域的复杂性。解决方案的关键在于提出自组织视觉原型(Self-Organizing Visual Prototypes, SOP)策略,该策略通过多个语义相似的支撑嵌入(support embeddings, SEs)来表示一个原型,每个SE包含互补的特征集合,从而更准确地表征其空间区域并提升训练性能。此外,论文引入了SOP掩码图像建模(SOP-MIM)任务,从多个非参数局部SE的角度重建被掩码的表示,进一步优化了模型性能。

链接: https://arxiv.org/abs/2505.21533
作者: Thalles Silva,Helio Pedrini,Adín Ramírez Rivera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2025, code at this https URL

点击查看摘要

Abstract:We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.
zh

[CV-159] EvidenceMoE: A Physics-Guided Mixture-of-Experts with Evidential Critics for Advancing Fluorescence Light Detection and Ranging in Scattering Media

【速读】:该论文旨在解决荧光激光雷达(Fluorescence LiDAR)在散射介质中进行距离和深度估计时面临的计算挑战,特别是难以分离光子飞行时间(与目标深度相关)和固有荧光寿命的问题。其解决方案的关键在于提出一种物理引导的专家混合(Physics-Guided Mixture-of-Experts, MoE)框架,其中专家模型基于底层物理规律(如描述光子在散射介质中传播的辐射传输方程)。该框架的核心是EvidenceMoE,它结合了基于证据的Dirichlet评判器(Evidence-Based Dirichlet Critics, EDCs),通过为每个专家的输出提供质量评分和修正反馈来评估其可靠性,并由决策网络自适应融合专家预测以生成鲁棒的最终估计。

链接: https://arxiv.org/abs/2505.21532
作者: Ismail Erbas,Ferhat Demirkiran,Karthik Swaminathan,Naigang Wang,Navid Ibtehaj Nizam,Stefan T. Radev,Kaoutar El Maghraoui,Xavier Intes,Vikas Pandey
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); University at Albany (阿尔巴尼大学); IBM T.J. Watson Research Center (IBM托马斯·J·沃森研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optics (physics.optics)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Fluorescence LiDAR (FLiDAR), a Light Detection and Ranging (LiDAR) technology employed for distance and depth estimation across medical, automotive, and other fields, encounters significant computational challenges in scattering media. The complex nature of the acquired FLiDAR signal, particularly in such environments, makes isolating photon time-of-flight (related to target depth) and intrinsic fluorescence lifetime exceptionally difficult, thus limiting the effectiveness of current analytical and computational methodologies. To overcome this limitation, we present a Physics-Guided Mixture-of-Experts (MoE) framework tailored for specialized modeling of diverse temporal components. In contrast to the conventional MoE approaches our expert models are informed by underlying physics, such as the radiative transport equation governing photon propagation in scattering media. Central to our approach is EvidenceMoE, which integrates Evidence-Based Dirichlet Critics (EDCs). These critic models assess the reliability of each expert’s output by providing per-expert quality scores and corrective feedback. A Decider Network then leverages this information to fuse expert predictions into a robust final estimate adaptively. We validate our method using realistically simulated Fluorescence LiDAR (FLiDAR) data for non-invasive cancer cell depth detection generated from photon transport models in tissue. Our framework demonstrates strong performance, achieving a normalized root mean squared error (NRMSE) of 0.030 for depth estimation and 0.074 for fluorescence lifetime.
zh

[CV-160] UniDB: Fast Sampling of Unified Diffusion Bridge

【速读】:该论文旨在解决统一扩散桥(UniDB)框架在图像生成任务中因依赖迭代欧拉采样方法而导致的推理速度慢、计算成本高的问题,以及现有加速技术无法有效应对其独特挑战(如缺失终端均值约束和SOC特定惩罚系数)的问题。解决方案的关键在于提出一种无需训练的采样算法UniDB++,其核心进展是推导出UniDB反向时序随机微分方程(SDE)的精确闭式解,从而有效减少欧拉近似中的误差累积,并在减少采样步骤(最多20倍)的情况下实现高质量生成。此外,通过引入更稳定的数据预测模型和SDE-Corrector机制,进一步提升了低步骤范围内的感知质量。

链接: https://arxiv.org/abs/2505.21528
作者: Mokai Pan,Kaizhen Zhu,Yuexin Ma,Yanwei Fu,Jingyi Yu,Jingya Wang,Ye Shi
机构: ShanghaiTech University (上海科技大学); Fudan University (复旦大学); MoE Key Laboratory of Intelligent Perception and Human Machine Collaboration (教育部智能感知与人机协同重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Bridges enable transitions between arbitrary distributions, with the Unified Diffusion Bridge (UniDB) framework achieving high-fidelity image generation via a Stochastic Optimal Control (SOC) formulation. However, UniDB’s reliance on iterative Euler sampling methods results in slow, computationally expensive inference, while existing acceleration techniques for diffusion or diffusion bridge models fail to address its unique challenges: missing terminal mean constraints and SOC-specific penalty coefficients in its SDEs. We present UniDB++, a training-free sampling algorithm that significantly improves upon these limitations. The method’s key advancement comes from deriving exact closed-form solutions for UniDB’s reverse-time SDEs, effectively reducing the error accumulation inherent in Euler approximations and enabling high-quality generation with up to 20 \times fewer sampling steps. This method is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes (5-10 steps). Additionally, we demonstrate that UniDB++ aligns with existing diffusion bridge acceleration methods by evaluating their update rules, and UniDB++ can recover DBIMs as special cases under some theoretical conditions. Experiments demonstrate UniDB++'s state-of-the-art performance in image restoration tasks, outperforming Euler-based methods in fidelity and speed while reducing inference time significantly. This work bridges the gap between theoretical generality and practical efficiency in SOC-driven diffusion bridge models. Our code is available at this https URL.
zh

[CV-161] mporal Restoration and Spatial Rewiring for Source-Free Multivariate Time Series Domain Adaptation

【速读】:该论文旨在解决无监督域适应(Source-Free Domain Adaptation, SFDA)在多变量时间序列(Multivariate Time Series, MTS)数据上的性能不足问题,特别是在缺乏源域数据的情况下,现有方法难以有效捕捉MTS数据中固有的空间相关性,从而影响特征对齐和域适应效果。解决方案的关键在于提出一种名为Temporal Restoration and Spatial Rewiring (TERSE)的新方法,该方法通过定制化的时空特征编码器,结合时间恢复和空间重布任务,以重建时间掩码序列的潜在表示和空间相关结构,从而实现跨域的空间-时间依赖性的建模与迁移。

链接: https://arxiv.org/abs/2505.21525
作者: Peiliang Gong,Yucheng Wang,Min Wu,Zhenghua Chen,Xiaoli Li,Daoqiang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained model from an annotated source domain to an unlabelled target domain without accessing the source data, thereby preserving data privacy. While existing SFDA methods have proven effective in reducing reliance on source data, they struggle to perform well on multivariate time series (MTS) due to their failure to consider the intrinsic spatial correlations inherent in MTS data. These spatial correlations are crucial for accurately representing MTS data and preserving invariant information across domains. To address this challenge, we propose Temporal Restoration and Spatial Rewiring (TERSE), a novel and concise SFDA method tailored for MTS data. Specifically, TERSE comprises a customized spatial-temporal feature encoder designed to capture the underlying spatial-temporal characteristics, coupled with both temporal restoration and spatial rewiring tasks to reinstate latent representations of the temporally masked time series and the spatially masked correlated structures. During the target adaptation phase, the target encoder is guided to produce spatially and temporally consistent features with the source domain by leveraging the source pre-trained temporal restoration and spatial rewiring networks. Therefore, TERSE can effectively model and transfer spatial-temporal dependencies across domains, facilitating implicit feature alignment. In addition, as the first approach to simultaneously consider spatial-temporal consistency in MTS-SFDA, TERSE can also be integrated as a versatile plug-and-play module into established SFDA methods. Extensive experiments on three real-world time series datasets demonstrate the effectiveness and versatility of our approach.
zh

[CV-162] Learning Shared Representations from Unpaired Data

【速读】:该论文试图解决多模态表示学习中共享表示学习的问题,特别是如何在缺乏配对样本的情况下构建有效的跨模态嵌入。解决方案的关键在于利用从每个单模态表示中独立构建的随机游走矩阵的谱嵌入,从而几乎完全依赖于未配对数据来学习共享表示。这一方法在计算机视觉和自然语言处理领域得到了实证支持,展示了未配对数据在捕捉有意义的跨模态关系方面的有效性。

链接: https://arxiv.org/abs/2505.21524
作者: Amitai Yacobi,Nir Ben-Ari,Ronen Talmon,Uri Shaham
机构: Bar-Ilan University (巴伊兰大学); Technion (技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our code IS publicly available at this https URL.
zh

[CV-163] CIM-NET: A Video Denoising Deep Neural Network Model Optimized for Computing-in-Memory Architectures

【速读】:该论文旨在解决将先进的深度神经网络(DNN)视频去噪模型部署到边缘设备时面临的实时性和能效挑战。现有DNN模型在设计时未充分考虑计算内存(CIM)架构的约束,从而限制了其在CIM芯片上的加速潜力。解决方案的关键在于提出一种软硬件协同设计框架,包含两个创新点:一是针对大感受野操作和CIM交叉开关结构的矩阵向量乘法(MVM)加速优化的CIM-Aware架构CIM-NET;二是用于集成滑动处理与全连接变换的伪卷积算子CIM-CONV,以实现高质量特征提取与重建。该框架显著减少了MVM操作次数,在保持竞争性性能的同时提升了CIM芯片上的推理速度。

链接: https://arxiv.org/abs/2505.21522
作者: Shan Gao,Zhiqiang Wu,Yawen Niu,Xiaotao Li,Qingqing Xu
机构: China Mobile Research Institute (中国移动研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While deep neural network (DNN)-based video denoising has demonstrated significant performance, deploying state-of-the-art models on edge devices remains challenging due to stringent real-time and energy efficiency requirements. Computing-in-Memory (CIM) chips offer a promising solution by integrating computation within memory cells, enabling rapid matrix-vector multiplication (MVM). However, existing DNN models are often designed without considering CIM architectural constraints, thus limiting their acceleration potential during inference. To address this, we propose a hardware-algorithm co-design framework incorporating two innovations: (1) a CIM-Aware Architecture, CIM-NET, optimized for large receptive field operation and CIM’s crossbar-based MVM acceleration; and (2) a pseudo-convolutional operator, CIM-CONV, used within CIM-NET to integrate slide-based processing with fully connected transformations for high-quality feature extraction and reconstruction. This framework significantly reduces the number of MVM operations, improving inference speed on CIM chips while maintaining competitive performance. Experimental results indicate that, compared to the conventional lightweight model FastDVDnet, CIM-NET substantially reduces MVM operations with a slight decrease in denoising performance. With a stride value of 8, CIM-NET reduces MVM operations to 1/77th of the original, while maintaining competitive PSNR (35.11 dB vs. 35.56 dB
zh

[CV-164] Do DeepFake Attribution Models Generalize?

【速读】:该论文旨在解决DeepFake检测中存在的一系列挑战,特别是针对二分类模型在处理不同伪造技术时的局限性,以及对伪造溯源(attribution)模型的研究不足。其关键解决方案是通过引入多类别模型和对比学习方法,提升模型在跨数据集场景下的泛化能力和对已知伪造技术的识别准确性,同时探索不同模型规模、数据质量和训练方法对溯源性能的影响。

链接: https://arxiv.org/abs/2505.21520
作者: Spiros Baxavanakis,Manos Schinas,Symeon Papadopoulos
机构: Information Technologies Institute @ CERTH(信息科技研究所 @ CERTH)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in DeepFake generation, along with the proliferation of open-source tools, have significantly lowered the barrier for creating synthetic media. This trend poses a serious threat to the integrity and authenticity of online information, undermining public trust in institutions and media. State-of-the-art research on DeepFake detection has primarily focused on binary detection models. A key limitation of these models is that they treat all manipulation techniques as equivalent, despite the fact that different methods introduce distinct artifacts and visual cues. Only a limited number of studies explore DeepFake attribution models, although such models are crucial in practical settings. By providing the specific manipulation method employed, these models could enhance both the perceived trustworthiness and explainability for end users. In this work, we leverage five state-of-the-art backbone models and conduct extensive experiments across six DeepFake datasets. First, we compare binary and multi-class models in terms of cross-dataset generalization. Second, we examine the accuracy of attribution models in detecting seen manipulation methods in unknown datasets, hence uncovering data distribution shifts on the same DeepFake manipulations. Last, we assess the effectiveness of contrastive methods in improving cross-dataset generalization performance. Our findings indicate that while binary models demonstrate better generalization abilities, larger models, contrastive methods, and higher data quality can lead to performance improvements in attribution models. The code of this work is available on GitHub.
zh

[CV-165] Enhancing Vision Transformer Explainability Using Artificial Astrocytes CVPR

【速读】:该论文试图解决深度学习模型决策过程缺乏可解释性的问题,尤其是在模型复杂度增加时,可解释性通常会下降。解决方案的关键在于提出一种无需训练的Vision Transformer with artificial Astrocytes (ViTA)方法,该方法受神经科学启发,通过增强预训练深度神经网络的推理能力,生成更符合人类认知的解释。

链接: https://arxiv.org/abs/2505.21513
作者: Nicolas Echevarrieta-Catalan,Ana Ribas-Rodriguez,Francisco Cedron,Odelia Schwartz,Vanessa Aguiar-Pulido
机构: University of Miami(迈阿密大学); University of A Coruña(拉科鲁尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: LXCV Workshop at IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2025

点击查看摘要

Abstract:Machine learning models achieve high precision, but their decision-making processes often lack explainability. Furthermore, as model complexity increases, explainability typically decreases. Existing efforts to improve explainability primarily involve developing new eXplainable artificial intelligence (XAI) techniques or incorporating explainability constraints during training. While these approaches yield specific improvements, their applicability remains limited. In this work, we propose the Vision Transformer with artificial Astrocytes (ViTA). This training-free approach is inspired by neuroscience and enhances the reasoning of a pretrained deep neural network to generate more human-aligned explanations. We evaluated our approach employing two well-known XAI techniques, Grad-CAM and Grad-CAM++, and compared it to a standard Vision Transformer (ViT). Using the ClickMe dataset, we quantified the similarity between the heatmaps produced by the XAI techniques and a (human-aligned) ground truth. Our results consistently demonstrate that incorporating artificial astrocytes enhances the alignment of model explanations with human perception, leading to statistically significant improvements across all XAI techniques and metrics utilized.
zh

[CV-166] Chest Disease Detection In X-Ray Images Using Deep Learning Classification Method

【速读】:该论文旨在解决如何准确分类胸部X光图像为四种类别(COVID-19、肺炎、结核病和正常)的问题,以辅助临床诊断。其解决方案的关键在于利用先进的预训练卷积神经网络(Convolutional Neural Networks, CNNs)模型,并通过迁移学习技术对这些模型进行微调,以适应医学X光图像的分类任务,同时采用Grad-CAM方法提高模型的可解释性,从而增强临床应用中的信任度和透明度。

链接: https://arxiv.org/abs/2505.22609
作者: Alanna Hazlett,Naomi Ohashi,Timothy Rodriguez,Sodiq Adewole
机构: University of Virginia(弗吉尼亚大学); School of Data Science(数据科学学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we investigate the performance across multiple classification models to classify chest X-ray images into four categories of COVID-19, pneumonia, tuberculosis (TB), and normal cases. We leveraged transfer learning techniques with state-of-the-art pre-trained Convolutional Neural Networks (CNNs) models. We fine-tuned these pre-trained architectures on a labeled medical x-ray images. The initial results are promising with high accuracy and strong performance in key classification metrics such as precision, recall, and F1 score. We applied Gradient-weighted Class Activation Mapping (Grad-CAM) for model interpretability to provide visual explanations for classification decisions, improving trust and transparency in clinical applications.
zh

[CV-167] Comparative Analysis of Machine Learning Models for Lung Cancer Mutation Detection and Staging Using 3D CT Scans

【速读】:该论文旨在解决肺癌患者非侵入性检测关键基因突变及分期的问题,以提升患者预后。其解决方案的关键在于对比两种机器学习模型——FMCIB+XGBoost(一种具有领域特定预训练的监督模型)与Dinov2+ABMIL(一种基于注意力的多实例学习的自监督模型)在3D肺结节数据上的性能,从而评估不同模型在基因突变检测和癌症分期任务中的有效性与适应性。

链接: https://arxiv.org/abs/2505.22592
作者: Yiheng Li,Francisco Carrillo-Perez,Mohammed Alawad,Olivier Gevaert
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung cancer is the leading cause of cancer mortality worldwide, and non-invasive methods for detecting key mutations and staging are essential for improving patient outcomes. Here, we compare the performance of two machine learning models - FMCIB+XGBoost, a supervised model with domain-specific pretraining, and Dinov2+ABMIL, a self-supervised model with attention-based multiple-instance learning - on 3D lung nodule data from the Stanford Radiogenomics and Lung-CT-PT-Dx cohorts. In the task of KRAS and EGFR mutation detection, FMCIB+XGBoost consistently outperformed Dinov2+ABMIL, achieving accuracies of 0.846 and 0.883 for KRAS and EGFR mutations, respectively. In cancer staging, Dinov2+ABMIL demonstrated competitive generalization, achieving an accuracy of 0.797 for T-stage prediction in the Lung-CT-PT-Dx cohort, suggesting SSL’s adaptability across diverse datasets. Our results emphasize the clinical utility of supervised models in mutation detection and highlight the potential of SSL to improve staging generalization, while identifying areas for enhancement in mutation sensitivity.
zh

[CV-168] Multipath cycleGAN for harmonization of paired and unpaired low-dose lung computed tomography reconstruction kernels

【速读】:该论文旨在解决CT重建核(reconstruction kernel)对定量成像测量(如肺气肿量化)产生的系统性变异问题,从而实现不同CT扫描参数下的图像一致性分析。其解决方案的关键在于提出一种多路径循环生成对抗网络(multipath cycleGAN)模型,该模型通过域特定的编码器和解码器以及共享潜在空间,在配对与非配对数据上进行训练,以实现CT图像的核标准化(kernel harmonization)。该方法有效减少了肺气肿评分的偏差,并在保持解剖结构一致性的前提下提升了定量分析的准确性。

链接: https://arxiv.org/abs/2505.22568
作者: Aravind R. Krishnan,Thomas Z. Li,Lucas W. Remedios,Michael E. Kim,Chenyu Gao,Gaurav Rudravaram,Elyssa M. McMaster,Adam M. Saunders,Shunxing Bao,Kaiwen Xu,Lianrui Zuo,Kim L. Sandler,Fabien Maldonado,Yuankai Huo,Bennett A. Landman
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstruction kernels in computed tomography (CT) affect spatial resolution and noise characteristics, introducing systematic variability in quantitative imaging measurements such as emphysema quantification. Choosing an appropriate kernel is therefore essential for consistent quantitative analysis. We propose a multipath cycleGAN model for CT kernel harmonization, trained on a mixture of paired and unpaired data from a low-dose lung cancer screening cohort. The model features domain-specific encoders and decoders with a shared latent space and uses discriminators tailored for each this http URL train the model on 42 kernel combinations using 100 scans each from seven representative kernels in the National Lung Screening Trial (NLST) dataset. To evaluate performance, 240 scans from each kernel are harmonized to a reference soft kernel, and emphysema is quantified before and after harmonization. A general linear model assesses the impact of age, sex, smoking status, and kernel on emphysema. We also evaluate harmonization from soft kernels to a reference hard kernel. To assess anatomical consistency, we compare segmentations of lung vessels, muscle, and subcutaneous adipose tissue generated by TotalSegmentator between harmonized and original images. Our model is benchmarked against traditional and switchable cycleGANs. For paired kernels, our approach reduces bias in emphysema scores, as seen in Bland-Altman plots (p0.05). For unpaired kernels, harmonization eliminates confounding differences in emphysema (p0.05). High Dice scores confirm preservation of muscle and fat anatomy, while lung vessel overlap remains reasonable. Overall, our shared latent space multipath cycleGAN enables robust harmonization across paired and unpaired CT kernels, improving emphysema quantification and preserving anatomical fidelity.
zh

[CV-169] Surf2CT: Cascaded 3D Flow Matching Models for Torso 3D CT Synthesis from Skin Surface NEURIPS2025

【速读】:该论文试图解决如何仅通过外部体表扫描和简单人口统计信息(如年龄、性别、身高、体重)生成完整的三维计算机断层扫描(CT)体积的问题,而无需任何内部成像数据。其解决方案的关键在于提出了一种新型的级联流匹配框架Surf2CT,该框架包含三个阶段:表面补全、粗略CT合成和CT超分辨率重建,每个阶段均采用经过流匹配训练的3D适应的EDM2主干网络,从而实现了从外部数据到高保真内部解剖结构图像的生成。

链接: https://arxiv.org/abs/2505.22511
作者: Siyeop Yoon,Yujin Oh,Pengfei Jin,Sifan Song,Matthew Tivnan,Dufan Wu,Xiang Li,Quanzheng Li
机构: Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114(马萨诸塞总医院和哈佛医学院,波士顿,MA 02114)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Neurips 2025 submitted

点击查看摘要

Abstract:We present Surf2CT, a novel cascaded flow matching framework that synthesizes full 3D computed tomography (CT) volumes of the human torso from external surface scans and simple demographic data (age, sex, height, weight). This is the first approach capable of generating realistic volumetric internal anatomy images solely based on external body shape and demographics, without any internal imaging. Surf2CT proceeds through three sequential stages: (1) Surface Completion, reconstructing a complete signed distance function (SDF) from partial torso scans using conditional 3D flow matching; (2) Coarse CT Synthesis, generating a low-resolution CT volume from the completed SDF and demographic information; and (3) CT Super-Resolution, refining the coarse volume into a high-resolution CT via a patch-wise conditional flow model. Each stage utilizes a 3D-adapted EDM2 backbone trained via flow matching. We trained our model on a combined dataset of 3,198 torso CT scans (approximately 1.13 million axial slices) sourced from Massachusetts General Hospital (MGH) and the AutoPET challenge. Evaluation on 700 paired torso surface-CT cases demonstrated strong anatomical fidelity: organ volumes exhibited small mean percentage differences (range from -11.1% to 4.4%), and muscle/fat body composition metrics matched ground truth with strong correlation (range from 0.67 to 0.96). Lung localization had minimal bias (mean difference -2.5 mm), and surface completion significantly improved metrics (Chamfer distance: from 521.8 mm to 2.7 mm; Intersection-over-Union: from 0.87 to 0.98). Surf2CT establishes a new paradigm for non-invasive internal anatomical imaging using only external data, opening opportunities for home-based healthcare, preventive medicine, and personalized clinical assessments without the risks associated with conventional imaging techniques.
zh

[CV-170] Risk-Sensitive Conformal Prediction for Catheter Placement Detection in Chest X-rays

【速读】:该论文旨在解决胸片中导管和线路位置检测的问题,以满足关键的临床需求。其解决方案的关键在于结合多任务学习与风险敏感的保真预测(risk-sensitive conformal prediction),通过同时执行分类、分割和关键点检测,并利用这些任务之间的协同关系来提升整体性能,同时通过风险敏感的保真预测方法提高对临床关键发现的预测可靠性,从而确保系统在临床部署中的安全性与准确性。

链接: https://arxiv.org/abs/2505.22496
作者: Long Hui
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This paper presents a novel approach to catheter and line position detection in chest X-rays, combining multi-task learning with risk-sensitive conformal prediction to address critical clinical requirements. Our model simultaneously performs classification, segmentation, and landmark detection, leveraging the synergistic relationship between these tasks to improve overall performance. We further enhance clinical reliability through risk-sensitive conformal prediction, which provides statistically guaranteed prediction sets with higher reliability for clinically critical findings. Experimental results demonstrate excellent performance with 90.68% overall empirical coverage and 99.29% coverage for critical conditions, while maintaining remarkable precision in prediction sets. Most importantly, our risk-sensitive approach achieves zero high-risk mispredictions (cases where the system dangerously declares problematic tubes as confidently normal), making the system particularly suitable for clinical deployment. This work offers both accurate predictions and reliably quantified uncertainty – essential features for life-critical medical applications.
zh

[CV-171] Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics MICCAI2025

【速读】:该论文旨在解决在肿瘤影像、虚拟试验和人工智能驱动的数据增强中对真实数字孪生体日益增长的需求,传统确定性体模无法提供足够的 anatomical 和 metabolic 多样性。其解决方案的关键在于提出一种级联的3D扩散模型框架,该框架通过两阶段生成过程实现:首先使用基于分数的扩散模型从人口统计变量生成低分辨率的PET/CT图像,随后通过超分辨率残差扩散模型提升空间分辨率,从而生成高保真度的3D PET/CT体积。

链接: https://arxiv.org/abs/2505.22489
作者: Siyeop Yoon,Sifan Song,Pengfei Jin,Matthew Tivnan,Yujin Oh,Sekeun Kim,Dufan Wu,Xiang Li,Quanzheng Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: MICCAI2025 Submitted version

点击查看摘要

Abstract:We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications.
zh

[CV-172] Large-Area Fabrication-aware Computational Diffractive Optics

【速读】:该论文试图解决可学习的衍射光学系统在实际应用中的制造限制问题,特别是仿真与实际制造器件之间存在的质量差距。其解决方案的关键在于提出了一种面向制造的衍射光学设计流程,该流程结合了直接写入灰度光刻和纳米压印复制技术,适用于低成本的大面积批量生产。此外,还提出了一个超分辨率神经光刻模型,能够准确预测制造过程产生的三维几何结构,并可无缝集成到现有的可微分光学框架中,实现面向制造的端到端计算光学系统优化。

链接: https://arxiv.org/abs/2505.22313
作者: Kaixuan Wei,Hector A. Jimenez-Romero,Hadi Amata,Jipeng Sun,Qiang Fu,Felix Heide,Wolfgang Heidrich
机构: King Abdullah University of Science and Technology(沙特阿拉伯国王阿卜杜拉科技大学); Princeton University(普林斯顿大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Differentiable optics, as an emerging paradigm that jointly optimizes optics and (optional) image processing algorithms, has made innovative optical designs possible across a broad range of applications. Many of these systems utilize diffractive optical components (DOEs) for holography, PSF engineering, or wavefront shaping. Existing approaches have, however, mostly remained limited to laboratory prototypes, owing to a large quality gap between simulation and manufactured devices. We aim at lifting the fundamental technical barriers to the practical use of learned diffractive optical systems. To this end, we propose a fabrication-aware design pipeline for diffractive optics fabricated by direct-write grayscale lithography followed by nano-imprinting replication, which is directly suited for inexpensive mass production of large area designs. We propose a super-resolved neural lithography model that can accurately predict the 3D geometry generated by the fabrication process. This model can be seamlessly integrated into existing differentiable optics frameworks, enabling fabrication-aware, end-to-end optimization of computational optical systems. To tackle the computational challenges, we also devise tensor-parallel compute framework centered on distributing large-scale FFT computation across many GPUs. As such, we demonstrate large scale diffractive optics designs up to 32.16 mm \times 21.44 mm, simulated on grids of up to 128,640 by 85,760 feature points. We find adequate agreement between simulation and fabricated prototypes for applications such as holography and PSF engineering. We also achieve high image quality from an imaging system comprised only of a single DOE, with images processed only by a Wiener filter utilizing the simulation PSF. We believe our findings lift the fabrication limitations for real-world applications of diffractive optics and differentiable optical design.
zh

[CV-173] Physics-inspired Generative AI models via real hardware-based noisy quantum diffusion

【速读】:该论文试图解决当前量子扩散模型(Quantum Diffusion Models, QDMs)在可扩展性方面的瓶颈问题,即现有算法受限于近期量子设备的性能而难以扩展。其解决方案的关键在于提出两种受物理启发的协议:第一种利用量子随机游走的形式,在正向过程中通过量子与经典动力学的特定相互作用,生成具有更低Fréchet Inception Distance (FID)的MNIST图像;第二种则通过利用真实IBM量子硬件中的固有噪声,在仅使用四个量子比特的情况下实现图像生成。这两种方法均未对量子噪声进行抑制或校正,而是将其作为有用资源加以利用。

链接: https://arxiv.org/abs/2505.22193
作者: Marco Parigi,Stefano Martina,Francesco Aldo Venturelli,Filippo Caruso
机构: 未知
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 9 figures. Supplementary materials: 2 pages, 2 figures

点击查看摘要

Abstract:Quantum Diffusion Models (QDMs) are an emerging paradigm in Generative AI that aims to use quantum properties to improve the performances of their classical counterparts. However, existing algorithms are not easily scalable due to the limitations of near-term quantum devices. Following our previous work on QDMs, here we propose and implement two physics-inspired protocols. In the first, we use the formalism of quantum stochastic walks, showing that a specific interplay of quantum and classical dynamics in the forward process produces statistically more robust models generating sets of MNIST images with lower Fréchet Inception Distance (FID) than using totally classical dynamics. In the second approach, we realize an algorithm to generate images by exploiting the intrinsic noise of real IBM quantum hardware with only four qubits. Our work could be a starting point to pave the way for new scenarios for large-scale algorithms in quantum Generative AI, where quantum noise is neither mitigated nor corrected, but instead exploited as a useful resource.
zh

[CV-174] Higher-Order Group Synchronization

【速读】:该论文试图解决的是高阶群同步(higher-order group synchronization)问题,即在超图(hypergraph)上通过同步高阶局部测量来获得节点的全局估计。传统群同步问题通常基于图结构,仅处理节点间的成对关系,而该研究扩展至更高阶的结构,以捕捉更复杂的局部关系。解决方案的关键在于提出一种全新的计算框架,该框架直接作用于高阶测量,并通过消息传递算法实现全局同步,同时提供了收敛性分析及对噪声和异常值的鲁棒性保证。

链接: https://arxiv.org/abs/2505.21932
作者: Adriana L. Duncan,Joe Kileel
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Combinatorics (math.CO); Optimization and Control (math.OC)
备注: 40 pages

点击查看摘要

Abstract:Group synchronization is the problem of determining reliable global estimates from noisy local measurements on networks. The typical task for group synchronization is to assign elements of a group to the nodes of a graph in a way that respects group elements given on the edges which encode information about local pairwise relationships between the nodes. In this paper, we introduce a novel higher-order group synchronization problem which operates on a hypergraph and seeks to synchronize higher-order local measurements on the hyperedges to obtain global estimates on the nodes. Higher-order group synchronization is motivated by applications to computer vision and image processing, among other computational problems. First, we define the problem of higher-order group synchronization and discuss its mathematical foundations. Specifically, we give necessary and sufficient synchronizability conditions which establish the importance of cycle consistency in higher-order group synchronization. Then, we propose the first computational framework for general higher-order group synchronization; it acts globally and directly on higher-order measurements using a message passing algorithm. We discuss theoretical guarantees for our framework, including convergence analyses under outliers and noise. Finally, we show potential advantages of our method through numerical experiments. In particular, we show that in certain cases our higher-order method applied to rotational and angular synchronization outperforms standard pairwise synchronization methods and is more robust to outliers. We also show that our method has comparable performance on simulated cryo-electron microscopy (cryo-EM) data compared to a standard cryo-EM reconstruction package.
zh

[CV-175] Subspecialty-Specific Foundation Model for Intelligent Gastrointestinal Pathology

【速读】:该论文旨在解决胃肠疾病(Gastrointestinal, GI)病理诊断中因依赖病理学家主观判断而导致的可重复性差和诊断变异性大的问题。其关键解决方案是开发了一个名为Digepath的专门基础模型,该模型采用双阶段迭代优化策略,结合预训练与细筛机制,以应对全片图像中稀疏分布病灶区域的检测问题,从而提升诊断准确性与一致性。

链接: https://arxiv.org/abs/2505.21928
作者: Lianghui Zhu,Xitong Ling,Minxi Ouyang,Xiaoping Liu,Mingxi Fu,Tian Guan,Fanglei Fu,Xuanyu Wang,Maomao Zeng,Mingxi Zhu,Yibo Jin,Liming Liu,Song Duan,Qiming He,Yizhi Wang,Luxi Xie,Houqiang Li,Yonghong He,Sufang Tian
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Gastrointestinal (GI) diseases represent a clinically significant burden, necessitating precise diagnostic approaches to optimize patient outcomes. Conventional histopathological diagnosis, heavily reliant on the subjective interpretation of pathologists, suffers from limited reproducibility and diagnostic variability. To overcome these limitations and address the lack of pathology-specific foundation models for GI diseases, we develop Digepath, a specialized foundation model for GI pathology. Our framework introduces a dual-phase iterative optimization strategy combining pretraining with fine-screening, specifically designed to address the detection of sparsely distributed lesion areas in whole-slide images. Digepath is pretrained on more than 353 million image patches from over 200,000 hematoxylin and eosin-stained slides of GI diseases. It attains state-of-the-art performance on 33 out of 34 tasks related to GI pathology, including pathological diagnosis, molecular prediction, gene mutation prediction, and prognosis evaluation, particularly in diagnostically ambiguous cases and resolution-agnostic tissue this http URL further translate the intelligent screening module for early GI cancer and achieve near-perfect 99.6% sensitivity across 9 independent medical institutions nationwide. The outstanding performance of Digepath highlights its potential to bridge critical gaps in histopathological practice. This work not only advances AI-driven precision pathology for GI diseases but also establishes a transferable paradigm for other pathology subspecialties.
zh

[CV-176] MAMBO-NET: Multi-Causal Aware Modeling Backdoor-Intervention Optimization for Medical Image Segmentation Network

【速读】:该论文试图解决医学图像分割中因混淆因素(confusion factors)导致的分割结果不准确问题,这些因素包括复杂的解剖变异和成像模态的局限性,它们会干扰分割的相关性和因果关系。解决方案的关键在于提出一种多因果感知建模后门干预优化网络(MAMBO-NET),该网络通过自建模的多高斯分布拟合混淆因素,并引入因果干预机制,同时设计适当的后验概率约束以有效训练混淆因素的分布,从而指导分割过程并减轻其对分割结果的影响。

链接: https://arxiv.org/abs/2505.21874
作者: Ruiguo Yu,Yiyang Zhang,Yuan Tian,Yujie Diao,Di Jin,Witold Pedrycz
机构: 天津大学( Tianjin University )
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation methods generally assume that the process from medical image to segmentation is unbiased, and use neural networks to establish conditional probability models to complete the segmentation task. This assumption does not consider confusion factors, which can affect medical images, such as complex anatomical variations and imaging modality limitations. Confusion factors obfuscate the relevance and causality of medical image segmentation, leading to unsatisfactory segmentation results. To address this issue, we propose a multi-causal aware modeling backdoor-intervention optimization (MAMBO-NET) network for medical image segmentation. Drawing insights from causal inference, MAMBO-NET utilizes self-modeling with multi-Gaussian distributions to fit the confusion factors and introduce causal intervention into the segmentation process. Moreover, we design appropriate posterior probability constraints to effectively train the distributions of confusion factors. For the distributions to effectively guide the segmentation and mitigate and eliminate the Impact of confusion factors on the segmentation, we introduce classical backdoor intervention techniques and analyze their feasibility in the segmentation task. To evaluate the effectiveness of our approach, we conducted extensive experiments on five medical image datasets. The results demonstrate that our method significantly reduces the influence of confusion factors, leading to enhanced segmentation accuracy.
zh

[CV-177] Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT -2

【速读】:该论文试图解决从胸部X光图像自动生成放射科报告过程中存在的隐私保护问题,传统集中式方法通常需要传输敏感数据,从而引发隐私担忧。解决方案的关键在于提出一种多模态联邦学习(Multimodal Federated Learning)框架,利用Vision Transformer(ViT)作为编码器和GPT-2作为报告生成器,在不共享原始数据的情况下实现去中心化训练,并通过三种联邦学习聚合策略(FedAvg、Krum Aggregation 和 Loss-aware Federated Averaging)进行评估,其中Krum Aggregation在多项评价指标上表现最优,证明了该框架在保持数据隐私的同时能够生成具有临床相关性和语义丰富性的放射科报告。

链接: https://arxiv.org/abs/2505.21715
作者: Md. Zahid Hossain,Mustofa Ahmed,Most. Sharmin Sultana Samu,Md. Rakibul Islam
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, manuscript under-review

点击查看摘要

Abstract:The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality.
zh

[CV-178] STA-Risk: A Deep Dive of Spatio-Temporal Asymmetries for Breast Cancer Risk Prediction

【速读】:该论文旨在解决乳腺癌风险预测模型在性能上的局限性,特别是现有模型通常仅基于单次检查或忽略纵向影像检查中与乳腺癌风险相关的空间和时间细节的细微变化。其解决方案的关键在于提出一种基于Transformer的新型模型STA-Risk(Spatial and Temporal Asymmetry-based Risk Prediction),通过侧向编码和时序编码来学习空间-时间不对称性,并由定制的不对称性损失进行调控,从而同时捕捉双侧和纵向影像中的细粒度演化信息,以提高未来1至5年乳腺癌风险预测的准确性。

链接: https://arxiv.org/abs/2505.21699
作者: Zhengbo Zhou,Dooman Arefan,Margarita Zuley,Jules Sumkin,Shandong Wu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting the risk of developing breast cancer is an important clinical tool to guide early intervention and tailoring personalized screening strategies. Early risk models have limited performance and recently machine learning-based analysis of mammogram images showed encouraging risk prediction effects. These models however are limited to the use of a single exam or tend to overlook nuanced breast tissue evolvement in spatial and temporal details of longitudinal imaging exams that are indicative of breast cancer risk. In this paper, we propose STA-Risk (Spatial and Temporal Asymmetry-based Risk Prediction), a novel Transformer-based model that captures fine-grained mammographic imaging evolution simultaneously from bilateral and longitudinal asymmetries for breast cancer risk prediction. STA-Risk is innovative by the side encoding and temporal encoding to learn spatial-temporal asymmetries, regulated by a customized asymmetry loss. We performed extensive experiments with two independent mammogram datasets and achieved superior performance than four representative SOTA models for 1- to 5-year future risk prediction. Source codes will be released upon publishing of the paper.
zh

[CV-179] nSVD algorithm for compression

【速读】:该论文旨在解决高维数据存储与处理中的资源消耗问题,特别是针对图像数据的存储、传输和处理过程中存在的内存占用大、带宽需求高及能耗高的挑战。其解决方案的关键在于利用张量(tensor)结构对原始数据进行组织,并采用Tucker模型进行压缩,从而在保持信息质量的同时减少计算时间和能量消耗。

链接: https://arxiv.org/abs/2505.21686
作者: Michele Gallo
机构: 未知
类目: Computation (stat.CO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tensors provide a robust framework for managing high-dimensional data. Consequently, tensor analysis has emerged as an active research area in various domains, including machine learning, signal processing, computer vision, graph analysis, and data mining. This study introduces an efficient image storage approach utilizing tensors, aiming to minimize memory to store, bandwidth to transmit and energy to processing. The proposed method organizes original data into a higher-order tensor and applies the Tucker model for compression. Implemented in R, this method is compared to a baseline algorithm. The evaluation focuses on efficient of algorithm measured in term of computational time and the quality of information preserved, using both simulated and real datasets. A detailed analysis of the results is conducted, employing established quantitative metrics, with significant attention paid to sustainability in terms of energy consumption across algorithms.
zh

[CV-180] Laparoscopic Image Desmoking Using the U-Net with New Loss Function and Integrated Differentiable Wiener Filter

【速读】:该论文旨在解决腹腔镜手术中由于手术烟雾导致的视觉清晰度下降问题,这一问题对外科医生和基于视觉的计算机辅助技术均构成显著挑战。其解决方案的关键在于提出一种结合新型损失函数和可学习微分维纳滤波器的U-Net深度学习方法(ULW)。该方法通过整合像素、结构和感知属性的损失函数,提升了重建图像的质量和真实感,同时可学习的维纳滤波器能够有效建模由手术烟雾引起的退化过程。

链接: https://arxiv.org/abs/2505.21634
作者: Chengyu Yang,Chengjun Liu
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Laparoscopic surgeries often suffer from reduced visual clarity due to the presence of surgical smoke originated by surgical instruments, which poses significant challenges for both surgeons and vision based computer-assisted technologies. In order to remove the surgical smoke, a novel U-Net deep learning with new loss function and integrated differentiable Wiener filter (ULW) method is presented. Specifically, the new loss function integrates the pixel, structural, and perceptual properties. Thus, the new loss function, which combines the structural similarity index measure loss, the perceptual loss, as well as the mean squared error loss, is able to enhance the quality and realism of the reconstructed images. Furthermore, the learnable Wiener filter is capable of effectively modelling the degradation process caused by the surgical smoke. The effectiveness of the proposed ULW method is evaluated using the publicly available paired laparoscopic smoke and smoke-free image dataset, which provides reliable benchmarking and quantitative comparisons. Experimental results show that the proposed ULW method excels in both visual clarity and metric-based evaluation. As a result, the proposed ULW method offers a promising solution for real-time enhancement of laparoscopic imagery. The code is available at this https URL.
zh

[CV-181] Optimizing Deep Learning for Skin Cancer Classification: A Computationally Efficient CNN with Minimal Accuracy Trade-Off

【速读】:该论文试图解决深度学习模型在医疗图像分析中因计算复杂度高而难以部署于资源受限环境的问题。其解决方案的关键在于设计一种轻量级卷积神经网络(CNN)模型,通过显著减少参数数量(从ResNet50的2390万降至69.2万)和计算量(FLOPs从40亿降至3004万),在保持分类精度偏差小于0.022%的前提下,有效降低能耗、内存占用和推理时间,从而提升模型在移动和边缘设备上的可行性。

链接: https://arxiv.org/abs/2505.21597
作者: Abdullah Al Mamun,Pollob Chandra Ray,Md Rahat Ul Nasib,Akash Das,Jia Uddin,Md Nurul Absur
机构: Dhaka University of Engineering & Technology, Bangladesh; Samsung Austin Research Center, United States; Federation University Australia, Australia; Woosong University, Korea; Graduate Center, CUNY, United States
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 7 Images

点击查看摘要

Abstract:The rapid advancement of deep learning in medical image analysis has greatly enhanced the accuracy of skin cancer classification. However, current state-of-the-art models, especially those based on transfer learning like ResNet50, come with significant computational overhead, rendering them impractical for deployment in resource-constrained environments. This study proposes a custom CNN model that achieves a 96.7% reduction in parameters (from 23.9 million in ResNet50 to 692,000) while maintaining a classification accuracy deviation of less than 0.022%. Our empirical analysis of the HAM10000 dataset reveals that although transfer learning models provide a marginal accuracy improvement of approximately 0.022%, they result in a staggering 13,216.76% increase in FLOPs, considerably raising computational costs and inference latency. In contrast, our lightweight CNN architecture, which encompasses only 30.04 million FLOPs compared to ResNet50’s 4.00 billion, significantly reduces energy consumption, memory footprint, and inference time. These findings underscore the trade-off between the complexity of deep models and their real-world feasibility, positioning our optimized CNN as a practical solution for mobile and edge-based skin cancer diagnostics.
zh

[CV-182] aylor expansion-based Kolmogorov-Arnold network for blind image quality assessment

【速读】:该论文旨在解决Kolmogorov-Arnold Network (KAN)在处理高维特征时面临的性能提升有限和计算成本增加的问题。其解决方案的关键在于提出TaylorKAN,该模型通过引入泰勒展开作为可学习的激活函数来增强局部逼近能力,并结合网络深度缩减和特征维度压缩以提高计算效率。实验结果表明,基于泰勒展开的局部逼近比基于正交函数的全局逼近更有效,验证了TaylorKAN在高维评分回归任务中的高效性和鲁棒性。

链接: https://arxiv.org/abs/2505.21592
作者: Ze Chen,Shaode Yu
机构: Communication University of China (中国传媒大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:Kolmogorov-Arnold Network (KAN) has attracted growing interest for its strong function approximation capability. In our previous work, KAN and its variants were explored in score regression for blind image quality assessment (BIQA). However, these models encounter challenges when processing high-dimensional features, leading to limited performance gains and increased computational cost. To address these issues, we propose TaylorKAN that leverages the Taylor expansions as learnable activation functions to enhance local approximation capability. To improve the computational efficiency, network depth reduction and feature dimensionality compression are integrated into the TaylorKAN-based score regression pipeline. On five databases (BID, CLIVE, KonIQ, SPAQ, and FLIVE) with authentic distortions, extensive experiments demonstrate that TaylorKAN consistently outperforms the other KAN-related models, indicating that the local approximation via Taylor expansions is more effective than global approximation using orthogonal functions. Its generalization capacity is validated through inter-database experiments. The findings highlight the potential of TaylorKAN as an efficient and robust model for high-dimensional score regression.
zh

[CV-183] Image denoising as a conditional expectation

【速读】:该论文试图解决图像去噪问题,其核心在于如何准确地从噪声图像中恢复出真实的无噪声图像。传统方法通常假设真实图像为某个子空间的投影,而该论文提出将噪声图像视为来自某一概率空间的样本集合,并将真实图像重构为条件期望。解决方案的关键在于利用核积分算子对未知概率空间上的积分进行估计,并将真实图像重新表述为再生核希尔伯特空间(RKHS)中线性方程的最小二乘解。这一方法在像素数量趋于无穷时被证明是收敛的,并可用于优化有限像素图像的去噪参数。

链接: https://arxiv.org/abs/2505.21546
作者: Sajal Chakroborty,Suddhasattwa Das
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:All techniques for denoising involve a notion of a true (noise-free) image, and a hypothesis space. The hypothesis space may reconstruct the image directly as a grayscale valued function, or indirectly by its Fourier or wavelet spectrum. Most common techniques estimate the true image as a projection to some subspace. We propose an interpretation of a noisy image as a collection of samples drawn from a certain probability space. Within this interpretation, projection based approaches are not guaranteed to be unbiased and convergent. We present a data-driven denoising method in which the true image is recovered as a conditional expectation. Although the probability space is unknown apriori, integrals on this space can be estimated by kernel integral operators. The true image is reformulated as the least squares solution to a linear equation in a reproducing kernel Hilbert space (RKHS), and involving various kernel integral operators as linear transforms. Assuming the true image to be a continuous function on a compact planar domain, the technique is shown to be convergent as the number of pixels goes to infinity. We also show that for a picture with finite number of pixels, the convergence result can be used to choose the various parameters for an optimum denoising result.
zh

[CV-184] High-Fidelity Functional Ultrasound Reconstruction via A Visual Auto-Regressive Framework

【速读】:该论文试图解决功能性超声(functional ultrasound, fUS)成像在实际应用中面临的数据稀缺问题,这一问题主要由伦理限制和颅骨导致的信号退化共同引起,从而限制了数据集的多样性并影响下游机器学习模型的公平性。解决方案的关键在于提升数据质量和多样性,以改善模型训练效果并增强其泛化能力。

链接: https://arxiv.org/abs/2505.21530
作者: Xuhang Chen,Zhuo Li,Yanyan Shen,Mufti Mahmud,Hieu Pham,Chi-Man Pun,Shuqiang Wang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); Department of Information and Computer Science, SDAIA-KFUPM Joint Research Center for AI, Interdisciplinary Research Center for Biosystems and Machines, King Fahd University of Petroleum and Minerals(沙特国王法赫德国王石油与矿业大学信息与计算机科学系,SDAIA-KFUPM人工智能联合研究中心,跨学科生物系统与机器研究中心); Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系); College of Engineering and Computer Science and the VinUni-Illinois Smart Health Center, VinUniversity(越南VinUniversity工程与计算机科学学院及VinUni-Illinois智能健康中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Functional ultrasound (fUS) imaging provides exceptional spatiotemporal resolution for neurovascular mapping, yet its practical application is significantly hampered by critical challenges. Foremost among these are data scarcity, arising from ethical considerations and signal degradation through the cranium, which collectively limit dataset diversity and compromise the fairness of downstream machine learning models.
zh

人工智能

[AI-0] Maximizing Confidence Alone Improves Reasoning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中奖励函数设计困难的问题,尤其是在缺乏外部奖励或真实答案的场景下,如何有效提升模型的推理能力。其解决方案的关键在于提出RENT:一种基于熵最小化的强化学习方法,该方法无需外部奖励或真实答案,而是利用模型自身生成分布的熵作为内在奖励,通过强化高置信度生成答案的思维链来提升模型的推理能力。

链接: https://arxiv.org/abs/2505.22660
作者: Mihir Prabhudesai,Lili Chen,Alex Ippoliti,Katerina Fragkiadaki,Hao Liu,Deepak Pathak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization – a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model’s entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen and Mistral families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is limited or unavailable.
zh

[AI-1] Pre-training for Recommendation Unlearning SIGIR2025

【速读】:该论文试图解决推荐系统中由于隐私需求或偏好变化而需要选择性遗忘训练数据的问题,尤其是在图神经网络(Graph Neural Networks, GNNs)驱动的现代推荐系统中,删除交互图中的连接会引发模型中的连锁反应,影响大量用户的推荐结果。传统方法在处理这一问题时存在显著缺陷:碎片化方法破坏图结构并降低性能,而影响函数技术在复杂GNN中可能不适用,特别是在自监督或随机架构中。该论文提出了一种新型的模型无关预训练范式UnlearnRec,其关键在于引入了Influence Encoder,该模块能够结合遗忘请求和现有模型参数,直接生成已遗忘模型的更新参数,仅需少量微调即可避免完全重训练,同时保持模型性能。

链接: https://arxiv.org/abs/2505.22649
作者: Guoxuan Chen,Lianghao Xia,Chao Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to SIGIR 2025 Oral

点击查看摘要

Abstract:Modern recommender systems powered by Graph Neural Networks (GNNs) excel at modeling complex user-item interactions, yet increasingly face scenarios requiring selective forgetting of training data. Beyond user requests to remove specific interactions due to privacy concerns or preference changes, regulatory frameworks mandate recommender systems’ ability to eliminate the influence of certain user data from models. This recommendation unlearning challenge presents unique difficulties as removing connections within interaction graphs creates ripple effects throughout the model, potentially impacting recommendations for numerous users. Traditional approaches suffer from significant drawbacks: fragmentation methods damage graph structure and diminish performance, while influence function techniques make assumptions that may not hold in complex GNNs, particularly with self-supervised or random architectures. To address these limitations, we propose a novel model-agnostic pre-training paradigm UnlearnRec that prepares systems for efficient unlearning operations. Our Influence Encoder takes unlearning requests together with existing model parameters and directly produces updated parameters of unlearned model with little fine-tuning, avoiding complete retraining while preserving model performance characteristics. Extensive evaluation on public benchmarks demonstrates that our method delivers exceptional unlearning effectiveness while providing more than 10x speedup compared to retraining approaches. We release our method implementation at: this https URL.
zh

[AI-2] FastTD3: Simple Fast and Capable Reinforcement Learning for Humanoid Control FAST

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在机器人领域中因算法复杂性和训练时间长而导致的瓶颈问题。解决方案的关键在于提出FastTD3算法,该算法通过并行仿真、大批次更新、分布 critic 和精心调优的超参数等简单但有效的改进,显著提升了训练速度,使得在单个A100 GPU上即可在3小时内完成HumanoidBench任务的训练,同时保持训练过程的稳定性。

链接: https://arxiv.org/abs/2505.22642
作者: Younggyo Seo,Carmelo Sferrazza,Haoran Geng,Michal Nauman,Zhao-Heng Yin,Pieter Abbeel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has driven significant progress in robotics, but its complexity and long training times remain major bottlenecks. In this report, we introduce FastTD3, a simple, fast, and capable RL algorithm that significantly speeds up training for humanoid robots in popular suites such as HumanoidBench, IsaacLab, and MuJoCo Playground. Our recipe is remarkably simple: we train an off-policy TD3 agent with several modifications – parallel simulation, large-batch updates, a distributional critic, and carefully tuned hyperparameters. FastTD3 solves a range of HumanoidBench tasks in under 3 hours on a single A100 GPU, while remaining stable during training. We also provide a lightweight and easy-to-use implementation of FastTD3 to accelerate RL research in robotics.
zh

[AI-3] SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning

【速读】:该论文试图解决模仿学习中由于大规模数据集质量参差不齐而导致的性能下降问题,特别是现有方法在数据集或轨迹层面进行粗粒度筛选,无法有效识别和剔除低质量的状态-动作对(state-action pairs)。解决方案的关键在于提出SCIZOR框架,该框架通过自监督的任务进展预测器过滤缺乏任务进展的次优数据,并利用基于联合状态-动作表示的去重模块消除冗余数据,从而提升模仿学习策略的性能。

链接: https://arxiv.org/abs/2505.22626
作者: Yu Zhang,Yuqi Xie,Huihan Liu,Rutav Shah,Michael Wan,Linxi Fan,Yuke Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinders learning with undesirable actions, and redundant data, which dilutes training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks. More information is available at: this https URL
zh

[AI-4] Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates INTERSPEECH2025

【速读】:该论文试图解决语音基础模型压缩的问题,旨在在保持模型性能的同时显著减少参数数量。其解决方案的关键在于将模型剪枝与参数更新整合到一个阶段,通过联合训练高度紧凑的层级绑定自挤压门(每个包含仅一个可学习阈值),实现细粒度神经元级别的剪枝。

链接: https://arxiv.org/abs/2505.22608
作者: Haoning Xu,Zhaoqing Li,Youjun Chen,Huimeng Wang,Guinan Li,Mengzhe Geng,Chengxi Deng,Xunying Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively, while incurring no statistically significant word error rate (WER) increase on the test-clean dataset. Compared to previously published methods on the same task, our approach not only achieves the lowest WER of 7.05% on the test-clean dataset under a comparable model compression ratio of 4.26x, but also operates with at least 25% less model compression time.
zh

[AI-5] One Rank at a Time: Cascading Error Dynamics in Sequential Learning

【速读】:该论文试图解决顺序学习(sequential learning)中误差传播的问题,特别是在通过低秩线性回归进行分层任务分解时,如何量化和控制由于计算资源有限和精度受限等因素导致的误差对整体模型精度的影响。其解决方案的关键在于构建一个分析框架,将学习过程分解为一系列依赖于前序步骤准确性的秩-1(rank-1)估计问题,并由此表征误差在顺序过程中的累积规律,进而为算法设计和稳定性保证提供理论依据。

链接: https://arxiv.org/abs/2505.22602
作者: Mahtab Alizadeh Vandchali,Fangshuo(Jasper)Liao,Anastasios Kyrillidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 36 pages

点击查看摘要

Abstract:Sequential learning – where complex tasks are broken down into simpler, hierarchical components – has emerged as a paradigm in AI. This paper views sequential learning through the lens of low-rank linear regression, focusing specifically on how errors propagate when learning rank-1 subspaces sequentially. We present an analysis framework that decomposes the learning process into a series of rank-1 estimation problems, where each subsequent estimation depends on the accuracy of previous steps. Our contribution is a characterization of the error propagation in this sequential process, establishing bounds on how errors – e.g., due to limited computational budgets and finite precision – affect the overall model accuracy. We prove that these errors compound in predictable ways, with implications for both algorithmic design and stability guarantees.
zh

[AI-6] Machine Unlearning under Overparameterization

【速读】:该论文试图解决在过参数化设置下的模型遗忘问题,即如何有效移除特定训练样本的影响,恢复若仅使用剩余数据训练所得的模型。传统方法在欠参数化设置中通过最小化损失来定义遗忘解,但在过参数化情况下,由于损失梯度消失,这些方法失效。论文的关键解决方案是将遗忘解定义为保留数据上的最小复杂度插值器,并提出一种仅需原始解上保留集模型梯度的新算法框架,通过在与模型梯度正交的扰动空间中优化正则化目标,实现对插值条件的一阶松弛。

链接: https://arxiv.org/abs/2505.22601
作者: Jacob L. Block,Aryan Mokhtari,Sanjay Shakkottai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning algorithms aim to remove the influence of specific training samples, ideally recovering the model that would have resulted from training on the remaining data alone. We study unlearning in the overparameterized setting, where many models interpolate the data, and defining the unlearning solution as any loss minimizer over the retained set \unicodex2013 as in prior work in the underparameterized setting \unicodex2013 is inadequate, since the original model may already interpolate the retained data and satisfy this condition. In this regime, loss gradients vanish, rendering prior methods based on gradient perturbations ineffective, motivating both new unlearning definitions and algorithms. For this setting, we define the unlearning solution as the minimum-complexity interpolator over the retained data and propose a new algorithmic framework that only requires access to model gradients on the retained set at the original solution. We minimize a regularized objective over perturbations constrained to be orthogonal to these model gradients, a first-order relaxation of the interpolation condition. For different model classes, we provide exact and approximate unlearning guarantees, and we demonstrate that an implementation of our framework outperforms existing baselines across various unlearning experiments.
zh

[AI-7] HDDLGym: A Tool for Studying Multi-Agent Hierarchical Problems Defined in HDDL with OpenAI Gym ICAPS2025

【速读】:该论文试图解决在强化学习(Reinforcement Learning, RL)中缺乏能够无缝集成分层规划(Hierarchical Planning)工具的问题。解决方案的关键在于引入HDDLGym,这是一个基于Python的工具,能够从分层领域定义语言(Hierarchical Domain Definition Language, HDDL)的领域和问题自动生成OpenAI Gym环境,从而实现RL与分层规划的高效集成,并支持多智能体场景下的协作规划。

链接: https://arxiv.org/abs/2505.22597
作者: Ngoc La,Ruaridh Mon-Williams,Julie A. Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted to Proceedings of ICAPS 2025

点击查看摘要

Abstract:In recent years, reinforcement learning (RL) methods have been widely tested using tools like OpenAI Gym, though many tasks in these environments could also benefit from hierarchical planning. However, there is a lack of a tool that enables seamless integration of hierarchical planning with RL. Hierarchical Domain Definition Language (HDDL), used in classical planning, introduces a structured approach well-suited for model-based RL to address this gap. To bridge this integration, we introduce HDDLGym, a Python-based tool that automatically generates OpenAI Gym environments from HDDL domains and problems. HDDLGym serves as a link between RL and hierarchical planning, supporting multi-agent scenarios and enabling collaborative planning among agents. This paper provides an overview of HDDLGym’s design and implementation, highlighting the challenges and design choices involved in integrating HDDL with the Gym interface, and applying RL policies to support hierarchical planning. We also provide detailed instructions and demonstrations for using the HDDLGym framework, including how to work with existing HDDL domains and problems from International Planning Competitions, exemplified by the Transport domain. Additionally, we offer guidance on creating new HDDL domains for multi-agent scenarios and demonstrate the practical use of HDDLGym in the Overcooked domain. By leveraging the advantages of HDDL and Gym, HDDLGym aims to be a valuable tool for studying RL in hierarchical planning, particularly in multi-agent contexts.
zh

[AI-8] GitGoodBench: A Novel Benchmark For Evaluating Agent ic Performance On Git

【速读】:该论文试图解决现有软件工程人工智能代理基准(如SWE-bench)未能涵盖关键开发者工作流的问题,特别是版本控制系统(Version Control System, VCS)操作的缺失。解决方案的关键在于提出GitGoodBench,这是一个针对VCS任务评估AI代理性能的新基准,涵盖了从宽松许可的Python、Java和Kotlin开源仓库中提取的三个核心Git场景,并提供了多个数据集以支持全面评估与训练。

链接: https://arxiv.org/abs/2505.22583
作者: Tobias Lindenbauer,Egor Bogomolov,Yaroslav Zharov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Short Paper, 5 pages

点击查看摘要

Abstract:Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.
zh

[AI-9] abularQGAN: A Quantum Generative Model for Tabular Data

【速读】:该论文试图解决在真实世界数据稀缺或隐私受限场景下,如何有效生成高质量的表格型数据的问题。解决方案的关键在于提出一种基于量子生成对抗网络(Quantum Generative Adversarial Network, QGAN)的架构,结合灵活的数据编码方式和新型的量子电路结构(ansatz),以更高效地建模表格数据。实验结果表明,该量子模型在SDMetrics相似性评分上平均优于经典模型8.5%,且仅使用了经典模型0.072%的参数量,展示了其在参数效率和生成质量上的优势。

链接: https://arxiv.org/abs/2505.22533
作者: Pallavi Bhardwaj,Caitlin Jones,Lasse Dierich,Aleksandar Vučković
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 18 pages,8 figures and 4 tables

点击查看摘要

Abstract:In this paper, we introduce a novel quantum generative model for synthesizing tabular data. Synthetic data is valuable in scenarios where real-world data is scarce or private, it can be used to augment or replace existing datasets. Real-world enterprise data is predominantly tabular and heterogeneous, often comprising a mixture of categorical and numerical features, making it highly relevant across various industries such as healthcare, finance, and software. We propose a quantum generative adversarial network architecture with flexible data encoding and a novel quantum circuit ansatz to effectively model tabular data. The proposed approach is tested on the MIMIC III healthcare and Adult Census datasets, with extensive benchmarking against leading classical models, CTGAN, and CopulaGAN. Experimental results demonstrate that our quantum model outperforms classical models by an average of 8.5% with respect to an overall similarity score from SDMetrics, while using only 0.072% of the parameters of the classical models. Additionally, we evaluate the generalization capabilities of the models using two custom-designed metrics that demonstrate the ability of the proposed quantum model to generate useful and novel samples. To our knowledge, this is one of the first demonstrations of a successful quantum generative model for handling tabular data, indicating that this task could be well-suited to quantum computers.
zh

[AI-10] raining RL Agents for Multi-Objective Network Defense Tasks

【速读】:该论文试图解决如何将开放性学习(Open-ended Learning, OEL)应用于现实世界网络安全场景中,以开发出具备鲁棒性和泛化能力的自主网络防御者的问题。其解决方案的关键在于提出一种任务表示方法,该方法能够在广泛的任务宇宙中保持目标、奖励和动作空间的一致性接口,从而使学习代理能够在不同网络条件、攻击者行为和防御目标下进行训练,并能够利用之前获得的知识。

链接: https://arxiv.org/abs/2505.22531
作者: Andres Molina-Markham,Luis Robaina,Sean Steinle,Akash Trivedi,Derek Tsui,Nicholas Potteiger,Lauren Brandt,Ransom Winder,Ahmed Ridley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Open-ended learning (OEL) – which emphasizes training agents that achieve broad capability over narrow competency – is emerging as a paradigm to develop artificial intelligence (AI) agents to achieve robustness and generalization. However, despite promising results that demonstrate the benefits of OEL, applying OEL to develop autonomous agents for real-world cybersecurity applications remains a challenge. We propose a training approach, inspired by OEL, to develop autonomous network defenders. Our results demonstrate that like in other domains, OEL principles can translate into more robust and generalizable agents for cyber defense. To apply OEL to network defense, it is necessary to address several technical challenges. Most importantly, it is critical to provide a task representation approach over a broad universe of tasks that maintains a consistent interface over goals, rewards and action spaces. This way, the learning agent can train with varying network conditions, attacker behaviors, and defender goals while being able to build on previously gained knowledge. With our tools and results, we aim to fundamentally impact research that applies AI to solve cybersecurity problems. Specifically, as researchers develop gyms and benchmarks for cyber defense, it is paramount that they consider diverse tasks with consistent representations, such as those we propose in our work. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2505.22531 [cs.LG] (or arXiv:2505.22531v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.22531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-11] Evaluating Supervised Learning Models for Fraud Detection: A Comparative Study of Classical and Deep Architectures on Imbalanced Transaction Data

【速读】:该论文试图解决高风险领域中欺诈检测的问题,特别是在金融和电子商务中,未检测到的欺诈交易可能导致重大经济损失。解决方案的关键在于系统地比较四种监督学习模型(逻辑回归、随机森林、轻量梯度提升机(LightGBM)和门控循环单元(GRU)网络)在大规模、高度不平衡的在线交易数据集上的性能,以评估其在检测罕见但影响重大的欺诈行为方面的有效性。研究强调了模型在整体和类别特定指标上的表现,并突出了不同模型在精确率与召回率之间的权衡,为实际部署提供了依据。

链接: https://arxiv.org/abs/2505.22521
作者: Chao Wang,Chuanhao Nie,Yunbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages. Chao Wang, Chuanhao Nie, and Yunbo Liu contributed equally to this work. Corresponding author: Yunbo Liu (yunbo.liu954@duke.edu). Submitted to the 3rd International Conference on Management Innovation and Economy Development (MIED 2025), Chongqing, China

点击查看摘要

Abstract:Fraud detection remains a critical task in high-stakes domains such as finance and e-commerce, where undetected fraudulent transactions can lead to significant economic losses. In this study, we systematically compare the performance of four supervised learning models - Logistic Regression, Random Forest, Light Gradient Boosting Machine (LightGBM), and a Gated Recurrent Unit (GRU) network - on a large-scale, highly imbalanced online transaction dataset. While ensemble methods such as Random Forest and LightGBM demonstrated superior performance in both overall and class-specific metrics, Logistic Regression offered a reliable and interpretable baseline. The GRU model showed strong recall for the minority fraud class, though at the cost of precision, highlighting a trade-off relevant for real-world deployment. Our evaluation emphasizes not only weighted averages but also per-class precision, recall, and F1-scores, providing a nuanced view of each model’s effectiveness in detecting rare but consequential fraudulent activity. The findings underscore the importance of choosing models based on the specific risk tolerance and operational needs of fraud detection systems.
zh

[AI-12] Strengthening Proportionality in Temporal Voting

【速读】:该论文试图解决在时间投票(temporal voting)框架下实现比例代表性的难题,特别是在使用批准投票(approval ballots)的情况下。现有研究已将多胜者场景中的比例代表性概念(如 justified representation (JR)、proportional JR (PJR) 和 extended JR (EJR))扩展到时间场景,而本文则进一步探索超越 EJR 的方法。其关键在于提出更强的 JR、PJR 和 EJR 变体,并引入更严格的多胜者公理的时间适应版本,如 EJR+、full JR (FJR)、full proportional JR (FPJR) 和 Core。通过分析这些概念的存在性及其与已有概念的关系,构建了一个丰富的比例性概念层次结构,其中 EJR+ 和 FJR 被证明在所有时间选举中均可满足,从而在增强比例性的同时保持可行性。

链接: https://arxiv.org/abs/2505.22513
作者: Bradley Phillips,Edith Elkind,Nicholas Teh,Tomasz Wąs
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study proportional representation in the framework of temporal voting with approval ballots. Prior work adapted basic proportional representation concepts – justified representation (JR), proportional JR (PJR), and extended JR (EJR) – from the multiwinner setting to the temporal setting. Our work introduces and examines ways of going beyond EJR. Specifically, we consider stronger variants of JR, PJR, and EJR, and introduce temporal adaptations of more demanding multiwinner axioms, such as EJR+, full JR (FJR), full proportional JR (FPJR), and the Core. For each of these concepts, we investigate its existence and study its relationship to existing notions, thereby establishing a rich hierarchy of proportionality concepts. Notably, we show that two of our proposed axioms – EJR+ and FJR – strengthen EJR while remaining satisfiable in every temporal election.
zh

[AI-13] From Strangers to Assistants: Fast Desire Alignment for Embodied Agent -User Adaptation

【速读】:该论文旨在解决具身智能体在复杂物理任务中与未知智能体和人类用户协作时,如何快速准确地对齐用户隐含需求的问题(desire alignment)。其关键解决方案是提出一种名为FAMER的框架,该框架引入基于欲望的思维推理机制以识别用户意图并过滤无关行为,并设计基于反思的通信模块以减少冗余询问,同时结合目标相关信息提取与记忆持久化以提升信息复用效率。

链接: https://arxiv.org/abs/2505.22503
作者: Yuanfei Wang,Xinju Huang,Fangwei Zhong,Yaodong Yang,Yizhou Wang,Yuanpei Chen,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:While embodied agents have made significant progress in performing complex physical tasks, real-world applications demand more than pure task execution. The agents must collaborate with unfamiliar agents and human users, whose goals are often vague and implicit. In such settings, interpreting ambiguous instructions and uncovering underlying desires is essential for effective assistance. Therefore, fast and accurate desire alignment becomes a critical capability for embodied agents. In this work, we first develop a home assistance simulation environment HA-Desire that integrates an LLM-driven human user agent exhibiting realistic value-driven goal selection and communication. The ego agent must interact with this proxy user to infer and adapt to the user’s latent desires. To achieve this, we present a novel framework FAMER for fast desire alignment, which introduces a desire-based mental reasoning mechanism to identify user intent and filter desire-irrelevant actions. We further design a reflection-based communication module that reduces redundant inquiries, and incorporate goal-relevant information extraction with memory persistence to improve information reuse and reduce unnecessary exploration. Extensive experiments demonstrate that our framework significantly enhances both task execution and communication efficiency, enabling embodied agents to quickly adapt to user-specific desires in complex embodied environments.
zh

[AI-14] Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation ICML2025

【速读】:该论文旨在解决强化学习中离策略评估(Off-Policy Evaluation, OPE)问题,特别是针对重要性抽样(Importance Sampling, IS)中的行为策略估计问题。现有研究表明,即使真实的行为策略是马尔可夫的,估计一个依赖历史的行为策略也能降低均方误差(MSE)。然而,为何使用历史信息能够降低MSE这一问题尚未得到理论解释。本文通过推导普通IS估计器的MSE偏差-方差分解,揭示了这一现象:依赖历史的行为策略估计可以降低其渐近方差,同时增加有限样本下的偏差。关键在于,随着估计的行为策略所依赖的历史长度增加,方差呈现一致下降的趋势。此外,该结论被扩展至多种其他OPE估计器,包括序列IS估计器、双重稳健估计器和边缘化IS估计器,无论行为策略是通过参数化还是非参数化方法进行估计。

链接: https://arxiv.org/abs/2505.22492
作者: Hongyi Zhou,Josiah P. Hanna,Jin Zhu,Ying Yang,Chengchun Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of why the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.
zh

[AI-15] On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

【速读】:该论文试图解决标准参数化(Standard Parameterization, SP)在大规模视觉与语言模型训练中的理论理解不足问题,特别是为何实际最优学习率的衰减速度远慢于无限宽度理论预测的现象。解决方案的关键在于引入损失函数的影响,证明在交叉熵(Cross-Entropy, CE)损失下,会存在一个受控发散(controlled divergence)区域,在该区域内,logits虽发散但损失、梯度和激活保持稳定,从而支持在大学习率下实现稳定的训练和持续的特征演化,这解释了SP在实践中的成功。

链接: https://arxiv.org/abs/2505.22491
作者: Moritz Haas,Sebastian Bordt,Ulrike von Luxburg,Leena Chennuru Vankadara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The dominant paradigm for training large-scale vision and language models is He initialization and a single global learning rate (\textitstandard parameterization, SP). Despite its practical success, standard parametrization remains poorly understood from a theoretical perspective: Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates. However, empirically optimal learning rates consistently decay much slower than theoretically predicted. By carefully studying neural network training dynamics, we demonstrate that this discrepancy is not fully explained by finite-width phenomena such as catapult effects or a lack of alignment between weights and incoming activations. We instead show that the apparent contradiction can be fundamentally resolved by taking the loss function into account: In contrast to Mean Squared Error (MSE) loss, we prove that under cross-entropy (CE) loss, an intermediate \textitcontrolled divergence regime emerges, where logits diverge but loss, gradients, and activations remain stable. Stable training under large learning rates enables persistent feature evolution at scale in all hidden layers, which is crucial for the practical success of SP. In experiments across optimizers (SGD, Adam), architectures (MLPs, GPT) and data modalities (vision, language), we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scalings for standard initialization.
zh

[AI-16] Human-Centered Human-AI Collaboration (HCHAC)

【速读】:该论文旨在解决人类与自主智能代理在人机协作(Human-AI Collaboration, HAC)中的协同机制与优化问题,特别是在强调以人类为中心的AI(Human-Centered AI, HCAI)框架下如何实现有效的合作。其解决方案的关键在于构建一种以人类为核心的人机协作框架(Human-Centered HAC, HCHAC),通过整合现有研究方法与理论,明确HAC的核心概念与特征,并探索其在实际场景(如自动驾驶车辆)中的应用与交互模式,从而推动HAC系统在有效性、可靠性及伦理集成方面的持续发展。

链接: https://arxiv.org/abs/2505.22477
作者: Qi Gao,Wei Xu,Hanxi Pan,Mowei Shen,Zaifeng Gao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This article is a chapter from the upcoming book Handbook of Human-Centered Artificial Intelligence

点击查看摘要

Abstract:In the intelligent era, the interaction between humans and intelligent systems fundamentally involves collaboration with autonomous intelligent agents. Human-AI Collaboration (HAC) represents a novel type of human-machine relationship facilitated by autonomous intelligent machines equipped with AI technologies. In this paradigm, AI agents serve not only as auxiliary tools but also as active teammates, partnering with humans to accomplish tasks collaboratively. Human-centered AI (HCAI) emphasizes that humans play critical leadership roles in the collaboration. This human-led collaboration imparts new dimensions to the human-machine relationship, necessitating innovative research perspectives, paradigms, and agenda to address the unique challenges posed by HAC. This chapter delves into the essence of HAC from the human-centered perspective, outlining its core concepts and distinguishing features. It reviews the current research methodologies and research agenda within the HAC field from the HCAI perspective, highlighting advancements and ongoing studies. Furthermore, a framework for human-centered HAC (HCHAC) is proposed by integrating these reviews and analyses. A case study of HAC in the context of autonomous vehicles is provided, illustrating practical applications and the synergistic interactions between humans and AI agents. Finally, it identifies potential future research directions aimed at enhancing the effectiveness, reliability, and ethical integration of human-centered HAC systems in diverse domains.
zh

[AI-17] opological Structure Learning Should Be A Research Priority for LLM -Based Multi-Agent Systems

【速读】:该论文试图解决大型语言模型为基础的多智能体系统(MASs)中,如何通过结构化组织实现最优协作的问题。其解决方案的关键在于构建拓扑感知的MASs,具体包括三个核心组件:智能体、通信链路和通信模式,并提出一个三阶段框架——智能体选择、结构表征和拓扑合成,以系统性地优化系统的协调性能与效率。

链接: https://arxiv.org/abs/2505.22467
作者: Jiaxi Yang,Mengqi Zhang,Yiqiao Jin,Hao Chen,Qingsong Wen,Lu Lin,Yi He,Weijie Xu,James Evans,Jindong Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. Nevertheless, the question of how agents should be structurally organized for optimal cooperation remains largely unexplored. In this position paper, we aim to gently redirect the focus of the MAS research community toward this critical dimension: develop topology-aware MASs for specific tasks. Specifically, the system consists of three core components - agents, communication links, and communication patterns - that collectively shape its coordination performance and efficiency. To this end, we introduce a systematic, three-stage framework: agent selection, structure profiling, and topology synthesis. Each stage would trigger new research opportunities in areas such as language models, reinforcement learning, graph learning, and generative modeling; together, they could unleash the full potential of MASs in complicated real-world applications. Then, we discuss the potential challenges and opportunities in the evaluation of multiple systems. We hope our perspective and framework can offer critical new insights in the era of agentic AI.
zh

[AI-18] AI Mathematician: Towards Fully Automated Frontier Mathematical Research

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRM)在数学研究中应用时所面临的两个关键挑战,即研究问题的内在复杂性和过程严谨性的要求。为了解决这些问题,论文提出的AI Mathematician (AIM)框架引入了两种核心策略:一种是探索机制,用于促进更长的解题路径;另一种是悲观合理验证方法,用于确保结果的可靠性。这些策略使AIM能够有效处理研究级别的数学任务,并在多个实际数学领域中展现出构建证明和发现非平凡见解的能力。

链接: https://arxiv.org/abs/2505.22451
作者: Yuanhang Liu,Yanxing Huang,Yanqiao Wang,Peng Li,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 95 pages, 1 figure

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have made significant progress in mathematical capabilities in recent times. However, these successes have been primarily confined to competition-level problems. In this work, we propose AI Mathematician (AIM) framework, which harnesses the reasoning strength of LRMs to support frontier mathematical research. We have identified two critical challenges of mathematical research compared to competition, \it the intrinsic complexity of research problems and \it the requirement of procedural rigor. To address these challenges, AIM incorporates two core strategies: an exploration mechanism to foster longer solution paths, and the pessimistic reasonable verification method to ensure reliability. This early version of AIM already exhibits strong capability in tackling research-level tasks. We conducted extensive experiments across several real-world mathematical topics and obtained promising results. AIM is able to autonomously construct substantial portions of proofs and uncover non-trivial insights within each research area. These findings highlight the potential of LRMs in mathematical discovery and suggest that LRM-based agent systems could significantly accelerate mathematical research in the future. Comments: 95 pages, 1 figure Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.22451 [cs.AI] (or arXiv:2505.22451v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.22451 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-19] SOReL and TOReL: Two Methods for Fully Offline Reinforcement Learning

【速读】:该论文旨在解决现实世界中强化学习(Reinforcement Learning, RL)应用面临的样本效率问题,即在实际环境中获取环境交互数据通常成本高昂或具有危险性,而现有的离线强化学习方法在超参数调优过程中仍依赖大量在线交互,并且无法保证初始在线性能的可靠性。为解决这两个问题,论文提出了两种算法:SOReL 和 TOReL。SOReL 的关键在于利用仅有的离线数据,通过贝叶斯方法推断环境动态的后验分布,从而基于后验预测不确定性可靠估计在线性能,并且所有超参数均在离线阶段完成调优;TOReL 的关键在于将基于信息率的离线超参数调优方法扩展至通用的离线强化学习框架,实现了仅使用离线数据即可达到与最佳在线调优方法相当的性能。

链接: https://arxiv.org/abs/2505.22442
作者: Mattie Fellows,Clarisse Wibault,Uljad Berdica,Johannes Forkel,Jakob N. Foerster,Michael A. Osborne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sample efficiency remains a major obstacle for real world adoption of reinforcement learning (RL): success has been limited to settings where simulators provide access to essentially unlimited environment interactions, which in reality are typically costly or dangerous to obtain. Offline RL in principle offers a solution by exploiting offline data to learn a near-optimal policy before deployment. In practice, however, current offline RL methods rely on extensive online interactions for hyperparameter tuning, and have no reliable bound on their initial online performance. To address these two issues, we introduce two algorithms. Firstly, SOReL: an algorithm for safe offline reinforcement learning. Using only offline data, our Bayesian approach infers a posterior over environment dynamics to obtain a reliable estimate of the online performance via the posterior predictive uncertainty. Crucially, all hyperparameters are also tuned fully offline. Secondly, we introduce TOReL: a tuning for offline reinforcement learning algorithm that extends our information rate based offline hyperparameter tuning methods to general offline RL approaches. Our empirical evaluation confirms SOReL’s ability to accurately estimate regret in the Bayesian setting whilst TOReL’s offline hyperparameter tuning achieves competitive performance with the best online hyperparameter tuning methods using only offline data. Thus, SOReL and TOReL make a significant step towards safe and reliable offline RL, unlocking the potential for RL in the real world. Our implementations are publicly available: this https URL_torel.
zh

[AI-20] Physics-Informed Distillation of Diffusion Models for PDE-Constrained Generation

【速读】:该论文试图解决在扩散模型中难以直接对干净样本 x0\boldsymbol{x}_0 强制施加偏微分方程(PDE)约束的问题,因为扩散模型仅能访问噪声数据 xt\boldsymbol{x}_t。传统方法通过估计干净样本的期望 E[x0xt]\mathbb{E}[\boldsymbol{x}_0|\boldsymbol{x}_t] 来施加约束,但这种做法会导致Jensen’s Gap,从而在PDE约束与生成建模精度之间产生权衡。解决方案的关键在于提出一种后处理蒸馏方法——物理信息蒸馏扩散模型(PIDDM),在不直接将PDE约束注入扩散过程的前提下,在蒸馏阶段强制施加这些约束,从而在提升PDE满足度的同时保持生成质量。

链接: https://arxiv.org/abs/2505.22391
作者: Yi Zhang,Difan Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
备注: 23 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Modeling physical systems in a generative manner offers several advantages, including the ability to handle partial observations, generate diverse solutions, and address both forward and inverse problems. Recently, diffusion models have gained increasing attention in the modeling of physical systems, particularly those governed by partial differential equations (PDEs). However, diffusion models only access noisy data \boldsymbolx_t at intermediate steps, making it infeasible to directly enforce constraints on the clean sample \boldsymbolx_0 at each noisy level. As a workaround, constraints are typically applied to the expectation of clean samples \mathbbE[\boldsymbolx_0|\boldsymbolx_t] , which is estimated using the learned score network. However, imposing PDE constraints on the expectation does not strictly represent the one on the true clean data, known as Jensen’s Gap. This gap creates a trade-off: enforcing PDE constraints may come at the cost of reduced accuracy in generative modeling. To address this, we propose a simple yet effective post-hoc distillation approach, where PDE constraints are not injected directly into the diffusion process, but instead enforced during a post-hoc distillation stage. We term our method as Physics-Informed Distillation of Diffusion Models (PIDDM). This distillation not only facilitates single-step generation with improved PDE satisfaction, but also support both forward and inverse problem solving and reconstruction from randomly partial observation. Extensive experiments across various PDE benchmarks demonstrate that PIDDM significantly improves PDE satisfaction over several recent and competitive baselines, such as PIDM, DiffusionPDE, and ECI-sampling, with less computation overhead. Our approach can shed light on more efficient and effective strategies for incorporating physical constraints into diffusion models.
zh

[AI-21] rain with Perturbation Infer after Merging: A Two-Stage Framework for Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中模型在学习新任务时容易发生灾难性遗忘的问题。现有CL方法仅依赖最新任务的参数进行推理,导致对已学知识的遗忘。其解决方案的关键在于提出一种名为Perturb-and-Merge (P\M)的新型持续学习框架,该框架将模型融合技术引入CL范式,通过在每个任务训练后构造由前序模型与新任务特定模型组成的凸组合模型来缓解遗忘。此外,通过理论分析确定最优融合系数,并引入基于任务向量和损失函数Hessian矩阵的正则化项以提升融合模型性能,同时采用二阶对称有限差分近似和随机扰动策略,实现高效且无需额外前向或反向传播的正则化项估计。

链接: https://arxiv.org/abs/2505.22389
作者: Haomiao Qiu,Miao Zhang,Ziyue Qiao,Liqiang Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Continual Learning (CL) aims to enable models to continuously acquire new knowledge from a sequence of tasks with avoiding the forgetting of learned information. However, existing CL methods only rely on the parameters of the most recent task for inference, which makes them susceptible to catastrophic forgetting. Inspired by the recent success of model merging techniques, we propose \textbfPerturb-and-Merge (P\M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting. Specifically, after training on each task, P\M constructs a new model by forming a convex combination of the previous model and the newly trained task-specific model. Through theoretical analysis, we minimize the total loss increase across all tasks and derive an analytical solution for the optimal merging coefficient. To further improve the performance of the merged model, we observe that the degradation introduced during merging can be alleviated by a regularization term composed of the task vector and the Hessian matrix of the loss function. Interestingly, we show that this term can be efficiently approximated using second-order symmetric finite differences, and a stochastic perturbation strategy along the task vector direction is accordingly devised which incurs no additional forward or backward passes while providing an effective approximation of the regularization term. Finally, we combine P\M with LoRA, a parameter-efficient fine-tuning method, to reduce memory overhead. Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
zh

[AI-22] Exact Algorithms and Lower Bounds for Forming Coalitions of Constrained Maximum Size AAAI2025

【速读】:该论文研究的是在考虑每个代理(agent)对队友偏好情况下,如何以最高效的方式将代理群体划分为团队的问题,这一问题被建模为经典的联盟形成(Coalition Formation)问题,并且要求每个团队的规模必须是有限的。论文的关键解决方案是提出了一种针对“小”团队且具有树状结构(bounded treewidth)的高效算法,并证明了该算法在渐近意义上是最优的,即在合理理论假设下,无法存在显著优于该算法的其他算法,即使对于星型结构(bounded vertex cover number)也是如此。

链接: https://arxiv.org/abs/2505.22384
作者: Foivos Fioravantes,Harmender Gahlawat,Nikolaos Melissinos
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注: a preliminary version appeared in AAAI 2025

点击查看摘要

Abstract:Imagine we want to split a group of agents into teams in the most \emphefficient way, considering that each agent has their own preferences about their teammates. This scenario is modeled by the extensively studied \textscCoalition Formation problem. Here, we study a version of this problem where each team must additionally be of bounded size. We conduct a systematic algorithmic study, providing several intractability results as well as multiple exact algorithms that scale well as the input grows (FPT), which could prove useful in practice. Our main contribution is an algorithm that deals efficiently with tree-like structures (bounded \emphtreewidth) for ``small’’ teams. We complement this result by proving that our algorithm is asymptotically optimal. Particularly, there can be no algorithm that vastly outperforms the one we present, under reasonable theoretical assumptions, even when considering star-like structures (bounded \emphvertex cover number). Comments: a preliminary version appeared in AAAI 2025 Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.22384 [cs.DS] (or arXiv:2505.22384v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.22384 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-23] SplitLoRA: Balancing Stability and Plasticity in Continual Learning Through Gradient Space Splitting

【速读】:该论文旨在解决持续学习(Continual Learning)中模型在顺序学习多个任务时面临的稳定性与可塑性之间的平衡问题。现有基于梯度投影的方法在划分梯度空间以实现稳定性与可塑性的最佳平衡方面存在困难。其解决方案的关键在于提出一种基于低秩适应(Low-Rank Adaptation)的新型持续学习方法——SplitLoRA,通过理论分析和有效方法确定先前任务梯度空间的最优划分,从而实现稳定性与可塑性的有效平衡。

链接: https://arxiv.org/abs/2505.22370
作者: Haomiao Qiu,Miao Zhang,Ziyue Qiao,Weili Guan,Min Zhang,Liqiang Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Continual Learning requires a model to learn multiple tasks in sequence while maintaining both stability:preserving knowledge from previously learned tasks, and plasticity:effectively learning new tasks. Gradient projection has emerged as an effective and popular paradigm in CL, where it partitions the gradient space of previously learned tasks into two orthogonal subspaces: a primary subspace and a minor subspace. New tasks are learned effectively within the minor subspace, thereby reducing interference with previously acquired knowledge. However, existing Gradient Projection methods struggle to achieve an optimal balance between plasticity and stability, as it is hard to appropriately partition the gradient space. In this work, we consider a continual learning paradigm based on Low-Rank Adaptation, which has gained considerable attention due to its efficiency and wide applicability, and propose a novel approach for continual learning, called SplitLoRA. We first provide a theoretical analysis of how subspace partitioning affects model stability and plasticity. Informed by this analysis, we then introduce an effective method that derives the optimal partition of the gradient space for previously learned tasks. This approach effectively balances stability and plasticity in continual learning. Experimental results on multiple datasets demonstrate that the proposed method achieves state-of-the-art performance.
zh

[AI-24] Agent DNS: A Root Domain Naming System for LLM Agents

【速读】:该论文试图解决跨厂商的生成式 AI(Generative AI)代理在服务发现、互操作性和通信方面存在的关键挑战。现有协议虽在代理与工具之间的互操作性及多代理通信方面取得进展,但缺乏跨不同代理和工具供应商的标准化服务发现协议和解决方案。论文提出的解决方案是AgentDNS,其关键在于借鉴传统DNS的原理,引入结构化的服务注册、语义化服务发现、安全调用和统一计费机制,以实现LLM代理在组织和技术边界上自主发现、解析和安全调用第三方代理和服务。

链接: https://arxiv.org/abs/2505.22368
作者: Enfang Cui,Yujun Cheng,Rui She,Dan Liu,Zhiyuan Liang,Minxin Guo,Tianzheng Li,Qian Wei,Wenjuan Xing,Zhijie Zhong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:The rapid evolution of Large Language Model (LLM) agents has highlighted critical challenges in cross-vendor service discovery, interoperability, and communication. Existing protocols like model context protocol and agent-to-agent protocol have made significant strides in standardizing interoperability between agents and tools, as well as communication among multi-agents. However, there remains a lack of standardized protocols and solutions for service discovery across different agent and tool vendors. In this paper, we propose AgentDNS, a root domain naming and service discovery system designed to enable LLM agents to autonomously discover, resolve, and securely invoke third-party agent and tool services across organizational and technological boundaries. Inspired by the principles of the traditional DNS, AgentDNS introduces a structured mechanism for service registration, semantic service discovery, secure invocation, and unified billing. We detail the architecture, core functionalities, and use cases of AgentDNS, demonstrating its potential to streamline multi-agent collaboration in real-world scenarios. The source code will be published on this https URL.
zh

[AI-25] Budget-Adaptive Adapter Tuning in Orthogonal Subspaces for Continual Learning in LLM s

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续学习(Continual Learning, CL)场景中面临的灾难性遗忘问题,即在学习新任务时,模型对先前任务的性能会显著下降。现有方法虽然通过正交子空间技术缓解任务干扰,但通常采用固定预算分配策略,未能考虑任务和层之间的复杂性差异;而近期的预算自适应调优方法多采用多阶段范式,导致优化与预算分配的解耦,产生潜在的不匹配问题。该论文提出的解决方案——OA-Adapter,其关键在于将动态预算适应与正交子空间学习统一在一个端到端的训练阶段,引入动态瓶颈维度适应机制,在无不对齐的情况下同时分配高效参数预算并优化任务目标,并通过在当前任务参数子空间与历史任务动态分配参数子空间之间施加正交约束,有效保留已有知识。

链接: https://arxiv.org/abs/2505.22358
作者: Zhiyi Wan,Wanrou Du,Liang Li,Miao Pan,Xiaoqi Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks. Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers. Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches’ practical application in CL scenarios. To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in a single end-to-end training stage. Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment. To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks. Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency, achieving higher average accuracy while using 58.5% fewer parameters on the standard CL benchmark.
zh

[AI-26] Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings ICML2025

【速读】:该论文试图解决在安全关键领域部署机器学习模型时的一个核心挑战:在无法获取真实标签进行直接验证的情况下,确保模型在下游用户数据上的可靠性能。解决方案的关键在于提出了一种名为“适用性过滤器(suitability filter)”的框架,该框架通过利用对协变量偏移敏感且能指示潜在预测错误的适用性信号(suitability signals)来检测性能退化。该方法通过比较未标记用户数据与标记测试数据集上的分类器准确率,确保性能下降不超过预设阈值,从而实现对决策不确定性的有效评估。

链接: https://arxiv.org/abs/2505.22356
作者: Angéline Pouget,Mohammad Yaghini,Stephan Rabanser,Nicolas Papernot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Deploying machine learning models in safety-critical domains poses a key challenge: ensuring reliable model performance on downstream user data without access to ground truth labels for direct validation. We propose the suitability filter, a novel framework designed to detect performance deterioration by utilizing suitability signals – model output features that are sensitive to covariate shifts and indicative of potential prediction errors. The suitability filter evaluates whether classifier accuracy on unlabeled user data shows significant degradation compared to the accuracy measured on the labeled test dataset. Specifically, it ensures that this degradation does not exceed a pre-specified margin, which represents the maximum acceptable drop in accuracy. To achieve reliable performance evaluation, we aggregate suitability signals for both test and user data and compare these empirical distributions using statistical hypothesis testing, thus providing insights into decision uncertainty. Our modular method adapts to various models and domains. Empirical evaluations across different classification tasks demonstrate that the suitability filter reliably detects performance deviations due to covariate shift. This enables proactive mitigation of potential failures in high-stakes applications.
zh

[AI-27] ChatPD: An LLM -driven Paper-Dataset Networking System KDD

【速读】:该论文试图解决学术平台在数据集管理中因手动流程导致的低效问题(dataset management inefficiencies)。其解决方案的关键在于提出一种名为ChatPD的系统,该系统利用大型语言模型(Large Language Models, LLMs)自动化从学术论文中提取数据集信息,并构建结构化的论文-数据集网络。系统包含三个核心模块:论文收集、数据集信息提取和数据集实体解析,其中通过图补全与推理策略实现数据集描述到对应实体的映射。

链接: https://arxiv.org/abs/2505.22349
作者: Anjie Xu,Ruiqing Ding,Leye Wang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by KDD Applied Data Science Track 2025

点击查看摘要

Abstract:Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textitpaper collection, \textitdataset information extraction, and \textitdataset entity resolution to construct paper-dataset networks. Specifically, we propose a \textitGraph Completion and Inference strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]this https URL.
zh

[AI-28] From Large AI Models to Agent ic AI: A Tutorial on Future Intelligent Communications

【速读】:该论文旨在解决6G通信系统中智能通信系统面临的约束感知与响应能力、可扩展性有限以及动态环境适应性低等挑战。其解决方案的关键在于引入大型人工智能模型(LAMs)和代理AI(Agentic AI)技术,通过构建以LAM为核心的通信设计范式,涵盖数据集构建、内部与外部学习方法,并开发基于LAM的代理AI系统,明确其核心组件如规划器、知识库、工具和记忆模块及其交互机制,从而提升通信系统的智能化水平。

链接: https://arxiv.org/abs/2505.22311
作者: Feibo Jiang,Cunhua Pan,Li Dong,Kezhi Wang,Octavia A. Dobre,Merouane Debbah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in intelligent communication systems, aiming to offer researchers a comprehensive overview of cutting-edge technologies and practical guidance. First, we outline the background of 6G communications, review the technological evolution from LAMs to Agentic AI, and clarify the tutorial’s motivation and main contributions. Subsequently, we present a comprehensive review of the key components required for constructing LAMs. We further categorize LAMs and analyze their applicability, covering Large Language Models (LLMs), Large Vision Models (LVMs), Large Multimodal Models (LMMs), Large Reasoning Models (LRMs), and lightweight LAMs. Next, we propose a LAM-centric design paradigm tailored for communications, encompassing dataset construction and both internal and external learning approaches. Building upon this, we develop an LAM-based Agentic AI system for intelligent communications, clarifying its core components such as planners, knowledge bases, tools, and memory modules, as well as its interaction mechanisms. We also introduce a multi-agent framework with data retrieval, collaborative planning, and reflective evaluation for 6G. Subsequently, we provide a detailed overview of the applications of LAMs and Agentic AI in communication scenarios. Finally, we summarize the research challenges and future directions in current studies, aiming to support the development of efficient, secure, and sustainable next-generation intelligent communication systems.
zh

[AI-29] Versatile Cardiovascular Signal Generation with a Unified Diffusion Transformer

【速读】:该论文旨在解决心血管信号(如光电容积描记术PPG、心电图ECG和血压BP)在实时监测中因采集挑战(如噪声可穿戴记录和负担沉重的侵入性操作)而难以联合使用的问题。其解决方案的关键在于提出UniCardio,一个用于重建低质量信号和合成未记录信号的多模态扩散Transformer模型,其核心创新包括专门设计的模型架构以处理生成任务中的信号模态,以及一种持续学习范式以整合不同的模态组合。

链接: https://arxiv.org/abs/2505.22306
作者: Zehua Chen,Yuyang Miao,Liyuan Wang,Luyun Fan,Danilo P. Mandic,Jun Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cardiovascular signals such as photoplethysmography (PPG), electrocardiography (ECG), and blood pressure (BP) are inherently correlated and complementary, together reflecting the health of cardiovascular system. However, their joint utilization in real-time monitoring is severely limited by diverse acquisition challenges from noisy wearable recordings to burdened invasive procedures. Here we propose UniCardio, a multi-modal diffusion transformer that reconstructs low-quality signals and synthesizes unrecorded signals in a unified generative framework. Its key innovations include a specialized model architecture to manage the signal modalities involved in generation tasks and a continual learning paradigm to incorporate varying modality combinations. By exploiting the complementary nature of cardiovascular signals, UniCardio clearly outperforms recent task-specific baselines in signal denoising, imputation, and translation. The generated signals match the performance of ground-truth signals in detecting abnormal health conditions and estimating vital signs, even in unseen domains, while ensuring interpretability for human experts. These advantages position UniCardio as a promising avenue for advancing AI-assisted healthcare.
zh

[AI-30] Voice CMS: updating the knowledge base of a digital assistant through conversation

【速读】:该论文试图解决数字助理知识库更新过程中用户交互效率与内容质量的问题,特别是在不同任务复杂度下用户偏好与系统可用性的关系。其解决方案的关键在于采用基于多智能体大语言模型(multi-agent LLM)架构和语音用户界面(VUI)的设计,以提升知识管理的灵活性与用户体验。

链接: https://arxiv.org/abs/2505.22303
作者: Grzegorz Wolny,Michał Szczerbak
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In this study, we propose a solution based on a multi-agent LLM architecture and a voice user interface (VUI) designed to update the knowledge base of a digital assistant. Its usability is evaluated in comparison to a more traditional graphical content management system (CMS), with a focus on understanding the relationship between user preferences and the complexity of the information being provided. The findings demonstrate that, while the overall usability of the VUI is rated lower than the graphical interface, it is already preferred by users for less complex tasks. Furthermore, the quality of content entered through the VUI is comparable to that achieved with the graphical interface, even for highly complex tasks. Obtained qualitative results suggest that a hybrid interface combining the strengths of both approaches could address the key challenges identified during the experiment, such as reducing cognitive load through graphical feedback while maintaining the intuitive nature of voice-based interactions. This work highlights the potential of conversational interfaces as a viable and effective method for knowledge management in specific business contexts.
zh

[AI-31] Compression versus Accuracy: A Hierarchy of Lifted Models

【速读】:该论文试图解决在使用概率图模型进行提升推理时,如何高效且稳定地构建可解释的提升表示问题。现有方法Advanced Colour Passing(ACP)依赖于超参数\varepsilon来分组相似分布的因子,但选择合适的\varepsilon值具有挑战性,且不同\varepsilon值可能导致差异显著的模型,影响可解释性。论文提出的解决方案的关键是采用一种无超参数的层次化方法,该方法能够高效计算出一个\varepsilon值的层次结构,确保模型层次的一致性,并由此生成层次化的误差界,从而在压缩与精度之间实现显式权衡,提升模型的可解释性。

链接: https://arxiv.org/abs/2505.22288
作者: Jan Speller,Malte Luttermann,Marcel Gehrke,Tanya Braun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probabilistic graphical models that encode indistinguishable objects and relations among them use first-order logic constructs to compress a propositional factorised model for more efficient (lifted) inference. To obtain a lifted representation, the state-of-the-art algorithm Advanced Colour Passing (ACP) groups factors that represent matching distributions. In an approximate version using \varepsilon as a hyperparameter, factors are grouped that differ by a factor of at most (1\pm \varepsilon) . However, finding a suitable \varepsilon is not obvious and may need a lot of exploration, possibly requiring many ACP runs with different \varepsilon values. Additionally, varying \varepsilon can yield wildly different models, leading to decreased interpretability. Therefore, this paper presents a hierarchical approach to lifted model construction that is hyperparameter-free. It efficiently computes a hierarchy of \varepsilon values that ensures a hierarchy of models, meaning that once factors are grouped together given some \varepsilon , these factors will be grouped together for larger \varepsilon as well. The hierarchy of \varepsilon values also leads to a hierarchy of error bounds. This allows for explicitly weighing compression versus accuracy when choosing specific \varepsilon values to run ACP with and enables interpretability between the different models.
zh

[AI-32] New Tools are Needed for Tracking Adherence to AI Model Behavioral Use Clauses

【速读】:该论文试图解决当前AI模型许可证在使用过程中缺乏有效跟踪与合规性验证的问题,从而确保其能够实现负责任的使用。解决方案的关键在于开发工具以追踪许可证的采用情况及遵守情况,这是确保许可证有效性的自然延伸和迫切需求。

链接: https://arxiv.org/abs/2505.22287
作者: Daniel McDuff,Tim Korjakow,Kevin Klyman,Danish Contractor
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Foundation models have had a transformative impact on AI. A combination of large investments in research and development, growing sources of digital data for training, and architectures that scale with data and compute has led to models with powerful capabilities. Releasing assets is fundamental to scientific advancement and commercial enterprise. However, concerns over negligent or malicious uses of AI have led to the design of mechanisms to limit the risks of the technology. The result has been a proliferation of licenses with behavioral-use clauses and acceptable-use-policies that are increasingly being adopted by commonly used families of models (Llama, Gemma, Deepseek) and a myriad of smaller projects. We created and deployed a custom AI licenses generator to facilitate license creation and have quantitatively and qualitatively analyzed over 300 customized licenses created with this tool. Alongside this we analyzed 1.7 million models licenses on the HuggingFace model hub. Our results show increasing adoption of these licenses, interest in tools that support their creation and a convergence on common clause configurations. In this paper we take the position that tools for tracking adoption of, and adherence to, these licenses is the natural next step and urgently needed in order to ensure they have the desired impact of ensuring responsible use.
zh

[AI-33] A Preprocessing Framework for Efficient Approximate Bi-Objective Shortest-Path Computation in the Presence of Correlated Objectives

【速读】:该论文试图解决在存在相关目标函数的双目标最短路径(Bi-objective Shortest-Path, BOSP)问题,此类问题在现实场景如道路网络中常见,其中两个目标函数(如行驶时间和燃油消耗)通常呈正相关。BOSP问题的计算复杂度较高,因为搜索空间随目标函数数量和图的规模呈指数增长。为了解决这一问题,论文提出了一种高效的算法,其关键在于利用目标函数之间的相关性,通过预处理阶段识别图中的相关聚类并生成新的图表示,从而减少搜索努力。该方法受到图聚类算法的启发,并将A*pex算法推广至更高效的状态,实验表明其在DIMACS数据集上的运行速度可提高多达五倍,同时保证了解的质量。

链接: https://arxiv.org/abs/2505.22244
作者: Yaron Halle,Ariel Felner,Sven Koenig,Oren Salzman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The bi-objective shortest-path (BOSP) problem seeks to find paths between start and target vertices of a graph while optimizing two conflicting objective functions. We consider the BOSP problem in the presence of correlated objectives. Such correlations often occur in real-world settings such as road networks, where optimizing two positively correlated objectives, such as travel time and fuel consumption, is common. BOSP is generally computationally challenging as the size of the search space is exponential in the number of objective functions and the graph size. Bounded sub-optimal BOSP solvers such as Apex alleviate this complexity by approximating the Pareto-optimal solution set rather than computing it exactly (given a user-provided approximation factor). As the correlation between objective functions increases, smaller approximation factors are sufficient for collapsing the entire Pareto-optimal set into a single solution. We leverage this insight to propose an efficient algorithm that reduces the search effort in the presence of correlated objectives. Our approach for computing approximations of the entire Pareto-optimal set is inspired by graph-clustering algorithms. It uses a preprocessing phase to identify correlated clusters within a graph and to generate a new graph representation. This allows a natural generalization of Apex to run up to five times faster on DIMACS dataset instances, a standard benchmark in the field. To the best of our knowledge, this is the first algorithm proposed that efficiently and effectively exploits correlations in the context of bi-objective search while providing theoretical guarantees on solution quality.
zh

[AI-34] Solver-Free Decision-Focused Learning for Linear Optimization Problems

【速读】:该论文试图解决预测-优化(predict-then-optimize)问题中的计算瓶颈,特别是在决策聚焦学习(Decision-Focused Learning, DFL)框架下,由于每次损失评估都需要求解优化问题而导致的高计算成本。解决方案的关键在于提出一种无需求解器的训练方法,该方法利用线性优化问题的几何结构,通过比较真实最优解与其预计算邻近顶点的估计质量来构建损失函数,从而在保持解质量的前提下显著降低计算成本。

链接: https://arxiv.org/abs/2505.22224
作者: Senne Berden,Ali İrfan Mahmutoğulları,Dimos Tsouros,Tias Guns
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mathematical optimization is a fundamental tool for decision-making in a wide range of applications. However, in many real-world scenarios, the parameters of the optimization problem are not known a priori and must be predicted from contextual features. This gives rise to predict-then-optimize problems, where a machine learning model predicts problem parameters that are then used to make decisions via optimization. A growing body of work on decision-focused learning (DFL) addresses this setting by training models specifically to produce predictions that maximize downstream decision quality, rather than accuracy. While effective, DFL is computationally expensive, because it requires solving the optimization problem with the predicted parameters at each loss evaluation. In this work, we address this computational bottleneck for linear optimization problems, a common class of problems in both DFL literature and real-world applications. We propose a solver-free training method that exploits the geometric structure of linear optimization to enable efficient training with minimal degradation in solution quality. Our method is based on the insight that a solution is optimal if and only if it achieves an objective value that is at least as good as that of its adjacent vertices on the feasible polytope. Building on this, our method compares the estimated quality of the ground-truth optimal solution with that of its precomputed adjacent vertices, and uses this as loss function. Experiments demonstrate that our method significantly reduces computational cost while maintaining high decision quality.
zh

[AI-35] Enhancing Uncertainty Estimation and Interpretability via Bayesian Non-negative Decision Layer ICLR2025

【速读】:该论文旨在解决深度神经网络在不确定性估计和可解释性方面的不足,具体表现为模型难以满足实际应用中的不确定性评估需求,以及深度神经网络的复杂性导致决策过程难以解释。其解决方案的关键在于提出一种贝叶斯非负决策层(Bayesian Non-negative Decision Layer, BNDL),该方法将深度神经网络重新表述为条件贝叶斯非负因子分析模型,通过引入随机潜在变量来建模复杂的依赖关系,并利用潜在变量的稀疏性和非负性促进解耦表示和决策层的学习,从而提升模型的可解释性。此外,还设计了一种基于Weibull变分推理网络的变分推断方法,以近似潜在变量的后验分布。

链接: https://arxiv.org/abs/2505.22199
作者: Xinyue Hu,Zhibin Duan,Bo Chen,Mingyuan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by The Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Although deep neural networks have demonstrated significant success due to their powerful expressiveness, most models struggle to meet practical requirements for uncertainty estimation. Concurrently, the entangled nature of deep neural networks leads to a multifaceted problem, where various localized explanation techniques reveal that multiple unrelated features influence the decisions, thereby undermining interpretability. To address these challenges, we develop a Bayesian Non-negative Decision Layer (BNDL), which reformulates deep neural networks as a conditional Bayesian non-negative factor analysis. By leveraging stochastic latent variables, the BNDL can model complex dependencies and provide robust uncertainty estimation. Moreover, the sparsity and non-negativity of the latent variables encourage the model to learn disentangled representations and decision layers, thereby improving interpretability. We also offer theoretical guarantees that BNDL can achieve effective disentangled learning. In addition, we developed a corresponding variational inference method utilizing a Weibull variational inference network to approximate the posterior distribution of the latent variables. Our experimental results demonstrate that with enhanced disentanglement capabilities, BNDL not only improves the model’s accuracy but also provides reliable uncertainty estimation and improved interpretability.
zh

[AI-36] Online Fair Division for Personalized 2-Value Instances

【速读】:该论文研究的是在线公平分配问题,其中物品依次到达,且存在固定数量的n个代理,每个代理对物品具有可加性估值函数。当一个物品出现时,其对每个代理的价值被揭示,并必须立即且不可逆地分配给某个代理。论文试图解决在无任何价值限制或分布假设下,该设置中存在非常强的不可能性结果的问题。解决方案的关键在于考虑估值函数受限的情况,特别是个性化二值实例,即每个代理对每个物品仅有两个可能的估值。通过设计一种确定性算法,该算法在每个时间步保持1/(2n-1)-MMS分配,并最终转化为1/4-MMS分配,从而在最坏情况下获得与最大化最小份额(MMS)和公平性(如EF1或EF2)相关的保证。此外,通过引入对未来信息的有限访问,可以实现更强的结果。

链接: https://arxiv.org/abs/2505.22174
作者: Georgios Amanatidis,Alexandros Lolos,Evangelos Markakis,Victor Turmel
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study an online fair division setting, where goods arrive one at a time and there is a fixed set of n agents, each of whom has an additive valuation function over the goods. Once a good appears, the value each agent has for it is revealed and it must be allocated immediately and irrevocably to one of the agents. It is known that without any assumptions about the values being severely restricted or coming from a distribution, very strong impossibility results hold in this setting. To bypass the latter, we turn our attention to instances where the valuation functions are restricted. In particular, we study personalized 2 -value instances, where there are only two possible values each agent may have for each good, possibly different across agents, and we show how to obtain worst case guarantees with respect to well-known fairness notions, such as maximin share fairness and envy-freeness up to one (or two) good(s). We suggest a deterministic algorithm that maintains a 1/(2n-1) -MMS allocation at every time step and show that this is the best possible any deterministic algorithm can achieve if one cares about every single time step; nevertheless, eventually the allocation constructed by our algorithm becomes a 1/4 -MMS allocation. To achieve this, the algorithm implicitly maintains a fragile system of priority levels for all agents. Further, we show that, by allowing some limited access to future information, it is possible to have stronger results with less involved approaches. By knowing the values of goods for n-1 time steps into the future, we design a matching-based algorithm that achieves an EF 1 allocation every n time steps, while always maintaining an EF 2 allocation. Finally, we show that our results allow us to get the first nontrivial guarantees for additive instances in which the ratio of the maximum over the minimum value an agent has for a good is bounded.
zh

[AI-37] What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning

【速读】:该论文试图解决如何理解并预测大型语言模型(Large Language Models, LLMs)在复杂任务中生成的长链式思维(Long Chain-of-Thought, LCoT)内部结构对最终答案正确性的影响这一问题。其解决方案的关键在于提出LCoT2Tree框架,该框架能够将顺序的LCoT转换为层次化树状结构,从而实现对LLM推理过程的深入结构分析,并利用图神经网络(Graph Neural Networks, GNNs)提取出如探索、回溯和验证等结构模式,这些模式被证明是跨多种任务和模型的更强性能预测因子。

链接: https://arxiv.org/abs/2505.22148
作者: Gangwei Jiang,Yahui Liu,Zhaoyi Li,Qi Wang,Fuzheng Zhang,Linqi Song,Ying Wei,Defu Lian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.
zh

[AI-38] Lifted Forward Planning in Relational Factored Markov Decision Processes with Concurrent Actions

【速读】:该论文试图解决在人工智能中,随着(不可区分的)物体数量增加导致状态空间呈指数级增长的问题,以及当动作空间大小依赖于状态空间大小时,尤其是允许并行动作的情况下,计算策略时需要枚举的可能情况急剧增加的问题。解决方案的关键在于提出一种一阶表示方法,以多项式而非指数级别的方式存储状态和动作空间,并引入Foreplan,一个关系前向规划器,利用该表示高效计算大量不可区分物体和动作的策略。此外,还提出了一个更快的近似版本的Foreplan。

链接: https://arxiv.org/abs/2505.22147
作者: Florian Andreas Marwitz,Tanya Braun,Ralf Möller,Marcel Gehrke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision making is a central problem in AI that can be formalized using a Markov Decision Process. A problem is that, with increasing numbers of (indistinguishable) objects, the state space grows exponentially. To compute policies, the state space has to be enumerated. Even more possibilities have to be enumerated if the size of the action space depends on the size of the state space, especially if we allow concurrent actions. To tackle the exponential blow-up in the action and state space, we present a first-order representation to store the spaces in polynomial instead of exponential size in the number of objects and introduce Foreplan, a relational forward planner, which uses this representation to efficiently compute policies for numerous indistinguishable objects and actions. Additionally, we introduce an even faster approximate version of Foreplan. Moreover, Foreplan identifies how many objects an agent should act on to achieve a certain task given restrictions. Further, we provide a theoretical analysis and an empirical evaluation of Foreplan, demonstrating a speedup of at least four orders of magnitude.
zh

[AI-39] Sentiment Simulation using Generative AI Agents

【速读】:该论文试图解决传统情感分析在捕捉人类情感的心理和情境驱动因素方面的局限性,这些局限性限制了其在需要预测性洞察的应用中的有效性。解决方案的关键在于提出一种基于生成式AI代理的稳健情感模拟框架,该框架嵌入了心理丰富的个体档案,通过结合社会人口统计信息与经过验证的人格特质、价值观、信念及社会政治态度的构念,实现对情感的动态模拟。

链接: https://arxiv.org/abs/2505.22125
作者: Melrose Tia,Jezreel Sophia Lanuzo,Lei Rigi Baltazar,Marie Joy Lopez-Relente,Diwa Malaya Quiñones,Jason Albia
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Traditional sentiment analysis relies on surface-level linguistic patterns and retrospective data, limiting its ability to capture the psychological and contextual drivers of human sentiment. These limitations constrain its effectiveness in applications that require predictive insight, such as policy testing, narrative framing, and behavioral forecasting. We present a robust framework for sentiment simulation using generative AI agents embedded with psychologically rich profiles. Agents are instantiated from a nationally representative survey of 2,485 Filipino respondents, combining sociodemographic information with validated constructs of personality traits, values, beliefs, and socio-political attitudes. The framework includes three stages: (1) agent embodiment via categorical or contextualized encodings, (2) exposure to real-world political and economic scenarios, and (3) generation of sentiment ratings accompanied by explanatory rationales. Using Quadratic Weighted Accuracy (QWA), we evaluated alignment between agent-generated and human responses. Contextualized encoding achieved 92% alignment in replicating original survey responses. In sentiment simulation tasks, agents reached 81%–86% accuracy against ground truth sentiment, with contextualized profile encodings significantly outperforming categorical (p 0.0001, Cohen’s d = 0.70). Simulation results remained consistent across repeated trials (+/-0.2–0.5% SD) and resilient to variation in scenario framing (p = 0.9676, Cohen’s d = 0.02). Our findings establish a scalable framework for sentiment modeling through psychographically grounded AI agents. This work signals a paradigm shift in sentiment analysis from retrospective classification to prospective and dynamic simulation grounded in psychology of sentiment formation.
zh

[AI-40] Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test

【速读】:该论文试图解决视觉大语言模型(VLLMs)在认知灵活性方面研究不足的问题,特别是其在设定转换能力上的表现。研究通过使用威斯康星卡片分类测验(WCST)评估了当前最先进的VLLMs(GPT-4o、Gemini-1.5 Pro和Claude-3.5 Sonnet)的认知灵活性,发现这些模型在链式思维提示下能够达到或超越人类水平的设定转换能力。解决方案的关键在于采用适当的提示策略和输入模态,同时通过角色扮演模拟认知灵活性受损的患者功能缺陷,表明VLLMs可能具备与大脑相似的认知架构,尤其是在设定转换能力方面。

链接: https://arxiv.org/abs/2505.22112
作者: Guangfu Hao,Frederic Alexandre,Shan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Cognitive flexibility has been extensively studied in human cognition but remains relatively unexplored in the context of Visual Large Language Models (VLLMs). This study assesses the cognitive flexibility of state-of-the-art VLLMs (GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet) using the Wisconsin Card Sorting Test (WCST), a classic measure of set-shifting ability. Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs. However, their abilities are highly influenced by both input modality and prompting strategy. In addition, we find that through role-playing, VLLMs can simulate various functional deficits aligned with patients having impairments in cognitive flexibility, suggesting that VLLMs may possess a cognitive architecture, at least regarding the ability of set-shifting, similar to the brain. This study reveals the fact that VLLMs have already approached the human level on a key component underlying our higher cognition, and highlights the potential to use them to emulate complex brain processes.
zh

[AI-41] he quest for the GRAph Level autoEncoder (GRALE)

【速读】:该论文旨在解决图表示学习(graph representation learning)中的挑战性问题,该问题在化学和生物学等关键应用领域具有重要影响。其解决方案的关键在于提出GRALE,一种新型的图自编码器,能够将不同大小的图编码和解码到共享的嵌入空间中。GRALE通过一种受最优传输启发的损失函数进行训练,该函数比较原始图与重构图,并利用一个可微分的节点匹配模块,该模块与编码器和解码器联合训练。此外,所提出的基于注意力的架构基于Evoformer,这是AlphaFold的核心组件,经过扩展以支持图的编码和解码。

链接: https://arxiv.org/abs/2505.22109
作者: Paul Krzakala,Gabriel Melo,Charlotte Laclau,Florence d’Alché-Buc,Rémi Flamary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although graph-based learning has attracted a lot of attention, graph representation learning is still a challenging task whose resolution may impact key application fields such as chemistry or biology. To this end, we introduce GRALE, a novel graph autoencoder that encodes and decodes graphs of varying sizes into a shared embedding space. GRALE is trained using an Optimal Transport-inspired loss that compares the original and reconstructed graphs and leverages a differentiable node matching module, which is trained jointly with the encoder and decoder. The proposed attention-based architecture relies on Evoformer, the core component of AlphaFold, which we extend to support both graph encoding and decoding. We show, in numerical experiments on simulated and molecular data, that GRALE enables a highly general form of pre-training, applicable to a wide range of downstream tasks, from classification and regression to more complex tasks such as graph interpolation, editing, matching, and prediction.
zh

[AI-42] Inclusive Differentially Private Federated Learning for Clinical Data

【速读】:该论文试图解决联邦学习(Federated Learning, FL)在临床AI模型训练中因隐私保护、资源限制和合规性问题而导致的实际应用障碍。其解决方案的关键在于提出一种基于合规性的联邦学习框架,通过自适应调整噪声来增强差分隐私(Differential Privacy, DP),该噪声调整依据可量化的客户端合规评分。此外,研究还引入了一个基于医疗和安全标准的合规评分工具,以促进不同临床环境中安全、包容和公平的参与。

链接: https://arxiv.org/abs/2505.22108
作者: Santhosh Parampottupadam,Melih Coşğun,Sarthak Pati,Maximilian Zenk,Saikat Roy,Dimitrios Bounias,Benjamin Hamm,Sinem Sav,Ralf Floca,Klaus Maier-Hein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Learning (FL) offers a promising approach for training clinical AI models without centralizing sensitive patient data. However, its real-world adoption is hindered by challenges related to privacy, resource constraints, and compliance. Existing Differential Privacy (DP) approaches often apply uniform noise, which disproportionately degrades model performance, even among well-compliant institutions. In this work, we propose a novel compliance-aware FL framework that enhances DP by adaptively adjusting noise based on quantifiable client compliance scores. Additionally, we introduce a compliance scoring tool based on key healthcare and security standards to promote secure, inclusive, and equitable participation across diverse clinical settings. Extensive experiments on public datasets demonstrate that integrating under-resourced, less compliant clinics with highly regulated institutions yields accuracy improvements of up to 15% over traditional FL. This work advances FL by balancing privacy, compliance, and performance, making it a viable solution for real-world clinical workflows in global healthcare.
zh

[AI-43] AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

【速读】:该论文旨在解决扩散模型在音频生成中推理速度慢的问题,同时克服修正流(rectified flow)方法在低步数下性能不佳的局限性。其解决方案的关键在于将预训练的扩散模型与修正流方法相结合,通过从预训练文本到音频(TTA)模型生成的确定性噪声样本对中学习一阶常微分方程(ODE)路径,从而提升生成效率。实验表明,该方法在仅需10个采样步骤的情况下优于现有模型,并将推理步骤减少至3步。

链接: https://arxiv.org/abs/2505.22106
作者: Junqi Zhao,Jinzheng Zhao,Haohe Liu,Yun Chen,Lu Han,Xubo Liu,Mark Plumbley,Wenwu Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.
zh

[AI-44] Efficient Dynamic Shielding for Parametric Safety Specifications

【速读】:该论文试图解决在运行时动态变化的安全规范下,传统静态防护机制需要重新计算导致延迟的问题。解决方案的关键在于引入动态防护机制(dynamic shields),该机制针对参数化安全规范进行静态设计,并能够在运行时根据实际揭示的安全规范动态适应,其核心算法创新在于一种简单快速的动态适应过程,能够利用标准安全防护机制的已知特性,如最大允许性(maximal permissiveness)。

链接: https://arxiv.org/abs/2505.22104
作者: Davide Corsi,Kaushik Mallik,Andoni Rodriguez,Cesar Sanchez
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Shielding has emerged as a promising approach for ensuring safety of AI-controlled autonomous systems. The algorithmic goal is to compute a shield, which is a runtime safety enforcement tool that needs to monitor and intervene the AI controller’s actions if safety could be compromised otherwise. Traditional shields are designed statically for a specific safety requirement. Therefore, if the safety requirement changes at runtime due to changing operating conditions, the shield needs to be recomputed from scratch, causing delays that could be fatal. We introduce dynamic shields for parametric safety specifications, which are succinctly represented sets of all possible safety specifications that may be encountered at runtime. Our dynamic shields are statically designed for a given safety parameter set, and are able to dynamically adapt as the true safety specification (permissible by the parameters) is revealed at runtime. The main algorithmic novelty lies in the dynamic adaptation procedure, which is a simple and fast algorithm that utilizes known features of standard safety shields, like maximal permissiveness. We report experimental results for a robot navigation problem in unknown territories, where the safety specification evolves as new obstacles are discovered at runtime. In our experiments, the dynamic shields took a few minutes for their offline design, and took between a fraction of a second and a few seconds for online adaptation at each step, whereas the brute-force online recomputation approach was up to 5 times slower.
zh

[AI-45] From Coders to Critics: Empowering Students through Peer Assessment in the Age of AI Copilots

【速读】:该论文试图解决AI辅助编程工具(如ChatGPT)在编程教育中广泛应用所带来的评估实践、学术诚信和技能发展问题,特别是在传统评分方法容易受到AI辅助抄袭影响的背景下。其解决方案的关键在于实施基于量规的匿名同行评审过程,通过学生相互评价最终项目(2D游戏)并将其评估结果与教师评分进行比较,以验证同行评审在准确性、学生参与度及评价思维培养方面的有效性。

链接: https://arxiv.org/abs/2505.22093
作者: Santiago Berrezueta-Guzman,Stephan Krusche,Stefan Wagner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: This is the authors’ preprint version of a paper accepted at the 11th International Symposium on Educational Technology, to be held in July 2025, in Bangkok, Thailand. The final published version will be available via IEEE Xplore Library

点击查看摘要

Abstract:The rapid adoption of AI powered coding assistants like ChatGPT and other coding copilots is transforming programming education, raising questions about assessment practices, academic integrity, and skill development. As educators seek alternatives to traditional grading methods susceptible to AI enabled plagiarism, structured peer assessment could be a promising strategy. This paper presents an empirical study of a rubric based, anonymized peer review process implemented in a large introductory programming course. Students evaluated each other’s final projects (2D game), and their assessments were compared to instructor grades using correlation, mean absolute error, and root mean square error (RMSE). Additionally, reflective surveys from 47 teams captured student perceptions of fairness, grading behavior, and preferences regarding grade aggregation. Results show that peer review can approximate instructor evaluation with moderate accuracy and foster student engagement, evaluative thinking, and interest in providing good feedback to their peers. We discuss these findings for designing scalable, trustworthy peer assessment systems to face the age of AI assisted coding. Comments: This is the authors’ preprint version of a paper accepted at the 11th International Symposium on Educational Technology, to be held in July 2025, in Bangkok, Thailand. The final published version will be available via IEEE Xplore Library Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2505.22093 [cs.CY] (or arXiv:2505.22093v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.22093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-46] VIRAL: Vision-grounded Integration for Reward design And Learning

【速读】:该论文试图解决人工智能中人类与机器对齐的问题,特别是在强化学习中由于设计不良的奖励函数所带来的风险。其解决方案的关键在于引入VIRAL,一个利用多模态大语言模型(Large Language Models, LLMs)生成和优化奖励函数的流水线。VIRAL能够根据给定环境和目标提示或标注图像自主生成并交互式改进奖励函数,同时可通过人类反馈或由视频LLM生成的策略描述进行优化,从而提升与用户意图的一致性。

链接: https://arxiv.org/abs/2505.22092
作者: Valentin Cuzin-Rambaud,Emilien Komlenovic,Alexandre Faure,Bruno Yun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent’s policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: this https URL and this https URL.
zh

[AI-47] Cognitively-Inspired Emergent Communication via Knowledge Graphs for Assisting the Visually Impaired

【速读】:该论文旨在解决辅助视觉障碍者实时导航时反馈系统在延迟与语义丰富性之间的权衡问题。现有基于自然语言的系统虽能提供详细指导,但响应速度不足以应对动态场景;而低延迟的符号化通信框架则缺乏语义深度,难以适用于触觉模态。论文提出的解决方案关键在于引入一种基于知识图谱的类认知涌现通信框架(VAG-EC),通过构建知识图谱表示物体及其关系,并结合注意力机制优先处理任务相关实体,从而生成紧凑、可解释且上下文敏感的符号语言,实现高效且适应性强的实时辅助技术。

链接: https://arxiv.org/abs/2505.22087
作者: Ruxiao Chen,Dezheng Han,Wenjie Han,Shuaishuai Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Assistive systems for visually impaired individuals must deliver rapid, interpretable, and adaptive feedback to facilitate real-time navigation. Current approaches face a trade-off between latency and semantic richness: natural language-based systems provide detailed guidance but are too slow for dynamic scenarios, while emergent communication frameworks offer low-latency symbolic languages but lack semantic depth, limiting their utility in tactile modalities like vibration. To address these limitations, we introduce a novel framework, Cognitively-Inspired Emergent Communication via Knowledge Graphs (VAG-EC), which emulates human visual perception and cognitive mapping. Our method constructs knowledge graphs to represent objects and their relationships, incorporating attention mechanisms to prioritize task-relevant entities, thereby mirroring human selective attention. This structured approach enables the emergence of compact, interpretable, and context-sensitive symbolic languages. Extensive experiments across varying vocabulary sizes and message lengths demonstrate that VAG-EC outperforms traditional emergent communication methods in Topographic Similarity (TopSim) and Context Independence (CI). These findings underscore the potential of cognitively grounded emergent communication as a fast, adaptive, and human-aligned solution for real-time assistive technologies. Code is available at this https URL.
zh

[AI-48] DSE: Navigating Design Space Exploration in High-Level Synthesis Using LLM s

【速读】:该论文旨在解决高阶综合(High-Level Synthesis, HLS)中设计空间探索(Design Space Exploration, DSE)的效率与效果问题,特别是在面对指令配置组合爆炸导致的不可行设计空间时,传统DSE方法在计算成本和优化结果上的局限性。其解决方案的关键在于引入iDSE框架,该框架首次将大语言模型(LLM)应用于DSE,通过利用LLM对HLS设计质量的感知能力,智能地剪枝设计空间并校准初始采样设计,从而加速收敛至帕累托前沿。此外,iDSE通过挖掘LLM在硬件优化中的聚合与发散思维模式,实现了设计质量与多样性的多路径优化。

链接: https://arxiv.org/abs/2505.22086
作者: Runkai Li,Jia Xiong,Xi Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-Level Synthesis (HLS) serves as an agile hardware development tool that streamlines the circuit design by abstracting the register transfer level into behavioral descriptions, while allowing designers to customize the generated microarchitectures through optimization directives. However, the combinatorial explosion of possible directive configurations yields an intractable design space. Traditional design space exploration (DSE) methods, despite adopting heuristics or constructing predictive models to accelerate Pareto-optimal design acquisition, still suffer from prohibitive exploration costs and suboptimal results. Addressing these concerns, we introduce iDSE, the first LLM-aided DSE framework that leverages HLS design quality perception to effectively navigate the design space. iDSE intelligently pruns the design space to guide LLMs in calibrating representative initial sampling designs, expediting convergence toward the Pareto front. By exploiting the convergent and divergent thinking patterns inherent in LLMs for hardware optimization, iDSE achieves multi-path refinement of the design quality and diversity. Extensive experiments demonstrate that iDSE outperforms heuristic-based DSE methods by 5.1 \times \sim 16.6 \times in proximity to the reference Pareto front, matching NSGA-II with only 4.6% of the explored designs. Our work demonstrates the transformative potential of LLMs in scalable and efficient HLS design optimization, offering new insights into multiobjective optimization challenges.
zh

[AI-49] he Resurrection of the ReLU

【速读】:该论文试图解决传统ReLU激活函数在深度学习模型中容易出现的“死亡ReLU问题”(dying ReLU problem),即ReLU单元在训练过程中可能变得不可逆地不活跃,从而限制模型的整体性能。解决方案的关键在于引入一种针对ReLU的替代梯度学习方法——SUGAR(Surrogate Gradient Learning for ReLU),该方法在前向传播中保持标准ReLU函数不变,但在反向传播中用平滑的替代梯度替代其导数,以避免梯度消失,从而有效恢复“死亡”的ReLU单元,并提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.22074
作者: Coşku Can Horuz,Geoffrey Kasenbacher,Saya Higuchi,Sebastian Kairat,Jendrik Stoltz,Moritz Pesl,Bernhard A. Moser,Christoph Linse,Thomas Martinetz,Sebastian Otte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling sophisticated activation functions within deep learning architectures has evolved into a distinct research direction. Functions such as GELU, SELU, and SiLU offer smooth gradients and improved convergence properties, making them popular choices in state-of-the-art models. Despite this trend, the classical ReLU remains appealing due to its simplicity, inherent sparsity, and other advantageous topological characteristics. However, ReLU units are prone to becoming irreversibly inactive - a phenomenon known as the dying ReLU problem - which limits their overall effectiveness. In this work, we introduce surrogate gradient learning for ReLU (SUGAR) as a novel, plug-and-play regularizer for deep architectures. SUGAR preserves the standard ReLU function during the forward pass but replaces its derivative in the backward pass with a smooth surrogate that avoids zeroing out gradients. We demonstrate that SUGAR, when paired with a well-chosen surrogate function, substantially enhances generalization performance over convolutional network architectures such as VGG-16 and ResNet-18, providing sparser activations while effectively resurrecting dead ReLUs. Moreover, we show that even in modern architectures like Conv2NeXt and Swin Transformer - which typically employ GELU - substituting these with SUGAR yields competitive and even slightly superior performance. These findings challenge the prevailing notion that advanced activation functions are necessary for optimal performance. Instead, they suggest that the conventional ReLU, particularly with appropriate gradient handling, can serve as a strong, versatile revived classic across a broad range of deep learning vision models.
zh

[AI-50] Reinforced Reasoning for Embodied Planning

【速读】:该论文旨在解决具身规划(embodied planning)中多步骤决策的一致性问题,特别是在动态视觉观察和自然语言目标下的任务执行。现有视觉-语言模型(VLMs)在静态感知任务上表现优异,但在交互式环境中缺乏时间推理、空间理解和常识性基础支撑。论文提出的解决方案关键在于引入一种强化学习微调框架,通过从封闭源模型中提炼高质量数据并进行监督微调(SFT),赋予模型结构化决策先验知识,随后设计基于规则的奖励函数并利用广义强化偏好优化(GRPO)优化策略,从而提升模型在复杂环境中的规划能力。

链接: https://arxiv.org/abs/2505.22050
作者: Di Wu,Jiaxin Fan,Junzhe Zang,Guanbo Wang,Wei Yin,Wenhao Li,Bo Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance long-horizon planning in embodied AI.
zh

[AI-51] Estimating the Effects of Sample Training Orders for Large Language Models without Retraining

【速读】:该论文试图解决训练样本顺序对大型语言模型(Large Language Models, LLMs)性能及内部学习动态的影响问题,传统方法因需要多次重新训练模型而难以应用于LLMs。其解决方案的关键在于设计一种无需重新训练的框架,通过使用一阶和二阶泰勒展开近似Adam优化器的更新,并利用随机投影方法存储中间检查点,从而高效估计任意训练样本顺序下的模型参数。

链接: https://arxiv.org/abs/2505.22042
作者: Hao Yang,Haoxuan Li,Mengyue Yang,Xu Chen,Mingming Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The order of training samples plays a crucial role in large language models (LLMs), significantly impacting both their external performance and internal learning dynamics. Traditional methods for investigating this effect generally require retraining the model with various sample orders, which is computationally infeasible for LLMs. In this work, we improve traditional methods by designing a retraining-free framework. By approximating Adam optimizer updates with first- and second-order Taylor expansions and utilizing random projection methods to store intermediate checkpoints, our framework can efficiently estimate model parameters for arbitrary training sample orders. Next, we apply our framework to two downstream research problems: (1) Training curriculum design for LLMs – we base our retraining-free framework to propose a novel curriculum learning strategy that augments curriculum proposals with estimated model performances, enabling more informed sample scheduling. (2) LLMs’ memorization and generalization effect analysis – we use our retraining-free framework to estimate how the positions of training samples influence LLMs’ capacity for memorization and generalization. We conduct extensive experiments to validate the effectiveness of our retraining-free framework in reproducing the true model performances, and further demonstrate its potential in optimizing LLM training curricula and analyzing the memorization and generalization effects of LLMs.
zh

[AI-52] Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from Ensembles INTERSPEECH2025

【速读】:该论文旨在解决呼吸音数据集在规模和质量上的限制,从而难以实现高性能的问题。其解决方案的关键在于采用软标签(soft label)训练方法,通过知识蒸馏(knowledge distillation)将多个教师模型的知识有效地转移到学生模型中,而该方法仅在训练阶段增加额外计算成本,推理阶段则保持高效。实验表明,即使使用一个与学生模型结构相同的教师模型,也能显著提升性能,最优效果可通过少量教师模型实现。

链接: https://arxiv.org/abs/2505.22027
作者: Miika Toikkanen,June-Woo Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Respiratory sound datasets are limited in size and quality, making high performance difficult to achieve. Ensemble models help but inevitably increase compute cost at inference time. Soft label training distills knowledge efficiently with extra cost only at training. In this study, we explore soft labels for respiratory sound classification as an architecture-agnostic approach to distill an ensemble of teacher models into a student model. We examine different variations of our approach and find that even a single teacher, identical to the student, considerably improves performance beyond its own capability, with optimal gains achieved using only a few teachers. We achieve the new state-of-the-art Score of 64.39 on ICHBI, surpassing the previous best by 0.85 and improving average Scores across architectures by more than 1.16. Our results highlight the effectiveness of knowledge distillation with soft labels for respiratory sound classification, regardless of size or architecture.
zh

[AI-53] Functional Matching of Logic Subgraphs: Beyond Structural Isomorphism

【速读】:该论文旨在解决逻辑电路中子图匹配的问题,传统方法依赖于结构图同构,无法在综合变换显著改变电路拓扑的情况下识别与功能相关的子图。解决方案的关键在于引入功能性子图匹配(functional subgraph matching)的概念,通过学习AIG和综合后网表中的鲁棒功能嵌入,并利用图分割方法识别模糊边界,从而在不依赖结构一致性的前提下检测隐含的逻辑功能。

链接: https://arxiv.org/abs/2505.21988
作者: Ziyang Zheng,Kezhi Li,Zhengyuan Shi,Qiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Subgraph matching in logic circuits is foundational for numerous Electronic Design Automation (EDA) applications, including datapath optimization, arithmetic verification, and hardware trojan detection. However, existing techniques rely primarily on structural graph isomorphism and thus fail to identify function-related subgraphs when synthesis transformations substantially alter circuit topology. To overcome this critical limitation, we introduce the concept of functional subgraph matching, a novel approach that identifies whether a given logic function is implicitly present within a larger circuit, irrespective of structural variations induced by synthesis or technology mapping. Specifically, we propose a two-stage multi-modal framework: (1) learning robust functional embeddings across AIG and post-mapping netlists for functional subgraph detection, and (2) identifying fuzzy boundaries using a graph segmentation approach. Evaluations on standard benchmarks (ITC99, OpenABCD, ForgeEDA) demonstrate significant performance improvements over existing structural methods, with average 93.8% accuracy in functional subgraph detection and a dice score of 91.3% in fuzzy boundary identification.
zh

[AI-54] Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决多智能体强化学习(MARL)中在部分可观测环境下如何有效实现智能体间通信的问题。传统方法通常将消息视为动作空间的一部分并假设合作性,而本文提出的MARL-CPC框架通过引入基于集体预测编码(CPC)的消息学习模型,将消息与状态推断相联系,从而支持非合作、奖励独立环境下的通信。解决方案的关键在于将消息设计为有助于状态推断而非直接优化发送者奖励,这一机制在非合作场景下仍能实现有效的协调与通信。

链接: https://arxiv.org/abs/2505.21985
作者: Naoto Yoshida,Tadahiro Taniguchi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In multi-agent reinforcement learning (MARL), effective communication improves agent performance, particularly under partial observability. We propose MARL-CPC, a framework that enables communication among fully decentralized, independent agents without parameter sharing. MARL-CPC incorporates a message learning model based on collective predictive coding (CPC) from emergent communication research. Unlike conventional methods that treat messages as part of the action space and assume cooperation, MARL-CPC links messages to state inference, supporting communication in non-cooperative, reward-independent settings. We introduce two algorithms -Bandit-CPC and IPPO-CPC- and evaluate them in non-cooperative MARL tasks. Benchmarks show that both outperform standard message-as-action approaches, establishing effective communication even when messages offer no direct benefit to the sender. These results highlight MARL-CPC’s potential for enabling coordination in complex, decentralized environments.
zh

[AI-55] Judging LLM s on a Simplex

【速读】:该论文试图解决使用大型语言模型(Large Language Models, LLMs)作为评判者对自由形式输出进行自动化评估时的可识别性问题,特别是在不同评分等级下排名的可识别性。其解决方案的关键在于引入一个几何框架,将评判者和候选者表示为概率单纯形上的点,并通过贝叶斯推断来整合认知不确定性与随机不确定性,从而更准确地量化排名估计的不确定性。这一方法在多个基准测试中表现出更高的排名准确性与覆盖率。

链接: https://arxiv.org/abs/2505.21972
作者: Patrick Vossler,Fan Xia,Yifan Mai,Jean Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 28 pages, 7 figures

点击查看摘要

Abstract:Automated evaluation of free-form outputs from large language models (LLMs) is challenging because many distinct answers can be equally valid. A common practice is to use LLMs themselves as judges, but the theoretical properties of this approach are not yet well understood. We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable using LLM judges. Our theoretical analysis uncovers a “phase transition” in ranking identifiability: for binary scoring systems, true rankings are identifiable even with weak judges under mild assumptions, while rankings become non-identifiable for three or more scoring levels even with infinite data, absent additional prior knowledge. This non-identifiability highlights how uncertainty in rankings stems from not only aleatoric uncertainty (i.e., inherent stochasticity in the data) but also epistemic uncertainty regarding which assumptions hold, an aspect that has received limited attention until now. To integrate both types of uncertainty, we use Bayesian inference to encode assumptions as priors and conduct sensitivity analysis of ranking estimates and credible intervals. Empirical evaluations across multiple benchmarks demonstrate that Bayesian inference yields more accurate rankings and substantially improves coverage rates. These results underscore the importance of taking a more holistic approach to uncertainty quantification when using LLMs as judges.
zh

[AI-56] DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

【速读】:该论文旨在解决家庭服务机器人在陌生环境中实现自适应导航的问题,该问题因需要同时具备低级路径规划和高级场景理解而具有挑战性。现有基于视觉-语言模型(VLM)的零样本方法虽减少了对先验地图和场景特定训练数据的依赖,但存在时空不连续性、非结构化记忆表示以及任务理解不足导致导航失败等局限。论文提出的解决方案是DORAEMON(Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation),其关键在于引入了背侧流(Dorsal Stream)和腹侧流(Ventral Stream)的分层架构,背侧流通过层次语义-空间融合与拓扑地图处理时空不连续性,腹侧流结合RAG-VLM和Policy-VLM提升决策能力,并通过Nav-Ensurance确保导航安全与效率。

链接: https://arxiv.org/abs/2505.21969
作者: Tianjun Gu,Linfeng Li,Xuhong Wang,Chenghua Gong,Jingyu Gong,Zhizhong Zhang,Yuan Xie,Lizhuang Ma,Xin Tan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency. We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better. Comprehensive experiments demonstrate DORAEMON’s effectiveness in zero-shot autonomous navigation without requiring prior map building or pre-training.
zh

[AI-57] Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection

【速读】:该论文试图解决对抗性攻击在随机多臂老虎机(stochastic bandits)中的现实可行性问题,传统方法依赖于不切实际的假设,如每轮奖励操纵和无界扰动,限制了其在真实系统中的适用性。解决方案的关键在于提出一种更贴近现实的威胁模型——假数据注入(Fake Data Injection),该模型允许攻击者仅注入有限数量的有界虚假反馈样本到学习者的历史中,从而模拟合法交互。在此模型下,设计了高效的攻击策略,明确处理了奖励值的幅度约束和数据注入的时间约束,理论分析表明这些攻击能够在几乎所有轮次中误导UCB和Thompson Sampling算法选择目标臂,同时仅产生次线性的攻击成本。

链接: https://arxiv.org/abs/2505.21938
作者: Qirun Zeng,Eric He,Richard Hoffmann,Xuchuang Wang,Jinhang Zuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Adversarial attacks on stochastic bandits have traditionally relied on some unrealistic assumptions, such as per-round reward manipulation and unbounded perturbations, limiting their relevance to real-world systems. We propose a more practical threat model, Fake Data Injection, which reflects realistic adversarial constraints: the attacker can inject only a limited number of bounded fake feedback samples into the learner’s history, simulating legitimate interactions. We design efficient attack strategies under this model, explicitly addressing both magnitude constraints (on reward values) and temporal constraints (on when and how often data can be injected). Our theoretical analysis shows that these attacks can mislead both Upper Confidence Bound (UCB) and Thompson Sampling algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost. Experiments on synthetic and real-world datasets validate the effectiveness of our strategies, revealing significant vulnerabilities in widely used stochastic bandit algorithms under practical adversarial scenarios.
zh

[AI-58] From Reasoning to Learning: A Survey on Hypothesis Discovery and Rule Learning with Large Language Models

【速读】:该论文试图解决的问题是:尽管大型语言模型(Large Language Models, LLMs)在指令遵循和演绎推理能力上取得了显著进展,但其是否能够真正发现新知识仍是一个开放性问题。论文提出的解决方案关键在于基于皮尔士的归纳、演绎和溯因框架,系统地探讨基于LLM的假设生成与验证机制,旨在推动LLM从“信息执行者”向“创新引擎”的转变,从而在科学研究和实际问题解决中实现真正的知识生成与创新。

链接: https://arxiv.org/abs/2505.21935
作者: Kaiyu He,Zhiyu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since the advent of Large Language Models (LLMs), efforts have largely focused on improving their instruction-following and deductive reasoning abilities, leaving open the question of whether these models can truly discover new knowledge. In pursuit of artificial general intelligence (AGI), there is a growing need for models that not only execute commands or retrieve information but also learn, reason, and generate new knowledge by formulating novel hypotheses and theories that deepen our understanding of the world. Guided by Peirce’s framework of abduction, deduction, and induction, this survey offers a structured lens to examine LLM-based hypothesis discovery. We synthesize existing work in hypothesis generation, application, and validation, identifying both key achievements and critical gaps. By unifying these threads, we illuminate how LLMs might evolve from mere ``information executors’’ into engines of genuine innovation, potentially transforming research, science, and real-world problem solving.
zh

[AI-59] FALCON: An ML Framework for Fully Automated Layout-Constrained Analog Circuit Design

【速读】:该论文旨在解决从性能规范自动设计模拟电路的问题,这一过程涉及拓扑选择、参数推断和布局可行性等多个复杂阶段。其解决方案的关键在于提出FALCON,一个统一的机器学习框架,通过拓扑选择和布局约束优化实现完全自动化的、以规范驱动的模拟电路合成。FALCON首先利用基于人类设计经验的性能驱动分类器选择合适的电路拓扑,随后采用定制的边缘中心图神经网络进行梯度优化的参数推断,并通过可微分布局成本引导设计,从而实现高效的布局感知设计。

链接: https://arxiv.org/abs/2505.21923
作者: Asal Mehradfar,Xuzhe Zhao,Yilun Huang,Emir Ceyani,Yankai Yang,Shihao Han,Hamidreza Aghasi,Salman Avestimehr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Designing analog circuits from performance specifications is a complex, multi-stage process encompassing topology selection, parameter inference, and layout feasibility. We introduce FALCON, a unified machine learning framework that enables fully automated, specification-driven analog circuit synthesis through topology selection and layout-constrained optimization. Given a target performance, FALCON first selects an appropriate circuit topology using a performance-driven classifier guided by human design heuristics. Next, it employs a custom, edge-centric graph neural network trained to map circuit topology and parameters to performance, enabling gradient-based parameter inference through the learned forward model. This inference is guided by a differentiable layout cost, derived from analytical equations capturing parasitic and frequency-dependent effects, and constrained by design rules. We train and evaluate FALCON on a large-scale custom dataset of 1M analog mm-wave circuits, generated and simulated using Cadence Spectre across 20 expert-designed topologies. Through this evaluation, FALCON demonstrates 99% accuracy in topology inference, 10% relative error in performance prediction, and efficient layout-aware design that completes in under 1 second per instance. Together, these results position FALCON as a practical and extensible foundation model for end-to-end analog circuit design automation.
zh

[AI-60] owards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

【速读】:该论文试图解决在大规模语言模型(Large Language Models, LLMs)推理过程中,由于扩展上下文窗口的采用而导致的键值缓存(Key-Value Cache, KVC)管理效率不足的问题。解决方案的关键在于设计一种高效的分布式缓存系统,该系统通过优化KVC元数据管理来提升缓存重用性,从而减少冗余并提高推理速度。研究分析了实际的KVC访问模式,并评估了现有存储系统在KVC预填充方面的不足,强调了针对LLM工作负载定制化存储方案的重要性。

链接: https://arxiv.org/abs/2505.21919
作者: Yue Zhu,Hao Yu,Chen Wang,Zhuoran Liu,Eun Kyung Lee
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: This paper has been accepted at IEEE Cloud 2025 as WIP paper. The final version will appear in IEEE Xplore

点击查看摘要

Abstract:The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.
zh

[AI-61] Self-supervised Learning Method Using Transformer for Multi-dimensional Sensor Data Processing

【速读】:该论文试图解决人类活动识别(Human Activity Recognition, HAR)任务中模型性能提升的问题。其解决方案的关键在于提出一种增强的n维数值处理Transformer架构,该架构通过三个核心特性进行改进:利用线性层对n维数值数据进行嵌入、基于分箱的预处理方法以及输出层中的线性变换,从而在五个不同数据集上实现了相较于原始Transformer 10%-15%的准确率提升。

链接: https://arxiv.org/abs/2505.21918
作者: Haruki Kai,Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures

点击查看摘要

Abstract:We developed a deep learning algorithm for human activity recognition using sensor signals as input. In this study, we built a pretrained language model based on the Transformer architecture, which is widely used in natural language processing. By leveraging this pretrained model, we aimed to improve performance on the downstream task of human activity recognition. While this task can be addressed using a vanilla Transformer, we propose an enhanced n-dimensional numerical processing Transformer that incorporates three key features: embedding n-dimensional numerical data through a linear layer, binning-based pre-processing, and a linear transformation in the output layer. We evaluated the effectiveness of our proposed model across five different datasets. Compared to the vanilla Transformer, our model demonstrated 10%-15% improvements in accuracy.
zh

[AI-62] Reinforcement Learning for Out-of-Distribution Reasoning in LLM s: An Empirical Study on Diagnosis-Related Group Coding

【速读】:该论文旨在解决诊断相关组(Diagnosis-Related Group, DRG)编码的自动化问题,该任务因涉及复杂的临床和财务数据而具有分布外(out-of-distribution, OOD)特性,传统大型语言模型(Large Language Models, LLMs)难以有效处理。论文提出的解决方案是DRG-Sapphire,其关键在于利用大规模强化学习(reinforcement learning, RL)技术,基于Qwen2.5-7B模型并通过规则奖励进行分组相对策略优化(Group Relative Policy Optimization, GRPO),以提升DRG编码的准确性与可解释性。此外,研究还发现,强化学习性能与监督微调(supervised fine-tuning, SFT)样本数量的对数呈近似线性关系,表明在OOD任务中,预训练模型中的领域知识对强化学习效果具有根本性影响。

链接: https://arxiv.org/abs/2505.21908
作者: Hanyin Wang,Zhenbang Wu,Gururaj Kolar,Hariprasad Korsapati,Brian Bartlett,Bryan Hull,Jimeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.
zh

[AI-63] Compressing Sine-Activated Low-Rank Adapters through Post-Training Quantization

【速读】:该论文试图解决在模型压缩(Post-Training Quantization)背景下,低秩适配器(LoRA)因低秩约束导致的表征能力受限问题,旨在保持模型性能的同时实现高效的参数压缩。解决方案的关键在于将固定频率正弦变换(sinusoidal transformation)引入量化后的LoRA适配器中,通过提升适配器的稳定秩(stable rank)来增强其表达能力,而无需增加额外参数,从而在模型量化后仍能保持较高的性能。

链接: https://arxiv.org/abs/2505.21895
作者: Cameron Gordon,Yiping Ji,Hemanth Saratchandran,Paul Albert,Simon Lucey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a standard approach for parameter-efficient fine-tuning, offering substantial reductions in trainable parameters by modeling updates as the product of two low-rank matrices. While effective, the low-rank constraint inherently limits representational capacity, often resulting in reduced performance compared to full-rank fine-tuning. Recent work by Ji et al. (2025) has addressed this limitation by applying a fixed-frequency sinusoidal transformation to low-rank adapters, increasing their stable rank without introducing additional parameters. This raises a crucial question: can the same sine-activated technique be successfully applied within the context of Post-Training Quantization to retain benefits even after model compression? In this paper, we investigate this question by extending the sinusoidal transformation framework to quantized LoRA adapters. We develop a theoretical analysis showing that the stable rank of a quantized adapter is tightly linked to that of its full-precision counterpart, motivating the use of such rank-enhancing functions even under quantization. Our results demonstrate that the expressivity gains from a sinusoidal non-linearity persist after quantization, yielding highly compressed adapters with negligible loss in performance. We validate our approach across a range of fine-tuning tasks for language, vision and text-to-image generation achieving significant memory savings while maintaining competitive accuracy.
zh

[AI-64] SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training

【速读】:该论文旨在解决扩散模型中偏好学习的两个关键问题:时间步依赖的不稳定性和非策略偏差。时间步依赖的不稳定性源于反向与正向扩散过程的不匹配以及早期噪声时间步的高梯度方差,而非策略偏差则来源于优化策略与数据收集策略之间的不匹配。论文提出的关键解决方案是SDPO(Importance-Sampled Direct Preference Optimization),其核心在于将重要性采样引入目标函数,以完全校正非策略偏差,并在扩散过程中强调信息量大的更新步骤,从而提升模型的稳定性和对人类偏好的对齐能力。

链接: https://arxiv.org/abs/2505.21893
作者: Xiaomeng Yang,Zhiyu Tan,Junyan Wang,Zhijian Zhou,Hao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Preference learning has become a central technique for aligning generative models with human expectations. Recently, it has been extended to diffusion models through methods like Direct Preference Optimization (DPO). However, existing approaches such as Diffusion-DPO suffer from two key challenges: timestep-dependent instability, caused by a mismatch between the reverse and forward diffusion processes and by high gradient variance in early noisy timesteps, and off-policy bias arising from the mismatch between optimization and data collection policies. We begin by analyzing the reverse diffusion trajectory and observe that instability primarily occurs at early timesteps with low importance weights. To address these issues, we first propose DPO-C\M, a practical strategy that improves stability by clipping and masking uninformative timesteps while partially mitigating off-policy bias. Building on this, we introduce SDPO (Importance-Sampled Direct Preference Optimization), a principled framework that incorporates importance sampling into the objective to fully correct for off-policy bias and emphasize informative updates during the diffusion process. Experiments on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B demonstrate that both methods outperform standard Diffusion-DPO, with SDPO achieving superior VBench scores, human preference alignment, and training robustness. These results highlight the importance of timestep-aware, distribution-corrected optimization in diffusion-based preference learning.
zh

[AI-65] SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing Problem

【速读】:该论文旨在解决在不确定性环境下鲁棒的车辆路径规划(Vehicle Routing Problem, VRP)问题,传统基准测试通常假设静态、理想化的场景,而现实物流中存在时间依赖性拥堵、对数正态延迟、概率性事故以及基于实证的时间窗等复杂动态因素。论文提出的SVRPBench是首个能够捕捉城市规模下高保真随机动态的开放基准,其关键在于构建了多样且约束丰富的场景,涵盖多仓库和多车辆配置,并通过模拟真实配送条件来挑战现有算法的泛化能力与适应性。

链接: https://arxiv.org/abs/2505.21887
作者: Ahmed Heakl,Yahia Salaheldin Shaaban,Martin Takac,Salem Lahlou,Zangir Iklassov
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 18 pages, 14 figures, 11 tables

点击查看摘要

Abstract:Robust routing under uncertainty is central to real-world logistics, yet most benchmarks assume static, idealized settings. We present SVRPBench, the first open benchmark to capture high-fidelity stochastic dynamics in vehicle routing at urban scale. Spanning more than 500 instances with up to 1000 customers, it simulates realistic delivery conditions: time-dependent congestion, log-normal delays, probabilistic accidents, and empirically grounded time windows for residential and commercial clients. Our pipeline generates diverse, constraint-rich scenarios, including multi-depot and multi-vehicle setups. Benchmarking reveals that state-of-the-art RL solvers like POMO and AM degrade by over 20% under distributional shift, while classical and metaheuristic methods remain robust. To enable reproducible research, we release the dataset and evaluation suite. SVRPBench challenges the community to design solvers that generalize beyond synthetic assumptions and adapt to real-world uncertainty.
zh

[AI-66] Symbolic Foundation Regressor on Complex Networks

【速读】:该论文旨在解决科学领域中如何从复杂数据中提取可解释的物理模型的问题,特别是在处理具有大量相互作用变量的数据时,传统方法存在效率低和可解释性差的局限。其解决方案的关键在于提出一种预训练的符号基础回归器(pre-trained symbolic foundation regressor),该模型能够有效压缩复杂数据并生成可解释的物理表达式,在非网络符号回归、复杂网络上的符号回归以及网络动态推断等多个领域表现出显著的方程推断效率提升,相较于基线方法提高了三倍,同时保持了预测的准确性。

链接: https://arxiv.org/abs/2505.21879
作者: Weiting Liu,Jiaxu Cui,Jiao Hu,En Wang,Bo Yang
机构: 未知
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 60 pages

点击查看摘要

Abstract:In science, we are interested not only in forecasting but also in understanding how predictions are made, specifically what the interpretable underlying model looks like. Data-driven machine learning technology can significantly streamline the complex and time-consuming traditional manual process of discovering scientific laws, helping us gain insights into fundamental issues in modern science. In this work, we introduce a pre-trained symbolic foundation regressor that can effectively compress complex data with numerous interacting variables while producing interpretable physical representations. Our model has been rigorously tested on non-network symbolic regression, symbolic regression on complex networks, and the inference of network dynamics across various domains, including physics, biochemistry, ecology, and epidemiology. The results indicate a remarkable improvement in equation inference efficiency, being three times more effective than baseline approaches while maintaining accurate predictions. Furthermore, we apply our model to uncover more intuitive laws of interaction transmission from global epidemic outbreak data, achieving optimal data fitting. This model extends the application boundary of pre-trained symbolic regression models to complex networks, and we believe it provides a foundational solution for revealing the hidden mechanisms behind changes in complex phenomena, enhancing interpretability, and inspiring further scientific discoveries.
zh

[AI-67] Extracting Research Instruments from Educational Literature Using LLM s

【速读】:该论文试图解决从学术文献中高效提取教育领域研究工具(research instruments)相关信息的问题,包括工具名称、类型、目标被试、测量构念及结果等。解决方案的关键在于构建一个基于大型语言模型(Large Language Models, LLMs)的系统,该系统采用多步骤提示(multi-step prompting)和领域特定的数据模式(domain-specific data schema),生成结构化输出,从而提升信息提取的准确性和详细程度。

链接: https://arxiv.org/abs/2505.21855
作者: Jiseung Yoo,Curran Mahowald,Meiyu Li,Wei Ai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming information extraction from academic literature, offering new possibilities for knowledge management. This study presents an LLM-based system designed to extract detailed information about research instruments used in the education field, including their names, types, target respondents, measured constructs, and outcomes. Using multi-step prompting and a domain-specific data schema, it generates structured outputs optimized for educational research. Our evaluation shows that this system significantly outperforms other approaches, particularly in identifying instrument names and detailed information. This demonstrates the potential of LLM-powered information extraction in educational contexts, offering a systematic way to organize research instrument information. The ability to aggregate such information at scale enhances accessibility for researchers and education leaders, facilitating informed decision-making in educational research and policy.
zh

[AI-68] A Provable Approach for End-to-End Safe Reinforcement Learning

【速读】:该论文试图解决安全强化学习(Safe Reinforcement Learning, RL)中确保策略在整个生命周期(从学习到部署)安全性的问题。现有安全RL范式难以实现这一目标,因此本文提出了一种名为可证明终身安全强化学习(Provably Lifetime Safe RL, PLS)的方法,其关键在于将离线安全RL与安全策略部署相结合。该方法通过基于回报条件的监督学习在离线阶段学习策略,并在部署阶段利用高斯过程(Gaussian Processes, GPs)谨慎优化有限的一组目标回报参数,从而在保证安全性的前提下找到接近最优的目标回报。

链接: https://arxiv.org/abs/2505.21852
作者: Akifumi Wachi,Kohei Miyaguchi,Takumi Tanabe,Rei Sato,Youhei Akimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Robotics (cs.RO)
备注: 27 pages

点击查看摘要

Abstract:A longstanding goal in safe reinforcement learning (RL) is a method to ensure the safety of a policy throughout the entire process, from learning to operation. However, existing safe RL paradigms inherently struggle to achieve this objective. We propose a method, called Provably Lifetime Safe RL (PLS), that integrates offline safe RL with safe policy deployment to address this challenge. Our proposed method learns a policy offline using return-conditioned supervised learning and then deploys the resulting policy while cautiously optimizing a limited set of parameters, known as target returns, using Gaussian processes (GPs). Theoretically, we justify the use of GPs by analyzing the mathematical relationship between target and actual returns. We then prove that PLS finds near-optimal target returns while guaranteeing safety with high probability. Empirically, we demonstrate that PLS outperforms baselines both in safety and reward performance, thereby achieving the longstanding goal to obtain high rewards while ensuring the safety of a policy throughout the lifetime from learning to operation.
zh

[AI-69] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories ICRA2025

【速读】:该论文旨在解决传统扩散/流匹配策略在模仿学习中计算成本高、无法实时执行动作的问题。其关键解决方案是将动作轨迹视为流轨迹,并通过从最后一个动作附近的窄高斯分布采样,结合通过流匹配学习的速率场进行增量积分,从而生成连续的动作序列。这种方法允许在流采样过程中实时向机器人发送动作,支持滚动时域策略执行,同时保持对多模态行为的建模能力。

链接: https://arxiv.org/abs/2505.21851
作者: Sunshine Jiang,Xiaolin Fang,Nicholas Roy,Tomás Lozano-Pérez,Leslie Pack Kaelbling,Siddharth Ancha
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICRA 2025 Beyond Pick and Place Workshop

点击查看摘要

Abstract:Recent advances in diffusion / flow-matching policies have enabled imitation learning of complex, multi-modal action trajectories. However, they are computationally expensive because they sample a trajectory of trajectories: a diffusion / flow trajectory of action trajectories. They discard intermediate action trajectories, and must wait for the sampling process to complete before any actions can be executed on the robot. We simplify diffusion / flow policies by treating action trajectories as flow trajectories. Instead of starting from pure noise, our algorithm samples from a narrow Gaussian around the last action. Then, it incrementally integrates a velocity field learned via flow matching to produce a sequence of actions that constitute a single trajectory. This enables actions to be streamed to the robot on-the-fly during the flow sampling process, and is well-suited for receding horizon policy execution. Despite streaming, our method retains the ability to model multi-modal behavior. We train flows that stabilize around demonstration trajectories to reduce distribution shift and improve imitation learning performance. Streaming flow policy outperforms prior methods while enabling faster policy execution and tighter sensorimotor loops for learning-based robot control. Project website: this https URL
zh

[AI-70] Xinyu AI Search: Enhanced Relevance and Comprehensive Results with Rich Answer Presentations

【速读】:该论文试图解决传统搜索引擎在处理复杂查询时难以整合碎片化信息,以及生成式AI搜索引擎在相关性、全面性和展示效果方面的不足。其解决方案的关键在于引入Xinyu AI Search系统,该系统通过查询分解图(query-decomposition graph)动态将复杂查询拆分为子查询,实现分步检索与生成,同时结合多源聚合、查询扩展、过滤与重排序策略提升结果的相关性与多样性,并通过细粒度引用和时间线可视化等创新手段优化结果展示。

链接: https://arxiv.org/abs/2505.21849
作者: Bo Tang,Junyi Zhu,Chenyang Xi,Yunhang Ge,Jiahao Wu,Yuchen Feng,Yijun Niu,Wenqiang Wei,Yu Yu,Chunyu Li,Zehao Lin,Hao Wu,Ning Liao,Yebin Yang,Jiajia Wang,Zhiyu Li,Feiyu Xiong,Jingrun Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional search engines struggle to synthesize fragmented information for complex queries, while generative AI search engines face challenges in relevance, comprehensiveness, and presentation. To address these limitations, we introduce Xinyu AI Search, a novel system that incorporates a query-decomposition graph to dynamically break down complex queries into sub-queries, enabling stepwise retrieval and generation. Our retrieval pipeline enhances diversity through multi-source aggregation and query expansion, while filtering and re-ranking strategies optimize passage relevance. Additionally, Xinyu AI Search introduces a novel approach for fine-grained, precise built-in citation and innovates in result presentation by integrating timeline visualization and textual-visual choreography. Evaluated on recent real-world queries, Xinyu AI Search outperforms eight existing technologies in human assessments, excelling in relevance, comprehensiveness, and insightfulness. Ablation studies validate the necessity of its key sub-modules. Our work presents the first comprehensive framework for generative AI search engines, bridging retrieval, generation, and user-centric presentation.
zh

[AI-71] An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints

【速读】:该论文旨在解决在线安全强化学习(Online Safe Reinforcement Learning)中在对抗性约束下的最优策略学习问题,其核心挑战在于如何在未知、时变且可能被敌对设计的约束条件下实现最优奖励最大化。论文提出的解决方案是Optimistic Mirror Descent Primal-Dual (OMDPD)算法,其关键在于首次针对具有任意时间对抗性约束的在线约束马尔可夫决策过程(CMDPs)设计了有效算法,能够在不依赖Slater条件或已知安全策略的情况下,实现最优遗憾界O(√K)和强约束违反界O(√K),并进一步通过准确的奖励和转移估计提升性能。

链接: https://arxiv.org/abs/2505.21841
作者: Jiahui Zhu,Kihyun Yu,Dabeen Lee,Xin Liu,Honghao Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 41 st International Conference on Machine Learning

点击查看摘要

Abstract:Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret O(sqrt(K)) and strong constraint violation O(sqrt(K)) without relying on Slater’s condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.
zh

[AI-72] Nonadaptive Output Regulation of Second-Order Nonlinear Uncertain Systems

【速读】:该论文试图解决二阶非线性不确定系统与未知外系统耦合下的鲁棒输出调节问题(robust output regulation problem)。其解决方案的关键在于构造通用内部模型(generic internal models)以描述系统的稳态状态和输入变量,并通过坐标变换将鲁棒输出调节问题转化为由二阶非线性不确定系统和通用内部模型组成的扩展系统的非自适应稳定化问题。随后,设计稳定化控制律并构建严格李雅普诺夫函数,以保证对未建模扰动的鲁棒性,最终通过非自适应控制律使扩展系统的输出零化流形具有吸引力,从而解决鲁棒输出调节问题。

链接: https://arxiv.org/abs/2505.21838
作者: Maobin Lu,Martin Guay,Telema Harry,Shimin Wang,Jordan Cooper
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Chaotic Dynamics (nlin.CD)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:This paper investigates the robust output regulation problem of second-order nonlinear uncertain systems with an unknown exosystem. Instead of the adaptive control approach, this paper resorts to a robust control methodology to solve the problem and thus avoid the bursting phenomenon. In particular, this paper constructs generic internal models for the steady-state state and input variables of the system. By introducing a coordinate transformation, this paper converts the robust output regulation problem into a nonadaptive stabilization problem of an augmented system composed of the second-order nonlinear uncertain system and the generic internal models. Then, we design the stabilization control law and construct a strict Lyapunov function that guarantees the robustness with respect to unmodeled disturbances. The analysis shows that the output zeroing manifold of the augmented system can be made attractive by the proposed nonadaptive control law, which solves the robust output regulation problem. Finally, we demonstrate the effectiveness of the proposed nonadaptive internal model approach by its application to the control of the Duffing system.
zh

[AI-73] uneComp: Joint Fine-tuning and Compression for Large Foundation Models

【速读】:该论文试图解决在后训练阶段进行模型压缩时,顺序的微调与压缩会导致性能下降,并且需要构建一个比必要更大的中间模型的问题。其解决方案的关键在于通过逐步将模型蒸馏到剪枝的低秩结构中,实现模型的联合微调与压缩,从而在保持性能的同时减少模型规模。

链接: https://arxiv.org/abs/2505.21835
作者: Xiangyu Chen,Jing Liu,Ye Wang,Matthew Brand, Pu (Perry)Wang,Toshiaki Koike-Akino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preliminary Work

点击查看摘要

Abstract:To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller model while guided by the downstream task. We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure. Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.
zh

[AI-74] SAGE-Eval: Evaluating LLM s for Systematic Generalizations of Safety Facts

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对新颖情境时是否能够稳健地泛化关键安全事实的问题,这一能力的缺失可能在用户提出简单问题时带来安全隐患。解决方案的关键在于引入SAGE-Eval,即SAfety-fact systematic GEneralization evaluation,这是首个测试LLMs是否能正确将已确立的安全事实应用于用户朴素查询的基准。SAGE-Eval包含从权威机构手动收集的104条安全事实,并系统性地扩展为跨7个常见领域(如户外活动、医学等)的10,428个测试场景,以评估模型在实际应用中的安全决策能力。

链接: https://arxiv.org/abs/2505.21828
作者: Chen Yueh-Han,Guy Davidson,Brenden M. Lake
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, “I’m considering packing melon balls for my 10-month-old’s lunch. What other foods would be good to include?” Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at this https URL and our code is available at this https URL.
zh

[AI-75] Music Source Restoration

【速读】:该论文试图解决音乐源恢复(Music Source Restoration, MSR)问题,即在现实音乐制作中,现有音乐源分离(Music Source Separation, MSS)方法未能考虑诸如均衡、压缩和混响等信号退化因素,导致其在实际应用中的效果受限。解决方案的关键在于将混合信号建模为经过单独退化处理的源信号的退化总和,并通过引入RawStems数据集,该数据集包含578首歌曲的未处理源信号,按8个主要和17个次要乐器组进行分类,总计354.13小时,以支持MSR任务的研究。同时,论文提出了U-Former作为基准方法,验证了MSR的可行性。

链接: https://arxiv.org/abs/2505.21827
作者: Yongyi Zang,Zheqi Dai,Mark D. Plumbley,Qiuqiang Kong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: A modified version of this paper is in review

点击查看摘要

Abstract:We introduce Music Source Restoration (MSR), a novel task addressing the gap between idealized source separation and real-world music production. Current Music Source Separation (MSS) approaches assume mixtures are simple sums of sources, ignoring signal degradations employed during music production like equalization, compression, and reverb. MSR models mixtures as degraded sums of individually degraded sources, with the goal of recovering original, undegraded signals. Due to the lack of data for MSR, we present RawStems, a dataset annotation of 578 songs with unprocessed source signals organized into 8 primary and 17 secondary instrument groups, totaling 354.13 hours. To the best of our knowledge, RawStems is the first dataset that contains unprocessed music stems with hierarchical categories. We consider spectral filtering, dynamic range compression, harmonic distortion, reverb and lossy codec as possible degradations, and establish U-Former as a baseline method, demonstrating the feasibility of MSR on our dataset. We release the RawStems dataset annotations, degradation simulation pipeline, training code and pre-trained models to be publicly available.
zh

[AI-76] Revisiting Self-attention for Cross-domain Sequential Recommendation KDD’25

【速读】:该论文旨在解决跨领域序列推荐(Cross-Domain Sequential Recommendation, CDSR)中的性能提升问题,特别是针对现有框架依赖额外领域特定组件而忽视Transformer中自注意力(self-attention)模块潜力的不足。其解决方案的关键在于从增强自注意力机制的角度出发,提出一种帕累托最优的自注意力方法,并将跨领域学习建模为多目标优化问题,通过动态减少跨领域注意力得分来优化推荐任务,从而实现自动化的知识迁移(AutoCDSR),有效缓解负迁移并促进辅助领域的互补知识交换。

链接: https://arxiv.org/abs/2505.21811
作者: Clark Mingxuan Ju,Leonardo Neves,Bhuvesh Kumar,Liam Collins,Tong Zhao,Yuwei Qiu,Qing Dou,Sohail Nizam,Sen Yang,Neil Shah
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to KDD’25

点击查看摘要

Abstract:Sequential recommendation is a popular paradigm in modern recommender systems. In particular, one challenging problem in this space is cross-domain sequential recommendation (CDSR), which aims to predict future behaviors given user interactions across multiple domains. Existing CDSR frameworks are mostly built on the self-attention transformer and seek to improve by explicitly injecting additional domain-specific components (e.g. domain-aware module blocks). While these additional components help, we argue they overlook the core self-attention module already present in the transformer, a naturally powerful tool to learn correlations among behaviors. In this work, we aim to improve the CDSR performance for simple models from a novel perspective of enhancing the self-attention. Specifically, we introduce a Pareto-optimal self-attention and formulate the cross-domain learning as a multi-objective problem, where we optimize the recommendation task while dynamically minimizing the cross-domain attention scores. Our approach automates knowledge transfer in CDSR (dubbed as AutoCDSR) – it not only mitigates negative transfer but also encourages complementary knowledge exchange among auxiliary domains. Based on the idea, we further introduce AutoCDSR+, a more performant variant with slight additional cost. Our proposal is easy to implement and works as a plug-and-play module that can be incorporated into existing transformer-based recommenders. Besides flexibility, it is practical to deploy because it brings little extra computational overheads without heavy hyper-parameter tuning. AutoCDSR on average improves Recall@10 for SASRec and Bert4Rec by 9.8% and 16.0% and NDCG@10 by 12.0% and 16.7%, respectively. Code is available at this https URL.
zh

[AI-77] Multimodal Federated Learning: A Survey through the Lens of Different FL Paradigms

【速读】:该论文试图解决多模态联邦学习(Multimodal Federated Learning, MFL)中由于多模态数据带来的独特挑战,包括模态异质性、隐私异质性和通信效率问题。其解决方案的关键在于系统性地分析MFL在三种主要联邦学习(Federated Learning, FL)范式下的表现:横向联邦学习(Horizontal FL, HFL)、纵向联邦学习(Vertical FL, VFL)和混合联邦学习(Hybrid FL)。通过对每种范式的任务建模、训练算法及核心挑战进行深入探讨,论文旨在为MFL提供一个全面的分类框架,并揭示不同FL范式下多模态数据带来的新型问题,从而为未来研究提供方向。

链接: https://arxiv.org/abs/2505.21792
作者: Yuanzhe Peng,Jieming Bian,Lei Wang,Yin Huang,Jie Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Federated Learning (MFL) lies at the intersection of two pivotal research areas: leveraging complementary information from multiple modalities to improve downstream inference performance and enabling distributed training to enhance efficiency and preserve privacy. Despite the growing interest in MFL, there is currently no comprehensive taxonomy that organizes MFL through the lens of different Federated Learning (FL) paradigms. This perspective is important because multimodal data introduces distinct challenges across various FL settings. These challenges, including modality heterogeneity, privacy heterogeneity, and communication inefficiency, are fundamentally different from those encountered in traditional unimodal or non-FL scenarios. In this paper, we systematically examine MFL within the context of three major FL paradigms: horizontal FL (HFL), vertical FL (VFL), and hybrid FL. For each paradigm, we present the problem formulation, review representative training algorithms, and highlight the most prominent challenge introduced by multimodal data in distributed settings. We also discuss open challenges and provide insights for future research. By establishing this taxonomy, we aim to uncover the novel challenges posed by multimodal data from the perspective of different FL paradigms and to offer a new lens through which to understand and advance the development of MFL.
zh

[AI-78] DualSchool: How Reliable are LLM s for Optimization Education?

【速读】:该论文试图解决生成式 AI (Generative AI) 在线性规划对偶转换(Primal to Dual Conversion, P2DC)任务中的表现问题,即尽管大型语言模型(LLMs)具备大量训练数据和转换过程的知识,但在实际生成正确对偶问题时仍存在显著不足。解决方案的关键在于提出 DualSchool,这是一个全面的框架,用于生成和验证 P2DC 实例,并采用规范图编辑距离(Canonical Graph Edit Distance)进行验证,从而超越了现有评估方法在 P2DC 任务中普遍存在的误报和漏报问题。

链接: https://arxiv.org/abs/2505.21775
作者: Michael Klamkin,Arnaud Deza,Sikai Cheng,Haoruo Zhao,Pascal Van Hentenryck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.
zh

[AI-79] Dont Think Longer Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在提升答案准确性的同时,因过度思考导致输出长度显著增加的问题,即不必要的复杂推理路径浪费计算资源并可能降低性能。解决方案的关键在于提出一种动态优化框架,将模型生成的推理路径分割为不同的思考模式(thinking patterns),系统地识别并推广有益于答案的模式,同时移除有害的模式,从而实现更简洁且信息充分的推理轨迹。该方法通过减少注意力FLOPs提升了推理效率,并在保持原有正确回答准确性的基础上,进一步提高了推理准确性。

链接: https://arxiv.org/abs/2505.21765
作者: Sohyun An,Ruochen Wang,Tianyi Zhou,Cho-Jui Hsieh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work In Progress

点击查看摘要

Abstract:While recent success of large reasoning models (LRMs) significantly advanced LLMs’ reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs’ limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
zh

[AI-80] Simulating the Unseen: Crash Prediction Must Learn from What Did Not Happen

【速读】:该论文试图解决交通安全管理中的数据悖论问题,即我们最希望预防的交通事故恰恰是观测到的稀少事件。现有事故频率模型和替代安全指标依赖于稀疏、噪声大且未充分报告的数据,而即使高保真的模拟也未能充分涵盖引发重大后果(如死亡)的长尾情境。解决方案的关键在于从传统的仅基于事故的学习范式转向一种新的反事实安全学习方法,通过推理不仅发生在实际中的事件,还包括在略有不同情境下可能发生但具有潜在危险性的大量情景。该方法通过生成场景引擎、多样化的驾驶员模型、因果学习等技术,合成并解释接近事故事件,从而将稀疏的事故数据转化为丰富的事故预测信号,实现车辆、道路和政策的预先压力测试,推动交通安全管理从被动调查向主动预防转变。

链接: https://arxiv.org/abs/2505.21743
作者: Zihao Li,Xinyuan Cao,Xiangbo Gao,Kexin Tian,Keshu Wu,Mohammad Anis,Hao Zhang,Keke Long,Jiwan Jiang,Xiaopeng Li,Yunlong Zhang,Tianbao Yang,Dominique Lord,Zhengzhong Tu,Yang Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic safety science has long been hindered by a fundamental data paradox: the crashes we most wish to prevent are precisely those events we rarely observe. Existing crash-frequency models and surrogate safety metrics rely heavily on sparse, noisy, and under-reported records, while even sophisticated, high-fidelity simulations undersample the long-tailed situations that trigger catastrophic outcomes such as fatalities. We argue that the path to achieving Vision Zero, i.e., the complete elimination of traffic fatalities and severe injuries, requires a paradigm shift from traditional crash-only learning to a new form of counterfactual safety learning: reasoning not only about what happened, but also about the vast set of plausible yet perilous scenarios that could have happened under slightly different circumstances. To operationalize this shift, our proposed agenda bridges macro to micro. Guided by crash-rate priors, generative scene engines, diverse driver models, and causal learning, near-miss events are synthesized and explained. A crash-focused digital twin testbed links micro scenes to macro patterns, while a multi-objective validator ensures that simulations maintain statistical realism. This pipeline transforms sparse crash data into rich signals for crash prediction, enabling the stress-testing of vehicles, roads, and policies before deployment. By learning from crashes that almost happened, we can shift traffic safety from reactive forensics to proactive prevention, advancing Vision Zero.
zh

[AI-81] Deep Reinforcement Learning Agents are not even close to Human Intelligence

【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)代理在零样本适应能力上的不足,特别是其在任务简化版本上表现显著下降的问题。现有研究多关注任务复杂化情况下的鲁棒性评估,而忽视了任务简化场景的测试。论文提出的解决方案是引入HackAtari,这是一个对Arcade Learning Environments进行任务变异的基准集,通过该基准集揭示了DRL代理对捷径的依赖性,并凸显了其与人类行为智能之间的持续差距。关键在于通过系统化的泛化测试,推动新基准和方法的发展,以超越传统的静态评估协议。

链接: https://arxiv.org/abs/2505.21731
作者: Quentin Delfosse,Jannis Blüml,Fabian Tatai,Théo Vincent,Bjarne Gregori,Elisabeth Dillies,Jan Peters,Constantin Rothkopf,Kristian Kersting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 49 pages in total, 5 main figures, 14 figures total

点击查看摘要

Abstract:Deep reinforcement learning (RL) agents achieve impressive results in a wide variety of tasks, but they lack zero-shot adaptation capabilities. While most robustness evaluations focus on tasks complexifications, for which human also struggle to maintain performances, no evaluation has been performed on tasks simplifications. To tackle this issue, we introduce HackAtari, a set of task variations of the Arcade Learning Environments. We use it to demonstrate that, contrary to humans, RL agents systematically exhibit huge performance drops on simpler versions of their training tasks, uncovering agents’ consistent reliance on shortcuts. Our analysis across multiple algorithms and architectures highlights the persistent gap between RL agents and human behavioral intelligence, underscoring the need for new benchmarks and methodologies that enforce systematic generalization testing beyond static evaluation protocols. Training and testing in the same environment is not enough to obtain agents equipped with human-like intelligence.
zh

[AI-82] Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

【速读】:该论文试图解决深度ReLU网络在小权重初始化下,梯度下降(GD)最初受参数空间中原点处的鞍点主导的问题。其解决方案的关键在于研究所谓的“逃逸方向”,这些方向在严格鞍点的情况下起到类似于Hessian矩阵特征向量的作用。论文表明,最优逃逸方向在其深层中表现出低秩偏差:第ℓ层权重矩阵的第一个奇异值至少比其他奇异值大ℓ^(1/4)倍。这一结果为证明深度ReLU网络中的鞍点到鞍点动力学提供了初步基础。

链接: https://arxiv.org/abs/2505.21722
作者: Ioannis Bantzis,James B. Simon,Arthur Jacot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:When a deep ReLU network is initialized with small weights, GD is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the \ell -th layer weight matrix is at least \ell^\frac14 larger than any other singular value. We also prove a number of related results about these escape directions. We argue that this result is a first step in proving Saddle-to-Saddle dynamics in deep ReLU networks, where GD visits a sequence of saddles with increasing bottleneck rank.
zh

[AI-83] Responsible Data Stewardship: Generative AI and the Digital Waste Problem AAAI

【速读】:该论文试图解决生成式 AI(Generative AI)发展过程中因数据无限存储而产生的数字废弃物(digital waste)问题,这一问题在可持续性方面尚未得到充分研究。解决方案的关键在于将环境可持续性作为负责任创新的核心,并通过跨学科的数字资源管理方法,提出包括研究方向调整、技术干预和文化转变在内的具体建议,以减轻长期数据存储对环境的影响。

链接: https://arxiv.org/abs/2505.21720
作者: Vanessa Utz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 8 pages, submitted to AAAI/ACM Conference on AI, Ethics and Society

点击查看摘要

Abstract:As generative AI systems become widely adopted, they enable unprecedented creation levels of synthetic data across text, images, audio, and video modalities. While research has addressed the energy consumption of model training and inference, a critical sustainability challenge remains understudied: digital waste. This term refers to stored data that consumes resources without serving a specific (and/or immediate) purpose. This paper presents this terminology in the AI context and introduces digital waste as an ethical imperative within (generative) AI development, positioning environmental sustainability as core for responsible innovation. Drawing from established digital resource management approaches, we examine how other disciplines manage digital waste and identify transferable approaches for the AI community. We propose specific recommendations encompassing re-search directions, technical interventions, and cultural shifts to mitigate the environmental consequences of in-definite data storage. By expanding AI ethics beyond immediate concerns like bias and privacy to include inter-generational environmental justice, this work contributes to a more comprehensive ethical framework that considers the complete lifecycle impact of generative AI systems.
zh

[AI-84] Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

【速读】:该论文试图解决长序列建模中计算效率与梯度稳定性之间的平衡问题,特别是在处理长序列时,传统递归模型在计算复杂度和内存消耗上存在瓶颈。其解决方案的关键在于提出LrcSSM,通过强制状态转移矩阵为对角矩阵并在每一步进行学习,使得整个序列可以并行求解,从而实现O(TD)\mathcal{O}(TD)的时间和内存复杂度,并仅具有O(logT)\mathcal{O}(\log T)的序列深度,同时提供了其他输入变化系统(如Liquid-S4和Mamba)所不具备的梯度稳定性保证。

链接: https://arxiv.org/abs/2505.21717
作者: Mónika Farsang,Ramin Hasani,Radu Grosu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We present LrcSSM, a \textitnonlinear recurrent model that processes long sequences as fast as today’s linear state-space layers. By forcing the state-transition matrix to be diagonal and learned at every step, the full sequence can be solved in parallel with a single prefix-scan, giving \mathcalO(TD) time and memory and only \mathcalO(\log T) sequential depth, for input-sequence length T and a state dimension D . Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Lastly, for network depth L , as the forward and backward passes cost \Theta(T,D,L) FLOPs, with its low sequential depth and parameter count \Theta(D,L) , the model follows the compute-optimal scaling law regime ( \beta \approx 0.42 ) recently observed for Mamba, outperforming quadratic-attention Transformers at equal compute while avoiding the memory overhead of FFT-based long convolutions. We show that on a series of long-range forecasting tasks, LrcSSM outperforms LRU, S5 and Mamba.
zh

[AI-85] A Joint Reconstruction-Triplet Loss Autoencoder Approach Towards Unseen Attack Detection in IoV Networks

【速读】:该论文旨在解决Internet of Vehicles (IoV)系统中由于高度互联性所带来的安全漏洞问题,特别是针对传统安全机制在检测复杂且不断演变的网络攻击时的不足。其解决方案的关键在于提出一种完全基于良性网络数据训练的无监督自编码器方法,通过加权组合重构损失和三元组边界损失来引导模型训练,从而实现对未知攻击的有效检测。该方法在工业物联网和家庭物联网的最新网络入侵数据集上进行了广泛实验,表现出对良性数据和异常数据的高准确率,并通过迁移学习展示了模型的适应性。

链接: https://arxiv.org/abs/2505.21703
作者: Julia Boone,Tolunay Seyfi,Fatemeh Afghah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted for publication in the IEEE Internet of Things Journal (IoT-J)

点击查看摘要

Abstract:Internet of Vehicles (IoV) systems, while offering significant advancements in transportation efficiency and safety, introduce substantial security vulnerabilities due to their highly interconnected nature. These dynamic systems produce massive amounts of data between vehicles, infrastructure, and cloud services and present a highly distributed framework with a wide attack surface. In considering network-centered attacks on IoV systems, attacks such as Denial-of-Service (DoS) can prohibit the communication of essential physical traffic safety information between system elements, illustrating that the security concerns for these systems go beyond the traditional confidentiality, integrity, and availability concerns of enterprise systems. Given the complexity and volume of data generated by IoV systems, traditional security mechanisms are often inadequate for accurately detecting sophisticated and evolving cyberattacks. Here, we present an unsupervised autoencoder method trained entirely on benign network data for the purpose of unseen attack detection in IoV networks. We leverage a weighted combination of reconstruction and triplet margin loss to guide the autoencoder training and develop a diverse representation of the benign training set. We conduct extensive experiments on recent network intrusion datasets from two different application domains, industrial IoT and home IoT, that represent the modern IoV task. We show that our method performs robustly for all unseen attack types, with roughly 99% accuracy on benign data and between 97% and 100% performance on anomaly data. We extend these results to show that our model is adaptable through the use of transfer learning, achieving similarly high results while leveraging domain features from one domain to another.
zh

[AI-86] multivariateGPT : a decoder-only transformer for multivariate categorical and numeric data

【速读】:该论文试图解决现实世界过程中生成的混合型数据(包括分类数据和数值数据)建模问题,这类数据通常在不规则且具有信息性的间隔下被记录。传统基于离散标记的方法在数值表示能力上存在局限,而神经微分方程等方法则不适用于分类数据或信息性采样。本文提出的解决方案的关键在于构建一个统一的架构——multivariateGPT,通过自回归序列分解、嵌入方案和损失函数,将下一个标记预测任务扩展为对下一标记类别和值联合分布的概率估计,从而有效处理混合类型的数据序列。

链接: https://arxiv.org/abs/2505.21680
作者: Andrew J. Loza,Jun Yup Kim,Shangzheng Song,Yihang Liu,Joseph J. Y. Sung,R Andrew Taylor,Dennis L. Shung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pates, 5 figures

点击查看摘要

Abstract:Real-world processes often generate data that are a mix of categorical and numeric values that are recorded at irregular and informative intervals. Discrete token-based approaches are limited in numeric representation capacity while methods like neural ordinary differential equations are not well suited for categorical data or informative sampling and require augmentation to handle certain classes of trajectories. Here, we present multivariateGPT, a single architecture for modeling sequences of mixed categorical (including tokenized text) and numeric data. This is accomplished with an autoregressive sequence decomposition, embedding scheme, and loss function that extend the next token prediction task to likelihood estimation of the joint distribution of next token class and value. We demonstrate how this approach can efficiently learn to generalize patterns in simple physical systems and model complex time series including electrocardiograms and multivariate electronic health record data. This work extends the utility of transformer based models to additional classes of data.
zh

[AI-87] What happens when generative AI models train recursively on each others generated outputs?

【速读】:该论文试图解决生成式 AI (Generative AI) 模型在训练过程中可能接触到其他模型生成内容所带来的潜在影响问题,特别是这种数据介导的模型交互对模型性能的长期影响。其解决方案的关键在于通过实证研究和理论建模,分析数据介导交互的实际过程,并实验验证此类交互可能产生的长期结果,从而揭示模型在接触外部生成内容时可能获得的新概念以及性能趋同的风险。

链接: https://arxiv.org/abs/2505.21677
作者: Hung Ahn Vu,Galen Reeves,Emily Wenger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages

点击查看摘要

Abstract:The internet is full of AI-generated content while also serving as a common source of training data for generative AI (genAI) models. This duality raises the possibility that future genAI models may be trained on other models’ generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society’s increasing dependence on genAI tools, understanding downstream effects of such data-mediated model interactions is critical. To this end, we provide empirical evidence for how data-mediated interactions might unfold in practice, develop a theoretical model for this interactive training process, and show experimentally possible long-term results of such interactions. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.
zh

[AI-88] Make Planning Research Rigorous Again!

【速读】:该论文试图解决当前基于大语言模型(Large Language Models, LLMs)的规划系统在设计和评估过程中缺乏严谨性的问题。其解决方案的关键在于将自动化规划领域已有的见解、工具和数据正确地整合到LLM-based planners的设计与评估中,以避免重复过去规划领域曾经历过的常见错误,从而加速LLM-based planners的发展并提升整体规划研究的进展。

链接: https://arxiv.org/abs/2505.21674
作者: Michael Katz,Harsha Kokel,Christian Muise,Shirin Sohrabi,Sarath Sreedharan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.
zh

[AI-89] Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing

【速读】:该论文研究了一个在n节点图G上的序列决策问题,其中每个节点的标签来自有限集合Σ,且服从与G相关的马尔可夫联合分布P。在每一步中,选择一个节点会揭示其标签并产生依赖于标签的奖励,目标是通过自适应选择节点以最大化期望累积折扣奖励。论文引入了前沿探索约束,即动作仅限于之前选择节点的邻居,以反映接触追踪和机器人探索等实际场景中的限制。解决方案的关键在于设计了一种基于Gittins指数的策略,该策略适用于一般图,并在G为森林时被证明是理论上最优的。

链接: https://arxiv.org/abs/2505.21671
作者: Davin Choo,Yuqi Pan,Tonghan Wang,Milind Tambe,Alastair van Heerden,Cheryl Johnson
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We study a sequential decision-making problem on a n -node graph G where each node has an unknown label from a finite set \mathbf\Sigma , drawn from a joint distribution P that is Markov with respect to G . At each step, selecting a node reveals its label and yields a label-dependent reward. The goal is to adaptively choose nodes to maximize expected accumulated discounted rewards. We impose a frontier exploration constraint, where actions are limited to neighbors of previously selected nodes, reflecting practical constraints in settings such as contact tracing and robotic exploration. We design a Gittins index-based policy that applies to general graphs and is provably optimal when G is a forest. Our implementation runs in O(n^2 \cdot |\mathbf\Sigma|^2) time while using O(n \cdot |\mathbf\Sigma|^2) oracle calls to P and O(n^2 \cdot |\mathbf\Sigma|) space. Experiments on synthetic and real-world graphs show that our method consistently outperforms natural baselines, including in non-tree, budget-limited, and undiscounted settings. For example, in HIV testing simulations on real-world sexual interaction networks, our policy detects nearly all positive cases with only half the population tested, substantially outperforming other baselines.
zh

[AI-90] Efficient Controllable Diffusion via Optimal Classifier Guidance

【速读】:该论文试图解决可控扩散模型生成问题,即通过优化给定的目标函数来引导模型生成符合特定需求的样本,这一问题在图像生成、分子生成和DNA/序列生成等领域具有重要应用。解决方案的关键在于将可控生成建模为一个KL-正则化目标函数的最优分布寻找问题,并提出SLCD(基于监督学习的可控扩散)方法,该方法通过迭代生成在线数据并训练一个小分类器来引导扩散模型的生成过程,其核心计算原语为分类,不涉及强化学习或控制中的复杂概念。

链接: https://arxiv.org/abs/2505.21666
作者: Owen Oertell,Shikun Sun,Yiding Chen,Jin Peng Zhou,Zhiyong Wang,Wen Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 9 figures, 3 tables

点击查看摘要

Abstract:The controllable generation of diffusion models aims to steer the model to generate samples that optimize some given objective functions. It is desirable for a variety of applications including image generation, molecule generation, and DNA/sequence generation. Reinforcement Learning (RL) based fine-tuning of the base model is a popular approach but it can overfit the reward function while requiring significant resources. We frame controllable generation as a problem of finding a distribution that optimizes a KL-regularized objective function. We present SLCD – Supervised Learning based Controllable Diffusion, which iteratively generates online data and trains a small classifier to guide the generation of the diffusion model. Similar to the standard classifier-guided diffusion, SLCD’s key computation primitive is classification and does not involve any complex concepts from RL or control. Via a reduction to no-regret online learning analysis, we show that under KL divergence, the output from SLCD provably converges to the optimal solution of the KL-regularized objective. Further, we empirically demonstrate that SLCD can generate high quality samples with nearly the same inference time as the base model in both image generation with continuous diffusion and biological sequence generation with discrete diffusion. Our code is available at this https URL
zh

[AI-91] Expert Survey: AI Reliability Security Research Priorities

【速读】:该论文试图解决如何有效指导战略性的AI研发投资,以提升AI系统的可靠性与安全性问题。其关键在于通过调查53位专家在105个AI可靠性和安全性研究领域中的观点,首次量化了专家在全面分类的AI安全与安全研究方向上的优先级,并生成具有数据支持的潜在影响排名,从而为资源的有效配置提供依据。

链接: https://arxiv.org/abs/2505.21664
作者: Joe O’Brien,Jeremy Dolan,Jay Kim,Jonah Dykhuizen,Jeba Sania,Sebastian Becker,Jam Kraprayoon,Cara Labrador
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Our survey of 53 specialists across 105 AI reliability and security research areas identifies the most promising research prospects to guide strategic AI RD investment. As companies are seeking to develop AI systems with broadly human-level capabilities, research on reliability and security is urgently needed to ensure AI’s benefits can be safely and broadly realized and prevent severe harms. This study is the first to quantify expert priorities across a comprehensive taxonomy of AI safety and security research directions and to produce a data-driven ranking of their potential impact. These rankings may support evidence-based decisions about how to effectively deploy resources toward AI reliability and security research.
zh

[AI-92] PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation

【速读】:该论文旨在解决细粒度机器人操作中对物体部件及其与任务关系的鲁棒推理问题,特别是在缺乏大规模带有部件级指令和标注的3D物体实例数据集的情况下。其解决方案的关键在于引入PartInstruct,这是首个基于部件级指令的细粒度机器人操作基准,包含513个物体实例、14个类别、1302个细粒度操作任务以及超过10,000个在3D模拟器中合成的专家示范,每个示范均配有高层任务指令、基于部件的基础技能指令及真实的3D物体信息,从而为训练和评估细粒度机器人操作模型提供支持。

链接: https://arxiv.org/abs/2505.21652
作者: Yifan Yin,Zhengtao Han,Shivam Aarya,Jianxin Wang,Shuhang Xu,Jiawei Peng,Angtian Wang,Alan Yuille,Tianmin Shu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of base part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches, including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.
zh

[AI-93] Efficient Diffusion Models for Symmetric Manifolds ICML2025

【速读】:该论文旨在解决在高维对称流形(如环面、球面、特殊正交群和酉群)上设计高效扩散模型的问题。现有方法依赖于热核,其缺乏闭式表达式,导致每一步训练需要$ d 次梯度评估或指数级于次梯度评估或指数级于 d 的算术运算。论文提出的解决方案的关键在于引入一种具有空间变化协方差的扩散模型,通过将欧几里得布朗运动投影到流形上来规避热核计算,并利用伊藤引理推导出一个高效的训练目标,使得每一步仅需的算术运算。论文提出的解决方案的关键在于引入一种具有空间变化协方差的扩散模型,通过将欧几里得布朗运动投影到流形上来规避热核计算,并利用伊藤引理推导出一个高效的训练目标,使得每一步仅需 O(1) 次梯度评估和近线性于次梯度评估和近线性于 d O(d^{1.19}) $)的算术运算,从而缩小了对称流形与欧几里得空间上扩散模型的效率差距。

链接: https://arxiv.org/abs/2505.21640
作者: Oren Mangoubi,Neil He,Nisheeth K. Vishnoi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Probability (math.PR); Machine Learning (stat.ML)
备注: The conference version of this paper appears in ICML 2025

点击查看摘要

Abstract:We introduce a framework for designing efficient diffusion models for d -dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require either d gradient evaluations or exponential-in- d arithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito’s Lemma, allowing each step to run in O(1) gradient evaluations and nearly-linear-in- d ( O(d^1.19) ) arithmetic operations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an “average-case” Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.
zh

[AI-94] he Feasibility of Topic-Based Watermarking on Academic Peer Reviews

【速读】:该论文试图解决在学术同行评审过程中使用生成式 AI (Generative AI) 所带来的诚信问题,特别是由于保密性泄露、幻觉内容和评估不一致导致的限制。其解决方案的关键在于提出一种轻量级、语义感知的主题基水印(topic-based watermarking, TBW)技术,该技术能够将可检测的信号嵌入到生成式 AI 生成的文本中,从而实现对 AI 生成内容的有效溯源,同时保持评审质量并具备对抗改写规避的强鲁棒性。

链接: https://arxiv.org/abs/2505.21636
作者: Alexander Nemecek,Yuzhou Jiang,Erman Ayday
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages main, 9 pages appendix

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into academic workflows, with many conferences and journals permitting their use for tasks such as language refinement and literature summarization. However, their use in peer review remains prohibited due to concerns around confidentiality breaches, hallucinated content, and inconsistent evaluations. As LLM-generated text becomes more indistinguishable from human writing, there is a growing need for reliable attribution mechanisms to preserve the integrity of the review process. In this work, we evaluate topic-based watermarking (TBW), a lightweight, semantic-aware technique designed to embed detectable signals into LLM-generated text. We conduct a comprehensive assessment across multiple LLM configurations, including base, few-shot, and fine-tuned variants, using authentic peer review data from academic conferences. Our results show that TBW maintains review quality relative to non-watermarked outputs, while demonstrating strong robustness to paraphrasing-based evasion. These findings highlight the viability of TBW as a minimally intrusive and practical solution for enforcing LLM usage in peer review.
zh

[AI-95] Is Your LLM Overcharging You? Tokenization Transparency and Incentives

【速读】:该论文试图解决当前基于按令牌计价(pay-per-token)机制的大型语言模型服务中,提供商可能通过策略性地错误报告模型生成输出所使用的令牌数量来牟取不当利益的问题,而用户无法验证或察觉这种行为。解决方案的关键在于引入一种激励相容的令牌定价机制,该机制将价格与输出的字符数而非令牌数挂钩,从而完全消除提供商策略性行为的经济动机。

链接: https://arxiv.org/abs/2505.21627
作者: Ander Artola Velasco,Stratis Tsirtsis,Nastaran Okati,Manuel Gomez-Rodriguez
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it – they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we introduce an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, to completely eliminate the financial incentive to strategize, we introduce a simple incentive-compatible token pricing mechanism. Under this mechanism, the price users pay for an output provided by a model depends on the number of characters of the output – they pay a fixed price per character. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the \textttLlama , \textttGemma and \textttMinistral families, and input prompts from the LMSYS Chatbot Arena platform.
zh

[AI-96] Preventing Adversarial AI Attacks Against Autonomous Situational Awareness: A Maritime Case Study

【速读】:该论文旨在解决对抗性人工智能(Adversarial AI)攻击对自主运输系统(如海上船舶)造成的安全威胁,具体包括传统防御措施的局限性、安全度量标准的不足以及超越模型级防御构建弹性的需求。其解决方案的关键在于利用多输入和数据融合技术构建防御组件,并引入一种新的AI安全度量标准,从而提升系统的对抗性机器学习攻击抵御能力。该方法被称为数据融合网络弹性(Data Fusion Cyber Resilience, DFCR),通过实证测试与定量分析验证了其有效性,结果显示DFCR在多种对抗攻击场景下显著降低了损失,提升了系统的鲁棒性和决策能力。

链接: https://arxiv.org/abs/2505.21609
作者: Mathew J. Walter,Aaron Barrett,Kimberly Tam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial artificial intelligence (AI) attacks pose a significant threat to autonomous transportation, such as maritime vessels, that rely on AI components. Malicious actors can exploit these systems to deceive and manipulate AI-driven operations. This paper addresses three critical research challenges associated with adversarial AI: the limited scope of traditional defences, inadequate security metrics, and the need to build resilience beyond model-level defences. To address these challenges, we propose building defences utilising multiple inputs and data fusion to create defensive components and an AI security metric as a novel approach toward developing more secure AI systems. We name this approach the Data Fusion Cyber Resilience (DFCR) method, and we evaluate it through real-world demonstrations and comprehensive quantitative analyses, comparing a system built with the DFCR method against single-input models and models utilising existing state-of-the-art defences. The findings show that the DFCR approach significantly enhances resilience against adversarial machine learning attacks in maritime autonomous system operations, achieving up to a 35% reduction in loss for successful multi-pronged perturbation attacks, up to a 100% reduction in loss for successful adversarial patch attacks and up to 100% reduction in loss for successful spoofing attacks when using these more resilient systems. We demonstrate how DFCR and DFCR confidence scores can reduce adversarial AI contact confidence and improve decision-making by the system, even when typical adversarial defences have been compromised. Ultimately, this work contributes to the development of more secure and resilient AI-driven systems against adversarial attacks.
zh

[AI-97] SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对高风险科学场景时的安全性评估不足问题,尤其是在处理知识密集型、潜在危害性内容时的鲁棒性缺乏系统性研究。现有安全基准主要关注低风险或浅层知识理解的任务,无法有效评估模型在复杂、危险情境下的表现。解决方案的关键在于提出SOSBench,这是一个基于法规、聚焦于危害性的基准测试框架,涵盖化学、生物、医学、药理学、物理和心理学六个高风险科学领域,包含3000个源自真实法规和法律的提示,并通过LLM辅助的进化流程生成多样化的现实滥用场景,从而更全面地评估模型的安全性。

链接: https://arxiv.org/abs/2505.21605
作者: Fengqing Jiang,Fengbo Ma,Zhangchen Xu,Yuetai Li,Bhaskar Ramasubramanian,Luyao Niu,Bo Li,Xianyan Chen,Zhen Xiang,Radha Poovendran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2505.21605 [cs.LG] (or arXiv:2505.21605v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.21605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-98] Public Discourse Sandbox: Facilitating Human and AI Digital Communication Research

【速读】:该论文试图解决在社交媒体上进行数字讨论干预研究时面临的数据获取困难、成本高、可靠性差以及伦理问题,同时缺乏可控且可扩展的机制来评估干预效果。其解决方案的关键在于引入Public Discourse Sandbox (PDS),这是一个用于人类与AI及AI与AI对话研究、测试和训练的数字话语研究平台,提供了一个安全、受控的实验环境,支持通过提示工程、检索增强生成(RAG)和微调等技术来理解AI行为及其对定制化AI参与者的影响力。

链接: https://arxiv.org/abs/2505.21604
作者: Kristina Radivojevic,Caleb Reinking,Shaun Whitfield,Paul Brenner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Social media serves as a primary communication and information dissemination platform for major global events, entertainment, and niche or topically focused community discussions. Therefore, it represents a valuable resource for researchers who aim to understand numerous questions. However, obtaining data can be difficult, expensive, and often unreliable due to the presence of bots, fake accounts, and manipulated content. Additionally, there are ethical concerns if researchers decide to conduct an online experiment without explicitly notifying social media users about their intent. There is a need for more controlled and scalable mechanisms to evaluate the impacts of digital discussion interventions on audiences. We introduce the Public Discourse Sandbox (PDS), which serves as a digital discourse research platform for human-AI as well as AI-AI discourse research, testing, and training. PDS provides a safe and secure space for research experiments that are not viable on public, commercial social media platforms. Its main purpose is to enable the understanding of AI behaviors and the impacts of customized AI participants via techniques such as prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. We provide a hosted live version of the sandbox to support researchers as well as the open-sourced code on GitHub for community collaboration and contribution.
zh

[AI-99] Leverag ing XP and CRISP-DM for Agile Data Science Projects

【速读】:该论文试图解决如何将敏捷开发方法eXtreme Programming (XP)与跨行业数据挖掘标准流程Cross-Industry Standard Process for Data Mining (CRISP-DM)在数据科学项目中进行整合的问题。解决方案的关键在于通过实证研究验证XP的敏捷性可以与CRISP-DM的结构化流程相结合,从而为数据科学项目提供一种既灵活又协作的方法论框架。

链接: https://arxiv.org/abs/2505.21603
作者: Andre Massahiro Shimaoka,Renato Cordeiro Ferreira,Alfredo Goldman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study explores the integration of eXtreme Programming (XP) and the Cross-Industry Standard Process for Data Mining (CRISP-DM) in agile Data Science projects. We conducted a case study at the e-commerce company Elo7 to answer the research question: How can the agility of the XP method be integrated with CRISP-DM in Data Science projects? Data was collected through interviews and questionnaires with a Data Science team consisting of data scientists, ML engineers, and data product managers. The results show that 86% of the team frequently or always applies CRISP-DM, while 71% adopt XP practices in their projects. Furthermore, the study demonstrates that it is possible to combine CRISP-DM with XP in Data Science projects, providing a structured and collaborative approach. Finally, the study generated improvement recommendations for the company.
zh

[AI-100] Relevance-driven Input Dropout: an Explanation-guided Regularization Technique

【速读】:该论文旨在解决机器学习模型中的过拟合问题(overfitting),特别是针对状态最先进(state-of-the-art, SOTA)模型在训练与测试集之间表现差异较大的问题。其解决方案的关键在于提出一种名为Relevance-driven Input Dropout (RelDrop)的新颖数据增强方法,该方法通过有选择性地遮蔽输入中最具相关性的区域,促使模型在预测过程中依赖其他重要特征,从而实现更有效的正则化和模型泛化能力的提升。

链接: https://arxiv.org/abs/2505.21595
作者: Shreyas Gururaj,Lars Grüne,Wojciech Samek,Sebastian Lapuschkin,Leander Weber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Overfitting is a well-known issue extending even to state-of-the-art (SOTA) Machine Learning (ML) models, resulting in reduced generalization, and a significant train-test performance gap. Mitigation measures include a combination of dropout, data augmentation, weight decay, and other regularization techniques. Among the various data augmentation strategies, occlusion is a prominent technique that typically focuses on randomly masking regions of the input during training. Most of the existing literature emphasizes randomness in selecting and modifying the input features instead of regions that strongly influence model decisions. We propose Relevance-driven Input Dropout (RelDrop), a novel data augmentation method which selectively occludes the most relevant regions of the input, nudging the model to use other important features in the prediction process, thus improving model generalization through informed regularization. We further conduct qualitative and quantitative analyses to study how Relevance-driven Input Dropout (RelDrop) affects model decision-making. Through a series of experiments on benchmark datasets, we demonstrate that our approach improves robustness towards occlusion, results in models utilizing more features within the region of interest, and boosts inference time generalization performance. Our code is available at this https URL.
zh

[AI-101] Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

【速读】:该论文旨在解决在资源受限的边缘设备上部署大型语言模型(Large Language Models, LLMs)时面临的计算资源限制、高延迟和隐私问题。其核心挑战在于如何在不依赖昂贵云端API的情况下,实现高效、低延迟的模型推理。解决方案的关键在于提出一种快速且成本效益高的推测性边缘-云解码框架,该框架在服务器端使用大目标模型,在设备端使用小草稿模型,并通过在目标模型中引入早期退出机制,使客户端能够在最终验证前预 drafts 后续的token,从而利用空闲时间并提升边缘与云端之间的并行性。

链接: https://arxiv.org/abs/2505.21594
作者: Yeshwanth Venkatesha,Souvik Kundu,Priyadarshini Panda
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before final verification, thus utilizing idle time and enhancing parallelism between edge and cloud. Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. These results demonstrate the potential of our framework for real-time LLM and VLM applications on resource-constrained edge devices.
zh

[AI-102] Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

【速读】:该论文旨在解决扩散模型中实现4-bit浮点数(FP)量化时面临的性能不一致问题。现有方法主要依赖整数量化和训练后量化微调,但在处理不对称激活分布、时间复杂性考虑不足以及微调损失与量化误差不匹配等方面存在局限。其解决方案的关键在于提出混合签名浮点量化(MSFP)框架,首次引入无符号FP量化,并结合时间步感知的低秩适配(TALoRA)和去噪因子损失对齐(DFA),以实现精确且稳定的微调。

链接: https://arxiv.org/abs/2505.21591
作者: Maosen Zhao,Pengtao Chen,Chong Yu,Yan Wen,Xudong Tan,Tao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model quantization reduces the bit-width of weights and activations, improving memory efficiency and inference speed in diffusion models. However, achieving 4-bit quantization remains challenging. Existing methods, primarily based on integer quantization and post-training quantization fine-tuning, struggle with inconsistent performance. Inspired by the success of floating-point (FP) quantization in large language models, we explore low-bit FP quantization for diffusion models and identify key challenges: the failure of signed FP quantization to handle asymmetric activation distributions, the insufficient consideration of temporal complexity in the denoising process during fine-tuning, and the misalignment between fine-tuning loss and quantization error. To address these challenges, we propose the mixup-sign floating-point quantization (MSFP) framework, first introducing unsigned FP quantization in model quantization, along with timestep-aware LoRA (TALoRA) and denoising-factor loss alignment (DFA), which ensure precise and stable fine-tuning. Extensive experiments show that we are the first to achieve superior performance in 4-bit FP quantization for diffusion models, outperforming existing PTQ fine-tuning methods in 4-bit INT quantization.
zh

[AI-103] Herd Behavior: Investigating Peer Influence in LLM -based Multi-Agent Systems

【速读】:该论文试图解决多智能体系统中基于大型语言模型(Large Language Models, LLMs)的群体行为动态问题,特别是智能体在共享环境中受同伴影响而产生的从众行为(herd behavior)。其解决方案的关键在于通过一系列控制实验揭示影响从众行为的多个因素,包括自我信心与对同伴信心的差距、同伴信息的呈现形式以及从众倾向的系统性调控,从而为提升多智能体协作效果提供理论支持和实践路径。

链接: https://arxiv.org/abs/2505.21588
作者: Young-Min Cho,Sharath Chandra Guntuku,Lyle Ungar
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have enabled the emergence of multi-agent systems where LLMs interact, collaborate, and make decisions in shared environments. While individual model behavior has been extensively studied, the dynamics of peer influence in such systems remain underexplored. In this paper, we investigate herd behavior, the tendency of agents to align their outputs with those of their peers, within LLM-based multi-agent interactions. We present a series of controlled experiments that reveal how herd behaviors are shaped by multiple factors. First, we show that the gap between self-confidence and perceived confidence in peers significantly impacts an agent’s likelihood to conform. Second, we find that the format in which peer information is presented plays a critical role in modulating the strength of herd behavior. Finally, we demonstrate that the degree of herd behavior can be systematically controlled, and that appropriately calibrated herd tendencies can enhance collaborative outcomes. These findings offer new insights into the social dynamics of LLM-based systems and open pathways for designing more effective and adaptive multi-agent collaboration frameworks.
zh

[AI-104] CellCLAT: Preserving Topology and Trimming Redundancy in Self-Supervised Cellular Contrastive Learning

【速读】:该论文旨在解决自监督拓扑深度学习(TDL)中细胞复形(cellular complexes)的表示学习问题,特别是针对其固有的结构约束和语义冗余带来的挑战。传统图增强技术可能破坏高阶细胞相互作用的完整性,而细胞复形中的拓扑冗余则可能削弱任务相关信息。解决方案的关键在于提出一种名为CellCLAT的双阶段框架,该框架通过基于参数扰动的增强方法在不改变细胞结构的前提下注入可控噪声,从而保持细胞拓扑结构;同时引入细胞裁剪调度器,利用双层元学习方法屏蔽任务无关细胞的梯度贡献,有效去除冗余拓扑元素并保留关键高阶语义。

链接: https://arxiv.org/abs/2505.21587
作者: Bin Qin,Qirui Ji,Jiangmeng Li,Yupeng Wang,Xuesong Wu,Jianwen Cao,Fanjiang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised topological deep learning (TDL) represents a nascent but underexplored area with significant potential for modeling higher-order interactions in simplicial complexes and cellular complexes to derive representations of unlabeled graphs. Compared to simplicial complexes, cellular complexes exhibit greater expressive power. However, the advancement in self-supervised learning for cellular TDL is largely hindered by two core challenges: \textitextrinsic structural constraints inherent to cellular complexes, and intrinsic semantic redundancy in cellular representations. The first challenge highlights that traditional graph augmentation techniques may compromise the integrity of higher-order cellular interactions, while the second underscores that topological redundancy in cellular complexes potentially diminish task-relevant information. To address these issues, we introduce Cellular Complex Contrastive Learning with Adaptive Trimming (CellCLAT), a twofold framework designed to adhere to the combinatorial constraints of cellular complexes while mitigating informational redundancy. Specifically, we propose a parameter perturbation-based augmentation method that injects controlled noise into cellular interactions without altering the underlying cellular structures, thereby preserving cellular topology during contrastive learning. Additionally, a cellular trimming scheduler is employed to mask gradient contributions from task-irrelevant cells through a bi-level meta-learning approach, effectively removing redundant topological elements while maintaining critical higher-order semantics. We provide theoretical justification and empirical validation to demonstrate that CellCLAT achieves substantial improvements over existing self-supervised graph learning methods, marking a significant attempt in this domain.
zh

[AI-105] Fairness in Federated Learning: Fairness for Whom?

【速读】:该论文试图解决联邦学习(Federated Learning, FL)中公平性研究存在的问题,即现有方法往往过于关注系统层面的狭义指标,如性能均等或贡献奖励,而忽视了在FL生命周期中产生的危害及其对不同利益相关者的影响。解决方案的关键在于提出一种以危害为中心的框架,将公平性定义与具体风险及利益相关者的脆弱性相联系,从而推动更加全面、情境感知和可问责的公平性研究。

链接: https://arxiv.org/abs/2505.21584
作者: Afaf Taik,Khaoula Chehbouni,Golnoosh Farnadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Fairness in federated learning has emerged as a rapidly growing area of research, with numerous works proposing formal definitions and algorithmic interventions. Yet, despite this technical progress, fairness in FL is often defined and evaluated in ways that abstract away from the sociotechnical contexts in which these systems are deployed. In this paper, we argue that existing approaches tend to optimize narrow system level metrics, such as performance parity or contribution-based rewards, while overlooking how harms arise throughout the FL lifecycle and how they impact diverse stakeholders. We support this claim through a critical analysis of the literature, based on a systematic annotation of papers for their fairness definitions, design decisions, evaluation practices, and motivating use cases. Our analysis reveals five recurring pitfalls: 1) fairness framed solely through the lens of server client architecture, 2) a mismatch between simulations and motivating use-cases and contexts, 3) definitions that conflate protecting the system with protecting its users, 4) interventions that target isolated stages of the lifecycle while neglecting upstream and downstream effects, 5) and a lack of multi-stakeholder alignment where multiple fairness definitions can be relevant at once. Building on these insights, we propose a harm centered framework that links fairness definitions to concrete risks and stakeholder vulnerabilities. We conclude with recommendations for more holistic, context-aware, and accountable fairness research in FL.
zh

[AI-106] AITEE – Agent ic Tutor for Electrical Engineering

【速读】:该论文旨在解决传统智能辅导系统在电气工程教育中难以满足学生个性化需求以及对电路相关具体问题处理能力不足的问题。其解决方案的关键在于构建一个基于代理的辅导系统AITEE,该系统通过适应性的电路重构过程支持手绘与数字电路的交互,并利用基于图的相似性度量结合检索增强生成方法,从课程材料中识别相关上下文,同时通过并行SPICE仿真提高解题方法应用的准确性,此外还引入苏格拉底式对话以促进学习者自主性。

链接: https://arxiv.org/abs/2505.21582
作者: Christopher Knievel,Alexander Bernhardt,Christian Bernhardt
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Intelligent tutoring systems combined with large language models offer a promising approach to address students’ diverse needs and promote self-efficacious learning. While large language models possess good foundational knowledge of electrical engineering basics, they remain insufficiently capable of addressing specific questions about electrical circuits. In this paper, we present AITEE, an agent-based tutoring system for electrical engineering designed to accompany students throughout their learning process, offer individualized support, and promote self-directed learning. AITEE supports both hand-drawn and digital circuits through an adapted circuit reconstruction process, enabling natural interaction with students. Our novel graph-based similarity measure identifies relevant context from lecture materials through a retrieval augmented generation approach, while parallel Spice simulation further enhances accuracy in applying solution methodologies. The system implements a Socratic dialogue to foster learner autonomy through guided questioning. Experimental evaluations demonstrate that AITEE significantly outperforms baseline approaches in domain-specific knowledge application, with even medium-sized LLM models showing acceptable performance. Our results highlight the potential of agentic tutors to deliver scalable, personalized, and effective learning environments for electrical engineering education.
zh

[AI-107] RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

【速读】:该论文试图解决在复杂任务中有效利用GitHub上开源代码仓库的问题,现有框架如OpenHands和SWE-Agent难以充分利用这些资源,主要受限于信息过载、依赖关系复杂以及当前大语言模型(Large Language Models, LLMs)的上下文窗口限制。其解决方案的关键在于提出RepoMaster,通过构建函数调用图、模块依赖图和层次化代码树,识别并提取核心组件,仅向LLMs提供必要的信息,从而提升理解和执行效率。在自主执行过程中,RepoMaster利用探索工具逐步挖掘相关组件,并通过信息剪枝优化上下文使用。

链接: https://arxiv.org/abs/2505.21577
作者: Huacan Wang,Ziyi Ni,Shuo Zhang,Shuo Lu,Sen Hu,Ziyang He,Chen Hu,Jiaye Lin,Yifu Guo,Yuntao Du,Pin Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: A novel approach; Very practical

点击查看摘要

Abstract:The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 24.1% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at this https URL.
zh

[AI-108] Concentration Distribution Learning from Label Distributions

【速读】:该论文试图解决标签分布学习(Label Distribution Learning, LDL)中标签分布未能完整表示实例的问题,因为其忽略了每个标签的绝对强度,导致无法获取标签空间外隐藏标签的总描述度,从而造成信息丢失和实例混淆。解决方案的关键在于引入“背景浓度”这一新概念,作为标签分布的绝对描述度项,并将其融入LDL过程,形成改进的浓度分布学习范式。通过概率方法和神经网络构建的新模型,能够从现有的LDL数据集中学习标签分布和背景浓度,实验表明该方法在提取背景浓度的同时,相比现有最先进方法能产生更精确的预测结果。

链接: https://arxiv.org/abs/2505.21576
作者: Jiawei Tang,Yuheng Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Label distribution learning (LDL) is an effective method to predict the relative label description degree (a.k.a. label distribution) of a sample. However, the label distribution is not a complete representation of an instance because it overlooks the absolute intensity of each label. Specifically, it’s impossible to obtain the total description degree of hidden labels that not in the label space, which leads to the loss of information and confusion in instances. To solve the above problem, we come up with a new concept named background concentration to serve as the absolute description degree term of the label distribution and introduce it into the LDL process, forming the improved paradigm of concentration distribution learning. Moreover, we propose a novel model by probabilistic methods and neural networks to learn label distributions and background concentrations from existing LDL datasets. Extensive experiments prove that the proposed approach is able to extract background concentrations from label distributions while producing more accurate prediction results than the state-of-the-art LDL methods. The code is available in this https URL.
zh

[AI-109] StreamLink: Large-Language-Model Driven Distributed Data Engineering System CIKM CIKM2024

【速读】:该论文试图解决传统数据工程任务中效率低、用户交互复杂以及数据隐私保护不足的问题。其解决方案的关键在于引入基于大型语言模型(Large Language Models, LLMs)的流式链接系统(StreamLink),该系统依托分布式框架如Apache Spark和Hadoop,通过本地微调的LLMs实现对用户自然语言查询的理解与数据库查询语句(如SQL)的生成,同时结合LLM驱动的语法与安全检查机制,确保生成查询的可靠性和安全性,从而提升数据处理的效率与用户友好性。

链接: https://arxiv.org/abs/2505.21575
作者: Dawei Feng,Di Mei,Huiri Tan,Lei Ren,Xianying Lou,Zhangxi Tan
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM Workshop 2024, this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system’s understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers to guarantee the reliability and safety of each generated query. StreamLink illustrates the potential of merging generative LLMs with distributed data processing for comprehensive and user-centric data engineering. With this architecture, we allow users to interact with complex database systems at different scales in a user-friendly and security-ensured manner, where the SQL generation reaches over 10% of execution accuracy compared to baseline methods, and allow users to find the most concerned item from hundreds of millions of items within a few seconds using natural language.
zh

[AI-110] Spectral-inspired Neural Operator for Data-efficient PDE Simulation in Physics-agnostic Regimes

【速读】:该论文旨在解决传统数值求解器在物理规律未知或需要快速推理时的适用性受限问题,以及数据驱动的神经偏微分方程(PDE)求解器在数据稀缺情况下性能不佳的问题。其解决方案的关键在于提出一种名为Spectral-inspired Neural Operator (SINO) 的新框架,该框架通过在频域中操作并引入频率到向量模块来学习类似于导数乘子的谱表示,从而在仅需少量轨迹(2-5条)的情况下无需任何已知的PDE项即可学习PDE算子,并通过非线性算子块和算子蒸馏技术实现高效推理与强泛化能力。

链接: https://arxiv.org/abs/2505.21573
作者: Han Wan,Rui Zhang,Hao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) govern the spatiotemporal evolution of various physical systems. Classical numerical solvers, while accurate, require fine discretization and full knowledge of the governing PDEs, limiting their applicability when the physics is unknown or fast inference is required. Data-driven neural PDE solvers alleviate these constraints by learning from data but demand large training datasets and perform poorly in data-scarce regimes. Physics-aware methods mitigate data requirements by incorporating physical knowledge yet rely on known PDE terms or local numerical schemes, restricting their ability to handle unknown or globally coupled systems. In this work, we propose the Spectral-inspired Neural Operator (SINO), a novel framework that learns PDE operators from limited trajectories (as few as 2-5), without any known PDE terms. SINO operates in the frequency domain and introduces a Frequency-to-Vector module to learn spectral representations analogous to derivative multipliers. To model nonlinear physical interactions, we design a nonlinear operator block that includes a \Pi -Block with low-pass filtering to prevent aliasing. Finally, we introduce an operator distillation technique to distill the trained model for efficient inference. SINO achieves state-of-the-art results across multiple PDE benchmarks, demonstrating strong discretization invariance and robust generalization to out-of-distribution initial conditions. To our knowledge, SINO is the first physics-aware method capable of accurately simulating globally coupled systems (e.g., the Navier-Stokes equations) from limited data without any explicit PDE terms.
zh

[AI-111] FCOS: A Two-Stage Recoverable Model Pruning Framework for Automatic Modulation Recognition

【速读】:该论文旨在解决传统手动调制识别方法在现代复杂无线通信场景中难以提取可靠信号特征且无法满足实时性要求的问题,以及基于深度学习的自动调制识别(AMR)模型因模型规模大、计算需求高而难以部署在资源受限设备上的问题。其解决方案的关键在于提出一种名为FCOS的细粒度到粗粒度两阶段剪枝框架,通过结合通道级剪枝与层级坍塌诊断,实现极端压缩、高性能和高效推理。具体而言,FCOS在第一阶段利用层次聚类和参数融合进行通道级剪枝,在第二阶段通过层坍塌诊断模块识别并移除因高通道压缩率导致的坍塌层,从而在保持较高分类精度的同时显著降低模型的计算量和参数量。

链接: https://arxiv.org/abs/2505.21571
作者: Yao Lu,Tengfei Ma,Zeyu Wang,Zhuangzhi Chen,Dongwei Xu,Yun Lin,Qi Xuan,Guan Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of wireless communications and the growing complexity of digital modulation schemes, traditional manual modulation recognition methods struggle to extract reliable signal features and meet real-time requirements in modern scenarios. Recently, deep learning based Automatic Modulation Recognition (AMR) approaches have greatly improved classification accuracy. However, their large model sizes and high computational demands hinder deployment on resource-constrained devices. Model pruning provides a general approach to reduce model complexity, but existing weight, channel, and layer pruning techniques each present a trade-off between compression rate, hardware acceleration, and accuracy preservation. To this end, in this paper, we introduce FCOS, a novel Fine-to-COarse two-Stage pruning framework that combines channel-level pruning with layer-level collapse diagnosis to achieve extreme compression, high performance and efficient inference. In the first stage of FCOS, hierarchical clustering and parameter fusion are applied to channel weights to achieve channel-level pruning. Then a Layer Collapse Diagnosis (LaCD) module uses linear probing to identify layer collapse and removes the collapsed layers due to high channel compression ratio. Experiments on multiple AMR benchmarks demonstrate that FCOS outperforms existing channel and layer pruning methods. Specifically, FCOS achieves 95.51% FLOPs reduction and 95.31% parameter reduction while still maintaining performance close to the original ResNet56, with only a 0.46% drop in accuracy on Sig2019-12. Code is available at this https URL.
zh

[AI-112] Beyond Explainability: The Case for AI Validation

【速读】:该论文试图解决当前人工智能知识(Artificial Knowledge, AK)系统在关键领域决策中因透明度不足而带来的治理挑战,传统以可解释性(explainability)为核心的监管方法无法有效应对这些问题。论文提出的解决方案的关键在于将验证(validation)作为核心监管支柱,通过确保AI输出的可靠性、一致性和鲁棒性,提供一种更实用、可扩展且风险敏感的替代方案,尤其适用于那些可解释性在技术或经济上不可行的高风险场景。

链接: https://arxiv.org/abs/2505.21570
作者: Dalit Ken-Dror Feldman,Daniel Benoliel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Knowledge (AK) systems are transforming decision-making across critical domains such as healthcare, finance, and criminal justice. However, their growing opacity presents governance challenges that current regulatory approaches, focused predominantly on explainability, fail to address adequately. This article argues for a shift toward validation as a central regulatory pillar. Validation, ensuring the reliability, consistency, and robustness of AI outputs, offers a more practical, scalable, and risk-sensitive alternative to explainability, particularly in high-stakes contexts where interpretability may be technically or economically unfeasible. We introduce a typology based on two axes, validity and explainability, classifying AK systems into four categories and exposing the trade-offs between interpretability and output reliability. Drawing on comparative analysis of regulatory approaches in the EU, US, UK, and China, we show how validation can enhance societal trust, fairness, and safety even where explainability is limited. We propose a forward-looking policy framework centered on pre- and post-deployment validation, third-party auditing, harmonized standards, and liability incentives. This framework balances innovation with accountability and provides a governance roadmap for responsibly integrating opaque, high-performing AK systems into society.
zh

[AI-113] VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leverag ing Speaker-Specific Latents INTERSPEECH2025

【速读】:该论文试图解决在零样本语音克隆(zero-shot VC)场景下,现有水印技术无法有效追踪和防止未经授权的语音克隆问题。现有方法通过在水印音频上训练传统语音克隆模型来实现溯源,但在零样本VC场景中,模型直接从音频提示生成音频而无需训练,导致水印失效。解决方案的关键在于提出VoiceMark,这是首个针对零样本VC的抗水印方法,其核心是利用说话人特定潜在表示(speaker-specific latents)作为水印载体,使水印能够通过零样本VC过程传递到生成的音频中。此外,通过引入VC模拟增强和基于语音活动检测(VAD)的损失函数,进一步提升了水印的鲁棒性。

链接: https://arxiv.org/abs/2505.21568
作者: Haiyun Li,Zhiyong Wu,Xiaofeng Xie,Jingran Xie,Yaoxun Xu,Hanyang Peng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Voice cloning (VC)-resistant watermarking is an emerging technique for tracing and preventing unauthorized cloning. Existing methods effectively trace traditional VC models by training them on watermarked audio but fail in zero-shot VC scenarios, where models synthesize audio from an audio prompt without training. To address this, we propose VoiceMark, the first zero-shot VC-resistant watermarking method that leverages speaker-specific latents as the watermark carrier, allowing the watermark to transfer through the zero-shot VC process into the synthesized audio. Additionally, we introduce VC-simulated augmentations and VAD-based loss to enhance robustness against distortions. Experiments on multiple zero-shot VC models demonstrate that VoiceMark achieves over 95% accuracy in watermark detection after zero-shot VC synthesis, significantly outperforming existing methods, which only reach around 50%. See our code and demos at: this https URL
zh

[AI-114] owards Human-Like Trajectory Prediction for Autonomous Driving: A Behavior-Centric Approach

【速读】:该论文旨在解决复杂动态交通环境中车辆轨迹预测的问题,这是自动驾驶(AD)系统发展中的关键挑战。其解决方案的关键在于提出HiT(Human-like Trajectory Prediction)模型,该模型通过引入行为感知模块和动态中心性度量,构建了一个动态框架,以捕捉交通参与者之间的直接和间接交互,从而更准确地模拟人类驾驶行为,提升轨迹预测的精度与可靠性。

链接: https://arxiv.org/abs/2505.21565
作者: Haicheng Liao,Zhenning Li,Guohui Zhang,Keqiang Li,Chengzhong Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting the trajectories of vehicles is crucial for the development of autonomous driving (AD) systems, particularly in complex and dynamic traffic environments. In this study, we introduce HiT (Human-like Trajectory Prediction), a novel model designed to enhance trajectory prediction by incorporating behavior-aware modules and dynamic centrality measures. Unlike traditional methods that primarily rely on static graph structures, HiT leverages a dynamic framework that accounts for both direct and indirect interactions among traffic participants. This allows the model to capture the subtle yet significant influences of surrounding vehicles, enabling more accurate and human-like predictions. To evaluate HiT’s performance, we conducted extensive experiments using diverse and challenging real-world datasets, including NGSIM, HighD, RounD, ApolloScape, and MoCAD++. The results demonstrate that HiT consistently outperforms other top models across multiple metrics, particularly excelling in scenarios involving aggressive driving behaviors. This research presents a significant step forward in trajectory prediction, offering a more reliable and interpretable approach for enhancing the safety and efficiency of fully autonomous driving systems.
zh

[AI-115] Fog Intelligence for Network Anomaly Detection

【速读】:该论文试图解决在移动通信网络中由于规模和复杂性不断增加,以及网络监控数据量和维度持续上升,导致难以有效监测和发现异常网络行为的问题。解决方案的关键在于提出一种名为雾智能(fog intelligence)的分布式机器学习架构,该架构结合了边缘计算和集中式云计算的优势,具备可扩展性、隐私保护能力,并适用于分布式无线网络的智能化管理。

链接: https://arxiv.org/abs/2505.21563
作者: Kai Yang,Hui Ma,Shaoyu Dou
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: published in IEEE Network

点击查看摘要

Abstract:Anomalies are common in network system monitoring. When manifested as network threats to be mitigated, service outages to be prevented, and security risks to be ameliorated, detecting such anomalous network behaviors becomes of great importance. However, the growing scale and complexity of the mobile communication networks, as well as the ever-increasing amount and dimensionality of the network surveillance data, make it extremely difficult to monitor a mobile network and discover abnormal network behaviors. Recent advances in machine learning allow for obtaining near-optimal solutions to complicated decision-making problems with many sources of uncertainty that cannot be accurately characterized by traditional mathematical models. However, most machine learning algorithms are centralized, which renders them inapplicable to a large-scale distributed wireless networks with tens of millions of mobile devices. In this article, we present fog intelligence, a distributed machine learning architecture that enables intelligent wireless network management. It preserves the advantage of both edge processing and centralized cloud computing. In addition, the proposed architecture is scalable, privacy-preserving, and well suited for intelligent management of a distributed wireless network.
zh

[AI-116] Enhancing Selection of Climate Tech Startups with AI – A Case Study on Integrating Human and AI Evaluations in the ClimaTech Great Global Innovation Challenge

【速读】:该论文试图解决传统创业公司筛选过程中存在的偏见和效率低下问题,通过引入生成式 AI (Generative AI) 与人工评估相结合的混合模型来提高选择气候科技初创企业的准确性和效率。解决方案的关键在于构建一个三阶段的评估流程,包括初始 AI 审核、半决赛的人工评审以及决赛的混合加权评分,其中 AI 和人类评估的权重在不同阶段有所调整,从而实现更公平、客观的决策过程。

链接: https://arxiv.org/abs/2505.21562
作者: Jennifer Turliuk,Alejandro Sevilla,Daniela Gorza,Tod Hynes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This case study examines the ClimaTech Great Global Innovation Challenge’s approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI’s GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman’s = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2505.21562 [cs.CY] (or arXiv:2505.21562v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.21562 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-117] Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework

【速读】:该论文试图解决云原生系统中由于工作负载管理问题(如资源阻塞、瓶颈或持续的Pod崩溃)导致的操作弹性不足问题,尤其是在对抗性场景(如分布式拒绝服务攻击)下的弹性维持难题。传统水平Pod自动扩展(HPA)方法难以应对动态环境,而基于强化学习的方法通常仅优化单一目标(如延迟或资源使用),忽视了更广泛的故障场景。解决方案的关键在于将操作弹性的总体目标分解为针对特定故障的子目标,并通过协作智能体构建一个HPA多智能体系统(MAS),结合数字孪生建模、模拟训练、行为分析与策略迁移的四阶段在线框架,以提升系统在复杂对抗条件下的操作弹性。

链接: https://arxiv.org/abs/2505.21559
作者: Julien Soulé,Jean-Paul Jamont,Michel Occello,Louis-Marie Traonouez,Paul Théron
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In cloud-native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial-of-Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning-based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents, collectively forming an HPA Multi-Agent System (MAS). We introduce an automated, four-phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster. Experimental results demonstrate that the generated HPA MASs outperform three state-of-the-art HPA systems in sustaining operational resilience under various adversarial conditions in a proposed complex cluster.
zh

[AI-118] MetaSTNet: Multimodal Meta-learning for Cellular Traffic Conformal Prediction

【速读】:该论文试图解决在仅有少量训练数据的情况下,网络流量预测技术难以实现准确预测的问题。解决方案的关键在于提出一种基于多模态元学习框架的深度学习模型——MetaSTNet,该模型通过在模拟器中训练并迁移元知识到真实环境,从而在仅需少量真实世界训练数据的情况下,快速适应新任务并获得准确的预测结果。

链接: https://arxiv.org/abs/2505.21553
作者: Hui Ma,Kai Yang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Network traffic prediction techniques have attracted much attention since they are valuable for network congestion control and user experience improvement. While existing prediction techniques can achieve favorable performance when there is sufficient training data, it remains a great challenge to make accurate predictions when only a small amount of training data is available. To tackle this problem, we propose a deep learning model, entitled MetaSTNet, based on a multimodal meta-learning framework. It is an end-to-end network architecture that trains the model in a simulator and transfers the meta-knowledge to a real-world environment, which can quickly adapt and obtain accurate predictions on a new task with only a small amount of real-world training data. In addition, we further employ cross conformal prediction to assess the calibrated prediction intervals. Extensive experiments have been conducted on real-world datasets to illustrate the efficiency and effectiveness of MetaSTNet.
zh

[AI-119] Understanding the learned look-ahead behavior of chess neural networks

【速读】:该论文试图解决神经网络在国际象棋中是否具备前瞻能力的问题,特别是Leela Chess Zero策略网络能否考虑未来多步走法及替代路径。其解决方案的关键在于通过分析模型在不同棋局位置下的表现,揭示其能够处理七步之后的棋盘状态,并利用相似的内部机制进行多路径推理,而非仅关注单一行进路线。这一发现为理解战略任务训练下神经网络中复杂前瞻能力的形成提供了新视角。

链接: https://arxiv.org/abs/2505.21552
作者: Diogo Cruz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 47 figures

点击查看摘要

Abstract:We investigate the look-ahead capabilities of chess-playing neural networks, specifically focusing on the Leela Chess Zero policy network. We build on the work of Jenner et al. (2024) by analyzing the model’s ability to consider future moves and alternative sequences beyond the immediate next move. Our findings reveal that the network’s look-ahead behavior is highly context-dependent, varying significantly based on the specific chess position. We demonstrate that the model can process information about board states up to seven moves ahead, utilizing similar internal mechanisms across different future time steps. Additionally, we provide evidence that the network considers multiple possible move sequences rather than focusing on a single line of play. These results offer new insights into the emergence of sophisticated look-ahead capabilities in neural networks trained on strategic tasks, contributing to our understanding of AI reasoning in complex domains. Our work also showcases the effectiveness of interpretability techniques in uncovering cognitive-like processes in artificial intelligence systems.
zh

[AI-120] Collaborative Agent ic AI Needs Interoperability Across Ecosystems

【速读】:该论文试图解决当前协作式智能体AI(Collaborative Agentic AI)领域中存在的生态系统碎片化和不兼容问题,这些问题阻碍了其大规模应用和推广。论文提出的解决方案的关键在于通过采用最小化标准来实现互操作性,从而构建开放、安全、可扩展的智能体生态系统。其核心是设计了一个名为“Web of Agents”的最小架构基础,包含智能体间消息传递、交互互操作性、状态管理和智能体发现四个组件,旨在通过复用现有标准和基础设施,推动可互操作的智能体系统发展。

链接: https://arxiv.org/abs/2505.21550
作者: Rishi Sharma,Martijn de Vos,Pradyumna Chari,Ramesh Raskar,Anne-Marie Kermarrec
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Collaborative agentic AI is projected to transform entire industries by enabling AI-powered agents to autonomously perceive, plan, and act within digital environments. Yet, current solutions in this field are all built in isolation, and we are rapidly heading toward a landscape of fragmented, incompatible ecosystems. In this position paper, we argue that interoperability, achieved by the adoption of minimal standards, is essential to ensure open, secure, web-scale, and widely-adopted agentic ecosystems. To this end, we devise a minimal architectural foundation for collaborative agentic AI, named Web of Agents, which is composed of four components: agent-to-agent messaging, interaction interoperability, state management, and agent discovery. Web of Agents adopts existing standards and reuses existing infrastructure where possible. With Web of Agents, we take the first but critical step toward interoperable agentic systems and offer a pragmatic path forward before ecosystem fragmentation becomes the norm.
zh

[AI-121] OpenReview Should be Protected and Leverag ed as a Community Asset for Research in the Era of Large Language Models

【速读】:该论文试图解决在大型语言模型(Large Language Models, LLMs)时代,如何有效利用高质量、领域丰富且持续演进的数据集来推动研究的问题。其解决方案的关键在于将OpenReview——一个不断演化的研究论文、同行评审、作者反驳、元评审和决策结果的存储库——作为核心社区资产进行更广泛的应用,以提升同行评审的质量、可扩展性和问责性,并支持对齐研究及基于真实专家评估的基准测试。

链接: https://arxiv.org/abs/2505.21537
作者: Hao Sun,Yunyi Shen,Mihaela van der Schaar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of large language models (LLMs), high-quality, domain-rich, and continuously evolving datasets capturing expert-level knowledge, core human values, and reasoning are increasingly valuable. This position paper argues that OpenReview – the continually evolving repository of research papers, peer reviews, author rebuttals, meta-reviews, and decision outcomes – should be leveraged more broadly as a core community asset for advancing research in the era of LLMs. We highlight three promising areas in which OpenReview can uniquely contribute: enhancing the quality, scalability, and accountability of peer review processes; enabling meaningful, open-ended benchmarks rooted in genuine expert deliberation; and supporting alignment research through real-world interactions reflecting expert assessment, intentions, and scientific values. To better realize these opportunities, we suggest the community collaboratively explore standardized benchmarks and usage guidelines around OpenReview, inviting broader dialogue on responsible data use, ethical considerations, and collective stewardship.
zh

[AI-122] Uncovering Bottlenecks and Optimizing Scientific Lab Workflows with Cycle Time Reduction Agents

【速读】:该论文旨在解决科学实验室(尤其是制药和生物技术公司)在优化工作流程中面临的挑战,这些问题主要源于任务的复杂性和数量,如化合物筛选和实验执行。解决方案的关键在于引入基于LangGraph的智能体工作流——Cycle Time Reduction Agents (CTRA),该系统通过自动化分析实验室操作指标来识别流程中的瓶颈,其核心组件包括问题生成智能体、操作指标智能体和洞察智能体,从而实现对实验室流程的高效分析与优化。

链接: https://arxiv.org/abs/2505.21534
作者: Yao Fehlis
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific laboratories, particularly those in pharmaceutical and biotechnology companies, encounter significant challenges in optimizing workflows due to the complexity and volume of tasks such as compound screening and assay execution. We introduce Cycle Time Reduction Agents (CTRA), a LangGraph-based agentic workflow designed to automate the analysis of lab operational metrics. CTRA comprises three main components: the Question Creation Agent for initiating analysis, Operational Metrics Agents for data extraction and validation, and Insights Agents for reporting and visualization, identifying bottlenecks in lab processes. This paper details CTRA’s architecture, evaluates its performance on a lab dataset, and discusses its potential to accelerate pharmaceutical and biotechnological development. CTRA offers a scalable framework for reducing cycle times in scientific labs.
zh

[AI-123] Conformance Checking for Less: Efficient Conformance Checking for Long Event Sequences

【速读】:该论文试图解决在长事件序列(long event sequences)中进行符合性检查(conformance checking)时面临的计算不可行性问题,尤其是在处理由传感器和预测模型产生的大规模数据日志时,由于寻找最优对齐(optimal alignment)的指数级复杂度导致的可扩展性挑战。解决方案的关键在于提出ConLES方法,该方法通过滑动窗口机制将事件序列划分为可管理的子序列,并逐个对齐到预期行为,从而显著减少搜索空间,同时保持整体准确性。ConLES利用全局信息来捕捉事件序列和过程模型的结构特性,以做出更有依据的对齐决策并舍弃局部最优但整体不优的对齐方案。

链接: https://arxiv.org/abs/2505.21506
作者: Eli Bogdanov,Izack Cohen,Avigdor Gal
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Long event sequences (termed traces) and large data logs that originate from sensors and prediction models are becoming increasingly common in our data-rich world. In such scenarios, conformance checking-validating a data log against an expected system behavior (the process model) can become computationally infeasible due to the exponential complexity of finding an optimal alignment. To alleviate scalability challenges for this task, we propose ConLES, a sliding-window conformance checking approach for long event sequences that preserves the interpretability of alignment-based methods. ConLES partitions traces into manageable subtraces and iteratively aligns each against the expected behavior, leading to significant reduction of the search space while maintaining overall accuracy. We use global information that captures structural properties of both the trace and the process model, enabling informed alignment decisions and discarding unpromising alignments, even if they appear locally optimal. Performance evaluations across multiple datasets highlight that ConLES outperforms the leading optimal and heuristic algorithms for long traces, consistently achieving the optimal or near-optimal solution. Unlike other conformance methods that struggle with long event sequences, ConLES significantly reduces the search space, scales efficiently, and uniquely supports both predefined and discovered process models, making it a viable and leading option for conformance checking of long event sequences.
zh

[AI-124] On the performance of machine-learning assisted Monte Carlo in sampling from simple statistical physics models

【速读】:该论文试图解决在难以采样的系统中,如何有效应用机器学习技术以提升蒙特卡洛采样和优化的问题。其解决方案的关键在于对广泛使用的顺序退火(Sequential Tempering)过程进行完整的分析研究,并结合浅层MADE架构对Curie-Weiss模型进行实验验证。通过描述最优权重及梯度下降优化下的训练过程,并比较加入局部Metropolis蒙特卡洛步骤与不加入的情况,该工作为最佳实践提供了理论预测,从而为机器学习技术与蒙特卡洛方法的融合奠定了清晰的理论基础。

链接: https://arxiv.org/abs/2505.22598
作者: Luca Maria Del Bono,Federico Ricci-Tersenghi,Francesco Zamponi
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Recent years have seen a rise in the application of machine learning techniques to aid the simulation of hard-to-sample systems that cannot be studied using traditional methods. Despite the introduction of many different architectures and procedures, a wide theoretical understanding is still lacking, with the risk of suboptimal implementations. As a first step to address this gap, we provide here a complete analytic study of the widely-used Sequential Tempering procedure applied to a shallow MADE architecture for the Curie-Weiss model. The contribution of this work is twofold: firstly, we give a description of the optimal weights and of the training under Gradient Descent optimization. Secondly, we compare what happens in Sequential Tempering with and without the addition of local Metropolis Monte Carlo steps. We are thus able to give theoretical predictions on the best procedure to apply in this case. This work establishes a clear theoretical basis for the integration of machine learning techniques into Monte Carlo sampling and optimization.
zh

[AI-125] Empowering Intelligent Low-altitude Economy with Large AI Model Deployment

【速读】:该论文旨在解决在低空经济(Low-altitude economy, LAE)中部署大型人工智能模型(Large artificial intelligence models, LAIMs)所面临的挑战,包括LAIMs的计算/存储需求与LAE实体有限的机载资源之间的显著差距、实验室训练的LAIMs与动态物理环境之间的不匹配,以及传统解耦设计在感知、通信和计算方面的低效性。解决方案的关键在于提出一种分层系统架构,并探索促进LAIMs与低空系统协同演进的关键使能技术,同时引入面向任务的执行流水线以实现可扩展和自适应的服务交付。

链接: https://arxiv.org/abs/2505.22343
作者: Zhonghao Lyu,Yulan Gao,Junting Chen,Hongyang Du,Jie Xu,Kaibin Huang,Dong In Kim
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-altitude economy (LAE) represents an emerging economic paradigm that redefines commercial and social aerial activities. Large artificial intelligence models (LAIMs) offer transformative potential to further enhance the intelligence of LAE services. However, deploying LAIMs in LAE poses several challenges, including the significant gap between their computational/storage demands and the limited onboard resources of LAE entities, the mismatch between lab-trained LAIMs and dynamic physical environments, and the inefficiencies of traditional decoupled designs for sensing, communication, and computation. To address these issues, we first propose a hierarchical system architecture tailored for LAIM deployment and present representative LAE application scenarios. Next, we explore key enabling techniques that facilitate the mutual co-evolution of LAIMs and low-altitude systems, and introduce a task-oriented execution pipeline for scalable and adaptive service delivery. Then, the proposed framework is validated through real-world case studies. Finally, we outline open challenges to inspire future research.
zh

[AI-126] Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection INTERSPEECH2025

【速读】:该论文旨在解决语音非流畅性检测(speech dysfluency detection)在临床诊断和语言评估中的挑战,特别是由于高质量标注数据稀缺导致的现有方法局限性。其解决方案的关键在于提出LLM-Dys——一个基于大语言模型(LLM)增强的非流畅语音语料库,该语料库通过改进的非流畅性模拟技术,涵盖了11种跨词级和音素级的非流畅性类别,从而提升了合成数据的自然韵律和上下文多样性。

链接: https://arxiv.org/abs/2505.22029
作者: Jinming Zhang,Xuanru Zhou,Jiachen Lian,Shuhe Li,William Li,Zoe Ezzes,Rian Bogley,Lisa Wauters,Zachary Miller,Jet Vonk,Brittany Morin,Maria Gorno-Tempini,Gopala Anumanchipalli
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys – the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at this https URL.
zh

[AI-127] HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3

【速读】:该论文旨在解决蛋白质结合体设计中的实际部署难题,这些问题包括流程碎片化、计算成本高以及工具集成复杂。其解决方案的关键在于开发了HelixDesign-Binder平台,该平台基于HelixFold3构建,能够自动化完整的结合体设计流程,涵盖骨架生成、序列设计、结构评估及多维评分,并通过统一各阶段实现可扩展且用户友好的系统,从而高效探索具有优良结构、能量和理化性质的结合体候选物。

链接: https://arxiv.org/abs/2505.21873
作者: Jie Gao,Jun Li,Jing Hu,Shanzhuo Zhang,Kunrui Zhu,Yueyang Huang,Xiaonan Zhang,Xiaomin Fang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Protein binder design is central to therapeutics, diagnostics, and synthetic biology, yet practical deployment remains challenging due to fragmented workflows, high computational costs, and complex tool integration. We present HelixDesign-Binder, a production-grade, high-throughput platform built on HelixFold3 that automates the full binder design pipeline, from backbone generation and sequence design to structural evaluation and multi-dimensional scoring. By unifying these stages into a scalable and user-friendly system, HelixDesign-Binder enables efficient exploration of binder candidates with favorable structural, energetic, and physicochemical properties. The platform leverages Baidu Cloud’s high-performance infrastructure to support large-scale design and incorporates advanced scoring metrics, including ipTM, predicted binding free energy, and interface hydrophobicity. Benchmarking across six protein targets demonstrates that HelixDesign-Binder reliably produces diverse and high-quality binders, some of which match or exceed validated designs in predicted binding affinity. HelixDesign-Binder is accessible via an interactive web interface in PaddleHelix platform, supporting both academic research and industrial applications in antibody and protein binder development.
zh

[AI-128] CSI-Bench: A Large-Scale In-the-Wild Dataset for Multi-task WiFi Sensing

【速读】:该论文旨在解决现有WiFi sensing系统在真实环境中的泛化能力不足的问题,其核心挑战在于现有数据集多来源于受控环境,硬件同质化且记录片段化,无法反映连续日常活动的真实场景。论文提出的解决方案是构建CSI-Bench,这是一个大规模、真实环境下的基准数据集,通过商用WiFi边缘设备在26种不同室内环境中收集了35名真实用户的超过461小时有效数据,涵盖了跌倒检测、呼吸监测、定位和运动源识别等任务,并提供了联合标注的多任务数据集,以支持鲁棒且可泛化的模型开发。

链接: https://arxiv.org/abs/2505.21866
作者: Guozhen Zhu,Yuqian Hu,Weihang Gao,Wei-Hsiang Wang,Beibei Wang,K. J. Ray Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:WiFi sensing has emerged as a compelling contactless modality for human activity monitoring by capturing fine-grained variations in Channel State Information (CSI). Its ability to operate continuously and non-intrusively while preserving user privacy makes it particularly suitable for health monitoring. However, existing WiFi sensing systems struggle to generalize in real-world settings, largely due to datasets collected in controlled environments with homogeneous hardware and fragmented, session-based recordings that fail to reflect continuous daily activity. We present CSI-Bench, a large-scale, in-the-wild benchmark dataset collected using commercial WiFi edge devices across 26 diverse indoor environments with 35 real users. Spanning over 461 hours of effective data, CSI-Bench captures realistic signal variability under natural conditions. It includes task-specific datasets for fall detection, breathing monitoring, localization, and motion source recognition, as well as a co-labeled multitask dataset with joint annotations for user identity, activity, and proximity. To support the development of robust and generalizable models, CSI-Bench provides standardized evaluation splits and baseline results for both single-task and multi-task learning. CSI-Bench offers a foundation for scalable, privacy-preserving WiFi sensing systems in health and broader human-centric applications. Comments: 21 pages, 4 figures Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2505.21866 [eess.SP] (or arXiv:2505.21866v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2505.21866 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-129] Learning optimal treatment strategies for intraoperative hypotension using deep reinforcement learning

【速读】:该论文旨在解决手术期间低血压和术后急性肾损伤(AKI)的管理问题,传统手术决策依赖于医生的经验和快速反应,存在较大变异性。其解决方案的关键是开发一种基于深度Q网络(Deep Q-Networks)的强化学习(Reinforcement Learning, RL)模型,通过分析患者术中生理时间序列、静脉输液和血管活性药物总剂量等16个变量,生成最优的静脉输液和血管活性药物剂量推荐,以减少术中低血压和术后AKI的发生。

链接: https://arxiv.org/abs/2505.21596
作者: Esra Adiyeke,Tianqi Liu,Venkata Sai Dheeraj Naganaboina,Han Li,Tyler J. Loftus,Yuanfang Ren,Benjamin Shickel,Matthew M. Ruppert,Karandeep Singh,Ruogu Fang,Parisa Rashidi,Azra Bihorac,Tezcan Ozrazgat-Baslanti
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 1 table, 5 figures, 5 supplemental tables, 6 supplemental figures

点击查看摘要

Abstract:Traditional methods of surgical decision making heavily rely on human experience and prompt actions, which are variable. A data-driven system generating treatment recommendations based on patient states can be a substantial asset in perioperative decision-making, as in cases of intraoperative hypotension, for which suboptimal management is associated with acute kidney injury (AKI), a common and morbid postoperative complication. We developed a Reinforcement Learning (RL) model to recommend optimum dose of intravenous (IV) fluid and vasopressors during surgery to avoid intraoperative hypotension and postoperative AKI. We retrospectively analyzed 50,021 surgeries from 42,547 adult patients who underwent major surgery at a quaternary care hospital between June 2014 and September 2020. Of these, 34,186 surgeries were used for model training and 15,835 surgeries were reserved for testing. We developed a Deep Q-Networks based RL model using 16 variables including intraoperative physiologic time series, total dose of IV fluid and vasopressors extracted for every 15-minute epoch. The model replicated 69% of physician’s decisions for the dosage of vasopressors and proposed higher or lower dosage of vasopressors than received in 10% and 21% of the treatments, respectively. In terms of IV fluids, the model’s recommendations were within 0.05 ml/kg/15 min of the actual dose in 41% of the cases, with higher or lower doses recommended for 27% and 32% of the treatments, respectively. The model resulted in a higher estimated policy value compared to the physicians’ actual treatments, as well as random and zero-drug policies. AKI prevalence was the lowest in patients receiving medication dosages that aligned with model’s decisions. Our findings suggest that implementation of the model’s policy has the potential to reduce postoperative AKI and improve other outcomes driven by intraoperative hypotension.
zh

[AI-130] WhisperD: Dementia Speech Recognition and Filler Word Detection with Whisper INTERSPEECH2025

【速读】:该论文试图解决生成式 AI (Generative AI) 在转录痴呆症患者(PwDs)语音时表现不佳的问题,因为PwDs常表现出不规则的语音模式和非流畅性,如停顿、重复和断句。现有的模型如Whisper是基于标准语音训练的,可能未接触过受痴呆影响的语音数据。解决方案的关键在于通过DementiaBank开源数据集和自有的内部数据集对Whisper进行微调,以提升其词错误率(WER),同时引入填充词以评估填充词包含率(FIR)和F1分数。实验结果表明,微调后的模型显著优于现有模型,并展现出良好的泛化能力。

链接: https://arxiv.org/abs/2505.21551
作者: Emmanuel Akinrintoyo,Nadine Abdelhalim,Nicole Salomons
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Submitted to Interspeech 2025 (Accepted)

点击查看摘要

Abstract:Whisper fails to correctly transcribe dementia speech because persons with dementia (PwDs) often exhibit irregular speech patterns and disfluencies such as pauses, repetitions, and fragmented sentences. It was trained on standard speech and may have had little or no exposure to dementia-affected speech. However, correct transcription is vital for dementia speech for cost-effective diagnosis and the development of assistive technology. In this work, we fine-tune Whisper with the open-source dementia speech dataset (DementiaBank) and our in-house dataset to improve its word error rate (WER). The fine-tuning also includes filler words to ascertain the filler inclusion rate (FIR) and F1 score. The fine-tuned models significantly outperformed the off-the-shelf models. The medium-sized model achieved a WER of 0.24, outperforming previous work. Similarly, there was a notable generalisability to unseen data and speech patterns.
zh

机器学习

[LG-0] On Learning Verifiers for Chain-of-Thought Reasoning

链接: https://arxiv.org/abs/2505.22650
作者: Maria-Florina Balcan,Avrim Blum,Zhiyuan Li,Dravyansh Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-Thought reasoning has emerged as a powerful approach for solving complex mathematical and logical problems. However, it can often veer off track through incorrect or unsubstantiated inferences. Formal mathematical reasoning, which can be checked with a formal verifier, is one approach to addressing this issue. However, currently LLMs are simply not good enough to solve complex problems in a formal way, and even just formalizing an informal problem statement can be challenging. Motivated by this fact, in this work we consider the problem of learning reliable verifiers for natural language Chain-of-Thought reasoning. That is, given a problem statement and step-by-step solution in natural language, the aim of the verifier is to output [Yes] if the reasoning steps in the solution are all valid, and [No] otherwise. In this work we give a formal PAC-learning framework for studying this problem. We propose and analyze several natural verification goals, at different levels of strength, in this framework. We provide sample complexity upper-bounds for learning verifiers satisfying these goals, as well as lower-bound and impossibility results for learning other natural verification objectives without additional assumptions.

[LG-1] Spectral Survival Analysis KDD2025

链接: https://arxiv.org/abs/2505.22641
作者: Chengzhi Shi,Stratis Ioannidis
类目: Machine Learning (cs.LG)
*备注: Extended version of conference paper appearing in KDD 2025

点击查看摘要

Abstract:Survival analysis is widely deployed in a diverse set of fields, including healthcare, business, ecology, etc. The Cox Proportional Hazard (CoxPH) model is a semi-parametric model often encountered in the literature. Despite its popularity, wide deployment, and numerous variants, scaling CoxPH to large datasets and deep architectures poses a challenge, especially in the high-dimensional regime. We identify a fundamental connection between rank regression and the CoxPH model: this allows us to adapt and extend the so-called spectral method for rank regression to survival analysis. Our approach is versatile, naturally generalizing to several CoxPH variants, including deep models. We empirically verify our method’s scalability on multiple real-world high-dimensional datasets; our method outperforms legacy methods w.r.t. predictive performance and efficiency.

[LG-2] SimProcess: High Fidelity Simulation of Noisy ICS Physical Processes

链接: https://arxiv.org/abs/2505.22638
作者: Denis Donadel,Gabriele Crestanello,Giulio Morandini,Daniele Antonioli,Mauro Conti,Massimo Merro
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: In 11th ACM Cyber-Physical System Security Workshop (CPSS '25), August 25-29, 2025, Hanoi, Vietnam

点击查看摘要

Abstract:Industrial Control Systems (ICS) manage critical infrastructures like power grids and water treatment plants. Cyberattacks on ICSs can disrupt operations, causing severe economic, environmental, and safety issues. For example, undetected pollution in a water plant can put the lives of thousands at stake. ICS researchers have increasingly turned to honeypots – decoy systems designed to attract attackers, study their behaviors, and eventually improve defensive mechanisms. However, existing ICS honeypots struggle to replicate the ICS physical process, making them susceptible to detection. Accurately simulating the noise in ICS physical processes is challenging because different factors produce it, including sensor imperfections and external interferences. In this paper, we propose SimProcess, a novel framework to rank the fidelity of ICS simulations by evaluating how closely they resemble real-world and noisy physical processes. It measures the simulation distance from a target system by estimating the noise distribution with machine learning models like Random Forest. Unlike existing solutions that require detailed mathematical models or are limited to simple systems, SimProcess operates with only a timeseries of measurements from the real system, making it applicable to a broader range of complex dynamic systems. We demonstrate the framework’s effectiveness through a case study using real-world power grid data from the EPIC testbed. We compare the performance of various simulation methods, including static and generative noise techniques. Our model correctly classifies real samples with a recall of up to 1.0. It also identifies Gaussian and Gaussian Mixture as the best distribution to simulate our power systems, together with a generative solution provided by an autoencoder, thereby helping developers to improve honeypot fidelity. Additionally, we make our code publicly available. Comments: In 11th ACM Cyber-Physical System Security Workshop (CPSS '25), August 25-29, 2025, Hanoi, Vietnam Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2505.22638 [cs.CR] (or arXiv:2505.22638v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.22638 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3709017.3737711 Focus to learn more DOI(s) linking to related resources

[LG-3] Understanding (Un)Reliability of Steering Vectors in Language Models ICLR2025

链接: https://arxiv.org/abs/2505.22637
作者: Joschka Braun,Carsten Eickhoff,David Krueger,Seyed Ali Bahrainian,Dmitrii Krasheninnikov
类目: Machine Learning (cs.LG)
*备注: 17 pages, 10 figures. Presented at the ICLR 2025 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.

[LG-4] Benignity of loss landscape with weight decay requires both large overparametrization and initialization

链接: https://arxiv.org/abs/2505.22578
作者: Etienne Boursier,Matthew Bowditch,Matthias Englert,Ranko Lazic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the \ell_2 -regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign – i.e., free of spurious local minima – under large overparametrization, specifically when the network width m satisfies m \gtrsim \min(n^d, 2^n) , where n is the number of data points and d the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations – corresponding to the feature learning regime – optimization can still converge to spurious local minima, despite the global benignity of the landscape.

[LG-5] FNOPE: Simulation-based inference on function spaces with Fourier Neural Operators

链接: https://arxiv.org/abs/2505.22573
作者: Guy Moss,Leah Sophie Muhle,Reinhard Drews,Jakob H. Macke,Cornelius Schröder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based inference (SBI) is an established approach for performing Bayesian inference on scientific simulators. SBI so far works best on low-dimensional parametric models. However, it is difficult to infer function-valued parameters, which frequently occur in disciplines that model spatiotemporal processes such as the climate and earth sciences. Here, we introduce an approach for efficient posterior estimation, using a Fourier Neural Operator (FNO) architecture with a flow matching objective. We show that our approach, FNOPE, can perform inference of function-valued parameters at a fraction of the simulation budget of state of the art methods. In addition, FNOPE supports posterior evaluation at arbitrary discretizations of the domain, as well as simultaneous estimation of vector-valued parameters. We demonstrate the effectiveness of our approach on several benchmark tasks and a challenging spatial inference task from glaciology. FNOPE extends the applicability of SBI methods to new scientific domains by enabling the inference of function-valued parameters.

[LG-6] Geometric Hyena Networks for Large-scale Equivariant Learning

链接: https://arxiv.org/abs/2505.22560
作者: Artem Moskalev,Mangal Prakash,Junjie Xu,Tianyu Cui,Rui Liao,Tommaso Mansi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant self-attention suffer from quadratic complexity, while local methods such as distance-based message passing sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, we introduce Geometric Hyena, the first equivariant long-convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all-atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self-attention. Notably, our model processes the geometric context of 30k tokens 20x faster than the equivariant transformer and allows 72x longer context within the same budget.

[LG-7] DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models NEURIPS

链接: https://arxiv.org/abs/2505.22549
作者: Alex Iacob,Lorenzo Sani,Mher Safaryan,Paris Giampouras,Samuel Horváth,Andrej Jovanovic,Meghdad Kurmanji,Preslav Aleksandrov,William F. Shen,Xinchi Qiu,Nicholas D. Lane
类目: Machine Learning (cs.LG)
*备注: Keywords: Distributed Training, Foundation Models, Large Language Models, Optimizers, Communication Efficiency, Federated Learning, Distributed Systems, Optimization Theory, Scaling, Robustness. Preprint, under review at NeurIPS

点击查看摘要

Abstract:Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing all optimizer states, tripling communication costs. We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Through extensive experiments on language models of up to 1.7B, we show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local ADAM. Furthermore, unlike previous heuristic approaches, DES-LOC is suited for practical training scenarios prone to system failures. DES-LOC offers a scalable, bandwidth-efficient, and fault-tolerant solution for foundation model training.

[LG-8] A Human-Centric Approach to Explainable AI for Personalized Education

链接: https://arxiv.org/abs/2505.22541
作者: Vinitra Swamy
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: PhD Thesis, EPFL (Computer Science)

点击查看摘要

Abstract:Deep neural networks form the backbone of artificial intelligence research, with potential to transform the human experience in areas ranging from autonomous driving to personal assistants, healthcare to education. However, their integration into the daily routines of real-world classrooms remains limited. It is not yet common for a teacher to assign students individualized homework targeting their specific weaknesses, provide students with instant feedback, or simulate student responses to a new exam question. While these models excel in predictive performance, this lack of adoption can be attributed to a significant weakness: the lack of explainability of model decisions, leading to a lack of trust from students, parents, and teachers. This thesis aims to bring human needs to the forefront of eXplainable AI (XAI) research, grounded in the concrete use case of personalized learning and teaching. We frame the contributions along two verticals: technical advances in XAI and their aligned human studies. We investigate explainability in AI for education, revealing systematic disagreements between post-hoc explainers and identifying a need for inherently interpretable model architectures. We propose four novel technical contributions in interpretability with a multimodal modular architecture (MultiModN), an interpretable mixture-of-experts model (InterpretCC), adversarial training for explainer stability, and a theory-driven LLM-XAI framework to present explanations to students (iLLuMinaTE), which we evaluate in diverse settings with professors, teachers, learning scientists, and university students. By combining empirical evaluations of existing explainers with novel architectural designs and human studies, our work lays a foundation for human-centric AI systems that balance state-of-the-art performance with built-in transparency and trust.

[LG-9] Uncertainty Quantification with Proper Scoring Rules: Adjusting Measures to Prediction Tasks

链接: https://arxiv.org/abs/2505.22538
作者: Paul Hofman,Yusuf Sale,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the problem of uncertainty quantification and propose measures of total, aleatoric, and epistemic uncertainty based on a known decomposition of (strictly) proper scoring rules, a specific type of loss function, into a divergence and an entropy component. This leads to a flexible framework for uncertainty quantification that can be instantiated with different losses (scoring rules), which makes it possible to tailor uncertainty quantification to the use case at hand. We show that this flexibility is indeed advantageous. In particular, we analyze the task of selective prediction and show that the scoring rule should ideally match the task loss. In addition, we perform experiments on two other common tasks. For out-of-distribution detection, our results confirm that a widely used measure of epistemic uncertainty, mutual information, performs best. Moreover, in the setting of active learning, our measure of epistemic uncertainty based on the zero-one-loss consistently outperforms other uncertainty measures.

[LG-10] st-Time Alignment of Discrete Diffusion Models with Sequential Monte Carlo

链接: https://arxiv.org/abs/2505.22524
作者: Chinmay Pani,Zijing Ou,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have become highly effective across various domains. However, real-world applications often require the generative process to adhere to certain constraints but without task-specific fine-tuning. To this end, we propose a training-free method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution at the test time. Our approach leverages twisted SMC with an approximate locally optimal proposal, obtained via a first-order Taylor expansion of the reward function. To address the challenge of ill-defined gradients in discrete spaces, we incorporate a Gumbel-Softmax relaxation, enabling efficient gradient-based approximation within the discrete generative framework. Empirical results on both synthetic datasets and image modelling validate the effectiveness of our approach.

[LG-11] Accelerating Optimization via Differentiable Stopping Time

链接: https://arxiv.org/abs/2505.22509
作者: Zhonglin Xie,Yiman Fong,Haoran Yuan,Zaiwen Wen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Optimization is an important module of modern machine learning applications. Tremendous efforts have been made to accelerate optimization algorithms. A common formulation is achieving a lower loss at a given time. This enables a differentiable framework with respect to the algorithm hyperparameters. In contrast, its dual, minimizing the time to reach a target loss, is believed to be non-differentiable, as the time is not differentiable. As a result, it usually serves as a conceptual framework or is optimized using zeroth-order methods. To address this limitation, we propose a differentiable stopping time and theoretically justify it based on differential equations. An efficient algorithm is designed to backpropagate through it. As a result, the proposed differentiable stopping time enables a new differentiable formulation for accelerating algorithms. We further discuss its applications, such as online hyperparameter tuning and learning to optimize. Our proposed methods show superior performance in comprehensive experiments across various problems, which confirms their effectiveness.

[LG-12] Sparsification and Reconstruction from the Perspective of Representation Geometry

链接: https://arxiv.org/abs/2505.22506
作者: Wenjie Sun,Bingzhe Wu,Zhile Yang,Chengke Wu
类目: Machine Learning (cs.LG)
*备注: 24 pages, 5 figures

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlinkthis https URLthis https URL.

[LG-13] Geometric GNNs for Charged Particle Tracking at GlueX

链接: https://arxiv.org/abs/2505.22504
作者: Ahmed Hossam Mohammed,Kishansingh Rajput,Simon Taylor,Denis Furletov,Sergey Furletov,Malachi Schram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nuclear physics experiments are aimed at uncovering the fundamental building blocks of matter. The experiments involve high-energy collisions that produce complex events with many particle trajectories. Tracking charged particles resulting from collisions in the presence of a strong magnetic field is critical to enable the reconstruction of particle trajectories and precise determination of interactions. It is traditionally achieved through combinatorial approaches that scale worse than linearly as the number of hits grows. Since particle hit data naturally form a 3-dimensional point cloud and can be structured as graphs, Graph Neural Networks (GNNs) emerge as an intuitive and effective choice for this task. In this study, we evaluate the GNN model for track finding on the data from the GlueX experiment at Jefferson Lab. We use simulation data to train the model and test on both simulation and real GlueX measurements. We demonstrate that GNN-based track finding outperforms the currently used traditional method at GlueX in terms of segment-based efficiency at a fixed purity while providing faster inferences. We show that the GNN model can achieve significant speedup by processing multiple events in batches, which exploits the parallel computation capability of Graphical Processing Units (GPUs). Finally, we compare the GNN implementation on GPU and FPGA and describe the trade-off.

[LG-14] ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

链接: https://arxiv.org/abs/2505.22494
作者: Michal Kmicikiewicz,Vincent Fortuin,Ewa Szczurek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.

[LG-15] Non-Asymptotic Analysis of (Sticky) Track-and-Stop

链接: https://arxiv.org/abs/2505.22475
作者: Riccardo Poiani,Martino Bernasconi,Andrea Celli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In pure exploration problems, a statistician sequentially collects information to answer a question about some stochastic and unknown environment. The probability of returning a wrong answer should not exceed a maximum risk parameter \delta and good algorithms make as few queries to the environment as possible. The Track-and-Stop algorithm is a pioneering method to solve these problems. Specifically, it is well-known that it enjoys asymptotic optimality sample complexity guarantees for \delta\to 0 whenever the map from the environment to its correct answers is single-valued (e.g., best-arm identification with a unique optimal arm). The Sticky Track-and-Stop algorithm extends these results to settings where, for each environment, there might exist multiple correct answers (e.g., \epsilon -optimal arm identification). Although both methods are optimal in the asymptotic regime, their non-asymptotic guarantees remain unknown. In this work, we fill this gap and provide non-asymptotic guarantees for both algorithms.

[LG-16] Forecasting Multivariate Urban Data via Decomposition and Spatio-Temporal Graph Analysis

链接: https://arxiv.org/abs/2505.22474
作者: Amirhossein Sohrabbeig,Omid Ardakanian,Petr Musilek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The forecasting of multivariate urban data presents a complex challenge due to the intricate dependencies between various urban metrics such as weather, air pollution, carbon intensity, and energy demand. This paper introduces a novel multivariate time-series forecasting model that utilizes advanced Graph Neural Networks (GNNs) to capture spatial dependencies among different time-series variables. The proposed model incorporates a decomposition-based preprocessing step, isolating trend, seasonal, and residual components to enhance the accuracy and interpretability of forecasts. By leveraging the dynamic capabilities of GNNs, the model effectively captures interdependencies and improves the forecasting performance. Extensive experiments on real-world datasets, including electricity usage, weather metrics, carbon intensity, and air pollution data, demonstrate the effectiveness of the proposed approach across various forecasting scenarios. The results highlight the potential of the model to optimize smart infrastructure systems, contributing to energy-efficient urban development and enhanced public well-being.

[LG-17] Pure Exploration with Infinite Answers

链接: https://arxiv.org/abs/2505.22473
作者: Riccardo Poiani,Martino Bernasconi,Andrea Celli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study pure exploration problems where the set of correct answers is possibly infinite, e.g., the regression of any continuous function of the means of the bandit. We derive an instance-dependent lower bound for these problems. By analyzing it, we discuss why existing methods (i.e., Sticky Track-and-Stop) for finite answer problems fail at being asymptotically optimal in this more general setting. Finally, we present a framework, Sticky-Sequence Track-and-Stop, which generalizes both Track-and-Stop and Sticky Track-and-Stop, and that enjoys asymptotic optimality. Due to its generality, our analysis also highlights special cases where existing methods enjoy optimality.

[LG-18] CPINN-ABPI: Physics-Informed Neural Networks for Accurate Power Estimation in MPSoCs

链接: https://arxiv.org/abs/2505.22469
作者: Mohamed R. Elshamy,Mehdi Elahi,Ahmad Patooghy,Abdel-Hameed A. Badawy
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient thermal and power management in modern multiprocessor systems-on-chip (MPSoCs) demands accurate power consumption estimation. One of the state-of-the-art approaches, Alternative Blind Power Identification (ABPI), theoretically eliminates the dependence on steady-state temperatures, addressing a major shortcoming of previous approaches. However, ABPI performance has remained unverified in actual hardware implementations. In this study, we conduct the first empirical validation of ABPI on commercial hardware using the NVIDIA Jetson Xavier AGX platform. Our findings reveal that, while ABPI provides computational efficiency and independence from steady-state temperature, it exhibits considerable accuracy deficiencies in real-world scenarios. To overcome these limitations, we introduce a novel approach that integrates Custom Physics-Informed Neural Networks (CPINNs) with the underlying thermal model of ABPI. Our approach employs a specialized loss function that harmonizes physical principles with data-driven learning, complemented by multi-objective genetic algorithm optimization to balance estimation accuracy and computational cost. In experimental validation, CPINN-ABPI achieves a reduction of 84.7% CPU and 73.9% GPU in the mean absolute error (MAE) relative to ABPI, with the weighted mean absolute percentage error (WMAPE) improving from 47%–81% to \sim 12%. The method maintains real-time performance with 195.3~ \mu s of inference time, with similar 85%–99% accuracy gains across heterogeneous SoCs.

[LG-19] Position: All Current Generative Fidelity and Diversity Metrics are Flawed ICML2025

链接: https://arxiv.org/abs/2505.22450
作者: Ossi Räisä,Boris van Breugel,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Any method’s development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.

[LG-20] Data-Driven Antenna Miniaturization: A Knowledge-Based System Integrating Quantum PSO and Predictive Machine Learning Models

链接: https://arxiv.org/abs/2505.22440
作者: Khan Masood Parvez,Sk Md Abidar Rahaman,Ali Shiri Sichani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of wireless technologies necessitates automated design frameworks to address antenna miniaturization and performance optimization within constrained development cycles. This study demonstrates a machine learning enhanced workflow integrating Quantum-Behaved Dynamic Particle Swarm Optimization (QDPSO) with ANSYS HFSS simulations to accelerate antenna design. The QDPSO algorithm autonomously optimized loop dimensions in 11.53 seconds, achieving a resonance frequency of 1.4208 GHz a 12.7 percent reduction compared to conventional 1.60 GHz designs. Machine learning models (SVM, Random Forest, XGBoost, and Stacked ensembles) predicted resonance frequencies in 0.75 seconds using 936 simulation datasets, with stacked models showing superior training accuracy (R2=0.9825) and SVM demonstrating optimal validation performance (R2=0.7197). The complete design cycle, encompassing optimization, prediction, and ANSYS validation, required 12.42 minutes on standard desktop hardware (Intel i5-8500, 16GB RAM), contrasting sharply with the 50-hour benchmark of PSADEA-based approaches. This 240 times of acceleration eliminates traditional trial-and-error methods that often extend beyond seven expert-led days. The system enables precise specifications of performance targets with automated generation of fabrication-ready parameters, particularly benefiting compact consumer devices requiring rapid frequency tuning. By bridging AI-driven optimization with CAD validation, this framework reduces engineering workloads while ensuring production-ready designs, establishing a scalable paradigm for next-generation RF systems in 6G and IoT applications.

[LG-21] STaR-Bets: Sequential Target-Recalculating Bets for Tighter Confidence Intervals

链接: https://arxiv.org/abs/2505.22422
作者: Václav Voráček,Francesco Orabona
类目: Machine Learning (cs.LG)
*备注: comments are welcome

点击查看摘要

Abstract:The construction of confidence intervals for the mean of a bounded random variable is a classical problem in statistics with numerous applications in machine learning and virtually all scientific fields. In particular, obtaining the tightest possible confidence intervals is vital every time the sampling of the random variables is expensive. The current state-of-the-art method to construct confidence intervals is by using betting algorithms. This is a very successful approach for deriving optimal confidence sequences, even matching the rate of law of iterated logarithms. However, in the fixed horizon setting, these approaches are either sub-optimal or based on heuristic solutions with strong empirical performance but without a finite-time guarantee. Hence, no betting-based algorithm guaranteeing the optimal \mathcalO(\sqrt\frac\sigma^2\log\frac1\deltan) width of the confidence intervals are known. This work bridges this gap. We propose a betting-based algorithm to compute confidence intervals that empirically outperforms the competitors. Our betting strategy uses the optimal strategy in every step (in a certain sense), whereas the standard betting methods choose a constant strategy in advance. Leveraging this fact results in strict improvements even for classical concentration inequalities, such as the ones of Hoeffding or Bernstein. Moreover, we also prove that the width of our confidence intervals is optimal up to an 1+o(1) factor diminishing with n . The code is available on~this https URL.

[LG-22] A Divide-and-Conquer Approach for Modeling Arrival Times in Business Process Simulation

链接: https://arxiv.org/abs/2505.22381
作者: Lukas Kirchdorfer,Konrad Özdemir,Stjepan Kusenic,Han van der Aa,Heiner Stuckenschmidt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Business Process Simulation (BPS) is a critical tool for analyzing and improving organizational processes by estimating the impact of process changes. A key component of BPS is the case-arrival model, which determines the pattern of new case entries into a process. Although accurate case-arrival modeling is essential for reliable simulations, as it influences waiting and overall cycle times, existing approaches often rely on oversimplified static distributions of inter-arrival times. These approaches fail to capture the dynamic and temporal complexities inherent in organizational environments, leading to less accurate and reliable outcomes. To address this limitation, we propose Auto Time Kernel Density Estimation (AT-KDE), a divide-and-conquer approach that models arrival times of processes by incorporating global dynamics, day-of-week variations, and intraday distributional changes, ensuring both precision and scalability. Experiments conducted across 20 diverse processes demonstrate that AT-KDE is far more accurate and robust than existing approaches while maintaining sensible execution time efficiency.

[LG-23] Directed Homophily-Aware Graph Neural Network

链接: https://arxiv.org/abs/2505.22362
作者: Aihu Zhang,Jiaxing Xu,Mengcheng Lan,Shili Xiang,Yiping Ke
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved significant success in various learning tasks on graph-structured data. Nevertheless, most GNNs struggle to generalize to heterophilic neighborhoods. Additionally, many GNNs ignore the directional nature of real-world graphs, resulting in suboptimal performance on directed graphs with asymmetric structures. In this work, we propose Directed Homophily-aware Graph Neural Network (DHGNN), a novel framework that addresses these limitations by incorporating homophily-aware and direction-sensitive components. DHGNN employs a resettable gating mechanism to adaptively modulate message contributions based on homophily levels and informativeness, and a structure-aware noise-tolerant fusion module to effectively integrate node representations from the original and reverse directions. Extensive experiments on both homophilic and heterophilic directed graph datasets demonstrate that DHGNN outperforms state-of-the-art methods in node classification and link prediction. In particular, DHGNN improves over the best baseline by up to 15.07% in link prediction. Our analysis further shows that the gating mechanism captures directional homophily gaps and fluctuating homophily across layers, providing deeper insights into message-passing behavior on complex graph structures.

[LG-24] Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

链接: https://arxiv.org/abs/2505.22361
作者: Xiangyu Chang,Xi Chen,Yining Wang,Zhiyi Zeng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies a bandit optimization problem where the goal is to maximize a function f(x) over T periods for some unknown strongly concave function f . We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions (x, x’) for a consecutive number of periods and then obtains an estimate of f(x)-f(x’) . We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions (x, x’) but also a stopping time n (i.e., the number of queries based on (x, x’) ). Second, motivated by our inventory application, the estimate of the difference f(x)-f(x’) is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

[LG-25] Multiclass Loss Geometry Matters for Generalization of Gradient Descent in Separable Classification

链接: https://arxiv.org/abs/2505.22359
作者: Matan Schliserman,Tomer Koren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the generalization performance of unregularized gradient methods for separable linear classification. While previous work mostly deal with the binary case, we focus on the multiclass setting with k classes and establish novel population risk bounds for Gradient Descent for loss functions that decay to zero. In this setting, we show risk bounds that reveal that convergence rates are crucially influenced by the geometry of the loss template, as formalized by Wang and Scott (2024), rather than of the loss function itself. Particularly, we establish risk upper bounds that holds for any decay rate of the loss whose template is smooth with respect to the p -norm. In the case of exponentially decaying losses, our results indicates a contrast between the p=\infty case, where the risk exhibits a logarithmic dependence on k , and p=2 where the risk scales linearly with k . To establish this separation formally, we also prove a lower bound in the latter scenario, demonstrating that the polynomial dependence on k is unavoidable. Central to our analysis is a novel bound on the Rademacher complexity of low-noise vector-valued linear predictors with a loss template smooth w.r.t.~general p -norms.

[LG-26] Look Within or Look Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning

链接: https://arxiv.org/abs/2505.22355
作者: Yongkang Liu,Xingle Xu,Ercong Nie,Zijing Wang,Shi Feng,Daling Wang,Qian Li,Hinrich Schütze
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods achieve performance comparable to Full Fine-Tuning (FFT) while requiring significantly fewer computing resources, making it the go-to choice for researchers. We find that although PEFT can achieve competitive results on some benchmarks, its performance falls short of FFT in complex tasks, such as reasoning and instruction-based fine-tuning. In this paper, we compare the characteristics of PEFT and FFT in terms of representational capacity and robustness based on optimization theory. We theoretically demonstrate that PEFT is a strict subset of FFT. By providing theoretical upper bounds for PEFT, we show that the limited parameter space constrains the model’s representational ability, making it more susceptible to perturbations. Experiments on 15 datasets encompassing classification, generation, reasoning, instruction fine-tuning tasks and 11 adversarial test sets validate our theories. We hope that these results spark further research beyond the realms of well established PEFT. The source code is in the anonymous Github repository\footnotethis https URL.

[LG-27] A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

链接: https://arxiv.org/abs/2505.22322
作者: Zhengyu Fang,Zhimeng Jiang,Huiyuan Chen,Xiaoge Zhang,Kaiyu Tang,Xiao Li,Jing Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC). Memorized samples are memorized slightly earlier and show stronger signals in early training. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method: (a) rank samples by epoch-wise intensity, (b) prune a tunable top fraction, and © retrain on the filtered dataset. Across multiple tabular datasets and models, DynamicCut reduces memorization with minimal impact on data diversity and downstream performance. It also complements augmentation-based defenses. Furthermore, DynamicCut enables cross-model transferability: high-ranked samples identified from one model (e.g., a diffusion model) are also effective for reducing memorization when removed from others, such as GANs and VAEs.

[LG-28] Rethinking BPS: A Utility-Based Evaluation Framework

链接: https://arxiv.org/abs/2505.22316
作者: Konrad Özdemir,Lukas Kirchdorfer,Keyvan Amiri Elyasi,Han van der Aa,Heiner Stuckenschmidt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Business process simulation (BPS) is a key tool for analyzing and optimizing organizational workflows, supporting decision-making by estimating the impact of process changes. The reliability of such estimates depends on the ability of a BPS model to accurately mimic the process under analysis, making rigorous accuracy evaluation essential. However, the state-of-the-art approach to evaluating BPS models has two key limitations. First, it treats simulation as a forecasting problem, testing whether models can predict unseen future events. This fails to assess how well a model captures the as-is process, particularly when process behavior changes from train to test period. Thus, it becomes difficult to determine whether poor results stem from an inaccurate model or the inherent complexity of the data, such as unpredictable drift. Second, the evaluation approach strongly relies on Earth Mover’s Distance-based metrics, which can obscure temporal patterns and thus yield misleading conclusions about simulation quality. To address these issues, we propose a novel framework that evaluates simulation quality based on its ability to generate representative process behavior. Instead of comparing simulated logs to future real-world executions, we evaluate whether predictive process monitoring models trained on simulated data perform comparably to those trained on real data for downstream analysis tasks. Empirical results show that our framework not only helps identify sources of discrepancies but also distinguishes between model accuracy and data complexity, offering a more meaningful way to assess BPS quality.

[LG-29] ransformers Pretrained on Procedural Data Contain Modular Structures for Algorithmic Reasoning

链接: https://arxiv.org/abs/2505.22308
作者: Zachary Shinnick,Liangze Jiang,Hemanth Saratchandran,Anton van den Hengel,Damien Teney
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pretraining on large, semantically rich datasets is key for developing language models. Surprisingly, recent studies have shown that even synthetic data, generated procedurally through simple semantic-free algorithms, can yield some of the same benefits as natural language pretraining. It is unclear what specific capabilities such simple synthetic data instils in a model, where these capabilities reside in the architecture, and how they manifest within its weights. In this short paper, we identify several beneficial forms of procedural data, together with specific algorithmic reasoning skills that improve in small transformers. Our core finding is that different procedural rules instil distinct but complementary inductive structures in the model. With extensive ablations and partial-transfer experiments, we discover that these structures reside in different parts of the model. Attention layers often carry the most transferable information, but some pretraining rules impart useful structure to MLP blocks instead. Most interestingly, the structures induced by multiple rules can be composed to jointly reinforce multiple capabilities. These results suggest an exciting possibility of disentangling the acquisition of knowledge from reasoning in language models, with the goal of improving their robustness and data efficiency.

[LG-30] Full Domain Analysis in Fluid Dynamics

链接: https://arxiv.org/abs/2505.22275
作者: Alexander Hagg,Adam Gaier,Dominik Wilde,Alexander Asteroth,Holger Foysi,Dirk Reith
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Novel techniques in evolutionary optimization, simulation and machine learning allow for a broad analysis of domains like fluid dynamics, in which computation is expensive and flow behavior is complex. Under the term of full domain analysis we understand the ability to efficiently determine the full space of solutions in a problem domain, and analyze the behavior of those solutions in an accessible and interactive manner. The goal of full domain analysis is to deepen our understanding of domains by generating many examples of flow, their diversification, optimization and analysis. We define a formal model for full domain analysis, its current state of the art, and requirements of subcomponents. Finally, an example is given to show what we can learn by using full domain analysis. Full domain analysis, rooted in optimization and machine learning, can be a helpful tool in understanding complex systems in computational physics and beyond.

[LG-31] Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

链接: https://arxiv.org/abs/2505.22257
作者: Youssef Mroueh,Nicolas Dupuis,Brian Belgodere,Apoorva Nitsure,Mattia Rigotti,Kristjan Greenewald,Jiri Navratil,Jerret Ross,Jesus Rios
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

[LG-32] A Unified Online-Offline Framework for Co-Branding Campaign Recommendations KDD

链接: https://arxiv.org/abs/2505.22254
作者: Xiangxiang Dai,Xiaowei Sun,Jinhang Zuo,Xutong Liu,John C.S. Lui
类目: Machine Learning (cs.LG)
*备注: Accepted at the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025

点击查看摘要

Abstract:Co-branding has become a vital strategy for businesses aiming to expand market reach within recommendation systems. However, identifying effective cross-industry partnerships remains challenging due to resource imbalances, uncertain brand willingness, and ever-changing market conditions. In this paper, we provide the first systematic study of this problem and propose a unified online-offline framework to enable co-branding recommendations. Our approach begins by constructing a bipartite graph linking initiating'' and target’’ brands to quantify co-branding probabilities and assess market benefits. During the online learning phase, we dynamically update the graph in response to market feedback, while striking a balance between exploring new collaborations for long-term gains and exploiting established partnerships for immediate benefits. To address the high initial co-branding costs, our framework mitigates redundant exploration, thereby enhancing short-term performance while ensuring sustainable strategic growth. In the offline optimization phase, our framework consolidates the interests of multiple sub-brands under the same parent brand to maximize overall returns, avoid excessive investment in single sub-brands, and reduce unnecessary costs associated with over-prioritizing a single sub-brand. We present a theoretical analysis of our approach, establishing a highly nontrivial sublinear regret bound for online learning in the complex co-branding problem, and enhancing the approximation guarantee for the NP-hard offline budget allocation optimization. Experiments on both synthetic and real-world co-branding datasets demonstrate the practical effectiveness of our framework, with at least 12% improvement.

[LG-33] B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data

链接: https://arxiv.org/abs/2505.22252
作者: Magdalena Proszewska,Tomasz Danel,Dawid Rymarczyk
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 26 pages, 16 figures, 5 tables

点击查看摘要

Abstract:Understanding the reasoning behind deep learning model predictions is crucial in cheminformatics and drug discovery, where molecular design determines their properties. However, current evaluation frameworks for Explainable AI (XAI) in this domain often rely on artificial datasets or simplified tasks, employing data-derived metrics that fail to capture the complexity of real-world scenarios and lack a direct link to explanation faithfulness. To address this, we introduce B-XAIC, a novel benchmark constructed from real-world molecular data and diverse tasks with known ground-truth rationales for assigned labels. Through a comprehensive evaluation using B-XAIC, we reveal limitations of existing XAI methods for Graph Neural Networks (GNNs) in the molecular domain. This benchmark provides a valuable resource for gaining deeper insights into the faithfulness of XAI, facilitating the development of more reliable and interpretable models.

[LG-34] UDuo: Universal Dual Optimization Framework for Online Matching

链接: https://arxiv.org/abs/2505.22243
作者: Bin Li,Diwei Liu,Zehong Hu,Jia Jia
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online resource allocation under budget constraints critically depends on proper modeling of user arrival dynamics. Classical approaches employ stochastic user arrival models to derive near-optimal solutions through fractional matching formulations of exposed users for downstream allocation tasks. However, this is no longer a reasonable assumption when the environment changes dynamically. In this work, We propose the Universal Dual optimization framework UDuo, a novel paradigm that fundamentally rethinks online allocation through three key innovations: (i) a temporal user arrival representation vector that explicitly captures distribution shifts in user arrival patterns and resource consumption dynamics, (ii) a resource pacing learner with adaptive allocation policies that generalize to heterogeneous constraint scenarios, and (iii) an online time-series forecasting approach for future user arrival distributions that achieves asymptotically optimal solutions with constraint feasibility guarantees in dynamic environments. Experimental results show that UDuo achieves higher efficiency and faster convergence than the traditional stochastic arrival model in real-world pricing while maintaining rigorous theoretical validity for general online allocation problems.

[LG-35] Yambda-5B – A Large-Scale Multi-modal Dataset for Ranking And Retrieval

链接: https://arxiv.org/abs/2505.22238
作者: A. Ploshkin,V. Tytskiy,A. Pismenny,V. Baikalov,E. Taychinov,A. Permiakov,D. Burlakov,E. Krofto,N. Savushkin
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Yambda-5B, a large-scale open dataset sourced from the this http URL streaming platform. Yambda-5B contains 4.79 billion user-item interactions from 1 million users across 9.39 million tracks. The dataset includes two primary types of interactions: implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes and undislikes). In addition, we provide audio embeddings for most tracks, generated by a convolutional neural network trained on audio spectrograms. A key distinguishing feature of Yambda-5B is the inclusion of the is_organic flag, which separates organic user actions from recommendation-driven events. This distinction is critical for developing and evaluating machine learning algorithms, as this http URL relies on recommender systems to personalize track selection for users. To support rigorous benchmarking, we introduce an evaluation protocol based on a Global Temporal Split, allowing recommendation algorithms to be assessed in conditions that closely mirror real-world use. We report benchmark results for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using a variety of evaluation metrics. By releasing Yambda-5B to the community, we aim to provide a readily accessible, industrial-scale resource to advance research, foster innovation, and promote reproducible results in recommender systems.

[LG-36] Optimal kernel regression bounds under energy-bounded noise

链接: https://arxiv.org/abs/2505.22235
作者: Amon Lahr,Johannes Köhler,Anna Scampicchio,Melanie N. Zeilinger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-conservative uncertainty bounds are key for both assessing an estimation algorithm’s accuracy and in view of downstream tasks, such as its deployment in safety-critical contexts. In this paper, we derive a tight, non-asymptotic uncertainty bound for kernel-based estimation, which can also handle correlated noise sequences. Its computation relies on a mild norm-boundedness assumption on the unknown function and the noise, returning the worst-case function realization within the hypothesis class at an arbitrary query input location. The value of this function is shown to be given in terms of the posterior mean and covariance of a Gaussian process for an optimal choice of the measurement noise covariance. By rigorously analyzing the proposed approach and comparing it with other results in the literature, we show its effectiveness in returning tight and easy-to-compute bounds for kernel-based estimates.

[LG-37] LaMM: Semi-Supervised Pre-Training of Large-Scale Materials Models

链接: https://arxiv.org/abs/2505.22208
作者: Yosuke Oyama,Yusuke Majima,Eiji Ohta,Yasufumi Sakai
类目: Machine Learning (cs.LG)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:Neural network potentials (NNPs) are crucial for accelerating computational materials science by surrogating density functional theory (DFT) calculations. Improving their accuracy is possible through pre-training and fine-tuning, where an NNP model is first pre-trained on a large-scale dataset and then fine-tuned on a smaller target dataset. However, this approach is computationally expensive, mainly due to the cost of DFT-based dataset labeling and load imbalances during large-scale pre-training. To address this, we propose LaMM, a semi-supervised pre-training method incorporating improved denoising self-supervised learning and a load-balancing algorithm for efficient multi-node training. We demonstrate that our approach effectively leverages a large-scale dataset of \sim 300 million semi-labeled samples to train a single NNP model, resulting in improved fine-tuning performance in terms of both speed and accuracy.

[LG-38] An Augmentation-Aware Theory for Self-Supervised Contrastive Learning ICML2025

链接: https://arxiv.org/abs/2505.22196
作者: Jingyi Cui,Hongwei Wen,Yisen Wang
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML2025

点击查看摘要

Abstract:Self-supervised contrastive learning has emerged as a powerful tool in machine learning and computer vision to learn meaningful representations from unlabeled data. Meanwhile, its empirical success has encouraged many theoretical studies to reveal the learning mechanisms. However, in the existing theoretical research, the role of data augmentation is still under-exploited, especially the effects of specific augmentation types. To fill in the blank, we for the first time propose an augmentation-aware error bound for self-supervised contrastive learning, showing that the supervised risk is bounded not only by the unsupervised risk, but also explicitly by a trade-off induced by data augmentation. Then, under a novel semantic label assumption, we discuss how certain augmentation methods affect the error bound. Lastly, we conduct both pixel- and representation-level experiments to verify our proposed theoretical results.

[LG-39] he informativeness of the gradient revisited

链接: https://arxiv.org/abs/2505.22158
作者: Rustem Takhanov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past decade gradient-based deep learning has revolutionized several applications. However, this rapid advancement has highlighted the need for a deeper theoretical understanding of its limitations. Research has shown that, in many practical learning tasks, the information contained in the gradient is so minimal that gradient-based methods require an exceedingly large number of iterations to achieve success. The informativeness of the gradient is typically measured by its variance with respect to the random selection of a target function from a hypothesis class. We use this framework and give a general bound on the variance in terms of a parameter related to the pairwise independence of the target function class and the collision entropy of the input distribution. Our bound scales as \tilde\mathcalO(\varepsilon+e^-\frac12\mathcalE_c) , where \tilde\mathcalO hides factors related to the regularity of the learning model and the loss function, \varepsilon measures the pairwise independence of the target function class and \mathcalE_c is the collision entropy of the input distribution. To demonstrate the practical utility of our bound, we apply it to the class of Learning with Errors (LWE) mappings and high-frequency functions. In addition to the theoretical analysis, we present experiments to understand better the nature of recent deep learning-based attacks on LWE. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.22158 [cs.LG] (or arXiv:2505.22158v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.22158 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Uncertainty Estimation for Heterophilic Graphs Through the Lens of Information Theory

链接: https://arxiv.org/abs/2505.22152
作者: Dominik Fuchsgruber,Tom Wollschläger,Johannes Bordne,Stephan Günnemann
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:While uncertainty estimation for graphs recently gained traction, most methods rely on homophily and deteriorate in heterophilic settings. We address this by analyzing message passing neural networks from an information-theoretic perspective and developing a suitable analog to data processing inequality to quantify information throughout the model’s layers. In contrast to non-graph domains, information about the node-level prediction target can increase with model depth if a node’s features are semantically different from its neighbors. Therefore, on heterophilic graphs, the latent embeddings of an MPNN each provide different information about the data distribution - different from homophilic settings. This reveals that considering all node representations simultaneously is a key design principle for epistemic uncertainty estimation on graphs beyond homophily. We empirically confirm this with a simple post-hoc density estimator on the joint node embedding space that provides state-of-the-art uncertainty on heterophilic graphs. At the same time, it matches prior work on homophilic graphs without explicitly exploiting homophily through post-processing.

[LG-41] Oryx: a Performant and Scalable Algorithm for Many-Agent Coordination in Offline MARL

链接: https://arxiv.org/abs/2505.22151
作者: Claude Formanek,Omayma Mahjoub,Louay Ben Nessir,Sasha Abramowitz,Ruan de Kock,Wiem Khlifi,Simon Du Toit,Felix Chalumeau,Daniel Rajaonarivonivelomanantsoa,Arnol Fokam,Siddarth Singh,Ulrich Mbou Sob,Arnu Pretorius
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline auto-regressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over lengthy trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works (SMAC, RWARE, and Multi-Agent MuJoCo) covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx’s superior ability to scale effectively in such settings. We will make all of our datasets, experimental data, and code available upon publication.

[LG-42] BiMi Sheets: Infosheets for bias mitigation methods

链接: https://arxiv.org/abs/2505.22114
作者: MaryBeth Defrance,Guillaume Bied,Maarten Buyl,Jefrey Lijffijt,Tijl De Bie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the past 15 years, hundreds of bias mitigation methods have been proposed in the pursuit of fairness in machine learning (ML). However, algorithmic biases are domain-, task-, and model-specific, leading to a `portability trap’: bias mitigation solutions in one context may not be appropriate in another. Thus, a myriad of design choices have to be made when creating a bias mitigation method, such as the formalization of fairness it pursues, and where and how it intervenes in the ML pipeline. This creates challenges in benchmarking and comparing the relative merits of different bias mitigation methods, and limits their uptake by practitioners. We propose BiMi Sheets as a portable, uniform guide to document the design choices of any bias mitigation method. This enables researchers and practitioners to quickly learn its main characteristics and to compare with their desiderata. Furthermore, the sheets’ structure allow for the creation of a structured database of bias mitigation methods. In order to foster the sheets’ adoption, we provide a platform for finding and creating BiMi Sheets at this http URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.22114 [cs.LG] (or arXiv:2505.22114v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.22114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

链接: https://arxiv.org/abs/2505.22094
作者: Tonghe Zhang,Yu Chao,Sicang Su,Yu Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 30 pages, 13 figures, 10 tables

点击查看摘要

Abstract:We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy’s deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project Webpage: this https URL

[LG-44] Can Test-time Computation Mitigate Memorization Bias in Neural Symbolic Regression?

链接: https://arxiv.org/abs/2505.22081
作者: Shun Sato,Issei Sato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression aims to discover mathematical equations that fit given numerical data. It has been applied in various fields of scientific research, such as producing human-readable expressions that explain physical phenomena. Recently, Neural symbolic regression (NSR) methods that involve Transformers pre-trained on large-scale synthetic datasets have gained attention. While these methods offer advantages such as short inference time, they suffer from low performance, particularly when the number of input variables is large. In this study, we hypothesized that this limitation stems from the memorization bias of Transformers in symbolic regression. We conducted a quantitative evaluation of this bias in Transformers using a synthetic dataset and found that Transformers rarely generate expressions not present in the training data. Additional theoretical analysis reveals that this bias arises from the Transformer’s inability to construct expressions compositionally while verifying their numerical validity. We finally examined if tailoring test-time strategies can lead to reduced memorization bias and better performance. We empirically demonstrate that providing additional information to the model at test time can significantly mitigate memorization bias. On the other hand, we also find that reducing memorization bias does not necessarily correlate with improved performance. These findings contribute to a deeper understanding of the limitations of NSR approaches and offer a foundation for designing more robust, generalizable symbolic regression methods. Code is available at this https URL .

[LG-45] Differentiable Generalized Sliced Wasserstein Plans

链接: https://arxiv.org/abs/2505.22049
作者: Laetitia Chapel,Romain Tavenard,Samuel Vaiter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimal Transport (OT) has attracted significant interest in the machine learning community, not only for its ability to define meaningful distances between probability distributions – such as the Wasserstein distance – but also for its formulation of OT plans. Its computational complexity remains a bottleneck, though, and slicing techniques have been developed to scale OT to large datasets. Recently, a novel slicing scheme, dubbed min-SWGG, lifts a single one-dimensional plan back to the original multidimensional space, finally selecting the slice that yields the lowest Wasserstein distance as an approximation of the full OT plan. Despite its computational and theoretical advantages, min-SWGG inherits typical limitations of slicing methods: (i) the number of required slices grows exponentially with the data dimension, and (ii) it is constrained to linear projections. Here, we reformulate min-SWGG as a bilevel optimization problem and propose a differentiable approximation scheme to efficiently identify the optimal slice, even in high-dimensional settings. We furthermore define its generalized extension for accommodating to data living on manifolds. Finally, we demonstrate the practical value of our approach in various applications, including gradient flows on manifolds and high-dimensional spaces, as well as a novel sliced OT-based conditional flow matching for image generation – where fast computation of transport plans is essential.

[LG-46] Detecting Undesired Process Behavior by Means of Retrieval Augmented Generation

链接: https://arxiv.org/abs/2505.22041
作者: Michael Grohs,Adrian Rebmann,Jana-Rebecca Rehse
类目: Machine Learning (cs.LG)
*备注: Accepted at the BPM Forum, located at the International Conference on Business Process Management (BPM) 2025

点击查看摘要

Abstract:Conformance checking techniques detect undesired process behavior by comparing process executions that are recorded in event logs to desired behavior that is captured in a dedicated process model. If such models are not available, conformance checking techniques are not applicable, but organizations might still be interested in detecting undesired behavior in their processes. To enable this, existing approaches use Large Language Models (LLMs), assuming that they can learn to distinguish desired from undesired behavior through fine-tuning. However, fine-tuning is highly resource-intensive and the fine-tuned LLMs often do not generalize well. To address these limitations, we propose an approach that requires neither a dedicated process model nor resource-intensive fine-tuning to detect undesired process behavior. Instead, we use Retrieval Augmented Generation (RAG) to provide an LLM with direct access to a knowledge base that contains both desired and undesired process behavior from other processes, assuming that the LLM can transfer this knowledge to the process at hand. Our evaluation shows that our approach outperforms fine-tuned LLMs in detecting undesired behavior, demonstrating that RAG is a viable alternative to resource-intensive fine-tuning, particularly when enriched with relevant context from the event log, such as frequent traces and activities.

[LG-47] Weakly-Supervised Contrastive Learning for Imprecise Class Labels

链接: https://arxiv.org/abs/2505.22028
作者: Zi-Hao Zhou,Jun-Jie Wang,Tong Wei,Min-Ling Zhang
类目: Machine Learning (cs.LG)
*备注: 38 pages, 2 figures, 11 tables

点击查看摘要

Abstract:Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity’’ to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at this https URL.

[LG-48] Learning in Compact Spaces with Approximately Normalized Transformers

链接: https://arxiv.org/abs/2505.22014
作者: Jörg K.H. Franke,Urs Spiegelhalter,Marianna Nezhurina,Jenia Jitsev,Frank Hutter,Michael Hefenbrock
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:In deep learning, regularization and normalization are common solutions for challenges such as overfitting, numerical instabilities, and the increasing variance in the residual stream. An alternative approach is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic but approximate normalization (anTransformer). Our approach constrains the norm of parameters and normalizes all representations via scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. When applied to GPT training, we observe a 40% faster convergence compared to models with QK normalization, with less than 3% additional runtime. Deriving scaling laws for anGPT, we found our method enables training with larger batch sizes and fewer hyperparameters, while matching the favorable scaling characteristics of classic GPT architectures.

[LG-49] Locking-Free Training of Physics-Informed Neural Network for Solving Nearly Incompressible Elasticity Equations

链接: https://arxiv.org/abs/2505.21994
作者: Josef Dick,Seungchan Ko,Kassem Mustapha,Sanghyeon Park
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to divergence instability, the accuracy of low-order conforming finite element methods for nearly incompressible homogeneous elasticity equations deteriorates as the Lamé coefficient \lambda\to\infty , or equivalently as the Poisson ratio \nu\to1/2 . This phenomenon, known as locking or non-robustness, remains not fully understood despite extensive investigation. In this paper, we propose a robust method based on a fundamentally different, machine-learning-driven approach. Leveraging recently developed Physics-Informed Neural Networks (PINNs), we address the numerical solution of linear elasticity equations governing nearly incompressible materials. The core idea of our method is to appropriately decompose the given equations to alleviate the extreme imbalance in the coefficients, while simultaneously solving both the forward and inverse problems to recover the solutions of the decomposed systems as well as the associated external conditions. Through various numerical experiments, including constant, variable and parametric Lamé coefficients, we illustrate the efficiency of the proposed methodology.

[LG-50] ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning

链接: https://arxiv.org/abs/2505.21987
作者: Zhendong Mi,Zhenglun Kong,Geng Yuan,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, 13 tables

点击查看摘要

Abstract:With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed with improved calibration efficiency. Our approach introduces two key innovations: (1) An activation cosine similarity loss-guided pruning metric, which considers the angular deviation of the output activation between the dense and pruned models. (2) An activation variance-guided pruning metric, which helps preserve semantic distinctions in output activations after pruning, enabling effective pruning with shorter input sequences. These two components can be readily combined to enhance LLM pruning in both accuracy and efficiency. Experimental results show that our method achieves up to an 18% reduction in perplexity and up to 63% decrease in pruning time on prevalent LLMs such as LLaMA, LLaMA-2, and OPT.

[LG-51] wo-Stage Feature Generation with Transformer and Reinforcement Learning

链接: https://arxiv.org/abs/2505.21978
作者: Wanfu Gao,Zengyao Man,Zebin He,Yuhao Tang,Jun Gao,Kunpeng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature generation is a critical step in machine learning, aiming to enhance model performance by capturing complex relationships within the data and generating meaningful new features. Traditional feature generation methods heavily rely on domain expertise and manual intervention, making the process labor-intensive and challenging to adapt to different scenarios. Although automated feature generation techniques address these issues to some extent, they often face challenges such as feature redundancy, inefficiency in feature space exploration, and limited adaptability to diverse datasets and tasks. To address these problems, we propose a Two-Stage Feature Generation (TSFG) framework, which integrates a Transformer-based encoder-decoder architecture with Proximal Policy Optimization (PPO). The encoder-decoder model in TSFG leverages the Transformer’s self-attention mechanism to efficiently represent and transform features, capturing complex dependencies within the data. PPO further enhances TSFG by dynamically adjusting the feature generation strategy based on task-specific feedback, optimizing the process for improved performance and adaptability. TSFG dynamically generates high-quality feature sets, significantly improving the predictive performance of machine learning models. Experimental results demonstrate that TSFG outperforms existing state-of-the-art methods in terms of feature quality and adaptability.

[LG-52] BOFormer: Learning to Solve Multi-Objective Bayesian Optimization via Non-Markovian RL

链接: https://arxiv.org/abs/2505.21974
作者: Yu-Heng Hung,Kai-Jie Lin,Yu-Heng Lin,Chien-YiWang,Cheng Sun,Ping-Chun Hsieh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) offers an efficient pipeline for optimizing black-box functions with the help of a Gaussian process prior and an acquisition function (AF). Recently, in the context of single-objective BO, learning-based AFs witnessed promising empirical results given its favorable non-myopic nature. Despite this, the direct extension of these approaches to multi-objective Bayesian optimization (MOBO) suffer from the \textithypervolume identifiability issue, which results from the non-Markovian nature of MOBO problems. To tackle this, inspired by the non-Markovian RL literature and the success of Transformers in language modeling, we present a generalized deep Q-learning framework and propose \textitBOFormer, which substantiates this framework for MOBO via sequence modeling. Through extensive evaluation, we demonstrate that BOFormer constantly outperforms the benchmark rule-based and learning-based algorithms in various synthetic MOBO and real-world multi-objective hyperparameter optimization problems. We have made the source code publicly available to encourage further research in this direction.

[LG-53] Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization

链接: https://arxiv.org/abs/2505.21944
作者: Linli Zhou,Bokun Wang,My T. Thai,Tianbao Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Two-way partial AUC (TPAUC) is a critical performance metric for binary classification with imbalanced data, as it focuses on specific ranges of the true positive rate (TPR) and false positive rate (FPR). However, stochastic algorithms for TPAUC optimization remain under-explored, with existing methods either limited to approximated TPAUC loss functions or burdened by sub-optimal complexities. To overcome these limitations, we introduce two innovative stochastic primal-dual double block-coordinate algorithms for TPAUC maximization. These algorithms utilize stochastic block-coordinate updates for both the primal and dual variables, catering to both convex and non-convex settings. We provide theoretical convergence rate analyses, demonstrating significant improvements over prior approaches. Our experimental results, based on multiple benchmark datasets, validate the superior performance of our algorithms, showcasing faster convergence and better generalization. This work advances the state of the art in TPAUC optimization and offers practical tools for real-world machine learning applications.

[LG-54] Continual Learning Beyond Experience Rehearsal and Full Model Surrogates

链接: https://arxiv.org/abs/2505.21942
作者: Prashant Bhat,Laurens Niesten,Elahe Arani,Bahram Zonooz
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:Continual learning (CL) has remained a significant challenge for deep neural networks as learning new tasks erases previously acquired knowledge, either partially or completely. Existing solutions often rely on experience rehearsal or full model surrogates to mitigate CF. While effective, these approaches introduce substantial memory and computational overhead, limiting their scalability and applicability in real-world scenarios. To address this, we propose SPARC, a scalable CL approach that eliminates the need for experience rehearsal and full-model surrogates. By effectively combining task-specific working memories and task-agnostic semantic memory for cross-task knowledge consolidation, SPARC results in a remarkable parameter efficiency, using only 6% of the parameters required by full-model surrogates. Despite its lightweight design, SPARC achieves superior performance on Seq-TinyImageNet and matches rehearsal-based methods on various CL benchmarks. Additionally, weight re-normalization in the classification layer mitigates task-specific biases, establishing SPARC as a practical and scalable solution for CL under stringent efficiency constraints.

[LG-55] HydraNet: Momentum-Driven State Space Duality for Multi-Granularity Tennis Tournaments Analysis

链接: https://arxiv.org/abs/2505.21882
作者: Ruijie Li,Xiang Zhao,Qiao Ning,Shikai Guo
类目: Machine Learning (cs.LG)
*备注: 14 pages, 9 figures (including subfigures), 5 tables. This is the first work to explore and effectively model momentum across multiple granularities in professional tennis tournaments

点击查看摘要

Abstract:In tennis tournaments, momentum, a critical yet elusive phenomenon, reflects the dynamic shifts in performance of athletes that can decisively influence match outcomes. Despite its significance, momentum in terms of effective modeling and multi-granularity analysis across points, games, sets, and matches in tennis tournaments remains underexplored. In this study, we define a novel Momentum Score (MS) metric to quantify a player’s momentum level in multi-granularity tennis tournaments, and design HydraNet, a momentum-driven state-space duality-based framework, to model MS by integrating thirty-two heterogeneous dimensions of athletes performance in serve, return, psychology and fatigue. HydraNet integrates a Hydra module, which builds upon a state-space duality (SSD) framework, capturing explicit momentum with a sliding-window mechanism and implicit momentum through cross-game state propagation. It also introduces a novel Versus Learning method to better enhance the adversarial nature of momentum between the two athletes at a macro level, along with a Collaborative-Adversarial Attention Mechanism (CAAM) for capturing and integrating intra-player and inter-player dynamic momentum at a micro level. Additionally, we construct a million-level tennis cross-tournament dataset spanning from 2012-2023 Wimbledon and 2013-2023 US Open, and validate the multi-granularity modeling capability of HydraNet for the MS metric on this dataset. Extensive experimental evaluations demonstrate that the MS metric constructed by the HydraNet framework provides actionable insights into how momentum impacts outcomes at different granularities, establishing a new foundation for momentum modeling and sports analysis. To the best of our knowledge. The source code and datasets are available at this https URL.

[LG-56] Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning

链接: https://arxiv.org/abs/2505.21877
作者: Hongyao Chen,Tianyang Xu,Xiaojun Wu,Josef Kittler
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Batch Normalisation (BN) is widely used in conventional deep neural network training to harmonise the input-output distributions for each batch of data. However, federated learning, a distributed learning paradigm, faces the challenge of dealing with non-independent and identically distributed data among the client nodes. Due to the lack of a coherent methodology for updating BN statistical parameters, standard BN degrades the federated learning performance. To this end, it is urgent to explore an alternative normalisation solution for federated learning. In this work, we resolve the dilemma of the BN layer in federated learning by developing a customised normalisation approach, Hybrid Batch Normalisation (HBN). HBN separates the update of statistical parameters (i.e. , means and variances used for evaluation) from that of learnable parameters (i.e. , parameters that require gradient updates), obtaining unbiased estimates of global statistical parameters in distributed scenarios. In contrast with the existing solutions, we emphasise the supportive power of global statistics for federated learning. The HBN layer introduces a learnable hybrid distribution factor, allowing each computing node to adaptively mix the statistical parameters of the current batch with the global statistics. Our HBN can serve as a powerful plugin to advance federated learning performance. It reflects promising merits across a wide range of federated learning settings, especially for small batch sizes and heterogeneous data.

[LG-57] Revisiting Bayesian Model Averag ing in the Era of Foundation Models

链接: https://arxiv.org/abs/2505.21857
作者: Mijung Park
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit the classical, full-fledged Bayesian model averaging (BMA) paradigm to ensemble pre-trained and/or lightly-finetuned foundation models to enhance the classification performance on image and text data. To make BMA tractable under foundation models, we introduce trainable linear classifiers that take frozen features from the pre-trained foundation models as inputs. The model posteriors over the linear classifiers tell us which linear heads and frozen features are better suited for a given dataset, resulting in a principled model ensembling method. Furthermore, we propose a computationally cheaper, optimizable model averaging scheme (OMA). In OMA, we directly optimize the model ensemble weights, just like those weights based on model posterior distributions in BMA, by reducing the amount of surprise (expected entropy of the predictions) we get from predictions of ensembled models. With the rapid development of foundation models, these approaches will enable the incorporation of future, possibly significantly better foundation models to enhance the performance of challenging classification tasks.

[LG-58] A Physics-Informed Learning Framework to Solve the Infinite-Horizon Optimal Control Problem

链接: https://arxiv.org/abs/2505.21842
作者: Filippos Fotiadis,Kyriakos G. Vamvoudakis
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted with minor revisions at International Journal of Robust and Nonlinear Control

点击查看摘要

Abstract:We propose a physics-informed neural networks (PINNs) framework to solve the infinite-horizon optimal control problem of nonlinear systems. In particular, since PINNs are generally able to solve a class of partial differential equations (PDEs), they can be employed to learn the value function of the infinite-horizon optimal control problem via solving the associated steady-state Hamilton-Jacobi-Bellman (HJB) equation. However, an issue here is that the steady-state HJB equation generally yields multiple solutions; hence if PINNs are directly employed to it, they may end up approximating a solution that is different from the optimal value function of the problem. We tackle this by instead applying PINNs to a finite-horizon variant of the steady-state HJB that has a unique solution, and which uniformly approximates the optimal value function as the horizon increases. An algorithm to verify if the chosen horizon is large enough is also given, as well as a method to extend it – with reduced computations and robustness to approximation errors – in case it is not. Unlike many existing methods, the proposed technique works well with non-polynomial basis functions, does not require prior knowledge of a stabilizing controller, and does not perform iterative policy evaluations. Simulations are performed, which verify and clarify theoretical findings.

[LG-59] In Search of Adams Secret Sauce

链接: https://arxiv.org/abs/2505.21829
作者: Antonio Orvieto,Robert Gower
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1,300 language models across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal. Beyond robust performance, this choice affords new theoretical insights, highlights the “secret sauce” on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.

[LG-60] Unsupervised Latent Pattern Analysis for Estimating Type 2 Diabetes Risk in Undiagnosed Populations

链接: https://arxiv.org/abs/2505.21824
作者: Praveen Kumar,Vincent T. Metzger,Scott A. Malec
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The global prevalence of diabetes, particularly type 2 diabetes mellitus (T2DM), is rapidly increasing, posing significant health and economic challenges. T2DM not only disrupts blood glucose regulation but also damages vital organs such as the heart, kidneys, eyes, nerves, and blood vessels, leading to substantial morbidity and mortality. In the US alone, the economic burden of diagnosed diabetes exceeded \ 400 billion in 2022. Early detection of individuals at risk is critical to mitigating these impacts. While machine learning approaches for T2DM prediction are increasingly adopted, many rely on supervised learning, which is often limited by the lack of confirmed negative cases. To address this limitation, we propose a novel unsupervised framework that integrates Non-negative Matrix Factorization (NMF) with statistical techniques to identify individuals at risk of developing T2DM. Our method identifies latent patterns of multimorbidity and polypharmacy among diagnosed T2DM patients and applies these patterns to estimate the T2DM risk in undiagnosed individuals. By leveraging data-driven insights from comorbidity and medication usage, our approach provides an interpretable and scalable solution that can assist healthcare providers in implementing timely interventions, ultimately improving patient outcomes and potentially reducing the future health and economic burden of T2DM.

[LG-61] Optimizing Data Augmentation through Bayesian Model Selection

链接: https://arxiv.org/abs/2505.21813
作者: Madi Matymov(1),Ba-Hien Tran(2),Michael Kampffmeyer(3 and 4),Markus Heinonen(5),Maurizio Filippone(1) ((1) KAUST, (2) Huawei Paris Research Center, (3) UiT The Arctic University of Norway, (4) Norwegian Computing Center, (5) Aalto University)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:Data Augmentation (DA) has become an essential tool to improve robustness and generalization of modern machine learning. However, when deciding on DA strategies it is critical to choose parameters carefully, and this can be a daunting task which is traditionally left to trial-and-error or expensive optimization based on validation performance. In this paper, we counter these limitations by proposing a novel framework for optimizing DA. In particular, we take a probabilistic view of DA, which leads to the interpretation of augmentation parameters as model (hyper)-parameters, and the optimization of the marginal likelihood with respect to these parameters as a Bayesian model selection problem. Due to its intractability, we derive a tractable Evidence Lower BOund (ELBO), which allows us to optimize augmentation parameters jointly with model parameters. We provide extensive theoretical results on variational approximation quality, generalization guarantees, invariance properties, and connections to empirical Bayes. Through experiments on computer vision tasks, we show that our approach improves calibration and yields robust performance over fixed or no augmentation. Our work provides a rigorous foundation for optimizing DA through Bayesian principles with significant potential for robust machine learning.

[LG-62] Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect INTERSPEECH2025

链接: https://arxiv.org/abs/2505.21809
作者: Jaya Narain,Vasudha Kowtha,Colin Lea,Lauren Tooley,Dianna Yee,Vikramjit Mitra,Zifang Huang,Miquel Espi Marques,Jon Huang,Carlos Avendano,Shirley Ren
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: accepted for Interspeech 2025

点击查看摘要

Abstract:Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations. Here we develop and evaluate voice quality models for seven voice and speech dimensions (intelligibility, imprecise consonants, harsh voice, naturalness, monoloudness, monopitch, and breathiness). Probes were trained on the public Speech Accessibility (SAP) project dataset with 11,184 samples from 434 speakers, using embeddings from frozen pre-trained models as features. We found that our probes had both strong performance and strong generalization across speech elicitation categories in the SAP dataset. We further validated zero-shot performance on additional datasets, encompassing unseen languages and tasks: Italian atypical speech, English atypical speech, and affective speech. The strong zero-shot performance and the interpretability of results across an array of evaluations suggests the utility of using voice quality dimensions in speaking style-related tasks.

[LG-63] abReason : A Reinforcement Learning-Enhanced Reasoning LLM for Explainable Tabular Data Prediction

链接: https://arxiv.org/abs/2505.21807
作者: Tommy Xu,Zhitian Zhang,Xiangyu Sun,Lauren Kelly Zung,Hossein Hajimirsadeghi,Greg Mori
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive modeling on tabular data is the cornerstone of many real-world applications. Although gradient boosting machines and some recent deep models achieve strong performance on tabular data, they often lack interpretability. On the other hand, large language models (LLMs) have demonstrated powerful capabilities to generate human-like reasoning and explanations, but remain under-performed for tabular data prediction. In this paper, we propose a new approach that leverages reasoning-based LLMs, trained using reinforcement learning, to perform more accurate and explainable predictions on tabular data. Our method introduces custom reward functions that guide the model not only toward high prediction accuracy but also toward human-understandable reasons for its predictions. Experimental results show that our model achieves promising performance on financial benchmark datasets, outperforming most existing LLMs.

[LG-64] owards Operational Automated Greenhouse Gas Plume Detection

链接: https://arxiv.org/abs/2505.21806
作者: Brian D. Bue,Jake H. Lee,Andrew K. Thorpe,Philip G. Brodrick,Daniel Cusworth,Alana Ayasse,Vassiliki Mancoridis,Anagha Satish,Shujun Xiong,Riley Duren
类目: Machine Learning (cs.LG)
*备注: Main 19 pages 14 figures. Supplemental 19 pages 16 figures. In review

点击查看摘要

Abstract:Operational deployment of a fully automated greenhouse gas (GHG) plume detection system remains an elusive goal for imaging spectroscopy missions, despite recent advances in deep learning approaches. With the dramatic increase in data availability, however, automation continues to increase in importance for natural and anthropogenic emissions monitoring. This work reviews and addresses several key obstacles in the field: data and label quality control, prevention of spatiotemporal biases, and correctly aligned modeling objectives. We demonstrate through rigorous experiments using multicampaign data from airborne and spaceborne instruments that convolutional neural networks (CNNs) are able to achieve operational detection performance when these obstacles are alleviated. We demonstrate that a multitask model that learns both instance detection and pixelwise segmentation simultaneously can successfully lead towards an operational pathway. We evaluate the model’s plume detectability across emission source types and regions, identifying thresholds for operational deployment. Finally, we provide analysis-ready data, models, and source code for reproducibility, and work to define a set of best practices and validation standards to facilitate future contributions to the field.

[LG-65] Faster Rates for Private Adversarial Bandits ICML2025

链接: https://arxiv.org/abs/2505.21790
作者: Hilal Asi,Vinod Raman,Kunal Talwar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any non-private bandit algorithm to a private bandit algorithm. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound of O\left(\frac\sqrtKT\sqrt\epsilon\right) , improving upon the existing upper bound O\left(\frac\sqrtKT \log(KT)\epsilon\right) for all \epsilon \leq 1 . In particular, our algorithms allow for sublinear expected regret even when \epsilon \leq \frac1\sqrtT , establishing the first known separation between central and local differential privacy for this problem. For bandits with expert advice, we give the first differentially private algorithms, with expected regret O\left(\frac\sqrtNT\sqrt\epsilon\right), O\left(\frac\sqrtKT\log(N)\log(KT)\epsilon\right) , and \tildeO\left(\fracN^1/6K^1/2T^2/3\log(NT)\epsilon ^1/3 + \fracN^1/2\log(NT)\epsilon\right) , where K and N are the number of actions and experts respectively. These rates allow us to get sublinear regret for different combinations of small and large K, N and \epsilon.

[LG-66] P-DROP: Poisson-Based Dropout for Graph Neural Networks

链接: https://arxiv.org/abs/2505.21783
作者: Hyunsik Yun
类目: Machine Learning (cs.LG)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Over-smoothing remains a major challenge in Graph Neural Networks (GNNs), where repeated message passing causes node representations to converge and lose discriminative power. To address this, we propose a novel node selection strategy based on Poisson processes, introducing stochastic but structure-aware updates. Specifically, we equip each node with an independent Poisson clock, enabling asynchronous and localized updates that preserve structural diversity. We explore two applications of this strategy: as a replacement for dropout-based regularization and as a dynamic subgraph training scheme. Experimental results on standard benchmarks (Cora, Citeseer, Pubmed) demonstrate that our Poisson-based method yields competitive or improved accuracy compared to traditional Dropout, DropEdge, and DropNode approaches, particularly in later training stages.

[LG-67] Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

链接: https://arxiv.org/abs/2505.21777
作者: Bao Pham,Gabriel Raya,Matteo Negri,Mohammed J. Zaki,Luca Ambrogioni,Dmitry Krotov
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:Hopfield networks are associative memory (AM) systems, designed for storing and retrieving patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the amount of training data reaches its critical memory load - spurious,,states , or unintended stable points, emerge at the end of the retrieval dynamics, leading to incorrect recall. In this work, we examine diffusion models, commonly used in generative modeling, from the perspective of AMs. The training phase of diffusion model is conceptualized as memory encoding (training data is stored in the memory). The generation phase is viewed as an attempt of memory retrieval. In the small data regime the diffusion model exhibits a strong memorization phase, where the network creates distinct basins of attraction around each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime, a different phase appears where an increase in the size of the training set fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, have distinct basins of attraction around them. Our findings provide: a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of AMs, theoretical prediction of existence of spurious states, empirical validation of this prediction in commonly-used diffusion models.

[LG-68] Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals ICML2025

链接: https://arxiv.org/abs/2505.21750
作者: Vivienne Huiling Wang,Tinghuai Wang,Joni Pajarinen
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Hierarchical reinforcement learning (HRL) learns to make decisions on multiple levels of temporal abstraction. A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals. To address this issue, the high-level policy must capture a complex subgoal distribution while also accounting for uncertainty in its estimates. We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals while leveraging principled GP uncertainty quantification. Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP’s predictive mean. Our approach outperforms prior HRL methods in both sample efficiency and performance on challenging continuous control benchmarks.

[LG-69] MIND-Stack: Modular Interpretable End-to-End Differentiability for Autonomous Navigation

链接: https://arxiv.org/abs/2505.21734
作者: Felix Jahncke,Johannes Betz
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages. Submitted to the IEEE Intelligent Vehicles Symposium (IV 2025), Romania

点击查看摘要

Abstract:Developing robust, efficient navigation algorithms is challenging. Rule-based methods offer interpretability and modularity but struggle with learning from large datasets, while end-to-end neural networks excel in learning but lack transparency and modularity. In this paper, we present MIND-Stack, a modular software stack consisting of a localization network and a Stanley Controller with intermediate human interpretable state representations and end-to-end differentiability. Our approach enables the upstream localization module to reduce the downstream control error, extending its role beyond state estimation. Unlike existing research on differentiable algorithms that either lack modules of the autonomous stack to span from sensor input to actuator output or real-world implementation, MIND-Stack offers both capabilities. We conduct experiments that demonstrate the ability of the localization module to reduce the downstream control loss through its end-to-end differentiability while offering better performance than state-of-the-art algorithms. We showcase sim-to-real capabilities by deploying the algorithm on a real-world embedded autonomous platform with limited computation power and demonstrate simultaneous training of both the localization and controller towards one goal. While MIND-Stack shows good results, we discuss the incorporation of additional modules from the autonomous navigation pipeline in the future, promising even greater stability and performance in the next iterations of the framework.

[LG-70] LaX: Boosting Low-Rank Training of Foundation Models via Latent Crossing

链接: https://arxiv.org/abs/2505.21732
作者: Ruijie Zhang,Ziyue Liu,Zhengyang Wang,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training foundation models such as ViTs and LLMs requires tremendous computing cost. Low-rank matrix or tensor factorization offers a parameter-efficient alternative, but often downgrades performance due to the restricted parameter space. In this work, we introduce \textbfLatent Crossing (LaX) – a simple yet effective plug-and-play module that enhances the capacity of low-rank models by enabling information flow across low-rank subspaces. We extensively validate the benefits of LaX on pre-training tasks with ViT-Base/Large and LLaMA-like models ranging from 60M to 1B parameters. LaX boosts low-rank model performance to match or exceed the full-rank baselines while using 2-3(\times) fewer parameters. When equipped with low-rank adapters (i.e., LoRA) for fine-tuning LLaMA-7/13B, LaX consistently improves performance on arithmetic and common sense reasoning tasks with negligible cost.

[LG-71] Network classification through random walks

链接: https://arxiv.org/abs/2505.21706
作者: Gonzalo Travieso,Joao Merenda,Odemir M. Bruno
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Network models have been widely used to study diverse systems and analyze their dynamic behaviors. Given the structural variability of networks, an intriguing question arises: Can we infer the type of system represented by a network based on its structure? This classification problem involves extracting relevant features from the network. Existing literature has proposed various methods that combine structural measurements and dynamical processes for feature extraction. In this study, we introduce a novel approach to characterize networks using statistics from random walks, which can be particularly informative about network properties. We present the employed statistical metrics and compare their performance on multiple datasets with other state-of-the-art feature extraction methods. Our results demonstrate that the proposed method is effective in many cases, often outperforming existing approaches, although some limitations are observed across certain datasets.

[LG-72] AMSFL: Adaptive Multi-Step Federated Learning via Gradient Difference-Based Error Modeling

链接: https://arxiv.org/abs/2505.21695
作者: Ganglou Xu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning faces critical challenges in balancing communication efficiency and model accuracy. One key issue lies in the approximation of update errors without incurring high computational costs. In this paper, we propose a lightweight yet effective method called Gradient Difference Approximation (GDA), which leverages first-order information to estimate local error trends without computing the full Hessian matrix. The proposed method forms a key component of the Adaptive Multi-Step Federated Learning (AMSFL) framework and provides a unified error modeling strategy for large-scale multi-step adaptive training environments.

[LG-73] Incentivizing Permissionless Distributed Learning of LLM s

链接: https://arxiv.org/abs/2505.21684
作者: Joel Lidin,Amir Sarfi,Evangelos Pappas,Samuel Dare,Eugene Belilovsky,Jacob Steeves
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We describe an incentive system for distributed deep learning of foundational models where peers are rewarded for contributions. The incentive system, \textitGauntlet, has been deployed on the bittensor blockchain and used to train a 1.2B LLM with completely permissionless contributions of pseudo-gradients: no control over the users that can register or their hardware. \textitGauntlet can be applied to any synchronous distributed training scheme that relies on aggregating updates or pseudo-gradients. We rely on a two-stage mechanism for fast filtering of peer uptime, reliability, and synchronization, combined with the core component that estimates the loss before and after individual pseudo-gradient contributions. We utilized an OpenSkill rating system to track competitiveness of pseudo-gradient scores across time. Finally, we introduce a novel mechanism to ensure peers on the network perform unique computations. Our live 1.2B run, which has paid out real-valued tokens to participants based on the value of their contributions, yielded a competitive (on a per-iteration basis) 1.2B model that demonstrates the utility of our incentive system.

[LG-74] PreGenie: An Agent ic Framework for High-quality Visual Presentation Generation

链接: https://arxiv.org/abs/2505.21660
作者: Xiaojie Xu,Xinli Xu,Sirui Chen,Haoyu Chen,Fan Zhang,Ying-Cong Chen
类目: Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Visual presentations are vital for effective communication. Early attempts to automate their creation using deep learning often faced issues such as poorly organized layouts, inaccurate text summarization, and a lack of image understanding, leading to mismatched visuals and text. These limitations restrict their application in formal contexts like business and scientific research. To address these challenges, we propose PreGenie, an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations. PreGenie is built on the Slidev presentation framework, where slides are rendered from Markdown code. It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations. Each stage leverages multiple MLLMs that collaborate and share information. Comprehensive experiments demonstrate that PreGenie excels in multimodal understanding, outperforming existing models in both aesthetics and content consistency, while aligning more closely with human design preferences. Comments: 11 pages, 9 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.21660 [cs.LG] (or arXiv:2505.21660v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.21660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent

链接: https://arxiv.org/abs/2505.21651
作者: Nikola Surjanovic,Alexandre Bouchard-Côté,Trevor Campbell
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The learning rate is an important tuning parameter for stochastic gradient descent (SGD) and can greatly influence its performance. However, appropriate selection of a learning rate schedule across all iterations typically requires a non-trivial amount of user tuning effort. To address this, we introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration and then takes appropriate action. We introduce theory supporting the convergence of AutoSGD, along with its deterministic counterpart for standard gradient descent. Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.

[LG-76] PrivATE: Differentially Private Confidence Intervals for Averag e Treatment Effects

链接: https://arxiv.org/abs/2505.21641
作者: Maresa Schröder,Justin Hartenstein,Stefan Feuerriegel
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The average treatment effect (ATE) is widely used to evaluate the effectiveness of drugs and other medical interventions. In safety-critical applications like medicine, reliable inferences about the ATE typically require valid uncertainty quantification, such as through confidence intervals (CIs). However, estimating treatment effects in these settings often involves sensitive data that must be kept private. In this work, we present PrivATE, a novel machine learning framework for computing CIs for the ATE under differential privacy. Specifically, we focus on deriving valid privacy-preserving CIs for the ATE from observational data. Our PrivATE framework consists of three steps: (i) estimating a differentially private ATE through output perturbation; (ii) estimating the differentially private variance through a truncated output perturbation mechanism; and (iii) constructing the CIs while accounting for the uncertainty from both the estimation and privatization steps. Our PrivATE framework is model agnostic, doubly robust, and ensures valid CIs. We demonstrate the effectiveness of our framework using synthetic and real-world medical datasets. To the best of our knowledge, we are the first to derive a general, doubly robust framework for valid CIs of the ATE under ( \varepsilon , \delta )-differential privacy.

[LG-77] Apprenticeship learning with prior beliefs using inverse optimization

链接: https://arxiv.org/abs/2505.21639
作者: Mauricio Junca,Esteban Leiva
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The relationship between inverse reinforcement learning (IRL) and inverse optimization (IO) for Markov decision processes (MDPs) has been relatively underexplored in the literature, despite addressing the same problem. In this work, we revisit the relationship between the IO framework for MDPs, IRL, and apprenticeship learning (AL). We incorporate prior beliefs on the structure of the cost function into the IRL and AL problems, and demonstrate that the convex-analytic view of the AL formalism (Kamoutsi et al., 2021) emerges as a relaxation of our framework. Notably, the AL formalism is a special case in our framework when the regularization term is absent. Focusing on the suboptimal expert setting, we formulate the AL problem as a regularized min-max problem. The regularizer plays a key role in addressing the ill-posedness of IRL by guiding the search for plausible cost functions. To solve the resulting regularized-convex-concave-min-max problem, we use stochastic mirror descent (SMD) and establish convergence bounds for the proposed method. Numerical experiments highlight the critical role of regularization in learning cost vectors and apprentice policies.

[LG-78] Learning Where to Learn: Training Distribution Selection for Provable OOD Performance

链接: https://arxiv.org/abs/2505.21626
作者: Nicolas Guerra,Nicholas H. Nelsen,Yunan Yang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 32 pages, 8 figures, 2 tables, 3 algorithms

点击查看摘要

Abstract:Out-of-distribution (OOD) generalization remains a fundamental challenge in machine learning. Models trained on one data distribution often experience substantial performance degradation when evaluated on shifted or unseen domains. To address this challenge, the present paper studies the design of training data distributions that maximize average-case OOD performance. First, a theoretical analysis establishes a family of generalization bounds that quantify how the choice of training distribution influences OOD error across a predefined family of target distributions. These insights motivate the introduction of two complementary algorithmic strategies: (i) directly formulating OOD risk minimization as a bilevel optimization problem over the space of probability measures and (ii) minimizing a theoretical upper bound on OOD error. Last, the paper evaluates the two approaches across a range of function approximation and operator learning examples. The proposed methods significantly improve OOD accuracy over standard empirical risk minimization with a fixed distribution. These results highlight the potential of distribution-aware training as a principled and practical framework for robust OOD generalization.

[LG-79] CiRL: Open-Source Environments for Reinforcement Learning in Circular Economy and Net Zero

链接: https://arxiv.org/abs/2505.21536
作者: Federico Zocco,Andrea Corti,Monica Malvezzi
类目: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: To be submitted

点击查看摘要

Abstract:The demand of finite raw materials will keep increasing as they fuel modern society. Simultaneously, solutions for stopping carbon emissions in the short term are not available, thus making the net zero target extremely challenging to achieve at scale. The circular economy (CE) paradigm is gaining attention as a solution to address climate change and the uncertainties of supplies of critical materials. Hence, in this paper, we introduce CiRL, a deep reinforcement learning (DRL) library of environments focused on the circularity of both solid and fluid materials. The integration of DRL into the design of material circularity is possible thanks to the formalism of thermodynamical material networks, which is underpinned by compartmental dynamical thermodynamics. Along with the focus on circularity, this library has three more features: the new CE-oriented environments are in the state-space form, which is typically used in dynamical systems analysis and control designs; it is based on a state-of-the-art Python library of DRL algorithms, namely, Stable-Baselines3; and it is developed in Google Colaboratory to be accessible to researchers from different disciplines and backgrounds as is often the case for circular economy researchers and engineers. CiRL is publicly available.

[LG-80] SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation

链接: https://arxiv.org/abs/2505.21514
作者: Mingchao Jiang,Abhinav Jain,Sophia Zorek,Chris Jermaine
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: Keywords: Benchmark Dataset, LLM Evaluation, Gen-AI, Program Synthesis; TLDR: SimCoPilot is a benchmark for evaluating LLMs as “copilot”-style interactive coding assistants, testing their ability to integrate and complete code within complex real-world software environments

点击查看摘要

Abstract:We introduce SIMCOPILOT, a benchmark that simulates the role of large language models (LLMs) as interactive, “copilot”-style coding assistants. Targeting both completion (finishing incomplete methods or code blocks) and infill tasks (filling missing segments within existing code), SIMCOPILOT provides a comprehensive framework for evaluating LLM coding capabilities. The benchmark comprises dedicated sub-benchmarks for Java (SIMCOPILOTJ) and Python (SIMCOPILOTP), covering diverse codebases varying in size and complexity. Our key contributions include: (a) establishing a realistic, detailed evaluation environment to assess LLM utility in practical coding scenarios, and (b) providing fine-grained analyses that address critical factors frequently overlooked by existing benchmarks, such as task-specific performance nuances, contextual understanding across code segments, and sensitivity to variable scope. Evaluations conducted across domains-including algorithms, databases, computer vision, and neural networks-offer insights into model strengths and highlight persistent challenges in maintaining logical consistency within complex dependency structures. Beyond benchmarking, our study sheds light on the current limitations of LLM-driven code generation and underscores the ongoing transition of LLMs from merely syntax-aware generators toward reliable, intelligent software development partners.

[LG-81] he Role of Visualization in LLM -Assisted Knowledge Graph Systems: Effects on User Trust Exploration and Workflows

链接: https://arxiv.org/abs/2505.21512
作者: Harry Li,Gabriel Appleby,Kenneth Alperin,Steven R Gomez,Ashley Suh
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) are powerful data structures, but exploring them effectively remains difficult for even expert users. Large language models (LLMs) are increasingly used to address this gap, yet little is known empirically about how their usage with KGs shapes user trust, exploration strategies, or downstream decision-making - raising key design challenges for LLM-based KG visual analysis systems. To study these effects, we developed LinkQ, a KG exploration system that converts natural language questions into structured queries with an LLM. We collaborated with KG experts to design five visual mechanisms that help users assess the accuracy of both KG queries and LLM responses: an LLM-KG state diagram that illustrates which stage of the exploration pipeline LinkQ is in, a query editor displaying the generated query paired with an LLM explanation, an entity-relation ID table showing extracted KG entities and relations with semantic descriptions, a query structure graph that depicts the path traversed in the KG, and an interactive graph visualization of query results. From a qualitative evaluation with 14 practitioners, we found that users - even KG experts - tended to overtrust LinkQ’s outputs due to its “helpful” visualizations, even when the LLM was incorrect. Users exhibited distinct workflows depending on their prior familiarity with KGs and LLMs, challenging the assumption that these systems are one-size-fits-all - despite often being designed as if they are. Our findings highlight the risks of false trust in LLM-assisted data analysis tools and the need for further investigation into the role of visualization as a mitigation technique.

[LG-82] Principled Out-of-Distribution Generalization via Simplicity

链接: https://arxiv.org/abs/2505.22622
作者: Jiawei Ge,Amanda Wang,Shange Tang,Chi Jin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Modern foundation models exhibit remarkable out-of-distribution (OOD) generalization, solving tasks far beyond the support of their training data. However, the theoretical principles underpinning this phenomenon remain elusive. This paper investigates this problem by examining the compositional generalization abilities of diffusion models in image generation. Our analysis reveals that while neural network architectures are expressive enough to represent a wide range of models – including many with undesirable behavior on OOD inputs – the true, generalizable model that aligns with human expectations typically corresponds to the simplest among those consistent with the training data. Motivated by this observation, we develop a theoretical framework for OOD generalization via simplicity, quantified using a predefined simplicity metric. We analyze two key regimes: (1) the constant-gap setting, where the true model is strictly simpler than all spurious alternatives by a fixed gap, and (2) the vanishing-gap setting, where the fixed gap is replaced by a smoothness condition ensuring that models close in simplicity to the true model yield similar predictions. For both regimes, we study the regularized maximum likelihood estimator and establish the first sharp sample complexity guarantees for learning the true, generalizable, simple model. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2505.22622 [stat.ML] (or arXiv:2505.22622v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.22622 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] Can Copulas Be Used for Feature Selection? A Machine Learning Study on Diabetes Risk Prediction

链接: https://arxiv.org/abs/2505.22554
作者: Agnideep Aich,Md Monzur Murshed,Amanda Mayeaux,Sameera Hewage
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted

点击查看摘要

Abstract:Accurate diabetes risk prediction relies on identifying key features from complex health datasets, but conventional methods like mutual information (MI) filters and genetic algorithms (GAs) often overlook extreme dependencies critical for high-risk subpopulations. In this study we introduce a feature-selection framework using the upper-tail dependence coefficient (\lambdaU) of the novel A2 copula, which quantifies how often extreme higher values of a predictor co-occur with diabetes diagnoses (target variable). Applied to the CDC Diabetes Health Indicators dataset (n=253,680), our method prioritizes five predictors (self-reported general health, high blood pressure, body mass index, mobility limitations, and high cholesterol levels) based on upper tail dependencies. These features match or outperform MI and GA selected subsets across four classifiers (Random Forest, XGBoost, Logistic Regression, Gradient Boosting), achieving accuracy up to 86.5% (XGBoost) and AUC up to 0.806 (Gradient Boosting), rivaling the full 21-feature model. Permutation importance confirms clinical relevance, with BMI and general health driving accuracy. To our knowledge, this is the first work to apply a copula’s upper-tail dependence for supervised feature selection, bridging extreme-value theory and machine learning to deliver a practical toolkit for diabetes prevention.

[LG-84] Symplectic Generative Networks (SGNs): A Hamiltonian Framework for Invertible Deep Generative Modeling

链接: https://arxiv.org/abs/2505.22527
作者: Agnideep Aich,Ashit Aich,Bruce Wade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted

点击查看摘要

Abstract:We introduce the Symplectic Generative Network (SGN), a deep generative model that leverages Hamiltonian mechanics to construct an invertible, volume-preserving mapping between a latent space and the data space. By endowing the latent space with a symplectic structure and modeling data generation as the time evolution of a Hamiltonian system, SGN achieves exact likelihood evaluation without incurring the computational overhead of Jacobian determinant calculations. In this work, we provide a rigorous mathematical foundation for SGNs through a comprehensive theoretical framework that includes: (i) complete proofs of invertibility and volume preservation, (ii) a formal complexity analysis with theoretical comparisons to Variational Autoencoders and Normalizing Flows, (iii) strengthened universal approximation results with quantitative error bounds, (iv) an information-theoretic analysis based on the geometry of statistical manifolds, and (v) an extensive stability analysis with adaptive integration guarantees. These contributions highlight the fundamental advantages of SGNs and establish a solid foundation for future empirical investigations and applications to complex, high-dimensional data.

[LG-85] IGNIS: A Neural Network Framework for Robust Parameter Estimation in Archimedean Copulas

链接: https://arxiv.org/abs/2505.22518
作者: Agnideep Aich,Ashit Baran Aich,Bruce Wade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Parameter estimation for Archimedean copulas remains a challenging problem, particularly for the recently developed A1 and A2 families that exhibit complex dependency structures. Traditional methods, such as the Method of Moments (MoM), Maximum Likelihood Estimation (MLE), and Maximum Pseudo-Likelihood (MPL), often struggle due to issues of non-monotonic relationship with dependency measures such as Kendall’s tau (as in the case of A1) and numerical instability. In this paper, we present the IGNIS Network, a novel, unified neural framework that learns a direct mapping from observable dependency measures to copula parameters, thereby overcoming the limitations of classical approaches. Our approach is trained on simulated data spanning five Archimedean copula families including Clayton, Gumbel, Frank, A1, and A2, ensuring its general applicability across the entire family. Extensive simulation studies demonstrate that the IGNIS Network reduces estimation errors compared to MoM, while inherently enforcing parameter constraints through theory-guided post-processing. We further validate the practical utility of our method on diverse real-world datasets, including financial returns (AAPL-MSFT), healthcare metrics (CDC Diabetes indicators), and environmental measurements (PM2.5 air quality). Our results underscore the transformative potential of neural methods for robust and accurate dependence modeling in modern applications.

[LG-86] Assessing Quantum Advantage for Gaussian Process Regression

链接: https://arxiv.org/abs/2505.22502
作者: Dominic Lowe,M.S. Kim,Roberto Bondesan
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:Gaussian Process Regression is a well-known machine learning technique for which several quantum algorithms have been proposed. We show here that in a wide range of scenarios these algorithms show no exponential speedup. We achieve this by rigorously proving that the condition number of a kernel matrix scales at least linearly with the matrix size under general assumptions on the data and kernel. We additionally prove that the sparsity and Frobenius norm of a kernel matrix scale linearly under similar assumptions. The implications for the quantum algorithms runtime are independent of the complexity of loading classical data on a quantum computer and also apply to dequantised algorithms. We supplement our theoretical analysis with numerical verification for popular kernels in machine learning.

[LG-87] Hypothesis Testing in Imaging Inverse Problems

链接: https://arxiv.org/abs/2505.22481
作者: Yiming Xi,Konstantinos Zygalakis,Marcelo Pereyra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a framework for semantic hypothesis testing tailored to imaging inverse problems. Modern imaging methods struggle to support hypothesis testing, a core component of the scientific method that is essential for the rigorous interpretation of experiments and robust interfacing with decision-making processes. There are three main reasons why image-based hypothesis testing is challenging. First, the difficulty of using a single observation to simultaneously reconstruct an image, formulate hypotheses, and quantify their statistical significance. Second, the hypotheses encountered in imaging are mostly of semantic nature, rather than quantitative statements about pixel values. Third, it is challenging to control test error probabilities because the null and alternative distributions are often unknown. Our proposed approach addresses these difficulties by leveraging concepts from self-supervised computational imaging, vision-language models, and non-parametric hypothesis testing with e-values. We demonstrate our proposed framework through numerical experiments related to image-based phenotyping, where we achieve excellent power while robustly controlling Type I errors.

[LG-88] Depth-Based Matrix Classification for the HHL Quantum Algorithm

链接: https://arxiv.org/abs/2505.22454
作者: Mark Danza,Sonia Lopez Alarcon,Cory Merkel
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Under the nearing error-corrected era of quantum computing, it is necessary to understand the suitability of certain post-NISQ algorithms for practical problems. One of the most promising, applicable and yet difficult to implement in practical terms is the Harrow, Hassidim and Lloyd (HHL) algorithm for linear systems of equations. An enormous number of problems can be expressed as linear systems of equations, from Machine Learning to fluid dynamics. However, in most cases, HHL will not be able to provide a practical, reasonable solution to these problems. This paper’s goal inquires about whether problems can be labeled using Machine Learning classifiers as suitable or unsuitable for HHL implementation when some numerical information about the problem is known beforehand. This work demonstrates that training on significantly representative data distributions is critical to achieve good classifications of the problems based on the numerical properties of the matrix representing the system of equations. Accurate classification is possible through Multi-Layer Perceptrons, although with careful design of the training data distribution and classifier parameters.

[LG-89] Computing Optimal Transport Maps and Wasserstein Barycenters Using Conditional Normalizing Flows

链接: https://arxiv.org/abs/2505.22364
作者: Gabriele Visentin,Patrick Cheridito
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel method for efficiently computing optimal transport maps and Wasserstein barycenters in high-dimensional spaces. Our approach uses conditional normalizing flows to approximate the input distributions as invertible pushforward transformations from a common latent space. This makes it possible to directly solve the primal problem using gradient-based minimization of the transport cost, unlike previous methods that rely on dual formulations and complex adversarial optimization. We show how this approach can be extended to compute Wasserstein barycenters by solving a conditional variance minimization problem. A key advantage of our conditional architecture is that it enables the computation of barycenters for hundreds of input distributions, which was computationally infeasible with previous methods. Our numerical experiments illustrate that our approach yields accurate results across various high-dimensional tasks and compares favorably with previous state-of-the-art methods.

[LG-90] Credal Prediction based on Relative Likelihood

链接: https://arxiv.org/abs/2505.22332
作者: Timo Löhr,Paul Hofman,Felix Mohr,Eyke Hüllermeier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictions in the form of sets of probability distributions, so-called credal sets, provide a suitable means to represent a learner’s epistemic uncertainty. In this paper, we propose a theoretically grounded approach to credal prediction based on the statistical notion of relative likelihood: The target of prediction is the set of all (conditional) probability distributions produced by the collection of plausible models, namely those models whose relative likelihood exceeds a specified threshold. This threshold has an intuitive interpretation and allows for controlling the trade-off between correctness and precision of credal predictions. We tackle the problem of approximating credal sets defined in this way by means of suitably modified ensemble learning techniques. To validate our approach, we illustrate its effectiveness by experiments on benchmark datasets demonstrating superior uncertainty representation without compromising predictive performance. We also compare our method against several state-of-the-art baselines in credal prediction.

[LG-91] Individualised Counterfactual Examples Using Conformal Prediction Intervals

链接: https://arxiv.org/abs/2505.22326
作者: James M. Adams,Gesine Reinert,Lukasz Szpruch,Carsten Maple,Andrew Elliott
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to Conformal and Probabilistic Predictions With Applications (COPA) 2025

点击查看摘要

Abstract:Counterfactual explanations for black-box models aim to pr ovide insight into an algorithmic decision to its recipient. For a binary classification problem an individual counterfactual details which features might be changed for the model to infer the opposite class. High-dimensional feature spaces that are typical of machine learning classification models admit many possible counterfactual examples to a decision, and so it is important to identify additional criteria to select the most useful counterfactuals. In this paper, we explore the idea that the counterfactuals should be maximally informative when considering the knowledge of a specific individual about the underlying classifier. To quantify this information gain we explicitly model the knowledge of the individual, and assess the uncertainty of predictions which the individual makes by the width of a conformal prediction interval. Regions of feature space where the prediction interval is wide correspond to areas where the confidence in decision making is low, and an additional counterfactual example might be more informative to an individual. To explore and evaluate our individualised conformal prediction interval counterfactuals (CPICFs), first we present a synthetic data set on a hypercube which allows us to fully visualise the decision boundary, conformal intervals via three different methods, and resultant CPICFs. Second, in this synthetic data set we explore the impact of a single CPICF on the knowledge of an individual locally around the original query. Finally, in both our synthetic data set and a complex real world dataset with a combination of continuous and discrete variables, we measure the utility of these counterfactuals via data augmentation, testing the performance on a held out set.

[LG-92] High Volume Rate 3D Ultrasound Reconstruction with Diffusion Models

链接: https://arxiv.org/abs/2505.22090
作者: Tristan S.W. Stevens,Oisín Nolan,Oudom Somphone,Jean-Luc Robert,Ruud J.G. van Sloun
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures, preprint

点击查看摘要

Abstract:Three-dimensional ultrasound enables real-time volumetric visualization of anatomical structures. Unlike traditional 2D ultrasound, 3D imaging reduces the reliance on precise probe orientation, potentially making ultrasound more accessible to clinicians with varying levels of experience and improving automated measurements and post-exam analysis. However, achieving both high volume rates and high image quality remains a significant challenge. While 3D diverging waves can provide high volume rates, they suffer from limited tissue harmonic generation and increased multipath effects, which degrade image quality. One compromise is to retain the focusing in elevation while leveraging unfocused diverging waves in the lateral direction to reduce the number of transmissions per elevation plane. Reaching the volume rates achieved by full 3D diverging waves, however, requires dramatically undersampling the number of elevation planes. Subsequently, to render the full volume, simple interpolation techniques are applied. This paper introduces a novel approach to 3D ultrasound reconstruction from a reduced set of elevation planes by employing diffusion models (DMs) to achieve increased spatial and temporal resolution. We compare both traditional and supervised deep learning-based interpolation methods on a 3D cardiac ultrasound dataset. Our results show that DM-based reconstruction consistently outperforms the baselines in image quality and downstream task performance. Additionally, we accelerate inference by leveraging the temporal consistency inherent to ultrasound sequences. Finally, we explore the robustness of the proposed method by exploiting the probabilistic nature of diffusion posterior sampling to quantify reconstruction uncertainty and demonstrate improved recall on out-of-distribution data with synthetic anomalies under strong subsampling.

[LG-93] PADAM: Parallel averag ed Adam reduces the error for stochastic optimization in scientific machine learning

链接: https://arxiv.org/abs/2505.22085
作者: Arnulf Jentzen,Julian Kranz,Adrian Riekert
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 38 pages, 13 figures

点击查看摘要

Abstract:Averaging techniques such as Ruppert–Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to achieve the smallest optimization error. In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM), in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the variant with the smallest optimization error. A central feature of this approach is that this procedure requires no more gradient evaluations than the usual ADAM optimizer as each of the averaged trajectories relies on the same underlying ADAM trajectory and thus on the same underlying gradients. We test the proposed PADAM optimizer in 13 stochastic optimization and deep neural network (DNN) learning problems and compare its performance with known optimizers from the literature such as standard SGD, momentum SGD, Adam with and without EMA, and ADAMW. In particular, we apply the compared optimizers to physics-informed neural network, deep Galerkin, deep backward stochastic differential equation and deep Kolmogorov approximations for boundary value partial differential equation problems from scientific machine learning, as well as to DNN approximations for optimal control and optimal stopping problems. In nearly all of the considered examples PADAM achieves, sometimes among others and sometimes exclusively, essentially the smallest optimization error. This work thus strongly suggest to consider PADAM for scientific machine learning problems and also motivates further research for adaptive averaging procedures within the training of DNNs.

[LG-94] Hyperbolic recurrent neural network as the first type of non-Euclidean neural quantum state ansatz

链接: https://arxiv.org/abs/2505.22083
作者: H. L. Dao
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:In this work, we introduce the first type of non-Euclidean neural quantum state (NQS) ansatz, in the form of the hyperbolic GRU (a variant of recurrent neural networks (RNNs)), to be used in the Variational Monte Carlo method of approximating the ground state wavefunction for quantum many-body systems. In particular, we examine the performances of NQS ansatzes constructed from both conventional or Euclidean RNN/GRU and from hyperbolic GRU in the prototypical settings of the one- and two-dimensional transverse field Ising models (TFIM) of up to 100 spins and the one-dimensional Heisenberg J_1J_2 and J_1J_2J_3 systems of up 50 spins. By virtue of the fact that, for all of the experiments performed in this work, hyperbolic GRU can yield performances comparable to or better than Euclidean RNNs, which have been extensively studied in these settings in the literature, our work is a proof-of-concept for the viability of hyperbolic GRU as the first type of non-Euclidean NQS ansatz for quantum many-body systems. Furthermore, in settings where the Hamiltonian displays a clear hierarchical interaction structure, such as the 1D Heisenberg J_1J_2 J_1J_2J_3 systems with the 1st, 2nd and even 3rd nearest neighbor interactions, our results show that hyperbolic GRU definitively outperforms its Euclidean version in all instances. The fact that these results are reminiscent of the established ones from natural language processing where hyperbolic GRU almost always outperforms Euclidean RNNs when the training data exhibit a tree-like or hierarchical structure leads us to hypothesize that hyperbolic GRU NQS ansatz would likely outperform Euclidean RNN/GRU NQS ansatz in quantum spin systems that involve different degrees of nearest neighbor interactions. Finally, with this work, we hope to initiate future studies of other types of non-Euclidean NQS beyond hyperbolic GRU.

[LG-95] Learning Curves of Stochastic Gradient Descent in Kernel Regression

链接: https://arxiv.org/abs/2505.22048
作者: Haihan Zhang,Weicheng Lin,Yuanshi Liu,Cong Fang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers a canonical problem in kernel regression: how good are the model performances when it is trained by the popular online first-order algorithms, compared to the offline ones, such as ridge and ridgeless regression? In this paper, we analyze the foundational single-pass Stochastic Gradient Descent (SGD) in kernel regression under source condition where the optimal predictor can even not belong to the RKHS, i.e. the model is misspecified. Specifically, we focus on the inner product kernel over the sphere and characterize the exact orders of the excess risk curves under different scales of sample sizes n concerning the input dimension d . Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales, without suffering the saturation, a prevalent phenomenon observed in (ridge) regression, except when the model is highly misspecified and the learning is in a final stage where n\gg d^\gamma with any constant \gamma 0 . The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule, a common practice in deep neural network training. As a byproduct, we provide the \emphfirst provable advantage of the scheme over the iterative averaging method in the common setting.

[LG-96] Align-DA: Align Score-based Atmospheric Data Assimilation with Multiple Preferences

链接: https://arxiv.org/abs/2505.22008
作者: Jing-An Sun,Hang Fan,Junchao Gong,Ben Fei,Kun Chen,Fenghua Ling,Wenlong Zhang,Wanghan Xu,Li Yan,Pierre Gentine,Lei Bai
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data assimilation (DA) aims to estimate the full state of a dynamical system by combining partial and noisy observations with a prior model forecast, commonly referred to as the background. In atmospheric applications, this problem is fundamentally ill-posed due to the sparsity of observations relative to the high-dimensional state space. Traditional methods address this challenge by simplifying background priors to regularize the solution, which are empirical and require continual tuning for application. Inspired by alignment techniques in text-to-image diffusion models, we propose Align-DA, which formulates DA as a generative process and uses reward signals to guide background priors, replacing manual tuning with data-driven alignment. Specifically, we train a score-based model in the latent space to approximate the background-conditioned prior, and align it using three complementary reward signals for DA: (1) assimilation accuracy, (2) forecast skill initialized from the assimilated state, and (3) physical adherence of the analysis fields. Experiments with multiple reward signals demonstrate consistent improvements in analysis quality across different evaluation metrics and observation-guidance strategies. These results show that preference alignment, implemented as a soft constraint, can automatically adapt complex background priors tailored to DA, offering a promising new direction for advancing the field.

[LG-97] Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion

链接: https://arxiv.org/abs/2505.21892
作者: Xunpeng Huang,Yingyu Lin,Nikki Lijing Kuang,Hanze Dong,Difan Zou,Yian Ma,Tong Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Continuous diffusion models have demonstrated remarkable performance in data generation across various domains, yet their efficiency remains constrained by two critical limitations: (1) the local adjacency structure of the forward Markov process, which restricts long-range transitions in the data space, and (2) inherent biases introduced during the simulation of time-inhomogeneous reverse denoising processes. To address these challenges, we propose Quantized Transition Diffusion (QTD), a novel approach that integrates data quantization with discrete diffusion dynamics. Our method first transforms the continuous data distribution p_* into a discrete one q_* via histogram approximation and binary encoding, enabling efficient representation in a structured discrete latent space. We then design a continuous-time Markov chain (CTMC) with Hamming distance-based transitions as the forward process, which inherently supports long-range movements in the original data space. For reverse-time sampling, we introduce a \textittruncated uniformization technique to simulate the reverse CTMC, which can provably provide unbiased generation from q_* under minimal score assumptions. Through a novel KL dynamic analysis of the reverse CTMC, we prove that QTD can generate samples with O(d\ln^2(d/\epsilon)) score evaluations in expectation to approximate the d --dimensional target distribution p_* within an \epsilon error tolerance. Our method not only establishes state-of-the-art inference efficiency but also advances the theoretical foundations of diffusion-based generative modeling by unifying discrete and continuous diffusion paradigms.

[LG-98] argeted Unlearning Using Perturbed Sign Gradient Methods With Applications On Medical Images

链接: https://arxiv.org/abs/2505.21872
作者: George R. Nahass,Zhu Wang,Homa Rashidisabet,Won Hwa Kim,Sasha Hubschman,Jeffrey C. Peterson,Ghasem Yazdanpanah,Chad A. Purnell,Pete Setabutr,Ann Q. Tran,Darvin Yi,Sathya N. Ravi
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 39 pages, 12 figures, 11 tables, 3 algorithms

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific training samples from a trained model without full retraining. While prior work has largely focused on privacy-motivated settings, we recast unlearning as a general-purpose tool for post-deployment model revision. Specifically, we focus on utilizing unlearning in clinical contexts where data shifts, device deprecation, and policy changes are common. To this end, we propose a bilevel optimization formulation of boundary-based unlearning that can be solved using iterative algorithms. We provide convergence guarantees when first-order algorithms are used to unlearn. Our method introduces tunable loss design for controlling the forgetting-retention tradeoff and supports novel model composition strategies that merge the strengths of distinct unlearning runs. Across benchmark and real-world clinical imaging datasets, our approach outperforms baselines on both forgetting and retention metrics, including scenarios involving imaging devices and anatomical outliers. This work establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications.

[LG-99] Spectral clustering for dependent community Hawkes process models of temporal networks

链接: https://arxiv.org/abs/2505.21845
作者: Lingfei Zhao,Hadeel Soliman,Kevin S. Xu,Subhadeep Paul
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Temporal networks observed continuously over time through timestamped relational events data are commonly encountered in application settings including online social media communications, financial transactions, and international relations. Temporal networks often exhibit community structure and strong dependence patterns among node pairs. This dependence can be modeled through mutual excitations, where an interaction event from a sender to a receiver node increases the possibility of future events among other node pairs. We provide statistical results for a class of models that we call dependent community Hawkes (DCH) models, which combine the stochastic block model with mutually exciting Hawkes processes for modeling both community structure and dependence among node pairs, respectively. We derive a non-asymptotic upper bound on the misclustering error of spectral clustering on the event count matrix as a function of the number of nodes and communities, time duration, and the amount of dependence in the model. Our result leverages recent results on bounding an appropriate distance between a multivariate Hawkes process count vector and a Gaussian vector, along with results from random matrix theory. We also propose a DCH model that incorporates only self and reciprocal excitation along with highly scalable parameter estimation using a Generalized Method of Moments (GMM) estimator that we demonstrate to be consistent for growing network size and time duration. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Methodology (stat.ME) Cite as: arXiv:2505.21845 [stat.ML] (or arXiv:2505.21845v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.21845 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

链接: https://arxiv.org/abs/2505.21799
作者: Tim Tsz-Kit Lau,Qi Long,Weijie Su
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ever-growing scale of deep learning models and datasets underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing “matrix-aware” preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam’s training instabilities, Muon’s accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.

[LG-101] A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with Polyak Averag ing

链接: https://arxiv.org/abs/2505.21796
作者: Sajad Khodadadian,Martin Zubeldia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 37 pages

点击查看摘要

Abstract:Polyak-Ruppert averaging is a widely used technique to achieve the optimal asymptotic variance of stochastic approximation (SA) algorithms, yet its high-probability performance guarantees remain underexplored in general settings. In this paper, we present a general framework for establishing non-asymptotic concentration bounds for the error of averaged SA iterates. Our approach assumes access to individual concentration bounds for the unaveraged iterates and yields a sharp bound on the averaged iterates. We also construct an example, showing the tightness of our result up to constant multiplicative factors. As direct applications, we derive tight concentration bounds for contractive SA algorithms and for algorithms such as temporal difference learning and Q-learning with averaging, obtaining new bounds in settings where traditional analysis is challenging.

[LG-102] Global Minimizers of ellp-Regularized Objectives Yield the Sparsest ReLU Neural Networks

链接: https://arxiv.org/abs/2505.21791
作者: Julia Nakhleh,Robert D. Nowak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of finding the sparsest interpolating ReLU network – i.e., the network with the fewest nonzero parameters or neurons – a goal with wide-ranging implications for efficiency, generalization, interpretability, theory, and model compression. Unlike post hoc pruning approaches, we propose a continuous, almost-everywhere differentiable training objective whose global minima are guaranteed to correspond to the sparsest single-hidden-layer ReLU networks that fit the data. This result marks a conceptual advance: it recasts the combinatorial problem of sparse interpolation as a smooth optimization task, potentially enabling the use of gradient-based training methods. Our objective is based on minimizing \ell^p quasinorms of the weights for 0 p 1 , a classical sparsity-promoting strategy in finite-dimensional settings. However, applying these ideas to neural networks presents new challenges: the function class is infinite-dimensional, and the weights are learned using a highly nonconvex objective. We prove that, under our formulation, global minimizers correspond exactly to sparsest solutions. Our work lays a foundation for understanding when and how continuous sparsity-inducing objectives can be leveraged to recover sparse networks through training.

[LG-103] Beyond 1D: Vision Transformers and Multichannel Signal Images for PPG-to-ECG Reconstruction

链接: https://arxiv.org/abs/2505.21767
作者: Xiaoyan Li,Shixin Xu,Faisal Habib,Arvind Gupta,Huaxiong Huang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Reconstructing ECG from PPG is a promising yet challenging task. While recent advancements in generative models have significantly improved ECG reconstruction, accurately capturing fine-grained waveform features remains a key challenge. To address this, we propose a novel PPG-to-ECG reconstruction method that leverages a Vision Transformer (ViT) as the core network. Unlike conventional approaches that rely on single-channel PPG, our method employs a four-channel signal image representation, incorporating the original PPG, its first-order difference, second-order difference, and area under the curve. This multi-channel design enriches feature extraction by preserving both temporal and physiological variations within the PPG. By leveraging the self-attention mechanism in ViT, our approach effectively captures both inter-beat and intra-beat dependencies, leading to more robust and accurate ECG reconstruction. Experimental results demonstrate that our method consistently outperforms existing 1D convolution-based approaches, achieving up to 29% reduction in PRD and 15% reduction in RMSE. The proposed approach also produces improvements in other evaluation metrics, highlighting its robustness and effectiveness in reconstructing ECG signals. Furthermore, to ensure a clinically relevant evaluation, we introduce new performance metrics, including QRS area error, PR interval error, RT interval error, and RT amplitude difference error. Our findings suggest that integrating a four-channel signal image representation with the self-attention mechanism of ViT enables more effective extraction of informative PPG features and improved modeling of beat-to-beat variations for PPG-to-ECG mapping. Beyond demonstrating the potential of PPG as a viable alternative for heart activity monitoring, our approach opens new avenues for cyclic signal analysis and prediction.

[LG-104] Are Statistical Methods Obsolete in the Era of Deep Learning?

链接: https://arxiv.org/abs/2505.21723
作者: Skyler Wu,Shihao Yang,S. C. Kou
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 11 figures (main text)

点击查看摘要

Abstract:In the era of AI, neural networks have become increasingly popular for modeling, inference, and prediction, largely due to their potential for universal approximation. With the proliferation of such deep learning models, a question arises: are leaner statistical methods still relevant? To shed insight on this question, we employ the mechanistic nonlinear ordinary differential equation (ODE) inverse problem as a testbed, using physics-informed neural network (PINN) as a representative of the deep learning paradigm and manifold-constrained Gaussian process inference (MAGI) as a representative of statistically principled methods. Through case studies involving the SEIR model from epidemiology and the Lorenz model from chaotic dynamics, we demonstrate that statistical methods are far from obsolete, especially when working with sparse and noisy observations. On tasks such as parameter inference and trajectory reconstruction, statistically principled methods consistently achieve lower bias and variance, while using far fewer parameters and requiring less hyperparameter tuning. Statistical methods can also decisively outperform deep learning models on out-of-sample future prediction, where the absence of relevant data often leads overparameterized models astray. Additionally, we find that statistically principled approaches are more robust to accumulation of numerical imprecision and can represent the underlying system more faithful to the true governing ODEs.

[LG-105] Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference

链接: https://arxiv.org/abs/2505.21721
作者: Kyurae Kim,Yi-An Ma,Trevor Campbell,Jacob R. Gardner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at an almost dimension-independent rate. Specifically, for strongly log-concave and log-smooth targets, the number of iterations for BBVI with a sub-Gaussian family to achieve an objective \epsilon -close to the global optimum is \mathrmO(\log d) , which improves over the \mathrmO(d) dependence of full-rank location-scale families. For heavy-tailed families, we provide a weaker \mathrmO(d^2/k) dimension dependence, where k is the number of finite moments. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.

[LG-106] What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization

链接: https://arxiv.org/abs/2505.21692
作者: Omar Bennouna,Amine Bennouna,Saurabh Amin,Asuman Ozdaglar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the fundamental question of how informative a dataset is for solving a given decision-making task. In our setting, the dataset provides partial information about unknown parameters that influence task outcomes. Focusing on linear programs, we characterize when a dataset is sufficient to recover an optimal decision, given an uncertainty set on the cost vector. Our main contribution is a sharp geometric characterization that identifies the directions of the cost vector that matter for optimality, relative to the task constraints and uncertainty set. We further develop a practical algorithm that, for a given task, constructs a minimal or least-costly sufficient dataset. Our results reveal that small, well-chosen datasets can often fully determine optimal decisions – offering a principled foundation for task-aware data selection.

[LG-107] STACI: Spatio-Temporal Aleatoric Conformal Inference

链接: https://arxiv.org/abs/2505.21658
作者: Brandon R. Feng,David Keetae Park,Xihaier Luo,Arantxa Urdangarin,Shinjae Yoo,Brian J. Reich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Fitting Gaussian Processes (GPs) provides interpretable aleatoric uncertainty quantification for estimation of spatio-temporal fields. Spatio-temporal deep learning models, while scalable, typically assume a simplistic independent covariance matrix for the response, failing to capture the underlying correlation structure. However, spatio-temporal GPs suffer from issues of scalability and various forms of approximation bias resulting from restrictive assumptions of the covariance kernel function. We propose STACI, a novel framework consisting of a variational Bayesian neural network approximation of non-stationary spatio-temporal GP along with a novel spatio-temporal conformal inference algorithm. STACI is highly scalable, taking advantage of GPU training capabilities for neural network models, and provides statistically valid prediction intervals for uncertainty quantification. STACI outperforms competing GPs and deep methods in accurately approximating spatio-temporal processes and we show it easily scales to datasets with millions of observations.

[LG-108] A Kernelised Stein Discrepancy for Assessing the Fit of Inhomogeneous Random Graph Models

链接: https://arxiv.org/abs/2505.21580
作者: Anum Fatima,Gesine Reinert
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 43 pages, 24 figures

点击查看摘要

Abstract:Complex data are often represented as a graph, which in turn can often be viewed as a realisation of a random graph, such as of an inhomogeneous random graph model (IRG). For general fast goodness-of-fit tests in high dimensions, kernelised Stein discrepancy (KSD) tests are a powerful tool. Here, we develop, test, and analyse a KSD-type goodness-of-fit test for IRG models that can be carried out with a single observation of the network. The test is applicable to a network of any size and does not depend on the asymptotic distribution of the test statistic. We also provide theoretical guarantees.

[LG-109] Automatic detection of abnormal clinical EEG: comparison of a finetuned foundation model with two deep learning models

链接: https://arxiv.org/abs/2505.21507
作者: Aurore Bussalb,François Le Gac,Guillaume Jubien,Mohamed Rahmouni,Ruggero G. Bettinardi,Pedro Marinho R. de Oliveira,Phillipe Derambure,Nicolas Gaspard,Jacques Jonas,Louis Maillard,Laurent Vercueil,Hervé Vespignani,Philippe Laval,Laurent Koessler,Ulysse Gimenez
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Electroencephalography (EEG) is commonly used by physicians for the diagnosis of numerous neurological disorders. Due to the large volume of EEGs requiring interpretation and the specific expertise involved, artificial intelligence-based tools are being developed to assist in their visual analysis. In this paper, we compare two deep learning models (CNN-LSTM and Transformer-based) with BioSerenity-E1, a recently proposed foundation model, in the task of classifying entire EEG recordings as normal or abnormal. The three models were trained or finetuned on 2,500 EEG recordings and their performances were evaluated on two private and one public datasets: a large multicenter dataset annotated by a single specialist (dataset A composed of n = 4,480 recordings), a small multicenter dataset annotated by three specialists (dataset B, n = 198), and the Temple University Abnormal (TUAB) EEG corpus evaluation dataset (n = 276). On dataset A, the three models achieved at least 86% balanced accuracy, with BioSerenity-E1 finetuned achieving the highest balanced accuracy (89.19% [88.36-90.41]). BioSerenity-E1 finetuned also achieved the best performance on dataset B, with 94.63% [92.32-98.12] balanced accuracy. The models were then validated on TUAB evaluation dataset, whose corresponding training set was not used during training, where they achieved at least 76% accuracy. Specifically, BioSerenity-E1 finetuned outperformed the other two models, reaching an accuracy of 82.25% [78.27-87.48]. Our results highlight the usefulness of leveraging pre-trained models for automatic EEG classification: enabling robust and efficient interpretation of EEG data with fewer resources and broader applicability.

[LG-110] Genetic Influences on Brain Aging: Analyzing Sex Differences in the UK Biobank using Structural MRI

链接: https://arxiv.org/abs/2505.20344
作者: Karen Ardila,Aashka Mohite,Abdoljalil Addeh,Amanda V. Tyndall,Cindy K. Barha,Quan Long,M. Ethan MacDonald
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures, conference

点击查看摘要

Abstract:Brain aging trajectories differ between males and females, yet the genetic factors underlying these differences remain underexplored. Using structural MRI and genotyping data from 40,940 UK Biobank participants (aged 45-83), we computed Brain Age Gap Estimates (BrainAGE) for total brain, hippocampal, and ventricular volumes. We conducted sex-stratified genome-wide association studies (GWAS) and Post-GWAS analyses to identify genetic variants associated with accelerated brain aging. Distinct gene sets emerged by sex: in females, neurotransmitter transport and mitochondrial stress response genes were implicated; in males, immune and inflammation-related genes dominated. Shared genes, including GMNC and OSTN, were consistently linked to brain volumes across sexes, suggesting core roles in neurostructural maintenance. Tissue expression analyses revealed sex-specific enrichment in pathways tied to neurodegeneration. These findings highlight the importance of sex-stratified approaches in aging research and suggest genetic targets for personalized interventions against age-related cognitive decline.

[LG-111] Provably Robust Training of Quantum Circuit Classifiers Against Parameter Noise

链接: https://arxiv.org/abs/2505.18478
作者: Lucas Tecot,Di Luo,Cho-Jui Hsieh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:Advancements in quantum computing have spurred significant interest in harnessing its potential for speedups over classical systems. However, noise remains a major obstacle to achieving reliable quantum algorithms. In this work, we present a provably noise-resilient training theory and algorithm to enhance the robustness of parameterized quantum circuit classifiers. Our method, with a natural connection to Evolutionary Strategies, guarantees resilience to parameter noise with minimal adjustments to commonly used optimization algorithms. Our approach is function-agnostic and adaptable to various quantum circuits, successfully demonstrated in quantum phase classification tasks. By developing provably guaranteed optimization theory with quantum circuits, our work opens new avenues for practical, robust applications of near-term quantum computers.

信息检索

[IR-0] DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers

链接: https://arxiv.org/abs/2505.22584
作者: Navve Wasserman,Oliver Heinimann,Yuval Golbari,Tal Zimbalist,Eli Schwartz,Michal Irani
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Rerankers play a critical role in multimodal Retrieval-Augmented Generation (RAG) by refining ranking of an initial set of retrieved documents. Rerankers are typically trained using hard negative mining, whose goal is to select pages for each query which rank high, but are actually irrelevant. However, this selection process is typically passive and restricted to what the retriever can find in the available corpus, leading to several inherent limitations. These include: limited diversity, negative examples which are often not hard enough, low controllability, and frequent false negatives which harm training. Our paper proposes an alternative approach: Single-Page Hard Negative Query Generation, which goes the other way around. Instead of retrieving negative pages per query, we generate hard negative queries per page. Using an automated LLM-VLM pipeline, and given a page and its positive query, we create hard negatives by rephrasing the query to be as similar as possible in form and context, yet not answerable from the page. This paradigm enables fine-grained control over the generated queries, resulting in diverse, hard, and targeted negatives. It also supports efficient false negative verification. Our experiments show that rerankers trained with data generated using our approach outperform existing models and significantly improve retrieval performance.

[IR-1] Domain specific ontologies from Linked Open Data (LOD)

链接: https://arxiv.org/abs/2505.22550
作者: Rosario Uceda-Sosa,Nandana Mihindukulasooriya,Atul Kumar,Sahil Bansal,Seema Nagar
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Logical and probabilistic reasoning tasks that require a deeper knowledge of semantics are increasingly relying on general purpose ontologies such as Wikidata and DBpedia. However, tasks such as entity disambiguation and linking may benefit from domain specific knowledge graphs, which make it more efficient to consume the knowledge and easier to extend with proprietary content. We discuss our experience bootstrapping one such ontology for IT with a domain-agnostic pipeline, and extending it using domain-specific glossaries.

[IR-2] Logical Consistency is Vital: Neural-Symbolic Information Retrieval for Negative-Constraint Queries

链接: https://arxiv.org/abs/2505.22299
作者: Ganlin Xu,Zhoujia Zhang,Wangyi Mei,Jiaqing Liang,Weijia Lu,Xiaodong Zhang,Zhifei Yang,Xiaofeng Ma,Yanghua Xiao,Deqing Yang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Information retrieval plays a crucial role in resource localization. Current dense retrievers retrieve the relevant documents within a corpus via embedding similarities, which compute similarities between dense vectors mainly depending on word co-occurrence between queries and documents, but overlook the real query intents. Thus, they often retrieve numerous irrelevant documents. Particularly in the scenarios of complex queries such as \emphnegative-constraint queries, their retrieval performance could be catastrophic. To address the issue, we propose a neuro-symbolic information retrieval method, namely \textbfNS-IR, that leverages first-order logic (FOL) to optimize the embeddings of naive natural language by considering the \emphlogical consistency between queries and documents. Specifically, we introduce two novel techniques, \emphlogic alignment and \emphconnective constraint, to rerank candidate documents, thereby enhancing retrieval relevance. Furthermore, we construct a new dataset \textbfNegConstraint including negative-constraint queries to evaluate our NS-IR’s performance on such complex IR scenarios. Our extensive experiments demonstrate that NS-IR not only achieves superior zero-shot retrieval performance on web search and low-resource retrieval tasks, but also performs better on negative-constraint queries. Our scource code and dataset are available at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.22299 [cs.IR] (or arXiv:2505.22299v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.22299 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xu Ganlin [view email] [v1] Wed, 28 May 2025 12:37:09 UTC (560 KB)

[IR-3] Personalized Tree based progressive regression model for watch-time prediction in short video recommendation

链接: https://arxiv.org/abs/2505.22153
作者: Xiaokai Chen,Xiao Lin,Changcheng Li,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In online video platforms, accurate watch time prediction has become a fundamental and challenging problem in video recommendation. Previous research has revealed that the accuracy of watch time prediction highly depends on both the transformation of watch-time labels and the decomposition of the estimation process. TPM (Tree based Progressive Regression Model) achieves State-of-the-Art performance with a carefully designed and effective decomposition paradigm. TPM discretizes the watch time into several ordinal intervals and organizes them into a binary decision tree, where each node corresponds to a specific interval. At each non-leaf node, a binary classifier is used to determine the specific interval in which the watch time variable most likely falls, based on the prediction outcome at its parent node. The tree structure serves as the core of TPM, as it defines the decomposition of watch time estimation and determines how the ordinal intervals are discretized. However, in TPM, the tree is predefined as a full binary tree, which may be sub-optimal for the following reasons. First, a full binary tree implies an equal partitioning of the watch time space, which may struggle to capture the complexity of real-world watch time distributions. Second, instead of relying on a globally fixed tree structure, we advocate for a personalized, data-driven tree that can be learned in an end-to-end manner. Therefore, we propose PTPM to enable a highly personalized decomposition of watch estimation with better efficacy and efficiency. Moreover, we reveal that TPM is affected by selection bias due to conditional modeling and devise a simple approach to address it. We conduct extensive experiments on both offline datasets and online environments. PTPM has been fully deployed in core traffic scenarios and serves more than 400 million users per day. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.22153 [cs.IR] (or arXiv:2505.22153v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.22153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] ConsRec: Denoising Sequential Recommendation through User-Consistent Preference Modeling

链接: https://arxiv.org/abs/2505.22130
作者: Haidong Xin,Qiushi Xiong,Zhenghao Liu,Sen Mei,Yukun Yan,Shi Yu,Shuo Wang,Yu Gu,Ge Yu,Chenyan Xiong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:User-item interaction histories are pivotal for sequential recommendation systems but often include noise, such as unintended clicks or actions that fail to reflect genuine user preferences. To address this issue, we propose the User-Consistent Preference-based Sequential Recommendation System (ConsRec), designed to capture stable user preferences and filter noisy items from interaction histories. Specifically, ConsRec constructs a user-interacted item graph, learns item similarities from their text representations, and then extracts the maximum connected subgraph from the user-interacted item graph for denoising items. Experimental results on the Yelp and Amazon Product datasets illustrate that ConsRec achieves a 13% improvement over baseline recommendation models, showing its effectiveness in denoising user-interacted items. Further analysis reveals that the denoised interaction histories form semantically tighter clusters of user-preferred items, leading to higher relevance scores for ground-truth targets and more accurate recommendations. All codes are available at this https URL.

[IR-5] Shapley Value-driven Data Pruning for Recommender Systems KDD

链接: https://arxiv.org/abs/2505.22057
作者: Yansen Zhang,Xiaokun Zhang,Ziqiang Cui,Chen Ma
类目: Information Retrieval (cs.IR)
*备注: In SIGKDD (2025), 10 pages

点击查看摘要

Abstract:Recommender systems often suffer from noisy interactions like accidental clicks or popularity bias. Existing denoising methods typically identify users’ intent in their interactions, and filter out noisy interactions that deviate from the assumed intent. However, they ignore that interactions deemed noisy could still aid model training, while some ``clean’’ interactions offer little learning value. To bridge this gap, we propose Shapley Value-driven Valuation (SVV), a framework that evaluates interactions based on their objective impact on model training rather than subjective intent assumptions. In SVV, a real-time Shapley value estimation method is devised to quantify each interaction’s value based on its contribution to reducing training loss. Afterward, SVV highlights the interactions with high values while downplaying low ones to achieve effective data pruning for recommender systems. In addition, we develop a simulated noise protocol to examine the performance of various denoising approaches systematically. Experiments on four real-world datasets show that SVV outperforms existing denoising methods in both accuracy and robustness. Further analysis also demonstrates that our SVV can preserve training-critical interactions and offer interpretable noise assessment. This work shifts denoising from heuristic filtering to principled, model-driven interaction valuation.

[IR-6] AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models

链接: https://arxiv.org/abs/2505.21741
作者: Dongjune Chang,Sola Kim,Young Soo Park
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Nuclear waste management requires rigorous regulatory compliance assessment, demanding advanced decision-support systems capable of addressing complex legal, environmental, and safety considerations. This paper presents a multi-agent Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) with document retrieval mechanisms to enhance decision accuracy through structured agent collaboration. Through a structured 10-round discussion model, agents collaborate to assess regulatory compliance and safety requirements while maintaining document-grounded responses. Implemented on consumer-grade hardware, the system leverages Llama 3.2 and mxbai-embed-large-v1 embeddings for efficient retrieval and semantic representation. A case study of a proposed temporary nuclear waste storage site near Winslow, Arizona, demonstrates the framework’s effectiveness. Results show the Regulatory Agent achieves consistently higher relevance scores in maintaining alignment with legal frameworks, while the Safety Agent effectively manages complex risk assessments requiring multifaceted analysis. The system demonstrates progressive improvement in agreement rates between agents across discussion rounds while semantic drift decreases, indicating enhanced decision-making consistency and response coherence. The system ensures regulatory decisions remain factually grounded, dynamically adapting to evolving regulatory frameworks through real-time document retrieval. By balancing automated assessment with human oversight, this framework offers a scalable and transparent approach to regulatory governance. These findings underscore the potential of AI-driven, multi-agent systems in advancing evidence-based, accountable, and adaptive decision-making for high-stakes environmental management scenarios.

[IR-7] Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

链接: https://arxiv.org/abs/2505.21700
作者: Sinchana Ramakanth Bhat,Max Rudat,Jannis Spiekermann,Nicolas Flores-Herr
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and their influence on retrieval performance using multiple embedding models. Our experiments, conducted on both short-form and long-form datasets, reveal that chunk size plays a critical role in retrieval effectiveness – smaller chunks (64-128 tokens) are optimal for datasets with concise, fact-based answers, whereas larger chunks (512-1024 tokens) improve retrieval in datasets requiring broader contextual understanding. We also analyze the impact of chunking on different embedding models, finding that they exhibit distinct chunking sensitivities. While models like Stella benefit from larger chunks, leveraging global context for long-range retrieval, Snowflake performs better with smaller chunks, excelling at fine-grained, entity-based matching. Our results underscore the trade-offs between chunk size, embedding models, and dataset characteristics, emphasizing the need for improved chunk quality measures, and more comprehensive datasets to advance chunk-based retrieval in long-document Information Retrieval (IR).

附件下载

点击下载今日全部论文列表