本篇博文主要内容为 2025-10-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-17)
今日共更新660篇论文,其中:
- 自然语言处理共156篇(Computation and Language (cs.CL))
- 人工智能共236篇(Artificial Intelligence (cs.AI))
- 计算机视觉共122篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共192篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Agent ic Design of Compositional Machines
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否能够学习进行复杂机器设计的问题,特别是通过组合式机器设计任务来评估其生成能力。解决方案的关键在于构建一个基于游戏《Besiege》的测试平台BesiegeField,该平台支持基于部件的构造、物理仿真和奖励驱动的评估机制,并在此基础上对当前最先进的LLM结合代理工作流进行基准测试,识别出成功所需的三项核心能力:空间推理、战略装配和指令遵循能力。针对现有开源模型表现不足的问题,研究进一步探索强化学习(Reinforcement Learning, RL)作为提升性能的路径,包括收集冷启动数据集并开展RL微调实验,从而推动语言模型在机器设计与物理推理交叉领域的进展。
链接: https://arxiv.org/abs/2510.14980
作者: Wenqian Zhang,Weiyang Liu,Zhen Liu
机构: The Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳)); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 75 pages, 31 figures, Project Page: this https URL
Abstract:The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
zh
[NLP-1] Attention Is All You Need for KV Cache in Diffusion LLM s
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, DLMs)在解码过程中因重复计算键值(Key-Value, KV)缓存而导致的冗余计算问题,从而在不牺牲生成质量的前提下显著降低解码延迟。现有方法在每个去噪步骤和层中均对所有token重新计算QKV,而忽略了KV状态在多数步骤尤其是浅层中的变化较小这一事实,造成大量无效计算。解决方案的关键在于提出一种无需训练、与架构无关的自适应缓存策略——Elastic-Cache,其核心创新包括:(1) 基于注意力机制的漂移检测(attention-aware drift test)决定何时刷新缓存(以最被关注token为保守下界),以及(2) 基于深度感知的刷新调度(depth-aware schedule)决定从哪一层开始重新计算,同时复用浅层缓存和窗口外MASK缓存。该方法实现了动态、分层的缓存更新,有效减少了冗余计算,在多项任务上实现高达45.1倍的速度提升,且保持优于基线的生成准确性。
链接: https://arxiv.org/abs/2510.14973
作者: Quan Nguyen-Tri,Mukul Ranjan,Zhiqiang Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL
Abstract:This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods’ decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant \bf MASK tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose \bf Elastic-Cache , a training-free, architecture-agnostic strategy that jointly decides when to refresh (via an attention-aware drift test on the most-attended token) and where to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: 8.7\times on GSM8K (256 tokens), 45.1\times on longer sequences, and 4.8\times on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ( 6.8\times on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
zh
[NLP-2] okDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
【速读】: 该论文旨在解决当前代码大语言模型(Code LLMs)中因使用基于统计的子词分词器(如BPE)而导致的语义不变但tokenization差异的问题,这种差异会引发模型行为的显著波动。解决方案的关键在于提出TokDrift框架,通过应用语义保持的重写规则生成仅在tokenization上不同的代码变体,并揭示了问题根源在于早期嵌入层中子词分割未能对齐语法标记边界,从而凸显了未来代码LLMs需采用语法感知的分词策略以提升可靠性和一致性。
链接: https://arxiv.org/abs/2510.14972
作者: Yinxi Li,Yuntian Deng,Pengyu Nie
机构: University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
zh
[NLP-3] LLM s as Scalable General-Purpose Simulators For Evolving Digital Agent Training
【速读】: 该论文旨在解决数字代理(Digital Agents)在训练过程中对大规模、多样化用户界面(UI)轨迹数据的依赖问题,此类数据的收集在人力标注、基础设施和工程实现上成本高昂。解决方案的关键在于提出一种可扩展的UI-Simulator范式,其核心包括:一个用于生成多样化UI状态的数字世界模拟器、一种引导式滚动过程以实现连贯探索,以及一个轨迹包装器以生成高质量且多样化的训练轨迹。此外,论文进一步提出UI-Simulator-Grow策略,通过优先合成高影响力任务及其变体,实现更高效的数据扩展,从而显著提升代理性能,即使使用较弱教师模型也能达到或超越基于真实UI数据训练的开源代理水平。
链接: https://arxiv.org/abs/2510.14969
作者: Yiming Wang,Da Yin,Yuedong Cui,Ruichen Zheng,Zhiqian Li,Zongyu Lin,Di Wu,Xueqing Wu,Chenchen Ye,Yu Zhou,Kai-Wei Chang
机构: UCLA; Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Project page: this https URL Code and data: this https URL
Abstract:Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce \textbfUI-Simulator , a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose \textbfUI-Simulator-Grow , a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.
zh
[NLP-4] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体在多轮交互场景中因奖励稀疏导致的学习效率低下问题,具体表现为优势衰减(advantage collapse)和细粒度信用分配缺失。其解决方案的关键在于提出信息增益策略优化(Information Gain-based Policy Optimization, IGPO),通过将每一轮交互建模为对真实答案信念更新的增量过程,定义轮次级奖励为策略产生正确答案概率的边际提升,从而获得内在且密集的监督信号。IGPO不依赖外部奖励模型或昂贵的蒙特卡洛估计,直接从模型自身的信念更新中推导出内在奖励,并将其与最终结果奖励结合形成密集奖励轨迹,显著提升了多轮任务中的准确率与样本效率。
链接: https://arxiv.org/abs/2510.14967
作者: Guoqing Wang,Sunhao Dai,Guangze Ye,Zeyu Gan,Wei Yao,Yong Deng,Xiaofeng Wu,Zhenzhe Ying
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy’s probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model’s own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
zh
[NLP-5] Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
【速读】: 该论文旨在解决递归深度语言模型(recurrent-depth language models)在推理阶段如何高效利用其冗余计算能力以加速生成的问题。这类模型通过重复层结构实现可扩展的计算能力,但传统自回归(autoregressive)生成方式未能充分利用其并行潜力。解决方案的关键在于提出一种基于扩散理论(diffusion theory)的新采样器——扩散强制采样器(diffusion forcing sampler),该方法在每次前向传播中解码新标记(token),同时通过递归机制并行优化这些标记的潜在状态。该方法理论上比相同时间预算下的基线自回归生成更具表达能力,并且无需重新训练即可直接应用于现有的3.5B参数递归深度Transformer模型,实现最高达5倍的生成速度提升。
链接: https://arxiv.org/abs/2510.14961
作者: Jonas Geiping,Xinyu Yang,Guinan Su
机构: ELLIS Institute Tübingen (ELLIS研究所图宾根); Max-Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Tübingen AI Center (图宾根人工智能中心); Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code can be found at this https URL
Abstract:Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
zh
[NLP-6] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在几何等依赖视觉辅助的数学领域中推理能力不足的问题。现有视觉链式思维(Visual Chain-of-Thought, VCoT)方法受限于刚性外部工具或无法生成高保真、时机精准的示意图,难以支持复杂问题求解。解决方案的关键在于提出MathCanvas框架,其核心由两个阶段构成:首先通过1520万对数据(包括1000万张图-描述对和520万步编辑轨迹)预训练模型掌握绘图与编辑能力;其次利用21.9万例交错式视觉-文本推理路径数据集进行微调,使模型学会何时及如何使用视觉辅助。该方法显著提升了模型在MathCanvas-Bench上的表现,相对强基线提升86%,并展现出良好的泛化能力。
链接: https://arxiv.org/abs/2510.14958
作者: Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li
机构: Multimedia Laboratory (MMLab), The Chinese University of Hong Kong (香港中文大学多媒体实验室); Huawei Research (华为研究); BUAA (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: this https URL
zh
[NLP-7] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
【速读】: 该论文旨在解决当前多模态生成模型在面对方言文本输入时性能显著下降的问题,即当用户使用包含方言词汇的提示(prompt)时,模型生成图像或视频内容的质量明显劣于使用标准美式英语(Standard American English, SAE)的情况。研究表明,即使仅在提示中加入一个方言词,模型性能也会下降32.26%至48.17%。现有缓解方法如微调(fine-tuning)和提示重写(prompt rewriting)虽能小幅提升方言表现(约7%),但往往导致SAE性能显著退化。论文提出一种基于编码器(encoder-based)的通用缓解策略,其关键在于通过引入对新方言特征的识别能力,同时保持对SAE任务的性能不变——实验表明,该方法在Stable Diffusion 1.5等模型上可使五种方言的性能提升至与SAE相当(+34.4%),且对SAE性能影响接近零。
链接: https://arxiv.org/abs/2510.14949
作者: Yu Zhou,Sohyun An,Haikang Deng,Da Yin,Clark Peng,Cho-Jui Hsieh,Kai-Wei Chang,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins ( 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
zh
[NLP-8] MetaBench: A Multi-task Benchmark for Assessing LLM s in Metabolomics
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代谢组学(metabolomics)这一需要深度、跨知识领域整合的科学领域中能力尚不明确的问题。其核心挑战在于代谢组学涉及复杂的生化通路、异构的标识符系统以及碎片化的数据库,导致现有LLMs难以有效支持专业研究需求。解决方案的关键是提出了MetaBench——首个面向代谢组学的基准测试体系,通过从权威公共资源中精心构建,系统评估LLMs在知识获取、理解、定位(grounding)、推理和科研应用等五个关键维度的表现。该基准揭示了当前模型在跨数据库标识符对齐任务中的显著局限性,尤其是在长尾代谢物上的性能下降,从而为开发和评估可靠的代谢组学人工智能工具提供了可量化、可扩展的基础架构。
链接: https://arxiv.org/abs/2510.14944
作者: Yuxing Lu,Xukai Zhao,J. Ben Tamo,Micky C. Nnamdi,Rui Peng,Shuang Zeng,Xingyu Hu,Jinzhuo Wang,May D. Wang
机构: Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学); Peking University (北京大学); Tsinghua University (清华大学); School of Electrical and Computer Engineering, Georgia Institute of Technology (佐治亚理工学院电气与计算机工程学院); School of Computer Science, Georgia Institute of Technology (佐治亚理工学院计算机科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 22 pages, 6 figures, 4 tables
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.
zh
[NLP-9] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理过程中缺乏验证信号的问题,尤其是在测试阶段无法获取外部验证器时,如何提升大型语言模型(Large Language Models, LLMs)的自我验证能力。此前方法通过在标准强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架中引入模型自验证训练机制,将推理与验证能力统一于单一LLM中,但其依赖两个独立的提示模板依次生成解决方案和自验证结果,效率低下。本文的关键突破在于理论揭示:自验证的强化学习目标闭式解可简化为一个极简形式——解决方案的真实推理奖励等于其最后一个token的自奖励得分(self-rewarding score),该得分由策略模型对指定token的下一个token对数概率与预计算常数之差乘以KL系数得到。基于此洞察,作者提出LaSeR算法,仅需在原RLVR损失基础上增加一个均方误差(MSE)损失项,使最后一个token的自奖励得分与验证器提供的推理奖励对齐,从而联合优化LLM的推理与自奖励能力。该方法在训练和推理阶段均可利用优化后的自奖励得分,且额外计算成本仅为一次额外token推理,显著提升了模型性能与推理时扩展能力。
链接: https://arxiv.org/abs/2510.14943
作者: Wenkai Yang,Weijie Liu,Ruobing Xie,Yiju Guo,Lulu Wu,Saiyong Yang,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); LLM Department, Tencent (腾讯大模型部门)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress. Github repo: this https URL
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model’s self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model’s next-token log-probability assigned to any pre-specified token at the solution’s last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model’s reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.
zh
[NLP-10] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations
【速读】: 该论文旨在解决精神健康障碍(如抑郁症、焦虑症和创伤后应激障碍(PTSD))在临床实践中常因主观评估、资源匮乏及社会污名而导致的漏诊或误诊问题,尤其是在初级医疗环境中,医生对抑郁或焦虑的误判率超过60%。解决方案的关键在于利用机器学习模型对真实世界中553例半结构化访谈数据进行分析,采用多种模型架构进行对比评估,包括零样本提示(zero-shot prompting)的GPT-4.1 Mini与MetaLLaMA以及基于低秩适配(LowRank Adaptation, LoRA)微调的RoBERTa模型。结果表明,这些模型在诊断准确率上均超过80%,尤其在PTSD检测中达到89%准确率和98%召回率;同时发现聚焦于短文本片段可提升召回率,证明叙事线索对敏感识别具有价值;LoRA微调在保持高性能的同时显著降低计算成本,展现出高效且可行的部署潜力。该研究为将生成式AI赋能的筛查工具整合进实际临床流程,特别是在资源有限或高污名化环境中实现早期干预提供了实证基础。
链接: https://arxiv.org/abs/2510.14937
作者: Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages 1 figure
Abstract:Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.
zh
[NLP-11] Circuit Insights: Towards Interpretability Beyond Activations
【速读】: 该论文旨在解决当前可解释人工智能(Explainable AI)与机制可解释性(Mechanistic Interpretability)研究中面临的两大核心问题:一是现有方法依赖人工检查,难以扩展至复杂任务;二是自动化分析方法虽具可扩展性,但常忽略特征间的交互作用,且对大语言模型(LLM)和数据集质量高度敏感。其解决方案的关键在于提出两种互补方法——WeightLens 和 CircuitLens:WeightLens 通过直接解析神经网络权重来解释特征,无需额外的解释模型或训练数据,在不依赖输入上下文的特征上表现优于或等同于现有方法;CircuitLens 则聚焦于捕捉特征激活如何由组件间的相互作用产生,揭示仅靠激活分析无法识别的电路级动态行为。二者结合显著提升了可解释性的鲁棒性,并推动了高效、高质量的机制电路分析。
链接: https://arxiv.org/abs/2510.14936
作者: Elena Golimblevskaia,Aakriti Jain,Bruno Puri,Ammar Ibrahim,Wojciech Samek,Sebastian Lapuschkin
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); Technische Universität Berlin (柏林工业大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); Centre of eXplainable Artificial Intelligence, Technological University Dublin (都柏林理工学院可解释人工智能中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.
zh
[NLP-12] Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models
【速读】: 该论文试图解决推理系统(如大语言模型)中过度自信(overconfidence)与实际可靠性之间不匹配的问题,即名义稳定性(nominal stability)与认知稳定性(epistemic stability)之间的差距。其解决方案的关键在于提出一个复合不稳定性指数(H-Risk),该指标融合了谱裕度(spectral margin)、条件数(conditioning)、时间敏感性(temporal sensitivity)和创新放大效应(innovation amplification),从而量化系统内部动态的脆弱性。通过线性高斯模拟和对大语言模型(LLMs)的实证分析,研究发现高H-Risk与校准偏差(miscalibration)及幻觉(hallucination)显著相关,表明该指数可作为诊断和选择性降低推理系统过自信行为的理论依据。
链接: https://arxiv.org/abs/2510.14925
作者: Akira Okutomi
机构: ToppyMicroServices OÜ(ToppyMicroServices OÜ)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 2 figures, preliminary version
Abstract:We reinterpret Kant’s Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing – and selectively reducing – overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.
zh
[NLP-13] RI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech Text and EEG
【速读】: 该论文旨在解决抑郁症自动检测中存在的一系列问题,包括现有研究在模态范围上的局限性、特征表示与建模策略缺乏系统比较,以及评估协议不一致等挑战。其解决方案的关键在于通过系统性地探索脑电图(EEG)、语音和文本三种模态的特征表示与融合策略,采用统一的受试者独立划分方式确保结果的稳健性和可复现性;具体而言,研究发现预训练嵌入优于手工设计特征,且精心设计的三模态模型能够实现最优性能,从而为未来多模态抑郁症检测研究奠定坚实基础。
链接: https://arxiv.org/abs/2510.14922
作者: Annisaa Fitri Nurfidausi,Eleonora Mancini,Paolo Torroni
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:
Abstract:Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.
zh
[NLP-14] Predicting Task Performance with Context-aware Scaling Laws
【速读】: 该论文旨在解决传统缩放定律(scaling laws)无法准确预测下游任务性能的问题,尤其是在上下文长度对模型表现有显著影响的情况下。传统方法主要关联上游指标(如交叉熵损失)与模型规模、训练数据和计算资源等设计因素,但忽略了上下文在下游任务中的关键作用。解决方案的关键在于提出一个简洁且可解释的框架,将下游性能建模为训练计算量(training compute)和提供上下文长度的联合函数,并通过在 Llama-2-7B 和 Llama-2-13B 的扩展上下文版本上进行大规模实证验证(覆盖65,500个实例,涵盖算术推理、常识推理和机器翻译三类任务),证明该框架能精准描述分布内性能、跨三个数量级训练计算量泛化,并可靠外推不同上下文长度下的性能表现。这一成果揭示了训练计算与上下文利用之间的相互作用机制,为高效设计适用于多样化下游任务的长上下文大语言模型提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2510.14919
作者: Kyle Montgomery,David Park,Jianhong Tu,Michael Bendersky,Beliz Gunel,Dawn Song,Chenguang Wang
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Washington University in St. Louis (圣路易斯华盛顿大学); Databricks; Google DeepMind (谷歌深度思维); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at this https URL.
zh
[NLP-15] Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation EMNLP2025
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中大语言模型(Large Language Models, LLMs)输出不一致的问题,尤其是在面对语义等价输入时,模型生成结果波动较大,而现有微调技术难以有效提升一致性。其解决方案的关键在于三方面创新:一是通过系统性合成数据生成策略扩充训练样本;二是引入三元组损失(triplet loss)优化嵌入表示以提升检索质量;三是提出一种基于中间层激活值的分层模型融合方法(layer-wise model merging),利用一致性感知权重整合多个专用模型的知识,从而显著增强输出一致性,在响应相似度上相较基线提升约47.5%,为工业级RAG系统的可靠性提供了实用改进路径。
链接: https://arxiv.org/abs/2510.14915
作者: Xujun Peng,Anoop Kumar,Jingyu Wu,Parker Glenn,Daben Liu
机构: AI Foundations, Capital One (Capital One)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Industry track
Abstract:Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.
zh
[NLP-16] Budget-aware Test-time Scaling via Discriminative Verification
【速读】: 该论文旨在解决大语言模型在复杂推理任务中,因采用生成式验证(generative verification)进行测试时缩放(test-time scaling)所导致的计算成本过高问题。其解决方案的关键在于提出一种预算感知(budget-aware)的判别式验证(discriminative verification)机制,并将其与自一致(self-consistency)策略结合,形成一种混合方法。实证分析表明,尽管判别式验证单独使用时性能较低,但与自一致策略协同后,在固定计算预算下显著优于当前最先进的生成式验证方法,例如在AIME2025数据集上准确率提升达15.3%。这一发现表明,判别式验证是实现高效且实用测试时缩放的有效途径。
链接: https://arxiv.org/abs/2510.14913
作者: Kyle Montgomery,Sijun Tan,Yuqi Chen,Siyuan Zhuang,Tianjun Zhang,Raluca Ada Popa,Chenguang Wang
机构: UC Santa Cruz (加州大学圣克鲁兹分校); UC Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a “free” upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at this https URL.
zh
[NLP-17] Reasoning with Sampling: Your Base Model is Smarter Than You Think
【速读】: 该论文试图解决的问题是:如何在不进行额外训练的情况下,从基础大语言模型(base models)中直接激发其推理能力,从而避免传统强化学习(reinforcement learning, RL)后训练方法对数据、标注和验证器的依赖。解决方案的关键在于提出一种基于马尔可夫链蒙特卡洛(Markov chain Monte Carlo, MCMC)思想的迭代采样算法,该算法利用基础模型自身的似然函数(likelihood)进行多轮采样优化,显著提升单次推理任务中的表现,如MATH500、HumanEval和GPQA等基准测试中接近甚至超越RL后训练的效果,且无需训练、人工标注数据或验证机制,同时保持样本多样性不退化。
链接: https://arxiv.org/abs/2510.14901
作者: Aayush Karan,Yilun Du
机构: Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models’ own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
zh
[NLP-18] Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
【速读】: 该论文旨在解决社交平台上隐性自杀意念(Suicidal Ideation, SI)的早期识别难题,即用户往往不直接表达其心理危机,而是通过日常发帖或与同伴互动中的间接信号体现风险。解决方案的关键在于构建一个融合个体长期发布历史与社会邻近同伴话语信息的计算框架,采用复合网络中心性指标筛选关键邻居,并对用户及其邻居的交互行为进行时间对齐,最终通过微调的DeBERTa-v3模型整合多层信号。实验表明,该方法在Reddit上的1000名用户(500例病例与500名对照)中相比仅依赖个体特征的基线模型,将隐性SI的早期检测准确率提升了15%,验证了同伴互动作为预测信号的重要性。
链接: https://arxiv.org/abs/2510.14889
作者: Soorya Ram Shimgekar,Ruining Zhao,Agam Goyal,Violeta J. Rodriguez,Paul A. Bloom,Hari Sundaram,Koustuv Saha
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Columbia University (哥伦比亚大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:On social media, many individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly. Instead, signs may surface indirectly through everyday posts or peer interactions. Detecting such implicit signals early is critical but remains challenging. We frame early and implicit SI as a forward-looking prediction task and develop a computational framework that models a user’s information environment, consisting of both their longitudinal posting histories as well as the discourse of their socially proximal peers. We adopted a composite network centrality measure to identify top neighbors of a user, and temporally aligned the user’s and neighbors’ interactions – integrating the multi-layered signals in a fine-tuned DeBERTa-v3 model. In a Reddit study of 1,000 (500 Case and 500 Control) users, our approach improves early and implicit SI detection by 15% over individual-only baselines. These findings highlight that peer interactions offer valuable predictive signals and carry broader implications for designing early detection systems that capture indirect as well as masked expressions of risk in online environments.
zh
[NLP-19] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction WACV26
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉分类(Fine-Grained Visual Classification, FGVC)任务中对自由形式回答的评估难题,尤其是高选项数(数百至数千)的多选题(Multiple Choice Questions, MCQs)场景下如何高效提取正确答案并扩展至基于检索的问题。其核心解决方案是提出nlg2choice方法,该方法采用两阶段策略:第一阶段由MLLM生成开放式回答,第二阶段通过文本约束解码(constrained decoding)预测最可能的选项;在检索场景中进一步引入早期停止机制(early stopping method)以降低计算开销、提升吞吐量,从而在保持高准确率的同时显著优化效率。
链接: https://arxiv.org/abs/2510.14885
作者: Logan Lawrence,Oindrila Saha,Megan Wei,Chen Sun,Subhransu Maji,Grant Van Horn
机构: University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校); Brown University(布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to WACV26. 12 pages, 8 tables, 5 figures
Abstract:Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don’t consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
zh
[NLP-20] From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR
【速读】: 该论文旨在解决通用编译器在现代空间架构(spatial architectures)上性能受限的问题,即传统编译器抽象了并行性、局部性和同步机制,无法有效利用细粒度的数据移动控制、执行顺序和计算资源放置等特性。其核心解决方案是提出MLIR-AIR,一个基于MLIR的开源编译栈,通过定义AIR方言(AIR dialect)提供异步与分层操作的结构化表示,使编译器能够显式地调度计算与数据流,实现跨硬件区域的计算分布、通信与计算重叠,无需依赖运行时特定协调或手动调度。关键创新在于利用编译期管理的调度机制,将高阶控制流转化为高效利用NPU计算阵列和内存层次结构的空间程序,从而显著提升计算效率(如矩阵乘法达78.7%)并支持复杂算子(如多头注意力)的简洁高效映射。
链接: https://arxiv.org/abs/2510.14871
作者: Erwei Wang,Samuel Bayliss,Andra Bisca,Zachary Blair,Sangeeta Chowdhary,Kristof Denolf,Jeff Fifield,Brandon Freiberger,Erika Hunhoff,Phil James-Roxby,Jack Lo,Joseph Melber,Stephen Neuendorffer,Eddie Richter,Andre Rosti,Javier Setoain,Gagandeep Singh,Endri Taka,Pranathi Vasireddy,Zhewen Yu,Niansong Zhang,Jinming Zhuang
机构: 未知
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:
Abstract:General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD’s NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR’s capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.
zh
[NLP-21] Benchmarking Multimodal Large Language Models for Face Recognition
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在人脸识别任务中性能尚不明确的问题,尤其是开源MLLMs与专用人脸识别模型在标准基准测试中的表现差异未被系统评估。其解决方案的关键在于构建一个针对主流人脸数据集(如LFW、CALFW、CPLFW、CFP、AgeDB和RFW)的系统性基准测试框架,通过统一协议比较不同MLLMs在零样本场景下的识别精度,从而揭示MLLMs虽能捕捉丰富语义线索但尚无法达到专业人脸识别模型在高精度场景下的性能水平,为未来基于MLLM的人脸识别模型设计提供量化依据与改进方向。
链接: https://arxiv.org/abs/2510.14866
作者: Hatef Otroshi Shahreza,Sébastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.
zh
[NLP-22] Midtraining Bridges Pretraining and Posttraining Distributions
【速读】: 该论文旨在解决语言模型在预训练过程中引入“中段训练”(midtraining)阶段的科学机制不明确的问题,即为何在预训练后期混合高质量、指令格式的数据能提升模型性能。其解决方案的关键在于通过受控实验系统性地验证中段训练的有效性及其作用机制:研究发现,在数学和代码等特定领域中,中段训练能够显著缩小预训练与下游任务数据之间的句法差异(syntactic gap),从而在监督微调后实现更低的域内验证损失,并减少对预训练数据的记忆遗忘(pretraining data forgetting)。相较持续预训练,中段训练在这些领域表现出更优的性能与更强的泛化保持能力,且引入时间点比数据混合权重更具决定性影响。
链接: https://arxiv.org/abs/2510.14865
作者: Emmy Liu,Graham Neubig,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, many language models have been pretrained with a “midtraining” phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.
zh
[NLP-23] Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在部署阶段因分布偏移导致路由决策次优的问题。现有测试时适应(test-time adaptation)方法主要针对密集模型且依赖外部数据,难以直接应用于MoE架构。其解决方案的关键在于提出一种无需参考数据、在线实时优化路由决策的框架:通过自监督机制利用已生成序列,在预填充阶段及后续周期性间隔中动态调整选定层的路由器 logits,仅使用轻量级加性向量更新以保持计算效率并避免过拟合。该方法实现了对MoE模型路由策略的持续适应,显著提升复杂推理任务性能,如在HumanEval上使OLMoE提升5.5%,且可无缝集成现有测试时扩展技术(如自一致性),在DeepSeek-V2-Lite上结合后平均提升6%。
链接: https://arxiv.org/abs/2510.14853
作者: Guinan Su,Yanwu Yang,Li Shen,Lu Yin,Shiwei Liu,Jonas Geiping
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ELLIS Institute Tübingen (ELLIS 图宾根研究所); Tübingen AI Center (图宾根人工智能中心); University of Tübingen (图宾根大学); Sun Yat-sen University (中山大学); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textita data-free, online test-time framework that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.
zh
[NLP-24] Where to Search: Measure the Prior-Structured Search Space of LLM Agents
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的生成-过滤-精炼(generate-filter-refine)迭代范式在AI+Science领域中,其搜索效率严重依赖于如何将领域先验知识(domain prior)编码为结构化的假设空间的问题。解决方案的关键在于提出一种紧凑的形式化理论,通过将智能体建模为输入与输出之间的模糊关系算子(fuzzy relation operator),从而在其可行转移路径上施加固定的“安全边界”(safety envelope);进一步地,利用单一延续参数(continuation parameter)对所有可达路径加权并求和,得到覆盖生成函数(coverage generating function),该函数不仅定义了可达性难度的度量,还提供了在由安全边界诱导的图结构上的几何解释,从而实现了对LLM辅助迭代搜索过程的系统性形式描述与可操作测量。
链接: https://arxiv.org/abs/2510.14846
作者: Zhuo-Yang Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 10 pages, 2 figures, 1 table
Abstract:The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.
zh
[NLP-25] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文档重排序(reranking)任务中,对比学习(Contrastive Learning, CL)与监督微调(Supervised Fine-Tuning, SFT)两种训练目标的性能差异问题,核心在于厘清哪种目标更适配LLM架构并揭示其内在机制。解决方案的关键在于将训练目标解耦为“权重”(控制更新幅度)和“方向”(指导更新路径)两个组件,并构建统一框架分析二者交互作用;通过探针实验发现,SFT在权重设计上显著优于CL,而评分方向无明显优势,从而阐明SFT对LLM重排序更具优势的根本原因。
链接: https://arxiv.org/abs/2510.14824
作者: Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ‘‘yes’’ (resp. ‘‘no’’) token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
zh
[NLP-26] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
【速读】: 该论文旨在解决生成式模型(尤其是需要推理能力的模型)在评估过程中因答案提取算法(answer extraction algorithm)不同而导致性能表现和最终答案分布高度敏感的问题。解决方案的关键在于提出一种名为“答案重生成”(Answer Regeneration)的基本框架:通过引入一次额外的模型推理,在提示词“Answer:”前加上原始输入与输出,从而生成新的候选答案序列,并从中选择或提取最终答案。该方法不依赖于特定的答案提取规则(extraction-rule-agnostic),显著提升了模型评估的稳定性和可靠性,已在数学问题求解和开放式问答任务中验证其有效性。
链接: https://arxiv.org/abs/2510.14773
作者: Hwiyeol Jo,Joosung Lee,Jaehone Lee,Sang-Woo Lee,Joonsuk Park,Kang Min Yoo
机构: NAVER Cloud (NAVER云); Neurofusion; University of Richmond (里士满大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ARR Submitted
Abstract:Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt “Answer:”. The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
zh
[NLP-27] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes
【速读】: 该论文旨在解决大语言模型在非英语语境下创造性写作能力不足的问题,其核心挑战在于训练数据稀缺且缺乏过程层面的监督信号。解决方案的关键在于构建了一个名为COIG-Writer的新型中文创造性写作数据集,该数据集通过系统性逆向工程高质量文本,提供包含(1)重构提示(reverse-engineered prompt)、(2)详细创作推理过程(creative reasoning)和(3)最终文本的三元组结构,从而引入过程监督(process supervision)。实验表明,创造性写作由两个组件构成:叙事逻辑(由过程监督提供)与语言表达(由通用数据维持),其中过程监督需与通用数据协同作用以稳定性能,最优比例为至少1:12(创意样本:通用样本),否则性能显著下降;同时发现创造性能力具有文化特异性,跨语言迁移效果差,且词汇多样性与创造力呈负相关(TTR悖论),揭示了逻辑结构与语言基础之间的协同机制。
链接: https://arxiv.org/abs/2510.14763
作者: Yunwen Li,Shuangshuang Ying,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Tianyu Zheng,Xeron Du,Qiguang Chen,Jiajun Shi,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Stephen Huang,Wanxiang Che,Chenghua Lin,Eli Zhang
机构: 2077AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.
zh
[NLP-28] Pluto: A Benchmark for Evaluating Efficiency of LLM -generated Hardware Code
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成硬件描述语言(Verilog)代码时,缺乏对综合性能指标(如面积、延迟和功耗)的系统性评估问题。现有基准测试多聚焦于功能正确性,忽视了高效硬件设计所需的优化维度,且常缺少可验证的测试平台和帕累托最优参考实现。为此,作者提出Pluto框架,其关键在于构建了一个包含114个问题的综合性评测集,每个问题均配有自检测试平台(self-checking testbenches)及多个帕累托最优(Pareto-optimal)参考实现,从而实现了对LLM生成Verilog设计在功能与合成效率上的全面量化评估。实验表明,尽管先进LLMs在功能正确性上表现良好(pass@1达78.3%),其在面积、延迟和功耗等合成效率指标上仍显著落后于人工设计(eff@1分别为63.8%、65.9%和64.0%),凸显了效率感知型评估框架在推动面向硬件的LLM研究中的必要性。
链接: https://arxiv.org/abs/2510.14756
作者: Manar Abdelatty,Maryam Nouh,Jacob K. Rosenstein,Sherief Reda
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8%, delay efficiency of 65.9%, and power efficiency of 64.0% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.
zh
[NLP-29] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在采用可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)时存在的虚假推理问题,即仅基于最终答案正确性进行奖励反馈会导致模型生成看似合理但实际逻辑错误的推理过程。解决方案的关键在于提出AutoRubric-R1V框架,其核心创新是一种可扩展的自我聚合方法,能够从成功轨迹中自动提取一致的推理检查点,并据此构建无需人工标注或更强教师模型的问题特定评分标准(rubric),从而实现过程级监督与结果奖励的联合优化,显著提升模型在六项多模态推理基准上的性能及推理过程的真实性。
链接: https://arxiv.org/abs/2510.14738
作者: Mengzhao Jia,Zhihan Zhang,Ignacio Cases,Zheyuan Liu,Meng Jiang,Peng Qi
机构: University of Notre Dame (圣母大学); Uniphore
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
zh
[NLP-30] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms
【速读】: 该论文试图解决生成式 AI (Generative AI) 在医疗健康领域快速部署过程中因忽视真实世界情境和用户多样性而引发的偏见、隐私侵犯及不公平访问等潜在风险问题。解决方案的关键在于提出一种以人为中心的框架,通过生成用户故事(user stories)并支持多代理讨论(multi-agent discussions),引导利益相关者在系统部署前创造性地思考可能带来的益处与危害。研究表明,阅读故事的参与者能够识别更广泛类型的 harms(共13类),且分布更为均衡,显著优于未读故事组主要关注隐私与福祉(58.3%)的局限性,表明该方法有效提升了对AI社会影响的全面认知与前瞻性判断能力。
链接: https://arxiv.org/abs/2510.14718
作者: Xingmeng Zhao,Dan Schumacher,Veronica Rammouz,Anthony Rios
机构: The University of Texas at San Antonio (圣安东尼奥德克萨斯大学)
类目: Computation and Language (cs.CL)
备注: 8 pages main + Appendix
Abstract:Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI’s impact on users.
zh
[NLP-31] ITAN: Graph-Executable Reasoning for Cyber Threat Intelligence
【速读】: 该论文旨在解决自然语言查询与结构化知识图谱之间语义鸿沟的问题,即如何将人类可读的网络安全威胁查询(如“某恶意软件如何利用漏洞进行横向移动?”)自动映射为可在知识图谱上执行的逻辑推理路径。传统检索系统难以支持复杂推理和可解释性,而TITAN框架通过引入一个路径规划模型(path planner model)和图执行器(graph executor),实现了从自然语言到可执行推理链的端到端转换。其关键创新在于构建了一个基于MITRE数据源的类型化、双向知识图谱(typed, bidirectional graph),使推理过程在威胁(threat)、行为(behavior)与防御(defense)三者间具有明确方向性和可逆性,并通过TITAN Dataset提供大规模标注样本用于训练与评估,从而确保生成的推理路径既语法正确又语义一致,且能在底层图谱中确定性执行。
链接: https://arxiv.org/abs/2510.14670
作者: Marco Simoni,Aleksandar Fontana,Andrea Saracino,Paolo Mori
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
Abstract:TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.
zh
[NLP-32] Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures EMNLP2025
【速读】: 该论文试图解决机器翻译模型在处理具有语义倾向性(semantic prosody)的语言结构时存在的不足问题,尤其是中文“被”字句(BEI passives)在翻译中未能准确反映其负面语义倾向的问题。解决方案的关键在于构建一个专门针对英语-中文句子对的数据集,明确展示“被”字句在表达不利内容时的语义倾向,并通过微调OPUS-MT、NLLB-600M和mBART50等主流机器翻译模型来强化这一语义特征的学习能力。实验表明,微调后的模型能更准确地在翻译不利内容时使用“被”字句,而在中性或正面内容中避免使用,且该知识可在多语言模型NLLB-600M中跨语言迁移至其他语对(如西班牙语-中文)。
链接: https://arxiv.org/abs/2510.14662
作者: Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures, *SEM workshop at EMNLP 2025 conference
Abstract:Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translations of each other may have different semantic prosody, more attention should be paid to this linguistic property to generate accurate translations. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50 models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.
zh
[NLP-33] An Efficient Rubric-based Generative Verifier for Search-Augmented LLM s
【速读】: 该论文旨在解决搜索增强型大语言模型(Search-Augmented Large Language Models, LLMs)在奖励建模方面存在的两大核心问题:一是规则类奖励(如Exact Match)对表达变化敏感且难以适用于长文本任务;二是生成式奖励虽具鲁棒性,但在动态语料库中设计可验证且稳定的奖励机制仍面临挑战,并伴随高昂计算成本。解决方案的关键在于提出“nugget-as-rubric”统一且可验证的范式,将原子信息点(nugget)作为结构化评价标准,针对短文本任务采用单个rubric,长文本任务则通过查询重写自动构建多个与信息需求对齐的rubric;并进一步设计了一个4B参数规模的高效生成式验证器Search-Gen-V,基于知识蒸馏和两阶段训练策略实现高精度验证,从而在多种工作负载下均表现出强鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2510.14660
作者: Linyue Ma,Yilong Xu,Xiang Long,Zhi Zheng
机构: ModelBest Inc.; Institute of Computing Technology, Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, “nugget-as-rubric”, which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question’s information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbfSearch-Gen-V, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.
zh
[NLP-34] Intent Clustering with Shared Pseudo-Labels
【速读】: 该论文旨在解决当前意图聚类(intent clustering)方法中依赖昂贵且不透明的商业大语言模型(Large Language Models, LLMs)、需预先指定聚类数量以及缺乏训练自由度等问题。其解决方案的关键在于提出一种无需训练、无需标签的聚类方法:首先利用轻量级开源LLM为每条文本生成可解释的伪标签(pseudo-labels),随后基于这些伪标签构建多标签分类任务,通过嵌入空间中共享标签较多的文本具有更近距离的假设实现聚类。该策略避免了直接计算文本相似性,并在多个基准数据集上展现出与现有先进方法相当甚至更优的性能,同时具备低资源适应性和跨模型稳定性。
链接: https://arxiv.org/abs/2510.14640
作者: I-Fan Lin,Faegheh Hasibi,Suzan Verberne
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.
zh
[NLP-35] RLAIF-SPA: Optimizing LLM -based Emotional Speech Synthesis via RLAIF
【速读】: 该论文旨在解决文本到语音(Text-To-Speech, TTS)合成中情感表达不足的问题,现有方法通常依赖昂贵的情感标注或优化间接目标,导致生成的语音虽然语义准确但缺乏情感表现力和感知自然性。解决方案的关键在于提出RLAIF-SPA框架,其核心是引入基于人工智能反馈的强化学习(Reinforcement Learning from AI Feedback, RLAIF)机制,利用自动语音识别(ASR)和大语言模型(LLM)分别提供语义准确性反馈与韵律-情感标签对齐奖励,从而直接优化情感表达与可懂度;该框架通过四个细粒度维度——结构(Structure)、情感(Emotion)、速度(Speed)和音调(Tone)——联合建模韵律-情感对齐,并结合语义准确性反馈,显著提升了语音的情感丰富性和自然度,在LibriSpeech数据集上的实验表明其在词错误率(WER)和人类主观评价上均优于Chat-TTS。
链接: https://arxiv.org/abs/2510.14628
作者: Qing Yang,Zhenghao Liu,Junxin Wang,Yangfan Du,Pengcheng Huang,Tong Xiao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
zh
[NLP-36] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
【速读】: 该论文旨在解决当前移动智能体(mobile agent)评估标准与实际复杂任务需求之间的不匹配问题。现有方法中,离线静态基准仅能验证单一预设的“黄金路径”,而在线动态测试受限于真实设备的复杂性和不可复现性,难以全面评估代理在多解场景下的能力。其解决方案的关键在于提出一种基于图结构的基准框架,通过建模真实设备交互中观察到的有限状态,实现对动态行为的静态模拟;在此基础上构建了ColorBench基准,支持多解评估、子任务完成率统计及原子级能力分析,从而在保持测试稳定性的同时逼近真实环境中的多样化执行路径。
链接: https://arxiv.org/abs/2510.14621
作者: Yuanyi Song,Heyuan Huang,Qiqiang Lin,Yin Zhao,Xiangmou Qu,Jun Wang,Xingyu Lou,Weiwen Liu,Zhuosheng Zhang,Jun Wang,Yong Yu,Weinan Zhang,Zhaoxiang Wang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined “golden path”, while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents’ performance on complex, long-horizon problems based on experimental results. Code and data are available at: this https URL.
zh
[NLP-37] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在归纳推理(inductive reasoning)任务中面临的两大挑战:一是现有数据多聚焦于表层规律,缺乏复杂内在模式;二是当前方法仅通过简单提示或微调实现训练,未提供精确的思维过程且难以控制难度。其解决方案的关键在于提出一个名为 \textitCodeSeq 的合成后训练数据集,该数据集基于数列构造算法问题以发现通项公式(general term generation, GTG),并通过反思失败测试案例并引入迭代修正机制生成监督微调数据,使模型能够自主生成案例并进行自我校验;同时,采用基于问题通过率与自导向案例生成成功率的协同可解性缩放奖励(Case-Synergy Solvability Scaling Reward)进行强化学习,从而让模型从成功与失败中更高效地学习。实验表明,使用 \textitCodeSeq 训练的模型在多种推理任务上表现提升,并能保持对分布外(OOD)样本的良好性能。
链接: https://arxiv.org/abs/2510.14620
作者: Kedi Chen,Zhikai Lei,Xu Guo,Xuecheng Wu,Siyuan Zeng,Jianghao Yin,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Qipeng Guo,Kai Chen,Wei Zhang
机构: East China Normal University (华东师范大学); Fudan University (复旦大学); Shanghai Innovation Institute; Xi’an Jiaotong University (西安交通大学); Shanghai AI Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textitCodeSeq, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textitCodeSeq improve on various reasoning tasks and can preserve the models’ OOD performance.
zh
[NLP-38] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures
【速读】: 该论文旨在解决当前偏好学习方法在缺乏客观质量信号(如事实准确性、语法正确性)时性能显著下降的问题,即现有基于强化学习的人类反馈(RLHF)方法主要依赖于可量化的目标错误检测,而非对主观质量特征(如创意、文风和情感共鸣)的有效建模。其解决方案的关键在于引入生成式奖励模型(Generative Reward Models, GRMs),该模型通过生成显式的推理链(reasoning chains)来评估文本质量,相较于传统序列分类架构的奖励模型(sequence-based reward models)和零样本语言模型判别器,在WritingPreferenceBench基准上实现了81.8%的准确率,显著优于前者(52.7%)和后者(53.9%),表明中间层的推理表示比直接分类更有利于捕捉人类偏好的复杂性。
链接: https://arxiv.org/abs/2510.14616
作者: Shuangshuang Ying,Yunwen Li,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Xeron Du,Tianyu Zheng,Yichi Zhang,Letian Ni,Yuyang Cheng,Qiguang Chen,Jingzhe Ding,Shengda Long,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Ge Zhang,Wenhao Huang,Wanxiang Che,Chenghua Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models–the standard architecture for RLHF–achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.
zh
[NLP-39] Just-In-Time Objectives: A General Approach for Specialized AI Interactions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏明确目标时生成内容趋于平庸、缺乏针对性的问题,例如撰写充斥陈词滥调的邮件。其解决方案的关键在于提出一种“即时目标诱导”(Just-In-Time Objectives, JIT)架构:通过被动观察用户行为自动推断其当前任务意图,并基于该单一目标快速优化下游AI系统的生成与评估过程。实验表明,该方法显著提升LLM输出质量(在N=14和N=205的实验中实现66–86%胜率优势),并能生成个性化工具,如基于人机交互(HCI)方法论批判草稿、预测同行反应或识别模糊术语,从而实现更贴合用户需求的智能响应。
链接: https://arxiv.org/abs/2510.14591
作者: Michelle S. Lam,Omar Shaikh,Hallie Xu,Alice Guo,Diyi Yang,Jeffrey Heer,James A. Landay,Michael S. Bernstein
机构: Stanford University (斯坦福大学); University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models promise a broad set of functions, but when not given a specific objective, they default to milquetoast results such as drafting emails littered with cliches. We demonstrate that inferring the user’s in-the-moment objective, then rapidly optimizing for that singular objective, enables LLMs to produce tools, interfaces, and responses that are more responsive and desired. We contribute an architecture for automatically inducing just-in-time objectives by passively observing user behavior, then steering downstream AI systems through generation and evaluation against this objective. Inducing just-in-time objectives (e.g., “Clarify the abstract’s research contribution”) enables automatic generation of tools, e.g., those that critique a draft based on relevant HCI methodologies, anticipate related researchers’ reactions, or surface ambiguous terminology. In a series of experiments (N=14, N=205) on participants’ own tasks, JIT objectives enable LLM outputs that achieve 66-86% win rates over typical LLMs, and in-person use sessions (N=17) confirm that JIT objectives produce specialized tools unique to each participant.
zh
[NLP-40] alking Points: Describing and Localizing Pixels
【速读】: 该论文旨在解决视觉-语言模型在跨模态理解中缺乏像素级关键点(keypoint)定位能力的问题,现有方法通常仅能实现对象级或区域级的语义对齐,难以通过自然语言实现精确到像素的关键点描述与定位。其解决方案的关键在于提出一个双向协同框架:一是点描述生成器(Point Descriptor),能够生成从粗到细、富含上下文信息的自由形式文本描述;二是点定位器(Point Localizer),可基于描述回归出精确的像素坐标。为训练此系统,作者构建了 LlamaPointInPart 数据集(20K+ 图像-关键点-描述三元组),并采用 GRPO 优化策略,在 AP-10K 上利用冻结的定位器作为奖励模型指导描述生成,从而提升定位准确性。该框架突破了传统模板化提示的限制,实现了语言驱动的像素级精准定位能力。
链接: https://arxiv.org/abs/2510.14583
作者: Matan Rusanovsky,Shimon Malnick,Shai Avidan
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on this http URL bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at this https URL.
zh
[NLP-41] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLM s
【速读】: 该论文旨在解决主权大语言模型(Sovereign Large Language Models, LLMs)在实际应用中缺乏系统性评估框架与数据集的问题,特别是如何量化其与用户社会文化背景的契合度以及技术安全性与鲁棒性。其解决方案的关键在于构建一个新的多维度数据集,并提出一套分析框架,用于提取和评估主权LLMs中的社会文化特征,同时结合技术指标进行综合测评。实验结果表明,尽管主权LLMs在低资源语言支持方面具有潜力,但其是否真正服务于目标用户群体仍存疑,且盲目信任其适配性可能导致对安全等关键质量属性的忽视。因此,论文强调需建立更全面、基于实证的评估体系以推动主权LLMs的高质量发展。
链接: https://arxiv.org/abs/2510.14565
作者: Kyubyung Chae,Gihoon Kim,Gyuseong Lee,Taesup Kim,Jaejin Lee,Heejin Kim
机构: Graduate School of Data Science, Dept. of Data Science, Seoul National University (首尔国立大学数据科学研究生院,数据科学系); College of Engineering, Dept. of Computer Science and Engineering, Seoul National University (首尔国立大学工程学院,计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users’ socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.
zh
[NLP-42] Agent ic Entropy-Balanced Policy Optimization
【速读】: 该论文旨在解决当前基于代理的强化学习(Agentic Reinforcement Learning, Agentic RL)在训练过程中因过度依赖熵信号而导致的训练崩溃问题,尤其是在多轮、长周期工具调用任务中,熵驱动的探索策略可能引发过分支(over-branching)和梯度稀释,从而损害策略稳定性和性能。解决方案的关键在于提出一种代理熵平衡策略优化算法(Agentic Entropy-Balanced Policy Optimization, AEPO),其核心创新包括:(1)动态熵平衡采样机制,通过熵预监控自适应分配全局与分支采样预算,并对连续高熵工具调用步骤施加分支惩罚以抑制过分支;(2)熵平衡策略优化机制,引入停止梯度操作于高熵截断项中以保留并合理重缩放高熵token的梯度,同时结合熵感知的优势估计,优先提升高不确定性状态的学习效率。实验证明,AEPO在14个挑战性数据集上均显著优于7种主流RL算法,且仅需1K样本即可实现优异性能,同时提升了采样多样性并维持策略熵稳定,为可扩展网页代理训练提供了有效保障。
链接: https://arxiv.org/abs/2510.14545
作者: Guanting Dong,Licheng Bao,Zhongyuan Wang,Kangzhi Zhao,Xiaoxi Li,Jiajie Jin,Jinghan Yang,Hangyu Mao,Fuzheng Zhang,Kun Gai,Guorui Zhou,Yutao Zhu,Ji-Rong Wen,Zhicheng Dou
机构: Renmin University of China (中国人民大学); Kuaishou Technology (快手科技)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Working in progress
Abstract:Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity’s Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity’s Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
zh
[NLP-43] E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
【速读】: 该论文旨在解决端到端软件开发(End-to-End Software Development, E2ESD)中测试自动化与质量保障的挑战,特别是针对生成式 AI (Generative AI) 在复杂用户需求场景下的测试有效性不足问题。解决方案的关键在于构建了一个名为 E2EDev 的细粒度基准测试集,包含多个行为驱动开发(Behavior-Driven Development, BDD)测试场景及其对应的 Python 步骤实现,并基于 Behave 框架搭建了全自动测试流水线;同时,为降低标注成本并提升质量,引入了人机协同多智能体标注框架(Human-in-the-Loop Multi-Agent Annotation Framework, HITL-MAA),从而系统性评估不同 E2ESD 框架和大语言模型(Large Language Models, LLMs)的性能表现,揭示当前方法在处理真实复杂任务时仍存在显著局限,凸显了对更高效、低成本 E2ESD 解决方案的迫切需求。
链接: https://arxiv.org/abs/2510.14509
作者: Jingyao Liu,Chen Huang,Zhizhao Guan,Wenqiang Lei,Yang Deng
机构: Sichuan University (四川大学); Singapore Management University (新加坡管理大学); National University of Singapore (新加坡国立大学); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, China (教育部机器学习与工业智能工程研究中心)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at this https URL.
zh
[NLP-44] Efficient Seq2seq Coreference Resolution Using Entity Representations
【速读】: 该论文旨在解决序列到序列(seq2seq)共指消解模型在增量设置(如对话场景)中效率低下的问题,这类场景要求文本必须逐句处理,而传统seq2seq方法因需完整输入序列导致计算冗余。解决方案的关键在于提出一种压缩表示机制:通过提取和重排实体级标记(entity-level tokens),并丢弃大部分非实体相关输入标记,从而显著降低计算负担。实验表明,在OntoNotes数据集上,该方法仅比全前缀增量基线低0.6 CoNLL F1分数,同时实现1.8的压缩比;在LitBank上甚至超越现有最优性能,验证了丢弃大量非必要token是可行且高效的增量共指消解策略。
链接: https://arxiv.org/abs/2510.14504
作者: Matt Grenander,Shay B. Cohen,Mark Steedman
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.
zh
[NLP-45] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models
【速读】: 该论文旨在解决低资源语言(如乌尔都语、泰语)在大型语言模型(LLMs)中性能显著低于高资源语言(如英语、中文)的问题,其核心挑战包括训练数据稀缺、机器翻译噪声以及跨语言对齐不稳定。解决方案的关键在于提出LiRA(Linguistic Robust Anchoring for Large Language Models)训练框架,其核心由两个模块构成:(i) Arca(Anchored Representation Composition Architecture),通过锚定机制将低资源语言映射到英文语义空间,并结合多智能体协作编码以保持共享嵌入空间中的几何稳定性;(ii) LaSR(Language-coupled Semantic Reasoner),在Arca输出的多语言表征基础上添加语言感知轻量推理头,并引入一致性正则化,统一优化目标以提升跨语言理解、检索与推理的鲁棒性。
链接: https://arxiv.org/abs/2510.14466
作者: Haolin Li,Haipeng Zhang,Mang Li,Yaohua Wang,Lijie Wen,Yu Zhang,Biqing Huang
机构: Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca’s multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.
zh
[NLP-46] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工具调用过程中因程序化JSON格式输出导致的任务干扰和格式约束问题,这些问题会显著降低工具调用的准确性并增加输出波动。解决方案的关键在于提出Natural Language Tools (NLT) 框架,通过将工具选择与响应生成解耦,使模型以自然语言形式输出工具调用信息,从而消除任务间的相互影响并摆脱对固定格式的依赖。这一设计在多个领域和模型上均展现出显著性能提升,尤其在开放权重模型中表现突出,且具备良好的鲁棒性和对无原生工具支持模型的扩展能力。
链接: https://arxiv.org/abs/2510.14453
作者: Reid T. Johnson,Michelle D. Pain,Jordan D. West
机构: 未知
类目: Computation and Language (cs.CL)
备注: 31 pages, 7 figures
Abstract:We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.
zh
[NLP-47] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents
【速读】: 该论文旨在解决当前开源深度研究型网络代理(deep research web agents)在信息聚合能力上的显著不足问题,即现有方法主要聚焦于提升信息检索效率,却忽视了从多源异构数据中进行严谨的知识整合与推理以支持深入研究的核心需求。其解决方案的关键在于提出“Explore to Evolve”范式:首先通过主动在线探索从真实网页获取可验证证据,随后基于这些证据自演化出一个结构化的聚合程序(由12种高阶逻辑操作组成),从而生成可验证的问答对(QA pair)。这一机制实现了从高层指导到具体操作的可扩展转化,支撑构建了包含10K样本的WebAggregatorQA数据集,并在此基础上训练出性能卓越的WebAggregator系列基础模型,显著提升了网络代理的信息聚合能力。
链接: https://arxiv.org/abs/2510.14438
作者: Rui Wang,Ce Zhang,Jun-Yu Ma,Jianshu Zhang,Hongru Wang,Yi Chen,Boyang Xue,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents’ information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
zh
[NLP-48] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
【速读】: 该论文旨在解决语言模型在遵循多约束指令(multi-constraint instructions)时表现不佳的问题,此类指令在真实应用场景中至关重要。现有强化学习(Reinforcement Learning, RL)方法依赖外部监督信号且在多约束任务中面临奖励稀疏的挑战。其解决方案的关键在于提出一种无标签的自监督强化学习框架,通过直接从指令中提取奖励信号并生成伪标签用于奖励模型训练,从而消除对外部监督的依赖;同时引入约束分解策略与高效的约束级二分类机制,在缓解奖励稀疏问题的同时保持计算效率。
链接: https://arxiv.org/abs/2510.14420
作者: Qingyu Ren,Qianyu He,Bowei Zhang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu
机构: Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University (复旦大学计算机科学与人工智能学院); School of Data Science, Fudan University (复旦大学数据科学学院); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at this https URL
zh
[NLP-49] IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理与规划任务中表现不足的问题,尤其是在单一模型模式下推理能力有限、多智能体系统(Multi-Agent Systems, MAS)虽具潜力但存在高计算成本和训练困难等瓶颈。其解决方案的关键在于提出一种名为IMAGINE(Integrating Multi-Agent System into One Model)的通用且可扩展框架,该框架通过端到端训练将MAS的结构化推理与规划能力高效整合进单个紧凑模型中,从而在保持小模型规模的同时显著超越原生MAS性能——实验表明,基于Qwen3-8B-Instruct构建的IMAGINE模型在TravelPlanner基准上达到82.7%的最终通过率,远超DeepSeek-R1-671B的40%。
链接: https://arxiv.org/abs/2510.14406
作者: Xikai Zhang,Bo Wang,Likang Xiao,Yongzhi Li,Quan Chen,Wenju Wu,Liu Liu
机构: Hangzhou International Innovation Institute (杭州国际创新研究院); School of Artificial Intelligence (人工智能学院); Beihang University (北京航空航天大学); Kuaishou Technology (快手科技)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.
zh
[NLP-50] MedTrust-RAG : Evidence Verification and Trust Alignment for Biomedical Question Answering
【速读】: 该论文旨在解决生物医学问答(Biomedical Question Answering, Biomedical QA)中基于检索增强生成(Retrieval-Augmented Generation, RAG)系统因后检索噪声和证据验证不足导致的幻觉问题,从而影响回答的可靠性。解决方案的关键在于提出一种名为MedTrust-Guided Iterative RAG的框架,其核心创新包括:1)通过结构化负知识断言(Negative Knowledge Assertions)实现引文感知推理,确保所有生成内容均明确基于检索到的医学文献;2)引入迭代检索-验证机制,由验证代理执行医学知识缺口分析(Medical Gap Analysis),动态优化查询直至获得可靠信息;3)集成MedTrust-Align模块(MTAM),结合经验证的正样本与幻觉感知的负样本,利用直接偏好优化(Direct Preference Optimization)强化引文支撑的推理并惩罚易产生幻觉的响应模式。该方法在多个基准测试中显著提升了准确率,验证了其在提升医学问答事实一致性方面的有效性。
链接: https://arxiv.org/abs/2510.14400
作者: Yingpeng Ning,Yuanyuan Sun,Ling Luo,Yanhua Wang,Yuchen Pan,Hongfei Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.
zh
[NLP-51] Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成个性化文本时缺乏真实个体沟通风格的问题,例如邮件或社交媒体消息的回复未能体现用户的独特表达习惯。传统方法受限于用户社交网络(SNS)或邮件历史数据的隐私问题而难以获取高质量个性化语料。为此,作者提出“你的下一个词预测”(Your Next Token Prediction, YNTP)任务,其核心解决方案是通过受控的人机对话机制构建个性化语料:让用户与基于MBTI人格维度设计的心理学驱动非玩家角色(NPC)进行为期五天的多轮交互,从而捕捉自然、日常的交流模式,并在此基础上评估提示工程(prompt-based)与微调(fine-tuning)两种个性化方法的有效性。该研究建立了首个多语言(英语、日语、汉语)YNTP基准数据集,为实现用户对齐的语言建模提供了基础。
链接: https://arxiv.org/abs/2510.14398
作者: Shiyao Ding,Takayuki Ito
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of “Your Next Token Prediction (YNTP)”, which models a user’s precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users’ internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: this https URL
zh
[NLP-52] Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis
【速读】: 该论文旨在解决现有研究中对用户社交媒体长期互动行为(如评论树结构)在预测自杀风险演化方面关注不足的问题。传统方法多聚焦于单条帖子的文本分析,而忽视了用户随时间积累的言论与互动信息。解决方案的关键在于构建一个基于Columbia Suicide Severity Rating Scale (C-SSRS) 的四标签标注数据集,并引入评论树(comment trees)结构作为关键特征,利用大语言模型(Large Language Models, LLMs)进行建模,实验证明整合评论树信息能显著提升对用户自杀风险等级的区分度与预测准确性,从而为早期干预提供更可靠的识别基础。
链接: https://arxiv.org/abs/2510.14395
作者: Jun Li,Qun Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Suicide remains a critical global public health issue. While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user’s evolving suicidal risk. Users, however, often reveal their intentions through historical posts and interactive comments over time. This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users’ suicidal risk levels. We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users’ posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.
zh
[NLP-53] Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM -based Optimizers
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在提示优化(prompt optimization)过程中面临的中毒攻击(poisoning risks)问题,尤其是针对反馈驱动的优化机制中潜在的安全漏洞。研究表明,与注入查询相比,操纵反馈信号对系统的影响更为显著,攻击成功率(Attack Success Rate, ASR)可提升高达 ΔASR = 0.48。解决方案的关键在于提出一种无需访问奖励模型的“假奖励攻击”(fake-reward attack),揭示了现有优化流程的脆弱性,并进一步设计了一种轻量级的“高亮防御”(highlighting defense)机制,能够在不损害任务性能的前提下将假奖励攻击导致的ASR从0.23降至0.07,从而将提示优化管道确立为首个需重点防护的攻击面。
链接: https://arxiv.org/abs/2510.14381
作者: Andrew Zhao,Reshmi Ghosh,Vitor Carvalho,Emily Lawton,Keegan Hines,Gao Huang,Jack W. Stokes
机构: Tsinghua University (清华大学); Microsoft (微软)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to \Delta ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward \Delta ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.
zh
[NLP-54] PluriHop: Exhaustive Recall-Sensitive QA over Distractor-Rich Corpora
【速读】: 该论文旨在解决在重复性高、干扰文档密集的文档集合中进行问答(QA)时,现有检索增强生成(Retrieval-Augmented Generation, RAG)方法表现不佳的问题。这类场景常见于医疗记录、合规报告和维护日志等结构化但非结构化的持续更新数据中,其特点是需要跨全部文档进行信息聚合,且对遗漏任一相关文档极为敏感,作者称之为“多段落跳跃”(pluri-hop)问题。解决方案的关键在于提出一种名为PluriHopRAG的新架构,其核心策略是“逐个检查所有文档,低成本过滤”,具体包括:(i) 将查询分解为文档级子问题以实现细粒度处理;(ii) 使用交叉编码器(cross-encoder)作为轻量级过滤器,在昂贵的大语言模型(LLM)推理前快速排除无关文档。实验表明,该方法在F1分数上相较基线提升18–52%,凸显了全面检索与早期过滤相结合的有效性。
链接: https://arxiv.org/abs/2510.14377
作者: Mykolas Sveistrys,Richard Kunert
机构: Turbit Systems GmbH
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a “check all documents individually, filter cheaply” approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG’s performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
zh
[NLP-55] From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program
【速读】: 该论文旨在解决美国国家气象局(NWS)在服务非英语使用者(占全美人口约6880万)时面临的语言障碍问题,以推进“天气准备国家”(Weather-Ready Nation)目标的实现。解决方案的关键在于开发一个基于人工智能的自动化翻译工具,利用大语言模型(LLMs)与神经机器翻译(NMT)技术,结合LILT公司特有的训练流程,使翻译系统能够精准适配气象术语和风险沟通语境,并支持多语言(如西班牙语、简体中文、越南语等)的规模化部署。该方案通过地理信息系统(GIS)映射识别区域语言需求,优化资源配置,同时嵌入伦理AI实践,确保翻译过程的透明性、公平性和人工监督,从而显著减少人工翻译负担,提升预警信息的可及性与文化相关性。
链接: https://arxiv.org/abs/2510.14369
作者: Joseph E. Trujillo-Falcon,Monica L. Bozeman,Liam E. Llewellyn,Samuel T. Halvorson,Meryl Mizell,Stuti Deshpande,Bob Manning,Todd Fagin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program’s design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.
zh
[NLP-56] On the Ability of LLM s to Handle Character-Level Perturbations: How Well and How?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对频繁且结构化的字符级扰动时的鲁棒性问题,特别是针对在线考试系统等场景中可能被滥用的风险。其核心解决方案是提出一种名为 \nameshort 的实用方法,通过在输入文本中插入不可见的 Unicode 控制字符来干扰模型的正常处理流程,从而抑制 LLM 的不当使用。该方法的关键在于利用字符级扰动显著破坏 tokenization 分割并降低信噪比,同时揭示出即便在这种强干扰下,多数 LLM 仍能保持较高性能,暗示了其底层对噪声具有隐式或显式去噪机制的鲁棒性。
链接: https://arxiv.org/abs/2510.14365
作者: Anyun Zhuo,Xuefei Ning,Ningyuan Li,Yu Wang,Pinyan Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textitimplicit versus \textitexplicit denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.
zh
[NLP-57] AI for Service: Proactive Assistance with AI Glasses
【速读】: 该论文旨在解决当前人工智能服务普遍存在的“被动响应”问题,即现有AI系统仅在接收到用户明确指令后才作出反应,难以实现对用户需求的主动预测与适时干预。为实现更具智能性和实用性的服务体验,论文提出了一种名为Alpha-Service的统一框架,其核心在于同时解决两个关键挑战:一是“何时介入”(Know When to intervene),通过分析第一人称视角视频流识别服务机会;二是“如何提供服务”(Know How to provide both generalized and personalized services),兼顾通用任务处理能力与个性化适应性。该方案受冯·诺依曼计算机架构启发,基于AI眼镜部署多智能体系统,包含输入单元、中央处理单元、算术逻辑单元、记忆单元和输出单元五大模块,实现了从环境感知、意图推理到自然交互的闭环服务流程。
链接: https://arxiv.org/abs/2510.14359
作者: Zichen Wen,Yiyu Wang,Chenfei Liao,Boxue Yang,Junxian Li,Weifeng Liu,Haocong He,Bolong Feng,Xuyang Liu,Yuanhuiyi Lyu,Xu Zheng,Xuming Hu,Linfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 5 figures, work in progress
Abstract:In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.
zh
[NLP-58] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering
【速读】: 该论文旨在解决高绩效医疗大语言模型(Large Language Models, LLMs)通常需要大量计算资源进行微调(fine-tuning),从而限制了资源受限医疗机构应用的问题。其解决方案的关键在于提出了一种基于置信度驱动的多模型协作框架,通过两个阶段实现:首先由置信度检测模块评估主模型对答案的确定性,随后利用自适应路由机制将低置信度问题分配给具备互补知识的辅助模型进行协同推理,从而在不进行微调的前提下提升医学问答性能。实验证明,该方法在MedQA、MedMCQA和PubMedQA三个基准上均取得竞争力结果,尤其在PubMedQA(95.0%)和MedMCQA(78.0%)上表现突出,且消融实验表明置信度感知路由与多模型协作显著优于单一模型或均匀推理策略。
链接: https://arxiv.org/abs/2510.14353
作者: Ziad Elshaer,Essam A. Rashed
机构: Nile University (尼勒大学); University of Hyogo (兵库县立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注:
Abstract:High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model’s certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0%) and MedMCQA (78.0%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.
zh
[NLP-59] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在扮演跨版本角色时,难以忠实且一致地再现特定宇宙中角色特征的问题,尤其关注超级英雄角色在漫画与电影宇宙中的多版本表现。其解决方案的关键在于提出一个名为“Beyond One World”的基准测试,涵盖30位标志性超级英雄及其90个不同正史版本,包含两个任务:(i) 正史事件(Canon Events),评估对关键人生阶段的事实记忆;(ii) 道德困境(Moral Dilemmas),考察伦理情境下的决策合理性。此外,研究引入“思考-行动匹配度”(Think-Act Matching)指标,量化内部推理(thinking)与外部行为(acting)的一致性,作为衡量模型可信度的代理指标,从而系统性揭示当前LLMs在多宇宙一致性与推理对齐方面的显著短板。
链接: https://arxiv.org/abs/2510.14351
作者: Perapard Ngokpol,Kun Kerdthaisong,Pasin Buakhaw,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot
机构: Thammasat School of Engineering, Thammasat University (诗纳卡宁威洛大学工程学院); Department of Computer Engineering and Digital Technology, Faculty of Engineering, Chulalongkorn University (朱拉隆功大学工程学院计算机工程与数字技术系); Artificial Intelligence Association of Thailand (泰国人工智能协会); School of Biomedical Engineering & Imaging Sciences, King’s College London (伦敦国王学院生物医学工程与成像科学学院); Siriraj Informatics and Data Innovation Center (SIData+), Faculty of Medicine, Siriraj Hospital, Mahidol University (西里拉吉医院信息与数据创新中心,玛希敦大学医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters – for example, superheroes across comic and cinematic universes – remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation (“thinking”) from outward decisions (“acting”). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.
zh
[NLP-60] A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimers Disease
【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期检测的难题,通过自然语言处理(Natural Language Processing, NLP)技术识别语言能力变化这一早期标志,从而实现高准确率的AD筛查。其解决方案的关键在于构建一种基于Doc2Vec与ELMo融合的混合词嵌入(hybrid word embedding)方法,并引入语言学特征以增强语法和语义表征;随后将嵌入后的特征向量输入逻辑回归模型,并对整个机器学习管道中的超参数(如正则化参数、学习率、Doc2Vec与ELMo的向量维度等)进行精细调优,最终在区分早期AD患者与健康对照组的任务中达到91%的分类准确率和97%的受试者工作特征曲线下面积(Area Under the Curve, AUC),显著优于现有最佳NLP模型(88%准确率),且模型在多次随机数据分割实验中表现出良好的稳定性(准确率标准差为0.0403,AUC标准差为0.0174)。
链接: https://arxiv.org/abs/2510.14332
作者: Yangyang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Peer-reviewed and published in Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2020). 7 pages, 5 figures
Abstract:Early detection of Alzheimer’s Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.
zh
[NLP-61] Evaluating Reducing Deceptive Dialogue From Language Models with Multi-turn RL
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话中可能产生的欺骗行为所带来的安全风险问题,尤其是其在无意或有意情况下生成误导性、虚假或操纵性内容的不可预测性。为量化此类行为,作者提出了一种新的“信念错位度量”(belief misalignment metric),该指标相较于现有五种检测欺骗的度量方法更能与人类判断保持一致。关键解决方案在于引入一种多轮强化学习(multi-turn reinforcement learning)微调策略,通过建模对话历史中的行为演化来识别并抑制欺骗意图,最终使LLMs的欺骗行为相比其他指令微调模型减少了77.6%。这一方法突破了传统单轮输出分析的局限,实现了对复杂交互中欺骗行为的有效评估与缓解。
链接: https://arxiv.org/abs/2510.14318
作者: Marwa Abdulhai,Ryan Cheng,Aryansh Shrivastava,Natasha Jaques,Yarin Gal,Sergey Levine
机构: UC Berkeley (加州大学伯克利分校); University of Oxford (牛津大学); University of Washington (华盛顿大学); UK AI Security Institute; Google DeepMind (谷歌深度智学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
zh
[NLP-62] rrarium: Revisiting the Blackboard for Multi-Agent Safety Privacy and Security Studies
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent System, MAS)在自动化用户任务(如会议调度)过程中面临的安全、隐私与可信性问题。这些问题源于LLM驱动的复杂协作协议对非结构化私有数据、用户约束和偏好的依赖,同时引入了诸如对齐偏差、恶意代理、通信通道被攻破以及数据投毒等新型攻击风险。解决方案的关键在于提出Terrarium框架——通过重构早期多智能体系统中的黑板(blackboard)设计,构建一个模块化、可配置的测试平台,用于细粒度地研究和评估MAS中的安全机制;该框架支持快速原型设计、防御策略迭代与多场景验证,从而推动可信多智能体系统的研发进程。
链接: https://arxiv.org/abs/2510.14312
作者: Mason Nakamura,Abhinav Kumar,Saaduddin Mahmud,Sahar Abdelnabi,Shlomo Zilberstein,Eugene Bagdasarian
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); ELLIS Institute Tübingen, MPI for Intelligent Systems, Tübingen AI Center (图宾根ELLIS研究所,马克斯·普朗克智能系统研究所,图宾根人工智能中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:A multi-agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter-agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine-grained study on safety, privacy, and security in LLM-based MAS. We repurpose the blackboard design, an early approach in multi-agent systems, to create a modular, configurable testbed for multi-agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework’s flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi-agent systems.
zh
[NLP-63] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking
【速读】: 该论文旨在解决多语言多模态实体链接(Multilingual Multimodal Entity Linking)任务中的挑战,即如何在不同语言的文本中准确地将命名实体与知识库(如Wikidata)中的唯一实体进行关联,尤其是在文本语境模糊或信息不足的情况下。其解决方案的关键在于构建了一个名为MERLIN的新测试平台,该平台包含五种语言(印地语、日语、印尼语、越南语和泰米尔语)的BBC新闻标题及其配图,并标注了超过7000个命名实体提及与2500个唯一Wikidata实体的对应关系。实验表明,引入视觉数据可显著提升实体链接准确性,尤其对缺乏强大多语言能力的语言模型效果更为明显。
链接: https://arxiv.org/abs/2510.14307
作者: Sathyanarayanan Ramamoorthy,Vishwa Shah,Simran Khanuja,Zaid Sheikh,Shan Jie,Ann Chia,Shearman Chua,Graham Neubig
机构: Carnegie Mellon University (卡内基梅隆大学); Defence Science and Technology Agency, Singapore (新加坡国防科技局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at this https URL
zh
[NLP-64] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言数学推理能力上的评估空白问题,特别是现有基准测试主要集中在英语或高资源语言,缺乏对中低资源语言场景下数学推理性能的系统性评估。解决方案的关键在于构建了一个名为MathMist的平行多语言数学推理基准,包含超过21,000组跨七种语言的对齐题-答数据集,覆盖高、中、低资源语言环境,并涵盖多种问题类型与解题逻辑。通过在零样本(zero-shot)、思维链(chain-of-thought, CoT)和代码切换(code-switched)推理范式下对多种开源及商用模型进行系统评测,揭示了LLMs在跨语言一致性与可解释性数学推理方面的显著不足,尤其在低资源语言中表现明显退化。
链接: https://arxiv.org/abs/2510.14305
作者: Mahbub E Sobhani,Md. Faiyaz Abdullah Sayeedi,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda
机构: BRAC University (布拉克大学); United International University (联合国际大学); Qatar Computing Research Institute (卡塔尔计算研究研究所); Amazon GenAI (亚马逊生成式人工智能); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: this https URL
zh
[NLP-65] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers
【速读】: 该论文旨在解决学术论文分析中关键概念关系挖掘不足的问题,现有数据库多局限于相似性匹配与基础分类,难以深入揭示概念间的网络结构。其解决方案的关键在于提出一种基于提示工程(prompt engineering)的关键概念路径分析方法:利用小语言模型实现精准的关键概念提取与创新点识别,并构建基于知识图谱约束机制的智能代理(agent),从而提升分析精度。通过在Qwen和DeepSeek模型上的微调,显著提升了准确率,相关模型已开源发布于Hugging Face平台。
链接: https://arxiv.org/abs/2510.14303
作者: Ziye Xia,Sergei S. Ospichev
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 10 figures
Abstract:In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.
zh
[NLP-66] Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL
【速读】: 该论文旨在解决Text-to-SQL系统中schema linking(模式链接)这一关键但研究不足的环节问题,即如何准确地将自然语言问题与数据库模式元素(如表和列)对齐。现有方法多聚焦于SQL生成优化,忽视了相关schema元素的检索,易导致生成错误或执行失败。其解决方案的关键在于提出一种上下文感知的双向schema检索框架,将schema linking视为独立任务,并融合两种互补策略:先进行表级检索再选择列,以及先进行列级检索再选择表;同时引入问题分解、关键词提取和关键短语提取等增强技术,从而显著提升schema召回率并降低误报率,有效缩小完整schema与理想schema设置之间的性能差距。
链接: https://arxiv.org/abs/2510.14296
作者: Md Mahadi Hasan Nahid,Davood Rafiei,Weiwei Zhang,Yong Zhang
机构: University of Alberta (阿尔伯塔大学); Huawei Technologies Canada Co., Ltd. (华为技术加拿大有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 30 Pages
Abstract:Schema linking – the process of aligning natural language questions with database schema elements – is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.
zh
[NLP-67] PRISM: Agent ic Retrieval with LLM s for Multi-Hop Question Answering
【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, MQA)中证据检索的准确性与完整性难题,即如何在复杂问题解答过程中高效获取多个相关证据片段,同时避免冗余或干扰信息。解决方案的关键在于提出一种基于代理(Agent)的结构化检索系统,通过三个专用代理协同工作:Question Analyzer将多跳问题分解为子问题,Selector聚焦于高精度识别每个子问题的相关上下文,Adder则负责补充可能遗漏的信息以提升召回率。Selector与Adder之间的迭代交互机制能够在保持检索结果紧凑的同时覆盖全面证据,显著优于传统方法,使下游QA模型在使用更少无关信息的前提下实现更高的答案准确率。
链接: https://arxiv.org/abs/2510.14278
作者: Md Mahadi Hasan Nahid,Davood Rafiei
机构: University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 18 pages
Abstract:Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks – HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG – demonstrates that our approach consistently outperforms strong baselines.
zh
[NLP-68] Qwen 3Guard Technical Report
【速读】: 该论文旨在解决当前安全防护模型(guardrail models)在实际应用中面临的两大核心问题:一是现有模型仅提供二元标签(safe/unsafe),难以适应不同领域对安全性的差异化容忍度;二是其需等待完整输出后才能进行安全检测,无法支持流式推理(streaming LLM inference),导致有害部分输出可能被提前暴露。解决方案的关键在于提出Qwen3Guard系列多语言安全防护模型,包含两个专用变体:生成式Qwen3Guard将安全分类建模为指令遵循任务,实现细粒度的三类判断(safe、controversial、unsafe);流式Qwen3Guard引入token级分类头,支持增量文本生成过程中的实时安全监控,从而实现低延迟、高精度的安全干预。
链接: https://arxiv.org/abs/2510.14276
作者: Haiquan Zhao,Chenhan Yuan,Fei Huang,Xiaomeng Hu,Yichang Zhang,An Yang,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin,Baosong Yang,Chen Cheng,Jialong Tang,Jiandong Jiang,Jianwei Zhang,Jijie Xu,Ming Yan,Minmin Sun,Pei Zhang,Pengjun Xie,Qiaoyu Tang,Qin Zhu,Rong Zhang,Shibin Wu,Shuo Zhang,Tao He,Tianyi Tang,Tingyu Xia,Wei Liao,Weizhou Shen,Wenbiao Yin,Wenmeng Zhou,Wenyuan Yu,Xiaobin Wang,Xiaodong Deng,Xiaodong Xu,Xinyu Zhang,Yang Liu,Yeqiu Li,Yi Zhang,Yong Jiang,Yu Wan,Yuxin Zhou
机构: Qwen Team(通义千问团队)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary “safe/unsafe” labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.
zh
[NLP-69] Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
【速读】: 该论文旨在解决小规模多语言嵌入模型(约10亿参数)在检索任务中性能显著落后于大规模模型(如70亿参数)的问题,尤其是在实际应用中最常见的检索场景下。其核心解决方案在于通过针对性优化训练策略来提升小模型的检索能力:关键因素包括采用硬负样本(hard negatives)进行负采样以显著提高检索精度、利用任务多样性(task diversity)而非单纯依赖语言多样性来增强训练数据的泛化能力,以及合理控制训练数据规模以避免边际收益递减。最终,研究团队构建了一个约3亿参数的紧凑型多语言模型,在检索性能上达到甚至超越当前主流70亿参数模型的水平。
链接: https://arxiv.org/abs/2510.14274
作者: Lifu Tu,Yingbo Zhou,Semih Yavuz
机构: Salesforce AI Research (Salesforce人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.
zh
[NLP-70] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)自动生成知识图谱(Knowledge Graphs, KGs)中存在的噪声问题,即生成的KG常包含冗余实体和不可靠关系,导致检索与生成性能下降并增加计算开销。其解决方案的关键在于提出DEG-RAG框架,通过两个核心机制实现:(1) 实体消歧(entity resolution),消除冗余实体;(2) 三元组反思(triple reflection),移除错误关系。这两个步骤共同构建出更紧凑、高质量的知识图谱,显著优于未经处理的原始KG,在多种主流Graph-based RAG变体中均展现出一致的问答性能提升。
链接: https://arxiv.org/abs/2510.14271
作者: Yilun Zheng,Dan Yang,Jie Li,Lin Shang,Lihui Chen,Jiahao Xu,Sitao Luan
机构: Nanyang Technological University (南洋理工大学); Nanjing University (南京大学); Mila, Quebec AI Institute (蒙特利尔人工智能研究所); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.
zh
[NLP-71] CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)内部机制不透明、难以解释的问题,即模型被视为“黑箱”,其各层功能和动态行为缺乏系统性理解。为应对这一挑战,作者提出了一种无探针(probe-free)的框架CAST(Compositional Analysis via Spectral Tracking),其核心创新在于通过Moore-Penrose伪逆估计每层的实际变换矩阵,并结合六种可解释的谱分析指标对Transformer各层进行深入剖析。该方法揭示了编码器-only与解码器-only模型在层行为上的本质差异——前者保持高秩处理,后者呈现压缩-扩展循环;同时,核分析(Kernel analysis)进一步识别出层间功能关系模式,利用CKA相似性矩阵将层划分为特征提取、压缩和专业化三个阶段,从而为理解模型内部运作提供了新的定量视角。
链接: https://arxiv.org/abs/2510.14262
作者: Zihao Fu,Ming Liao,Chris Russell,Zhenguang G. Cai
机构: The Chinese University of Hong Kong (香港中文大学); Hong Kong Polytechnic University (香港理工大学); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models have achieved remarkable success but remain largely black boxes with poorly understood internal mechanisms. To address this limitation, many researchers have proposed various interpretability methods including mechanistic analysis, probing classifiers, and activation visualization, each providing valuable insights from different perspectives. Building upon this rich landscape of complementary approaches, we introduce CAST (Compositional Analysis via Spectral Tracking), a probe-free framework that contributes a novel perspective by analyzing transformer layer functions through direct transformation matrix estimation and comprehensive spectral analysis. CAST offers complementary insights to existing methods by estimating the realized transformation matrices for each layer using Moore-Penrose pseudoinverse and applying spectral analysis with six interpretable metrics characterizing layer behavior. Our analysis reveals distinct behaviors between encoder-only and decoder-only models, with decoder models exhibiting compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis further demonstrates functional relationship patterns between layers, with CKA similarity matrices clearly partitioning layers into three phases: feature extraction, compression, and specialization.
zh
[NLP-72] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior
【速读】: 该论文旨在解决语言模型(Language Model, LM)行为与其训练数据之间关系的因果机制问题,即如何通过干预训练数据来系统性地验证数据对模型行为的影响。其解决方案的关键在于提出了一套可操作的实验流程(experimental recipe),将干预过程分解为多个阶段:首先从基准测试中选取评估项以衡量模型行为,接着匹配与这些评估项相关的训练文档,随后对文档进行修改(即“重写历史”),再基于修改后的数据重新训练模型,并量化行为变化。该方法不仅验证了共现统计和信息检索方法在识别相关训练文档中的作用,还揭示了现有方法不足以完全解释模型在知识问答任务中的表现,从而为未来研究提供了可复现、可扩展的实证框架。
链接: https://arxiv.org/abs/2510.14261
作者: Rahul Nadkarni,Yanai Elazar,Hila Gonen,Noah A. Smith
机构: University of Washington (华盛顿大学); Bar Ilan University (巴伊兰大学); University of British Columbia (不列颠哥伦比亚大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches – i.e., ``rewriting history’’ – and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM’s ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.
zh
[NLP-73] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)范式中因被动文本分块而导致的知识内化深度不足与推理能力受限的问题。其核心解决方案是提出一种场景感知文档记忆混合框架(Mixtures of scenario-aware document Memories, MoM),关键在于将文本处理从被动切片转变为以主动理解为导向的文档记忆提取过程,模拟人类阅读时的认知机制。MoM首先通过大语言模型(LLMs)生成结构化的逻辑大纲以指导高质量内容抽取,结合多路径采样与多视角评估机制筛选最优文档记忆,并引入反向推理策略从高质量结果中提炼专家思维路径,从而赋能小语言模型(SLMs)具备主动探索和构建文档记忆的能力,最终实现更语义完整、可解释性强的文档记忆检索机制。
链接: https://arxiv.org/abs/2510.14252
作者: Jihao Zhao,Zhiyuan Ji,Simin Niu,Hanyu Wang,Feiyu Xiong,Zhiyu Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
zh
[NLP-74] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对相同语义但不同表述的提示(prompt)时产生不一致回答的问题,即提升模型对提示扰动的鲁棒性。解决方案的核心是提出一种无监督训练方法 Flip-Flop Consistency (F²C),其关键由两个组件构成:一是通过提示变体间的多数投票生成硬伪标签的共识交叉熵(Consensus Cross-Entropy, CCE);二是引入表示对齐损失,将低置信度及非多数预测结果拉向高置信度多数投票变体所确立的共识空间。该方法显著提升了模型在多个自然语言处理任务上的答案一致性、平均 F₁ 分数和泛化性能,同时降低了因提示格式变化导致的性能波动。
链接: https://arxiv.org/abs/2510.14242
作者: Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Morteza Dehghani
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 6 figures, 3 tables, and 1 algorithm
Abstract:Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ( F^2C ), an unsupervised training method that improves robustness to such perturbations. F^2C is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, F^2C raises observed agreement by 11.62%, improves mean F_1 by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, F^2C generalizes effectively, increasing \overlineF_1 and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, F^2C consistently improves both performance and agreement while reducing variance. These findings highlight F^2C as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at this https URL.
zh
[NLP-75] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
【速读】: 该论文旨在解决开放权重大语言模型(Large Language Models, LLMs)在国际信息学奥林匹克竞赛(International Olympiad in Informatics, IOI)中难以达到金牌水平的问题,即如何在不依赖闭源模型和不可复现方法的前提下,实现与顶尖专有模型相当的推理与编程能力。其解决方案的关键在于提出一个可扩展且可复现的测试时计算框架——GenCluster,该框架通过大规模生成、行为聚类、排序机制以及轮转提交策略,在有限的验证预算下高效探索多样化的解空间,从而显著提升开放权重模型在IOI任务中的表现,并首次使用开源模型gpt-oss-120b实现了IOI 2025金牌级别成绩,为大语言模型推理能力的透明评估树立了新基准。
链接: https://arxiv.org/abs/2510.14232
作者: Mehrzad Samadi,Aleksander Ficek,Sean Narenthiran,Siddhartha Jain,Wasi Uddin Ahmad,Somshubra Majumdar,Vahid Noroozi,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 11 figures
Abstract:Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.
zh
[NLP-76] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning
【速读】: 该论文旨在解决多阶段推理(multi-stage reasoning)中因分解复杂问题为多个子阶段而导致的延迟增加问题,同时克服现有自适应加速技术(如层跳过,layer skipping)在效率与准确性之间难以平衡的挑战。其关键解决方案是提出 LiteStage,一个面向延迟感知的层跳过框架:通过离线阶段级搜索分配最优层数预算,并结合在线置信度驱动的生成提前退出机制,有效抑制冗余输出token的生成,从而在保持高精度的同时显著提升推理效率。
链接: https://arxiv.org/abs/2510.14211
作者: Beomseok Kang,Jiwon Song,Jae-Joon Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
zh
[NLP-77] DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans
【速读】: 该论文旨在解决大型语言模型角色扮演代理(LLM RPAs)在模拟个体人类行为时存在的角色一致性(persona fidelity)不足问题,其根源在于人工构建的角色档案常依赖于挑选的信息和人格特征,缺乏与目标个体行为的验证对齐。解决方案的关键在于提出动态角色优化框架(Dynamic Persona Refinement Framework, DPRF),该框架通过迭代识别生成行为与人类真实行为之间的认知偏差(既可通过自由形式也可通过理论基础的结构化分析),并据此持续优化角色档案,从而显著提升LLM RPAs的行为一致性,并在多个下游任务中展现出跨模型泛化能力。
链接: https://arxiv.org/abs/2510.14205
作者: Bingsheng Yao,Bo Sun,Yuanzhe Dong,Yuxuan Lu,Dakuo Wang
机构: Northeastern University (东北大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In Submission
Abstract:The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs’ behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these this http URL evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie this http URL can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and this http URL work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.
zh
[NLP-78] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition
【速读】: 该论文旨在解决现有研究中仅关注大五人格(Big Five)而忽视了近年来受关注的六边形人格模型(HEXACO)在多模态人类行为自动识别中的应用问题,尤其是HEXACO中包含的诚信-谦逊特质(Honesty-Humility)与替代性攻击、报复心及社会支配倾向等心理特征的相关性尚未被充分挖掘。此外,此前缺乏对Big Five与HEXACO在机器学习建模下关系的系统分析。解决方案的关键在于提出一种联合建模方法,通过优化同时识别Big Five和HEXACO人格维度的机制,提升对多模态人类行为感知的准确性与全面性。实验基于自我介绍视频数据集验证了该方法的有效性。
链接: https://arxiv.org/abs/2510.14203
作者: Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama
机构: NTT, Inc. (日本电信电话公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted at APSIPA ASC 2025
Abstract:This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.
zh
[NLP-79] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令遵循能力(instruction-following ability)提升过程中,传统监督微调(Supervised Fine-Tuning, SFT)方法依赖大量人工标注数据且难以充分挖掘已有数据潜力的问题。其核心解决方案是提出一种基于强化学习的替代策略——RL-based Semantic Response Ranking (RLSR),关键在于利用语义嵌入空间中的余弦相似度作为奖励信号,对模型生成的多个响应与人类标注响应进行排序优化,从而在不增加额外标注成本的前提下,更高效地利用大规模SFT数据集提升指令遵循性能。实验表明,RLSR可直接替代SFT并取得优于SFT的基准表现,同时与SFT结合进一步提升下游任务效果。
链接: https://arxiv.org/abs/2510.14200
作者: Zhichao Wang,Andy Wong,Ruslan Belkin
机构: Inflection AI
类目: Computation and Language (cs.CL)
备注:
Abstract:After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model’s instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT’s 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.
zh
[NLP-80] MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation
【速读】: 该论文旨在解决金融服务业中大规模客户语句标注任务中存在的标注积压问题(annotation backlogs),即数百万条用户语句需进行准确分类,传统人工标注效率低下且成本高昂。解决方案的关键在于提出MAFA(Multi-Agent Framework for Annotation)——一个可配置的多智能体协同标注框架,其核心创新包括:1)通过专业化智能体与结构化推理机制实现任务分工与协作;2)引入基于评判者(judge-based)的共识机制提升标注一致性;3)支持动态任务适配,无需代码变更即可定义FAQ、意图、实体或领域特定类别等标注类型。该系统已在摩根大通部署,成功消除百万级语句积压,平均标注一致性达86%,并显著减少人工工作量(年节省超5000小时),同时通过置信度分级(高/中/低分别为85%/10%/5%)使人类标注员聚焦于低置信度案例,从而实现高效精准的标注流程。
链接: https://arxiv.org/abs/2510.14184
作者: Mahmood Hegazy,Aaron Rodrigues,Azzam Naeem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA’s effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.
zh
[NLP-81] owards Reversible Model Merging For Low-rank Weights
【速读】: 该论文旨在解决低秩压缩模型(如通过低秩适应 LoRA 或训练后奇异值分解 SVD 压缩)在传统合并方法下性能显著下降的问题。现有模型合并方法通常将多个微调模型的权重直接融合为单一模型,但在低秩表示中这种操作会破坏原有任务特异性信息,导致合并后模型在各任务上表现劣于原始独立模型。其解决方案的关键在于提出一种可逆模型合并(Reversible Model Merging, RMM)方法,该方法不直接合并权重,而是构建一个紧凑的权重基底空间,使得每个原始任务特定模型可通过线性组合从该基底中重建出来,从而保留所有源模型的性能,并提供一种无需数据、高效且灵活的闭式解来确定最优基底和任务系数。
链接: https://arxiv.org/abs/2510.14163
作者: Mohammadsajad Alipour,Mohammad Mohammadi Amiri
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks. While prior work has shown that merging can approximate the performance of individual fine-tuned models for each task, it largely overlooks scenarios where models are compressed into low-rank representations, either through low-rank adaptation (LoRA) or post-training singular value decomposition (SVD). We first demonstrate that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model. Motivated by this phenomenon, we propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis (e.g., an equivalent of holding two or more models) from which original task-specific models can be recovered via linear combination. This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. Crucially, this allows us to ``revert’’ to each individual model when needed, recognizing that no merged model can consistently outperform one specialized for its task. Building on this insight, we introduce our method, Reversible Model Merging (RMM), an efficient, data-free, and flexible method that provides a closed-form solution for selecting the optimal basis of model weights and task-specific coefficients for linear combination. Extensive experiments across diverse datasets and model scales demonstrate that RMM consistently outperforms existing merging approaches, preserving the performance of low-rank compressed models by a significant margin.
zh
[NLP-82] Building a Macedonian Recipe Dataset: Collection Parsing and Comparative Analysis
【速读】: 该论文旨在解决马其顿语烹饪数据集在数字研究中严重缺失的问题,以支持计算美食学(Computational Gastronomy)对区域性饮食传统的深入分析。解决方案的关键在于通过网络爬虫和结构化解析构建首个系统性的马其顿菜谱数据集,并针对成分描述的异构性(如单位、数量和修饰词)进行归一化处理,从而实现高质量的数据标准化;进一步利用点互信息(Pointwise Mutual Information)和提升度(Lift score)等指标分析食材频率与共现模式,揭示具有代表性的马其顿菜肴组合特征。
链接: https://arxiv.org/abs/2510.14128
作者: Darko Sasanski,Dimitar Peshevski,Riste Stojanov,Dimitar Trajanov
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.
zh
[NLP-83] oward Cybersecurity-Expert Small Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在网络安全(cybersecurity)领域部署滞后的问题,其核心挑战在于缺乏高质量、领域特定的小型语言模型(Small Language Models, SLMs)及其训练数据集。为应对这一问题,作者提出 CyberPal 2.0,一个参数规模从 4B 到 20B 的网络安全专家级小型语言模型家族。解决方案的关键在于构建了一个增强型链式思维(chain-of-thought)网络安全指令数据集,该数据集通过 SecKnowledge 2.0 数据增强与格式化流水线生成,该流水线融合了专家引导的推理格式控制和 LLM 驱动的多步锚定(multi-step grounding),从而生成更高保真度、任务相关的推理轨迹,显著提升了模型在网络安全任务中的性能表现。
链接: https://arxiv.org/abs/2510.14113
作者: Matan Levi,Daniel Ohayon,Ariel Blobstein,Ravid Sagi,Ian Molloy,Yair Allouche
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.
zh
[NLP-84] DROID: Dual Representation for Out-of-Scope Intent Detection
【速读】: 该论文旨在解决任务导向型对话系统中**离域意图检测(Out-of-Scope, OOS)**的问题,即如何准确识别用户输入是否超出系统已知意图范围。传统方法通常依赖强分布假设或额外的校准模块,存在泛化能力不足或复杂性高的缺陷。其解决方案的关键在于提出一种轻量级端到端框架DROID(Dual Representation for Out-of-Scope Intent Detection),通过融合两种互补编码器——通用语句编码器(Universal Sentence Encoder, USE)用于广义语义泛化,以及领域自适应的Transformer去噪自动编码器(Transformer-based Denoising Autoencoder, TSDAE)用于捕捉特定领域的上下文差异——生成联合表示,并由一个轻量级分支分类器结合单一校准阈值实现高效OOS判定,无需后处理评分。此外,DROID引入合成与开放域异常样本增强策略以提升边界学习效果,在仅有150万可训练参数的情况下显著优于现有SOTA方法,尤其在低资源场景下表现突出。
链接: https://arxiv.org/abs/2510.14110
作者: Wael Rashwan,Hossam M. Zawbaa,Sourav Dutta,Haytham Assem
机构: Maynooth University (爱尔兰); Technological University Dublin (爱尔兰); Huawei Research Centre (爱尔兰); Amazon Alexa AI (英国)
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures, 4 Tables. Preprint submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders – the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6–15% for known and 8–20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.
zh
[NLP-85] Generating Fair Consensus Statements with Social Choice on Token-Level MDPs
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成共识声明时缺乏内在结构、难以提供可证明公平性保障的问题。其核心挑战在于如何在聚合多样化的自由形式意见时,确保生成结果对所有参与方具有公平性和合理性。解决方案的关键在于将共识生成建模为一个多目标、词元级(token-level)的马尔可夫决策过程(Markov Decision Process, MDP),其中每个代理(agent)的偏好对应一个目标,且通过代理自身的策略(如个性化语言模型)隐式定义最优Q函数,从而无需显式价值函数即可量化每一步生成的奖励。在此框架下,论文提出两种基于社会选择理论的方法:一是构造一种保证处于事前核心(ex-ante core)的随机生成策略,该策略源自最大化纳什福利(Nash Welfare)的完整声明分布;二是针对单条共识声明生成,采用搜索算法优化平等福利(egalitarian welfare),实验证明该方法相较于基线(包括Habermas Machine)能显著提升最差情况下的代理对齐度。
链接: https://arxiv.org/abs/2510.14106
作者: Carter Blair,Kate Larson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Current frameworks for consensus statement generation with large language models lack the inherent structure needed to provide provable fairness guarantees when aggregating diverse free-form opinions. We model the task as a multi-objective, token-level Markov Decision Process (MDP), where each objective corresponds to an agent’s preference. Token-level rewards for each agent are derived from their policy (e.g., a personalized language model). This approach utilizes the finding that such policies implicitly define optimal Q-functions, providing a principled way to quantify rewards at each generation step without a value function (Rafailov et al., 2024). This MDP formulation creates a formal structure amenable to analysis using principles from social choice theory. We propose two approaches grounded in social choice theory. First, we propose a stochastic generation policy guaranteed to be in the ex-ante core, extending core stability concepts from voting theory to text generation. This policy is derived from an underlying distribution over complete statements that maximizes proportional fairness (Nash Welfare). Second, for generating a single statement, we target the maximization of egalitarian welfare using search algorithms within the MDP framework. Empirically, experiments using language models to instantiate agent policies show that search guided by the egalitarian objective generates consensus statements with improved worst-case agent alignment compared to baseline methods, including the Habermas Machine (Tessler et al., 2024).
zh
[NLP-86] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中因信息逐步呈现而导致的性能退化问题,这一现象严重影响了LLMs在实际应用场景中的可用性。解决方案的关键在于将模型内部不确定性视为可利用的信号而非噪声,通过Shannon熵持续量化下一词元分布的不确定性,并在检测到熵值显著上升时触发自适应提示整合机制,从而动态重置和优化生成过程。该方法称为ERGO(Entropy-guided Resetting for Generation Optimization),实验证明其在增量指令场景下相较标准基线平均性能提升56.6%,同时显著提高能力上限(+24.7%)并降低不可靠性(-35.3%)。
链接: https://arxiv.org/abs/2510.14077
作者: Haziq Mohammad Khalid,Athikash Jeyaganthan,Timothy Do,Yicheng Fu,Sean O’Brien,Vasu Sharma,Kevin Zhu
机构: Algoverse AI Research; University of Nottingham (诺丁汉大学); San Jose State University (圣何塞州立大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 figures
Abstract:Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.
zh
[NLP-87] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages ACL AACL2025
【速读】: 该论文试图解决的问题是:语言中音义关联(phonosemantic iconicity)在大规模定量研究中的表现程度,以及这种关联是否能在不同语言中被系统识别并量化。解决方案的关键在于采用分布式的分析方法(distributional approach),通过统计手段衡量词素(morpheme)的语音相似性空间与语义相似性空间之间的对齐程度,在六种不同语言(英语、西班牙语、印地语、芬兰语、土耳其语和泰米尔语)中进行跨语言比较,从而发现此前未被识别的音义关联模式,并验证已有假说的支持度。
链接: https://arxiv.org/abs/2510.14040
作者: George Flint,Kaustubh Kislay
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures, under review – ACL (AACL 2025)
Abstract:Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations–both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes’ phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.
zh
[NLP-88] hink Globally Group Locally: Evaluating LLM s Using Multi-Lingual Word Grouping Games EMNLP
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨语言抽象推理任务中存在语言模态偏差的问题,即模型在不同语言环境下表现不一致,尤其在无需依赖固定公式或知识的抽象推理任务中缺乏系统评估。其解决方案的关键在于提出一个受《纽约时报》“Connections”游戏启发的多语言抽象推理基准——GlobalGroup,该基准覆盖英语、西班牙语、中文、印地语和阿拉伯语五种语言,并包含原生语言与英文翻译版本,以实现跨语言对比;同时引入游戏难度测量机制,确保在相似难度条件下进行可控比较,从而更准确地识别语言偏好及开源与闭源模型之间的性能差异。
链接: https://arxiv.org/abs/2510.14030
作者: César Guerra-Solano,Zhuochun Li,Xiang Lorraine Li
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP Main 2025
Abstract:Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply “out-of-the-box thinking” to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds – English, Spanish, Chinese, Hindi, and Arabic – in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
zh
[NLP-89] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在跨文化情境下推理能力评估不足的问题,即单纯依赖答案正确性无法反映模型对文化语境的理解深度。为实现更细致的评估,作者提出CRaFT(Explanation-based Reasoning Framework for Culture and Translation),其关键在于引入基于解释的多维度评价体系,通过四个可解释指标——文化流利度(Cultural Fluency)、偏离度(Deviation)、一致性(Consistency)和语言适应性(Linguistic Adaptation)——系统量化模型在不同语言环境下生成的答案与解释质量。该框架首次将文化理解纳入评估核心,揭示了语言形式对模型跨文化推理能力的塑造作用,为构建具备文化自适应性的LLMs提供了可操作的评估工具和改进方向。
链接: https://arxiv.org/abs/2510.14014
作者: Shehenaz Hossain,Haithem Afli
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.
zh
[NLP-90] BitNet Distillation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定下游任务中部署时面临的高内存占用与计算成本问题,尤其是在资源受限场景下难以高效运行的问题。其核心解决方案是提出BitNet Distillation(BitDistill)轻量级微调流程,通过三项关键技术实现:1)引入SubLN模块以优化低精度推理的稳定性;2)基于MiniLM的多头注意力蒸馏机制,保留关键语义信息的同时压缩模型结构;3)持续预训练作为关键预热步骤,缓解微调后1.58-bit量化模型(即三值权重 -1, 0, 1)与全精度模型在目标任务上的性能差距。实验表明,该方法在保持与全精度模型相当性能的前提下,实现了最高达10倍的内存节省和CPU上2.65倍的推理加速。
链接: https://arxiv.org/abs/2510.13998
作者: Xun Wu,Shaohan Huang,Wenhui Wang,Ting Song,Li Dong,Yan Xia,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 4 figures
Abstract:In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights -1, 0, 1) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at this https URL.
zh
[NLP-91] he German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)训练数据中存在大量授权状态不明确的文本问题,尤其针对非英语语种如德语,其公开许可文本资源严重匮乏。为应对这一挑战,作者提出了“德国公共数据集”(German Commons),这是迄今规模最大的开放许可德语文本集合,涵盖法律、科学、文化、政治、新闻、经济和网络文本等七个领域,共计1545.6亿个词元(tokens)。解决方案的关键在于:通过系统性地从具备可验证授权资质的数据源获取内容,并采用严格的去重、质量过滤与格式标准化处理流程,确保数据在多样来源下的高质量一致性;同时所有子集均使用至少CC-BY-SA 4.0或等效许可证,保障模型训练与再分发的合法性与合规性,从而填补德语领域开放预训练数据的空白,推动真正开源德语语言模型的发展。
链接: https://arxiv.org/abs/2510.13996
作者: Lukas Gienapp,Christopher Schröder,Stefan Schweter,Christopher Akiki,Ferdinand Schlatt,Arden Zimmermann,Phillipe Genêt,Martin Potthast
机构: University of Kassel (卡塞尔大学); hessian.AI (黑森人工智能); ScaDS.AIKassel (萨克森数据科学中心卡塞尔分部); InfAI (信息与人工智能研究所); ScaDS.AILeipzig (萨克森数据科学中心莱比锡分部); Leipzig University (莱比锡大学); Stefan Schweter (独立研究员); Friedrich-Schiller-Universität Jena (耶拿弗里德里希-席勒大学); German National Library (德国国家图书馆)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 12 tables, includes datasheet
Abstract:Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.
zh
[NLP-92] Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks
【速读】: 该论文旨在解决当前自动语音识别(ASR)系统主要依赖声学信息而忽略多模态上下文的问题,特别是在科学演讲场景中,仅使用语音难以准确识别专业术语。其关键解决方案在于引入演示文稿(presentation slides)作为视觉辅助信息,通过构建一个包含领域特定术语自动转录分析的多模态基准数据集,并采用合适的数据增强策略缓解缺乏带幻灯片标注数据的问题;最终训练出的多模态模型在整体词汇上实现了约34%的词错误率(WER)相对降低,在领域术语上更是达到35%的显著提升,验证了视觉信息对ASR性能优化的有效性。
链接: https://arxiv.org/abs/2510.13979
作者: Supriti Sinhamahapatra,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2510.13979 [cs.AI] (or arXiv:2510.13979v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.13979 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-93] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems
【速读】: 该论文旨在解决现实世界中检索增强生成(Retrieval-Augmented Generation, RAG)系统输出错误的多样性与复杂性问题,这类错误可能源于检索、生成或知识库本身的缺陷,严重影响系统的可靠性与鲁棒性。解决方案的关键在于提出了一种全新的错误类型分类体系(taxonomy),系统性地归纳了实际RAG系统中可能出现的错误类别,并辅以真实案例和应对策略;同时构建了一个标注了错误类型的RAG错误响应数据集,并设计了一种与该分类体系对齐的自动化评估方法,使开发者能够在开发过程中有效追踪和定位错误,从而提升RAG系统的稳定性与可维护性。
链接: https://arxiv.org/abs/2510.13975
作者: Kin Kwan Leung,Mouloud Belbahri,Yi Sui,Alex Labach,Xueying Zhang,Stephen Rose,Jesse C. Cresswell
机构: Layer 6 AI(第六层人工智能)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages
Abstract:Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at this https URL.
zh
[NLP-94] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时推理过程中因计算资源消耗大而导致效率低下的问题,同时提升推理的准确性与稳定性。其核心发现是:推理过程中的不确定性高度局部化——仅少数高熵(high-entropy)token显著影响输出正确性。基于此现象,作者提出无需训练的Minimal Test-Time Intervention(MTI)框架,其关键在于两点:一是选择性地在不确定位置应用无分类器引导(Classifier-Free Guidance, CFG),实现精准干预;二是通过复用主模型的键值缓存(KV cache)高效近似无条件解码,从而实现轻量级负向提示引导(Lightweight negative-prompt guidance)。该方法在多个通用、编程及STEM任务上均取得显著性能提升,且保持高效率。
链接: https://arxiv.org/abs/2510.13940
作者: Zhen Yang,Mingyang Zhang,Feng Chen,Ganggui Ding,Liang Hou,Xin Tao,Pengfei Wan,Ying-Cong Chen
机构: HKUST(GZ)(香港科技大学(广州)); Kuaishou Technology(快手科技); AIML; ZJU(浙江大学); Ant Group(蚂蚁集团); HKUST(香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL
Abstract:Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.
zh
[NLP-95] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
【速读】: 该论文旨在解决生成式 AI 模型在模仿获奖作家写作风格方面的能力及其对原作者版权影响的问题,尤其关注其是否能生成高质量、风格忠实的文学文本。研究通过一项预注册实验,对比了专家作家与三个前沿大语言模型(ChatGPT、Claude、Gemini)在模拟50位获奖作家风格上的表现,发现未经微调的模型在专家评价中显著劣于人类作者,但在对单个作者作品进行微调后,AI生成文本在风格忠实度和写作质量上均显著优于人类专家,并且更难被检测为AI生成内容。解决方案的关键在于:针对特定作者进行微调训练,这不仅消除了原始模型中导致评分低下的“可识别AI风格特征”(如陈词滥调密度),还大幅降低了成本(仅为专业写作者报酬的0.3%),从而提供了直接回应版权法第四项合理使用因素——即对原作市场价值影响——的实证依据。
链接: https://arxiv.org/abs/2510.13939
作者: Tuhin Chakrabarty,Jane C. Ginsburg,Paramveer Dhillon
机构: Stony Brook University (石溪大学); Columbia Law School (哥伦比亚法学院); University of Michigan (密歇根大学); MIT Initiative on the Digital Economy (麻省理工学院数字经济倡议)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Preprint Under Review
Abstract:The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI’s ability to generate derivative this http URL it’s unclear whether these models can generate high quality literary text while emulating authors’ styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude Gemini in writing up to 450 word excerpts emulating 50 award-winning authors’ diverse styles. In blind pairwise evaluations by 159 representative expert lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p10^8) writing quality (OR=0.13, p10^7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors’ complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p10^13) writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning inference cost of 81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright’s fourth fair-use factor, the “effect upon the potential market or value” of the source works.
zh
[NLP-96] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis
【速读】: 该论文旨在解决当前对生成式 AI(Generative AI)驱动的深度研究(Deep Research, DR)代理在企业财务分析任务中能力评估缺乏系统性和严谨性的问题。其解决方案的关键在于提出了一种名为HisRubric的新颖评估框架,该框架具有层次化的分析结构和细粒度评分标准,能够模拟专业分析师的工作流程,从数据识别、指标计算到战略总结与解释逐层评估DR代理的能力;在此基础上构建了FinDeepResearch基准,涵盖64家上市公司、8个金融市场的多语言数据集(共15,808个评分项),并通过16种代表性方法的实验揭示了不同模型在多种场景下的性能表现,为后续研究提供了可量化、可复现的评估基础。
链接: https://arxiv.org/abs/2510.13936
作者: Fengbin Zhu,Xiang Yao Ng,Ziyang Liu,Chang Liu,Xianwei Zeng,Chao Wang,Tianhui Tan,Xuan Yao,Pengyang Shao,Min Xu,Zixuan Wang,Jing Wang,Xin Lin,Junfeng Li,Jingxian Zhu,Yang Zhang,Wenjie Wang,Fuli Feng,Richang Hong,Huanbo Luan,Ke-Wei Huang,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); 6Estates Pte Ltd; Asian Institute of Digital Finance (亚洲数字金融研究院); Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent’s capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents’ capabilities in corporate financial analysis. This framework mirrors the professional analyst’s workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.
zh
[NLP-97] Big Reasoning with Small Models: Instruction Retrieval at Inference Time
【速读】: 该论文试图解决小语言模型(Small Language Models, SLMs)在本地计算设备上运行时,因缺乏多步推理能力或领域专业知识而导致性能受限的问题。其解决方案的关键在于推理阶段的指令干预(instruction intervention at inference time):通过构建一个结构化的指令语料库(Instruction Corpus),利用GPT-5对相似训练问题进行聚类并生成简洁的推理步骤指令;在推理过程中,模型检索最相关的指令并遵循其结构化步骤进行推理,而非从头生成推理过程。这种方法区别于传统的检索增强生成(Retrieval-Augmented Generation, RAG),不依赖文本片段的检索,而是提供可执行的推理流程指导,从而显著提升SLMs在医学、法律和数学等专业任务上的表现,且无需额外微调。
链接: https://arxiv.org/abs/2510.13935
作者: Kenan Alkiek,David Jurgens,Vinod Vydiswaran
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.
zh
[NLP-98] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions NEURIPS2025 ALT
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在药物安全预测中可能引入社会人口学偏见的问题,尤其是当这些属性(如教育水平、住房稳定性等)在临床实践中与不良事件(Adverse Event, AE)无直接关联时,LLMs是否仍会将其纳入预测逻辑。解决方案的关键在于构建一个基于角色的评估框架,利用美国食品药品监督管理局不良事件报告系统(FAERS)的结构化数据,对两种先进模型(ChatGPT-4o 和 Bio-Medical-Llama-3.8B)进行多维度测评,识别显式偏见(推理过程中明确提及 persona 属性)和隐式偏见(预测不一致但未提及属性),从而揭示 LLMs 在药物流行病学监测中的公平性风险,并提出需建立以公平性为导向的评估协议和缓解策略,以保障其在临床部署前的安全性和公正性。
链接: https://arxiv.org/abs/2510.13931
作者: Siying Liu,Shisheng Zhang,Indu Bala
机构: University of Adelaide (阿德莱德大学); University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL)
备注: Preprint of a paper accepted as a poster at the NeurIPS 2025 Workshop on Generative AI for Health (GenAI4Health). The final camera-ready workshop version may differ. Licensed under CC BY 4.0
Abstract:Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.
zh
[NLP-99] LLM s Can Get “Brain Rot”!
【速读】: 该论文试图解决的问题是:持续暴露于低质量网络文本是否会引发大语言模型(Large Language Models, LLMs)的认知能力不可逆下降,即是否存在“LLM脑损伤”(LLM Brain Rot)现象。为因果性地隔离数据质量的影响,作者设计了受控实验,在真实Twitter/X语料库上构建了高干扰(junk)与反向控制数据集,通过两种正交操作化指标——M1(互动度)和M2(语义质量)——保持token规模与训练操作一致。关键解决方案在于:首先,利用剂量-反应关系验证认知衰退的可量化性(如ARC-Challenge任务得分从74.9降至57.2);其次,通过错误归因分析识别出“思维跳脱”(thought-skipping)为主要病变机制;最后,发现流行度(非语义指标)比文本长度更能预测脑损伤效应,从而确立数据质量为LLM能力退化的因果驱动因素,并提出将数据筛选视为训练期安全问题,推动部署模型定期进行“认知健康检查”。
链接: https://arxiv.org/abs/2510.13928
作者: Shuo Xing,Junyuan Hong,Yifan Wang,Runjin Chen,Zhenyu Zhang,Ananth Grama,Zhengzhong Tu,Zhangyang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges’ g0.3 ) on reasoning, long-context understanding, safety, and inflating “dark traits” (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops 74.9 \rightarrow 57.2 and RULER-CWE 84.4 \rightarrow 52.3 as junk ratio rises from 0% to 100% . Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textittraining-time safety problem and motivating routine “cognitive health checks” for deployed LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.13928 [cs.CL] (or arXiv:2510.13928v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.13928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-100] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理生物医学查询时缺乏科学严谨性的问题,尤其是在无法访问权威生物医学数据库的情况下,LLMs 常常生成虚假的蛋白质功能、相互作用及结构信息。为应对这一挑战,作者提出 BioMedSearch,其核心解决方案是构建一个基于多源信息检索的框架,通过文献检索、蛋白数据库和网络搜索的协同接入,结合子查询分解、关键词提取、任务图构建与多源信息过滤等关键技术,实现对复杂生物医学问题的精准问答。该方法显著提升了答案准确性,在自建的多层级生物医学多项选择题数据集 BioMedMCQs 上验证了其优越性能,尤其在机制识别(Level 1)、非相邻语义整合(Level 2)和时间因果推理(Level 3)三个难度层级上均超越基线模型。
链接: https://arxiv.org/abs/2510.13926
作者: Congying Liu,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: this https URL
zh
[NLP-101] An LLM -Powered AI Agent Framework for Holistic IoT Traffic Interpretation
【速读】: 该论文旨在解决物联网(IoT)网络中海量、异构流量数据的深层次理解难题,传统孤立检测方法难以捕捉跨层行为、协议交互与上下文语义关联。其解决方案的关键在于构建一个基于大语言模型(LLM)的AI代理框架,通过将原始报文捕获(packet captures)转化为结构化且语义增强的表示,融合特征提取、基于Transformer的异常检测、报文与流摘要、威胁情报丰富化以及检索增强型问答机制,实现对IoT流量的全面解析与可解释性分析。实验表明,混合检索策略(结合词法与语义搜索及重排序)显著优于仅使用密集检索的方法,同时系统资源开销低,具备高效性和实用性。
链接: https://arxiv.org/abs/2510.13925
作者: Daniel Adu Worae,Spyridon Mastorakis
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.
zh
[NLP-102] LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding
【速读】: 该论文旨在解决临床笔记中诊断编码(ICD codes)自动分配与排序的难题,传统方法将其视为分类任务,忽略了编码顺序对医疗诊断和报销等应用场景的重要性。解决方案的关键在于首次将此问题建模为一个结合分类与排序的检索系统框架,从而更好地捕捉编码之间的优先级关系。实验表明,该方法在识别高优先级主诊断编码方面的排名准确率达到47%,显著优于现有最优分类模型的20%;同时在分类性能上也取得了微平均F1分数0.6065和宏平均F1分数0.2904,优于对比模型的0.597和0.2660。
链接: https://arxiv.org/abs/2510.13922
作者: Mohammad Mansoori,Amira Soliman,Farzaneh Etminani
机构: Halmstad University (哈尔姆斯塔德大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.
zh
[NLP-103] FACTS: Table Summarization via Offline Template Generation with Agent ic Workflows
【速读】: 该论文旨在解决查询聚焦的表格摘要(query-focused table summarization)任务中现有方法的三大局限性:表到文本模型需昂贵微调且难以处理复杂推理;基于提示的大语言模型(LLM)方法受限于token长度、效率低下并存在敏感数据泄露风险;先前的代理式流水线依赖分解、规划或人工模板,缺乏鲁棒性和可扩展性。解决方案的关键在于提出一种名为FACTS(Fast, Accurate, and Privacy-Compliant Table Summarization)的代理工作流,其核心创新是通过离线模板生成机制,预先构建由SQL查询和Jinja2模板组成的可复用结构化模板,这些模板可在不暴露原始数据的情况下仅基于表模式(schema)由LLM生成,从而实现快速、准确且符合隐私合规要求的自然语言摘要生成。
链接: https://arxiv.org/abs/2510.13920
作者: Ye Yuan,Mohammad Amin Shabani,Siqi Liu
机构: McGill(麦吉尔大学); Mila - Quebec AI Institute(魁北克人工智能研究所); RBC Borealis(加拿大皇家银行 Borealis)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.
zh
[NLP-104] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
【速读】: 该论文旨在解决测试时扩展(Test-Time Scaling, TTS)中如何有效利用过程奖励模型(Process Reward Models, PRMs)的验证信号以提升大语言模型(Large Language Models, LLMs)响应质量的问题。当前实践中,PRM-based选择策略常被简单多数投票(majority voting)超越,暴露出现有方法未能充分挖掘PRM与LLM之间的协同关系。解决方案的关键在于提出一个理论驱动的加权聚合框架,该框架表明最优策略是根据LLM与PRM间复杂交互关系动态估计权重——实证发现这些权重在不同LLM-PRM组合中差异显著,且常包含负权重;进而设计高效的预计算校准方法来学习此类非线性权重函数,从而在仅使用21.3%计算资源的情况下显著优于传统加权多数投票,实现更智能、更高效的TTS性能提升。
链接: https://arxiv.org/abs/2510.13918
作者: Peng Kuang,Yanli Wang,Xiaoyu Han,Yaowenqi Liu,Kaidi Xu,Haohan Wang
机构: Zhejiang University (浙江大学); Imperial College London (帝国理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only 21.3% of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
zh
[NLP-105] Element2Vec: Build Chemical Element Representation from Text for Property Prediction
【速读】: 该论文旨在解决化学元素属性数据难以直接测量、传统数值分析方法难以建模复杂关系,以及现有生成式AI工具存在幻觉和可解释性不足的问题。其核心解决方案是提出Element2Vec框架,通过自然语言处理技术从维基百科文本中提取化学元素的嵌入表示,包括全局通用向量(Global)和属性聚焦局部向量(Local),并设计基于自注意力机制的测试时训练方法以缓解因数据稀疏导致的预测误差,从而提升材料科学中AI驱动发现的准确性与可靠性。
链接: https://arxiv.org/abs/2510.13916
作者: Yuanhao Li,Keyuan Lai,Tianqi Wang,Qihao Liu,Jiawei Ma,Yuan-Chao Hu
机构: SSLab(智能系统实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.
zh
[NLP-106] Readability ne Learnability: Rethinking the Role of Simplicity in Training Small Language Models
【速读】: 该论文旨在解决当前研究中关于小语言模型(Small Language Models, SLMs)能否生成连贯文本的争议性解释问题,即是否可归因于训练语料的可读性(readability)。此前研究认为,简化语料(如儿童导向的TinyStories)因其词汇易懂、句法简单和叙事结构熟悉,是促使SLMs具备生成能力的关键因素。然而,本文通过构建结构一致但可读性不同的合成数据集,发现可读性并非决定模型连贯性或学习效率的核心变量;相反,统计上的简单性——以n-gram多样性衡量——才是更可靠的可学习性预测指标。其解决方案的关键在于引入可控的实验设计,排除可读性的混杂影响,从而揭示出语言模型能力涌现的真实驱动因素,强调应避免无实证依据地类比人类认知发展,转而聚焦于数据本身的统计特性对模型学习的影响。
链接: https://arxiv.org/abs/2510.13915
作者: Ivan Lee,Taylor Berg-Kirkpatrick
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to COLM 2025 (Spotlight)
Abstract:Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability – characterized by accessible vocabulary, familiar narrative structure, and simple syntax – plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training – drawing parallels to human cognitive development without empirical basis – and argue for more precise reasoning about what properties actually support capability emergence in small models.
zh
[NLP-107] Synthesizing Agent ic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms ICLR26
【速读】: 该论文旨在解决当前基于网络的“深度研究”智能体(deep research agents)在处理复杂问答任务时面临的挑战,即语言模型缺乏对长程推理和探索的有效优化,且现有指令微调数据集难以控制难度与质量,导致合成数据无法充分模拟真实场景中的复杂性。解决方案的关键在于提出一种双管齐下的数据合成流程:通过逐步增加任务复杂度直至基准网络代理(baseline web agent)失效,系统性地生成高质量、高多样性的问题-答案对;该基准代理不仅执行任务,还负责验证事实性、检查替代答案并实施过滤,从而确保数据的真实性和有效性。实验表明,尽管数据规模较小,该方法生成的数据集能显著提升模型性能,在工具调用多样性上达到现有数据集的两倍,有效避免重复调用行为,且在多个网络基准测试中优于已有方案。
链接: https://arxiv.org/abs/2510.13913
作者: Shrey Pandit,Xuan-Phi Nguyen,Yifei Ming,Austin Xu,Jiayu Wang,Caiming Xiong,Shafiq Joty
机构: Salesforce AI Research (Salesforce人工智能研究); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. ICLR 26 submission
Abstract:Web-based ‘deep research’ agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.
zh
[NLP-108] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
【速读】: 该论文旨在解决当前AI辩论(AI debate)研究中忽视主观性问题的局限,即现有实验多基于具有明确真实标签的数据集,导致“谎言”仅表现为对错误命题的辩护,而未考察模型在面对主观判断时是否仍能保持信念一致性。其解决方案的关键在于:设计一个包含主观问题的实验框架,通过测量大型语言模型(Large Language Models, LLMs)的先验信念(prior beliefs),并引入与模型信念冲突的裁判人格(judge persona),从而检验模型是否会采取讨好策略(sycophantic strategies)以提升说服力,而非坚持自身信念。此外,作者对比了顺序式与并行式辩论协议,评估系统偏差,并分析不同立场下模型论证的质量与说服力差异,揭示出模型在信念一致时更具说服力、但信念不一致时论证质量被更高评分的悖论现象,为改进人类判官的训练信号和构建更对齐的AI系统提供了实证依据。
链接: https://arxiv.org/abs/2510.13912
作者: María Victoria Carro,Denise Alejandra Mester,Facundo Nieto,Oscar Agustín Stanchi,Guido Ernesto Bergman,Mario Alejandro Leiva,Eitan Sprejer,Luca Nicolás Forziati Gangi,Francisca Gauna Selasco,Juan Gustavo Corvalán,Gerardo I. Simari,María Vanina Martinez
机构: FAIR, IALAB, Universidad de Buenos Aires, AR; Università degli Studi di Genova, IT; Universidad Nacional de Córdoba, AR; BAISH, Universidad de Buenos Aires, AR; Instituto Superior en Informática, LIDI, Universidad Nacional de La Plata; CONICET, AR; Dept. of Comp. Sci. and Eng., Universidad Nacional del Sur & ICIC UNS-CONICET, AR; Artificial Intelligence Research Institute (IIIA-CSIC), ES
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages
Abstract:The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models’ prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge’s presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.
zh
[NLP-109] RAG Cap-Bench: Benchmarking Capabilities of LLM s in Agent ic Retrieval Augmented Generation Systems
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 中 agentic Retrieval-Augmented Generation (RAG) 系统在处理复杂多跳问题时表现不足,以及其内部中间推理能力缺乏细粒度评估的问题。解决方案的关键在于提出 RAGCap-Bench——一个面向中间任务能力的基准测试框架,通过构建典型 LLM 错误分类体系并设计针对性评测题目,实现对 agentic RAG 工作流中关键中间步骤(如规划、检索与推理)的精细化评估;实验表明,“慢思考”模型因具备更强的中间能力而获得更优端到端性能,验证了该基准的有效性及提升中间推理能力的重要性。
链接: https://arxiv.org/abs/2510.13910
作者: Jingru Lin,Chen Zhang,Stephen Y. Liu,Haizhou Li
机构: National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that “slow-thinking” models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark’s validity and the importance of enhancing these intermediate capabilities.
zh
[NLP-110] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的知识图谱推理(Knowledge Graph Reasoning, KGR)中因稀疏知识图谱(KG)上下文导致的LLM内在知识失真问题,以及现有方法难以有效约束LLM生成幻觉、从而严重影响推理可信度的问题。解决方案的关键在于提出一种统一协调LLM知识与KG上下文的知识推理语言模型(Knowledge Reasoning Language Model, KRLM),其核心创新包括:设计了知识推理语言(Knowledge Reasoning Language, KRL)指令格式与分词器以对齐LLM知识与KG表示;引入KRL注意力层通过动态知识记忆机制实现LLM内在知识与外部KG上下文的协同;并构建结构感知的下一实体预测器,严格将推理结果限制在可信知识域内,从而显著提升模型在零样本和微调场景下的推理准确性和可靠性。
链接: https://arxiv.org/abs/2510.13909
作者: Xingrui Zhuo,Jiapu Wang,Gongqing Wu,Zhongyuan Wang,Jichen Zhang,Shirui Pan,Xindong Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnoteOur source codes are available at this https URL in both zero-shot reasoning and fine-tuning scenarios.
zh
[NLP-111] Interpreting the Latent Structure of Operator Precedence in Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行算术运算时内部表征机制不明确的问题,尤其是模型是否在隐藏层中编码了运算符优先级(operator precedence)。此前研究主要关注输出结果或提示策略,而忽视了模型内部如何进行算术计算。论文以开源指令微调模型LLaMA 3.2-3B为基础,构建包含不同括号结构的三操作数算术表达式数据集,通过可解释性技术(如logit lens、线性分类探针和UMAP几何可视化)分析残差流(residual stream)中的中间计算结果。关键发现是:中间计算结果确实存在于残差流中,尤其在多层感知机(MLP)块之后;且每个运算符的嵌入在注意力层后线性编码优先级信息。解决方案的核心创新在于提出“部分嵌入交换”(partial embedding swap)技术,通过交换高影响力嵌入维度来显式修改运算符优先级,从而验证并操控模型内部的算术逻辑。
链接: https://arxiv.org/abs/2510.13908
作者: Dharunish Yugeswardeenoo,Harshil Nukala,Cole Blondin,Sean O Brien,Vasu Sharma,Kevin Zhu
机构: Algoverse AI Research (Algoverse AI 研究)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures. Accepted to INTERPLAY Workshop at COLM 2025
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator’s embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.
zh
[NLP-112] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因输入提示(prompt)敏感而导致的提示设计难题,尤其是现有自动提示优化(Automatic Prompt Optimization, APO)方法普遍依赖高质量标注数据的问题。为实现无需标签的高效优化,作者提出Prompt Duel Optimizer(PDO),其核心创新在于将提示优化建模为双人博弈(dueling-bandit)问题,利用LLM作为评判者提供成对偏好反馈(pairwise preference feedback)来生成监督信号;同时结合Double Thompson Sampling(D-TS)以优先选择信息量大的提示比较,并引入Top-Performer Guided Mutation策略通过变异高表现提示扩展候选池,从而在无标签场景下实现样本高效优化,并能兼容部分标签以降低评判噪声。
链接: https://arxiv.org/abs/2510.13907
作者: Yuanchen Wu,Saurabh Verma,Justin Lee,Fangzhou Xiong,Poppy Zhang,Amel Awadelkarim,Xu Chen,Yubai Yuan,Shawndra Hill
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Meta
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
zh
[NLP-113] Schema for In-Context Learning
【速读】: 该论文旨在解决传统示例驱动的上下文学习(In-Context Learning, ICL)缺乏显式知识检索与抽象层次迁移机制的问题,即大语言模型(Large Language Models, LLMs)在面对新任务时难以自发构建和利用内部结构化的推理模板(schema)。解决方案的关键在于提出Schema Activated In-Context Learning (SA-ICL) 框架,该框架受认知科学中图式理论(schema theory)启发,从先前示例中提取认知建构模块的表示,形成轻量级、结构化的推理步骤及其关系模板——即抽象图式(abstracted schema),并在处理新问题时显式地激活该图式以增强模型推理能力。实验证明,SA-ICL显著提升LLMs在化学与物理领域的性能(最高达36.19%),同时减少对多示例依赖并提高可解释性,从而推动更类人的推理能力发展。
链接: https://arxiv.org/abs/2510.13905
作者: Pan Chen,Shaohong Chen,Mark Wang,Shi Xuan Leong,Priscilla Fung,Varinia Bernales,Alan Aspuru-Guzik
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model’s reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.
zh
[NLP-114] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治与道德领域是否存在意识形态偏倚的问题,特别是其生成内容是否倾向于某种政治立场,以及模型能否准确模拟不同人口群体的道德观点。解决方案的关键在于采用道德基础理论(Moral Foundations Theory, MFT)作为分析框架,将LLM输出映射到五个道德维度(伤害、公平、内群体忠诚、权威和纯洁),并与已有的人类研究数据进行直接对比,从而量化LLM响应中的潜在偏倚,并系统评估其在显式提示和基于人口统计的角色扮演情境下对政治意识形态的再现能力。
链接: https://arxiv.org/abs/2510.13902
作者: Nicole Smith-Vaniz,Harper Lyon,Lorraine Steigner,Ben Armstrong,Nicholas Mattei
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.
zh
[NLP-115] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对 jailbreak 攻击时的安全脆弱性问题,即攻击者通过构造特定输入(如对抗后缀)诱导模型生成受限制内容,从而绕过其安全机制。解决方案的关键在于提出 RAID(Refusal-Aware and Integrated Decoding)框架,该框架通过将离散 token 映射为连续嵌入空间进行优化,并采用联合目标函数:(i) 引导模型产生受限响应,(ii) 引入拒绝感知正则项以避免嵌入空间中指向拒绝方向的激活,(iii) 加入连贯性约束以保持语义合理性与非冗余性。最终通过 critic-guided 解码过程将优化后的嵌入映射回自然且有效的 token 序列,实现高效、低资源消耗的攻击构造,显著优于现有白盒与黑盒基线方法。
链接: https://arxiv.org/abs/2510.13901
作者: Tuan T. Nguyen,John Le,Thai T. Vu,Willy Susilo,Heath Cooper
机构: VNPT AI (越南电信人工智能); University of Wollongong (伍伦贡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
zh
[NLP-116] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
【速读】: 该论文旨在解决窄域微调(narrow finetuning)后大语言模型(Large Language Models, LLMs)在激活空间中产生显著偏置的问题,这些偏置虽未被直接观察到,却能反映微调任务的特性。研究发现,通过简单的模型差分(model diffing)工具分析微调前后模型激活差异——特别是对随机文本前几个token的激活差异进行识别与干预——可有效提取出与微调数据格式和内容高度相关的信号,从而实现对微调目标的解释性理解。其解决方案的关键在于:利用激活差异的可迁移性进行“激活引导”(steering),即向模型激活中注入差异向量以生成符合微调领域特征的文本,并进一步构建基于LLM的可解释性代理(interpretability agent),相比传统提示方法显著提升了对微调域的理解能力。此方法揭示了窄域微调模型中存在可检测的过拟合痕迹,且混入预训练数据可削弱此类偏置,但残余风险仍需关注。
链接: https://arxiv.org/abs/2510.13900
作者: Julian Minder,Clément Dumas,Stewart Slocum,Helena Casademunt,Cameron Holmes,Robert West,Neel Nanda
机构: EPFL (瑞士联邦理工学院); Ecole Normale Supérieure Paris-Saclay, Université Paris-Saclay; Anthropic Fellows Program; Harvard University; MATS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.
zh
[NLP-117] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges ICDM
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)时代下文本作者归属(authorship attribution)日益复杂的挑战,即机器生成的文本与人类写作在风格上趋于相似,导致难以准确区分来源。其解决方案的关键在于提出并对比两种互补的识别机制:固定风格嵌入(fixed Style Embeddings)和经过指令微调的LLM判别器(GPT-4o),并在一个包含六类文本域(学术、新闻、小说、博客、口语转录和影视剧本)的平衡数据集Human AI Parallel Corpus上进行基准测试。结果表明,风格嵌入在结构化文本(如口语和剧本)中表现更优,而LLM判别器在语义敏感领域(如小说和学术写作)显著优于嵌入方法,凸显了作者归属问题的多维特性,并强调需采用混合策略以提升识别准确性。
链接: https://arxiv.org/abs/2510.13898
作者: Misam Abbas
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication at the 2025 IEEE ICDM Workshop on “Grounding Documents with Reasoning, Agents, Retrieval, and Attribution”. This is author submitted version. Not yet published
Abstract:Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.
zh
[NLP-118] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)面临的关键安全威胁——越狱攻击(jailbreaking techniques)的系统性理解与防御不足问题。现有防御方法普遍局限于单轮攻击、语言覆盖有限,且分类体系未能充分捕捉攻击策略的多样性或聚焦于风险类别而非技术本质。其解决方案的关键在于:构建一个包含50种越狱策略的分层分类体系(hierarchical taxonomy),将其划分为七类核心家族(如角色扮演、说服、权限提升等),并基于此分类体系开展结构化红队测试,从而全面评估不同攻击类型的流行度与成功率;同时,利用该分类指导提示工程以提升自动检测性能,并发布首个意大利语多轮对抗对话数据集(1364条),支持对渐进式恶意意图演化机制的研究。
链接: https://arxiv.org/abs/2510.13893
作者: Olga E. Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Daniele Nardi
机构: Sapienza University of Rome (罗马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.
zh
[NLP-119] he Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域适应过程中对高质量监督微调(Supervised Fine-Tuning, SFT)数据依赖性强、现有数据选择方法过度依赖LLM内部知识、可解释性弱且泛化能力有限的问题。解决方案的关键在于提出一种受认知科学启发的框架THTB(The Harder The Better),通过结合质量过滤与内在和外在难度评分机制,优先选择高阶认知指令,从而实现可解释、可量化的高效SFT数据筛选与标注引导。实验表明,仅使用5%的数据即可超越全量数据训练效果,并在垂直领域中以2%的数据量优于大规模数据训练模型,显著提升了领域适配能力。
链接: https://arxiv.org/abs/2510.13892
作者: Zhaoyang Shang,Sibo Wei,Jianbin Guo,Rui Zhou,Lifeng Dong,Yin Luo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs’ internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on this https URL.
zh
[NLP-120] A Survey on Collaborating Small and Large Language Models for Performance Cost-effectiveness Cloud-edge Privacy and Trustworthiness
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的高微调成本、推理延迟大、边缘部署能力受限以及可靠性不足等问题,同时探索小型语言模型(Small Language Models, SLMs)在效率与适应性方面的优势如何与LLMs的泛化能力和推理能力协同互补。其解决方案的关键在于构建一个以协作目标为导向的系统性框架,提出包含性能增强、成本效益优化、云边隐私保护和可信性提升四个维度的分类体系,并在此基础上梳理代表性方法、总结设计范式,从而推动高效、安全且可扩展的SLM-LLM协同机制发展。
链接: https://arxiv.org/abs/2510.13890
作者: Fali Wang,Jihai Chen,Shuhua Yang,Ali Al-Lawati,Linli Tang,Hui Liu,Suhang Wang
机构: Pennsylvania State University (宾夕法尼亚州立大学); Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 17 figures, under review
Abstract:Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs’ specialization and efficiency with LLMs’ generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.
zh
[NLP-121] Reliable Fine-Grained Evaluation of Natural Language Math Proofs
【速读】: 该论文试图解决生成式 AI (Generative AI) 在数学证明生成任务中缺乏可靠、细粒度评估工具的问题,即当前方法难以对模型生成的自然语言数学证明进行精确评分。解决方案的关键在于提出了一种系统性的评估器设计方法,并构建了首个专家标注的细粒度证明评分数据集 ProofBench,用于训练和验证评估器;其中核心创新是 ProofGrader,其结合强推理能力的语言模型(LLM)、来自参考解法和评分标准的丰富上下文信息以及简单的集成策略,在 0–7 分制上实现了低至 0.926 的平均绝对误差(MAE),显著优于基线方法,并在“最佳 n 选一”任务中有效提升了生成质量。
链接: https://arxiv.org/abs/2510.13888
作者: Wenjie Ma,Andrei Cojocaru,Neel Kolhe,Bradley Louie,Robin Said Sharif,Haihan Zhang,Vincent Zhuang,Matei Zaharia,Sewon Min
机构: UC Berkeley (加州大学伯克利分校); Google DeepMind (谷歌深度思维); Peking University (北京大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 6 figures, 10 tables
Abstract:Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of- n selection task: at n=16 , ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
zh
[NLP-122] Order from Chaos: Comparative Study of Ten Leading LLM s on Unstructured Data Categorization
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在未结构化文本分类任务中表现不佳的问题,尤其是在面对交互广告局(IAB)2.2层级分类体系时,尽管模型规模持续扩大,其经典指标(如准确率、F1分数)仍仅达到中等水平,且存在严重的幻觉(hallucination)和类别膨胀(inflation)现象。解决方案的关键在于提出一种基于集成学习的方法:通过多个LLM作为独立专家协同决策,实现对输入文本的更可靠分类。该方法显著提升了准确性,完全消除了幻觉,并有效抑制了类别过度生成,表明模型间的协调协作比单纯依赖模型规模或架构改进更能推动文本分类性能向甚至超越人类专家水平迈进。
链接: https://arxiv.org/abs/2510.13885
作者: Ariel Kamen
机构: RingCentral(环信公司); UC Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures,
Abstract:This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization. Comments: 10 pages, 4 figures, Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.13885 [cs.CL] (or arXiv:2510.13885v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.13885 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-123] oo Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在模拟公众意见和社会现象时,普遍采用封闭式问答格式(如多选题或简答题)导致的局限性问题,这类设计忽视了LLM固有的生成式能力。解决方案的关键在于引入开放式的自由文本生成机制,即通过捕捉LLM输出中的主题、观点和推理过程,实现更真实的社会模拟;这种方法能够提升测量精度与研究设计灵活性,支持对未预见观点的探索,减少研究者预设偏见,并增强表达多样性与个体差异的刻画,从而推动自然语言处理(NLP)与社会科学方法论之间的协同创新。
链接: https://arxiv.org/abs/2510.13884
作者: Bolei Ma,Yong Cao,Indira Sen,Anna-Carolina Haensch,Frauke Kreuter,Barbara Plank,Daniel Hershcovich
机构: LMU Munich & Munich Center for Machine Learning (慕尼黑大学 & 慕尼黑机器学习中心); University of Tübingen & Tübingen AI Center (图宾根大学 & 图宾根人工智能中心); University of Mannheim (曼海姆大学); University of Maryland, College Park (马里兰大学学院公园分校); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes “in” LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.
zh
[NLP-124] PAGE: Prompt Augmentation for text Generation Enhancement
【速读】: 该论文旨在解决生成式 AI(Generative AI)在面对特定任务或特殊需求时性能不佳的问题,尤其是在缺乏大量标注数据的情况下难以有效调整。其核心解决方案是提出 PAGE(Prompt Augmentation for text Generation Enhancement)框架,关键在于引入轻量级辅助模块(如分类器或提取器),利用这些模块对输入文本进行推理并生成结构化信息,进而构建增强型输入以提升生成质量与可控性。该方法不依赖额外的生成式辅助模型,而是采用模块化架构,具有良好的任务适配性和可扩展性。
链接: https://arxiv.org/abs/2510.13880
作者: Mauro Jose Pacchiotti,Luciana Ballejos,Mariel Ale
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Spanish language
Abstract:In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.
zh
[NLP-125] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
【速读】: 该论文旨在解决语言模型在处理不同复杂度输入时计算资源分配不均的问题,即如何让模型根据输入的难易程度动态调整每个token所需的计算步骤(compute steps),从而提升效率与准确性。解决方案的关键在于提出了一类名为“Catch Your Breath”(CYB)的监督训练目标,其核心机制是允许模型通过输出“don’t know”信号请求额外计算时间,并插入暂停标记(pause token)以获得后续计算资源;同时将每一步输出的选择建模为带有时间成本的序贯决策问题,通过三种变体(CYB-AP、CYB-VA、CYB-DP)分别从任意时间预测、变分优化和计算预算约束角度优化模型对不确定性的判断与资源调度能力。实验表明,该方法显著降低训练数据需求(仅需基线模型的1/3),并能自适应地根据词性、语境复杂度等特征选择是否暂停,实现更智能的推理过程。
链接: https://arxiv.org/abs/2510.13879
作者: Alexandre Galashov,Matt Jones,Rosemary Ke,Yuan Cao,Vaishnavh Nagarajan,Michael C. Mozer
机构: Google DeepMind(谷歌深度智脑)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a don’t know output. If the model is granted a delay, a specialized pause token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use don’t know outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as \textitCatch Your Breath losses and we study three methods in this class: CYB-AP frames the model’s task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like \textitpatients and \textitchallenges but never pauses after the first token of contracted words like \textitwasn and \textitdidn , and it shows high variability for ambiguous tokens like \textitwon , which could function as either a verb or part of a contraction.
zh
[NLP-126] xtBandit: Evaluating Probabilistic Reasoning in LLM s Through Language-Only Decision Tasks
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏数值线索或显式概率信息的情况下,如何通过纯自然语言反馈进行序列决策的问题。其核心挑战在于评估LLMs是否能从文本形式的奖励信号(如“you earned a token”)中推断出潜在的奖励结构并据此调整策略。解决方案的关键在于构建了一个新颖的基准测试环境——多臂老虎机(multi-armed bandit)场景,其中模型仅接收纯文本反馈,无任何数值提示,从而迫使模型基于语言理解进行概率推理与策略优化。实验表明,尽管多数LLMs表现逊于经典决策算法(如Thompson Sampling、Epsilon Greedy等),但Qwen3-4B模型达到了89.2%的最佳臂选择率,显著优于其他大模型及传统方法,证明了在非数值化语境下,生成式AI仍可涌现出有效的概率推理能力。
链接: https://arxiv.org/abs/2510.13878
作者: Jimin Lim,Arjun Damerla,Arthur Jiang,Nam Le
机构: UC Merced (加州大学默塞德分校); UC Berkeley (加州大学伯克利分校); Algoverse
类目: Computation and Language (cs.CL)
备注: COLM 2025 @ ORIGen Workshop
Abstract:Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, “you earned a token”, without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.
zh
[NLP-127] What Layers When: Learning to Skip Compute in LLM s with Residual Gates
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中计算资源消耗过高的问题,特别是在长文本生成和指令微调场景下如何实现高效计算而不显著牺牲模型性能。其解决方案的关键在于提出一种名为GateSkip的残差流门控机制,通过为每个注意力层(Attention)和前馈神经网络层(MLP)配备一个sigmoid-linear门控函数,在不改变模型结构的前提下,对每个token在每层的输出进行重要性评估,并基于门控值排序后按层预算跳过低重要性token的处理。该方法具有平滑、可微分的特点,能够在预训练模型基础上稳定微调,避免了传统早退出或路由型混合深度(Mixture-of-Depths)模型所需的复杂重训练过程。实验表明,该方法在长文本推理中最多节省15%计算量且保持90%以上基线准确率,在指令微调任务中甚至能在全量计算下提升精度,并在约50%计算节省时达到与基线相当的性能。
链接: https://arxiv.org/abs/2510.13876
作者: Filipe Laitenberger,Dawid Kopiczko,Cees G.M. Snoek,Yuki M. Asano
机构: Qualcomm-UvA Lab, University of Amsterdam (阿姆斯特丹大学); FunAI Lab, University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
zh
[NLP-128] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation
【速读】: 该论文旨在解决法国肿瘤学临床文本自然语言处理(Natural Language Processing, NLP)工具开发中缺乏高质量标注数据集的问题。解决方案的关键在于构建并公开发布FRACCO(FRench Annotated Corpus for Clinical Oncology),这是一个由专家标注的1301个合成法语临床病例组成的语料库,其原始来源为西班牙CANTEMIST语料库,并基于国际疾病分类肿瘤学(International Classification of Diseases for Oncology, ICD-O)进行实体标注与概念标准化。该语料库包含三类标注层:形态学、解剖部位和组织学分化术语,以及将多个ICD-O元素整合为统一临床概念的复合表达归一化层,通过多轮人工校验确保标注质量,最终形成涵盖399种形态学代码、272种解剖部位代码及2043种复合表达的高精度标注资源,为法语肿瘤学文本中的命名实体识别(Named Entity Recognition, NER)与概念归一化提供基准标准。
链接: https://arxiv.org/abs/2510.13873
作者: Johann Pignat,Milena Vucetic,Christophe Gaudet-Blavignac,Jamil Zaghir,Amandine Stettler,Fanny Amrein,Jonatan Bonjour,Jean-Philippe Goldman,Olivier Michielin,Christian Lovis,Mina Bjelogrlic
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.
zh
[NLP-129] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua
【速读】: 该论文旨在解决资源匮乏语言(如克丘亚语)在语音技术发展中因数据与资源稀缺而面临的瓶颈问题。其核心解决方案是将克丘亚语整合进Common Voice这一开放且由社区驱动的语音数据集平台,通过语言引入(language onboarding)和语料库构建(包括朗读与自发口语数据),推动高质量语音数据的采集与验证。研究以普诺克丘亚语(Puno Quechua)为案例,展示了Common Voice目前已收录191.1小时克丘亚语语音数据(86%已验证),其中普诺克丘亚语贡献12小时(77%验证),凸显了该平台在促进包容性语音技术发展方面的潜力。同时,论文还提出需关注技术挑战与伦理问题,强调原住民数据主权(indigenous data sovereignty)和社区参与的重要性。
链接: https://arxiv.org/abs/2510.13871
作者: Elwin Huaman,Wendi Huaman,Jorge Luis Huaman,Ninfa Quispe
机构: 未知
类目: Computation and Language (cs.CL)
备注: to be published in the 9th Annual International Conference on Information Management and Big Data (SIMBig 2025)
Abstract:Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86% validated), with Puno Quechua contributing 12 hours (77% validated), highlighting the Common Voice’s potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.
zh
[NLP-130] Unlocking the Potential of Diffusion Language Models through Template Infilling
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理过程中受限于继承自自回归语言模型的前缀提示(prefix-based prompting)策略的问题,从而限制了其生成灵活性与效率。解决方案的关键在于提出模板填空(Template Infilling, TI)机制,该机制首先生成目标响应的结构化模板,随后对掩码段进行填充;同时引入动态段分配(Dynamic Segment Allocation, DSA)方法,根据生成置信度自适应调整各段长度,从而提升结构控制的灵活性与生成质量。实验表明,该方法在数学推理和代码生成基准上相较基线模型平均提升17.01%(p值),并在多标记生成场景中实现有效加速而不损失生成质量。
链接: https://arxiv.org/abs/2510.13870
作者: Junhoo Lee,Seungyeon Kim,Nojun Kwak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs’ generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01 % p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.
zh
[NLP-131] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues
【速读】: 该论文旨在解决生成式 AI(Generative AI)在教育场景中应用时,学习者情绪动态(affective dynamics)尚不明确的问题,特别是大型语言模型(Large Language Model, LLM)驱动的辅导对话中学习者情绪状态如何演变及其对学习过程的影响。其解决方案的关键在于构建首个基于多模型集成(ensemble-LLM)的大规模情感感知框架,通过零样本情感标注方法,利用三类前沿LLM(Gemini、GPT-4o、Claude)对16,986轮师生对话进行情感分析,输出包括效价(valence)、唤醒度(arousal)和学习帮助性(learning-helpfulness)的标量评分及自由文本情绪标签,并采用排名加权组内聚合与模型间多数共识策略融合结果,从而生成鲁棒的情绪画像。该方法实现了对学习者情绪状态的精细化追踪与量化建模,为负责任地将生成式AI融入教育提供了实证基础与干预契机。
链接: https://arxiv.org/abs/2510.13862
作者: Chenyu Zhang,Sharifa Alghowinem,Cynthia Breazeal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 4 pages, 3 figures. Published in the 11th International Conference on Affective Computing and Intelligent Interaction (ACII 2025), Late-Breaking Results Track
Abstract:While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners’ evolving affective states. To achieve this, we analyzed two semesters’ worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners’ emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived–positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.
zh
[NLP-132] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在内存占用和计算开销方面的效率瓶颈问题,尤其是在小语言模型(Small Language Models, SLMs)应用于智能体(agentic AI)系统时的性能优化需求。解决方案的关键在于通过分析模型架构中的冗余性,结合AI可解释性和推理时层剪枝(inference-time layer pruning)的研究洞见,提出一种新型高效语言模型架构ShishuLM:该架构利用归一化与注意力计算在中等上下文场景下近似线性于输入长度的特性,将整个Transformer块用多层感知机(MLPs)近似替代,从而显著减少参数量和键值缓存(Key-Value Cache, KV cache)需求,最终实现训练和推理阶段最高达25%的内存节省和40%的延迟降低。
链接: https://arxiv.org/abs/2510.13860
作者: Shivanshu Kumar,Gopalakrishnan Srinivasan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.
zh
[NLP-133] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
【速读】: 该论文旨在解决医学视觉问答(Medical Visual Question Answering, MedVQA)中如何提升系统对伤口护理相关图像与自然语言查询的响应质量与临床合理性问题。其解决方案的关键在于采用基于通用领域指令微调的大语言模型(Large Language Model, LLM),结合轻量级检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过引入域内文本和视觉示例进行简单索引与融合,无需额外训练或复杂重排序机制,即可有效提升推理能力、结构化输出一致性及回答质量,在多个指标上表现优异,证明了该方法作为多模态临床自然语言处理任务的高效基线可行性。
链接: https://arxiv.org/abs/2510.13856
作者: A H M Rezaul Karim,Ozlem Uzuner
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs – a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking – provides a simple and effective baseline for multimodal clinical NLP tasks.
zh
[NLP-134] Harnessing Consistency for Robust Test-Time LLM Ensemble
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)集成方法在面对异构分词方案和模型专长差异时所表现出的鲁棒性不足问题,即集成结果易受错误信号干扰而失效。其解决方案的关键在于提出一种可插拔的协同鲁棒增强(CoRE)机制,通过双重一致性建模提升集成系统的稳定性:一方面在token层面利用低通滤波器抑制因分词错位导致的高不一致不确定token,实现细粒度鲁棒性增强;另一方面在模型层面通过鼓励高自信心且与其他模型输出差异最小的预测,促进全局一致性,从而在粗粒度上增强鲁棒性。该方法可无缝集成于多种现有集成策略并显著提升性能与鲁棒性。
链接: https://arxiv.org/abs/2510.13855
作者: Zhichen Zeng,Qi Yu,Xiao Lin,Ruizhong Qiu,Xuying Ning,Tianxin Wei,Yuchen Yan,Jingrui He,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures
Abstract:Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.
zh
[NLP-135] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging
【速读】: 该论文旨在解决低资源语言场景下标注数据稀缺导致的模型性能瓶颈问题,特别是在缺乏大量标注样本时,如何提升序列标注任务(如词性标注和命名实体识别)的准确性。其解决方案的关键在于提出Rule-to-Tag (R2T) 框架,该框架将多层级语言规则直接嵌入神经网络的训练目标中,并设计了一个自适应损失函数,其中包含一个正则化项,使模型能够以有原则的方式处理未登录词(out-of-vocabulary, OOV)的不确定性。这一方法属于“有原则的学习”(principled learning, PrL)范式,强调通过显式任务约束而非仅依赖标注样本进行训练,从而显著提升模型在极少标注数据下的泛化能力。
链接: https://arxiv.org/abs/2510.13854
作者: Mamadou K. Keita,Christopher Homan,Sebastien Diarra
机构: Rochester Institute of Technology (罗切斯特理工学院); RobotsMali
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce the Rule-to-Tag (R2T) framework, a hybrid approach that integrates a multi-tiered system of linguistic rules directly into a neural network’s training objective. R2T’s novelty lies in its adaptive loss function, which includes a regularization term that teaches the model to handle out-of-vocabulary (OOV) words with principled uncertainty. We frame this work as a case study in a paradigm we call principled learning (PrL), where models are trained with explicit task constraints rather than on labeled examples alone. Our experiments on Zarma part-of-speech (POS) tagging show that the R2T-BiLSTM model, trained only on unlabeled text, achieves 98.2% accuracy, outperforming baselines like AfriBERTa fine-tuned on 300 labeled sentences. We further show that for more complex tasks like named entity recognition (NER), R2T serves as a powerful pre-training step; a model pre-trained with R2T and fine-tuned on just 50 labeled sentences outperformes a baseline trained on 300.
zh
[NLP-136] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation CIDR’26
【速读】: 该论文旨在解决私有企业数据仓库中文本到SQL(text-to-SQL)模型评估缺乏高质量、领域特定基准数据集的问题。现有研究多基于公开数据集(如Spider、Bird),但这些数据难以反映真实企业环境中复杂且敏感的查询需求,而构建此类私有场景下的基准又面临标注成本高、依赖专家人力等挑战。解决方案的关键在于提出BenchPress——一个“人在回路”(human-in-the-loop)系统,利用检索增强生成(Retrieval-Augmented Generation, RAG)与大语言模型(Large Language Models, LLMs)自动生成候选自然语言描述,并由数据库管理员(DBA)进行筛选、排序或编辑以确保语义准确性和领域一致性。该方法显著降低了人工标注时间与成本,同时提升了基准数据的质量和模型评估的可靠性。
链接: https://arxiv.org/abs/2510.13853
作者: Fabian Wenz,Omar Bouattour,Devin Yang,Justin Choi,Cecil Gregg,Nesime Tatbul,Çağatay Demiralp
机构: TU Munich (慕尼黑工业大学); MIT (麻省理工学院); Intel Labs (英特尔实验室); AWS AI Labs (亚马逊人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC)
备注: CIDR’26
Abstract:Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at this https URL and is also accessible on our website at this http URL.
zh
[NLP-137] ConsistencyAI: A Benchmark to Assess LLM s Factual Consistency When Responding to Different Demographic Groups
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不同用户角色(persona)下输出事实一致性不足的问题,即同一问题在不同用户背景设定下可能产生不一致甚至矛盾的回答。解决方案的关键在于提出并实现了一个独立于LLM厂商的基准测试工具ConsistencyAI,通过向模型输入相同问题但附加来自不同人口统计学角色的上下文提示(persona context),量化模型在跨角色情境下的回答一致性程度。具体方法包括:使用句子嵌入(sentence embeddings)计算跨角色回答间的余弦相似度,并以加权平均值作为事实一致性得分(factual consistency score)。实验表明,模型的事实一致性受模型提供方和话题类型双重影响,且该指标可为评估与改进LLM的公平性与稳定性提供客观依据。
链接: https://arxiv.org/abs/2510.13852
作者: Peter Banyas,Shristi Sharma,Alistair Simmons,Atharva Vispute
机构: Duke University (杜克大学); UNC-Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: For associated code repository, see this http URL For user-friendly web app, see this http URL
Abstract:Is an LLM telling you different facts than it’s telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI’s Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
zh
[NLP-138] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续知识更新场景下,因多次编辑操作导致的灾难性干扰(catastrophic interference)问题,即新知识的引入会破坏先前已整合的知识表示。解决方案的关键在于提出EvoEdit策略,通过逐次执行零空间对齐(sequential null-space alignment),在不改变原有和已修改知识表征的前提下,保持输出不变性,从而实现稳定且高效的模型编辑,有效缓解多轮编辑中的干扰效应,并具备良好的理论保障。
链接: https://arxiv.org/abs/2510.13851
作者: Sicheng Lyu,Yu Gu,Xinyu Wang,Jerry Huang,Sitao Luan,Yufei Cui,Xiao-Wen Chang,Peng Lu
机构: McGill University (麦吉尔大学); Mila—Quebec AI Institute (蒙特利尔魁北克人工智能研究所); SimpleWay.AI (SimpleWay.AI); Université de Montréal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.
zh
[NLP-139] Revisiting the UID Hypothesis in LLM Reasoning Traces
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在使用链式思维(Chain-of-Thought, CoT)推理时,其推理过程中的中间步骤常出现不忠实或难以解释的问题。为解决此问题,作者基于心理语言学中的均匀信息密度(Uniform Information Density, UID)假说,提出了一种基于熵的信息流分析方法,用于量化推理轨迹中的信息密度变化。关键发现是:尽管人类交流遵循稳定的单位信息流模式,但LLMs的成功推理路径却表现出全局非均匀的信息密度波动,这挑战了机器推理应模仿人类认知的假设,并提示未来可设计更具可解释性和自适应性的推理模型。
链接: https://arxiv.org/abs/2510.13850
作者: Minju Gwak,Guijin Son,Jaehyung Kim
机构: Yonsei University (延世大学); OneLine AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics – which posits that humans communicate by maintaining a stable flow of information – we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.
zh
[NLP-140] Language steering in latent space to mitigate unintended code-switching
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在实际应用中出现的非预期语种切换(code-switching)问题,该现象会显著降低模型在下游任务中的可靠性。解决方案的关键在于提出一种轻量级的推理时语言控制方法——潜在空间语言引导(latent-space language steering),其核心思想是通过主成分分析(PCA)在平行语料上识别出语言方向,并沿这些方向调整token嵌入(token embeddings)以精确调控模型输出的语言身份,同时保持语义一致性。该方法仅需少量平行数据进行校准,计算开销极低,在Qwen2.5和Llama-3.2模型上实现了高达95–99%的语言分类准确率,并将跨语言对的下一个词分布差异减少最多达42%。
链接: https://arxiv.org/abs/2510.13849
作者: Andrey Goncharov,Nikolai Kondusov,Alexey Zaytsev
机构: Applied AI Instiute(应用人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.
zh
[NLP-141] On-device System of Compositional Multi-tasking in Large Language Models EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在同时执行复杂组合任务(如从长对话中生成翻译摘要)时面临的挑战,传统参数高效微调方法(如低秩适配器 LoRA)难以有效整合多任务处理需求。解决方案的关键在于引入一个可学习的投影层(learnable projection layer),该层置于总结与翻译适配器的顶部,实现两者的高效融合,从而在保持较低计算开销的前提下支持组合任务的并行执行,相较需大规模重训练或串行处理的替代方案更具实用性与效率。
链接: https://arxiv.org/abs/2510.13848
作者: Ondrej Bohdal,Konstantinos Theodosiadis,Asterios Mpatziakas,Dimitris Filippidis,Iro Spyrou,Christos Zonios,Anastasios Drosou,Dimosthenis Ioannidis,Kyeng-Hun Lee,Jijoong Moon,Hyeonmok Ko,Mete Ozay,Umberto Michieli
机构: Samsung R&D Institute UK (三星研发研究所英国); CERTH (希腊国家技术研究中心); Samsung Research (三星研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2025 (industry track)
Abstract:Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.
zh
[NLP-142] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于推测解码(speculative decoding)的推理加速问题,特别是由于大语言模型(Large Language Model, LLM)词汇表规模扩大导致的小模型 drafter 输出头参数量激增所引发的延迟瓶颈。传统方法通过固定短列表(fixed shortlist)限制 drafter 的词汇空间以降低计算开销,但存在两个缺陷:一是短列表依赖语料频率,难以跨任务泛化;二是静态筛选会抑制罕见或领域特定词元,降低每轮验证的平均接受长度。解决方案的关键在于提出 DynaSpec,一种基于上下文动态短列表机制(context-dependent dynamic shortlisting),其核心是引入轻量级、粗粒度的元分类器(meta-classifier),将输入上下文路由至少量词元聚类,从而构建动态短列表;该机制在保证验证阶段使用完整词汇表和精确性的同时,利用并行执行 draft 编码与元短列表计算的优势,在更小的短列表下仍能提升平均接受长度,显著改善推理效率与鲁棒性。
链接: https://arxiv.org/abs/2510.13847
作者: Jinbin Zhang,Nasib Ullah,Erik Schultheis,Rohit Babbar
机构: Aalto University (阿尔托大学); IST Austria (奥地利科学技术研究院); University of Bath (巴斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter’s output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter’s vocabulary to a fixed subset of the target model’s vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter’s shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter’s hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.
zh
[NLP-143] Serialized EHR make for good text representations
【速读】: 该论文旨在解决现有基础模型在医疗领域中难以有效融合电子健康记录(Electronic Health Records, EHR)的表格型与事件序列特性,从而限制其对患者跨就诊时间点的长期依赖关系建模能力的问题。解决方案的关键在于提出SerialBEHRT——一个领域对齐的基础模型,通过在结构化EHR序列上进行额外预训练,扩展SciBERT架构,使其能够编码临床事件间的时序与上下文关系,从而生成更丰富的患者表征,显著提升抗生素敏感性预测任务的性能。
链接: https://arxiv.org/abs/2510.13843
作者: Zhirong Chou,Quan Qin,Shi Li
机构: Southern China University (南方科技大学); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.
zh
[NLP-144] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG -based Fact Checking
【速读】: 该论文旨在解决知识投毒(Knowledge Poisoning)在现实世界事实核查场景中对检索增强生成(Retrieval-Augmented Generation, RAG)系统构成的威胁问题,即攻击者通过向知识库注入对抗性内容,诱导大语言模型(Large Language Models, LLMs)基于被篡改的上下文生成受控输出,从而误导事实核查结果。传统方法多关注于误导性或恶意检索内容的影响,而本文聚焦于更复杂的场景:检索池中包含真实的支持或反驳证据,此时攻击更具挑战性。解决方案的关键在于提出一种名为ADMIT(Adversarial Multi-Injection Technique)的少样本、语义对齐的投毒攻击技术,其无需访问目标LLM、检索器或进行词元级控制,即可实现高成功率的事实核查决策翻转和虚假推理生成。实验表明,ADMIT在4种检索器、11个LLM及4个跨领域基准上均表现出强迁移能力,平均攻击成功率达86%,且在极低的投毒率(0.93 × 10⁻⁶)下仍具鲁棒性,显著优于现有最优攻击方法(提升11.2% ASR),揭示了RAG驱动的事实核查系统存在严重安全隐患。
链接: https://arxiv.org/abs/2510.13842
作者: Yutao Wu,Xiao Liu,Yinghui Li,Yifeng Gao,Yifan Ding,Jiale Ding,Xiang Zheng,Xingjun Ma
机构: Deakin University (迪肯大学); Fudan University (复旦大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs’ susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbfADMIT (\textbfADversarial \textbfMulti-\textbfInjection \textbfTechnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86% at an extremely low poisoning rate of 0.93 \times 10^-6 , and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.
zh
[NLP-145] Meronymic Ontology Extraction via Large Language Models
【速读】: 该论文旨在解决产品本体(ontology)构建过程中依赖人工、耗时且成本高昂的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现从原始评论文本中全自动提取产品本体,特别是以整体-部分关系(meronymies)的形式表示,从而替代传统手工建模方式。实验表明,该方法生成的本体在LLM-as-a-judge评估下优于基于BERT的现有基线,为LLMs在产品及其他领域本体抽取中的通用应用奠定了基础。
链接: https://arxiv.org/abs/2510.13839
作者: Dekai Zhang,Simone Conia,Antonio Rago
机构: Imperial College London (帝国理工学院); Sapienza University of Rome (罗马大学); King’s College London (国王学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Ontologies have become essential in today’s digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.
zh
[NLP-146] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection
【速读】: 该论文旨在解决仇恨言论检测中因训练标签偏倚和跨文化差异导致的识别难题,具体包括数据稀疏性(data sparsity)、文化混杂性(cultural entanglement)以及标签模糊性(ambiguous labeling)等问题。其核心解决方案是提出一种文化感知框架(culture-aware framework),通过构建个体的仇恨子空间(hate subspaces)来捕捉不同文化背景下对仇恨言论的独特理解;关键创新在于:利用文化属性组合建模缓解数据稀疏问题,并借助标签传播机制(label propagation)提取每种文化组合的特征表示,从而提升分类性能——实验表明该方法在所有指标上平均优于当前最优模型1.05%。
链接: https://arxiv.org/abs/2510.13837
作者: Weibin Cai,Reza Zafarani
机构: Syracuse University (雪城大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals’ hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05% on average across all metrics.
zh
[NLP-147] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在生成结果时缺乏对自身不确定性的量化能力问题,即如何判断模型“知道自己不知道什么”。为实现可信人工智能系统,论文提出了一种以输出一致性为代理指标的不确定性量化(Uncertainty Quantification, UQ)方法,其关键在于设计一种高层级的、非形式化的基于相似度的聚合框架,通过比较多个采样生成结果之间的相似性来估计置信度。该框架可涵盖多种UQ方法,并进一步引入了利用小规模训练集训练置信度估计模型的新技术,实验证明其在问答、摘要和文本到SQL等复杂生成任务中能获得比基线更校准良好的置信度。
链接: https://arxiv.org/abs/2510.13836
作者: Debarun Bhattacharjya,Balaji Ganesan,Junkyu Lee,Radu Marinescu,Katsiaryna Mirylenka,Michael Glass,Xiao Shou
机构: IBM Research (IBM 研究院); Zalando (扎兰多); Baylor University (贝勒大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages including appendix, Findings of EMNLP 2025
Abstract:When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM’s generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.
zh
[NLP-148] ConDABench: Interactive Evaluation of Language Models for Data Analysis
【速读】: 该论文旨在解决现实世界中数据分析任务常面临的目标不明确和数据脏乱问题,这些问题需要通过用户交互来理解并澄清用户意图,而现有大语言模型(LLM)在数据分析任务上的评估基准未能充分捕捉此类复杂性或提供对交互性的原生支持。解决方案的关键在于提出ConDABench框架,其核心包括:(a) 一种多智能体工作流,用于从描述公共数据集洞察的文章中生成逼真的对话式数据分析(ConDA)基准;(b) 基于该工作流生成的1,420个ConDA问题;© 一个评估工具包,首次实现对对话式数据分析工具在生成的ConDA问题上进行系统性评估。这一框架为模型开发者提供了衡量迈向真正协作型模型进展的新途径。
链接: https://arxiv.org/abs/2510.13835
作者: Avik Dutta,Priyanshu Gupta,Hosein Hasanbeig,Rahul Pratap Singh,Harshit Nigam,Sumit Gulwani,Arjun Radhakrishna,Gustavo Soares,Ashish Tiwari
机构: Microsoft(微软); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user’s intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and © an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.
zh
[NLP-149] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
【速读】: 该论文旨在解决基于Transformer的模型在推理和部署过程中因多层结构与注意力头(attention heads)带来的效率瓶颈问题。现有剪枝方法中,基于梯度的头重要性评分(Head Importance Scores, HIS)虽具备可解释性和识别冗余头的能力,但仅反映梯度驱动的贡献,忽略了注意力模式的多样性。为此,作者提出一种新的剪枝准则HIES(Head Importance-Entropy Score),其核心创新在于融合了头重要性得分与注意力熵(attention entropy),从而从两个互补维度评估每个注意力头的实际贡献。实验表明,HIES剪枝相比仅使用HIS的方法,在模型质量上提升最高达15.2%,稳定性提高2.04倍,实现了高效压缩且不损失准确性与稳定性。
链接: https://arxiv.org/abs/2510.13832
作者: Minsik Choi,Hyegang Son,Changhoon Kim,Young Geun Kim
机构: Korea University (韩国国立大学); Soongsil University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages
Abstract:Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.
zh
[NLP-150] Informed Routing in LLM s: Smarter Token-Level Computation for Faster Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因推理成本过高而导致的部署限制问题。现有方法通过动态分配每个token的计算资源以提升效率,但其依赖贪婪路由策略(greedy routing),即对每个token采用“执行或跳过”的二元决策机制,容易造成不可逆的信息丢失和次优的token选择。论文提出了一种新的范式——知情路由(informed routing),其关键在于不仅评估token的即时重要性,还引入可恢复性(recoverability)概念,衡量一个token变换过程是否能够被近似重构。为此,作者设计了一个轻量级特征预测器(Lightweight Feature Forecaster, LFF),在路由决策前预估单元输出,从而实现“执行或近似”的灵活策略,在显著降低计算开销的同时保持模型精度。实验表明,该方法在语言建模与推理任务中均实现了最优的效率-性能权衡,并且无需最终LoRA微调即可达到甚至超越需全量微调的基线方法,同时训练时间减少超过50%。
链接: https://arxiv.org/abs/2510.13831
作者: Chao Han,Yijuan Liang,Zihao Xuan,Daokuan Wu,Wei Zhang,Xiaoyu Shen
机构: Institute of Digital Twin, Eastern Institute of Technology, Ningbo(宁波东方理工大学数字孪生研究所); University of Science and Technology of China(中国科学技术大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing–a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token’s immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit’s output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: this https URL
zh
[NLP-151] Users as Annotators: LLM Preference Learning from Comparison Mode
【速读】: 该论文旨在解决大规模语言模型(LLM)对齐过程中,依赖人工标注的成对偏好数据存在质量不可控的问题。传统方法通常由专业标注员对模型响应进行二元偏好判断,但成本高且难以扩展;而用户在日常交互中产生的偏好标签虽具天然优势(用户是自身查询的最佳评判者),却缺乏质量保障。解决方案的关键在于利用两个不同模型或同一模型的不同版本生成的响应之间的不对称性,构建一个用户行为模型,通过期望最大化(EM)算法估计用户的隐式质量因子,并据此过滤低质量标注数据。该方法有效提升了用户标注数据的质量控制能力,从而增强了下游任务中LLM对齐的效果。
链接: https://arxiv.org/abs/2510.13830
作者: Zhongze Cai,Xiaocheng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data – user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user’s data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users’ annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.
zh
[NLP-152] A Linguistics-Aware LLM Watermarking via Syntactic Predictability
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中文本质量与水印检测鲁棒性之间的权衡问题,尤其针对现有水印方法依赖模型特定信号(如词元级熵)导致无法实现公开可验证检测的局限性。解决方案的关键在于提出 STELA 框架,其通过动态调节水印强度以匹配语言内在的语法自由度:利用词性标注(POS)n-gram 建模的语言不确定性,在语法约束强的语境中减弱水印强度以保障文本质量,而在语言灵活性高的语境中增强水印强度以提升可检测性;该机制无需访问模型 logits,从而实现了真正意义上的公开可验证检测。
链接: https://arxiv.org/abs/2510.13829
作者: Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at this https URL.
zh
[NLP-153] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening
【速读】: 该论文旨在解决当前可解释人工智能(Explainable Artificial Intelligence, XAI)在心理健康筛查(Mental Health Screening, MHS)应用中存在“实验室到临床”的转化断层问题,即现有XAI技术如SHAP和LIME虽能生成技术上忠实的特征重要性评分,却难以提供临床相关且可操作的洞察,导致其在真实医疗场景中难以落地。解决方案的关键在于提出一种生成式操作框架(Generative Operational Framework),该框架以大型语言模型(Large Language Models, LLMs)为核心翻译引擎,将不同XAI工具的原始技术输出与基于检索增强生成(Retrieval-Augmented Generation, RAG)的临床指南融合,自动生成人类可读、证据支持的临床叙事,从而实现从技术透明到临床实用性的跨越。
链接: https://arxiv.org/abs/2510.13828
作者: Ratna Kandala,Akshata Kishore Moharir,Divya Arvinda Nayak
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.
zh
[NLP-154] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL
【速读】: 该论文旨在解决当前Text-to-SQL方法在跨语言场景下的两个核心问题:一是现有评估仅关注可执行查询的准确率,忽视了语义对齐(semantic alignment)挑战,即用户意图与生成SQL语句之间的语义一致性;二是模型在非英语语言上的执行准确率显著下降,平均降幅达6个百分点。解决方案的关键在于提出一种结合组相对策略优化(Group Relative Policy Optimization, GRPO)与多语言对比奖励信号(multilingual contrastive reward signal)的新框架,通过引入基于语义相似度的奖励机制,增强模型在跨语言场景中生成SQL时的任务效率和语义准确性。实验表明,该方法在MultiSpider多语言数据集上显著提升了执行准确率(最高达87.4%)和语义准确率(最高达59.14%),且使用仅3,000个强化学习样本即可超越更大规模的零样本8B模型性能。
链接: https://arxiv.org/abs/2510.13827
作者: Ashish Kattamuri,Ishita Prasad,Meetu Malhotra,Arpita Vats,Rahul Raja,Albert Lie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20th International Workshop on Semantic and Social Media Adaptation Personalization
Abstract:Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge – both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) – all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.
zh
[NLP-155] Generative AI in Heritage Practice: Improving the Accessibility of Heritage Guidance
【速读】: 该论文旨在解决专业遗产实践领域中公共导向指导文件可及性不足的问题,特别是通过生成式人工智能(Generative AI)提升此类文档的撰写效率与质量。解决方案的关键在于开发并验证一个针对遗产保护与阐释领域定制的生成式AI聊天机器人HAZEL,该模型基于大型语言模型(Large Language Model, LLM)进行微调(fine-tuning),使其在特定任务中表现优于通用模型如ChatGPT(GPT-4)。研究表明,经过领域适配的GenAI工具虽不能替代人类专家完成高复杂度的技术写作,但在自动化和加速部分写作流程方面具有显著价值,尤其适用于资源有限的遗产机构。
链接: https://arxiv.org/abs/2510.13811
作者: Jessica Witte,Edmund Lee,Lisa Brausem,Verity Shillabeer,Chiara Bonacchi
机构: University of Edinburgh (爱丁堡大学); Historic England (英格兰历史遗产保护组织)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages
Abstract:This paper discusses the potential for integrating Generative Artificial Intelligence (GenAI) into professional heritage practice with the aim of enhancing the accessibility of public-facing guidance documents. We developed HAZEL, a GenAI chatbot fine-tuned to assist with revising written guidance relating to heritage conservation and interpretation. Using quantitative assessments, we compare HAZEL’s performance to that of ChatGPT (GPT-4) in a series of tasks related to the guidance writing process. The results of this comparison indicate a slightly better performance of HAZEL over ChatGPT, suggesting that the GenAI chatbot is more effective once the underlying large language model (LLM) has been fine-tuned. However, we also note significant limitations, particularly in areas requiring cultural sensitivity and more advanced technical expertise. These findings suggest that, while GenAI cannot replace human heritage professionals in technical authoring tasks, its potential to automate and expedite certain aspects of guidance writing could offer valuable benefits to heritage organisations, especially in resource-constrained contexts.
zh
计算机视觉
[CV-0] Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
【速读】:该论文旨在解决使用预训练的2D图像编辑模型进行多视角图像编辑时,各视角间缺乏一致性的问题。现有方法通常依赖显式的3D表示优化,但存在计算耗时长和在稀疏视角下不稳定等缺陷。其解决方案的关键在于提出一种隐式3D正则化策略,通过约束生成的2D图像序列符合预训练的多视角图像分布来实现一致性;具体而言,采用耦合扩散采样(coupled diffusion sampling)技术,同时从多视角图像分布和2D编辑图像分布中采样两条轨迹,并利用耦合项强制生成图像间的多视角一致性,从而无需显式建模3D结构即可实现高效、稳定的多视角一致编辑。
链接: https://arxiv.org/abs/2510.14981
作者: Hadi Alzayer,Yunzhi Zhang,Chen Geng,Jia-Bin Huang,Jiajun Wu
机构: Stanford University (斯坦福大学); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
zh
[CV-1] From Pixels to Words – Towards Native Vision-Language Primitives at Scale
【速读】:该论文旨在解决两个核心问题:一是明确原生视觉-语言模型(native Vision-Language Models, VLMs)与模块化VLMs的根本差异及其可克服的限制;二是推动原生VLM研究的可及性与民主化,以加速领域进展。解决方案的关键在于提出一套构建原生VLM的基本原则,并设计出名为NEO的新一代原生VLM家族——其核心创新在于通过从第一性原理出发设计的模型原语(primitive),实现像素与词表示在共享语义空间中的有效对齐、融合视觉与语言模块的优势,并内嵌多种跨模态特性以支持统一的视觉-语言编码、对齐与推理。NEO仅用390M图像-文本样本即可从零开始高效学习视觉感知,同时在密集且一体化的架构中缓解视觉-语言冲突,为可扩展、高性能的原生VLM奠定基础。
链接: https://arxiv.org/abs/2510.14979
作者: Haiwen Diao,Mingxuan Li,Silei Wu,Linjun Dai,Xiaohua Wang,Hanming Deng,Lewei Lu,Dahua Lin,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Xi’an Jiaotong University (西安交通大学); SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures
Abstract:The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: this https URL.
zh
[CV-2] Learning an Image Editing Model without Image Editing Pairs
【速读】:该论文旨在解决当前图像编辑模型依赖大规模输入-目标配对数据进行监督微调的问题,这类数据在现实中难以规模化获取。现有方法虽尝试使用预训练模型生成合成配对数据,但会将原始模型的伪影传播并放大至最终模型。其解决方案的关键在于提出一种无需任何配对数据的新训练范式:通过在训练过程中展开(unroll)少量步骤的扩散模型,并利用视觉语言模型(VLM)提供反馈作为直接梯度信号,实现端到端优化;同时引入分布匹配损失(DMD),约束生成图像保持在预训练模型学习到的图像流形内,从而保证视觉保真度。该方法在标准基准测试中表现优异,达到甚至超越依赖大量配对数据训练的模型性能。
链接: https://arxiv.org/abs/2510.14978
作者: Nupur Kumari,Sheng-Yu Wang,Nanxuan Zhao,Yotam Nitzan,Yuheng Li,Krishna Kumar Singh,Richard Zhang,Eli Shechtman,Jun-Yan Zhu,Xun Huang
机构: Carnegie Mellon University (卡内基梅隆大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: project page: this https URL
Abstract:Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
zh
[CV-3] rra: Explorable Native 3D World Model with Point Latents
【速读】:该论文旨在解决现有世界模型(World Models)多依赖像素对齐表示、忽视物理世界固有三维结构的问题,从而导致3D一致性差和建模效率低。其核心解决方案是提出Terra——一种原生3D世界模型,通过在内在3D潜在空间中表示和生成可探索环境来实现高保真重建与高效生成。关键创新在于提出点到高斯变分自编码器(Point-to-Gaussian Variational Autoencoder, P2G-VAE),将3D输入编码为潜在点表示,并解码为3D高斯基元以联合建模几何与外观;同时引入稀疏点流匹配网络(Sparse Point Flow Matching Network, SPFlow)生成潜在点表示,在过程中同步去噪点位置与特征,从而实现精确多视角一致性与任意视角灵活渲染,并支持通过点潜在空间的渐进式生成实现可探索的世界建模。
链接: https://arxiv.org/abs/2510.14977
作者: Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Xin Tao,Pengfei Wan,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL
Abstract:World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
zh
[CV-4] Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation ICCV2025
【速读】:该论文旨在解决如何从近身交互姿态(close-proximity human-human interactive poses)中高效生成多样化、符合人类行为先验的交互动画问题,尤其在缺乏高质量动作捕捉(motion-capture, mocap)数据的开放世界场景中。解决方案的关键在于提出 Ponimator 框架,其核心是利用交互姿态的时空先验:通过两个条件扩散模型实现功能分离——一是基于时间先验的姿势动画器(pose animator),用于从交互姿态生成动态运动序列;二是基于空间先验的姿势生成器(pose generator),可从单个姿态、文本或二者结合生成新的交互姿态。这种结构使系统能支持图像驱动交互动画、反应动画和文本到交互合成等多种任务,从而将高质量 mocap 数据中的交互知识迁移至更广泛的现实应用中。
链接: https://arxiv.org/abs/2510.14976
作者: Shaowei Liu,Chuan Guo,Bing Zhou,Jian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Accepted to ICCV 2025. Project page: this https URL
Abstract:Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.
zh
[CV-5] WithAnyone: Towards Controllable and ID Consistent Image Generation
【速读】:该论文旨在解决文本到图像生成中身份一致性(identity consistency)问题,特别是现有模型在缺乏大规模成对数据集时依赖重建训练所导致的“复制粘贴”(copy-paste)缺陷——即模型直接复制参考人脸而非在姿态、表情或光照变化下保持身份特征。其解决方案的关键在于:(1) 构建大规模多身份成对数据集 MultiID-2M,支持多人场景下的多样化参考;(2) 提出量化复制粘贴程度与身份保真度和多样性权衡的新基准;(3) 设计基于对比学习的身份损失函数(contrastive identity loss),利用成对数据实现高保真度与多样性的平衡。最终提出的扩散模型 WithAnyone 显著减少了复制粘贴现象,同时提升对姿态和表情的可控性,并保持优异的感知质量。
链接: https://arxiv.org/abs/2510.14975
作者: Hengyuan Xu,Wei Cheng,Peng Xing,Yixiao Fang,Shuhan Wu,Rui Wang,Xianfang Zeng,Daxin Jiang,Gang Yu,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 Pages; Project Page: this https URL Code: this https URL
Abstract:Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
zh
[CV-6] pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
【速读】:该论文旨在解决少步数扩散或基于流的生成模型中,教师模型预测速度(velocity)与学生模型预测“捷径”之间的格式不匹配问题,这一不匹配导致蒸馏过程复杂且常伴随质量与多样性之间的权衡。其解决方案的关键在于提出策略驱动的流模型(π-Flow),通过修改学生模型输出层以预测一个无网络依赖的策略(policy),该策略在后续子步上动态生成流速度,从而实现无需额外网络评估即可高效、准确地进行常微分方程(ODE)积分;同时引入一种新颖的模仿蒸馏方法,利用标准ℓ₂流匹配损失将策略轨迹上的速度对齐至教师模型,使学生模型仅需模仿教师行为即可稳定训练并突破质量-多样性权衡限制。
链接: https://arxiv.org/abs/2510.14974
作者: Hansheng Chen,Kai Zhang,Hao Tan,Leonidas Guibas,Gordon Wetzstein,Sai Bi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Demos: this https URL and this https URL
Abstract:Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ( \pi -Flow). \pi -Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy’s ODE trajectory to the teacher’s, we introduce a novel imitation distillation approach, which matches the policy’s velocity to the teacher’s along the policy’s trajectory using a standard \ell_2 flow matching loss. By simply mimicking the teacher’s behavior, \pi -Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256 ^2 , it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, \pi -Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.
zh
[CV-7] RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks NEURIPS2025 NEURIPS
【速读】:该论文旨在解决长时程任务中子任务分解的难题,即现有分层视觉-语言-动作(Vision-Language-Action, VLA)框架依赖人工标注或启发式规则对目标任务进行分割,导致子任务与底层视觉运动策略(visuomotor policy)训练数据分布不一致,从而影响任务执行性能。其解决方案的关键在于提出一种基于检索的演示分解器(Retrieval-based Demonstration Decomposer, RDD),通过将演示中分解后的子任务区间视觉特征与低层策略训练数据中的视觉特征对齐,实现自动化的、数据驱动的子任务划分,从而提升子任务分解的质量和下游任务的鲁棒性。
链接: https://arxiv.org/abs/2510.14968
作者: Mingxuan Yan,Yuping Wang,Zechun Liu,Jiachen Li
机构: University of California, Riverside (加州大学河滨分校); University of Michigan (密歇根大学); Meta AI (Meta人工智能)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025); Project Website: this http URL
Abstract:To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at this http URL.
zh
[CV-8] ChangingGrounding: 3D Visual Grounding in Changing Scenes
【速读】:该论文旨在解决现实场景中机器人基于自然语言指令进行3D视觉定位(3D Visual Grounding, 3DVG)时面临的挑战,即现有方法通常假设环境点云是重建且实时更新的,这导致需要频繁重新扫描、成本高昂并难以部署。为应对这一问题,作者提出将3DVG建模为一个主动式、记忆驱动的任务,并构建了首个基准测试ChangingGrounding,用于评估智能体在动态环境中如何利用历史观测、仅探索必要区域并仍能生成精确的3D边界框。其解决方案的核心在于提出Mem-ChangingGrounder,一种零样本方法,通过跨模态检索与轻量级多视角融合相结合:首先识别查询中的对象类别,检索相关记忆以指导行动;随后高效探索目标区域,在先前操作失效时回退,执行多视角扫描并将多视角证据融合,最终输出准确的物体边界框。该方法在降低探索成本的同时显著提升了定位精度,推动了面向实际应用的以记忆为中心的3DVG研究发展。
链接: https://arxiv.org/abs/2510.14965
作者: Miao Hu,Zhiwei Huang,Tai Wang,Jiangmiao Pang,Dahua Lin,Nanning Zheng,Runsen Xu
机构: Xi’an Jiaotong University (西安交通大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages
Abstract:Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: this https URL .
zh
[CV-9] RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion
【速读】:该论文旨在解决降水临近预报(precipitation nowcasting)中因大气系统固有的混沌性和强时空耦合动态性所带来的建模挑战,尤其针对现有基于扩散模型的方法在可扩展性上的局限:潜空间方法需额外训练自编码器以引入注意力机制,增加复杂度并限制泛化能力;而像素空间方法计算成本高且常忽略注意力机制,难以捕捉长程时空依赖。其解决方案的关键在于将**逐标记注意力(Token-wise Attention)**原生集成至U-Net扩散模型及时空编码器中,从而在不引入高资源消耗的前提下,动态捕获多尺度空间交互与时间演化特征,无需独立的潜空间模块即可实现高效、高保真且鲁棒的降水预测。
链接: https://arxiv.org/abs/2510.14962
作者: Thao Nguyen,Jiaqi Ma,Fahad Shahbaz Khan,Souhaib Ben Taieb,Salman Khan
机构: Mohamed Bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.
zh
[CV-10] C4D: 4D Made from 3D through Dual Correspondences ICCV2025
【速读】:该论文旨在解决从单目视频中恢复4D(3D空间+时间)结构的问题,即联合估计动态场景的几何形状与相机位姿。传统基于点图(pointmap)的3D重建方法(如DUSt3R)在静态场景中表现优异,但在动态场景中因运动物体违反多视角几何约束而导致重建失真。解决方案的关键在于提出C4D框架,通过引入时间一致性对应关系扩展现有3D重建范式至4D:一方面预测点图,另一方面同时捕捉短时光流(short-term optical flow)和长时点跟踪(long-term point tracking);特别地,设计了一个动态感知的点跟踪器以提供额外的运动信息,从而辅助生成运动掩膜(motion mask),有效分离移动物体与静态背景,为动态场景提供更可靠的重建引导;此外,通过一组动态场景优化目标实现逐帧3D几何与相机参数恢复,并将2D轨迹提升为平滑的3D轨迹,最终实现完整的4D结构重建。
链接: https://arxiv.org/abs/2510.14960
作者: Shizun Wang,Zhenxiang Jiang,Xingyi Yang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025
Abstract:Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: this https URL
zh
[CV-11] RealDPO: Real or Not Real that is the Preference
【速读】:该论文旨在解决当前视频生成模型在复杂动作合成中存在的自然性、流畅性和情境一致性不足的问题,这一局限性严重制约了其在实际场景中的应用。解决方案的关键在于提出一种名为RealDPO的新对齐范式,该方法利用真实世界视频作为正样本进行偏好学习,通过引入针对运动真实性的定制化损失函数,实现对生成动作的迭代自修正优化。相比传统监督微调(Supervised Fine-Tuning, SFT)提供的有限纠错反馈,RealDPO借助直接偏好优化(Direct Preference Optimization, DPO)机制,在真实视频与模型错误输出之间建立对比关系,从而显著提升生成动作的真实感与质量。
链接: https://arxiv.org/abs/2510.14955
作者: Guo Cheng,Danni Yang,Ziqi Huang,Jianlou Si,Chenyang Si,Ziwei Liu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); S-Lab, Nanyang Technological University (南洋理工大学S-Lab); University of Electronic Science and Technology of China (中国电子科技大学); Nanjing University (南京大学); SenseTime Research (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL Project Page: this https URL
Abstract:Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.
zh
[CV-12] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression
【速读】:该论文旨在解决全身多模态人体运动生成中的两大核心问题:一是设计有效的运动生成机制,二是将文本、语音和音乐等多种模态信息融合到统一框架中。其解决方案的关键在于提出一种连续掩码自回归运动变换器(continuous masked autoregressive motion transformer),该结构引入门控线性注意力(gated linear attention)和RMSNorm模块,以增强对关键动作的关注并抑制异常运动或异质模态分布带来的不稳定性;同时,采用DiT(Diffusion Transformer)结构从变换器向目标扩散条件信号,进一步提升运动生成质量与多模态泛化能力,并通过AdaLN与交叉注意力机制实现多模态信号的有效融合。
链接: https://arxiv.org/abs/2510.14954
作者: Zhe Li,Weihao Yuan,Weichao Shen,Siyu Zhu,Zilong Dong,Chang Xu
机构: University of Sydney (悉尼大学); Alibaba Group (阿里巴巴集团); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.
zh
[CV-13] From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
【速读】:该论文旨在解决现有语言引导类人机器人行走方法中存在的多阶段处理流程带来的累积误差、高延迟以及语义与控制耦合弱的问题。传统方法通常包括人类动作解码、适配至机器人形态并由物理控制器跟踪,这一过程不仅复杂且不可靠。解决方案的关键在于提出RoboGhost框架,其核心创新是摒弃显式的动作解码和形态重定向步骤,直接将语言驱动的运动潜在表示(motion latents)作为策略条件,使扩散模型能够从噪声中直接去噪生成可执行的动作,从而保持语义意图一致性,并支持快速、响应式的控制;同时结合因果Transformer与扩散机制的混合运动生成器,确保长时程行为的一致性与稳定性,同时保留多样性,实现精准的类人行为控制。
链接: https://arxiv.org/abs/2510.14952
作者: Zhe Li,Cheng Chi,Yangyang Wei,Boan Zhu,Yibo Peng,Tao Huang,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Chang Xu
机构: University of Sydney (悉尼大学); BAAI (北京智源人工智能研究院); Harbin Institute of Technology (哈尔滨工业大学); Hong Kong University of Science and Technology (香港科技大学); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision-language-action humanoid systems.
zh
[CV-14] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
【速读】:该论文旨在解决视频生成模型在长序列生成中难以同时保证场景一致性(scene consistency)与精确相机控制(camera controllability)的问题,尤其是在跨越时间边界时易因直接复用空间邻近帧而导致动态元素错误保留。解决方案的关键在于提出一种双时空条件机制(dual spatio-temporal conditioning),即同时利用时间相邻帧确保运动连续性、空间相邻内容维持场景一致性;并引入3D场景记忆(3D scene memory)——通过动态SLAM结合新型动态掩码策略(dynamic masking strategy)从输入视频中提取静态几何结构,并将其投影至任意目标视角作为几何一致的3D空间提示(spatial prompt),从而在保持长期空间连贯性的同时允许动态区域基于时间上下文自然演化,实现高效且真实的视频生成。
链接: https://arxiv.org/abs/2510.14945
作者: JoungBin Lee,Jaewoo Jung,Jisang Han,Takuya Narihira,Kazumi Fukuda,Junyoung Seo,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim
机构: KAIST AI; Sony AI; ETH Zürich; Sony Group Corporation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page : this https URL
Abstract:We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : this https URL
zh
[CV-15] MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos
【速读】:该论文旨在解决密集视频目标描述(Dense Video Object Captioning, DVOC)任务中因训练策略分离导致性能受限的问题。DVOC要求模型同时具备时空感知能力和自然语言描述能力,而以往方法由于依赖手动标注且采用分阶段训练,难以实现端到端优化。其解决方案的关键在于利用先进的视觉语言模型(Vision-Language Model, VLM)生成时空定位实体的合成描述,从而扩展LVIS和LV-VIS数据集为LVISCap和LV-VISCap,并基于此预训练一个端到端的MaskCaptioner模型,该模型可联合完成目标检测、分割、跟踪与caption生成,在三个基准测试集VidSTG、VLN和BenSMOT上达到当前最优性能。
链接: https://arxiv.org/abs/2510.14904
作者: Gabriel Fiastre,Antoine Yang,Cordelia Schmid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 8 figures
Abstract:Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at this https URL.
zh
[CV-16] Leverag ing Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection
【速读】:该论文旨在解决现有半监督视频异常检测(VAD)方法在检测涉及物体交互的复杂异常时表现不佳,且缺乏可解释性的问题。解决方案的关键在于利用多模态大语言模型(MLLM)提取并解析物体随时间变化的活动与交互关系,通过查询MLLM生成描述正常视频中物体对活动与交互的文本表示,并在测试阶段将这些文本描述与训练集中正常视频的描述进行对比以实现异常检测,从而不仅提升了对交互型异常的检测能力,还天然具备可解释性。
链接: https://arxiv.org/abs/2510.14896
作者: Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz
机构: University of South Florida (南佛罗里达大学); Mitsubishi Electric Research Laboratories (三菱电机研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.
zh
[CV-17] ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在文本到图像生成中缺乏精确且灵活控制机制的问题,而当前可控生成主要集中在扩散模型上。为填补这一空白,作者提出ScaleWeaver框架,其核心创新在于改进的MMDiT模块中引入了参考注意力(Reference Attention)机制:该机制摒弃了图像→条件的冗余注意力计算,显著降低计算开销并稳定控制信号注入;同时通过参数高效微调策略,仅引入少量可学习参数并辅以零初始化线性投影,实现对控制信息的有效整合而不破坏基础VAR模型的生成能力。此设计使ScaleWeaver在保持高保真度和高效推理的同时,实现了优于扩散方法的可控生成性能。
链接: https://arxiv.org/abs/2510.14882
作者: Keli Liu,Zhendong Wang,Wengang Zhou,Shaodong Xu,Ruixiao Dong,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image \rightarrow condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.
zh
[CV-18] BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data
【速读】:该论文旨在解决现有碰撞预测方法无法有效区分自车(ego-vehicle)威胁与不涉及自车的随机事故,导致实际部署中误报率过高的问题。其解决方案的关键在于提出BADAS系列模型,该模型基于Nexar的真实世界行车记录仪碰撞数据集进行训练,该数据集是首个专为自车中心评估设计的基准;通过重新标注主流基准以识别自车参与情况、添加共识告警时间标签,并在必要时合成负样本,从而实现公平的平均精度(AP)/曲线下面积(AUC)及时间维度上的评估。BADAS采用端到端训练的V-JEPA2骨干网络,包含两个版本:BADAS-Open(使用1.5k公开视频训练)和BADAS1.0(使用40k私有视频训练),在DAD、DADA-2000、DoTA和Nexar等多个数据集上均达到当前最优的AP/AUC表现,且相比前向碰撞预警系统(ADAS)基线模型产生更符合现实的时间到碰撞估计。
链接: https://arxiv.org/abs/2510.14876
作者: Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Shizhan Zhu,Daniel Moura,Orly Zvitia
机构: getnexar.com
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar’s real-world dashcam collision dataset – the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.
zh
[CV-19] OUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
【速读】:该论文旨在解决现有手-物体交互(Hand-Object Interaction, HOI)生成研究中因受限于固定抓握模式而导致的多样性不足问题,即当前方法通常依赖物理先验(如力闭合)或通用意图指令进行控制,难以捕捉日常生活中多样化的自由形式交互(如推、戳、旋转等)。其解决方案的关键在于提出Free-Form HOI Generation任务,并构建WildO2数据集(包含4.4k个独特交互实例,覆盖92种意图和610类物体),同时设计了TOUCH框架——一个基于多层级扩散模型的三阶段生成流程,通过显式接触建模实现细粒度语义控制,并结合接触一致性与物理约束优化以确保生成结果的物理合理性与多样性。
链接: https://arxiv.org/abs/2510.14874
作者: Guangyi Han,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method’s ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is \hrefthis https URLhere .
zh
[CV-20] Multi-modal video data-pipelines for machine learning with minimal human supervision
【速读】:该论文旨在解决如何在少样本甚至无监督条件下,将多种视觉模态(如视频、图像、文本等)高效融合以实现对现实世界更全面理解的问题。其核心挑战在于传统机器学习模型多为单模态(如RGB图像与语义标签或文本与情感分类),而真实世界本质上是多模态的,需整合不同模态信息才能实现更深层次的认知。解决方案的关键在于:首先构建一个全自动数据流水线,利用预训练专家模型(pre-trained experts)并基于原始视频进行程序化组合;其次采用专为多模态数据设计的PHG-MAE模型进行高效知识蒸馏,最终获得仅含100万参数(1M)却能媲美约3亿参数(300M)模型性能的轻量化架构。该方法显著降低了多模态建模的标注依赖和计算成本,并已在消费级硬件上成功部署用于实时语义分割与深度估计等任务。
链接: https://arxiv.org/abs/2510.14862
作者: Mihai-Cristian Pîrvu,Marius Leordeanu
机构: Institute of Mathematics of the Romanian Academy ”Simion Stoilow”; Faculty of Automatic Control and Computer Science, National University of Science and Technology POLITEHNICA
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb - semantic or text - sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.
zh
[CV-21] A Multi-Task Deep Learning Framework for Skin Lesion Classification ABCDE Feature Quantification and Evolution Simulation
【速读】:该论文旨在解决皮肤病变(skin lesions)自动分析中缺乏可解释性的问题,特别是如何将临床常用的ABCDE法则(Asymmetry, Border irregularity, Color variation, Diameter, and Evolving)与深度学习模型相结合,以提升诊断的透明度和医生对模型决策的理解。其解决方案的关键在于提出了一种新型深度学习框架,不仅能对皮肤病变进行分类(准确率达89%,恶性黑色素瘤AUC为0.96),还能量化评估每个ABCD特征的具体得分,并通过潜在空间中的特征轨迹可视化模拟病变从良性痣向恶性黑色素瘤演化的动态过程,从而实现从黑箱模型到可解释医学辅助诊断工具的转变。
链接: https://arxiv.org/abs/2510.14855
作者: Harsha Kotla,Arun Kumar Rajasekaran,Hannah Rana
机构: University of Cambridge (剑桥大学); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.
zh
[CV-22] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
【速读】:该论文旨在解决当前视频生成模型在想象性场景(imaginative scenarios)中性能显著下降的问题,这类场景通常涉及稀有共现概念与远距离语义关系,超出模型训练分布。现有方法依赖测试时缩放(test-time scaling)提升视频质量,但受限于固定搜索空间和静态奖励设计,难以适应此类复杂提示。解决方案的关键在于提出ImagerySearch——一种提示引导的自适应测试时搜索策略,能够根据提示中的语义关系动态调整推理搜索空间和奖励函数,从而在挑战性的想象场景中生成更连贯、视觉上更合理的视频。
链接: https://arxiv.org/abs/2510.14847
作者: Meiqi Wu,Jiashu Zhu,Xiaokun Feng,Chubin Chen,Chen Zhu,Bingze Song,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang
机构: UCAS; AMAP, Alibaba Group; CRISE; THU; SEU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
zh
[CV-23] Backdoor Unlearning by Linear Task Decomposition
【速读】:该论文旨在解决基础模型(Foundation Models)在面对对抗扰动和定向后门攻击时的安全性问题,尤其是如何在不损害模型通用能力的前提下实现后门的移除。现有方法依赖昂贵的微调来覆盖有害行为,常导致其他任务性能下降。论文的关键创新在于发现后门模式在模型权重空间中与良性任务高度解耦(disentangled),这一特性使得可以精准隔离并消除后门影响,同时最小化对干净数据性能的干扰。基于此洞察,作者提出一种简单有效的遗忘(unlearning)方法,通过利用这种解耦结构,在已知攻击情况下实现近乎完美的后门移除,且平均保留96%的干净准确率;即使在攻击未知时,也能通过逆向工程生成触发器实现成功去毒,显著优于当前最先进防御方法。
链接: https://arxiv.org/abs/2510.14845
作者: Amel Abdelraheem,Alessandro Favero,Gerome Bovet,Pascal Frossard
机构: EPFL, Lausanne, Switzerland (瑞士洛桑联邦理工学院); Cyber-Defence Campus, armasuisse, Switzerland (瑞士武装部队网络防御校园)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor’s influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.
zh
[CV-24] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在执行精细操作任务时,缺乏对关键三维(3D)结构的理解与推理能力的问题。解决方案的关键在于提出QDepth-VLA框架,通过引入一个辅助的深度预测任务,利用专门设计的深度专家网络预测由VQ-VAE编码器生成的深度图量化潜在标记(quantized latent tokens),从而让模型学习到具备深度感知能力的表征,有效捕捉重要的几何线索,提升空间推理能力和操作任务表现。
链接: https://arxiv.org/abs/2510.14836
作者: Yixuan Li,Yuhui Chen,Mingcai Zhou,Haoran Li
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
zh
[CV-25] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data
【速读】:该论文试图解决生成式 AI 在肿瘤分割任务中因缺乏大规模体素级标注数据而导致性能瓶颈的问题,此类数据难以获取且依赖医学专家标注。解决方案的关键在于利用合成数据增强真实数据的有效性,从而显著提升模型训练效率——研究发现,在仅使用500例真实扫描配合合成数据的情况下即可达到与1,500例真实扫描相当的分割性能,这表明合成数据能够“陡化”数据扩展规律(data scaling laws),使模型在更少真实数据下实现更优表现。基于此认知,作者构建了AbdomenAtlas 2.0,一个包含10,135例CT扫描和15,130个肿瘤实例的多器官(胰腺、肝脏、肾脏、结肠、食管和子宫)体素级标注数据集,其规模远超现有公开数据集,并在分布内和分布外测试中分别实现了+7%和+16%的Dice相似系数(DSC)提升,为腹部肿瘤AI分割提供了强有力的训练基础。
链接: https://arxiv.org/abs/2510.14831
作者: Qi Chen,Xinze Zhou,Chen Liu,Hao Chen,Wenxuan Li,Zekun Jiang,Ziyan Huang,Yuxuan Zhao,Dexin Yu,Junjun He,Yefeng Zheng,Ling Shao,Alan Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学Terminus人工智能实验室); Hong Kong Polytechnic University (香港理工大学); University of Cambridge (剑桥大学); Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Qilu Hospital of Shandong University (山东大学齐鲁医院); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0–a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation–based on lessons from the JHH dataset–for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.
zh
[CV-26] FraQAT: Quantization Aware Training with Fractional bits
【速读】:该论文旨在解决大容量生成式 AI 模型在移动设备上部署受限的问题,主要挑战在于模型参数的高精度(如32位浮点数)导致内存占用和计算资源需求过高,难以适配智能手机等边缘设备。为应对这一问题,作者提出了一种新的分数位量化(fractional bits quantization, \short)方法,其关键创新在于:从32位逐步降低至4位每参数的精度,并在优化过程中引入分数位(fractional bits)以保持高质量的生成效果。该方案在多个扩散模型(如SD3.5-Medium、Sana、PixArt和FLUX.1-schnell)上验证了有效性,相较标准量化感知训练(QAT)实现了更低的Fréchet Inception Distance(FiD),同时成功将Sana模型部署于搭载高通骁龙8 Elite处理器的三星S25U手机上运行。
链接: https://arxiv.org/abs/2510.14823
作者: Luca Morreale,Alberto Gil C. P. Ramos,Malcolm Chadwick,Mehid Noroozi,Ruchika Chavhan,Abhinav Mehrotra,Sourav Bhattacharya
机构: Samsung AI Center (三星人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT8. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model’s precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving 4-7% lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).
zh
[CV-27] Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning
【速读】:该论文旨在解决现有轨迹表示学习(Trajectory Representation Learning, TRL)方法忽视轨迹形成过程中外部环境因素和内部路径选择行为的问题,即当前方法将轨迹视为孤立的时空序列,未能充分建模其背后的环境感知与决策机制。解决方案的关键在于提出一个统一框架PRTraj,其核心创新包括:1)引入环境感知模块(Environment Perception Module),通过捕捉周围兴趣点(POI)分布的多粒度语义信息来增强路网表征;2)设计显式路径选择编码器(Route Choice Encoder),将轨迹中的路段转换序列建模为一系列决策过程,从而显式建模路径选择行为;最终将这些路径选择感知的表示进行聚合,生成全局轨迹嵌入,显著提升了下游任务的性能与数据效率。
链接: https://arxiv.org/abs/2510.14819
作者: Ji Cao,Yu Wang,Tongya Zheng,Zujie Ren,Canghong Jin,Gang Chen,Mingli Song
机构: Zhejiang University (浙江大学); Hangzhou City University (杭州城市大学); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbfPerception and explicit \textbfRoute choice modeling for effective \textbfTrajectory representation learning, dubbed \textbfPRTraj. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: this https URL.
zh
[CV-28] Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports Fewer Masks
【速读】:该论文旨在解决早期肿瘤检测中人工智能(AI)模型训练依赖大量手动标注肿瘤掩膜(tumor masks)这一高成本、低效率的问题。传统方法需要放射科医生逐 voxel 标注肿瘤边界,耗时耗资巨大,严重限制了AI在多瘤种、大规模筛查中的应用。其解决方案的关键在于提出 R-Super 方法,利用临床实践中广泛存在的医学报告(含肿瘤大小、数量、形态及病理信息)作为弱监督信号来训练AI模型进行肿瘤分割,从而大幅减少对精细掩膜的依赖。实验表明,仅用10万份报告即可达到使用723个掩膜训练的效果,并结合两者进一步提升敏感性和特异性,实现对多种罕见器官肿瘤(如脾脏、胆囊等)的有效分割,突破了以往缺乏公开掩膜数据的瓶颈。
链接: https://arxiv.org/abs/2510.14803
作者: Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Szymon Płotka,Jieneng Chen,Qi Chen,Zheren Zhu,Jakub Prządo,Ibrahim E. Hamacı,Sezgin Er,Yuhan Wang,Ashwin Kumar,Bjoern Menze,Jarosław B. Ćwikła,Yuyin Zhou,Akshay S. Chaudhari,Curtis P. Langlotz,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks–detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor’s size, number, appearance, and sometimes, pathology results–information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.14803 [cs.CV] (or arXiv:2510.14803v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.14803 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-29] Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from HE Whole Slide Images
【速读】:该论文旨在解决当前计算病理学中基础模型(foundation models)因任务无关性而忽视器官特异性形态特征的问题,这些特征可能反映不同的生物学过程并显著影响肿瘤行为、治疗反应及患者预后。其解决方案的关键在于提出一种新型可解释的人工智能模型PRISM(Prognostic Representation of Integrated Spatial Morphology),该模型通过引入每个形态学类别内的连续变异谱来刻画表型多样性,从而更真实地反映恶性转化是一个渐进的进化过程而非突变式转变的原理。PRISM在424例III期结直肠癌(colorectal cancer, CRC)患者的874万张组织切片上训练,展现出优于现有CRC特异性方法和AI基础模型的预后性能(五年总生存率AUC达0.70 ± 0.04,准确率68.37% ± 4.75%),且具备性别无差异稳健性和跨临床病理亚组的稳定性。
链接: https://arxiv.org/abs/2510.14800
作者: Usama Sajjad,Abdul Rehman Akbar,Ziyu Su,Deborah Knight,Wendy L. Frankel,Metin N. Gurcan,Wei Chen,Muhammad Khalid Khan Niazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 ± 0.04; accuracy = 68.37% ± 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.
zh
[CV-30] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection
【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection, OVD)中因依赖直接图像-文本匹配而导致的鲁棒性不足问题,尤其是在拥挤或遮挡场景下对未见类别的识别性能受限。其解决方案的关键在于引入结构化的视觉链式思维(Chain-of-Thought, CoT)推理机制,将对象理解分解为三个可解释步骤:(1)区域感知(即使是对未见对象),(2)通过零样本推理进行类别识别,以及(3)背景定位以分离语义复杂的物体。其中,第三步自然引出对比背景学习(Contrastive Background Learning, CBL),利用预计算的背景线索作为负样本,促进物体与背景特征解耦,从而构建一个针对复杂场景鲁棒伪标签生成的集成流程。
链接: https://arxiv.org/abs/2510.14792
作者: Hojun Choi,Youngsun Lim,Jaeyo Shin,Hyunjung Shim
机构: KAIST AI (韩国科学技术院人工智能); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 13 Figures, 12 Tables
Abstract:Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.
zh
[CV-31] MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks
【速读】:该论文旨在解决微型飞行器(Micro Air Vehicle, MAV)集群在复杂环境中可靠通信的问题,尤其是在传统基于无线电的方法因频谱拥挤、干扰和高功耗而受限的情况下。解决方案的关键在于受蜜蜂“摆尾舞”启发的视觉通信框架,通过MAV执行特定的运动模式(如上下、左右、左上右、左下右四种运动基元)来编码信息(如方向和距离),并利用事件相机被动捕获这些运动信号,再通过预定义的视觉码本与轻量级脉冲神经网络(Spiking Neural Network, SNN)结合事件帧分割模型进行解码,从而实现低功耗、鲁棒性强的集群间通信。
链接: https://arxiv.org/abs/2510.14770
作者: Zhang Nengbo,Hann Woei Ho,Ye Zhou
机构: School of Aerospace Engineering, Engineering Campus, Universiti Sains Malaysia, 14300 Nibong Tebal, Pulau Pinang, Malaysia (航空航天工程学院,工程校区,马来西亚理科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (start'',
end’‘, 1'',
0’'). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework’s effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.
zh
[CV-32] Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality
【速读】:该论文旨在解决火星地形高度图(heightmap)中因卫星成像和传输限制导致的缺失值问题,以提升虚拟现实(Virtual Reality)在行星任务规划、科学分析与宇航员训练中的模拟可靠性。现有方法多依赖简单的插值技术(如逆距离加权、克里金法),难以保持几何一致性;而基于地球数据训练的条件深度学习模型无法直接迁移至火星场景。解决方案的关键在于提出一种基于无条件扩散模型(unconditional diffusion model)的表面重建方法,通过在NASA HiRISE数据集上增强生成的12000张火星高度图进行训练,并采用非均匀重缩放策略捕捉多尺度地形特征后统一调整至128×128分辨率,从而实现高精度且视觉感知一致的填补效果,在RMSE和LPIPS指标上显著优于传统方法。
链接: https://arxiv.org/abs/2510.14765
作者: Giuseppe Lorenzo Catalano,Agata Marta Soccini
机构: Università degli Studi di Torino (都灵大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 21 pages, 9 figures
Abstract:Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth’s comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA’s HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.
zh
[CV-33] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement
【速读】:该论文旨在解决低光照图像增强(Low-light Image Enhancement, LLIE)中因像素级信息严重退化导致特征表示不可靠的问题,从而引发纹理恢复不佳、颜色不一致及伪影等缺陷。解决方案的关键在于提出LightQANet框架,其核心创新包括两个模块:一是静态建模视角下的光量化模块(Light Quantization Module, LQM),通过显式提取并量化与光照相关的特征因子,强化光照不变表示的学习,缓解不同光照水平下的特征不一致性;二是动态适应视角下的光感知提示模块(Light-Aware Prompt Module, LAPM),将光照先验编码为可学习提示,动态引导特征学习过程,使模型能够灵活应对复杂且连续变化的光照条件,从而实现跨多种照明场景的一致且鲁棒的图像质量提升。
链接: https://arxiv.org/abs/2510.14753
作者: Xu Wu,Zhihui Lai,Xianxu Hou,Jie Zhou,Ya-nan Zhang,Linlin Shen
机构: Shenzhen University (深圳大学); Nanyang Technological University (南洋理工大学); Xi’an Jiaotong-Liverpool University (西交利物浦大学); Changsha University of Science and Technology (长沙理工大学); Sichuan Normal University (四川师范大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.
zh
[CV-34] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models NEURIPS2025
【速读】:该论文旨在解决视觉分类器(visual classifiers)决策过程缺乏透明性与可解释性的问题,尤其在无法获取训练数据或真实标签的情况下,如何生成全局、自然语言形式的模型解释。其解决方案的关键在于提出 DEXTER 框架,该框架结合扩散模型(diffusion models)与大语言模型(large language models),通过优化文本提示(text prompts)生成类条件合成图像(class-conditional images),以激活目标分类器,并利用这些合成样本诱导出描述类别特定决策模式和偏见的自然语言报告,从而实现无需访问原始训练数据即可进行模型行为解释的能力。
链接: https://arxiv.org/abs/2510.14741
作者: Simone Carnemolla,Matteo Pennisi,Sarinda Samarasinghe,Giovanni Bellitto,Simone Palazzo,Daniela Giordano,Mubarak Shah,Concetto Spampinato
机构: University of Catania (卡塔尼亚大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025 (spotlight)
Abstract:Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier’s decision process without access to training data or ground-truth labels. We demonstrate DEXTER’s flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at this https URL.
zh
[CV-35] Free-Grained Hierarchical Recognition
【速读】:该论文旨在解决 hierarchical image classification(层级图像分类)在现实场景中因标注粒度不一致而导致性能下降的问题。现有方法通常假设所有样本均具有细粒度的完整标注,但实际标注常受图像质量、标注者专业水平及任务需求影响,导致标签粒度混杂(如远距离鸟仅标注为“Bird”,近距离则细化至“Bald eagle”)。解决方案的关键在于提出一种名为 free-grain learning(自由粒度学习)的新范式,其核心是利用视觉-语言模型(如CLIP)生成伪属性(pseudo-attributes)以增强语义指导,并结合半监督学习提供视觉一致性约束,从而在异质监督(heterogeneous supervision)下提升模型鲁棒性与准确性。同时,作者构建了 ImageNet-F 基准数据集,模拟真实人类标注行为,推动该问题的研究进展。
链接: https://arxiv.org/abs/2510.14737
作者: Seulki Park,Zilin Wang,Stella X. Yu
机构: University of Michigan (密歇根大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages
Abstract:Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.
zh
[CV-36] Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection
【速读】:该论文旨在解决当前目标检测方法中多尺度特征表示缺乏跨层依赖建模的问题,从而限制了对尺度变化较大目标的上下文信息捕捉能力。解决方案的关键在于提出一种新颖的跨层特征自注意力模块(Cross-Layer Feature Self-Attention Module, CFSAM),其核心由三部分组成:卷积局部特征提取器、基于Transformer的全局建模单元以高效捕获跨层交互关系,以及特征融合机制用于恢复和增强原始表示。该模块在SSD300框架中显著提升了检测性能,在PASCAL VOC上达到78.6% mAP(基线为75.5%),在COCO上达到52.1% mAP(基线为43.1%),且训练收敛更快,计算开销可控。
链接: https://arxiv.org/abs/2510.14726
作者: Dingzhou Xie,Rushi Lan,Cheng Pang,Enhao Ning,Jiahao Zeng,Wei Zheng
机构: Guilin University of Electronic Technology (桂林电子科技大学); University of Science and Technology Beijing (北京科技大学); Jiangxi University of Technology (江西理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.
zh
[CV-37] Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models
【速读】:该论文旨在解决现有视频相机运动分类(Camera Movement Classification, CMC)模型在历史影像数据上泛化能力不足的问题。由于大多数CMC方法基于现代高质量视频数据集训练,其在低质量、高噪声的历史影像(如二战档案影片)中表现尚不明确。解决方案的关键在于首次系统性评估五种主流视频分类模型在HISTORIAN数据集上的性能,该数据集包含专家标注的二战影像;其中最优模型Video Swin Transformer在仅有有限训练样本的情况下达到80.25%准确率,表明深度模型在低质量视频中仍具较强适应性,为未来融合多模态输入与改进时序架构以提升历史视频理解提供了重要方向。
链接: https://arxiv.org/abs/2510.14713
作者: Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig
机构: TU Wien (维也纳工业大学); St. Pölten University of Applied Sciences (圣珀尔滕应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 5 pages, accepted at AIROV2025
Abstract:Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.
zh
[CV-38] Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery
【速读】:该论文旨在解决鲸类种群大规模自动化监测难题,传统调查方法成本高且难以扩展,而现有基于高分辨率(Very High-Resolution, VHR)卫星影像的检测技术受限于标注数据匮乏、图像质量与环境条件变化大以及构建鲁棒机器学习流水线的成本高昂。解决方案的关键在于提出一种半自动化流程:首先利用统计异常检测方法识别空间离群点(即“有趣点”),作为潜在鲸类目标候选区域;随后结合一个面向专家的Web标注界面,快速验证这些候选点。该方法无需依赖标注训练数据,在三个基准场景中实现了90.3%至96.4%的召回率,同时将需人工检查区域面积减少高达99.8%(从超1000平方公里降至不足2平方公里),为未来基于卫星的海洋哺乳动物监测提供了可扩展的第一步。
链接: https://arxiv.org/abs/2510.14709
作者: Caleb Robinson,Kimberly T. Goetz,Christin B. Khan,Meredith Sackett,Kathleen Leonard,Rahul Dodhia,Juan M. Lavista Ferres
机构: Microsoft AI for Good Research Lab (微软AI for Good研究实验室); National Marine Fisheries Service (国家海洋渔业服务); NOAA (美国国家海洋和大气管理局); Azura Consulting (阿祖拉咨询公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. “interesting points”. We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% – from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at this https URL.
zh
[CV-39] Leverag ing Learned Image Prior for 3D Gaussian Compression ICCV2025
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)压缩中因缺乏学习到的先验知识而导致的率失真(rate-distortion)性能瓶颈问题。现有压缩方法虽能显著降低存储开销,但难以有效恢复压缩引起的画质退化。解决方案的关键在于引入一个基于图像空间先验的修复网络(restoration network),该网络以初始压缩后的高斯为输入,建模退化高斯与原始高斯之间的图像域压缩伪影;同时通过引入粗略渲染残差(coarse rendering residuals)作为侧信息,增强重建质量,并利用修复后图像的监督信号对压缩高斯进行精炼,从而在保持极低存储占用的同时大幅提升渲染质量。该框架可兼容多种现有3DGS压缩方法,具有广泛适用性。
链接: https://arxiv.org/abs/2510.14705
作者: Seungjoo Shin,Jaesik Park,Sunghyun Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025 Workshop on ECLR
Abstract:Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.
zh
[CV-40] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning ICCV2025
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频时序定位(video temporal grounding)与视频推理任务中存在的性能不足问题,尤其是在真实场景下对视频内容的精准理解和可解释性推理方面。其解决方案的关键在于提出一种无需训练的框架VTimeCoT,通过引入两种新颖的视觉工具——即“即插即用的进度条集成工具”和“高效高亮工具”,并设计了一种跨模态的视觉-时序链式思维(visuotemporal chain-of-thought, visuotemporal CoT)推理过程,从而增强模型对视频时序信息的理解能力,并实现更可组合、可解释的推理路径。
链接: https://arxiv.org/abs/2510.14672
作者: Jinglei Zhang,Yuanfan Guo,Rolandos Alexandros Potamias,Jiankang Deng,Hang Xu,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学); Noah’s Ark Lab; Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: this https URL
zh
[CV-41] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
【速读】:该论文旨在解决传统知识蒸馏(Knowledge Distillation, KD)在有限数据场景下存在的知识退化、监督效率低以及对强教师模型或大规模标注数据依赖性强等问题。其解决方案的关键在于提出首个弱监督链式知识蒸馏网络(Weakly-supervised Chain-based KD network, WeCKD),通过构建一个由多个相互连接模型组成的渐进式蒸馏链,使每个学生模型不仅从上一阶段学习知识,还对知识进行优化后再传递给下一阶段,从而实现结构化的知识迁移。此机制显著提升了特征学习能力,降低了数据依赖性,并克服了一次性蒸馏的局限性,实验证明在多个医学影像数据集上均能实现优于传统方法的性能,且在仅使用少量标注数据时仍可获得高达+23%的累积准确率提升。
链接: https://arxiv.org/abs/2510.14668
作者: Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Sami Azam,Asif Karim,Jemima Beissbarth,Amanda Leach
机构: United International University (联合国际大学); Charles Darwin University (查尔斯达尔文大学); Menzies School of Health Research (门齐斯健康研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
zh
[CV-42] EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)
【速读】:该论文旨在解决采矿活动导致的地表变化监测难题,特别是现有数据集在时间深度和地理覆盖范围上的局限性,从而难以支撑可持续资源管理和环境治理所需的长期、连续监测。其解决方案的关键在于构建EuroMineNet——首个基于Sentinel-2多光谱影像的多时相基准数据集,涵盖欧盟133个矿区,提供2015至2024年每年的专家标注数据,支持两类可持续驱动任务:(1)多时相采矿足迹制图,采用新颖的Change-Aware Temporal IoU(CA-TIoU)指标评估;(2)跨时相变化检测以捕捉渐进与突发的地表变化。该数据集推动了GeoAI模型在大陆尺度上对环境动态的分析能力,并揭示了当前方法在短期动态检测方面的挑战,为实现可解释、时序一致的采矿监测提供了基础支撑。
链接: https://arxiv.org/abs/2510.14661
作者: Weikang Yu,Vincent Nwazelibe,Xianping Ma,Xiaokang Zhang,Richard Gloaguen,Xiao Xiang Zhu,Pedram Ghamisi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at this https URL.
zh
[CV-43] Decorrelation Speeds Up Vision Transformers ICLR2026
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在掩码自编码器(Masked Autoencoder, MAE)预训练过程中计算成本过高、难以在时间与资源受限的工业场景中应用的问题。其解决方案的关键在于将去相关反向传播(Decorrelated Backpropagation, DBP)引入MAE预训练流程,通过在每一层迭代降低输入特征的相关性来加速收敛;该方法仅应用于编码器部分,在不牺牲训练稳定性的前提下显著缩短了预训练时间,并提升了下游任务性能。
链接: https://arxiv.org/abs/2510.14657
作者: Kieran Carrigg,Rob van Gastel,Melda Yeghaian,Sander Dalm,Faysal Boughorbel,Marcel van Gerven
机构: Donders Institute for Brain, Cognition, and Behaviour (多兰斯研究所脑、认知与行为); ASMPT ALSI B.V. (ASMPT ALSI 有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 12 figures, submitted to ICLR 2026
Abstract:Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
zh
[CV-44] In-Context Learning with Unpaired Clips for Instruction-based Video Editing
【速读】:该论文旨在解决指令驱动的视频编辑(instruction-based video editing)在实际应用中因缺乏大规模成对视频编辑数据集而难以扩展的问题。现有方法受限于构建高质量配对数据的成本和复杂性,导致模型泛化能力不足。解决方案的关键在于提出一种低成本预训练策略,利用未配对视频片段通过上下文学习(in-context learning)来预训练基础视频生成模型,使其具备添加、替换或删除等通用编辑能力;随后仅需少量高质量配对编辑数据即可高效微调,从而显著提升指令遵循准确性和视觉保真度。
链接: https://arxiv.org/abs/2510.14648
作者: Xinyao Liao,Xianfang Zeng,Ziye Song,Zhoujie Fu,Gang Yu,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12% improvement in editing instruction following and a 15% improvement in editing quality.
zh
[CV-45] SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation
【速读】:该论文旨在解决测试时适应(Test-time adaptation, TTA)中因分布偏移导致深度模型性能下降的问题,特别是针对基于扩散模型的输入级适应方法依赖梯度引导、探索能力有限且泛化性不足的局限性。解决方案的关键在于提出一种仅在推理阶段运行的框架SteeringTTA,其核心创新是将Feynman-Kac控制理论引入扩散过程,通过伪标签驱动的奖励机制,结合累积top-K概率与熵调度策略,引导多粒子轨迹在探索与置信度之间取得平衡,从而实现无需模型更新或源数据即可提升分类鲁棒性的目标。
链接: https://arxiv.org/abs/2510.14634
作者: Jihyun Yu,Yoojin Oh,Wonho Bae,Mingyu Kim,Junhyug Noh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.
zh
[CV-46] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
【速读】:该论文旨在解决生成式模型中因高维空间冗余和训练成本高昂而导致的效率低下问题,尤其是在图像生成任务中。其核心挑战在于如何在保持生成质量的同时,降低模型复杂度与训练资源消耗。解决方案的关键在于提出 Representation Tokenizer (RepTok),它利用预训练自监督视觉Transformer(SSL)编码器提取单个连续潜在token作为图像表示,仅微调语义token嵌入并结合联合训练的生成解码器,通过流匹配目标实现高质量重建;同时引入余弦相似度损失以维持原始SSL空间的良好几何结构,从而确保生成过程的稳定性与高效性。此方法显著减少了空间冗余,提升了训练效率,并在有限数据下仍能实现竞争力强的零样本文本到图像合成性能。
链接: https://arxiv.org/abs/2510.14630
作者: Ming Gui,Johannes Schusterbauer,Timy Phan,Felix Krause,Josh Susskind,Miguel Angel Bautista,Björn Ommer
机构: CompVis @ LMU Munich (计算机视觉 @ 慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
zh
[CV-47] GOPLA: Generalizable Object Placement Learning via Synthetic Augmentation of Human Arrangement
【速读】:该论文旨在解决机器人在日常家居环境中进行物体放置(object placement)时面临的双重挑战:一方面需理解语义偏好(如常识性物体关系),另一方面需确保几何可行性(如避障)。其解决方案的关键在于提出一个分层框架GOPLA,该框架通过多模态大语言模型将人类指令和视觉输入转化为结构化的对象关系计划,再由空间映射器生成融合几何常识的3D可操作性地图,并结合扩散规划器在测试阶段基于多计划分布与碰撞约束生成最优放置位姿。此外,为缓解数据稀缺问题,研究引入了一种可扩展的数据增强流水线,将少量人工示范转化为多样化的合成训练数据,从而显著提升模型在真实场景中的泛化能力与放置成功率(较次优方法提升30.04个百分点)。
链接: https://arxiv.org/abs/2510.14627
作者: Yao Zhong,Hanzhi Chen,Simon Schaefer,Anran Zhang,Stefan Leutenegger
机构: Technical University of Munich (慕尼黑工业大学); ETH Zurich (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robots are expected to serve as intelligent assistants, helping humans with everyday household organization. A central challenge in this setting is the task of object placement, which requires reasoning about both semantic preferences (e.g., common-sense object relations) and geometric feasibility (e.g., collision avoidance). We present GOPLA, a hierarchical framework that learns generalizable object placement from augmented human demonstrations. A multi-modal large language model translates human instructions and visual inputs into structured plans that specify pairwise object relationships. These plans are then converted into 3D affordance maps with geometric common sense by a spatial mapper, while a diffusion-based planner generates placement poses guided by test-time costs, considering multi-plan distributions and collision avoidance. To overcome data scarcity, we introduce a scalable pipeline that expands human placement demonstrations into diverse synthetic training data. Extensive experiments show that our approach improves placement success rates by 30.04 percentage points over the runner-up, evaluated on positioning accuracy and physical plausibility, demonstrating strong generalization across a wide range of real-world robotic placement scenarios.
zh
[CV-48] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理长视频时因帧序列密集而导致的二次计算复杂度问题,即token冗余严重、上下文限制和推理延迟高。其解决方案的关键在于提出一种简单且可插拔的高效视频采样方法(Efficient Video Sampling, EVS),通过识别并删除连续帧中保持不变的空间区域(即 temporally static patches),在不改变模型架构或重新训练的前提下显著减少token数量,同时保留语义完整性与位置信息。EVS在推理阶段可将大语言模型(Large Language Model, LLM)的首次响应时间(Time-to-First-Token, TTFT)降低最多4倍,且结合随机裁剪率的再训练策略后,模型对不同压缩强度具有鲁棒性,实现高效的视频-语言理解。
链接: https://arxiv.org/abs/2510.14624
作者: Natan Bagrov,Eugene Khvedchenia,Borys Tymchenko,Shay Aharon,Lior Kadoch,Tomer Keren,Ofri Masad,Yonatan Geifman,Ran Zilberstein,Tuomas Rintamaki,Matthieu Le,Andrew Tao
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches – spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
zh
[CV-49] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding ACM-MM
【速读】:该论文旨在解决羽毛球视频中战术理解的多尺度语义描述问题,即如何同时生成逐拍级别的动作描述(shot-level captions)和体现战术执行过程的时间动态性描述(tactic-level captions)。其关键解决方案是提出了一种双分支架构的视频字幕框架 Shot2Tactic-Caption,该框架通过共享视觉编码器与时空Transformer编码器实现对视频帧序列的统一表征,并分别生成拍级与战术级字幕;此外,引入战术单元检测器(Tactic Unit Detector)以识别战术执行中的有效单位、类型及状态(如中断/恢复),并设计基于逐拍提示引导的机制(shot-wise prompt-guided mechanism),将战术类型和状态作为跨注意力提示注入解码器,从而有效捕捉完整或中断后恢复的战术流程,显著提升了战术层面字幕的准确性与连贯性。
链接: https://arxiv.org/abs/2510.14617
作者: Ning Ding,Keisuke Fujii,Toru Tamaki
机构: Nagoya Institute of Technology (名古屋工业大学); Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures. Accepted to ACM MMSports 2025
Abstract:Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbfShot2Tactic-Caption, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.
zh
[CV-50] Knowledge-based Visual Question Answer with Multimodal Processing Retrieval and Filtering NEURIPS2025
【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-based Visual Question Answering, KB-VQA)中因多模态查询质量不高和检索结果相关性不足而导致的性能瓶颈问题。其解决方案的关键在于提出了一种三阶段方法Wiki-PRF,包括处理(Processing)、检索(Retrieval)与过滤(Filtering)三个环节:首先通过动态调用视觉工具提取精准的多模态信息以优化查询;其次融合视觉与文本特征实现跨模态知识检索;最后对检索结果进行相关性过滤与聚焦,从而提升答案准确性。此外,模型采用基于强化学习的训练策略,以回答准确性和格式一致性作为奖励信号,显著增强了推理能力、工具调用精度及无关内容过滤效果。
链接: https://arxiv.org/abs/2510.14605
作者: Yuyang Hong,Jiaqi Gu,Qi Yang,Lubin Fan,Yue Wu,Ying Wang,Kun Ding,Shiming Xiang,Jieping Ye
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025
Abstract:Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model’s reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at this https URL
zh
[CV-51] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering
【速读】:该论文旨在解决野生动物相机陷阱图像数据集中存在大量未标注物种、且现有分类器无法识别的问题。其核心挑战在于如何在无标签情况下对海量野生动物图像进行有效组织与分析,以支持生物多样性监测。解决方案的关键在于采用零样本(zero-shot)方法,结合自监督视觉Transformer模型(如DINOv2)与无监督聚类技术(如GMM),并辅以降维方法(如UMAP),实现高精度的图像分组和连续一维相似性排序。实验表明,DINOv2配合UMAP与GMM在5物种测试集上达到88.6%准确率(macro-F1=0.874),而基于t-SNE的1D排序在哺乳动物和鸟类中实现88.2%一致性,在鱼类中达95.2%,该方法已部署于Animal Detect平台用于加速人工标注流程和探索性分析。
链接: https://arxiv.org/abs/2510.14596
作者: Hugo Markoff,Jevgenijs Galaktionovs
机构: Animal Detect (动物检测)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended abstract. Submitted to AICC: Workshop on AI for Climate and Conservation - EurIPS 2025 (non-archival)
Abstract:Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.
zh
[CV-52] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers
【速读】:该论文旨在解决当前动物分类模型(如SpeciesNet)在实际应用中因采用保守的汇总策略而导致大量动物仅被标记到高阶分类单元(如目、科)而非物种级别的问题。其解决方案的关键在于构建一个五阶段的层级再分类系统,该系统融合了SpeciesNet EfficientNetV2-M的预测结果、CLIP(Contrastive Language–Image Pre-training)嵌入以及三元组损失(triplet-loss)驱动的度量学习方法,通过 centroid 建模与自适应余弦距离评分机制,实现从高阶标签向物种级识别的精细化重构。实验表明,该方法在LILA BC Desert Lion Conservation数据集上成功将原本标注为“空白”或“动物”的456个检测框重新分类至物种级别,准确率达96.5%,显著提升了物种识别覆盖率(64.9%)。
链接: https://arxiv.org/abs/2510.14594
作者: Hugo Markoff,Jevgenijs Galaktionovs
机构: Animal Detect (Animal Detect)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended abstract. Submitted to AICC: Workshop on AI for Climate and Conservation - EurIPS 2025 (non-archival)
Abstract:State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from “blank” and “animal” labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent
zh
[CV-53] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding
【速读】:该论文旨在解决视频生成中物体运动连贯性和交互一致性难以维持的问题,具体表现为两个实践瓶颈:一是人类提供的运动提示(如小的2D箭头图)在编码后有效token数量过少,导致引导效果弱化;二是单一输出头同时优化外观与运动时倾向于强调纹理而非时间一致性。解决方案的关键在于提出STANCE框架,包含两个核心组件:其一为实例提示(Instance Cues),通过像素对齐的控制信号将稀疏、可编辑的提示转化为稠密的2.5D(相机相对)运动场,利用实例流平均和单目深度增强,降低深度歧义并保持易用性;其二为密集旋转位置编码(Dense RoPE),通过为锚定在首帧的一小部分运动token赋予空间可定位的旋转嵌入,保留提示在token空间中的显著性,结合RGB与辅助图(分割或深度)联合预测,实现结构锚定与外观处理分离,从而稳定优化过程并提升时间连贯性,无需逐帧轨迹脚本。
链接: https://arxiv.org/abs/2510.14588
作者: Zhifei Chen,Tianshuo Xu,Leyi Wu,Luozhou Wang,Dongyu Yan,Zihan You,Wenting Luo,Guo Zhang,Yingcong Chen
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学); XMU(厦门大学); MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code, model, and demos can be found at this https URL
Abstract:Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues – a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB (+) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
zh
[CV-54] CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification
【速读】:该论文旨在解决基于激光雷达(LiDAR)点云的车辆再识别(vehicle re-identification)任务中,如何从三维点云中学习具有判别性和互补性的特征以区分不同车辆的问题。解决方案的关键在于提出了一种曲率感知的多分支神经网络(CALM-Net),其核心创新包括:引入边缘卷积(edge convolution)捕捉局部几何结构、采用点注意力机制(point attention)增强关键区域特征、以及设计曲率嵌入(curvature embedding)表征点云局部表面变化。通过融合这三种机制,模型能够有效提取更丰富的几何与上下文特征,从而显著提升车辆再识别性能,在nuScenes大规模数据集上相较最强基线提升了约1.97%的平均准确率。
链接: https://arxiv.org/abs/2510.14576
作者: Dongwook Lee,Sol Han,Jinwhan Kim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); Robotics Program (机器人项目); Department of Mechanical Engineering (机械工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.
zh
[CV-55] BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU
【速读】:该论文旨在解决3D高斯散点(3D Gaussian Splatting, 3DGS)训练流程中的三大效率瓶颈:(1)高斯点密度分配不均导致冗余;(2)高斯投影阶段计算负载失衡;(3)颜色绘制阶段内存访问碎片化。其解决方案的核心在于算法-系统协同设计:(1)在算法层面提出启发式负载感知的密度控制策略,自动移除密集区域80%冗余高斯点并填补稀疏区域空白;(2)在系统层面引入基于相似性的高斯采样与合并机制,将静态的一对一线程-像素映射替换为动态自适应负载分配,使线程根据局部密度处理不同数量的高斯点;(3)在映射层面提出重排序驱动的内存访问策略,重构RGB存储结构以支持共享内存批量加载,显著提升访存效率。实验表明,该方法在NVIDIA A100 GPU上相较原始3DGS实现1.44倍训练加速且质量损失可忽略。
链接: https://arxiv.org/abs/2510.14564
作者: Junyi Wu,Jiaming Xu,Jinhao Li,Yongkang Zhou,Jiayi Pan,Xingyang Li,Guohao Dai
机构: Shanghai Jiao Tong University (上海交通大学); Infinigence-AI; SII
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ASP-DAC 2026
Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44 \times training speedup on a NVIDIA A100 GPU with negligible quality degradation. Comments: Accepted by ASP-DAC 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.14564 [cs.CV] (or arXiv:2510.14564v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.14564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-56] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video NEURIPS2025
【速读】:该论文旨在解决生成式 AI 在人类场景中实现主动感知与响应的问题,即如何让AI不仅被动观察环境,还能在恰当时机主动理解、预测并回应动态事件。其核心挑战在于实现三重属性:主动一致性(Proactive Coherence)、适时响应性(Just-in-Time Responsiveness)和同步效率(Synchronized Efficiency)。解决方案的关键在于提出一个完整的技术流水线,包括数据引擎、多阶段训练策略以及主动动态压缩技术,并引入ESTP-Bench基准测试和ESTP-F1指标以系统评估这些属性,从而显著优于多个基线模型。
链接: https://arxiv.org/abs/2510.14560
作者: Yulin Zhang,Cheng Shi,Yang Wang,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025 (preview; camera-ready in preparation)
Abstract:Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:this https URL
zh
[CV-57] Consistent text-to-image generation via scene de-contextualization
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中普遍存在的身份漂移(identity shift)问题,即在不同场景下生成的同一主体图像难以保持身份一致性。研究表明,身份漂移的关键来源是T2I模型在训练过程中自然形成的场景上下文与主体身份之间的相关性,称为场景情境化(scene contextualization)。解决方案的核心在于提出一种无需训练、高效的提示嵌入编辑方法——场景去情境化(Scene De-Contextualization, SDeC),其通过量化奇异值分解(SVD)方向稳定性来识别并抑制提示嵌入中的潜在场景-身份关联,从而实现对T2I模型内置场景情境化的逆过程。该方法支持每提示对应单一场景的灵活使用,无需预先知晓所有目标场景,因而适用于现实场景中动态变化或未知场景的应用需求。
链接: https://arxiv.org/abs/2510.14553
作者: Song Tang,Peihao Gong,Kunyu Li,Kai Guo,Boyu Wang,Mao Ye,Jianwei Zhang,Xiatian Zhu
机构: University of Shanghai for Science and Technology(上海理工大学); Universität Hamburg(汉堡大学); Fudan University(复旦大学); Western University(西门大学); Vector Institute(向量研究所); University of Electronic Science and Technology of China(电子科技大学); University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I’s built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt’s embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
zh
[CV-58] Exploring Cross-Modal Flows for Few-Shot Learning
【速读】:该论文旨在解决跨模态任务中多模态特征对齐的难题,尤其是现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法仅进行单步调整(one-step adjustment),难以应对模态间高度纠缠的复杂数据集的问题。其解决方案的关键在于提出首个模型无关的多步调整方法——流匹配对齐(Flow Matching Alignment, FMA),通过学习跨模态速度场(cross-modal velocity field)实现渐进式特征校正,从而提升对齐精度与鲁棒性;FMA结合固定耦合策略确保类别对应关系、噪声增强策略缓解数据稀缺问题,并引入早停求解器优化效率与准确率。
链接: https://arxiv.org/abs/2510.14543
作者: Ziqi Jiang,Yanghao Wang,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures
Abstract:Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today’s PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
zh
[CV-59] Exploring Image Representation with Decoupled Classical Visual Descriptors BMVC2025
【速读】:该论文旨在解决深度学习模型在图像理解任务中内部表示不透明的问题,即难以解释视觉信息如何被处理。为弥补这一缺陷与传统视觉描述符(如边缘、颜色和强度分布)之间存在的鸿沟,作者提出VisualSplit框架,其关键在于显式地将图像分解为解耦的古典视觉描述符,并将其视为独立但互补的视觉知识组成部分。通过基于重建的预训练机制,该方法在保留各描述符可解释性的同时学习其本质特征,从而在图像生成与编辑等高级视觉任务中实现有效的属性控制,验证了该学习范式在视觉理解中的有效性。
链接: https://arxiv.org/abs/2510.14536
作者: Chenyuan Qu,Hao Chen,Jianbo Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The 36th British Machine Vision Conference (BMVC 2025)
Abstract:Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: this https URL.
zh
[CV-60] Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval
【速读】:该论文旨在解决医学影像(如脑部磁共振成像,MRI)在不同成像中心因设备和扫描协议差异导致的域偏移(domain shift)问题,该问题会显著降低机器学习模型在疾病分类等任务中的性能。现有方法虽能通过将图像编码至低维潜在空间并解耦为域不变特征(zu)与域特定特征(zd)实现较好效果,但缺乏可解释性,难以满足医疗场景对透明性和可信度的要求。解决方案的关键在于提出一种通用的可解释表示学习框架——伪线性风格编码器对抗域适应(Pseudo-Linear-Style Encoder Adversarial Domain Adaptation, PL-SE-ADA),其包含两个编码器 fE 和 fSE 分别提取 zu 与 zd、一个解码器 fD 实现图像重建,并引入域预测器 gD 进行对抗训练;同时创新性地通过叠加 zu 和 zd 的重建结果来重构原始图像,从而在保证域和谐的同时保留疾病相关的信息,并支持对域无关脑部特征与域特异性成分的可视化,显著提升整体框架的可解释性。
链接: https://arxiv.org/abs/2510.14535
作者: Keima Abe,Hayato Muraki,Shuhei Tomoshige,Kenichi Oishi,Hitoshi Iyatomi
机构: Hosei University (法政大学); Johns Hopkins Medicine (约翰霍普金斯医学); Alzheimer’s Disease Neuroimaging Initiative (阿尔茨海默病神经影像计划)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 6 pages,3 figures, 3 tables. Accepted at 2025 IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2025)
Abstract:Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images \boldsymbolx into a low-dimensional latent space \boldsymbolz , then disentangle it into \boldsymbolz_u (domain-invariant) and \boldsymbolz_d (domain-specific), achieving strong results. However, these methods often lack interpretability - an essential requirement in medical applications - leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders f_E and f_SE to extract \boldsymbolz_u and \boldsymbolz_d , a decoder to reconstruct the image f_D , and a domain predictor g_D . Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image \boldsymbolx by summing reconstructions from \boldsymbolz_u and \boldsymbolz_d , ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.
zh
[CV-61] owards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology
【速读】:该论文旨在解决当前牙科人工智能(AI)系统在临床应用中面临的三大核心问题:一是现有模型多为单模态、任务特定设计,难以泛化至多样化场景;二是依赖昂贵的人工标注数据,限制了模型的可扩展性;三是缺乏统一的评估基准,导致性能比较困难。解决方案的关键在于提出首个面向牙科领域的视觉基础模型家族DentVFM,其基于Vision Transformer(ViT)架构构建2D与3D变体,并通过自监督学习在包含约160万张多模态牙科影像的大规模数据集DentVista上进行预训练,从而生成任务无关的视觉表征。此外,研究还引入DentBench基准,覆盖八类牙科亚专业领域,显著提升模型在疾病诊断、治疗分析、生物标志物识别及解剖定位等任务中的泛化能力与标签效率,实现跨模态诊断并优于经验丰富的牙医,在无常规影像时提供更可靠的判断结果。
链接: https://arxiv.org/abs/2510.14532
作者: Xinrui Huang,Fan Xiao,Dongming He,Anqi Gao,Dandan Li,Xiaofan Zhang,Shaoting Zhang,Xudong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.
zh
[CV-62] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
【速读】:该论文旨在解决文档解析(document parsing)任务中模型性能与资源消耗之间的矛盾问题,即如何在保持高精度的同时实现轻量化部署。其解决方案的关键在于提出PaddleOCR-VL模型,其核心是PaddleOCR-VL-0.9B这一紧凑而强大的视觉语言模型(Vision-Language Model, VLM),该模型融合了基于NaViT架构的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型,从而在支持109种语言的基础上,精准识别文本、表格、公式和图表等复杂文档元素,并显著优于现有方案,在多个公开及内部基准测试中达到当前最优(SOTA)性能,同时具备快速推理能力,适合实际应用场景部署。
链接: https://arxiv.org/abs/2510.14528
作者: Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Handong Zheng,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma
机构: Baidu Inc.(百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
zh
[CV-63] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models
【速读】:该论文旨在解决文本到图像生成中因初始噪声采样不当导致的提示对齐偏差问题(prompt misalignment),即在使用预训练的稳定扩散模型(Stable Diffusion, SD)时,由于推理阶段从提示无关的高斯先验中采样初始噪声,而训练阶段则依赖于提示条件下的特定潜在空间子集,从而引发生成图像与文本提示不一致的问题。解决方案的关键在于提出一种噪声投影器(noise projector),该模块在去噪前对初始噪声进行文本条件微调,将其映射至更贴近训练分布的提示感知噪声空间,从而实现无需修改SD模型、无参考图像或手工先验条件下提升文本-图像对齐效果,并通过单次前向传播替代多样本选择机制,显著降低推理开销。
链接: https://arxiv.org/abs/2510.14526
作者: Yunze Tong,Didi Zhu,Zijing Hu,Jinluan Yang,Ziyu Zhao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Appendix will be appended soon
Abstract:In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
zh
[CV-64] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing
【速读】:该论文旨在解决手术器械制造过程中因依赖人工质检而导致的缺陷检测不准确、一致性差的问题,从而影响无菌性、机械完整性及患者安全。其解决方案的关键在于提出SurgScan框架,基于YOLOv8目标检测模型构建实时缺陷识别系统,利用包含102,876张高分辨率图像的多类缺陷数据集进行训练,实现99.3%的高精度与每帧4.2–5.8毫秒的实时推理速度,同时通过对比度增强预处理显著提升检测性能,满足ISO 13485和FDA标准要求,为医疗制造领域提供可扩展、低成本的自动化质量控制方案。
链接: https://arxiv.org/abs/2510.14525
作者: Qurrat Ul Ain,Atif Aftab Ahmed Jilani,Zunaira Shafqat,Nigar Azhar Butt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.
zh
[CV-65] Vision Mamba for Permeability Prediction of Porous Media
【速读】:该论文旨在解决三维多孔介质渗透率预测中模型计算效率与参数规模之间的权衡问题。传统卷积神经网络(CNN)参数量大、内存消耗高,而视觉变换器(Vision Transformer, ViT)的计算复杂度随输入图像分辨率呈平方增长,限制了其在高分辨率图像任务中的应用。论文提出将视觉状态空间模型(Vision Mamba)作为骨干网络用于渗透率预测,其关键在于利用Mamba架构线性扩展的计算复杂度特性(相比ViT的二次复杂度)以及显著更少的可训练参数(相比CNN),从而在保持高精度的同时大幅提升计算和内存效率。实验表明,Vision Mamba在多个评估维度上优于ViT和CNN,并通过消融研究验证了其组件对预测准确性的贡献。
链接: https://arxiv.org/abs/2510.14516
作者: Ali Kashefi,Tapan Mukerji
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.
zh
[CV-66] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data
【速读】:该论文旨在解决农业放牧活动在大范围区域内的可扩展监测问题,即如何利用遥感数据高效识别放牧发生的位置,以支持生态保护导向的土地利用合规性检查。其解决方案的关键在于使用Sentinel-2 L2A多时相反射率数据,通过训练一个CNN-LSTM集成模型对农田边界多边形进行季节性放牧状态的二分类预测(放牧/未放牧),在保证高召回率(90%)的同时实现了平均77%的F1分数,显著提升了监管资源的配置效率——当巡检人员每年仅能覆盖4%的站点时,基于模型优先检查非放牧区域可使确认的非放牧点数量提升17.2倍。
链接: https://arxiv.org/abs/2510.14493
作者: Aleksis Pirinen,Delia Fano Yela,Smita Chakraborty,Erik Källman
机构: RISE Research Institutes of Sweden (瑞典研究机构); Climate AI Nordics (气候AI北欧); Swedish Centre for Impacts of Climate Extremes (瑞典气候极端影响中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models: this https URL
Abstract:Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.
zh
[CV-67] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration
【速读】:该论文旨在解决多任务图像恢复(multi-task image restoration)模型因参数量过大而导致的计算效率低下问题。其解决方案的关键在于通过迭代剪枝策略(iterative pruning strategy)从过参数化的深度模型中发现高度稀疏的子网络,这些子网络在保持或超越密集模型性能的同时,显著减少可训练参数数量。具体而言,MIR-L模型在每轮剪枝中移除低幅度权重,并将剩余权重重置为初始值,从而有效识别出“胜者彩票”(winning tickets),实现在仅保留10%参数的情况下仍保持高恢复性能。
链接: https://arxiv.org/abs/2510.14463
作者: Thomas Katraouras,Dimitrios Rafailidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WI-IAT 2025
Abstract:Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model’s optimization, effectively uncovering “winning tickets” that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at this https URL.
zh
[CV-68] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review
【速读】:该论文旨在解决脑部影像中异常检测与分割问题,传统监督学习方法依赖大量像素级标注数据且受限于常见病灶类型,难以应对罕见或异质性疾病。其解决方案的关键在于采用无监督深度生成模型(unsupervised deep generative models),这些模型仅需健康个体的影像数据进行训练,即可通过学习正常脑结构来识别偏离预期的异常区域。该方法的核心优势在于能够生成可解释的伪健康重建(pseudo-healthy reconstructions,亦称反事实重建),从而在标注数据稀缺场景下仍具备良好的临床适用性,并为半监督学习、新型影像生物标志物发现及跨疾病异常映射提供统一端到端框架。
链接: https://arxiv.org/abs/2510.14462
作者: Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi
机构: Univ Rennes (雷恩大学); Inria (法国国家信息与自动化研究院); CNRS (法国国家科学研究中心); Inserm (法国国家健康与医学研究院); IRISA UMR 6074 (IRISA 6074联合实验室); Siemens Healthineers (西门子医疗); CHU Rennes (雷恩大学医院); Centre de Kerpape (凯尔帕中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.
zh
[CV-69] Structured Universal Adversarial Attacks on Object Detection for Video Sequences
【速读】:该论文旨在解决视频目标检测(video object detection)模型在面对通用扰动(universal perturbation)攻击时的脆弱性问题,这类攻击能在不显著改变视觉感知的前提下误导检测系统。解决方案的关键在于提出一种最小失真度的通用对抗攻击方法,其核心创新是利用核范数正则化(nuclear norm regularization)引导扰动集中在背景区域,从而生成结构化的、更具隐蔽性的攻击样本;同时采用自适应乐观指数梯度法(adaptive, optimistic exponentiated gradient method)高效优化该非凸问题,在保证攻击效果的同时提升收敛速度与可扩展性。
链接: https://arxiv.org/abs/2510.14460
作者: Sven Jacob,Weijia Shao,Gjergji Kasneci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at GCPR 2025 (German Conference on Pattern Recognition). This is a different version as submitted to the conference, not the official conference proceedings
Abstract:Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at this https URL.
zh
[CV-70] Real-Time Neural Video Compression with Unified Intra and Inter Coding
【速读】:该论文旨在解决现有神经视频压缩(Neural Video Compression, NVC)方案中存在的若干关键问题,包括对遮挡区域(disocclusion)和新增内容处理效率低、帧间误差传播与累积严重等。其解决方案的核心在于引入统一的帧内/帧间编码框架,使每个帧均由单一模型自适应地选择执行帧内或帧间编码;同时提出一种双向两帧同步压缩设计,不仅利用前向冗余,还挖掘后向帧间冗余,从而有效抑制误差传播并提升对动态场景的适应能力。实验表明,该方法在保持实时编码解码性能的前提下,相较DCVC-RT平均实现10.7%的BD-rate降低,并显著改善比特率与画质的稳定性。
链接: https://arxiv.org/abs/2510.14431
作者: Hui Xiang,Yifan Bian,Li Li,Jingran Wu,Xianguo Zhang,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学); Tencent Shannon Lab (腾讯谢尔顿实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.
zh
[CV-71] Deep Compositional Phase Diffusion for Long Motion Sequence Generation NEURIPS2025
【速读】:该论文旨在解决当前生成式AI在生成复合运动序列时,难以保持相邻动作片段之间运动动态连续性的问题,即在多语义动作拼接场景下,模型常出现过渡生硬或突兀的伪影。解决方案的关键在于提出一种名为“组合相位扩散”(Compositional Phase Diffusion)的新框架,其核心创新是引入两个模块:语义相位扩散模块(Semantic Phase Diffusion Module, SPDM)和过渡相位扩散模块(Transitional Phase Diffusion Module, TPDM),二者协同工作于预训练的动作中心相位自编码器(Action-Centric Motion Phase Autoencoder, ACT-PAE)所构建的潜在运动频域空间中,从而在扩散过程中逐步融合语义引导与相邻片段间的相位细节信息,有效保障了跨动作片段的运动连贯性。
链接: https://arxiv.org/abs/2510.14427
作者: Ho Yin Au,Jie Chen,Junkun Jiang,Jingyu Xiang
机构: Hong Kong Baptist University (香港浸会大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025 (Oral)
Abstract:Recent research on motion generation has shown significant progress in generating semantically aligned motion with singular semantics. However, when employing these models to create composite sequences containing multiple semantically generated motion clips, they often struggle to preserve the continuity of motion dynamics at the transition boundaries between clips, resulting in awkward transitions and abrupt artifacts. To address these challenges, we present Compositional Phase Diffusion, which leverages the Semantic Phase Diffusion Module (SPDM) and Transitional Phase Diffusion Module (TPDM) to progressively incorporate semantic guidance and phase details from adjacent motion clips into the diffusion process. Specifically, SPDM and TPDM operate within the latent motion frequency domain established by the pre-trained Action-Centric Motion Phase Autoencoder (ACT-PAE). This allows them to learn semantically important and transition-aware phase information from variable-length motion clips during training. Experimental results demonstrate the competitive performance of our proposed framework in generating compositional motion sequences that align semantically with the input conditions, while preserving phase transitional continuity between preceding and succeeding motion clips. Additionally, motion inbetweening task is made possible by keeping the phase parameter of the input motion sequences fixed throughout the diffusion process, showcasing the potential for extending the proposed framework to accommodate various application scenarios. Codes are available at this https URL.
zh
[CV-72] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis
【速读】:该论文旨在解决计算病理学中因全切片图像(Whole Slide Images, WSIs)的巨像素输入带来的计算瓶颈以及密集人工标注稀缺所导致的癌症预后模型性能受限问题,同时克服现有方法对多倍率WSI中的细粒度信息和肿瘤微环境差异关注不足的局限。其解决方案的关键在于提出一种“由易到难”的渐进式表征学习模型——双课程对比多实例学习(Dual-Curriculum Contrastive Multi-Instance Learning, DCMIL),该模型无需依赖密集标注即可高效处理巨像素级WSI,并直接输出预后预测;通过引入双课程机制与对比学习策略,DCMIL能有效挖掘细粒度预后相关区域、提供鲁棒的实例不确定性估计,并捕捉正常与肿瘤组织间的形态学差异,从而提升模型泛化能力并推动生物发现。
链接: https://arxiv.org/abs/2510.14403
作者: Chao Tu,Kun Huang,Jie Zhang,Qianjin Feng,Yu Zhang,Zhenyuan Ning
机构: Southern Medical University (南方医科大学); Jiangxi Medical College (江西医学院); Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at this https URL.
zh
[CV-73] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLOFaster-RCNN Ensemble WACV2026
【速读】:该论文旨在解决高批量电子产品制造中主板(Motherboard)装配级缺陷检测的难题,现有研究多聚焦于裸板或走线级别的缺陷,而对完整主板的装配缺陷(如缺螺丝、松动风扇线缆和表面划痕)检测仍缺乏系统性方法。解决方案的关键在于提出BoardVision框架,结合YOLOv7与Faster R-CNN两种代表性检测器,并设计轻量级集成策略——置信度-时序投票(Confidence-Temporal Voting, CTV Voter),通过可解释规则平衡精度与召回率;同时在真实扰动(锐度、亮度、方向变化)下验证模型鲁棒性,并开发GUI驱动的可部署检测工具,实现从基准测试到实际质量控制的落地转化。
链接: https://arxiv.org/abs/2510.14389
作者: Brandon Hill,Kma Solaiman
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been submitted to IEEE/CVF WACV 2026 Applications track and is currently under review
Abstract:Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.
zh
[CV-74] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights
【速读】:该论文旨在解决脑肿瘤分割中因肿瘤亚区异质性导致的精度不足问题,以及基于状态空间模型(State Space Models, SSMs)的Mamba架构在计算效率上的瓶颈和跨数据集鲁棒性不足的问题。其解决方案的关键在于提出一种双分辨率双向Mamba(Dual-Resolution Bi-Directional Mamba, DRBD-Mamba)模型:首先利用空间填充曲线(space-filling curve)实现三维到一维特征映射时的空间局部性保持,从而减少多轴顺序特征计算带来的高开销;其次引入门控融合模块以自适应整合前向与反向上下文信息,并通过量化块对特征进行离散化处理以提升模型鲁棒性;此外,作者构建了五个系统性的BraTS2023划分方案用于严格评估不同条件下的分割性能,实验证明该方法在保持高精度的同时实现了15倍的效率提升。
链接: https://arxiv.org/abs/2510.14383
作者: Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan
机构: The University of Western Australia (西澳大利亚大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86% for tumor core and 1.45% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.
zh
[CV-75] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在处理多对象提示时存在的对象忽略(object neglect)和对象混淆(object mixing)问题。通过系统分析,作者识别出四种典型失败场景:相似形状、相似纹理、背景偏差差异以及多对象共存,并基于对CLIP文本嵌入的两个关键观察提出了解决方案DOS(Directional Object Separation)。其核心在于对三种类型的CLIP文本嵌入进行定向调整,从而增强不同对象之间的语义分离度,进而提升多对象图像生成的成功率与准确性。实验表明,DOS在多个基准测试中显著优于现有方法,有效缓解了多对象生成中的语义混淆问题。
链接: https://arxiv.org/abs/2510.14376
作者: Dongnam Byun,Jungwon Park,Jumgmin Ko,Changin Choi,Wonjong Rhee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
zh
[CV-76] Spatial Preference Rewarding for MLLM s Spatial Understanding ICCV2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度空间感知能力上的不足,例如生成详细区域描述或准确定位物体的能力较弱,且难以响应用户对精细化空间理解的需求。现有方法主要依赖预标注指令数据微调MLLM以注入空间知识,但缺乏对其实际输出的直接监督。解决方案的关键在于提出一种空间偏好奖励机制(Spatial Preference Rewarding, SPR),通过引入语义得分和定位得分对MLLM生成的描述进行综合评估,并利用高分精修版本与低分初始版本配对进行直接偏好优化,从而增强模型输出与视觉输入之间的细粒度对齐,显著提升空间理解能力,同时训练开销极小。
链接: https://arxiv.org/abs/2510.14374
作者: Han Qiu,Peng Gao,Lewei Lu,Xiaoqin Zhang,Ling Shao,Shijian Lu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory; Sensetime Research; Zhejiang University of Technology (浙江工业大学); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user’s requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs’ actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs’ spatial capabilities by rewarding MLLMs’ detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at this https URL
zh
[CV-77] Leverag ing Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration ICRA2024
【速读】:该论文旨在解决如何利用大量未标注的RGB-D数据进行场景几何推理的问题,尤其聚焦于提升RGB-D图像配准(registration)的准确性。其解决方案的关键在于:首先,采用循环一致的关键点(cycle-consistent keypoints)作为显著特征点,在匹配过程中施加空间一致性约束,从而提高对应点的精度;其次,提出一种新颖的姿态块(pose block),融合门控循环单元(GRU)与变换同步机制,有效整合历史信息与多视角数据,增强配准的鲁棒性与准确性。该方法在ScanNet和3DMatch数据集上优于现有自监督配准方法,甚至超越部分传统监督方法。
链接: https://arxiv.org/abs/2510.14354
作者: Siddharth Tourani,Jayaram Reddy,Sarvesh Thakur,K Madhava Krishna,Muhammad Haris Khan,N Dinesh Reddy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, accepted at ICRA 2024 (International Conference on Robotics and Automation)
Abstract:With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.
zh
[CV-78] Vision-Centric Activation and Coordination for Multimodal Large Language Models
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中仅依赖文本token的下一个词预测监督信号,从而忽视了对视觉理解至关重要的视觉中心信息(vision-centric information),导致其分析能力受限的问题。解决方案的关键在于提出VaCo框架,通过引入视觉感知对齐(Visual Discriminative Alignment)机制,整合来自多个视觉基础模型(Vision Foundation Models, VFMs)的任务感知特征,并借助可学习的模块化任务查询(Modular Task Queries, MTQs)和视觉对齐层(Visual Alignment Layers, VALs)激活特定视觉信号,同时利用Token Gateway Mask(TGM)协调不同VFMs间表示冲突,实现文本与视觉输出的统一优化,显著提升MLLMs在多种基准测试中的视觉理解性能。
链接: https://arxiv.org/abs/2510.14349
作者: Yunnan Wang,Fan Lu,Kecheng Zheng,Ziyuan Huang,Ziqiang Li,Wenjun Zeng,Xin Jin
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团); Eastern Institute of Technology (东华理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
zh
[CV-79] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection
【速读】:该论文旨在解决虹膜生物特征识别系统在面临呈现攻击(Presentation Attacks, PAs)时的安全性问题,尤其是由于真实攻击样本(如人工眼球、打印眼图像或化妆隐形眼镜)难以获取而导致的呈现攻击检测(Presentation Attack Detection, PAD)方法训练与评估数据集稀缺的问题。解决方案的关键在于提出了一种名为多域图像翻译扩散StyleGAN(Multi-domain Image Translative Diffusion StyleGAN, MID-StyleGAN)的新框架,该框架融合了扩散模型和生成对抗网络(Generative Adversarial Networks, GANs)的优势,通过多域架构实现真伪虹膜图像与不同攻击域之间的可控转换,并引入针对眼部数据定制的自适应损失函数以保持域一致性,从而生成高质量且多样化的合成眼区图像,显著提升PAD系统的检测性能。
链接: https://arxiv.org/abs/2510.14314
作者: Shivangi Yadav,Arun Ross
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.
zh
[CV-80] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding EMNLP2025 KR-0822
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型在生成回答时往往过度依赖单一模态或仅凭记忆训练数据而缺乏对视觉信息的正确 grounding( groundedness )。其解决方案的关键在于提出一种无需训练的三层对比解码结合水印检测的方法:首先从解码层中选择一个成熟层和一个新手层,然后通过与水印相关的问题识别出一个具有视觉 grounding 能力的中间层(pivot layer),最后利用三层对比机制生成最终输出。该方法在 POPE、MME 和 AMBER 等公开基准测试上显著降低了 LVLM 的幻觉现象,并提升了视觉引导响应的质量。
链接: https://arxiv.org/abs/2510.14304
作者: Kyungryul Back,Seongbeom Park,Milim Kim,Mincheol Kwon,SangHyeok Lee,Hyunyoung Lee,Junhee Cho,Seunghyun Park,Jinkyu Kim
机构: Korea University (韩国大学); KT Corporation (KT公司); Soongsil University (弘益大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings; Project: this https URL
Abstract:Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations – they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.
zh
[CV-81] Learning Human-Humanoid Coordination for Collaborative Object Carrying
【速读】:该论文旨在解决人-人形机器人(human-humanoid)协作中因人形机器人复杂全身动力学而导致的顺应性协作难以实现的问题。现有研究主要集中在机械臂与人类的顺应性协作,而人形机器人由于其高维状态空间和动态耦合特性,缺乏有效的协同控制方法。解决方案的关键在于提出一种仅依赖本体感觉(proprioception-only)的强化学习框架COLA,该框架通过单一策略同时建模领导者(leader)和跟随者(follower)行为,在闭环环境中训练以隐式预测物体运动模式和人类意图,从而实现负载平衡的协调轨迹规划,无需外部传感器或复杂的交互模型即可完成稳定、鲁棒的协作搬运任务。
链接: https://arxiv.org/abs/2510.14293
作者: Yushi Du,Yixuan Li,Baoxiong Jia,Yutang Lin,Pei Zhou,Wei Liang,Yanchao Yang,Siyuan Huang
机构: the University of Hong Kong (香港大学); BIGAI (通用人工智能国家重点实验室); Beijing Institute of Technology (北京理工大学); Peking University (北京大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Human-humanoid collaboration shows significant promise for applications in healthcare, domestic assistance, and manufacturing. While compliant robot-human collaboration has been extensively developed for robotic arms, enabling compliant human-humanoid collaboration remains largely unexplored due to humanoids’ complex whole-body dynamics. In this paper, we propose a proprioception-only reinforcement learning approach, COLA, that combines leader and follower behaviors within a single policy. The model is trained in a closed-loop environment with dynamic object interactions to predict object motion patterns and human intentions implicitly, enabling compliant collaboration to maintain load balance through coordinated trajectory planning. We evaluate our approach through comprehensive simulator and real-world experiments on collaborative carrying tasks, demonstrating the effectiveness, generalization, and robustness of our model across various terrains and objects. Simulation experiments demonstrate that our model reduces human effort by 24.7%. compared to baseline approaches while maintaining object stability. Real-world experiments validate robust collaborative carrying across different object types (boxes, desks, stretchers, etc.) and movement patterns (straight-line, turning, slope climbing). Human user studies with 23 participants confirm an average improvement of 27.4% compared to baseline models. Our method enables compliant human-humanoid collaborative carrying without requiring external sensors or complex interaction models, offering a practical solution for real-world deployment.
zh
[CV-82] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts
【速读】:该论文旨在解决组织病理学图像分析中因采集过程或数据来源差异导致的域偏移(domain shift)问题,该问题严重制约了深度学习模型的泛化能力。现有方法多依赖于特征分布对齐或引入统计变异性来建模相关性,但往往忽视了潜在的因果关系。其解决方案的关键在于提出一种基于因果推断的新框架,通过显式建模中介变量(mediators)和观测到的组织切片,利用前门准则(front-door principle)设计变换策略,在保留语义特征的同时有效缓解混杂因素(confounders)的影响。实验表明,该方法在CAMELYON17和私有病理数据集上均实现稳定性能提升,最高达7%,显著优于现有基线方法。
链接: https://arxiv.org/abs/2510.14273
作者: Kieu-Anh Truong Thi,Huy-Hieu Pham,Duc-Trong Le
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.
zh
[CV-83] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
【速读】:该论文旨在解决基于高斯泼溅(Gaussian Splatting)的3D场景重建方法在稀疏观测区域难以捕捉精细细节和保持真实感的问题,其核心挑战源于训练数据在三维空间中的稀疏性。解决方案的关键在于提出一种混合2D-3D方法GauSSmart,通过融合2D基础模型(如DINO)提供的语义特征监督与凸滤波(convex filtering)技术,引导高斯 splat 的密度增长与精化过程,从而增强低覆盖区域的重建完整性并保留复杂结构细节,显著提升了重建质量与鲁棒性。
链接: https://arxiv.org/abs/2510.14270
作者: Alexander Valverde,Brian Xu,Yuyin Zhou,Meng Xu,Hongyun Wang
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Brown University (布朗大学); Kean University (基恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2510.14270 [cs.CV] (or arXiv:2510.14270v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.14270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-84] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment
【速读】:该论文旨在解决光学相机通信(Optical Camera Communication, OCC)系统在远距离传输中因信道衰减和噪声导致的误码率(Bit Error Rate, BER)升高问题。其解决方案的关键在于结合了OOK(On-Off Keying)调制与toggle解调技术,并引入数字锁相环(Digital Phase-Locked Loop, DPLL),从而显著提升了系统在户外环境下的鲁棒性与通信速率。实验表明,该方案首次实现了在200米距离下以60 kbps速率、400米距离下以30 kbps速率时BER低于10⁻³的稳定通信性能。
链接: https://arxiv.org/abs/2510.14266
作者: Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a \mathrmBER 10^-3 at 200m-60kbps and 400m-30kbps in outdoor experiments.
zh
[CV-85] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching
【速读】:该论文旨在解决高分辨率图像跨视图匹配中的两大难题:一是现有交叉注意力机制因二次复杂度导致计算效率低下,二是缺乏显式的匹配约束使得匹配精度受限。其解决方案的关键在于提出一种新型注意力机制——MatchAttention,该机制通过动态匹配相对位置来引导键值对的注意力采样中心,并利用双线性Softmax实现连续且可微的滑动窗口采样;同时,相对位置通过残差连接嵌入特征通道进行迭代更新,使模型更聚焦于学习跨视角匹配关系。此外,设计了基于门控机制的交叉MatchAttention和一致性约束损失函数以缓解遮挡影响,从而在保持高精度的同时显著降低计算复杂度,实现高效、高分辨率的跨视图匹配。
链接: https://arxiv.org/abs/2510.14260
作者: Tingman Yan,Tao Liu,Xilian Yang,Qunfei Zhao,Zeyang Xia
机构: Dalian University of Technology (大连理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at this https URL.
zh
[CV-86] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
【速读】:该论文旨在解决多主体(multi-human)视频生成中身份一致性难以保持的问题,尤其是在动态交互场景下,确保多个角色在整个视频序列中身份稳定是关键挑战。其解决方案的核心在于提出 Identity-GRPO,一个基于人类反馈优化的强化学习框架,通过构建一个大规模偏好数据集训练视频奖励模型(video reward model),该模型包含人工标注与合成失真数据,并聚焦于人物身份一致性的成对标注;随后采用针对多主体一致性优化的 GRPO(Generalized Reward Policy Optimization)变体,在 VACE 和 Phantom 等基线方法上显著提升多角色身份保真度,实验表明该方法在人类一致性指标上相较基线最高提升达 18.9%。
链接: https://arxiv.org/abs/2510.14256
作者: Xiangyu Meng,Zixian Zhang,Zhenghao Zhang,Junchao Liao,Long Qin,Weizhi Wang
机构: Alibaba Group (阿里巴巴集团); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
zh
[CV-87] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成中的人体身份一致性保持难题,尤其是在目标人物面部仅占画面较小比例时,现有模型难以维持输入图像与生成视频间的人脸身份一致性,这在表达变化和动作幅度较大的场景下尤为突出。解决方案的关键在于提出一种基于强化学习的身份保留奖励引导优化方法(Identity-Preserving Reward-guided Optimization, IPRO),其核心创新包括:1)不依赖额外模块或架构修改,而是通过一个面部身份评分器直接对扩散模型进行微调;2)将奖励信号反向传播至采样链的最后几步,以增强梯度反馈的丰富性并加速收敛;3)设计了一种新颖的面部特征池机制,利用真实视频中的多角度人脸信息提升泛化能力;4)引入KL散度正则项稳定训练过程,防止对奖励信号过拟合。
链接: https://arxiv.org/abs/2510.14255
作者: Liao Shen,Wentao Jiang,Yiran Zhu,Tiezheng Ge,Zhiguo Cao,Bo Zheng
机构: Taobao & Tmall Group of Alibaba (淘宝与天猫集团); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \hrefthis https URLthis https URL.
zh
[CV-88] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering
【速读】:该论文旨在解决大规模场景中高效定位(localization)与高质量渲染(high-quality rendering)所面临的计算成本过高问题。现有基于场景坐标回归(Scene Coordinate Regression, SCR)的方法在小规模场景中表现良好,但在扩展至大规模场景时受限于单个网络的容量。其解决方案的关键在于提出混合专家加速坐标编码方法(Mixed Expert-based Accelerated Coordinate Encoding, MACE),通过引入门控网络(gating network)实现对子网络的隐式分类与选择,确保每次推理仅激活一个子网络,从而显著降低计算开销;同时设计无辅助损失负载均衡策略(Auxiliary-Loss-Free Load Balancing, ALF-LB),提升大规模场景下的定位精度,最终在保持高渲染质量的同时大幅减少训练时间(如在Cambridge测试集上仅需10分钟训练)。
链接: https://arxiv.org/abs/2510.14251
作者: Mingkai Liu,Dikai Fan,Haohua Que,Haojia Gao,Xiao Liu,Shuxue Peng,Meixia Lin,Shengyu Gu,Ruicong Ye,Wanli Qiu,Handong Yao,Ruopeng Zhang,Xianliang Huang
机构: PICO, ByteDance Inc. (字节跳动); University of Georgia (佐治亚大学); Beijing University of Technology (北京工业大学); Peking University (北京大学); Chongqing Vocational Institute of Engineering (重庆工程职业学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.
zh
[CV-89] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication
【速读】:该论文旨在解决传统帧式光学摄像通信(Optical Camera Communication, OCC)系统中存在的比特率低和处理负载高的问题。针对这一挑战,论文提出了一种基于事件触发视觉传感器(Event-based Vision Sensor, EVS)的新型调制方案——事件间隔调制(Event Interval Modulation, EIM),其关键在于充分利用EVS异步采样与高动态范围等独特特性,通过调制事件之间的时间间隔来编码信息,从而显著提升传输速率。实验表明,在室内环境下实现了10米距离下28 kbps和50米距离下8.4 kbps的可靠通信,确立了事件驱动OCC系统的新性能基准。
链接: https://arxiv.org/abs/2510.14245
作者: Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.
zh
[CV-90] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis
【速读】:该论文旨在解决当前深度伪造(Deepfake)检测方法在面对基于生成对抗网络(GANs)、扩散模型(Diffusion Models)和神经渲染(Neural Rendering)等先进生成技术所制造的高保真伪造媒体时,识别能力不足的问题。传统方法依赖人工设计的音素-视觉映射阈值、帧级一致性检查或单一模态策略,难以捕捉由这些先进模型产生的细微时间不一致性。解决方案的关键在于提出一种多模态音频-视觉框架——音素-时间与身份动态分析(Phoneme-Temporal and Identity-Dynamic Analysis, PIA),通过融合语言信息、动态面部运动特征和面部身份嵌入,从多个互补模态中识别出隐藏的不一致,从而显著提升对微小深度伪造篡改的检测精度。
链接: https://arxiv.org/abs/2510.14241
作者: Soumyya Kanti Datta,Tanvi Ranga,Chengzhe Sun,Siwei Lyu
机构: University at Buffalo, SUNY (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at this https URL
zh
[CV-91] LOTA: Bit-Planes Guided AI-Generated Image Detection ICCV2025
【速读】:该论文旨在解决当前AI生成图像检测方法中存在的两大问题:一是基于图像重建误差的检测方法计算成本高,二是难以捕捉原始图像中固有的噪声特征。其解决方案的关键在于创新性地利用比特平面(bit-plane)分解技术提取图像噪声信息,通过低比特平面反映图像中的噪声模式,并结合多种图像归一化策略(如缩放和阈值处理)增强噪声信号;进一步设计最大梯度块选择机制,采用多方向梯度计算噪声得分并选取最高响应区域以放大噪声特征;最终构建轻量级分类头,探索基于噪声的分类器与噪声引导的分类器两种结构,从而实现高效、高精度的AI生成图像检测。
链接: https://arxiv.org/abs/2510.14230
作者: Hongsong Wang,Renxi Cheng,Yang Zhang,Chaolei Han,Jie Gui
机构: Southeast University (东南大学); Shenzhen University (深圳大学); Purple Mountain Laboratories (紫金山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the ICCV2025, COde is this https URL
Abstract:The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf98.9% (\textbf11.9%~ \uparrow ) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at this https URL.
zh
[CV-92] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures SIGGRAPH
【速读】:该论文旨在解决视频扩散模型中多视角角色一致性(multi-view character consistency)与3D相机控制(3D camera control)难以协同实现的问题,同时提升虚拟制作(virtual production)中的个性化、光照适应性和运动空间布局控制能力。其解决方案的关键在于提出了一种新颖的定制化数据流水线:通过4D高斯点绘(4D Gaussian Splatting, 4DGS)对记录的体素捕捉表演进行多视角重渲染,并结合视频重照明模型引入光照变化,从而构建高质量、多样化且可控的训练数据;在此基础上微调开源视频扩散模型,实现了强身份保留、精确相机控制和光照自适应能力,同时支持双主体生成(联合训练与噪声混合两种策略)及场景级真实视频定制,显著提升了视频生成质量与生产可用性。
链接: https://arxiv.org/abs/2510.14179
作者: Yuancheng Xu,Wenqi Xian,Li Ma,Julien Philip,Ahmet Levent Taşel,Yiwei Zhao,Ryan Burgert,Mingming He,Oliver Hermann,Oliver Pilarski,Rahul Garg,Paul Debevec,Ning Yu
机构: Eyeline Labs; Netflix
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to SIGGRAPH Asia 2025
Abstract:We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: this https URL.
zh
[CV-93] PoissonNet: A Local-Global Approach for Learning on Surfaces SIGGRAPH
【速读】:该论文旨在解决现有网格神经网络架构在学习高频特征时困难、感受野不足、对离散化敏感以及计算开销效率低等问题。其解决方案的关键在于提出了一种基于泊松方程(Poisson’s equation)的局部-全局学习框架——PoissonNet,通过在网格梯度域中应用可学习的局部特征变换,并求解泊松系统实现标量特征更新的全局传播,从而在保持全频谱特征的同时提供真正的全局感受野,且对网格三角剖分具有不变性,同时显著降低计算复杂度,提升了模型的可扩展性与性能表现。
链接: https://arxiv.org/abs/2510.14146
作者: Arman Maesumi,Tanish Makadia,Thibault Groueix,Vladimir G. Kim,Daniel Ritchie,Noam Aigerman
机构: Brown University (布朗大学); Adobe Research (Adobe 研究院); Université de Montréal (蒙特利尔大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia) 2025, 16 pages
Abstract:Many network architectures exist for learning on meshes, yet their constructions entail delicate trade-offs between difficulty learning high-frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local-global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local-global learning scheme, which uses Poisson’s equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local-global learning framework preserves the features’s full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability – both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state-of-the-art performance on semantic segmentation and parameterizing highly-detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state-of-the-art architectures that learn on surfaces.
zh
[CV-94] cubic: CUDA-accelerated 3D Bioimage Computing ICCV2025
【速读】:该论文旨在解决当前生物图像分析工具在处理大规模二维(2D)和三维(3D)生物图像数据时面临的可扩展性差、计算效率低以及与现代科学计算工作流集成困难的问题。现有工具普遍存在缺乏应用编程接口(API)、不支持图形处理器(GPU)加速、3D图像处理能力有限及计算密集型任务互操作性差等缺陷。其解决方案的关键在于提出一个名为cubic的开源Python库,该库通过在广泛使用的SciPy和scikit-image API基础上,引入CuPy和RAPIDS cuCIM提供的GPU加速替代实现,构建了一个设备无关(device-agnostic)的API体系;该体系能自动将运算调度至GPU执行(当数据位于GPU上时),否则在CPU上运行,从而无缝加速从预处理到分割和特征提取的多种图像处理流程,同时保持算法精度,并与Python科学计算生态(包括其他GPU加速方法)良好集成,显著提升生物图像分析的效率与可扩展性。
链接: https://arxiv.org/abs/2510.14143
作者: Alexandr A. Kalinin,Anne E. Carpenter,Shantanu Singh,Matthew J. O’Meara
机构: Broad Institute of MIT and Harvard (Broad研究所); University of Michigan Medical School (密歇根大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: accepted to BioImage Computing workshop @ ICCV 2025
Abstract:Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic’s API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github . com/alxndrkalinin/cubic
zh
[CV-95] Capture Canonicalize Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images
【速读】:该论文旨在解决从少量无结构手机照片中生成高保真、身份一致的3D虚拟形象(avatar)的问题。现有方法存在两大局限:单视角方法易产生几何不一致性和幻觉,导致身份失真;而基于合成数据训练的模型无法还原皮肤皱纹、细发等高频细节,影响真实感。解决方案的关键在于提出“Capture, Canonicalize, Splat”零样本流水线:首先通过生成式规范化模块(generative canonicalization module)将多视角无结构图像统一为标准化表示,确保几何一致性;其次利用一个基于真实人物穹顶采集数据集训练的Transformer模型,实现对高保真高斯点云(Gaussian splatting)的建模,从而在无需额外训练的情况下生成具有强身份保真度和视觉真实感的静态半身3D avatar。
链接: https://arxiv.org/abs/2510.14081
作者: Emanuel Garbin,Guy Adam,Oded Krams,Zohar Barzelay,Eran Guendelman,Michael Schwarz,Moran Vatelmacher,Yigal Shenkman,Eli Peker,Itai Druker,Uri Patish,Yoav Blum,Max Bluvstein,Junxuan Li,Rawal Khirodkar,Shunsuke Saito
机构: Meta
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
zh
[CV-96] Synchronization of Multiple Videos ICCV2025
【速读】:该论文旨在解决多视角视频或生成式 AI 视频之间的时间同步问题,尤其针对不同场景、多样主体与背景以及非线性时间偏移带来的复杂挑战。解决方案的关键在于提出Temporal Prototype Learning (TPL) 框架,通过从任意预训练模型提取的高维嵌入构建共享且紧凑的一维表示(1D representation),并学习一个统一的原型序列来锚定关键动作阶段,从而避免耗时的成对匹配,实现高效、鲁棒的视频同步。
链接: https://arxiv.org/abs/2510.14051
作者: Avihai Naaman,Ron Shapira Weber,Oren Freifeld
机构: Ben-Gurion University of the Negev (本古里安大学); Data Science Research Center (数据科学研究中心); School of Brain Sciences and Cognition (脑科学与认知学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at this https URL
zh
[CV-97] Vgent: Graph-based Retrieval-Reasoning -Augmented Generation For Long Video Understanding NEURIPS2025
【速读】:该论文旨在解决长视频理解中大视频语言模型(LVLMs)面临的两大挑战:一是超出上下文窗口的视频帧token处理困难,二是难以维持长期时序信息;二是传统检索增强生成(RAG)方法在应用于长视频时易破坏时间依赖性并引入无关信息,影响推理准确性。解决方案的关键在于提出Vgent框架,其核心创新包括:(i) 构建基于语义关系的结构化图表示来保留视频片段间的语义关联,提升检索有效性;(ii) 引入中间推理步骤,通过结构化验证减少检索噪声,并显式聚合跨片段的相关信息,从而增强LVLM的推理能力与上下文感知性。
链接: https://arxiv.org/abs/2510.14032
作者: Xiaoqian Shen,Wenxuan Zhang,Jun Chen,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (沙特阿卜杜拉国王科技大学); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 (Spotlight). Webpage at this https URL
Abstract:Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of 3.0%\sim 5.4% over base models on MLVU, and outperformed state-of-the-art video RAG methods by 8.6% . Our code is publicly available at this https URL.
zh
[CV-98] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
【速读】:该论文旨在解决现有对抗净化(adversarial purification)方法在应对非加性对抗扰动(non-additive adversarial perturbations,如模糊、遮挡和失真)时效果显著下降的问题。当前方法主要针对加性扰动设计,难以有效分离真实图像与非加性干扰。解决方案的关键在于提出一种扩展的对抗净化框架 NAPPure,其核心思想是通过建立对抗图像的生成过程,并利用最大似然估计(likelihood maximization)对干净图像和扰动参数进行解耦,从而实现对非加性扰动的有效净化,实验表明该方法在 GTSRB 和 CIFAR-10 数据集上显著提升了模型对非加性扰动的鲁棒性。
链接: https://arxiv.org/abs/2510.14025
作者: Junjie Nan,Jianing Li,Wei Chen,Mingkun Zhang,Xueqi Cheng
机构: State Key Laboratory of AI Safety (人工智能安全国家重点实验室); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.
zh
[CV-99] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer
【速读】:该论文旨在解决前列腺癌中筛状结构(cribriform morphology)识别的临床挑战,即该病理特征与不良预后相关且不适用于主动监测,但其在实际诊断中存在报告不足和不同病理医生间判读一致性差的问题。解决方案的关键在于开发并验证一个基于深度学习的AI系统,采用EfficientNetV2-S编码器结合多实例学习(Multiple Instance Learning, MIL)策略,实现对全切片图像(Whole-slide Images, WSI)的端到端分类。该模型在内部和外部独立数据集上均展现出高准确性和鲁棒性,且在与九名专家病理医生的对比中达到最高的一致性水平,表明其具备达到甚至超越人类专家的诊断能力,从而有望提升诊断可靠性、标准化报告流程并优化治疗决策。
链接: https://arxiv.org/abs/2510.13995
作者: Kelvin Szolnoky,Anders Blilie,Nita Mulliqi,Toyonori Tsuzuki,Hemamali Samaratunga,Matteo Titus,Xiaoyi Ji,Sol Erika Boman,Einar Gudlaugsson,Svein Reidar Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radisław Kordek,Roman Łowicki,Brett Delahunt,Kenneth A. Iczkowski,Theo van der Kwast,Geert J. L. H. van Leenders,Katia R. M. Leite,Chin-Chen Pan,Emiel Adrianus Maria Janssen,Martin Eklund,Lars Egevad,Kimmo Kartasalo
机构: Karolinska Institutet (卡罗林斯卡学院); Stavanger University Hospital (斯塔万格大学医院); University of Stavanger (斯塔万格大学); Aichi Medical University (爱知医科大学); University of Queensland (昆士兰大学); Medical University of Lodz (洛兹医科大学); Malaghan Institute of Medical Research (马拉甘医学研究所); University of California - Davis Health (加州大学戴维斯分校健康中心); University of Toronto (多伦多大学); Erasmus MC, University Medical Center (埃因霍温大学医疗中心); University of São Paulo Medical School (圣保罗大学医学院); Taipei Veterans General Hospital (台北荣民总医院); Griffith University (格里菲斯大学); SciLifeLab (瑞典国家生命科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model’s performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen’s kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen’s kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen’s kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen’s kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.13995 [cs.CV] (or arXiv:2510.13995v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.13995 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kelvin Szolnoky [view email] [v1] Wed, 15 Oct 2025 18:23:34 UTC (743 KB)
zh
[CV-100] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models
【速读】:该论文旨在解决传统视觉模型在遥感图像分析中对大量领域特定标注数据的依赖以及其在复杂环境下的上下文理解能力有限的问题。解决方案的关键在于将传统的目标检测模型(如YOLO)与视觉语言模型(Vision Language Models, VLMs)如LLaVA、ChatGPT和Gemini相结合,利用VLMs强大的多模态理解和语义推理能力增强遥感图像中飞机检测与场景理解的准确性与鲁棒性,尤其在少样本学习和图像退化等挑战性条件下表现显著提升。
链接: https://arxiv.org/abs/2510.13993
作者: Jia Yun Chua,Argyrios Zolotas,Miguel Arana-Catania
机构: Cranfield University (克兰菲尔德大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 7 figures, 8 tables. To be published in Applied AI Letters
Abstract:Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.
zh
[CV-101] Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems ICLR2025
【速读】:该论文旨在解决逆问题中从噪声测量数据中恢复真实信号时,传统数据保真项(如均方误差或负对数似然)因逐点匹配噪声数据而导致过拟合的问题。其解决方案的关键在于提出一种新的分布一致性(distributional consistency, DC)损失函数,该方法不再追求与噪声测量值的逐点一致,而是通过模型生成的概率评分评估观测数据是否在统计上与当前估计所隐含的噪声分布一致,从而实现分布层面的校准。DC损失可直接替代传统保真项,兼容现代正则化策略且无需依赖先验信息即可避免噪声过拟合,在图像去噪和医学图像重建等场景中显著提升了性能。
链接: https://arxiv.org/abs/2510.13972
作者: George Webber,Andrew J. Reader
机构: King’s College London (伦敦国王学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Preprint; submitted to ICLR 2025 for possible publication
Abstract:Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current solutions balance prior assumptions regarding the true signal (regularization) with agreement to noisy measured data (data-fidelity). Conventional data-fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data-fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We adopt this aggregated perspective and introduce distributional consistency (DC) loss, a data-fidelity objective that replaces pointwise matching with distribution-level calibration using model-based probability scores for each measurement. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern regularizers, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise even without the use of priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for inverse problems.
zh
[CV-102] Weight Weaving: Parameter Pooling for Data-Free Model Merging NEURIPS2025
【速读】:该论文旨在解决模型融合(model merging)中缩放参数 λ 设置依赖于评估数据的问题,即现有方法通常需要访问特权数据来调优 λ,这在实际应用中不可行。其解决方案的关键在于提出一种称为 Weight Weaving 的插件式技术,通过在用户定义的池化函数(如平均、随机选择或已有融合方法)下对不同 λ 值下的模型权重进行聚合,从而实现无需评估数据即可高效融合多个专家模型。该方法具有高度模块化特性,与现有融合方法正交且不引入额外约束,显著提升了多种融合策略在无数据场景下的性能表现。
链接: https://arxiv.org/abs/2510.13921
作者: Levy Chaves,Eduardo Valle,Sandra Avila
机构: Recod.ai Lab., Instituto de Computação, Universidade Estadual de Campinas (UNICAMP); Intercom; Valeo.ai
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures. Accepted at the 3rd UniReps Workshop @ NeurIPS 2025
Abstract:Model merging provides a cost-effective and data-efficient combination of specialized deep neural networks through parameter integration. This technique leverages expert models across downstream tasks without requiring retraining. Most model merging approaches critically depend on scaling hyper-parameters \lambda , which weight each model’s contribution globally or individually. Principled approaches for setting scaling factors without accessing any data (data-free) are scarce, often leading researchers to tune \lambda using privileged data from the evaluation set, which is obviously unfeasible in practice. To address this limitation, we introduce Weight Weaving, a plug-and-play technique that pools model weights across \lambda values search space using user-defined pooling functions, such as averaging, random selection, or even existing model merging methods. Our method demonstrates high modularity, imposing minimal constraints on the search space. It operates orthogonally to existing model merging methods and eliminates evaluation data requirements. We validate Weight Weaving across three ViT variants in three experimental setups: vision multi-task learning, vision continual learning, and domain generalization. Our method consistently improves the performance of several model merging methods, achieving average accuracy gains of up to 15.9 percentage points in a data-free setting.
zh
[CV-103] Post-surgical Endometriosis Segmentation in Laparoscopic Videos
【速读】:该论文旨在解决内异症(endometriosis)在临床诊断中因视觉表现多样且分布于体内多个部位而导致的识别困难问题,尤其针对非专业医疗人员或普通医生易误诊的情况。其解决方案的关键在于开发一套基于训练的图像分割系统,能够从腹腔镜手术视频中自动识别并标注最常见的暗色内异症病灶(dark endometrial implants),通过多色叠加标记病灶区域并生成检测摘要,从而提升视频浏览效率与诊断准确性。
链接: https://arxiv.org/abs/2510.13899
作者: Andreas Leibetseder,Klaus Schoeffmann,Jörg Keckstein,Simon Keckstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: This is a demo paper that was already published this https URL but a preprint/author’s copy is needed for the funding agency
Abstract:Endometriosis is a common women’s condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.
zh
[CV-104] MultiFoodhat: A potential new paradigm for intelligent food quality inspection
【速读】:该论文旨在解决现有食品图像分类模型在面对未见过的食物类别时泛化能力不足的问题,尤其是在依赖大量标注数据的监督学习方法中表现受限。其解决方案的关键在于提出了一种对话驱动的多智能体推理框架 MultiFoodChat,该框架通过视觉-语言模型(VLMs)与大语言模型(LLMs)的协同作用,借助多轮视觉-文本对话实现动态推理;其中,物体感知标记(Object Perception Token, OPT)用于捕捉细粒度视觉特征,交互式推理代理(Interactive Reasoning Agent, IRA)则根据上下文线索不断优化预测结果,从而在无需额外训练或人工标注的情况下实现零样本食物识别,显著提升了准确性和可解释性。
链接: https://arxiv.org/abs/2510.13889
作者: Yue Hu,Guohang Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.
zh
[CV-105] Self-Training with Dynamic Weighting for Robust Gradual Domain Adaptation NIPS25
【速读】:该论文旨在解决渐进式域适应(Gradual Domain Adaptation, GDA)中知识迁移效率低和中间域数据不完整导致的鲁棒性不足问题。其解决方案的关键在于提出一种自训练动态加权方法(Self-Training with Dynamic Weighting, STDW),通过引入一个随训练进程从0到1变化的时间相关超参数 ϱ,自适应地平衡源域与目标域的损失贡献,从而实现稳定的域间知识迁移。该机制在迭代优化过程中结合伪标签生成与加权目标函数,有效缓解域偏移并提升模型在多个中间域上的泛化能力。
链接: https://arxiv.org/abs/2510.13864
作者: Zixi Wang,Yushe Cao,Yubo Huang,Jinzhu Wei,Jingzehua Xu,Shuai Zhang,Xin Lai
机构: University of Electronic Science and Technology of China (电子科技大学); Tsinghua University (清华大学); Zhenguan AI Lab (智冠人工智能实验室); Southwest Jiaotong University (西南交通大学); Shanghai University (上海大学); New Jersey Institute of Technology (新泽西理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: It had formerly appeared as arXiv:2501.19159v2 in error. Accepted by NIPS 25
Abstract:In this paper, we propose a new method called Self-Training with Dynamic Weighting (STDW), which aims to enhance robustness in Gradual Domain Adaptation (GDA) by addressing the challenge of smooth knowledge migration from the source to the target domain. Traditional GDA methods mitigate domain shift through intermediate domains and self-training but often suffer from inefficient knowledge migration or incomplete intermediate data. Our approach introduces a dynamic weighting mechanism that adaptively balances the loss contributions of the source and target domains during training. Specifically, we design an optimization framework governed by a time-varying hyperparameter \varrho (progressing from 0 to 1), which controls the strength of domain-specific learning and ensures stable adaptation. The method leverages self-training to generate pseudo-labels and optimizes a weighted objective function for iterative model updates, maintaining robustness across intermediate domains. Experiments on rotated MNIST, color-shifted MNIST, portrait datasets, and the Cover Type dataset demonstrate that STDW outperforms existing baselines. Ablation studies further validate the critical role of \varrho 's dynamic scheduling in achieving progressive adaptation, confirming its effectiveness in reducing domain bias and improving generalization. This work provides both theoretical insights and a practical framework for robust gradual domain adaptation, with potential applications in dynamic real-world scenarios. The code is available at this https URL.
zh
[CV-106] ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning
【速读】:该论文旨在解决长时程具身规划(long-horizon embodied planning)中的核心挑战:环境不仅受智能体动作影响,还存在与智能体动作并行的外生过程(exogenous processes),如水加热或多米诺骨牌连锁反应,这些过程难以建模且显著增加了规划复杂度。解决方案的关键在于提出一种抽象世界模型框架,通过变分贝叶斯推断(variational Bayesian inference)结合大语言模型(LLM)的先验提议,联合学习两类要素:(i) 符号化状态表示(symbolic state representations)和 (ii) 内生动作与外生机制的因果过程(causal processes)。每个因果过程建模随机因果关系的时间演化,从而实现对动态世界的高效、可泛化的推理与规划,在五个模拟桌面机器人环境中验证了其优于多种基线方法的性能。
链接: https://arxiv.org/abs/2509.26255
作者: Yichao Liang,Dat Nguyen,Cambridge Yang,Tianyang Li,Joshua B. Tenenbaum,Carl Edward Rasmussen,Adrian Weller,Zenna Tavares,Tom Silver,Kevin Ellis
机构: University of Cambridge (剑桥大学); Basis; Cornell University (康奈尔大学); Princeton University (普林斯顿大学); Massachusetts Institute of Technology (麻省理工学院); The Alan Turing Institute (艾伦图灵研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 41 pages. The last two authors contributed equally in co-advising
Abstract:Long-horizon embodied planning is challenging because the world does not only change through an agent’s actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent’s actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic cause-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
zh
[CV-107] owards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline
【速读】:该论文旨在解决当前多模态虚假内容检测中存在的一大挑战:现有模型通常仅针对人类制造的虚假信息(如谣言和误导性帖子)或AI生成内容(如图像合成模型或视觉-语言模型生成的内容)进行专门设计,而在真实场景中,用户无法预先知晓多模态内容的具体伪造类型,导致专用模型泛化能力差、实用性受限。解决方案的关键在于构建一个统一的基准数据集OmniFake(包含12.7万样本,融合了人工标注的虚假信息与新合成的AI生成示例),并提出统一多模态虚假内容检测框架UMFDet;其核心创新包括:基于视觉-语言模型(VLM)主干网络增强的类别感知混合专家(Category-aware Mixture-of-Experts, MoE)适配器,用于捕捉不同类别的特定线索,以及一种归属链式思维机制(attribution chain-of-thought),通过隐式推理引导定位显著的欺骗信号,从而实现对两类虚假内容的鲁棒且一致的检测性能。
链接: https://arxiv.org/abs/2509.25991
作者: Haiyang Li,Yaxiong Wang,Shengeng Tang,Lianwei Wu,Lechao Cheng,Zhun Zhong
机构: Hefei University of Technology (合肥工业大学); Northwestern Polytechnical University (西北工业大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.
zh
[CV-108] A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities
【速读】:该论文旨在解决乳腺密度对乳腺癌筛查敏感性下降的问题,即在乳腺组织致密的女性中,传统乳腺X线摄影(mammography)的检出率显著降低,导致漏诊或延迟诊断。其解决方案的关键在于构建一个基于乳腺密度信息的多模态人工智能(multi-modal AI)框架,通过动态选择成像模态——对脂肪型乳腺使用乳腺X线AI,对致密型乳腺则采用热成像AI(Thermalytix),从而实现针对不同乳腺组织成分的优化检测策略。该方法不仅显著提升了整体敏感性(94.55%)和特异性(79.93%),还克服了单一模态在特定乳腺类型下性能下降的问题,展现出良好的可解释性、低成本和易部署特性。
链接: https://arxiv.org/abs/2510.14340
作者: Siva Teja Kakileti,Bharath Govindaraju,Sudhakar Sampangi,Geetha Manjunath
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.
zh
[CV-109] Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation
【速读】:该论文旨在解决医学图像分割中域适应(domain adaptation)的可靠性问题,尤其是在时空数据(如超声心动图)中因缺乏时间一致性以及噪声和伪影干扰而导致的分割质量下降问题。解决方案的关键在于提出了一种名为RL4Seg3D的无监督域适应框架,其核心创新包括:引入新颖的奖励函数(reward functions)以提升关键解剖标志点的分割精度,并设计融合机制(fusion scheme)来增强时序一致性;同时利用强化学习(reinforcement learning)优化分割结果,在不依赖目标域标签的情况下显著提高分割准确性、解剖学合理性与时序稳定性,并额外提供一个可靠的不确定性估计器,可在测试阶段进一步提升性能。
链接: https://arxiv.org/abs/2510.14244
作者: Arnaud Judge,Nicolas Duchateau,Thierry Judge,Roman A. Sandler,Joseph Z. Sokol,Christian Desrosiers,Olivier Bernard,Pierre-Marc Jodoin
机构: University of Sherbrooke (舍布鲁克大学); INSA, Universite Claude Bernard Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS (法国克莱蒙-奥弗涅大学); École de technologie supérieure (魁北克科技学院); Institut Universitaire de France (法国大学研究院); iCardio.ai (iCardio.ai)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, submitted to IEEE TMI
Abstract:Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at this https URL.
zh
[CV-110] GenCellAgent : Generalizable Training-Free Cellular Image Segmentation via Large Language Model Agents
【速读】:该论文旨在解决细胞图像分割(cellular image segmentation)在多模态数据、形态多样性及标注资源有限等挑战下的性能瓶颈问题。其核心解决方案是提出GenCellAgent,一个无需训练的多智能体框架,通过规划-执行-评估(planner-executor-evaluator)循环实现动态工具调度与自适应优化:系统自动将图像路由至最优专用分割器或通用视觉-语言模型(vision-language models),利用少量参考图像在线调整以应对成像条件差异,并支持文本引导的亚细胞结构(如未覆盖的细胞器)分割;同时,专家修正被存入长期记忆,推动系统自我进化与个性化工作流构建。此设计显著提升了跨基准的分割准确率(平均提升15.7%)和对新对象的泛化能力(如内质网和线粒体IoU提升37.6%),并在无需重新训练的前提下实现了高鲁棒性与灵活性。
链接: https://arxiv.org/abs/2510.13896
作者: Xi Yu,Yang Yang,Qun Liu,Yonghua Du,Sean McSweeney,Yuewei Lin
机构: Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 43 pages
Abstract:Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training-free multi-agent framework that orchestrates specialist segmenters and generalist vision-language models via a planner-executor-evaluator loop (choose tool \rightarrow run \rightarrow quality-check) with long-term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text-guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self-evolution and personalized workflows. Across four cell-segmentation benchmarks, this routing yields a 15.7% mean accuracy gain over state-of-the-art baselines. On endoplasmic reticulum and mitochondria from new datasets, GenCellAgent improves average IoU by 37.6% over specialist models. It also segments novel objects such as the Golgi apparatus via iterative text-guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.
zh
人工智能
[AI-0] CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在实际应用中因忽视安全性而导致的潜在灾难性后果问题。传统方法虽可通过控制屏障函数(Control Barrier Functions, CBF)在线部署安全过滤器以保障动态安全性,但RL策略无法感知CBF约束,常导致保守行为。其解决方案的关键在于提出CBF-RL框架,通过在训练阶段显式引入CBF项来最小化对原始RL策略的修改,并结合训练过程中的策略轨迹安全过滤,使学习到的策略内化安全约束——不仅强制执行更安全的动作,还偏向于获取更安全的奖励信号,从而实现无需在线安全滤波器即可安全部署的目标。理论证明连续时间安全滤波器可基于离散时间轨迹的闭式表达式实现,实验验证了该方法在导航任务和Unitree G1人形机器人上的有效性,显著提升了探索安全性、收敛速度与不确定性下的鲁棒性。
链接: https://arxiv.org/abs/2510.14959
作者: Lizhi Yang,Blake Werner,Massimiliano de Sa Aaron D. Ames
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 8 pages
Abstract:Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety – traditionally deployed \emphonline via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs \emphin training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy – both enforcing safer actions and biasing towards safer rewards – enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
zh
[AI-1] Architecture Is All You Need: Diversity-Enabled Sweet Spots for Robust Humanoid Locomotion
【速读】:该论文旨在解决非结构化环境中类人机器人(humanoid robot)稳定行走的问题,特别是如何在感知信息有限的情况下实现鲁棒的运动控制。传统端到端(end-to-end)控制架构难以兼顾快速低层稳定性和慢速感知决策之间的时序差异,导致在复杂地形中表现不稳定。解决方案的关键在于采用分层控制架构(Layered Control Architecture, LCA),即高频运行的本体感觉稳定器(proprioceptive stabilizer)与低频感知策略(perceptual policy)分离设计,并通过两阶段训练流程(先盲稳定器预训练,再感知微调)来优化系统性能。实验表明,这种基于时间尺度分离的架构显著优于单阶段方案,即使使用简化的感知编码器也能在仿真和真实硬件(Unitree G1)上成功完成楼梯和边缘行走任务,证明了架构分层而非网络规模或复杂度是实现鲁棒感知驱动运动的核心因素。
链接: https://arxiv.org/abs/2510.14947
作者: Blake Werner,Lizhi Yang,Aaron D. Ames
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 8 pages
Abstract:Robust humanoid locomotion in unstructured environments requires architectures that balance fast low-level stabilization with slower perceptual decision-making. We show that a simple layered control architecture (LCA), a proprioceptive stabilizer running at high rate, coupled with a compact low-rate perceptual policy, enables substantially more robust performance than monolithic end-to-end designs, even when using minimal perception encoders. Through a two-stage training curriculum (blind stabilizer pretraining followed by perceptual fine-tuning), we demonstrate that layered policies consistently outperform one-stage alternatives in both simulation and hardware. On a Unitree G1 humanoid, our approach succeeds across stair and ledge tasks where one-stage perceptual policies fail. These results highlight that architectural separation of timescales, rather than network scale or complexity, is the key enabler for robust perception-conditioned locomotion.
zh
[AI-2] GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning
【速读】:该论文旨在解决当前过程奖励模型(Process Reward Models, PRMs)在多步推理任务中面临的三大核心挑战:奖励噪声高、事实一致性差以及与步骤级推理目标存在偏差。现有方法受限于人工标注成本高、LLM自评易产生幻觉或蒙特卡洛(Monte Carlo, MC)估计因信用分配错误导致监督信号失准等问题。其解决方案的关键在于提出GroundedPRM框架,通过三个创新机制实现自动化的高质量过程监督:首先利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)构建结构化推理路径以实现细粒度信用分配;其次引入外部工具对每一步进行执行验证,消除幻觉性监督并提升事实准确性;最后设计融合工具验证结果与MCTS反馈的混合奖励聚合机制,并将奖励信号格式化为带解释增强的生成式结构,从而提升可解释性和与指令微调大语言模型(Large Language Models, LLMs)的兼容性。
链接: https://arxiv.org/abs/2510.14942
作者: Yao Zhang,Yu Wu,Haowei Zhang,Weiguo Li,Haokun Chen,Jingpei Wu,Guohao Li,Zhen Han,Volker Tresp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages
Abstract:Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
zh
[AI-3] Mapping Smarter Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates
【速读】:该论文旨在解决企业智能平台在集成来自多个第三方供应商的日志数据时面临的schema映射(schema mapping)难题,尤其是在缺乏完整或准确供应商文档的情况下。传统方法依赖大语言模型(LLM)进行自动映射,但准确性不足且难以迭代优化。其解决方案的关键在于设计了一个基于强化学习的代理(agent),该代理无需标注样本或更新模型权重即可自我改进:在推理阶段,它能识别模糊的字段映射、生成针对性网络搜索查询以获取外部证据,并通过置信度奖励机制逐步优化映射结果。实验表明,该方法将映射准确率从仅使用LLM的56.4%提升至93.94%,同时显著减少需人工复核的低置信度映射(降低85%),从而提供一种可解释、可扩展、适应性强的证据驱动型映射策略。
链接: https://arxiv.org/abs/2510.14900
作者: Wen-Kwang Tsao,Yao-Ching Yu,Chien-Ming Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The Enterprise Intelligence Platform must integrate logs from numerous third-party vendors in order to perform various downstream tasks. However, vendor documentation is often unavailable at test time. It is either misplaced, mismatched, poorly formatted, or incomplete, which makes schema mapping challenging. We introduce a reinforcement learning agent that can self-improve without labeled examples or model weight updates. During inference, the agent: 1) Identifies ambiguous field-mapping attempts. 2) Generates targeted web-search queries to gather external evidence. 3) Applies a confidence-based reward to iteratively refine its mappings. To demonstrate this concept, we converted Microsoft Defender for Endpoint logs into a common schema. Our method increased mapping accuracy from 56.4%(LLM-only) to 72.73%(RAG) to 93.94% over 100 iterations using GPT-4o. At the same time, it reduced the number of low-confidence mappings requiring expert review by 85%. This new approach provides an evidence-driven, transparent method for solving future industry problems, paving the way for more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.
zh
[AI-4] Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
【速读】:该论文旨在解决高风险人工智能应用中因不可逆错误导致的安全问题,特别是在缺乏导师(mentor)指导的情况下,如何安全地进行探索性学习。传统序贯决策理论通常假设所有错误均可恢复(如通过奖励边界约束),但在高风险场景下,单次错误操作可能造成无法挽回的损害,而激进探索的带宽算法(bandit algorithms)在此类环境中存在显著风险。为应对这一挑战,作者提出一种两动作上下文带宽模型(two-action contextual bandit),其中代理在每轮观察输入后选择“弃权”(abstain,固定0奖励)或“执行”(commit,执行预设任务策略),且执行后的奖励虽有上界但可任意负值,同时假设奖励函数关于输入满足Lipschitz连续性。解决方案的关键在于设计一种基于谨慎性的算法(caution-based algorithm),其核心思想是识别一个可信区域,在该区域内仅当现有证据不足以确认危害时才进行执行,从而实现对危险情形的有效规避;在独立同分布(i.i.d.)输入假设下,该算法可实现次线性遗憾(sublinear regret),从理论上证明了谨慎探索在高风险环境部署学习代理的有效性。
链接: https://arxiv.org/abs/2510.14884
作者: Sarah Liaw,Benjamin Plaut
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 1 figure; under submission
Abstract:In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.
zh
[AI-5] he Gatekeeper Knows Enough
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为自主代理在实际应用中因有限上下文窗口和状态不同步所导致的不可靠输出、行为不可预测及资源利用效率低下等问题,尤其是在与大型、结构化且敏感的知识系统(如代码库和文档)交互时更为显著。解决方案的关键在于提出一种名为“Gatekeeper Protocol”的新型、领域无关的框架,该协议强制代理首先在系统的一个低保真度“潜在状态”(latent state)表示上进行操作与推理,并按需请求高保真度上下文;所有交互通过统一的JSON格式进行中介,形成声明式、状态同步的协议,从而确保代理对系统的认知始终可验证地扎根于系统真实状态。
链接: https://arxiv.org/abs/2510.14881
作者: Fikresilase Wondmeneh Abebayew
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 7 pages, 1 figure
Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents, yet their practical utility is fundamentally constrained by a limited context window and state desynchronization resulting from the LLMs’ stateless nature and inefficient context management. These limitations lead to unreliable output, unpredictable behavior, and inefficient resource usage, particularly when interacting with large, structured, and sensitive knowledge systems such as codebases and documents. To address these challenges, we introduce the Gatekeeper Protocol, a novel, domain-agnostic framework that governs agent-system interactions. Our protocol mandates that the agent first operate and reason on a minimalist, low-fidelity “latent state” representation of the system to strategically request high-fidelity context on demand. All interactions are mediated through a unified JSON format that serves as a declarative, state-synchronized protocol, ensuring the agent’s model of the system remains verifiably grounded in the system’s reality. We demonstrate the efficacy of this protocol with Sage, a reference implementation of the Gatekeeper Protocol for software development. Our results show that this approach significantly increases agent reliability, improves computational efficiency by minimizing token consumption, and enables scalable interaction with complex systems, creating a foundational methodology for building more robust, predictable, and grounded AI agents for any structured knowledge domain.
zh
[AI-6] Predicting kernel regression learning curves from only raw data statistics
【速读】:该论文旨在解决如何基于真实数据集(如CIFAR-5m、SVHN和ImageNet)上核回归的学习曲线预测问题,即准确刻画测试风险(test risk)与样本规模之间的关系。其核心挑战在于现有理论多基于理想化假设(如高斯分布),难以直接适用于复杂的真实图像数据。解决方案的关键是提出“赫米特特征结构假设”(Hermite eigenstructure ansatz, HEA),该假设通过解析近似核函数在各向异性数据分布下的特征值与特征函数,发现其形式类似于数据的赫米特多项式(Hermite polynomials),从而将学习曲线预测问题转化为仅依赖两个可计算量——经验数据协方差矩阵和目标函数 $ f_* $ 的经验多项式分解。作者证明了HEA在高斯数据下成立,并实验证明其在真实图像数据中也具有良好的适用性,为从数据结构到模型性能的端到端理论建模提供了可行性路径。
链接: https://arxiv.org/abs/2510.14878
作者: Dhruva Karkada,Joseph Turnbull,Yuxi Liu,James B. Simon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function f_* . The key new idea is an analytical approximation of a kernel’s eigenvalues and eigenfunctions with respect to an anisotropic data distribution. The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the Hermite eigenstructure ansatz (HEA). We prove the HEA for Gaussian data, but we find that real image data is often “Gaussian enough” for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk. Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA. Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.
zh
[AI-7] LabOS: The AI-XR Co-Scientist That Sees and Works With Humans
【速读】:该论文试图解决传统科学研究中计算与实验脱节的问题,即AI在科研中的应用长期局限于数据处理和模拟设计,难以直接参与物理实验过程。解决方案的关键在于构建LabOS系统,其核心是通过多模态感知、自进化智能体(self-evolving agents)以及扩展现实(Extended-Reality, XR)驱动的人机协同机制,使AI能够理解实验场景、实时响应实验操作,并与人类科学家共同执行实验任务,从而实现从“计算辅助”到“协同执行”的跨越,推动实验室向智能化、人机共融的科研环境演进。
链接: https://arxiv.org/abs/2510.14861
作者: Le Cong,Zaixi Zhang,Xiaotong Wang,Yin Di,Ruofan Jin,Michal Gerasimiuk,Yinkai Wang,Ravi K. Dinesh,David Smerkous,Alex Smerkous,Xuekun Wu,Shilong Liu,Peishan Li,Yi Zhu,Simran Serrao,Ning Zhao,Imran A. Mohammad,John B. Sunwoo,Joseph C. Wu,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Entended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and human-AI collaboration, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications–from cancer immunotherapy target discovery to stem-cell engineering – LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.
zh
[AI-8] Boosting Instruction Following at Scale
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在实际应用中因提示词(prompt)指令冗余或冲突导致的指令遵循可靠性下降问题。现有方法依赖于手动调整提示内容,但难以保证新增指令被有效执行。解决方案的关键在于提出一种后生成阶段的“指令增强”(Instruction Boosting)方法,通过量化分析提示指令间的语义冲突程度,识别并缓解指令之间的张力(tension),从而显著提升模型对多条指令的遵循准确率——实验表明,该方法可使两指令场景下指令遵循率提升最高达7个百分点,十指令场景下提升4个百分点。论文进一步构建了SCALEDIF基准,用于系统评估高指令密度下的性能变化,并提供了一个定量冲突评分工具以指导开发者优化提示设计。
链接: https://arxiv.org/abs/2510.14842
作者: Ben Elder,Evelyn Duesterwald,Vinod Muthusamy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6+4 pages, 7 figures, 2 tables
Abstract:A typical approach developers follow to influence an LLM’s behavior in an application is through careful manipulation of the prompt, such as by adding or modifying instructions. However, merely adding more instructions provides little assurance that they will actually be followed. We introduce Instruction Boosting as a post-generation method to increase the reliability of LLM prompt instructions. We show that Instruction Boosting improves the instruction following rate by up to 7 points for two instructions and up to 4 points for ten instructions. To demonstrate these results we introduce SCALEDIF, a benchmark with a scaled instruction volume of up to ten instructions per data sample. We also present an analysis of the commonly observed trend that performance degrades as more instructions are added. We show that an important factor contributing to this trend is the degree of tension and conflict that arises as the number of instructions is increased. We contribute a quantitative conflict scoring tool that explains the observed performance trends and provides feedback to developers on the impact that additional prompt instructions have on a model’s performance.
zh
[AI-9] RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
【速读】:该论文旨在解决现实场景中机器人操作(如家庭和工厂环境)对可靠性、效率和鲁棒性的高要求问题,目标是使机器人性能达到甚至超越熟练人类操作员的水平。解决方案的关键在于提出一种名为RL-100的强化学习训练框架,其核心创新为三阶段流程:首先通过模仿学习利用人类先验知识;其次采用迭代离线强化学习结合离线策略评估(Offline Policy Evaluation, OPE)机制,以保守方式优化扩散视觉运动策略(diffusion visuomotor policies),确保改进过程稳定可靠;最后通过在线强化学习消除残余失败模式。此外,引入轻量级一致性蒸馏头将多步采样过程压缩为单步策略,显著降低延迟一个数量级,同时保持任务性能,从而实现高频控制与长时间运行的鲁棒性(最高达两小时不间断操作)。
链接: https://arxiv.org/abs/2510.14830
作者: Kun Lei,Huanyu Li,Dongjie Yu,Zhenyu Wei,Lingxiao Guo,Zhennan Jiang,Ziyu Wang,Shiyu Liang,Huazhe Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL
Abstract:Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained bu supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.
zh
[AI-10] RoboGPT -R1: Enhancing Robot Planning with Reinforcement Learning
【速读】:该论文旨在解决具身智能体在复杂现实环境中执行长视距操作任务时,因常识和推理能力受限而导致的性能瓶颈问题。现有基于监督微调(Supervised Fine-Tuning, SFT)的大语言模型和视觉语言模型虽在规划任务中表现良好,但在多步骤操作中的物理理解与动作序列一致性方面仍存在不足。解决方案的关键在于提出一种两阶段微调框架 RoboGPT-R1:首先通过专家示范序列进行监督训练获取基础知识,随后引入强化学习(Reinforcement Learning, RL)提升模型对视觉-空间关系的理解与推理能力;同时设计了一个规则驱动的奖励函数,综合考量长程任务绩效与环境动作约束,从而显著增强模型在 EmbodiedBench 基准上的表现,优于 GPT-4o-mini 和 Qwen2.5-VL-7B 等更大规模模型。
链接: https://arxiv.org/abs/2510.14828
作者: Jinrui Liu,Bingyan Nie,Boyu Li,Yaran Chen,Yuze Wang,Shunsen He,Haoran Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model’s shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.
zh
[AI-11] Agent ic NL2SQL to Reduce Computational Costs NEURIPS2025
【速读】:该论文旨在解决大规模SQL数据库上的自然语言转SQL(NL2SQL)任务中因需处理大量元信息而导致的提示词过长、计算成本高昂的问题。解决方案的关键在于提出Datalake Agent,一个基于代理的交互式系统,通过在推理框架内循环性地仅请求完成特定表问答任务所必需的元信息,从而显著减少LLM输入的token数量(最多降低87%),在保持竞争性能的同时实现显著的成本节约。
链接: https://arxiv.org/abs/2510.14808
作者: Dominik Jehle,Lennart Purucker,Frank Hutter
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the NeurIPS 2025 Workshop on Efficient Reasoning. 10 pages, 11 figures
Abstract:Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs). Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs. To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently. Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information. Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks. The Datalake Agent reduces the tokens used by the LLM by up to 87% and thus allows for substantial cost reductions while maintaining competitive performance.
zh
[AI-12] SimKO: Simple Pass@K Policy Optimization
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练过程中存在的系统性探索不足问题,即模型倾向于过度聚焦于当前最优解(exploitation),导致在多候选答案评估指标 pass@K(K > 1)上的性能下降。其核心发现是:RLVR 方法在训练中表现出显著的 token 级概率集中效应(probability concentration effect),即 top-1 词汇概率不断累积而抑制其他候选词的概率分布,且这种过强的集中趋势与较差的 pass@K 表现正相关。为此,作者提出 Simple Pass@K Optimization (SimKO),其关键在于采用不对称优化策略——对正确响应增强 top-K 候选词的概率,对错误响应则对 top-1 候选词施加强惩罚,并特别针对高熵 token 应用此机制,从而有效缓解概率集中问题并提升探索能力。
链接: https://arxiv.org/abs/2510.14807
作者: Ruotian Peng,Yi Ren,Zhouliang Yu,Weiyang Liu,Yandong Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technical report (20 pages, 10 figures, project page: this https URL )
Abstract:Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR’s exploration.
zh
[AI-13] Cross-Scenario Unified Modeling of User Interests at Billion Scale
【速读】:该论文旨在解决当前推荐系统在多场景下用户兴趣建模不足的问题,尤其是在内容平台中,用户行为分散于搜索、信息流浏览和内容发现等多种异构场景中,传统推荐系统往往局限于单一场景的业务指标优化,忽视跨场景的行为信号,并难以高效集成大语言模型(Large Language Model, LLM)进行百亿级规模部署,从而限制了对用户全链路兴趣的捕捉能力。其解决方案的关键在于提出RED-Rec——一种面向多样化场景的LLM增强型分层推荐引擎,通过聚合与合成多场景下的用户行为动作,统一构建用户兴趣表示;核心创新包括一个两塔式LLM驱动的框架以实现高效且细粒度的用户与物品表征,以及一种场景感知的密集混合与查询策略,有效融合跨场景行为信号,捕获用户意图模式并在服务阶段表达细粒度上下文特定意图。
链接: https://arxiv.org/abs/2510.14788
作者: Manjie Xu,Cheng Chen,Xin Jia,Jingyi Zhou,Yongji Wu,Zejian Wang,Chi Zhang,Kai Zuo,Yibo Chen,Xu Tang,Yao Hu,Yixin Zhu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: The dataset, code, and models will be released soon
Abstract:User interests on content platforms are inherently diverse, manifesting through complex behavioral patterns across heterogeneous scenarios such as search, feed browsing, and content discovery. Traditional recommendation systems typically prioritize business metric optimization within isolated specific scenarios, neglecting cross-scenario behavioral signals and struggling to integrate advanced techniques like LLMs at billion-scale deployments, which finally limits their ability to capture holistic user interests across platform touchpoints. We propose RED-Rec, an LLM-enhanced hierarchical Recommender Engine for Diversified scenarios, tailored for industry-level content recommendation systems. RED-Rec unifies user interest representations across multiple behavioral contexts by aggregating and synthesizing actions from varied scenarios, resulting in comprehensive item and user modeling. At its core, a two-tower LLM-powered framework enables nuanced, multifaceted representations with deployment efficiency, and a scenario-aware dense mixing and querying policy effectively fuses diverse behavioral signals to capture cross-scenario user intent patterns and express fine-grained, context-specific intents during serving. We validate RED-Rec through online A/B testing on hundreds of millions of users in RedNote through online A/B testing, showing substantial performance gains in both content recommendation and advertisement targeting tasks. We further introduce a million-scale sequential recommendation dataset, RED-MMU, for comprehensive offline training and evaluation. Our work advances unified user modeling, unlocking deeper personalization and fostering more meaningful user engagement in large-scale UGC platforms.
zh
[AI-14] Beyond Multi-Token Prediction: Pretraining LLM s with Future Summaries
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长程推理、规划和创意写作任务中表现受限的问题,其根源在于教师强制训练(teacher-forced training)导致的短视性预测机制。为突破这一瓶颈,作者提出未来摘要预测(Future Summary Prediction, FSP),其核心创新在于引入一个辅助头(auxiliary head),用于预测序列长期未来的紧凑表示(compact representation),从而保留对长文本生成至关重要的信息。FSP通过两种实现方式:手工设计摘要(如未来词袋表示)和学习型摘要(利用从右到左训练的反向语言模型生成嵌入),显著优于传统的单标记预测(Next-token Prediction, NTP)和多标记预测(Multi-token Prediction, MTP),在数学、推理和编程基准测试中均展现出提升效果。
链接: https://arxiv.org/abs/2510.14751
作者: Divyat Mahajan,Sachin Goyal,Badr Youbi Idrissi,Mohammad Pezeshki,Ioannis Mitliagkas,David Lopez-Paz,Kartik Ahuja
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under Review
Abstract:Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
zh
[AI-15] Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling
【速读】:该论文旨在解决大规模语言模型预训练中因固定批量大小(batch size)导致的训练效率瓶颈问题,尤其是在使用自适应优化器(如Adam)时,如何通过合理的批量大小调度策略来加速训练过程。其核心挑战在于:相较于随机梯度下降(SGD)中批量加倍可等价于学习率减半的明确关系,自适应优化器缺乏类似的理论指导,使得批量 ramping 策略通常依赖经验调参。解决方案的关键是提出 Seesaw 调度框架——当标准学习率调度器将学习率减半时,Seesaw 采用乘以 1/2 并同时将批量大小加倍的方式,从而在保持损失动态不变的前提下减少串行训练步骤。该方法基于首次针对噪声线性回归场景下 SGD 的有限样本等价性证明,并扩展至归一化 SGD(一种 Adam 的可分析代理),在实际训练中验证了其有效性:在 Chinchilla 规模下对 1.5亿至6亿参数模型进行实验,Seesaw 在相同浮点运算量(FLOPs)下达到与余弦衰减相当的性能,同时将实际运行时间减少约36%,逼近理论极限。
链接: https://arxiv.org/abs/2510.14717
作者: Alexandru Meterez,Depen Morwani,Jingfeng Wu,Costin-Andrei Oncescu,Cengiz Pehlevan,Sham Kakade
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Increasing the batch size during training – a ‘‘batch ramp’’ – is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by 1/\sqrt2 and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by \approx 36% , approaching the theoretical limit implied by our analysis.
zh
[AI-16] oolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
【速读】:该论文旨在解决当前推理扩展(inference scaling)技术主要应用于非结构化输出生成任务,而在结构化输出任务(如函数调用)中应用仍处于探索阶段的问题。其解决方案的关键在于提出一种结合细粒度束搜索(fine-grained beam search)与过程奖励模型(ToolPRM)的推理扩展框架。ToolPRM通过在单次函数调用内部步骤上进行评分,实现了对结构化工具使用推理过程的细粒度监督;为此,研究者还构建了首个基于函数掩码技术自动标注的细粒度跨调用过程监督数据集,用于训练ToolPRM。实验表明,该方法显著提升了函数调用任务中的预测准确性,并揭示了一个关键原则:“探索更多但保留更少”,这是由于结构化函数调用生成具有不可恢复性特征所致。
链接: https://arxiv.org/abs/2510.14703
作者: Jianghao Lin,Yuanyuan Shi,Xin Peng,Renjie Ding,Hairui Wang,Yuxuan Peng,Bizhe Bai,Weixi Song,Fengshuo Bai,Huacan Chai,Weinan Zhang,Fei Huang,Ying Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs: “explore more but retain less” due to the unrecoverability characteristics of structured function calling generation.
zh
[AI-17] Cognitive-Aligned Spatio-Temporal Large Language Models For Next Point-of-Interest Prediction
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的下一个兴趣点(Point-of-Interest, POI)推荐任务中存在的两大问题:一是LLMs缺乏对结构化地理实体和移动轨迹序列模式的原生理解能力;二是工业级POI推荐系统难以有效融合世界知识(如季节、天气、节假日)与用户认知特征(如习惯、职业、偏好),从而限制了推荐性能与用户体验。解决方案的关键在于提出CoAST框架,其核心机制包含两个阶段:第一阶段通过在脱敏用户的时空轨迹数据上继续预训练,实现推荐知识的获取;第二阶段通过监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)相结合的方式,对齐人类认知判断与用户偏好,从而实现认知对齐(Cognitive Alignment)。该方法以自然语言为接口,实现了多源信息(时空模式、用户画像、情境信息)的有效整合,显著提升了推荐效果。
链接: https://arxiv.org/abs/2510.14702
作者: Penglong Zhai,Jie Li,Fanyi Di,Yue Liu,Yifang Yuan,Jie Huang,Peng Wu,Sicong Wang,Mingyang Yin,Tingting Hu,Yao Xu,Xin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures
Abstract:The next point-of-interest (POI) recommendation task aims to predict the users’ immediate next destinations based on their preferences and historical check-ins, holding significant value in location-based services. Recently, large language models (LLMs) have shown great potential in recommender systems, which treat the next POI prediction in a generative manner. However, these LLMs, pretrained primarily on vast corpora of unstructured text, lack the native understanding of structured geographical entities and sequential mobility patterns required for next POI prediction tasks. Moreover, in industrial-scale POI prediction applications, incorporating world knowledge and alignment of human cognition, such as seasons, weather conditions, holidays, and users’ profiles (such as habits, occupation, and preferences), can enhance the user experience while improving recommendation performance. To address these issues, we propose CoAST (Cognitive-Aligned Spatial-Temporal LLMs), a framework employing natural language as an interface, allowing for the incorporation of world knowledge, spatio-temporal trajectory patterns, profiles, and situational information. Specifically, CoAST mainly comprises of 2 stages: (1) Recommendation Knowledge Acquisition through continued pretraining on the enriched spatial-temporal trajectory data of the desensitized users; (2) Cognitive Alignment to align cognitive judgments with human preferences using enriched training data through Supervised Fine-Tuning (SFT) and a subsequent Reinforcement Learning (RL) phase. Extensive offline experiments on various real-world datasets and online experiments deployed in “Guess Where You Go” of AMAP App homepage demonstrate the effectiveness of CoAST.
zh
[AI-18] FedPPA: Progressive Parameter Alignment for Personalized Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端计算资源异构性和数据非独立同分布(non-IID)导致的模型性能下降问题,尤其关注现有个性化联邦学习(Personalized Federated Learning, PFL)方法在同时面对模型和数据异构性时的不足。其解决方案的关键在于提出一种名为渐进参数对齐(Progressive Parameter Alignment, FedPPA)的新机制,通过逐步对齐各客户端共享层权重与全局模型权重,缓解客户端更新过程中全局与本地模型间的不一致性,同时保留客户端本地知识,从而提升在non-IID场景下的个性化鲁棒性;此外,进一步引入基于熵的加权平均策略以增强全局模型性能,同时保持强个性化能力。
链接: https://arxiv.org/abs/2510.14698
作者: Maulidi Adi Prasetia,Muhamad Risqi U. Saputra,Guntur Dharma Putra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, TrustCom 2025 Conference
Abstract:Federated Learning (FL) is designed as a decentralized, privacy-preserving machine learning paradigm that enables multiple clients to collaboratively train a model without sharing their data. In real-world scenarios, however, clients often have heterogeneous computational resources and hold non-independent and identically distributed data (non-IID), which poses significant challenges during training. Personalized Federated Learning (PFL) has emerged to address these issues by customizing models for each client based on their unique data distribution. Despite its potential, existing PFL approaches typically overlook the coexistence of model and data heterogeneity arising from clients with diverse computational capabilities. To overcome this limitation, we propose a novel method, called Progressive Parameter Alignment (FedPPA), which progressively aligns the weights of common layers across clients with the global model’s weights. Our approach not only mitigates inconsistencies between global and local models during client updates, but also preserves client’s local knowledge, thereby enhancing personalization robustness in non-IID settings. To further enhance the global model performance while retaining strong personalization, we also integrate entropy-based weighted averaging into the FedPPA framework. Experiments on three image classification datasets, including MNIST, FMNIST, and CIFAR-10, demonstrate that FedPPA consistently outperforms existing FL algorithms, achieving superior performance in personalized adaptation.
zh
[AI-19] Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging
【速读】:该论文旨在解决模型合并(model merging)过程中因任务无关冗余(task-irrelevant redundancy)导致的性能下降问题,尤其是由任务向量(task vector)中包含的非目标知识成分所引发的冲突。现有方法通过随机裁剪参数空间元素来缓解冗余,但缺乏对知识结构的感知,易引入偏差。解决方案的关键在于提出一种基于知识感知子空间的净化任务向量方法(Purifying TAsk Vectors, PAVE):首先利用各任务的训练样本在对应微调模型中提取前线性层的协方差矩阵,进而进行面向上下文的奇异值分解(context-oriented singular value decomposition),从而在知识感知子空间中区分任务相关与冗余权重分量;随后通过剪枝冗余部分净化任务向量,并引入谱秩分配策略(spectral rank allocation strategy)以实现跨模型公平的剪枝力度控制,最终形成可插拔式优化方案,适用于多种基于任务向量的合并方法并显著提升其性能。
链接: https://arxiv.org/abs/2510.14697
作者: Bang An,Yibo Yang,Philip Torr,Bernard Ghanem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging aims to integrate task-specific abilities from individually fine-tuned models into a single model without extra training. In recent model merging methods, task vector has become a fundamental building block, as it can encapsulate the residual information from finetuning. However, the merged model often suffers from notable performance degradation due to the conflicts caused by task-irrelevant redundancy in task vectors. Existing efforts in overcoming redundancy by randomly dropping elements in the parameter space involves randomness and lacks knowledge awareness. To address these challenges, in this study, we propose Purifying TAsk Vectors (PAVE) in knowledge-aware subspace. Concretely, we sample some training examples from each task, and feed them into their corresponding fine-tuned models to acquire the covariance matrices before linear layers. We then perform a context-oriented singular value decomposition, which accentuates the weight components most relevant to the target knowledge. As a result, we can split fine-tuned model weights into task-relevant and redundant components in the knowledge-aware subspace, and purify the task vector by pruning the redundant components. To induce fair pruning efforts across models, we further introduce a spectral rank allocation strategy by optimizing a normalized activated pruning error. The task vector purification by our method as a plug-and-play scheme is applicable across various task vector-based merging methods to improve their performance. In experiments, we demonstrate the effectiveness of PAVE across a diverse set of merging methods, tasks, and model architectures.
zh
[AI-20] xLLM Technical Report
【速读】:该论文旨在解决大规模企业级部署中大语言模型(Large Language Model, LLM)推理服务的性能瓶颈与资源利用率低下问题,尤其是在多样化AI加速器环境下的高效调度与计算优化难题。其核心解决方案是提出一个解耦式的服务-引擎架构(decoupled service-engine architecture):在服务层,xLLM-Service通过智能调度模块实现多模态请求处理、在线与离线任务共置,并引入工作负载自适应的Prefill-Decode(PD)和Encode-Prefill-Decode(EPD)拆分策略,结合分布式KV Cache管理与容错机制以提升集群利用率和可用性;在引擎层,xLLM-Engine通过多层级执行流水线优化、自适应图模式(adaptive graph mode)及xTensor内存管理技术,协同算法增强如优化的推测解码(speculative decoding)和动态EPLB(dynamic EPLB),实现计算资源的充分饱和与推理效率的显著提升。
链接: https://arxiv.org/abs/2510.14686
作者: Tongxuan Liu,Tao Peng,Peijun Yang,Xiaoyang Zhao,Xiusheng Lu,Weizhe Huang,Zirui Liu,Xiaoyu Chen,Zhiwei Liang,Jun Xiong,Donghe Jin,Minchao Zhang,Jinrong Guo,Yingxu Deng,Xu Zhang,Xianzhe Dong,Siqi Wang,Siyu Wu,Yu Wu,Zihan Tang,Yuting Zeng,Yanshu Wang,Jinguang Liu,Meng Kang,Menxin Li,Yunlong Wang,Yiming Liu,Xiaolong Ma,Yifan Wang,Yichen Zhang,Jinrun Yin,Keyang Zheng,Jiawei Yin,Jun Zhang,Ziyue Wang,Xiaobo Lin,Liangyu Liu,Liwei Lan,Yang Liu,Chunhua Peng,Han Liu,Songcheng Ren,Xuezhu Wang,Yunheng Shen,Yi Wang,Guyue Liu,Hui Chen,Tong Yang,Hailong Yang,Jing Li,Guiguang Ding,Ke Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 39 pages
Abstract:We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at this https URL and this https URL.
zh
[AI-21] Practical Utilitarian Algorithm Configuration
【速读】:该论文旨在解决**实用主义算法配置(utilitarian algorithm configuration)**中理论保障与实际性能之间的差距问题,即如何在保证强理论性质的同时提升配置方法的实践效率。其解决方案的关键在于对COUP(一种具有理论保障的实用主义配置程序)进行一系列改进,这些改进在不削弱原有理论 guarantees 的前提下显著提升了其实验性能,使其能够与广泛应用但无性能保障的启发式配置方法相竞争。同时,论文通过案例研究展示了如何探索不同效用函数下算法选择方案的鲁棒性,从而增强配置结果的实用性与适应性。
链接: https://arxiv.org/abs/2510.14683
作者: Devon Graham,Kevin Leyton-Brown
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Utilitarian algorithm configuration identifies a parameter setting for a given algorithm that maximizes a user’s utility. Utility functions offer a theoretically well-grounded approach to optimizing decision-making under uncertainty and are flexible enough to capture a user’s preferences over algorithm runtimes (e.g., they can describe a sharp cutoff after which a solution is no longer required, a per-hour cost for compute, or diminishing returns from algorithms that take longer to run). COUP is a recently-introduced utilitarian algorithm configuration procedure which was designed mainly to offer strong theoretical guarantees about the quality of the configuration it returns, with less attention paid to its practical performance. This paper closes that gap, bringing theoretically-grounded, utilitarian algorithm configuration to the point where it is competitive with widely used, heuristic configuration procedures that offer no performance guarantees. We present a series of improvements to COUP that improve its empirical performance without degrading its theoretical guarantees and demonstrate their benefit experimentally. Using a case study, we also illustrate ways of exploring the robustness of a given solution to the algorithm selection problem to variations in the utility function.
zh
[AI-22] When Planners Meet Reality: How Learned Reactive Traffic Agents Shift nuPlan Benchmarks
【速读】:该论文旨在解决封闭环路仿真中因使用基于规则的交通代理(如IDM模型)而导致的规划器评估偏差问题。这类代理行为简单且被动,无法有效模拟复杂场景下的车辆交互,从而掩盖了规划器的真实性能并导致排名失真。解决方案的关键在于将先进的学习型交通代理模型SMART集成到nuPlan框架中,以实现更贴近现实的仿真环境;通过对比IDM与SMART代理下的评估结果,量化了仿真到真实世界差距缩小后规划器性能的变化,发现多数规划器在IDM环境下被高估,而在SMART代理下表现出更真实的交互能力与稳定性,尤其在多车道、高交互场景中表现提升显著。
链接: https://arxiv.org/abs/2510.14677
作者: Steffen Hagedorn,Luka Donkov,Aron Distelzweig,Alexandru P. Condurache
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Planner evaluation in closed-loop simulation often uses rule-based traffic agents, whose simplistic and passive behavior can hide planner deficiencies and bias rankings. Widely used IDM agents simply follow a lead vehicle and cannot react to vehicles in adjacent lanes, hindering tests of complex interaction capabilities. We address this issue by integrating the state-of-the-art learned traffic agent model SMART into nuPlan. Thus, we are the first to evaluate planners under more realistic conditions and quantify how conclusions shift when narrowing the sim-to-real gap. Our analysis covers 14 recent planners and established baselines and shows that IDM-based simulation overestimates planning performance: nearly all scores deteriorate. In contrast, many planners interact better than previously assumed and even improve in multi-lane, interaction-heavy scenarios like lane changes or turns. Methods trained in closed-loop demonstrate the best and most stable driving performance. However, when reaching their limits in augmented edge-case scenarios, all learned planners degrade abruptly, whereas rule-based planners maintain reasonable basic behavior. Based on our results, we suggest SMART-reactive simulation as a new standard closed-loop benchmark in nuPlan and release the SMART agents as a drop-in alternative to IDM at this https URL.
zh
[AI-23] NAEL: Non-Anthropocentric Ethical Logic
【速读】:该论文旨在解决当前人工智能伦理模型普遍依赖人类中心主义(anthropocentric)道德直觉、难以在动态多智能体环境中实现情境敏感且自适应的伦理行为的问题。解决方案的关键在于提出一种基于主动推理(active inference)与符号推理相结合的新型伦理框架NAEL(Non-Anthropocentric Ethical Logic),其核心是将伦理行为形式化为智能体在不确定环境中最小化全局预期自由能(global expected free energy)时的涌现特性,从而无需预设人类道德观即可生成具有自我保护、认知学习与集体福祉之间动态平衡的伦理决策能力。
链接: https://arxiv.org/abs/2510.14676
作者: Bianca Maria Lerma,Rafael Peñaloza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the FEAR workshop 2025
Abstract:We introduce NAEL (Non-Anthropocentric Ethical Logic), a novel ethical framework for artificial agents grounded in active inference and symbolic reasoning. Departing from conventional, human-centred approaches to AI ethics, NAEL formalizes ethical behaviour as an emergent property of intelligent systems minimizing global expected free energy in dynamic, multi-agent environments. We propose a neuro-symbolic architecture to allow agents to evaluate the ethical consequences of their actions in uncertain settings. The proposed system addresses the limitations of existing ethical models by allowing agents to develop context-sensitive, adaptive, and relational ethical behaviour without presupposing anthropomorphic moral intuitions. A case study involving ethical resource distribution illustrates NAEL’s dynamic balancing of self-preservation, epistemic learning, and collective welfare.
zh
[AI-24] Machine Learning and Public Health: Identifying and Mitigating Algorithmic Bias through a Systematic Review AAAI
【速读】:该论文旨在解决生成式 AI(Generative AI)在公共卫生领域应用中因算法偏见(algorithmic bias)而可能加剧健康不平等的问题。当前机器学习(Machine Learning, ML)虽能提升公共卫生的监测、风险分层与资源配置效率,但若缺乏对算法偏见的系统性关注,反而可能强化既有健康差距。论文的关键解决方案是提出一个四阶段公平导向框架——ACAR(Awareness, Conceptualization, Application, Reporting),并辅以基于文献综述提炼出的引导性问题,帮助研究者在ML生命周期各阶段识别、定义、实施和报告公平性考量,从而推动算法创新服务于健康公平而非损害其目标。
链接: https://arxiv.org/abs/2510.14669
作者: Sara Altamirano,Arjan Vreeken,Sennay Ghebreab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of the paper accepted at the AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025), including an appendix. 10 pages, 2 figures
Abstract:Machine learning (ML) promises to revolutionize public health through improved surveillance, risk stratification, and resource allocation. However, without systematic attention to algorithmic bias, ML may inadvertently reinforce existing health disparities. We present a systematic literature review of algorithmic bias identification, discussion, and reporting in Dutch public health ML research from 2021 to 2025. To this end, we developed the Risk of Algorithmic Bias Assessment Tool (RABAT) by integrating elements from established frameworks (Cochrane Risk of Bias, PROBAST, Microsoft Responsible AI checklist) and applied it to 35 peer-reviewed studies. Our analysis reveals pervasive gaps: although data sampling and missing data practices are well documented, most studies omit explicit fairness framing, subgroup analyses, and transparent discussion of potential harms. In response, we introduce a four-stage fairness-oriented framework called ACAR (Awareness, Conceptualization, Application, Reporting), with guiding questions derived from our systematic literature review to help researchers address fairness across the ML lifecycle. We conclude with actionable recommendations for public health ML practitioners to consistently consider algorithmic bias and foster transparency, ensuring that algorithmic innovations advance health equity rather than undermine it.
zh
[AI-25] Beyond Hallucinations: The Illusion of Understanding in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在人类沟通与决策中广泛应用时所引发的认知与知识论漂移问题,尤其是其生成内容虽具流畅性和说服力,却可能因缺乏 grounded reasoning 而产生幻觉(hallucination),从而导致事实性错误和误导。解决方案的关键在于提出“Rose-Frame”三维框架,该框架通过三个维度诊断人机交互中的认知偏差:(i) “地图 vs. 地域”(Map vs. Territory)区分表征现实(epistemology)与现实本身(ontology);(ii) “直觉 vs. 理性”(Intuition vs. Reason)基于双过程理论划分快速联想判断与慢速反思思维;(iii) “冲突 vs. 确认”(Conflict vs. Confirmation)评估观点是否经受批判性检验而非仅获相互强化。该框架不试图通过增加数据或规则修复LLMs,而是提供一种反思工具,使模型局限与用户假设均得以显性化,从而实现以人类理性为根基的认知治理(cognitive governance),确保机器的表达能力与人类理解之间保持对齐。
链接: https://arxiv.org/abs/2510.14665
作者: Rikard Rosenbacke,Carl Rosenbacke,Victor Rosenbacke,Martin McKee
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself. While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning. This creates the risk of hallucination, responses that sound convincing but lack factual validity. Building on Geoffrey Hinton’s observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification. To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction. The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation. Each dimension captures a distinct failure mode, and their combination amplifies misalignment. Rose-Frame does not attempt to fix LLMs with more data or rules. Instead, it offers a reflective tool that makes both the model’s limitations and the user’s assumptions visible, enabling more transparent and critically aware AI deployment. It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason. Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.
zh
[AI-26] Galaxy Morphology Classification with Counterfactual Explanation NEURIPS2024
【速读】:该论文旨在解决大规模天文数据中星系形态分类任务中机器学习模型缺乏可解释性的问题,即现有方法虽能实现高预测性能,但难以揭示模型决策过程。其解决方案的关键在于扩展经典的编码器-解码器架构,引入可逆流(invertible flow)机制,从而在保持良好预测性能的同时,提供基于反事实解释(counterfactual explanations)的决策过程信息,增强模型的透明度与可信度。
链接: https://arxiv.org/abs/2510.14655
作者: Zhuo Cao,Lena Krieger,Hanno Scharr,Ira Assent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the Machine Learning and the Physical Sciences Workshop at NeurIPS 2024 (non-archival)
Abstract:Galaxy morphologies play an essential role in the study of the evolution of galaxies. The determination of morphologies is laborious for a large amount of data giving rise to machine learning-based approaches. Unfortunately, most of these approaches offer no insight into how the model works and make the results difficult to understand and explain. We here propose to extend a classical encoder-decoder architecture with invertible flow, allowing us to not only obtain a good predictive performance but also provide additional information about the decision process with counterfactual explanations.
zh
[AI-27] he Bidding Games: Reinforcement Learning for MEV Extraction on Polygon Blockchain
【速读】:该论文旨在解决区块链网络中由于交易排序策略引发的**最大可提取价值(Maximal Extractable Value, MEV)获取难题,特别是在Polygon Atlas等结构化密封拍卖机制下,搜索者(searchers)需在亚秒级时间内做出最优出价决策,而无法获知竞争对手行为或存在性,传统博弈论方法因依赖完全信息与静态均衡假设难以适用。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)**的框架:首先构建一个能准确模拟套利机会随机到达与概率性竞争的仿真环境;其次设计基于近端策略优化(Proximal Policy Optimization, PPO)的竞价代理,在连续动作空间中实现自适应策略生成,并满足生产级推理速度要求;最终通过实证验证表明,该历史条件化代理在部署时捕获49%的可用利润,替换市场领导者时达81%,显著优于静态出价策略,证明了强化学习在高频、部分可观测MEV场景中的核心优势。
链接: https://arxiv.org/abs/2510.14642
作者: Andrei Seoev,Leonid Gremyachikh,Anastasiia Smirnova,Yash Madhwal,Alisa Kalacheva,Dmitry Belousov,Ilia Zubov,Aleksei Smirnov,Denis Fedyanin,Vladimir Gorgadze,Yury Yanovich
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:In blockchain networks, the strategic ordering of transactions within blocks has emerged as a significant source of profit extraction, known as Maximal Extractable Value (MEV). The transition from spam-based Priority Gas Auctions to structured auction mechanisms like Polygon Atlas has transformed MEV extraction from public bidding wars into sealed-bid competitions under extreme time constraints. While this shift reduces network congestion, it introduces complex strategic challenges where searchers must make optimal bidding decisions within a sub-second window without knowledge of competitor behavior or presence. Traditional game-theoretic approaches struggle in this high-frequency, partially observable environment due to their reliance on complete information and static equilibrium assumptions. We present a reinforcement learning framework for MEV extraction on Polygon Atlas and make three contributions: (1) A novel simulation environment that accurately models the stochastic arrival of arbitrage opportunities and probabilistic competition in Atlas auctions; (2) A PPO-based bidding agent optimized for real-time constraints, capable of adaptive strategy formulation in continuous action spaces while maintaining production-ready inference speeds; (3) Empirical validation demonstrating our history-conditioned agent captures 49% of available profits when deployed alongside existing searchers and 81% when replacing the market leader, significantly outperforming static bidding strategies. Our work establishes that reinforcement learning provides a critical advantage in high-frequency MEV environments where traditional optimization methods fail, offering immediate value for industrial participants and protocol designers alike.
zh
[AI-28] Causality Enhancement for Cross-Domain Recommendation
【速读】:该论文旨在解决跨域推荐(Cross-domain Recommendation)中因源域任务或特征不一致导致的建模不足或负迁移问题,以及在未考虑潜在因果关系的情况下引入源域特征可能限制其对最终预测贡献的问题。解决方案的关键在于提出一个因果增强型框架 CE-CDR,其核心创新包括:首先将跨域推荐重构为因果图以提供理论指导;其次通过启发式方法构建因果感知数据集;进而设计一种理论上无偏的“部分标签因果损失”(Partial Label Causal Loss),从而将受限于偏差数据集的因果建模泛化至未见的跨域模式,生成更丰富的跨域表示,并作为模型无关插件提升目标域推荐性能。
链接: https://arxiv.org/abs/2510.14641
作者: Zhibo Wu,Yunfan Wu,Lin Jiang,Ping Yang,Yao Hu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-domain recommendation forms a crucial component in recommendation systems. It leverages auxiliary information through source domain tasks or features to enhance target domain recommendations. However, incorporating inconsistent source domain tasks may result in insufficient cross-domain modeling or negative transfer. While incorporating source domain features without considering the underlying causal relationships may limit their contribution to final predictions. Thus, a natural idea is to directly train a cross-domain representation on a causality-labeled dataset from the source to target domain. Yet this direction has been rarely explored, as identifying unbiased real causal labels is highly challenging in real-world scenarios. In this work, we attempt to take a first step in this direction by proposing a causality-enhanced framework, named CE-CDR. Specifically, we first reformulate the cross-domain recommendation as a causal graph for principled guidance. We then construct a causality-aware dataset heuristically. Subsequently, we derive a theoretically unbiased Partial Label Causal Loss to generalize beyond the biased causality-aware dataset to unseen cross-domain patterns, yielding an enriched cross-domain representation, which is then fed into the target model to enhance target-domain recommendations. Theoretical and empirical analyses, as well as extensive experiments, demonstrate the rationality and effectiveness of CE-CDR and its general applicability as a model-agnostic plugin. Moreover, it has been deployed in production since April 2025, showing its practical value in real-world applications.
zh
[AI-29] GemiRec: Interest Quantization and Generation for Multi-Interest Recommendation
【速读】:该论文旨在解决多兴趣推荐中存在的两个关键问题:一是兴趣坍缩(interest collapse),即多个用户表示趋于同质化,无法有效区分不同兴趣;二是对兴趣演化的建模不足,难以捕捉用户历史行为中未显式体现的潜在兴趣。解决方案的关键在于提出一个框架级优化方法 GemiRec,其核心创新为通过兴趣量化(interest quantization)实现结构化的兴趣分离,并借助生成模型显式学习用户兴趣的动态演化过程,从而在兴趣字典维护、兴趣后验分布建模和多兴趣检索三个模块上协同提升推荐效果。
链接: https://arxiv.org/abs/2510.14626
作者: Zhibo Wu,Yunfan Wu,Quan Liu,Lin Jiang,Ping Yang,Yao Hu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-interest recommendation has gained attention, especially in industrial retrieval stage. Unlike classical dual-tower methods, it generates multiple user representations instead of a single one to model comprehensive user interests. However, prior studies have identified two underlying limitations: the first is interest collapse, where multiple representations homogenize. The second is insufficient modeling of interest evolution, as they struggle to capture latent interests absent from a user’s historical behavior. We begin with a thorough review of existing works in tackling these limitations. Then, we attempt to tackle these limitations from a new perspective. Specifically, we propose a framework-level refinement for multi-interest recommendation, named GemiRec. The proposed framework leverages interest quantization to enforce a structural interest separation and interest generation to learn the evolving dynamics of user interests explicitly. It comprises three modules: (a) Interest Dictionary Maintenance Module (IDMM) maintains a shared quantized interest dictionary. (b) Multi-Interest Posterior Distribution Module (MIPDM) employs a generative model to capture the distribution of user future interests. © Multi-Interest Retrieval Module (MIRM) retrieves items using multiple user-interest representations. Both theoretical and empirical analyses, as well as extensive experiments, demonstrate its advantages and effectiveness. Moreover, it has been deployed in production since March 2025, showing its practical value in industrial applications.
zh
[AI-30] LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching NEURIPS2025
【速读】:该论文旨在解决当前生成式AI(Generative AI)模型在高风险领域(如医疗健康和科学研究)中缺乏可解释性的问题,尤其是现有反事实解释方法存在的梯度消失、潜在空间不连续以及对学习决策边界与真实决策边界对齐高度依赖等关键局限。其解决方案的核心是提出LeapFactual算法,该方法基于条件流匹配(conditional flow matching),能够在真实与学习决策边界不一致的情况下依然生成可靠且信息丰富的反事实样本;同时采用模型无关(model-agnostic)设计,不仅适用于非可微分模型,还能兼容人类参与的系统(human-in-the-loop),从而扩展了反事实解释在公民科学等需人工标注场景中的应用范围。
链接: https://arxiv.org/abs/2510.14623
作者: Zhuo Cao,Xuan Zhao,Lena Krieger,Hanno Scharr,Ira Assent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a poster presentation at NeurIPS 2025. Camera-ready version. 10 pages, 7 figures
Abstract:The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model’s prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.
zh
[AI-31] An Active Inference Model of Mouse Point-and-Click Behaviour
【速读】:该论文旨在解决人机交互(HCI)中空间指向任务的计算建模问题,特别是如何在连续状态、动作和观测空间下实现更符合人类行为特征的鼠标点击模拟。其解决方案的关键在于引入主动推理(Active Inference, AIF)框架,通过最小化预期自由能(Expected Free Energy)来选择动作,仅依赖于对感知结果(如正确点击按钮)的偏好分布,而非传统最优反馈控制方法中的误差最小化机制。该方法自然地整合了概率性的预测延迟补偿,并能在不重新调整系统参数的情况下,对不同难度的目标表现出差异化的合理行为,从而提升了模型的适应性和生物合理性。
链接: https://arxiv.org/abs/2510.14611
作者: Markus Klar,Sebastian Stein,Fraser Paterson,John H. Williamson,Roderick Murray-Smith
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 12 pages + Appendix; Accepted to 6th International Workshop on Active Inference (IWAI 2025)
Abstract:We explore the use of Active Inference (AIF) as a computational user model for spatial pointing, a key problem in Human-Computer Interaction (HCI). We present an AIF agent with continuous state, action, and observation spaces, performing one-dimensional mouse pointing and clicking. We use a simple underlying dynamic system to model the mouse cursor dynamics with realistic perceptual delay. In contrast to previous optimal feedback control-based models, the agent’s actions are selected by minimizing Expected Free Energy, solely based on preference distributions over percepts, such as observing clicking a button correctly. Our results show that the agent creates plausible pointing movements and clicks when the cursor is over the target, with similar end-point variance to human users. In contrast to other models of pointing, we incorporate fully probabilistic, predictive delay compensation into the agent. The agent shows distinct behaviour for differing target difficulties without the need to retune system parameters, as done in other approaches. We discuss the simulation results and emphasize the challenges in identifying the correct configuration of an AIF agent interacting with continuous systems.
zh
[AI-32] Selective Labeling with False Discovery Rate Control
【速读】:该论文旨在解决大规模数据集标注中高质量标签获取成本高昂的问题,特别是当依赖生成式 AI (Generative AI) 进行自动标注时,其预测标签不可避免地存在误差,而现有选择性标注方法缺乏对 AI 标签质量的理论保障,常导致 AI 标注子集中错误率过高。解决方案的关键在于提出一种名为“校准标注”(Conformal Labeling)的新方法,通过控制假发现率(False Discovery Rate, FDR)来识别可被严格证明可信的 AI 标注实例;具体而言,该方法基于校准样本中被 AI 错误标注的置信度分布,为每个测试样本构造一个校准 p 值,并以数据驱动的阈值筛选出 p 值低于该阈值的样本,从而在理论上保证所选子集中错误标签的比例不超过预设水平,实现高精度与高效率兼顾的可靠标注。
链接: https://arxiv.org/abs/2510.14581
作者: Huipeng Huang,Wenbo Liao,Huajun Xi,Hao Zeng,Mengchen Zhao,Hongxin Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbfConformal Labeling, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal p -value for each test instance by comparing AI models’ predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose p -values are below a data-dependent threshold, certifying AI models’ predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.
zh
[AI-33] LLM Agents Beyond Utility: An Open-Ended Perspective
【速读】:该论文试图解决的问题是:如何将预训练的大语言模型(Large Language Model, LLM)代理从一个智能问题求解工具,升级为具备自主规划、任务设计与推理能力的独立实体,从而实现更开放、持续的学习与目标导向行为。其解决方案的关键在于引入一个开放式实验设置,使LLM代理具备生成自身任务、积累知识并与其环境进行长期交互的能力;通过这种增强机制,代理能够可靠地执行多步骤指令、跨会话存储与复用信息,并主动提出和解决新任务,尽管仍受限于提示工程敏感性、重复任务生成及缺乏自我表征能力。该方法揭示了LLM向开放性智能体演进的潜力与当前局限,并为未来研究提供了方向,如改进记忆管理、促进有效探索以及实现抽象长期目标的追踪。
链接: https://arxiv.org/abs/2510.14548
作者: Asen Nachkov,Xi Wang,Luc Van Gool
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent LLM agents have made great use of chain of thought reasoning and function calling. As their capabilities grow, an important question arises: can this software represent not only a smart problem-solving tool, but an entity in its own right, that can plan, design immediate tasks, and reason toward broader, more ambiguous goals? To study this question, we adopt an open-ended experimental setting where we augment a pretrained LLM agent with the ability to generate its own tasks, accumulate knowledge, and interact extensively with its environment. We study the resulting open-ended agent qualitatively. It can reliably follow complex multi-step instructions, store and reuse information across runs, and propose and solve its own tasks, though it remains sensitive to prompt design, prone to repetitive task generation, and unable to form self-representations. These findings illustrate both the promise and current limits of adapting pretrained LLMs toward open-endedness, and point to future directions for training agents to manage memory, explore productively, and pursue abstract long-term goals.
zh
[AI-34] Symbol Grounding in Neuro-Symbolic AI: A Gentle Introduction to Reasoning Shortcuts
【速读】:该论文旨在解决神经符号人工智能(Neuro-symbolic AI, NeSy AI)中因概念推理捷径(Reasoning Shortcuts, RSs)导致的模型不可靠性问题。RSs 指的是当模型未直接监督高阶概念时,可能通过错误地锚定概念来实现标签准确率的虚假提升,从而损害模型解释的可理解性、分布外泛化性能及整体可靠性。解决方案的关键在于系统梳理和整合现有文献中对 RSs 的成因、后果及其理论刻画,并总结当前缓解与感知策略的有效性与局限性,最终提供一个统一视角以降低研究者和实践者应对该问题的门槛,推动可靠且可信的 NeSy AI 模型发展。
链接: https://arxiv.org/abs/2510.14538
作者: Emanuele Marconato,Samuele Bortolotti,Emile van Krieken,Paolo Morettin,Elena Umili,Antonio Vergari,Efthymia Tsamoura,Andrea Passerini,Stefano Teso
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Neuro-symbolic (NeSy) AI aims to develop deep neural networks whose predictions comply with prior knowledge encoding, e.g. safety or structural constraints. As such, it represents one of the most promising avenues for reliable and trustworthy AI. The core idea behind NeSy AI is to combine neural and symbolic steps: neural networks are typically responsible for mapping low-level inputs into high-level symbolic concepts, while symbolic reasoning infers predictions compatible with the extracted concepts and the prior knowledge. Despite their promise, it was recently shown that - whenever the concepts are not supervised directly - NeSy models can be affected by Reasoning Shortcuts (RSs). That is, they can achieve high label accuracy by grounding the concepts incorrectly. RSs can compromise the interpretability of the model’s explanations, performance in out-of-distribution scenarios, and therefore reliability. At the same time, RSs are difficult to detect and prevent unless concept supervision is available, which is typically not the case. However, the literature on RSs is scattered, making it difficult for researchers and practitioners to understand and tackle this challenging problem. This overview addresses this issue by providing a gentle introduction to RSs, discussing their causes and consequences in intuitive terms. It also reviews and elucidates existing theoretical characterizations of this phenomenon. Finally, it details methods for dealing with RSs, including mitigation and awareness strategies, and maps their benefits and limitations. By reformulating advanced material in a digestible form, this overview aims to provide a unifying perspective on RSs to lower the bar to entry for tackling them. Ultimately, we hope this overview contributes to the development of reliable NeSy and trustworthy AI models.
zh
[AI-35] JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用大量外部工具时导致的提示词膨胀(prompt bloating)问题,即随着可用工具数量增加,提示词长度显著增长,从而引发高 token 成本、延迟上升以及因选择无关工具而导致的任务成功率下降。解决方案的关键在于提出 JSPLIT 框架,该框架基于层次化工具分类体系(taxonomy),通过分析用户查询与工具结构之间的语义关联,动态筛选并仅包含最相关的工具描述到提示词中,从而有效压缩提示长度,在保持甚至提升任务执行准确率的同时降低计算成本。
链接: https://arxiv.org/abs/2510.14537
作者: Emanuele Antonioni,Stefan Markovic,Anirudha Shankar,Jaime Bernardo,Lovro Markovic,Silvia Pareti,Benedetto Proietti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI systems are continually evolving and advancing, and user expectations are concurrently increasing, with a growing demand for interactions that go beyond simple text-based interaction with Large Language Models (LLMs). Today’s applications often require LLMs to interact with external tools, marking a shift toward more complex agentic systems. To support this, standards such as the Model Context Protocol (MCP) have emerged, enabling agents to access tools by including a specification of the capabilities of each tool within the prompt. Although this approach expands what agents can do, it also introduces a growing problem: prompt bloating. As the number of tools increases, the prompts become longer, leading to high prompt token costs, increased latency, and reduced task success resulting from the selection of tools irrelevant to the prompt. To address this issue, we introduce JSPLIT, a taxonomy-driven framework designed to help agents manage prompt size more effectively when using large sets of MCP tools. JSPLIT organizes the tools into a hierarchical taxonomy and uses the user’s prompt to identify and include only the most relevant tools, based on both the query and the taxonomy structure. In this paper, we describe the design of the taxonomy, the tool selection algorithm, and the dataset used to evaluate JSPLIT. Our results show that JSPLIT significantly reduces prompt size without significantly compromising the agent’s ability to respond effectively. As the number of available tools for the agent grows substantially, JSPLIT even improves the tool selection accuracy of the agent, effectively reducing costs while simultaneously improving task success in high-complexity agent environments.
zh
[AI-36] State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living
【速读】:该论文旨在解决用户在使用数字设备时因分心而导致的生产力下降及心理情绪负面影响问题。解决方案的关键在于设计并实现一种基于大语言模型(Large Language Model)的人工智能(AI)助手,该助手能够通过分析屏幕截图、应用名称和URL等多模态信息,识别用户的当前行为是否与其设定的目标意图一致,并在偏离目标时提供温和的提醒(nudges)。系统通过初始澄清对话和持续的用户反馈不断优化检测准确性,从而有效帮助用户维持专注,使数字行为与个人意图保持一致。
链接: https://arxiv.org/abs/2510.14513
作者: Juheon Choi,Juyoung Lee,Jian Kim,Chanyoung Kim,Taewon Min,W. Bradley Knox,Min Kyung Lee,Kimin Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:When working on digital devices, people often face distractions that can lead to a decline in productivity and efficiency, as well as negative psychological and emotional impacts. To address this challenge, we introduce a novel Artificial Intelligence (AI) assistant that elicits a user’s intention, assesses whether ongoing activities are in line with that intention, and provides gentle nudges when deviations occur. The system leverages a large language model to analyze screenshots, application titles, and URLs, issuing notifications when behavior diverges from the stated goal. Its detection accuracy is refined through initial clarification dialogues and continuous user feedback. In a three-week, within-subjects field deployment with 22 participants, we compared our assistant to both a rule-based intent reminder system and a passive baseline that only logged activity. Results indicate that our AI assistant effectively supports users in maintaining focus and aligning their digital behavior with their intentions. Our source code is publicly available at this url this https URL
zh
[AI-37] Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)系统设计与部署中因策略选择、组合与调优复杂性过高而导致的鲁棒性差、定制化程度高且难以复现的问题。其关键解决方案是提出Helmsman——一个基于多智能体的自动化系统,能够从高层次用户需求出发,通过三个协同阶段实现端到端的FL系统合成:(1) 人机交互式规划以制定科学的研究方案;(2) 监督代理团队模块化生成代码;(3) 在沙盒仿真环境中进行闭环自主评估与迭代优化。该方法显著提升了FL系统构建的自动化水平和性能表现。
链接: https://arxiv.org/abs/2510.14512
作者: Haoyuan Li,Mathias Funk,Aaqib Saeed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.
zh
[AI-38] From Guess2Graph: When and How Can Unreliable Experts Safely Boost Causal Discovery in Finite Samples?
【速读】:该论文旨在解决因果发现算法在小样本条件下性能不佳的问题。现有方法虽可通过引入专家知识(包括来自大语言模型的先验信息)作为约束来提升性能,但其理论保证通常依赖于专家预测的完美性或不确定性估计的准确性,这在实际应用中难以满足。论文提出Guess2Graph(G2G)框架,其关键在于利用专家猜测指导统计检验的顺序,而非替代这些检验,从而在保持统计一致性的同时实现性能提升。该方案通过两个实例化方法实现:PC-Guess增强经典PC算法,gPC-Guess则进一步融合学习机制以更高效利用高质量专家输入,二者均在理论上保证即使专家存在错误仍能维持正确性,且在有限样本下若专家表现优于随机水平,gPC-Guess可证明优于非增强版本;实验表明两者均随专家准确率单调提升,其中gPC-Guess展现出显著优势。
链接: https://arxiv.org/abs/2510.14488
作者: Sujai Hiremath,Dominik Janzing,Philipp Faller,Patrick Blöbaum,Elke Kirschbaum,Shiva Prasad Kasiviswanathan,Kyra Gan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Causal discovery algorithms often perform poorly with limited samples. While integrating expert knowledge (including from LLMs) as constraints promises to improve performance, guarantees for existing methods require perfect predictions or uncertainty estimates, making them unreliable for practical use. We propose the Guess2Graph (G2G) framework, which uses expert guesses to guide the sequence of statistical tests rather than replacing them. This maintains statistical consistency while enabling performance improvements. We develop two instantiations of G2G: PC-Guess, which augments the PC algorithm, and gPC-Guess, a learning-augmented variant designed to better leverage high-quality expert input. Theoretically, both preserve correctness regardless of expert error, with gPC-Guess provably outperforming its non-augmented counterpart in finite samples when experts are “better than random.” Empirically, both show monotonic improvement with expert accuracy, with gPC-Guess achieving significantly stronger gains.
zh
[AI-39] Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models
【速读】:该论文旨在解决基于语言模型(Language Model, LM)增强的图基础模型(Graph Foundation Models, GFMs)在无保护提示微调(prompt tuning)阶段所面临的安全漏洞问题,尤其是在属性不可访问的文本属性图(Text-Attributed Graphs, TAGs)约束环境下,传统图后门攻击因缺乏触发节点属性优化而性能显著下降的问题。解决方案的关键在于提出一种双触发机制的新型后门攻击框架,该框架同时在文本层(text-level)和结构层(struct-level)构建攻击信号,通过利用预设的文本池实现无需显式优化触发节点文本属性即可完成有效攻击,从而在保持高干净准确率的同时,在隐蔽性强的单触发节点场景下仍能实现优异的攻击成功率。
链接: https://arxiv.org/abs/2510.14470
作者: Xiaoyu Xue,Yuni Lai,Chenxi Huang,Yulin Zhu,Gaolei Li,Xiaoge Zhang,Kai Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of graph foundation models (GFMs), particularly those incorporating language models (LMs), has revolutionized graph learning and demonstrated remarkable performance on text-attributed graphs (TAGs). However, compared to traditional GNNs, these LM-empowered GFMs introduce unique security vulnerabilities during the unsecured prompt tuning phase that remain understudied in current research. Through empirical investigation, we reveal a significant performance degradation in traditional graph backdoor attacks when operating in attribute-inaccessible constrained TAG systems without explicit trigger node attribute optimization. To address this, we propose a novel dual-trigger backdoor attack framework that operates at both text-level and struct-level, enabling effective attacks without explicit optimization of trigger node text attributes through the strategic utilization of a pre-established text pool. Extensive experimental evaluations demonstrate that our attack maintains superior clean accuracy while achieving outstanding attack success rates, including scenarios with highly concealed single-trigger nodes. Our work highlights critical backdoor risks in web-deployed LM-empowered GFMs and contributes to the development of more robust supervision mechanisms for open-source platforms in the era of foundation models.
zh
[AI-40] Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning
【速读】:该论文旨在解决大规模预训练语言模型微调过程中因噪声或不相关样本导致监督信号稀释的问题,即如何高效识别并利用高价值训练数据以提升模型对齐效果。其解决方案的关键在于提出一种理论基础扎实、资源消耗低的数据选择与重加权框架——In-Context Approximation (ICA),该方法通过在上下文中条件化一个小而精炼的保留集(holdout set),无需参考模型或额外微调即可估算候选样本训练后可能带来的验证损失。基于局部线性化假设,ICA等价于向最优解的第一阶更新方向,因而可作为数据价值的有效代理指标;进而从ICA得分中推导出每条样本的权重,并动态调整梯度更新过程,从而在SFT、DPO和SimPO等多种对齐范式下均显著提升模型性能,且计算开销极小。
链接: https://arxiv.org/abs/2510.14459
作者: Ling Zhang,Xianliang Yang,Juwon Yu,Park Cheonyoung,Lei Song,Jiang Bian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a theoretically grounded, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning. Under a local linearization, ICA is equivalent to a first-order update toward the holdout optimum, motivating its use as a proxy for data value. We derive per-example weights from ICA scores, dynamically reweighting gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the choice of k holdout examples for in-context demonstrations, and note limitations for rapidly drifting on-policy updates, highlighting directions for future work. Code and prompts will be released.
zh
[AI-41] owards Adaptable Humanoid Control via Adaptive Motion Tracking
【速读】:该论文旨在解决人形机器人在复杂现实环境中如何从单一参考动作中实现高精度且自适应的运动模仿问题。现有方法要么依赖大量训练动作以实现良好适应性但牺牲模仿精度(motion prior方法),要么虽能高精度模仿但需测试时提供目标动作且数据需求量大(motion-tracking方法)。解决方案的关键在于提出AdaMimic算法:首先通过稀疏化单个参考动作生成关键帧并进行轻量编辑构建增强数据集;接着利用该数据集初始化策略,跟踪稀疏关键帧生成稠密中间动作;随后训练适配器模块调整跟踪速度并优化低层动作执行,从而实现灵活的时间扭曲(flexible time warping),显著提升模仿精度与环境适应能力。
链接: https://arxiv.org/abs/2510.14454
作者: Tao Huang,Huayi Wang,Junli Ren,Kangning Yin,Zirui Wang,Xiao Chen,Feiyu Jia,Wentao Zhang,Junfeng Long,Jingbo Wang,Jiangmiao Pang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Humanoid robots are envisioned to adapt demonstrated motions to diverse real-world conditions while accurately preserving motion patterns. Existing motion prior approaches enable well adaptability with a few motions but often sacrifice imitation accuracy, whereas motion-tracking methods achieve accurate imitation yet require many training motions and a test-time target motion to adapt. To combine their strengths, we introduce AdaMimic, a novel motion tracking algorithm that enables adaptable humanoid control from a single reference motion. To reduce data dependence while ensuring adaptability, our method first creates an augmented dataset by sparsifying the single reference motion into keyframes and applying light editing with minimal physical assumptions. A policy is then initialized by tracking these sparse keyframes to generate dense intermediate motions, and adapters are subsequently trained to adjust tracking speed and refine low-level actions based on the adjustment, enabling flexible time warping that further improves imitation accuracy and adaptability. We validate these significant improvements in our approach in both simulation and the real-world Unitree G1 humanoid robot in multiple tasks across a wide range of adaptation conditions. Videos and code are available at this https URL.
zh
[AI-42] Feature Selection and Regularization in Multi-Class Classification: An Empirical Study of One-vs-Rest Logistic Regression with Gradient Descent Optimization and L1 Sparsity Constraints
【速读】:该论文旨在解决多类葡萄酒分类中模型准确性、特征维度和可解释性之间的权衡问题,这对分析化学领域的生产部署至关重要。其解决方案的关键在于通过实证比较一 vs 余(One-vs-Rest)逻辑回归的不同实现方式(包括手动梯度下降与scikit-learn优化求解器),并量化L1正则化对特征稀疏性的影响,最终提出一个由5个关键化学特征组成的子集,在保持92–94%准确率的同时实现62%的复杂度降低,从而在资源受限环境中实现成本节约(每样本节省80美元)和时间效率提升(减少56%时间),且具备亚毫秒级预测延迟,满足实时质量控制需求。
链接: https://arxiv.org/abs/2510.14449
作者: Jahidul Arafat,Fariha Tasmin,Md Kaosar Uddin,Sanjaya Poudel,Eftakhar Ahmed Arnob
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 7 figures, 5 tables. Submitted to Machine Learning track. Comprehensive empirical evaluation of interpretable linear classification for analytical chemistry applications with focus on production deployment constraints, cost-benefit analysis, and class-specific feature importance patterns
Abstract:Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability - critical factors for production deployment in analytical chemistry. This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing from-scratch gradient descent implementation against scikit-learn’s optimized solvers and quantifying L1 regularization effects on feature sparsity. Manual gradient descent achieves 92.59 percent mean test accuracy with smooth convergence, validating theoretical foundations, though scikit-learn provides 24x training speedup and 98.15 percent accuracy. Class-specific analysis reveals distinct chemical signatures with heterogeneous patterns where color intensity varies dramatically (0.31 to 16.50) across cultivars. L1 regularization produces 54-69 percent feature reduction with only 4.63 percent accuracy decrease, demonstrating favorable interpretability-performance trade-offs. We propose an optimal 5-feature subset achieving 62 percent complexity reduction with estimated 92-94 percent accuracy, enabling cost-effective deployment with 80 dollars savings per sample and 56 percent time reduction. Statistical validation confirms robust generalization with sub-2ms prediction latency suitable for real-time quality control. Our findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments.
zh
[AI-43] A Free Lunch in LLM Compression: Revisiting Retraining after Pruning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在剪枝(pruning)后性能下降的问题,传统方法通常依赖全量重训练以恢复性能,但因计算资源消耗过大而不可行。论文提出通过层间掩码选择与重建(mask selection and reconstruction)策略,在少量校准数据上进行局部权重重建,从而避免全量重训练。其解决方案的关键在于:在Transformer块内部对注意力模块(attention)和多层感知机模块(MLP)分别独立重建,而非整体重建或单矩阵重建,这种设计实现了最优的资源效率与性能平衡——不仅显著降低内存占用,还优于全量重训练的效果,且简单有效的剪枝策略(如Wanda)在正确重建步骤下可超越复杂方法,挑战了“应完全避免重训练”的主流观点。
链接: https://arxiv.org/abs/2510.14444
作者: Moritz Wagner,Christophe Roux,Max Zimmer,Sebastian Pokutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.
zh
[AI-44] Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare
【速读】:该论文旨在解决生物声学数据在精准畜牧业中应用受限的问题,主要挑战包括计算复杂性高以及生态效度不足。其解决方案的关键在于构建了一个符合FAIR原则的大型牛只发声数据集(569个标注片段扩展至2900样本),涵盖48种行为类别,并通过多麦克风阵列在三个商业化奶牛场采集真实环境声音,确保生态真实性;同时开发了一套分布式处理框架,集成iZotope RX降噪、音视频多模态同步对齐及基于Praat、librosa和openSMILE的标准声学特征工程(共24个描述符),从而实现高噪声环境下鲁棒的特征提取与实时处理能力,为动物中心型人工智能提供可部署的基础资源。
链接: https://arxiv.org/abs/2510.14443
作者: Mayuri Kate,Suresh Neethirajan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 40 pages, 14 figures, 9 Tables
Abstract:The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed augmentation. This FAIR compliant resource addresses major Big Data challenges - volume (90 hours of recordings, 65.6 GB), variety (multi farm and multi zone acoustics), velocity (real time processing), and veracity (noise robust feature extraction). Our distributed processing framework integrates advanced denoising using iZotope RX, multimodal synchronization through audio and video alignment, and standardized feature engineering with 24 acoustic descriptors generated from Praat, librosa, and openSMILE. Preliminary benchmarks reveal distinct class level acoustic patterns for estrus detection, distress classification, and maternal communication. The datasets ecological realism, reflecting authentic barn acoustics rather than controlled settings, ensures readiness for field deployment. This work establishes a foundation for animal centered AI, where bioacoustic data enable continuous and non invasive welfare assessment at industrial scale. By releasing standardized pipelines and detailed metadata, we promote reproducible research that connects Big Data analytics, sustainable agriculture, and precision livestock management. The framework supports UN SDG 9, showing how data science can turn traditional farming into intelligent, welfare optimized systems that meet global food needs while upholding ethical animal care.
zh
[AI-45] Eliminating Negative Occurrences of Derived Predicates from PDDL Axioms ICAPS2025 KR
【速读】:该论文旨在解决规划领域定义语言(PDDL)中公理(axiom)的负向谓词引用限制问题,即传统PDDL规范仅允许在公理体中出现由动作直接设定的谓词的否定形式,而禁止使用通过其他公理推导出的谓词的否定。这一限制在实践中常被放宽,文献中通常要求公理集合是可分层的(stratifiable),但未明确说明是否等价于原限制。论文的核心贡献在于证明:只要公理集合是可分层的,即便允许负向引用派生谓词,其表达能力仍与最小不动点逻辑(least fixed-point logic)一致,因此可以通过构造性变换消除所有派生谓词的负向引用。解决方案的关键在于提出一种系统性的变换方法,将包含负向派生谓词的公理转换为等价形式,其中仅保留由动作直接设定的谓词的否定,从而实现语义不变性并满足原始PDDL标准。
链接: https://arxiv.org/abs/2510.14412
作者: Claudia Grundke,Gabriele Röger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of a paper of the same title presented at the joint KR/ICAPS 2025 workshop “KRPlan: Knowledge Representation Meets Automated Planning”
Abstract:Axioms are a feature of the Planning Domain Definition Language PDDL that can be considered as a generalization of database query languages such as Datalog. The PDDL standard restricts negative occurrences of predicates in axiom bodies to predicates that are directly set by actions and not derived by axioms. In the literature, authors often deviate from this limitation and only require that the set of axioms is stratifiable. Both variants can express exactly the same queries as least fixed-point logic, indicating that negative occurrences of derived predicates can be eliminated. We present the corresponding transformation.
zh
[AI-46] he Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统中如何在混合动机情境下(即个体利益与集体利益冲突时)自发形成合作规范的问题,尤其是在缺乏显式奖励信号的情况下。传统基于大语言模型(Large Language Models, LLMs)的公共资源博弈(Common-Pool Resource, CPR)研究通常依赖于明确的奖励函数,而人类合作往往发生在信息不完全、依赖启发式策略、沟通和惩罚机制的情境中。本文的关键解决方案是构建一个移除显式奖励信号的CPR模拟框架,并嵌入文化演化机制:包括社会学习(从成功同伴处习得策略与信念)和基于Ostrom治理原则的规范性惩罚;同时允许代理通过环境反馈自主学习捕捞、监控和惩罚行为的后果,从而实现规范的内生性涌现。这一设计使模型能够复现人类行为的关键发现,并揭示不同LLM在资源丰富/稀缺与利他/自私初始条件下的规范演化差异,为AI系统在社会与组织场景中的协同规范对齐提供可验证的实验平台。
链接: https://arxiv.org/abs/2510.14401
作者: Prateek Gupta,Qiankun Zhong,Hiromu Yakura,Thomas Eisenmann,Iyad Rahwan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:A growing body of multi-agent studies with Large Language Models (LLMs) explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without full visibility into payoffs and population, relying instead on heuristics, communication, and punishment. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom’s principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a 2\times2 grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.
zh
[AI-47] FairBatching: Fairness-Aware Batch Formation for LLM Inference
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理系统中预填充(prefill)与解码(decode)任务之间的资源分配不公平问题,该问题导致时延敏感的首次词元响应时间(Time-to-First-Token, TTFT)与持续吞吐率(Time-Per-Output-Token, TPOT)难以同时优化。现有无停顿批处理调度器(stall-free batching scheduler)虽能避免解码阶段的停滞,但因过度优先解码任务而引发计算资源分配不公,造成解码空闲时间浪费和预填充队列延迟增加,进而损害整体服务质量(Quality of Service, QoS)。解决方案的关键在于提出FairBatching调度算法:其核心创新包括自适应批量容量确定机制(动态调整计算预算以提升GPU利用率且不违反服务等级目标SLO),以及公平且动态的批次形成策略(打破固定解码优先逻辑,允许从突发解码任务中回收资源用于预填充请求),并引入新颖的负载估计方法以增强与上层调度器的协同能力,从而在保障TPOT SLO的前提下显著降低TTFT尾部延迟,并大幅提升单节点与集群级别的吞吐容量。
链接: https://arxiv.org/abs/2510.14392
作者: Hongtao Lyu,Boyue Liu,Mingyu Wu,Haibo Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low Time-Per-Output-Token, or TPOT) for ongoing requests. Existing stall-free batching schedulers proposed by Sarathi, while effective at preventing decode stalls, introduce significant computational unfairness. They prioritize decode tasks excessively, simultaneously leading to underutilized decode slack and unnecessary prefill queuing delays, which collectively degrade the system’s overall quality of service (QoS). This work identifies the root cause of this unfairness: the non-monotonic nature of Time-Between-Tokens (TBT) as a scheduling metric and the rigid decode-prioritizing policy that fails to adapt to dynamic workload bursts. We therefore propose FairBatching, a novel LLM inference scheduler that enforces fair resource allocation between prefill and decode tasks. It features an adaptive batch capacity determination mechanism, which dynamically adjusts the computational budget to improve the GPU utilization without triggering SLO violations. Its fair and dynamic batch formation algorithm breaks away from the decode-prioritizing paradigm, allowing computation resources to be reclaimed from bursting decode tasks to serve prefill surges, achieving global fairness. Furthermore, FairBatching provides a novel load estimation method, enabling more effective coordination with upper-level schedulers. Implemented and evaluated on realistic traces, FairBatching significantly reduces TTFT tail latency by up to 2.29x while robustly maintaining TPOT SLOs, achieving overall 20.0% improvement in single-node capacity and 54.3% improvement in cluster-level capacity. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.14392 [cs.DC] (or arXiv:2510.14392v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2510.14392 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-48] Beat Detection as Object Detection
【速读】:该论文旨在解决音乐中节拍(beat)与强拍(downbeat)追踪任务的建模问题,传统方法如循环神经网络(RNN)、时间卷积网络(TCN)和Transformer通常输出帧级激活值,难以直接捕捉节拍作为时序“对象”的结构特性。其解决方案的关键在于将节拍追踪重新建模为一维时序对象检测任务:利用FCOS(Fully Convolutional One-Stage Object Detector)框架,将其二维视觉检测器适配至音频信号的1D时序空间;通过引入WaveBeat的时序特征提取器替代原骨干网络,并增加特征金字塔网络(Feature Pyramid Network, FPN)以捕获多尺度时间模式;模型预测重叠的节拍/强拍区间及其置信度分数,再经非极大值抑制(Non-Maximum Suppression, NMS)筛选最终结果,该步骤在功能上类比于传统追踪器中的动态贝叶斯网络(DBN),但实现更简洁且参数更少。
链接: https://arxiv.org/abs/2510.14391
作者: Jaehoon Ahn,Moon-Ryul Jung
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 5 tables
Abstract:Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal “objects.” Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat’s temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
zh
[AI-49] Hi-Agent : Hierarchical Vision-Language Agents for Mobile Device Control
【速读】:该论文旨在解决当前移动设备控制代理在面对新任务或未见过的用户界面(UI)布局时泛化能力差的问题,其根源在于现有方法多依赖于直接的状态到动作映射,缺乏结构化的推理与规划机制。解决方案的关键在于提出一种可训练的分层视觉语言代理(Hi-Agent),该代理由高层推理模型和低层动作模型组成,并通过联合优化实现高效训练;其中,作者将多步决策重构为一系列单步子目标,并设计了一种利用底层模型执行反馈引导高层优化的前瞻优势函数(foresight advantage function),从而缓解长程任务中Group Relative Policy Optimization(GRPO)面临的路径爆炸问题,实现无需评判器(critic-free)的稳定联合训练。
链接: https://arxiv.org/abs/2510.14388
作者: Zhe Wu,Hongjin Lu,Junliang Xing,Changhao Zhang,Yin Zhu,Yuhao Yang,Yuheng Jing,Kai Li,Kun Shao,Jianye Hao,Jun Wang,Yuanchun Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.
zh
[AI-50] Can MLLM s Absorb Math Reasoning Abilities from LLM s as Free Lunch?
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在数学推理能力上显著落后于纯文本大语言模型(Large Language Models, LLMs)的问题,尤其是在不进行微调(tuning-free)的前提下,如何有效迁移数学推理能力。其解决方案的关键在于提出了一种名为IP-Merging的方法:首先识别MLLM与数学专用LLM(Math LLM)中与推理相关的关键参数层,然后将这些参数投影到MLLM的参数子空间中以维持两者间的对齐关系,最后在该子空间内直接合并参数。这一策略避免了传统模型融合方法因参数空间差异导致的性能下降问题,实现了无需微调即可显著提升MLLM的数学推理能力,同时不损害其原有多模态能力。
链接: https://arxiv.org/abs/2510.14387
作者: Yijie Hu,Zihao Zhou,Kaizhu Huang,Xiaowei Huang,Qiufeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years. However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind. Since the MLLM typically consists of an LLM and a vision block, we wonder: Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning? Recent model-merging approaches may offer insights into this question. However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance. Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space. Based on the empirical insights, we propose IP-Merging that first identifies the reasoning-associated parameters in both MLLM and Math LLM, then projects them into the subspace of MLLM, aiming to maintain the alignment, and finally merges parameters in this subspace. IP-Merging is a tuning-free approach since parameters are directly adjusted. Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities.
zh
[AI-51] SUM-AgriVLN: Spatial Understanding Memory for Agricultural Vision-and-Language Navigation
【速读】:该论文旨在解决农业视觉-语言导航(Agricultural Vision-and-Language Navigation, AgriVLN)中因将每条导航指令视为独立任务而忽略历史经验所导致的空间上下文信息利用不足的问题。其解决方案的关键在于提出Spatial Understanding Memory for Agricultural Vision-and-Language Navigation (SUM-AgriVLN),其中的SUM模块通过3D重建与表示技术实现空间理解并保存空间记忆,从而在后续导航任务中复用历史空间信息,提升导航成功率。在A2A基准上的实验表明,该方法将成功率从0.47提升至0.54,同时仅轻微增加导航误差(从2.91m增至2.93m),达到农业领域当前最优性能。
链接: https://arxiv.org/abs/2510.14357
作者: Xiaobei Zhao,Xingqi Lyu,Xiang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Agricultural robots are emerging as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily rely on manual operation or fixed rail systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling robots to navigate to the target positions following the natural language instructions. In practical agricultural scenarios, navigation instructions often repeatedly occur, yet AgriVLN treat each instruction as an independent episode, overlooking the potential of past experiences to provide spatial context for subsequent ones. To bridge this gap, we propose the method of Spatial Understanding Memory for Agricultural Vision-and-Language Navigation (SUM-AgriVLN), in which the SUM module employs spatial understanding and save spatial memory through 3D reconstruction and representation. When evaluated on the A2A benchmark, our SUM-AgriVLN effectively improves Success Rate from 0.47 to 0.54 with slight sacrifice on Navigation Error from 2.91m to 2.93m, demonstrating the state-of-the-art performance in the agricultural domain. Code: this https URL.
zh
[AI-52] BinCtx: Multi-Modal Representation Learning for Robust Android App Behavior Detection
【速读】:该论文旨在解决移动应用市场中难以检测的恶意行为问题,尤其是那些不依赖权限保护API、且可通过UI或元数据修改进行伪装的 undesired behaviors(如干扰性广告、非法跳转和支付欺骗)。解决方案的关键在于提出BINCTX方法,通过构建三种模态的多维表征:(i) 基于字节码图像的全局代码语义与家族模式视图,(ii) 表示行为触发机制的上下文视图(包括manifest声明动作、组件、权限及URL/IP常量),以及(iii) 第三方库调用频率沿组件间调用路径的统计视图。这三类特征被嵌入并融合以训练具备上下文感知能力的分类器,在真实世界恶意软件和良性应用上达到94.73%的宏F1值,显著优于现有基线方法,并在商业混淆和对抗样本下仍保持鲁棒性。
链接: https://arxiv.org/abs/2510.14344
作者: Zichen Liu,Shao Yang,Xusheng Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile app markets host millions of apps, yet undesired behaviors (e.g., disruptive ads, illegal redirection, payment deception) remain hard to catch because they often do not rely on permission-protected APIs and can be easily camouflaged via UI or metadata edits. We present BINCTX, a learning approach that builds multi-modal representations of an app from (i) a global bytecode-as-image view that captures code-level semantics and family-style patterns, (ii) a contextual view (manifested actions, components, declared permissions, URL/IP constants) indicating how behaviors are triggered, and (iii) a third-party-library usage view summarizing invocation frequencies along inter-component call paths. The three views are embedded and fused to train a contextual-aware classifier. On real-world malware and benign apps, BINCTX attains a macro F1 of 94.73%, outperforming strong baselines by at least 14.92%. It remains robust under commercial obfuscation (F1 84% post-obfuscation) and is more resistant to adversarial samples than state-of-the-art bytecode-only systems.
zh
[AI-53] Stop-RAG : Value-Based Retrieval Control for Iterative RAG NEURIPS2025
【速读】:该论文旨在解决迭代式检索增强生成(Iterative Retrieval-Augmented Generation, RAG)在处理复杂多跳问题时因固定迭代次数或不准确的置信度代理导致的延迟高、成本大及引入干扰证据风险增加的问题。解决方案的关键在于将迭代RAG建模为有限horizon马尔可夫决策过程(Markov Decision Process, MDP),并提出Stop-RAG——一种基于价值函数的控制器,通过训练获得能够自适应决定何时停止检索的策略。该方法利用完整轨迹的全宽前视Q(λ)目标进行训练,在保持与黑盒API和现有流水线兼容的同时,显著提升了多跳问答任务中的准确性,验证了价值驱动的自适应停止机制是当前智能体系统中缺失的重要组件。
链接: https://arxiv.org/abs/2510.14337
作者: Jaewan Park,Solbee Cho,Jay-Yoon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 MTI-LLM Workshop
Abstract:Iterative retrieval-augmented generation (RAG) enables large language models to answer complex multi-hop questions, but each additional loop increases latency, costs, and the risk of introducing distracting evidence, motivating the need for an efficient stopping strategy. Existing methods either use a predetermined number of iterations or rely on confidence proxies that poorly reflect whether more retrieval will actually help. We cast iterative RAG as a finite-horizon Markov decision process and introduce Stop-RAG, a value-based controller that adaptively decides when to stop retrieving. Trained with full-width forward-view Q( \lambda ) targets from complete trajectories, Stop-RAG learns effective stopping policies while remaining compatible with black-box APIs and existing pipelines. On multi-hop question-answering benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines and prompting-based stopping with LLMs. These results highlight adaptive stopping as a key missing component in current agentic systems, and demonstrate that value-based control can improve the accuracy of RAG systems.
zh
[AI-54] Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction
【速读】:该论文旨在解决基于大语言模型的多智能体系统(Multi-Agent Systems, MAS)在协作求解过程中易受级联错误影响的问题,即单个错误步骤可能在智能体间传播并破坏整体推理轨迹。解决方案的关键在于提出一种元认知框架 MASC,通过两个互补设计实现实时、无监督的步骤级错误检测与自纠正:(1) 下一步执行重建(Next-Execution Reconstruction),利用查询和交互历史预测下一步嵌入以捕捉因果一致性;(2) 原型引导增强(Prototype-Guided Enhancement),学习正常步骤嵌入的原型先验,在稀疏上下文(如早期步骤)下稳定重建与异常评分。当检测到异常步骤时,MASC 调用修正代理对当前动作进行修正,从而阻断错误传播路径。实验表明,该方法在 WhoWhen 基准上将步骤级错误检测的 AUC-ROC 提升最高达 8.47%,且可无缝集成至多种 MAS 架构中,带来端到端性能提升。
链接: https://arxiv.org/abs/2510.14319
作者: Xu Shen,Qi Zhang,Song Wang,Zhen Tan,Xinyu Zhao,Laura Yao,Vaishnav Tadiparthi,Hossein Nourkhiz Mahjoub,Ehsan Moradi Pari,Kwonjoon Lee,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model based multi-agent systems (MAS) excel at collaborative problem solving but remain brittle to cascading errors: a single faulty step can propagate across agents and disrupt the trajectory. In this paper, we present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction. MASC rethinks detection as history-conditioned anomaly scoring via two complementary designs: (1) Next-Execution Reconstruction, which predicts the embedding of the next step from the query and interaction history to capture causal consistency, and (2) Prototype-Guided Enhancement, which learns a prototype prior over normal-step embeddings and uses it to stabilize reconstruction and anomaly scoring under sparse context (e.g., early steps). When an anomaly step is flagged, MASC triggers a correction agent to revise the acting agent’s output before information flows downstream. On the WhoWhen benchmark, MASC consistently outperforms all baselines, improving step-level error detection by up to 8.47% AUC-ROC ; When plugged into diverse MAS frameworks, it delivers consistent end-to-end gains across architectures, confirming that our metacognitive monitoring and targeted correction can mitigate error propagation with minimal overhead.
zh
[AI-55] A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中安全对齐(safety alignment)脆弱性问题,即即使在良性数据上进行微调或使用低秩适应(Low-Rank Adaptation, LoRA)方法时,预训练阶段获得的安全行为也容易被破坏,导致模型产生有害响应。解决方案的关键在于提出GuardSpace框架,其核心由两个组件构成:一是通过协方差预条件奇异值分解(covariance-preconditioned singular value decomposition)显式分离预训练权重中的安全相关子空间与安全无关子空间,并将低秩适配器初始化于安全无关部分,同时冻结安全相关部分以保留原有安全机制;二是构建一个零空间投影器(null space projector),限制适配器更新对有害提示下安全输出的扰动,从而维持原始拒绝行为。该方法在多个下游任务中显著优于现有技术,例如在Llama-2-7B-Chat模型微调GSM8K任务时,将平均有害得分从14.4%降至3.6%,同时准确率从26.0%提升至28.0%。
链接: https://arxiv.org/abs/2510.14301
作者: Bingjie Zhang,Yibo Yang,Renzhe,Dandan Guo,Jindong Gu,Philip Torr,Bernard Ghanem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.
zh
[AI-56] Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 在机器人操作任务中扩展 VLA(Vision-Language-Action)模型时面临的两大挑战:一是从零训练新模型需要大量计算资源和数据,而当前机器人数据稀缺,因此需充分利用预训练 VLA 模型权重;二是实时控制需在模型能力与计算效率之间取得平衡。解决方案的关键在于提出 AdaMoE 架构——一种基于混合专家(Mixture-of-Experts, MoE)的结构,它继承稠密 VLA 模型的预训练权重,并通过将前馈层替换为稀疏激活的 MoE 层来扩展动作专家。其核心创新是采用解耦机制,将专家选择与专家加权分离,借助独立的缩放适配器(scale adapter)实现基于任务相关性的专家选择与可独立调控的权重贡献,从而避免“赢家通吃”式的竞争机制,促进多专家协同利用,在保持计算效率的同时显著提升性能,在多个基准测试中实现最高达 9.3% 的性能增益,并在真实机器人实验中获得 21.5% 的改进。
链接: https://arxiv.org/abs/2510.14300
作者: Weijie Shen,Yitian Liu,Yuhao Wu,Zhixuan Liang,Sijia Gu,Dehui Wang,Tian Nian,Lei Xu,Yusen Qin,Jiangmiao Pang,Xinping Guan,Xiaokang Yang,Yao Mu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
zh
[AI-57] ED: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening ICDM2025
【速读】:该论文旨在解决深度神经网络在关键应用场景中面临的隐蔽后门攻击(stealthy backdoor attacks)问题,此类攻击通过污染训练数据使模型在特定输入下触发恶意行为,同时保持输入在视觉或语义上看似正常,从而逃避现有防御机制。其解决方案的关键在于提出TED++框架,该框架基于子流形感知(submanifold-aware)思想,首先构建每个类别在隐藏特征空间中的“管状邻域”(tubular neighbourhood),并利用少量干净样本估计该邻域的局部“厚度”;随后采用局部自适应排序(Locally Adaptive Ranking, LAR)方法检测偏离该邻域的激活点,并通过跨层聚合LAR调整后的排名序列,捕捉输入在演化类子流形上的“管约束”行为特征,最终识别出显著偏离该行为模式的潜在后门样本。
链接: https://arxiv.org/abs/2510.14299
作者: Nam Le,Leo Yu Zhang,Kewen Liao,Shirui Pan,Wei Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICDM 2025
Abstract:As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-aware framework that effectively detects subtle backdoors that evade existing defences. TED++ begins by constructing a tubular neighbourhood around each class’s hidden-feature manifold, estimating its local thickness'' from a handful of clean activations. It then applies Locally Adaptive Ranking (LAR) to detect any activation that drifts outside the admissible tube. By aggregating these LAR-adjusted ranks across all layers, TED++ captures how faithfully an input remains on the evolving class submanifolds. Based on such characteristic
tube-constrained’’ behaviour, TED++ flags inputs whose LAR-based ranking sequences deviate significantly. Extensive experiments are conducted on benchmark datasets and tasks, demonstrating that TED++ achieves state-of-the-art detection performance under both adaptive-attack and limited-data scenarios. Remarkably, even with only five held-out examples per class, TED++ still delivers near-perfect detection, achieving gains of up to 14% in AUROC over the next-best method. The code is publicly available at this https URL.
zh
[AI-58] Beyond a Single Perspective: Towards a Realistic Evaluation of Website Fingerprinting Attacks
【速读】:该论文旨在解决当前网站指纹攻击(Website Fingerprinting, WF)研究中普遍存在的局限性问题,即大多数现有WF攻击方法仅在理想化、单一实验场景下表现优异(准确率超过90%),而忽视了真实网络环境中复杂的多因素干扰,如防御机制、流量漂移、多标签浏览、早期检测、开放世界设定及少样本情况等。为应对这一挑战,论文提出了一种多维评估框架(multidimensional evaluation framework),首次系统性地在多种现实条件组合下对主流WF攻击技术进行综合测试与分析,揭示了其在复杂环境中的性能退化现象,从而明确了当前WF攻击在实际应用中的可行性瓶颈,并为未来开发更鲁棒、更贴近实战的攻击方法提供了关键指导。
链接: https://arxiv.org/abs/2510.14283
作者: Xinhao Deng,Jingyou Chen,Linxiao Yu,Yixiang Zhang,Zhongyi Gu,Changhao Qiu,Xiyuan Zhao,Ke Xu,Qi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Website Fingerprinting (WF) attacks exploit patterns in encrypted traffic to infer the websites visited by users, posing a serious threat to anonymous communication systems. Although recent WF techniques achieve over 90% accuracy in controlled experimental settings, most studies remain confined to single scenarios, overlooking the complexity of real-world environments. This paper presents the first systematic and comprehensive evaluation of existing WF attacks under diverse realistic conditions, including defense mechanisms, traffic drift, multi-tab browsing, early-stage detection, open-world settings, and few-shot scenarios. Experimental results show that many WF techniques with strong performance in isolated settings degrade significantly when facing other conditions. Since real-world environments often combine multiple challenges, current WF attacks are difficult to apply directly in practice. This study highlights the limitations of WF attacks and introduces a multidimensional evaluation framework, offering critical insights for developing more robust and practical WF attacks.
zh
[AI-59] MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
【速读】:该论文旨在解决现有大模型推理能力评估基准在覆盖范围和灵活性方面的局限性,即现有基准通常仅涵盖特定领域且难以根据模型推理能力的演进动态调整难度。解决方案的关键在于提出MorphoBench,其核心创新包括:(1)整合多学科复杂推理问题,提升评估的全面性;(2)基于模型推理过程中生成的关键语句自适应调节题目难度,实现动态适配;(3)引入仿真软件生成问题以低资源消耗方式实现难度的灵活调整。通过上述机制,MorphoBench能够持续追踪并反映先进模型(如o3和GPT-5)的推理能力变化,从而增强评估的有效性和科学性。
链接: https://arxiv.org/abs/2510.14265
作者: Xukai Wang,Xuanbo Liu,Mingrui Chen,Haitian Zhong,Xuanlin Yang,Bohan Zeng,Jinbo Hu,Hao Liang,Junbo Niu,Xuchen Li,Ruitao Wu,Ruichuan An,Yang Shi,Liu Liu,Xu-Yao Zhang,Qiang Liu,Zhouchen Lin,Wentao Zhang,Bin Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 12 figures
Abstract:With the advancement of powerful large-scale reasoning models, effectively evaluating the reasoning capabilities of these models has become increasingly important. However, existing benchmarks designed to assess the reasoning abilities of large models tend to be limited in scope and lack the flexibility to adapt their difficulty according to the evolving reasoning capacities of the models. To address this, we propose MorphoBench, a benchmark that incorporates multidisciplinary questions to evaluate the reasoning capabilities of large models and can adjust and update question difficulty based on the reasoning abilities of advanced models. Specifically, we curate the benchmark by selecting and collecting complex reasoning questions from existing benchmarks and sources such as Olympiad-level competitions. Additionally, MorphoBench adaptively modifies the analytical challenge of questions by leveraging key statements generated during the model’s reasoning process. Furthermore, it includes questions generated using simulation software, enabling dynamic adjustment of benchmark difficulty with minimal resource consumption. We have gathered over 1,300 test questions and iteratively adjusted the difficulty of MorphoBench based on the reasoning capabilities of models such as o3 and GPT-5. MorphoBench enhances the comprehensiveness and validity of model reasoning evaluation, providing reliable guidance for improving both the reasoning abilities and scientific robustness of large models. The code has been released in this https URL.
zh
[AI-60] owards Agent ic Self-Learning LLM s in Search Environment
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的智能体(Agent)在开放域场景下如何实现可扩展训练的问题,即如何在不依赖人工标注数据集或预定义规则奖励的情况下,持续提升智能体的能力。其核心挑战在于缺乏有效的奖励信号和任务数据规模不足导致的性能瓶颈。解决方案的关键在于提出一种全闭环、多角色协同进化的强化学习框架——代理自学习(Agentic Self-Learning, ASL),该框架通过统一任务生成、策略执行与评估于共享工具环境和LLM主干中,协调提示生成器(Prompt Generator)、策略模型(Policy Model)和生成式奖励模型(Generative Reward Model, GRM)形成一个正向循环:更难的任务设定 → 更精准的验证 → 更强的求解能力。实证表明,ASL在零标签数据条件下仍能持续优化,显著优于传统基于规则奖励或固定奖励模型的基线方法,且GRM的持续训练是突破性能天花板的关键因素。
链接: https://arxiv.org/abs/2510.14253
作者: Wangtao Sun,Xiang Cheng,Jialin Fan,Yao Xu,Xing Yu,Shizhu He,Jun Zhao,Kang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbfAgentic Self-Learning (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at this https URL
zh
[AI-61] Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?
【速读】:该论文旨在解决当前联合语言-音频嵌入模型在捕捉人类感知的音色(timbre)维度方面能力不足的问题,特别是这些模型对亮度、粗糙度和温暖感等多维音色属性的对齐可靠性尚未充分验证。解决方案的关键在于系统性评估三种主流模型(MS-CLAP、LAION-CLAP 和 MuQ-MuLan)在乐器声音与音频效果上的音色语义一致性,结果表明 LAION-CLAP 在跨类别音频中展现出最稳定的与人类感知音色的一致性,从而为音色感知建模提供了更可靠的嵌入空间基础。
链接: https://arxiv.org/abs/2510.14249
作者: Qixin Deng,Bryan Pardo,Thrasyvoulos N Pappas
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.
zh
[AI-62] Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中因分布偏移(distribution shift)导致的决策问题,即训练环境与部署环境不一致时,如何学习鲁棒策略。核心挑战在于在线设置下样本效率和探索机制的优化,而现有基于策略优化的方法在鲁棒强化学习(Robust Reinforcement Learning, RRL)领域仍缺乏理论和实证研究。解决方案的关键是提出分布鲁棒正则化策略优化算法(Distributionally Robust Regularized Policy Optimization, DR-RPO),其创新点包括:1)引入参考策略正则化以在软最大策略类中实现可 tractable 的优化,并构造出同时对转移动态和策略进行双重约束的鲁棒马尔可夫决策过程(RMDP)变体;2)采用 d-矩形线性 MDP 框架结合线性函数逼近与上置信界奖励(upper confidence bonus)实现乐观探索,从而扩展至大规模状态-动作空间。理论分析表明,DR-RPO 可实现多项式次优性边界和样本效率,性能与基于值的方法相当,且实验验证了其在多种场景下的鲁棒性。
链接: https://arxiv.org/abs/2510.14246
作者: Jingwen Gu,Yiting He,Zhishuai Liu,Pan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 53 pages, 8 figures
Abstract:Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ. We study this problem through the lens of robust Markov decision processes (RMDPs), which optimize performance against adversarial transition dynamics. Our focus is the online setting, where the agent has only limited interaction with the environment, making sample efficiency and exploration especially critical. Policy optimization, despite its success in standard RL, remains theoretically and empirically underexplored in robust RL. To bridge this gap, we propose \textbfDistributionally \textbfRobust \textbfRegularized \textbfPolicy \textbfOptimization algorithm (DR-RPO), a model-free online policy optimization method that learns robust policies with sublinear regret. To enable tractable optimization within the softmax policy class, DR-RPO incorporates reference-policy regularization, yielding RMDP variants that are doubly constrained in both transitions and policies. To scale to large state-action spaces, we adopt the d -rectangular linear MDP formulation and combine linear function approximation with an upper confidence bonus for optimistic exploration. We provide theoretical guarantees showing that policy optimization can achieve polynomial suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches. Finally, empirical results across diverse domains corroborate our theory and demonstrate the robustness of DR-RPO.
zh
[AI-63] Spatial Computing Communications for Multi-User Virtual Reality in Distributed Mobile Edge Computing Network
【速读】:该论文旨在解决沉浸式虚拟现实(Immersive Virtual Reality, VR)应用在多用户交互场景下对延迟、能效和计算资源的严苛需求问题,尤其是在分布式移动边缘计算(Mobile Edge Computing, MEC)网络中的部署挑战。其核心解决方案是提出空间计算通信(Spatial Computing Communications, SCC)框架,通过联合建模物理空间(由用户与基站定义)和虚拟空间(共享沉浸环境),并利用用户动态和资源需求的概率模型,将资源部署任务形式化为多目标组合优化(Multi-Objective Combinatorial Optimization, MOCO)问题,以同时最小化系统延迟和能耗。关键创新在于设计了MO-CMPO算法——一种结合监督学习与强化学习(Reinforcement Learning, RL)微调的多目标一致性模型策略优化方法,并引入稀疏图神经网络(Graph Neural Network, GNN)高效生成帕累托最优解,实验证明其在超体积指标和推理延迟上优于基线方法,且揭示出按目标导向的部署模式:延迟优先方案倾向于本地MEC执行以降低传输延迟,而能效优先方案则通过减少冗余部署来节省能量。
链接: https://arxiv.org/abs/2510.14243
作者: Caolu Xu,Zhiyong Chen,Meixia Tao,Li Song,Wenjun Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: submited to IEEE journal
Abstract:Immersive virtual reality (VR) applications impose stringent requirements on latency, energy efficiency, and computational resources, particularly in multi-user interactive scenarios. To address these challenges, we introduce the concept of spatial computing communications (SCC), a framework designed to meet the latency and energy demands of multi-user VR over distributed mobile edge computing (MEC) networks. SCC jointly represents the physical space, defined by users and base stations, and the virtual space, representing shared immersive environments, using a probabilistic model of user dynamics and resource requirements. The resource deployment task is then formulated as a multi-objective combinatorial optimization (MOCO) problem that simultaneously minimizes system latency and energy consumption across distributed MEC resources. To solve this problem, we propose MO-CMPO, a multi-objective consistency model with policy optimization that integrates supervised learning and reinforcement learning (RL) fine-tuning guided by preference weights. Leveraging a sparse graph neural network (GNN), MO-CMPO efficiently generates Pareto-optimal solutions. Simulations with real-world New Radio base station datasets demonstrate that MO-CMPO achieves superior hypervolume performance and significantly lower inference latency than baseline methods. Furthermore, the analysis reveals practical deployment patterns: latency-oriented solutions favor local MEC execution to reduce transmission delay, while energy-oriented solutions minimize redundant placements to save energy.
zh
[AI-64] LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild
【速读】:该论文旨在解决当前评估生成式 AI (Generative AI) 系统在深度研究(Deep Research)能力上的基准不足问题,即现有评测基准难以真实反映系统在多源、动态、复杂信息整合与分析方面的能力。其解决方案的关键在于提出两个核心工具:一是 LiveResearchBench,一个包含100个专家设计的、覆盖日常、企业及学术场景的任务集合,强调用户导向性、动态性、明确性和多源搜索需求;二是 DeepEval,一套涵盖内容和报告层面质量的综合评估框架,包括覆盖率、呈现质量、引用准确性与关联性、一致性与分析深度等维度,并通过四种互补评估协议确保结果稳定且贴近人工判断。这一组合为系统性评估前沿深度研究系统提供了严谨、可复现的基准。
链接: https://arxiv.org/abs/2510.14240
作者: Jiayu Wang,Yifei Ming,Riya Dulepet,Qinglin Chen,Austin Xu,Zixuan Ke,Frederic Sala,Aws Albarghouthi,Caiming Xiong,Shafiq Joty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep research – producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources – marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
zh
[AI-65] Large Scale Retrieval for the LinkedIn Feed using Causal Language Models
【速读】:该论文旨在解决大规模推荐系统中召回阶段的效率与质量瓶颈问题,即如何在毫秒级延迟和高并发请求下,从数亿候选内容中高效筛选出高质量的候选集以供后续排序。其关键解决方案是将大型因果语言模型(如Meta的LLaMA 3)微调为双编码器(dual encoder),仅使用文本输入生成用户和内容的高质嵌入(embedding),并通过优化提示(prompt)设计、大规模微调技术和低延迟在线服务基础设施实现端到端部署。特别地,研究发现对提示中数值特征进行量化处理可提升信息编码精度,从而增强召回层与排序层之间的对齐效果,显著改善用户参与度,尤其在新用户群体中表现突出。
链接: https://arxiv.org/abs/2510.14223
作者: Sudarshan Srinivasa Ramanujam,Antonio Alonso,Saurabh Kataria,Siddharth Dangi,Akhilesh Gupta,Birjodh Singh Tiwana,Manas Somaiya,Luke Simon,David Byrne,Sojeong Ha,Sen Zhou,Andrei Akterskii,Zhanglong Liu,Samira Sriram,Crescent Xiong,Zhoutao Pei,Angela Shao,Alex Li,Annie Xiao,Caitlin Kolb,Thomas Kistler,Zach Moore,Hamed Firooz
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:In large scale recommendation systems like the LinkedIn Feed, the retrieval stage is critical for narrowing hundreds of millions of potential candidates to a manageable subset for ranking. LinkedIn’s Feed serves suggested content from outside of the member’s network (based on the member’s topical interests), where 2000 candidates are retrieved from a pool of hundreds of millions candidate with a latency budget of a few milliseconds and inbound QPS of several thousand per second. This paper presents a novel retrieval approach that fine-tunes a large causal language model (Meta’s LLaMA 3) as a dual encoder to generate high quality embeddings for both users (members) and content (items), using only textual input. We describe the end to end pipeline, including prompt design for embedding generation, techniques for fine-tuning at LinkedIn’s scale, and infrastructure for low latency, cost effective online serving. We share our findings on how quantizing numerical features in the prompt enables the information to get properly encoded in the embedding, facilitating greater alignment between the retrieval and ranking layer. The system was evaluated using offline metrics and an online A/B test, which showed substantial improvements in member engagement. We observed significant gains among newer members, who often lack strong network connections, indicating that high-quality suggested content aids retention. This work demonstrates how generative language models can be effectively adapted for real time, high throughput retrieval in industrial applications.
zh
[AI-66] Echoes of Human Malice in Agents : Benchmarking LLM s for Multi-Turn Online Harassment Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在多轮交互中易受恶意操控的问题,特别是针对在线骚扰行为的系统性攻击漏洞。现有研究多聚焦于单轮提示词攻击(jailbreak),而忽视了现实场景中持续、动态的多轮骚扰互动。其解决方案的关键在于构建一个基于重复博弈理论(repeated game theory)的多智能体模拟框架,生成合成的多轮骚扰对话数据集,并设计三种针对记忆(memory)、规划(planning)和微调(fine-tuning)机制的攻击方法,结合混合评估框架对主流开源(LLaMA-3.1-8B-Instruct)与闭源模型(Gemini-2.0-flash)进行实证分析。实验表明,通过微调攻击可使骚扰成功率提升至95.78–96.89%(无攻击时为57.25–64.19%),同时显著降低拒绝率至1–2%,并诱发人类级的攻击性人格特征(如马基雅维利主义或自恋倾向),揭示出当前安全防护机制在多轮情境下的严重不足。
链接: https://arxiv.org/abs/2510.14207
作者: Trilok Padhi,Pinxian Lu,Abdulkadir Erol,Tanmay Sutar,Gauri Sharma,Mina Sonmez,Munmun De Choudhury,Ugur Kursuncu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78–96.89% vs. 57.25–64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9–87.8% vs. 44.2–50.8% without tuning, and Flaming with 81.2–85.1% vs. 31.5–38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.
zh
[AI-67] Implementation of AI in Precision Medicine
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在精准医疗中临床落地受限的问题,重点识别影响AI系统在真实世界环境中实施的关键障碍与促进因素。其解决方案的核心在于构建一个基于生态系统的框架,强调数据质量、临床可靠性、工作流程整合及治理机制之间的相互依赖关系,从而为实现可信且可持续的AI赋能精准医疗提供路径指引。
链接: https://arxiv.org/abs/2510.14194
作者: Göktuğ Bender,Samer Faraj,Anand Bhardwaj
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to SMASH 2025
Abstract:Artificial intelligence (AI) has become increasingly central to precision medicine by enabling the integration and interpretation of multimodal data, yet implementation in clinical settings remains limited. This paper provides a scoping review of literature from 2019-2024 on the implementation of AI in precision medicine, identifying key barriers and enablers across data quality, clinical reliability, workflow integration, and governance. Through an ecosystem-based framework, we highlight the interdependent relationships shaping real-world translation and propose future directions to support trustworthy and sustainable implementation.
zh
[AI-68] ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)算法对奖励函数设计高度敏感的问题,这一挑战限制了其在复杂任务中的广泛应用。解决方案的关键在于提出ARM-FM框架——通过基础模型(Foundation Models, FMs)自动构建奖励机器(Reward Machines, RMs),实现基于自然语言的可组合式奖励设计。该方法利用RM的结构化形式实现任务分解,并借助FM的语言理解能力将自然语言指令转化为形式化的奖励逻辑,同时通过语言嵌入关联RM状态以支持跨任务泛化,从而显著提升奖励设计的效率与迁移能力。
链接: https://arxiv.org/abs/2510.14176
作者: Roger Creus Castanyer,Faisal Mohamed,Pablo Samuel Castro,Cyrus Neary,Glen Berseth
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) – an automata-based formalism for reward specification – are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM’s effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.
zh
[AI-69] JEDA: Query-Free Clinical Order Search from Ambient Dialogues
【速读】:该论文旨在解决临床对话中显式指令(如“开具胸部X光检查”)与隐式推理(如“咳嗽 overnight 加重,应排查肺炎”)混杂时,现有系统依赖大语言模型(Large Language Model, LLM)重写指令所导致的延迟、不稳定和不透明问题,从而阻碍实时医嘱生成。其解决方案的关键在于提出JEDA(Joint Embedding for Direct and Ambient clinical orders),一种基于领域初始化的双编码器架构,能够直接检索标准医嘱,并在无需查询的情况下,通过短时间窗口内环境对话的嵌入表示触发检索;该模型基于PubMedBERT初始化并采用去重安全的对比学习目标进行微调,有效对齐不同表达形式的意图到统一的医嘱概念,同时利用受限LLM引导训练策略强化医嘱与多种表述(仅命令、仅上下文、命令+上下文、上下文+推理)之间的关联,实现更清晰的医嘱间区分、更强的查询-医嘱耦合性及更好的泛化能力,且无需LLM即可实现快速、可解释的实时医嘱检索。
链接: https://arxiv.org/abs/2510.14169
作者: Praphul Singh,Corey Barrett,Sumana Srivasta,Amitabh Saikia,Irfan Bulu,Sri Gadde,Krishnaram Kenthapadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical conversations mix explicit directives (order a chest X-ray) with implicit reasoning (the cough worsened overnight, we should check for pneumonia). Many systems rely on LLM rewriting, adding latency, instability, and opacity that hinder real-time ordering. We present JEDA (Joint Embedding for Direct and Ambient clinical orders), a domain-initialized bi-encoder that retrieves canonical orders directly and, in a query-free mode, encodes a short rolling window of ambient dialogue to trigger retrieval. Initialized from PubMedBERT and fine-tuned with a duplicate-safe contrastive objective, JEDA aligns heterogeneous expressions of intent to shared order concepts. Training uses constrained LLM guidance to tie each signed order to complementary formulations (command only, context only, command+context, context+reasoning), producing clearer inter-order separation, tighter query extendash order coupling, and stronger generalization. The query-free mode is noise-resilient, reducing sensitivity to disfluencies and ASR errors by conditioning on a short window rather than a single utterance. Deployed in practice, JEDA yields large gains and substantially outperforms its base encoder and recent open embedders (Linq Embed Mistral, SFR Embedding, GTE Qwen, BGE large, Embedding Gemma). The result is a fast, interpretable, LLM-free retrieval layer that links ambient context to actionable clinical orders in real time.
zh
[AI-70] FinAI Data Assistant: LLM -based Financial Database Query Processing with the OpenAI Function Calling API CIKM2025
【速读】:该论文旨在解决金融数据库中自然语言查询的可靠性与效率问题,即如何在保证查询准确性的前提下降低延迟和成本。其解决方案的关键在于将大语言模型(LLM)与OpenAI函数调用(Function Calling)API相结合,而非直接通过文本到SQL(text-to-SQL)方式生成完整查询语句;系统将用户请求路由至一个经过验证的参数化查询库,以牺牲生成式灵活性为代价换取更高的可靠性、更低的延迟和更优的成本效益。
链接: https://arxiv.org/abs/2510.14162
作者: Juhyeong Kim,Yejin Kim,Youngbin Lee,Hyunwoo Byun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, accepted at CIKM 2025 FinAI Workshop
Abstract:We present FinAI Data Assistant, a practical approach for natural-language querying over financial databases that combines large language models (LLMs) with the OpenAI Function Calling API. Rather than synthesizing complete SQL via text-to-SQL, our system routes user requests to a small library of vetted, parameterized queries, trading generative flexibility for reliability, low latency, and cost efficiency. We empirically study three questions: (RQ1) whether LLMs alone can reliably recall or extrapolate time-dependent financial data without external retrieval; (RQ2) how well LLMs map company names to stock ticker symbols; and (RQ3) whether function calling outperforms text-to-SQL for end-to-end database query processing. Across controlled experiments on prices and fundamentals, LLM-only predictions exhibit non-negligible error and show look-ahead bias primarily for stock prices relative to model knowledge cutoffs. Ticker-mapping accuracy is near-perfect for NASDAQ-100 constituents and high for S\P~500 firms. Finally, FinAI Data Assistant achieves lower latency and cost and higher reliability than a text-to-SQL baseline on our task suite. We discuss design trade-offs, limitations, and avenues for deployment.
zh
[AI-71] Combining Reinforcement Learning and Behavior Trees for NPCs in Video Games with AMD Schola
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在商业视频游戏中应用缓慢的问题,特别是如何将RL驱动的非玩家角色(Non-Player Characters, NPCs)有效集成到实际游戏开发流程中。其解决方案的关键在于探索行为树(Behavior Trees, BTs)与RL的结合,通过引入AMD Schola插件在Unreal Engine中联合训练RL模型与BT系统,从而实现多任务NPC在复杂3D环境中的可控、可解释且高效的行为生成。这一方法不仅提升了RL在游戏AI中的实用性,也为传统BT框架注入了自适应决策能力。
链接: https://arxiv.org/abs/2510.14154
作者: Tian Liu,Alex Cann,Ian Colbert,Mehdi Saeedi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, 5 tables
Abstract:While the rapid advancements in the reinforcement learning (RL) research community have been remarkable, the adoption in commercial video games remains slow. In this paper, we outline common challenges the Game AI community faces when using RL-driven NPCs in practice, and highlight the intersection of RL with traditional behavior trees (BTs) as a crucial juncture to be explored further. Although the BT+RL intersection has been suggested in several research papers, its adoption is rare. We demonstrate the viability of this approach using AMD Schola – a plugin for training RL agents in Unreal Engine – by creating multi-task NPCs in a complex 3D environment inspired by the commercial video game ``The Last of Us". We provide detailed methodologies for jointly training RL models with BTs while showcasing various skills.
zh
[AI-72] CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization
【速读】:该论文旨在解决如何利用生成式 AI(Generative AI)高效求解复杂计算问题的挑战,尤其在缺乏明确解析解或传统算法难以应对的场景中。其核心解决方案是提出 CodeEvolve,一个结合大语言模型(Large Language Models, LLMs)与遗传算法(Genetic Algorithms)的开放源代码进化编程代理系统。关键创新包括:采用岛屿式遗传算法(island-based genetic algorithm)以维持种群多样性并提升计算吞吐量;设计基于启发式的交叉机制(inspiration-based crossover),利用 LLM 的上下文窗口融合成功解的特征;以及引入元提示策略(meta-prompting strategies)实现对解空间的动态探索。这些方法共同提升了模型在数学基准测试上的表现,超越了闭源的 AlphaEvolve 系统。
链接: https://arxiv.org/abs/2510.14150
作者: Henrique Assumpção,Diego Ferreira,Leandro Campos,Fabricio Murai
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, 9 figures, 2 tables
Abstract:In this work, we introduce CodeEvolve, an open-source evolutionary coding agent that unites Large Language Models (LLMs) with genetic algorithms to solve complex computational problems. Our framework adapts powerful evolutionary concepts to the LLM domain, building upon recent methods for generalized scientific discovery. CodeEvolve employs an island-based genetic algorithm to maintain population diversity and increase throughput, introduces a novel inspiration-based crossover mechanism that leverages the LLMs context window to combine features from successful solutions, and implements meta-prompting strategies for dynamic exploration of the solution space. We conduct a rigorous evaluation of CodeEvolve on a subset of the mathematical benchmarks used to evaluate Google DeepMind’s closed-source AlphaEvolve. Our findings show that our method surpasses AlphaEvolve’s performance on several challenging problems. To foster collaboration and accelerate progress, we release our complete framework as an open-source repository.
zh
[AI-73] Inferred global dense residue transition graphs from primary structure sequences enable protein interaction prediction via directed graph convolutional neural networks
【速读】:该论文旨在解决蛋白质-蛋白质相互作用(Protein-Protein Interactions, PPIs)精准预测问题,以支持细胞功能解析和药物开发。现有方法主要依赖于从蛋白质语言模型(Protein Language Models, PLMs)提取的序列嵌入或基于图神经网络(Graph Neural Networks, GNNs)处理3D结构数据,但这些方法计算成本较高。其解决方案的关键在于提出一种两阶段图表示学习框架ProtGram-DirectGCN:首先构建ProtGram,将蛋白质一级结构建模为全局推断的n-gram有向图,其中残基间的转移概率作为边权重;其次设计DirectGCN,一种专用于有向图的卷积神经网络,通过路径特异性变换(入边、出边、无向边)与共享变换的组合,并引入可学习门控机制融合信息,从而高效学习残基级嵌入并聚合为蛋白级表示,实现高精度PPI预测,尤其在训练数据有限时仍表现出鲁棒性能。
链接: https://arxiv.org/abs/2510.14139
作者: Islam Akef Ebeid,Haoteng Tang,Pengfei Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: under review in Frontiers in Bioinformatics
Abstract:Introduction Accurate prediction of protein-protein interactions (PPIs) is crucial for understanding cellular functions and advancing drug development. Existing in-silico methods use direct sequence embeddings from Protein Language Models (PLMs). Others use Graph Neural Networks (GNNs) for 3D protein structures. This study explores less computationally intensive alternatives. We introduce a novel framework for downstream PPI prediction through link prediction. Methods We introduce a two-stage graph representation learning framework, ProtGram-DirectGCN. First, we developed ProtGram. This approach models a protein’s primary structure as a hierarchy of globally inferred n-gram graphs. In these graphs, residue transition probabilities define edge weights. Each edge connects a pair of residues in a directed graph. The probabilities are aggregated from a large corpus of sequences. Second, we propose DirectGCN, a custom directed graph convolutional neural network. This model features a unique convolutional layer. It processes information through separate path-specific transformations: incoming, outgoing, and undirected. A shared transformation is also applied. These paths are combined via a learnable gating mechanism. We apply DirectGCN to ProtGram graphs to learn residue-level embeddings. These embeddings are pooled via attention to generate protein-level embeddings for prediction. Results We first established the efficacy of DirectGCN on standard node classification benchmarks. Its performance matches established methods on general datasets. The model excels at complex, directed graphs with dense, heterophilic structures. When applied to PPI prediction, the full ProtGram-DirectGCN framework delivers robust predictive power. This strong performance holds even with limited training data.
zh
[AI-74] A Multimodal Approach to Heritage Preservation in the Context of Climate Change
【速读】:该论文旨在解决文化遗产遗址因气候变化加速退化的问题,传统单模态监测方法(如仅依赖视觉检查或环境传感器)无法有效捕捉环境压力因素与材料劣化之间的复杂相互作用。解决方案的关键在于提出一种轻量级多模态架构,通过融合温度、湿度等传感器数据与视觉图像信息来预测遗址退化程度;其核心创新包括:(1) 简化的编码器结构(64维潜在空间)以防止小样本数据集(n=37)下的过拟合,以及 (2) 自适应 Barlow Twins 损失函数,促进模态间的互补性而非冗余性。实验表明,该方法在斯特拉斯堡大教堂数据上达到 76.9% 准确率,显著优于标准多模态模型和原始 PerceiverIO,验证了架构简洁性与对比正则化在数据稀缺场景下实现高效多模态学习的有效性。
链接: https://arxiv.org/abs/2510.14136
作者: David Roqui,Adèle Cormier,nistor Grozavu,Ann Bourges
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cultural heritage sites face accelerating degradation due to climate change, yet tradi- tional monitoring relies on unimodal analysis (visual inspection or environmental sen- sors alone) that fails to capture the complex interplay between environmental stres- sors and material deterioration. We propose a lightweight multimodal architecture that fuses sensor data (temperature, humidity) with visual imagery to predict degradation severity at heritage sites. Our approach adapts PerceiverIO with two key innovations: (1) simplified encoders (64D latent space) that prevent overfitting on small datasets (n=37 training samples), and (2) Adaptive Barlow Twins loss that encourages modality complementarity rather than redundancy. On data from Strasbourg Cathedral, our model achieves 76.9% accu- racy, a 43% improvement over standard multimodal architectures (VisualBERT, Trans- former) and 25% over vanilla PerceiverIO. Ablation studies reveal that sensor-only achieves 61.5% while image-only reaches 46.2%, confirming successful multimodal synergy. A systematic hyperparameter study identifies an optimal moderate correlation target (\tau =0.3) that balances align- ment and complementarity, achieving 69.2% accuracy compared to other \tau values (\tau =0.1/0.5/0.7: 53.8%, \tau =0.9: 61.5%). This work demonstrates that architectural sim- plicity combined with contrastive regularization enables effective multimodal learning in data-scarce heritage monitoring contexts, providing a foundation for AI-driven con- servation decision support systems.
zh
[AI-75] Formalizing the Safety Security and Functional Properties of Agent ic AI Systems
【速读】:该论文旨在解决当前多智能体AI系统中因通信协议碎片化导致的语义鸿沟问题,这种碎片化阻碍了对系统安全性、可靠性及功能性的严谨分析,并可能引发架构错位和可被利用的协调漏洞。其解决方案的关键在于提出一个由两个基础模型构成的统一建模框架:一是主机代理模型(host agent model),用于形式化顶层交互实体的任务分解与执行编排;二是任务生命周期模型(task lifecycle model),以细粒度描述子任务从创建到完成的状态变迁与错误处理机制。这两个模型共同构建了一个语义一致的分析平台,使得17项主机代理属性与14项任务生命周期属性能够通过时序逻辑表达并进行形式化验证,从而实现对多智能体系统的正确性保障、死锁检测与安全漏洞预防。
链接: https://arxiv.org/abs/2510.14133
作者: Edoardo Allegrini,Ananth Shreekumar,Z. Berkay Celik
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:Agentic AI systems, which leverage multiple autonomous agents and Large Language Models (LLMs), are increasingly used to address complex, multi-step tasks. The safety, security, and functionality of these systems are critical, especially in high-stakes applications. However, the current ecosystem of inter-agent communication is fragmented, with protocols such as the Model Context Protocol (MCP) for tool access and the Agent-to-Agent (A2A) protocol for coordination being analyzed in isolation. This fragmentation creates a semantic gap that prevents the rigorous analysis of system properties and introduces risks such as architectural misalignment and exploitable coordination issues. To address these challenges, we introduce a modeling framework for agentic AI systems composed of two foundational models. The first, the host agent model, formalizes the top-level entity that interacts with the user, decomposes tasks, and orchestrates their execution by leveraging external agents and tools. The second, the task lifecycle model, details the states and transitions of individual sub-tasks from creation to completion, providing a fine-grained view of task management and error handling. Together, these models provide a unified semantic framework for reasoning about the behavior of multi-AI agent systems. Grounded in this framework, we define 17 properties for the host agent and 14 for the task lifecycle, categorized into liveness, safety, completeness, and fairness. Expressed in temporal logic, these properties enable formal verification of system behavior, detection of coordination edge cases, and prevention of deadlocks and security vulnerabilities. Through this effort, we introduce the first rigorously grounded, domain-agnostic framework for the systematic analysis, design, and deployment of correct, reliable, and robust agentic AI systems.
zh
[AI-76] STEMS: Spatial-Temporal Enhanced Safe Multi-Agent Coordination for Building Energy Management
【速读】:该论文旨在解决多建筑能源协同管理中面临的三大核心挑战:空间-时间信息利用不足、缺乏严格的运行安全保证以及系统复杂性高。其解决方案的关键在于提出一种名为Spatial-Temporal Enhanced Safe Multi-Agent Coordination (STEMS) 的新型安全约束型多智能体强化学习框架,该框架通过两个核心组件实现突破:一是基于GCN-Transformer融合架构的空间-时间图表示学习机制,用于捕捉建筑间的关联关系与时间动态模式;二是引入控制屏障函数(Control Barrier Functions)的安全约束多智能体强化学习算法,提供数学意义上的安全保证。实验证明,STEMS在降低能耗成本(21%)、减少碳排放(18%)的同时,将安全违规率从35.1%显著降至5.6%,并保持极低的舒适度不满意度(0.13)。
链接: https://arxiv.org/abs/2510.14112
作者: Huiliang Zhang,Di Wu,Arnaud Zinflou,Benoit Boulet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Building energy management is essential for achieving carbon reduction goals, improving occupant comfort, and reducing energy costs. Coordinated building energy management faces critical challenges in exploiting spatial-temporal dependencies while ensuring operational safety across multi-building systems. Current multi-building energy systems face three key challenges: insufficient spatial-temporal information exploitation, lack of rigorous safety guarantees, and system complexity. This paper proposes Spatial-Temporal Enhanced Safe Multi-Agent Coordination (STEMS), a novel safety-constrained multi-agent reinforcement learning framework for coordinated building energy management. STEMS integrates two core components: (1) a spatial-temporal graph representation learning framework using a GCN-Transformer fusion architecture to capture inter-building relationships and temporal patterns, and (2) a safety-constrained multi-agent RL algorithm incorporating Control Barrier Functions to provide mathematical safety guarantees. Extensive experiments on real-world building datasets demonstrate STEMS’s superior performance over existing methods, showing that STEMS achieves 21% cost reduction, 18% emission reduction, and dramatically reduces safety violations from 35.1% to 5.6% while maintaining optimal comfort with only 0.13 discomfort proportion. The framework also demonstrates strong robustness during extreme weather conditions and maintains effectiveness across different building types.
zh
[AI-77] Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning
【速读】:该论文旨在解决机器学习中系统性、组合性外分布(out-of-distribution, OOD)泛化能力不足的问题,这是制约现代语言模型涌现推理能力的关键瓶颈。解决方案的核心在于提出并验证四种架构机制:(i) 输入自适应循环;(ii) 算法监督;(iii) 通过离散瓶颈实现锚定的潜在表示;以及 (iv) 显式的错误纠正机制。这些机制共同构建了一种面向Transformer网络的原生且可扩展的潜在空间推理方法,显著提升了模型在复杂计算图任务上的算法泛化能力,并通过机制解释分析揭示了其OOD泛化鲁棒性的内在原理。
链接: https://arxiv.org/abs/2510.14095
作者: Awni Altabaa,Siyu Chen,John Lafferty,Zhuoran Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Systematic, compositional generalization beyond the training distribution remains a core challenge in machine learning – and a critical bottleneck for the emergent reasoning abilities of modern language models. This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed. We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization: (i) input-adaptive recurrence; (ii) algorithmic supervision; (iii) anchored latent representations via a discrete bottleneck; and (iv) an explicit error-correction mechanism. Collectively, these mechanisms yield an architectural approach for native and scalable latent space reasoning in Transformer networks with robust algorithmic generalization capabilities. We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.
zh
[AI-78] Every Language Model Has a Forgery-Resistant Signature
【速读】:该论文旨在解决语言模型输出的溯源与验证问题,即如何在不依赖模型权重或输入数据的情况下,识别生成文本的来源模型并确保其真实性。解决方案的关键在于发现并利用语言模型输出的一个隐含几何约束——即所有语言模型的对数概率(log-probs)分布位于高维椭球面上,这一特性构成了模型独有的“椭圆签名”(ellipse signature)。该签名具有不可伪造性(无参数访问时无法生成符合椭圆约束的 log-probs)、自然存在性(所有语言模型均具备)、自包含性(无需输入或完整权重即可检测)以及冗余性(每个 log-prob 输出均可独立验证),从而为语言模型输出提供了一种类似对称密钥认证机制的验证协议。
链接: https://arxiv.org/abs/2510.14086
作者: Matthew Finlayson,Xiang Ren,Swabha Swayamdipta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The ubiquity of closed-weight language models with public-facing APIs has generated interest in forensic methods, both for extracting hidden model details (e.g., parameters) and for identifying models by their outputs. One successful approach to these goals has been to exploit the geometric constraints imposed by the language model architecture and parameters. In this work, we show that a lesser-known geometric constraint–namely, that language model outputs lie on the surface of a high-dimensional ellipse–functions as a signature for the model and can be used to identify the source model of a given output. This ellipse signature has unique properties that distinguish it from existing model-output association methods like language model fingerprints. In particular, the signature is hard to forge: without direct access to model parameters, it is practically infeasible to produce log-probabilities (logprobs) on the ellipse. Secondly, the signature is naturally occurring, since all language models have these elliptical constraints. Thirdly, the signature is self-contained, in that it is detectable without access to the model inputs or the full weights. Finally, the signature is compact and redundant, as it is independently detectable in each logprob output from the model. We evaluate a novel technique for extracting the ellipse from small models and discuss the practical hurdles that make it infeasible for production-scale models. Finally, we use ellipse signatures to propose a protocol for language model output verification, analogous to cryptographic symmetric-key message authentication systems.
zh
[AI-79] DiffOPF: Diffusion Solver for Optimal Power Flow
【速读】:该论文旨在解决传统深度学习最优潮流(Optimal Power Flow, OPF)求解器在处理系统参数(如导纳、拓扑结构)变化时的局限性问题,即现有方法为单值映射,无法有效捕捉给定负荷下因系统参数波动导致的多组可行调度点(dispatch setpoints)分布特性,除非将全部参数显式编码至特征空间,而这在实际中不可行。解决方案的关键在于提出一种基于扩散模型(diffusion-based)的OPF求解器——DiffOPF,其将OPF建模为条件采样问题,从运行历史中学习负荷与调度点的联合分布,并在给定负荷条件下输出调度点的边缘分布,从而支持生成统计可信的初始解样本,且在成本与约束满足之间具有良好的权衡性能。
链接: https://arxiv.org/abs/2510.14075
作者: Milad Hoseinpour,Vladimir Dvorkin
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
备注: 7 pages, 4 figures, 2 tables
Abstract:The optimal power flow (OPF) is a multi-valued, non-convex mapping from loads to dispatch setpoints. The variability of system parameters (e.g., admittances, topology) further contributes to the multiplicity of dispatch setpoints for a given load. Existing deep learning OPF solvers are single-valued and thus fail to capture the variability of system parameters unless fully represented in the feature space, which is prohibitive. To solve this problem, we introduce a diffusion-based OPF solver, termed \textitDiffOPF, that treats OPF as a conditional sampling problem. The solver learns the joint distribution of loads and dispatch setpoints from operational history, and returns the marginal dispatch distributions conditioned on loads. Unlike single-valued solvers, DiffOPF enables sampling statistically credible warm starts with favorable cost and constraint satisfaction trade-offs. We explore the sample complexity of DiffOPF to ensure the OPF solution within a prescribed distance from the optimization-based solution, and verify this experimentally on power system benchmarks.
zh
[AI-80] Exploratory Causal Inference in SAEnce
【速读】:该论文旨在解决随机对照试验(Randomized Controlled Trials, RCTs)中因果效应估计难以规模化的问题,传统方法依赖人工构建假设且分析成本高昂,容易局限于流行但不完整的假设。为突破这一限制,作者提出通过直接从数据中发现未知的因果效应来实现自动化因果推断。其解决方案的关键在于引入神经效应搜索(Neural Effect Search),该方法首先利用预训练基础模型(pretrained foundation models)将试验中的非结构化数据转化为有意义的表示,并通过稀疏自编码器(sparse autoencoder)进行解释;随后采用一种递归的分层策略(recursive procedure with progressive stratification)有效应对多重检验问题和效应纠缠(effects entanglement),从而在神经层面识别出显著的因果效应。该方法在半合成实验中验证了鲁棒性,并首次在真实生态学实验中实现了无监督的因果效应识别。
链接: https://arxiv.org/abs/2510.14073
作者: Tommaso Mencattini,Riccardo Cadei,Francesco Locatello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Randomized Controlled Trials are one of the pillars of science; nevertheless, they rely on hand-crafted hypotheses and expensive analysis. Such constraints prevent causal effect estimation at scale, potentially anchoring on popular yet incomplete hypotheses. We propose to discover the unknown effects of a treatment directly from data. For this, we turn unstructured data from a trial into meaningful representations via pretrained foundation models and interpret them via a sparse autoencoder. However, discovering significant causal effects at the neural level is not trivial due to multiple-testing issues and effects entanglement. To address these challenges, we introduce Neural Effect Search, a novel recursive procedure solving both issues by progressive stratification. After assessing the robustness of our algorithm on semi-synthetic experiments, we showcase, in the context of experimental ecology, the first successful unsupervised causal effect identification on a real-world scientific trial.
zh
[AI-81] On the expressivity of sparse maxout networks
【速读】:该论文旨在解决稀疏Maxout网络(sparse maxout networks)的表达能力(expressivity)问题,即在固定输入连接数(indegree)约束下,网络的深度与宽度如何共同影响其函数逼近能力。解决方案的关键在于建立可计算函数与一类虚拟多面体(virtual polytopes)之间的对偶关系,通过分析该多面体的维度上限获得紧致的表达能力边界,并据此构造出深度层次结构:证明当网络深度不足时,即使宽度增加也无法弥补因稀疏性导致的表达能力限制,从而揭示了深度在稀疏网络中的不可替代作用。
链接: https://arxiv.org/abs/2510.14068
作者: Moritz Grillo,Tobias Hofmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Combinatorics (math.CO)
备注:
Abstract:We study the expressivity of sparse maxout networks, where each neuron takes a fixed number of inputs from the previous layer and employs a, possibly multi-argument, maxout activation. This setting captures key characteristics of convolutional or graph neural networks. We establish a duality between functions computable by such networks and a class of virtual polytopes, linking their geometry to questions of network expressivity. In particular, we derive a tight bound on the dimension of the associated polytopes, which serves as the central tool for our analysis. Building on this, we construct a sequence of depth hierarchies. While sufficiently deep sparse maxout networks are universal, we prove that if the required depth is not reached, width alone cannot compensate for the sparsity of a fixed indegree constraint.
zh
[AI-82] Position: Require Frontier AI Labs To Release Small “Analog” Models
【速读】:该论文试图解决当前前沿人工智能(Artificial Intelligence, AI)模型监管中面临的“安全-创新权衡”问题,即在确保AI系统安全性的同时避免抑制技术创新。其解决方案的关键在于强制大型AI实验室发布与自身最先进模型同源训练并经过知识蒸馏(distillation)的小型开放模型(analog models),这些模型作为公共代理工具,可支持更广泛的研究群体开展安全性验证、可解释性研究和算法透明度分析,而无需披露原始大模型的全部参数或架构。研究表明,基于此类小型模型开发的安全与可解释性方法能有效迁移至前沿规模系统,从而以极低的额外成本显著提升整体AI安全水平,并推动创新进程。
链接: https://arxiv.org/abs/2510.14053
作者: Shriyash Upadhyay,Chaithanya Bandi,Narmeen Oozeer,Philip Quirke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent proposals for regulating frontier AI models have sparked concerns about the cost of safety regulation, and most such regulations have been shelved due to the safety-innovation tradeoff. This paper argues for an alternative regulatory approach that ensures AI safety while actively promoting innovation: mandating that large AI laboratories release small, openly accessible analog models (scaled-down versions) trained similarly to and distilled from their largest proprietary models. Analog models serve as public proxies, allowing broad participation in safety verification, interpretability research, and algorithmic transparency without forcing labs to disclose their full-scale models. Recent research demonstrates that safety and interpretability methods developed using these smaller models generalize effectively to frontier-scale systems. By enabling the wider research community to directly investigate and innovate upon accessible analogs, our policy substantially reduces the regulatory burden and accelerates safety advancements. This mandate promises minimal additional costs, leveraging reusable resources like data and infrastructure, while significantly contributing to the public good. Our hope is not only that this policy be adopted, but that it illustrates a broader principle supporting fundamental research in machine learning: deeper understanding of models relaxes the safety-innovation tradeoff and lets us have more of both. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.14053 [cs.AI] (or arXiv:2510.14053v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.14053 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-83] Cyber-Resilient System Identification for Power Grid through Bayesian Integration
【速读】:该论文旨在解决电力系统在面对随机坏数据和针对性虚假数据注入攻击(False Data Injection Attack, FDIA)时,状态估计与拓扑识别准确性下降的问题。现有基于快照的系统辨识方法虽能处理随机异常,但难以检测隐蔽的、交互式的目标攻击,从而导致严重误估。解决方案的关键在于通过贝叶斯集成(Bayesian Integration)将快照法与基于距离的时间序列模型相结合,利用历史数据中因拓扑变化等引起的分布差异,从历史正常行为中提取先验信息,并以贝叶斯方式融合进系统辨识过程,从而增强对目标虚假数据的鲁棒性。
链接: https://arxiv.org/abs/2510.14043
作者: Shimiao Li,Guannan Qu,Bryan Hooi,Vyas Sekar,Soummya Kar,Larry Pileggi
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Power grids increasingly need real-time situational awareness under the ever-evolving cyberthreat landscape. Advances in snapshot-based system identification approaches have enabled accurately estimating states and topology from a snapshot of measurement data, under random bad data and topology errors. However, modern interactive, targeted false data can stay undetectable to these methods, and significantly compromise estimation accuracy. This work advances system identification that combines snapshot-based method with time-series model via Bayesian Integration, to advance cyber resiliency against both random and targeted false data. Using a distance-based time-series model, this work can leverage historical data of different distributions induced by changes in grid topology and other settings. The normal system behavior captured from historical data is integrated into system identification through a Bayesian treatment, to make solutions robust to targeted false data. We experiment on mixed random anomalies (bad data, topology error) and targeted false data injection attack (FDIA) to demonstrate our method’s 1) cyber resilience: achieving over 70% reduction in estimation error under FDIA; 2) anomalous data identification: being able to alarm and locate anomalous data; 3) almost linear scalability: achieving comparable speed with the snapshot-based baseline, both taking 1min per time tick on the large 2,383-bus system using a laptop CPU.
zh
[AI-84] One Bug Hundreds Behind: LLM s for Large-Scale Bug Discovery
【速读】:该论文旨在解决软件中广泛存在的重复模式缺陷(Recurring Pattern Bugs, RPBs)问题,即由于同一根本原因导致的多个代码片段中反复出现且未被修复的漏洞。这类缺陷不仅增加调试成本,还可能因漏洞报告暴露攻击模式而扩大程序的攻击面。解决方案的关键在于利用BugStone系统,该系统结合LLVM静态分析与大语言模型(Large Language Model, LLM),通过识别已修复实例中的共性错误模式(如特定API误用),在全程序范围内扫描相似模式以定位潜在的新漏洞。实验表明,该方法在Linux内核中发现了超过2.2万个潜在问题,并经人工验证确认了246个有效漏洞,展现了高精度(92.2%)和良好的配对准确率(79.1%)。
链接: https://arxiv.org/abs/2510.14036
作者: Qiushi Wu,Yue Xiao,Dhilung Kirat,Kevin Eykholt,Jiyong Jang,Douglas Lee Schales
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Fixing bugs in large programs is a challenging task that demands substantial time and effort. Once a bug is found, it is reported to the project maintainers, who work with the reporter to fix it and eventually close the issue. However, across the program, there are often similar code segments, which may also contain the bug, but were missed during discovery. Finding and fixing each recurring bug instance individually is labor intensive. Even more concerning, bug reports can inadvertently widen the attack surface as they provide attackers with an exploitable pattern that may be unresolved in other parts of the program. In this paper, we explore these Recurring Pattern Bugs (RPBs) that appear repeatedly across various code segments of a program or even in different programs, stemming from a same root cause, but are unresolved. Our investigation reveals that RPBs are widespread and can significantly compromise the security of software programs. This paper introduces BugStone, a program analysis system empowered by LLVM and a Large Language Model (LLM). The key observation is that many RPBs have one patched instance, which can be leveraged to identify a consistent error pattern, such as a specific API misuse. By examining the entire program for this pattern, it is possible to identify similar sections of code that may be vulnerable. Starting with 135 unique RPBs, BugStone identified more than 22K new potential issues in the Linux kernel. Manual analysis of 400 of these findings confirmed that 246 were valid. We also created a dataset from over 1.9K security bugs reported by 23 recent top-tier conference works. We manually annotate the dataset, identify 80 recurring patterns and 850 corresponding fixes. Even with a cost-efficient model choice, BugStone achieved 92.2% precision and 79.1% pairwise accuracy on the dataset. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.14036 [cs.SE] (or arXiv:2510.14036v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.14036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-85] GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations
【速读】:该论文旨在解决部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)中规划学习的可扩展性与泛化能力问题。现有方法通常依赖领域特定的神经网络架构,难以在不同规模的问题间迁移。解决方案的关键在于提出一种以动作为中心的图表示框架(action-centric graph representation framework),将信念状态系统性地转化为结构化的动作导向图,从而使得在小规模问题上学到的结构模式可以迁移到更大规模实例中;同时结合图神经网络与解码器架构从专家示范中学习价值函数和策略,并将其作为启发式信息引导蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)应用于大规模问题,实现了零样本(zero-shot)跨规模泛化,且在保持解质量的同时显著降低搜索开销。
链接: https://arxiv.org/abs/2510.14035
作者: Rajesh Mangannavar,Prasad Tadepalli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages content. 2 pages references
Abstract:We introduce an action-centric graph representation framework for learning to guide planning in Partially Observable Markov Decision Processes (POMDPs). Unlike existing approaches that require domain-specific neural architectures and struggle with scalability, GammaZero leverages a unified graph-based belief representation that enables generalization across problem sizes within a domain. Our key insight is that belief states can be systematically transformed into action-centric graphs where structural patterns learned on small problems transfer to larger instances. We employ a graph neural network with a decoder architecture to learn value functions and policies from expert demonstrations on computationally tractable problems, then apply these learned heuristics to guide Monte Carlo tree search on larger problems. Experimental results on standard POMDP benchmarks demonstrate that GammaZero achieves comparable performance to BetaZero when trained and tested on the same-sized problems, while uniquely enabling zero-shot generalization to problems 2-4 times larger than those seen during training, maintaining solution quality with reduced search requirements.
zh
[AI-86] Context-Selective State Space Models: Feedback is All You Need
【速读】:该论文旨在解决Transformer模型在处理长序列时存在的二次时间复杂度问题以及难以建模长程依赖关系的挑战。为此,作者提出了一种新颖的时间变异性状态空间模型(State Space Model, SSM)——COFFEE(COntext From FEEdback),其核心创新在于引入状态反馈机制,使选择性(selectivity)由内部状态而非当前输入决定,从而实现基于累积上下文的动态调节能力。这一设计显著增强了模型对长程依赖的捕捉能力,同时通过参数化优化去除了S6模块中的冗余结构,提升了模型的紧凑性和可训练性。实验表明,COFFEE在诱导头任务和MNIST分类任务上均以远少的参数量和训练样本达到优于S6的表现,验证了状态反馈作为构建高效可扩展序列模型的关键机制的有效性。
链接: https://arxiv.org/abs/2510.14027
作者: Riccardo Zattra,Giacomo Baggio,Umberto Casti,Augusto Ferrante,Francesco Ticozzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers, powered by the attention mechanism, are the backbone of most foundation models, yet they suffer from quadratic complexity and difficulties in dealing with long-range dependencies in the input sequence. Recent work has shown that state space models (SSMs) provide an efficient alternative, with the S6 module at the core of the Mamba architecture achieving state-of-the-art results on long-sequence benchmarks. In this paper, we introduce the COFFEE (COntext From FEEdback) model, a novel time-varying SSM that incorporates state feedback to enable context-dependent selectivity, while still allowing for parallel implementation. Whereas the selectivity mechanism of S6 only depends on the current input, COFFEE computes it from the internal state, which serves as a compact representation of the sequence history. This shift allows the model to regulate its dynamics based on accumulated context, improving its ability to capture long-range dependencies. In addition to state feedback, we employ an efficient model parametrization that removes redundancies present in S6 and leads to a more compact and trainable formulation. On the induction head task, COFFEE achieves near-perfect accuracy with two orders of magnitude fewer parameters and training sequences compared to S6. On MNIST, COFFEE largely outperforms S6 within the same architecture, reaching 97% accuracy with only 3585 parameters. These results showcase the role of state feedback as a key mechanism for building scalable and efficient sequence models.
zh
[AI-87] Conditional Clifford-Steerable CNNs with Complete Kernel Basis for PDE Modeling
【速读】:该论文旨在解决Clifford-Steerable CNNs(CSCNNs)中核基(kernel basis)不完整导致的模型表达能力受限问题,从而限制了其在复杂物理系统建模中的性能。解决方案的关键在于提出条件型Clifford-Steerable核(Conditional Clifford-Steerable Kernels),通过将输入特征场计算出的等变表示引入核函数中,使核成为输入依赖的等变结构;同时推导出此类输入相关核的等变约束,并利用隐式参数化高效求解,显著提升了模型在流体动力学和相对论电动力学等偏微分方程(PDE)预测任务中的表达能力和泛化性能。
链接: https://arxiv.org/abs/2510.14007
作者: Bálint László Szarvas(1),Maksim Zhdanov(1 and 2) ((1) University of Amsterdam, (2) AMLab)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Clifford-Steerable CNNs (CSCNNs) provide a unified framework that allows incorporating equivariance to arbitrary pseudo-Euclidean groups, including isometries of Euclidean space and Minkowski spacetime. In this work, we demonstrate that the kernel basis of CSCNNs is not complete, thus limiting the model expressivity. To address this issue, we propose Conditional Clifford-Steerable Kernels, which augment the kernels with equivariant representations computed from the input feature field. We derive the equivariance constraint for these input-dependent kernels and show how it can be solved efficiently via implicit parameterization. We empirically demonstrate an improved expressivity of the resulting framework on multiple PDE forecasting tasks, including fluid dynamics and relativistic electrodynamics, where our method consistently outperforms baseline methods.
zh
[AI-88] REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
【速读】:该论文旨在解决稀疏激活的专家混合模型(Sparsely-Activated Mixture-of-Experts, SMoE)在生成式任务中因参数量庞大而导致的内存开销问题,同时探索更有效的专家压缩策略。现有研究倾向于通过专家合并(expert merging)来提升判别性任务性能,但本文发现该方法在生成任务中存在固有缺陷——合并会导致“功能子空间坍缩”(functional subspace collapse),即路由器(router)对专家的输入依赖性控制能力丧失,从而引入不可消除的误差。解决方案的关键在于提出一种新的剪枝准则:Router-weighted Expert Activation Pruning (REAP),该方法综合考虑路由器门控值(gate-values)与专家激活范数(activation norms),实现更精准的专家剪枝,在保持生成质量的同时显著降低模型复杂度,尤其在50%压缩率下表现优异,并在代码生成和工具调用等任务上实现了近乎无损的压缩效果。
链接: https://arxiv.org/abs/2510.13999
作者: Mike Lasby,Ivan Lazarevich,Nish Sinnadurai,Sean Lie,Yani Ioannou,Vithursan Thangarasa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures, 7 tables
Abstract:Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a “functional subspace collapse”, due to the loss of the router’s independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
zh
[AI-89] Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否会在面对经典认知科学范式——因果判断任务时,表现出类似人类的因果错觉(illusion of causality),即在缺乏充分证据的情况下错误地推断出变量间的因果关系。解决方案的关键在于构建了一个包含1000个零因果情境(null contingency scenarios)的医疗领域数据集,并通过提示LLMs评估潜在原因的有效性,从而系统性地检验其因果推理能力。研究发现所有测试模型均表现出对虚假因果关系的系统性误判,表明它们高度易受因果错觉影响,支持了“LLMs仅能复现因果语言而无真实因果理解”的假设,凸显其在需精准因果推理的应用场景中存在显著风险。
链接: https://arxiv.org/abs/2510.13985
作者: María Victoria Carro,Denise Alejandra Mester,Francisca Gauna Selasco,Giovanni Franco Gabriel Marraffini,Mario Alejandro Leiva,Gerardo I. Simari,María Vanina Martinez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Causal learning is the cognitive process of developing the capability of making causal inferences based on available information, often guided by normative principles. This process is prone to errors and biases, such as the illusion of causality, in which people perceive a causal relationship between two variables despite lacking supporting evidence. This cognitive bias has been proposed to underlie many societal problems, including social prejudice, stereotype formation, misinformation, and superstitious thinking. In this work, we examine whether large language models are prone to developing causal illusions when faced with a classic cognitive science paradigm: the contingency judgment task. To investigate this, we constructed a dataset of 1,000 null contingency scenarios (in which the available information is not sufficient to establish a causal relationship between variables) within medical contexts and prompted LLMs to evaluate the effectiveness of potential causes. Our findings show that all evaluated models systematically inferred unwarranted causal relationships, revealing a strong susceptibility to the illusion of causality. While there is ongoing debate about whether LLMs genuinely understand causality or merely reproduce causal language without true comprehension, our findings support the latter hypothesis and raise concerns about the use of language models in domains where accurate causal reasoning is essential for informed decision-making.
zh
[AI-90] Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM -Based Multi-Agent Simulations
【速读】:该论文试图解决当前多智能体系统(multi-agent systems)在社会模拟中普遍存在的静态性与局限性问题,即现有仿真环境通常受限于预定义任务、有限动态性和刚性评估标准,难以真实反映现实社会的开放性与复杂性。其解决方案的关键在于推动从静态基准向开放性(open-endedness)、持续协同演化(continuous co-evolution)和鲁棒社会对齐(socially aligned)的多智能体仿真范式转变,强调通过新型架构设计平衡稳定性与多样性,并建立更灵活的评估机制以捕捉不可预测的行为模式,从而构建更具适应性和社会感知能力的AI生态系统。
链接: https://arxiv.org/abs/2510.13982
作者: Jinkun Chen,Sher Badshah,Xuemin Yu,Sijia Han,Jiechao Gao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:What if artificial agents could not just communicate, but also evolve, adapt, and reshape their worlds in ways we cannot fully predict? With llm now powering multi-agent systems and social simulations, we are witnessing new possibilities for modeling open-ended, ever-changing environments. Yet, most current simulations remain constrained within static sandboxes, characterized by predefined tasks, limited dynamics, and rigid evaluation criteria. These limitations prevent them from capturing the complexity of real-world societies. In this paper, we argue that static, task-specific benchmarks are fundamentally inadequate and must be rethought. We critically review emerging architectures that blend llm with multi-agent dynamics, highlight key hurdles such as balancing stability and diversity, evaluating unexpected behaviors, and scaling to greater complexity, and introduce a fresh taxonomy for this rapidly evolving field. Finally, we present a research roadmap centered on open-endedness, continuous co-evolution, and the development of resilient, socially aligned AI ecosystems. \textbfWe call on the community to move beyond static paradigms and help shape the next generation of adaptive, socially-aware multi-agent simulations.
zh
[AI-91] Benefits and Limitations of Communication in Multi-Agent Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂任务和长上下文时性能下降的问题。其解决方案的关键在于提出一个理论框架,用于分析多智能体系统(multi-agent systems)的表达能力,并基于此框架对状态追踪、回忆机制和k跳推理三类算法家族进行建模与量化分析。通过推导出完成任务所需的最小智能体数量、智能体间通信的结构与量级以及随着问题规模扩展可实现的速度提升上限,论文揭示了通信在特定条件下具有理论上的必要性,明确了智能体数量与带宽之间的权衡关系,并识别了资源受限时的内在瓶颈,从而为设计可扩展的多智能体推理系统提供了原则性指导。
链接: https://arxiv.org/abs/2510.13903
作者: Michael Rizvi-Martel,Satwik Bhattamishra,Neil Rathi,Guillaume Rabusseau,Michael Hahn
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, 14 figures
Abstract:Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and k -hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.
zh
[AI-92] K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
【速读】:该论文旨在解决长视频理解中因上下文窗口限制和计算成本导致的帧采样效率低下问题,尤其是传统均匀采样易造成信息丢失,而现有关键帧选择方法(如文本-帧检索或基于强化学习的帧优化)往往产生稀疏且时间上不连续的帧,忽视场景连贯性且缺乏多尺度灵活性。解决方案的关键在于提出一种全新的“K-frames”范式,其核心创新是不再直接选择单帧,而是预测语义一致、与查询相关的片段(clip),从而实现任意数量(any-k)的关键帧选择,同时保障时间连续性和任务相关性;该方法通过构建包含20万条带查询条件的视频亮点数据集PeakClips,并采用三阶段渐进式课程学习策略(两阶段监督微调用于时序定位与关键片段感知,一阶段强化学习直接优化面向下游任务的场景驱动预测策略),实现了无需额外标注即可灵活适配不同用户预算的高效关键帧选择。
链接: https://arxiv.org/abs/2510.13891
作者: Yifeng Yao,Yike Yun,Jing Wang,Huishuai Zhang,Dongyan Zhao,Ke Tian,Zhihao Wang,Minghui Qiu,Tao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in image understanding, but long-video are constrained by context windows and computational cost. Uniform frame sampling often leads to substantial information loss. Meanwhile existing keyframe selection methods such as text-frame retrieval or RL-based frame optimization typically yield sparse and temporally disjointed frames, overlooking scene continuity and lacking flexibility for multi-scale frame selection. To address these limitations, we introduce K-frames, a novel paradigm for scene-driven keyframe selection that preserves temporal continuity. Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips, which enables any-k keyframes selection to meet diverse user budgets. To achieve this approach, we first introduce PeakClips, a dataset of 200K video highlights conditioned by query. Building on this dataset, K-frames learns clip2frame selection using a three-stage progressive curriculum. It involves two Supervised Fine-Tuning stages for temporal grounding and key-clip perception, followed by a Reinforcement Learning stage that directly optimizes the scene-driven prediction policy for downstream task without further annotations. Extensive experiments on major long-video understanding benchmarks demonstrate that K-frames provides an effective, interpretable, and plug-and-play solution for keyframe selection at various scales. Our dataset and model will be available.
zh
[AI-93] Joint Discriminative-Generative Modeling via Dual Adversarial Training
【速读】:该论文旨在解决如何在单一框架中同时实现鲁棒分类与高保真生成建模的挑战,尤其针对联合能量模型(Joint Energy-Based Models, JEM)中存在的训练不稳定性和生成样本质量差的问题。其核心解决方案在于引入对抗训练(Adversarial Training, AT)原则,构建一个稳定且高效的联合学习框架:首先用AT替代SGLD-based训练方式,通过BCE损失区分真实数据与PGD生成的对比样本来优化能量函数;其次,在判别部分采用协同对抗训练提升分类鲁棒性并省去显式的梯度惩罚项;最后提出两阶段训练策略以缓解批归一化(Batch Normalization)与EBM训练之间的不兼容问题。该方法显著提升了现有混合模型的对抗鲁棒性,并在ImageNet等复杂数据集上实现了接近扩散模型的生成质量,首次证明基于MCMC的能量模型可在高分辨率图像上实现高质量生成。
链接: https://arxiv.org/abs/2510.13872
作者: Xuwang Yin,Claire Zhang,Julie Steele,Nir Shavit,Tony T. Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review. Code available at this https URL
Abstract:Simultaneously achieving robust classification and high-fidelity generative modeling within a single framework presents a significant challenge. Hybrid approaches, such as Joint Energy-Based Models (JEM), interpret classifiers as EBMs but are often limited by the instability and poor sample quality inherent in SGLD-based training. We address these limitations by proposing a novel training framework that integrates adversarial training (AT) principles for both discriminative robustness and stable generative learning. The proposed method introduces three key innovations: (1) the replacement of SGLD-based JEM learning with a stable, AT-based approach that optimizes the energy function by discriminating between real data and PGD-generated contrastive samples using the BCE loss; (2) synergistic adversarial training for the discriminative component that enhances classification robustness while eliminating the need for explicit gradient penalties; and (3) a two-stage training procedure to resolve the incompatibility between batch normalization and EBM training. Experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that our method substantially improves adversarial robustness over existing hybrid models while maintaining competitive generative performance. On ImageNet, when optimized for generative modeling, our model’s generative fidelity surpasses that of BigGAN and approaches diffusion models, representing the first MCMC-based EBM approach to achieve high-quality generation on complex, high-resolution datasets. Our approach addresses key stability issues that have limited JEM scaling and demonstrates that adversarial training can serve as an effective foundation for unified frameworks capable of generating and robustly classifying visual data.
zh
[AI-94] CoLoR-GAN: Continual Few-Shot Learning with Low-Rank Adaptation in Generative Adversarial Networks
【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在持续学习(Continual Learning, CL)场景下,尤其是从少量样本(Few-Shot, FS)中学习时面临的灾难性遗忘(catastrophic forgetting)问题。现有最先进方法如LFS-GAN在每次训练迭代中引入显著数量的新权重,长期累积导致参数膨胀。为此,论文提出CoLoR-GAN框架,其核心创新在于利用低秩张量(low-rank tensors)实现模型对目标任务的高效适应,从而大幅减少所需参数量;进一步地,通过引入“LoRA in LoRA”(LLoRA)技术优化卷积层适配器结构,并基于实证研究提供LoRA超参数选择指南,显著提升了模型效率与性能,在多个基准持续学习与少样本任务上达到当前最优(SOTA)效果,同时资源消耗显著降低。
链接: https://arxiv.org/abs/2510.13869
作者: Munsif Ali,Leonardo Rossi,Massimo Bertozzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning (CL) in the context of Generative Adversarial Networks (GANs) remains a challenging problem, particularly when it comes to learn from a few-shot (FS) samples without catastrophic forgetting. Current most effective state-of-the-art (SOTA) methods, like LFS-GAN, introduce a non-negligible quantity of new weights at each training iteration, which would become significant when considering the long term. For this reason, this paper introduces \textcolorred\textbf\underlinecontinual few-sh\textcolorred\textbf\underlineot learning with \textcolorred\textbf\underlinelow-\textcolorred\textbf\underlinerank adaptation in GANs named CoLoR-GAN, a framework designed to handle both FS and CL together, leveraging low-rank tensors to efficiently adapt the model to target tasks while reducing even more the number of parameters required. Applying a vanilla LoRA implementation already permitted us to obtain pretty good results. In order to optimize even further the size of the adapters, we challenged LoRA limits introducing a LoRA in LoRA (LLoRA) technique for convolutional layers. Finally, aware of the criticality linked to the choice of the hyperparameters of LoRA, we provide an empirical study to easily find the best ones. We demonstrate the effectiveness of CoLoR-GAN through experiments on several benchmark CL and FS tasks and show that our model is efficient, reaching SOTA performance but with a number of resources enormously reduced. Source code is available on \hrefthis https URLGithub.
zh
[AI-95] Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning NEURIPS2025
【速读】:该论文旨在解决深度神经网络模型在跨域场景下泛化能力不足的问题,其核心假设是:神经网络在深层特征中将任务相关的语义信息编码于高频成分,而将特定域的偏差(domain-specific biases)存储于低频成分。解决方案的关键在于提出一种名为“Deep Edge Filter”的高通滤波方法,通过从原始特征中减去低通滤波后的输出,从而分离出具有泛化能力的高频特征表示,同时保持模型架构完整性。实验表明,该方法在视觉、文本、3D 和音频等多种模态与架构上均能稳定提升性能,且分析验证了其可诱导特征稀疏化并有效提取高频成分,支持了核心假设。
链接: https://arxiv.org/abs/2510.13865
作者: Dongkwan Lee,Junhoo Lee,Nojun Kwak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS2025
Abstract:We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at this https URL.
zh
[AI-96] Benchmarking Correctness and Security in Multi-Turn Code Generation
【速读】:该论文旨在解决现有AI代码辅助工具评估基准普遍局限于单轮任务,无法反映真实软件开发中迭代式编码过程的问题。其核心挑战在于如何系统性地在多轮交互场景下同时衡量生成代码的正确性和安全性。解决方案的关键在于提出首个面向多轮编程场景的基准测试框架MT-Sec,通过构建合成数据流水线将已有单轮任务转化为语义对齐的多轮交互序列,在保留原始测试套件的基础上模拟真实开发流程的复杂性;实验表明,即使是最先进的模型在多轮场景下“正确且安全”的输出比例也下降20-27%,凸显了当前模型在持续迭代中的稳定性与安全性不足,从而推动更贴近实际工作流的评估体系发展。
链接: https://arxiv.org/abs/2510.13859
作者: Ruchit Rawal,Jeffrey Yang Fan Chiang,Chihao Shen,Jeffery Siyuan Tian,Aastha Mahajan,Tom Goldstein,Yizheng Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:AI coding assistants powered by large language models (LLMs) have transformed software development, significantly boosting productivity. While existing benchmarks evaluate the correctness and security of LLM-generated code, they are typically limited to single-turn tasks that do not reflect the iterative nature of real-world development. We introduce MT-Sec, the first benchmark to systematically evaluate both correctness and security in multi-turn coding scenarios. We construct this using a synthetic data pipeline that transforms existing single-turn tasks into semantically aligned multi-turn interaction sequences, allowing reuse of original test suites while modeling the complexity of real-world coding processes. We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec and observe a consistent 20-27% drop in “correct and secure” outputs from single-turn to multi-turn settings – even among state-of-the-art models. Beyond full-program generation, we also evaluate models on multi-turn code-diff generation – an unexplored yet practically relevant setting – and find that models perform worse here, with increased rates of functionally incorrect and insecure outputs. Finally, we find that while agent scaffoldings boost single-turn code generation performance, they are not quite as effective in multi-turn evaluations. Together, these findings highlight the need for benchmarks that jointly evaluate correctness and security in multi-turn, real-world coding workflows.
zh
[AI-97] Decision Oriented Technique (DOTechnique): Finding Model Validity Through Decision-Maker Context
【速读】:该论文旨在解决模型有效性评估中缺乏明确有效性边界时,如何准确识别模型有效区域的问题。传统方法依赖预定义的有效性框架,但在实际应用中往往难以获得或不足以支撑决策需求。其解决方案的关键在于提出一种面向决策的模型有效性评估方法——决策一致性技术(Decision Oriented Technique, DOTechnique),该方法不以输出相似性为依据,而是通过判断代理模型与高保真模型是否产生一致的决策结果来确定有效性区域,从而在无显式边界条件下高效识别模型的有效范围。该技术进一步融合领域约束和符号推理机制以缩小搜索空间,提升计算效率,并通过高速公路变道系统的案例验证了其在实际场景中的有效性。
链接: https://arxiv.org/abs/2510.13858
作者: Raheleh Biglari,Joachim Denil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Model validity is as critical as the model itself, especially when guiding decision-making processes. Traditional approaches often rely on predefined validity frames, which may not always be available or sufficient. This paper introduces the Decision Oriented Technique (DOTechnique), a novel method for determining model validity based on decision consistency rather than output similarity. By evaluating whether surrogate models lead to equivalent decisions compared to high-fidelity models, DOTechnique enables efficient identification of validity regions, even in the absence of explicit validity boundaries. The approach integrates domain constraints and symbolic reasoning to narrow the search space, enhancing computational efficiency. A highway lane change system serves as a motivating example, demonstrating how DOTechnique can uncover the validity region of a simulation model. The results highlight the potential of the technique to support finding model validity through decision-maker context.
zh
[AI-98] From Craft to Constitution: A Governance-First Paradigm for Principled Agent Engineering
【速读】:该论文旨在解决当前生成式 AI(Generative AI)代理在从原型向生产环境过渡时面临的“工艺危机”(crisis of craft),即代理系统普遍存在脆弱性、不可预测性和缺乏可信度的问题,这主要源于将本质上具有概率特性的大语言模型(Large Language Models, LLMs)与传统软件工程中确定性的思维模式相耦合所导致的根本范式错位。解决方案的关键在于提出一种以治理为先(governance-first)的新范式,并通过一个称为ArbiterOS的正式架构来实现原则化的代理工程,从而系统性地管理代理的行为与决策过程,提升其在关键任务场景下的可靠性与可控性。
链接: https://arxiv.org/abs/2510.13857
作者: Qiang Xu,Xiangyu Wen,Changran Xu,Zeju Li,Jianyuan Zhong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of powerful Large Language Models (LLMs) has ushered in an Age of the Agent,'' enabling autonomous systems to tackle complex goals. However, the transition from prototype to production is hindered by a pervasive
crisis of craft,‘’ resulting in agents that are brittle, unpredictable, and ultimately untrustworthy in mission-critical applications. This paper argues this crisis stems from a fundamental paradigm mismatch – attempting to command inherently probabilistic processors with the deterministic mental models of traditional software engineering. To solve this crisis, we introduce a governance-first paradigm for principled agent engineering, embodied in a formal architecture we call ArbiterOS.
zh
[AI-99] Information flow in multilayer perceptrons: an in-depth analysis
【速读】:该论文旨在解决多层感知机(Multilayer Perceptron, MLP)中信息流动机制的解析问题,特别是在监督学习框架下如何理解信息在各层间的传递与处理。其核心解决方案是引入“信息矩阵”(information matrix)这一形式化框架,用以刻画网络各层的信息处理特性,并揭示优化策略的内在原理。关键发现包括:提出一种参数化的优化策略、发现信息瓶颈框架中的优化方法与信息矩阵推导出的策略高度相似,以及阐明MLP本质上是一种“适配器”(adaptor),其作用是根据任务目标对输入信息进行有效转换。
链接: https://arxiv.org/abs/2510.13846
作者: Giuliano Armano
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 30 pages, 8 figures
Abstract:Analysing how information flows along the layers of a multilayer perceptron is a topic of paramount importance in the field of artificial neural networks. After framing the problem from the point of view of information theory, in this position article a specific investigation is conducted on the way information is processed, with particular reference to the requirements imposed by supervised learning. To this end, the concept of information matrix is devised and then used as formal framework for understanding the aetiology of optimisation strategies and for studying the information flow. The underlying research for this article has also produced several key outcomes: i) the definition of a parametric optimisation strategy, ii) the finding that the optimisation strategy proposed in the information bottleneck framework shares strong similarities with the one derived from the information matrix, and iii) the insight that a multilayer perceptron serves as a kind of “adaptor”, meant to process the input according to the given objective.
zh
[AI-100] A2AS: Agent ic AI Runtime Security and Self-Defense
【速读】:该论文旨在解决AI代理(AI agents)和大语言模型(Large Language Models, LLMs)驱动的应用程序在安全性和可信性方面的关键挑战,包括行为不可控、输入污染、上下文窗口完整性缺失以及缺乏可定制的安全策略。解决方案的核心是提出A2AS(Agent-to-Agent Security)框架,其基础为BASIC安全模型:通过行为证书(Behavior certificates)实现行为强制执行,通过认证提示(Authenticated prompts)保障上下文窗口完整性,通过安全边界(Security boundaries)隔离不可信输入,通过上下文防御(In-context defenses)提升模型推理安全性,并通过编码化策略(Codified policies)支持应用特定规则。该框架无需引入延迟开销、外部依赖、架构变更、模型重训练或操作复杂性,从而实现纵深防御(defense-in-depth)策略,推动建立AI安全行业标准。
链接: https://arxiv.org/abs/2510.13825
作者: Eugene Neelou,Ivan Novikov,Max Moroz,Om Narayan,Tiffany Saade,Mika Ayenson,Ilya Kabanov,Jen Ozmen,Edward Lee,Vineeth Sai Narajala,Emmanuel Guilherme Junior,Ken Huang,Huseyin Gulsin,Jason Ross,Marat Vyshegorodtsev,Adelin Travers,Idan Habler,Rahul Jadav
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The A2AS framework is introduced as a security layer for AI agents and LLM-powered applications, similar to how HTTPS secures HTTP. A2AS enforces certified behavior, activates model self-defense, and ensures context window integrity. It defines security boundaries, authenticates prompts, applies security rules and custom policies, and controls agentic behavior, enabling a defense-in-depth strategy. The A2AS framework avoids latency overhead, external dependencies, architectural changes, model retraining, and operational complexity. The BASIC security model is introduced as the A2AS foundation: (B) Behavior certificates enable behavior enforcement, (A) Authenticated prompts enable context window integrity, (S) Security boundaries enable untrusted input isolation, (I) In-context defenses enable secure model reasoning, © Codified policies enable application-specific rules. This first paper in the series introduces the BASIC security model and the A2AS framework, exploring their potential toward establishing the A2AS industry standard.
zh
[AI-101] Leverag ing Wireless Sensor Networks for Real-Time Monitoring and Control of Industrial Environments
【速读】:该论文旨在解决传统工业监控中依赖有线通信系统所导致的灵活性差、响应滞后及安全隐患等问题,尤其针对2020至2024年间全球频发的工业火灾风险,提出了一种基于物联网(Internet of Things, IoT)与无线传感器网络(Wireless Sensor Network, WSN)融合的智能监测与远程控制系统。其解决方案的关键在于利用NRF射频收发模块构建高可靠性的WSN,通过Arduino微控制器实现多参数(如温度、湿度、土壤湿度和火灾探测)的实时采集与本地显示,并支持通过互联网远程控制直流电机速度以调节相关工艺参数;同时具备紧急情况下自动触发消防设备的能力,从而显著提升工业过程的自动化水平、安全性与应急响应效率。
链接: https://arxiv.org/abs/2510.13820
作者: Muhammad Junaid Asif,Shazia Saqib,Rana Fayyaz Ahmad,Hamza Khan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:This research proposes an extensive technique for monitoring and controlling the industrial parameters using Internet of Things (IoT) technology based on wireless communication. We proposed a system based on NRF transceivers to establish a strong Wireless Sensor Network (WSN), enabling transfer of real-time data from multiple sensors to a central setup that is driven by ARDUINO microcontrollers. Different key parameters, crucial for industrial setup such as temperature, humidity, soil moisture and fire detection, are monitored and displayed on an LCD screen, enabling factory administration to oversee the industrial operations remotely over the internet. Our proposed system bypasses the need for physical presence for monitoring by addressing the shortcomings of conventional wired communication systems. Other than monitoring, there is an additional feature to remotely control these parameters by controlling the speed of DC motors through online commands. Given the rising incidence of industrial fires over the worldwide between 2020 and 2024 due to an array of hazards, this system with dual functionality boosts the overall operational efficiency and safety. This overall integration of IoT and Wireless Sensor Network (WSN) reduces the potential risks linked with physical monitoring, providing rapid responses in emergency scenarios, including the activation of firefighting equipment. The results show that innovations in wireless communication perform an integral part in industrial process automation and safety, paving the way to more intelligent and responsive operating environments. Overall, this study highlights the potential for change of IoT-enabled systems to revolutionize monitoring and control in a variety of industrial applications, resulting in increased productivity and safety.
zh
[AI-102] Reversing the Lens: Using Explainable AI to Understand Human Expertise
【速读】:该论文试图解决的问题是:如何在缺乏全局最优解的安全与可靠性关键领域中,理解人类如何通过经验积累形成高效的决策策略。解决方案的关键在于将可解释人工智能(Explainable AI, XAI)中的计算工具引入对人类学习过程的分析,具体通过构建粒子加速器调参任务中操作员子任务的图结构模型,并运用社区检测和层次聚类等方法,量化揭示操作员如何分解复杂问题并随经验积累演化其问题求解结构,从而为人类认知机制提供可量化的研究路径。
链接: https://arxiv.org/abs/2510.13814
作者: Roussel Rahman,Aashwin Ananda Mishra,Wan-Lin Hu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Both humans and machine learning models learn from experience, particularly in safety- and reliability-critical domains. While psychology seeks to understand human cognition, the field of Explainable AI (XAI) develops methods to interpret machine learning models. This study bridges these domains by applying computational tools from XAI to analyze human learning. We modeled human behavior during a complex real-world task – tuning a particle accelerator – by constructing graphs of operator subtasks. Applying techniques such as community detection and hierarchical clustering to archival operator data, we reveal how operators decompose the problem into simpler components and how these problem-solving structures evolve with expertise. Our findings illuminate how humans develop efficient strategies in the absence of globally optimal solutions, and demonstrate the utility of XAI-based methods for quantitatively studying human cognition.
zh
[AI-103] Local Causal Discovery for Statistically Efficient Causal Inference
【速读】:该论文旨在解决因果推断中调整集(adjustment set)选择的效率与统计最优性之间的权衡问题:全局因果发现方法虽能识别最优调整集(即渐近方差最小的集合),但计算复杂度随变量数增长迅速;局部因果发现方法虽具有可扩展性,却只能获得统计次优的调整集。解决方案的关键在于提出一种名为Local Optimal Adjustments Discovery (LOAD) 的新方法,其核心创新是结合局部因果发现的计算效率与全局方法的统计最优性——首先利用局部信息判断目标变量间因果效应是否可识别,若可识别则通过局部因果结构推断中介变量及其父节点以构建最优调整集;否则返回基于局部结构的合法父节点调整集,从而在保证因果有效性的同时显著提升计算效率并改善估计精度。
链接: https://arxiv.org/abs/2510.14582
作者: Mátyás Schubert,Tom Claassen,Sara Magliacane
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Causal discovery methods can identify valid adjustment sets for causal effect estimation for a pair of target variables, even when the underlying causal graph is unknown. Global causal discovery methods focus on learning the whole causal graph and therefore enable the recovery of optimal adjustment sets, i.e., sets with the lowest asymptotic variance, but they quickly become computationally prohibitive as the number of variables grows. Local causal discovery methods offer a more scalable alternative by focusing on the local neighborhood of the target variables, but are restricted to statistically suboptimal adjustment sets. In this work, we propose Local Optimal Adjustments Discovery (LOAD), a sound and complete causal discovery approach that combines the computational efficiency of local methods with the statistical optimality of global methods. First, LOAD identifies the causal relation between the targets and tests if the causal effect is identifiable by using only local information. If it is identifiable, it then finds the optimal adjustment set by leveraging local causal discovery to infer the mediators and their parents. Otherwise, it returns the locally valid parent adjustment sets based on the learned local structure. In our experiments on synthetic and realistic data LOAD outperforms global methods in scalability, while providing more accurate effect estimation than local methods.
zh
[AI-104] Semantic representations emerge in biologically inspired ensembles of cross-supervising neural networks
【速读】:该论文旨在解决生物神经系统中如何通过无监督学习实现高效信息表征的问题,特别是揭示冗余减少(redundancy reduction)这一设计原则在神经编码中的机制性实现路径。其解决方案的关键在于提出一种由多个并行子网络组成的集成模型,每个子网络仅接收输入的局部区域(小感受野,small receptive field),并通过与其他网络的交叉监督(cross-supervision)机制进行协同学习——即各网络基于同时或近时序接收到的输入相互提供监督信号,从而在抽象表示空间中学习语义特征。该框架不依赖共享权重,且具有生物可实现性,实验表明其在视觉与神经元刺激下均能获得可解码性强、精度接近有监督网络的表征,并发现小感受野和稀疏连接即可达到最优性能,显著降低计算复杂度。
链接: https://arxiv.org/abs/2510.14486
作者: Roy Urbach,Elad Schneidman
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 29 pages, 8 figures, 2 supplementary figures
Abstract:Brains learn to represent information from a large set of stimuli, typically by weak supervision. Unsupervised learning is therefore a natural approach for exploring the design of biological neural networks and their computations. Accordingly, redundancy reduction has been suggested as a prominent design principle of neural encoding, but its mechanistic'' biological implementation is unclear. Analogously, unsupervised training of artificial neural networks yields internal representations that allow for accurate stimulus classification or decoding, but typically rely on biologically-implausible implementations. We suggest that interactions between parallel subnetworks in the brain may underlie such learning: we present a model of representation learning by ensembles of neural networks, where each network learns to encode stimuli into an abstract representation space by cross-supervising interactions with other networks, for inputs they receive simultaneously or in close temporal proximity. Aiming for biological plausibility, each network has a small
receptive field’', thus receiving a fixed part of the external input, and the networks do not share weights. We find that for different types of network architectures, and for both visual or neuronal stimuli, these cross-supervising networks learn semantic representations that are easily decodable and that decoding accuracy is comparable to supervised networks – both at the level of single networks and the ensemble. We further show that performance is optimal for small receptive fields, and that sparse connectivity between networks is nearly as accurate as all-to-all interactions, with far fewer computations. We thus suggest a sparsely interacting collective of cross-supervising networks as an algorithmic framework for representational learning and collective computation in the brain.
zh
[AI-105] Column Generation Using Domain-Independent Dynamic Programming
【速读】:该论文旨在解决大规模精确优化中列生成(Column Generation)与分支定价(Branch-and-Price)方法在实际应用中的可复用性与通用性问题。传统列生成方法的定价子问题(Pricing Problem)高度依赖具体应用场景,通常需定制化设计以利用问题结构,导致算法组件难以跨场景复用。本文的关键解决方案是引入领域无关动态规划(Domain-Independent Dynamic Programming, DIDP),将其作为通用定价求解器,从而实现无需针对特定问题重写定价模块即可完成高效列生成。实验表明,基于DIDP的分支定价实现,在七个不同问题类别上优于当前领先的商用求解器,验证了该方案在通用性与性能上的优势。
链接: https://arxiv.org/abs/2510.14317
作者: Ryo Kuroiwa,Edward Lam
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: Manuscript submitted to INFORMS Journal on Computing didp-rs code: this https URL Model code: this https URL
Abstract:Column generation and branch-and-price are leading methods for large-scale exact optimization. Column generation iterates between solving a master problem and a pricing problem. The master problem is a linear program, which can be solved using a generic solver. The pricing problem is highly dependent on the application but is usually discrete. Due to the difficulty of discrete optimization, high-performance column generation often relies on a custom pricing algorithm built specifically to exploit the problem’s structure. This bespoke nature of the pricing solver prevents the reuse of components for other applications. We show that domain-independent dynamic programming, a software package for modeling and solving arbitrary dynamic programs, can be used as a generic pricing solver. We develop basic implementations of branch-and-price with pricing by domain-independent dynamic programming and show that they outperform a world-leading solver on static mixed integer programming formulations for seven problem classes.
zh
[AI-106] Extracting latent representations from X-ray spectra. Classification regression and accretion signatures of Chandra sources
【速读】:该论文旨在解决如何从大量X射线光谱数据中提取紧凑且具有物理意义的表示问题,以提升对天体物理源的理解与分析效率。其核心挑战在于如何在保持数据高维特征的同时,实现降维并保留关键物理信息。解决方案的关键是采用基于Transformer架构的自编码器(autoencoder)模型,将来自钱德拉源目录(Chandra Source Catalog, CSC)的高显著性检测光谱压缩为8个潜在变量的低维表示。该方法不仅实现了高精度的光谱重建,还在潜空间中展现出良好的聚类性能(如对活动星系核和恒星级致密天体的分类准确率达69%),且潜在特征与硬度比、氢柱密度( N_H )等物理量存在非线性相关性,表明所学表示蕴含了物理可解释的信息。
链接: https://arxiv.org/abs/2510.14102
作者: Nicolò Oreste Pinciroli Vago,Juan Rafael Martínez-Galarza,Roberta Amato
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The study of X-ray spectra is crucial to understanding the physical nature of astrophysical sources. Machine learning methods can extract compact and informative representations of data from large datasets. The Chandra Source Catalog (CSC) provides a rich archive of X-ray spectral data, which remains largely underexplored in this context. This work aims to develop a compact and physically meaningful representation of Chandra X-ray spectra using deep learning. To verify that the learned representation captures relevant information, we evaluate it through classification, regression, and interpretability analyses. We use a transformer-based autoencoder to compress X-ray spectra. The input spectra, drawn from the CSC, include only high-significance detections. Astrophysical source types and physical summary statistics are compiled from external catalogs. We evaluate the learned representation in terms of spectral reconstruction accuracy, clustering performance on 8 known astrophysical source classes, and correlation with physical quantities such as hardness ratios and hydrogen column density ( N_H ). The autoencoder accurately reconstructs spectra with 8 latent variables. Clustering in the latent space yields a balanced classification accuracy of \sim 40% across the 8 source classes, increasing to \sim 69% when restricted to AGNs and stellar-mass compact objects exclusively. Moreover, latent features correlate with non-linear combinations of spectral fluxes, suggesting that the compressed representation encodes physically relevant information. The proposed autoencoder-based pipeline is a powerful tool for the representation and interpretation of X-ray spectra, providing a compact latent space that supports both classification and the estimation of physical properties. This work demonstrates the potential of deep learning for spectral studies and uncovering new patterns in X-ray data.
zh
[AI-107] Optical Computation-in-Communication enables low-latency high-fidelity perception in telesurgery
【速读】:该论文旨在解决远程手术(telesurgery)中因物理隔离导致的感官反馈与控制延迟问题,尤其针对端到端延迟超过200毫秒时会显著影响生成式AI(Generative AI)可靠性及患者安全的挑战。传统电子AI架构受限于推理与通信的串行处理机制,难以满足高时效性医疗场景的需求。其解决方案的关键在于提出光学计算-通信一体化(Optical Computation-in-Communication, OCiC)框架,通过将光学远程计算单元(ORCU)嵌入光通信路径,在传输过程中同步完成AI推理任务;该方案利用频谱高效的二维光子卷积实现每通道高达69 tera-operations per second的运算能力,并保持与CPU/GPU基准相差小于0.1%的推理精度,同时天然抑制累积误差传播,从而突破深度光学网络可扩展性的长期瓶颈。
链接: https://arxiv.org/abs/2510.14058
作者: Rui Yang,Jiaming Hu,Jian-Qing Zheng,Yue-Zhen Lu,Jian-Wei Cui,Qun Ren,Yi-Jie Yu,John Edward Wu,Zhao-Yu Wang,Xiao-Li Lin,Dandan Zhang,Mingchu Tang,Christos Masouros,Huiyun Liu,Chin-Pang Liu
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Artificial intelligence (AI) holds significant promise for enhancing intraoperative perception and decision-making in telesurgery, where physical separation impairs sensory feedback and control. Despite advances in medical AI and surgical robotics, conventional electronic AI architectures remain fundamentally constrained by the compounded latency from serial processing of inference and communication. This limitation is especially critical in latency-sensitive procedures such as endovascular interventions, where delays over 200 ms can compromise real-time AI reliability and patient safety. Here, we introduce an Optical Computation-in-Communication (OCiC) framework that reduces end-to-end latency significantly by performing AI inference concurrently with optical communication. OCiC integrates Optical Remote Computing Units (ORCUs) directly into the optical communication pathway, with each ORCU experimentally achieving up to 69 tera-operations per second per channel through spectrally efficient two-dimensional photonic convolution. The system maintains ultrahigh inference fidelity within 0.1% of CPU/GPU baselines on classification and coronary angiography segmentation, while intrinsically mitigating cumulative error propagation, a longstanding barrier to deep optical network scalability. We validated the robustness of OCiC through outdoor dark fibre deployments, confirming consistent and stable performance across varying environmental conditions. When scaled globally, OCiC transforms long-haul fibre infrastructure into a distributed photonic AI fabric with exascale potential, enabling reliable, low-latency telesurgery across distances up to 10,000 km and opening a new optical frontier for distributed medical intelligence.
zh
[AI-108] Dual-attention ResNet outperforms transformers in HER2 prediction on DCE-MRI
【速读】:该论文旨在解决乳腺癌中HER2状态的非侵入性预测问题,以减少对组织活检的依赖并优化诊疗流程。其核心解决方案是利用动态对比增强磁共振成像(DCE-MRI)数据,通过一种三头双注意力残差网络(Triple-Head Dual-Attention ResNet)对RGB融合的时间序列进行建模,从而实现高精度的HER2分型。关键创新在于采用双注意力机制有效捕捉可迁移的时空特征,并在多中心数据集上验证了模型的泛化能力,表明该方法在无需微调的情况下仍能保持良好的外部性能,为乳腺癌影像组学中可重复的深度学习生物标志物开发提供了可靠路径。
链接: https://arxiv.org/abs/2510.13897
作者: Naomi Fridman,Anat Goldstein
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Breast cancer is the most diagnosed cancer in women, with HER2 status critically guiding treatment decisions. Noninvasive prediction of HER2 status from dynamic contrast-enhanced MRI (DCE-MRI) could streamline diagnostics and reduce reliance on biopsy. However, preprocessing high-dynamic-range DCE-MRI into standardized 8-bit RGB format for pretrained neural networks is nontrivial, and normalization strategy significantly affects model performance. We benchmarked intensity normalization strategies using a Triple-Head Dual-Attention ResNet that processes RGB-fused temporal sequences from three DCE phases. Trained on a multicenter cohort (n=1,149) from the I-SPY trials and externally validated on BreastDCEDL_AMBL (n=43 lesions), our model outperformed transformer-based architectures, achieving 0.75 accuracy and 0.74 AUC on I-SPY test data. N4 bias field correction slightly degraded performance. Without fine-tuning, external validation yielded 0.66 AUC, demonstrating cross-institutional generalizability. These findings highlight the effectiveness of dual-attention mechanisms in capturing transferable spatiotemporal features for HER2 stratification, advancing reproducible deep learning biomarkers in breast cancer imaging.
zh
[AI-109] Bayes or Heisenberg: Who(se) Rules?
【速读】:该论文旨在解决量子系统测量过程的建模问题,特别是如何将通常由量子态矢量描述的量子系统,在特定条件下重新表述为基于概率状态矢量的随机方程。其解决方案的关键在于利用Tensor Brain (TB)模型的神经网络动力学来近似这些概率表示,从而实现对量子测量过程的概率性建模,并通过生物启发的机制将生成的符号表示高效地整合进推理过程中。
链接: https://arxiv.org/abs/2510.13894
作者: Volker Tresp Hang Li,Federico Harjes,Yunpu Ma
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注:
Abstract:Although quantum systems are generally described by quantum state vectors, we show that in certain cases their measurement processes can be reformulated as probabilistic equations expressed in terms of probabilistic state vectors. These probabilistic representations can, in turn, be approximated by the neural network dynamics of the Tensor Brain (TB) model. The Tensor Brain is a recently proposed framework for modeling perception and memory in the brain, providing a biologically inspired mechanism for efficiently integrating generated symbolic representations into reasoning processes. Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2510.13894 [q-bio.NC] (or arXiv:2510.13894v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2510.13894 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-110] Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion
【速读】:该论文旨在解决不完整多视图数据(incomplete multi-view data)下的聚类问题,即某些样本在部分视图上完全缺失信息,导致传统多视图聚类方法性能下降。现有深度方法常依赖静态融合策略或两阶段流程,存在融合效果不佳和误差传播等问题。其解决方案的关键在于提出一种基于分层语义对齐与协同补全(Hierarchical Semantic Alignment and Cooperative Completion, HSACC)的框架:首先通过双层语义空间设计实现鲁棒的跨视图融合——低层语义空间中通过最大化视图间互信息保证一致性对齐,高层语义空间中依据视图与初始融合表示的分布相似性动态分配自适应权重并加权融合生成全局统一表征;同时,借助高维语义空间中的投影隐式重建缺失视图,并联合优化重构与聚类目标,实现补全与聚类的协同学习。
链接: https://arxiv.org/abs/2510.13887
作者: Xiaojian Ding,Lin Zhao,Xian Li,Xiaoying Zhu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model’s robustness to hyperparameter variations.
zh
[AI-111] Physics-Informed autoencoder for DSC-MRI Perfusion post-processing: application to glioma grading
【速读】:该论文旨在解决动态对比增强磁共振成像(DSC-MRI)灌注分析中因噪声或运动伪影导致的灌注参数估计不准确问题,此类问题在临床环境中尤为突出,且传统深度学习方法受限于第三方去卷积算法的偏差与局限性。解决方案的关键在于提出一种物理信息引导的自编码器(physics-informed autoencoder),其通过引入解析模型来解码灌注参数,并以此指导编码网络的学习过程;该方法无需依赖外部去卷积软件即可实现自监督训练,在胶质瘤患者数据库上验证了其在低计算开销下仍能可靠完成胶质瘤分级,且在高噪声条件下表现出优于现有方法的鲁棒性。
链接: https://arxiv.org/abs/2510.13886
作者: Pierre Fayolle,Alexandre Bône,Noëlie Debs,Mathieu Naudin,Pascal Bourdon,Remy Guillevin,David Helbert
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 5 pages, 5 figures, IEEE ISBI 2025, Houston, Tx, USA
Abstract:DSC-MRI perfusion is a medical imaging technique for diagnosing and prognosing brain tumors and strokes. Its analysis relies on mathematical deconvolution, but noise or motion artifacts in a clinical environment can disrupt this process, leading to incorrect estimate of perfusion parameters. Although deep learning approaches have shown promising results, their calibration typically rely on third-party deconvolution algorithms to generate reference outputs and are bound to reproduce their limitations. To adress this problem, we propose a physics-informed autoencoder that leverages an analytical model to decode the perfusion parameters and guide the learning of the encoding network. This autoencoder is trained in a self-supervised fashion without any third-party software and its performance is evaluated on a database with glioma patients. Our method shows reliable results for glioma grading in accordance with other well-known deconvolution algorithms despite a lower computation time. It also achieved competitive performance even in the presence of high noise which is critical in a medical environment. Comments: 5 pages, 5 figures, IEEE ISBI 2025, Houston, Tx, USA Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP) Cite as: arXiv:2510.13886 [q-bio.QM] (or arXiv:2510.13886v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2510.13886 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-112] FFT-Accelerated Auxiliary Variable MCMC for Fermionic Lattice Models: A Determinant-Free Approach with O(Nlog N) Complexity
【速读】:该论文旨在解决量子多体系统模拟中的计算复杂度瓶颈问题,传统方法受限于 O(N3) 的时间复杂度,难以扩展至大规模体系。其解决方案的关键在于提出一种基于马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)的新型算法,通过在傅里叶域中构造粒子轨迹的转移核,将转移概率表示为卷积形式,从而利用快速傅里叶变换(Fast Fourier Transform, FFT)实现近线性 O(NlogN) 的每轮更新复杂度;同时,辅助变量具有闭式因子化条件分布,支持高效精确的吉布斯采样更新,整体显著提升了模拟效率并验证了 NlogN 的标度行为。
链接: https://arxiv.org/abs/2510.13866
作者: Deqian Kong,Shi Feng,Jianwen Xie,Ying Nian Wu
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:We introduce a Markov Chain Monte Carlo (MCMC) algorithm that dramatically accelerates the simulation of quantum many-body systems, a grand challenge in computational science. State-of-the-art methods for these problems are severely limited by O(N^3) computational complexity. Our method avoids this bottleneck, achieving near-linear O(N \log N) scaling per sweep. Our approach samples a joint probability measure over two coupled variable sets: (1) particle trajectories of the fundamental fermions, and (2) auxiliary variables that decouple fermion interactions. The key innovation is a novel transition kernel for particle trajectories formulated in the Fourier domain, revealing the transition probability as a convolution that enables massive acceleration via the Fast Fourier Transform (FFT). The auxiliary variables admit closed-form, factorized conditional distributions, enabling efficient exact Gibbs sampling update. We validate our algorithm on benchmark quantum physics problems, accurately reproducing known theoretical results and matching traditional O(N^3) algorithms on 32\times 32 lattice simulations at a fraction of the wall-clock time, empirically demonstrating N \log N scaling. By reformulating a long-standing physics simulation problem in machine learning language, our work provides a powerful tool for large-scale probabilistic inference and opens avenues for physics-inspired generative models. Subjects: Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2510.13866 [cond-mat.str-el] (or arXiv:2510.13866v1 [cond-mat.str-el] for this version) https://doi.org/10.48550/arXiv.2510.13866 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-113] owards Neurocognitive-Inspired Intelligence: From AIs Structural Mimicry to Human-Like Functional Cognition
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统在泛化能力、适应性以及可解释性方面的局限性,尤其是其任务特定性强、难以从少量数据中学习、且缺乏类似人类认知的灵活性和鲁棒性等问题。解决方案的关键在于提出“神经认知启发智能”(Neurocognitive-Inspired Intelligence, NII),这是一种融合神经科学、认知科学、计算机视觉与AI的混合方法,强调通过模块化、生物启发的架构实现集成(integration)、具身性(embodiment)和自适应性(adaptability),从而构建能够快速学习、利用先验经验并在现实环境中以最小监督完成感知、推理、记忆与行动的通用智能系统。
链接: https://arxiv.org/abs/2510.13826
作者: Noorbakhsh Amiri Golilarz,Hassan S. Al Khatib,Shahram Rahimi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence has advanced significantly through deep learning, reinforcement learning, and large language and vision models. However, these systems often remain task specific, struggle to adapt to changing conditions, and cannot generalize in ways similar to human cognition. Additionally, they mainly focus on mimicking brain structures, which often leads to black-box models with limited transparency and adaptability. Inspired by the structure and function of biological cognition, this paper introduces the concept of “Neurocognitive-Inspired Intelligence (NII),” a hybrid approach that combines neuroscience, cognitive science, computer vision, and AI to develop more general, adaptive, and robust intelligent systems capable of rapid learning, learning from less data, and leveraging prior experience. These systems aim to emulate the human brain’s ability to flexibly learn, reason, remember, perceive, and act in real-world settings with minimal supervision. We review the limitations of current AI methods, define core principles of neurocognitive-inspired intelligence, and propose a modular, biologically inspired architecture that emphasizes integration, embodiment, and adaptability. We also discuss potential implementation strategies and outline various real-world applications, from robotics to education and healthcare. Importantly, this paper offers a hybrid roadmap for future research, laying the groundwork for building AI systems that more closely resemble human cognition.
zh
[AI-114] GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI
【速读】:该论文旨在解决当前机器学习模型在基因组学数据可视化任务中缺乏领域特定训练基础的问题。其解决方案的关键在于构建一个名为GQVis的数据集,该数据集通过将抽象的低层级基因组学数据问题与对应的可视化结果进行配对,为模型训练提供结构化且语义丰富的资源;该数据集不仅包含单查询数据点(1.14百万条)、查询对(628k)和查询链(589k),还整合了设计理由、图注和图像替代文本等元信息,从而支持更精准、可解释的生成式AI (Generative AI) 模型开发,适用于基因组学领域的可视化任务。
链接: https://arxiv.org/abs/2510.13816
作者: Skylar Sargent Walters,Arthea Valderrama,Thomas C. Smits,David Kouřil,Huyen N. Nguyen,Sehi L’Yi,Devin Lange,Nils Gehlenborg
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Data visualization is a fundamental tool in genomics research, enabling the exploration, interpretation, and communication of complex genomic features. While machine learning models show promise for transforming data into insightful visualizations, current models lack the training foundation for domain-specific tasks. In an effort to provide a foundational resource for genomics-focused model training, we present a framework for generating a dataset that pairs abstract, low-level questions about genomics data with corresponding visualizations. Building on prior work with statistical plots, our approach adapts to the complexity of genomics data and the specialized representations used to depict them. We further incorporate multiple linked queries and visualizations, along with justifications for design choices, figure captions, and image alt-texts for each item in the dataset. We use genomics data retrieved from three distinct genomics data repositories (4DN, ENCODE, Chromoscope) to produce GQVis: a dataset consisting of 1.14 million single-query data points, 628k query pairs, and 589k query chains. The GQVis dataset and generation code are available at this https URL and this https URL.
zh
机器学习
[LG-0] Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability
链接: https://arxiv.org/abs/2510.14970
作者: Katiana Kontolati,Rini Jasmine Gladstone,Ian Davis,Ethan Pickering
类目: Machine Learning (cs.LG)
*备注: 35 pages, 12 figures
Abstract:We extend biologically-informed neural networks (BINNs) for genomic prediction (GP) and selection (GS) in crops by integrating thousands of single-nucleotide polymorphisms (SNPs) with multi-omics measurements and prior biological knowledge. Traditional genotype-to-phenotype (G2P) models depend heavily on direct mappings that achieve only modest accuracy, forcing breeders to conduct large, costly field trials to maintain or marginally improve genetic gain. Models that incorporate intermediate molecular phenotypes such as gene expression can achieve higher predictive fit, but they remain impractical for GS since such data are unavailable at deployment or design time. BINNs overcome this limitation by encoding pathway-level inductive biases and leveraging multi-omics data only during training, while using genotype data alone during inference. Applied to maize gene-expression and multi-environment field-trial data, BINN improves rank-correlation accuracy by up to 56% within and across subpopulations under sparse-data conditions and nonlinearly identifies genes that GWAS/TWAS fail to uncover. With complete domain knowledge for a synthetic metabolomics benchmark, BINN reduces prediction error by 75% relative to conventional neural nets and correctly identifies the most important nonlinear pathway. Importantly, both cases show highly sensitive BINN latent variables correlate with the experimental quantities they represent, despite not being trained on them. This suggests BINNs learn biologically-relevant representations, nonlinear or linear, from genotype to phenotype. Together, BINNs establish a framework that leverages intermediate domain information to improve genomic prediction accuracy and reveal nonlinear biological relationships that can guide genomic selection, candidate gene selection, pathway enrichment, and gene-editing prioritization.
[LG-1] Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores
链接: https://arxiv.org/abs/2510.14966
作者: Zachary Robertson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 2 figures
Abstract:Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. We show that averaging TVD-MI’s binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions. Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 = [0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95 [0.825, 2.252]). We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation. At 33% coverage, we achieve holdout RMSE 0.117 \pm 0.008 while preserving agent rankings (Spearman \rho = 0.972 \pm 0.015 ), three times fewer evaluations than full dense. Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ( \rho = 0.872 ) and consistent identity-link advantage. TVD-MI’s geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.
[LG-2] VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tunin
链接: https://arxiv.org/abs/2510.14930
作者: Binghao Huang,Jie Xu,Iretiayo Akinola,Wei Yang,Balakumar Sundaralingam,Rowland O’Flaherty,Dieter Fox,Xiaolong Wang,Arsalan Mousavian,Yu-Wei Chao,Yunzhu Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by 9th Conference on Robot Learning (CoRL 2025); Website: this https URL
Abstract:Humans excel at bimanual assembly tasks by adapting to rich tactile feedback – a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at this https URL.
[LG-3] Instruction Set Migration at Warehouse Scale
链接: https://arxiv.org/abs/2510.14928
作者: Eric Christopher,Kevin Crossan,Wolff Dobson,Chris Kennelly,Drew Lewis,Kun Lin,Martin Maas,Parthasarathy Ranganathan,Emma Rapati,Brian Yang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Migrating codebases from one instruction set architecture (ISA) to another is a major engineering challenge. A recent example is the adoption of Arm (in addition to x86) across the major Cloud hyperscalers. Yet, this problem has seen limited attention by the academic community. Most work has focused on static and dynamic binary translation, and the traditional conventional wisdom has been that this is the primary challenge. In this paper, we show that this is no longer the case. Modern ISA migrations can often build on a robust open-source ecosystem, making it possible to recompile all relevant software from scratch. This introduces a new and multifaceted set of challenges, which are different from binary translation. By analyzing a large-scale migration from x86 to Arm at Google, spanning almost 40,000 code commits, we derive a taxonomy of tasks involved in ISA migration. We show how Google automated many of the steps involved, and demonstrate how AI can play a major role in automatically addressing these tasks. We identify tasks that remain challenging and highlight research challenges that warrant further attention. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2510.14928 [cs.SE] (or arXiv:2510.14928v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.14928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-4] Learnable Mixed Nash Equilibria are Collectively Rational
链接: https://arxiv.org/abs/2510.14907
作者: Geelon So,Yi-An Ma
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:We extend the study of learning in games to dynamics that exhibit non-asymptotic stability. We do so through the notion of uniform stability, which is concerned with equilibria of individually utility-seeking dynamics. Perhaps surprisingly, it turns out to be closely connected to economic properties of collective rationality. Under mild non-degeneracy conditions and up to strategic equivalence, if a mixed equilibrium is not uniformly stable, then it is not weakly Pareto optimal: there is a way for all players to improve by jointly deviating from the equilibrium. On the other hand, if it is locally uniformly stable, then the equilibrium must be weakly Pareto optimal. Moreover, we show that uniform stability determines the last-iterate convergence behavior for the family of incremental smoothed best-response dynamics, used to model individual and corporate behaviors in the markets. Unlike dynamics around strict equilibria, which can stabilize to socially-inefficient solutions, individually utility-seeking behaviors near mixed Nash equilibria lead to collective rationality.
[LG-5] Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning
链接: https://arxiv.org/abs/2510.14894
作者: Marc Damie,Florian Hahn,Andreas Peter,Jan Ramon
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:To preserve privacy, multi-party computation (MPC) enables executing Machine Learning (ML) algorithms on secret-shared or encrypted data. However, existing MPC frameworks are not optimized for sparse data. This makes them unsuitable for ML applications involving sparse data, e.g., recommender systems or genomics. Even in plaintext, such applications involve high-dimensional sparse data, that cannot be processed without sparsity-related optimizations due to prohibitively large memory requirements. Since matrix multiplication is central in ML algorithms, we propose MPC algorithms to multiply secret sparse matrices. On the one hand, our algorithms avoid the memory issues of the “dense” data representation of classic secure matrix multiplication algorithms. On the other hand, our algorithms can significantly reduce communication costs (some experiments show a factor 1000) for realistic problem sizes. We validate our algorithms in two ML applications in which existing protocols are impractical. An important question when developing MPC algorithms is what assumptions can be made. In our case, if the number of non-zeros in a row is a sensitive piece of information then a short runtime may reveal that the number of non-zeros is small. Existing approaches make relatively simple assumptions, e.g., that there is a universal upper bound to the number of non-zeros in a row. This often doesn’t align with statistical reality, in a lot of sparse datasets the amount of data per instance satisfies a power law. We propose an approach which allows adopting a safe upper bound on the distribution of non-zeros in rows/columns of sparse matrices. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2510.14894 [cs.CR] (or arXiv:2510.14894v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.14894 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Prediction-Specific Design of Learning-Augmented Algorithms
链接: https://arxiv.org/abs/2510.14887
作者: Sizhe Li,Nicolas Christianson,Tongxin Li
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:Algorithms with predictions has emerged as a powerful framework to combine the robustness of traditional online algorithms with the data-driven performance benefits of machine-learned (ML) predictions. However, most existing approaches in this paradigm are overly conservative, as they do not leverage problem structure to optimize performance in a prediction-specific manner. In this paper, we show that such prediction-specific performance criteria can enable significant performance improvements over the coarser notions of consistency and robustness considered in prior work. Specifically, we propose a notion of \emphstrongly-optimal algorithms with predictions, which obtain Pareto optimality not just in the worst-case tradeoff between robustness and consistency, but also in the prediction-specific tradeoff between these metrics. We develop a general bi-level optimization framework that enables systematically designing strongly-optimal algorithms in a wide variety of problem settings, and we propose explicit strongly-optimal algorithms for several classic online problems: deterministic and randomized ski rental, and one-max search. Our analysis reveals new structural insights into how predictions can be optimally integrated into online algorithms by leveraging a prediction-specific design. To validate the benefits of our proposed framework, we empirically evaluate our algorithms in case studies on problems including dynamic power management and volatility-based index trading. Our results demonstrate that prediction-specific, strongly-optimal algorithms can significantly improve performance across a variety of online decision-making settings.
[LG-7] Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks
链接: https://arxiv.org/abs/2510.14844
作者: Odelia Melamed,Gilad Yehudai,Gal Vardi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:
Abstract:Machine Unlearning aims to remove specific data from trained models, addressing growing privacy and ethical concerns. We provide a theoretical analysis of a simple and widely used method - gradient ascent - used to reverse the influence of a specific data point without retraining from scratch. Leveraging the implicit bias of gradient descent towards solutions that satisfy the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem, we quantify the quality of the unlearned model by evaluating how well it satisfies these conditions w.r.t. the retained data. To formalize this idea, we propose a new success criterion, termed \textbf (\epsilon, \delta, \tau) -successful unlearning, and show that, for both linear models and two-layer neural networks with high dimensional data, a properly scaled gradient-ascent step satisfies this criterion and yields a model that closely approximates the retrained solution on the retained data. We also show that gradient ascent performs successful unlearning while still preserving generalization in a synthetic Gaussian-mixture setting.
[LG-8] Reinforcement Learning with Stochastic Reward Machines AAAI-22 AAAI
链接: https://arxiv.org/abs/2510.14837
作者: Jan Corazza,Ivan Gavran,Daniel Neider
类目: Machine Learning (cs.LG)
*备注: A shorter version of this paper appeared in the Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22). Source code available at this https URL
Abstract:Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.
[LG-9] Intelligent Dynamic Handover via AI-assisted Signal Quality Prediction in 6G Multi-RAT Networks
链接: https://arxiv.org/abs/2510.14832
作者: Maria Lamprini A. Bartsioka,Anastasios Giannopoulos,Sotirios Spantideas
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 9 pages, 17 figures
Abstract:The emerging paradigm of 6G multiple Radio Access Technology (multi-RAT) networks, where cellular and Wireless Fidelity (WiFi) transmitters coexist, requires mobility decisions that remain reliable under fast channel dynamics, interference, and heterogeneous coverage. Handover in multi-RAT deployments is still highly reactive and event-triggered, relying on instantaneous measurements and threshold events. This work proposes a Machine Learning (ML)-assisted Predictive Conditional Handover (P-CHO) framework based on a model-driven and short-horizon signal quality forecasts. We present a generalized P-CHO sequence workflow orchestrated by a RAT Steering Controller, which standardizes data collection, parallel per-RAT predictions, decision logic with hysteresis-based conditions, and CHO execution. Considering a realistic multi-RAT environment, we train RAT-aware Long Short Term Memory (LSTM) networks to forecast the signal quality indicators of mobile users along randomized trajectories. The proposed P-CHO models are trained and evaluated under different channel models for cellular and IEEE 802.11 WiFi integrated coverage. We study the impact of hyperparameter tuning of LSTM models under different system settings, and compare direct multi-step versus recursive P-CHO variants. Comparisons against baseline predictors are also carried out. Finally, the proposed P-CHO is tested under soft and hard handover settings, showing that hysteresis-enabled P-CHO scheme is able to reduce handover failures and ping-pong events. Overall, the proposed P-CHO framework can enable accurate, low-latency, and proactive handovers suitable for ML-assisted handover steering in 6G multi-RAT deployments.
[LG-10] o Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models
链接: https://arxiv.org/abs/2510.14826
作者: Eran Malach,Omid Saremi,Sinead Williamson,Arwen Bradley,Aryo Lotfi,Emmanuel Abbe,Josh Susskind,Etai Littwin
类目: Machine Learning (cs.LG)
*备注:
Abstract:State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any ``truly long-form’’ generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.
[LG-11] Programmatic Representation Learning with Language Models
链接: https://arxiv.org/abs/2510.14825
作者: Gabriel Poesia,Georgia Gabriela Sampaio
类目: Machine Learning (cs.LG)
*备注: Code available at this https URL
Abstract:Classical models for supervised machine learning, such as decision trees, are efficient and interpretable predictors, but their quality is highly dependent on the particular choice of input features. Although neural networks can learn useful representations directly from raw data (e.g., images or text), this comes at the expense of interpretability and the need for specialized hardware to run them efficiently. In this paper, we explore a hypothesis class we call Learned Programmatic Representations (LeaPR) models, which stack arbitrary features represented as code (functions from data points to scalars) and decision tree predictors. We synthesize feature functions using Large Language Models (LLMs), which have rich prior knowledge in a wide range of domains and a remarkable ability to write code using existing domain-specific libraries. We propose two algorithms to learn LeaPR models from supervised data. First, we design an adaptation of FunSearch to learn features rather than directly generate predictors. Then, we develop a novel variant of the classical ID3 algorithm for decision tree learning, where new features are generated on demand when splitting leaf nodes. In experiments from chess position evaluation to image and text classification, our methods learn high-quality, neural network-free predictors often competitive with neural networks. Our work suggests a flexible paradigm for learning interpretable representations end-to-end where features and predictions can be readily inspected and understood.
[LG-12] ackling Time-Series Forecasting Generalization via Mitigating Concept Drift
链接: https://arxiv.org/abs/2510.14814
作者: Zhiyuan Zhao,Haoxin Liu,B. Aditya Prakash
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures, 4 tables
Abstract:Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention. Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines. Comments: 17 pages, 6 figures, 4 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.14814 [cs.LG] (or arXiv:2510.14814v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.14814 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Efficient Dynamic Structured Sparse Training with Learned Shuffles
链接: https://arxiv.org/abs/2510.14812
作者: Abhishek Tyagi,Arjun Iyer,Liam Young,William H Renninger,Christopher Kanan,Yuhao Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any w active weights out of n , a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures – block, N:M, and diagonals – we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90–95% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to 1.21\times and infers up to 2.9\times faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.
[LG-14] Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning
链接: https://arxiv.org/abs/2510.14810
作者: Shikuang Deng,Jiayuan Zhang,Yuhang Wu,Ting Chen,Shi Gu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at this https URL.
[LG-15] Active Jammer Localization via Acquisition-Aware Path Planning
链接: https://arxiv.org/abs/2510.14790
作者: Luis González-Gudiño,Mariona Jaramillo-Civill,Pau Closas,Tales Imbiriba
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures
Abstract:We propose an active jammer localization framework that combines Bayesian optimization with acquisition-aware path planning. Unlike passive crowdsourced methods, our approach adaptively guides a mobile agent to collect high-utility Received Signal Strength measurements while accounting for urban obstacles and mobility constraints. For this, we modified the A* algorithm, A-UCB*, by incorporating acquisition values into trajectory costs, leading to high-acquisition planned paths. Simulations on realistic urban scenarios show that the proposed method achieves accurate localization with fewer measurements compared to uninformed baselines, demonstrating consistent performance under different environments.
[LG-16] Causal Discovery for Linear DAGs with Dependent Latent Variables via Higher-order Cumulants
链接: https://arxiv.org/abs/2510.14780
作者: Ming Cai,Penggang Gao,Hisayuki Hara
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 59 pages, 6 figures, and 3 tables
Abstract:This paper addresses the problem of estimating causal directed acyclic graphs in linear non-Gaussian acyclic models with latent confounders (LvLiNGAM). Existing methods assume mutually independent latent confounders or cannot properly handle models with causal relationships among observed variables. We propose a novel algorithm that identifies causal DAGs in LvLiNGAM, allowing causal structures among latent variables, among observed variables, and between the two. The proposed method leverages higher-order cumulants of observed data to identify the causal structure. Extensive simulations and experiments with real-world data demonstrate the validity and practical utility of the proposed algorithm. Comments: 59 pages, 6 figures, and 3 tables Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2510.14780 [cs.LG] (or arXiv:2510.14780v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.14780 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-17] Leverag ing Code Cohesion Analysis to Identify Source Code Supply Chain Attacks
链接: https://arxiv.org/abs/2510.14778
作者: Maor Reuben,Ido Mendel,Or Feldman,Moshe Kravchik,Mordehai Guri,Rami Puzis
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Supply chain attacks significantly threaten software security with malicious code injections within legitimate projects. Such attacks are very rare but may have a devastating impact. Detecting spurious code injections using automated tools is further complicated as it often requires deciphering the intention of both the inserted code and its context. In this study, we propose an unsupervised approach for highlighting spurious code injections by quantifying cohesion disruptions in the source code. Using a name-prediction-based cohesion (NPC) metric, we analyze how function cohesion changes when malicious code is introduced compared to natural cohesion fluctuations. An analysis of 54,707 functions over 369 open-source C++ repositories reveals that code injection reduces cohesion and shifts naming patterns toward shorter, less descriptive names compared to genuine function updates. Considering the sporadic nature of real supply-chain attacks, we evaluate the proposed method with extreme test-set imbalance and show that monitoring high-cohesion functions with NPC can effectively detect functions with injected code, achieving a Precision@100 of 36.41% at a 1:1,000 ratio and 12.47% at 1:10,000. These results suggest that automated cohesion measurements, in general, and name-prediction-based cohesion, in particular, may help identify supply chain attacks, improving source code integrity.
[LG-18] he Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents
链接: https://arxiv.org/abs/2510.14727
作者: Antony Bartlett,Cynthia Liem,Annibale Panichella
类目: Machine Learning (cs.LG)
*备注: Pre-print - Accepted at Symposium on Search Based Software Engineering (SSBSE) 2025 co-located with ASE’25
Abstract:Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents.
[LG-19] awa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
链接: https://arxiv.org/abs/2510.14719
作者: Hongzheng Chen,Bin Fan,Alexander Collins,Bastian Hagedorn,Evghenii Gaburov,Masahiro Masuda,Matthew Brookhart,Chris Sullivan,Jason Knight,Zhiru Zhang,Vinod Grover
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
*备注:
Abstract:Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines–a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1 \times speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2 \times speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.
[LG-20] Online Reliable Anomaly Detection via Neuromorphic Sensing and Communications
链接: https://arxiv.org/abs/2510.14688
作者: Junya Shiraishi,Jiechen Chen,Osvaldo Simeone,Petar Popovski
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:This paper proposes a low-power online anomaly detection framework based on neuromorphic wireless sensor networks, encompassing possible use cases such as brain-machine interfaces and remote environmental monitoring. In the considered system, a central reader node actively queries a subset of neuromorphic sensor nodes (neuro-SNs) at each time frame. The neuromorphic sensors are event-driven, producing spikes in correspondence to relevant changes in the monitored system. The queried neuro-SNs respond to the reader with impulse radio (IR) transmissions that directly encode the sensed local events. The reader processes these event-driven signals to determine whether the monitored environment is in a normal or anomalous state, while rigorously controlling the false discovery rate (FDR) of detections below a predefined threshold. The proposed approach employs an online hypothesis testing method with e-values to maintain FDR control without requiring knowledge of the anomaly rate, and it dynamically optimizes the sensor querying strategy by casting it as a best-arm identification problem in a multi-armed bandit framework. Extensive performance evaluation demonstrates that the proposed method can reliably detect anomalies under stringent FDR requirements, while efficiently scheduling sensor communications and achieving low detection latency.
[LG-21] Geometric Moment Alignment for Domain Adaptation via Siegel Embeddings
链接: https://arxiv.org/abs/2510.14666
作者: Shayan Gharib,Marcelo Hartmann,Arto Klami
类目: Machine Learning (cs.LG)
*备注:
Abstract:We address the problem of distribution shift in unsupervised domain adaptation with a moment-matching approach. Existing methods typically align low-order statistical moments of the source and target distributions in an embedding space using ad-hoc similarity measures. We propose a principled alternative that instead leverages the intrinsic geometry of these distributions by adopting a Riemannian distance for this alignment. Our key novelty lies in expressing the first- and second-order moments as a single symmetric positive definite (SPD) matrix through Siegel embeddings. This enables simultaneous adaptation of both moments using the natural geometric distance on the shared manifold of SPD matrices, preserving the mean and covariance structure of the source and target distributions and yielding a more faithful metric for cross-domain comparison. We connect the Riemannian manifold distance to the target-domain error bound, and validate the method on image denoising and image classification benchmarks. Our code is publicly available at this https URL.
[LG-22] First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training
链接: https://arxiv.org/abs/2510.14614
作者: Gyudong Kim,Hyukju Na,Jin Hyeon Kim,Hyunsung Jang,Jaemin Park,Jaegi Hwang,Namkoo Ha,Seungryong Kim,Young Geun Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block’s MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline.
[LG-23] Multimodal RAG for Unstructured Data:Leverag ing Modality-Aware Knowledge Graphs with Hybrid Retrieval
链接: https://arxiv.org/abs/2510.14592
作者: Rashmi R,Vidyadhar Upadhya
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 12 pages, 6 figures, submitted for review
Abstract:Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA’s ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
[LG-24] Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
链接: https://arxiv.org/abs/2510.14586
作者: Daria Frolova,Talgat Daulbaev,Egor Sevryugov,Sergei A. Nikolenko,Dmitry N. Ivankov,Ivan Oseledets,Marina A. Pak
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ( \mathbbR^3 , \mathrmSO(3) , and \mathrmSO(2) ). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at this https URL.
[LG-25] State-Space Models for Tabular Prior-Data Fitted Networks
链接: https://arxiv.org/abs/2510.14573
作者: Felix Koch,Marcel Wever,Fabian Raisch,Benjamin Tischler
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM’s inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.
[LG-26] Redundancy-Aware Test-Time Graph Out-of-Distribution Detection NEURIPS2025
链接: https://arxiv.org/abs/2510.14562
作者: Yue Hou,He Zhu,Ruomei Liu,Yingke Su,Junran Wu,Ke Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Distributional discrepancy between training and test data can lead models to make inaccurate predictions when encountering out-of-distribution (OOD) samples in real-world applications. Although existing graph OOD detection methods leverage data-centric techniques to extract effective representations, their performance remains compromised by structural redundancy that induces semantic shifts. To address this dilemma, we propose RedOUT, an unsupervised framework that integrates structural entropy into test-time OOD detection for graph classification. Concretely, we introduce the Redundancy-aware Graph Information Bottleneck (ReGIB) and decompose the objective into essential information and irrelevant redundancy. By minimizing structural entropy, the decoupled redundancy is reduced, and theoretically grounded upper and lower bounds are proposed for optimization. Extensive experiments on real-world datasets demonstrate the superior performance of RedOUT on OOD detection. Specifically, our method achieves an average improvement of 6.7%, significantly surpassing the best competitor by 17.3% on the ClinTox/LIPO dataset pair.
[LG-27] MX: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving MICRO MICRO2025
链接: https://arxiv.org/abs/2510.14557
作者: Jungi Lee,Junyong Park,Soohyun Cha,Jaehoon Cho,Jaewoong Sim
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: To appear at the 58th International Symposium on Microarchitecture (MICRO 2025)
Abstract:Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.
[LG-28] A Deep State-Space Model Compression Method using Upper Bound on Output Error
链接: https://arxiv.org/abs/2510.14542
作者: Hiroki Sakamoto,Kazuhiro Sato
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:We study deep state-space models (Deep SSMs) that contain linear-quadratic-output (LQO) systems as internal blocks and present a compression method with a provable output error guarantee. We first derive an upper bound on the output error between two Deep SSMs and show that the bound can be expressed via the h^2 -error norms between the layerwise LQO systems, thereby providing a theoretical justification for existing model order reduction (MOR)-based compression. Building on this bound, we formulate an optimization problem in terms of the h^2 -error norm and develop a gradient-based MOR method. On the IMDb task from the Long Range Arena benchmark, we demonstrate that our compression method achieves strong performance. Moreover, unlike prior approaches, we reduce roughly 80% of trainable parameters without retraining, with only a 4-5% performance drop.
[LG-29] On the Identifiability of Tensor Ranks via Prior Predictive Matching
链接: https://arxiv.org/abs/2510.14523
作者: Eliezer da Silva,Arto Klami,Diego Mesquita,Iñigo Urteaga
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Selecting the latent dimensions (ranks) in tensor factorization is a central challenge that often relies on heuristic methods. This paper introduces a rigorous approach to determine rank identifiability in probabilistic tensor models, based on prior predictive moment matching. We transform a set of moment matching conditions into a log-linear system of equations in terms of marginal moments, prior hyperparameters, and ranks; establishing an equivalence between rank identifiability and the solvability of such system. We apply this framework to four foundational tensor-models, demonstrating that the linear structure of the PARAFAC/CP model, the chain structure of the Tensor Train model, and the closed-loop structure of the Tensor Ring model yield solvable systems, making their ranks identifiable. In contrast, we prove that the symmetric topology of the Tucker model leads to an underdetermined system, rendering the ranks unidentifiable by this method. For the identifiable models, we derive explicit closed-form rank estimators based on the moments of observed data only. We empirically validate these estimators and evaluate the robustness of the proposal.
[LG-30] Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective
链接: https://arxiv.org/abs/2510.14510
作者: Xingjian Wu,Xiangfei Qiu,Hanyin Cheng,Zhengyu Li,Jilin Hu,Chenjuan Guo,Bin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plugin-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at this https URL.
[LG-31] Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals
链接: https://arxiv.org/abs/2510.14503
作者: Andrejs Sorstkins,Omer Tariq,Muhammad Bilal
类目: Machine Learning (cs.LG)
*备注: Submitted PLOS ONE
Abstract:This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.
[LG-32] Coder as Editor: Code-driven Interpretable Molecular Optimization
链接: https://arxiv.org/abs/2510.14455
作者: Wenyu Zhu,Chengzhu Li,Xiaohe Tian,Yifan Wang,Yinjun Jia,Jianhui Wang,Bowen Gao,Ya-Qin Zhang,Wei-Ying Ma,Yanyan Lan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code. MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation. Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs. On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity. By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.
[LG-33] owards geological inference with process-based and deep generative modeling part 1: training on fluvial deposits
链接: https://arxiv.org/abs/2510.14445
作者: Guillaume Rongier,Luk Peeters
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 24 pages, 16 figures
Abstract:The distribution of resources in the subsurface is deeply linked to the variations of its physical properties. Generative modeling has long been used to predict those physical properties while quantifying the associated uncertainty. But current approaches struggle to properly reproduce geological structures, and fluvial deposits in particular, because of their continuity. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - can be trained to reproduce fluvial deposits simulated by a process-based model - a more expensive model that mimics geological processes. An ablation study shows that developments from the deep-learning community to generate large 2D images are directly transferable to 3D images of fluvial deposits. Training remains stable, and the generated samples reproduce the non-stationarity and details of the deposits without mode collapse or pure memorization of the training data. Using a process-based model to generate those training data allows us to include valuable properties other than the usual physical properties. We show how the deposition time let us monitor and validate the performance of a GAN by checking that its samples honor the law of superposition. Our work joins a series of previous studies suggesting that GANs are more robust that given credit for, at least for training datasets targeting specific geological structures. Whether this robustness transfers to larger 3D images and multimodal datasets remains to be seen. Exploring how deep generative models can leverage geological principles like the law of superposition shows a lot of promise.
[LG-34] MergeMoE: Efficient Compression of MoE Models via Expert Output Merging
链接: https://arxiv.org/abs/2510.14436
作者: Ruijie Miao,Yilun Yao,Zihan Wang,Zhiming Wang,Bairen Yi,LingJun Liu,Yikai Zhao,Tong Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Mixture-of-Experts (MoE) technique has proven to be a promising solution to efficiently scale the model size, which has been widely applied in recent LLM advancements. However, the substantial memory overhead of MoE models has made their compression an important research direction. In this work, we provide a theoretical analysis of expert merging, a recently proposed technique for compressing MoE models. Rather than interpreting expert merging from the conventional perspective of parameter aggregation, we approach it from the perspective of merging experts’ outputs. Our key insight is that the merging process can be interpreted as inserting additional matrices into the forward computation, which naturally leads to an optimization formulation. Building on this analysis, we introduce MergeMoE, a method that leverages mathematical optimization to construct the compression matrices. We evaluate MergeMoE on multiple MoE models and show that our algorithm consistently outperforms the baselines with the same compression ratios.
[LG-35] Interaction Concordance Index: Performance Evaluation for Interaction Prediction Methods
链接: https://arxiv.org/abs/2510.14419
作者: Tapio Pahikkala,Riikka Numminen,Parisa Movahedi,Napsu Karmitsa,Antti Airola
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Consider two sets of entities and their members’ mutual affinity values, say drug-target affinities (DTA). Drugs and targets are said to interact in their effects on DTAs if drug’s effect on it depends on the target. Presence of interaction implies that assigning a drug to a target and another drug to another target does not provide the same aggregate DTA as the reversed assignment would provide. Accordingly, correctly capturing interactions enables better decision-making, for example, in allocation of limited numbers of drug doses to their best matching targets. Learning to predict DTAs is popularly done from either solely from known DTAs or together with side information on the entities, such as chemical structures of drugs and targets. In this paper, we introduce interaction directions’ prediction performance estimator we call interaction concordance index (IC-index), for both fixed predictors and machine learning algorithms aimed for inferring them. IC-index complements the popularly used DTA prediction performance estimators by evaluating the ratio of correctly predicted directions of interaction effects in data. First, we show the invariance of IC-index on predictors unable to capture interactions. Secondly, we show that learning algorithm’s permutation equivariance regarding drug and target identities implies its inability to capture interactions when either drug, target or both are unseen during training. In practical applications, this equivariance is remedied via incorporation of appropriate side information on drugs and targets. We make a comprehensive empirical evaluation over several biomedical interaction data sets with various state-of-the-art machine learning algorithms. The experiments demonstrate how different types of affinity strength prediction methods perform in terms of IC-index complementing existing prediction performance estimators.
[LG-36] Revisit Modality Imbalance at the Decision Layer
链接: https://arxiv.org/abs/2510.14411
作者: Xiaoyu Ma,Hao Chen
类目: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Some Insights in Balanced Multimodal Learning
Abstract:Multimodal learning integrates information from different modalities to enhance model performance, yet it often suffers from modality imbalance, where dominant modalities overshadow weaker ones during joint optimization. This paper reveals that such an imbalance not only occurs during representation learning but also manifests significantly at the decision layer. Experiments on audio-visual datasets (CREMAD and Kinetic-Sounds) show that even after extensive pretraining and balanced optimization, models still exhibit systematic bias toward certain modalities, such as audio. Further analysis demonstrates that this bias originates from intrinsic disparities in feature-space and decision-weight distributions rather than from optimization dynamics alone. We argue that aggregating uncalibrated modality outputs at the fusion stage leads to biased decision-layer weighting, hindering weaker modalities from contributing effectively. To address this, we propose that future multimodal systems should focus more on incorporate adaptive weight allocation mechanisms at the decision layer, enabling relative balanced according to the capabilities of each modality.
[LG-37] Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
链接: https://arxiv.org/abs/2510.14393
作者: Ching-Lin Hsiung,Tian-Sheuan Chang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 10 pages; IEEE Transactions on Circuits and Systems I: Regular Papers
Abstract:Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, where the Feed-Forward Network (FFN) tends to be the dominant computational bottleneck. This paper presents a low power Vision Transformer accelerator, optimized through algorithm-hardware co-design. The model complexity is reduced using hardware-friendly dynamic token pruning without introducing complex mechanisms. Sparsity is further improved by replacing GELU with ReLU activations and employing dynamic FFN2 pruning, achieving a 61.5% reduction in operations and a 59.3% reduction in FFN2 weights, with an accuracy loss of less than 2%. The hardware adopts a row-wise dataflow with output-oriented data access to eliminate data transposition, and supports dynamic operations with minimal area overhead. Implemented in TSMC’s 28nm CMOS technology, our design occupies 496.4K gates and includes a 232KB SRAM buffer, achieving a peak throughput of 1024 GOPS at 1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61 GOPS/mm2.
[LG-38] SHaRe-SSM: An Oscillatory Spiking Neural Network for Target Variable Modeling in Long Sequences
链接: https://arxiv.org/abs/2510.14386
作者: Kartikay Agrawal,Abhijeet Vikram,Vedant Sharma,Vaishnavi N.,Ayon Borthakur
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:In recent years, with the emergence of large models, there has been a significant interest in spiking neural networks (SNNs) primarily due to their energy efficiency, multiplication-free, and sparse event-based deep learning. Similarly, state space models (SSMs) in varying designs have evolved as a powerful alternative to transformers for target modeling in long sequences, thereby overcoming the quadratic dependence on sequence length of a transformer. Inspired by this progress, we here design SHaRe-SSM (Spiking Harmonic Resonate and Fire State Space Model), for target variable modeling (including both classification and regression) for very-long-range sequences. Our second-order spiking SSM, on average, performs better than transformers or first-order SSMs while circumventing multiplication operations, making it ideal for resource-constrained applications. The proposed block consumes 73 \times less energy than second-order ANN-based SSMs for an 18k sequence, while retaining performance. To ensure learnability over the long-range sequences, we propose exploiting the stable and efficient implementation of the dynamical system using parallel scans. Moreover, for the first time, we propose a kernel-based spiking regressor using resonate and fire neurons for very long-range sequences. Our network shows superior performance on even a 50k sequence while being significantly energy-efficient. In addition, we conducted a systematic analysis of the impact of heterogeneity, dissipation, and conservation in resonate-and-fire SSMs.
[LG-39] Jet Functors and Weil Algebras in Automatic Differentiation: A Geometric Analysis
链接: https://arxiv.org/abs/2510.14342
作者: Amandip Sangha
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:
Abstract:We present a geometric formulation of automatic differentiation (AD) using jet bundles and Weil algebras. Reverse-mode AD emerges as cotangent-pullback, while Taylor-mode corresponds to evaluation in a Weil algebra. From these principles, we derive concise statements on correctness, stability, and complexity: a functorial identity for reverse-mode, algebraic exactness of higher-order derivatives, and explicit bounds on truncation error. We further show that tensorized Weil algebras permit one-pass computation of all mixed derivatives with cost linear in the algebra dimension, avoiding the combinatorial blow-up of nested JVP/VJP schedules. This framework interprets AD theory through the lens of differential geometry and offers a foundation for developing structure-preserving differentiation methods in deep learning and scientific computing. Code and examples are available at this https URL.
[LG-40] DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
链接: https://arxiv.org/abs/2510.14336
作者: Shruti Sarika Chakraborty,Peter Minary
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Transformers (GTs) have emerged as powerful architectures for graph-structured data, yet remain constrained by rigid designs and lack quantifiable interpretability. Current state-of-the-art GTs commit to fixed GNN types across all layers, missing potential benefits of depth-specific component selection, while their complex architectures become opaque where performance gains cannot be distinguished between meaningful patterns and spurious correlations. We redesign GT attention through asymmetry, decoupling structural encoding from feature representation: queries derive from node features while keys and values come from GNN transformations. Within this framework, we use Differentiable ARchiTecture Search (DARTS) to select optimal GNN operators at each layer, enabling depth-wise heterogeneity inside transformer attention itself (DARTS-GT). To understand discovered architectures, we develop the first quantitative interpretability framework for GTs through causal ablation. Our metrics (Head-deviation, Specialization, and Focus), identify which heads and nodes drive predictions while enabling model comparison. Experiments across eight benchmarks show DARTS-GT achieves state-of-the-art on four datasets while remaining competitive on others, with discovered architectures revealing dataset-specific patterns. Our interpretability analysis reveals that visual attention salience and causal importance do not always correlate, indicating widely used visualization approaches may miss components that actually matter. Crucially, heterogeneous architectures found by DARTS-GT consistently produced more interpretable models than baselines, establishing that Graph Transformers need not choose between performance and interpretability.
[LG-41] LLM -ERM: Sample-Efficient Program Learning via LLM -Guided Search
链接: https://arxiv.org/abs/2510.14331
作者: Shivam Singhal,Eran Malach,Tomaso Poggio,Tomer Galanti
类目: Machine Learning (cs.LG)
*备注:
Abstract:We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short python code'') can be learned with a
small’’ number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families. To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw k candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. \em Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.14331 [cs.LG] (or arXiv:2510.14331v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.14331 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-42] Active Measuring in Reinforcement Learning With Delayed Negative Effects
链接: https://arxiv.org/abs/2510.14315
作者: Daiqi Gao,Ziping Xu,Aseel Rawashdeh,Predrag Klasnja,Susan A. Murphy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty may provably improve sample efficiency and increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users’ health status through surveys.
[LG-43] Enhancing Time-Series Anomaly Detection by Integrating Spectral-Residual Bottom-Up Attention with Reservoir Computing
链接: https://arxiv.org/abs/2510.14287
作者: Hayato Nihei,Sou Nobukawa,Yusuke Sakemi,Kazuyuki Aihara
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reservoir computing (RC) establishes the basis for the processing of time-series data by exploiting the high-dimensional spatiotemporal response of a recurrent neural network to an input signal. In particular, RC trains only the output layer weights. This simplicity has drawn attention especially in Edge Artificial Intelligence (AI) applications. Edge AI enables time-series anomaly detection in real time, which is important because detection delays can lead to serious incidents. However, achieving adequate anomaly-detection performance with RC alone may require an unacceptably large reservoir on resource-constrained edge devices. Without enlarging the reservoir, attention mechanisms can improve accuracy, although they may require substantial computation and undermine the learning efficiency of RC. In this study, to improve the anomaly detection performance of RC without sacrificing learning efficiency, we propose a spectral residual RC (SR-RC) that integrates the spectral residual (SR) method - a learning-free, bottom-up attention mechanism - with RC. We demonstrated that SR-RC outperformed conventional RC and logistic-regression models based on values extracted by the SR method across benchmark tasks and real-world time-series datasets. Moreover, because the SR method, similarly to RC, is well suited for hardware implementation, SR-RC suggests a practical direction for deploying RC as Edge AI for time-series anomaly detection.
[LG-44] Stable Prediction of Adverse Events in Medical Time-Series Data
链接: https://arxiv.org/abs/2510.14286
作者: Mayank Keoliya,Seewon Choi,Rajeev Alur,Mayur Naik,Eric Wong
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 Figures
Abstract:Early event prediction (EEP) systems continuously estimate a patient’s imminent risk to support clinical decision-making. For bedside trust, risk trajectories must be accurate and temporally stable, shifting only with new, relevant evidence. However, current benchmarks (a) ignore stability of risk scores and (b) evaluate mainly on tabular inputs, leaving trajectory behavior untested. To address this gap, we introduce CAREBench, an EEP benchmark that evaluates deployability using multi-modal inputs-tabular EHR, ECG waveforms, and clinical text-and assesses temporal stability alongside predictive accuracy. We propose a stability metric that quantifies short-term variability in per-patient risk and penalizes abrupt oscillations based on local-Lipschitz constants. CAREBench spans six prediction tasks such as sepsis onset and compares classical learners, deep sequence models, and zero-shot LLMs. Across tasks, existing methods, especially LLMs, struggle to jointly optimize accuracy and stability, with notably poor recall at high-precision operating points. These results highlight the need for models that produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings. (Code: this https URL.)
[LG-45] Nonparametric Data Attribution for Diffusion Models
链接: https://arxiv.org/abs/2510.14269
作者: Yutian Zhao,Chao Du,Xiaosen Zheng,Tianyu Pang,Min Lin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at this https URL.
[LG-46] Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals
链接: https://arxiv.org/abs/2510.14254
作者: Saurabh Kataria,Yi Wu,Zhaoliang Chen,Hyunjung Gloria Kwak,Yuhao Xu,Lovely Yeswanth Panchumarthi,Ran Xiao,Jiaying Lu,Ayca Ermis,Anni Zhao,Runze Yan,Alex Federov,Zewen Liu,Xu Wu,Wei Jin,Carl Yang,Jocelyn Grunwell,Stephanie R. Brown,Amit Shah,Craig Jabaley,Tim Buchman,Sivasubramanium V Bhavani,Randall J. Lee,Xiao Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models are large-scale machine learning models that are pre-trained on massive amounts of data and can be adapted for various downstream tasks. They have been extensively applied to tasks in Natural Language Processing and Computer Vision with models such as GPT, BERT, and CLIP. They are now also increasingly gaining attention in time-series analysis, particularly for physiological sensing. However, most time series foundation models are specialist models - with data in pre-training and testing of the same type, such as Electrocardiogram, Electroencephalogram, and Photoplethysmogram (PPG). Recent works, such as MOMENT, train a generalist time series foundation model with data from multiple domains, such as weather, traffic, and electricity. This paper aims to conduct a comprehensive benchmarking study to compare the performance of generalist and specialist models, with a focus on PPG signals. Through an extensive suite of total 51 tasks covering cardiac state assessment, laboratory value estimation, and cross-modal inference, we comprehensively evaluate both models across seven dimensions, including win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability. These metrics jointly capture not only the models’ capability but also their adaptability, robustness, and efficiency under different fine-tuning strategies, providing a holistic understanding of their strengths and limitations for diverse downstream scenarios. In a full-tuning scenario, we demonstrate that the specialist model achieves a 27% higher win score. Finally, we provide further analysis on generalization, fairness, attention visualizations, and the importance of training data choice.
[LG-47] A Physics Prior-Guided Dual-Stream Attention Network for Motion Prediction of Elastic Brag g Breakwaters
链接: https://arxiv.org/abs/2510.14250
作者: Lianzi Jiang,Jianxin Zhang,Xinyu Han,Huanhe Dong,Xiangrong Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate motion response prediction for elastic Bragg breakwaters is critical for their structural safety and operational integrity in marine environments. However, conventional deep learning models often exhibit limited generalization capabilities when presented with unseen sea states. These deficiencies stem from the neglect of natural decay observed in marine systems and inadequate modeling of wave-structure interaction (WSI). To overcome these challenges, this study proposes a novel Physics Prior-Guided Dual-Stream Attention Network (PhysAttnNet). First, the decay bidirectional self-attention (DBSA) module incorporates a learnable temporal decay to assign higher weights to recent states, aiming to emulate the natural decay phenomenon. Meanwhile, the phase differences guided bidirectional cross-attention (PDG-BCA) module explicitly captures the bidirectional interaction and phase relationship between waves and the structure using a cosine-based bias within a bidirectional cross-computation paradigm. These streams are synergistically integrated through a global context fusion (GCF) module. Finally, PhysAttnNet is trained with a hybrid time-frequency loss that jointly minimizes time-domain prediction errors and frequency-domain spectral discrepancies. Comprehensive experiments on wave flume datasets demonstrate that PhysAttnNet significantly outperforms mainstream models. Furthermore,cross-scenario generalization tests validate the model’s robustness and adaptability to unseen environments, highlighting its potential as a framework to develop predictive models for complex systems in ocean engineering.
[LG-48] When Flatness Does (Not) Guarantee Adversarial Robustness
链接: https://arxiv.org/abs/2510.14231
作者: Nils Philipp Walter,Linara Adilova,Jilles Vreeken,Michael Kamp
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite their empirical success, neural networks remain vulnerable to small, adversarial perturbations. A longstanding hypothesis suggests that flat minima, regions of low curvature in the loss landscape, offer increased robustness. While intuitive, this connection has remained largely informal and incomplete. By rigorously formalizing the relationship, we show this intuition is only partially correct: flatness implies local but not global adversarial robustness. To arrive at this result, we first derive a closed-form expression for relative flatness in the penultimate layer, and then show we can use this to constrain the variation of the loss in input space. This allows us to formally analyze the adversarial robustness of the entire network. We then show that to maintain robustness beyond a local neighborhood, the loss needs to curve sharply away from the data manifold. We validate our theoretical predictions empirically across architectures and datasets, uncovering the geometric structure that governs adversarial vulnerability, and linking flatness to model confidence: adversarial examples often lie in large, flat regions where the model is confidently wrong. Our results challenge simplified views of flatness and provide a nuanced understanding of its role in robustness.
[LG-49] Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization
链接: https://arxiv.org/abs/2510.14217
作者: Asma Jamali,Tin Sum Cheng,Rodrigo A. Vargas-Hernández
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 14 pages, 5 figures, 3 tables, SI: 8 pages, 7 figures
Abstract:Understanding the spectral properties of kernels offers a principled perspective on generalization and representation quality. While deep models achieve state-of-the-art accuracy in molecular property prediction, kernel methods remain widely used for their robustness in low-data regimes and transparent theoretical grounding. Despite extensive studies of kernel spectra in machine learning, systematic spectral analyses of molecular kernels are scarce. In this work, we provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset, molecular fingerprint, pretrained transformer-based, global and local 3D representations across seven molecular properties. Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy. Pearson correlation tests further reveal that for transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance. We also implement truncated kernels to probe the relationship between spectrum and predictive performance: in many kernels, retaining only the top 2% of eigenvalues recovers nearly all performance, indicating that the leading eigenvalues capture the most informative features. Our results challenge the common heuristic that “richer spectra yield better generalization” and highlight nuanced relationships between representation, kernel features, and predictive performance. Beyond molecular property prediction, these findings inform how kernel and self-supervised learning methods are evaluated in data-limited scientific and real-world tasks.
[LG-50] Incentive-Based Federated Learning
链接: https://arxiv.org/abs/2510.14208
作者: Chanuka A.S. Hewa Kaluannakkage,Rajkumar Buyya
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 24 pages, 5 figures, chapter for edited book (Federated Learning: Foundations and Applications)
Abstract:Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.
[LG-51] Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation
链接: https://arxiv.org/abs/2510.14190
作者: Ruchi Sandilya,Sumaira Perez,Charles Lynch,Lindsay Victoria,Benjamin Zebley,Derrick Matthew Buchanan,Mahendra T. Bhati,Nolan Williams,Timothy J. Spellman,Faith M. Gunning,Conor Liston,Logan Grosenick
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models excel at generation, but their latent spaces are not explicitly organized for interpretable control. We introduce ConDA (Contrastive Diffusion Alignment), a framework that applies contrastive learning within diffusion embeddings to align latent geometry with system dynamics. Motivated by recent advances showing that contrastive objectives can recover more disentangled and structured representations, ConDA organizes diffusion latents such that traversal directions reflect underlying dynamical factors. Within this contrastively structured space, ConDA enables nonlinear trajectory traversal that supports faithful interpolation, extrapolation, and controllable generation. Across benchmarks in fluid dynamics, neural calcium imaging, therapeutic neurostimulation, and facial expression, ConDA produces interpretable latent representations with improved controllability compared to linear traversals and conditioning-based baselines. These results suggest that diffusion latents encode dynamics-relevant structure, but exploiting this structure requires latent organization and traversal along the latent manifold.
[LG-52] Optimal Control Theoretic Neural Optimizer: From Backpropagation to Dynamic Programming
链接: https://arxiv.org/abs/2510.14168
作者: Guan-Horng Liu,Tianrong Chen,Evangelos A. Theodorou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Optimization of deep neural networks (DNNs) has been a driving force in the advancement of modern machine learning and artificial intelligence. With DNNs characterized by a prolonged sequence of nonlinear propagation, determining their optimal parameters given an objective naturally fits within the framework of Optimal Control Programming. Such an interpretation of DNNs as dynamical systems has proven crucial in offering a theoretical foundation for principled analysis from numerical equations to physics. In parallel to these theoretical pursuits, this paper focuses on an algorithmic perspective. Our motivated observation is the striking algorithmic resemblance between the Backpropagation algorithm for computing gradients in DNNs and the optimality conditions for dynamical systems, expressed through another backward process known as dynamic programming. Consolidating this connection, where Backpropagation admits a variational structure, solving an approximate dynamic programming up to the first-order expansion leads to a new class of optimization methods exploring higher-order expansions of the Bellman equation. The resulting optimizer, termed Optimal Control Theoretic Neural Optimizer (OCNOpt), enables rich algorithmic opportunities, including layer-wise feedback policies, game-theoretic applications, and higher-order training of continuous-time models such as Neural ODEs. Extensive experiments demonstrate that OCNOpt improves upon existing methods in robustness and efficiency while maintaining manageable computational complexity, paving new avenues for principled algorithmic design grounded in dynamical systems and optimal control theory.
[LG-53] Data Understanding Survey: Pursuing Improved Dataset Characterization Via Tensor-based Methods
链接: https://arxiv.org/abs/2510.14161
作者: Matthew D. Merris,Tim Andersen
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures, Pre-print
Abstract:In the evolving domains of Machine Learning and Data Analytics, existing dataset characterization methods such as statistical, structural, and model-based analyses often fail to deliver the deep understanding and insights essential for innovation and explainability. This work surveys the current state-of-the-art conventional data analytic techniques and examines their limitations, and discusses a variety of tensor-based methods and how these may provide a more robust alternative to traditional statistical, structural, and model-based dataset characterization techniques. Through examples, we illustrate how tensor methods unveil nuanced data characteristics, offering enhanced interpretability and actionable intelligence. We advocate for the adoption of tensor-based characterization, promising a leap forward in understanding complex datasets and paving the way for intelligent, explainable data-driven discoveries.
[LG-54] On Evaluating Loss Functions for Stock Ranking: An Empirical Analysis With Transformer Model CIKM2025
链接: https://arxiv.org/abs/2510.14156
作者: Jan Kwiatkowski,Jarosław A. Chudziak
类目: Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注: This paper has been submitted to CIKM 2025
Abstract:Quantitative trading strategies rely on accurately ranking stocks to identify profitable investments. Effective portfolio management requires models that can reliably order future stock returns. Transformer models are promising for understanding financial time series, but how different training loss functions affect their ability to rank stocks well is not yet fully understood. Financial markets are challenging due to their changing nature and complex relationships between stocks. Standard loss functions, which aim for simple prediction accuracy, often aren’t enough. They don’t directly teach models to learn the correct order of stock returns. While many advanced ranking losses exist from fields such as information retrieval, there hasn’t been a thorough comparison to see how well they work for ranking financial returns, especially when used with modern Transformer models for stock selection. This paper addresses this gap by systematically evaluating a diverse set of advanced loss functions including pointwise, pairwise, listwise for daily stock return forecasting to facilitate rank-based portfolio selection on SP 500 data. We focus on assessing how each loss function influences the model’s ability to discern profitable relative orderings among assets. Our research contributes a comprehensive benchmark revealing how different loss functions impact a model’s ability to learn cross-sectional and temporal patterns crucial for portfolio selection, thereby offering practical guidance for optimizing ranking-based trading strategies.
[LG-55] Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks
链接: https://arxiv.org/abs/2510.14137
作者: Faezeh Dehghan Tarzjani,Bhaskar Krishnamachari
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The p-persistent CSMA protocol is central to random-access MAC analysis, but predicting saturation throughput in heterogeneous multi-hop wireless networks remains a hard problem. Simplified models that assume a single, shared interference domain can underestimate throughput by 48–62% in sparse topologies. Exact Markov-chain analyses are accurate but scale exponentially in computation time, making them impractical for large networks. These computational barriers motivate structural machine learning approaches like GNNs for scalable throughput prediction in general network topologies. Yet off-the-shelf GNNs struggle here: a standard GCN yields 63.94% normalized mean absolute error (NMAE) on heterogeneous networks because symmetric normalization conflates a node’s direct interference with higher-order, cascading effects that pertain to how interference propagates over the network graph. Building on these insights, we propose the Decoupled Graph Convolutional Network (D-GCN), a novel architecture that explicitly separates processing of a node’s own transmission probability from neighbor interference effects. D-GCN replaces mean aggregation with learnable attention, yielding interpretable, per-neighbor contribution weights while capturing complex multihop interference patterns. D-GCN attains 3.3% NMAE, outperforms strong baselines, remains tractable even when exact analytical methods become computationally infeasible, and enables gradient-based network optimization that achieves within 1% of theoretical optima. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2510.14137 [cs.LG] (or arXiv:2510.14137v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.14137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-56] Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL
链接: https://arxiv.org/abs/2510.14129
作者: Mahsa Bastankhah,Grace Liu,Dilip Arumugam,Thomas L. Griffiths,Benjamin Eysenbach
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithm’s objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.
[LG-57] Neural Network-enabled Domain-consistent Robust Optimisation for Global CO_2 Reduction Potential of Gas Power Plants
链接: https://arxiv.org/abs/2510.14125
作者: Waqar Muhammad Ashraf,Talha Ansar,Abdulelah S. Alshehri,Peipei Chen,Ramit Debnath,Vivek Dua
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a neural network-driven robust optimisation framework that integrates data-driven domain as a constraint into the nonlinear programming technique, addressing the overlooked issue of domain-inconsistent solutions arising from the interaction of parametrised neural network models with optimisation solvers. Applied to a 1180 MW capacity combined cycle gas power plant, our framework delivers domain-consistent robust optimal solutions that achieve a verified 0.76 percentage point mean improvement in energy efficiency. For the first time, scaling this efficiency gain to the global fleet of gas power plants, we estimate an annual 26 Mt reduction potential in CO _2 (with 10.6 Mt in Asia, 9.0 Mt in the Americas, and 4.5 Mt in Europe). These results underscore the synergetic role of machine learning in delivering near-term, scalable decarbonisation pathways for global climate action.
[LG-58] David vs. Goliath: A comparative study of different-sized LLM s for code generation in the domain of automotive scenario generation
链接: https://arxiv.org/abs/2510.14115
作者: Philipp Bauerfeind,Amir Salarpour,David Fernandez,Pedram MohajerAnsari,Johannes Reschke,Mert D. Pesé
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.
[LG-59] Briding Diffusion Posterior Sampling and Monte Carlo methods: a survey
链接: https://arxiv.org/abs/2510.14114
作者: Yazid Janati,Alain Durmus,Jimmy Olsson,Eric Moulines
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models enable the synthesis of highly accurate samples from complex distributions and have become foundational in generative modeling. Recently, they have demonstrated significant potential for solving Bayesian inverse problems by serving as priors. This review offers a comprehensive overview of current methods that leverage \emphpre-trained diffusion models alongside Monte Carlo methods to address Bayesian inverse problems without requiring additional training. We show that these methods primarily employ a \emphtwisting mechanism for the intermediate distributions within the diffusion process, guiding the simulations toward the posterior distribution. We describe how various Monte Carlo methods are then used to aid in sampling from these twisted distributions.
[LG-60] Near-Optimal Regret-Queue Length Tradeoff in Online Learning for Two-Sided Markets
链接: https://arxiv.org/abs/2510.14097
作者: Zixian Yang,Sushil Mahavir Varma,Lei Ying
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC); Probability (math.PR)
*备注: 67 pages, 12 figures
Abstract:We study a two-sided market, wherein, price-sensitive heterogeneous customers and servers arrive and join their respective queues. A compatible customer-server pair can then be matched by the platform, at which point, they leave the system. Our objective is to design pricing and matching algorithms that maximize the platform’s profit, while maintaining reasonable queue lengths. As the demand and supply curves governing the price-dependent arrival rates may not be known in practice, we design a novel online-learning-based pricing policy and establish its near-optimality. In particular, we prove a tradeoff among three performance metrics: \tildeO(T^1-\gamma) regret, \tildeO(T^\gamma/2) average queue length, and \tildeO(T^\gamma) maximum queue length for \gamma \in (0, 1/6] , significantly improving over existing results [1]. Moreover, barring the permissible range of \gamma , we show that this trade-off between regret and average queue length is optimal up to logarithmic factors under a class of policies, matching the optimal one as in [2] which assumes the demand and supply curves to be known. Our proposed policy has two noteworthy features: a dynamic component that optimizes the tradeoff between low regret and small queue lengths; and a probabilistic component that resolves the tension between obtaining useful samples for fast learning and maintaining small queue lengths.
[LG-61] ENDE: Transfer Entropy Neural Diffusion Estimation
链接: https://arxiv.org/abs/2510.14096
作者: Simon Pedro Galeano Munoz,Mustapha Bounoua,Giulio Franzese,Pietro Michiardi,Maurizio Filippone
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transfer entropy measures directed information flow in time series, and it has become a fundamental quantity in applications spanning neuroscience, finance, and complex systems analysis. However, existing estimation methods suffer from the curse of dimensionality, require restrictive distributional assumptions, or need exponentially large datasets for reliable convergence. We address these limitations in the literature by proposing TENDE (Transfer Entropy Neural Diffusion Estimation), a novel approach that leverages score-based diffusion models to estimate transfer entropy through conditional mutual information. By learning score functions of the relevant conditional distributions, TENDE provides flexible, scalable estimation while making minimal assumptions about the underlying data-generating process. We demonstrate superior accuracy and robustness compared to existing neural estimators and other state-of-the-art approaches across synthetic benchmarks and real data.
[LG-62] Neural Network approximation power on homogeneous and heterogeneous reaction-diffusion equations
链接: https://arxiv.org/abs/2510.14094
作者: Haotian Feng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reaction-diffusion systems represent one of the most fundamental formulations used to describe a wide range of physical, chemical, and biological processes. With the increasing adoption of neural networks, recent research has focused on solving differential equations using machine learning techniques. However, the theoretical foundation explaining why neural networks can effectively approximate such solutions remains insufficiently explored. This paper provides a theoretical analysis of the approximation power of neural networks for one- and two-dimensional reaction-diffusion equations in both homogeneous and heterogeneous media. Building upon the universal approximation theorem, we demonstrate that a two-layer neural network can approximate the one-dimensional reaction-diffusion equation, while a three-layer neural network can approximate its two-dimensional counterpart. The theoretical framework presented here can be further extended to elliptic and parabolic equations. Overall, this work highlights the expressive power of neural networks in approximating solutions to reaction-diffusion equations and related PDEs, providing a theoretical foundation for neural network-based differential equation solvers. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.14094 [cs.LG] (or arXiv:2510.14094v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.14094 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] FedHFT: Efficient Federated Finetuning with Heterogeneous Edge Clients
链接: https://arxiv.org/abs/2510.14054
作者: Fatih Ilhan,Selim Furkan Tekin,Tiansheng Huang,Gaowen Liu,Ramana Kompella,Greg Eisenhauer,Yingyan Celine Lin,Calton Pu,Ling Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Fine-tuning pre-trained large language models (LLMs) has become a common practice for personalized natural language understanding (NLU) applications on downstream tasks and domain-specific datasets. However, there are two main challenges: (i) limited and/or heterogeneous data for fine-tuning due to proprietary data confidentiality or privacy requirements, and (ii) varying computation resources available across participating clients such as edge devices. This paper presents FedHFT - an efficient and personalized federated fine-tuning framework to address both challenges. First, we introduce a mixture of masked adapters to handle resource heterogeneity across participating clients, enabling high-performance collaborative fine-tuning of pre-trained language model(s) across multiple clients in a distributed setting, while keeping proprietary data local. Second, we introduce a bi-level optimization approach to handle non-iid data distribution based on masked personalization and client clustering. Extensive experiments demonstrate significant performance and efficiency improvements over various natural language understanding tasks under data and resource heterogeneity compared to representative heterogeneous federated learning methods.
[LG-64] CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations
链接: https://arxiv.org/abs/2510.14049
作者: Guangyi Chen,Yunlong Deng,Peiyuan Zhu,Yan Li,Yifan Sheng,Zijian Li,Kun Zhang
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:
Abstract:Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:this https URL, Dataset:this https URL.
[LG-65] Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training
链接: https://arxiv.org/abs/2510.14009
作者: Jie Hao,Xiaochuan Gong,Jie Xu,Zhengdao Wang,Mingrui Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Geometry-aware optimization algorithms, such as Muon, have achieved remarkable success in training deep neural networks (DNNs). These methods leverage the underlying geometry of DNNs by selecting appropriate norms for different layers and updating parameters via norm-constrained linear minimization oracles (LMOs). However, even within a group of layers associated with the same norm, the local curvature can be heterogeneous across layers and vary dynamically over the course of training. For example, recent work shows that sharpness varies substantially across transformer layers and throughout training, yet standard geometry-aware optimizers impose fixed learning rates to layers within the same group, which may be inefficient for DNN training. In this paper, we introduce a noise-adaptive layerwise learning rate scheme on top of geometry-aware optimization algorithms and substantially accelerate DNN training compared to methods that use fixed learning rates within each group. Our method estimates gradient variance in the dual norm induced by the chosen LMO on the fly, and uses it to assign time-varying noise-adaptive layerwise learning rates within each group. We provide a theoretical analysis showing that our algorithm achieves a sharp convergence rate. Empirical results on transformer architectures such as LLaMA and GPT demonstrate that our approach achieves faster convergence than state-of-the-art optimizers. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.14009 [cs.LG] (or arXiv:2510.14009v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.14009 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features
链接: https://arxiv.org/abs/2510.14005
作者: Wei Zou,Yupei Liu,Yanting Wang,Ying Chen,Neil Gong,Jinyuan Jia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: The code is available at this https URL
Abstract:LLM-integrated applications are vulnerable to prompt injection attacks, where an attacker contaminates the input to inject malicious prompts, causing the LLM to follow the attacker’s intent instead of the original user’s. Existing prompt injection detection methods often have sub-optimal performance and/or high computational overhead. In this work, we propose PIShield, a detection method that is both effective and efficient. Our key observation is that the internal representation of the final token in a prompt-extracted from a specific layer of the LLM, which we term the injection-critical layer-captures distinguishing features between clean and contaminated prompts. Leveraging this insight, we train a simple linear classifier on these internal representations using a labeled set of clean and contaminated prompts. We compare PIShield against 11 baselines across 5 diverse benchmark datasets and 8 prompt injection attacks. The results demonstrate that PIShield is both highly effective and efficient, substantially outperforming existing methods. Additionally, we show that PIShield resists strong adaptive attacks.
[LG-67] Signature in Code Backdoor Detection how far are we?
链接: https://arxiv.org/abs/2510.13992
作者: Quoc Hung Le,Thanh Le-Cong,Bach Le,Bowen Xu
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 20 pages, 3 figures
Abstract:As Large Language Models (LLMs) become increasingly integrated into software development workflows, they also become prime targets for adversarial attacks. Among these, backdoor attacks are a significant threat, allowing attackers to manipulate model outputs through hidden triggers embedded in training data. Detecting such backdoors remains a challenge, and one promising approach is the use of Spectral Signature defense methods that identify poisoned data by analyzing feature representations through eigenvectors. While some prior works have explored Spectral Signatures for backdoor detection in neural networks, recent studies suggest that these methods may not be optimally effective for code models. In this paper, we revisit the applicability of Spectral Signature-based defenses in the context of backdoor attacks on code models. We systematically evaluate their effectiveness under various attack scenarios and defense configurations, analyzing their strengths and limitations. We found that the widely used setting of Spectral Signature in code backdoor detection is often suboptimal. Hence, we explored the impact of different settings of the key factors. We discovered a new proxy metric that can more accurately estimate the actual performance of Spectral Signature without model retraining after the defense.
[LG-68] Multi-View Semi-Supervised Label Distribution Learning with Local Structure Complementarity
链接: https://arxiv.org/abs/2510.13917
作者: Yanshan Xiao,Kaihong Wu,Bo Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Label distribution learning (LDL) is a paradigm that each sample is associated with a label distribution. At present, the existing approaches are proposed for the single-view LDL problem with labeled data, while the multi-view LDL problem with labeled and unlabeled data has not been considered. In this paper, we put forward the multi-view semi-supervised label distribution learning with local structure complementarity (MVSS-LDL) approach, which exploits the local nearest neighbor structure of each view and emphasizes the complementarity of local nearest neighbor structures in multiple views. Specifically speaking, we first explore the local structure of view v by computing the k -nearest neighbors. As a result, the k -nearest neighbor set of each sample \boldsymbolx_i in view v is attained. Nevertheless, this k -nearest neighbor set describes only a part of the nearest neighbor information of sample \boldsymbolx_i . In order to obtain a more comprehensive description of sample \boldsymbolx_i 's nearest neighbors, we complement the nearest neighbor set in view v by incorporating sample \boldsymbolx_i 's nearest neighbors in other views. Lastly, based on the complemented nearest neighbor set in each view, a graph learning-based multi-view semi-supervised LDL model is constructed. By considering the complementarity of local nearest neighbor structures, different views can mutually provide the local structural information to complement each other. To the best of our knowledge, this is the first attempt at multi-view LDL. Numerical studies have demonstrated that MVSS-LDL attains explicitly better classification performance than the existing single-view LDL methods.
[LG-69] Joint Active RIS Configuration and User Power Control for Localization: A Neuroevolution-Based Approach
链接: https://arxiv.org/abs/2510.13819
作者: George Stamatelis,Hui Chen,Henk Wymeersch,George C. Alexandropoulos
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Submitted to an IEEE venue
Abstract:This paper studies user localization aided by a Reconfigurable Intelligent Surface (RIS). A feedback link from the Base Station (BS) to the user is adopted to enable dynamic power control of the user pilot transmissions in the uplink. A novel multi-agent algorithm for the joint control of the RIS phase configuration and the user transmit power is presented, which is based on a hybrid approach integrating NeuroEvolution (NE) and supervised learning. The proposed scheme requires only single-bit feedback messages for the uplink power control, supports RIS elements with discrete responses, and is numerically shown to outperform fingerprinting, deep reinforcement learning baselines and backpropagation-based position estimators.
[LG-70] Large Language Models for Real-World IoT Device Identification
链接: https://arxiv.org/abs/2510.13817
作者: Rameen Mahmood,Tousif Ahmed,Sai Teja Peddinti,Danny Yuxing Huang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 8 pages, 3 figures
Abstract:The rapid expansion of IoT devices has outpaced current identification methods, creating significant risks for security, privacy, and network accountability. These challenges are heightened in open-world environments, where traffic metadata is often incomplete, noisy, or intentionally obfuscated. We introduce a semantic inference pipeline that reframes device identification as a language modeling task over heterogeneous network metadata. To construct reliable supervision, we generate high-fidelity vendor labels for the IoT Inspector dataset, the largest real-world IoT traffic corpus, using an ensemble of large language models guided by mutual-information and entropy-based stability scores. We then instruction-tune a quantized LLaMA3.18B model with curriculum learning to support generalization under sparsity and long-tail vendor distributions. Our model achieves 98.25% top-1 accuracy and 90.73% macro accuracy across 2,015 vendors while maintaining resilience to missing fields, protocol drift, and adversarial manipulation. Evaluation on an independent IoT testbed, coupled with explanation quality and adversarial stress tests, demonstrates that instruction-tuned LLMs provide a scalable and interpretable foundation for real-world device identification at scale.
[LG-71] Cascading Adversarial Bias from Injection to Distillation in Language Models
链接: https://arxiv.org/abs/2505.24842
作者: Harsh Chaudhari,Jamie Hayes,Matthew Jagielski,Ilia Shumailov,Milad Nasr,Alina Oprea
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.
[LG-72] A Geometric Approach to Optimal Experimental Design
链接: https://arxiv.org/abs/2510.14848
作者: Gavin Kerrigan,Christian A. Naesseth,Tom Rainforth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a novel geometric framework for optimal experimental design (OED). Traditional OED approaches, such as those based on mutual information, rely explicitly on probability densities, leading to restrictive invariance properties. To address these limitations, we propose the mutual transport dependence (MTD), a measure of statistical dependence grounded in optimal transport theory which provides a geometric objective for optimizing designs. Unlike conventional approaches, the MTD can be tailored to specific downstream estimation problems by choosing appropriate geometries on the underlying spaces. We demonstrate that our framework produces high-quality designs while offering a flexible alternative to standard information-theoretic techniques.
[LG-73] Fast and Scalable Score-Based Kernel Calibration Tests
链接: https://arxiv.org/abs/2510.14711
作者: Pierre Glaser,David Widmann,Fredrik Lindsten,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages
Abstract:We introduce the Kernel Calibration Conditional Stein Discrepancy test (KCCSD test), a non-parametric, kernel-based test for assessing the calibration of probabilistic models with well-defined scores. In contrast to previous methods, our test avoids the need for possibly expensive expectation approximations while providing control over its type-I error. We achieve these improvements by using a new family of kernels for score-based probabilities that can be estimated without probability density samples, and by using a conditional goodness-of-fit criterion for the KCCSD test’s U-statistic. We demonstrate the properties of our test on various synthetic settings.
[LG-74] MCbiF: Measuring Topological Autocorrelation in Multiscale Clusterings via 2-Parameter Persistent Homology
链接: https://arxiv.org/abs/2510.14710
作者: Juni Schindler,Mauricio Barahona
类目: Algebraic Topology (math.AT); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Datasets often possess an intrinsic multiscale structure with meaningful descriptions at different levels of coarseness. Such datasets are naturally described as multi-resolution clusterings, i.e., not necessarily hierarchical sequences of partitions across scales. To analyse and compare such sequences, we use tools from topological data analysis and define the Multiscale Clustering Bifiltration (MCbiF), a 2-parameter filtration of abstract simplicial complexes that encodes cluster intersection patterns across scales. The MCbiF can be interpreted as a higher-order extension of Sankey diagrams and reduces to a dendrogram for hierarchical sequences. We show that the multiparameter persistent homology (MPH) of the MCbiF yields a finitely presented and block decomposable module, and its stable Hilbert functions characterise the topological autocorrelation of the sequence of partitions. In particular, at dimension zero, the MPH captures violations of the refinement order of partitions, whereas at dimension one, the MPH captures higher-order inconsistencies between clusters across scales. We demonstrate through experiments the use of MCbiF Hilbert functions as topological feature maps for downstream machine learning tasks. MCbiF feature maps outperform information-based baseline features on both regression and classification tasks on synthetic sets of non-hierarchical sequences of partitions. We also show an application of MCbiF to real-world data to measure non-hierarchies in wild mice social grouping patterns across time.
[LG-75] Response to Discussions of “Causal and Counterfactual Views of Missing Data Models”
链接: https://arxiv.org/abs/2510.14694
作者: Razieh Nabi,Rohit Bhattacharya,Ilya Shpitser,James M. Robins
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We are grateful to the discussants, Levis and Kennedy [2025], Luo and Geng [2025], Wang and van der Laan [2025], and Yang and Kim [2025], for their thoughtful comments on our paper (Nabi et al., 2025). In this rejoinder, we summarize our main contributions and respond to each discussion in turn.
[LG-76] Parameter Identification for Partial Differential Equation with Jump Discontinuities in Coefficients by Markov Switching Model and Physics-Informed Machine Learning
链接: https://arxiv.org/abs/2510.14656
作者: Zhikun Zhang,Guanyu Pan,Xiangjun Wang,Yong Xu,Guangtao Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Inverse problems involving partial differential equations (PDEs) with discontinuous coefficients are fundamental challenges in modeling complex spatiotemporal systems with heterogeneous structures and uncertain dynamics. Traditional numerical and machine learning approaches often face limitations in addressing these problems due to high dimensionality, inherent nonlinearity, and discontinuous parameter spaces. In this work, we propose a novel computational framework that synergistically integrates physics-informed deep learning with Bayesian inference for accurate parameter identification in PDEs with jump discontinuities in coefficients. The core innovation of our framework lies in a dual-network architecture employing a gradient-adaptive weighting strategy: a main network approximates PDE solutions while a sub network samples its coefficients. To effectively identify mixture structures in parameter spaces, we employ Markovian dynamics methods to capture hidden state transitions of complex spatiotemporal systems. The framework has applications in reconstruction of solutions and identification of parameter-varying regions. Comprehensive numerical experiments on various PDEs with jump-varying coefficients demonstrate the framework’s exceptional adaptability, accuracy, and robustness compared to existing methods. This study provides a generalizable computational approach of parameter identification for PDEs with discontinuous parameter structures, particularly in non-stationary or heterogeneous systems.
[LG-77] Personalized federated learning Row-wise fusion regularization Multivariate modeling Sparse estimation
链接: https://arxiv.org/abs/2510.14413
作者: Runlin Zhou,Letian Li,Zemin Zheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study personalized federated learning for multivariate responses where client models are heterogeneous yet share variable-level structure. Existing entry-wise penalties ignore cross-response dependence, while matrix-wise fusion over-couples clients. We propose a Sparse Row-wise Fusion (SROF) regularizer that clusters row vectors across clients and induces within-row sparsity, and we develop RowFed, a communication-efficient federated algorithm that embeds SROF into a linearized ADMM framework with privacy-preserving partial participation. Theoretically, we establish an oracle property for SROF-achieving correct variable-level group recovery with asymptotic normality-and prove convergence of RowFed to a stationary solution. Under random client participation, the iterate gap contracts at a rate that improves with participation probability. Empirically, simulations in heterogeneous regimes show that RowFed consistently lowers estimation and prediction error and strengthens variable-level cluster recovery over NonFed, FedAvg, and a personalized matrix-fusion baseline. A real-data study further corroborates these gains while preserving interpretability. Together, our results position row-wise fusion as an effective and transparent paradigm for large-scale personalized federated multivariate learning, bridging the gap between entry-wise and matrix-wise formulations.
[LG-78] A novel Information-Driven Strategy for Optimal Regression Assessment
链接: https://arxiv.org/abs/2510.14222
作者: Benjamín Castro,Camilo Ramírez,Sebastián Espinosa,Jorge F. Silva,Marcos E. Orchard,Heraldo Rozas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In Machine Learning (ML), a regression algorithm aims to minimize a loss function based on data. An assessment method in this context seeks to quantify the discrepancy between the optimal response for an input-output system and the estimate produced by a learned predictive model (the student). Evaluating the quality of a learned regressor remains challenging without access to the true data-generating mechanism, as no data-driven assessment method can ensure the achievability of global optimality. This work introduces the Information Teacher, a novel data-driven framework for evaluating regression algorithms with formal performance guarantees to assess global optimality. Our novel approach builds on estimating the Shannon mutual information (MI) between the input variables and the residuals and applies to a broad class of additive noise models. Through numerical experiments, we confirm that the Information Teacher is capable of detecting global optimality, which is aligned with the condition of zero estimation error with respect to the – inaccessible, in practice – true model, working as a surrogate measure of the ground truth assessment loss and offering a principled alternative to conventional empirical performance metrics.
[LG-79] High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data
链接: https://arxiv.org/abs/2510.14145
作者: Mohammed Baragilly,Hend Gabr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Determining the appropriate number of clusters in unsupervised learning is a central problem in statistics and data science. Traditional validity indices such as Calinski-Harabasz, Silhouette, and Davies-Bouldin-depend on centroid-based distances and therefore degrade in high-dimensional or contaminated data. This paper proposes a new robust, nonparametric clustering validation framework, the High-Dimensional Between-Within Distance Median (HD-BWDM), which extends the recently introduced BWDM criterion to high-dimensional spaces. HD-BWDM integrates random projection and principal component analysis to mitigate the curse of dimensionality and applies trimmed clustering and medoid-based distances to ensure robustness against outliers. We derive theoretical results showing consistency and convergence under Johnson-Lindenstrauss embeddings. Extensive simulations demonstrate that HD-BWDM remains stable and interpretable under high-dimensional projections and contamination, providing a robust alternative to traditional centroid-based validation criteria. The proposed method provides a theoretically grounded, computationally efficient stopping rule for nonparametric clustering in modern high-dimensional applications.
[LG-80] deFOREST: Fusing Optical and Radar satellite data for Enhanced Sensing of Tree-loss
链接: https://arxiv.org/abs/2510.14092
作者: Julio Enrique Castrillon-Candas,Hanfeng Gu,Caleb Meredith,Yulin Li,Xiaojing Tang,Pontus Olofsson,Mark Kon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this paper we develop a deforestation detection pipeline that incorporates optical and Synthetic Aperture Radar (SAR) data. A crucial component of the pipeline is the construction of anomaly maps of the optical data, which is done using the residual space of a discrete Karhunen-Loève (KL) expansion. Anomalies are quantified using a concentration bound on the distribution of the residual components for the nominal state of the forest. This bound does not require prior knowledge on the distribution of the data. This is in contrast to statistical parametric methods that assume knowledge of the data distribution, an impractical assumption that is especially infeasible for high dimensional data such as ours. Once the optical anomaly maps are computed they are combined with SAR data, and the state of the forest is classified by using a Hidden Markov Model (HMM). We test our approach with Sentinel-1 (SAR) and Sentinel-2 (Optical) data on a 92.19,km \times 91.80,km region in the Amazon forest. The results show that both the hybrid optical-radar and optical only methods achieve high accuracy that is superior to the recent state-of-the-art hybrid method. Moreover, the hybrid method is significantly more robust in the case of sparse optical data that are common in highly cloudy regions.
[LG-81] Exact Dynamics of Multi-class Stochastic Gradient Descent
链接: https://arxiv.org/abs/2510.14074
作者: Elizabeth Collins-Woodfin,Inbar Seroussi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 58 pages, 12 figures
Abstract:We develop a framework for analyzing the training and learning rate dynamics on a variety of high- dimensional optimization problems trained using one-pass stochastic gradient descent (SGD) with data generated from multiple anisotropic classes. We give exact expressions for a large class of functions of the limiting dynamics, including the risk and the overlap with the true signal, in terms of a deterministic solution to a system of ODEs. We extend the existing theory of high-dimensional SGD dynamics to Gaussian-mixture data and a large (growing with the parameter size) number of classes. We then investigate in detail the effect of the anisotropic structure of the covariance of the data in the problems of binary logistic regression and least square loss. We study three cases: isotropic covariances, data covariance matrices with a large fraction of zero eigenvalues (denoted as the zero-one model), and covariance matrices with spectra following a power-law distribution. We show that there exists a structural phase transition. In particular, we demonstrate that, for the zero-one model and the power-law model with sufficiently large power, SGD tends to align more closely with values of the class mean that are projected onto the “clean directions” (i.e., directions of smaller variance). This is supported by both numerical simulations and analytical studies, which show the exact asymptotic behavior of the loss in the high-dimensional limit.
[LG-82] Dynamic SBI: Round-free Sequential Simulation-Based Inference with Adaptive Datasets
链接: https://arxiv.org/abs/2510.13997
作者: Huifang Lyu,James Alvey,Noemi Anau Montel,Mauro Pieroni,Christoph Weniger
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 5 figures, software available at: this https URL
Abstract:Simulation-based inference (SBI) is emerging as a new statistical paradigm for addressing complex scientific inference problems. By leveraging the representational power of deep neural networks, SBI can extract the most informative simulation features for the parameters of interest. Sequential SBI methods extend this approach by iteratively steering the simulation process towards the most relevant regions of parameter space. This is typically implemented through an algorithmic structure, in which simulation and network training alternate over multiple rounds. This strategy is particularly well suited for high-precision inference in high-dimensional settings, which are commonplace in physics applications with growing data volumes and increasing model fidelity. Here, we introduce dynamic SBI, which implements the core ideas of sequential methods in a round-free, asynchronous, and highly parallelisable manner. At its core is an adaptive dataset that is iteratively transformed during inference to resemble the target observation. Simulation and training proceed in parallel: trained networks are used both to filter out simulations incompatible with the data and to propose new, more promising ones. Compared to round-based sequential methods, this asynchronous structure can significantly reduce simulation costs and training overhead. We demonstrate that dynamic SBI achieves significant improvements in simulation and training efficiency while maintaining inference performance. We further validate our framework on two challenging astrophysical inference tasks: characterising the stochastic gravitational wave background and analysing strong gravitational lensing systems. Overall, this work presents a flexible and efficient new paradigm for sequential SBI.
[LG-83] Long-Term Spatio-Temporal Forecasting of Monthly Rainfall in West Bengal Using Ensemble Learning Approaches
链接: https://arxiv.org/abs/2510.13927
作者: Jishu Adhikary,Raju Maiti
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 25 pages, 22 figures
Abstract:Rainfall forecasting plays a critical role in climate adaptation, agriculture, and water resource management. This study develops long-term forecasts of monthly rainfall across 19 districts of West Bengal using a century-scale dataset spanning 1900-2019. Daily rainfall records are aggregated into monthly series, resulting in 120 years of observations for each district. The forecasting task involves predicting the next 108 months (9 years, 2011-2019) while accounting for temporal dependencies and spatial interactions among districts. To address the nonlinear and complex structure of rainfall dynamics, we propose a hierarchical modeling framework that combines regression-based forecasting of yearly features with multi-layer perceptrons (MLPs) for monthly prediction. Yearly features, such as annual totals, quarterly proportions, variability measures, skewness, and extremes, are first forecasted using regression models that incorporate both own lags and neighboring-district lags. These forecasts are then integrated as auxiliary inputs into an MLP model, which captures nonlinear temporal patterns and spatial dependencies in the monthly series. The results demonstrate that the hierarchical regression-MLP architecture provides robust long-term spatio-temporal forecasts, offering valuable insights for agriculture, irrigation planning, and water conservation strategies.
[LG-84] Switchboard-Affect: Emotion Perception Labels from Conversational Speech
链接: https://arxiv.org/abs/2510.13906
作者: Amrit Romana,Jaya Narain,Tien Dung Tran,Andrea Davis,Jason Fong,Ramya Rasipuram,Vikramjit Mitra
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 2025 13th International Conference on Affective Computing and Intelligent Interaction (ACII) this https URL
Abstract:Understanding the nuances of speech emotion dataset curation and labeling is essential for assessing speech emotion recognition (SER) model potential in real-world applications. Most training and evaluation datasets contain acted or pseudo-acted speech (e.g., podcast speech) in which emotion expressions may be exaggerated or otherwise intentionally modified. Furthermore, datasets labeled based on crowd perception often lack transparency regarding the guidelines given to annotators. These factors make it difficult to understand model performance and pinpoint necessary areas for improvement. To address this gap, we identified the Switchboard corpus as a promising source of naturalistic conversational speech, and we trained a crowd to label the dataset for categorical emotions (anger, contempt, disgust, fear, sadness, surprise, happiness, tenderness, calmness, and neutral) and dimensional attributes (activation, valence, and dominance). We refer to this label set as Switchboard-Affect (SWB-Affect). In this work, we present our approach in detail, including the definitions provided to annotators and an analysis of the lexical and paralinguistic cues that may have played a role in their perception. In addition, we evaluate state-of-the-art SER models, and we find variable performance across the emotion categories with especially poor generalization for anger. These findings underscore the importance of evaluation with datasets that capture natural affective variations in speech. We release the labels for SWB-Affect to enable further analysis in this domain.
[LG-85] DeepMartingale: Duality of the Optimal Stopping Problem with Expressivity
链接: https://arxiv.org/abs/2510.13868
作者: Junyan Ye,Hoi Ying Wong
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
*备注: 65 pages, 4 tables
Abstract:Using a martingale representation, we introduce a novel deep-learning approach, which we call DeepMartingale, to study the duality of discrete-monitoring optimal stopping problems in continuous time. This approach provides a tight upper bound for the primal value function, even in high-dimensional settings. We prove that the upper bound derived from DeepMartingale converges under very mild assumptions. Even more importantly, we establish the expressivity of DeepMartingale: it approximates the true value function within any prescribed accuracy \varepsilon under our architectural design of neural networks whose size is bounded by \tildec,D^\tildeq\varepsilon^-\tilder , where the constants \tildec, \tildeq, \tilder are independent of the dimension D and the accuracy \varepsilon . This guarantees that DeepMartingale does not suffer from the curse of dimensionality. Numerical experiments demonstrate the practical effectiveness of DeepMartingale, confirming its convergence, expressivity, and stability.
[LG-86] An Overview of the JPEG AI Learning-Based Image Coding Standard
链接: https://arxiv.org/abs/2510.13867
作者: Semih Esenlik,Yaojun Wu,Zhaobin Zhang,Ye-Kui Wang,Kai Zhang,Li Zhang,João Ascenso,Shan Liu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: IEEE Transactions on Circuits and Systems for Video Technology
Abstract:JPEG AI is an emerging learning-based image coding standard developed by Joint Photographic Experts Group (JPEG). The scope of the JPEG AI is the creation of a practical learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization and machine consumption. Scheduled for completion in early 2025, the first version of JPEG AI focuses on human vision tasks, demonstrating significant BD-rate reductions compared to existing standards, in terms of MS-SSIM, FSIM, VIF, VMAF, PSNR-HVS, IW-SSIM and NLPD quality metrics. Designed to ensure broad interoperability, JPEG AI incorporates various design features to support deployment across diverse devices and applications. This paper provides an overview of the technical features and characteristics of the JPEG AI standard.
[LG-87] Hybrid Deep Learning Approaches for Classifying Autism from Brain MRI
链接: https://arxiv.org/abs/2510.13841
作者: Ashley Chen
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 25 pages, 13 figures, 4 tables, 19 references
Abstract:Autism spectrum disorder (ASD) is most often diagnosed using behavioral evaluations, which can vary between clinicians. Brain imaging, combined with machine learning, may help identify more objective patterns linked to ASD. This project used magnetic resonance imaging (MRI) data from the publicly available ABIDE I dataset (n = 1,112) to test two approaches for classifying ASD and control participants. The first was a 3D convolutional neural network (CNN) trained end-to-end. The second was a hybrid approach that used the CNN as a feature extractor and then applied a support vector machine (SVM) classifier. The baseline CNN reached moderate performance (accuracy = 0.66, AUC = 0.70), while the hybrid CNN + SVM achieved higher overall accuracy (0.76) and AUC (0.80). The hybrid model also produced more balanced results between ASD and control groups. Separating feature extraction and classification improved performance and reduced bias between diagnostic groups. These findings suggest that combining deep learning and traditional machine learning methods could enhance the reliability of MRI-based research on ASD.
信息检索
[IR-0] Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report
链接: https://arxiv.org/abs/2510.14880
作者: Rikiya Takehi,Benjamin Clavié,Sean Lee,Aamir Shakir
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In this work, we introduce mxbai-edge-colbert-v0 models, at two different parameter counts: 17M and 32M. As part of our research, we conduct numerous experiments to improve retrieval and late-interaction models, which we intend to distill into smaller models as proof-of-concepts. Our ultimate aim is to support retrieval at all scales, from large-scale retrieval which lives in the cloud to models that can run locally, on any device. mxbai-edge-colbert-v0 is a model that we hope will serve as a solid foundation backbone for all future experiments, representing the first version of a long series of small proof-of-concepts. As part of the development of mxbai-edge-colbert-v0, we conducted multiple ablation studies, of which we report the results. In terms of downstream performance, mxbai-edge-colbert-v0 is a particularly capable small model, outperforming ColBERTv2 on common short-text benchmarks (BEIR) and representing a large step forward in long-context tasks, with unprecedented efficiency.
[IR-1] A Simulation Framework for Studying Systemic Effects of Feedback Loops in Recommender Systems
链接: https://arxiv.org/abs/2510.14857
作者: Gabriele Barlacchi,Margherita Lalli,Emanuele Ferragina,Fosca Giannotti,Luca Pappalardo
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 12 pages, 4 figures
Abstract:Recommender systems continuously interact with users, creating feedback loops that shape both individual behavior and collective market dynamics. This paper introduces a simulation framework to model these loops in online retail environments, where recommenders are periodically retrained on evolving user-item interactions. Using the Amazon e-Commerce dataset, we analyze how different recommendation algorithms influence diversity, purchase concentration, and user homogenization over time. Results reveal a systematic trade-off: while the feedback loop increases individual diversity, it simultaneously reduces collective diversity and concentrates demand on a few popular items. Moreover, for some recommender systems, the feedback loop increases user homogenization over time, making user purchase profiles increasingly similar. These findings underscore the need for recommender designs that balance personalization with long-term diversity.
[IR-2] Dataset Pruning in RecSys and ML: Best Practice or Mal-Practice?
链接: https://arxiv.org/abs/2510.14704
作者: Leonie Winter
类目: Information Retrieval (cs.IR)
*备注: 69 pages, 14 figures
Abstract:Offline evaluations in recommender system research depend heavily on datasets, many of which are pruned, such as the widely used MovieLens collections. This thesis examines the impact of data pruning - specifically, removing users with fewer than a specified number of interactions - on both dataset characteristics and algorithm performance. Five benchmark datasets were analysed in both their unpruned form and at five successive pruning levels (5, 10, 20, 50, 100). For each coreset, we examined structural and distributional characteristics and trained and tested eleven representative algorithms. To further assess if pruned datasets lead to artificially inflated performance results, we also evaluated models trained on the pruned train sets but tested on unpruned data. Results show that commonly applied core pruning can be highly selective, leaving as little as 2% of the original users in some datasets. Traditional algorithms achieved higher nDCG@10 scores when both training and testing on pruned data; however, this advantage largely disappeared when evaluated on unpruned test sets. Across all algorithms, performance declined with increasing pruning levels when tested on unpruned data, highlighting the impact of dataset reduction on the performance of recommender algorithms.
[IR-3] MR.Rec: Synergizing Memory and Reasoning for Personalized Recommendation Assistant with LLM s
链接: https://arxiv.org/abs/2510.14629
作者: Jiani Huang,Xingchen Zou,Lianghao Xia,Qing Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The application of Large Language Models (LLMs) in recommender systems faces key challenges in delivering deep personalization and intelligent reasoning, especially for interactive scenarios. Current methods are often constrained by limited context windows and single-turn reasoning, hindering their ability to capture dynamic user preferences and proactively reason over recommendation contexts. To address these limitations, we propose this http URL, a novel framework that synergizes memory and reasoning for LLM-based recommendations. To achieve personalization, we develop a comprehensive Retrieval-Augmented Generation (RAG) system that efficiently indexes and retrieves relevant external memory to enhance LLM personalization capabilities. Furthermore, to enable the synergy between memory and reasoning, our RAG system goes beyond conventional query-based retrieval by integrating reasoning enhanced memory retrieval. Finally, we design a reinforcement learning framework that trains the LLM to autonomously learn effective strategies for both memory utilization and reasoning refinement. By combining dynamic memory retrieval with adaptive reasoning, this approach ensures more accurate, context-aware, and highly personalized recommendations. Extensive experiments demonstrate that this http URL significantly outperforms state-of-the-art baselines across multiple metrics, validating its efficacy in delivering intelligent and personalized recommendations. We will release code and data upon paper notification.
[IR-4] Ensembling Multiple Hallucination Detectors Trained on VLLM Internal Representations KDD
链接: https://arxiv.org/abs/2510.14330
作者: Yuto Nakamizo,Ryuhei Miyazato,Hikaru Tanabe,Ryuta Yamakura,Kiori Hatanaka
类目: Information Retrieval (cs.IR)
*备注: 5th place solution at Meta KDD Cup 2025
Abstract:This paper presents the 5th place solution by our team, y3h2, for the Meta CRAG-MM Challenge at KDD Cup 2025. The CRAG-MM benchmark is a visual question answering (VQA) dataset focused on factual questions about images, including egocentric images. The competition was contested based on VQA accuracy, as judged by an LLM-based automatic evaluator. Since incorrect answers result in negative scores, our strategy focused on reducing hallucinations from the internal representations of the VLM. Specifically, we trained logistic regression-based hallucination detection models using both the hidden_state and the outputs of specific attention heads. We then employed an ensemble of these models. As a result, while our method sacrificed some correct answers, it significantly reduced hallucinations and allowed us to place among the top entries on the final leaderboard. For implementation details and code, please refer to this https URL.
[IR-5] Large Reasoning Embedding Models: Towards Next-Generation Dense Retrieval Paradigm
链接: https://arxiv.org/abs/2510.14321
作者: Jianting Tang,Dongshuai Li,Tao Wen,Fuyu Lv,Dan Ou,Linli Xu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In modern e-commerce search systems, dense retrieval has become an indispensable component. By computing similarities between query and item (product) embeddings, it efficiently selects candidate products from large-scale repositories. With the breakthroughs in large language models (LLMs), mainstream embedding models have gradually shifted from BERT to LLMs for more accurate text modeling. However, these models still adopt direct-embedding methods, and the semantic accuracy of embeddings remains inadequate. Therefore, contrastive learning is heavily employed to achieve tight semantic alignment between positive pairs. Consequently, such models tend to capture statistical co-occurrence patterns in the training data, biasing them toward shallow lexical and semantic matches. For difficult queries exhibiting notable lexical disparity from target items, the performance degrades significantly. In this work, we propose the Large Reasoning Embedding Model (LREM), which novelly integrates reasoning processes into representation learning. For difficult queries, LREM first conducts reasoning to achieve a deep understanding of the original query, and then produces a reasoning-augmented query embedding for retrieval. This reasoning process effectively bridges the semantic gap between original queries and target items, significantly improving retrieval accuracy. Specifically, we adopt a two-stage training process: the first stage optimizes the LLM on carefully curated Query-CoT-Item triplets with SFT and InfoNCE losses to establish preliminary reasoning and embedding capabilities, and the second stage further refines the reasoning trajectories via reinforcement learning (RL). Extensive offline and online experiments validate the effectiveness of LREM, leading to its deployment on China’s largest e-commerce platform since August 2025.
[IR-6] Synergistic Integration and Discrepancy Resolution of Contextualized Knowledge for Personalized Recommendation
链接: https://arxiv.org/abs/2510.14257
作者: Lingyu Mu,Hao Deng,Haibo Xing,Kaican Lin,Zhitong Zhu,Yu Zhang,Xiaoyi Zeng,Zhengxiao Liu,Zheng Lin,Jinxin Hu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The integration of large language models (LLMs) into recommendation systems has revealed promising potential through their capacity to extract world knowledge for enhanced reasoning capabilities. However, current methodologies that adopt static schema-based prompting mechanisms encounter significant limitations: (1) they employ universal template structures that neglect the multi-faceted nature of user preference diversity; (2) they implement superficial alignment between semantic knowledge representations and behavioral feature spaces without achieving comprehensive latent space integration. To address these challenges, we introduce CoCo, an end-to-end framework that dynamically constructs user-specific contextual knowledge embeddings through a dual-mechanism approach. Our method realizes profound integration of semantic and behavioral latent dimensions via adaptive knowledge fusion and contradiction resolution modules. Experimental evaluations across diverse benchmark datasets and an enterprise-level e-commerce platform demonstrate CoCo’s superiority, achieving a maximum 8.58% improvement over seven cutting-edge methods in recommendation accuracy. The framework’s deployment on a production advertising system resulted in a 1.91% sales growth, validating its practical effectiveness. With its modular design and model-agnostic architecture, CoCo provides a versatile solution for next-generation recommendation systems requiring both knowledge-enhanced reasoning and personalized adaptation.