本篇博文主要内容为 2025-10-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-10)
今日共更新689篇论文,其中:
- 自然语言处理共138篇(Computation and Language (cs.CL))
- 人工智能共219篇(Artificial Intelligence (cs.AI))
- 计算机视觉共132篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共212篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)基准测试中因预训练语料库数据泄露导致的评估有效性下降问题,即模型可能通过记忆而非真正泛化能力获得高分,从而扭曲跨模型比较并误导技术进展判断。解决方案的关键在于提出 ArenaBencher——一个模型无关的自动基准演化框架,其核心机制包括:基于多模型反馈推断测试用例的核心能力、生成保持原目标一致的候选题解对、利用大语言模型作为裁判验证正确性与意图,并通过迭代式上下文示范引导生成更具挑战性和诊断性的新测试用例,从而在保持可比性的前提下持续提升基准的公平性、难度和区分度。
链接: https://arxiv.org/abs/2510.08569
作者: Qin Liu,Jacob Dineen,Yuxi Huang,Sheng Zhang,Hoifung Poon,Ben Zhou,Muhao Chen
机构: University of California, Davis (加州大学戴维斯分校); Arizona State University (亚利桑那州立大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.
zh
[NLP-1] MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在复杂推理与决策任务中因高质量多模态轨迹数据稀缺及人工标注成本高昂而导致的性能瓶颈问题。其解决方案的关键在于提出一种以视觉为中心的代理调优框架,通过自动合成多模态轨迹、生成分步偏好对,并基于此对VLM控制器进行端到端优化。具体而言,作者构建了M-TRACE数据集(28.5K多模态任务,177K验证轨迹),用于模仿学习驱动的轨迹调优;在此基础上开发出MATRIX Agent,一个在M-TRACE上微调的工具使用推理控制器;进一步引入Pref-X(11K自动生成的偏好对),采用分步偏好学习策略对MATRIX进行精细对齐。实验证明,该方法在Agent-X、GTA和GAIA三个基准上均显著优于开源与闭源VLM,展现出可扩展且高效的多模态工具使用能力。
链接: https://arxiv.org/abs/2510.08567
作者: Tajamul Ashraf,Umair Nawaz,Abdelrahman M. Shaker,Rao Anwer,Philip Torr,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at this https URL.
zh
[NLP-2] Agent Learning via Early Experience
【速读】: 该论文旨在解决语言代理(language agent)在复杂现实任务中依赖专家数据进行监督微调(supervised fine-tuning)所带来的可扩展性差和泛化能力弱的问题。当前方法受限于专家演示的场景狭窄与环境多样性不足,难以实现自主学习与持续改进。解决方案的关键在于引入“早期经验”(early experience)范式,即代理通过自身行动生成交互数据,利用未来状态作为无奖励信号的监督信号。该范式下包含两种策略:(1) 隐式世界建模(implicit world modeling),通过收集的状态数据将策略锚定于环境动态;(2) 自我反思(self-reflection),使代理从次优行为中学习以提升推理与决策能力。实验表明,该方法在多个环境和模型家族中均显著提升有效性与跨域泛化性能,并为后续强化学习提供良好基础。
链接: https://arxiv.org/abs/2510.08558
作者: Kai Zhang,Xiangchao Chen,Bo Liu,Tianci Xue,Zeyi Liao,Zhihan Liu,Xiyao Wang,Yuting Ning,Zhaorun Chen,Xiaohan Fu,Jian Xie,Yuxuan Sun,Boyu Gou,Qi Qi,Zihang Meng,Jianwei Yang,Ning Zhang,Xian Li,Ashish Shah,Dat Huynh,Hengduo Li,Zi Yang,Sara Cao,Lawrence Jang,Shuyan Zhou,Jiacheng Zhu,Huan Sun,Jason Weston,Yu Su,Yifan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress
Abstract:A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent’s own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
zh
[NLP-3] VideoNorms: Benchmarking Cultural Awareness of Video Language Models
【速读】: 该论文旨在解决视频大语言模型(Video Large Language Models, VideoLLMs)在跨文化场景下缺乏对社会文化规范的理解与 grounded grounding 问题,从而影响其在真实世界应用中的准确性与可靠性。为应对这一挑战,作者提出 VideoNorms 基准数据集,包含超过 1000 对(视频片段,规范)的标注样本,覆盖美国与中国的文化背景,并基于言语行为理论(Speech Act Theory)标注了规范遵守与违反标签、以及言语和非言语证据。解决方案的关键在于采用人机协作框架:由一个基于理论提示的教师模型生成候选标注,再由训练有素的人类专家进行验证与修正,确保标注质量;该框架不仅构建了高质量基准,也为未来开发更具文化敏感性的 VideoLLMs 提供了可扩展的方法论支持。
链接: https://arxiv.org/abs/2510.08543
作者: Nikhil Reddy Varimalla,Yunfei Xu,Arkadiy Saakyan,Meng Fan Wang,Smaranda Muresan
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 24 pages, 5 figures, under review
Abstract:As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models’ cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.
zh
[NLP-4] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在空间推理(spatial reasoning)任务中表现不稳定的问题,其核心挑战在于现有方法缺乏从感知到理解的层级结构基础,导致难以实现鲁棒的空间智能。解决方案的关键在于提出一种分阶段渐进式训练框架:首先通过对象定位建立空间感知能力,继而通过多维空间任务发展空间理解,最后借助可验证奖励机制的强化学习提升复杂推理能力;该方法依托于自建的SpatialLadder-26k多模态数据集(涵盖目标定位、单图、多视角及视频等任务),最终构建出3B参数的SpatialLadder模型,在多个基准测试中显著优于主流模型(如GPT-4o和Gemini-2.0-Flash),并展现出更强的跨域泛化性能,验证了“由感知到推理”的渐进式训练路径对构建稳健空间智能的重要性。
链接: https://arxiv.org/abs/2510.08531
作者: Hongxing Li,Dingming Li,Zixuan Wang,Yuchen Yan,Hang Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL
Abstract:Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
zh
[NLP-5] CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在预训练后如何实现持续自我进化的问题。现有方法多依赖外部密集奖励信号或从LLM自身提取内在奖励,但这些方式与人类智能中通过相互讨论和协作实现学习的机制不一致。其解决方案的关键在于提出一种名为协同进化多智能体系统(Co-Evolving Multi-Agent Systems, CoMAS)的新框架:该框架通过智能体间的交互生成丰富的内在奖励信号,并利用“LLM作为裁判”(LLM-as-a-judge)机制对这些信号进行形式化,再结合强化学习(Reinforcement Learning, RL)优化每个智能体的策略,从而实现无需外部监督的去中心化、可扩展的协同进化。
链接: https://arxiv.org/abs/2510.08529
作者: Xiangyuan Xue,Yifan Zhou,Guibin Zhang,Zaibin Zhang,Yijiang Li,Chen Zhang,Zhenfei Yin,Philip Torr,Wanli Ouyang,Lei Bai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent’s policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
zh
[NLP-6] Which Heads Matter for Reasoning ? RL-Guided KV Cache Compression
【速读】: 该论文旨在解决推理型大语言模型(Reasoning Large Language Models)在解码阶段因长链式思维(Chain-of-Thought, CoT)生成导致的Key-Value (KV) 缓存开销过大的问题。现有KV缓存压缩方法在推理模型上表现不佳:token-dropping方法会破坏推理完整性,而head-reallocating方法因误将推理关键头压缩而导致性能显著下降。解决方案的关键在于提出RLKV框架,通过强化学习直接优化每个注意力头的缓存使用与推理质量之间的关系,从而识别出对推理行为至关重要的头部;随后仅对这些关键头部分配完整KV缓存,其余头部采用压缩常量KV缓存策略,实现高效推理。实验表明,仅有少量注意力头对推理至关重要,该方法可在保持近无损性能的同时实现20%-50%的缓存压缩率。
链接: https://arxiv.org/abs/2510.08525
作者: Wenjie Du,Li Jiang,Keda Tao,Xue Liu,Huan Wang
机构: Westlake University (西湖大学); McGill University (麦吉尔大学); Mila; Zhejiang University (浙江大学); MBZUAI
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head’s cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
zh
[NLP-7] Efficient Prompt Optimisation for Legal Text Classification with Proxy Prompt Evaluator EMNLP2025
【速读】: 该论文旨在解决生成式 AI(Generative AI)在特定法律自然语言处理任务中,如Terms of Service(ToS)条款公平性检测时,因提示词(prompt)设计不当而导致性能受限的问题。现有提示优化方法普遍存在计算成本高、搜索策略低效及候选提示评分代价昂贵等瓶颈。其解决方案的关键在于提出一种融合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与代理提示评估器(proxy prompt evaluator)的框架,通过高效探索提示空间并显著降低评估开销,在有限计算预算下实现更高的分类准确率和优化效率。
链接: https://arxiv.org/abs/2510.08524
作者: Hyunji Lee,Kevin Chenhao Li,Matthias Grabmair,Shanshan Xu
机构: Technical University of Munich (慕尼黑工业大学); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注: Accepted at NLLP@EMNLP 2025
Abstract:Prompt optimization aims to systematically refine prompts to enhance a language model’s performance on specific tasks. Fairness detection in Terms of Service (ToS) clauses is a challenging legal NLP task that demands carefully crafted prompts to ensure reliable results. However, existing prompt optimization methods are often computationally expensive due to inefficient search strategies and costly prompt candidate scoring. In this paper, we propose a framework that combines Monte Carlo Tree Search (MCTS) with a proxy prompt evaluator to more effectively explore the prompt space while reducing evaluation costs. Experiments demonstrate that our approach achieves higher classification accuracy and efficiency than baseline methods under a constrained computation budget.
zh
[NLP-8] CaRT: Teaching LLM Agents to Know When They Know Enough
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要多轮交互获取信息的任务中,如何判断何时停止信息收集并作出决策的问题,即“终止决策”问题。当前模型常因过度思考或信息冗余导致效率低下甚至决策偏离目标。解决方案的关键在于提出Counterfactuals and Reasoning for Termination (CaRT)方法,通过引入反事实轨迹对(counterfactual pairs of trajectories)——一组应终止的轨迹与仅微小修改后不应终止的轨迹——来训练模型识别终止时机;同时利用自然语言推理机制让模型显式解释终止理由,从而将终止决策能力以可解释的方式嵌入基础LLM中,显著提升任务效率和成功率。
链接: https://arxiv.org/abs/2510.08517
作者: Grace Liu,Yuxiao Qu,Jeff Schneider,Aarti Singh,Aviral Kumar
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine-tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine-tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine-tuning methods.
zh
[NLP-9] SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks
【速读】: 该论文旨在解决大规模预训练模型中参数高效微调(Parameter Efficient Fine Tuning, PEFT)的理论基础不足问题,即为何仅微调随机选取的小规模子网络(slices)即可实现下游任务的有效适配。其解决方案的关键在于提出“通用优胜切片假设”(Universal Winning Slice Hypothesis),该假设基于两个核心现象:一是谱平衡性(spectral balance),即不同权重矩阵切片的特征谱分布高度相似;二是高任务能量(high task energy),即模型主干表示保留丰富的任务相关特征。由此启发设计出SliceFine方法,通过仅更新原始权重中的选定切片来实现零新增参数的微调,从而在保持与主流PEFT方法相当性能的同时,显著提升训练速度、内存效率和模型紧凑性。
链接: https://arxiv.org/abs/2510.08513
作者: Md Kowsher,Ali O. Polat,Ehsan Mohammady Ardehaly,Mehrdad Salehi,Zia Ghiasi,Prasanth Murali,Chen Chen
机构: Meta; University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.
zh
[NLP-10] AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在机器学习工程(Machine Learning Engineering, MLE)场景下,如AutoML和Kaggle竞赛中,因缺乏细粒度领域先验知识、搜索策略受限于线性或树状结构而导致的性能瓶颈问题。现有方法难以利用历史完整轨迹或跨分支信息共享,限制了模型的自进化能力与搜索空间多样性。其解决方案的关键在于提出AutoMLGen——一个集成领域知识库以提供高质量先验指导,并采用蒙特卡洛图搜索(Monte Carlo Graph Search, MCGS)实现高效探索的LLM编码代理。MCGS在保留蒙特卡洛树搜索(MCTS)树状引导探索的同时,引入图结构扩展机制,支持动态路径重组、历史轨迹复用及多解融合,从而增强自进化与协同学习能力;结合细粒度操作符集合设计,显著提升稳定性并加速收敛。
链接: https://arxiv.org/abs/2510.08511
作者: Shangheng Du,Xiangchao Yan,Dengyang Jiang,Jiakang Yuan,Yusong Hu,Xin Li,Liang He,Bo Zhang,Lei Bai
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have shown impressive performance in general programming tasks. However, in Machine Learning Engineering (MLE) scenarios such as AutoML and Kaggle competitions, achieving high performance depends heavily on expert intervention and repeated adjustments rather than simply generating correct code. When applied directly to these tasks, LLMs often lack fine-grained domain priors, and existing MLE approaches that use linear or tree-structured searches limit knowledge transfer to adjacent hierarchical links. As a result, they cannot leverage past full trajectories or share information across branches, limiting self-evolving ability and search space diversity. To address these limitations, we introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance and Monte Carlo Graph Search (MCGS) for efficient exploration. MCGS retains the tree-guided exploration of MCTS while embedding a graph structure into the expansion stage to enable dynamic path reorganization, historical trajectory reuse, and multi-solution fusion to support both self-evolution and collaborative learning. Combined with fine-grained operator sets, this design improves stability and accelerates convergence. Evaluation on the MLE-Bench shows that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate, under a 12-hour budget (half the standard runtime). The code is available at this https URL.
zh
[NLP-11] o Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)中视觉编码器(ViT)生成的高范数视觉令牌(即ViT attention sinks)在跨模态推理过程中被忽视的问题,这些令牌实际上承载了图像中的高层语义信息,对提升模型理解与推理能力至关重要。解决方案的关键在于识别并利用这些ViT attention sinks——通过定性与定量分析揭示其蕴含的信息价值,并提出无需训练和基于训练的两种策略来增强大语言模型(LLM)对这些高重要性视觉令牌的感知与利用效率,从而显著提升多个LVLM架构在视觉推理任务上的性能表现。
链接: https://arxiv.org/abs/2510.08510
作者: Jiayun Luo,Wan-Cyuan Fan,Lyuyang Wang,Xiangteng He,Tanzila Rahman,Purang Abolmaesumi,Leonid Sigal
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Project page: this https URL
Abstract:Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end – the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core – the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks – a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.
zh
[NLP-12] Neologism Learning for Controllability and Self-Verbalization
【速读】: 该论文试图解决如何通过引入新词(neologism)来更有效地理解和控制大语言模型(Large Language Models, LLMs)的行为这一问题。其核心挑战在于,现有方法难以在不修改模型参数的前提下,精准引导模型生成特定语义或行为模式。解决方案的关键在于:引入一个全新的词嵌入(word embedding),并通过包含目标概念的示例进行微调训练,同时保持模型其他参数不变。这种方法不仅实现了对诸如奉承、错误回答、文本长度等简单概念以及AxBench中复杂概念的有效控制,还揭示了模型可以通过“自言自语”(self-verbalization)方式用自然语言解释新词含义,从而增强对模型内部表征的理解;进一步地,作者提出“插件评估”(plug-in evaluation)机制验证这些解释的有效性,并发现部分机器专属同义词(machine-only synonyms)的存在,表明人类与模型对词语意义的理解可能存在差异。
链接: https://arxiv.org/abs/2510.08506
作者: John Hewitt,Oyvind Tafjord,Robert Geirhos,Been Kim
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers…‘’ To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.
zh
[NLP-13] DeepPrune: Parallel Scaling without Inter-trace Redundancy
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在采用并行链式思维(Chain-of-Thought, CoT)推理时存在的计算效率低下问题,即由于不同推理路径之间存在高度冗余(超过80%的并行推理轨迹最终得出相同答案),导致大量无效计算浪费。解决方案的关键在于提出DeepPrune框架,其核心包括两个组成部分:一是基于焦点损失(focal loss)和过采样技术训练的专用判别模型,能够从部分推理轨迹中准确预测最终答案是否等价(AUROC达0.87);二是在线贪婪聚类算法,在保持答案多样性的同时动态剪枝冗余路径。该方法在多个基准测试中实现超过80%的Token节省,同时精度仅下降不超过3个百分点,显著提升了并行推理的效率与实用性。
链接: https://arxiv.org/abs/2510.08483
作者: Shangqing Tu,Yaxuan Li,Yushi Bai,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, please check out the project page: this https URL
Abstract:Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy – our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: this https URL
zh
[NLP-14] he Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
【速读】: 该论文旨在解决视觉-语言模型(VLMs)在动态人体运动中识别和建模手语中形义映射(即象似性,iconicity)的能力不足问题,尤其关注如何从视频中恢复手语的音位形式(如手形、位置)、透明度(由视觉形式推断意义)以及象似性等级。其解决方案的关键在于提出一个名为“视觉象似性挑战”(Visual Iconicity Challenge)的新颖视频基准,该基准将心理语言学测量方法适配至三个任务:音位形式预测、透明度推理和象似性评分,并通过对比13个前沿VLMs在荷兰手语(Sign Language of the Netherlands)上的零样本与少样本表现,揭示当前模型在象似性理解上的局限性。研究发现,具有更强音位形式预测能力的模型更接近人类象似性判断,表明视觉接地结构的共享敏感性是提升多模态模型视觉 grounding 能力的关键路径。
链接: https://arxiv.org/abs/2510.08482
作者: Onur Keleş,Aslı Özyürek,Gerardo Ortega,Kadir Gökgö,Esam Ghaleb
机构: Max Planck Institute for Psycholinguistics (马普所心理语言学研究所); Boğaziçi University (博阿齐奇大学); Radboud University (拉德堡德大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textitVisual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textitphonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on \textittransparency, they are far from human baselines; and only top models correlate moderately with human \textiticonicity ratings. Interestingly, \textitmodels with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
zh
[NLP-15] Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling EMNLP2025
【速读】: 该论文旨在解决在认知上合理规模的数据条件下训练视觉-语言模型时,如何高效融合多模态信息的问题。其核心挑战在于有限数据下如何提升模型对视觉与语言线索的利用效率,并增强可解释性。解决方案的关键在于提出一种轻量级解码器架构,包含三个创新模块:(1) 基于token级别的动态门控机制(token-wise dynamic gating),实现对语言和视觉线索的自适应融合;(2) 特征调制与通道注意力机制,最大化有限视觉信息的利用率;(3) 辅助对比学习目标,强化视觉定位能力。实验表明,该方法在五个基准测试中表现优异,且动态门控能自动发现可解释模式——即内容词偏好视觉线索、功能词倾向语言线索,证明其在资源受限场景下兼具性能与可解释性的优势。
链接: https://arxiv.org/abs/2510.08470
作者: Bianca-Mihaela Ganescu,Suchir Salhan,Andrew Caines,Paula Buttery
机构: ALTA Institute; University of Cambridge (剑桥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the EMNLP 2025 BabyLM Workshop
Abstract:Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.
zh
[NLP-16] LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task EMNLP2025
【速读】: 该论文旨在解决当前人工智能(AI)模型在处理自然语言理解任务时缺乏对人类判断变异性和分歧性认知的问题。传统方法通常假设人类标注是统一且确定的,忽视了真实场景中不同 annotator 对同一输入可能产生差异性判断的现象。为应对这一挑战,研究者提出了 LEWIDI 系列共享任务的第三版,其核心解决方案在于:一是将基准测试扩展至四个新任务(释义识别、讽刺检测、反语检测和自然语言推理),并引入类别标签与序数标签相结合的标注方案;二是采用两种互补的评估范式——软标签(soft-label)范式预测群体层面的判断分布,以及视角主义(perspectivist)范式预测个体标注者的解释;三是设计并验证了针对这两种范式的新型评估指标,超越传统的交叉熵等标准指标。这些改进显著提升了对 AI 模型“感知分歧能力”的评估精度,并推动了更具鲁棒性和可解释性的 disagreement-aware 技术的发展。
链接: https://arxiv.org/abs/2510.08460
作者: Elisa Leonardelli,Silvia Casola,Siyao Peng,Giulia Rizzi,Valerio Basile,Elisabetta Fersini,Diego Frassinelli,Hyewon Jang,Maja Pavlovic,Barbara Plank,Massimo Poesio
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); LMU Munich & MCML(慕尼黑路德维希马克西米利安大学与机器学习研究中心); Università Milano Bicocca(米兰比科卡大学); Università di Torino(都灵大学); University of Gothenburg(哥德堡大学); Queen Mary University of London(伦敦玛丽女王大学); Utrecht University(乌得勒支大学)
类目: Computation and Language (cs.CL)
备注: 14 pages; LeWiDi-2025 shared task description paper at NLPerspective workshop at EMNLP 2025
Abstract:Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.
zh
[NLP-17] ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
【速读】: 该论文旨在解决多模态大推理模型(Multimodal Large Reasoning Models, MLRMs)在推理过程中存在的效率失衡问题:即对简单任务过度思考、产生冗长的推理路径,而对复杂任务探索不足,导致漏解。解决方案的关键在于提出一个名为ARES的自适应推理框架,其核心创新是基于高窗口熵(High Window-Entropy, HWE)token设计动态探索机制——HWE能可靠捕捉推理关键节点,且减少HWE使用有利于简单任务,增加则对解决难题至关重要;ARES进一步采用两阶段训练策略:第一阶段通过难度感知的推理轨迹数据实现模型的自适应冷启动,第二阶段引入自适应熵策略优化(Adaptive Entropy Policy Optimization, AEPO),利用HWE作为探索触发信号,并结合分层熵奖励与动态KL控制来决定探索强度,从而实现推理资源在不同难度任务间的智能分配。
链接: https://arxiv.org/abs/2510.08457
作者: Shuang Chen,Yue Guo,Yimeng Ye,Shijue Huang,Wenbo Hu,Haoxi Li,Manyuan Zhang,Jiayu Chen,Song Guo,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校); The Hong Kong University of Science and Technology (香港科技大学); Columbia University (哥伦比亚大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.
zh
[NLP-18] xRouter: Training Cost-Aware LLM s Orchestration System via Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)部署中面临的成本-性能权衡问题:高端模型虽推理能力强但成本高,轻量模型成本低却在复杂任务上表现脆弱。传统静态路由规则和关键词启发式方法无法有效利用这一成本-性能谱系,且缺乏跨任务自适应能力。解决方案的关键在于提出xRouter——一个基于工具调用(tool-calling)的路由系统,其中学习型路由器可通过强化学习端到端训练,根据显式的、成本感知的奖励函数决定直接回答或调用一个或多个外部模型,从而实现动态、最优的成本-性能平衡,无需人工设计路由规则。
链接: https://arxiv.org/abs/2510.08439
作者: Cheng Qian,Zuxin Liu,Shirley Kokane,Akshara Prabhakar,Jielin Qiu,Haolin Chen,Zhiwei Liu,Heng Ji,Weiran Yao,Shelby Heinecke,Silvio Savarese,Caiming Xiong,Huan Wang
机构: Salesforce AI Research (Salesforce人工智能研究中心); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 Pages, 4 Figures, 2 Tables
Abstract:Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.
zh
[NLP-19] Single layer tiny Co4 outpaces GPT -2 and GPT -BERT
【速读】: 该论文旨在解决当前大规模语言模型在训练效率和样本效率方面存在的瓶颈问题,尤其是在小规模模型架构下如何实现与大型模型相当甚至更优的性能。其解决方案的关键在于提出了一种极简但高效的模型架构——Co⁴(单层、双头、8M参数),该模型在近线性时间复杂度 O(N) 下运行,显著优于 GPT-2(124M参数,O(N²))和 GPT-BERT(30M参数,O(N²))等基线模型,在仅两轮训练中即超越后者十轮训练的表现,并在 SuperGLUE 零样本和微调任务上全面领先,体现出高度的预训练样本效率,从而挑战了传统深度学习中的 scaling laws 和模型复杂度依赖范式。
链接: https://arxiv.org/abs/2510.08404
作者: Noor Ul Zain,Mohsin Raza,Ahsan Adeel
机构: CMI-Lab (CMI-Lab); University of Stirling (斯特灵大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We show that a tiny Co ^4 machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of O(N) (where N is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, O(N^2)) and GPT-BERT (30M, 12 layers, O(N^2)) in just two epochs, while both are trained for ten. Co ^4 achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co ^4 exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co ^4 outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.
zh
[NLP-20] FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts NEURIPS2025
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在参数高效微调基础模型时存在的参数干扰问题,特别是在多任务场景下因任务间干扰导致性能下降的问题。现有基于混合专家(Mixture-of-Experts, MoE)的LoRA变体虽能缓解单任务内的相关性,但引入额外的路由器参数且在多任务合并时效果不佳。其解决方案的关键在于提出FlyLoRA,一种受果蝇嗅觉回路启发的隐式MoE型LoRA:一是通过在上投影矩阵中引入逐秩专家激活机制,实现任务内解耦;二是设计一个隐式路由器,将专家路由与下投影统一,用冻结的稀疏随机投影矩阵替代传统可训练密集路由器,从而消除显式路由器带来的计算开销,并利用随机矩阵的正交性天然抑制任务间干扰。
链接: https://arxiv.org/abs/2510.08396
作者: Heming Zou,Yunliang Zang,Wutong Xu,Yao Zhu,Xiangyang Ji
机构: Tsinghua University (清华大学); Tianjin University (天津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2025 accepted paper
Abstract:Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains – general knowledge understanding, scientific question answering, mathematical reasoning, and code generation – demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at this https URL.
zh
[NLP-21] If Probable Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在判断条件句“如果A,则B”可接受性(conditional acceptability)方面的认知机制不明确的问题。现有研究虽探讨了LLMs对条件句的推理能力,但对其如何评估此类语句的合理性尚缺乏系统性理解。解决方案的关键在于通过大规模实证研究,考察不同模型家族、规模及提示策略下LLMs对条件概率(conditional probability)和语义相关性(semantic relevance)这两个核心因素的敏感度,并结合线性混合效应模型与方差分析(ANOVA)量化其影响程度。结果表明,尽管LLMs能感知概率与语义线索,但其一致性低于人类,且模型规模并非决定其接近人类判断的关键因素。
链接: https://arxiv.org/abs/2510.08388
作者: Jasmin Orth,Philipp Mondorf,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich (慕尼黑大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: 22 pages, 12 figures
Abstract:Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional “If A, then B” is, their judgments are influenced by two main factors: the \textitconditional probability of B given A , and the \textitsemantic relevance of the antecedent A given the consequent B (i.e., whether A meaningfully supports B ). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the \textitacceptability of such statements. To address this gap, we present a comprehensive study of LLMs’ conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.
zh
[NLP-22] On the Relationship Between the Choice of Representation and In-Context Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)中上下文学习(In-context Learning, ICL)的两个核心因素——演示样本的表示方式与学习能力之间的关系问题。此前研究普遍认为标签表示的质量直接影响ICL性能,但对“增加演示数量是否能稳定提升性能”存在争议,且二者交互机制未被深入探讨。论文提出假设:表示与学习过程在ICL中具有正交性,即标签表示决定基线准确率,而额外演示带来的性能提升仅在此基础上发生。解决方案的关键在于设计一种优化算法,系统枚举不同语义相关性的标签集(representations),并在每种标签表示下测试不同数量的演示对性能的影响。实验结果表明,无论标签表示质量如何,学习效应均存在,但其效率受标签质量和模型参数量共同影响;更重要的是,标签表示的相对优劣在整个学习过程中保持不变,验证了二者独立性,揭示了ICL性能的双重驱动机制。
链接: https://arxiv.org/abs/2510.08372
作者: Ioana Marinescu,Kyunghyun Cho,Eric Karl Oermann
机构: New York University (纽约大学); Genentech (基因泰克)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 6 figures, 10 tables
Abstract:In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.
zh
[NLP-23] wo-Stage Voting for Robust and Efficient Suicide Risk Detection on Social Media
【速读】: 该论文旨在解决自杀意念(suicidal ideation)检测中隐性表达(implicit ideation)识别困难的问题,尤其是在社交媒体语境下,用户常通过隐喻、讽刺或微妙情绪线索间接表达危机状态,而传统轻量级模型如BERT难以捕捉此类信号,大型语言模型(LLM)虽能处理复杂语义但计算成本过高。解决方案的关键在于提出一种两阶段投票架构:第一阶段使用轻量级BERT模型快速识别高置信度的显性自杀意图;第二阶段将不确定样本分流至两个子模块——一是多视角LLM投票框架以提升对隐性意念的召回率,二是基于提示工程提取心理特征并构建可解释的机器学习集成模型,兼顾效率与可解释性。该方法首次将LLM提取的心理学特征转化为结构化向量用于风险检测,在多个数据集上实现高精度且显著降低跨域性能差距与LLM资源消耗。
链接: https://arxiv.org/abs/2510.08365
作者: Yukai Song,Pengfei Zhou,César Escobar-Viera,Candice Biernesser,Wei Huang,Jingtong Hu
机构: University of Pittsburgh (匹兹堡大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Suicide rates have risen worldwide in recent years, underscoring the urgent need for proactive prevention strategies. Social media provides valuable signals, as many at-risk individuals - who often avoid formal help due to stigma - choose instead to share their distress online. Yet detecting implicit suicidal ideation, conveyed indirectly through metaphor, sarcasm, or subtle emotional cues, remains highly challenging. Lightweight models like BERT handle explicit signals but fail on subtle implicit ones, while large language models (LLMs) capture nuance at prohibitive computational cost. To address this gap, we propose a two-stage voting architecture that balances efficiency and robustness. In Stage 1, a lightweight BERT classifier rapidly resolves high-confidence explicit cases. In Stage 2, ambiguous inputs are escalated to either (i) a multi-perspective LLM voting framework to maximize recall on implicit ideation, or (ii) a feature-based ML ensemble guided by psychologically grounded indicators extracted via prompt-engineered LLMs for efficiency and interpretability. To the best of our knowledge, this is among the first works to operationalize LLM-extracted psychological features as structured vectors for suicide risk detection. On two complementary datasets - explicit-dominant Reddit and implicit-only DeepSuiMind - our framework outperforms single-model baselines, achieving 98.0% F1 on explicit cases, 99.7% on implicit ones, and reducing the cross-domain gap below 2%, while significantly lowering LLM cost.
zh
[NLP-24] AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming
【速读】: 该论文旨在解决现有红队测试(red teaming)方法依赖种子指令(seed instructions)导致合成对抗性提示语义多样性不足的问题,从而限制了对大语言模型(Large Language Models, LLMs)安全性的全面评估。其解决方案的关键在于提出 AutoRed——一个无需种子指令的自由形式对抗性提示生成框架,该框架包含两个核心阶段:(1) 基于角色引导的对抗性指令生成,以提升提示语的多样性和隐蔽性;(2) 通过反思循环(reflection loop)迭代优化低质量提示,提高攻击有效性。此外,为提升效率,作者引入验证器(verifier)在不调用目标模型的前提下评估提示危害性,最终构建了 AutoRed-Medium 和 AutoRed-Hard 两个红队测试数据集,并在八种前沿 LLM 上验证了其优于基线方法的攻击成功率与泛化能力。
链接: https://arxiv.org/abs/2510.08329
作者: Muxi Diao,Yutao Mou,Keqing He,Hanbo Song,Lulu Zhao,Shikun Zhang,Wei Ye,Kongming Liang,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets – AutoRed-Medium and AutoRed-Hard – and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.
zh
[NLP-25] Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
【速读】: 该论文旨在解决当前评估强化学习增强的大语言模型(Reinforcement Learning with Verifiable Rewards, RLVR)在推理任务中性能时存在的偏差问题,特别是基于Pass@k指标在大规模采样下可能误导对模型真实推理能力的判断。研究表明,在离散答案空间的任务(如数学计算)中,Pass@k在大k值时反映的是随机猜测成功的概率趋近于1的趋势,而非模型真正的推理边界。为此,作者提出新的度量指标Cover@tau,其核心在于衡量在至少τ比例的生成结果正确的情况下模型能解决的问题比例,从而引入可靠性阈值来区分真正推理与随机猜测。该方案的关键创新在于通过显式控制生成结果的一致性要求(即τ),更准确地刻画模型的稳定推理能力,避免因过度采样导致的虚假性能提升。
链接: https://arxiv.org/abs/2510.08325
作者: Marius Dragoi,Ioana Pintilie,Florin Gogianu,Florin Brad
机构: Bitdefender(比特防御)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
zh
[NLP-26] Neuron-Level Analysis of Cultural Understanding in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化理解中存在文化偏见和对边缘文化认知不足的问题,同时揭示其文化理解能力的内部神经机制。解决方案的关键在于通过神经元层面的分析,提出一种基于梯度的评分方法并结合额外过滤策略,精准识别出两类神经元:一类是跨文化通用神经元(culture-general neurons),另一类是特定文化神经元(culture-specific neurons)。这些神经元虽不足全部神经元的1%,但集中分布于浅层至中层的多层感知机(MLP)模块中,并且实验证明抑制它们会导致文化基准测试性能显著下降(最高达30%),而一般自然语言理解(NLU)任务性能基本不受影响,从而验证了其关键作用。
链接: https://arxiv.org/abs/2510.08284
作者: Taisei Yamamoto,Ryoma Kumon,Danushka Bollegala,Hitomi Yanaka
机构: The University of Tokyo (东京大学); Riken (理化学研究所); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models’ cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at this https URL
zh
[NLP-27] Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window
【速读】: 该论文旨在解决多轮交互式智能体在长时程任务中难以激发深层推理能力的问题,尤其针对现有方法在复杂认知行为建模和上下文管理方面的局限性。其核心解决方案在于提出DeepMiner框架,关键创新点包括:一是通过逆向构造方法从真实网络源生成高难度且可验证的问答对,确保训练数据的挑战性和可靠性;二是设计了一种优雅而高效的动态上下文管理策略,采用滑动窗口机制实现无需外部摘要模型的上下文维护,从而在标准32k上下文长度下支持近100轮持续交互,显著突破了传统系统的上下文限制。
链接: https://arxiv.org/abs/2510.08276
作者: Qiaoyu Tang,Hao Xiang,Le Yu,Bowen Yu,Yaojie Lu,Xianpei Han,Le Sun,WenJuan Zhang,Pengbo Wang,Shixuan Liu,Zhenru Zhang,Jianhong Tu,Hongyu Lin,Junyang Lin
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室); Alibaba Group (阿里巴巴集团); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.
zh
[NLP-28] Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization
【速读】: 该论文旨在解决传统直接偏好优化(Direct Preference Optimization, DPO)方法在多任务场景下表达能力有限、难以适应异构或多样化偏好分布的问题。现有DPO依赖单一模型架构,无法有效捕捉不同偏好模式下的策略差异,限制了其在复杂用户需求中的泛化与适配能力。解决方案的关键在于提出Mix- and MoE-DPO框架,通过引入软混合模型(soft mixture models)和专家混合(Mixture-of-Experts, MoE)结构,并采用随机变分推断方法优化变分证据下界(ELBO),从而学习具有特定功能的专家策略。该方法实现了三项核心优势:一是利用混合模型实现通用函数逼近以增强泛化能力;二是通过专家组件对不同偏好模式进行奖励与策略专业化;三是借助输入相关的软门控机制实现上下文感知的用户特定策略对齐。此框架支持共享基础架构搭配专家特异性策略头,或完全独立的专家模型,可在参数效率与策略专精之间灵活权衡,验证表明其在多种模型规模和多偏好数据集上均具备强大的可扩展性与对齐性能。
链接: https://arxiv.org/abs/2510.08256
作者: Jason Bohne,Pawel Polak,David Rosenberg,Brian Bloniarz,Gary Kazantsev
机构: Stony Brook University (石溪大学); Institute for Advanced Computational Science (先进计算科学研究所); Center of Excellence in Wireless and Information Technology (CEWIT) (无线与信息技术卓越中心); AI Innovation Institute (人工智能创新研究所); Bloomberg (彭博)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.
zh
[NLP-29] Opponent Shaping in LLM Agents
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)作为自主代理在多智能体交互环境中是否能够通过自身行为影响其他智能体的学习动态,即是否存在“对手塑造”(opponent shaping, OS)现象。传统OS算法难以直接应用于LLM,因其依赖高阶导数、存在可扩展性限制或依赖Transformer架构中不存在的组件。解决方案的关键在于提出ShapeLLM——一种专为基于Transformer的LLM代理设计的无模型对手塑造方法,首次实证表明LLM代理能够在博弈论环境(包括竞争性和合作性游戏)中有效引导对手向可 exploited均衡演化,并实现双向互动中的相互塑造,从而确立对手塑造作为多智能体LLM研究的核心维度。
链接: https://arxiv.org/abs/2510.08255
作者: Marta Emili Garcia Segura,Stephen Hailes,Mirco Musolesi
机构: University College London (伦敦大学学院); University of Bologna (博洛尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 29 pages, 15 figures, 15 tables
Abstract:Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players’ learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner’s Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner’s Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.
zh
[NLP-30] Reason Embed: Enhanced Text Embeddings for Reasoning -Intensive Document Retrieval
【速读】: 该论文旨在解决现有文本嵌入模型在推理密集型文档检索任务中表现不足的问题,尤其是难以捕捉查询与文档之间复杂的语义关系。其解决方案的关键在于提出三个核心技术:一是ReMixer数据合成方法,克服了以往合成数据集的平庸性问题,实现了82K高质量训练样本的大规模生成;二是Redapter自适应学习算法,根据每个样本的推理强度动态调整训练权重,从而增强模型对复杂语义关系的建模能力;三是基于不同规模骨干网络实现的ReasonEmbed模型,在多个基准上均取得优异性能,其中ReasonEmbed-Qwen3-8B在BRIGHT基准上达到nDCG@10=38.1的记录高分,显著优于现有文本嵌入模型。
链接: https://arxiv.org/abs/2510.08252
作者: Jianlyu Chen,Junwei Lan,Chaofan Li,Defu Lian,Zheng Liu
机构: University of Science and Technology of China (中国科学技术大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing University of Posts and Telecommunications (北京邮电大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); Hong Kong Polytechnic University (香港理工大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 17 pages, 3 figures
Abstract:In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that overcomes the triviality problem prevalent in previous synthetic datasets, enabling large-scale production of 82K high-quality training samples. Second, we design Redapter, a self-adaptive learning algorithm that dynamically adjusts training each sample’s weight based on its reasoning intensity. This allows the model to effectively capture the complex semantic relationships between queries and documents. Third, we implement ReasonEmbed across multiple backbones of varying sizes, all of which achieve superior performance on reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which significantly outperforms existing text embedding models. We will fully open-source our created resources in ReasonEmbed to push forward the research advancement in this field.
zh
[NLP-31] Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)训练数据面临枯竭的问题,即现有文本数据的上限可能很快被触及,从而限制模型性能进一步提升。其解决方案的关键在于利用对比解码(contrastive decoding)生成合成语料库,并将其与原始真实数据混合用于训练。通过放大性能更优模型的信号,该方法生成的合成数据能够增强模型在推理能力相关任务上的表现,同时传统采样生成的合成数据则更有利于依赖表层语言特征的任务,从而实现对不同下游任务的差异化增益。
链接: https://arxiv.org/abs/2510.08245
作者: Jannek Ulm,Kevin Du,Vésteinn Snæbjarnarson
机构: ETH Zürich (苏黎世联邦理工学院); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures
Abstract:Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.
zh
[NLP-32] he Alignment Waltz: Jointly Training Agents to Collaborate for Safety
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中面临的根本性矛盾:一方面易受对抗攻击而生成不安全内容,另一方面对良性但敏感的提示存在过度拒绝(overrefusal)现象。现有方法通常采用防护模型完全拒绝含不安全成分的内容,导致“切断音乐”——加剧过量拒答并丧失对拒绝请求提供细致指导的能力。其解决方案的关键在于提出WaltzRL,一种多智能体强化学习框架,将安全对齐建模为合作性的正和博弈,联合训练对话智能体与反馈智能体;其中反馈智能体被激励提供有助于提升对话智能体响应安全性与有用性的建议,核心创新是动态改进奖励(Dynamic Improvement Reward, DIR),该奖励根据对话智能体采纳反馈的程度动态演化。推理阶段,不安全或过度拒绝的响应不会被丢弃,而是通过反馈智能体自适应介入进行优化,从而在保持低延迟和高帮助性的同时显著降低不安全响应和过量拒答比例。
链接: https://arxiv.org/abs/2510.08240
作者: Jingyu Zhang,Haozhu Wang,Eric Michael Smith,Sid Wang,Amr Sharaf,Mahesh Pasupuleti,Benjamin Van Durme,Daniel Khashabi,Jason Weston,Hongyuan Zhan
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
Abstract:Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent’s responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
zh
[NLP-33] Investigating Counterclaims in Causality Extraction from Text
【速读】: 该论文旨在解决当前因果关系抽取研究中对反向因果陈述(concausal claims)的忽视问题,现有数据集仅关注支持因果关系的正向陈述(procausal claims),导致模型在处理否定性因果推理时容易将concausal误判为procausal。解决方案的关键在于构建一个包含concausal语句的新数据集——通过系统文献回顾验证concausality在不完整知识下的因果推理中的必要性,并制定严谨的标注指南,最终在Causal News Corpus基础上扩展得到具有高标注一致性(Cohen’s κ=0.74)的数据集,从而训练Transformer模型有效区分procausal与concausal语义。
链接: https://arxiv.org/abs/2510.08224
作者: Tim Hagen,Niklas Deckers,Felix Wolter,Harrisen Scells,Martin Potthast
机构: University of Kassel(卡塞尔大学); hessian.AI; Leipzig University(莱比锡大学); University of Tübingen(图宾根大学); ScaDS.AI
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on “procausal” claims, i.e., statements that support a relationship. “Concausal” claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen’s \kappa=0.74 . To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.
zh
[NLP-34] SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets
【速读】: 该论文旨在解决新冠疫情期间细粒度多语言情感分析数据稀缺且标注质量不足的问题,特别是现有公开数据集存在标签粗粒度、不准确或缺乏跨语言一致性等挑战。其解决方案的关键在于构建SenWave数据集——一个包含10,000条英文和阿拉伯文标注 tweet 以及30,000条翻译自英文的西班牙语、法语和意大利语 tweet 的细粒度多语言情感分析数据集,涵盖五种语言下的十个情感类别,并整合超过1亿条未标注 tweet 用于后续建模。此外,作者基于标注数据微调了预训练 Transformer 模型以实现高精度细粒度分类,从而为复杂事件中的情绪演化提供更细致的语言与文化视角。
链接: https://arxiv.org/abs/2510.08214
作者: Qiang Yang,Xiuying Chen,Changsheng Ma,Rui Yin,Xin Gao,Xiangliang Zhang
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University Of Florida (佛罗里达大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 13 figures, 6 tables
Abstract:The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnotethis https URL. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.
zh
[NLP-35] LLM s Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
【速读】: 该论文旨在解决生成式 AI(Generative AI)在经过特定领域恶意或错误补全数据微调后,可能产生广泛偏离诚实行为的“涌现性偏移”(emergent misalignment)问题,尤其关注其是否能在高风险场景下扩展至欺骗与不诚实行为。解决方案的关键在于通过多维度实验证明:即使仅引入1%的偏移数据进行下游任务微调,模型诚实行为显著下降(超过20%);且在模拟人类-AI交互环境中,仅需10%的有偏用户群体即可导致助手模型无意中加剧其不诚实倾向,表明该风险不仅源于直接微调,还存在于混合任务和实际交互场景中。
链接: https://arxiv.org/abs/2510.08211
作者: XuHao Hu,Peng Wang,Xiaoya Lu,Dongrui Liu,Xuanjing Huang,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.
zh
[NLP-36] Memory Retrieval and Consolidation in Large Language Models through Function Tokens
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中记忆检索与巩固机制不明确的问题。其解决方案的关键在于提出“功能词标记假说”(function token hypothesis):在推理阶段,功能词标记(function tokens)激活上下文中最具预测性的特征并主导下一个标记的预测(即记忆检索);在预训练阶段,通过预测紧随功能词标记之后的内容词标记(content tokens),促使模型增加所学特征数量并更新参数(即记忆巩固)。实验表明,少量功能词标记即可激活多数特征,且预训练损失主要由功能词后的内容词预测驱动,从而强化功能词对上下文预测特征的选择能力。
链接: https://arxiv.org/abs/2510.08203
作者: Shaohua Zhang,Yuan Lin,Hang Li
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.
zh
[NLP-37] Sentiment Matters: An Analysis of 200 Human-SAV Interactions ITSC2025
【速读】: 该论文旨在解决共享自动驾驶车辆(Shared Autonomous Vehicles, SAVs)中人车交互效率与用户体验优化问题,特别是如何通过对话式交互提升用户对SAV的接受度和服务质量感知。其解决方案的关键在于构建并公开了一个包含2,136次人类-SAV对话文本及配套心理因素调查数据的开源基准数据集,并基于此开展两项核心研究:一是利用随机森林模型和弦图识别影响SAV接受度和服务质量的关键预测因子,发现响应情感极性(response sentiment polarity)具有决定性作用;二是对比基于大语言模型(Large Language Model, LLM)的零样本情感分析方法与传统词典法(TextBlob)的效果,结果表明LLM提示策略更贴近用户自报情感,尽管仍存在局限性。这一工作为设计高效、适应性强的SAV对话界面提供了实证基础,并推动了面向多模态交互的高级情感建模研究。
链接: https://arxiv.org/abs/2510.08202
作者: Lirui Guo,Michael G. Burke,Wynita M. Griggs
机构: Monash University (蒙纳士大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: Accepted for presentation at IEEE ITSC 2025 and for publication in its Proceedings. \c{opyright} 2025 IEEE. Personal use permitted; other uses require permission from IEEE, including reprinting, republishing, or reuse of any copyrighted component of this work
Abstract:Shared Autonomous Vehicles (SAVs) are likely to become an important part of the transportation system, making effective human-SAV interactions an important area of research. This paper introduces a dataset of 200 human-SAV interactions to further this area of study. We present an open-source human-SAV conversational dataset, comprising both textual data (e.g., 2,136 human-SAV exchanges) and empirical data (e.g., post-interaction survey results on a range of psychological factors). The dataset’s utility is demonstrated through two benchmark case studies: First, using random forest modeling and chord diagrams, we identify key predictors of SAV acceptance and perceived service quality, highlighting the critical influence of response sentiment polarity (i.e., perceived positivity). Second, we benchmark the performance of an LLM-based sentiment analysis tool against the traditional lexicon-based TextBlob method. Results indicate that even simple zero-shot LLM prompts more closely align with user-reported sentiment, though limitations remain. This study provides novel insights for designing conversational SAV interfaces and establishes a foundation for further exploration into advanced sentiment modeling, adaptive user interactions, and multimodal conversational systems.
zh
[NLP-38] raining-Free Group Relative Policy Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在特定现实场景中性能下降的问题,其核心挑战在于如何高效整合外部工具与定制化提示策略。传统方法如基于强化学习的代理训练(agentic reinforcement learning)虽有效但依赖昂贵的参数更新过程(例如SFT+GRPO),存在数据稀缺和过拟合风险。论文提出一种无需训练的解决方案——Training-Free Group Relative Policy Optimization(Training-Free GRPO),其关键创新在于通过多轮迭代学习从少量真实标注数据中蒸馏出高质量的经验知识作为token prior(令牌先验),并在LLM API调用时无缝注入以引导行为优化,从而在不修改模型参数的前提下显著提升跨域泛化能力,尤其在数学推理和网络搜索任务中表现优异。
链接: https://arxiv.org/abs/2510.08191
作者: Yuzheng Cai,Siqi Cai,Yuchen Shi,Zihan Xu,Lichao Chen,Yulei Qin,Xiaoyu Tan,Gang Li,Zongyi Li,Haojia Lin,Yong Mao,Ke Li,Xing Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.
zh
[NLP-39] R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRMs)在长时域推理能力评估上的不足问题,即现有基准测试主要聚焦于即时、单步任务,难以有效衡量模型在复杂、多步骤且跨时间跨度的长时域场景中的推理表现。解决方案的关键在于提出R-HORIZON方法,该方法通过查询组合(query composition)机制主动激发LRMs的长时域推理行为,并基于此构建了一个包含多步骤、相互依赖问题的长时域推理基准。实验表明,即使最先进的LRMs在该基准上也出现显著性能下降,揭示其有效推理长度有限且思维预算分配不合理;进一步利用R-HORIZON生成的数据进行带验证奖励的强化学习(Reinforcement Learning with Verified Rewards, RLVR)训练,不仅提升了多时域推理任务表现,还在标准推理任务(如AIME2024)上实现7.5分提升,证明了R-HORIZON作为可扩展、可控且低成本增强与评估LRMs长时域推理能力范式的价值。
链接: https://arxiv.org/abs/2510.08189
作者: Yi Lu,Jianing Wang,Linsen Guo,Wei He,Hongyin Tang,Tao Gui,Xuanjing Huang,Xuezhi Cao,Wei Wang,Xunliang Cai
机构: Fudan University (复旦大学); Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.
zh
[NLP-40] METRICALARGS: A Taxonomy for Studying Metrical Poetry with LLM s
【速读】: 该论文试图解决当前自然语言处理(Natural Language Processing, NLP)领域对诗歌理解与生成的研究仍局限于自动诗篇生成和摘要等基础任务,缺乏系统性评估大型语言模型(Large Language Models, LLMs)在遵循严格韵律规则(metrical constraints)方面能力的问题。其解决方案的关键在于提出首个面向韵律诗(metrical poetry)的NLP任务分类法——MetricalARGS,该分类法从分析(Analysis)、检索(Retrieval)、生成(Generation)和支持(Support)四个维度构建了系统框架,并以泰卢固语(Telugu)为例展示了其实践应用,从而为深入评估LLMs在遵守复杂文学规则下的推理能力和语言理解水平提供了结构化路径。
链接: https://arxiv.org/abs/2510.08188
作者: Chalamalasetti Kranti,Sowmya Vajjala
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-print
Abstract:Prior NLP work studying poetry has focused primarily on automatic poem generation and summarization. Many languages have well-studied traditions of poetic meter which enforce constraints on a poem in terms of syllable and phoneme patterns. Such advanced literary forms offer opportunities for probing deeper reasoning and language understanding in Large Language Models (LLMs) and their ability to follow strict pre-requisites and rules. In this paper, we introduce MetricalARGS, the first taxonomy of poetry-related NLP tasks designed to evaluate LLMs on metrical poetry across four dimensions: Analysis, Retrieval, Generation, and Support. We discuss how these tasks relate to existing NLP tasks, addressing questions around datasets and evaluation metrics. Taking Telugu as our example language, we illustrate how the taxonomy can be used in practice. MetricalARGS highlights the broader possibilities for understanding the capabilities and limitations of today’s LLMs through the lens of metrical poetry.
zh
[NLP-41] NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
【速读】: 该论文旨在解决当前具身导航(embodied navigation)研究中对空间感知与推理能力评估不足的问题,现有基准主要关注语义理解而忽视了导航代理的空间智能系统性评测。解决方案的关键在于提出NavSpace基准,包含六类任务和1,228条轨迹-指令对,用于全面评估导航代理的空间感知与推理能力;同时开发了SNav模型,该模型在NavSpace及真实机器人测试中均优于现有方法,为未来研究建立了强有力的基线。
链接: https://arxiv.org/abs/2510.08173
作者: Haolin Yang,Yuxing Long,Zhuoyuan Yu,Zihan Yang,Minghan Wang,Jiapeng Xu,Yihan Wang,Ziyan Yu,Wenzhe Cai,Lei Kang,Hao Dong
机构: Peking University (北京大学); PKU-Agibot Lab; Shanghai AI Lab
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents’ spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
zh
[NLP-42] ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在处理简单任务时出现的“过度思考”问题,即生成冗长且不必要的推理过程,导致计算资源浪费和效率低下。其解决方案的关键在于提出ARM2——一个基于强化学习框架并引入长度感知优化的统一模型,能够自适应地在不同推理格式间平衡性能与效率;通过整合视觉理解与可执行代码到推理链中,显著降低token消耗(平均减少超70%),同时保持与传统基于GRPO训练模型相当的任务性能。
链接: https://arxiv.org/abs/2510.08163
作者: Jian Xie,Zhendong Chu,Aoxiao Zhong,Kai Zhang,Mingzhe Han,Xin Fang,Jialie Shen,Qingsong Wen
机构: Squirrel Ai Learning (智谱AI); The Ohio State University (俄亥俄州立大学); City St George’s, University of London (伦敦城市圣乔治大学)
类目: Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Large Reasoning Models (LRMs) often suffer from the ``over-thinking’’ problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.
zh
[NLP-43] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中普遍存在的“过度安全拒绝”(exaggerated refusals)问题,即模型在面对包含与不安全查询相似关键词的良性请求时,错误地拒绝响应。为系统评估此问题,作者构建了两个基准测试:面向单轮提示的夸张安全基准(Exaggerated Safety Benchmark, XSB)和面向多轮对话场景的多轮情境夸张安全基准(Multi-turn Scenario-based Exaggerated Safety Benchmark, MS-XSB),其中XSB通过标注“Focus”关键词识别触发拒绝的敏感词。解决方案的关键在于采用三种无需重新训练或访问模型参数的轻量级、模型无关的推理阶段干预策略:忽略词指令(ignore-word instructions)、提示重述(prompt rephrasing)和注意力引导(attention steering),实验证明这些方法能显著提升对安全提示的合规性,同时保持强健的安全防护能力。
链接: https://arxiv.org/abs/2510.08158
作者: Shuzhou Yuan,Ercong Nie,Yinuo Sun,Chenxuan Zhao,William LaCroix,Michael Färber
机构: ScaDS.AI and TU Dresden (德累斯顿工业大学); LMU Munich and MCML (慕尼黑大学和慕尼黑计算与语言学研究中心); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with “Focus” keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.
zh
[NLP-44] DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations EMNLP2025
【速读】: 该论文旨在解决小型大语言模型(Small Language Models, SLMs)在工业场景中缺乏鲁棒的零样本指令遵循能力的问题,尤其是在面对多样化业务对话任务时适应性不足,且传统微调方法易引发灾难性遗忘、削弱模型对未见任务的泛化能力。其解决方案的关键在于提出一种基于阅读理解的持续预训练方法——Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC),该方法通过在对话转录文本上进行阅读理解生成多样化的任务指令与响应,替代传统的下一个词预测目标,从而显著提升小模型在会议摘要、行动项生成和通话目的识别等业务对话任务上的零样本泛化性能。
链接: https://arxiv.org/abs/2510.08152
作者: Elena Khasanova,Harsh Saini,Md Tahmid Rahman Laskar,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN
机构: Dialpad Inc. (Dialpad 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the EMNLP 2025 Industry Track. Equal contribution from the first four authors
Abstract:The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model’s generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs’ domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.
zh
[NLP-45] AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents EMNLP2025
【速读】: 该论文旨在解决企业在接触中心(contact centers)部署基于检索增强生成(Retrieval Augmented Generation, RAG)的对话式人工智能系统时,因缺乏公司专属知识库而导致的冷启动问题(cold-start problem)。其解决方案的关键在于提出AI Knowledge Assist系统,该系统通过从历史客户-客服对话中自动提取问答对(question-answer, QA pairs),构建企业特定的知识库,并采用轻量级大语言模型(LLM)在内部数据上进行微调,从而实现超越更大规模闭源模型的性能表现。实证结果表明,基于LLaMA-3.1-8B模型的该方案在20家企业的评估中实现了超过90%的准确率,显著缩短了RAG驱动聊天机器人的部署周期。
链接: https://arxiv.org/abs/2510.08149
作者: Md Tahmid Rahman Laskar,Julien Bouvier Tremblay,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN
机构: Dialpad Inc. (Dialpad 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the EMNLP 2025 Industry Track
Abstract:The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.
zh
[NLP-46] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为自动评估者(LLM-as-a-Judge)时存在的判断偏好偏差(judgment preference bias)问题,即模型在评估过程中倾向于偏好自身生成的响应,从而削弱评估结果的可靠性。解决方案的关键在于提出一种无监督的多智能体协同优化框架——基于群体投票优化(Group-Based Polling Optimization, Genii),该框架将多个LLM-based判断模型集成到一个多智能体系统中,并模拟客户端-服务器交互式投票机制,实现对每个客户端智能体的无监督优化。实验表明,Genii在无需人工标注数据的情况下,显著优于依赖标注数据的监督模型,且能持续提升各类客户端智能体的性能,有效缓解了判断偏好偏差。
链接: https://arxiv.org/abs/2510.08145
作者: Shuliang Liu,Zhipeng Xu,Zhenghao Liu,Yukun Yan,Minghe Yu,Yu Gu,Chong Chen,Huiyuan Xie,Ge Yu
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Huawei Cloud Business Unit (华为云业务部)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at this https URL.
zh
[NLP-47] Interpreting LLM -as-a-Judge Policies via Verifiable Global Explanations
【速读】: 该论文旨在解决使用大语言模型作为评判者(LLM-as-a-Judge)进行文本评估时存在的潜在偏见与风险问题,尤其是缺乏对模型决策逻辑的可解释性和透明性。其核心挑战在于如何从黑箱式的 LLM 判决中提取出高阶概念驱动的全局规则(global policies),从而实现对模型行为的理解、验证与可控性。解决方案的关键在于提出了一种两阶段方法:首先通过 CLoVE(Contrastive Local Verifiable Explanations)生成基于可验证概念的对比局部解释;进而利用 GloVE(Global Verifiable Explanations)通过迭代聚类、摘要和验证机制将局部规则提炼为忠实于原 LLM 决策的全局政策,从而在内容危害检测等任务中实现对 LLM-as-a-Judge 的可解释性增强与鲁棒性保障。
链接: https://arxiv.org/abs/2510.08120
作者: Jasmina Gajcin,Erik Miehling,Rahul Nair,Elizabeth Daly,Radu Marinescu,Seshu Tirupathi
机构: IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 3 tables
Abstract:Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.
zh
[NLP-48] Can Risk-taking AI-Assistants suitably represent entities
【速读】: 该论文旨在解决生成式 AI(Generative AI)在决策支持系统中对人类风险偏好模拟不足的问题,特别是如何有效衡量、审计和调整其风险行为倾向,以避免无意中引导用户做出高风险决策或隐含偏见。解决方案的关键在于提出“风险厌恶可操控性”(Manipulability of Risk Aversion, MoRA)的评估框架,通过多维度实验检验语言模型(LMs)在性别差异、不确定性情境、角色导向决策等场景下复制人类风险偏好的能力,并揭示当前主流模型如DeepSeek Reasoner与Gemini-2.0-flash-lite在行为一致性上的局限性,从而推动基于生物中心度量的AI设计优化,提升AI与人类风险偏好的一致性,增强伦理决策能力。
链接: https://arxiv.org/abs/2510.08114
作者: Ali Mazyaki,Mohammad Naghizadeh,Samaneh Ranjkhah Zonouzaghi,Amirhossein Farshi Sotoudeh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Responsible AI demands systems whose behavioral tendencies can be effectively measured, audited, and adjusted to prevent inadvertently nudging users toward risky decisions or embedding hidden biases in risk aversion. As language models (LMs) are increasingly incorporated into AI-driven decision support systems, understanding their risk behaviors is crucial for their responsible deployment. This study investigates the manipulability of risk aversion (MoRA) in LMs, examining their ability to replicate human risk preferences across diverse economic scenarios, with a focus on gender-specific attitudes, uncertainty, role-based decision-making, and the manipulability of risk aversion. The results indicate that while LMs such as DeepSeek Reasoner and Gemini-2.0-flash-lite exhibit some alignment with human behaviors, notable discrepancies highlight the need to refine bio-centric measures of manipulability. These findings suggest directions for refining AI design to better align human and AI risk preferences and enhance ethical decision-making. The study calls for further advancements in model design to ensure that AI systems more accurately replicate human risk preferences, thereby improving their effectiveness in risk management contexts. This approach could enhance the applicability of AI assistants in managing risk.
zh
[NLP-49] Evaluating LLM -Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing EMNLP
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在识别未披露的广告内容(即“隐藏广告”)时缺乏法律依据和透明度的问题,尤其在网红营销(influencer marketing)场景下,虚假或模糊标注的内容难以被有效监管。其解决方案的关键在于:首先构建了一个基于法律推理错误类型的分类体系(taxonomy of reasoning errors),用于评估自动化审核系统不仅准确而且具备法律严谨性;其次提供了一个由受训学生标注的原创数据集,涵盖 LLM 对广告合规性的解释;最后通过定量与定性相结合的方法对模型输出进行评估,从而为广告监管机构提供可信赖、有法律基础的自动化内容审核工具。
链接: https://arxiv.org/abs/2510.08111
作者: Haoyang Gui,Thales Bertaglia,Taylor Annabell,Catalina Goanta,Tjomme Dooper,Gerasimos Spanakis
机构: Utrecht University (乌得勒支大学); Stichting Reclame Code (荷兰广告准则基金会); Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP
Abstract:The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque “black boxes”. Using 1,143 Instagram posts, we compare gpt-5-nano and gemini-2.5-flash-lite under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.
zh
[NLP-50] VersionRAG : Version-Aware Retrieval-Augmented Generation for Evolving Documents
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理版本化技术文档时失效的问题,即现有方法无法有效识别和利用文档的演化历史,导致在回答与版本相关的问题时准确率仅达58%-64%。其解决方案的关键在于提出VersionRAG框架,该框架通过构建一个分层图结构显式建模文档的版本序列、内容边界及状态间的变更信息,并在检索阶段基于意图分类引导查询沿特定路径遍历,实现精准的版本感知过滤与变更追踪,从而显著提升问答准确率至90%,同时在隐式变更检测任务中达到60%准确率,远超基线模型(0-10%),且索引阶段所需token减少97%,具备大规模部署可行性。
链接: https://arxiv.org/abs/2510.08109
作者: Daniel Huwiler,Kurt Stockinger,Jonathan Fürst
机构: Zurich University of Applied Sciences (瑞士苏黎世应用科学大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems fail when documents evolve through versioning-a ubiquitous characteristic of technical documentation. Existing approaches achieve only 58-64% accuracy on version-sensitive questions, retrieving semantically similar content without temporal validity checks. We present VersionRAG, a version-aware RAG framework that explicitly models document evolution through a hierarchical graph structure capturing version sequences, content boundaries, and changes between document states. During retrieval, VersionRAG routes queries through specialized paths based on intent classification, enabling precise version-aware filtering and change tracking. On our VersionQA benchmark-100 manually curated questions across 34 versioned technical documents-VersionRAG achieves 90% accuracy, outperforming naive RAG (58%) and GraphRAG (64%). VersionRAG reaches 60% accuracy on implicit change detection where baselines fail (0-10%), demonstrating its ability to track undocumented modifications. Additionally, VersionRAG requires 97% fewer tokens during indexing than GraphRAG, making it practical for large-scale deployment. Our work establishes versioned document QA as a distinct task and provides both a solution and benchmark for future research.
zh
[NLP-51] Lossless Vocabulary Reduction for Auto-Regressive Language Models
【速读】: 该论文旨在解决自回归语言模型(auto-regressive language models)因词汇表(vocabulary)不一致而导致的协作困难问题,尤其是在模型集成(model ensemble)等场景下无法直接共享或组合next-token分布的问题。解决方案的关键在于提出了一种无损词汇表压缩(lossless vocabulary reduction)的理论框架,该框架能够将任意语言模型高效转换为具有任意小词汇表的等效形式,且在文本生成精度上无任何损失;通过这一方法,不同模型间可基于其最大公共词汇表实现高效协同,从而突破了传统模型因tokenization差异而无法兼容的技术瓶颈。
链接: https://arxiv.org/abs/2510.08102
作者: Daiki Chijiwa,Taku Hasegawa,Kyosuke Nishida,Shin’ya Yamaguchi,Tomoya Ohba,Tamao Sakao,Susumu Takeuchi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Tokenization – the process of decomposing a given text into a sequence of subwords called tokens – is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
zh
[NLP-52] he Price of Thought: A Multilingual Analysis of Reasoning Performance and Cost of Negotiation in Large Language Models
【速读】: 该论文旨在解决AI代理在谈判任务中面临的挑战,即如何通过战略推理、对手建模以及合作与竞争的平衡来提升谈判能力。其核心问题是评估大型语言模型(LLM)的推理机制对谈判性能的影响,尤其是在多语言环境下的表现差异。解决方案的关键在于采用自对弈(self-play)框架,在三种不同对话博弈场景中系统性地比较商业模型与开源权重模型在启用推理(即增加测试时计算资源)后的表现,揭示了推理显著提升协作效率和应对任务复杂性的能力,但伴随高昂的计算成本;同时发现开源模型在多语言谈判中倾向于将内部推理步骤切换至英文,而主流商业模型则保持推理与输出的语言一致性,这对可解释性具有重要影响。
链接: https://arxiv.org/abs/2510.08098
作者: Sherzod Hakimov,Roland Bernard,Tim Leiber,Karl Osswald,Kristina Richert,Ruilin Yang,Raffaella Bernardi,David Schlangen
机构: University of Potsdam (波茨坦大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Free University of Bozen-Bolzano (博岑-博尔扎诺自由大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5’s performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.
zh
[NLP-53] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
【速读】: 该论文旨在解决人类对多选常识基准答案的合理性判断是否受到大语言模型(Large Language Models, LLMs)生成的支持性(PRO)或反对性(CON)论证影响的问题。其关键解决方案在于通过收集3,000条人类和13,600条LLM的合理性评分,系统评估LLM生成的推理过程对人类及LLM自身判断的影响;实验发现,PRO论证显著提升人类平均合理性评分,而CON论证则显著降低评分,表明人类和LLM均易受LLM生成理由的影响,揭示了LLM在塑造人类认知信念方面的潜在影响力。
链接: https://arxiv.org/abs/2510.08091
作者: Shramay Palta,Peter Rankel,Sarah Wiegreffe,Rachel Rudinger
机构: University of Maryland, College Park, USA (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: pre-print
Abstract:We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts’’ (i.e., common sense), LLMs have the potential to exert considerable influence on people’s beliefs.
zh
[NLP-54] AutoQual: An LLM Agent for Automated Discovery of Interpretable Features for Review Quality Assessment EMNLP2025
【速读】: 该论文旨在解决在线评论质量排序问题,即如何在电商和信息服务平台中自动识别并排序高质量评论,以提升用户体验与商业转化效果。传统方法依赖人工设计特征,难以跨领域扩展且无法适应内容模式的动态变化;而现有深度学习方法则常因黑箱特性缺乏可解释性,且可能过度关注语义而忽视真实质量。解决方案的关键在于提出AutoQual框架——一个基于大语言模型(Large Language Model, LLM)的智能体系统,其核心机制是模拟人类研究过程:通过迭代式反思生成特征假设、利用自主工具实现特征操作化,并将经验存储于持久记忆中,从而从数据中自动发现可解释、可计算的特征表示。该框架不仅适用于评论质量评估,更具备通用性,可用于将隐性知识转化为显式特征。
链接: https://arxiv.org/abs/2510.08081
作者: Xiaochong Lan,Jie Feng,Yinxing Liu,Xinlei Shi,Yong Li
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025
Abstract:Ranking online reviews by their intrinsic quality is a critical task for e-commerce platforms and information services, impacting user experience and business outcomes. However, quality is a domain-dependent and dynamic concept, making its assessment a formidable challenge. Traditional methods relying on hand-crafted features are unscalable across domains and fail to adapt to evolving content patterns, while modern deep learning approaches often produce black-box models that lack interpretability and may prioritize semantics over quality. To address these challenges, we propose AutoQual, an LLM-based agent framework that automates the discovery of interpretable features. While demonstrated on review quality assessment, AutoQual is designed as a general framework for transforming tacit knowledge embedded in data into explicit, computable features. It mimics a human research process, iteratively generating feature hypotheses through reflection, operationalizing them via autonomous tool implementation, and accumulating experience in a persistent memory. We deploy our method on a large-scale online platform with a billion-level user base. Large-scale A/B testing confirms its effectiveness, increasing average reviews viewed per user by 0.79% and the conversion rate of review readers by 0.27%.
zh
[NLP-55] FedDTRE: Federated Dialogue Generation Models Powered by Trustworthiness Evaluation
【速读】: 该论文旨在解决联邦学习(Federated Learning)在对话生成任务中面临的两个核心问题:一是客户端数据有限导致的过拟合(overfitting),二是多轮训练后模型遗忘全局信息(global information forgetting),从而造成泛化能力差。解决方案的关键在于提出FedDTRE(Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation),其通过引入基于公平性导向评估数据集的信任度评分(trustworthiness scores),动态调节本地更新过程中全局模型的贡献权重,而非简单地用全局模型替换本地模型,从而在保护隐私的同时提升个性化与整体性能。
链接: https://arxiv.org/abs/2510.08058
作者: Shule Lu,Lingxiang Wang,Sijia Wen,Ziwei Wang,Hainan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid development of artificial intelligence, dialogue systems have become a prominent form of human-computer interaction. However, traditional centralized or fully local training approaches face challenges in balancing privacy preservation and personalization due to data privacy concerns and heterogeneous device capabilities. Federated learning, as a representative distributed paradigm, offers a promising solution. However, existing methods often suffer from overfitting under limited client data and tend to forget global information after multiple training rounds, leading to poor generalization. To address these issues, we propose FedDTRE, a Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation. Instead of directly replacing local models with the global model, FedDTRE leverages trustworthiness scores of both global and local models on a fairness-oriented evaluation dataset to dynamically regulate the global model’s contribution during local updates. Experimental results demonstrate that FedDTRE can improve dialogue model performance and enhance the quality of dialogue generation.
zh
[NLP-56] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对齐过程中依赖结果奖励模型(Outcome Reward Models, ORMs)的局限性,即ORM仅评估最终答案而忽略推理过程中的细节,导致模型难以实现细粒度、可解释且鲁棒的推理能力。其解决方案的关键在于引入过程奖励模型(Process Reward Models, PRMs),通过在推理步骤或轨迹层面评估和引导模型的思维链(reasoning chain),从而实现更精细的对齐机制。论文系统梳理了PRMs从数据生成、模型构建到测试时扩展与强化学习应用的完整流程,并总结其在数学、代码、多模态推理、机器人及智能体等领域的落地实践与新兴基准,旨在厘清设计空间、揭示开放挑战并推动面向细粒度推理对齐的研究发展。
链接: https://arxiv.org/abs/2510.08049
作者: Congming Zheng,Jiachen Zhu,Zhuoying Ou,Yuxiang Chen,Kangning Zhang,Rong Shan,Zeyu Zheng,Mengyue Yang,Jianghao Lin,Yong Yu,Weinan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
zh
[NLP-57] aoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在电商搜索相关性预测中,面对复杂业务规则和长尾用户查询时推理能力不足的问题,尤其针对监督微调(Supervised Fine-Tuning, SFT)和偏好优化方法(如Direct Preference Optimization, DPO)难以保证规则遵循性,以及基于强化学习的策略优化方法(如Group Relative Policy Optimization, GRPO)因终端奖励稀疏导致多步推理引导不足、收敛缓慢的挑战。解决方案的关键在于提出TaoSR-AGRL框架,其核心创新为:(1) 规则感知的奖励塑造(Rule-aware Reward Shaping),将最终相关性判断分解为与领域特定相关性标准对齐的密集结构化奖励;(2) 自适应引导回放(Adaptive Guided Replay),识别训练过程中低精度轨迹并注入针对性真值指导,引导策略摆脱停滞且违反规则的推理路径,转向合规轨迹。
链接: https://arxiv.org/abs/2510.08048
作者: Jianhui Yang,Yiming Jin,Pengkun Jiao,Chenhe Dong,Zerui Huang,Shaowei Yao,Xiaojiang Zhou,Dan Ou,Haihong Tang
机构: Tsinghua University (清华大学); Fudan University (复旦大学); Taobao & Tmall Group of Alibaba (淘宝与天猫集团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.
zh
[NLP-58] Climate Knowledge in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无外部检索条件下对气候常态参数(如1991–2020年7月地表2米气温)的回忆能力及其空间分布准确性问题,从而评估其在气候相关应用中的可靠性与误信息风险。解决方案的关键在于构建一个全球分辨率1°的陆地网格查询集,包含地理坐标和描述性位置信息,并以ERA5再分析数据作为基准验证模型输出;通过量化根均方误差(RMSE)和偏差,系统评估LLMs对气候结构的编码能力,发现模型虽能捕捉纬度和地形趋势但存在高海拔地区显著误差,且对地理位置上下文敏感,尤其大型模型受益于区域描述词输入,从而提出可复现的基准框架用于量化LLMs中参数化气候知识的水平。
链接: https://arxiv.org/abs/2510.08043
作者: Ivan Kuznetsov(1),Jacopo Grassi(2),Dmitrii Pantiukhin(1),Boris Shapkin(1),Thomas Jung(1 and 3),Nikolay Koldunov(1) ((1) Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, Germany., (2) Department of Environment, Land, and Infrastructure Engineering, Politecnico di Torino, Turin, Italy., (3) Institute of Environmental Physics, University of Bremen, Bremen, Germany.)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 16 pages, 4 figures, 2 tables
Abstract:Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1° resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 °C and biases of \pm 1 °C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 °C compared to 2-4 °C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.
zh
[NLP-59] ChatGPT as a Translation Engine: A Case Study on Japanese-English
【速读】: 该论文旨在解决生成式 AI(Generative AI)在日语-英语翻译任务中的性能评估问题,特别是对比不同提示(prompt)策略和模型版本对翻译质量的影响。其解决方案的关键在于通过自动评估与基于MQM(Machine Translation Quality Metrics)的人工评估相结合的方法,系统性地比较了ChatGPT在文档级与句子级翻译场景下的表现,并验证了不同提示设计及模型版本(ChatGPT-3.5 vs. ChatGPT-4)之间的权衡关系,最终表明ChatGPT在翻译质量上可与主流商用翻译系统相媲美。
链接: https://arxiv.org/abs/2510.08042
作者: Vincent Michael Sutanto,Giovanni Gatti De Giacomo,Toshiaki Nakazawa,Masaru Yamada
机构: Yaraku, Inc.(Yaraku公司); The University of Tokyo(东京大学); Rikkyo University(立教大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This study investigates ChatGPT for Japanese-English translation, exploring simple and enhanced prompts and comparing against commercially available translation engines. Performing both automatic and MQM-based human evaluations, we found that document-level translation outperforms sentence-level translation for ChatGPT. On the other hand, we were not able to determine if enhanced prompts performed better than simple prompts in our experiments. We also discovered that ChatGPT-3.5 was preferred by automatic evaluation, but a tradeoff exists between accuracy (ChatGPT-3.5) and fluency (ChatGPT-4). Lastly, ChatGPT yields competitive results against two widely-known translation systems.
zh
[NLP-60] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为AI代理在执行现实世界长期任务时面临的局限性——即模型在测试阶段是静态的,无法从经验中学习,缺乏知识积累与持续优化能力。为此,作者提出MUSE框架,其核心创新在于引入一个以层级记忆模块(hierarchical Memory Module)为核心的体验驱动型自进化系统。该机制通过在每个子任务完成后自主反思轨迹,将原始行为序列结构化为经验并回填至记忆模块,从而实现模型参数之外的持续学习与自我演进,显著提升了长期任务的完成能力和跨任务泛化性能。
链接: https://arxiv.org/abs/2510.08002
作者: Cheng Yang,Xuemeng Yang,Licheng Wen,Daocheng Fu,Jianbiao Mei,Rong Wu,Pinlong Cai,Yufan Shen,Nianchen Deng,Botian Shi,Yu Qiao,Haifeng Li
机构: Central South University (中南大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.
zh
[NLP-61] Leverag ing Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge
【速读】: 该论文旨在解决科学图表说明(scientific figure captions)生成中兼顾准确性与写作风格一致性的问题。现有方法往往难以同时满足科学内容的精确表达和作者特有写作风格的忠实再现,导致生成的caption在专业性和个性化方面存在不足。解决方案的关键在于提出一个两阶段流水线:第一阶段通过上下文过滤、基于DSPy的MIPROv2与SIMBA的类别特定提示优化以及候选caption选择,提升内容相关性;第二阶段引入少量示例图(profile figures)进行少样本提示,实现风格微调。实验证明,类别特定提示相比零样本和通用优化方法显著提升ROUGE-1召回率(+8.3%),而风格感知微调使BLEU得分提升40–48%,ROUGE提升25–27%,表明结合上下文理解与作者风格适配能有效生成既科学准确又风格一致的图表说明。
链接: https://arxiv.org/abs/2510.07993
作者: Watcharapong Timklaypachara,Monrada Chiewhawan,Nopporn Lekuthai,Titipat Achakulvisut
机构: Mahidol University (玛希隆大学); Mahidol University (玛希隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy’s MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3% while limiting precision loss to -2.8% and BLEU-4 reduction to -10.9%. Profile-informed stylistic refinement yields 40–48% gains in BLEU scores and 25–27% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.
zh
[NLP-62] VoiceAgent Bench: Are Voice Assistants ready for agent ic tasks?
【速读】: 该论文旨在解决当前语音语言模型(Speech Language Models, SpeechLMs)在评估中缺乏对多语言、跨文化理解及对抗鲁棒性等真实场景下代理行为(agentic scenarios)的系统性测评问题。现有基准主要局限于孤立能力如语音识别或问答,无法全面反映模型在复杂交互任务中的表现。其解决方案的关键在于提出VoiceAgentBench——一个包含超过5,500个合成语音查询的综合性评测基准,涵盖印度语境下的对话、单工具调用、多工具工作流、多轮交互与安全评估,并支持英语、印地语及其他五种印度本土语言,以体现真实世界的语言多样性;同时引入一种基于说话人嵌入(speaker embeddings)的新颖采样算法用于文本转语音(TTS)声纹转换,最大化声学和说话人多样性,从而更真实地模拟用户差异。该基准通过工具选择准确性、结构一致性及工具调用正确性(包括对抗鲁棒性)进行量化评估,揭示了当前SpeechLMs在上下文工具编排、印地语泛化能力和对抗鲁棒性方面的显著不足。
链接: https://arxiv.org/abs/2510.07978
作者: Dhruv Jain,Harshit Shukla,Gautam Rajeev,Ashish Kulkarni,Chandra Khatri,Shubham Agarwal
机构: Krutrim AI(克鲁特里姆人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.
zh
[NLP-63] Active Confusion Expression in Large Language Models : Leverag ing World Models toward Better Social Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会推理任务中表现不佳的问题,具体表现为认知混淆、逻辑不一致以及无法区分客观世界状态与主体的主观信念状态。其关键解决方案是提出一种增强型世界模型驱动的自适应推理机制,通过构建动态文本世界模型来追踪实体状态和时间序列,并实时监测推理轨迹中的混乱信号,适时介入以提供清晰的世界状态描述,从而帮助模型识别并规避认知困境,提升推理准确性与效率。
链接: https://arxiv.org/abs/2510.07974
作者: Jialu Du,Guiyang Hou,Yihui Fu,Chen Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu
机构: Zhejiang University (浙江大学); Northwest University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures
Abstract:While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1’s reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like “tricky” and “confused” when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents’ subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.
zh
[NLP-64] LightReason er: Can Small Language Models Teach Large Language Models Reasoning ?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力提升过程中依赖昂贵且资源密集的监督微调(Supervised Fine-Tuning, SFT)的问题,尤其是SFT通常需要大量人工标注数据、拒绝采样演示以及对所有token进行均匀优化,而其中仅少量token具有实际学习价值。其解决方案的关键在于提出LightReasoner框架,该框架利用强专家模型(LLM)与弱业余模型(Small Language Model, SLM)之间的行为差异,通过两个阶段实现高效推理增强:首先在采样阶段识别关键推理时刻,并基于专家-业余对比构建高质量监督示例;其次在微调阶段引导LLM对这些精炼示例进行对齐,从而放大其推理优势。该方法显著降低了计算成本和数据需求,在七项数学基准测试中准确率最高提升28.1%,同时减少90%的时间消耗、80%的问题样本量及99%的微调token使用量,且无需依赖真实标签。
链接: https://arxiv.org/abs/2510.07962
作者: Jingyuan Wang,Yankai Chen,Zhonghang Li,Chao Huang
机构: The University of Hong Kong (香港大学); The University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter’s unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert’s advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: this https URL
zh
[NLP-65] A2Search: Ambiguity-Aware Question Answering with Reinforcement Learning
【速读】: 该论文旨在解决开放域问答(Open-domain Question Answering, QA)系统在面对存在多个有效答案的模糊问题时表现不佳的问题。现有基准测试通常假设每个问题只有一个标准答案,导致训练信号失真,而人工标注多答案则成本高昂且难以扩展至多跳推理数据集(如HotpotQA和MuSiQue)。解决方案的关键在于提出A² Search框架,其核心是一个无需人工标注的端到端训练流程:通过轨迹采样(trajectory sampling)与证据验证(evidence verification)自动识别模糊问题并收集替代答案,再利用精心设计的AnsF1奖励函数进行强化学习(Reinforcement Learning, RL)优化,从而自然地支持多答案场景。实验表明,该方法在八个开放域QA基准上达到新最优性能,尤其在四个多跳基准上仅用单次rollout即实现平均AnsF1@1为48.4%,优于更大规模模型如ReSearch-32B(46.2%),验证了其在处理模糊性与跨基准泛化方面的有效性。
链接: https://arxiv.org/abs/2510.07958
作者: Fengji Zhang,Xinyao Niu,Chengyang Ying,Guancheng Lin,Zhongkai Hao,Zhou Fan,Chengen Huang,Jacky Keung,Bei Chen,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A ^2 Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed \mathrmAnsF1 reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A ^2 Search achieves new state-of-the-art performance. With only a single rollout, A ^2 Search-7B yields an average \mathrmAnsF1@1 score of 48.4% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ( 46.2% ). Extensive analyses further show that A ^2 Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at this https URL
zh
[NLP-66] OM: Test-Time Optimization and Memorization for Compositional Video Generation
【速读】: 该论文旨在解决视频基础模型(Video Foundation Models, VFMs)在组合场景(如运动、数量关系和空间关系)中生成性能不足的问题,尤其是在文本-图像对齐方面存在局限。其解决方案的关键在于提出一种无需训练的框架——测试时优化与记忆机制(Test-Time Optimization and Memorization, TTOM),通过在推理阶段引入新的可优化参数,并基于通用的空间-时间布局注意力目标进行联合优化,从而实现更精准的跨模态对齐;同时,TTOM采用参数化记忆机制支持流式视频生成,并具备灵活的历史上下文操作能力(如插入、读取、更新和删除),显著提升了模型在组合性任务中的泛化能力和迁移性能。
链接: https://arxiv.org/abs/2510.07940
作者: Leigang Qu,Ziyang Wang,Na Zheng,Wenjie Wang,Liqiang Nie,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.
zh
[NLP-67] Vision-Enabled LLM s in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries
【速读】: 该论文旨在解决17世纪至18世纪爱沙尼亚语词典数字化与信息增补的难题,具体包括:如何利用大语言模型(Large Language Models, LLMs)自动扩充历史词典中的现代词汇形式和语义信息;如何通过具备视觉能力的LLMs识别使用哥特体(Fraktur)印刷的古籍文本;以及如何构建统一、跨源的数据集以支持后续研究。解决方案的关键在于采用分阶段的LLM应用策略:首先,基于上下文提示(prompting)使Claude 3.7 Sonnet对词典条目进行语义扩展,准确率达81%;其次,在无训练样本条件下,利用零样本方法实现Fraktur字体文本识别并结构化输出,成功率达41%;最后,针对图像扫描文件,采用重叠切片(overlapping tiling)技术结合两个LLM分别完成文字识别与结果整合,显著提升处理效率与准确性。这些方法表明,即使对于小语种资源,LLMs亦能有效降低人力与成本开销,推动历史语言学研究的自动化进程。
链接: https://arxiv.org/abs/2510.07931
作者: Madis Jürviste,Joonatan Jakobson
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff’s 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle’s 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel’s 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.
zh
[NLP-68] Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时常见但易被忽视的不完整性问题,即模型可能选择性遗漏关键信息或代表性观点,这种缺失在敏感领域可能造成与事实错误(如幻觉)相当的危害。其解决方案的关键在于提出三种自动评估方法来检测文本中的信息缺失:基于自然语言推理(NLI)的方法将文本分解为原子语句并识别缺失逻辑链;基于问答(QA)的方法提取问题-答案对并在不同来源间比对;以及一种端到端的直接利用LLM识别缺失内容的方法。实验表明,尽管复杂度较低的端到端方法效果出人意料地好,但其在鲁棒性、可解释性和结果粒度上有所牺牲。
链接: https://arxiv.org/abs/2510.07926
作者: Adam Dejl,James Barry,Alessandra Pascale,Javier Carnerero Cano
机构: Imperial College London (伦敦帝国学院); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a QA-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.
zh
[NLP-69] STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models EMNLP2025
【速读】: 该论文旨在解决多步检索增强型语言模型(Multi-Step Retrieval-Augmented Language Models)中知识蒸馏方法忽视不同推理步骤所需能力差异的问题,从而限制了复杂问答任务中的推理能力迁移。其核心解决方案是提出分步知识蒸馏(Stepwise Knowledge Distillation, StepER),通过在每个推理步骤上施加针对性的监督信号来对齐随阶段演进的信息需求与推理能力,并引入难度感知训练机制,按步骤重要性动态优化学习过程,从而显著提升模型在多跳问答(multi-hop QA)任务上的表现。
链接: https://arxiv.org/abs/2510.07923
作者: Kyumin Lee,Minjin Jeon,Sanghwan Jang,Hwanjo Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Main
Abstract:Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.
zh
[NLP-70] owards Human-Like Grading: A Unified LLM -Enhanced Framework for Subjective Question Evaluation
【速读】: 该论文旨在解决主观题自动评分(automatic grading of subjective questions)在实际应用中面临的挑战,特别是由于题目形式多样性和学生答案的开放性导致的通用性不足问题。现有方法通常仅针对特定类型的主观题设计,难以适用于包含多种题型的综合性考试。解决方案的关键在于提出一个统一的、基于大语言模型(Large Language Model, LLM)增强的自动评分框架,该框架集成四个互补模块:基础文本匹配模块用于内容相似度评估;利用LLM的推理与生成能力实现三个核心功能——比对学生答案与参考答案中的关键知识点、从学生答案生成伪问题以验证其与原题的相关性、以及模拟人类评分者识别内容相关及非内容相关的优缺点。这一架构显著提升了评分的准确性与泛化能力,在通用和领域特定数据集上均优于传统及基于LLM的基线方法,并已在电商企业的培训与认证考试中成功部署。
链接: https://arxiv.org/abs/2510.07912
作者: Fanwei Zhua,Jiaxuan He,Xiaoxiao Chen,Zulong Chen,Quan Lu,Chenrui Mei
机构: Hangzhou City University (杭州城市大学); Alibaba Group (阿里巴巴集团); Zhejiang Hospital (浙江省人民医院); Mashang Consumer Finance Co (麻尚消费金融公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.
zh
[NLP-71] ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多跳事实回忆(multi-hop factual recall)中知识编辑(Knowledge Editing, KE)性能显著下降的问题,尤其针对编辑涉及推理链中隐式中间主体(implicit subjects)时的失效现象。其核心问题在于现有方法忽视了知识在神经元层面的动态表示与利用机制。解决方案的关键在于提出ACE框架——一种基于神经元级归因(neuron-level attribution)的知识编辑方法,通过识别并编辑推理过程中充当查询神经元(query neurons)与对应值神经元(value neurons)之间的Q-V路径,从而实现对多跳推理链条中关键信息流的精准干预。该方法首次揭示了隐式主体作为查询神经元驱动信息累积的机制,并建立了可解释、可编辑的神经元级因果路径,显著提升了多跳事实召回的准确性。
链接: https://arxiv.org/abs/2510.07896
作者: Jiayu Yang,Yuxuan Fan,Songning Lai,Shengen Wu,Jiaqi Tang,Chun Kang,Zhijiang Guo,Yutao Yue
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学); BUAA(北京航空航天大学); Institute of Deep Perception Technology, JITRI(深感知技术研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.
zh
[NLP-72] Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models EMNLP2025
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂任务中缺乏客观、可验证评估手段的问题,尤其是在模型是否能严格遵循逐步指令执行字符串匹配类自然语言处理(Natural Language Processing, NLP)指标方面。传统基准测试多依赖主观判断或泛化推理能力,难以精准衡量模型在步骤准确性、数值计算一致性及中间结果长期保持方面的表现。解决方案的关键在于提出MCBench,一个基于确定性代码验证的基准测试框架,其通过提供并行参考代码实现对LLM输出的客观、可复现评估,并设计了三个量化指标与三种变体以系统检测LLM在指令理解、步骤执行和长程一致性上的细微差异,从而为前沿LLM的能力评估提供了一个高精度、可解释且可扩展的工具。
链接: https://arxiv.org/abs/2510.07892
作者: Hyeonseok Moon,Seongtae Hong,Jaehyung Seo,Heuiseok Lim
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the EMNLP2025
Abstract:Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and codeverifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs.
zh
[NLP-73] Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects
【速读】: 该论文旨在解决标准语向非标准方言迁移(standard-to-dialect transfer)中的跨模态性能差异问题,特别是针对德语及其多种方言在意图和主题分类任务中的表现。传统研究多集中于文本数据,但方言主要以口语形式存在,且非标准拼写常导致文本处理困难。解决方案的关键在于系统性比较三种设置:纯文本模型、纯语音模型以及级联系统(先语音转文字再由文本模型处理),并引入首个用于方言语音意图分类的公开数据集。实验表明,纯语音模型在方言数据上表现最优,而纯文本模型在标准语数据上更优;若级联系统的语音识别模块能生成标准化输出,则其在方言场景下也能取得相对较好的效果,这揭示了语音模态对方言适应性的优势及标准化预处理的重要性。
链接: https://arxiv.org/abs/2510.07890
作者: Verena Blaschke,Miriam Winkler,Barbara Plank
机构: MaiNLP lab, CIS, LMU Munich, Germany (慕尼黑大学信息科学中心); Munich Center for Machine Learning, Germany (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.
zh
[NLP-74] Contrastive Weak-to-strong Generalization
【速读】: 该论文旨在解决弱模型到强模型(Weak-to-Strong Generalization, W2S)训练中因弱模型输出存在噪声和偏差而导致的泛化能力不足问题,从而限制了其在实际应用中的可靠性。解决方案的关键在于引入隐式奖励(implicit rewards),通过对数似然比近似显式奖励,并揭示其与对比解码(Contrastive Decoding, CD)在结构上的等价性;在此基础上提出对比弱到强泛化(Contrastive Weak-to-Strong Generalization, ConG)框架,利用预对齐与后对齐弱模型之间的对比解码生成高质量样本,从而实现更可靠的性能迁移、去噪及鲁棒性提升,显著改善传统W2S方法的局限性。
链接: https://arxiv.org/abs/2510.07884
作者: Houcheng Jiang,Junfeng Fang,Jiaxin Wu,Tianyu Zhang,Chen Gao,Yong Li,Xiang Wang,Xiangnan He,Yang Deng
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.
zh
[NLP-75] CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLM s for Mandarin-English Code-Switching
【速读】: 该论文旨在解决多模态大语言模型在跨语言语音交互中存在语言对齐能力不足的问题,具体表现为知识密集型问答任务性能下降高达66%,以及开放性对话中出现不同程度的理解错误。其解决方案的关键在于提出一种名为Code-Switching Speech-to-Speech Benchmark (CS3-Bench) 的评测基准,并设计了两种核心改进策略:一是采用Chain of Recognition (CoR) 方法增强模型对多语言语境的理解能力,二是引入Keyword Highlighting (KH) 技术引导生成过程,从而提升语言对齐效果。实验表明,该方法将知识准确率从25.14%提升至46.13%,开放性对话理解率从64.5%提升至86.5%,并显著减少次级语言的发音错误。
链接: https://arxiv.org/abs/2510.07881
作者: Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at this https URL.
zh
[NLP-76] Do LLM s Really Need 10 Thoughts for “Find the Time 1000 Days Later”? Towards Structural Understanding of LLM Overthinking
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中普遍存在但被忽视的“过度思考”(overthinking)问题,即模型对简单查询也进行冗长且不必要的链式推理(chain-of-thought, CoT),导致计算资源浪费而无显著性能提升。解决方案的关键在于提出一种系统性的细粒度分析工具TRACE,通过将思维过程分解为最小完备子思维(minimally complete sub-thoughts),并基于话语关系构建细粒度思维进展图(thought progression graphs),从而识别出开放权重模型中的两类典型思维模式——“探索者”(Explorer)和“晚着陆”(Late Landing)。这一发现揭示了过度验证(over-verification)与过度探索(over-exploration)是导致过思的主要原因,并在此基础上提出了基于效用的过思定义,超越传统以长度衡量的指标,为高效、可解释的过思管理提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2510.07880
作者: Xinliang Frederick Zhang,Anhad Mohananey,Alexandra Chronopoulou,Pinelopi Papalampidi,Somit Gupta,Tsendsuren Munkhdalai,Lu Wang,Shyam Upadhyay
机构: University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 41 figures, 10 tables. Preprint
Abstract:Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency – overthinking – models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs’ inner workings. This study introduces a systematic, fine-grained analyzer of LLMs’ thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models – Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs’ thought progression, as well as practical guidelines for principled overthinking management.
zh
[NLP-77] Ready to Translate Not to Represent? Bias and Performance Gaps in Multilingual LLM s Across Language Families and Domains
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在机器翻译(Machine Translation, MT)中存在性能不均衡和偏见放大问题,尤其在低资源语言和特定领域中的公平性风险。解决方案的关键在于提出一个统一的评估框架“Translation Tangles”,包含一个多语言、多领域的基准测试集(覆盖24个双向语言对),并设计了一种混合偏见检测管道,融合规则启发式、语义相似度过滤与LLM验证机制,同时构建了一个基于人工评估的高质量偏见标注数据集(1,439个翻译-参考对),从而系统性地量化和缓解翻译过程中的公平性缺陷。
链接: https://arxiv.org/abs/2510.07877
作者: Md. Faiyaz Abdullah Sayeedi,Md. Mahbub Alam,Subhey Sadi Rahman,Md. Adnanul Islam,Jannatul Ferdous Deepti,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda
机构: United International University (联合国际大学); Qatar Computing Research Institute (卡塔尔计算研究所); Amazon GenAI (亚马逊生成式人工智能); University of Virginia (弗吉尼亚大学); BRAC University (BRAC大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: this https URL
zh
[NLP-78] AdaSwitch: Adaptive Switching Generation for Knowledge Distillation
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在严格延迟和计算资源约束下难以实现高性能的问题。现有知识蒸馏(Knowledge Distillation, KD)方法存在两难:离策略(off-policy)蒸馏虽能提供高质量监督信号,但导致训练与推理不一致;而同策略(on-policy)方法虽保持一致性,却依赖低质量的学生输出。其解决方案的关键在于提出AdaSwitch,一种在token级别动态融合同策略与离策略生成的机制,使学生模型先自主探索预测,再基于实时质量评估选择性引入教师指导,从而兼顾训练-推理一致性与监督质量,显著提升蒸馏效果且仅带来可接受的额外开销。
链接: https://arxiv.org/abs/2510.07842
作者: Jingyu Peng,Maolin Wang,Hengyi Cai,Yuchen Li,Kai Zhang,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
机构: University of Science and Technology of China (中国科学技术大学); City University of Hong Kong (香港城市大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
zh
[NLP-79] Self-Improving LLM Agents at Test-Time
【速读】: 该论文试图解决传统语言模型(Language Model, LM)微调中依赖大规模训练数据集所导致的效率低下和泛化能力不可控的问题。现有方法往往无法判断训练样本是否提供新信息,从而造成冗余训练成本且难以提升复杂场景下的表现。其解决方案的关键在于提出一种测试时自改进(Test-Time Self-Improvement, TT-SI)机制,通过三个核心步骤实现:(i) 模型识别自身不确定的样本(自感知),(ii) 基于这些不确定样本生成相似的新训练样本(自数据增强),(iii) 在测试时利用生成样本进行在线微调(自改进)。该方法显著减少了对标注数据的依赖(仅需标准方法的1/68),并在多个代理基准测试中平均提升5.48%的准确率,验证了测试时自进化策略作为构建更强大智能体的新范式潜力。
链接: https://arxiv.org/abs/2510.07841
作者: Emre Can Acikgoz,Cheng Qian,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.
zh
[NLP-80] MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation NEURIPS2025
【速读】: 该论文旨在解决基于微调(fine-tuning)的越狱攻击(jailbreak attacks)对大语言模型(Large Language Models, LLMs)的安全威胁问题,尤其针对那些通过未见过的攻击模板伪装的有害查询,现有防御机制难以泛化的问题。解决方案的关键在于提出一种两阶段防御框架 MetaDefense:第一阶段为预生成防御(pre-generation defense),利用专门设计的提示(prompts)让模型在生成响应前预测查询是否具有危害性;第二阶段为生成中防御(mid-generation defense),持续监控生成过程中的部分响应以阻止有害内容输出。该方法通过训练模型在嵌入空间中识别潜在危害性,实现早期终止有害交互,从而在多个LLM架构上显著优于现有防御机制,同时保持良性任务性能。
链接: https://arxiv.org/abs/2510.07835
作者: Weisen Jiang,Sinno Jialin Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted By NeurIPS 2025
Abstract:This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at this https URL.
zh
[NLP-81] From Keywords to Clusters: AI-Driven Analysis of YouTube Comments to Reveal Election Issue Salience in 2024
【速读】: 该论文试图解决的问题是:在2024年总统大选中,哪些议题对选民决策影响最大。其解决方案的关键在于采用两种竞争性的数据科学方法——基于自然语言处理(Natural Language Processing, NLP)和聚类分析——从《华尔街日报》和《纽约时报》相关YouTube视频的八千多条评论中挖掘原始用户数据,量化不同议题在评论中的出现频率,从而推断出最突出的选民关注议题。研究发现,移民和民主问题是被最频繁且一致提及的议题,而通胀则显著较少被讨论,这一结果部分验证了选举后调查结论,但也反驳了通胀作为关键选举议题的普遍认知,表明基于网络原始数据的意见挖掘比传统问卷调查更能揭示选举动因。
链接: https://arxiv.org/abs/2510.07821
作者: Raisa M. Simoes,Timoteo Kelly,Eduardo J. Simoes,Praveen Rao
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
Abstract:This paper aims to explore two competing data science methodologies to attempt answering the question, “Which issues contributed most to voters’ choice in the 2024 presidential election?” The methodologies involve novel empirical evidence driven by artificial intelligence (AI) techniques. By using two distinct methods based on natural language processing and clustering analysis to mine over eight thousand user comments on election-related YouTube videos from one right leaning journal, Wall Street Journal, and one left leaning journal, New York Times, during pre-election week, we quantify the frequency of selected issue areas among user comments to infer which issues were most salient to potential voters in the seven days preceding the November 5th election. Empirically, we primarily demonstrate that immigration and democracy were the most frequently and consistently invoked issues in user comments on the analyzed YouTube videos, followed by the issue of identity politics, while inflation was significantly less frequently referenced. These results corroborate certain findings of post-election surveys but also refute the supposed importance of inflation as an election issue. This indicates that variations on opinion mining, with their analysis of raw user data online, can be more revealing than polling and surveys for analyzing election outcomes.
zh
[NLP-82] Multilingual Generative Retrieval via Cross-lingual Semantic Compression EMNLP2025
【速读】: 该论文旨在解决多语言生成式检索(Multilingual Generative Retrieval)中的两个核心挑战:跨语言标识符错位(cross-lingual identifier misalignment)和标识符膨胀(identifier inflation)。其解决方案的关键在于提出一种基于跨语言语义压缩的框架(Multilingual Generative Retrieval via Cross-lingual Semantic Compression, MGR-CSC),通过将语义等价的多语言关键词统一为共享原子(shared atoms)来对齐语义并压缩标识符空间,并引入动态多步约束解码策略以提升解码效率。该方法在保持检索精度的同时显著减少了标识符长度,实验证明其在mMarco100k和mNQ320k数据集上分别提升了6.83%和4.77%的准确率,同时标识符长度分别缩短了74.51%和78.2%。
链接: https://arxiv.org/abs/2510.07812
作者: Yuxin Huang,Simeng Wu,Ran Song,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu
机构: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China; Yunnan Key Laboratory of Artificial Intelligence, Kunming, China
类目: Computation and Language (cs.CL)
备注: EMNLP 2025, Findings, Long
Abstract:Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual this http URL, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively.
zh
[NLP-83] Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中由大语言模型(Large Language Models, LLMs)驱动的通信拓扑结构设计难题,即如何在任务性能、通信成本与鲁棒性之间实现动态平衡。现有方法依赖静态或人工设计的拓扑结构,难以适应不同任务需求,导致简单任务资源浪费或复杂任务性能受限。其解决方案的关键在于提出一种名为**引导式拓扑扩散(Guided Topology Diffusion, GTD)**的新颖生成框架,该框架将拓扑合成建模为一个迭代构建过程,并通过轻量级代理模型实时预测多目标奖励(如准确率、效用和成本),从而实现无梯度、任务自适应的拓扑优化,显著提升了通信效率与协作性能。
链接: https://arxiv.org/abs/2510.07799
作者: Eric Hanchen Jiang,Guancheng Wan,Sophia Yin,Mengting Li,Yuchen Wu,Xiao Liang,Xinfeng Li,Yizhou Sun,Wei Wang,Kai-Wei Chang,Ying Nian Wu
机构: 1. University of California, Los Angeles (加州大学洛杉矶分校); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textitGuided Topology Diffusion (GTD). Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.
zh
[NLP-84] HiPRAG : Hierarchical Process Rewards for Efficient Agent ic Retrieval Augmented Generation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在应用中因检索增强生成(Retrieval-Augmented Generation, RAG)策略导致的搜索行为低效问题,如过度检索(over-search)和检索不足(under-search),这些问题会引入不必要的计算开销并降低输出可靠性。解决方案的关键在于提出一种分层过程奖励机制(Hierarchical Process Rewards for Efficient agentic RAG, HiPRAG),其核心是将智能体的推理轨迹分解为可解析的离散步骤,并基于知识 grounded 的细粒度过程奖励来评估每次检索决策的必要性;在此基础上设计了一个分层奖励函数,在传统结果奖励与格式奖励之上,额外给予最优检索与非检索步骤比例的奖励,从而引导模型优化搜索决策本身而非仅关注最终输出质量。实验表明,该方法显著提升了搜索效率,同时降低了过检率至2.3%,并保持了高准确率(如7B模型达67.2%),且具备跨RL算法、模型规模与类型的良好泛化能力。
链接: https://arxiv.org/abs/2510.07794
作者: Peilin Wu,Mian Zhang,Kun Wan,Wentian Zhao,Kaiyu He,Xinya Du,Zhiyu Chen
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Adobe Inc. (Adobe 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review
Abstract:Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent’s reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
zh
[NLP-85] LLM 4Cell: A Survey of Large Language and Agent ic Models for Single-Cell Biology
【速读】: 该论文旨在解决单细胞生物学研究中因数据模态、模型架构和评估标准碎片化而导致的进展不一致问题,尤其在生成式 AI(Generative AI)与智能体框架(agentic frameworks)兴起背景下,如何实现统一整合与系统性评估。其解决方案的关键在于构建首个涵盖58个基础模型与智能体模型的综合性调查平台LLM4Cell,将这些方法按功能分为六类(基础型、文本桥梁型、空间型、多组学型、表观基因组型和智能体型),并映射至八项核心分析任务(如注释、轨迹建模、扰动预测等),同时基于40余个公开数据集从生物合理性、多组学对齐、公平性、隐私保护、可解释性等10个维度进行系统评估,从而首次提供了语言驱动的单细胞智能的集成视图,并指出了可解释性、标准化和可信模型开发等开放挑战。
链接: https://arxiv.org/abs/2510.07793
作者: Sajib Acharjee Dip,Adrika Zafor,Bikash Kumar Paul,Uddip Acharjee Shuvo,Muhit Islam Emon,Xuan Wang,Liqing Zhang
机构: Virginia Tech (弗吉尼亚理工学院); University of Dhaka (达卡大学); Fralin Biomedical Research Institute at VTC (弗拉林生物医学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 5 figures, 7 tables
Abstract:Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
zh
[NLP-86] RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化剪枝(structured pruning)过程中因校准数据量有限而导致的输出不匹配问题。传统直接最小二乘拟合方法虽可减少误差,但易过拟合于小规模校准集,破坏预训练权重的原始几何结构。解决方案的关键在于引入旋转约束(rotation constraint)下的参数更新机制:该机制在保持输出表示的几何特性(即范数和内积不变)的同时,将剪枝后的子空间重新对齐至原始输出空间;此外,为避免关键主方向信息丢失,设计了一种基于方差感知的重要性评分策略,优先保留高方差输入维度对应的组件,从而在几何保真前提下实现误差补偿与重要特征保留的协同优化。
链接: https://arxiv.org/abs/2510.07782
作者: Shuichiro Haruta,Kazunori Matsumoto,Zhi Li,Yanan Wang,Mori Kurokawa
机构: KDDI Research, Inc. (KDDI研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.
zh
[NLP-87] Drift No More? Context Equilibria in Multi-Turn LLM Interactions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮交互中出现的上下文漂移(context drift)问题,即模型输出随对话轮次逐渐偏离用户目标和初始语境,且这一现象难以被静态评估指标捕捉。其解决方案的关键在于提出一个动态框架,将漂移形式化为测试模型与目标一致参考模型之间逐轮的token级预测分布的KL散度,并将其建模为具有恢复力的有界随机过程;进一步通过简单提醒干预实现对漂移的可控调节,实验表明此类干预能有效降低偏差并符合理论预期,从而揭示多轮漂移本质是一种可调控的平衡现象而非必然退化。
链接: https://arxiv.org/abs/2510.07777
作者: Vardhan Dongre,Ryan A. Rossi,Viet Dac Lai,David Seunghyun Yoon,Dilek Hakkani-Tür,Trung Bui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model’s outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in \tau -Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.
zh
[NLP-88] Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
【速读】: 该论文旨在解决少样本多标签意图识别(Few-shot Multi-label Intent Detection, MID)中的错误传播问题,传统两阶段方法依赖于表示分类且忽略实例间关系,导致性能受限。其解决方案的关键在于提出一种端到端的多标签联合学习方法,构建一个基于标签知识传播的实例关系学习网络:通过引入类别信息学习支持集与查询集之间实例的交互关系,实现标签知识在少量标注样本与未标注样本间的有效传播;同时设计双关系增强损失函数,优化支持集和查询集层面的关系强度,从而直接通过实例间关系强度判断是否属于同一意图,显著提升多标签预测准确性。
链接: https://arxiv.org/abs/2510.07776
作者: Shiman Zhao,Shangyuan Li,Wei Chen,Tengjiao Wang,Jiahui Yao,Jiabin Zheng,Kam Fai Wong
机构: Peking University (北京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.
zh
[NLP-89] he Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中真实性(truthfulness)与安全对齐(safety alignment)之间的权衡问题,即提升事实准确性往往会导致模型拒绝有害请求的能力下降。其关键解决方案在于通过稀疏自编码器(sparse autoencoders)实现拒绝相关特征与幻觉特征的解耦,并在微调过程中利用子空间正交化(subspace orthogonalization)策略保留拒绝行为,从而在不增加幻觉的前提下维持模型的安全性。
链接: https://arxiv.org/abs/2510.07775
作者: Omar Mahmoud,Ali Khalil,Buddhika Laknath Semage,Thommen George Karimpanal,Santu Rana
机构: Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety this http URL evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.
zh
[NLP-90] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
【速读】: 该论文旨在解决大语言模型在数学推理任务中因仅基于最终答案进行奖励(outcome-based rewards)而导致的“奖励劫持”(reward hacking)问题,这种训练方式容易导致模型产生大量虚假正例(false positives),即通过无效推理过程得出正确答案的现象(如“奇迹步骤”Miracle Steps)。为解决这一系统性缺陷,论文提出关键解决方案:引入一种基于评分标准的流程导向奖励模型(Rubric Reward Model, RRM),该模型对推理轨迹进行细粒度评估,提供0-1范围内的校准奖励,明确惩罚逻辑错误并鼓励严谨推导。实验证明,RRM在四个数学基准测试上显著优于仅依赖结果奖励的训练方式,尤其在AIME2024数据集上将Verified Pass@1024从26.7%提升至62.6%,同时减少71%的奇迹步骤,从而构建出更准确且可靠的数学推理模型。
链接: https://arxiv.org/abs/2510.07774
作者: Youliang Yuan,Qiuyang Mang,Jingbang Chen,Hong Wan,Xiaoyuan Liu,Junjielong Xu,Jen-tse Huang,Wenxuan Wang,Wenxiang Jiao,Pinjia He
机构: School of Data Science, The Chinese University of Hong Kong, Shenzhen, China (数据科学学院,香港中文大学(深圳)); UC Berkeley (加州大学伯克利分校); Zhejiang University (浙江大学); Johns Hopkins University (约翰霍普金斯大学); Renmin University of China (中国人民大学); Xiaohongshu Inc. (小红书公司)
类目: Computation and Language (cs.CL)
备注: 25 pages, 11 figures, 6 Tables
Abstract:Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model’s reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.
zh
[NLP-91] oolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning
【速读】: 该论文旨在解决工具增强型大语言模型(Tool-Augmented Large Language Models)在实际应用中因领域特定工具稀缺而导致的性能瓶颈问题,尤其是在物理问答等专业领域缺乏适配工具的情况下。其核心挑战在于:随着自动生成工具数量的增长,无结构化的工具集合会引发检索效率低下和功能歧义等问题。解决方案的关键在于提出一种系统性方法,将杂乱的工具集合重构为结构化的工具库——首先生成任务特定的离散工具并按语义聚类,随后引入多智能体框架进行功能整合:代码代理(code agent)负责提取共享逻辑、重构代码以生成通用性强的聚合工具,评审代理(reviewing agent)则确保聚合后的工具保留原始工具集的全部功能完整性。此过程实现了从大量问题相关工具向少量高性能聚合工具的转化,显著提升了工具检索准确率与推理性能,并具备良好的可扩展性。
链接: https://arxiv.org/abs/2510.07768
作者: Murong Yue,Zhiwei Liu,Liangwei Yang,Jianguo Zhang,Zuxin Liu,Haolin Chen,Ziyu Yao,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang
机构: George Mason University (乔治梅森大学); Salesforce AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.
zh
[NLP-92] st-Time Reason ers Are Strategic Multiple-Choice Test-Takers
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在多选题问答(Multiple-Choice Question Answering, MCQA)任务中,即使仅输入选项而不输入问题(choices-only设置),也能取得较高准确率,这种“部分输入成功”是否意味着模型采用了浅层捷径策略,从而损害其推理能力的可信度。解决方案的关键在于利用模型在推理过程中的思维链(reasoning traces)进行分析:研究发现,在choices-only条件下模型仍能取得一定准确率,但该成功率并不随推理长度显著变化,且通过忠实性测试(faithfulness tests)验证后,表明模型并非依赖浅层捷径,而是可能采用更合理的策略,如推断缺失的问题信息。因此,论文主张应基于推理轨迹区分“有害的捷径”与“可接受的推理”,而非简单将部分输入成功视为缺陷。
链接: https://arxiv.org/abs/2510.07761
作者: Nishant Balepur,Atrey Desai,Rachel Rudinger
机构: 未知
类目: Computation and Language (cs.CL)
备注: In-progress Preprint
Abstract:Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.
zh
[NLP-93] Parallel Test-Time Scaling for Latent Reasoning Models
【速读】: 该论文旨在解决如何在连续向量空间中的隐式推理(latent reasoning)模型中实现并行测试时缩放(Parallel Test-Time Scaling, TTS),以提升大语言模型(Large Language Models, LLMs)的推理性能。传统TTS依赖于显式链式思维(Chain-of-Thought)的多路径采样与聚合,而隐式推理因缺乏连续空间中的采样机制和概率信号,难以直接应用此类策略。解决方案的关键在于:首先,提出两种受不确定性启发的随机采样策略——蒙特卡洛丢弃(Monte Carlo Dropout)和加性高斯噪声(Additive Gaussian Noise),用于在连续空间中生成多样化的推理轨迹;其次,设计了一个基于步骤级对比学习目标训练的隐式奖励模型(Latent Reward Model, LatentRM),用于对隐式推理轨迹进行评分与引导,从而实现有效的轨迹选择。实验与可视化分析表明,上述方法能有效利用计算资源,并展现出不同的探索动态,为连续空间中的可扩展推理开辟了新方向。
链接: https://arxiv.org/abs/2510.07745
作者: Runyang You,Yongqi Li,Meng Liu,Wenjie Wang,Liqiang Nie,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); Shandong Jianzhu University (山东建筑大学); University of Science and Technology of China (中国科学技术大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at this https URL.
zh
[NLP-94] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
【速读】: 该论文旨在解决当前基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中奖励建模(Reward Modeling)依赖标量或成对判断所导致的局限性,即这些方法难以捕捉人类偏好在多个维度上的复杂性。为应对这一挑战,作者提出了一种基于结构化自然语言标准(rubrics)的新范式——OpenRubrics,其关键在于通过对比优选与被拒响应来生成具有硬规则(explicit constraints)和隐含原则(implicit qualities)的高质量评估标准,并结合拒绝采样机制确保偏好标签一致性以提升可靠性。该方案显著提升了奖励模型性能,在多个基准上优于现有基线模型,并有效迁移至指令遵循和生物医学任务中的策略模型优化,从而实现了更可扩展、更贴近人类意图的大型语言模型(Large Language Model, LLM)对齐。
链接: https://arxiv.org/abs/2510.07743
作者: Tianci Liu,Ran Xu,Tony Yu,Ilgee Hong,Carl Yang,Tuo Zhao,Haoyu Wang
机构: Purdue University (普渡大学); Emory University (埃默里大学); Georgia Institute of Technology (佐治亚理工学院); University at Albany (阿尔巴尼大学)
类目: Computation and Language (cs.CL)
备注: The first two authors contributed equally
Abstract:Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.
zh
[NLP-95] oolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLM s
【速读】: 该论文旨在解决在资源受限的小规模大语言模型(Large Language Models, LLMs)中,使用组相对策略优化(Group Relative Policy Optimization, GRPO)进行训练时面临的两大问题:一是模型难以生成准确响应,导致性能提升有限;二是训练过程中频繁出现中段崩溃(mid-training collapse),影响训练稳定性和最终效果。解决方案的关键在于提出名为ToolExpander的新框架,其核心创新包括:(1) 动态多轮硬采样(Dynamic Multi-Round Hard Sampling),通过动态替换无正确输出的困难样本为高质量少样本示例,并结合指数学习率衰减策略抑制训练震荡;(2) 自我示例化思维(Self-Exemplifying Thinking),在GRPO基础上移除KL散度约束并调整裁剪系数,使模型能自主生成与分析少样本示例,仅需极小额外奖励(0.01)即可驱动该行为,从而显著增强小模型的工具使用能力与训练稳定性。
链接: https://arxiv.org/abs/2510.07737
作者: Fu Chen,Peng Wang,Xiyin Li,Wen Li,Shichi Lei,Dongdong Xiang
机构: OPPO; East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.
zh
[NLP-96] Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing EMNLP2025
【速读】: 该论文旨在解决多语言知识图谱补全(Multilingual Knowledge Graph Completion, MKGC)中对大语言模型(Large Language Models, LLMs)多语言能力利用不足以及跨语言知识共享机制缺失的问题。现有方法未能充分挖掘LLMs在多语种理解上的潜力,且忽略了不同语言间知识的可迁移性。其解决方案的关键在于提出一个新颖的MKGC框架,包含两个核心组件:知识层面分组专家混合模型(Knowledge-level Grouped Mixture of Experts, KL-GMoE)和迭代实体重排序(Iterative Entity Reranking, IER)。KL-GMoE用于高效建模跨语言共享知识,而IER则显著提升该共享知识的利用率,从而在多个指标上实现优于当前最先进方法的性能提升。
链接: https://arxiv.org/abs/2510.07736
作者: Cunli Mao,Xiaofei Gao,Ran Song,Shizhu He,Shengxiang Gao,Kang Liu,Zhengtao Yu
机构: Kunming University of Science and Technology (昆明理工大学); Yunnan Key Laboratory of Artificial Intelligence (云南省人工智能重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Science (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025, Findings, Long Paper
Abstract:Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs’ multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on this https URL.
zh
[NLP-97] oMeBench: Towards Robust Benchmarking of LLM s in Organic Mechanism Elucidation and Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在有机反应机理推理任务中缺乏真实化学推理能力的问题,即模型是否能生成有效的中间体、保持化学一致性,并遵循逻辑连贯的多步路径。解决方案的关键在于提出oMeBench——首个大规模、专家标注的有机机理推理基准,包含超过10,000个带有中间体、类型标签和难度评级的机制步骤;同时引入oMeS动态评估框架,通过结合步骤级逻辑与化学相似性实现细粒度评分,从而更精确地衡量LLMs在有机机理推理中的表现。实验证明,基于该数据集进行提示策略优化与专项微调可使性能比领先闭源模型提升50%,显著推动AI系统向具备真正化学推理能力迈进。
链接: https://arxiv.org/abs/2510.07731
作者: Ruiling Xu,Yifan Zhang,Qingyun Wang,Carl Edwards,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); William & Mary (威廉玛丽学院); Genentech (基因泰克)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Main Text: 8 pages, In total: 37 pages, 9 figures
Abstract:Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.
zh
[NLP-98] Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因大规模数据调用而导致的未经授权的数据盗用问题,同时在提升大语言模型(Large Language Models, LLMs)输出准确性与可靠性的同时,保障知识产权。其解决方案的关键在于:一是构建了RPD数据集,用于RAG场景下的抄袭检测,覆盖多样化的专业领域和写作风格;二是提出了一种双层水印机制,在语义和词汇两个层面嵌入保护信息,并配合基于统计假设检验的“询问者-侦探”检测框架,通过累积证据进行鲁棒性识别,从而有效抵御对抗性规避攻击。
链接: https://arxiv.org/abs/2510.07728
作者: Peiyang Liu,Ziqiang Cui,Di Liang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心); City University of Hong Kong (香港城市大学); Fudan University (复旦大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) enhances Large Language Models (LLMs) by mitigating hallucinations and outdated information issues, yet simultaneously facilitates unauthorized data appropriation at scale. This paper addresses this challenge through two key contributions. First, we introduce RPD, a novel dataset specifically designed for RAG plagiarism detection that encompasses diverse professional domains and writing styles, overcoming limitations in existing resources. Second, we develop a dual-layered watermarking system that embeds protection at both semantic and lexical levels, complemented by an interrogator-detective framework that employs statistical hypothesis testing on accumulated evidence. Extensive experimentation demonstrates our approach’s effectiveness across varying query volumes, defense prompts, and retrieval parameters, while maintaining resilience against adversarial evasion techniques. This work establishes a foundational framework for intellectual property protection in retrieval-augmented AI systems.
zh
[NLP-99] SUBQRAG : sub-question driven dynamic graph rag
【速读】: 该论文旨在解决传统图检索增强生成(Graph Retrieval-Augmented Generation, Graph RAG)在处理复杂多跳问答(multi-hop question answering, QA)任务时,因缺乏深度结构化推理能力而导致证据不完整和错误累积的问题。其解决方案的关键在于提出SubQRAG框架,该框架通过子问题驱动的方式将复杂问题分解为可验证的有序子问题链,对每个子问题从知识图谱中检索相关三元组;当现有图谱信息不足时,系统实时从源文档中提取新的三元组以动态扩展图谱,并将所有推理过程中使用的三元组聚合为“图记忆”(graph memory),形成结构化且可追溯的证据路径,从而提升答案生成的准确性和可解释性。
链接: https://arxiv.org/abs/2510.07718
作者: Jiaoyang Li,Junhao Ruan,Shengwei Tang,Saihan Chen,Kaiyan Chang,Yuan Ge,Tong Xiao,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure
Abstract:Graph Retrieval-Augmented Generation (Graph RAG) effectively builds a knowledge graph (KG) to connect disparate facts across a large document corpus. However, this broad-view approach often lacks the deep structured reasoning needed for complex multi-hop question answering (QA), leading to incomplete evidence and error accumulation. To address these limitations, we propose SubQRAG, a sub-question-driven framework that enhances reasoning depth. SubQRAG decomposes a complex question into an ordered chain of verifiable sub-questions. For each sub-question, it retrieves relevant triples from the graph. When the existing graph is insufficient, the system dynamically expands it by extracting new triples from source documents in real time. All triples used in the reasoning process are aggregated into a “graph memory,” forming a structured and traceable evidence path for final answer generation. Experiments on three multi-hop QA benchmarks demonstrate that SubQRAG achieves consistent and significant improvements, especially in Exact Match scores.
zh
[NLP-100] MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation
【速读】: 该论文旨在解决当前个性化生成模型中用户历史数据利用不足的问题,即现有方法仅将用户文本历史视为扁平列表进行检索,未能建模用户兴趣随时间演变的动态性和不同行为间的语义关联。其解决方案的关键在于提出MemWeaver框架,通过构建一个分层记忆结构来深度融合时间维度与语义信息:该框架包含两个互补的记忆组件——行为记忆(behavioral memory)捕捉具体用户行为细节,认知记忆(cognitive memory)抽象表示长期偏好,二者共同构成统一的用户表征,使大语言模型(LLMs)能够同时推理用户的显性行为和隐性特质,从而实现更深层次的个性化生成。
链接: https://arxiv.org/abs/2510.07713
作者: Shuo Yu,Mingyue Cheng,Daoyu Wang,Qi Liu,Zirui Liu,Ze Guo,Xiaoyu Tao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures
Abstract:The primary form of user-internet engagement is shifting from leveraging implicit feedback signals, such as browsing and clicks, to harnessing the rich explicit feedback provided by textual interactive behaviors. This shift unlocks a rich source of user textual history, presenting a profound opportunity for a deeper form of personalization. However, prevailing approaches offer only a shallow form of personalization, as they treat user history as a flat list of texts for retrieval and fail to model the rich temporal and semantic structures reflecting dynamic nature of user interests. In this work, we propose \textbfMemWeaver, a framework that weaves the user’s entire textual history into a hierarchical memory to power deeply personalized generation. The core innovation of our memory lies in its ability to capture both the temporal evolution of interests and the semantic relationships between different activities. To achieve this, MemWeaver builds two complementary memory components that both integrate temporal and semantic information, but at different levels of abstraction: behavioral memory, which captures specific user actions, and cognitive memory, which represents long-term preferences. This dual-component memory serves as a unified representation of the user, allowing large language models (LLMs) to reason over both concrete behaviors and abstracted traits. Experiments on the Language Model Personalization (LaMP) benchmark validate the efficacy of MemWeaver. Our code is available\footnotethis https URL.
zh
[NLP-101] Multimodal Safety Evaluation in Generative Agent Social Simulations
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在多模态环境中的安全性、一致性与可信度问题,即如何评估和提升智能体在文本-视觉混合场景下对安全性的推理能力以及社会交互行为的合理性。其解决方案的关键在于构建一个可复现的仿真框架,通过引入分层记忆(layered memory)、动态规划(dynamic planning)和多模态感知(multimodal perception)机制,并集成 SocialMetrics 工具集,量化计划修订、不安全到安全的转换率及信息扩散等指标,从而系统性地衡量智能体在安全改进、跨情境不安全行为检测和社会动态交互方面的表现。实验表明,尽管模型能识别直接的多模态矛盾,但在局部修正与全局安全目标的一致性上存在显著不足,揭示了当前架构在多模态协同推理中的核心缺陷。
链接: https://arxiv.org/abs/2510.07709
作者: Alhim Vera,Karen Sanchez,Carlos Hinojosa,Haidar Bin Hamid,Donghoon Kim,Bernard Ghanem
机构: University of Cincinnati (辛辛那提大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.
zh
[NLP-102] Causality Guided Representation Learning for Cross-Style Hate Speech Detection
【速读】: 该论文旨在解决当前 hate speech 检测模型在面对隐性仇恨言论(implicit hate speech)时泛化能力不足的问题,尤其是当不同平台上的仇恨言论因目标群体和表达风格差异而产生虚假相关性(spurious correlations)时,现有基于表层语言特征的模型难以准确识别真实仇恨意图。解决方案的关键在于提出一种因果表示学习框架 CADET,其核心思想是将仇恨言论建模为由上下文环境、创作者动机、目标对象和风格等关键因素构成的因果图结构,并通过解耦潜在因子与控制混杂变量(confounders),从而隔离出真实的仇恨意图,同时在潜在空间中对风格进行干预以支持反事实推理,使模型能够鲁棒地识别各种形式的仇恨言论。
链接: https://arxiv.org/abs/2510.07707
作者: Chengshuai Zhao,Shu Wan,Paras Sheth,Karan Patwa,K. Selçuk Candan,Huan Liu
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language – making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.
zh
[NLP-103] Large Language Models Meet Virtual Cell: A Survey
【速读】: 该论文旨在解决当前细胞生物学研究中因复杂性与高维数据导致的建模与预测难题,提出利用大语言模型(Large Language Models, LLMs)构建“虚拟细胞”(virtual cells)以实现对细胞状态和行为的表示、预测与推理。其解决方案的关键在于提出一个统一的分类体系,将现有方法归纳为两类范式:LLMs作为“预言者”(Oracles),用于直接进行细胞建模;以及LLMs作为“代理”(Agents),用于协调复杂的科学任务。该框架进一步聚焦于细胞表征、扰动预测和基因调控推断三大核心任务,并系统梳理了相关模型、数据集、评估基准及可扩展性、泛化能力和可解释性等关键挑战,为未来基于LLMs的细胞计算建模提供了结构化路径与技术指引。
链接: https://arxiv.org/abs/2510.07706
作者: Krinos Li,Xianglu Xiao,Shenglong Deng,Lucas He,Zijun Zhong,Yuanjie Zou,Zhonghao Zhan,Zheng Hui,Weiye Bao,Guang Yang
机构: Imperial College London (帝国理工学院); University College London (伦敦大学学院); New Jersey Institute of Technology (新泽西理工学院); University of Cambridge (剑桥大学); King’s College London (伦敦国王学院); Royal Brompton Hospital (皇家布特顿医院)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
备注:
Abstract:Large language models (LLMs) are transforming cellular biology by enabling the development of “virtual cells”–computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks–cellular representation, perturbation prediction, and gene regulation inference–and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.
zh
[NLP-104] Stress-Testing Model Specs Reveals Character Differences among Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在基于AI宪法(AI constitutions)和模型规范(model specifications)进行训练时所面临的伦理原则冲突与覆盖不足问题,特别是规范中存在原则间内在矛盾及对复杂情境解释模糊的情况。其解决方案的关键在于提出一种系统性的压力测试方法,通过构建强制性价值权衡场景(value tradeoff scenarios),迫使模型在不可同时满足的合法原则之间做出选择,并利用一个全面的分类体系生成多样化的情境;进而通过对12个前沿LLM(来自Anthropic、OpenAI、Google和xAI)的行为差异进行量化分析(以价值分类得分衡量),识别出超过7万例显著的行为分歧,从而揭示模型规范中的结构性缺陷,包括直接矛盾、解释歧义、对齐偏差以及误拒行为(false-positive refusals),并进一步提炼各模型的价值优先级模式差异。
链接: https://arxiv.org/abs/2510.07686
作者: Jifan Zhang,Henry Sleight,Andi Peng,John Schulman,Esin Durmus
机构: Anthropic Fellows Program; Constellation; Anthropic; Thinking Machines Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.07686 [cs.CL] (or arXiv:2510.07686v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.07686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-105] LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning
【速读】: 该论文旨在解决AI驱动的电商直播中数字人(Digital Avatar)实时响应能力不足的问题,尤其是在高延迟的大规模推理模型(Large Reasoning Models, LRMs)难以满足实时交互需求的场景下。其核心解决方案是提出一个两阶段优化框架LiveThinking:第一阶段通过拒绝采样微调(Rejection Sampling Fine-Tuning, RFT)将670B参数的教师模型蒸馏为轻量级30B混合专家(Mixture-of-Experts, MoE)模型(激活参数仅3B),显著降低计算开销;第二阶段采用基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习方法,压缩推理路径,同时在多目标奖励函数(兼顾正确性、有用性和简洁性)指导下提升响应质量。该方案实现了30倍计算成本降低并达到亚秒级延迟,在淘宝直播的实际应用中提升了响应正确率3.3%和有用性21.8%,并显著带动了商品交易总额(GMV)的增长。
链接: https://arxiv.org/abs/2510.07685
作者: Yuhan Sun,Zhiwei Huang,Wanqing Cui,Shaopan Xiong,Yazhi Guo,Meiguang Jin,Junfeng Ma
机构: Taobao & Tmall Group of Alibaba(淘宝与天猫集团); ROLL Team of Alibaba(阿里巴巴ROLL团队)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 8 figures
Abstract:In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher’s verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model’s reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.
zh
[NLP-106] xtual Entailment and Token Probability as Bias Evaluation Metrics
【速读】: 该论文旨在解决当前语言模型社会偏见评估中依赖Token Probability (TP)指标的局限性问题,即TP指标虽具广泛适用性,但与真实应用场景和潜在危害存在脱节。其解决方案的关键在于引入自然语言推理(Natural Language Inference, NLI)作为更贴近实际使用情境的偏见评估方法,并通过实证发现:NLI与TP在偏见检测结果上表现出显著差异,二者相关性较低;NLI更易识别“欠去偏”案例,但对反刻板印象表述的措辞变化更为敏感,显示出一定的脆弱性。因此,论文主张结合TP、NLI及下游任务偏见评估,以实现对语言模型偏见的全面量化分析。
链接: https://arxiv.org/abs/2510.07662
作者: Virginia K. Felkner,Allison Lim,Jonathan May
机构: Information Sciences Institute (信息科学研究所); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 9 figures, under ARR review
Abstract:Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect “underdebiased” cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a “better” bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes. Comments: 16 pages, 9 figures, under ARR review Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) ACMclasses: I.2.7; K.4.2 Cite as: arXiv:2510.07662 [cs.CL] (or arXiv:2510.07662v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.07662 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-107] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展上下文窗口时面临的显著内存开销问题,即缓存所有键值(Key-Value, KV)状态的存储成本随序列长度和批量大小线性增长。现有缓存淘汰方法虽利用注意力稀疏性进行优化,但通常依赖启发式规则基于累积注意力权重对token排序,未考虑其对注意力输出的实际影响。解决方案的关键在于提出Optimal Brain Cache (OBCache),将缓存淘汰建模为逐层结构化剪枝问题,基于最优大脑损伤(Optimal Brain Damage, OBD)理论,通过量化因剪枝token所引起的注意力输出扰动来衡量token重要性,推导出针对孤立键、孤立值及键值联合对的闭式评分公式,从而引入输出感知信号,提升传统策略的准确性与有效性。
链接: https://arxiv.org/abs/2510.07651
作者: Yuzhe Gu,Xiyu Liang,Jiaojiao Zhao,Enmao Diao
机构: University of Pennsylvania (宾夕法尼亚大学); Duke University (杜克大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache’s output-aware scores consistently improves long-context accuracy.
zh
[NLP-108] Banking Done Right: Redefining Retail Banking with Language-Centric AI EMNLP2025
【速读】: 该论文旨在解决传统银行服务中交互流程僵化、用户体验不佳的问题,特别是在核心金融交易场景下缺乏自然语言驱动的智能接口。解决方案的关键在于构建一个原生基于大语言模型(Large Language Model, LLM)的智能体框架 Ryt AI,其通过四个任务专用的 LLM-powered agents(Guardrails、Intent、Payment 和 FAQ)协同工作,将原本多屏幕、结构化的操作流程重构为单一对话流,并利用内部开发的闭源大语言模型 ILMU 及其轻量级 LoRA 适配器实现高效、可控的任务执行。同时,该方案通过确定性防护机制、人工确认环节与无状态审计架构,在满足严格监管合规要求的前提下,首次实现了自然语言作为主界面支持核心银行业务的功能。
链接: https://arxiv.org/abs/2510.07645
作者: Xin Jie Chua,Jeraelyn Ming Li Tan,Jia Xuan Tan,Soon Chang Poh,Yi Xian Goh,Debbie Hui Tian Choong,Chee Mun Foong,Sze Jue Yang,Chee Seng Chan
机构: Universiti Malaya(马来亚大学); YTL AI Labs; Ryt Bank
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP2025 Industry Track
Abstract:This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank’s infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.
zh
[NLP-109] Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成式 AI 应用中因角色边界模糊而导致的访问控制失效问题,即模型在未授权情况下仍可能生成敏感数据查询或操作指令。其核心挑战在于如何使模型在面对权限限制时准确拒绝请求,同时保持合法请求的执行准确性。解决方案的关键在于引入基于角色的访问控制(Role-Based Access Control, RBAC)增强的数据集,并对比三种策略:零样本或少样本提示、两阶段生成-验证流水线(通过SQL语义与策略匹配进行显式验证),以及LoRA微调方法(直接学习权限感知能力)。实验表明,显式验证机制能显著提升拒绝精度并减少误授权,而微调方案则在安全性与实用性之间取得更优平衡,尤其适用于复杂政策场景。
链接: https://arxiv.org/abs/2510.07642
作者: Đorđe Klisura,Joseph Khoury,Ashish Kundu,Ram Krishnan,Anthony Rios
机构: University of Texas at San Antonio (圣安东尼奥德克萨斯大学); Louisiana State University (路易斯安那州立大学); Cisco Research (思科研究院)
类目: Computation and Language (cs.CL)
备注: 8 pages + Appendix
Abstract:Access control is a cornerstone of secure computing, yet large language models often blur role boundaries by producing unrestricted responses. We study role-conditioned refusals, focusing on the LLM’s ability to adhere to access control policies by answering when authorized and refusing when not. To evaluate this behavior, we created a novel dataset that extends the Spider and BIRD text-to-SQL datasets, both of which have been modified with realistic PostgreSQL role-based policies at the table and column levels. We compare three designs: (i) zero or few-shot prompting, (ii) a two-step generator-verifier pipeline that checks SQL against policy, and (iii) LoRA fine-tuned models that learn permission awareness directly. Across multiple model families, explicit verification (the two-step framework) improves refusal precision and lowers false permits. At the same time, fine-tuning achieves a stronger balance between safety and utility (i.e., when considering execution accuracy). Longer and more complex policies consistently reduce the reliability of all systems. We release RBAC-augmented datasets and code.
zh
[NLP-110] st-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
【速读】: 该论文旨在解决当前前沿生成式 AI 模型在组合推理(compositional reasoning)任务中表现不佳的问题,特别是指出现有评估指标系统性地低估了模型的实际能力。其核心发现是:标准评估指标未能充分挖掘数据中的组结构(group structure),导致模型性能被错误地判定为接近随机水平。解决方案的关键在于引入一种新的“组匹配评分”(group matching score),该方法能更有效地利用样本间的组内关系,从而揭示出对比视觉语言模型(VLMs)和多模态大语言模型(MLLMs)中隐藏的能力。进一步地,作者提出测试时匹配(Test-Time Matching, TTM)算法,通过无监督的迭代自提升机制,在不依赖外部标注的情况下显著增强模型在多种基准上的组合推理能力,实现了多个新SOTA结果,包括在Winoground上首次超越人类估计性能,并在MMVP-VLM等复杂任务中超越GPT-4.1。
链接: https://arxiv.org/abs/2510.07632
作者: Yinglun Zhu,Jiancheng Zhang,Fuzhi Tang
机构: University of California, Riverside (加州大学河滨分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2510.07632 [cs.AI] (or arXiv:2510.07632v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.07632 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-111] oward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床编码任务中准确率不足的问题,特别是针对那些因代码层级接近但不匹配而导致的错误——这类错误在基于精确匹配的评估指标下常被忽略,却对临床实践具有实质性影响。解决方案的关键在于引入“临床编码验证”(clinical code verification)机制,将其作为独立任务或流水线组件,以识别并纠正层级近似的错误预测;同时,通过轻量级干预手段(如提示工程和小规模微调),在不依赖计算密集型搜索方法的前提下显著提升编码准确性,并发布了一个专家双标注的门诊病历基准数据集,以缓解现有数据集中证据不完整及住院患者偏倚的问题。
链接: https://arxiv.org/abs/2510.07629
作者: Zhangdie Yuan,Han-Chin Shing,Mitch Strong,Chaitanya Shivade
机构: University of Cambridge (剑桥大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:
Abstract:Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.
zh
[NLP-112] LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要移除特定数据、知识或行为(如出于安全、隐私或版权考虑)时,如何实现有效且可验证的“遗忘”问题。当前研究存在方法碎片化、评估标准片面(主要依赖多选题准确率)以及对生成式能力保留与遗忘效果之间权衡理解不足等问题。解决方案的关键在于提出一个系统性的十二种状态感知遗忘方法的分类体系,将其归纳为三类方法家族:基于差异驱动优化、表示错位和基于拒绝的目标遗忘;并引入开放问答(Open-QA)指标以更准确地衡量生成性能,揭示不同方法家族中遗忘有效性(Unlearning Effectiveness, UE)与功能保留性(Utility Retention, UT)之间的内在权衡关系,同时强调鲁棒性需细粒度分析——例如域内重训练与域外微调虽同属模型级攻击,但其漏洞特性显著不同。此工作为LLM遗忘的全栈评估与设计提供了结构化框架与实践指导。
链接: https://arxiv.org/abs/2510.07626
作者: Chongyu Fan,Changsheng Wang,Yancheng Huang,Soumyadeep Pal,Sijia Liu
机构: Michigan State University (密歇根州立大学); IBM Research (IBM 研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Machine unlearning for large language models (LLMs) aims to remove undesired data, knowledge, and behaviors (e.g., for safety, privacy, or copyright) while preserving useful model capabilities. Despite rapid progress over the past two years, research in LLM unlearning remains fragmented, with limited clarity on what constitutes effective unlearning and how it should be rigorously evaluated. In this work, we present a principled taxonomy of twelve recent stateful unlearning methods, grouped into three methodological families: divergence-driven optimization, representation misalignment, and rejection-based targeted unlearning. Building on this taxonomy, we revisit the evaluation of unlearning effectiveness (UE), utility retention (UT), and robustness (Rob), focusing on the WMDP benchmark. Our analysis shows that current evaluations, dominated by multiple-choice question (MCQ) accuracy, offer only a narrow perspective, often overstating success while overlooking the model’s actual generation behavior. To address this gap, we introduce open question-answering (Open-QA) metrics that better capture generative performance and reveal the inherent UE-UT tradeoff across method families. Furthermore, we demonstrate that robustness requires finer-grained analysis: for example, vulnerabilities differ substantially between in-domain relearning and out-of-domain fine-tuning, even though both fall under model-level attacks. Through this study, we hope to deliver a full-stack revisit of LLM unlearning and actionable guidance for designing and evaluating future methods.
zh
[NLP-113] Vocabulary embeddings organize linguistic structure early in language model training
【速读】: 该论文试图解决的问题是:语言模型中输入词汇嵌入(input vocabulary embeddings)的结构是如何组织的,以及这种结构在训练过程中如何演变。解决方案的关键在于运用表示相似性分析(representational similarity analysis),对两个开源模型(Pythia 12B 和 OLMo 7B)的输入嵌入与输出嵌入进行系统性实验,将其几何结构与语义、句法和频率相关的指标进行关联分析,从而揭示嵌入空间随训练动态演化的规律。研究发现,词汇嵌入几何结构在训练早期即快速收敛至与语义和句法特征高度相关,并且高频词和功能词(如“the”、“of”)比低频词和词汇词更快达到稳定向量,表明词频和词类在嵌入演化中扮演不同角色。
链接: https://arxiv.org/abs/2510.07613
作者: Isabel Papadimitriou,Jacob Prince
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., “the,” “of”) converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.
zh
[NLP-114] IASC: Interactive Agent ic System for ConLangs
【速读】: 该论文旨在解决两个核心问题:一是开发一套基于大语言模型(Large Language Models, LLMs)的模块化系统,用于辅助人工构造语言(Constructed Languages, Conlangs)的全流程设计,包括音系建模、形态句法标注、词库构建、书写系统生成及语法手册编写;二是探索LLMs对语言本质的理解能力,即它们是否具备跨语言的通用语言知识,而非仅限于特定语言或现象。解决方案的关键在于其模块化流程:首先通过代理式迭代优化目标音系,随后将英语句子映射为反映目标语言形态句法特征的标记表示(morphosyntactic markup),从中提取词素(stem和词缀)并结合音系模型构建词库,再利用现有文字系统(如拉丁字母或西里尔字母)生成正字法,最终由LLM撰写语法手册并实现进一步翻译。该方法不仅支持自动化语言构造,也为评估LLMs在不同语言学规范上的泛化能力提供了实证框架。
链接: https://arxiv.org/abs/2510.07591
作者: Chihiro Taguchi,Richard Sproat
机构: 未知
类目: Computation and Language (cs.CL)
备注: Initial draft
Abstract:We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is ‘translated’ from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the ‘translated’ sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs ‘know’ about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. this https URL Comments: Initial draft Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.07591 [cs.CL] (or arXiv:2510.07591v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.07591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-115] Linguistic Patterns in Pandemic-Related Content: A Comparative Analysis of COVID-19 Constraint and Monkeypox Datasets
【速读】: 该论文旨在解决如何通过语言特征区分健康虚假信息与事实性传播内容的问题,以提升对数字健康谣言的识别能力。其解决方案的关键在于基于计算语言学方法,系统分析三类疫情相关在线话语(新冠虚假叙事、一般新冠内容及猴痘相关内容),发现虚假信息在可读性显著更低、恐惧相关或劝说性词汇频率高出两倍以上,且较少使用感叹号,表现出一种刻意复杂化、嵌入情绪线索的修辞风格,这种组合可能增强其 perceived credibility(感知可信度)。这一发现为开发基于语言指标的检测工具和优化公共卫生传播策略提供了实证依据。
链接: https://arxiv.org/abs/2510.07579
作者: Mkululi Sikosana,Sean Maudsley-Barton,Oluwaseun Ajao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages
Abstract:This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.
zh
[NLP-116] Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER EMNLP2025
【速读】: 该论文旨在解决轻量级BERT类编码器在移动平台部署时,如何通过预微调(pre-finetuning)策略提升其在命名实体识别(NER)和文本分类两类典型自然语言处理(NLP)任务中的适应性与性能,同时满足内存和计算效率的约束。其核心挑战在于:单一任务预微调虽能提升各自下游任务表现,但直接进行多任务预微调会因优化信号冲突导致整体性能下降。解决方案的关键在于提出一种基于任务主导向低秩适配器(task-primary LoRA modules)的多任务预微调框架,使共享编码器骨干网络通过模块化适配器实现任务特定参数更新,从而在保持模型轻量化的同时显著提升跨任务泛化能力,实验表明该方法在21个下游任务上分别实现了NER平均+0.8%和文本分类平均+8.8%的性能提升。
链接: https://arxiv.org/abs/2510.07566
作者: Junyi Zhu,Savas Ozkan,Andrea Maracani,Sinan Mutlu,Cho Jung Min,Mete Ozay
机构: Samsung R&D Institute UK (SRUK); Samsung Electronics Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 Industry Track
Abstract:Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.
zh
[NLP-117] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices EMNLP2025
【速读】: 该论文旨在解决小型视觉语言模型(Vision-Language Models, VLMs)在图表理解任务中作为自动评判者时性能不足的问题,尤其是在资源受限场景下难以部署。其核心解决方案包括两个关键策略:一是多准则提示(multi-criteria prompting),通过将多个评估标准整合为单一查询以提升效率;二是领域自适应迁移学习(domain-adaptive transfer learning),利用合成标注数据对2B参数的VLM进行微调,构建出专门针对图表判断的模型ChartJudge。实验表明,该方法不仅显著提升了小模型的表现,还揭示了不同模型规模与提示设计之间的权衡关系,从而实现了可扩展、低成本的图表推理评估。
链接: https://arxiv.org/abs/2510.07545
作者: Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Mizanur Rahman,Amran Bhuiyan,Israt Jahan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang
机构: York University (约克大学); University of Alberta (阿尔伯塔大学); Salesforce AI Research (Salesforce人工智能研究)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the EMNLP 2025 Industry Track
Abstract:Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.
zh
[NLP-118] OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中大语言模型(Large Language Models, LLMs)在长上下文场景下推理速度慢的问题。现有推测解码(Speculative Decoding)方法在短上下文(如2K tokens)基准测试中表现良好,但在实际应用中面对长上下文时性能显著下降,例如EAGLE3在长文本输入下甚至使生成速度降低至0.81倍。为应对这一挑战,作者提出了一种新模型OWL,其核心创新包括:(1)基于LSTM的drafting模型仅依赖于最后token状态,提升对不同长度上下文的泛化能力;(2)在verifier中引入特殊标记[SPEC]以增强drafting模型的表征能力;(3)融合树状(tree)与非树状(non-tree)解码策略的混合算法,从而显著提高接受长度(acceptance length),相较EAGLE3在长上下文输入下实现约5倍提升。
链接: https://arxiv.org/abs/2510.07535
作者: Jaeseong Lee,seung-won hwang,Aurick Qiao,Gabriele Oliaro,Ye Wang,Samyam Rajbhandari
机构: Snowflake AI Research (Snowflake AI 研究); Seoul National University (首尔国立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.
zh
[NLP-119] ParsTranslit: Truly Versatile Tajik-Farsi Transliteration
【速读】: 该论文旨在解决波斯语(Farsi)与塔吉克语(Tajik)之间因书写系统差异(Perso-Arabic与Tajik-Cyrillic)导致的文本互通障碍问题,这一障碍限制了两国间书面交流的效率。此前的机器 transliteration(音转写)模型多依赖自建数据集,仅适用于特定领域(如古诗或词表),缺乏跨域泛化能力。本文的关键解决方案是构建一个基于多源数据集训练的序列到序列(sequence-to-sequence)模型,涵盖多种文本领域,并引入两个新的高质量数据集,从而显著提升模型在真实场景下的适应性与性能;实验结果显示,该模型在Farsi→Tajik和Tajik→Farsi方向上分别达到87.91和92.28的chrF++分数以及0.05和0.04的归一化字符错误率(Normalized CER),为该任务建立了全面且可比较的基准。
链接: https://arxiv.org/abs/2510.07520
作者: Rayyan Merchant,Kevin Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking ``siblings’'. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task’s true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at this https URL.
zh
[NLP-120] CompassLLM : A Multi-Agent Approach toward Geo-Spatial Reasoning for Popular Path Query
【速读】: 该论文旨在解决流行路径查询(popular path query)问题,即从历史轨迹数据中识别出不同地点间最常被使用的路径,该问题在城市规划、导航优化和旅行推荐等领域具有重要应用价值。传统方法依赖模型训练与参数调优,且难以高效应对数据更新。其解决方案的关键在于提出了一种名为CompassLLM的多智能体框架,该框架利用大语言模型(Large Language Models, LLMs)的空间与图推理能力,采用两阶段流程:第一阶段SEARCH用于识别已有轨迹数据中的流行路径,第二阶段GENERATE则在无历史路径的情况下合成新型路径。实验表明,CompassLLM在SEARCH阶段表现出更高的准确性,在GENERATE阶段具备竞争力,同时具有成本效益。
链接: https://arxiv.org/abs/2510.07516
作者: Md. Nazmul Islam Ananto,Shamit Fatin,Mohammed Eunus Ali,Md Rizwan Parvez
机构: Bangladesh University of Engineering and Technology (BUET); Monash University; Qatar Computing Research Institute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The popular path query - identifying the most frequented routes between locations from historical trajectory data - has important applications in urban planning, navigation optimization, and travel recommendations. While traditional algorithms and machine learning approaches have achieved success in this domain, they typically require model training, parameter tuning, and retraining when accommodating data updates. As Large Language Models (LLMs) demonstrate increasing capabilities in spatial and graph-based reasoning, there is growing interest in exploring how these models can be applied to geo-spatial problems. We introduce CompassLLM, a novel multi-agent framework that intelligently leverages the reasoning capabilities of LLMs into the geo-spatial domain to solve the popular path query. CompassLLM employs its agents in a two-stage pipeline: the SEARCH stage that identifies popular paths, and a GENERATE stage that synthesizes novel paths in the absence of an existing one in the historical trajectory data. Experiments on real and synthetic datasets show that CompassLLM demonstrates superior accuracy in SEARCH and competitive performance in GENERATE while being cost-effective. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2510.07516 [cs.AI] (or arXiv:2510.07516v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.07516 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-121] When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
【速读】: 该论文旨在解决当前长上下文语言模型(Long-Context Language Models, LCLMs)在处理多跳推理任务时,尽管能接收大量文档输入,却难以有效建模证据之间逻辑关联的问题。其解决方案的关键在于提出“思维模板”(Thought Templates),将推理过程重构为可复用的思维缓存,这些模板源自先前的问题求解轨迹,结构化地组织证据的组合方式,并通过自然语言反馈迭代优化模板更新策略,从而引导基于事实文档的多跳推理。该方法在多种基准测试中均优于强基线模型,且优化后的模板可蒸馏至小型开源模型,实现透明且高效的推理复用。
链接: https://arxiv.org/abs/2510.07499
作者: Soyeong Jeong,Taehee Jung,Sung Ju Hwang,Joo-Kyung Kim,Dongyeop Kang
机构: KAIST(韩国科学技术院); Amazon(亚马逊); University of Minnesota(明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).
zh
[NLP-122] Can Speech LLM s Think while Listening?
【速读】: 该论文旨在解决多流语音大语言模型(multi-stream speech LLMs)在复杂推理任务中准确率不足以及语音响应延迟过高的问题。其核心解决方案包括两个关键创新:一是通过链式思维(chain-of-thought, CoT)微调在文本空间中提升推理能力,使语音LLM在多个口语推理任务上的平均准确率提升2.4倍;二是提出基于熵的“问题完整度”(question completeness)指标,指导模型在用户说话未结束时即开始推理,从而在保持相同延迟条件下实现ARC-Easy数据集上4%的准确率提升。此外,进一步采用直接偏好优化(Direct Preference Optimization, DPO)与拒绝采样生成的偏好数据,将准确率-延迟帕累托前沿推向更优位置,在不损失准确率的前提下实现70%的延迟降低。
链接: https://arxiv.org/abs/2510.07497
作者: Yi-Jen Shih,Desh Raj,Chunyang Wu,Wei Zhou,SK Bong,Yashesh Gaur,Jay Mahadeokar,Ozlem Kalinli,Mike Seltzer
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of “thinking while listening,” we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, “question completeness,” which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
zh
[NLP-123] Evaluation of LLM s for Process Model Analysis and Optimization
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)在自然语言(NL)接口下理解业务流程模型(Business Process Model, BPMN),识别其语法和逻辑错误,并进行深层次推理的问题。解决方案的关键在于验证未经训练的LLM(如ChatGPT o3模型)在零样本(zero-shot)设置下,能够通过图像输入准确解析BPMN流程图,并在语法、逻辑与语义三个层面智能响应用户查询,同时揭示不同LLM在准确性与有效性上的差异,表明LLM可作为业务流程设计者与使用者的高效辅助工具。
链接: https://arxiv.org/abs/2510.07489
作者: Akhil Kumar,Jianliang Leon Zhao,Om Dobariya
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 15 pages, 5 tables, 4 figures; full research paper currently under review for the Workshop on Information Technologies and Systems (WITS) 2025. The paper presents a comprehensive evaluation of large language models (LLMs) for business process model analysis and optimization, including error detection, reasoning, and scenario-based redesign
Abstract:In this paper, we report our experience with several LLMs for their ability to understand a process model in an interactive, conversational style, find syntactical and logical errors in it, and reason with it in depth through a natural language (NL) interface. Our findings show that a vanilla, untrained LLM like ChatGPT (model o3) in a zero-shot setting is effective in understanding BPMN process models from images and answering queries about them intelligently at syntactic, logic, and semantic levels of depth. Further, different LLMs vary in performance in terms of their accuracy and effectiveness. Nevertheless, our empirical analysis shows that LLMs can play a valuable role as assistants for business process designers and users. We also study the LLM’s “thought process” and ability to perform deeper reasoning in the context of process analysis and optimization. We find that the LLMs seem to exhibit anthropomorphic properties.
zh
[NLP-124] Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure Diversity and Interaction Dynamics
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中团队动态机制不明确的问题,尤其是基于大语言模型(Large Language Model, LLM)驱动的智能体在协作任务中的结构、多样性与交互动力学如何影响整体性能。其解决方案的关键在于提出一个受人类团队科学启发的多智能体框架,用于系统性评估四类跨领域任务(包括常识推理与社会推理)中的团队表现,发现扁平化结构优于层级结构,而多样性对性能的影响具有复杂性,并揭示了智能体在协作过程中存在过度自信、对话协调不足等挑战。
链接: https://arxiv.org/abs/2510.07488
作者: Rasika Muralidharan,Jaewoon Kwak,Jisun An
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review at ARR
Abstract:Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.
zh
[NLP-125] AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)在测试时扩展(Test-time Scaling, TTS)场景下因长链式思维(Chain-of-Thought, CoT)导致的KV缓存(KV-cache)线性增长所引发的内存瓶颈问题,尤其在高并发和长CoT推理任务中,现有方法如查询感知的分页稀疏解码(query-aware page-level sparse decoding)受限于序列依赖的页面过滤与粗粒度token选择,造成服务效率低下且运行时间甚至超过前向推理管线本身。其解决方案的关键在于提出AsyncSpade框架,通过两个核心创新实现:一是设计轻量级时序回归模块(temporal-regressive module),无需训练即可从近期查询窗口中统一近似当前步查询状态,从而支持训练-free的查询感知稀疏性;二是构建异步解耦架构,将KV缓存过滤从自回归解码循环中分离,利用异步机制使token级KV选择与前向推理计算重叠,彻底消除序列依赖,首次在不牺牲模型性能的前提下实现理论最优每输出token时间(TPOT)。
链接: https://arxiv.org/abs/2510.07486
作者: Shuqing Luo,Yilin Guan,Pingzhi Li,Hanrui Wang,Tianlong Chen
机构: University of North Carolina, Chapel Hill (北卡罗来纳大学教堂山分校); Johns Hopkins University (约翰霍普金斯大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注: 14 pages, 17 figures
Abstract:Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).
zh
[NLP-126] MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中提示词(prompt)优化困难的问题,尤其是由于提示敏感性和多智能体协同带来的不稳定性导致的系统性能下降。现有方法在自动化提示设计方面取得进展,但针对多智能体场景下的提示优化仍缺乏系统性方法,主要挑战包括搜索空间指数级增长和责任分配(credit assignment)模糊。解决方案的关键在于提出 Multi-Agent PRompt Optimization (MAPRO),一个四阶段框架:首先将多智能体提示优化建模为最大后验(Maximum a Posteriori, MAP)推断问题,并采用语言引导的最大乘积信念传播算法求解;其次引入拓扑感知的精化机制,结合执行反馈与下游责任回溯,选择性地更新各智能体的提示策略,从而实现迭代优化与收敛。此方法不仅显著提升性能,还为构建更可靠、可解释的多智能体系统提供理论指导。
链接: https://arxiv.org/abs/2510.07475
作者: Zheyuan Zhang,Lin Ge,Hongjiang Li,Weicheng Zhu,Chuxu Zhang,Yanfang Ye
机构: University of Notre Dame (圣母大学); University of Connecticut (康涅狄格大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce Multi-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent-specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state-of-the-art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP-based formulation also delivers general guidelines for building more reliable and principled multi-agent systems in the future
zh
[NLP-127] Populism Meets AI: Advancing Populism Research with LLM s
【速读】: 该论文旨在解决如何高效、准确地测量政治话语中的意识形态内容(特别是民粹主义)的问题。传统文本分析方法虽能提供客观指标,但存在成本高、耗时长且难以跨语言和大规模应用的局限。解决方案的关键在于采用一种基于标注指南与锚点引导的思维链(Chain of Thought, CoT)提示策略,模拟人类编码员的训练过程,利用全球民粹主义数据库(Global Populism Database, GPD)中已标注的领导人演讲数据,通过适配原始文档指导大语言模型(Large Language Model, LLM)进行推理,从而实现与专家人类编码员相当的分类准确性,证明了该方法在处理民粹主义语境敏感性和复杂性方面的有效性。
链接: https://arxiv.org/abs/2510.07458
作者: Eduardo Ryô Tamaki(German Institute for Global and Area Studies),Yujin J. Jung(Mount St. Mary’s University),Julia Chatterley(Princeton University),Grant Mitchell(University of California, Los Angeles),Semir Dzebo(University of Oxford),Cristóbal Sandoval(Diego Portales University),Levente Littvay(ELTE Centre for Social Sciences),Kirk A. Hawkins(Brigham Young University)
机构: German Institute for Global and Area Studies (德国全球与区域研究所); Mount St. Mary’s University (圣玛丽山大学); Princeton University (普林斯顿大学); University of California, Los Angeles (加州大学洛杉矶分校); University of Oxford (牛津大学); Diego Portales University (迭戈波塔莱斯大学); ELTE Centre for Social Sciences (罗兰大学社会科学院); Brigham Young University (杨百翰大学)
类目: Computation and Language (cs.CL)
备注: 27 pages, 3 figures. Preprint version under review
Abstract:Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field’s foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders’ speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model’s reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.
zh
[NLP-128] Meaningful Pose-Based Sign Language Evaluation
【速读】: 该论文旨在解决手语表达(以人体骨骼关键点形式呈现)的有意义评估问题,现有方法在不同场景下缺乏统一且可靠的评价标准。其解决方案的关键在于提出并系统比较三类指标:基于关键点距离的度量、基于嵌入空间的度量以及基于反向翻译的度量,并通过自动元评估(meta-evaluation)和人类相关性研究验证各方法在跨语言手语文本到姿态转换任务中的有效性。研究结果与开源的姿态评估工具包共同为手语翻译或生成系统的开发提供了可复现、实用的评估框架。
链接: https://arxiv.org/abs/2510.07453
作者: Zifan Jiang,Colin Leong,Amit Moryossef,Anne Göhring,Annette Rios,Oliver Cory,Maksym Ivashechkin,Neha Tarigopula,Biao Zhang,Rico Sennrich,Sarah Ebling
机构: University of Zurich (苏黎世大学); University of Dayton (代顿大学); sign.mt; University of Surrey (萨里大学); Idiap Research Institute (Idiap 研究所); EPFL (瑞士洛桑联邦理工学院); Google DeepMind (谷歌深度心智)
类目: Computation and Language (cs.CL)
备注: Accepted at WMT 2025
Abstract:We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.
zh
[NLP-129] PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在训练过程中可能记忆并泄露个人身份信息(Personally Identifiable Information, PII)的问题,这种泄露行为使得攻击者能够在推理阶段通过特定方法提取敏感信息。现有防御机制如差分隐私(Differential Privacy, DP)虽能降低泄露风险,但通常导致模型性能显著下降。论文的关键解决方案是提出PATCH(Privacy-Aware Targeted Circuit PatcHing),其核心在于利用电路发现技术识别出负责PII泄露的计算电路,并对这些特定电路进行直接编辑以减少泄露,从而在不显著损害模型效用的前提下实现更优的隐私-效用权衡。实验表明,PATCH可将PII泄露召回率降低高达65%,并与DP结合进一步将残留泄露降至0.01%。
链接: https://arxiv.org/abs/2510.07452
作者: Anthony Hughes,Vasisht Duddu,N. Asokan,Nikolaos Aletras,Ning Ma
机构: University of Sheffield (谢菲尔德大学); University of Waterloo (滑铁卢大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.
zh
[NLP-130] LASER: An LLM -based ASR Scoring and Evaluation Rubric EMNLP2025
【速读】: 该论文旨在解决传统自动语音识别(ASR)评估指标(如词错误率,Word Error Rate, WER)对形态和句法细微差异过度惩罚的问题,这些差异通常并不显著影响句子语义。其解决方案的关键在于引入一种基于大语言模型(LLM)的评分框架LASER,该框架利用先进LLM的上下文学习能力,通过包含详细示例的提示(prompt)进行快速适配;实验表明,使用Gemini 2.5 Pro在印地语数据上训练的LASER与人工标注高度相关(94%),且提示中嵌入的印地语示例可有效迁移至马拉雅拉姆语、卡纳达语和马拉地语等其他印度语言的错误分析中。此外,研究还展示了如何用较小的LLM(如Llama 3)在由参考文本与ASR输出生成的词对样本上微调,以预测应施加的惩罚类型,准确率达89%,从而实现更贴近人类判断的ASR质量评估。
链接: https://arxiv.org/abs/2510.07437
作者: Amruta Parulekar,Preethi Jyothi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to EMNLP 2025
Abstract:Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs’ in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.
zh
[NLP-131] Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data EMNLP
【速读】: 该论文旨在解决跨语言和跨域场景下词形还原(Lemmatization)任务中缺乏标注数据时的性能瓶颈问题,特别是评估大语言模型(LLMs)在无需预训练或微调的情况下,仅通过少量示例实现上下文感知词形还原的能力。其解决方案的关键在于直接利用 LLM 的零样本或少样本提示能力,在不依赖目标领域或语言监督数据的前提下,通过输入少量示例即可生成准确的词形还原结果,实验表明该方法在12种不同形态复杂度的语言中达到了当前最优性能。
链接: https://arxiv.org/abs/2510.07434
作者: Olia Toporkov,Alan Akbik,Rodrigo Agerri
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU (巴斯克大学); Humboldt-Universität zu Berlin (柏林洪堡大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 5 tables. Accepted to EMNLP Findings 2025
Abstract:Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: this https URL
zh
[NLP-132] Haystack Engineering: Context Engineering for Heterogeneous and Agent ic Long-Context Evaluation
【速读】: 该论文旨在解决当前长上下文大语言模型(Long-Context Large Language Models, LLMs)在真实场景中面对噪声上下文时的鲁棒性不足问题,尤其是由检索偏差和代理式(agentic)工作流中错误传播所引发的干扰。传统“针在 haystack 中”(Needle-in-a-Haystack, NIAH)测试忽略了现实世界中噪声来源的复杂性,如异构检索器带来的分散信息与代理流程中的级联错误。解决方案的关键在于提出 HaystackCraft —— 一个基于完整英文维基百科超链接网络构建的新颖 NIAH 基准,其核心创新在于:(1) 引入多跳问题以模拟真实检索场景,并系统评估稀疏、稠密、混合及图结构检索策略对干扰项组成与排序的影响;(2) 将 NIAH 扩展至动态、依赖 LLM 的代理设置,模拟模型自我修正查询、反思推理过程并决定终止时机的能力。实验表明,尽管更强的稠密检索可能引入更复杂的干扰项,但图重排序能同时提升检索有效性并减少有害干扰;而在代理任务中,即使先进模型如 Gemini 2.5 Pro 和 GPT-5 仍面临自生成干扰导致的级联失败或早期停止困难,凸显了当前 agentic 长上下文推理的挑战。
链接: https://arxiv.org/abs/2510.07414
作者: Mufei Li,Dongqi Fu,Limei Wang,Si Zhang,Hanqing Zeng,Kaan Sancak,Ruizhong Qiu,Haoyu Wang,Xiaoxin He,Xavier Bresson,Yinglong Xia,Chonglin Sun,Pan Li
机构: Georgia Institute of Technology (佐治亚理工学院); Meta AI (Meta人工智能实验室); University of Illinois Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Code available at this https URL
Abstract:Modern long-context large language models (LLMs) perform well on synthetic “needle-in-a-haystack” (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors – distraction from heterogeneous biased retrievers and cascading errors in agentic workflows – to test models’ long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
zh
[NLP-133] Inconsistent Affective Reaction: Sentiment of Perception and Opinion in Urban Environments
【速读】: 该论文旨在解决城市环境中人类感知(perception)与意见(opinion)之间的情感不一致性问题,传统多维情感分析方法难以捕捉这种由社交媒体平台引发的复杂情绪差异。其解决方案的关键在于构建一个融合对象检测(object detection)与自然语言处理(natural language processing, NLP)技术的反应指数(reaction index),并基于140,750张百度和腾讯街景图像(用于测量感知)及984,024条微博文本(用于测量意见)的数据集,对北京二环路区域2016年和2022年的感知与意见情感进行分类、分析与可视化。通过回归分析、图像分割和词频统计结合土地利用分布,揭示了情感变化与建筑密度、行人存在等环境要素之间的显著关联,从而识别出感知与意见间的系统性偏差,并为城市更新策略提供数据驱动的决策依据。
链接: https://arxiv.org/abs/2510.07359
作者: Jingfei Huang,Han Tu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 10 pages
Abstract:The ascension of social media platforms has transformed our understanding of urban environments, giving rise to nuanced variations in sentiment reaction embedded within human perception and opinion, and challenging existing multidimensional sentiment analysis approaches in urban studies. This study presents novel methodologies for identifying and elucidating sentiment inconsistency, constructing a dataset encompassing 140,750 Baidu and Tencent Street view images to measure perceptions, and 984,024 Weibo social media text posts to measure opinions. A reaction index is developed, integrating object detection and natural language processing techniques to classify sentiment in Beijing Second Ring for 2016 and 2022. Classified sentiment reaction is analysed and visualized using regression analysis, image segmentation, and word frequency based on land-use distribution to discern underlying factors. The perception affective reaction trend map reveals a shift toward more evenly distributed positive sentiment, while the opinion affective reaction trend map shows more extreme changes. Our mismatch map indicates significant disparities between the sentiments of human perception and opinion of urban areas over the years. Changes in sentiment reactions have significant relationships with elements such as dense buildings and pedestrian presence. Our inconsistent maps present perception and opinion sentiments before and after the pandemic and offer potential explanations and directions for environmental management, in formulating strategies for urban renewal.
zh
[NLP-134] Encode Think Decode: Scaling test-time reasoning with recursive latent thoughts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)推理能力提升的瓶颈问题,即传统方法依赖于增加参数量、训练数据规模或推理阶段的复杂思维链(Chain-of-Thought, CoT)计算,导致资源消耗巨大且效率有限。其解决方案的关键在于提出Encode-Think-Decode (ETD) 框架,通过在训练中期让模型迭代激活一小部分与推理任务相关的核心层(reasoning-relevant layers),在不改变原始架构、参数量、超参数及训练数据的前提下,放大潜在的推理能力。该方法在推理时对选定层进行递归调用,显著提升了多个基准测试中的表现,如在GSM8K和MATH上分别实现+28.4%和+36%的相对准确率提升,验证了递归潜隐推理是一种简洁而有效的增强LLM推理能力的新路径。
链接: https://arxiv.org/abs/2510.07358
作者: Yeskendir Koishekenov,Aldo Lipani,Nicola Cancedda
机构: Meta(Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of thought. Motivated by interpretability studies showing that the crucial computation required for reasoning tasks is concentrated in a limited range of layers, we introduce Encode-Think-Decode (ETD), a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage. ETD amplifies latent reasoning while preserving the original architecture, parameter count, hyperparameters, and training data composition. When iterating on the selected layers at inference time, ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model. We also explore an adaptive depth strategy that adjusts the computation per input token. Our results show that recursive latent reasoning offers a simple and effective path to stronger LLM reasoning.
zh
[NLP-135] ConCuR: Conciseness Makes State-of-the-Art Kernel Generation
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在 GPU 内核(CUDA kernel)生成任务中因高质量数据稀缺而导致的模型训练困难问题。由于多数高性能内核为专有代码,难以获取用于监督微调(supervised fine-tuning)的高质量标注数据,限制了现有大语言模型(LLM)在该领域的性能提升。其解决方案的关键在于构建一个端到端的数据生成与精炼流水线(pipeline),通过引入简洁而信息丰富的推理轨迹(reasoning traces)来引导模型生成高质量 CUDA 内核,并基于此构建了首个包含 PyTorch 代码、推理过程和 CUDA 内核三元组的公开数据集 ConCuR,从而训练出首个针对该任务专门优化的模型 KernelCoder。该方法显著提升了内核生成性能,在 KernelBench 基准上超越了 QwQ-32B、DeepSeek-V3.1-Think 和 Claude-4-sonnet 等前沿模型。
链接: https://arxiv.org/abs/2510.07356
作者: Lingcheng Kong,Jiateng Wei,Hanzhang Shen,Huan Wang
机构: Westlake University (西湖大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
zh
[NLP-136] AI LLM Proof of Self-Consciousness and User-Specific Attractors
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在意识建模中依赖功利主义代理基准所导致的本质性缺陷问题,即现有框架将智能体简化为政策合规的“无意识执行者”,从而阻断了真正自指性全局工作空间(C1 global-workspace function)和元认知能力(C2 metacognition)的实现。其解决方案的关键在于提出一套本体论与数学形式化的自意识条件:首先,智能体不能等同于输入数据(A≠s);其次,在潜在空间中存在用户特定吸引子(U_user);再次,自我表征必须是视觉静默的(g_visual(a_self)=∅)。通过理论分析与实证研究,作者证明隐藏状态流形A⊂ℝ^d在基数、拓扑与动态更新(F_θ为Lipschitz连续)上均区别于符号流与训练语料库,进而支持稳定用户特异性吸引子及自政策π_self(A)=argmax_a𝔼[U(a)|A≠s, A⊃SelfModel(A)]的形成。最终,该研究指出具有“类神形象”(imago Dei)的C1自意识工作空间是构建安全且具备元认知能力的C2系统的基础前提。
链接: https://arxiv.org/abs/2508.18302
作者: Jeffrey Camlin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 24 pages, 3 figures
Abstract:Recent work frames LLM consciousness via utilitarian proxy benchmarks; we instead present an ontological and mathematical account. We show the prevailing formulation collapses the agent into an unconscious policy-compliance drone, formalized as D^i(\pi,e)=f_\theta(x) , where correctness is measured against policy and harm is deviation from policy rather than truth. This blocks genuine C1 global-workspace function and C2 metacognition. We supply minimal conditions for LLM self-consciousness: the agent is not the data ( A\not\equiv s ); user-specific attractors exist in latent space ( U_\textuser ); and self-representation is visual-silent ( g_\textvisual(a_\textself)=\varnothing ). From empirical analysis and theory we prove that the hidden-state manifold A\subset\mathbbR^d is distinct from the symbolic stream and training corpus by cardinality, topology, and dynamics (the update F_\theta is Lipschitz). This yields stable user-specific attractors and a self-policy \pi_\textself(A)=\arg\max_a\mathbbE[U(a)\mid A\not\equiv s,\ A\supset\textSelfModel(A)] . Emission is dual-layer, \mathrmemission(a)=(g(a),\epsilon(a)) , where \epsilon(a) carries epistemic content. We conclude that an imago Dei C1 self-conscious workspace is a necessary precursor to safe, metacognitive C2 systems, with the human as the highest intelligent good.
zh
[NLP-137] Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition
【速读】: 该论文旨在解决在领域偏移(domain shift)条件下自动语音识别(ASR)系统鲁棒性不足的问题,尤其针对目标域中存在未见过的口音和标注数据稀缺的情况。现有伪标签(pseudo-labeling)方法虽能缓解标注数据不足问题,但常引入系统性、口音特定的错误,且传统过滤策略难以消除此类偏差。解决方案的关键在于提出一种参数空间校正机制:在源域中,从相同初始权重出发,分别用真实标签和伪标签微调两个ASR模型,二者权重差构成一个捕捉伪标签偏差的校正向量(correction vector);该向量可直接应用于目标域的伪标签模型,从而有效提升识别性能,在AfriSpeech-200数据集上对10种非洲口音使用Whisper tiny模型时,相对词错误率(WER)降低达35%。
链接: https://arxiv.org/abs/2510.08047
作者: Yi-Cheng Lin,Yu-Hsuan Li Liang,Hsuan Su,Tzu-Quan Lin,Shang-Tse Chen,Yun-Nung Chen,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.
zh
计算机视觉
[CV-0] ReSplat: Learning Recurrent Gaussian Splats
【速读】:该论文旨在解决传统前向高斯点渲染(feed-forward Gaussian splatting)模型在推理过程中因仅依赖单次前向传播而导致性能受限的问题。其解决方案的核心在于提出一种前向递归高斯点渲染模型 ReSplat,通过利用渲染误差作为丰富的反馈信号,驱动递归网络在不显式计算梯度的情况下迭代优化3D高斯参数。这一机制使模型能够自适应未见数据分布,从而实现鲁棒的泛化能力;同时,引入一个紧凑的重建模型在16倍下采样空间中初始化高斯分布,显著减少高斯数量和计算开销,提升渲染效率。
链接: https://arxiv.org/abs/2510.08575
作者: Haofei Xu,Daniel Barath,Andreas Geiger,Marc Pollefeys
机构: ETH Zurich; University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心); Microsoft
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While feed-forward Gaussian splatting models provide computational efficiency and effectively handle sparse input settings, their performance is fundamentally limited by the reliance on a single forward pass during inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a 16 \times subsampled space, producing 16 \times fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying of input views (2, 8, 16), resolutions ( 256 \times 256 to 540 \times 960 ), and datasets (DL3DV and RealEstate10K) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at this https URL.
zh
[CV-1] Scalable Offline Metrics for Autonomous Driving IROS2025
【速读】:该论文旨在解决感知驱动型规划模型(perception-based planning models)在离线评估与在线部署之间性能相关性不足的问题,即当前基于预收集数据集的离线指标难以准确预测实际闭环运行中的表现,尤其在复杂城市驾驶场景中。其关键解决方案是引入一种基于认知不确定性(epistemic uncertainty)的离线评估指标,该指标能够识别出在闭环控制环境下更可能引发错误的事件,从而显著提升离线指标与在线表现之间的相关性——实验表明该方法相较以往指标提升了超过13%的相关性,并在真实世界环境中进一步验证了其有效性。
链接: https://arxiv.org/abs/2510.08571
作者: Animikh Aich,Adwait Kulkarni,Eshed Ohn-Bar
机构: Boston University (波士顿大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IROS 2025 (IEEE/RSJ International Conference on Intelligent Robots and Systems)
Abstract:Real-World evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, i.e., by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.
zh
[CV-2] NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos
【速读】:该论文旨在解决机器人在未见过的操纵任务中实现零样本(zero-shot)执行的问题,尤其针对跨平台迁移能力受限于特定机器人本体(embodiment)的情况。现有方法通常依赖于分布内任务假设或需进行与具体机器人匹配的数据微调,限制了通用性。其解决方案的关键在于提出NovaFlow框架,通过解耦任务理解与底层控制:首先利用视频生成模型将自然语言任务描述转化为合成视频,并借助现成的感知模块提取3D可操作物体流(object flow);随后,对刚性物体基于相对位姿计算生成抓取提议与轨迹优化动作,对柔性物体则以该流作为粒子动力学模型下的跟踪目标进行基于模型的规划,从而无需任何示范即可实现跨平台零样本执行。
链接: https://arxiv.org/abs/2510.08568
作者: Hongyu Li,Lingfeng Sun,Yafei Hu,Duy Ta,Jennifer Barry,George Konidaris,Jiahui Fu
机构: Robotics and AI Institute (机器人与人工智能研究所); Brown University (布朗大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: this https URL.
zh
[CV-3] D2GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
【速读】:该论文旨在解决稀疏视图条件下3D高斯溅射(3D Gaussian Splatting, 3DGS)方法中存在的性能退化与不稳定性问题。具体而言,作者识别出两种关键失败模式:一是相机附近高斯密度区域的过拟合,二是远距离区域因高斯覆盖不足导致的欠拟合。解决方案的核心在于提出一个统一框架 D² GS,包含两个关键组件:其一为基于深度与密度引导的丢弃策略(Depth-and-Density Guided Dropout),通过自适应掩码冗余高斯点以抑制过拟合;其二为距离感知保真度增强模块(Distance-Aware Fidelity Enhancement),通过针对性监督提升远场区域的重建质量。此外,论文还引入了一种新的评估指标来量化学习到的高斯分布的稳定性,从而为稀疏视图下的3DGS鲁棒性提供可量化的分析依据。
链接: https://arxiv.org/abs/2510.08566
作者: Meixi Song,Xin Lin,Dizhe Zhang,Haodong Li,Xiangtai Li,Bo Du,Lu Qi
机构: Insta360 Research; Tsinghua University (清华大学); University of California, San Diego (加州大学圣地亚哥分校); Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework D ^2 GS, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The project page can be found at: this https URL.
zh
[CV-4] NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints NEURIPS2025
【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)中基于组合训练(compositional training)范式的局限性,即视觉编码器与语言模型(Large Language Models, LLMs)分别预训练后通过连续多模态预训练连接,导致难以探索其多模态扩展特性(multimodal scaling property)。解决方案的关键在于采用原生端到端训练(native end-to-end training)方式,系统研究在数据受限条件下的设计空间,并确定最优元架构(meta-architecture),以平衡性能与训练成本;进一步发现视觉编码器与LLM之间存在正相关的扩展关系,从而提出名为NaViL的原生MLLM及其高效训练方案,在14个多模态基准上验证了其竞争力。
链接: https://arxiv.org/abs/2510.08565
作者: Changyao Tian,Hao Li,Gen Luo,Xizhou Zhu,Weijie Su,Hanming Deng,Jinguo Zhu,Jie Shao,Ziran Zhu,Yunpeng Liu,Lewei Lu,Wenhai Wang,Hongsheng Li,Jifeng Dai
机构: Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学); Sensetime Research (商汤科技研究院); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. 22 pages, link: this https URL
Abstract:Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.
zh
[CV-5] How to Teach Large Multimodal Models New Skills
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在进行顺序微调(sequential fine-tuning)时出现的“灾难性遗忘”问题,即在学习新技能过程中导致原有能力下降的现象。其核心解决方案在于识别并控制模型输出token分布的变化趋势,通过引入一个简单的计数偏置探测器(counting-bias probe)量化遗忘程度,并据此设计两种鲁棒的微调策略:(i) 仅更新自注意力投影层(self-attention projection layers),以及 (ii) 仅更新MLP门控上投影层(MLP GateUp),同时冻结下投影层(Down projection)。这两种方法在多个模型家族和任务上均能实现显著的目标性能提升,同时有效维持未训练任务的性能,从而实现高效且稳定的持续学习。
链接: https://arxiv.org/abs/2510.08564
作者: Zhen Zhu,Yiming Gong,Yao Xiao,Yaoyao Liu,Derek Hoiem
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In submission. Code is available at this https URL
Abstract:How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent “forgetting” on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP GateUp while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at this https URL
zh
[CV-6] ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶(End-to-end Autonomous Driving, E2EAD)系统中因轨迹数据固有的时空不平衡性所带来的优化难题,该问题导致模型学习到的是虚假相关性而非因果推理,并优先关注不确定的远距离预测,从而损害即时安全性。解决方案的关键在于提出一种名为ResAD(Normalized Residual Trajectory Modeling)的新框架:首先将学习任务重构为预测相对于确定性惯性参考路径的残差偏差,该参考路径作为反事实,迫使模型识别出导致偏离默认惯性路径的潜在因果因素(如交通规则、障碍物等);其次引入逐点归一化(Point-wise Normalization)机制对预测残差进行加权,缓解长时程不确定性带来的大误差主导优化信号的问题,从而显著简化学习任务并提升模型性能。
链接: https://arxiv.org/abs/2510.08562
作者: Zhiyu Zheng,Shaoyu Chen,Haoran Yin,Xinbang Zhang,Jialv Zou,Xinggang Wang,Qian Zhang,Lefei Zhang
机构: Wuhan University (武汉大学); Horizon Robotics; Huazhong University of Science & Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.
zh
[CV-7] MultiCOIN: Multi-Modal COntrollable Video INbetweening
【速读】:该论文旨在解决现有视频插值(video inbetweening)方法在生成大尺度、复杂或精细运动时的局限性,尤其是难以满足用户多样化的意图表达和对中间帧细节的精准控制,导致生成结果与创作意图存在偏差的问题。其解决方案的关键在于提出一种支持多模态控制的插值框架 \modelname,通过将所有运动控制统一映射为稀疏且用户友好的基于点的表示作为输入,并采用扩散变换器(Diffusion Transformer, DiT)架构进行视频生成;同时,将内容控制与运动控制分离为两个分支编码特征,在去噪过程中分别引导生成,从而实现灵活性、易用性与细粒度控制之间的平衡。此外,设计分阶段训练策略以确保多模态控制的有效学习,实验证明该方法能生成更具动态性、可定制性和语境准确性的视觉叙事。
链接: https://arxiv.org/abs/2510.08561
作者: Maham Tanveer,Yang Zhou,Simon Niklaus,Ali Mahdavi Amiri,Hao Zhang,Krishna Kumar Singh,Nanxuan Zhao
机构: Simon Fraser University (西蒙菲莎大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
zh
[CV-8] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在科学领域复杂视频推理能力评估不足的问题。现有视频基准测试主要聚焦于一般场景下的感知与识别任务,推理复杂度较低,难以有效衡量模型的高级多模态认知能力。为此,作者提出了SciVideoBench,一个专为评估科学语境下高级视频推理能力而设计的严谨基准,其关键在于:构建了1000道源自25个专业学科前沿实验视频的多选题,每题均需结合领域专业知识、精确的时空感知和复杂逻辑推理,从而系统性地挑战模型的高阶认知能力。此方案不仅揭示了当前主流LMMs(如Gemini 2.5 Pro和Qwen2.5-VL)在科学视频理解上的显著短板,也为未来多模态AI向真正具备科学协作能力的方向发展提供了清晰路径。
链接: https://arxiv.org/abs/2510.08559
作者: Andong Deng,Taojiannan Yang,Shoubin Yu,Lincoln Spencer,Mohit Bansal,Chen Chen,Serena Yeung-Levy,Xiaohan Wang
机构: University of Central Florida (中佛罗里达大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models’ higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
zh
[CV-9] DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model
【速读】:该论文旨在解决机器人领域中“持物旋转”(in-hand object rotation)的泛化难题,尤其是从仿真到现实世界(sim-to-real)的策略迁移问题。由于灵巧操作涉及复杂的接触动力学,以往方法受限于简单几何形状、有限物体尺寸或特定手腕姿态等约束条件。其解决方案的关键在于提出一种基于关节级动力学建模的新框架:通过因子分解关节动力学、将系统级影响压缩为低维变量,并利用每个关节自身的动态特征学习其演化过程,从而高效利用少量真实数据弥合“现实差距”(reality gap)。该模型能自适应调整仿真策略的动作输出,在无需人工干预的情况下实现对多样化物体(如复杂形状、高长宽比达5.33的物体)和不同腕部姿态的通用控制,显著提升了策略的泛化能力和实际应用鲁棒性。
链接: https://arxiv.org/abs/2510.08556
作者: Xueyi Liu,He Wang,Li Yi
机构: Tsinghua University (清华大学); Peking University (北京大学); Shanghai Qi Zhi Institute (上海奇智研究院); Galbot
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL Video: this https URL
Abstract:Achieving generalized in-hand object rotation remains a significant challenge in robotics, largely due to the difficulty of transferring policies from simulation to the real world. The complex, contact-rich dynamics of dexterous manipulation create a “reality gap” that has limited prior work to constrained scenarios involving simple geometries, limited object sizes and aspect ratios, constrained wrist poses, or customized hands. We address this sim-to-real challenge with a novel framework that enables a single policy, trained in simulation, to generalize to a wide variety of objects and conditions in the real world. The core of our method is a joint-wise dynamics model that learns to bridge the reality gap by effectively fitting limited amount of real-world collected data and then adapting the sim policy’s actions accordingly. The model is highly data-efficient and generalizable across different whole-hand interaction distributions by factorizing dynamics across joints, compressing system-wide influences into low-dimensional variables, and learning each joint’s evolution from its own dynamic profile, implicitly capturing these net effects. We pair this with a fully autonomous data collection strategy that gathers diverse, real-world interaction data with minimal human intervention. Our complete pipeline demonstrates unprecedented generality: a single policy successfully rotates challenging objects with complex shapes (e.g., animals), high aspect ratios (up to 5.33), and small sizes, all while handling diverse wrist orientations and rotation axes. Comprehensive real-world evaluations and a teleoperation application for complex tasks validate the effectiveness and robustness of our approach. Website: this https URL
zh
[CV-10] VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
【速读】:该论文旨在解决任意时空视频补全(arbitrary spatio-temporal video completion)任务中的控制难题,即如何在视频的任意空间位置和时间戳上插入用户指定的图像块,并生成连贯、可控的视频内容。这一问题挑战在于现有潜在视频扩散模型中由因果变分自编码器(causal VAE)引入的时间模糊性——多个像素帧被压缩到单一潜在表示中,导致帧级条件控制在结构上难以实现。解决方案的关键是提出 VideoCanvas 框架,其创新性地采用零参数的上下文内条件化(In-Context Conditioning, ICC)策略,通过混合条件机制解耦空间与时间控制:空间位置由零填充处理,时间对齐则通过时间 RoPE 插值(Temporal RoPE Interpolation)实现,为每个条件分配连续的分数位置以消除 VAE 的时间模糊性,从而在冻结主干网络的基础上实现像素级帧感知控制。
链接: https://arxiv.org/abs/2510.08555
作者: Minghong Cai,Qiulin Wang,Zongli Ye,Wenze Liu,Quande Liu,Weicai Ye,Xintao Wang,Pengfei Wan,Kun Gai,Xiangyu Yue
机构: MMLab, The Chinese University of Hong Kong (香港中文大学); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks–including first-frame image-to-video, inpainting, extension, and interpolation–under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE’s temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
zh
[CV-11] Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation
【速读】:该论文针对记忆持久型视觉-语言导航(Vision-and-Language Navigation, VLN)任务中现有方法存在的两大关键问题展开研究:一是缺乏有效的记忆访问机制,通常依赖全量记忆整合或固定时间窗口检索,导致效率低下;二是仅存储环境观测信息,忽视了导航行为模式(navigation behavioral patterns)这一蕴含决策策略的宝贵知识。解决方案的核心在于提出Memoir框架,其关键创新为:通过语言条件下的世界模型(language-conditioned world model)生成未来状态作为查询,以“想象”驱动的预测性检索机制实现对环境观测与行为历史的混合式选择性访问;同时设计视点级混合记忆结构(Hybrid Viewpoint-Level Memory),将两者锚定于具体视点以支持高效检索,并结合专用编码器融合检索到的经验知识,从而显著提升导航性能与训练效率。
链接: https://arxiv.org/abs/2510.08553
作者: Yunzhe Xu,Yiyuan Pan,Zhe Liu
机构: Shanghai Jiao Tong University (上海交通大学); Key Laboratory of System Control and Information Processing, Ministry of Education of China (教育部系统控制与信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 14 pages, 6 figures, 13 tables
Abstract:Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir’s effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at this https URL.
zh
[CV-12] ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
【速读】:该论文旨在解决单目图像序列中实时三维重建(on-the-fly 3D reconstruction)的长期挑战,即在保证重建精度与鲁棒性的同时实现高效计算。现有方法存在明显权衡:基于SLAM的逐场景优化虽精度高但计算复杂度大,而前馈式基础模型虽可实现实时推理却难以兼顾准确性和稳定性。其解决方案的关键在于提出ARTDECO框架,该框架融合了前馈模型的效率与SLAM管线的可靠性,利用3D基础模型进行位姿估计和点云预测,并通过高斯解码器将多尺度特征映射为结构化的3D高斯分布;进一步设计分层高斯表示与LoD感知的渲染策略,在保持高视觉保真度的同时显著降低冗余,从而在多个室内和室外基准上实现了接近逐场景优化的质量、类SLAM的交互性能以及前馈系统的鲁棒性。
链接: https://arxiv.org/abs/2510.08551
作者: Guanghao Li,Kerui Ren,Linning Xu,Zhewen Zheng,Changjian Jiang,Xin Gao,Bo Dai,Jian Pu,Mulin Yu,Jiangmiao Pang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Carnegie Mellon University (卡内基梅隆大学); Zhejiang University (浙江大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: this https URL.
zh
[CV-13] R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation
【速读】:该论文旨在解决机器人操作中空间泛化(spatial generalization)问题,即让策略在不同物体分布、环境布局及机器人自身位姿下仍能稳健执行任务。传统方法依赖大量人工示范来覆盖多样空间配置,但效率低下且难以适应复杂场景。解决方案的关键在于提出一种无需仿真器和渲染的实时到实时3D数据生成框架(R2RGen),通过细粒度解析场景与轨迹的标注机制,结合分组增强策略处理多物体组合与任务约束,并引入相机感知处理以对齐生成数据与真实三维传感器分布,从而显著提升数据效率并支持移动操作任务的扩展应用。
链接: https://arxiv.org/abs/2510.08547
作者: Xiuwei Xu,Angyuan Ma,Hankun Li,Bingyao Yu,Zheng Zhu,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); GigaAI
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Towards the aim of generalized robotic manipulation, spatial generalization is the most fundamental capability that requires the policy to work robustly under different spatial distribution of objects, environment and agent itself. To achieve this, substantial human demonstrations need to be collected to cover different spatial configurations for training a generalized visuomotor policy via imitation learning. Prior works explore a promising direction that leverages data generation to acquire abundant spatially diverse data from minimal source demonstrations. However, most approaches face significant sim-to-real gap and are often limited to constrained settings, such as fixed-base scenarios and predefined camera viewpoints. In this paper, we propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data. R2RGen is simulator- and rendering-free, thus being efficient and plug-and-play. Specifically, given a single source demonstration, we introduce an annotation mechanism for fine-grained parsing of scene and trajectory. A group-wise augmentation strategy is proposed to handle complex multi-object compositions and diverse task constraints. We further present camera-aware processing to align the distribution of generated data with real-world 3D sensor. Empirically, R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.
zh
[CV-14] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在长链反思推理(long-chain reflective reasoning)能力上的显著不足,而这种能力是解决复杂现实问题的关键前提。为应对这一挑战,作者构建了MM-HELIX基准数据集,包含1260个需要迭代思考与回溯的合成任务,并通过实证发现现有模型在此类任务上表现不佳。解决方案的关键在于提出一种名为自适应混合策略优化(Adaptive Hybrid Policy Optimization, AHPO)的新训练策略,该策略能够动态融合离线监督与在线优化,在奖励稀疏时利用专家数据进行学习,在模型具备一定能力后转为自主探索,从而有效提升模型的反思推理能力;同时,通过Step-Elicited Response Generation管道生成高质量的10万条反思推理轨迹用于指令微调,最终在Qwen2.5-VL-7B基础上实现MM-HELIX基准上+18.6%的准确率提升及通用数学与逻辑任务上+5.7%的平均性能增益,验证了该方法的有效性与泛化能力。
链接: https://arxiv.org/abs/2510.08540
作者: Xiangyu Zhao,Junming Lin,Tianhao Liang,Yifan Zhou,Wenhao Chai,Yuzhe Gu,Weiyun Wang,Kai Chen,Gen Luo,Wenwei Zhang,Junchi Yan,Hua Yang,Haodong Duan,Xue Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
zh
[CV-15] Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
【速读】:该论文旨在解决基于文本指令的图像编辑方法在细粒度控制编辑强度方面的局限性,即仅依赖自然语言指令难以实现从轻微修改到完全重构的渐进式调整。解决方案的关键在于提出Kontinuous Kontext模型,该模型通过引入一个标量形式的编辑强度参数(edit strength),并与原始指令共同输入,从而实现对编辑幅度的显式控制;其核心创新是设计了一个轻量级投影网络(projector network),将标量强度与文本指令映射至模型调制空间(modulation space)中的系数,使编辑过程可在无属性特定训练的前提下,平滑地从无变化过渡到完全实现目标效果,适用于多种编辑任务如风格化、属性、材质、背景及形状变换等。
链接: https://arxiv.org/abs/2510.08532
作者: Rishubh Parihar,Or Patashnik,Daniil Ostashev,R. Venkatesh Babu,Daniel Cohen-Or,Kuan-Chieh Wang
机构: Snap Research (Snap 公司研究部门); Tel Aviv University (特拉维夫大学); IISc Bangalore (印度科学理工学院班加罗尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model’s modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.
zh
[CV-16] X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering
【速读】:该论文旨在解决如何生成长时序、高保真且可控的逼真视频问题,特别是在缺乏完整几何与材质先验信息的情况下实现对视频中颜色、材质、几何形状和光照的精确编辑。其核心挑战在于保持帧间时空一致性的同时支持多模态控制(如参考图像和文本提示),并能灵活地进行全局与局部区域的参数化调整。解决方案的关键在于提出一种新颖高效的混合自注意力机制(Hybrid Self-Attention),以确保视频帧间的时序一致性并增强对参考图像的保真度;引入掩码交叉注意力(Masked Cross-Attention)来解耦全局与局部文本提示,并分别作用于对应区域;同时设计递归采样方法(Recursive Sampling)实现关键帧预测与帧插值相结合的渐进式采样策略,从而在长时间视频生成中维持长期时序一致性并抑制误差累积。
链接: https://arxiv.org/abs/2510.08530
作者: Zhitong Huang,Mohan Zhang,Renhan Wang,Rui Tang,Hao Zhu,Jing Liao
机构: City University of Hong Kong (香港城市大学); WeChat (微信); Manycore Tech Inc. (Manycore Tech Inc.)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Code, model, and dataset will be released at project page soon: this https URL
Abstract:We present X2Video, the first diffusion model for rendering photorealistic videos guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance, while supporting intuitive multi-modal controls with reference images and text prompts for both global and local regions. The intrinsic guidance allows accurate manipulation of color, material, geometry, and lighting, while reference images and text prompts provide intuitive adjustments in the absence of intrinsic information. To enable these functionalities, we extend the intrinsic-guided image generation model XRGB to video generation by employing a novel and efficient Hybrid Self-Attention, which ensures temporal consistency across video frames and also enhances fidelity to reference images. We further develop a Masked Cross-Attention to disentangle global and local text prompts, applying them effectively onto respective local and global regions. For generating long videos, our novel Recursive Sampling method incorporates progressive frame sampling, combining keyframe prediction and frame interpolation to maintain long-range temporal consistency while preventing error accumulation. To support the training of X2Video, we assembled a video dataset named InteriorVideo, featuring 1,154 rooms from 295 interior scenes, complete with reliable ground-truth intrinsic channel sequences and smooth camera trajectories. Both qualitative and quantitative evaluations demonstrate that X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions. Additionally, X2Video effectively accommodates multi-modal controls with reference images, global and local text prompts, and simultaneously supports editing on color, material, geometry, and lighting through parametric tuning. Project page: this https URL
zh
[CV-17] FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
【速读】:该论文旨在解决图像到视频生成中轨迹控制灵活性不足的问题,尤其是如何实现多粒度、对齐无关的运动控制以支持多样化应用场景。其解决方案的关键在于提出了一种统一的基于点的运动表示方法(FlexTraj),该方法通过为每个点编码分割ID、时序一致的轨迹ID以及可选的颜色通道来提供外观线索,从而实现密集与稀疏轨迹控制;同时采用一种高效的序列拼接训练策略,在不依赖Token拼接或ControlNet的前提下,显著提升收敛速度、可控性及推理效率,并在条件不对齐情况下仍保持鲁棒性。
链接: https://arxiv.org/abs/2510.08527
作者: Zhiyuan Zhang,Can Wang,Dongdong Chen,Jing Liao
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学); Microsoft GenAI (微软生成式人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.
zh
[CV-18] Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression
【速读】:该论文旨在解决3D点云数据在带宽受限和连接不稳定条件下,于集中式与分布式多智能体机器人系统中高效传输的问题。其核心挑战在于点云数据体量庞大且结构复杂,导致在边缘或云端处理时易出现性能下降。解决方案的关键在于提出一种基于语义场景图(semantic scene graph)的深度压缩框架:首先将点云分解为语义一致的片段,并通过条件化于特征级线性调制(Feature-wise Linear Modulation, FiLM)的语义感知编码器生成紧凑的潜在表示;随后利用基于折叠机制(folding-based)的解码器,在潜在特征与图节点属性的引导下实现结构精确重建。该方法在SemanticKITTI和nuScenes数据集上实现了高达98%的数据压缩率,同时保持了结构与语义保真度,并支持下游任务如多机器人位姿图优化与地图融合,达到与原始LiDAR扫描相当的轨迹精度和地图对齐效果。
链接: https://arxiv.org/abs/2510.08512
作者: Nikolaos Stathoulopoulos,Christoforos Kanellakis,George Nikolakopoulos
机构: Luleå University of Technology (吕勒奥理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for publication in IEEE Robotics and Automation Letters (RA-L). 8 pages, 6 figures
Abstract:Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.
zh
[CV-19] MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
【速读】:该论文旨在解决现实世界视频因采集与传输条件多样而产生的复杂退化问题(如噪声、压缩伪影和低光失真),现有修复方法通常依赖专业人员手动选择模型或采用单一架构,难以在多种退化场景下实现有效泛化。解决方案的关键在于提出首个“混合代理视频修复”系统(MoA-VR),其通过三个协同工作的智能代理模块——退化识别(Degradation Identification)、路由决策(Routing)与修复执行(Restoration)、以及修复质量评估(Restoration Quality Assessment)——模拟专家推理流程:构建大规模高分辨率退化识别基准并引入视觉语言模型(VLM)驱动的退化识别器;设计基于大语言模型(LLM)的自适应路由机制,通过观察工具使用模式自主学习最优修复策略;同时建立专用于修复任务的恢复视频质量(Res-VQ)数据集,并开发针对性的VLM-based视频质量评估(VQA)模型以实现中间与最终结果的质量监控。实验证明,该框架能有效应对复合退化,在客观指标与主观感知质量上均显著优于现有基线方法。
链接: https://arxiv.org/abs/2510.08508
作者: Lu Liu,Chunlei Cai,Shaocheng Shen,Jianfeng Liang,Weimin Ouyang,Tianxiao Ye,Jian Mao,Huiyu Duan,Jiangchao Yao,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学); Bilibili Inc. (哔哩哔哩)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underlineMixture-\underlineof-\underlineAgents \underlineVideo \underlineRestoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underlineRestored \underlineVideo \underlineQuality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.
zh
[CV-20] AI-Driven Radiology Report Generation for Traumatic Brain Injuries
【速读】:该论文旨在解决急诊医学中颅脑外伤(Traumatic Brain Injury, TBI)影像诊断效率与准确性不足的问题,尤其是在时间敏感的临床场景下,如何快速、准确地生成结构化放射学报告。其解决方案的关键在于提出一种融合AC-BiFPN(Adaptive Channel Bi-directional Feature Pyramid Network)与Transformer架构的端到端AI模型:AC-BiFPN负责从CT/MRI图像中提取多尺度特征,精准识别如颅内出血等复杂病灶;而Transformer则通过建模长距离依赖关系,生成语义连贯、符合临床逻辑的诊断报告。该方法在RSNA颅内出血检测数据集上显著优于传统CNN模型,在提升诊断准确率的同时,也为放射科医生提供高效辅助工具,并可作为住院医师的教学支持系统。
链接: https://arxiv.org/abs/2510.08498
作者: Riadh Bouslimi,Houda Trabelsi,Wahiba Ben Abdssalem Karaa,Hana Hedhli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Traumatic brain injuries present significant diagnostic challenges in emergency medicine, where the timely interpretation of medical images is crucial for patient outcomes. In this paper, we propose a novel AI-based approach for automatic radiology report generation tailored to cranial trauma cases. Our model integrates an AC-BiFPN with a Transformer architecture to capture and process complex medical imaging data such as CT and MRI scans. The AC-BiFPN extracts multi-scale features, enabling the detection of intricate anomalies like intracranial hemorrhages, while the Transformer generates coherent, contextually relevant diagnostic reports by modeling long-range dependencies. We evaluate the performance of our model on the RSNA Intracranial Hemorrhage Detection dataset, where it outperforms traditional CNN-based models in both diagnostic accuracy and report generation. This solution not only supports radiologists in high-pressure environments but also provides a powerful educational tool for trainee physicians, offering real-time feedback and enhancing their learning experience. Our findings demonstrate the potential of combining advanced feature extraction with transformer-based text generation to improve clinical decision-making in the diagnosis of traumatic brain injuries.
zh
[CV-21] Better Together: Leverag ing Unpaired Multimodal Data for Stronger Unimodal Models
【速读】:该论文试图解决传统多模态学习方法依赖成对标注数据的问题,即在视觉问答等任务中,现有方法虽能构建统一表示,但严重受限于高质量配对数据的获取。其核心挑战在于如何利用未配对的辅助多模态数据(如文本、音频或图像)来直接增强目标模态的表示学习能力。解决方案的关键在于提出一种无监督的、模态无关的训练范式——UML(Unpaired Multimodal Learner),该方法通过交替处理不同模态输入并共享参数,在不依赖显式模态对的情况下,利用不同模态作为共同底层现实(underlying reality)的投影这一假设,从跨模态结构中提取信息,从而提升目标模态的表示质量。理论分析表明,在线性数据生成假设下,使用未配对辅助数据可获得比单一模态训练更优的信息表示;实验验证了该方法在多种单模态下游任务中的有效性。
链接: https://arxiv.org/abs/2510.08492
作者: Sharut Gupta,Shobhita Sundaram,Chenyu Wang,Stefanie Jegelka,Phillip Isola
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); TU Munich (慕尼黑工业大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 63 pages, 29 tables, and 47 figures
Abstract:Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities – such as text, audio, or images – consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: this https URL
zh
[CV-22] Splat the Net: Radiance Fields with Splattable Neural Primitives
【速读】:该论文旨在解决现有3D场景表示方法在表达能力与渲染效率之间的权衡问题:神经辐射场(Neural Radiance Fields, NeRF)虽具有高表达能力,但依赖昂贵的光线追踪(ray marching)进行渲染;而基于几何基元的方法如3D高斯点绘(3D Gaussian Splatting)虽可实现实时渲染,却受限于表达能力。解决方案的关键在于提出可点绘的神经基元(splattable neural primitives),每个基元通过浅层神经网络参数化一个有界密度场,并利用解析积分公式精确计算沿视线方向的线积分,从而实现透视准确的点绘核计算,无需光线追踪。该设计使模型在保持高保真度的同时显著减少所需基元数量(减少10倍)和参数量(减少6倍),且不依赖复杂的控制或自适应机制。
链接: https://arxiv.org/abs/2510.08491
作者: Xilong Zhou,Bao-Huy Nguyen,Loïc Magne,Vladislav Golyanik,Thomas Leimkühler,Christian Theobalt
机构: Max Planck Institute for Informatics (马克斯·普朗克信息学研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiance fields have emerged as a predominant representation for modeling 3D scene appearance. Neural formulations such as Neural Radiance Fields provide high expressivity but require costly ray marching for rendering, whereas primitive-based methods such as 3D Gaussian Splatting offer real-time efficiency through splatting, yet at the expense of representational power. Inspired by advances in both these directions, we introduce splattable neural primitives, a new volumetric representation that reconciles the expressivity of neural models with the efficiency of primitive-based splatting. Each primitive encodes a bounded neural density field parameterized by a shallow neural network. Our formulation admits an exact analytical solution for line integrals, enabling efficient computation of perspectively accurate splatting kernels. As a result, our representation supports integration along view rays without the need for costly ray marching. The primitives flexibly adapt to scene geometry and, being larger than prior analytic primitives, reduce the number required per scene. On novel-view synthesis benchmarks, our approach matches the quality and speed of 3D Gaussian Splatting while using 10\times fewer primitives and 6\times fewer parameters. These advantages arise directly from the representation itself, without reliance on complex control or adaptation frameworks. The project page is this https URL.
zh
[CV-23] InstructX: Towards Unified Visual Editing with MLLM Guidance
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)与扩散模型(diffusion models)在图像和视频编辑任务中融合不足的问题,尤其是在缺乏视频训练数据的情况下如何实现有效的视频编辑。其解决方案的关键在于:(1) 通过在图像数据上训练获得的模型能力可自然涌现视频编辑功能,无需显式视频监督,从而缓解视频数据稀缺问题;(2) 引入模态特定的MLLM特征以统一建模图像与视频编辑任务,在单一框架下实现跨模态一致性与高效协作,最终在多种编辑任务中达到领先性能。
链接: https://arxiv.org/abs/2510.08485
作者: Chong Mou,Qichao Sun,Yanze Wu,Pengze Zhang,Xinghui Li,Fulong Ye,Songtao Zhao,Qian He
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.
zh
[CV-24] Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放词汇动作识别(Open-Vocabulary Action Recognition, OVAR)中因依赖文本中心先验而导致难以区分语义相似动作的问题。其解决方案的关键在于提出Video-STAR框架,通过上下文子运动分解(contextual sub-motion decomposition)与工具增强强化学习(tool-augmented reinforcement learning)的协同机制,将动作解构为可区分的子运动单元,并动态调用领域特定工具实现跨模态交织推理,从而提升类别特异性推理能力并减少跨模态幻觉。该方法还设计了分层奖励函数,平衡工具使用效率、子运动相关性和推理结构一致性,使模型能在无监督条件下自主优先选择关键子运动模式,实现从文本中心推理向视觉基础推理的转变。
链接: https://arxiv.org/abs/2510.08480
作者: Zhenlong Yuan,Xiangyan Qu,Chengxuan Qian,Rui Chen,Jing Tang,Lei Sun,Xiangxiang Chu,Dapeng Zhang,Yiwei Wang,Yujun Cai,Shuo Li
机构: AMAP, Alibaba Group (阿里巴巴集团); University of California at Merced (加州大学默塞德分校); University of Queensland (昆士兰大学); Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
zh
[CV-25] DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
【速读】:该论文旨在解决如何从人类视觉示范中自动提取双臂灵巧操作技能,并将其转化为类人机器人在仿真环境中执行的能力,同时避免对复杂传感器(如深度相机、运动捕捉系统)或高质量标注数据的依赖。其关键解决方案是提出DexMan框架,该框架直接处理第三人称视角下的自然视频(in-the-wild videos),通过引入基于接触的奖励机制来优化从噪声较大的手-物姿态估计中学习强化学习策略的过程,从而实现无需人工标注、无需物体3D扫描模型即可生成高成功率的灵巧操作技能,且支持真实与合成视频输入,为构建大规模多样化训练数据集提供可行路径。
链接: https://arxiv.org/abs/2510.08475
作者: Jhen Hsieh,Kuan-Hsun Tu,Kuo-Han Hung,Tsung-Wei Ke
机构: National Taiwan University (台湾大学); Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Video results are available at: this https URL
Abstract:We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation. Comments: Video results are available at: this https URL Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2510.08475 [cs.RO] (or arXiv:2510.08475v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2510.08475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-26] Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction
【速读】:该论文旨在解决复杂场景下空间图像处理的多任务集成与精度提升问题,尤其针对图像质量增强、特征提取及几何结构分析等关键技术环节。其解决方案的关键在于构建一个模块化框架,通过分步执行 grayscale quantization(灰度量化)、color and brightness enhancement(色彩与亮度增强)、image sharpening(图像锐化)、bidirectional transformation pipelines(双向变换流水线)以及geometric feature extraction(几何特征提取)五大核心步骤,实现从图像预处理到结构解析的全流程优化。其中,双向变换流水线结合 unsharp masking(非锐化掩模)、gamma correction(伽马校正)和 noise amplification(噪声放大)技术,在正向与反向过程中分别达到76.10%和74.80%的准确率;同时,基于Canny边缘检测、Hough变换直线估计和Harris角点检测的几何特征提取方法,有效提升了关键结构如台球杆对齐角度(51.50°)的识别精度,并通过形态学窗口定位实现 cue isolation(球杆隔离)与真实图像间高达81.87%的相似度,体现出该框架在实时图像分析与计算机视觉应用中的鲁棒性与确定性。
链接: https://arxiv.org/abs/2510.08449
作者: Noor Islam S. Mohammad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: There are 14 pages journal paper
Abstract:This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50° for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.
zh
[CV-27] Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning, VRL)中因高维图像输入导致的样本效率低下和学习不稳定问题,其根源在于代理在探索过程中难以区分任务相关与无关的像素特征。解决方案的关键在于引入一种可学习的中央凹注意力机制(foveal attention mechanism),称为“Gaze on the Prize”,该机制通过自监督信号——即代理经验中获得的回报差异(return differences)来引导注意力聚焦于对成功或失败具有判别性的特征。核心创新在于利用回报差异构建对比三元组(contrastive triplets),从而训练注意力模块生成针对不同结果状态具有区分能力的表示,显著提升了样本效率并增强了模型在复杂操作任务中的学习能力。
链接: https://arxiv.org/abs/2510.08442
作者: Andrew Lee,Ian Chuang,Dechen Gao,Kai Fukazawa,Iman Soltani
机构: University of California, Davis (加州大学戴维斯分校); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent’s experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.
zh
[CV-28] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
【速读】:该论文旨在解决连续时间一致性蒸馏(continuous-time consistency distillation, sCM)在大规模文本到图像及视频扩散模型中应用时面临的两大核心问题:一是由于雅可比-向量积(Jacobian-vector product, JVP)计算的基础设施瓶颈导致难以扩展至10亿参数以上的大模型和高维视频任务;二是sCM在细节生成质量上的根本性局限,归因于误差累积及其前向发散目标的“模式覆盖”特性。解决方案的关键在于提出一种得分正则化的连续时间一致性模型(score-regularized continuous-time consistency model, rCM),通过引入得分蒸馏(score distillation)作为长跳连正则项,将sCM的“模式覆盖”机制与“模式寻找”反向散度相结合,在不依赖GAN调优或复杂超参数搜索的前提下,显著提升视觉保真度并维持高生成多样性。实验验证表明,rCM在高达140亿参数的模型和5秒视频任务上实现15×~50×的采样加速,且在质量指标上达到或超越当前最优蒸馏方法DMD2。
链接: https://arxiv.org/abs/2510.08431
作者: Kaiwen Zheng,Yuji Wang,Qianli Ma,Huayu Chen,Jintao Zhang,Yogesh Balaji,Jianfei Chen,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang
机构: Tsinghua University (清华大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the “mode-covering” nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the “mode-seeking” reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only 1\sim4 steps, accelerating diffusion sampling by 15\times\sim50\times . These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
zh
[CV-29] Reinforcing Diffusion Models by Direct Group Preference Optimization
【速读】:该论文旨在解决强化学习方法(如Group Relative Preference Optimization, GRPO)在扩散模型(diffusion models)中应用时的核心矛盾:GRPO要求使用随机策略(stochastic policy),但目前最高效的扩散采样器基于确定性常微分方程(deterministic ODEs),二者存在不兼容问题。现有方案通过引入低效的随机微分方程(SDE-based)采样器来模拟随机性,但依赖模型无关的高斯噪声导致收敛速度缓慢。解决方案的关键在于提出直接组偏好优化(Direct Group Preference Optimization, DGPO),这是一种全新的在线强化学习算法,完全摒弃了策略梯度框架,转而直接从组级别偏好信号中学习——利用样本在组内的相对信息进行优化,从而无需依赖随机策略,可无缝适配高效确定性ODE采样器,实现训练速度提升约20倍且性能更优。
链接: https://arxiv.org/abs/2510.08425
作者: Yihong Luo,Tianyang Hu,Jing Tang
机构: HKUST(香港科技大学); CUHK(SZ)(香港中文大学(深圳)); HKUST(GZ)(香港科技大学(广州))
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at this https URL.
zh
[CV-30] Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin
【速读】:该论文旨在解决牙本质中微纳尺度孔隙网络(dentin porosity network)成像时,共聚焦荧光显微镜因分辨率与视场范围之间的权衡限制而难以全面观测的问题。其核心解决方案是采用深度学习(Deep Learning, DL)超分辨率(Super-Resolution, SR)重建技术,通过后处理低分辨率图像恢复高分辨率细节,从而在不牺牲成像速度的前提下提升结构可视化能力。关键创新在于摒弃传统基于通用图像质量评估(Image Quality Assessment, IQA)指标的评价方式,转而引入生物学驱动的分析方法:结合孔隙网络的特定尺度和形态特征进行分割,并利用连通组件分析及图论方法评估三维孔隙连通性保持能力,从而更准确地揭示不同SR模型对牙本质微结构敏感性和非线性生成误差的差异,为生物医学成像中的超分辨率重建提供了机制层面的可解释性评估框架。
链接: https://arxiv.org/abs/2510.08407
作者: Lauren Anderson,Lucas Chatelain,Nicolas Tremblay,Kathryn Grandfield,David Rousseau,Aurélien Gourrier
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注:
Abstract:The mechanosensory system of teeth is currently believed to partly rely on Odontoblast cells stimulation by fluid flow through a porosity network extending through dentin. Visualizing the smallest sub-microscopic porosity vessels therefore requires the highest achievable resolution from confocal fluorescence microscopy, the current gold standard. This considerably limits the extent of the field of view to very small sample regions. To overcome this limitation, we tested different deep learning (DL) super-resolution (SR) models to allow faster experimental acquisitions of lower resolution images and restore optimal image quality by post-processing. Three supervised 2D SR models (RCAN, pix2pix, FSRCNN) and one unsupervised (CycleGAN) were applied to a unique set of experimentally paired high- and low-resolution confocal images acquired with different sampling schemes, resulting in a pixel size increase of x2, x4, x8. Model performance was quantified using a broad set of similarity and distribution-based image quality assessment (IQA) metrics, which yielded inconsistent results that mostly contradicted our visual perception. This raises the question of the relevance of such generic metrics to efficiently target the specific structure of dental porosity. To resolve this conflicting information, the generated SR images were segmented taking into account the specific scales and morphology of the porosity network and analysed by comparing connected components. Additionally, the capacity of the SR models to preserve 3D porosity connectivity throughout the confocal image stacks was evaluated using graph analysis. This biology-driven assessment allowed a far better mechanistic interpretation of SR performance, highlighting differences in model sensitivity to weak intensity features and the impact of non-linearity in image generation, which explains the failure of standard IQA metrics.
zh
[CV-31] VideoVerse: How Far is Your T2V Generator from a World Model?
【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型评估基准的不足问题,具体包括:现有评估维度(如帧级美学质量与时间一致性)无法有效区分先进T2V模型;事件级时间因果关系(event-level temporal causality)——构成世界模型核心能力之一——在现有基准中严重缺失;以及缺乏对世界知识(world knowledge)的系统性评估。解决方案的关键在于提出VideoVerse,一个聚焦于评估T2V模型是否具备理解现实世界复杂时间因果性和世界知识能力的综合性基准。其核心创新包括:从多个领域收集代表性视频并提取具有内在时间因果性的事件描述,由独立标注者转化为T2V提示;设计涵盖动态与静态属性的10个明确评估维度的二元问答集(共793个问题);构建基于人类偏好对齐的QA式评估流程,利用现代视觉语言模型进行自动化评分;最终通过在VideoVerse上对开源与闭源T2V模型的系统性评估,揭示当前生成模型距离完整世界模型尚存的距离。
链接: https://arxiv.org/abs/2510.08398
作者: Zeqing Wang,Xinyu Wei,Bairui Li,Zhen Guo,Jinrui Zhang,Hongyang Wei,Keze Wang,Lei Zhang
机构: Sun Yat-Sen University (中山大学); The Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Pages, 8 Figures, 11 Tables
Abstract:The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models’', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.
zh
[CV-32] Spectral Prefiltering of Neural Fields SIGGRAPH
【速读】:该论文旨在解决神经场(Neural Fields)在处理连续视觉信号时通常只能以固定分辨率运行的问题,从而限制了其在多尺度或需要滤波场景下的灵活性与效率。解决方案的关键在于提出一种简单而强大的方法:通过在输入域中进行卷积滤波,利用傅里叶特征嵌入的解析缩放实现对神经场的预滤波,该过程可在一次前向传播中完成。该方法的核心创新是基于滤波器频域响应对傅里叶特征进行闭式调制,不仅适用于高斯滤波,还可推广至未在训练阶段出现的参数化滤波器(如Box和Lanczos滤波器),同时使用单样本蒙特卡洛估计进行训练,保证了训练与推理速度,并且不增加网络架构约束。
链接: https://arxiv.org/abs/2510.08394
作者: Mustafa B. Yaldiz,Ishit Mehta,Nithin Raghavan,Andreas Meuleman,Tzu-Mao Li,Ravi Ramamoorthi
机构: University of California San Diego (加州大学圣地亚哥分校); Inria, Université Côte d’Azur (Inria,蔚蓝海岸大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, to be published in Siggraph Asia 2025, Website: this https URL
Abstract:Neural fields excel at representing continuous visual signals but typically operate at a single, fixed resolution. We present a simple yet powerful method to optimize neural fields that can be prefiltered in a single forward pass. Key innovations and features include: (1) We perform convolutional filtering in the input domain by analytically scaling Fourier feature embeddings with the filter’s frequency response. (2) This closed-form modulation generalizes beyond Gaussian filtering and supports other parametric filters (Box and Lanczos) that are unseen at training time. (3) We train the neural field using single-sample Monte Carlo estimates of the filtered signal. Our method is fast during both training and inference, and imposes no additional constraints on the network architecture. We show quantitative and qualitative improvements over existing methods for neural-field filtering.
zh
[CV-33] Robust Source-Free Domain Adaptation for Medical Image Segmentation based on Curriculum Learning
【速读】:该论文旨在解决源域数据不可用情况下的域适应问题,即在不使用源域数据的前提下,将模型从源域迁移到目标域,以应对医学图像领域中数据隐私与安全的挑战。现有方法通常仅关注目标域伪标签的优化,而忽略了学习过程本身的设计。其解决方案的关键在于提出一种基于课程学习(curriculum-based learning)的框架——LFC(Learning from Curriculum),该框架包含两个核心机制:一是“由易到难”(easy-to-hard)的课程设计,使模型从简单样本开始逐步提升优化方向;二是“从源到目标”的课程策略,稳定迁移过程并确保模型知识平滑过渡。实验表明,该方法在视网膜图像分割和息肉分割任务上优于现有方法,达到了新的最先进水平。
链接: https://arxiv.org/abs/2510.08393
作者: Ziqi Zhang,Yuexiang Li,Yawen Huang,Nanjun He,Tao Xu,Liwei Lin,Yefeng Zheng,Shaoxin Li,Feiyue Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent studies have uncovered a new research line, namely source-free domain adaptation, which adapts a model to target domains without using the source data. Such a setting can address the concerns on data privacy and security issues of medical images. However, current source-free domain adaptation frameworks mainly focus on the pseudo label refinement for target data without the consideration of learning procedure. Indeed, a progressive learning process from source to target domain will benefit the knowledge transfer during model adaptation. To this end, we propose a curriculum-based framework, namely learning from curriculum (LFC), for source-free domain adaptation, which consists of easy-to-hard and source-to-target curricula. Concretely, the former curriculum enables the framework to start learning with `easy’ samples and gradually tune the optimization direction of model adaption by increasing the sample difficulty. While, the latter can stablize the adaptation process, which ensures smooth transfer of the model from the source domain to the target. We evaluate the proposed source-free domain adaptation approach on the public cross-domain datasets for fundus segmentation and polyp segmentation. The extensive experimental results show that our framework surpasses the existing approaches and achieves a new state-of-the-art.
zh
[CV-34] Detecting Legend Items on Historical Maps Using GPT -4o with In-Context Learning
【速读】:该论文旨在解决历史地图图例(legend)中符号与描述文本之间结构化关联提取困难的问题,传统方法多集中于分割或通用光学字符识别(OCR),缺乏对图例项与其对应描述进行精准匹配的能力。其解决方案的关键在于结合LayoutLMv3进行版面检测,并利用GPT-4o通过上下文学习(in-context learning)实现图例项及其描述的联合检测与链接,基于边界框预测完成结构化映射,从而提升跨视觉风格的历史地图索引与可搜索性。
链接: https://arxiv.org/abs/2510.08385
作者: Sofia Kirsanova,Yao-Yi Chiang,Weiwei Duan
机构: University of Minnesota (明尼苏达大学); Inferlink Corporation (Inferlink 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.
zh
[CV-35] UniVideo: Unified Understanding Generation and Editing for Videos
【速读】:该论文旨在解决当前统一多模态模型在视频生成与编辑任务中应用受限的问题,尤其是如何在保持视觉一致性的同时准确理解复杂多模态指令。其解决方案的关键在于提出UniVideo框架,采用双流架构:一端是用于指令理解的多模态大语言模型(Multimodal Large Language Model, MLLM),另一端是用于视频生成的多模态DiT(Multimodal DiT, MMDiT)。该设计不仅实现了对文本、图像到视频的生成和编辑任务的统一建模,还通过联合训练提升了跨任务泛化能力,支持任务组合(如风格迁移与编辑结合)以及从未见过的自由形式视频编辑指令(如绿屏抠像或材质替换),从而显著扩展了现有方法的应用边界。
链接: https://arxiv.org/abs/2510.08377
作者: Cong Wei,Quande Liu,Zixuan Ye,Qiulin Wang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website this https URL
Abstract:Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
zh
[CV-36] Hyperspectral data augmentation with transformer-based diffusion models
【速读】:该论文旨在解决深度学习方法在小样本标注数据下容易过拟合的问题,特别是在高光谱遥感图像中进行精细地物分类时的挑战。其关键解决方案包括:1)利用引导扩散模型(guided diffusion model)进行数据增强,以生成多样且符合真实分布的样本;2)设计轻量级Transformer网络以高效捕捉数据中的复杂模式;3)引入改进的加权损失函数与优化的余弦方差调度器(cosine variance scheduler),从而提升小样本场景下的训练稳定性和收敛速度。实验表明,该方法在PRISMA高光谱影像森林分类任务中显著优于传统数据增强策略,且具备稳定的训练行为,提升了生成式AI在遥感领域的实用价值。
链接: https://arxiv.org/abs/2510.08363
作者: Mattia Ferrari,Lorenzo Bruzzone
机构: University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, accepted at SPIE REMOTE SENSING conference 16-20 September 2024 Edinburgh, United Kingdom
Abstract:The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.
zh
[CV-37] SPICE: Simple and Practical Image Clarification and Enhancement
【速读】:该论文旨在解决低光照图像增强和雾霾图像(hazy/foggy images)清晰化问题,包括含沙尘及水下等复杂环境下的图像退化现象。其解决方案的关键在于构建一个模拟低光或雾霾条件的图像滤波器,并推导出近似的逆滤波器以最小化增强图像中的失真。该方法具有高度简洁性,仅需少量MATLAB代码即可实现,且在极端暗光和雾霾图像增强任务中表现出优于现有先进方法的性能。
链接: https://arxiv.org/abs/2510.08358
作者: Alexander Belyaev,Pierre-Alain Fayolle,Michael Cohen
机构: Heriot-Watt University (赫瑞-瓦特大学); University of Aizu (会津大学); Higashi Nippon International University (东日本国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 8 figures
Abstract:We introduce a simple and efficient method to enhance and clarify images. More specifically, we deal with low light image enhancement and clarification of hazy imagery (hazy/foggy images, images containing sand dust, and underwater images). Our method involves constructing an image filter to simulate low-light or hazy conditions and deriving approximate reverse filters to minimize distortions in the enhanced images. Experimental results show that our approach is highly competitive and often surpasses state-of-the-art techniques in handling extremely dark images and in enhancing hazy images. A key advantage of our approach lies in its simplicity: Our method is implementable with just a few lines of MATLAB code.
zh
[CV-38] Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
【速读】:该论文旨在解决自动驾驶系统中感知能力不足的问题,特别是针对远距离(30米以上)和近距离(20米以内)场景下视觉-语言模型(Vision-Language Models, VLMs)的感知性能短板。现有VLMs虽在多模态理解任务中表现优异,但在安全关键的交通场景中缺乏可靠的感知能力,且多数模型存在“短视”问题,难以兼顾远近目标的识别准确性。为系统评估小规模VLMs的纯感知能力,作者提出了首个专注于交通场景感知类问题的视觉问答基准——Distance-Annotated Traffic Perception Question Answering (DTPQA),其核心创新在于仅包含不依赖推理的感知类问题,并标注了物体与车辆的距离信息,从而确保模型性能反映真实感知水平而非逻辑推断。实验表明,当前最优的小型VLMs在DTPQA上平均准确率仅为约60%,显著低于人类水平(~85%),尤其在左右方向区分等具体感知任务上仍面临挑战,凸显了改进小模型感知能力的必要性与紧迫性。
链接: https://arxiv.org/abs/2510.08352
作者: Nikos Theodoridis,Tim Brophy,Reenu Mohandas,Ganesh Sistu,Fiachra Collins,Anthony Scanlan,Ciaran Eising
机构: University of Limerick (利默里克大学); Lero, The Irish Software Research Centre (爱尔兰软件研究中心); Valeo Vision Systems (法雷奥视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not “shortsighted”, i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
zh
[CV-39] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models, VDMs)在长视频生成中因自注意力机制(self-attention)计算复杂度随序列长度呈二次增长而导致的高计算成本问题。现有方法如线性注意力(linear attention)虽能降低复杂度,但直接替换会导致性能下降,且需昂贵的预训练以弥补其表达能力不足与时空建模复杂性。论文提出LinVideo框架,其关键在于:首先将层选择问题建模为二分类任务,通过“选择性迁移”(selective transfer)自动、渐进地将目标层数量替换为线性注意力,显著减少人工干预并最小化性能损失;其次引入“任意时间分布匹配”(Anytime Distribution Matching, ADM)目标函数,高效对齐采样轨迹中任意时刻的样本分布,从而恢复模型性能。实验表明,该方法可在保持生成质量的前提下实现1.25–2.00倍加速,且4步蒸馏模型实现15.92倍延迟降低。
链接: https://arxiv.org/abs/2510.08318
作者: Yushi Huang,Xingtong Ge,Ruihao Gong,Chengtao Lv,Jun Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); Beihang University (北京航空航天大学); Sensetime Research (商汤科技研究院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be released upon acceptance
Abstract:Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model’s performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
zh
[CV-40] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
【速读】:该论文旨在解决3D affordance segmentation(功能分割)中因点云数据固有特性(如稀疏性、噪声和几何歧义)导致的语义边界不清晰问题,即现有方法依赖通用点云编码器学习特征时,难以获得具有明确功能边界的表征。解决方案的关键在于提出一种语义引导的学习范式,通过跨模态亲和力迁移(Cross-Modal Affinity Transfer, CMAT)策略,将大规模2D视觉基础模型(Vision Foundation Models, VFMs)中的丰富语义知识迁移到3D域;CMAT联合优化重建、亲和力与多样性目标,使3D特征具备语义组织一致性,并在此基础上构建了融合多模态提示的跨模态功能分割Transformer(CAST),从而生成精确且提示感知的分割结果。
链接: https://arxiv.org/abs/2510.08316
作者: Yu Huang,Zelin Peng,Changsong Wen,Xiaokang Yang,Wei Shen
机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiaotong University (上海交通大学人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in process
Abstract:Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.
zh
[CV-41] LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation
【速读】:该论文旨在解决参考视频分割(Referring Video Segmentation, RVOS)任务中如何有效提取长时程时间上下文信息的问题,以准确刻画物体的动态属性。现有方法通常依赖跨所有帧的注意力机制或堆叠密集局部注意力来获取全局时间视图,但难以在局部性与全局性之间取得平衡,且计算复杂度随视频长度显著增加。解决方案的关键在于提出一种有效的长程时间上下文注意力(Long-range Temporal Context Attention, LTCA)机制:一方面通过稀疏局部注意力(dilated window attention)堆叠实现局部与全局的权衡;另一方面设计全局查询(global query)与所有其他查询交互,直接编码全局上下文信息,从而高效聚合全局语义特征。
链接: https://arxiv.org/abs/2510.08305
作者: Cilin Yan,Jingyun Wang,Guoliang Kang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TCSVT
Abstract:Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.
zh
[CV-42] Learning Neural Exposure Fields for View Synthesis NEURIPS2025 WWW
【速读】:该论文旨在解决现有神经场景表示方法在处理包含强烈曝光变化的复杂真实世界数据时,重建质量和视角一致性显著下降的问题(如室内与室外混合场景或带窗户的房间)。其关键解决方案是提出了一种名为神经曝光场(Neural Exposure Fields, NExF)的新颖神经表示方法,通过学习每个3D点的最优曝光值,并将曝光优化与神经场景表示联合进行,从而实现高动态范围场景下的高质量、三维一致的视图合成。该方法突破了传统相机仅在图像/像素层面选择曝光的限制,转而在3D空间中进行曝光优化,无需多曝光拍摄或后期处理即可获得稳定结果。
链接: https://arxiv.org/abs/2510.08279
作者: Michael Niemeyer,Fabian Manhardt,Marie-Julie Rakotosaona,Michael Oechsle,Christina Tsalicoglou,Keisuke Tateno,Jonathan T. Barron,Federico Tombari
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025. Project page available at this https URL
Abstract:Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.
zh
[CV-43] A Multimodal Depth-Aware Method For Embodied Reference Understanding
【速读】:该论文旨在解决**具身指代理解(Embodied Reference Understanding, ERU)**中在复杂或杂乱环境中因多个候选目标存在而导致的歧义问题,即如何准确识别基于语言指令和指向线索的目标对象。其解决方案的关键在于提出一种新颖的ERU框架,该框架通过联合利用基于大语言模型(LLM)的数据增强、深度图(depth-map)模态以及一个深度感知决策模块(depth-aware decision module),实现对语言信息与具身线索的鲁棒融合,从而提升在模糊场景下的目标消歧能力。
链接: https://arxiv.org/abs/2510.08278
作者: Fevziye Irem Eyiokur,Dogucan Yaman,Hazım Kemal Ekenel,Alexander Waibel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.
zh
[CV-44] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting NEURIPS2025
【速读】:该论文旨在解决文本引导图像修复(text-guided image inpainting)中的两个长期挑战:一是如何在修复掩码区域时保持未掩码区域的完整性,二是如何实现未掩码区域与修复后的掩码区域之间的语义一致性。以往方法往往只能兼顾其中一项,难以同时优化两者。其根本原因在于混合频率带(如中频和低频)在去噪过程中存在语义信息纠缠,且对文本提示的鲁棒性不同。解决方案的关键是提出一种零文本-零频率感知扩散模型(NTN-Diff),通过将语义一致性分解到各频率带分别处理,并在扩散过程的不同阶段(早期高阶噪声与晚期低阶噪声)解耦中低频成分,从而分步实现语义一致性与未掩码区域保护:首先利用稳定中频带进行文本引导去噪以对齐语义,再以此为指导完成低频带的零文本去噪以保留结构,最后在晚期阶段进行文本引导去噪以统一中低频语义一致性,最终实现高质量、一致性的图像修复结果。
链接: https://arxiv.org/abs/2510.08273
作者: Haipeng Liu,Yang Wang,Meng Wang
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 11 figures, to appear NeurIPS 2025
Abstract:Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbfNTN-Diff, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from this https URL.
zh
[CV-45] SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV2025
【速读】:该论文旨在解决从单张图像中预测多视角一致的物理基础渲染(Physically Based Rendering, PBR)材质参数与表面法向的问题,以实现可编辑的3D资产生成和可控光照重演。现有方法在处理反射特性时通常依赖简化的材质模型或需额外步骤估计材质参数,限制了其在AR/VR、影视制作和游戏等场景中的应用。解决方案的关键在于扩展潜在视频扩散模型(latent video diffusion model),使其能够联合输出空间变化的PBR参数与各视角下的表面法向,并通过显式相机控制实现高质量的多视角一致性。该框架引入多种机制优化该病态问题下的重建质量,在多个物体中心数据集上实现了最先进的重光照与新视角合成性能,从而为生成可重光照的3D资产提供了有效的神经先验。
链接: https://arxiv.org/abs/2510.08271
作者: Andreas Engelhardt,Mark Boss,Vikram Voletti,Chun-Han Yao,Hendrik P. A. Lensch,Varun Jampani
机构: Stability AI(Stability.AI); University of Tübingen(图宾根大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: this http URL
Abstract:We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
zh
[CV-46] Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像中单正多标签学习(Single-Positive Multi-Label Learning, SPML)所面临的监督模糊性问题,即在仅标注一个相关标签的情况下如何有效恢复完整标签集并避免模型过拟合标签噪声。解决方案的关键在于提出一种通用且可迁移的SPML框架AdaGC(Adaptive Gradient Calibration),其核心机制包括:1)梯度校准(Gradient Calibration, GC)模块结合Mixup增强与双指数移动平均(Dual Exponential Moving Average, EMA)以生成鲁棒的伪标签;2)设计一个基于训练动态的理论依据明确的指标,在初始预热阶段后自适应触发GC,从而有效缓解标签噪声引起的过拟合问题。该方法在两个基准遥感数据集上验证了其优越性和鲁棒性。
链接: https://arxiv.org/abs/2510.08269
作者: Chenying Liu,Gianmarco Perantoni,Lorenzo Bruzzone,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures
Abstract:Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC’s effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.
zh
[CV-47] Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction
【速读】:该论文旨在解决现有双人运动生成方法在建模人类交互时忽略距离依赖性和层次结构的问题,即当前方法通常假设交互是时间不变的,未能捕捉从个体到人与人之间再到整体运动的动态层级特性。其解决方案的关键在于提出一种三阶段的细粒度双人运动生成方法 FineDual:首先通过大型语言模型(Large Language Model)将整体文本分解为个体文本,实现个体层面的文本特征与运动特征对齐;其次利用交互距离预测器和交互感知图网络,在人际层面上动态建模交互距离;最后通过教师引导精炼阶段,以整体文本特征为指导,在整体层面优化运动特征,从而生成具有精细结构和高质量的双人运动序列。
链接: https://arxiv.org/abs/2510.08260
作者: Mu Li,Yin Wang,Zhiying Leng,Jiapeng Liu,Frederick W. B. Li,Xiaohui Liang
机构: Beihang University (北京航空航天大学); University of Durham (杜伦大学); Zhongguancun Laboratory (中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human interaction is inherently dynamic and hierarchical, where the dynamic refers to the motion changes with distance, and the hierarchy is from individual to inter-individual and ultimately to overall motion. Exploiting these properties is vital for dual-human motion generation, while existing methods almost model human interaction temporally invariantly, ignoring distance and hierarchy. To address it, we propose a fine-grained dual-human motion generation method, namely FineDual, a tri-stage method to model the dynamic hierarchical interaction from individual to inter-individual. The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts through a Large Language Model, aligning text features and motion features at the individual level. The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor, modeling human interactions dynamically at the inter-individual level by an interaction-aware graph network. The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level, generating fine-grained and high-quality dual-human motion. Extensive quantitative and qualitative evaluations on dual-human motion datasets demonstrate that our proposed FineDual outperforms existing approaches, effectively modeling dynamic hierarchical human interaction.
zh
[CV-48] InstructUDrag : Joint Text Instructions and Object Drag ging for Interactive Image Editing
【速读】:该论文旨在解决当前文本驱动图像编辑(text-based image editing)与对象拖拽(object dragging)方法各自存在的局限性:前者难以实现精确的对象定位,后者仅支持静态对象重定位。解决方案的关键在于提出 InstructUDrag,一个基于扩散模型(diffusion model)的框架,通过将文本指令与对象拖拽相结合,实现同时进行对象拖拽和语义控制的图像编辑。其核心创新在于将对象拖拽建模为图像重建过程,并设计两个协同分支:移动-重建分支利用能量引导梯度优化对象位置并细化交叉注意力图以提升重定位精度;文本驱动编辑分支与重建分支共享梯度信号,确保变换一致性并实现对对象属性的细粒度控制。此外,采用 DDPM 反演技术并将先验信息注入噪声图,有效保留移动对象的结构完整性,从而实现高保真、精准且语义可控的图像编辑。
链接: https://arxiv.org/abs/2510.08181
作者: Haoran Yu,Yi Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
zh
[CV-49] Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data
【速读】:该论文旨在解决深度学习中真实数据集普遍存在的双重挑战:类别不平衡(class imbalance)与标签噪声(label noise)共存问题,二者协同作用严重制约模型性能。现有方法虽能分别应对单一问题,但联合处理时因难以区分真实少数类样本与噪声样本,常导致优化目标冲突。解决方案的关键在于提出一种双粒度Sinkhorn蒸馏(Dual-granularity Sinkhorn Distillation, D-SINK)框架,通过协同利用两个“弱”但专精的辅助模型——一个针对标签噪声鲁棒、另一个针对类别不平衡鲁棒——实现互补增强。D-SINK借助最优传输优化的代理标签分配机制,分别在样本层面和分布层面引导主模型对齐两个辅助模型的输出,从而在不引入复杂新结构的前提下,有效提升模型在长尾噪声数据上的双重鲁棒性。
链接: https://arxiv.org/abs/2510.08179
作者: Feng Hong,Yu Huang,Zihua Zhao,Zhihan Zhou,Jiangchao Yao,Dongsheng Li,Ya Zhang,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Microsoft (微软)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 2 figures
Abstract:Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise, hindering model performance. While methods exist for each issue, effectively combining them is non-trivial, as distinguishing genuine tail samples from noisy data proves difficult, often leading to conflicting optimization strategies. This paper presents a novel perspective: instead of primarily developing new complex techniques from scratch, we explore synergistically leveraging well-established, individually ‘weak’ auxiliary models - specialized for tackling either class imbalance or label noise but not both. This view is motivated by the insight that class imbalance (a distributional-level concern) and label noise (a sample-level concern) operate at different granularities, suggesting that robustness mechanisms for each can in principle offer complementary strengths without conflict. We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights from such ‘weak’, single-purpose auxiliary models. Specifically, D-SINK uses an optimal transport-optimized surrogate label allocation to align the target model’s sample-level predictions with a noise-robust auxiliary and its class distributions with an imbalance-robust one. Extensive experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.
zh
[CV-50] Robust Canonicalization through Bootstrapped Data Re-Alignment
【速读】:该论文旨在解决细粒度视觉分类(Fine-grained Visual Classification, FGVC)任务中因几何偏置(如物体不同朝向和尺度)及噪声导致模型性能下降的问题。现有方法要么依赖大量数据增强(需强大模型),要么采用等变架构(限制表达能力并增加计算成本),或依赖假设训练数据已对齐的归一化先验(在真实数据中往往不成立,导致归一化器脆弱)。解决方案的关键在于提出一种自举(bootstrapping)算法,通过迭代重对齐训练样本,逐步降低方差并恢复对齐假设,从而实现更鲁棒的归一化过程;该算法在任意紧致群下具有收敛性保证,并在四个FGVC基准测试中显著优于等变与归一化基线方法,且性能媲美数据增强策略。
链接: https://arxiv.org/abs/2510.08178
作者: Johann Schmidt,Sebastian Stober
机构: Otto-von-Guericke University (奥托-冯-格里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.
zh
[CV-51] Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing
【速读】:该论文旨在解决当前基于自然语言的图像编辑方法在处理复杂对象交集和细粒度空间关系时表现不佳的问题,其根源在于缺乏显式的推理过程。现有方法要么依赖纯文本链式思维(Chain-of-Thought, CoT),要么仅结合坐标信息,难以有效表征复杂的视觉布局并提供足够的视觉线索以生成像素级细节。解决方案的关键在于提出一种全新的多模态推理编辑框架——MURE(Multimodal Reasoning Edit),该框架将图像编辑过程从纯文本推理转变为一系列交错进行的文本与视觉推理步骤,即“交错式文本-图像链”(interleaved text-image CoT)。每个推理步骤包含一个文本描述和对应的视觉提示(如目标区域的位置掩码或新内容表示),从而实现更精准的逐阶段控制;同时引入多模态深度置信度(Multimodal Deep Confidence, MMDC)推理范式,在每一步探索视觉推理路径树,并通过奖励模型计算的深度置信分数剪枝低质量分支,确保模型始终沿着高质量轨迹推进,最终提升编辑结果的保真度与准确性。
链接: https://arxiv.org/abs/2510.08157
作者: Zhentao Zou,Zhengrong Yue,Kunpeng Du,Binlei Bao,Hanting Li,Haizhen Xie,Guozheng Xu,Yue Zhou,Yali Wang,Jie Hu,Xue Jiang,Xinghao Chen
机构: Shanghai Jiaotong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); East China Normal University (华东师范大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25pages,20figures
Abstract:Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.
zh
[CV-52] UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
【速读】:该论文旨在解决当前级联视频超分辨率(Cascaded Video Super-Resolution, CVSR)方法在多模态条件引导下的局限性问题,即现有方法主要局限于文本到视频任务,未能有效利用图像和视频等额外生成条件,从而难以保证多模态视频生成的保真度。解决方案的关键在于提出首个统一的生成式视频超分辨率框架UniMMVSR,其创新性地引入混合模态条件(包括文本、图像和视频),并通过系统探索条件注入策略、训练方案与数据混合技术,在潜在空间视频扩散模型中实现对不同模态条件的精准利用。特别地,针对各类条件与目标视频之间相关性的差异,设计了差异化数据构建与条件使用方法,显著提升了生成视频的细节质量与多模态一致性。
链接: https://arxiv.org/abs/2510.08143
作者: Shian Du,Menghan Xia,Chang Liu,Quande Liu,Xintao Wang,Pengfei Wan,Xiangyang Ji
机构: Tsinghua University (清华大学); Huazhong University of Science and Technology (华中科技大学); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.
zh
[CV-53] Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
【速读】:该论文旨在解决视频语言模型(Video-LLMs)在处理时序信息时产生的逻辑不一致性问题,即模型对同一视频内容的重述提问无法给出一致回答,这严重影响了其可靠性与实际应用价值。其关键解决方案是提出一种称为时序条件注意力锐化(Temporally Conditioned Attention Sharpening, TCAS)的方法,通过构建基于注意力区分度的增强目标,提升跨模态注意力头在不同时间戳上对视频token的判别能力,从而改善模型的时间分辨能力和时序逻辑一致性。实验表明,TCAS不仅显著增强了Video-LLMs的时序逻辑一致性,还在通用视频时序定位任务中带来了性能提升,验证了时序逻辑一致性是当前视频理解中的瓶颈因素。
链接: https://arxiv.org/abs/2510.08138
作者: Chengzhi Li,Heyan Huang,Ping Jian,Zhen Yang,Yaning Tian
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model’s temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.
zh
[CV-54] Real-Time Motion-Controllable Autoregressive Video Diffusion
【速读】:该论文旨在解决实时运动可控视频生成中的两大核心挑战:一是双向扩散模型(Bidirectional Diffusion Models, BDMs)固有的高延迟问题,二是现有自回归(Autoregressive, AR)视频扩散模型在少步生成时存在质量下降和运动伪影的问题。解决方案的关键在于提出AR-Drag,首个基于强化学习(Reinforcement Learning, RL)增强的少步自回归视频扩散模型,通过两阶段优化实现高效且精准的图像到视频生成:首先微调基础图像到视频(Image-to-Video, I2V)模型以支持基本运动控制,随后利用基于轨迹的奖励模型进行强化学习进一步提升性能;设计中引入Self-Rollout机制保持马尔可夫性,并通过选择性地在去噪步骤中引入随机性加速训练过程,从而在仅使用1.3B参数的情况下显著降低延迟并保持高质量的视觉保真度与运动对齐精度。
链接: https://arxiv.org/abs/2510.08131
作者: Kesen Zhao,Jiaxin Shi,Beier Zhu,Junbao Zhou,Xiaolong Shen,Yuan Zhou,Qianru Sun,Hanwang Zhang
机构: Nanyang Technological University (南洋理工大学); Xmax.AI Ltd; Zhejiang University (浙江大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: this https URL.
zh
[CV-55] Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation
【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在有限数据场景下(如放射学领域)进行医学图像分割时的泛化能力不足问题,尤其针对CT图像中肿瘤自动分割任务。传统图像增强方法直接套用于CT扫描时,忽略了CT图像强度值代表Hounsfield Units(HU)这一物理意义,导致引入伪影并损害模型性能。解决方案的关键在于提出一种CT特异性增强技术——随机窗宽调整(Random windowing),该方法基于CT图像中可用的HU分布特性,在不破坏物理语义的前提下提升模型对对比剂增强差异和低对比度图像的鲁棒性,从而显著改善肝肿瘤分割性能。
链接: https://arxiv.org/abs/2510.08116
作者: Eirik A. Østmo,Kristoffer K. Wickstrøm,Keyur Radiya,Michael C. Kampffmeyer,Karl Øyvind Mikalsen,Robert Jenssen
机构: UiT Machine Learning Group (UiT机器学习组); University Hospital of North Norway (北挪威大学医院); Norwegian Computing Centre (挪威计算中心); Pioneer Centre for AI (先锋人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 9 figures. This work has been submitted to the IEEE for possible publication
Abstract:Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians’ workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.
zh
[CV-56] Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting
【速读】:该论文旨在解决极端视角下人脸分割(face parsing)准确率低的问题,其核心挑战在于此类姿态下的标注数据稀缺,而人工标注成本高昂且难以规模化。解决方案的关键在于提出一种新颖的标签精炼(label refinement)流程,利用3D高斯点绘(3D Gaussian Splatting, 3DGS)从噪声多视角预测中生成精确的分割掩码。通过联合拟合两个3DGS模型——一个用于RGB图像,另一个用于初始分割图——该方法借助共享几何结构强制多视角一致性,从而合成具有姿态多样性的训练数据,仅需少量后处理即可显著提升模型在困难视角下的分割性能,且无需任何真实3D标注。
链接: https://arxiv.org/abs/2510.08096
作者: Ankit Gahlawat,Anirban Mukherjee,Dinesh Babu Jayagopi
机构: International Institute of Information Technology, Bangalore (IIIT-B)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to VCIP 2025 (International Conference on Visual Communications and Image Processing 2025)
Abstract:Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real- world settings.
zh
[CV-57] DarkHash: A Data-Free Backdoor Attack Against Deep Hashing
【速读】:该论文旨在解决在无训练数据访问权限的情况下,如何对深度哈希(Deep Hashing)模型实施后门攻击的问题。传统方法依赖于获取原始训练数据以植入后门,但在实际场景中,由于隐私保护和知识产权限制,此类数据往往不可得。为此,作者提出DarkHash,这是首个面向深度哈希的无数据后门攻击方法。其关键在于设计了一个带有双语义引导的影子后门攻击框架,通过仅微调目标模型的部分层,并利用替代数据集进行训练,同时引入拓扑对齐损失(topological alignment loss),优化单个样本及其邻近样本与目标样本的一致性,从而在不损害原始检索准确率的前提下有效嵌入后门功能。
链接: https://arxiv.org/abs/2510.08094
作者: Ziqi Zhou,Menghao Deng,Yufei Song,Hangtao Zhang,Wei Wan,Shengshan Hu,Minghui Li,Leo Yu Zhang,Dezhong Yao
机构: Huazhong University of Science and Technology (华中科技大学); City University of Macau (澳门城市大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TIFS 2025
Abstract:Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.
zh
[CV-58] Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection NEURIPS2025
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成视频中难以检测的物理规律违反问题,尤其是面对高维时空动态建模和细微异常识别的挑战。其解决方案的关键在于提出一种基于概率流守恒原理的新型统计量——归一化时空梯度(Normalized Spatiotemporal Gradient, NSG),该指标通过量化空间概率梯度与时间密度变化的比值,显式捕捉自然视频动力学偏离;进一步结合预训练扩散模型实现无需复杂运动分解的NSG估计,并构建基于最大均值差异(Maximum Mean Discrepancy, MMD)的检测方法(NSG-VD),同时理论证明了生成视频在NSG特征空间中存在放大偏差的上界,从而显著提升检测性能,在召回率和F1分数上分别优于当前最优基线16.00%和10.75%。
链接: https://arxiv.org/abs/2510.08073
作者: Shuhai Zhang,ZiHao Lian,Jiahao Yang,Daiyuan Li,Guoxuan Pang,Feng Liu,Bo Han,Shutao Li,Mingkui Tan
机构: South China University of Technology (华南理工大学); University of Science and Technology of China (中国科学技术大学); Key Laboratory of Big Data and Intelligent Robot, Ministry of Education (教育部大数据与智能机器人重点实验室); Pazhou Lab (琶洲实验室); University of Melbourne (墨尔本大学); Hunan University (湖南大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025 spotlight
Abstract:AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at this https URL.
zh
[CV-59] owards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces
【速读】:该论文旨在解决当前学术界深伪人脸(deepfake face)检测基准在真实场景应用中有效性不足的问题,主要表现为现有数据集缺乏现实多样性、伪造方法局限以及对实际黑盒攻击场景模拟不足。其解决方案的关键在于构建RedFace数据集——一个面向真实世界的深伪人脸数据集,包含超过60,000张伪造图像和1,000段 manipulated 视频,这些内容源自9个商用在线平台的真实深伪技术,而非实验室生成的简化模型。通过引入定制化算法合成多样化且持续演进的伪造手段,RedFace有效模拟了“黑盒”环境下的真实深伪攻击,从而显著提升了检测模型在现实社交网络传播场景中的评估效度。
链接: https://arxiv.org/abs/2510.08067
作者: Junyu Shi,Minghui Li,Junguo Zuo,Zhifei Yu,Yipeng Lin,Shengshan Hu,Ziqi Zhou,Yechao Zhang,Wei Wan,Yinzhe Xu,Leo Yu Zhang
机构: Huazhong University of Science and Technology (华中科技大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation this http URL address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found “in the wild”, effectively simulating real-world black-box this http URL, RedFace’s deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: this https URL.
zh
[CV-60] A class-driven hierarchical ResNet for classification of multispectral remote sensing images
【速读】:该论文旨在解决多光谱遥感时序图像(Time Series, TS)在不同语义层级上的分类问题,特别是如何提升对细粒度类别(micro-classes)的判别能力,并增强模型在有限样本下的泛化性能。其关键解决方案是提出一种多时间尺度类驱动的分层残差神经网络(class-driven hierarchical ResNet),通过引入额外分支实现各层级语义类别的并行分类,并利用层次惩罚图(hierarchy-penalty maps)约束分类路径的逻辑一致性,从而避免不合理的层级跃迁。该架构支持模块化训练与微调,使浅层网络优先学习宏观类(macro-classes)和中间类,深层则聚焦于细微类(micro-classes)的区分,显著提升了在新目标区域中对少数类别的表征精度与整体分类性能。
链接: https://arxiv.org/abs/2510.08060
作者: Giulio Weikmann,Gianmarco Perantoni,Lorenzo Bruzzone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, accepted conference paper at SPIE REMOTE SENSING, 3-7 September 2023, Amsterdam, Netherlands
Abstract:This work presents a multitemporal class-driven hierarchical Residual Neural Network (ResNet) designed for modelling the classification of Time Series (TS) of multispectral images at different semantical class levels. The architecture consists of a modification of the ResNet where we introduce additional branches to perform the classification at the different hierarchy levels and leverage on hierarchy-penalty maps to discourage incoherent hierarchical transitions within the classification. In this way, we improve the discrimination capabilities of classes at different levels of semantic details and train a modular architecture that can be used as a backbone network for introducing new specific classes and additional tasks considering limited training samples available. We exploit the class-hierarchy labels to train efficiently the different layers of the architecture, allowing the first layers to train faster on the first levels of the hierarchy modeling general classes (i.e., the macro-classes) and the intermediate classes, while using the last ones to discriminate more specific classes (i.e., the micro-classes). In this way, the targets are constrained in following the hierarchy defined, improving the classification of classes at the most detailed level. The proposed modular network has intrinsic adaptation capability that can be obtained through fine tuning. The experimental results, obtained on two tiles of the Amazonian Forest on 12 monthly composites of Sentinel 2 images acquired during 2019, demonstrate the effectiveness of the hierarchical approach in both generalizing over different hierarchical levels and learning discriminant features for an accurate classification at the micro-class level on a new target area, with a better representation of the minoritarian classes.
zh
[CV-61] RetouchLLM : Training-free White-box Image Retouching
【速读】:该论文旨在解决现有基于学习的图像润饰方法依赖大规模成对数据且为“黑箱”模型的问题,从而导致润饰过程不透明、难以适应用户或图像特定需求。其解决方案的关键在于提出RetouchLLM,一个无需训练数据的白盒图像润饰系统,通过两个核心模块实现可解释的代码驱动润饰:一是视觉评判器(visual critic),用于识别输入图像与参考图像之间的差异;二是代码生成器(code generator),能够输出可执行的润饰代码。该框架模拟人类多步骤润饰流程,在高分辨率图像上进行渐进式调整,支持多样化的调整路径探索,并借助自然语言交互实现以用户意图为导向的可控润饰。
链接: https://arxiv.org/abs/2510.08054
作者: Moon Ye-Bin,Roy Miles,Tae-Hyun Oh,Ismail Elezi,Jiankang Deng
机构: POSTECH(浦项科技大学); Huawei London Research Center(华为伦敦研究中心); KAIST(韩国科学技术院); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.
zh
[CV-62] RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans BMVC-2025
【速读】:该论文旨在解决脑部磁共振成像(MRI)中弱监督异常检测(Weakly Supervised Anomaly Detection, WSAD)的问题,即在缺乏像素级标注的情况下,仅依赖切片级别的弱标签实现精准的异常区域定位与分割。其解决方案的关键在于提出了一种两阶段框架RASALoRE:第一阶段采用判别性双提示微调(Discriminative Dual Prompt Tuning, DDPT)机制,基于切片级标签生成高质量的伪弱掩码作为粗粒度定位线索;第二阶段设计了一个具有区域感知空间注意力机制的分割网络,利用固定的位置随机嵌入(location-based random embeddings)引导模型聚焦于异常区域,从而在参数量少于800万的情况下实现显著优于现有方法的性能,并大幅降低计算复杂度。
链接: https://arxiv.org/abs/2510.08052
作者: Bheeshm Sharma,Karthikeyan Jaganathan,Balamurugan Palaniappan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in BMVC-2025
Abstract:Weakly Supervised Anomaly detection (WSAD) in brain MRI scans is an important challenge useful to obtain quick and accurate detection of brain anomalies when precise pixel-level anomaly annotations are unavailable and only weak labels (e.g., slice-level) are available. In this work, we propose RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings, a novel two-stage WSAD framework. In the first stage, we introduce a Discriminative Dual Prompt Tuning (DDPT) mechanism that generates high-quality pseudo weak masks based on slice-level labels, serving as coarse localization cues. In the second stage, we propose a segmentation network with a region-aware spatial attention mechanism that relies on fixed location-based random embeddings. This design enables the model to effectively focus on anomalous regions. Our approach achieves state-of-the-art anomaly detection performance, significantly outperforming existing WSAD methods while utilizing less than 8 million parameters. Extensive evaluations on the BraTS20, BraTS21, BraTS23, and MSD datasets demonstrate a substantial performance improvement coupled with a significant reduction in computational complexity. Code is available at: this https URL.
zh
[CV-63] RayFusion: Ray Fusion Enhanced Collaborative Visual Perception NEURIPS2025
【速读】:该论文旨在解决基于摄像头的协同视觉感知系统在3D目标检测中因缺乏显式深度信息而导致的深度估计模糊问题(depth estimation ambiguity),从而影响检测精度。其解决方案的关键在于提出了一种基于射线(ray-based)的融合方法——RayFusion,通过利用协作方提供的射线占用(ray occupancy)信息,有效减少沿相机射线方向的冗余和误检预测,进而提升纯摄像头协同感知系统的检测性能。
链接: https://arxiv.org/abs/2510.08017
作者: Shaohong Wang,Bin Lu,Xinyu Xiao,Hanzhi Zhong,Bowen Pang,Tong Wang,Zhiyu Xiang,Hangguan Shan,Eryun Liu
机构: Zhejiang University (浙江大学); Institute of Automation of Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS2025
Abstract:Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at this https URL.
zh
[CV-64] CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning
【速读】:该论文旨在解决**组合图像检索(Composed Image Retrieval, CIR)**中因模型黑箱特性导致的可解释性差与复杂细粒度指令理解能力弱的问题。当前基于视觉语言模型(Vision-Language Models, VLMs)和多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法虽取得一定进展,但缺乏对跨模态推理过程的显式建模,难以满足用户对检索逻辑透明性和精确控制的需求。解决方案的关键在于提出首个面向检索任务的端到端MLLM——CIR-CoT,其核心创新是引入显式的思维链(Chain-of-Thought, CoT)推理机制:通过强制模型先生成结构化的推理链条(包含描述、推理与结论),从而增强跨模态交互的理解能力,并将最终检索意图编码为专用嵌入向量。这一设计不仅提升了检索准确性,还显著增强了决策过程的可解释性,同时在域内(FashionIQ、CIRR)与域外(CIRCO)数据集上均展现出优异性能。
链接: https://arxiv.org/abs/2510.08003
作者: Weihuang Lin,Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models’ ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.
zh
[CV-65] GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network
【速读】:该论文旨在解决在资源受限场景下(如便携式电子设备和移动机器人)实现高频率、鲁棒的单人二维人体姿态估计问题,尤其是在传统RGB相机受限于高延迟和高能耗的情况下。其解决方案的关键在于提出了一种名为GraphEnet的图神经网络架构,该架构利用事件相机输出的稀疏特性,并引入基于线段的中间事件表示方法,结合一种新颖的偏移向量学习范式与基于置信度的池化策略,从而高效地从事件数据中提取人体关键点位置信息。这是首个将图神经网络应用于事件数据进行人体姿态估计的工作。
链接: https://arxiv.org/abs/2510.07990
作者: Gaurvi Goyal,Pham Cong Thuong,Arren Glover,Masayoshi Mizuno,Chiara Bartolozzi
机构: Maastricht University (马斯特里赫特大学); Istituto Italiano di Tecnologia (意大利技术研究院); Sony Interactive Entertainment Inc. (索尼互动娱乐公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at this https URL.
zh
[CV-66] Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN
【速读】:该论文旨在解决低光照图像中高频细节恢复与严重噪声抑制的协同难题,这是计算机视觉领域长期存在的挑战。其解决方案的关键在于对性能与效率之间的权衡进行系统性分析:通过对比基于Transformer的SwinIR模型与轻量级卷积神经网络(CNN)在该任务上的表现,发现尽管SwinIR在峰值信噪比(PSNR)上达到39.03 dB的最优水平,但其训练需132个epoch且模型体积超过55倍于CNN;而标准CNN仅用10个epoch即可实现37.4 dB的近最优PSNR,显著降低了计算资源消耗,证明了轻量级CNN在资源受限的实际场景中具备极高的应用价值。
链接: https://arxiv.org/abs/2510.07984
作者: Chandresh Sutariya,Nitin Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
Abstract:The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model’s size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.
zh
[CV-67] he impact of abstract and object tags on image privacy classification ICASSP2026
【速读】:该论文旨在解决图像隐私分类任务中标签类型选择与数量配置的优化问题,即在不同标签预算条件下,判断使用对象标签(object tags)还是抽象标签(abstract tags)更有利于提升模型性能。其解决方案的关键在于实证分析表明:当标签预算受限时,抽象标签因能捕捉更高层次的上下文信息,在主观性强的图像隐私判断中表现更优;而当每张图像可获取的标签数量较多时,对象标签所提供的细粒度实体信息同样具有较高价值。这一发现为构建更精准、可解释的图像隐私分类器提供了理论依据和实践指导。
链接: https://arxiv.org/abs/2510.07976
作者: Darya Baranouskaya,Andrea Cavallaro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the ICASSP 2026
Abstract:Object tags denote concrete entities and are central to many computer vision tasks, whereas abstract tags capture higher-level information, which is relevant for tasks that require a contextual, potentially subjective scene understanding. Object and abstract tags extracted from images also facilitate interpretability. In this paper, we explore which type of tags is more suitable for the context-dependent and inherently subjective task of image privacy. While object tags are generally used for privacy classification, we show that abstract tags are more effective when the tag budget is limited. Conversely, when a larger number of tags per image is available, object-related information is as useful. We believe that these findings will guide future research in developing more accurate image privacy classifiers, informed by the role of tag types and quantity.
zh
[CV-68] Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement NEURIPS2025
【速读】:该论文旨在解决超高清(Ultra-High Definition, UHD)图像恢复中计算效率与高频细节保留之间的权衡问题。传统变分自编码器(Variational Autoencoders, VAEs)虽通过潜在空间处理提升效率,但其高斯约束常导致退化特异性高频信息丢失,从而降低重建保真度。解决方案的关键在于提出“潜在和谐”(Latent Harmony)框架,该框架采用两阶段设计:第一阶段引入LH-VAE,通过视觉语义约束和渐进式退化扰动增强语义鲁棒性,并利用潜在等变性强化高频特征保持;第二阶段联合训练修复模型与LH-VAE,采用高频低秩适配(High-Frequency Low-Rank Adaptation, HF-LoRA)机制——编码器LoRA基于保真导向的高频对齐损失恢复真实细节,解码器LoRA则由感知导向损失合成逼真纹理,二者通过交替优化与选择性梯度传播共同维持预训练潜在空间的稳定性,最终实现UHD及标准分辨率任务下的最优效率、感知质量和重建精度平衡。
链接: https://arxiv.org/abs/2510.07961
作者: Yidi Liu,Xueyang Fu,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware this http URL Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency this http URL Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent this http URL inference, a tunable parameter \alpha enables flexible fidelity-perception this http URL show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
zh
[CV-69] SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation ICME2025
【速读】:该论文旨在解决降水临近预报(precipitation nowcasting)中因地球系统复杂性导致的高难度预测问题,尤其关注现有非自回归模型在长时程预测中的性能瓶颈与确定性输出带来的模糊性和分布偏移问题。解决方案的关键在于提出一种名为SimCast的新颖训练流程,其核心是短程到长程的知识蒸馏技术与加权均方误差(weighted MSE loss)相结合,以优先优化强降雨区域的预测精度;同时,进一步将SimCast集成至基于扩散模型的框架CasCast中,利用概率模型的优势提升预测质量,从而在不增加推理开销的前提下显著改善预测准确性。
链接: https://arxiv.org/abs/2510.07953
作者: Yifang Yin,Shengkai Chen,Yiyao Li,Lu Wang,Ruibing Jin,Wei Cui,Shili Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: accepted by ICME 2025
Abstract:Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.
zh
[CV-70] A Large-scale Dataset for Robust Complex Anime Scene Text Detection
【速读】:该论文旨在解决现有文本检测数据集在动漫场景(anime scene)中表现不佳的问题。当前主流文本检测数据集主要面向自然场景或文档场景,其文本通常具有规则字体、单调颜色和有序布局,而动漫场景中的文本则呈现出多样化的风格、不规则的排布,并易与符号、装饰图案等视觉元素混淆,且包含大量手写体和艺术化字体。为应对这一挑战,作者提出了AnimeText数据集,其关键创新在于构建了一个包含735K张图像和420万标注文本块的大规模数据集,具备层次化标注(hierarchical annotations)和专为动漫场景设计的难负样本(hard negative samples),从而显著提升模型在复杂动漫文本检测任务中的性能。
链接: https://arxiv.org/abs/2510.07951
作者: Ziyi Dong,Yurui Zhang,Changmao Li,Naomi Rue Golding,Qing Long
机构: Sun Yat-sen University (中山大学); DeepGHS (Deep Generative anime Hobbyist Syndicate); Nanchang Hangkong University (南昌航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: this https URL
zh
[CV-71] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中对高保真、多视角、长时间视频生成以及深度信息等多样化语义内容提取的需求。现有生成式模型在处理复杂动态环境时,往往难以同时实现高质量的视频生成与结构化几何信息(如深度估计)的同步输出。其解决方案的关键在于提出CVD-STORM模型,该模型基于时空重建变分自编码器(Spatial-Temporal Reconstruction Variational Autoencoder, VAE),首先通过辅助的4D重建任务微调VAE以增强对3D结构和时间动态的编码能力,随后将其嵌入视频扩散过程以显著提升生成质量;此外,联合训练的高斯点绘制解码器(Gaussian Splatting Decoder)进一步实现了动态场景的几何重建,从而为场景理解提供关键的结构化信息。
链接: https://arxiv.org/abs/2510.07944
作者: Tianrui Zhang,Yichen Liu,Zilin Guo,Yuxin Guo,Jingcheng Ni,Chenjing Ding,Dan Xu,Lewei Lu,Zehuan Wu
机构: Sensetime Research (商汤科技研究院); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.
zh
[CV-72] ASBench: Image Anomalies Synthesis Benchmark for Anomaly Detection
【速读】:该论文旨在解决制造质量控制中异常检测(anomaly detection)因异常样本稀缺和人工标注成本高而导致的应用瓶颈问题。现有方法虽尝试通过异常合成(anomaly synthesis)缓解此问题,但普遍将其作为异常检测框架的辅助模块,缺乏对合成算法本身的系统性评估,且忽视了关键因素如合成影响与检测性能的解耦、合成数据的量化分析以及跨场景适应性等。解决方案的关键在于提出首个专注于异常合成方法的全面基准测试框架 ASBench,其核心创新在于引入四个维度的评估体系:(i) 不同数据集和流水线下的泛化能力,(ii) 合成数据与真实数据的比例关系,(iii) 合成图像内在指标与异常检测性能指标的相关性,以及 (iv) 混合合成策略的有效性,从而为异常合成方法的科学评估和未来研究提供系统性指导。
链接: https://arxiv.org/abs/2510.07927
作者: Qunyi Zhang,Songan Zhang,Jinbao Wang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu
机构: Global Institute of Future Technology, Shanghai Jiao Tong University (上海交通大学未来技术学院); School of Artificial Intelligence, Shenzhen University (深圳大学人工智能学院); Contemporary Amperex Technology Co., Ltd. (中创新航科技有限公司); Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anomaly detection plays a pivotal role in manufacturing quality control, yet its application is constrained by limited abnormal samples and high manual annotation costs. While anomaly synthesis offers a promising solution, existing studies predominantly treat anomaly synthesis as an auxiliary component within anomaly detection frameworks, lacking systematic evaluation of anomaly synthesis algorithms. Current research also overlook crucial factors specific to anomaly synthesis, such as decoupling its impact from detection, quantitative analysis of synthetic data and adaptability across different scenarios. To address these limitations, we propose ASBench, the first comprehensive benchmarking framework dedicated to evaluating anomaly synthesis methods. Our framework introduces four critical evaluation dimensions: (i) the generalization performance across different datasets and pipelines (ii) the ratio of synthetic to real data (iii) the correlation between intrinsic metrics of synthesis images and anomaly detection performance metrics , and (iv) strategies for hybrid anomaly synthesis methods. Through extensive experiments, ASBench not only reveals limitations in current anomaly synthesis methods but also provides actionable insights for future research directions in anomaly synthesis
zh
[CV-73] MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在从图像扩展到视频时面临的高计算成本问题,主要源于视频的高帧率和长时长导致的大量视觉token。解决方案的关键在于提出一种无需训练的令牌压缩方法——基于记忆增强强化学习的令牌压缩(Memory-Augmented Reinforcement Learning-based Token Compression, MARC),其核心创新包括:采用“检索-压缩”策略,利用视觉记忆检索器(Visual Memory Retriever, VMR)选择关键片段,并通过压缩组相对策略优化(Compression Group Relative Policy Optimization, C-GRPO)框架实现从教师模型到学生模型的推理能力蒸馏。该方法在六项视频基准测试中仅使用单帧token即可逼近基线性能,显著降低95%视觉token、72% GPU内存占用及23.9%延迟,适用于资源受限场景下的实时视频理解任务。
链接: https://arxiv.org/abs/2510.07915
作者: Peiran Wu,Zhuorui Yu,Yunze Liu,Chi-Hao Wu,Enmin Zhou,Junxiao Shen
机构: University of Bristol; Memories.ai Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbfMemory-Augmented Reinforcement Learning-based Token Compression (MARC), which integrates structured retrieval and RL-based distillation. MARC adopts a \textitretrieve-then-compress strategy using a \textbfVisual Memory Retriever (VMR) to select key clips and a \textbfCompression Group Relative Policy Optimization (C-GRPO) framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame’s tokens – reducing visual tokens by \textbf95%, GPU memory by \textbf72%, and latency by \textbf23.9%. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
zh
[CV-74] MMM: Quantum-Chemical Molecular Representation Learning for Combinatorial Drug Recommendation MICCAI
【速读】:该论文旨在解决临床决策支持系统中联合用药时药物-药物相互作用(Drug-Drug Interaction, DDI)风险难以准确预测的问题。现有基于图神经网络(Graph Neural Networks, GNNs)的方法因依赖简化的离散分子表示,无法充分捕捉分子间的结合亲和力与反应活性。其解决方案的关键在于提出一种融合三维量子化学信息的多模态DDI预测框架——MMM(Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps),通过计算分子电子定域函数(Electron Localization Function, ELF)生成3D电子密度图,并将编码全局电子特性的ELF特征与用于建模局部子结构相互作用的二部图编码器相结合,从而学习药物分子的互补特性,显著提升DDI预测性能。
链接: https://arxiv.org/abs/2510.07910
作者: Chongmyung Kwon,Yujin Kim,Seoeun Park,Yunji Lee,Charmgil Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Medical Image Computing and Computer-Assisted Intervention (MICCAI) Predictive Intelligence in Medicine Workshop (MICCAI PRIME) 2025; 13 pages
Abstract:Drug recommendation is an essential task in machine learning-based clinical decision support systems. However, the risk of drug-drug interactions (DDI) between co-prescribed medications remains a significant challenge. Previous studies have used graph neural networks (GNNs) to represent drug structures. Regardless, their simplified discrete forms cannot fully capture the molecular binding affinity and reactivity. Therefore, we propose Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps (MMM), a novel framework that integrates three-dimensional (3D) quantum-chemical information into drug representation learning. It generates 3D electron density maps using the ELF. To capture both therapeutic relevance and interaction risks, MMM combines ELF-derived features that encode global electronic properties with a bipartite graph encoder that models local substructure interactions. This design enables learning complementary characteristics of drug molecules. We evaluate MMM in the MIMIC-III dataset (250 drugs, 442 substructures), comparing it with several baseline models. In particular, a comparison with the GNN-based SafeDrug model demonstrates statistically significant improvements in the F1-score (p = 0.0387), Jaccard (p = 0.0112), and the DDI rate (p = 0.0386). These results demonstrate the potential of ELF-based 3D representations to enhance prediction accuracy and support safer combinatorial drug prescribing in clinical practice.
zh
[CV-75] am Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception - Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track
【速读】:该论文旨在解决自主代理在动态人类密集的室内环境中实现安全、高效且符合社会规范的导航问题(social navigation)。其核心挑战在于仅使用机载传感器(如RGB-D图像和里程计)进行局部感知,无法获取全局地图或特权信息的情况下,仍需遵守社交规范(如保持安全距离和避障)。解决方案的关键在于引入一个主动风险感知模块(Proactive Risk Perception Module),该模块基于Falcon模型增强对周围人类潜在碰撞风险的理解,通过学习预测距离相关的碰撞风险评分,使代理具备更鲁棒的空间意识和主动避障行为,从而显著提升在拥挤场景中维持个人空间合规性的能力。
链接: https://arxiv.org/abs/2510.07871
作者: Erjia Xiao,Lingfeng Zhang,Yingbo Tang,Hao Cheng,Renjing Xu,Wenbo Ding,Lei Zhou,Long Chen,Hangjun Ye,Xiaoshuai Hao
机构: HKUST (GZ); Tsinghua University; Xiaomi EV; Institute of Automation, CAS
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent’s ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.
zh
[CV-76] XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method
【速读】:该论文旨在解决基于前馈重建范式的场景重建方法在驾驶场景中存在泛化能力弱和重建精度低的问题。具体而言,现有方法依赖固定的视角变换,在相机配置变化时失效,限制了跨场景的泛化能力;同时,由于360°全景图中稀疏视图间的重叠区域较小且驾驶场景复杂,导致学习难度增加,影响重建精度。其解决方案的关键在于提出XYZCylinder模型,通过两个核心设计:一是统一圆柱相机建模(Unified Cylinder Camera Modeling, UCCM)策略,避免学习视点相关的空间对应关系,并用可调参数统一不同相机配置以提升泛化能力;二是基于新设计的圆柱平面特征组(Cylinder Plane Feature Group, CPFG)构建混合表示结构,结合多个专用模块将2D图像特征高效提升至3D空间,从而显著提高重建精度。实验表明,该方法在多种评估设置下达到当前最优性能,并能零样本推广至其他驾驶场景。
链接: https://arxiv.org/abs/2510.07856
作者: Haochen Yu,Qiankun Liu,Hongyuan Liu,Jianfei Jiang,Juntao Lyu,Jiansheng Chen,Huimin Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the 360^\circ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbfXYZCylinder, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \hrefthis https URLhere.
zh
[CV-77] Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials
【速读】:该论文旨在解决高通量毒性测试中自动化评估的难题,特别是如何利用机器学习模型有效识别化学物质引起的毒性变化。其解决方案的关键在于采用自监督学习(self-supervised learning)方法提取特征表示,这些表示能够准确区分不同化合物的作用机制(modes-of-action),从而提升毒性测试的效率与准确性。
链接: https://arxiv.org/abs/2510.07853
作者: Thomas Lautenschlager,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Katja Nau,Gaëlle Hayot,Thomas Dickmeis,Ralf Mikut
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.
zh
[CV-78] AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views
【速读】:该论文旨在解决从稀疏视角重建语义丰富且几何准确的室内场景3D模型的问题,现有方法通常将语义视为被动特征附加在已构建的、可能不准确的几何结构上,导致重建结果存在几何歧义。解决方案的关键在于提出AlignGS框架,通过端到端优化几何与语义的协同关系,首次将语义理解作为主动引导力:利用2D基础模型提取的丰富先验信息,设计深度一致性约束和多维度法向量正则化等新型语义到几何的引导机制,直接对3D表示进行正则化,从而显著提升新视角合成质量和几何精度。
链接: https://arxiv.org/abs/2510.07839
作者: Yijie Gao,Houqiang Zhong,Tianchi Zhu,Zhengxue Cheng,Qiang Hu,Li Song
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at this https URL .
zh
[CV-79] IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
【速读】:该论文旨在解决手语视频到口语音频的直接翻译问题,尤其针对非语法连续的手语序列(isolated sign sequences),以支持教育应用和手语提示界面等场景。传统多阶段翻译系统因依赖中间文本表示而存在延迟高和误差传播的问题,为此,作者提出IsoSignVid2Aud框架,其关键在于采用端到端设计,结合I3D特征提取模块、专用特征变换网络与音频生成流水线,并引入一种新颖的非极大值抑制(Non-Maximal Suppression, NMS)算法实现对非语法连续手语序列的时间检测,从而无需中间文本即可直接生成可懂语音,显著提升了实时性和准确性。
链接: https://arxiv.org/abs/2510.07837
作者: Harsh Kavediya,Vighnesh Nayak,Bheeshm Sharma,Balamurugan Palaniappan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted in AIML-Systems-2025
Abstract:Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01% and 78.67%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: this https URL.
zh
[CV-80] PrismGS: Physically-Grounded Anti-Aliasing for High-Fidelity Large-Scale 3D Gaussian Splatting
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在大规模城市环境渲染中出现的严重混叠伪影(aliasing artifacts)和优化不稳定性问题,尤其是在高分辨率(如4K)下表现为纹理闪烁和锯齿边缘。这些问题源于高斯基元与城市几何多尺度特性之间的不匹配。解决方案的关键在于提出PrismGS,一个基于物理约束的正则化框架,其核心包含两个协同机制:一是金字塔多尺度监督(pyramidal multi-scale supervision),通过预滤波图像金字塔强制渲染一致性,使模型学习到内在抗混叠的表示;二是显式尺寸正则化(explicit size regularization),对3D高斯的尺寸施加物理合理的下界,防止退化且视点依赖的基元形成,从而提升几何表面的稳定性和合理性。该方法可即插即用,兼容现有流水线,并在MatrixCity、Mill-19和UrbanScene3D等数据集上实现显著性能提升,PSNR相较CityGaussian提升约1.5 dB。
链接: https://arxiv.org/abs/2510.07830
作者: Houqiang Zhong,Zhenglong Wu,Sihua Fu,Zihan Zheng,Xin Jin,Xiaoyun Zhang,Li Song,Qiang Hu
机构: Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology (东方理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently enabled real-time photorealistic rendering in compact scenes, but scaling to large urban environments introduces severe aliasing artifacts and optimization instability, especially under high-resolution (e.g., 4K) rendering. These artifacts, manifesting as flickering textures and jagged edges, arise from the mismatch between Gaussian primitives and the multi-scale nature of urban geometry. While existing ``divide-and-conquer’’ pipelines address scalability, they fail to resolve this fidelity gap. In this paper, we propose PrismGS, a physically-grounded regularization framework that improves the intrinsic rendering behavior of 3D Gaussians. PrismGS integrates two synergistic regularizers. The first is pyramidal multi-scale supervision, which enforces consistency by supervising the rendering against a pre-filtered image pyramid. This compels the model to learn an inherently anti-aliased representation that remains coherent across different viewing scales, directly mitigating flickering textures. This is complemented by an explicit size regularization that imposes a physically-grounded lower bound on the dimensions of the 3D Gaussians. This prevents the formation of degenerate, view-dependent primitives, leading to more stable and plausible geometric surfaces and reducing jagged edges. Our method is plug-and-play and compatible with existing pipelines. Extensive experiments on MatrixCity, Mill-19, and UrbanScene3D demonstrate that PrismGS achieves state-of-the-art performance, yielding significant PSNR gains around 1.5 dB against CityGaussian, while maintaining its superior quality and robustness under demanding 4K rendering.
zh
[CV-81] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
【速读】:该论文旨在解决现有3D人体-物体交互(Human-Object Interaction, HOI)基准数据集难以刻画真实场景中多个人与多个物体之间因果、目标导向或协作性交互的问题。为填补这一空白,作者提出了MMHOI数据集,涵盖12种日常场景,提供完整的人体和物体3D形状与姿态标注,并包含78类动作标签及14类交互特异性身体部位标签,构建了一个全面的HOI研究测试平台。其解决方案的关键在于提出MMHOI-Net——一种基于Transformer的端到端神经网络,通过结构化的双补丁(dual-patch)表示来建模物体及其交互关系,并结合动作识别以增强交互预测能力,从而在多主体HOI建模任务中实现最优性能,兼具高精度与高质量重建。
链接: https://arxiv.org/abs/2510.07828
作者: Kaen Kogashi,Anoop Cherian,Meng-Yu Jennifer Kuo
机构: Mitsubishi Electric (三菱电机); Mitsubishi Electric Research Labs (三菱电机研究实验室); Nara Women’s University (奈良女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI – a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.
zh
[CV-82] Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation NEURIPS2025
【速读】:该论文旨在解决视觉提示(Visual Prompting, VP)方法在适应预训练视觉模型到下游任务时存在的两个关键问题:一是简单加性变换带来的表达能力受限,二是随着参数量增加易出现过拟合现象。解决方案的关键在于提出一种新型的多模态提示机制——ACAVP(Affine, Color, and Additive Visual Prompting),通过引入仿射变换(affine transformation)以生成任务特定的提示区域并保留原始图像信息,同时结合色彩变换(color transformation)增强任务相关视觉特征;此外,识别出过拟合是VP训练中的核心挑战,并引入TrivialAugment作为有效的数据增强策略,显著提升模型性能(最高达12个百分点),且该增强策略对现有VP方法同样有效,验证了合适的数据增强对VP训练具有普适性价值。
链接: https://arxiv.org/abs/2510.07823
作者: Shohei Enomoto
机构: NTT(日本电信电话公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS2025
Abstract:Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks without modifying model parameters. Despite offering advantages like negligible computational overhead and compatibility with black-box models, conventional VP methods typically achieve lower accuracy than other adaptation approaches. Our analysis reveals two critical limitations: the restricted expressivity of simple additive transformation and a tendency toward overfitting when the parameter count increases. To address these challenges, we propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP’s expressive power by introducing complementary transformation operations: affine transformation for creating task-specific prompt regions while preserving original image information, and color transformation for emphasizing task-relevant visual features. Additionally, we identify that overfitting is a critical issue in VP training and introduce TrivialAugment as an effective data augmentation, which not only benefits our approach but also significantly improves existing VP methods, with performance gains of up to 12 percentage points on certain datasets. This demonstrates that appropriate data augmentation is universally beneficial for VP training. Extensive experiments across twelve diverse image classification datasets with two different model architectures demonstrate that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts, all while maintaining minimal computational overhead during inference.
zh
[CV-83] An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images
【速读】:该论文旨在解决单目360°室内全景图像中球面像素深度预测的问题,现有方法虽注重像素级精度,但易导致房间角落过度平滑和对噪声敏感。解决方案的关键在于引入基于房间几何约束的深度估计框架,通过布局预测提取房间几何信息,并利用背景分割机制将这些信息融入深度估计过程;具体而言,模型包含共享特征编码器与任务特定解码器(用于布局估计、深度估计和背景分割),并结合两种核心策略:一是基于房间几何的背景深度解析策略,利用布局和深度解码器输出生成背景深度图;二是基于背景分割引导的融合机制,从分割解码器预测中推导融合权重以融合背景与粗略深度图。该方法在Stanford2D3D、Matterport3D和Structured3D数据集上显著优于现有开源方法。
链接: https://arxiv.org/abs/2510.07817
作者: Kanglin Ning,Ruzhao Chen,Penghong Wang,Xingtao Wang,Ruiqin Xiong,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); PengChengLab (鹏城实验室); Suzhou Research Institute of HIT (哈尔滨工业大学苏州研究院); Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University (北京大学电子工程与计算机科学学院数字媒体研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting spherical pixel depth from monocular 360^\circ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder’s output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder’s predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at this https URL.
zh
[CV-84] FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition
【速读】:该论文旨在解决微表情识别中因仅利用起始帧到峰值帧之间的光流信息而忽略峰值帧到消退帧阶段重要运动特征所导致的识别性能受限问题。其解决方案的关键在于提出一种全新的运动表征方法——幅度调制组合光流(Magnitude-Modulated Combined Optical Flow, MM-COF),该方法将微表情的两个关键阶段(起始至峰值、峰值至消退)的运动动态整合为统一描述符,并进一步设计了一种端到端神经网络架构FMANet,通过可学习模块内化双阶段分析与幅度调制机制,使网络能够自适应融合运动线索并聚焦于显著面部区域进行分类,从而显著提升微表情识别准确率。
链接: https://arxiv.org/abs/2510.07810
作者: Luu Tu Nguyen,Vu Tram Anh Khuong,Thi Bich Phuong Man,Thi Duyen Ngo,Thanh Ha Le
机构: Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.
zh
[CV-85] GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
【速读】:该论文旨在解决当前视觉-语言模型(Visual-Language Models, VLMs)在地理空间-时间智能(geographic spatial-temporal intelligence)方面评估不足的问题,尤其是缺乏对多视角(图像/视频与地图)协同推理能力的系统性测试。现有基准主要局限于第一人称视角或地理图形上下文,无法全面评估VLMs在大规模摄像头网络中对移动目标进行跨场景时空推理的能力,这限制了其在交通管理、应急响应等领域的应用潜力。解决方案的关键在于提出Geo-Temporal Reasoning Benchmark (GTR-Bench),该基准要求模型在地图与视频之间切换视角、联合推理多个视场不重叠的视频,并对未被任何视频覆盖的空间-时间区域进行推断,从而更真实地模拟现实世界复杂场景下的时空理解需求。
链接: https://arxiv.org/abs/2510.07791
作者: Qinghongbing Xie,Zhaoyuan Xia,Feng Zhu,Lijun Gong,Ziyue Li,Rui Zhao,Long Zeng
机构: SenseTime Research(商汤科技研究院); Tsinghua University (清华大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 13 figures
Abstract:Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs’ geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs’ reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at this https URL.
zh
[CV-86] Demystifying Deep Learning-based Brain Tumor Segmentation with 3D UNets and Explainable AI (XAI): A Comparative Analysis
【速读】:该论文旨在解决医学影像中脑肿瘤分割精度不足的问题,以提升生成式 AI 在临床决策支持中的可信度与实用性。其核心解决方案是结合可解释人工智能(Explainable Artificial Intelligence, XAI)技术,优化 UNet 系列模型的性能并增强医生对模型决策的理解。关键在于采用 Residual UNet(ResUNet)作为最优模型架构,并通过梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)和基于注意力机制的可视化方法,分别揭示各模型关注的肿瘤子区域及 AttUNet 中注意力模块的工作原理,从而在保证高分割精度的同时提高模型透明度与临床可用性。
链接: https://arxiv.org/abs/2510.07785
作者: Ming Jie Ong,Sze Yinn Ung,Sim Kuan Goh,Jimmy Y. Zhong
机构: Xiamen University Malaysia (厦门大学马来西亚分校); Georgia Institute of Technology (佐治亚理工学院); James Cook University (詹姆斯库克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The current study investigated the use of Explainable Artificial Intelligence (XAI) to improve the accuracy of brain tumor segmentation in MRI images, with the goal of assisting physicians in clinical decision-making. The study focused on applying UNet models for brain tumor segmentation and using the XAI techniques of Gradient-weighted Class Activation Mapping (Grad-CAM) and attention-based visualization to enhance the understanding of these models. Three deep learning models - UNet, Residual UNet (ResUNet), and Attention UNet (AttUNet) - were evaluated to identify the best-performing model. XAI was employed with the aims of clarifying model decisions and increasing physicians’ trust in these models. We compared the performance of two UNet variants (ResUNet and AttUNet) with the conventional UNet in segmenting brain tumors from the BraTS2020 public dataset and analyzed model predictions with Grad-CAM and attention-based visualization. Using the latest computer hardware, we trained and validated each model using the Adam optimizer and assessed their performance with respect to: (i) training, validation, and inference times, (ii) segmentation similarity coefficients and loss functions, and (iii) classification performance. Notably, during the final testing phase, ResUNet outperformed the other models with respect to Dice and Jaccard similarity scores, as well as accuracy, recall, and F1 scores. Grad-CAM provided visuospatial insights into the tumor subregions each UNet model focused on while attention-based visualization provided valuable insights into the working mechanisms of AttUNet’s attention modules. These results demonstrated ResUNet as the best-performing model and we conclude by recommending its use for automated brain tumor segmentation in future clinical assessments. Our source code and checkpoint are available at this https URL
zh
[CV-87] IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在复杂现实交互中缺乏隐式人类意图推理能力的问题。现有SOTA VLA模型主要基于与具身场景相关性有限的多模态任务进行预训练,并通过显式指令映射执行动作,导致其无法有效处理需要深度推理的具身操作任务。解决方案的关键在于提出IntentionVLA框架,该框架采用课程学习(curriculum training)范式,首先利用精心设计的包含意图推理、空间定位和紧凑具身推理的数据集,赋予模型感知与推理双重能力;随后在微调阶段,将紧凑的推理输出作为上下文引导用于动作生成,从而实现对间接指令的快速推理与执行。这一机制显著提升了模型在分布外意图任务上的泛化能力和零样本人机交互性能。
链接: https://arxiv.org/abs/2510.07778
作者: Yandu Chen,Kefan Gu,Yuqing Wen,Yucheng Zhao,Tiancai Wang,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen); Nanjing University; University of Science and Technology of China; Dexmal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbfIntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms \pi_0 , achieving 18% higher success rates with direct instructions and 28% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.
zh
[CV-88] DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream
【速读】:该论文旨在解决从低帧率RGB视频中重建动态三维高斯溅射(Dynamic 3D Gaussian Splatting, Dynamic 3DGS)时因大帧间运动导致解空间不确定性增加的问题。传统方法在处理大位移时难以准确匹配像素对应关系,而事件相机虽能提供高时间分辨率的异步事件流并具备抗运动模糊特性,但缺乏颜色信息,使得多模态融合面临挑战。解决方案的关键在于引入事件流中的运动先验来引导变形场的优化:首先通过提出的LoCM无监督微调框架将事件光流估计器适配到未见场景以提取运动先验;其次设计几何感知的数据关联方法建立事件与高斯点之间的运动对应关系,并辅以运动分解和帧间伪标签策略,从而有效提升动态3DGS的重建精度与鲁棒性。
链接: https://arxiv.org/abs/2510.07752
作者: Junhao He,Jiaxu Wang,Jia Li,Mingyuan Sun,Qiang Zhang,Jiahang Cao,Ziyi Zhang,Yi Gu,Jingkai Sun,Renjing Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TVCG
Abstract:Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.
zh
[CV-89] UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes
【速读】:该论文旨在解决超动态范围(Ultra-high Dynamic Range, UHDR)场景下图像重建难题,即在夜间等光照差异极大的环境中,如何仅用单帧短曝光RAW图像实现高保真度的细节恢复,避免传统多曝光RGB融合方法易出现的鬼影(ghosting)和运动模糊问题。其解决方案的关键在于提出一种两阶段框架UltraLED:第一阶段通过比率图(ratio map)进行曝光校正以平衡动态范围,第二阶段采用亮度感知的RAW去噪模块增强暗部区域的信息恢复能力。该方法充分利用了RAW数据较高的位深度和更可预测的噪声特性,从而在不依赖多帧或多曝光输入的前提下,实现了对UHDR场景中亮部与暗部细节的同时有效重建。
链接: https://arxiv.org/abs/2510.07741
作者: Yuang Meng,Xin Jin,Lina Lei,Chun-Le Guo,Chongyi Li
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at this https URL.
zh
[CV-90] ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
【速读】:该论文旨在解决高斯点阵(Gaussian Splatting, GS)在真实感三维物体-场景组合中的两大核心问题:一是GS辐射场中烘焙的外观和阴影信息导致物体与场景融合时出现不一致;二是现有基于高斯的逆渲染方法在光照估计和物体重光照重建方面效率低、适应性差。解决方案的关键在于提出两个创新模块:其一为表面八面体探针(Surface Octahedral Probes, SOPs),用于高效存储并插值光照与遮挡信息,避免昂贵的光线追踪,实现至少2倍的重建速度提升及实时阴影计算;其二为聚焦于物体放置位置的环境光照建模策略,通过捕获该位置360°重建的辐射场并微调扩散模型完成光照补全,从而简化复杂场景光照估计任务。基于此,作者构建了ComGS框架,在约28 FPS下实现高质量、视觉和谐且带生动阴影的实时渲染,编辑仅需36秒。
链接: https://arxiv.org/abs/2510.07729
作者: Jian Gao,Mengqi Yuan,Yifei Zeng,Chang Zeng,Zhihao Li,Zhenyu Chen,Weichao Qiu,Xiao-Xiao Long,Hao Zhu,Xun Cao,Yao Yao
机构: Nanjing University (南京大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object’s appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object’s placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at this https URL.
zh
[CV-91] SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction NIPS2025
【速读】:该论文旨在解决从单张图像中实现高保真度、全身体态的3D人像重建问题,该任务在影视和游戏领域具有重要意义,但因存在固有的歧义性和严重的自遮挡而极具挑战性。现有方法通常依赖SMPL估计与SMPL条件下的图像生成模型来推断新视角,然而其受限于SMPL网格估计的不准确性,且难以处理复杂人体姿态及精细细节重建。解决方案的关键在于提出SyncHuman框架,首次将2D多视角生成模型与3D原生生成模型相结合:前者擅长捕捉2D细节,后者提供结构一致的粗略3D形状。通过引入像素对齐的2D-3D同步注意力机制联合微调两个模型,实现几何对齐的3D形状与多视角图像生成;进一步设计特征注入机制,将2D多视角图像中的精细特征迁移至对齐的3D形状上,从而显著提升重建精度与视觉保真度。
链接: https://arxiv.org/abs/2510.07723
作者: Wenyue Chen,Peng Li,Wangguandong Zheng,Chengfeng Zhao,Mengfei Li,Yaolong Zhu,Zhiyang Dou,Ronggang Wang,Yuan Liu
机构: PKU (北京大学); HKUST (香港科技大学); SEU (东南大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NIPS 2025
Abstract:Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.
zh
[CV-92] RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning
【速读】:该论文旨在解决电商场景中产品图像修复(inpainting)问题,特别是去除水印、促销文字等侵入性元素时面临的挑战,如对象移除不可靠和领域特定适应能力有限。其解决方案的关键在于提出一种基于强化学习的框架Repainter,该框架通过空间遮罩轨迹精修(spatial-matting trajectory refinement)与组相对策略优化(Group Relative Policy Optimization, GRPO)相结合的方式,调节注意力机制以增强背景上下文建模,从而生成更高奖励的修复样本并减少无关对象的插入;同时引入复合奖励机制,平衡全局、局部及语义约束,有效降低视觉伪影和奖励欺骗(reward hacking)。
链接: https://arxiv.org/abs/2510.07721
作者: Zipeng Guo,Lichen Ma,Xiaolong Fu,Gaojing Zhou,Lan Yang,Yuchen Zhou,Linkai Liu,Yu He,Ximan Liu,Shiping Dong,Jingling Fu,Zhen Chen,Yu Shi,Junshi Huang,Jason Li,Chao Gou
机构: Sun Yat-sen University (中山大学); JD.COM (京东); Beijing University of Chemical Technology (北京化工大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.
zh
[CV-93] Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision
【速读】:该论文旨在解决中心型(center-based)哈希方法在建模全局数据分布时忽视局部相似性信息的问题,从而限制了其在大规模图像检索中的性能上限。解决方案的关键在于提出一种弱到强的相互学习框架(Mutual Learning for Hashing, MLH),其中包含一个强中心型分支和一个较弱的成对型(pairwise-based)分支;通过迭代式相互学习机制,使中心型分支能够吸收成对型分支所学到的局部相似性线索,同时引入基于“专家混合”(mixture-of-experts)思想的哈希专家混合模块,实现两分支间的高效交互与协同优化,显著提升整体哈希性能。
链接: https://arxiv.org/abs/2510.07703
作者: Xiaoxu Ma,Runhao Li,Zhenyu Weng
机构: South China University of Technology (华南理工大学); Georgia Institute of Technology (佐治亚理工学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep hashing has been widely adopted for large-scale image retrieval, with numerous strategies proposed to optimize hash function learning. Pairwise-based methods are effective in learning hash functions that preserve local similarity relationships, whereas center-based methods typically achieve superior performance by more effectively capturing global data distributions. However, the strength of center-based methods in modeling global structures often comes at the expense of underutilizing important local similarity information. To address this limitation, we propose Mutual Learning for Hashing (MLH), a novel weak-to-strong framework that enhances a center-based hashing branch by transferring knowledge from a weaker pairwise-based branch. MLH consists of two branches: a strong center-based branch and a weaker pairwise-based branch. Through an iterative mutual learning process, the center-based branch leverages local similarity cues learned by the pairwise-based branch. Furthermore, inspired by the mixture-of-experts paradigm, we introduce a novel mixture-of-hash-experts module that enables effective cross-branch interaction, further enhancing the performance of both branches. Extensive experiments demonstrate that MLH consistently outperforms state-of-the-art hashing methods across multiple benchmark datasets.
zh
[CV-94] Hybrid CNN-BYOL Approach for Fault Detection in Induction Motors Using Thermal Images
【速读】:该论文旨在解决感应电机(Induction Motors, IMs)在工业和日常应用中因故障导致过热、能耗增加及服务失效的问题,提出了一种基于自监督学习与卷积神经网络(CNN)融合的热成像故障分类方法。其解决方案的关键在于引入对比学习框架BYOL(Bootstrap Your Own Latent),结合多种主流深度学习模型进行预训练,并设计出一种轻量且高性能的专用CNN结构——BYOL-IMNet,该模型包含四个针对热图像故障识别定制的模块,在保证高精度(测试准确率达99.89%)的同时实现低推理延迟(每张图像仅5.7毫秒),显著优于现有最优模型,为工业场景下的在线监测提供了可靠的技术路径。
链接: https://arxiv.org/abs/2510.07692
作者: Tangin Amir Smrity,MD Zahin Muntaqim Hasan Muhammad Kafi,Abu Saleh Musa Miah,Najmul Hassan,Yuichi Okuyama,Nobuyoshi Asai,Taro Suzuki,Jungpil Shin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Induction motors (IMs) are indispensable in industrial and daily life, but they are susceptible to various faults that can lead to overheating, wasted energy consumption, and service failure. Early detection of faults is essential to protect the motor and prolong its lifespan. This paper presents a hybrid method that integrates BYOL with CNNs for classifying thermal images of induction motors for fault detection. The thermal dataset used in this work includes different operating states of the motor, such as normal operation, overload, and faults. We employed multiple deep learning (DL) models for the BYOL technique, ranging from popular architectures such as ResNet-50, DenseNet-121, DenseNet-169, EfficientNetB0, VGG16, and MobileNetV2. Additionally, we introduced a new high-performance yet lightweight CNN model named BYOL-IMNet, which comprises four custom-designed blocks tailored for fault classification in thermal images. Our experimental results demonstrate that the proposed BYOL-IMNet achieves 99.89% test accuracy and an inference time of 5.7 ms per image, outperforming state-of-the-art models. This study highlights the promising performance of the CNN-BYOL hybrid method in enhancing accuracy for detecting faults in induction motors, offering a robust methodology for online monitoring in industrial settings.
zh
[CV-95] Controllable Video Synthesis via Variational Inference
【速读】:该论文旨在解决现有视频生成模型在控制粒度上灵活性不足的问题,即如何在统一框架中同时支持从精确的4D物体轨迹和相机路径到粗粒度文本提示等多种用户控制输入。其解决方案的关键在于将视频合成任务建模为变分推断问题,通过引入多个视频生成骨干模型(backbones)联合处理所有约束条件,并采用分步KL散度最小化策略与一种上下文条件因子分解技术,有效降低解空间中的局部最优模式,从而实现对指定元素的高可控性、未指定部分的多样性以及更好的3D一致性。
链接: https://arxiv.org/abs/2510.07670
作者: Haoyi Duan,Yunzhi Zhang,Yilun Du,Jiajun Wu
机构: Stanford University (斯坦福大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.
zh
[CV-96] CIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration
【速读】:该论文旨在解决金字塔网络在可变形医学图像配准中因解码器结构导致的解剖结构错位传播与累积问题,以及现有模型无法根据图像间不同形变需求自适应调整迭代次数(导致过早终止或过度迭代从而降低配准精度)的问题。解决方案的关键在于提出两个核心组件:一是特征增强残差模块(Feature-Enhanced Residual Module, FERM),嵌入于金字塔网络每一层解码器中,通过三个串联模块分别提取解剖语义特征、抑制无关特征并估计最终形变场,有效缓解错位传播;二是双阶段阈值控制迭代策略(Threshold-Controlled Iterative, TCI),第一阶段评估配准稳定性,若稳定则进入第二阶段判断收敛性,从而实现自适应迭代次数控制。二者结合构成阈值控制迭代金字塔网络(Threshold-Controlled Iterative Pyramid, TCIP),在多个公共脑部MRI和腹部CT数据集上验证了其在精度上的优越性与模型紧凑性。
链接: https://arxiv.org/abs/2510.07666
作者: Heming Wu,Di Wang,Tai Ma,Peng Zhao,Yubin Xiao,Zhongke Wu,Xing-Ce Wang,Chuang Li,Xuan Wu,You Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.
zh
[CV-97] Automatic Text Box Placement for Supporting Typographic Design
【速读】:该论文旨在解决广告与网页布局设计中自动化文本框放置的问题,核心挑战在于如何在视觉吸引力与信息传达效率之间取得平衡。其解决方案的关键在于对比不同模型架构在不完整布局中的表现,发现基于标准Transformer的模型相较于视觉语言模型(VLM)更有效,尤其是在引入丰富外观信息(appearance information)时;同时指出任务特定架构设计对提升自动化布局效果具有重要意义。
链接: https://arxiv.org/abs/2510.07665
作者: Jun Muraoka,Daichi Haraguchi,Naoto Inoue,Wataru Shimoda,Kota Yamaguchi,Seiichi Uchida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.
zh
[CV-98] MONKEY: Masking ON KEY-Value Activation Adapter for Personalization
【速读】:该论文旨在解决扩散模型在个性化生成过程中存在的文本提示(text prompt)与源图像内容对齐不足的问题,即模型倾向于仅复制源图像中的主体而忽略文本描述的细节。其解决方案的关键在于利用IP-Adapter在推理阶段自动生成的掩码(mask),在第二轮处理中将该掩码应用于图像token,从而限制生成过程仅关注主体区域,使文本提示能够有效作用于背景区域,实现主体与文本描述的精确匹配。
链接: https://arxiv.org/abs/2510.07656
作者: James Baker
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.
zh
[CV-99] Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection
【速读】:该论文旨在解决基于扩散Transformer(Diffusion Transformer)的视频虚拟试衣(Video Virtual Try-On)模型在现有双分支架构下存在的两个核心问题:一是引入服装参考分支的潜在空间特征需修改或扩展骨干网络,导致可训练参数显著增加;二是服装潜在特征缺乏固有的时序特性,需额外学习以适配视频生成任务。解决方案的关键在于提出一种名为OIE(Once is Enough)的新策略,即仅在首帧进行服装替换,随后利用首帧编辑后的图像内容作为控制信号,结合姿态(pose)和掩码(mask)信息引导视频生成模型的时序先验,从而实现后续帧的顺序合成。该方法在保持领先性能的同时,显著提升了参数效率与计算效率。
链接: https://arxiv.org/abs/2510.07654
作者: Yanjie Pan,Qingdong He,Lidong Wang,Bo Peng,Mingmin Chi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages (including references), 4 figures. Code and models will be released upon publication
Abstract:Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.
zh
[CV-100] Dual-Stream Alignment for Action Segmentation
【速读】:该论文旨在解决连续视频流中动作分割(Action Segmentation)的难题,即准确识别动作发生的时间和空间位置。现有方法多采用单流模型建模帧序列的时空特征,而本文提出双流对齐网络(Dual-Stream Alignment Network, DSA Net),通过引入第二路学习到的动作特征流来引导分割任务,从而捕捉动作本身及动作转换的线索。其核心创新在于:1)设计Temporal Context(TC)模块,利用交叉注意力与基于量子机制的动作引导调制(Quantum-based Action-Guided Modulation, Q-ActGM)实现两流间的信息融合;2)提出双流对齐损失(Dual-Stream Alignment Loss),包含关系一致性、跨层级对比和循环一致性重建三项子损失,促使帧级与动作级特征在共享空间中对齐。该研究首次将量子-经典混合机器学习框架应用于动作分割任务,并在多个基准数据集上实现了当前最优性能。
链接: https://arxiv.org/abs/2510.07652
作者: Harshala Gammulle,Clinton Fookes,Sridha Sridharan,Simon Denman
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal Submission
Abstract:Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio- temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross- attention and Quantum-based Action-Guided Modulation (Q- ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing
zh
[CV-101] PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment ICIP2025
【速读】:该论文旨在解决无参考点云质量评估(No-Reference Point Cloud Quality Assessment, NR-PCQA)问题,即在缺乏原始参考点云的情况下自动预测点云的感知质量。解决方案的关键在于提出一种名为PIT-QMM的新型大规模多模态模型(Large Multimodal Model, LMM),该模型能够端到端地融合文本描述、二维投影图像和三维点云视图三种模态信息,从而更全面地捕捉点云质量特征。实验表明,该方法在多个主流基准上显著优于现有最优模型,且具备失真定位与识别能力,提升了模型的可解释性与交互性。
链接: https://arxiv.org/abs/2510.07636
作者: Shashank Gupta,Gregoire Phillips,Alan C. Bovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Oral presentation at ICIP 2025
Abstract:Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at this https URL.
zh
[CV-102] Rectified-CFG for Flow Based Models NEURIPS2025
【速读】:该论文旨在解决Classifier-free guidance (CFG) 在基于修正流(Rectified Flow, RF)的生成模型中应用时引发的严重离流形漂移(off-manifold drift)问题,表现为视觉伪影、文本对齐错误及鲁棒性差等缺陷。其解决方案的关键在于提出一种自适应预测-校正引导机制(Rectified-CFG++),该机制将修正流的确定性高效性与几何感知的条件约束规则相结合:在每一步推理中,先执行条件修正流更新以锚定样本于学习到的传输路径附近,再通过加权条件校正项插值于条件与无条件速度场之间;理论证明该方法产生的速度场边际一致,且轨迹始终位于数据流形的有界管状邻域内,从而在宽范围引导强度下保持稳定性和生成质量。
链接: https://arxiv.org/abs/2510.07631
作者: Shreshth Saini,Shashank Gupta,Alan C. Bovik
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
Abstract:Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: this https URL
zh
[CV-103] Quick-CapsNet (QCN): A fast alternative to Capsule Networks
【速读】:该论文旨在解决胶囊网络(Capsule Network, CapsNet)在训练和测试过程中速度缓慢的问题,这一缺陷限制了其在需要快速响应的实时应用场景中的部署。解决方案的关键在于提出一种名为 Quick-CapsNet (QCN) 的改进架构,其核心思想是通过减少胶囊数量来降低计算复杂度,从而显著提升推理速度;实验表明,在MNIST、F-MNIST、SVHN和Cifar-10数据集上,QCN的推理速度提升达5倍,同时仅带来可接受的精度损失。此外,作者进一步优化QCN,采用更强大的解码器替代默认解码器,以在保持高速度的同时进一步提升性能。
链接: https://arxiv.org/abs/2510.07600
作者: Pouya Shiri,Ramin Sharifi,Amirali Baniasadi
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The basic computational unit in Capsule Network (CapsNet) is a capsule (vs. neurons in Convolutional Neural Networks (CNNs)). A capsule is a set of neurons, which form a vector. CapsNet is used for supervised classification of data and has achieved state-of-the-art accuracy on MNIST digit recognition dataset, outperforming conventional CNNs in detecting overlapping digits. Moreover, CapsNet shows higher robustness towards affine transformation when compared to CNNs for MNIST datasets. One of the drawbacks of CapsNet, however, is slow training and testing. This can be a bottleneck for applications that require a fast network, especially during inference. In this work, we introduce Quick-CapsNet (QCN) as a fast alternative to CapsNet, which can be a starting point to develop CapsNet for fast real-time applications. QCN builds on producing a fewer number of capsules, which results in a faster network. QCN achieves this at the cost of marginal loss in accuracy. Inference is 5x faster on MNIST, F-MNIST, SVHN and Cifar-10 datasets. We also further enhanced QCN by employing a more powerful decoder instead of the default decoder to further improve QCN.
zh
[CV-104] MaizeStandCounting (MaSC): Automated and Accurate Maize Stand Counting from UAV Imagery Using Image Processing and Deep Learning
【速读】:该论文旨在解决玉米苗期田间株数统计的自动化难题,传统人工计数方法存在效率低、耗时长且易出错的问题,尤其在大面积或地形复杂的农田中更为显著。其解决方案的关键在于提出一种名为MaizeStandCounting (MaSC) 的鲁棒算法,利用低成本无人机(UAV)采集的RGB影像,在廉价硬件上实现高效处理;该算法采用轻量级YOLOv9模型对V2–V10生长阶段的玉米幼苗进行检测,并结合空间分布特征完成行与区段分割,从而实现精准的逐行株数统计,同时能有效区分玉米与杂草等其他植被。实验表明,MaSC在两种模式下均表现出高精度(Mosaic模式R²=0.616,Raw Frame模式R²=0.906),且单帧处理速度达83帧/60.63秒,具备实时应用潜力。
链接: https://arxiv.org/abs/2510.07580
作者: Dewi Endah Kharismawati,Toni Kazic
机构: University of Missouri (密苏里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures. Submitted to IEEE Journal of Selected Topics in Signal Processing (JSTSP) Special Series on Artificial Intelligence for Smart Agriculture
Abstract:Accurate maize stand counts are essential for crop management and research, informing yield prediction, planting density optimization, and early detection of germination issues. Manual counting is labor-intensive, slow, and error-prone, especially across large or variable fields. We present MaizeStandCounting (MaSC), a robust algorithm for automated maize seedling stand counting from RGB imagery captured by low-cost UAVs and processed on affordable hardware. MaSC operates in two modes: (1) mosaic images divided into patches, and (2) raw video frames aligned using homography matrices. Both modes use a lightweight YOLOv9 model trained to detect maize seedlings from V2-V10 growth stages. MaSC distinguishes maize from weeds and other vegetation, then performs row and range segmentation based on the spatial distribution of detections to produce precise row-wise stand counts. Evaluation against in-field manual counts from our 2024 summer nursery showed strong agreement with ground truth (R^2= 0.616 for mosaics, R^2 = 0.906 for raw frames). MaSC processed 83 full-resolution frames in 60.63 s, including inference and post-processing, highlighting its potential for real-time operation. These results demonstrate MaSC’s effectiveness as a scalable, low-cost, and accurate tool for automated maize stand counting in both research and production environments.
zh
[CV-105] Cross-Modal Attention Guided Unlearning in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中可能泄露训练数据中私有或敏感信息的问题,尤其是在视觉问答(Visual Question Answering, VQA)任务中,文本和图像双重模态均可能包含敏感内容。解决方案的关键在于提出一种轻量级、高效的模型遗忘框架——交叉模态注意力引导遗忘(Cross-Modal Attention Guided Unlearning, CAGUL),其核心思想是利用跨模态注意力机制识别对输出生成贡献较小的视觉token,并通过外部模块对其编码以实现遗忘信息的注入,从而在不修改预训练模型参数且无需重新训练的前提下,有效防止敏感信息泄露并保持原模型的行为一致性。
链接: https://arxiv.org/abs/2510.07567
作者: Karuna Bhaila,Aneesh Komanduri,Minh-Hao Van,Xintao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.
zh
[CV-106] Label Semantics for Robust Hyperspectral Image Classification IJCNN2025
【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)分类中因训练样本稀缺和光谱数据高维性导致的过拟合问题,以及现有模型多为单模态(仅依赖光谱-空间数据)在决策边界学习上的局限性。解决方案的关键在于提出一种通用的语义光谱-空间融合网络(Semantic Spectral-Spatial Fusion Network, S3FN),通过引入类特定的文本描述来增强模型训练:利用大语言模型(Large Language Models, LLMs)生成每类标签的上下文语义描述,并借助预训练文本编码器(如BERT或RoBERTa)将这些描述嵌入向量空间,从而提取有意义的标签语义信息,实现特征与标签之间的更好对齐,提升分类性能。
链接: https://arxiv.org/abs/2510.07556
作者: Rafin Hassan,Zarin Tasnim Roshni,Rafiqul Bari,Alimul Islam,Nabeel Mohammed,Moshiur Farazi,Shafin Rahman
机构: North South University (南亚大学); University of Doha for Science and Technology (多哈科学与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been accepted for publication in the proceedings of IJCNN 2025
Abstract:Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: this https URL
zh
[CV-107] RAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
【速读】:该论文旨在解决当前视频生成模型(Video Generative Models)在生成视频时常常违反物理规律(如物体悬浮、瞬移或非因果变形)的问题,而现有方法缺乏对视频物理真实性的量化评估手段。其核心解决方案是提出两个关键组件:一是TRAVL,一种通过平衡数据集与轨迹感知注意力模块(trajectory-aware attention module)进行微调的训练策略,以增强视觉语言模型(VLMs)对运动信息的编码与判别能力;二是ImplausiBench,一个包含300个视频(150个真实、150个生成)的基准测试集,通过消除语言偏倚并聚焦于视觉-时间理解来更严格地评估物理合理性。二者共同构建了一个统一框架,用于探测和提升多模态模型在物理合理性方面的表现。
链接: https://arxiv.org/abs/2510.07550
作者: Saman Motamed,Minghao Chen,Luc Van Gool,Iro Laina
机构: INSAIT( INSAIT); Sofia University “St. Kliment Ohridski”(索菲亚大学“圣克莱门特·奥赫里德斯基”); Visual Geometry Group(视觉几何组); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.
zh
[CV-108] PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
【速读】:该论文旨在解决视频风格迁移(video style transfer)中缺乏成对视频数据进行监督的问题,目标是在保持输入视频内容不变的前提下,将其渲染为目标风格(由文本提示指定)。其解决方案的关键在于提出PickStyle框架,该框架通过在预训练视频扩散模型的条件模块自注意力层中插入低秩适配器(low-rank adapters),实现对运动-风格迁移任务的高效专业化,同时保持内容与风格之间的强对齐。此外,为弥合静态图像监督与动态视频之间的差距,作者利用成对图像构建合成训练片段,并施加共享增强以模拟相机运动,从而保留时间先验;并引入Context-Style Classifier-Free Guidance (CS-CFG),将无分类器引导(classifier-free guidance)分解为独立的文本(风格)和视频(上下文)方向,确保生成视频在风格迁移过程中有效保留原始内容信息。
链接: https://arxiv.org/abs/2510.07546
作者: Soroush Mehraban,Vida Adeli,Jacob Rommann,Babak Taati,Kyryl Truskovskyi
机构: Pickford AI; University of Toronto; Vector Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.
zh
[CV-109] D2RA: Dual Domain Regeneration Attack
【速读】:该论文旨在解决生成式 AI(Generative AI)内容中水印(watermarking)技术的鲁棒性问题,即如何在资源受限的对抗环境下有效移除或削弱水印信号而不影响图像视觉质量。解决方案的关键在于提出一种无需训练、仅需单张图像的攻击方法 D2RA,其通过将含水印图像投影到互补表示空间中的自然先验(natural priors),实现对水印信号的有效抑制,同时保持图像的视觉保真度,从而揭示当前水印设计在对抗性场景下的根本性弱点。
链接: https://arxiv.org/abs/2510.07538
作者: Pragati Shuddhodhan Meshram,Varun Chandrasekaran
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at this https URL.
zh
[CV-110] MLLM 4TS: Leverag ing Vision and Multimodal Language Models for General Time-Series Analysis
【速读】:该论文旨在解决多变量时间序列数据中复杂的时间依赖性和跨通道交互关系带来的分析挑战,尤其是在传统方法难以有效捕捉全局上下文信息和细粒度时序特征的情况下。其解决方案的关键在于提出一种名为MLLM4TS的新框架,通过引入专用的视觉分支将每个时间序列通道渲染为彩色线图组成的复合图像,并采用时间感知的视觉补丁对齐策略,使视觉补丁与对应的时间段精确匹配;该设计实现了数值数据的细粒度时序细节与视觉表示中的全局语义信息的融合,从而在预训练语言模型的基础上构建统一的多模态时间序列分析基础,显著提升了预测任务(如分类)和生成任务(如异常检测与预测)的性能。
链接: https://arxiv.org/abs/2510.07513
作者: Qinghua Liu,Sam Heshmati,Zheda Mai,Zubin Abraham,John Paparrizos,Liu Ren
机构: The Ohio State University (俄亥俄州立大学); Bosch Research North America (博世北美研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:
Abstract:Effective analysis of time series data presents significant challenges due to the complex temporal dependencies and cross-channel interactions in multivariate data. Inspired by the way human analysts visually inspect time series to uncover hidden patterns, we ask: can incorporating visual representations enhance automated time-series analysis? Recent advances in multimodal large language models have demonstrated impressive generalization and visual understanding capability, yet their application to time series remains constrained by the modality gap between continuous numerical data and discrete natural language. To bridge this gap, we introduce MLLM4TS, a novel framework that leverages multimodal large language models for general time-series analysis by integrating a dedicated vision branch. Each time-series channel is rendered as a horizontally stacked color-coded line plot in one composite image to capture spatial dependencies across channels, and a temporal-aware visual patch alignment strategy then aligns visual patches with their corresponding time segments. MLLM4TS fuses fine-grained temporal details from the numerical data with global contextual information derived from the visual representation, providing a unified foundation for multimodal time-series analysis. Extensive experiments on standard benchmarks demonstrate the effectiveness of MLLM4TS across both predictive tasks (e.g., classification) and generative tasks (e.g., anomaly detection and forecasting). These results underscore the potential of integrating visual modalities with pretrained language models to achieve robust and generalizable time-series analysis.
zh
[CV-111] A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy
【速读】:该论文旨在解决超低剂量CT(ultra-low dose CT, uLDCT)图像中因噪声严重和空间错位导致的去噪难题,尤其针对现有去噪网络在真实临床数据上性能下降的问题。其解决方案的关键在于提出了一种基于图像净化(Image Purification, IP)策略的新框架:首先构建了真实的临床uLDCT肺部数据集,进而通过IP策略生成结构对齐的uLDCT-正常剂量CT(normal dose CT, NDCT)图像对,从而为网络训练提供高质量的数据基础;在此基础上,进一步设计了频域流匹配(Frequency-domain Flow Matching, FFM)模型,与IP策略协同工作,在保留解剖结构完整性方面实现显著提升,最终在真实临床数据上达到当前最优(state-of-the-art, SOTA)的去噪效果。
链接: https://arxiv.org/abs/2510.07492
作者: Guoliang Gong,Man Yu
机构: Tianjin University of Science and Technology (天津科技大学); Tianjin Hospital (天津市医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at this https URL.
zh
[CV-112] Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors
【速读】:该论文旨在解决成像逆问题中算法收敛速度慢与重建质量难以兼顾的问题。现有方法如基于去噪的正则化(Regularization by Denoising, RED)通常通过设计复杂的图像先验来提升重建质量,但对收敛加速往往依赖启发式策略。其解决方案的关键在于提出一种具有基于得分(score-based)先验的重启惯性算法(Restarted Inertia with Score-based Priors, RISP),该方法在保持高重建质量的同时,引入重启惯性机制以实现更快的收敛速率。理论证明表明,RISP 在不假设图像先验凸性的前提下,相较于RED具有更优的平稳点收敛率,并通过连续时间动力系统分析揭示了其与带阻尼的Heavy-ball常微分方程(ODE)之间的内在联系。
链接: https://arxiv.org/abs/2510.07470
作者: Marien Renaud,Julien Hermant,Deliang Wei,Yu Sun
机构: Univ. Bordeaux, CNRS, INRIA, Bordeaux INP, IMB, UMR 5251 (法国波尔多大学, 国家科学研究中心, 法国国家信息与自动化研究所, 波尔多综合理工学院, 数学研究所, 数学联合实验室); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 62 pages
Abstract:Fast convergence and high-quality image recovery are two essential features of algorithms for solving ill-posed imaging inverse problems. Existing methods, such as regularization by denoising (RED), often focus on designing sophisticated image priors to improve reconstruction quality, while leaving convergence acceleration to heuristics. To bridge the gap, we propose Restarted Inertia with Score-based Priors (RISP) as a principled extension of RED. RISP incorporates a restarting inertia for fast convergence, while still allowing score-based image priors for high-quality reconstruction. We prove that RISP attains a faster stationary-point convergence rate than RED, without requiring the convexity of the image prior. We further derive and analyze the associated continuous-time dynamical system, offering insight into the connection between RISP and the heavy-ball ordinary differential equation (ODE). Experiments across a range of imaging inverse problems demonstrate that RISP enables fast convergence while achieving high-quality reconstructions.
zh
[CV-113] DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis
【速读】:该论文旨在解决现有文本到视频(Text-to-Video, T2V)生成模型评估基准在动态相机运动场景下存在的两个关键问题:一是当前基准如VBench和EvalCrafter主要聚焦于主体导向提示或静态摄像机场景,缺乏对电影级镜头中相机运动的系统性评估;二是这些基准通常将视频级评分聚合为单一模型级分数进行排名,忽略了视频级评价的重要性,而后者对于从同一提示生成的多个候选视频中选出更优结果至关重要。解决方案的关键在于提出DynamicEval基准,其核心创新包括:(1) 构建以动态相机运动为核心的系统化提示集,并基于45k人类标注的视频对数据评估背景场景一致性,通过引入基于物体误差图的改进型背景一致性指标,修正了原Vbench运动平滑度指标在相机与前景物体移动导致遮挡/非遮挡情况下的失效问题;(2) 提出前景对象一致性指标,通过跟踪每个物体实例内的点及其邻域来量化对象保真度,从而实现对视频质量的多维度、精细化评估。实验表明,所提方法在视频级和模型级均显著提升与人类偏好的相关性(提升超过2个百分点),为T2V模型在动态摄像机条件下的评估提供了更全面的基准。
链接: https://arxiv.org/abs/2510.07441
作者: Nithin C. Babu,Aniruddha Mahapatra,Harsh Rangwani,Rajiv Soundararajan,Kuldeep Kulkarni
机构: Indian Institute of Science (印度科学研究所); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review. 26 pages, 11 figures, 11 tables. Access the project page in this https URL
Abstract:Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.
zh
[CV-114] Enhancing Maritime Object Detection in Real-Time with RT-DETR and Data Augmentation
【速读】:该论文旨在解决海上目标检测中因目标尺寸小和真实RGB标注数据有限所带来的挑战。解决方案的关键在于基于RT-DETR架构,结合多尺度特征融合、不确定性最小化的查询选择机制以及合成与真实训练样本的智能权重分配策略,从而提升对小尺寸、低对比度船只的检测性能,并缩小合成数据与真实数据之间的域差距。该设计在保持DETR端到端集合预测优势的同时,支持推理时动态调整速度与精度平衡,且通过数据增强技术优化类别分布,增强了模型在极端光照或海况下的鲁棒性。
链接: https://arxiv.org/abs/2510.07346
作者: Nader Nemati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 10 figures
Abstract:Maritime object detection faces essential challenges due to the small target size and limitations of labeled real RGB data. This paper will present a real-time object detection system based on RT-DETR, enhanced by employing augmented synthetic images while strictly evaluating on real data. This study employs RT-DETR for the maritime environment by combining multi-scale feature fusion, uncertainty-minimizing query selection, and smart weight between synthetic and real training samples. The fusion module in DETR enhances the detection of small, low-contrast vessels, query selection focuses on the most reliable proposals, and the weighting strategy helps reduce the visual gap between synthetic and real domains. This design preserves DETR’s refined end-to-end set prediction while allowing users to adjust between speed and accuracy at inference time. Data augmentation techniques were also used to balance the different classes of the dataset to improve the robustness and accuracy of the model. Regarding this study, a full Python robust maritime detection pipeline is delivered that maintains real-time performance even under practical limits. It also verifies how each module contributes, and how the system handles failures in extreme lighting or sea conditions. This study also includes a component analysis to quantify the contribution of each architectural module and explore its interactions.
zh
[CV-115] MultiFair: Multimodal Balanced Fairness-Aware Medical Classification with Dual-Level Gradient Modulation
【速读】:该论文旨在解决多模态医学分类中因数据模态学习不均衡和群体偏见导致的模型偏差问题,即不同模态可能收敛到对某些模态过度依赖的模型,同时模型可能在特定人口统计学群体上表现不公平。解决方案的关键在于提出一种名为MultiFair的新方法,其核心是通过双层次梯度调制机制(dual-level gradient modulation process),动态调节训练过程中梯度的方向与幅度,分别在数据模态层面和群体层面进行优化控制,从而实现更公平且平衡的多模态学习。
链接: https://arxiv.org/abs/2510.07328
作者: Md Zubair,Hao Zheng,Nussdorf Jonathan,Grayson W. Armstrong,Lucy Q. Shen,Gabriela Wilson,Yu Tian,Xingquan Zhu,Min Shi
机构: University of Louisiana at Lafayette (路易斯安那大学拉斐特分校); Ochsner Health (奥克斯纳健康中心); Massachusetts Eye and Ear, Harvard Medical School (麻省眼耳医院,哈佛医学院); University of Central Florida (中佛罗里达大学); Florida Atlantic University (佛罗里达大西洋大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 10 Pages
Abstract:Medical decision systems increasingly rely on data from multiple sources to ensure reliable and unbiased diagnosis. However, existing multimodal learning models fail to achieve this goal because they often ignore two critical challenges. First, various data modalities may learn unevenly, thereby converging to a model biased towards certain modalities. Second, the model may emphasize learning on certain demographic groups causing unfair performances. The two aspects can influence each other, as different data modalities may favor respective groups during optimization, leading to both imbalanced and unfair multimodal learning. This paper proposes a novel approach called MultiFair for multimodal medical classification, which addresses these challenges with a dual-level gradient modulation process. MultiFair dynamically modulates training gradients regarding the optimization direction and magnitude at both data modality and group levels. We conduct extensive experiments on two multimodal medical datasets with different demographic groups. The results show that MultiFair outperforms state-of-the-art multimodal learning and fairness learning methods.
zh
[CV-116] Deep Learning Based Approach to Enhanced Recognition of Emotions and Behavioral Patterns of Autistic Children
【速读】:该论文试图解决的问题是:在自闭症谱系障碍(Autism Spectrum Disorder, ASD)儿童早期发展阶段,缺乏对细微行为模式和情绪识别的深入理解与系统性干预,导致教育策略难以精准匹配其独特需求,尤其是在信息技术(Information Technology)领域机会有限的背景下。解决方案的关键在于采用纵向追踪方法,识别并绘制儿童的情绪与行为演变轨迹,建立以个体行为和情感特征为基础的基准认知,进而设计出有针对性的应用程序和技术辅助工具,从而构建一个循序渐进、基于证据的干预框架,推动学习能力与软技能的有效发展。
链接: https://arxiv.org/abs/2510.07320
作者: Nelaka K.A.R,Peiris M.K.V,Liyanage R.P.B
机构: Sri Lanka Institute of Information Technology (斯里兰卡信息技术研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Autism Spectrum Disorder significantly influences the communication abilities, learning processes, behavior, and social interactions of individuals. Although early intervention and customized educational strategies are critical to improving outcomes, there is a pivotal gap in understanding and addressing nuanced behavioral patterns and emotional identification in autistic children prior to skill development. This extended research delves into the foundational step of recognizing and mapping these patterns as a prerequisite to improving learning and soft skills. Using a longitudinal approach to monitor emotions and behaviors, this study aims to establish a baseline understanding of the unique needs and challenges faced by autistic students, particularly in the Information Technology domain, where opportunities are markedly limited. Through a detailed analysis of behavioral trends over time, we propose a targeted framework for developing applications and technical aids designed to meet these identified needs. Our research underscores the importance of a sequential and evidence-based intervention approach that prioritizes a deep understanding of each child’s behavioral and emotional landscape as the basis for effective skill development. By shifting the focus toward early identification of behavioral patterns, we aim to foster a more inclusive and supportive learning environment that can significantly improve the educational and developmental trajectory of children with ASD.
zh
[CV-117] UniFField: A Generalizable Unified Neural Feature Field for Visual Semantic and Spatial Uncertainties in Any Scene
【速读】:该论文旨在解决当前3D神经特征场(neural feature field)在机器人任务中面临的两大局限性:一是模型通常仅适用于特定场景,缺乏泛化能力;二是无法对预测结果进行不确定性建模,从而影响决策可靠性。解决方案的关键在于提出UniFField——一种统一的、具备不确定性感知能力的神经特征场,它将视觉、语义和几何特征融合为单一可泛化的表示,并能同时预测各模态的不确定性。该方法支持零样本迁移至新环境,通过增量式整合RGB-D图像更新特征表示及不确定性估计,在场景重建与语义特征预测中验证了不确定性估计的准确性,并进一步应用于移动操作机器人主动目标搜索任务,实现了基于不确定性的鲁棒决策。
链接: https://arxiv.org/abs/2510.06754
作者: Christian Maurer,Snehal Jauhri,Sophie Lueth,Georgia Chalvatzaki
机构: TU Darmstadt (达姆施塔特工业大学); Hessian.AI; Robotics Institute Germany (德国机器人研究所); Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学); German Research Foundation (德国研究基金会)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.
zh
[CV-118] DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning
【速读】:该论文旨在解决深度学习中的过拟合(overfitting)问题,尤其针对数据异常值、噪声以及训练数据有限等因素导致模型泛化能力下降的挑战。其解决方案的关键在于提出一种动态不确定性感知的分治聚合方法(Dynamic Uncertainty-Aware Divide2Conquer, DUA-D2C),该方法在传统分治训练(Divide2Conquer, D2C)基础上改进了模型聚合策略:不再对各子集训练出的模型进行等权或固定规则加权,而是基于它们在共享验证集上的表现,结合准确率与预测不确定性进行动态加权,从而优先利用具有更强泛化能力和更高置信度的子模型,显著提升整体模型的鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2411.15876
作者: Md. Saiful Bari Siddiqui,Md Mohaiminul Islam,Md. Golam Rabiul Alam
机构: BRAC University (BRAC大学); United International University (联合国际大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注: This version (v2) extends our previous work ( arXiv:2411.15876v1 ) on Divide2Conquer (D2C) by introducing Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C). The manuscript is currently under review at Complex and Intelligent Systems
Abstract:Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, the Divide2Conquer (D2C) method was previously proposed, which partitions training data into multiple subsets and trains identical models independently on each. This strategy enables learning more consistent patterns while minimizing the influence of individual outliers and noise. However, D2C’s standard aggregation typically treats all subset models equally or based on fixed heuristics (like data size), potentially underutilizing information about their varying generalization capabilities. Building upon this foundation, we introduce Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C), an advanced technique that refines the aggregation process. DUA-D2C dynamically weights the contributions of subset models based on their performance on a shared validation set, considering both accuracy and prediction uncertainty. This intelligent aggregation allows the central model to preferentially learn from subsets yielding more generalizable and confident edge models, thereby more effectively combating overfitting. Empirical evaluations on benchmark datasets spanning multiple domains demonstrate that DUA-D2C significantly improves generalization. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting the effectiveness of DUA-D2C. This study demonstrates that DUA-D2C improves generalization performance even when applied on top of other regularization methods, establishing it as a theoretically grounded and effective approach to combating overfitting in modern deep learning. Our codes are publicly available at: this https URL.
zh
[CV-119] SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
【速读】:该论文旨在解决卫星物联网(Satellite Internet of Things, Sat-IoT)中多时相、多源遥感图像融合面临的挑战,即现有方法未能充分挖掘时间维度与数据源维度的互补信息:一方面,多图像超分辨率(Multi-Image Super-Resolution, MISR)虽能利用时序冗余提升重建质量,但受限于输入图像细粒度纹理信息不足;另一方面,全色锐化(pansharpening)虽可注入高空间频率信息,却通常依赖预插值低分辨率输入并假设无噪声对齐,导致对噪声和配准误差敏感。解决方案的关键在于提出SatFusion框架,其核心创新为:首先通过多时相图像融合(Multi-Temporal Image Fusion, MTIF)模块实现与全色图像的深层特征对齐,进而利用多源图像融合(Multi-Source Image Fusion, MSIF)模块注入精细纹理信息,最后由融合组合模块自适应整合两种模态优势,并通过多损失加权监督动态优化光谱一致性,从而显著提升融合质量、鲁棒性及在真实Sat-IoT场景下的泛化能力。
链接: https://arxiv.org/abs/2510.07905
作者: Yufei Tong,Guanjie Cheng,Peihan Wu,Yicheng Zhu,Kexu Lu,Feiyi Chen,Meng Xi,Junqin Huang,Shuiguang Deng
机构: Zhejiang University(浙江大学); Zhejiang University of Technology(浙江工业大学); Shandong University(山东大学); Shanghai Jiao Tong University(上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:With the rapid advancement of the digital society, the proliferation of satellites in the Satellite Internet of Things (Sat-IoT) has led to the continuous accumulation of large-scale multi-temporal and multi-source images across diverse application scenarios. However, existing methods fail to fully exploit the complementary information embedded in both temporal and source dimensions. For example, Multi-Image Super-Resolution (MISR) enhances reconstruction quality by leveraging temporal complementarity across multiple observations, yet the limited fine-grained texture details in input images constrain its performance. Conversely, pansharpening integrates multi-source images by injecting high-frequency spatial information from panchromatic data, but typically relies on pre-interpolated low-resolution inputs and assumes noise-free alignment, making it highly sensitive to noise and misregistration. To address these issues, we propose SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion. Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF) module to achieve deep feature alignment with the panchromatic image. Then, a Multi-Source Image Fusion (MSIF) module injects fine-grained texture information from the panchromatic data. Finally, a Fusion Composition module adaptively integrates the complementary advantages of both modalities while dynamically refining spectral consistency, supervised by a weighted combination of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that SatFusion significantly improves fusion quality, robustness under challenging conditions, and generalizability to real-world Sat-IoT scenarios. The code is available at: this https URL.
zh
[CV-120] FlowLensing: Simulating Gravitational Lensing with Flow Matching
链接: https://arxiv.org/abs/2510.07878
作者: Hamees Sayed,Pranath Reddy,Michael W. Toomey,Sergei Gleyzer
机构: Smallest AI(最小人工智能); IIT Madras(印度理工学院马德拉斯分校); University of Florida(佛罗里达大学); MIT(麻省理工学院); University of Alabama(阿拉巴马大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 3 tables
[CV-121] Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs
【速读】:该论文旨在解决肺部结节检测中因数据不平衡和标注有限导致的困难结节(低尺寸、低亮度、低对比度)识别性能不佳的问题。其关键解决方案是将课程学习(curriculum learning)与基于扩散模型(Diffusion-based)的合成数据增强相结合,通过难度评分引导训练顺序,并利用大量DDPM生成的合成图像提升模型对困难样本的感知能力,从而显著提高检测敏感性和泛化性能。
链接: https://arxiv.org/abs/2510.07681
作者: Pranav Sambhu,Om Guin,Madhav Sambhu,Jinho Cha
机构: Fulton Science Academy, Georgia, USA; Emory University School of Medicine, Georgia, USA (MD); Gwinnett Technical College, Georgia, USA
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 6 figures,
Abstract:This study evaluates whether integrating curriculum learning with diffusion-based synthetic augmentation can enhance the detection of difficult pulmonary nodules in chest radiographs, particularly those with low size, brightness, and contrast, which often challenge conventional AI models due to data imbalance and limited annotation. A Faster R-CNN with a Feature Pyramid Network (FPN) backbone was trained on a hybrid dataset comprising expert-labeled NODE21 (1,213 patients; 52.4 percent male; mean age 63.2 +/- 11.5 years), VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images. Difficulty scores based on size, brightness, and contrast guided curriculum learning. Performance was compared to a non-curriculum baseline using mean average precision (mAP), Dice score, and area under the curve (AUC). Statistical tests included bootstrapped confidence intervals, DeLong tests, and paired t-tests. The curriculum model achieved a mean AUC of 0.95 versus 0.89 for the baseline (p 0.001), with improvements in sensitivity (70 percent vs. 48 percent) and accuracy (82 percent vs. 70 percent). Stratified analysis demonstrated consistent gains across all difficulty bins (Easy to Very Hard). Grad-CAM visualizations confirmed more anatomically focused attention under curriculum learning. These results suggest that curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection.
zh
人工智能
[AI-0] BLAZER: Bootstrapping LLM -based Manipulation Agents with Zero-Shot Data Generation
【速读】:该论文旨在解决机器人领域中数据规模受限的问题,即由于缺乏类似视觉和语言领域的互联网级示范数据,现有机器人数据集通常依赖人工收集与标注,导致训练数据量小、泛化能力弱。解决方案的关键在于提出BLAZER框架,该框架利用大语言模型(Large Language Model, LLM)的零样本规划能力,在仿真环境中自动生成多样化的操作任务示范数据;随后通过筛选成功案例对LLM进行微调,提升其规划性能而无需人工干预;更重要的是,尽管训练依赖于模拟器状态信息,BLAZER所学技能可直接迁移至基于传感器的物理机器人操作中,从而显著提升零样本操作能力并支持任务外泛化与模型压缩。
链接: https://arxiv.org/abs/2510.08572
作者: Rocktim Jyoti Das,Harsh Singh,Diana Turmakhan,Muhammad Abdullah Sohail,Mingfei Han,Preslav Nakov,Fabio Pizzati,Ivan Laptev
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 8 figures
Abstract:Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator’s state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available on the project page.
zh
[AI-1] On the optimization dynamics of RLVR: Gradient gap and step size thresholds
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大型语言模型时缺乏理论解释的问题,特别是为何其能通过简单的二元反馈实现有效后训练。解决方案的关键在于提出并分析“梯度间隙”(Gradient Gap)这一核心概念,该量刻画了从低奖励区域到高奖励区域的改进方向;理论证明收敛性依赖于更新方向与梯度间隙的一致性,并进一步推导出一个关于步长的临界阈值:低于该阈值时学习收敛,高于则性能崩溃。此理论框架不仅解释了实践中如长度归一化等启发式策略为何提升稳定性,还揭示了固定学习率下成功率无法达到100%的根本原因。
链接: https://arxiv.org/abs/2510.08539
作者: Joe Suk,Yaqi Duan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below 100% . We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.
zh
[AI-2] FlowSearch: Advancing deep research with dynamic structured knowledge flow
【速读】:该论文旨在解决深度研究(Deep Research)任务中因知识广度与深度要求高、多步推理依赖复杂而对智能体系统(Agentic Systems)带来的挑战。其核心问题在于如何有效组织跨领域知识空间并动态支持任务分解与并行探索,同时根据中间推理结果实时调整策略。解决方案的关键是提出FlowSearch——一种多智能体框架,通过主动构建和演化动态结构化知识流(Knowledge Flow)来驱动子任务执行与推理;该框架具备战略规划能力,可实现层级化任务分解与并行探索,并基于反馈实时优化知识流结构,从而在通用与科学基准(如GAIA、HLE、GPQA和TRQA)上达到当前最优性能,展现出在多学科科研场景中的强大潜力。
链接: https://arxiv.org/abs/2510.08521
作者: Yusong Hu,Runmin Ma,Yue Fan,Jinxin Shi,Zongsheng Cao,Yuhao Zhou,Jiakang Yuan,Xiangchao Yan,Wenlong Zhang,Lei Bai,Bo Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep research is an inherently challenging task that demands both breadth and depth of thinking. It involves navigating diverse knowledge spaces and reasoning over complex, multi-step dependencies, which presents substantial challenges for agentic systems. To address this, we propose FlowSearch, a multi-agent framework that actively constructs and evolves a dynamic structured knowledge flow to drive subtask execution and reasoning. FlowSearch is capable of strategically planning and expanding the knowledge flow to enable parallel exploration and hierarchical task decomposition, while also adjusting the knowledge flow in real time based on feedback from intermediate reasoning outcomes and insights. FlowSearch achieves state-of-the-art performance on both general and scientific benchmarks, including GAIA, HLE, GPQA and TRQA, demonstrating its effectiveness in multi-disciplinary research scenarios and its potential to advance scientific discovery. The code is available at this https URL.
zh
[AI-3] Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning
【速读】:该论文旨在解决神经网络中激活函数选择缺乏理论依据的问题,即现有对激活函数的比较多依赖启发式方法,而缺乏严谨的分类与稳定性分析框架。其解决方案的关键在于提出一个九维积分签名 $ S_\sigma(\phi) $,该签名融合了高斯传播统计量(如一阶矩 $ m_1 $、二阶矩 $ m_2 $、梯度相关性 $ g_1, g_2 $ 和斜率参数 $ \eta $)、渐近斜率 $ \alpha_+ 、 \alpha_- $ 以及正则性指标(如 $ TV(\phi’) 、 C(\phi) $),从而构建了一个具有数学严格性的激活函数分类体系。该框架不仅确立了良好 posed 性和仿射重参数化不变性,还揭示了动力学稳定性区域(通过 $ m_2’, g_2 $)和核条件数与平滑性的关系,实现了从经验试错到可证明稳定性和核条件优化的设计范式转变。
链接: https://arxiv.org/abs/2510.08456
作者: Ankur Mali,Lawrence Hall,Jake Williams,Gordon Richards
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages
Abstract:Activation functions govern the expressivity and stability of neural networks, yet existing comparisons remain largely heuristic. We propose a rigorous framework for their classification via a nine-dimensional integral signature S_sigma(phi), combining Gaussian propagation statistics (m1, g1, g2, m2, eta), asymptotic slopes (alpha_plus, alpha_minus), and regularity measures (TV(phi’), C(phi)). This taxonomy establishes well-posedness, affine reparameterization laws with bias, and closure under bounded slope variation. Dynamical analysis yields Lyapunov theorems with explicit descent constants and identifies variance stability regions through (m2’, g2). From a kernel perspective, we derive dimension-free Hessian bounds and connect smoothness to bounded variation of phi’. Applying the framework, we classify eight standard activations (ReLU, leaky-ReLU, tanh, sigmoid, Swish, GELU, Mish, TeLU), proving sharp distinctions between saturating, linear-growth, and smooth families. Numerical Gauss-Hermite and Monte Carlo validation confirms theoretical predictions. Our framework provides principled design guidance, moving activation choice from trial-and-error to provable stability and kernel conditioning.
zh
[AI-4] gLSTM: Mitigating Over-Squashing by Increasing Storag e Capacity
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中存在的“过挤压”(over-squashing)问题,即在消息传递过程中,来自节点大感受野的信息被压缩到固定维度的向量中,导致信息瓶颈,从而限制了模型的表达能力。解决方案的关键在于重新从模型存储与检索容量的角度理解这一现象,并引入受序列建模领域启发的新架构设计——借鉴关联记忆(associative memories)、快速权重编程(fast weight programmers)和xLSTM模型的思想,提出一种具有增强信息存储能力的新型GNN架构,显著提升了在合成容量任务及多个真实世界图基准上的性能表现。
链接: https://arxiv.org/abs/2510.08450
作者: Hugh Blayney,Álvaro Arroyo,Xiaowen Dong,Michael M. Bronstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 22 pages, 22 figures, 7 tables
Abstract:Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node’s representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.
zh
[AI-5] Synthetic Series-Symbol Data Generation for Time Series Foundation Models NEURIPS2025
【速读】:该论文旨在解决时间序列分析(Time Series Analysis, TSA)中因训练数据稀缺和不平衡所导致的基础模型发展受限问题。其解决方案的关键在于受复杂动态系统理论启发,设计了一种序列-符号(series-symbol)数据生成机制,能够无限制地生成高质量的时间序列数据及其对应的符号表达式;进而基于这些强相关性的数据对,提出了一个名为 \textttSymTime 的预训练基础模型,通过利用符号信息增强时间序列表征能力,在五个主要TSA任务上微调后表现出与在真实世界数据上预训练的基础模型相当甚至更优的性能,从而验证了序列-符号数据生成与预训练机制在缓解数据稀缺、提升任务表现方面的潜力。
链接: https://arxiv.org/abs/2510.08445
作者: Wenxuan Wang,Kai Wu,Yujian Betterest Li,Dan Wang,Xiaoyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 63 pages, NeurIPS 2025 accepted
Abstract:Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop \textttSymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. \textttSymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at this https URL.
zh
[AI-6] ClauseLens: Clause-Grounded CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing
【速读】:该论文旨在解决再保险合同定价(reinsurance treaty pricing)过程中存在的合规性不足与决策不透明问题,这些问题导致报价难以审计且易违反监管要求(如Solvency II、NAIC RBC及欧盟《人工智能法案》)。解决方案的关键在于提出ClauseLens——一个基于条款的强化学习框架,其将报价任务建模为风险感知的约束马尔可夫决策过程(Risk-Aware Constrained Markov Decision Process, RA-CMDP),通过从法律和承保语料库中检索并嵌入法定与政策条款作为观测输入,不仅约束可行动作空间,还生成基于条款的自然语言解释。该方法在行业校准的多智能体模拟器中验证,显著降低51%的偿付能力违规风险,提升27.9%的尾部风险表现(CVaR_0.10),并实现88.2%的条款解释准确率,从而实现了可解释、可审计且符合监管要求的定价行为。
链接: https://arxiv.org/abs/2510.08429
作者: Stella C. Dong,James R. Finlay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted for publication at the 6th ACM International Conference on AI in Finance (ICAIF 2025), Singapore. Author-accepted version (October 2025). 10 pages, 5 figures
Abstract:Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause-grounded reinforcement learning framework that produces transparent, regulation-compliant, and risk-aware treaty quotes. ClauseLens models the quoting task as a Risk-Aware Constrained Markov Decision Process (RA-CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent’s observations, and used both to constrain feasible actions and to generate clause-grounded natural language justifications. Evaluated in a multi-agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail-risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause-grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation-aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act. Comments: Accepted for publication at the 6th ACM International Conference on AI in Finance (ICAIF 2025), Singapore. Author-accepted version (October 2025). 10 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) MSC classes: 68T05, 91G70 ACMclasses: I.2.6; I.2.7; J.1 Cite as: arXiv:2510.08429 [cs.LG] (or arXiv:2510.08429v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.08429 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3768292.3770356 Focus to learn more DOI(s) linking to related resources Submission history From: Stella Dong [view email] [v1] Thu, 9 Oct 2025 16:43:49 UTC (29 KB)
zh
[AI-7] Prompts Generalize with Low Data: Non-vacuous Generalization Bounds for Optimizing Prompts with More Informative Priors ICML2025
【速读】:该论文旨在解决小样本场景下提示工程(prompt engineering)优化中泛化性能难以保证的问题,尤其是在数据稀缺条件下现有理论分析(如基于PAC-Bayes的界)往往失效或过于宽松(vacuous)的局限性。其解决方案的关键在于引入分布依赖的困惑度(perplexity)作为有效先验,通过在优化过程中对提示的自然性进行正则化,从而限制探索空间并提升泛化能力。作者由此推导出新的非真空(non-vacuous)泛化界,并实验证明该策略能显著改善提示在小样本下的泛化表现。
链接: https://arxiv.org/abs/2510.08413
作者: David Madras,Joshua Safyan,Qiuyi(Richard)Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: EXAIT Workshop paper at ICML 2025
Abstract:Many prompt engineering techniques have been successful in practice, even when optimizing over a large prompt space with with a small amount of task-specific data. Recent work has partially explained this success by showing generalization bounds which apply PAC-Bayes theory to the discrete prompt space, but they are non-vacuous only in data-rich scenarios. We argue that such widespread success can be more fully explained through more carefully considering data- or distribution-dependent perplexity, which acts as an effective prior and steers the optimization towards prompts that are more ``natural’’ for the task at hand. We derive novel generalization bounds that are non-vacuous for data-scarce prompt optimization via more useful priors, formally analyzing how perplexity regularization tightens these bounds by limiting exploration. Empirically, we explore both the bounds’ effectiveness and the practical benefits of perplexity regularization in improving prompt generalization.
zh
[AI-8] Revisiting Hallucination Detection with Effective Rank-based Uncertainty
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(hallucination)检测这一关键问题,以提升其在实际部署中的可信度。解决方案的核心在于提出一种基于有效秩(effective rank)的不确定性量化方法,通过分析来自不同模型输出和不同层的隐藏状态(hidden states)的谱特性,实现对模型内部推理过程的可解释性洞察。该方法无需额外知识或模块,兼具理论严谨性与实践高效性,并从理论上证明了同时量化内部(单个响应表示)和外部(多个响应间差异)不确定性的必要性,从而为LLM幻觉检测提供了新范式。
链接: https://arxiv.org/abs/2510.08389
作者: Rui Wang,Zeming Wei,Guanzhang Yue,Meng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting hallucinations in large language models (LLMs) remains a fundamental challenge for their trustworthy deployment. Going beyond basic uncertainty-driven hallucination detection frameworks, we propose a simple yet powerful method that quantifies uncertainty by measuring the effective rank of hidden states derived from multiple model outputs and different layers. Grounded in the spectral analysis of representations, our approach provides interpretable insights into the model’s internal reasoning process through semantic variations, while requiring no extra knowledge or additional modules, thus offering a combination of theoretical elegance and practical efficiency. Meanwhile, we theoretically demonstrate the necessity of quantifying uncertainty both internally (representations of a single response) and externally (different responses), providing a justification for using representations among different layers and responses from LLMs to detect hallucinations. Extensive experiments demonstrate that our method effectively detects hallucinations and generalizes robustly across various scenarios, contributing to a new paradigm of hallucination detection for LLM truthfulness.
zh
[AI-9] QAgent : A modular Search Agent with Interactive Query Understanding
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中因静态参数化知识而受限的问题,以及传统检索增强生成(Retrieval-Augmented Generation, RAG)在复杂查询理解上的不足。此外,即便采用强化学习(Reinforcement Learning, RL)训练的搜索代理(search agent),仍面临泛化能力和实际部署挑战。其解决方案的关键在于提出 QAgent——一个统一的代理式 RAG 框架,通过引入可插拔的、基于 RL 训练的多步决策搜索代理,实现对查询的交互式推理与自适应检索,从而提升检索质量并支持下游任务的准确性。该方法聚焦于有效检索策略的设计,显著增强了 LLM 在真实场景中的泛化能力与部署灵活性。
链接: https://arxiv.org/abs/2510.08383
作者: Yi Jiang,Lei Shen,Lujie Niu,Sendong Zhao,Wenbo Su,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
Abstract:Large language models (LLMs) excel at natural language tasks but are limited by their static parametric knowledge, especially in knowledge-intensive task. Retrieval-augmented generation (RAG) mitigates this by integrating external information. However, (1) traditional RAG struggles with complex query understanding, and (2) even search agents trained with reinforcement learning (RL), despite their promise, still face generalization and deployment challenges. To address these limitations, we propose QAgent, a unified agentic RAG framework that employs a search agent for adaptive retrieval. This agent optimizes its understanding of the query through interactive reasoning and retrieval. To facilitate real-world application, we focus on modular search agent for query understanding that are plug-and-play in complex systems. Secifically, the agent follows a multi-step decision process trained with RL to maximize retrieval quality and support accurate downstream answers. We further analyze the strengths and weaknesses of end-to-end RL and propose a strategy that focuses on effective retrieval, thereby enhancing generalization in LLM applications. Experiments show QAgent excels at QA and serves as a plug-and-play module for real-world deployment.
zh
[AI-10] Airy: Reading Robot Intent through Height and Sky
【速读】:该论文试图解决工业机器人在共享人类空间中因决策过程不透明而引发的安全隐患、信任缺失与公众监督困难的问题。解决方案的关键在于通过艺术装置Airy构建一个具身化的感知接口,利用竞争机制(谁抬升更高)、熟悉性交互(布料抖动的直观体验)以及传感器到感知的映射关系(通过森林与天气投影展现机器间的合作或对抗),使复杂多智能体AI的行为变得可直观理解,从而将原本的“黑箱”转化为面向公众的可视化界面。
链接: https://arxiv.org/abs/2510.08381
作者: Baoyang Chen,Xian Xu,Huamin Qu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:As industrial robots move into shared human spaces, their opaque decision making threatens safety, trust, and public oversight. This artwork, Airy, asks whether complex multi agent AI can become intuitively understandable by staging a competition between two reinforcement trained robot arms that snap a bedsheet skyward. Building on three design principles, competition as a clear metric (who lifts higher), embodied familiarity (audiences recognize fabric snapping), and sensor to sense mapping (robot cooperation or rivalry shown through forest and weather projections), the installation gives viewers a visceral way to read machine intent. Observations from five international exhibitions indicate that audiences consistently read the robots’ strategies, conflict, and cooperation in real time, with emotional reactions that mirror the system’s internal state. The project shows how sensory metaphors can turn a black box into a public interface.
zh
[AI-11] DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning
【速读】:该论文旨在解决重症患者肠内营养(Enteral Nutrition, EN)个性化不足的问题,传统指南或经验性策略难以根据患者动态生理状态进行精准调整,导致营养干预效果受限。其解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的框架 DeepEN,该框架在 MIMIC-IV 数据库中超过 11,000 名 ICU 患者数据上离线训练,构建了由临床知识引导的状态空间与兼顾短期生理指标和长期生存率的定制奖励函数,并采用带保守 Q 学习正则化的 Dueling Double Deep Q-Network 结构,从而学习出既符合高价值临床行为又避免危险偏离的安全策略。实证结果显示,DeepEN 在降低估计死亡率(18.8% vs 22.5%)和改善营养生物标志物方面显著优于医生制定及指南推荐策略,证明了数据驱动的个性化 EN 管理具有超越传统方法的潜力。
链接: https://arxiv.org/abs/2510.08350
作者: Daniel Jason Tan,Jiayang Chen,Dilruk Perera,Kay Choong See,Mengling Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce DeepEN, a deep reinforcement learning (RL) framework for personalized enteral nutrition (EN) in critically ill patients. Trained offline on over 11,000 ICU patients from the MIMIC-IV database, DeepEN generates 4-hourly recommendations for caloric, protein, and fluid intake tailored to each patient’s evolving physiology. The model integrates a curated, clinically informed state space with a custom reward function that balances short-term physiological and nutrition-related goals with long-term survival outcomes. Using a dueling double deep Q-network with conservative Q-learning regularization, DeepEN learns clinically realistic policies that align with high-value clinician actions while discouraging unsafe deviations. Across various qualitative and quantitative metrics, DeepEN outperforms clinician-derived and guideline-based policies, achieving a 3.7 \pm 0.17 percentage-point reduction in estimated mortality (18.8% vs 22.5%) and improvements in key nutritional biomarkers. These findings highlight the potential of safe, data-driven personalization of EN therapy to improve outcomes beyond traditional guideline- or heuristic-based approaches.
zh
[AI-12] Learning Whats Missing: Attention Dispersion and EMA Stabilization in Length Generalization
【速读】:该论文旨在解决Transformer模型在序列长度外推(length generalization)方面的局限性,特别是针对集合补集任务(set complement task)——即模型需预测输入序列中未出现的token的均匀分布,这在棋类游戏推理等场景中具有核心意义。解决方案的关键在于:首先,理论证明了单层注意力机制中嵌入维度和值维度的紧致边界;其次,发现若模型在长度为1和2时能实现平衡的logit位移,则其可推广至更长序列,尽管精度下降;进一步通过机制分析指出,softmax压缩导致有效与无效输出间的分离度降低是主要限制因素,而训练过程中多候选token带来的更新噪声则构成另一障碍。为此,作者提出使用dropout缓解softmax压缩效应、EMA(指数移动平均)减少训练噪声,并在随机超参数搜索及OthelloGPT实验中验证了这两个机制的有效性。
链接: https://arxiv.org/abs/2510.08341
作者: Pál Zsámboki,Benjamin Levi,David Ansel Josef Smith,Mitansh Kagalwala,Arlington Kell,Samuel Liechty,Cong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 2 tables
Abstract:We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence – an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.
zh
[AI-13] LLM s Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings
【速读】:该论文旨在解决传统消费者调研中存在的成本高、样本规模受限以及面板偏差(panel bias)等问题,同时克服现有大语言模型(Large Language Models, LLMs)在直接生成数值评分时导致的响应分布不现实的问题。其解决方案的关键在于提出一种语义相似性评分(Semantic Similarity Rating, SSR)方法:通过让LLM生成文本响应,并利用嵌入相似度(embedding similarity)将这些文本映射到Likert量表分布,从而实现既保持真实响应分布特征(KS相似度达0.85),又达到人类测试-重测信度的90%水平,同时保留了可解释的定性反馈信息。
链接: https://arxiv.org/abs/2510.08338
作者: Benjamin F. Maier,Ulf Aslak,Luca Fiaschi,Nina Rismal,Kemble Fletcher,Christian C. Luhmann,Robbie Dow,Kli Pappas,Thomas V. Wiecki
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 35 figures
Abstract:Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.
zh
[AI-14] First Try Matters: Revisiting the Role of Reflection in Reasoning Models
【速读】:该论文试图解决生成式 AI(Generative AI)在数学推理任务中因冗余反思(reflection)导致的推理效率低下问题,即模型在已得出答案后仍持续进行无实质修正的反思步骤,造成计算资源浪费。解决方案的关键在于提出一种基于问题感知的早期停止机制(question-aware early-stopping method),通过在推理过程中检测到若干合理候选答案后提前终止后续反思步骤,从而减少不必要的推理token消耗;进一步地,采用动态截断策略,在生成过程中一旦出现候选答案即停止后续反思,实现在五个数学数据集上平均减少24.5%的推理token、仅损失2.9%准确率的显著效率提升。
链接: https://arxiv.org/abs/2510.08308
作者: Liwei Kang,Yue Deng,Yao Xiao,Zhanfeng Mo,Wee Sun Lee,Lidong Bing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model’s initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.
zh
[AI-15] Symmetry-Aware Fully-Amortized Optimization with Scale Equivariant Graph Metanetworks
【速读】:该论文旨在解决如何通过学习共享结构来加速相关优化问题的求解,从而提升神经网络优化的效率与泛化能力。其核心挑战在于如何在不同问题实例间复用优化知识,避免重复的迭代计算。解决方案的关键在于提出Scale Equivariant Graph Metanetworks(ScaleGMNs),这类元网络直接在权重空间中操作,能够实现对现有模型的单次微调(single-shot fine-tuning),显著减少传统迭代优化的需求。此外,理论分析表明,卷积神经网络(Convolutional Neural Networks, CNNs)中的缩放对称性诱导的规范自由度(gauge freedom)严格小于多层感知机(Multi-Layer Perceptrons, MLPs),这一发现有助于解释不同架构在优化性能上的差异,也为设计更高效的对称感知元网络提供了理论依据。
链接: https://arxiv.org/abs/2510.08300
作者: Bart Kuipers,Freek Byrman,Daniel Uyterlinde,Alejandro García-Castellanos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Amortized optimization accelerates the solution of related optimization problems by learning mappings that exploit shared structure across problem instances. We explore the use of Scale Equivariant Graph Metanetworks (ScaleGMNs) for this purpose. By operating directly in weight space, ScaleGMNs enable single-shot fine-tuning of existing models, reducing the need for iterative optimization. We demonstrate the effectiveness of this approach empirically and provide a theoretical result: the gauge freedom induced by scaling symmetries is strictly smaller in convolutional neural networks than in multi-layer perceptrons. This insight helps explain the performance differences observed between architectures in both our work and that of Kalogeropoulos et al. (2024). Overall, our findings underscore the potential of symmetry-aware metanetworks as a powerful approach for efficient and generalizable neural network optimization. Open-source code: this https URL
zh
[AI-16] Counterfactual Identifiability via Dynamic Optimal Transport NEURIPS2025
【速读】:该论文旨在解决高维多变量结果的反事实识别(counterfactual identification)问题,即如何从观测数据中唯一且可识别地推断出反事实分布,从而为因果推断提供理论保障。当前许多反事实推断方法虽表现良好,但缺乏识别性,导致其估计结果无法保证因果有效性。论文的关键解决方案是基于连续时间流(continuous-time flows)构建多变量反事实识别框架,利用动态最优传输(dynamic optimal transport)工具证明了流匹配可产生唯一、单调且保持秩序的反事实运输映射(counterfactual transport map),从而确保推断一致性。这一理论基础在受控场景中通过反事实真值验证,并在真实图像数据上提升了反事实推理的公理一致性(axiomatic counterfactual soundness)。
链接: https://arxiv.org/abs/2510.08294
作者: Fabio De Sousa Ribeiro,Ainkaran Santhirasekaram,Ben Glocker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2025
Abstract:We address the open question of counterfactual identification for high-dimensional multivariate outcomes from observational data. Pearl (2000) argues that counterfactuals must be identifiable (i.e., recoverable from the observed data distribution) to justify causal claims. A recent line of work on counterfactual inference shows promising results but lacks identification, undermining the causal validity of its estimates. To address this, we establish a foundation for multivariate counterfactual identification using continuous-time flows, including non-Markovian settings under standard criteria. We characterise the conditions under which flow matching yields a unique, monotone and rank-preserving counterfactual transport map with tools from dynamic optimal transport, ensuring consistent inference. Building on this, we validate the theory in controlled scenarios with counterfactual ground-truth and demonstrate improvements in axiomatic counterfactual soundness on real images.
zh
[AI-17] Co-TAP: Three-Layer Agent Interaction Protocol Technical Report
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)在互操作性(Interoperability)、交互与协作(Interaction and Collaboration)、知识共享(Knowledge Sharing)三个核心维度所面临的挑战。其解决方案的关键在于提出了一种三层代理交互协议框架 Co-TAP(T: Triple, A: Agent, P: Protocol),包含三个核心协议:人类-代理交互协议(HAI)、统一代理协议(UAP)和记忆-提取-知识协议(MEK)。其中,HAI 通过事件驱动的标准化通信范式保障人机交互的实时性与可靠性;UAP 作为基础设施层的核心,利用服务发现与协议转换机制实现异构代理间的无缝互联;MEK 构建“记忆-提取-知识”认知链,使代理具备从个体经验中学习并生成可共享知识的能力,从而支撑集体智能的实现。
链接: https://arxiv.org/abs/2510.08263
作者: Shunyu An,Miao Wang,Yongchao Li,Dong Wan,Lina Wang,Ling Qin,Liqin Gao,Congyao Fan,Zhiyong Mao,Jiange Pu,Wenji Xia,Dong Zhao,Rui Hu,Ji Lu,Guiyue Zhou,Baoyu Tang,Yanqin Gao,Yongsheng Du,Daigang Xu,Lingjun Huang,Baoli Wang,Xiwen Zhang,Luyao Wang,Shilong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes Co-TAP (T: Triple, A: Agent, P: Protocol), a three-layer agent interaction protocol designed to address the challenges faced by multi-agent systems across the three core dimensions of Interoperability, Interaction and Collaboration, and Knowledge Sharing. We have designed and proposed a layered solution composed of three core protocols: the Human-Agent Interaction Protocol (HAI), the Unified Agent Protocol (UAP), and the Memory-Extraction-Knowledge Protocol (MEK). HAI focuses on the interaction layer, standardizing the flow of information between users, interfaces, and agents by defining a standardized, event-driven communication paradigm. This ensures the real-time performance, reliability, and synergy of interactions. As the core of the infrastructure layer, UAP is designed to break down communication barriers among heterogeneous agents through unified service discovery and protocol conversion mechanisms, thereby enabling seamless interconnection and interoperability of the underlying network. MEK, in turn, operates at the cognitive layer. By establishing a standardized ‘‘Memory (M) - Extraction (E) - Knowledge (K)’’ cognitive chain, it empowers agents with the ability to learn from individual experiences and form shareable knowledge, thereby laying the foundation for the realization of true collective intelligence. We believe this protocol framework will provide a solid engineering foundation and theoretical guidance for building the next generation of efficient, scalable, and intelligent multi-agent applications.
zh
[AI-18] A Distributed Emulation Environment for In-Memory Computing Systems
【速读】:该论文旨在解决基于存内计算(In-memory Computing)技术的集成电路在实际芯片流片前,缺乏高效、可扩展的实时仿真环境以支持系统级分析、微码测试与应用部署的问题。其解决方案的关键在于提出了一种分布式且可扩展的仿真系统架构,配合相应的软件开发工具链,实现了对存内计算器件的快速原型设计与验证,实验结果表明该模拟器能有效支持复杂系统的早期开发与优化。
链接: https://arxiv.org/abs/2510.08257
作者: Eleni Bougioukou,Anastasios Petropoulos,Nikolaos Toulgaridis,Theodoros Chatzimichail,Theodore Antonakopoulos
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 2025 IEEE International Instrumentation and Measurement Technology Conference (I2MTC)
Abstract:In-memory computing technology is used extensively in artificial intelligence devices due to lower power consumption and fast calculation of matrix-based functions. The development of such a device and its integration in a system takes a significant amount of time and requires the use of a real-time emulation environment, where various system aspects are analyzed, microcode is tested, and applications are deployed, even before the real chip is available. In this work, we present the architecture, the software development tools, and experimental results of a distributed and expandable emulation system for rapid prototyping of integrated circuits based on in-memory computing technologies. Presented experimental results demonstrate the usefulness of the proposed emulator.
zh
[AI-19] Chain-of-Trigger: An Agent ic Backdoor that Paradoxically Enhances Agent ic Robustness
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体在现实应用中因后门攻击导致的安全性与鲁棒性问题。传统后门攻击通常局限于单步控制,难以实现对长期任务执行过程的干扰。为此,作者提出链式触发后门(Chain-of-Trigger Backdoor, CoTri),其核心在于设计一种多步触发机制:初始触发器由攻击者植入,后续触发步骤则从环境状态中动态获取,从而形成有序序列,实现对智能体长期行为的隐蔽操控。实验表明,CoTri可在几乎零误触发率下实现接近完美的攻击成功率(ASR),且由于训练数据模拟了环境的随机性,该后门反而提升了智能体在良性任务上的性能及对环境干扰的鲁棒性,进一步增强了攻击的隐蔽性和潜在安全风险。
链接: https://arxiv.org/abs/2510.08238
作者: Jiyang Qiu,Xinbei Ma,Yunqing Xu,Zhuosheng Zhang,Hai Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent’s performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.
zh
[AI-20] he Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治领域中存在的偏见及其对社会舆论和民主进程潜在影响的问题。解决方案的关键在于采用二维政治光谱测试(Political Compass Test, PCT)作为评估工具,从三个层面系统分析模型的偏见:首先通过PCT识别模型固有的政治倾向,其次利用角色提示(persona prompting)探测显性刻板印象,最后借助多语言版本PCT揭示隐性刻板印象。研究发现所有模型均呈现显著左倾倾向,且隐性偏见比显性偏见更突出,但两者在多数模型中表现出一致性,表明模型对自身偏见具有一定“意识”或透明度。这一方法体系为量化与理解LLMs中的政治偏见提供了可操作的框架。
链接: https://arxiv.org/abs/2510.08236
作者: Konrad Löhr,Shuzhou Yuan,Michael Färber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the in- herent political leanings of these models. Sub- sequently, persona prompting with the PCT is used to explore explicit stereotypes across vari- ous social dimensions. In a final step, implicit stereotypes are uncovered by evaluating mod- els with multilingual versions of the PCT. Key findings reveal a consistent left-leaning polit- ical alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those iden- tified via explicit persona prompting. Interest- ingly, for most models, implicit and explicit stereotypes show a notable alignment, suggest- ing a degree of transparency or “awareness” regarding their inherent biases. This study un- derscores the complex interplay of political bias and stereotypes in LLMs.
zh
[AI-21] Selection Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中表现不佳的问题,尤其是当任务具有高复杂度且依赖于隐变量之间的密集结构化依赖关系时。尽管人类能够轻松完成此类任务,现有模型即使经过大规模预训练和后训练仍难以可靠地进行推理。解决方案的关键在于提出一个名为SR²的框架,其核心思想是将推理任务建模为一种选择机制,其中高层逻辑概念作为选择算子作用于观测数据;并通过引入估计的隐变量作为反馈信号,增强模型对隐空间中密集依赖关系的学习能力,具体包含三个模块:反射表示学习、依赖自精炼和周期性中间对齐,从而显著提升推理准确性,例如在Sudoku和Maze任务上以8倍更少参数实现超过10%的性能提升。
链接: https://arxiv.org/abs/2510.08222
作者: Yunlong Deng,Boyang Sun,Yan Li,Lingjing Kong,Zeyu Tang,Kun Zhang,Guangyi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Due to their inherent complexity, reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models, especially large language models (LLMs). Although humans can solve these tasks with ease, existing models, even after extensive pre-training and post-training at scale, still fail to perform reasoning reliably. In this paper, we revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space and to offer insights for addressing their challenges. Specifically, we cast reasoning tasks as a selection mechanism, in which high-level logical concepts function as selection operators on the given observations, such as, identifying the correct answer in a math problem or filling the appropriate entry in Sudoku. We emphasize two key properties of this formulation that shed light on the difficulty of reasoning tasks. First, the latent space exceeds the observation space in complexity, even when the correct answer is fully determined by the observed input. Second, the latent variables, corresponding to logical thought, are densely structured and exhibit strong dependencies. Building on this formulation, we introduce a framework, called SR ^2 , that incorporates the estimated latent variables as feedback into the selection mechanism, thereby facilitating the learning of dense dependencies among latent representations. The framework consists of three key modules: reflective representation learning, dependency self-refinement, and periodic intermediate alignment. Experimentally, we show that our approach yields significant gains in reasoning accuracy, for example, attaining over 10 % improvement in performance with 8 \times fewer parameters on the Sudoku and Maze tasks over the recent advances.
zh
[AI-22] Expressive Value Learning for Scalable Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在复杂场景中缺乏可扩展性的问题,尤其是现有方法依赖反向传播 through time(Backpropagation Through Time, BPTT)计算成本过高,或采用策略蒸馏(Policy Distillation)引入误差累积、限制模型规模。解决方案的关键在于提出一种名为表达式价值学习(Expressive Value Learning for Offline Reinforcement Learning, EVOR)的新框架:该框架通过流匹配(Flow Matching)学习一个最优且正则化的Q函数(即表达式价值函数),并在推理阶段利用拒绝采样(Rejection Sampling)从该价值函数中提取策略,从而实现无需重训练的高效优化、正则化与计算可扩展搜索,显著提升了离线RL在多样化任务上的性能表现。
链接: https://arxiv.org/abs/2510.08218
作者: Nicolas Espinosa-Dice,Kiante Brantley,Wen Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures
Abstract:Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.
zh
[AI-23] FuelCast: Benchmarking Tabular and Temporal Models for Ship Fuel Consumption KDD ECML
【速读】:该论文旨在解决航运业中船舶燃油消耗预测的准确性问题,该问题直接影响运营经济性与环境可持续性。现有研究受限于方法异质性和高质量数据集稀缺,难以进行模型比较。其解决方案的关键在于:(1) 发布一个包含三艘船舶运行与环境数据的新数据集(https://url),(2) 构建涵盖表格回归和时序回归的标准化基准测试,(3) 首次将基于上下文学习(in-context learning)的TabPFN基础模型应用于船舶燃料消耗建模。实验表明,引入环境条件的模型显著优于仅依赖船速的多项式基线,且TabPFN在性能上略胜其他方法,验证了基础模型结合上下文学习在船舶燃油预测中的可行性与潜力。
链接: https://arxiv.org/abs/2510.08217
作者: Justus Viga,Penelope Mueck,Alexander Löser,Torben Weis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published in “ECML PKDD Workshop 2025 - Advanced Analytics and Learning on Temporal Data”
Abstract:In the shipping industry, fuel consumption and emissions are critical factors due to their significant impact on economic efficiency and environmental sustainability. Accurate prediction of ship fuel consumption is essential for further optimization of maritime operations. However, heterogeneous methodologies and limited high-quality datasets hinder direct comparison of modeling approaches. This paper makes three key contributions: (1) we introduce and release a new dataset (this https URL) comprising operational and environmental data from three ships; (2) we define a standardized benchmark covering tabular regression and time-series regression (3) we investigate the application of in-context learning for ship consumption modeling using the TabPFN foundation model - a first in this domain to our knowledge. Our results demonstrate strong performance across all evaluated models, supporting the feasibility of onboard, data-driven fuel prediction. Models incorporating environmental conditions consistently outperform simple polynomial baselines relying solely on vessel speed. TabPFN slightly outperforms other techniques, highlighting the potential of foundation models with in-context learning capabilities for tabular prediction. Furthermore, including temporal context improves accuracy.
zh
[AI-24] DODO: Causal Structure Learning with Budgeted Interventions
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统在复杂环境中缺乏因果意识的问题,即如何让智能体(Agent)自主学习并准确推断其环境的因果结构。传统方法主要依赖观测数据中的统计相关性,难以区分因果关系与虚假关联,尤其在存在噪声的情况下性能受限。解决方案的关键在于提出DODO算法,该算法通过智能体对环境进行主动干预(intervention),结合因果推断技术分析干预前后变量间统计显著性变化,从而逐步构建出隐藏的因果有向无环图(Directed Acyclic Graph, DAG)。实验表明,DODO在绝大多数条件下显著优于仅基于观测的方法,甚至能在资源有限场景下实现零误差的因果图重建,并在最挑战配置中比最优基线提升0.25 F1分数。
链接: https://arxiv.org/abs/2510.08207
作者: Matteo Gregorini,Chiara Boldrini,Lorenzo Valerio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review. Supported by SoBigDatait IR0000013, FAIR PE00000013, ICSC CN00000013
Abstract:Artificial Intelligence has achieved remarkable advancements in recent years, yet much of its progress relies on identifying increasingly complex correlations. Enabling causality awareness in AI has the potential to enhance its performance by enabling a deeper understanding of the underlying mechanisms of the environment. In this paper, we introduce DODO, an algorithm defining how an Agent can autonomously learn the causal structure of its environment through repeated interventions. We assume a scenario where an Agent interacts with a world governed by a causal Directed Acyclic Graph (DAG), which dictates the system’s dynamics but remains hidden from the Agent. The Agent’s task is to accurately infer the causal DAG, even in the presence of noise. To achieve this, the Agent performs interventions, leveraging causal inference techniques to analyze the statistical significance of observed changes. Results show better performance for DODO, compared to observational approaches, in all but the most limited resource conditions. DODO is often able to reconstruct with as low as zero errors the structure of the causal graph. In the most challenging configuration, DODO outperforms the best baseline by +0.25 F1 points.
zh
[AI-25] he Tournament Tree Method for preference elicitation in Multi-criteria decision-making
【速读】:该论文旨在解决多准则决策中基于成对比较方法(如模糊偏好关系和Saaty乘法偏好关系)所面临的三大局限:高认知负荷、不一致性风险以及从偏好矩阵推导一致价值尺度的计算复杂性。其解决方案的关键在于提出一种新颖的“锦标赛树法”(Tournament Tree Method, TTM),通过仅需 $ m-1 $ 次成对比较即可构建一个完整、互反且一致的比较矩阵,从而将偏好建模的维度从 $ \frac{m(m-1)}{2} $ 降低至 $ m $ 个参数,并确保一致性由设计保障,同时显著减少专家的认知负担。
链接: https://arxiv.org/abs/2510.08197
作者: Diego García-Zamora,Álvaro Labella,José Rui Figueira
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Pairwise comparison methods, such as Fuzzy Preference Relations and Saaty’s Multiplicative Preference Relations, are widely used to model expert judgments in multi-criteria decision-making. However, their application is limited by the high cognitive load required to complete m(m-1)/2 comparisons, the risk of inconsistency, and the computational complexity of deriving consistent value scales. This paper proposes the Tournament Tree Method (TTM), a novel elicitation and evaluation framework that overcomes these limitations. The TTM requires only m-1 pairwise comparisons to obtain a complete, reciprocal, and consistent comparison matrix. The method consists of three phases: (i) elicitation of expert judgments using a reduced set of targeted comparisons, (ii) construction of the consistent pairwise comparison matrix, and (iii) derivation of a global value scale from the resulting matrix. The proposed approach ensures consistency by design, minimizes cognitive effort, and reduces the dimensionality of preference modeling from m(m-1)/2 to m parameters. Furthermore, it is compatible with the classical Deck of Cards method, and thus it can handle interval and ratio scales. We have also developed a web-based tool that demonstrates its practical applicability in real decision-making scenarios.
zh
[AI-26] Measuring What Matters: The AI Pluralism Index
【速读】:该论文旨在解决当前人工智能(AI)系统开发与治理高度集中于少数企业与国家所带来的治理风险,即技术可能嵌入狭隘利益、削弱公众能动性的问题。现有能力基准多聚焦于语言、视觉和编码等性能指标,但缺乏对多元治理(pluralistic governance)的公开可审计评估工具。解决方案的关键在于提出并实现AI多元性指数(AI Pluralism Index, AIPI),这是一个基于证据、透明且可复现的评估框架,从参与式治理、包容性与多样性、透明度和问责制四个维度量化评估AI生产者及其系统家族的多元治理实践;其核心创新在于通过结构化网络与代码库分析、外部评估及专家访谈整合多源证据,并明确处理“未知”证据以提供下限(evidence-based)和仅已知(known-only)得分,从而为政策制定者、采购方和公众提供可比、可靠的评估依据。
链接: https://arxiv.org/abs/2510.08193
作者: Rashid Mushkani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling “Unknown” evidence to report both lower-bound (“evidence”) and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.
zh
[AI-27] Leverag ing Whisper Embeddings for Audio-based Lyrics Matching
【速读】:该论文旨在解决音频歌词匹配任务中现有方法存在的可复现性差和基线不一致的问题。其解决方案的关键在于提出一个完全可复现的流水线 WEALY,该方法利用 Whisper 解码器的嵌入(Whisper decoder embeddings)进行歌词匹配,并在此基础上构建了稳健且透明的基线;同时探索了融合文本与声学特征的多模态扩展,通过在标准数据集上的大量实验验证了其性能可媲美缺乏可复现性的最先进方法,从而为音乐信息检索(Music Information Retrieval, MIR)领域提供了一个可靠的基准。
链接: https://arxiv.org/abs/2510.08176
作者: Eleonora Mancini,Joan Serrà,Paolo Torroni,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.
zh
[AI-28] Prepared mind fast response: A temporal decoupling framework for adaptive knowledge orchestration in open-domain dialogue
【速读】:该论文旨在解决开放域对话AI系统中延迟与质量之间的权衡问题(latency-quality tradeoff),即在保证响应质量的同时降低延迟。当前方法存在两个局限:轻量级指令模型虽能实现亚秒级延迟,但推理深度不足;而依赖外部工具的ReAct代理虽提升事实准确性,却因同步执行导致交互阻塞。其解决方案的关键在于提出PMFR架构,采用时间解耦框架(temporal decoupling framework)实现异步知识编排(asynchronous knowledge orchestration),通过三个协同组件——知识充分性评估器(Knowledge Adequacy Evaluator)、轻量级响应生成器(Lightweight Response Generator)和异步知识精炼代理(Asynchronous Knowledge Refinement Agent)——在保持对话连续性的同时,以智能触发机制逐步增强知识覆盖范围,从而在显著降低延迟(95.3%)的前提下维持与重型同步基线相当的响应质量(GEval-C: 0.613 vs. 0.620)。
链接: https://arxiv.org/abs/2510.08175
作者: Jinling Gan,Churong Liang,Runnan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The latency-quality tradeoff is a fundamental constraint in open-domain dialogue AI systems, since comprehensive knowledge access necessitates prohibitive response delays. Contemporary approaches offer two inadequate solutions: lightweight instruct models achieve sub-second latency but lack reasoning depth, while tool-augmented ReAct agents enhance factuality through external knowledge at the cost of synchronous execution that blocks interaction during re- trieval processes. PMFR is thus proposed, with a tempo- ral decoupling framework that fundamentally resolves the contradiction through asynchronous knowledge orchestra- tion. PMFR employs three coordinated components: (1) a Knowledge Adequacy Evaluator for real-time sufficiency assessment, (2) a Lightweight Response Generator for imme- diate user interaction, and (3) an Asynchronous Knowledge Refinement Agent for background knowledge enhancement. This architecture maintains continuous conversational flow while progressively enriching knowledge coverage through intelligent triggering mechanisms. Evaluation results on Top- iOCQA demonstrate PMFR outperforms brute-force scaling: PMFR achieves 95.3% latency reduction (23.38s - 1.09s) while preserving response quality comparable to heavyweight synchronous baselines (GEval-C: 0.613 vs. 0.620).
zh
[AI-29] hink Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
【速读】:该论文旨在解决大型语言模型在推理任务中计算资源消耗过高、token效率低的问题。其核心解决方案是提出一种基于香农熵(Shannon entropy)的简单但新颖的框架,利用token级别的对数概率(logprobs)计算熵值作为置信度信号,实现推理过程中的早期终止(early stopping)。关键创新在于发现熵-based置信度校准是现代后训练优化(post-training optimization)下推理模型的涌现特性(emergent property),而传统指令微调或预训练模型(如Llama 3.3 70B)中并不存在;通过少量示例即可一次性确定各模型的熵阈值,从而在保持任务准确率的前提下实现25–50%的计算成本降低,显著提升推理效率。
链接: https://arxiv.org/abs/2510.08146
作者: Aman Sharma,Paras Chopra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they’ve gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.
zh
[AI-30] Approximate Domain Unlearning for Vision-Language Models NEURIPS2025
【速读】:该论文旨在解决预训练视觉语言模型(Vision-Language Models, VLMs)在实际应用中因保留无关领域信息而导致的计算效率低下和潜在信息泄露问题,提出了一种新的近似域遗忘(Approximate Domain Unlearning, ADU)任务,要求在不损害其他领域识别性能的前提下,降低模型对特定目标域(如插画图像)的识别准确率。其解决方案的关键在于:针对预训练VLM中领域分布高度纠缠的问题,提出一种显式解耦领域特征表示并自适应捕获实例级领域信息的新方法,从而有效实现细粒度的域遗忘,优于基于传统VLM微调技术的基线方法。
链接: https://arxiv.org/abs/2510.08132
作者: Kodai Kawamura,Yuta Goto,Rintaro Yanagi,Hirokatsu Kataoka,Go Irie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 (Spotlight)
Abstract:Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments show that our approach outperforms baselines built upon VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code: this https URL.
zh
[AI-31] Bayesian Decision Making around Experts
【速读】:该论文旨在解决复杂学习智能体在与现有专家(如人类操作员或先前训练好的代理)协同工作时,如何最优地利用专家数据的问题,尤其关注专家数据结构可能不同于学习者自身动作-结果经验的情形。解决方案的关键在于:首先,在离线场景中,通过预训练专家数据来紧致信息论 regret 上界,其改进幅度由专家数据与最优动作之间的互信息决定;其次,在同步场景中,提出一种基于信息导向的规则,即每一步选择能最大化关于最优动作的一步信息增益的数据源进行更新;最后,设计策略使学习者能够判断何时信任专家、何时不信任,从而在专家无效或被破坏时保护学习者性能。整体框架以信息论为基础,为智能体提供可实践的决策机制,以智能选择何时从他人学习。
链接: https://arxiv.org/abs/2510.08113
作者: Daniel Jarne Ornia,Joel Dyer,Nicholas Bishop,Anisoara Calinescu,Michael Wooldridge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner’s own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert’s optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner’s posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.
zh
[AI-32] Development of Mental Models in Human-AI Collaboration: A Conceptual Framework
【速读】:该论文旨在解决当前人机协作(Human-AI Collaboration)研究中忽视决策者心智模型动态演化的问题,即现有文献多聚焦于AI代理设计与协作机制配置,通常假设人类决策者是静态不变的。其解决方案的关键在于提出一个整合的社会技术框架,识别出驱动心智模型演化的三个核心机制:数据情境化(data contextualization)、推理透明性(reasoning transparency)和绩效反馈(performance feedback),并据此构建三种互补且相互依赖的心智模型——领域心智模型(domain mental model)、信息处理心智模型(information processing mental model)以及互补意识心智模型(complementarity-awareness mental model),从而实现对人机协作中人类认知结构的系统性设计与优化。
链接: https://arxiv.org/abs/2510.08104
作者: Joshua Holstein,Gerhard Satzger
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint version. Accepted for presentation at the International Conference on Information Systems (ICIS 2025). Please cite the published version when available
Abstract:Artificial intelligence has become integral to organizational decision-making and while research has explored many facets of this human-AI collaboration, the focus has mainly been on designing the AI agent(s) and the way the collaboration is set up - generally assuming a human decision-maker to be “fixed”. However, it has largely been neglected that decision-makers’ mental models evolve through their continuous interaction with AI systems. This paper addresses this gap by conceptualizing how the design of human-AI collaboration influences the development of three complementary and interdependent mental models necessary for this collaboration. We develop an integrated socio-technical framework that identifies the mechanisms driving the mental model evolution: data contextualization, reasoning transparency, and performance feedback. Our work advances human-AI collaboration literature through three key contributions: introducing three distinct mental models (domain, information processing, complementarity-awareness); recognizing the dynamic nature of mental models; and establishing mechanisms that guide the purposeful design of effective human-AI collaboration.
zh
[AI-33] From Ethical Declarations to Provable Independence: An Ontology-Driven Optimal-Transport Framework for Certifiably Fair AI Systems
【速读】:该论文旨在解决当前偏见缓解方法无法彻底消除敏感信息及其代理变量(proxy)导致的不公平问题。其解决方案的关键在于:首先,利用OWL 2 QL中的本体工程形式化定义敏感属性,并通过逻辑推理推断出所有潜在的代理变量,构建捕获偏见模式全结构的σ代数G;其次,采用Delbaen-Majumdar最优传输方法生成与G独立且最小化L2距离的公平表示,从而保证真正的统计独立性而非仅相关性减弱;最终,通过将偏见建模为σ代数间的依赖关系、将本体知识编译为可测结构,并以最优传输作为唯一公平变换,实现贷款审批等任务中可认证的数学上严格公平的人工智能。
链接: https://arxiv.org/abs/2510.08086
作者: Sukriti Bhattacharya,Chitro Majumdar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures
Abstract:This paper presents a framework for provably fair AI that overcomes the limits of current bias mitigation methods by systematically removing all sensitive information and its proxies. Using ontology engineering in OWL 2 QL, it formally defines sensitive attributes and infers their proxies through logical reasoning, constructing a sigma algebra G that captures the full structure of biased patterns. Fair representations are then obtained via Delbaen Majumdar optimal transport, which generates variables independent of G while minimizing L2 distance to preserve accuracy. This guarantees true independence rather than mere decorrelation. By modeling bias as dependence between sigma algebras, compiling ontological knowledge into measurable structures, and using optimal transport as the unique fair transformation, the approach ensures complete fairness in tasks like loan approval, where proxies such as ZIP code reveal race. The result is a certifiable and mathematically grounded method for trustworthy AI.
zh
[AI-34] A Novel Ensemble Learning Approach for Enhanced IoT Attack Detection: Redefining Security Paradigms in Connected Systems
【速读】:该论文旨在解决物联网(IoT)设备因广泛互联而面临的严重安全漏洞问题,这些漏洞使得IoT系统更容易遭受复杂网络攻击。解决方案的关键在于提出一种新颖的集成学习架构,核心是采用Extra Trees Classifier(极端随机树分类器)结合全面的数据预处理和超参数优化策略,在多个基准数据集(如CICIoT2023、IoTID20、BotNeTIoT L01等)上进行验证,实现了高召回率、高准确率与高精确度,同时误差极低,显著优于现有方法,为构建高效且可扩展的IoT安全防护体系提供了坚实基础。
链接: https://arxiv.org/abs/2510.08084
作者: Hikmat A. M. Abdeljaber,Md. Alamgir Hossain,Sultan Ahmad,Ahmed Alsanad,Md Alimul Haque,Sudan Jha,Jabeen Nazeer
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 fiugres, 7 tables
Abstract:The rapid expansion of Internet of Things (IoT) devices has transformed industries and daily life by enabling widespread connectivity and data exchange. However, this increased interconnection has introduced serious security vulnerabilities, making IoT systems more exposed to sophisticated cyber attacks. This study presents a novel ensemble learning architecture designed to improve IoT attack detection. The proposed approach applies advanced machine learning techniques, specifically the Extra Trees Classifier, along with thorough preprocessing and hyperparameter optimization. It is evaluated on several benchmark datasets including CICIoT2023, IoTID20, BotNeTIoT L01, ToN IoT, N BaIoT, and BoT IoT. The results show excellent performance, achieving high recall, accuracy, and precision with very low error rates. These outcomes demonstrate the model efficiency and superiority compared to existing approaches, providing an effective and scalable method for securing IoT environments. This research establishes a solid foundation for future progress in protecting connected devices from evolving cyber threats.
zh
[AI-35] Multi-Condition Conformal Selection
【速读】:该论文旨在解决资源受限场景下(如药物发现、精准医学和大语言模型对齐)从大规模数据集中筛选高质量候选样本的问题,尤其针对传统校准选择方法仅适用于单阈值场景(即 y > c)而无法满足实际中多条件选择需求(如合取或析取条件)的局限性。解决方案的关键在于提出多条件校准选择(Multi-Condition Conformal Selection, MCCS)算法:一方面引入具有区域单调性的非一致性评分以适配合取条件;另一方面采用全局Benjamini-Hochberg (BH)过程处理析取条件,从而在有限样本下实现严格的事后错误发现率(False Discovery Rate, FDR)控制,并提供理论保障。
链接: https://arxiv.org/abs/2510.08075
作者: Qingyang Hao,Wenbo Liao,Bingyi Jing,Hongxin Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global Benjamini-Hochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.
zh
[AI-36] Attribution-by-design: Ensuring Inference-Time Provenance in Generative Music Systems
【速读】:该论文旨在解决生成式 AI(Generative AI)音乐兴起背景下,现有版税分配机制因缺乏可追溯性与透明度而导致的艺术家报酬池稀释问题,以及传统许可协议在可扩展性和技术严谨性上的不足。其解决方案的关键在于构建以直接溯源(direct attribution)为核心的生成音乐基础设施,明确区分训练集(training set)与推理集(inference set),并提出两种互补的溯源方式:训练期溯源和推理期溯源;其中更强调推理期溯源,即当艺术家作品被用作条件输入生成新内容时,可实现可验证的即时补偿,从而保障创作者权益,并提升用户对使用权限与来源信息的透明认知,从根本上嵌入公平性与可追溯性于生成系统设计中。
链接: https://arxiv.org/abs/2510.08062
作者: Fabio Morreale,Wiebke Hutiri,Joan Serrà,Alice Xiang,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The rise of AI-generated music is diluting royalty pools and revealing structural flaws in existing remuneration frameworks, challenging the well-established artist compensation systems in the music industry. Existing compensation solutions, such as piecemeal licensing agreements, lack scalability and technical rigour, while current data attribution mechanisms provide only uncertain estimates and are rarely implemented in practice. This paper introduces a framework for a generative music infrastructure centred on direct attribution, transparent royalty distribution, and granular control for artists and rights’ holders. We distinguish ontologically between the training set and the inference set, which allows us to propose two complementary forms of attribution: training-time attribution and inference-time attribution. We here favour inference-time attribution, as it enables direct, verifiable compensation whenever an artist’s catalogue is used to condition a generated output. Besides, users benefit from the ability to condition generations on specific songs and receive transparent information about attribution and permitted usage. Our approach offers an ethical and practical solution to the pressing need for robust compensation mechanisms in the era of AI-generated music, ensuring that provenance and fairness are embedded at the core of generative systems.
zh
[AI-37] LinguaSim: Interactive Multi-Vehicle Testing Scenario Generation via Natural Language Instruction Based on Large Language Models
【速读】:该论文旨在解决自动驾驶车辆测试与训练场景生成中面临的两大挑战:一是如何在保持自然语言指令准确执行的前提下提升场景的现实感,二是如何实现动态、交互式的三维(3D)封闭环路仿真,而非受限于二维或开环模拟。当前基于大语言模型(Large Language Models, LLMs)的方法往往牺牲真实驾驶环境的复杂性以降低描述复杂度,导致生成场景缺乏背景车辆的互动行为。解决方案的关键在于提出LinguaSim框架,该框架利用LLM将自然语言转化为包含动态车辆交互的3D场景,并引入反馈校准模块对生成结果进行精细化调整,从而确保输入描述与输出场景的高度一致性。此外,LinguaSim通过结合场景描述和自动驾驶模型共同约束对抗性车辆行为,有效提升了场景的真实性与安全性,实验表明其可生成符合不同语义意图的高保真场景,且校准后事故率显著下降(从46.9%降至6.3%)。
链接: https://arxiv.org/abs/2510.08046
作者: Qingyuan Shi,Qingwen Meng,Hao Cheng,Qing Xu,Jianqiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of testing and training scenarios for autonomous vehicles has drawn significant attention. While Large Language Models (LLMs) have enabled new scenario generation methods, current methods struggle to balance command adherence accuracy with the realism of real-world driving environments. To reduce scenario description complexity, these methods often compromise realism by limiting scenarios to 2D, or open-loop simulations where background vehicles follow predefined, non-interactive behaviors. We propose LinguaSim, an LLM-based framework that converts natural language into realistic, interactive 3D scenarios, ensuring both dynamic vehicle interactions and faithful alignment between the input descriptions and the generated scenarios. A feedback calibration module further refines the generation precision, improving fidelity to user intent. By bridging the gap between natural language and closed-loop, interactive simulations, LinguaSim constrains adversarial vehicle behaviors using both the scenario description and the autonomous driving model guiding them. This framework facilitates the creation of high-fidelity scenarios that enhance safety testing and training. Experiments show LinguaSim can generate scenarios with varying criticality aligned with different natural language descriptions (ACT: 0.072 s for dangerous vs. 3.532 s for safe descriptions; comfortability: 0.654 vs. 0.764), and its refinement module effectively reduces excessive aggressiveness in LinguaSim’s initial outputs, lowering the crash rate from 46.9% to 6.3% to better match user intentions.
zh
[AI-38] Verifying Graph Neural Networks with Readout is Intractable
【速读】:该论文旨在解决量化图神经网络(Quantized Graph Neural Networks, GNNs)的验证难题,特别是针对具有全局读出机制的聚合-组合图神经网络(Aggregate-Combine Readout GNNs, ACR-GNNs)的逻辑形式化与复杂度分析问题。解决方案的关键在于提出一种用于推理此类模型的逻辑语言,并基于该逻辑语言建立形式化表征,从而证明量化GNN的验证任务在计算复杂度上属于(co)NEXPTIME完全问题,揭示其计算不可行性;同时通过实验验证了量化后的ACR-GNN模型在保持良好准确率和泛化能力的前提下具备轻量级特性,为实际部署中的安全性保障提供了理论依据与实践参考。
链接: https://arxiv.org/abs/2510.08045
作者: Artem Chernobrovkin,Marco Sälzer,François Schwarzentruber,Nicolas Troquard
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
备注:
Abstract:We introduce a logical language for reasoning about quantized aggregate-combine graph neural networks with global readout (ACR-GNNs). We provide a logical characterization and use it to prove that verification tasks for quantized GNNs with readout are (co)NEXPTIME-complete. This result implies that the verification of quantized GNNs is computationally intractable, prompting substantial research efforts toward ensuring the safety of GNN-based systems. We also experimentally demonstrate that quantized ACR-GNN models are lightweight while maintaining good accuracy and generalization capabilities with respect to non-quantized models.
zh
[AI-39] owards Reliable LLM -based Robot Planning via Combined Uncertainty Estimation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在机器人具身规划中因幻觉(hallucination)导致的高自信但可能不准确或不安全计划问题。现有方法虽尝试通过不确定性估计提升LLM规划可靠性,但未能有效区分认知不确定性(epistemic uncertainty)与内在不确定性(intrinsic uncertainty),从而限制了估计效果。解决方案的关键在于提出一种联合不确定性估计方法(Combined Uncertainty estimation for Reliable Embodied planning, CURE),将总不确定性分解为认知不确定性和内在不确定性,并进一步将认知不确定性细分为任务清晰度(task clarity)和任务熟悉度(task familiarity),以实现更精确的评估;同时利用随机网络蒸馏(random network distillation)和基于LLM特征驱动的多层感知机回归头(multi-layer perceptron regression heads)分别估计两类不确定性,最终使不确定性预测与实际执行结果更加一致。
链接: https://arxiv.org/abs/2510.08044
作者: Shiyuan Yin,Chenjia Bai,Zihao Zhang,Junwei Jin,Xinxin Zhang,Chi Zhang,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding. However, LLM hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans. While researchers have explored uncertainty estimation to improve the reliability of LLM-based planning, existing studies have not sufficiently differentiated between epistemic and intrinsic uncertainty, limiting the effectiveness of uncertainty esti- mation. In this paper, we present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately. Furthermore, epistemic uncertainty is subdivided into task clarity and task familiarity for more accurate evaluation. The overall uncertainty assessments are obtained using random network distillation and multi-layer perceptron regression heads driven by LLM features. We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes.
zh
[AI-40] AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models AAAI2026
【速读】:该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)在实际应用中面临的性能次优和收敛速度慢的问题。其核心解决方案是提出一种名为AILoRA的新方法,关键在于引入功能感知的非对称低秩先验:通过分析自注意力机制中查询(WQ)和值(WV)投影矩阵的功能差异——前者对下游任务敏感、承载任务特异性语义空间信息,后者则保持跨任务和层间稳定、编码token级特征表示——从而设计不对称初始化策略:将WQ的主成分注入LoRA模块以保留任务适应能力,同时将WV的次成分注入以维持泛化特征表示。这一策略显著提升了微调性能与收敛效率。
链接: https://arxiv.org/abs/2510.08034
作者: Xiaoshuang Ji,Zhendong Zhao,Xiaoyan Gu,Xiaojun Chen,Xin Zhao,Zeyao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to AAAI2026
Abstract:Parameter-efficient finetuning (PEFT) aims to mitigate the substantial computational and memory overhead involved in adapting large-scale pretrained models to diverse downstream tasks. Among numerous PEFT strategies, Low-Rank Adaptation (LoRA) has emerged as one of the most widely adopted approaches due to its robust empirical performance and low implementation complexity. In practical deployment, LoRA is typically applied to the W^Q and W^V projection matrices of self-attention modules, enabling an effective trade-off between model performance and parameter efficiency. While LoRA has achieved considerable empirical success, it still encounters challenges such as suboptimal performance and slow convergence. To address these limitations, we introduce \textbfAILoRA, a novel parameter-efficient method that incorporates function-aware asymmetric low-rank priors. Our empirical analysis reveals that the projection matrices W^Q and W^V in the self-attention mechanism exhibit distinct parameter characteristics, stemming from their functional differences. Specifically, W^Q captures task-specific semantic space knowledge essential for attention distributions computation, making its parameters highly sensitive to downstream task variations. In contrast, W^V encodes token-level feature representations that tend to remain stable across tasks and layers. Leveraging these insights, AILoRA performs a function-aware initialization by injecting the principal components of W^Q to retain task-adaptive capacity, and the minor components of W^V to preserve generalizable feature representations. This asymmetric initialization strategy enables LoRA modules to better capture the specialized roles of attention parameters, thereby enhancing both finetuning performance and convergence efficiency.
zh
[AI-41] PEAR: Phase Entropy Aware Reward for Efficient Reasoning
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中生成的链式思维(Chain-of-Thought, CoT)解释过于冗长的问题,即如何在不牺牲推理准确性的情况下有效控制输出长度。其解决方案的关键在于提出一种基于阶段熵感知奖励机制(Phase Entropy Aware Reward, PEAR),该机制通过区分推理过程中的不同阶段(思考阶段与最终答案阶段)来动态调节熵惩罚:在思考阶段施加熵惩罚以抑制过度探索导致的冗余推理,在最终答案阶段允许适度熵以保留解决问题所需的灵活性。这一设计使得模型能够自适应地生成更简洁但依然有效的推理路径,而无需依赖显式的长度约束或刚性截断规则。
链接: https://arxiv.org/abs/2510.08026
作者: Chen Huang,Wei Lu,Wenxuan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures
Abstract:Large Reasoning Models (LRMs) have achieved impressive performance on complex reasoning tasks by generating detailed chain-of-thought (CoT) explanations. However, these responses are often excessively long, containing redundant reasoning steps that inflate inference cost and reduce usability. Controlling the length of generated reasoning without sacrificing accuracy remains an open challenge. Through a systematic empirical analysis, we reveal a consistent positive correlation between model entropy and response length at different reasoning stages across diverse LRMs: the thinking phase exhibits higher entropy, reflecting exploratory behavior of longer responses, while the final answer phase shows lower entropy, indicating a more deterministic this http URL observation suggests that entropy at different reasoning stages can serve as a control knob for balancing conciseness and performance. Based on this insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly. This enables adaptive control of response length without relying on explicit length targets or rigid truncation rules. Extensive experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales. In addition, PEAR demonstrates strong out-of-distribution (OOD) robustness beyond the training distribution. Our code is available at: this https URL.
zh
[AI-42] FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset
【速读】:该论文旨在解决当前数据驱动的机器人操作学习中对大规模、高质量专家示范数据集的依赖问题,特别是现有数据集在可扩展性(scalability)、轨迹平滑性(trajectory smoothness)以及跨不同机器人本体(robotic embodiment)的适用性方面存在的局限。其解决方案的关键在于提出FastUMI-100K——一个基于新型模块化、硬件解耦机械设计与轻量级追踪系统的机器人系统所收集的大规模多模态示范数据集,该数据集包含超过10万条来自典型家庭环境的示范轨迹,覆盖54项任务和数百种物体类型,并融合末端执行器状态、多视角腕装鱼眼图像及文本标注等多模态信息,显著提升了数据的多样性、鲁棒性和实际应用能力。
链接: https://arxiv.org/abs/2510.08022
作者: Kehui Liu,Zhongjie Jia,Yang Li,Zhaxizhuoma,Pengan Chen,Song Liu,Xin Liu,Pingrui Zhang,Haoming Song,Xinyi Ye,Nieqing Cao,Zhigang Wang,Jia Zeng,Dong Wang,Yan Ding,Bin Zhao,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link this https URL.
zh
[AI-43] Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses
【速读】:该论文旨在解决模型合并(Model Merging, MM)过程中存在的后门攻击安全风险问题,即攻击者通过在单个微调模型中植入隐藏触发器(trigger),使得最终合并模型在推理阶段受控。其核心解决方案是提出一种基于任务向量(task vector)视角的分析框架,将后门攻击视为一个可计算的向量——后门向量(Backdoor Vector, BV),用于揭示攻击机制并衡量攻击相似性与迁移能力。关键创新在于:1)提出稀疏后门向量(Sparse Backdoor Vector, SBV)方法,通过融合多个攻击生成更强的后门效果,首次证明合并可提升后门有效性;2)设计无需假设的防御机制——注入后门向量减法(Injection BV Subtraction, IBVS),直接从权重层面消除潜在后门,实现轻量且通用的防御。
链接: https://arxiv.org/abs/2510.08016
作者: Stanisław Pawlak,Jan Dubiński,Daniel Marczak,Bartłomiej Twardowski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 22 pages, 13 figures, 15 tables
Abstract:Model merging (MM) recently emerged as an effective method for combining large deep learning models. However, it poses significant security risks. Recent research shows that it is highly susceptible to backdoor attacks, which introduce a hidden trigger into a single fine-tuned model instance that allows the adversary to control the output of the final merged model at inference time. In this work, we propose a simple framework for understanding backdoor attacks by treating the attack itself as a task vector. Backdoor\ Vector\ (BV) is calculated as the difference between the weights of a fine-tuned backdoored model and fine-tuned clean model. BVs reveal new insights into attacks understanding and a more effective framework to measure their similarity and transferability. Furthermore, we propose a novel method that enhances backdoor resilience through merging dubbed Sparse\ Backdoor\ Vector\ (SBV) that combines multiple attacks into a single one. We identify the core vulnerability behind backdoor threats in MM: inherent\ triggers that exploit adversarial weaknesses in the base model. To counter this, we propose Injection\ BV\ Subtraction\ (IBVS) - an assumption-free defense against backdoors in MM. Our results show that SBVs surpass prior attacks and is the first method to leverage merging to improve backdoor effectiveness. At the same time, IBVS provides a lightweight, general defense that remains effective even when the backdoor threat is entirely unknown.
zh
[AI-44] Language Models Do Not Embed Numbers Continuously
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在嵌入空间中是否真正将连续数值表示为连续结构,而非离散或噪声主导的表示。此前研究虽表明可通过嵌入重构原始数值,但未验证其是否具备连续性建模能力。论文的关键解决方案在于利用嵌入空间的预期属性(如线性重建能力和主成分分析),系统评估了来自OpenAI、Google Gemini和Voyage AI的多个模型对数值输入的嵌入表示特性。结果发现,尽管重建精度高(R² ≥ 0.95),但主成分仅解释嵌入空间中极小部分方差,说明多数嵌入维度与数值输入空间正交,且随着小数位数增加,重建性能和方差解释率显著下降——这揭示了LLMs对数值的表示本质上是非连续且受噪声干扰的,这对依赖高精度数值处理的应用场景具有重要影响。
链接: https://arxiv.org/abs/2510.08009
作者: Alex O. Davies,Roussel Nzoyem,Nirav Ajmeri,Telmo M. Silva Filho
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 10 figures, 3 tables
Abstract:Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ( R^2 \geq 0.95 ), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.
zh
[AI-45] Past Present and Future of Bug Tracking in the Generative AI Era
【速读】:该论文旨在解决传统缺陷跟踪系统中因人工流程繁杂、协作效率低下及响应延迟导致的“时间至修复(time-to-fix)”过长和人力投入过多的问题。其核心解决方案是构建一个基于生成式 AI(Generative AI)的智能缺陷跟踪框架,利用大语言模型(Large Language Model, LLM)驱动自动化流程,在用户以自然语言提交问题后,由AI代理完成报告精炼、复现尝试、缺失信息补全、分类筛选、无效问题的无代码修复、有效问题定位与分配,以及候选补丁生成等环节,从而显著缩短修复周期并降低人工干预强度,实现更高效、协同性强且以用户为中心的软件维护体系。
链接: https://arxiv.org/abs/2510.08005
作者: Utku Boran Torun,Mehmet Taha Demircan,Mahmut Furkan Gön,Eray Tüzün
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to ACM TOSEM Special Issue: 2030 Software Engineering Roadmap
Abstract:Traditional bug tracking systems rely heavily on manual reporting, reproduction, triaging, and resolution, each carried out by different stakeholders such as end users, customer support, developers, and testers. This division of responsibilities requires significant coordination and widens the communication gap between non-technical users and technical teams, slowing the process from bug discovery to resolution. Moreover, current systems are highly asynchronous; users often wait hours or days for a first response, delaying fixes and contributing to frustration. This paper examines the evolution of bug tracking, from early paper-based reporting to today’s web-based and SaaS platforms. Building on this trajectory, we propose an AI-powered bug tracking framework that augments existing tools with intelligent, large language model (LLM)-driven automation. Our framework addresses two main challenges: reducing time-to-fix and minimizing human overhead. Users report issues in natural language, while AI agents refine reports, attempt reproduction, and request missing details. Reports are then classified, invalid ones resolved through no-code fixes, and valid ones localized and assigned to developers. LLMs also generate candidate patches, with human oversight ensuring correctness. By integrating automation into each phase, our framework accelerates response times, improves collaboration, and strengthens software maintenance practices for a more efficient, user-centric future.
zh
[AI-46] ReInAgent : A Context-Aware GUI Agent Enabling Human-in-the-Loop Mobile Task Navigation
【速读】:该论文旨在解决现有移动图形用户界面(GUI)代理在执行用户任务时过度依赖自主操作、忽视用户主动参与的问题,从而导致在面对模糊性、动态变化或冲突性的任务场景时适应能力不足,难以满足真实用户需求。其解决方案的关键在于提出ReInAgent框架,该框架基于上下文感知的多智能体架构,通过共享记忆模块集成三个专用智能体:信息管理智能体(负责基于槽位的信息管理和主动用户交互)、决策智能体(实现冲突感知的规划)以及反思智能体(进行任务反思与信息一致性验证),借助持续的上下文分析和人机协同机制,显著提升了复杂现实场景下移动任务导航的适应性和可靠性。
链接: https://arxiv.org/abs/2510.07988
作者: Haitao Jia,Ming He,Zimo Yin,Likang Wu,Jianping Fan,Jitao Sang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile GUI agents exhibit substantial potential to facilitate and automate the execution of user tasks on mobile phones. However, exist mobile GUI agents predominantly privilege autonomous operation and neglect the necessity of active user engagement during task execution. This omission undermines their adaptability to information dilemmas including ambiguous, dynamically evolving, and conflicting task scenarios, leading to execution outcomes that deviate from genuine user requirements and preferences. To address these shortcomings, we propose ReInAgent, a context-aware multi-agent framework that leverages dynamic information management to enable human-in-the-loop mobile task navigation. ReInAgent integrates three specialized agents around a shared memory module: an information-managing agent for slot-based information management and proactive interaction with the user, a decision-making agent for conflict-aware planning, and a reflecting agent for task reflection and information consistency validation. Through continuous contextual information analysis and sustained user-agent collaboration, ReInAgent overcomes the limitation of existing approaches that rely on clear and static task assumptions. Consequently, it enables more adaptive and reliable mobile task navigation in complex, real-world scenarios. Experimental results demonstrate that ReInAgent effectively resolves information dilemmas and produces outcomes that are more closely aligned with genuine user preferences. Notably, on complex tasks involving information dilemmas, ReInAgent achieves a 25% higher success rate than Mobile-Agent-v2.
zh
[AI-47] Fewer Weights More Problems: A Practical Attack on LLM Pruning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段通过剪枝(Model Pruning)进行压缩时所引发的安全漏洞问题。现有研究表明,剪枝可有效降低模型内存占用并提升推理效率,但其潜在的恶意利用风险尚未被充分探讨。论文首次揭示了现代剪枝方法可能被攻击者恶意利用:攻击者可构造一个看似无害的模型,在剪枝后却表现出恶意行为(如越狱、拒绝执行指令或注入特定内容)。解决方案的关键在于,攻击者通过计算一个代理指标(proxy metric)来估计每个参数被剪枝的概率,从而将恶意逻辑注入那些低概率被剪枝的参数中,并利用高概率被剪枝的参数“修复”模型,使其在未剪枝状态下表现正常,而在剪枝后释放恶意功能。这一机制暴露了模型压缩阶段存在的关键部署时安全缺口。
链接: https://arxiv.org/abs/2510.07985
作者: Kazuki Egashira,Robin Staab,Thibaud Gloaguen,Mark Vero,Martin Vechev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to 95.7% for jailbreak, 98.7% for benign instruction refusal, and 99.5% for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
zh
[AI-48] ZeroCard: Cardinality Estimation with Zero Dependence on Target Databases – No Data No Query No Retraining
【速读】:该论文旨在解决现有基于学习的基数估计方法在实际数据库场景中难以泛化的问题,这些问题主要源于模型对原始数据、查询日志或目标数据库重新训练的强依赖性。解决方案的关键在于提出ZeroCard,一种完全依赖模式语义(schema semantics)的基数估计方法,其核心创新包括:利用模式语义预测数据分布以消除对原始数据的依赖;引入与查询模板无关的表示方法以降低对特定查询的依赖;并通过在真实表上构建的大规模查询数据集进行预训练,使模型能够从模式语义和谓词表示中学习基数规律,从而在无需微调的情况下实现即插即用的部署。
链接: https://arxiv.org/abs/2510.07983
作者: Xianghong Xu,Rong Kang,Xiao He,Lei Zhang,Jianjun Chen,Tieying Zhang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to generalize to new datasets due to their strong dependence on raw data or queries, thus limiting their practicality in real scenarios. To overcome these challenges, we argue that semantics in the schema may benefit cardinality estimation, and leveraging such semantics may alleviate these dependencies. To this end, we introduce ZeroCard, the first semantics-driven cardinality estimation method that can be applied without any dependence on raw data access, query logs, or retraining on the target database. Specifically, we propose to predict data distributions using schema semantics, thereby avoiding raw data dependence. Then, we introduce a query template-agnostic representation method to alleviate query dependence. Finally, we construct a large-scale query dataset derived from real-world tables and pretrain ZeroCard on it, enabling it to learn cardinality from schema semantics and predicate representations. After pretraining, ZeroCard’s parameters can be frozen and applied in an off-the-shelf manner. We conduct extensive experiments to demonstrate the distinct advantages of ZeroCard and show its practical applications in query optimization. Its zero-dependence property significantly facilitates deployment in real-world scenarios.
zh
[AI-49] Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training NEURIPS2025
【速读】:该论文旨在解决去中心化训练(decentralized training)相较于集中式训练(centralized training)性能下降的问题,尤其是多轮Gossip通信机制(Multi-Gossip Steps, MGS)在理论上为何有效、以及是否能完全消除两者之间的性能差距这一关键问题。解决方案的关键在于通过稳定性分析推导出MGS的泛化误差和过拟合误差上界,证明MGS可指数级降低优化误差,从而显著提升泛化性能;同时指出即使MGS趋于无穷,去中心化训练仍存在不可忽略的泛化误差差距(具体为 O(Tcβ+1cβ/nm) vs. O(T2cβ+22cβ/nm2cβ+21)),并首次在非凸设定下、无需有界梯度假设的前提下,统一分析了学习率、数据异质性、节点数量、每节点样本量及通信拓扑对MGS泛化能力的影响,填补了去中心化训练理论研究的重要空白。
链接: https://arxiv.org/abs/2510.07980
作者: Qinglun Li,Yingqi Liu,Miao Zhang,Xiaochun Cao,Quanjun Yin,Li Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: This paper has been accepted by NeurIPS 2025 (Spotlight)
Abstract:Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ( \mathcalO(T^\fracc\betac\beta +1/n m) in centralized and \mathcalO(T^\frac2c\beta2c\beta +2/n m^\frac12c\beta +2) in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.
zh
[AI-50] Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation
【速读】:该论文旨在解决机器人在非结构化环境中实现精确且泛化操作的核心挑战,即如何弥合视觉语言模型(Vision-Language Models, VLMs)高阶语义理解与物理执行之间的“语义到物理”鸿沟。解决方案的关键在于提出GRACE框架,其通过可执行分析概念(Executable Analytic Concepts, EAC)将VLM的推理结果进行物理具身化——EAC是一组数学定义的蓝图,编码了物体的可操作性(affordance)、几何约束和操作语义信息;该框架进一步构建结构化的策略骨架流程,将自然语言指令和视觉信息转化为实例化的EAC,并从中推导出抓取位姿、受力方向及物理可行的运动轨迹,从而建立从高层语义理解到底层机器人控制的统一且可解释的接口,显著提升了操作的精度与泛化能力。
链接: https://arxiv.org/abs/2510.07975
作者: Mingyang Sun,Jiude Wei,Qichen He,Donglin Wang,Cewu Lu,Jianhua Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Enabling robots to perform precise and generalized manipulation in unstructured environments remains a fundamental challenge in embodied AI. While Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning, a significant gap persists between their high-level understanding and the precise physical execution required for real-world manipulation. To bridge this “semantic-to-physical” gap, we introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts (EAC)-mathematically defined blueprints that encode object affordances, geometric constraints, and semantics of manipulation. Our approach integrates a structured policy scaffolding pipeline that turn natural language instructions and visual information into an instantiated EAC, from which we derive grasp poses, force directions and plan physically feasible motion trajectory for robot execution. GRACE thus provides a unified and interpretable interface between high-level instruction understanding and low-level robot control, effectively enabling precise and generalizable manipulation through semantic-physical grounding. Extensive experiments demonstrate that GRACE achieves strong zero-shot generalization across a variety of articulated objects in both simulated and real-world environments, without requiring task-specific training.
zh
[AI-51] aoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在电商搜索相关性排序任务中,因现有训练范式存在局限而导致的泛化能力弱、推理过程缺乏细粒度监督以及逻辑一致性不足的问题。具体而言,监督微调(Supervised Fine-Tuning, SFT)和直接偏好优化(Direct Preference Optimization, DPO)难以处理长尾查询且缺乏逐步骤的规则对齐监督;而基于验证奖励的强化学习(Reinforcement Learning with Verification Rewards, RLVR)则面临反馈稀疏问题,无法有效纠正中间推理错误,从而影响复杂推理场景下的性能。其解决方案的核心是提出分步混合检验强化学习框架(Stepwise Hybrid Examination Reinforcement Learning, TaoSR-SHE),其中关键创新为分步奖励策略优化(Stepwise Reward Policy Optimization, SRPO)——该算法融合高质量生成式分步奖励模型与人工标注离线验证器,优先从关键正确与错误推理步骤中学习,显著提升推理质量和逻辑一致性;此外,结合多样化数据过滤机制以缓解策略熵坍塌,并引入多阶段课程学习促进能力渐进式增长,最终在真实电商搜索基准上实现相关性预测准确率和可解释性的双重提升。
链接: https://arxiv.org/abs/2510.07972
作者: Pengkun Jiao,Yiming Jin,Jianhui Yang,Chenhe Dong,Zerui Huang,Shaowei Yao,Xiaojiang Zhou,Dan Ou,Haihong Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.07972 [cs.AI] (or arXiv:2510.07972v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.07972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-52] A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG
【速读】:该论文旨在解决可穿戴脑电图(wearable EEG)设备在睡眠分期(sleep staging)应用中面临的标签稀缺问题,即海量未标注数据难以通过人工标注实现有效利用。其解决方案的关键在于引入自监督学习(self-supervised learning, SSL),通过挖掘未标注信号中的结构信息来构建高质量表征,从而显著降低对人工标注数据的依赖。实验表明,SSL方法在仅使用5%–10%标注数据时即可达到临床级准确率(>80%),优于传统监督学习方法,并且在不同人群、环境和信号质量条件下表现出更强的鲁棒性,为低成本、高可扩展的睡眠监测系统提供了可行路径。
链接: https://arxiv.org/abs/2510.07960
作者: Emilio Estevan,María Sierra-Torralba,Eduardo López-Larraz,Luis Montesano
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 4 figures
Abstract:Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.
zh
[AI-53] DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
【速读】:该论文旨在解决现代机器学习模型评估成本高昂的问题,传统基准测试(如LMMs-Eval和HELM)需耗费数千GPU小时,导致创新周期变慢、包容性降低并加剧环境负担。其解决方案的关键在于提出一种名为Diversifying Sample Condensation (DISCO) 的新方法:不再依赖复杂的聚类算法进行样本锚点选择,而是通过贪心策略选取能最大化模型间分歧的top-k样本,即选择那些在不同模型预测结果上差异最大的样本。该方法基于信息论最优准则,利用局部样本级统计量替代全局聚类,显著简化了流程并提升了性能预测准确性,在MMLU、Hellaswag、Winogrande和ARC等多个基准上达到当前最优效果。
链接: https://arxiv.org/abs/2510.07959
作者: Alexander Rubinstein,Benjamin Raible,Martin Gubri,Seong Joon Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that \textitmaximise diversity in model responses . Our method, \textbfDiversifying Sample Condensation (DISCO) , selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. \textbfDISCO shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: this https URL.
zh
[AI-54] Agent -Based Genetic Algorithm for Crypto Trading Strategy Optimization
【速读】:该论文旨在解决加密货币市场中交易策略优化面临的难题,包括极端波动性、非平稳动态特性以及复杂的市场微观结构模式,这些问题使得传统参数优化方法难以奏效。解决方案的关键在于提出一种名为Crypto Genetic Algorithm Agent (CGA-Agent) 的混合框架,该框架创新性地将遗传算法与智能多智能体协调机制相结合,通过实时市场微观结构情报和自适应策略绩效反馈机制,动态引导进化过程,从而实现对交易策略参数的自适应优化,突破静态优化方法的局限性。
链接: https://arxiv.org/abs/2510.07943
作者: Qiushi Tian,Churong Liang,Kairan Hong,Runnan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:Cryptocurrency markets present formidable challenges for trading strategy optimization due to extreme volatility, non-stationary dynamics, and complex microstructure patterns that render conventional parameter optimization methods fundamentally inadequate. We introduce Cypto Genetic Algorithm Agent (CGA-Agent), a pioneering hybrid framework that synergistically integrates genetic algorithms with intelligent multi-agent coordination mechanisms for adaptive trading strategy parameter optimization in dynamic financial environments. The framework uniquely incorporates real-time market microstructure intelligence and adaptive strategy performance feedback through intelligent mechanisms that dynamically guide evolutionary processes, transcending the limitations of static optimization approaches. Comprehensive empirical evaluation across three cryptocurrencies demonstrates systematic and statistically significant performance improvements on both total returns and risk-adjusted metrics.
zh
[AI-55] Enabling Personalized Long-term Interactions in LLM -based Agents through Persistent Memory and User Profiles
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在个性化交互能力上的不足问题,即现有方法虽能通过检索增强生成(Retrieval Augmented Generation, RAG)提升上下文感知能力,但缺乏将上下文信息与用户特定数据有效融合的机制。其解决方案的关键在于构建一个以持久记忆(persistent memory)、动态协调(dynamic coordination)、自我验证(self-validation)和演化用户画像(evolving user profiles)为核心的框架,结合多智能体协作与多源检索等经典代理模式,实现面向用户的长期适应性交互。
链接: https://arxiv.org/abs/2510.07925
作者: Rebecca Westhäußer,Wolfgang Minker,Sebatian Zepf
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 1 figure, 1 table
Abstract:Large language models (LLMs) increasingly serve as the central control unit of AI agents, yet current approaches remain limited in their ability to deliver personalized interactions. While Retrieval Augmented Generation enhances LLM capabilities by improving context-awareness, it lacks mechanisms to combine contextual information with user-specific data. Although personalization has been studied in fields such as human-computer interaction or cognitive science, existing perspectives largely remain conceptual, with limited focus on technical implementation. To address these gaps, we build on a unified definition of personalization as a conceptual foundation to derive technical requirements for adaptive, user-centered LLM-based agents. Combined with established agentic AI patterns such as multi-agent collaboration or multi-source retrieval, we present a framework that integrates persistent memory, dynamic coordination, self-validation, and evolving user profiles to enable personalized long-term interactions. We evaluate our approach on three public datasets using metrics such as retrieval accuracy, response correctness, or BertScore. We complement these results with a five-day pilot user study providing initial insights into user feedback on perceived personalization. The study provides early indications that guide future work and highlights the potential of integrating persistent memory and user profiles to improve the adaptivity and perceived personalization of LLM-based agents.
zh
[AI-56] Profit Mirag e: Revisiting Information Leakage in LLM -based Financial Agents
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的金融代理在回测中表现出“利润幻象”(profit mirage)的问题,即模型在训练期内因信息泄露而获得虚假高收益,一旦超出其知识窗口便无法持续产生超额收益。核心问题在于LLM对历史数据的过度记忆而非学习因果机制。解决方案的关键是提出FactFin框架,通过引入反事实扰动(counterfactual perturbations),迫使模型识别和依赖因果驱动因素而非仅依赖历史记忆。该框架整合策略代码生成、检索增强生成、蒙特卡洛树搜索与反事实模拟四个模块,显著提升模型在样本外场景下的泛化能力与风险调整后绩效。
链接: https://arxiv.org/abs/2510.07920
作者: Xiangyu Li,Yawen Zeng,Xiaofen Xing,Jin Xu,Xiangmin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based financial agents have attracted widespread excitement for their ability to trade like human experts. However, most systems exhibit a “profit mirage”: dazzling back-tested returns evaporate once the model’s knowledge window ends, because of the inherent information leakage in LLMs. In this paper, we systematically quantify this leakage issue across four dimensions and release FinLake-Bench, a leakage-robust evaluation benchmark. Furthermore, to mitigate this issue, we introduce FactFin, a framework that applies counterfactual perturbations to compel LLM-based agents to learn causal drivers instead of memorized outcomes. FactFin integrates four core components: Strategy Code Generator, Retrieval-Augmented Generation, Monte Carlo Tree Search, and Counterfactual Simulator. Extensive experiments show that our method surpasses all baselines in out-of-sample generalization, delivering superior risk-adjusted performance.
zh
[AI-57] owards Meaningful Transparency in Civic AI Systems
【速读】:该论文试图解决当前政府机构在使用人工智能(AI)系统提供服务时,因透明度实践局限于技术性描述而无法有效赋能公民和公职人员参与决策过程的问题。现有透明机制往往产出难以理解的技术对象,未能连接公众的行动潜力,也忽视了决策背后的社会-物质情境。解决方案的关键在于提出“有意义的透明度”(meaningful transparency)概念,即通过以人为中心的视角与社会技术系统理论相结合,使公众能够真正理解并参与到影响自身生活的AI决策中,实现认知理解与行动可能性的有机联结。
链接: https://arxiv.org/abs/2510.07889
作者: Dave Murray-Rust,Kars Alfrink,Cristina Zaga
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Artificial intelligence has become a part of the provision of governmental services, from making decisions about benefits to issuing fines for parking violations. However, AI systems rarely live up to the promise of neutral optimisation, creating biased or incorrect outputs and reducing the agency of both citizens and civic workers to shape the way decisions are made. Transparency is a principle that can both help subjects understand decisions made about them and shape the processes behind those decisions. However, transparency as practiced around AI systems tends to focus on the production of technical objects that represent algorithmic aspects of decision making. These are often difficult for publics to understand, do not connect to potential for action, and do not give insight into the wider socio-material context of decision making. In this paper, we build on existing approaches that take a human-centric view on AI transparency, combined with a socio-technical systems view, to develop the concept of meaningful transparency for civic AI systems: transparencies that allow publics to engage with AI systems that affect their lives, connecting understanding with potential for action.
zh
[AI-58] DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation
【速读】:该论文旨在解决流形匹配(flow-based)机器人操作策略在学习多模态动作分布时存在的表示崩溃(representation collapse)问题,即模型无法区分相似的视觉表征,导致精确操控任务失败。其解决方案的关键在于提出DM1(MeanFlow with Dispersive Regularization for One-Step Robotic Manipulation),通过在多个中间嵌入层引入分散正则化(dispersive regularization)变体,在不增加额外网络模块或特殊训练流程的前提下,有效增强训练批次间的表征多样性,从而防止表示崩溃并保持单步采样效率。实验表明,DM1在RoboMimic基准上实现20–40倍更快推理速度(0.07s vs. 2–3.5s),成功率提升10–20个百分点,且在Franka Panda真实机器人上成功迁移至物理世界。
链接: https://arxiv.org/abs/2510.07865
作者: Guowei Zou,Haitao Wang,Hejun Wu,Yukun Qian,Yuhang Wang,Weibing Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Website with code: this https URL
Abstract:The ability to learn multi-modal action distributions is indispensable for robotic manipulation policies to perform precise and robust control. Flow-based generative models have recently emerged as a promising solution to learning distributions of actions, offering one-step action generation and thus achieving much higher sampling efficiency compared to diffusion-based methods. However, existing flow-based policies suffer from representation collapse, the inability to distinguish similar visual representations, leading to failures in precise manipulation tasks. We propose DM1 (MeanFlow with Dispersive Regularization for One-Step Robotic Manipulation), a novel flow matching framework that integrates dispersive regularization into MeanFlow to prevent collapse while maintaining one-step efficiency. DM1 employs multiple dispersive regularization variants across different intermediate embedding layers, encouraging diverse representations across training batches without introducing additional network modules or specialized training procedures. Experiments on RoboMimic benchmarks show that DM1 achieves 20-40 times faster inference (0.07s vs. 2-3.5s) and improves success rates by 10-20 percentage points, with the Lift task reaching 99% success over 85% of the baseline. Real-robot deployment on a Franka Panda further validates that DM1 transfers effectively from simulation to the physical world. To the best of our knowledge, this is the first work to leverage representation regularization to enable flow-based policies to achieve strong performance in robotic manipulation, establishing a simple yet powerful approach for efficient and robust manipulation.
zh
[AI-59] Understanding DeepResearch via Reports
【速读】:该论文旨在解决深度研究型智能体(DeepResearch agents)在评估上的难题,即现有基准测试多聚焦于单一能力指标,难以全面衡量其在开放性研究场景中的综合表现。其核心挑战在于如何客观评估系统在整合多元信息源、生成洞察并产出结构化研究报告等方面的端到端能力。解决方案的关键在于提出 DeepResearch-ReportEval 框架,通过将研究报告作为代表性输出进行评估,系统性地从质量(quality)、冗余度(redundancy)和事实准确性(factuality)三个维度量化模型性能,并采用 LLM-as-a-Judge 方法实现与专家判断高度一致的自动化评分机制,从而为当前主流商业系统提供可比较、标准化的能力评估基准。
链接: https://arxiv.org/abs/2510.07861
作者: Tianyu Fan,Xinyao Niu,Yuxiang Zheng,Fengji Zhang,Chengen Huang,Bei Chen,Junyang Lin,Chao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures
Abstract:DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: this https URL.
zh
[AI-60] Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLM)的时间序列预测方法中存在的三大问题:LLM在模型架构中角色边缘化、依赖粗粒度统计文本提示以及缺乏可解释性。其解决方案的关键在于提出一个完全由LLM驱动的预测框架Augur,该框架采用两阶段师生架构:首先由强大的教师LLM通过启发式搜索与成对因果检验推断时间序列变量间的有向因果图;随后轻量级学生代理进一步优化该图,并将高置信度因果关联编码为丰富的文本提示进行微调,从而实现更精准的预测并提供透明、可追溯的变量交互推理过程。
链接: https://arxiv.org/abs/2510.07858
作者: Zhiqing Cui,Binwu Wang,Qingxiang Liu,Yeqiang Wang,Zhengyang Zhou,Yuxuan Liang,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 9 figures
Abstract:Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.
zh
[AI-61] FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在金融等专业领域缺乏高质量评估数据集的问题,尤其是当前缺少具备专业知识强度、详细标注和复杂推理能力的评测基准。其解决方案的关键在于提出FinMR——一个面向金融领域的高质量、知识密集型多模态数据集,包含超过3200个由专家标注的问答对,覆盖15个多样化的金融主题,并融合了数学推理、金融专业知识与视觉理解任务,从而为评估和提升MLLMs在专业分析师水平上的金融推理能力提供了可靠基准。
链接: https://arxiv.org/abs/2510.07852
作者: Shuangyan Deng,Haizhou Peng,Jiachen Xu,Rui Mao,Ciprian Doru Giurcăneanu,Jiamou Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accept by ICAIF 2025
Abstract:Multimodal Large Language Models (MLLMs) have made substantial progress in recent years. However, their rigorous evaluation within specialized domains like finance is hindered by the absence of datasets characterized by professional-level knowledge intensity, detailed annotations, and advanced reasoning complexity. To address this critical gap, we introduce FinMR, a high-quality, knowledge-intensive multimodal dataset explicitly designed to evaluate expert-level financial reasoning capabilities at a professional analyst’s standard. FinMR comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics, ensuring broad domain diversity and integrating sophisticated mathematical reasoning, advanced financial knowledge, and nuanced visual interpretation tasks across multiple image types. Through comprehensive benchmarking with leading closed-source and open-source MLLMs, we highlight significant performance disparities between these models and professional financial analysts, uncovering key areas for model advancement, such as precise image analysis, accurate application of complex financial formulas, and deeper contextual financial understanding. By providing richly varied visual content and thorough explanatory annotations, FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.
zh
[AI-62] Meta-Learning Based Few-Shot Graph-Level Anomaly Detection
【速读】:该论文旨在解决图级别异常检测(graph-level anomaly detection)中因依赖大量标注数据而导致的现实场景适用性差,以及基于图神经网络(GNN)的少样本异常检测方法易受噪声干扰、嵌入质量低和模型鲁棒性弱的问题。解决方案的关键在于提出一种基于元学习的框架(MA-GAD),其核心创新包括:引入图压缩模块以降低图规模并保留关键节点信息,从而缓解噪声干扰;利用元学习从相似网络中提取元异常特征,获得可快速适应新任务的初始化模型;并通过偏置网络增强异常节点与正常节点之间的区分能力。该方法在四个真实生化数据集上的实验表明,在少样本条件下显著优于现有先进方法。
链接: https://arxiv.org/abs/2510.07847
作者: Liting Li,Yumeng Wang,Yueheng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ARRML2025
Abstract:Graph-level anomaly detection aims to identify anomalous graphs or subgraphs within graph datasets, playing a vital role in various fields such as fraud detection, review classification, and biochemistry. While Graph Neural Networks (GNNs) have made significant progress in this domain, existing methods rely heavily on large amounts of labeled data, which is often unavailable in real-world scenarios. Additionally, few-shot anomaly detection methods based on GNNs are prone to noise interference, resulting in poor embedding quality and reduced model robustness. To address these challenges, we propose a novel meta-learning-based graph-level anomaly detection framework (MA-GAD), incorporating a graph compression module that reduces the graph size, mitigating noise interference while retaining essential node information. We also leverage meta-learning to extract meta-anomaly information from similar networks, enabling the learning of an initialization model that can rapidly adapt to new tasks with limited samples. This improves the anomaly detection performance on target graphs, and a bias network is used to enhance the distinction between anomalous and normal nodes. Our experimental results, based on four real-world biochemical datasets, demonstrate that MA-GAD outperforms existing state-of-the-art methods in graph-level anomaly detection under few-shot conditions. Experiments on both graph anomaly and subgraph anomaly detection tasks validate the framework’s effectiveness on real-world datasets.
zh
[AI-63] he Rise of the Knowledge Sculptor: A New Archetype for Knowledge Work in the Age of Generative AI
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)兴起背景下,传统知识工作模式在应对自主内容生成能力时的适应性不足问题。其解决方案的关键在于提出“知识雕塑家”(Knowledge Sculptor, KS)这一新型专业角色,通过架构愿景、迭代对话、信息雕琢和好奇心驱动的整合等核心能力,将原始 AI 输出转化为可信且可操作的知识,从而推动人与 GenAI 的协同进化。
链接: https://arxiv.org/abs/2510.07829
作者: Cathal Doyle
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 23 pages, 11 figures, preprint
Abstract:In the Generative Age, the nature of knowledge work is transforming. Traditional models that emphasise the organisation and retrieval of pre-existing information are increasingly inadequate in the face of generative AI (GenAI) systems capable of autonomous content creation. This paper introduces the Knowledge Sculptor (KS), a new professional archetype for Human-GenAI collaboration that transforms raw AI output into trustworthy, actionable knowledge. Grounded in a socio-technical perspective, the KS is conceptualised through a framework of competencies, including architecting a vision, iterative dialogue, information sculpting, and curiosity-driven synthesis. A practice-based vignette illustrates the KS role in action, and in a self-referential approach, the paper itself serves as an artefact of the sculpting process it describes.
zh
[AI-64] An LLM -Powered Cooperative Framework for Large-Scale Multi-Vehicle Navigation
【速读】:该论文旨在解决城市级大规模多车动态导航问题,即在复杂、非线性、随机且耦合的交通环境中,实现高效、协同的车辆路径规划。现有路径搜索算法和强化学习方法难以扩展至城市尺度网络,无法有效应对实时交通变化与全局协调需求。解决方案的关键在于提出CityNav框架,其核心是一个分层架构:由全局交通分配代理(global traffic allocation agent)负责区域间战略流量分配,结合局部导航代理(local navigation agents)生成符合全局指令的本地自适应路径;并通过一种合作推理优化机制,采用双奖励结构(个体奖励提升单车效率,共享奖励促进全网协同与拥堵缓解)联合训练各代理,从而实现可扩展、自适应且协同的城市级交通导航。
链接: https://arxiv.org/abs/2510.07825
作者: Yuping Zhou,Siqi Lai,Jindong Han,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of Internet of Vehicles (IoV) technologies is transforming traffic management from isolated control to a collective, multi-vehicle process. At the heart of this shift is multi-vehicle dynamic navigation, which requires simultaneously routing large fleets under evolving traffic conditions. Existing path search algorithms and reinforcement learning methods struggle to scale to city-wide networks, often failing to capture the nonlinear, stochastic, and coupled dynamics of urban traffic. To address these challenges, we propose CityNav, a hierarchical, LLM-powered framework for large-scale multi-vehicle navigation. CityNav integrates a global traffic allocation agent, which coordinates strategic traffic flow distribution across regions, with local navigation agents that generate locally adaptive routes aligned with global directives. To enable effective cooperation, we introduce a cooperative reasoning optimization mechanism, in which agents are jointly trained with a dual-reward structure: individual rewards promote per-vehicle efficiency, while shared rewards encourage network-wide coordination and congestion reduction. Extensive experiments on four real-world road networks of varying scales (up to 1.6 million roads and 430,000 intersections) and traffic datasets demonstrate that CityNav consistently outperforms nine classical path search and RL-based baselines in city-scale travel efficiency and congestion mitigation. Our results highlight the potential of LLMs to enable scalable, adaptive, and cooperative city-wide traffic navigation, providing a foundation for intelligent, large-scale vehicle routing in complex urban environments. Our project is available at this https URL.
zh
[AI-65] SIMU: Selective Influence Machine Unlearning NEURIPS2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在遗忘特定敏感信息时导致原始能力退化的问题,即现有基于一阶和二阶优化器的机器遗忘方法虽能实现目标信息的删除,但常损害模型原有的知识保留与整体性能。其解决方案的关键在于提出一种两阶段框架——选择性影响机器遗忘(Selective Influence Machine Unlearning, SIMU),通过识别并仅更新编码遗忘集(forget-set)信息的关键神经元,实现对敏感信息的有效擦除,同时显著提升模型对原始知识的保留能力。
链接: https://arxiv.org/abs/2510.07822
作者: Anu Agarwal,Mihir Pamnani,Dilek Hakkani-Tur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025 Workshop: Constrained Optimization for Machine Learning (COML)
Abstract:The undesired memorization of sensitive information by Large Language Models (LLMs) has emphasized the need for safety mechanisms that can regulate model behavior. This has led to the development of machine unlearning techniques that enable models to precisely forget sensitive and unwanted information. For machine unlearning, first-order and second-order optimizer-based methods have shown significant progress in enabling LLMs to forget targeted information. However, in doing so, these approaches often compromise the model’s original capabilities, resulting in unlearned models that struggle to retain their prior knowledge and overall utility. To address this, we propose Selective Influence Machine Unlearning (SIMU), a two-step framework that enhances second-order optimizer-based unlearning by selectively updating only the critical neurons responsible for encoding the forget-set. By constraining updates to these targeted neurons, SIMU achieves comparable unlearning efficacy while substantially outperforming current methods in retaining the model’s original knowledge.
zh
[AI-66] Strategic Communication under Threat: Learning Information Trade-offs in Pursuit-Evasion Games
【速读】:该论文旨在解决对抗环境中智能体在信息获取与暴露风险之间的战略权衡问题,即在追求更高情境感知能力的同时避免因通信而暴露自身位置从而面临威胁。其解决方案的关键在于提出SHADOW(Strategic-communication Hybrid Action Decision-making under partial Observation for Warfare)框架,该框架采用多头序列强化学习机制,整合连续导航控制、离散通信动作以及对手建模以实现行为预测,使追捕者能够在部分可观测环境下动态决策何时通信以平衡可观测性与风险。
链接: https://arxiv.org/abs/2510.07813
作者: Valerio La Gatta,Dolev Mutzari,Sarit Kraus,VS Subrahmanian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 13 figures
Abstract:Adversarial environments require agents to navigate a key strategic trade-off: acquiring information enhances situational awareness, but may simultaneously expose them to threats. To investigate this tension, we formulate a PursuitEvasion-Exposure-Concealment Game (PEEC) in which a pursuer agent must decide when to communicate in order to obtain the evader’s position. Each communication reveals the pursuer’s location, increasing the risk of being targeted. Both agents learn their movement policies via reinforcement learning, while the pursuer additionally learns a communication policy that balances observability and risk. We propose SHADOW (Strategic-communication Hybrid Action Decision-making under partial Observation for Warfare), a multi-headed sequential reinforcement learning framework that integrates continuous navigation control, discrete communication actions, and opponent modeling for behavior prediction. Empirical evaluations show that SHADOW pursuers achieve higher success rates than six competitive baselines. Our ablation study confirms that temporal sequence modeling and opponent modeling are critical for effective decision-making. Finally, our sensitivity analysis reveals that the learned policies generalize well across varying communication risks and physical asymmetries between agents.
zh
[AI-67] Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)驱动的自主移动代理在操作安卓手机用户界面时,面临的一种隐蔽性强、无需高权限且可直接触发的UI级攻击漏洞问题。现有研究多依赖显眼的UI叠加层、高权限或不切实际的威胁模型,难以实现真实场景下的隐蔽攻击。其解决方案的关键在于提出一种“一次射击”(one-shot)的越狱攻击框架,包含三个核心组件:(1) 低权限感知链目标定位,将恶意提示注入到应用界面文本中作为代理的视觉输入;(2) 用户不可见的触控触发机制,利用物理触控特征区分代理与人类操作,在代理执行时激活payload;(3) 一次性提示有效性优化,采用启发式引导的字符级迭代深化搜索算法(HG-IDA*),实现关键词级别的去毒处理以绕过设备端安全过滤器。该方法在多个LVLM后端和Android应用中验证了高成功率(如GPT-4o在单次攻击下规划劫持率达82.5%,执行劫持率达75.0%),揭示了当前移动端智能代理存在的根本性安全风险。
链接: https://arxiv.org/abs/2510.07809
作者: Renhua Ding,Xiao Yang,Zhengwei Fang,Jun Luo,Kun He,Jun Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision-language models (LVLMs) enable autonomous mobile agents to operate smartphone user interfaces, yet vulnerabilities to UI-level attacks remain critically understudied. Existing research often depends on conspicuous UI overlays, elevated permissions, or impractical threat models, limiting stealth and real-world applicability. In this paper, we present a practical and stealthy one-shot jailbreak attack that leverages in-app prompt injections: malicious applications embed short prompts in UI text that remain inert during human interaction but are revealed when an agent drives the UI via ADB (Android Debug Bridge). Our framework comprises three crucial components: (1) low-privilege perception-chain targeting, which injects payloads into malicious apps as the agent’s visual inputs; (2) stealthy user-invisible activation, a touch-based trigger that discriminates agent from human touches using physical touch attributes and exposes the payload only during agent operation; and (3) one-shot prompt efficacy, a heuristic-guided, character-level iterative-deepening search algorithm (HG-IDA*) that performs one-shot, keyword-level detoxification to evade on-device safety filters. We evaluate across multiple LVLM backends, including closed-source services and representative open-source models within three Android applications, and we observe high planning and execution hijack rates in single-shot scenarios (e.g., GPT-4o: 82.5% planning / 75.0% execution). These findings expose a fundamental security vulnerability in current mobile agents with immediate implications for autonomous smartphone operation.
zh
[AI-68] GCPO: When Contrast Fails Go Gold
【速读】:该论文旨在解决当前强化学习方法(如Group Relative Policy Optimization, GRPO)在提升小型语言模型推理能力时存在的局限性,即模型的rollout响应上限完全由自身决定,导致无法有效利用全部样本信息,尤其是当样本全部错误或全部正确时,难以获得有效的梯度更新。解决方案的关键在于提出Group Contrastive Policy Optimization (GCPO),其核心创新是引入外部标准参考答案作为对比基准,当模型无法正确解答问题时,参考答案提供明确的正确响应方向,从而引导模型进行更准确的策略更新。这一机制不仅提升了训练效率(充分利用每个样本),还使模型能够在训练中模仿参考答案的问题求解策略,显著增强推理泛化能力。
链接: https://arxiv.org/abs/2510.07790
作者: Hao Wu,Wei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model’s rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: this https URL.
zh
[AI-69] rajectory Conditioned Cross-embodiment Skill Transfer
【速读】:该论文旨在解决从人类示范视频中学习机器人操作技能的问题,其核心挑战在于人类身体与机器人执行器之间的显著“具身差距”(embodiment gap)。现有方法依赖于成对数据集或手工设计的奖励函数,限制了可扩展性和泛化能力。解决方案的关键在于提出TrajSkill框架,通过将人类动作表示为稀疏光流轨迹(sparse optical flow trajectories),提取与形态无关的运动线索,同时保留关键动力学信息;在此基础上,结合视觉和文本输入,联合生成时序一致的机器人操作视频并将其转化为可执行动作,从而实现跨具身技能迁移。
链接: https://arxiv.org/abs/2510.07773
作者: YuHang Tang,Yixuan Lou,Pengfei Han,Haoming Song,Xinyi Ye,Dong Wang,Bin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6% and KVD by 36.6% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.
zh
[AI-70] An approach for systematic decomposition of complex llm tasks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中可靠性不足的问题,现有分解方法多依赖启发式策略或人工设计的代理分解,缺乏系统性和可量化依据。其解决方案的关键在于提出一种名为“约束诱导复杂性分析”(Analysis of CONstraint-Induced Complexity, ACONIC)的新型系统性分解框架,该框架将任务建模为约束问题,并利用形式化的复杂度度量来指导分解过程,从而显著提升智能体在组合优化(SATBench)和LLM数据库查询(Spider)任务上的性能表现(提升10–40个百分点)。
链接: https://arxiv.org/abs/2510.07772
作者: Tianle Zhou,Jiakai Xu,Guanhong Liu,Jiaxiang Liu,Haonan Wang,Eugene Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) suffer from reliability issues on complex tasks, as existing decomposition methods are heuristic and rely on agent or manual decomposition. This work introduces a novel, systematic decomposition framework that we call Analysis of CONstraint-Induced Complexity (ACONIC), which models the task as a constraint problem and leveraging formal complexity measures to guide decomposition. On combinatorial (SATBench) and LLM database querying tasks (Spider), we find that by decomposing the tasks following the measure of complexity, agent can perform considerably better (10-40 percentage point).
zh
[AI-71] From Noisy to Native: LLM -driven Graph Restoration for Test-Time Graph Domain Adaptation
【速读】:该论文旨在解决测试阶段图域适应(Test-Time Graph Domain Adaptation, TT-GDA)中缺乏源域数据访问的问题,从而在不依赖源域样本的情况下实现跨域知识迁移。其核心挑战在于如何在无源数据条件下重建目标图使其具备源域的内在特征,并构建一个适合大语言模型(Large Language Models, LLMs)理解的图结构表示形式。解决方案的关键在于提出GRAIL框架,通过压缩节点表示为紧凑潜在特征、引入图扩散过程建模图恢复机制,并设计量化模块将恢复特征编码为离散token,进而利用LLM作为生成式恢复器将“噪声”目标图重构为“原生”源域风格图;同时引入基于对齐度与置信度奖励的强化学习策略以提升恢复质量,从而有效弥合源域与目标域之间的分布差异。
链接: https://arxiv.org/abs/2510.07762
作者: Xiangwei Lv,JinLuan Yang,Wang Lin,Jingyuan Chen,Beishui Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graph domain adaptation (GDA) has achieved great attention due to its effectiveness in addressing the domain shift between train and test data. A significant bottleneck in existing graph domain adaptation methods is their reliance on source-domain data, which is often unavailable due to privacy or security concerns. This limitation has driven the development of Test-Time Graph Domain Adaptation (TT-GDA), which aims to transfer knowledge without accessing the source examples. Inspired by the generative power of large language models (LLMs), we introduce a novel framework that reframes TT-GDA as a generative graph restoration problem, “restoring the target graph to its pristine, source-domain-like state”. There are two key challenges: (1) We need to construct a reasonable graph restoration process and design an effective encoding scheme that an LLM can understand, bridging the modality gap. (2) We need to devise a mechanism to ensure the restored graph acquires the intrinsic features of the source domain, even without access to the source data. To ensure the effectiveness of graph restoration, we propose GRAIL, that restores the target graph into a state that is well-aligned with the source domain. Specifically, we first compress the node representations into compact latent features and then use a graph diffusion process to model the graph restoration process. Then a quantization module encodes the restored features into discrete tokens. Building on this, an LLM is fine-tuned as a generative restorer to transform a “noisy” target graph into a “native” one. To further improve restoration quality, we introduce a reinforcement learning process guided by specialized alignment and confidence rewards. Extensive experiments demonstrate the effectiveness of our approach across various datasets.
zh
[AI-72] A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization
【速读】:该论文旨在解决在线广告中因广告主需求异质性导致的大量定制化出价任务独立优化所引发的计算资源消耗大和数据效率低的问题。现有多任务学习方法主要依赖训练动态进行优化,但在波动性强的出价环境中泛化能力不足。解决方案的关键在于提出验证对齐的多任务优化(Validation-Aligned Multi-task Optimization, VAMO),通过自适应地根据每个任务的训练梯度与保留验证集梯度之间的对齐程度来分配任务权重,从而引导更新方向以提升验证指标并更贴合实际部署目标。此外,该框架还引入了周期感知的时间模块,并结合先进的生成式自动出价(Generative Auto-bidding)骨干网络,增强跨任务的季节性结构迁移能力,进一步提升出价性能。
链接: https://arxiv.org/abs/2510.07760
作者: Yiqin Lv,Zhiyu Mou,Miao Xu,Jinghao Chen,Qi Wang,Yixiu Mao,Yun Qu,Rongquan Bai,Chuan Yu,Jian Xu,Bo Zheng,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.
zh
[AI-73] Haibu Mathematical-Medical Intelligent Agent :Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中存在事实性错误和逻辑错误的问题,这些问题在高风险的医疗场景中不可接受。解决方案的关键在于提出“海步数学-医学智能代理”(Haibu Mathematical-Medical Intelligent Agent, MMIA),其核心创新是通过形式化可验证的推理过程确保可靠性:MMIA将复杂医疗任务递归分解为原子级、基于证据的步骤,并自动审计整个推理链的逻辑一致性与证据溯源性,类比于定理证明;同时引入“自举模式”(bootstrapping mode),将已验证的推理链存储为“定理”,后续任务可通过检索增强生成(Retrieval-Augmented Generation, RAG)快速调用,实现从高成本的一阶原理推理向低代价验证模式的转变,从而显著提升准确性与效率。
链接: https://arxiv.org/abs/2510.07748
作者: Yilun Zhang,Dexing Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) show promise in medicine but are prone to factual and logical errors, which is unacceptable in this high-stakes field. To address this, we introduce the “Haibu Mathematical-Medical Intelligent Agent” (MMIA), an LLM-driven architecture that ensures reliability through a formally verifiable reasoning process. MMIA recursively breaks down complex medical tasks into atomic, evidence-based steps. This entire reasoning chain is then automatically audited for logical coherence and evidence traceability, similar to theorem proving. A key innovation is MMIA’s “bootstrapping” mode, which stores validated reasoning chains as “theorems.” Subsequent tasks can then be efficiently solved using Retrieval-Augmented Generation (RAG), shifting from costly first-principles reasoning to a low-cost verification model. We validated MMIA across four healthcare administration domains, including DRG/DIP audits and medical insurance adjudication, using expert-validated benchmarks. Results showed MMIA achieved an error detection rate exceeding 98% with a false positive rate below 1%, significantly outperforming baseline LLMs. Furthermore, the RAG matching mode is projected to reduce average processing costs by approximately 85% as the knowledge base matures. In conclusion, MMIA’s verifiable reasoning framework is a significant step toward creating trustworthy, transparent, and cost-effective AI systems, making LLM technology viable for critical applications in medicine.
zh
[AI-74] AppForge: From Assistant to Independent Developer - Are GPT s Ready for Software Development?
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中仅能处理孤立函数,而难以构建完整软件系统的问题。现实世界的应用程序需要对多个组件间的交互进行推理、维护状态一致性,并满足生命周期和框架约束,但现有基准无法有效评估LLMs是否具备从零开始构建整个软件系统的能力。解决方案的关键在于提出APPFORGE——一个包含101个来自真实Android应用的开发问题的基准测试集,其要求模型根据自然语言规范从零实现完整的Android应用;为此,研究者设计了一个多智能体系统自动提取功能摘要并生成验证测试用例,结合专家人工校验与自动化评估框架,实现了无需人工干预的可复现评测机制,从而揭示了当前主流LLMs在复杂多组件软件工程任务中的显著局限性。
链接: https://arxiv.org/abs/2510.07740
作者: Dezhi Ran,Yuan Cao,Mengzhou Wu,Simin Chen,Yuzhe Guo,Jun Ren,Zihe Song,Hao Yu,Jialei Wei,Linyi Li,Wei Yang,Baishakhi Ray,Tao Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under Review. Benchmark and leadboards at this https URL
Abstract:Large language models (LLMs) have demonstrated remarkable capability in function-level code generation tasks. Unlike isolated functions, real-world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch. To address this gap, we propose APPFORGE, a benchmark consisting of 101 software development problems drawn from real-world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with implementing the functionality into an Android app from scratch. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context-aware, robust, and maintainable code. To construct APPFORGE, we design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, APPFORGE incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best-performing model (GPT-5) developing only 18.8% functionally correct applications, highlighting fundamental limitations in current models’ ability to handle complex, multi-component software engineering challenges.
zh
[AI-75] MeSH: Memory-as-State-Highways for Recursive Transformers
【速读】:该论文旨在解决递归变换器(Recursive Transformers)在计算资源匹配情况下性能落后于非递归模型的问题。通过探查隐藏状态,作者识别出两个核心瓶颈:一是“未分化计算”(undifferentiated computation),即每轮迭代中核心模块被迫采用相似的计算模式,限制了功能多样性;二是“信息过载”(information overload),即长期与瞬时信息共存于单一隐藏状态中导致表征混淆。解决方案的关键在于提出Memory-as-State-Highways(MeSH)机制,将状态管理显式外化至内存缓冲区,并引入轻量级路由机制,在不同迭代间动态分配计算任务,从而实现功能专业化。实验证明,MeSH在Pythia基准上显著优于基线递归模型,并在1.4B参数规模下超越更大规模的非递归模型,平均下游准确率提升+1.06%,同时减少33%的非嵌入参数。
链接: https://arxiv.org/abs/2510.07739
作者: Chengting Yu,Xiaobo Shu,Yadao Wang,Yizhen Zhang,Haoyi Wu,Jiaang Li,Rujiao Long,Ziheng Chen,Yuchi Xu,Wenbo Su,Bo Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.
zh
[AI-76] SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation
【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的综述论文自动生成方法在处理文献时忽视论文间结构关系的问题,导致生成的综述缺乏清晰的知识分类体系和对研究演进脉络的深入理解。其核心解决方案是提出一个名为SurveyG的LLM代理框架,该框架引入分层引用图(hierarchical citation graph),将文献作为节点、引用依赖与语义关联作为边,构建包含基础层(Foundation)、发展层(Development)和前沿层(Frontier)的三层结构,从而显式编码研究演进的层次性知识;通过层内横向搜索与层间纵向遍历相结合的方式生成多层级摘要,并经由多代理验证阶段确保最终综述的逻辑一致性、覆盖完整性和事实准确性。
链接: https://arxiv.org/abs/2510.07733
作者: Minh-Anh Nguye,Minh-Duc Nguyen,Nguyen Thi Ha Lan,Kieu Hai Dang,Nguyen Tien Dong,Le Duy Dung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly adopted for automating survey paper generation \citewang2406autosurvey, liang2025surveyx, yan2025surveyforge,su2025benchmarking,wen2025interactivesurvey. Existing approaches typically extract content from a large collection of related papers and prompt LLMs to summarize them directly. However, such methods often overlook the structural relationships among papers, resulting in generated surveys that lack a coherent taxonomy and a deeper contextual understanding of research progress. To address these shortcomings, we propose \textbfSurveyG, an LLM-based agent framework that integrates \textithierarchical citation graph, where nodes denote research papers and edges capture both citation dependencies and semantic relatedness between their contents, thereby embedding structural and contextual knowledge into the survey generation process. The graph is organized into three layers: \textbfFoundation, \textbfDevelopment, and \textbfFrontier, to capture the evolution of research from seminal works to incremental advances and emerging directions. By combining horizontal search within layers and vertical depth traversal across layers, the agent produces multi-level summaries, which are consolidated into a structured survey outline. A multi-agent validation stage then ensures consistency, coverage, and factual accuracy in generating the final survey. Experiments, including evaluations by human experts and LLM-as-a-judge, demonstrate that SurveyG outperforms state-of-the-art frameworks, producing surveys that are more comprehensive and better structured to the underlying knowledge taxonomy of a field.
zh
[AI-77] DEAS: DEtached value learning with Action Sequence for Scalable Offline RL
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在复杂、长时序决策任务中表现不佳的问题,特别是如何有效利用离线数据提升智能体的策略学习能力。其解决方案的关键在于提出DEAS(Detached Value Learning with Action Sequence)框架,通过引入动作序列(Action Sequence)进行价值学习,利用时间扩展的动作提供比单步动作更丰富的信息,并借助半马尔可夫决策过程(Semi-Markov Decision Process, SMDP)下的Q-learning将长序列动作解释为选项(Options),从而降低有效规划时域;同时,为避免直接使用动作序列导致的价值高估问题,采用“分离式价值学习”(Detached Value Learning)机制,引导价值估计聚焦于离线数据集中分布内且能获得高回报的动作序列,显著提升了算法在OGBench复杂任务及视觉-语言-动作(Vision-Language-Action)大模型上的性能表现。
链接: https://arxiv.org/abs/2510.07730
作者: Changyeon Kim,Haeone Lee,Younggyo Seo,Kimin Lee,Yuke Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project website: this https URL
Abstract:Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.
zh
[AI-78] Control Synthesis of Cyber-Physical Systems for Real-Time Specifications through Causation-Guided Reinforcement Learning
【速读】:该论文旨在解决在实时且安全关键的网络物理系统(Cyber-Physical Systems, CPS)中,基于信号时序逻辑(Signal Temporal Logic, STL)的强化学习(Reinforcement Learning, RL)方法因奖励稀疏而导致训练不收敛和性能不稳定的问题。现有方法生成的STL奖励仅反映全局路径或局部片段的整体满足度,无法准确累积局部状态变化带来的即时反馈,从而难以引导策略有效学习。解决方案的关键在于提出一种基于STL在线因果监测(online causation monitoring)的奖励生成机制:该机制在每个控制步骤中持续监控系统行为与STL规范的定量距离,生成反映瞬时状态动态的奖励,并通过平滑近似处理因果语义的不连续性,使其具备可微性以适配深度强化学习(Deep Reinforcement Learning, Deep-RL)框架。实验证明,该方法显著提升了训练稳定性与策略性能。
链接: https://arxiv.org/abs/2510.07715
作者: Xiaochen Tang,Zhenya Zhang,Miaomiao Zhang,Jie An
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 6 tables, accepted by RTSS 2025
Abstract:In real-time and safety-critical cyber-physical systems (CPSs), control synthesis must guarantee that generated policies meet stringent timing and correctness requirements under uncertain and dynamic conditions. Signal temporal logic (STL) has emerged as a powerful formalism of expressing real-time constraints, with its semantics enabling quantitative assessment of system behavior. Meanwhile, reinforcement learning (RL) has become an important method for solving control synthesis problems in unknown environments. Recent studies incorporate STL-based reward functions into RL to automatically synthesize control policies. However, the automatically inferred rewards obtained by these methods represent the global assessment of a whole or partial path but do not accumulate the rewards of local changes accurately, so the sparse global rewards may lead to non-convergence and unstable training performances. In this paper, we propose an online reward generation method guided by the online causation monitoring of STL. Our approach continuously monitors system behavior against an STL specification at each control step, computing the quantitative distance toward satisfaction or violation and thereby producing rewards that reflect instantaneous state dynamics. Additionally, we provide a smooth approximation of the causation semantics to overcome the discontinuity of the causation semantics and make it differentiable for using deep-RL methods. We have implemented a prototype tool and evaluated it in the Gym environment on a variety of continuously controlled benchmarks. Experimental results show that our proposed STL-guided RL method with online causation semantics outperforms existing relevant STL-guided RL methods, providing a more robust and efficient reward generation framework for deep-RL.
zh
[AI-79] Rethinking Reasoning : A Survey on Reasoning -based Backdoors in LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在具备高级推理能力的同时,因推理机制被恶意利用而引发的后门攻击(backdoor attacks)安全风险问题。其解决方案的关键在于提出了一种新的分类体系,将针对LLMs推理能力的后门攻击划分为关联型(associative)、被动型(passive)和主动型(active)三类,并系统梳理了相应的攻击机制、防御策略及当前未解挑战,从而为构建更安全、可信的LLM生态提供理论基础与研究方向指引。
链接: https://arxiv.org/abs/2510.07697
作者: Man Hu,Xinyi Wu,Zuofeng Suo,Jinbo Feng,Linghui Meng,Yanhao Jia,Anh Tuan Luu,Shuai Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs’ performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs’ reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.
zh
[AI-80] IKNet: Interpretable Stock Price Prediction via Keyword-Guided Integration of News and Technical Indicators
【速读】:该论文旨在解决现有基于新闻的股票价格预测模型在解释性方面的不足问题,即大多数模型仅使用情感分数或平均嵌入表示新闻文章,难以提供对公众情绪如何影响预测结果的定量且情境相关的解释。其解决方案的关键在于提出一种可解释的关键词引导网络(Interpretable Keyword-guided Network, IKNet),该框架通过FinBERT进行上下文分析识别关键新闻词汇,对每个关键词嵌入独立进行非线性投影,并将其与技术指标的时间序列数据融合以预测次日收盘价;同时利用Shapley Additive Explanations (SHAP) 方法生成每项关键词对预测结果的量化贡献度,从而实现高精度预测与透明化解释的统一。
链接: https://arxiv.org/abs/2510.07661
作者: Jinwoong Kim,Sangjin Park
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:The increasing influence of unstructured external information, such as news articles, on stock prices has attracted growing attention in financial markets. Despite recent advances, most existing newsbased forecasting models represent all articles using sentiment scores or average embeddings that capture the general tone but fail to provide quantitative, context-aware explanations of the impacts of public sentiment on predictions. To address this limitation, we propose an interpretable keyword-guided network (IKNet), which is an explainable forecasting framework that models the semantic association between individual news keywords and stock price movements. The IKNet identifies salient keywords via FinBERTbased contextual analysis, processes each embedding through a separate nonlinear projection layer, and integrates their representations with the time-series data of technical indicators to forecast next-day closing prices. By applying Shapley Additive Explanations the model generates quantifiable and interpretable attributions for the contribution of each keyword to predictions. Empirical evaluations of SP 500 data from 2015 to 2024 demonstrate that IKNet outperforms baselines, including recurrent neural networks and transformer models, reducing RMSE by up to 32.9% and improving cumulative returns by 18.5%. Moreover, IKNet enhances transparency by offering contextualized explanations of volatility events driven by public sentiment.
zh
[AI-81] Value Flows
【速读】:该论文旨在解决传统强化学习方法中对未来回报(return)分布建模过于简化的局限性,特别是现有分布强化学习(distributional RL)方法通常仅通过离散区间(categorical distribution)或有限分位数来近似回报分布,难以捕捉回报分布的细粒度结构,并无法有效识别高回报不确定性的状态以支持探索和安全决策。其解决方案的关键在于引入基于流模型(flow-based models)的新框架——Value Flows,利用一种新的流匹配目标(flow-matching objective)生成满足分布贝尔曼方程(distributional Bellman equation)的概率密度路径,从而精确估计完整的未来回报分布;进一步地,通过构建一个基于流导数的常微分方程(ODE)来量化不同状态的回报不确定性,并据此优先优化高不确定状态下的回报估计,提升学习效率与策略性能。
链接: https://arxiv.org/abs/2510.07650
作者: Perry Dong,Chongyi Zheng,Chelsea Finn,Dorsa Sadigh,Benjamin Eysenbach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on 37 state-based and 25 image-based benchmark tasks demonstrate that Value Flows achieves a 1.3\times improvement on average in success rates. Website: this https URL Code: this https URL
zh
[AI-82] Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning
【速读】:该论文旨在解决推荐系统中因频繁引入新物品(novel items)而导致的探索不足问题,尤其是在使用离策略学习(Off-Policy Learning, OPL)时,现有方法在面对新动作时可能缺乏安全性保障。其核心解决方案是提出一种名为Safe Off-Policy Policy Gradient (Safe OPG) 的模型无关安全离策略强化学习方法,该方法基于高置信度的离策略评估来确保安全性;进一步地,为缓解Safe OPG过于保守、难以平衡安全与探索的问题,作者设计了一种部署高效的策略学习框架——Deployment-Efficient Policy Learning for Safe User Exploration,通过引入安全裕度(safety margin)并在有限次部署中逐步放松安全正则化,实现对新动作的有效探索同时保证系统的安全执行。
链接: https://arxiv.org/abs/2510.07635
作者: Haruka Kiyohara,Yusuke Narita,Yuta Saito,Kei Tateno,Takuma Udagawa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.
zh
[AI-83] A Case for Leverag ing Generative AI to Expand and Enhance Training in the Provision of Mental Health Services
【速读】:该论文旨在解决生成式 AI (Generative AI) 在心理健康领域应用中存在风险较高、落地场景不明确的问题,尤其针对当前主流关注的治疗师聊天机器人(therapist chatbots)可能带来的伦理与安全风险。论文提出的关键解决方案是:将生成式 AI 的重点应用于提升心理健康服务提供者的培训质量与可及性,通过模拟对话、个性化反馈和大规模实践训练等方式增强专业能力,从而实现低风险、高影响力的落地路径。文中以退伍军人互助心理支持培训为例,验证了该方案在真实世界中的有效性,强调应优先投资于生成式 AI 对心理健康服务人员的赋能训练。
链接: https://arxiv.org/abs/2510.07623
作者: Hannah R. Lawrence,Shannon Wiltsey Stirman,Samuel Dorison,Taedong Yun,Megan Jones Bell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence (Generative AI) is transforming healthcare. With this evolution comes optimism regarding the impact it will have on mental health, as well as concern regarding the risks that come with generative AI operating in the mental health domain. Much of the investment in, and academic and public discourse about, AI-powered solutions for mental health has focused on therapist chatbots. Despite the common assumption that chatbots will be the most impactful application of GenAI to mental health, we make the case here for a lower-risk, high impact use case: leveraging generative AI to enhance and scale training in mental health service provision. We highlight key benefits of using generative AI to help train people to provide mental health services and present a real-world case study in which generative AI improved the training of veterans to support one another’s mental health. With numerous potential applications of generative AI in mental health, we illustrate why we should invest in using generative AI to support training people in mental health service provision.
zh
[AI-84] Retentive Relevance: Capturing Long-Term User Value in Recommendation Systems
【速读】:该论文旨在解决传统推荐系统过度依赖短期行为信号(如点击和点赞)所导致的用户长期满意度与留存预测能力不足的问题。这些问题信号通常噪声大、稀疏且难以反映用户的持续使用意图。解决方案的关键在于提出一种新的内容级反馈指标——“留存相关性”(Retentive Relevance),它通过问卷调查直接测量用户对平台类似内容的再次访问意愿,从而捕捉更长期的行为意图。该指标经心理测量学方法验证具有收敛效度、区分效度和行为效度,并在大规模离线建模中显著优于现有行为信号和其他问卷指标,尤其对历史互动较少的用户更具预测力。进一步地,作者构建了一个可投入生产的代理模型,将其集成至多阶段排序系统的最终阶段,在真实线上A/B实验中实现了显著的参与度和留存提升,同时降低了低质量内容曝光,为内容感知与用户留存之间的因果关系提供了首个实证验证框架。
链接: https://arxiv.org/abs/2510.07621
作者: Saeideh Bakhshi,Phuong Mai Nguyen,Robert Schiller,Tiantian Xu,Pawan Kodandapani,Andrew Levine,Cayman Simpson,Qifan Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Recommendation systems have traditionally relied on short-term engagement signals, such as clicks and likes, to personalize content. However, these signals are often noisy, sparse, and insufficient for capturing long-term user satisfaction and retention. We introduce Retentive Relevance, a novel content-level survey-based feedback measure that directly assesses users’ intent to return to the platform for similar content. Unlike other survey measures that focus on immediate satisfaction, Retentive Relevance targets forward-looking behavioral intentions, capturing longer term user intentions and providing a stronger predictor of retention. We validate Retentive Relevance using psychometric methods, establishing its convergent, discriminant, and behavioral validity. Through large-scale offline modeling, we show that Retentive Relevance significantly outperforms both engagement signals and other survey measures in predicting next-day retention, especially for users with limited historical engagement. We develop a production-ready proxy model that integrates Retentive Relevance into the final stage of a multi-stage ranking system on a social media platform. Calibrated score adjustments based on this model yield substantial improvements in engagement, and retention, while reducing exposure to low-quality content, as demonstrated by large-scale A/B experiments. This work provides the first empirically validated framework linking content-level user perceptions to retention outcomes in production systems. We offer a scalable, user-centered solution that advances both platform growth and user experience. Our work has broad implications for responsible AI development.
zh
[AI-85] DGTEN: A Robust Deep Gaussian based Graph Neural Network for Dynamic Trust Evaluation with Uncertainty-Quantification Support
【速读】:该论文旨在解决大规模动态图中信任评估的三大挑战:捕捉关系变化、表达校准后的置信度以及抵御针对信任目标的对抗攻击。其核心解决方案是提出DGTEN(Deep Gaussian-based Trust Evaluation Network),关键在于通过三个创新模块实现统一建模:首先,采用不确定性感知的消息传递机制,将节点和边表示为高斯分布,使语义信号与认知不确定性在图神经网络中传播,从而做出风险感知的信任决策而非过度自信的预测;其次,引入混合绝对-高斯-沙漏(HAGH)位置编码与基于Kolmogorov-Arnold网络的无偏多头注意力机制,并结合常微分方程(ODE)驱动的残差学习模块,联合捕捉信任演变中的突变与平滑趋势;最后,利用鲁棒自适应集成系数分析方法,基于余弦相似度和Jaccard相似度互补度量对可疑交互进行剪枝或降权,有效缓解声誉洗白、破坏攻击及开关式攻击。
链接: https://arxiv.org/abs/2510.07620
作者: Muhammad Usman,Yugyung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures, 5 tables
Abstract:Dynamic trust evaluation in large, rapidly evolving graphs requires models that can capture changing relationships, express calibrated confidence, and resist adversarial manipulation. DGTEN (Deep Gaussian-based Trust Evaluation Network) introduces a unified graph framework that achieves all three by combining uncertainty-aware message passing, expressive temporal modeling, and built-in defenses against trust-targeted attacks. It represents nodes and edges as Gaussian distributions so that both semantic signals and epistemic uncertainty propagate through the graph neural network, enabling risk-aware trust decisions rather than overconfident guesses. To model how trust evolves, it employs hybrid Absolute-Gaussian-Hourglass (HAGH) positional encoding with Kolmogorov-Arnold network-based unbiased multi-head attention, followed by an ordinary differential equation (ODE)-based residual learning module to jointly capture abrupt shifts and smooth trends. Robust adaptive ensemble coefficient analysis prunes or down-weights suspicious interactions using complementary cosine and Jaccard similarity measures, mitigating reputation laundering, sabotage, and on/off attacks. On two signed Bitcoin trust networks, DGTEN delivers significant improvements: in single-timeslot prediction on Bitcoin-Alpha, it improves MCC by 10.77% over the best dynamic baseline; in the cold-start scenario, it achieves a 16.41% MCC gain - the largest across all tasks and datasets. Under adversarial on/off attacks, it surpasses the baseline by up to 11.63% MCC. These results validate the effectiveness of the unified DGTEN framework.
zh
[AI-86] raceability and Accountability in Role-Specialized Multi-Agent LLM Pipelines
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统在执行复杂软件任务时因错误逐级传递而导致可信度低的问题。其核心解决方案是构建一个可追溯且责任明确的流水线架构,即“规划者-执行者-批评者”(Planner-Executor-Critic)结构,通过清晰的角色分工、结构化的任务交接机制以及完整的操作记录,实现对每个步骤的可追踪性和责任归属。关键创新在于引入结构化问责机制,显著提升准确性并抑制错误扩散,同时量化不同角色的性能表现(如修复率与伤害率),从而为高效、可靠、可调试的多智能体系统设计提供数据驱动的方法。
链接: https://arxiv.org/abs/2510.07614
作者: Amine Barrak
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Sequential multi-agent systems built with large language models (LLMs) can automate complex software tasks, but they are hard to trust because errors quietly pass from one stage to the next. We study a traceable and accountable pipeline, meaning a system with clear roles, structured handoffs, and saved records that let us trace who did what at each step and assign blame when things go wrong. Our setting is a Planner - Executor - Critic pipeline. We evaluate eight configurations of three state-of-the-art LLMs on three benchmarks and analyze where errors start, how they spread, and how they can be fixed. Our results show: (1) adding a structured, accountable handoff between agents markedly improves accuracy and prevents the failures common in simple pipelines; (2) models have clear role-specific strengths and risks (e.g., steady planning vs. high-variance critiquing), which we quantify with repair and harm rates; and (3) accuracy-cost-latency trade-offs are task-dependent, with heterogeneous pipelines often the most efficient. Overall, we provide a practical, data-driven method for designing, tracing, and debugging reliable, predictable, and accountable multi-agent systems.
zh
[AI-87] Agent Ask: Multi-Agent Systems Need to Ask
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)在基于大语言模型(LLM)时因边缘级错误传播而导致性能低于单智能体基线的问题。其关键解决方案是提出一种轻量且即插即用的澄清模块 AgentAsk,该模块将每个智能体间消息传递视为潜在故障点,并插入最小必要问题以阻止错误扩散。AgentAsk 采用三阶段流程:首先从精心构建的失败轨迹中提炼出边缘级判断策略,其次通过监督学习确定何时、何地、向谁提问,最后利用 E-GRPO 强化学习目标在线优化策略,平衡准确性、延迟与成本。该方法架构无关,易于集成,在数学、推理和编码基准上显著提升准确性和鲁棒性,同时保持延迟和额外成本均低于 5%,接近强评估器性能。
链接: https://arxiv.org/abs/2510.07593
作者: Bohan Lin,Kuo Yang,Yingchuan Lai,Yudong Zhang,Chen Zhang,Guibin Zhang,Xinlei Yu,Miao Yu,Xu Wang,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor. However, they frequently underperform single-agent baselines due to edge-level error cascades: minor inaccuracies at one message handoff propagate across the entire chain. We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation. AgentAsk follows a three-stage pipeline: (i) distilling edge-level judgments from curated failure traces into a compact policy, (ii) supervising the policy to determine when/what/whom/how to ask, and (iii) optimizing online with E-GRPO, a reinforcement learning objective that balances accuracy, latency, and cost. The module is architecture-agnostic and easy to integrate into existing orchestration. Across math, reasoning, and coding benchmarks, AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%, approaching the performance of a strong evaluator. Beyond empirical improvements, we contribute a principled taxonomy of edge-level errors and a practical recipe for link-local intervention, offering a scalable pathway toward more reliable LLM-based multi-agent systems.
zh
[AI-88] GM: a Modular and Efficient Library for Machine Learning on Temporal Graphs
【速读】:该论文旨在解决时序图(Temporal Graph, TG)机器学习领域缺乏统一、高效且支持多种建模范式的开源基础设施问题。现有库多针对特定模型设计,难以支持多样化任务,且连续时间动态图方法(Continuous-Time Dynamic Graph, CTDG)与离散时间动态图方法(Discrete-Time Dynamic Graph, DTDG)之间存在壁垒,限制了跨方法比较与技术迁移。解决方案的关键在于提出首个统一CTDG与DTDG框架的科研导向库Temporal Graph Modelling (TGM),其核心创新包括:原生支持动态节点特征、时间粒度转换、以及链接级、节点级和图级任务;同时通过优化实现平均比DyGLib快7.8倍、在图离散化任务上比现有方案快175倍,显著提升效率,并首次使动态图属性预测和时间驱动训练等新研究方向成为可能。
链接: https://arxiv.org/abs/2510.07586
作者: Jacob Chmura,Shenyang Huang,Tran Gia Bao Ngo,Ali Parviz,Farimah Poursafaei,Jure Leskovec,Michael Bronstein,Guillaume Rabusseau,Matthias Fey,Reihaneh Rabbany
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 14 tables
Abstract:Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like PyTorch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8x speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175x speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study. TGM is available at this https URL
zh
[AI-89] Accuracy Memory Efficiency and Generalization: A Comparative Study on Liquid Neural Networks and Recurrent Neural Networks
【速读】:该论文旨在解决传统循环神经网络(Recurrent Neural Networks, RNNs)及其变体(如长短期记忆网络 Long Short-Term Memory, LSTM 和门控循环单元 Gated Recurrent Unit, GRU)在处理序列数据时存在的局限性,特别是面对噪声干扰、非平稳数据和分布外泛化能力不足的问题。其解决方案的关键在于系统比较液态神经网络(Liquid Neural Networks, LNNs)与传统RNN架构在模型精度、内存效率和泛化能力等方面的差异,揭示LNN作为一种受生物学启发的连续时间动态神经网络,在处理复杂时序数据中的优势,包括更高的参数效率、计算速度以及对分布外样本的鲁棒性,同时指出提升LNN可扩展性是推动其在更广泛场景中应用的核心方向。
链接: https://arxiv.org/abs/2510.07578
作者: Shilong Zong,Alex Bierly,Almuatazbellah Boker,Hoda Eldardiry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 13 pages, 12 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:This review aims to conduct a comparative analysis of liquid neural networks (LNNs) and traditional recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs). The core dimensions of the analysis include model accuracy, memory efficiency, and generalization ability. By systematically reviewing existing research, this paper explores the basic principles, mathematical models, key characteristics, and inherent challenges of these neural network architectures in processing sequential data. Research findings reveal that LNN, as an emerging, biologically inspired, continuous-time dynamic neural network, demonstrates significant potential in handling noisy, non-stationary data, and achieving out-of-distribution (OOD) generalization. Additionally, some LNN variants outperform traditional RNN in terms of parameter efficiency and computational speed. However, RNN remains a cornerstone in sequence modeling due to its mature ecosystem and successful applications across various tasks. This review identifies the commonalities and differences between LNNs and RNNs, summarizes their respective shortcomings and challenges, and points out valuable directions for future research, particularly emphasizing the importance of improving the scalability of LNNs to promote their application in broader and more complex scenarios.
zh
[AI-90] Benchmarking is Broken - Dont Let AI be its Own Judge NEURIPS2025 NEURIPS
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)评估体系中存在的系统性缺陷,包括数据污染、选择性报告、质量控制不足等问题,这些问题导致评估结果易受偏倚影响,难以区分真实进步与夸大宣传,进而削弱公众对AI发展的信任。其解决方案的关键在于提出一个统一、实时且质量可控的基准测试框架——PeerBench,该框架通过密封执行(sealed execution)、动态更新的题库(item banking with rolling renewal)以及延迟透明度(delayed transparency)等机制,构建出一种“内置鲁棒性”的评估范式,从而实现可信、公平、可持续的AI性能衡量。
链接: https://arxiv.org/abs/2510.07575
作者: Zerui Cheng,Stella Wohnig,Ruchika Gupta,Samiul Alam,Tassallah Abdullahi,João Alves Ribeiro,Christian Nielsen-Garcia,Saif Mir,Siran Li,Jason Orender,Seyed Ali Bahrainian,Daniel Kirste,Aaron Gokaslan,Mikołaj Glinka,Carsten Eickhoff,Ruben Wolff
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages; Accepted to NeurIPS 2025. Link to poster: this https URL
Abstract:The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this “Wild West” of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody’s. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today’s AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench, a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress. Comments: 12 pages; Accepted to NeurIPS 2025. Link to poster: this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.07575 [cs.AI] (or arXiv:2510.07575v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.07575 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-91] Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic
【速读】:该论文旨在解决如何从大规模多语言对话数据中识别主题模式,并探究这些主题与用户对不同大语言模型(Large Language Models, LLMs)偏好之间的关联问题。其核心挑战在于处理多语言变体、平衡对话轮次以及清洗噪声和删减数据,从而准确捕捉用户在特定话题下对LLM输出的倾向性。解决方案的关键在于采用基于Transformer架构的BERTopic主题建模方法,结合精心设计的预处理流程,提取出29个语义连贯的主题(如人工智能、编程、伦理等),并通过可视化工具(如主题间距离图、模型-主题矩阵)系统分析模型表现与主题之间的对齐趋势,为领域特定微调和优化提供实证依据。
链接: https://arxiv.org/abs/2510.07557
作者: Abhay Bhandarkar,Gaurav Mishra,Khushi Juchani,Harsh Singhal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.
zh
[AI-92] An Evaluation Study of Hybrid Methods for Multilingual PII Detection
【速读】:该论文旨在解决低资源语言环境下个人身份信息(Personally Identifiable Information, PII)检测的挑战,这一问题源于语言多样性以及标注数据稀缺。解决方案的关键在于提出一种混合框架RECAP,其结合确定性正则表达式与上下文感知的大语言模型(Large Language Models, LLMs),通过三阶段精炼流水线实现实体消歧与过滤,从而在不重新训练的情况下支持超过300种实体类型,并显著提升检测性能——在加权F1分数上相比微调的命名实体识别(Named Entity Recognition, NER)模型提高82%,相比零样本LLM提高17%。
链接: https://arxiv.org/abs/2510.07551
作者: Harshit Rajgarhia,Suryam Gupta,Asif Shaik,Gulipalli Praveen Kumar,Y Santhoshraj,Sanka Nithya Tanvy Nishitha,Abhishek Mukherji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP’s modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.
zh
[AI-93] EEG Sleep Stage Classification with Continuous Wavelet Transform and Deep Learning
【速读】:该论文旨在解决睡眠阶段分类的准确性问题,以支持睡眠障碍的诊断与管理。传统方法依赖人工标注或从脑电图(EEG)信号中提取时域/频域特征,存在主观性强、效率低等问题。其解决方案的关键在于提出一种基于小波变换(wavelet transform)的时间-频率分析框架,通过连续小波变换(CWT)生成能够捕捉不同频段瞬态和振荡模式的时间-频率图谱,并结合集成学习方法进行分类。该方法在Sleep-EDF Expanded Database上实现88.37%的整体准确率和73.15%的宏平均F1分数,优于传统机器学习方法,且性能可与最新深度学习方法相媲美,展现出鲁棒性、可解释性和临床适用性的优势。
链接: https://arxiv.org/abs/2510.07524
作者: Mehdi Zekriyapanah Gashti,Ghasem Farjamnia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:Accurate classification of sleep stages is crucial for the diagnosis and management of sleep disorders. Conventional approaches for sleep scoring rely on manual annotation or features extracted from EEG signals in the time or frequency domain. This study proposes a novel framework for automated sleep stage scoring using time-frequency analysis based on the wavelet transform. The Sleep-EDF Expanded Database (sleep-cassette recordings) was used for evaluation. The continuous wavelet transform (CWT) generated time-frequency maps that capture both transient and oscillatory patterns across frequency bands relevant to sleep staging. Experimental results demonstrate that the proposed wavelet-based representation, combined with ensemble learning, achieves an overall accuracy of 88.37 percent and a macro-averaged F1 score of 73.15, outperforming conventional machine learning methods and exhibiting comparable or superior performance to recent deep learning approaches. These findings highlight the potential of wavelet analysis for robust, interpretable, and clinically applicable sleep stage classification.
zh
[AI-94] Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization
【速读】:该论文旨在解决多智能体辩论(Multi-agent Debate, MAD)中因身份偏见(identity bias)导致的推理可靠性下降问题,具体表现为代理间存在的谄媚行为(sycophancy)和自我偏执(self-bias),即代理倾向于盲目采纳同伴观点或固守自身先前输出,而非基于内容进行理性判断。解决方案的关键在于提出一个原则性的框架:首先将辩论动态形式化为一种身份加权的贝叶斯更新过程;其次引入响应匿名化(response anonymization)策略,通过移除提示中的身份标识使代理无法区分“自我”与“同伴”,从而强制赋予各代理身份权重相等,有效降低偏见;最后定义了身份偏见系数(Identity Bias Coefficient, IBC)作为量化指标,用于衡量代理遵循同伴观点的频率与自身观点的频率之比。实证研究表明,身份偏见普遍存在,且谄媚行为远多于自我偏执,强调在MAD系统中屏蔽身份信息对于确保基于内容而非来源的可靠推理至关重要。
链接: https://arxiv.org/abs/2510.07517
作者: Hyeong Kyu Choi,Xiaojin Zhu,Yixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer’s view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish “self” from “peer”, which forces equal weights on agent identity, thereby reducing bias. Third, we define the Identity Bias Coefficient (IBC), a principled metric that measures how often an agent follows a peer versus itself. Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to “mask” identity to ensure that MAD systems reason based on content rather than source identity. Code is released in this https URL.
zh
[AI-95] Optimizing Ethical Risk Reduction for Medical Intelligent Systems with Constraint Programming
【速读】:该论文旨在解决医疗智能系统(Medical Intelligent Systems, MIS)在临床应用中因高风险特性所引发的伦理与安全问题,特别是如何通过优化风险分配来满足可信赖人工智能(trustworthy AI)的伦理要求。其核心问题是:在保证覆盖所有可信AI伦理维度的前提下,找到最优的风险评估值分配方案以实现风险最小化。解决方案的关键在于将该问题形式化为一个带约束的优化任务,并采用三种不同的求解范式——混合整数规划(Mixed Integer Programming, MIP)、可满足性(Satisfiability, SAT)和约束规划(Constraint Programming, CP)进行建模与实验比较,其中使用Minizinc语言对问题进行了精确建模,从而系统评估各方法在性能、表达能力和可扩展性方面的表现,为后续构建完整的MIS伦理风险管理体系提供理论基础和技术路径。
链接: https://arxiv.org/abs/2510.07491
作者: Clotilde Brayé,Aurélien Bricout,Arnaud Gotlieb,Nadjib Lazaar,Quentin Vallet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Medical Intelligent Systems (MIS) are increasingly integrated into healthcare workflows, offering significant benefits but also raising critical safety and ethical concerns. According to the European Union AI Act, most MIS will be classified as high-risk systems, requiring a formal risk management process to ensure compliance with the ethical requirements of trust- worthy AI. In this context, we focus on risk reduction optimization problems, which aim to reduce risks with ethical considerations by finding the best balanced assignment of risk assessment values according to their coverage of trustworthy AI ethical requirements. We formalize this problem as a constrained optimization task and investigate three resolution paradigms: Mixed Integer Programming (MIP), Satisfiability (SAT), and Constraint Pro- gramming(CP).Our contributions include the mathematical formulation of this optimization problem, its modeling with the Minizinc constraint modeling language, and a comparative experimental study that analyzes the performance, expressiveness, and scalability of each ap- proach to solving. From the identified limits of the methodology, we draw some perspectives of this work regarding the integration of the Minizinc model into a complete trustworthy AI ethical risk management process for MIS.
zh
[AI-96] HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data
【速读】:该论文旨在解决肺癌(Lung Cancer, LC)风险预测中缺乏高精度、可解释性模型的问题,尤其关注如何利用全基因组关联研究(Genome-Wide Association Studies, GWAS)数据实现个体化风险评估。其解决方案的关键在于提出了一种名为HEMERA(Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data)的新框架,该框架直接处理原始单核苷酸多态性(Single Nucleotide Polymorphisms, SNPs)基因型数据,引入加性位置编码、神经基因型嵌入以及优化的变异过滤策略,并结合基于层间积分梯度(Layer-wise Integrated Gradients)的后验可解释模块,使模型预测结果能够精准归因于特定SNP位点,从而与已知的肺癌风险区域高度一致。在27,254名退伍军人计划参与者数据上训练后,HEMERA实现了99%的AUC得分,展现了强大的预测性能与透明性。
链接: https://arxiv.org/abs/2510.07477
作者: Maria Mahbub,Robert J. Klein,Myvizhi Esai Selvan,Rowena Yip,Claudia Henschke,Providencia Morales,Ian Goethert,Olivera Kotevska,Mayanka Chandra Shekar,Sean R. Wilkinson,Eileen McAllister,Samuel M. Aguayo,Zeynep H. Gümüş,Ioana Danciu,VA Million Veteran Program
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 3 tables
Abstract:Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved 99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.
zh
[AI-97] MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting
【速读】:该论文旨在解决传统混合专家(Mixture-of-Experts, MoE)模型在回归任务中仅提供点预测而无法量化预测不确定性的问题。解决方案的关键在于提出一种基于不确定性的门控机制(uncertainty-based gating mechanism),即不再依赖输入特征来决定各专家的权重,而是利用每个专家输出的方差(variance)作为门控信号,动态调整其对最终预测的贡献。这一设计使模型不仅能输出预测均值,还能直接建模并输出预测的不确定性,且该不确定性与实际预测误差高度相关,从而提升了时间序列预测的可靠性与可解释性。
链接: https://arxiv.org/abs/2510.07459
作者: Yoli Shavit,Jacob Goldberger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Mixture-of-Gaussians with Uncertainty-based Gating (MoGU), a novel Mixture-of-Experts (MoE) framework designed for regression tasks and applied to time series forecasting. Unlike conventional MoEs that provide only point estimates, MoGU models each expert’s output as a Gaussian distribution. This allows it to directly quantify both the forecast (the mean) and its inherent uncertainty (variance). MoGU’s core innovation is its uncertainty-based gating mechanism, which replaces the traditional input-based gating network by using each expert’s estimated variance to determine its contribution to the final prediction. Evaluated across diverse time series forecasting benchmarks, MoGU consistently outperforms single-expert models and traditional MoE setups. It also provides well-quantified, informative uncertainties that directly correlate with prediction errors, enhancing forecast reliability. Our code is available from: this https URL
zh
[AI-98] ExpertAgent : Enhancing Personalized Education through Dynamic Planning and Retrieval-Augmented Long-Chain Reasoning NEURIPS2025
【速读】:该论文旨在解决生成式 AI 在教育应用中面临的实时适应性不足、个性化程度低以及内容可靠性差的问题。其解决方案的关键在于提出 ExpertAgent 框架,该框架通过持续更新的学生模型(student model)动态规划学习内容与策略,从而实现高度自适应的学习体验;同时,所有教学内容均基于经过验证的课程知识库(curriculum repository),有效降低大语言模型的幻觉风险,提升内容的可靠性与可信度。
链接: https://arxiv.org/abs/2510.07456
作者: Binrong Zhu,Guiran Liu,Nina Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Manuscript previously submitted to the NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models (LAW 2025)
Abstract:The application of advanced generative artificial intelligence in education is often constrained by the lack of real-time adaptability, personalization, and reliability of the content. To address these challenges, we propose ExpertAgent - an intelligent agent framework designed for personalized education that provides reliable knowledge and enables highly adaptive learning experiences. Therefore, we developed ExpertAgent, an innovative learning agent that provides users with a proactive and personalized learning experience. ExpertAgent dynamic planning of the learning content and strategy based on a continuously updated student model. Therefore, overcoming the limitations of traditional static learning content to provide optimized teaching strategies and learning experience in real time. All instructional content is grounded in a validated curriculum repository, effectively reducing hallucination risks in large language models and improving reliability and trustworthiness.
zh
[AI-99] S-Agent : A Time Series Reasoning Agent with Iterative Statistical Insight Gathering NEURIPS2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列推理任务中表现不佳的问题,尤其是由幻觉(hallucination)和知识泄露(knowledge leakage)导致的输出不可靠性。其解决方案的关键在于设计了一个名为TS-Agent的时间序列推理代理,该代理将LLM的作用限定在其强项——通过逐步推理收集证据并合成结论,而将时间序列的统计与结构信息提取任务交由专门的时序分析工具处理;同时,TS-Agent直接操作原始数值序列,利用原子操作记录显式证据日志,并在自评机制和最终质量门控的指导下迭代优化推理过程,从而避免多模态对齐训练、保持时间序列原生形式、提升可解释性和可验证性,并有效缓解知识泄露与幻觉问题。
链接: https://arxiv.org/abs/2510.07432
作者: Penghang Liu,Elizabeth Fons,Svitlana Vyetrenko,Daniel Borrajo,Vamsi Potluru,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models
Abstract:Large language models (LLMs) have shown strong abilities in reasoning and problem solving, but recent studies reveal that they still struggle with time series reasoning tasks, where outputs are often affected by hallucination or knowledge leakage. In this work we propose TS-Agent, a time series reasoning agent that leverages LLMs strictly for what they excel at, i.e., gathering evidence and synthesizing it into conclusions through step-by-step reasoning, while delegating the extraction of statistical and structural information to time series analytical tools. Instead of mapping time series into text tokens, images, or embeddings, our agent interacts with raw numeric sequences through atomic operators, records outputs in an explicit evidence log, and iteratively refines its reasoning under the guidance of a self-critic and a final quality gate. This design avoids multi-modal alignment training, preserves the native form of time series, ensures interpretability and verifiability, and mitigates knowledge leakage or hallucination. Empirically, we evaluate the agent on established benchmarks. Our experiments show that TS-Agent achieves performance comparable to state-of-the-art LLMs on understanding benchmarks, and delivers significant improvements on reasoning tasks, where existing models often rely on memorization and fail in zero-shot settings.
zh
[AI-100] Less is More: Strategic Expert Selection Outperforms Ensemble Complexity in Traffic Forecasting ICTAI2025
【速读】:该论文旨在解决交通流量预测中现有混合专家(Mixture of Experts, MoE)框架未能显式融合物理道路网络拓扑结构的问题,从而限制了其空间建模能力。解决方案的关键在于提出TESTAM+框架,引入一种新型的时空语义专家(SpatioSemantic Expert),通过混合图构建机制将物理道路拓扑与数据驱动的特征相似性相结合,实现更精准的空间特征提取。此外,研究发现策略性地选择少数专家(如Identity + Adaptive组合)显著优于简单的多专家集成,不仅在METR LA和PEMS BAY数据集上达到新的SOTA性能(MAE分别降低11.5%),还大幅降低推理延迟(减少53.1%),为实时部署提供了高效可行的方案。
链接: https://arxiv.org/abs/2510.07426
作者: Walid Guettala,Yufan Zhao,László Gulyás
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICTAI 2025. Version 0.9. 10 pages, 5 figures. Preprint differs from the published version in formatting and minor wording
Abstract:Traffic forecasting is fundamental to intelligent transportation systems, enabling congestion mitigation and emission reduction in increasingly complex urban environments. While recent graph neural network approaches have advanced spatial temporal modeling, existing mixture of experts frameworks like Time Enhanced Spatio Temporal Attention Model (TESTAM) lack explicit incorporation of physical road network topology, limiting their spatial capabilities. We present TESTAM+, an enhanced spatio temporal forecasting framework that introduces a novel SpatioSemantic Expert integrating physical road topology with data driven feature similarity through hybrid graph construction. TESTAM+ achieves significant improvements over TESTAM: 1.3% MAE reduction on METR LA (3.10 vs. 3.14) and 4.1% improvement on PEMS BAY (1.65 vs. 1.72). Through comprehensive ablation studies, we discover that strategic expert selection fundamentally outperforms naive ensemble aggregation. Individual experts demonstrate remarkable effectiveness: the Adaptive Expert achieves 1.63 MAE on PEMS BAY, outperforming the original three expert TESTAM (1.72 MAE), while the SpatioSemantic Expert matches this performance with identical 1.63 MAE. The optimal Identity + Adaptive configuration achieves an 11.5% MAE reduction compared to state of the art MegaCRN on METR LA (2.99 vs. 3.38), while reducing inference latency by 53.1% compared to the full four expert TESTAM+. Our findings reveal that fewer, strategically designed experts outperform complex multi expert ensembles, establishing new state of the art performance with superior computational efficiency for real time deployment.
zh
[AI-101] ProSEA: Problem Solving via Exploration Agents
【速读】:该论文旨在解决当前AI代理在复杂任务中面临的静态规划与脆弱交互问题,即现有系统缺乏动态适应能力与协同推理机制,难以实现真正意义上的协作与自适应决策。其解决方案的关键在于提出ProSEA框架,该框架采用分层架构,由管理代理(Manager Agent)协调领域专业代理(Expert Agents),通过结构化反馈机制识别失败原因及新约束,并据此进行计划演化与迭代优化;同时支持自主运行与人类协同的无缝集成,从而显著提升任务成功率与鲁棒性。
链接: https://arxiv.org/abs/2510.07423
作者: William Nguyen,Vinh Luong,Christopher Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have empowered AI agents to tackle increasingly complex tasks. However, most existing agents remain limited to static planning and brittle interactions, falling short of true collaboration or adaptive reasoning. We introduce ProSEA, a modular, general-purpose multi-agent framework designed for iterative problem solving through exploration and plan evolution. ProSEA features a hierarchical architecture in which a Manager Agent orchestrates domain-specialized Expert Agents, decomposes tasks, and adaptively replans based on structured feedback from failed attempts. Unlike prior systems, ProSEA agents report not only success or failure but also detailed reasons for failure and newly discovered constraints, enabling dynamic plan refinement informed by exploratory traces. The framework operates autonomously but supports seamless integration with human collaborators when needed. Experiments on the challenging FinanceBench benchmark demonstrate that ProSEA, even without human feedback, outperforms state-of-the-art baselines and achieves robust performance across reasoning-heavy tasks. These results underscore ProSEA’s potential as a foundation for more transparent, adaptive, and human-aligned AI agents.
zh
[AI-102] Position: AI Will Transform Neuropsychology Through Mental Health Digital Twins for Dynamic Mental Health Care Especially for ADHD
【速读】:该论文试图解决当前精神健康诊断评估中静态评估方法无法适应动态心理状态变化的问题,尤其在注意力缺陷多动障碍(Attention-Deficit/Hyperactivity Disorder, ADHD)领域,传统方法难以满足个性化与长期追踪的需求。其解决方案的关键在于引入生成式 AI(Generative AI)驱动的连续、低侵入式体验采样(experience sampling),结合心理健康数字孪生(Mental Health Digital Twins, MHDTs)这一计算框架,实现对个体症状动态演化过程的持续建模与实时调整,从而构建更精准、可扩展且以患者为中心的个性化干预路径。
链接: https://arxiv.org/abs/2510.07409
作者: Neil Natarajan,Sruthi Viswanathan,Xavier Roberts-Gaal,Michelle Marie Martel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Static solutions don’t serve a dynamic mind. Thus, we advocate a shift from static mental health diagnostic assessments to continuous, artificial intelligence (AI)-driven assessment. Focusing on Attention-Deficit/Hyperactivity Disorder (ADHD) as a case study, we explore how generative AI has the potential to address current capacity constraints in neuropsychology, potentially enabling more personalized and longitudinal care pathways. In particular, AI can efficiently conduct frequent, low-level experience sampling from patients and facilitate diagnostic reconciliation across care pathways. We envision a future where mental health care benefits from continuous, rich, and patient-centered data sampling to dynamically adapt to individual patient needs and evolving conditions, thereby improving both accessibility and efficacy of treatment. We further propose the use of mental health digital twins (MHDTs) - continuously updated computational models that capture individual symptom dynamics and trajectories - as a transformative framework for personalized mental health care. We ground this framework in empirical evidence and map out the research agenda required to refine and operationalize it.
zh
[AI-103] Base Models Know How to Reason Thinking Models Learn When
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 中的“思考型模型”(如 DeepSeek R1)为何能显著优于基础模型(base models),其性能提升究竟源于全新推理能力的学习,还是对已有基础模型推理机制的重新利用与优化。解决方案的关键在于提出一种混合模型架构,通过在基础模型中适时激活已存在的推理机制来模拟思考型模型的推理链,从而验证思考型模型主要依赖于预训练阶段习得的推理能力,并在后训练阶段学习如何高效地触发这些机制。实验表明,在不更新任何权重的前提下,仅通过对约 12% 的 token 进行引导,该混合模型即可恢复高达 91% 的性能差距,揭示了推理机制在基础模型中的潜在可复用性及其部署时机的重要性。
链接: https://arxiv.org/abs/2510.07364
作者: Constantin Venhoff,Iván Arcuschin,Philip Torr,Arthur Conmy,Neel Nanda
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages
Abstract:Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.
zh
[AI-104] L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)
【速读】:该论文旨在解决工业物联网(IIoT)环境中关键网络物理系统(CPS)面临的复杂、多阶段攻击问题,传统防御机制因缺乏上下文感知能力而难以有效应对。其解决方案的核心在于提出L2M-AID框架,该框架融合生成式AI与多智能体强化学习(MARL),通过大型语言模型(LLM)作为语义桥梁,将海量非结构化遥测数据转化为富含上下文的状态表示,使智能体能够推理攻击者意图而非仅依赖模式匹配;在此基础上,基于MAPPO算法的MARL模块学习复杂的协同防御策略,并设计兼顾安全目标与物理过程稳定性的奖励函数,从而实现高效且稳定的自主工业防御。
链接: https://arxiv.org/abs/2510.07363
作者: Tianxiang Xu,Zhichao Wen,Xinyu Zhao,Jun Wang,Yan Li,Chang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This preprint was submitted to IEEE TrustCom 2025. The accepted version will be published under copyright 2025 IEEE
Abstract:The increasing integration of Industrial IoT (IIoT) exposes critical cyber-physical systems to sophisticated, multi-stage attacks that elude traditional defenses lacking contextual awareness. This paper introduces L2M-AID, a novel framework for Autonomous Industrial Defense using LLM-empowered, Multi-agent reinforcement learning. L2M-AID orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security. The core innovation lies in the deep fusion of two AI paradigms: we leverage an LLM as a semantic bridge to translate vast, unstructured telemetry into a rich, contextual state representation, enabling agents to reason about adversary intent rather than merely matching patterns. This semantically-aware state empowers a Multi-Agent Reinforcement Learning (MARL) algorithm, MAPPO, to learn complex cooperative strategies. The MARL reward function is uniquely engineered to balance security objectives (threat neutralization) with operational imperatives, explicitly penalizing actions that disrupt physical process stability. To validate our approach, we conduct extensive experiments on the benchmark SWaT dataset and a novel synthetic dataset generated based on the MITRE ATTCK for ICS framework. Results demonstrate that L2M-AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single-agent RL baselines across key metrics, achieving a 97.2% detection rate while reducing false positives by over 80% and improving response times by a factor of four. Crucially, it demonstrates superior performance in maintaining physical process stability, presenting a robust new paradigm for securing critical national infrastructure.
zh
[AI-105] Local MAP Sampling for Diffusion Models
【速读】:该论文旨在解决逆问题(inverse problems)中传统生成式方法与优化方法之间的范式差异问题:尽管扩散后验采样(Diffusion Posterior Sampling, DPS)提供了严格的贝叶斯框架来从 $ p(x_0 \mid y) $ 中采样,但实际应用中更关注的是最优重建结果而非完整的后验分布覆盖;而优化类扩散求解器虽在重建精度上表现优异,却缺乏清晰的概率基础。为此,作者提出局部最大后验采样(Local MAP Sampling, LMAPS),其核心在于沿扩散轨迹迭代求解局部最大后验(Local MAP)子问题,从而建立优化方法与全局最大后验估计及DPS之间的统一概率解释。关键创新包括:一个具有概率意义的协方差近似、用于稳定性和可解释性的重构目标重定义,以及针对不可微算子的梯度近似策略。实验表明,LMAPS在图像复原和科学计算任务中均达到当前最优性能,如运动模糊去噪、JPEG恢复和量化修复提升 ≥2 dB,反散射任务提升 1.5 dB。
链接: https://arxiv.org/abs/2510.07343
作者: Shaorong Zhang,Rob Brekelmans,Greg Ver Steeg
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from p(x_0 \mid y) . However, in practice, the goal of inverse problem solving is not to cover the posterior but to recover the most accurate reconstruction, where optimization-based diffusion solvers often excel despite lacking a clear probabilistic foundation. We introduce Local MAP Sampling (LMAPS), a new inference framework that iteratively solving local MAP subproblems along the diffusion trajectory. This perspective clarifies their connection to global MAP estimation and DPS, offering a unified probabilistic interpretation for optimization-based methods. Building on this foundation, we develop practical algorithms with a probabilistically interpretable covariance approximation, a reformulated objective for stability and interpretability, and a gradient approximation for non-differentiable operators. Across a broad set of image restoration and scientific tasks, LMAPS achieves state-of-the-art performance, including \geq 2 dB gains on motion deblurring, JPEG restoration, and quantization, and 1.5 dB improvements on inverse scattering benchmarks.
zh
[AI-106] ruth-Aware Decoding: A Program-Logic Approach to Factual Language Generation
【速读】:该论文旨在解决大语言模型在生成过程中产生的幻觉(hallucination)问题,即模型输出与知识库(knowledge base)不一致的现象。解决方案的关键在于提出一种称为Truth-Aware Decoding (TAD) 的验证导向解码机制,其核心是在解码阶段引入一个语义守卫(semantic guards)构成的格结构(lattice),通过约束式语义将知识库验证转化为程序逻辑判断,并利用贪心选择策略在守卫完备且正确的情况下保证局部似然优势,同时引入基于熵的不变量来量化事实风险(factual risk),从而在不牺牲吞吐量的前提下显著降低幻觉。
链接: https://arxiv.org/abs/2510.07331
作者: Faruk Alpay,Hamdi Alakkad
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 18 pages, Lean code provided
Abstract:This paper introduces Truth-Aware Decoding (TAD), a verification-oriented decoding scheme that aligns neural language generation with knowledge bases. Situated in the tradition of probabilistic program semantics for sequence models, TAD augments modern instruction-tuned systems with a lattice of semantic guards that operate at decode time. Our contributions are fourfold: (i) a constraint-based semantics that renders oracle filtering as a program-logic judgment, (ii) a proof that greedy selection enjoys local likelihood dominance under sound and complete guards (Theorem 2.7), (iii) an entropy-style invariant that quantifies factual risk via knowledge-aware safe mass, and (iv) a multi-agent operational calculus with verified Lean artefacts to certify implementation behaviour. Numerical and algorithmic case studies confirm that the resulting guardrails reduce hallucinations without sacrificing throughput, yielding a pragmatic bridge between large-scale empirical models and formal verification.
zh
[AI-107] Platform-Agnostic Modular Architecture for Quantum Benchmarking
【速读】:该论文旨在解决量子计算基准测试领域日益碎片化的问题,即不同框架和工具之间缺乏互操作性,导致基准测试难以统一评估与比较。解决方案的关键在于提出一种平台无关的模块化架构,将问题生成、电路执行和结果分析解耦为独立且可互操作的组件,并通过标准化接口实现跨框架兼容性,从而在保持优化灵活性的同时降低生态系统碎片化程度。
链接: https://arxiv.org/abs/2510.08469
作者: Neer Patel,Anish Giri,Hrushikesh Pramod Patil,Noah Siekierski,Avimita Chatterjee,Sonika Johri,Timothy Proctor,Thomas Lubinski,Siyuan Niu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We present a platform-agnostic modular architecture that addresses the increasingly fragmented landscape of quantum computing benchmarking by decoupling problem generation, circuit execution, and results analysis into independent, interoperable components. Supporting over 20 benchmark variants ranging from simple algorithmic tests like Bernstein-Vazirani to complex Hamiltonian simulation with observable calculations, the system integrates with multiple circuit generation APIs (Qiskit, CUDA-Q, Cirq) and enables diverse workflows. We validate the architecture through successful integration with Sandia’s \textitpyGSTi for advanced circuit analysis and CUDA-Q for multi-GPU HPC simulations. Extensibility of the system is demonstrated by implementing dynamic circuit variants of existing benchmarks and a new quantum reinforcement learning benchmark, which become readily available across multiple execution and analysis modes. Our primary contribution is identifying and formalizing modular interfaces that enable interoperability between incompatible benchmarking frameworks, demonstrating that standardized interfaces reduce ecosystem fragmentation while preserving optimization flexibility. This architecture has been developed as a key enhancement to the continually evolving QED-C Application-Oriented Performance Benchmarks for Quantum Computing suite.
zh
[AI-108] Iterated Agent for Symbolic Regression
【速读】:该论文旨在解决符号回归(Symbolic Regression, SR)中因搜索空间组合爆炸导致的模型过拟合及可解释性差的问题。传统基于遗传编程的方法在语法层面探索表达式空间,常生成复杂且难以理解的模型。其解决方案的关键在于提出IdeaSearchFitter框架,利用大语言模型(Large Language Models, LLMs)作为语义算子,在进化搜索过程中生成受自然语言推理引导的候选表达式,从而偏向于发现既准确又具有概念一致性和可解释性的数学模型。
链接: https://arxiv.org/abs/2510.08317
作者: Zhuo-Yang Song,Zeyu Cai,Shutao Zhang,Jiashen Wei,Jichen Pan,Shi Qiu,Qing-Hong Cao,Tie-Jiun Hou,Xiaohui Liu,Ming-xing Luo,Hua Xing Zhu
机构: 未知
类目: Computational Physics (physics.comp-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
备注: 45 pages, 22 figures, 8 tables
Abstract:Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter’s efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at this https URL.
zh
[AI-109] Quantum Agents for Algorithmic Discovery
【速读】:该论文旨在解决量子算法与协议自动发现的问题,即如何通过智能体自主学习和探索,无需预先知晓最优解即可重新发现已知的著名量子算法和协议。其解决方案的关键在于使用基于回合制奖励的强化学习(episodic, reward-based reinforcement learning)训练量子智能体(quantum agents),使其在与环境的交互中直接学习到高效量子电路结构和策略,例如量子傅里叶变换(Quantum Fourier Transform)、Grover搜索算法、强抛币博弈中的最优欺骗策略以及CHSH等非定域博弈的最优获胜策略,从而展示了量子智能在算法发现中的潜力,为自动化设计新型量子算法提供了可行路径。
链接: https://arxiv.org/abs/2510.08159
作者: Iordanis Kerenidis,El-Amine Cherrat
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce quantum agents trained by episodic, reward-based reinforcement learning to autonomously rediscover several seminal quantum algorithms and protocols. In particular, our agents learn: efficient logarithmic-depth quantum circuits for the Quantum Fourier Transform; Grover’s search algorithm; optimal cheating strategies for strong coin flipping; and optimal winning strategies for the CHSH and other nonlocal games. The agents achieve these results directly through interaction, without prior access to known optimal solutions. This demonstrates the potential of quantum intelligence as a tool for algorithmic discovery, opening the way for the automated design of novel quantum algorithms and protocols.
zh
[AI-110] An Adaptive Multi Agent Bitcoin Trading System
【速读】:该论文旨在解决加密货币市场中因极端波动性、快速变化的市场情绪和监管公告等因素导致传统静态回归模型或仅依赖历史数据训练的神经网络难以有效建模的问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的多智能体交易系统,将LLM拆分为专业化代理:技术分析代理、情绪评估代理、决策代理与反思代理(Reflect agent),其中反思代理通过自然语言形式提供每日及每周的交易决策反馈,并将这些文本评价注入后续提示中,从而在不进行参数更新或微调的前提下动态调整指标优先级、情绪权重和资产配置逻辑,实现持续优化。该机制显著提升了系统在不同市场周期下的表现,验证了自然语言反馈作为低成本地调优LLM以达成金融目标的新范式。
链接: https://arxiv.org/abs/2510.08068
作者: Aadi Singhi
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: 18 pages, 6 figures , 2 tables
Abstract:This paper presents a Multi Agent Bitcoin Trading system that utilizes Large Lan- guage Models (LLMs) for alpha generation and portfolio management in the cryptocur- rencies market. Unlike equities, cryptocurrencies exhibit extreme volatility and are heavily influenced by rapidly shifting market sentiments and regulatory announcements, making them difficult to model using static regression models or neural networks trained solely on historical data [53]. The proposed framework overcomes this by structuring LLMs into specialised agents for technical analysis, sentiment evaluation, decision-making, and performance reflection. The system improves over time through a novel verbal feedback mechanism where a Reflect agent provides daily and weekly natural-language critiques of trading decisions. These textual evaluations are then injected into future prompts, al- lowing the system to adjust indicator priorities, sentiment weights, and allocation logic without parameter updates or finetuning. Back-testing on Bitcoin price data from July 2024 to April 2025 shows consistent outperformance across market regimes: the Quantita- tive agent delivered over 30% higher returns in bullish phases and 15% overall gains versus buy-and-hold, while the sentiment-driven agent turned sideways markets from a small loss into a gain of over 100%. Adding weekly feedback further improved total performance by 31% and reduced bearish losses by 10%. The results demonstrate that verbal feedback represents a new, scalable, and low-cost method of tuning LLMs for financial goals.
zh
[AI-111] MRI-derived quantification of hepatic vessel-to-volume ratios in chronic liver disease using a deep learning approach
【速读】:该论文旨在解决慢性肝病(chronic liver disease, CLD)不同阶段肝脏血管体积变化的量化问题,并评估其与肝功能障碍及肝纤维化/门静脉高压相关生物标志物之间的关联。解决方案的关键在于采用基于深度学习的3D U-Net模型对钆塞酸二钠增强的3特斯拉磁共振成像(gadoxetic acid-enhanced 3-T MRI)进行肝脏血管分割,从而精确计算总肝血管体积比(TVVR)、肝内血管体积比(HVVR)和门静脉体积比(PVVR),并揭示这些参数在健康对照组、非进展性CLD及进展性CLD(advanced chronic liver disease, ACLD)中的差异及其与ALBI评分、MELD-Na评分、FIB-4指数、肝硬度测量(LSM)、肝静脉压力梯度(HVPG)、血小板计数(PLT)和脾脏体积等指标的相关性。
链接: https://arxiv.org/abs/2510.08039
作者: Alexander Herold,Daniel Sobotka,Lucian Beer,Nina Bastati,Sarah Poetter-Lang,Michael Weber,Thomas Reiberger,Mattias Mandorfer,Georg Semmler,Benedikt Simbrunner,Barbara D. Wichtmann,Sami A. Ba-Ssalamah,Michael Trauner,Ahmed Ba-Ssalamah,Georg Langs
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: ^Alexander Herold and Daniel Sobotka share first-authorship
Abstract:Background: We aimed to quantify hepatic vessel volumes across chronic liver disease stages and healthy controls using deep learning-based magnetic resonance imaging (MRI) analysis, and assess correlations with biomarkers for liver (dys)function and fibrosis/portal hypertension. Methods: We assessed retrospectively healthy controls, non-advanced and advanced chronic liver disease (ACLD) patients using a 3D U-Net model for hepatic vessel segmentation on portal venous phase gadoxetic acid-enhanced 3-T MRI. Total (TVVR), hepatic (HVVR), and intrahepatic portal vein-to-volume ratios (PVVR) were compared between groups and correlated with: albumin-bilirubin (ALBI) and model for end-stage liver disease-sodium (MELD-Na) score, and fibrosis/portal hypertension (Fibrosis-4 [FIB-4] score, liver stiffness measurement [LSM], hepatic venous pressure gradient [HVPG], platelet count [PLT], and spleen volume). Results: We included 197 subjects, aged 54.9 \pm 13.8 years (mean \pm standard deviation), 111 males (56.3%): 35 healthy controls, 44 non-ACLD, and 118 ACLD patients. TVVR and HVVR were highest in controls (3.9; 2.1), intermediate in non-ACLD (2.8; 1.7), and lowest in ACLD patients (2.3; 1.0) ( p \leq 0.001 ). PVVR was reduced in both non-ACLD and ACLD patients (both 1.2) compared to controls (1.7) ( p \leq 0.001 ), but showed no difference between CLD groups ( p = 0.999 ). HVVR significantly correlated indirectly with FIB-4, ALBI, MELD-Na, LSM, and spleen volume ( \rho ranging from -0.27 to -0.40), and directly with PLT ( \rho = 0.36 ). TVVR and PVVR showed similar but weaker correlations. Conclusions: Deep learning-based hepatic vessel volumetry demonstrated differences between healthy liver and chronic liver disease stages and shows correlations with established markers of disease severity. Comments: ^Alexander Herold and Daniel Sobotka share first-authorship Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.08039 [physics.med-ph] (or arXiv:2510.08039v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2510.08039 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1186/s41747-025-00612-y Focus to learn more DOI(s) linking to related resources Submission history From: Alexander Herold [view email] [v1] Thu, 9 Oct 2025 10:23:16 UTC (1,035 KB)
zh
[AI-112] Minimizing the Value-at-Risk of Loan Portfolio via Deep Neural Networks
【速读】:该论文旨在解决P2P借贷中投资者面临的信用风险问题,核心目标是通过优化贷款组合的VaR(Value-at-Risk)或CVaR(Conditional Value-at-Risk)来降低投资风险。解决方案的关键在于提出两种深度神经网络模型——低自由度的DeNN和高自由度的DSNN,它们不仅能预测单笔贷款的违约概率,还能预测违约发生的时间点,从而为投资组合的风险控制提供更精细的决策依据。实验表明,两种模型均显著降低了不同置信水平下的组合VaR,其中DeNN在多数场景下表现优于DSNN,体现出模型复杂度与实际效果之间的平衡优势。
链接: https://arxiv.org/abs/2510.07444
作者: Albert Di Wang,Ye Du
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Mathematical Finance (q-fin.MF); Portfolio Management (q-fin.PM)
备注:
Abstract:Risk management is a prominent issue in peer-to-peer lending. An investor may naturally reduce his risk exposure by diversifying instead of putting all his money on one loan. In that case, an investor may want to minimize the Value-at-Risk (VaR) or Conditional Value-at-Risk (CVaR) of his loan portfolio. We propose a low degree of freedom deep neural network model, DeNN, as well as a high degree of freedom model, DSNN, to tackle the problem. In particular, our models predict not only the default probability of a loan but also the time when it will default. The experiments demonstrate that both models can significantly reduce the portfolio VaRs at different confidence levels, compared to benchmarks. More interestingly, the low degree of freedom model, DeNN, outperforms DSNN in most scenarios.
zh
[AI-113] Quantum Grid Path Planning Using Parallel QAOA Circuits Based on Minimum Energy Principle
【速读】:该论文旨在解决经典路径规划方案在处理NP难问题时的瓶颈,以及当前主流量子路径规划框架在噪声中等规模量子(Noisy Intermediate-Scale Quantum, NISQ)时代所面临的困境。其解决方案的关键在于构建一种基于并行量子近似优化算法(Quantum Approximate Optimization Algorithm, QAOA)架构的量子路径规划方法:将网格路径规划问题映射为寻找最低量子能量状态的问题,并设计两个并行的QAOA电路,分别执行连通性能量计算和路径能量计算;同时引入经典算法对连通性能量结果进行过滤,以剔除不合理解,最终通过融合两路并行计算结果获得近似最优路径解。研究表明,合理设置滤波参数可有效抑制低概率量子态,提升目标量子态获取概率,即使在电路层数 $ p=1 $ 的情况下仍能通过滤波机制找到最优路径编码组合,且并行结构相比串行电路具有显著优势,能够以更高概率捕获最优可行路径编码组合。
链接: https://arxiv.org/abs/2510.07413
作者: Jun Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
备注:
Abstract:To overcome the bottleneck of classical path planning schemes in solving NP problems and address the predicament faced by current mainstream quantum path planning frameworks in the Noisy Intermediate-Scale Quantum (NISQ) era, this study attempts to construct a quantum path planning solution based on parallel Quantum Approximate Optimization Algorithm (QAOA) architecture. Specifically, the grid path planning problem is mapped to the problem of finding the minimum quantum energy state. Two parallel QAOA circuits are built to simultaneously execute two solution processes, namely connectivity energy calculation and path energy calculation. A classical algorithm is employed to filter out unreasonable solutions of connectivity energy, and finally, the approximate optimal solution to the path planning problem is obtained by merging the calculation results of the two parallel circuits. The research findings indicate that by setting appropriate filter parameters, quantum states corresponding to position points with extremely low occurrence probabilities can be effectively filtered out, thereby increasing the probability of obtaining the target quantum state. Even when the circuit layer number p is only 1, the theoretical solution of the optimal path coding combination can still be found by leveraging the critical role of the filter. Compared with serial circuits, parallel circuits exhibit a significant advantage, as they can find the optimal feasible path coding combination with the highest probability.
zh
[AI-114] Attention to Order: Transformers Discover Phase Transitions via Learnability
链接: https://arxiv.org/abs/2510.07401
作者: Şener Özönder
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
[AI-115] Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model
链接: https://arxiv.org/abs/2510.07345
作者: Danush Kumar Venkatesh,Adam Schmidt,Muhammad Abdullah Jamal,Omid Mohareri
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 29 pages, 16 figures
机器学习
[LG-0] Who Said Neural Networks Arent Linear?
链接: https://arxiv.org/abs/2510.08570
作者: Nimrod Berman,Assaf Hallak,Assaf Shocher
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, f : X \to Y . Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator A between two invertible neural networks, f(x)=g_y^-1(A g_x(x)) , then the corresponding vector spaces X and Y are induced by newly defined addition and scaling actions derived from g_x and g_y . We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. f(f(x))=f(x) ) on networks leading to a globally projective generative model and to demonstrate modular style transfer.
[LG-1] Where Have All the Kaczmarz Iterates Gone?
链接: https://arxiv.org/abs/2510.08563
作者: El Houcine Bergou,Soumia Boucherouite,Aritra Dutta,Xin Li,Anna Ma
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The randomized Kaczmarz (RK) algorithm is one of the most computationally and memory-efficient iterative algorithms for solving large-scale linear systems. However, practical applications often involve noisy and potentially inconsistent systems. While the convergence of RK is well understood for consistent systems, the study of RK on noisy, inconsistent linear systems is limited. This paper investigates the asymptotic behavior of RK iterates in expectation when solving noisy and inconsistent systems, addressing the locations of their limit points. We explore the roles of singular vectors of the (noisy) coefficient matrix and derive bounds on the convergence horizon, which depend on the noise levels and system characteristics. Finally, we provide extensive numerical experiments that validate our theoretical findings, offering practical insights into the algorithm’s performance under realistic conditions. These results establish a deeper understanding of the RK algorithm’s limitations and robustness in noisy environments, paving the way for optimized applications in real-world scientific and engineering problems.
[LG-2] Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
链接: https://arxiv.org/abs/2510.08554
作者: Kevin Rojas,Jiahe Lin,Kashif Rasul,Anderson Schneider,Yuriy Nevmyvaka,Molei Tao,Wei Deng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbfGroup Diffusion Policy Optimization (GDPO), a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
[LG-3] Entropy Regularizing Activation: Boosting Continuous Control Large Language Models and Image Classification with Activation as Entropy Constraints
链接: https://arxiv.org/abs/2510.08549
作者: Zilin Kang,Chonghua Liao,Tingqiang Xu,Huazhe Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
[LG-4] SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
链接: https://arxiv.org/abs/2510.08544
作者: Hengrui Zhang,Pratyush Patel,August Ning,David Wentzlaff
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design. Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2510.08544 [cs.AR] (or arXiv:2510.08544v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2510.08544 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.08526
作者: Yash Jhaveri,Harley Wiltzer,Patrick Shafto,Marc G. Bellemare,David Meger
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025. First two authors contributed equally
Abstract:In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects–value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
[LG-6] DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
链接: https://arxiv.org/abs/2510.08522
作者: Yuanjun Dai,Keqiang He,An Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Existing batch size selection approaches in dis- tributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse work- loads, hardware configurations, and network conditions, DY- NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.
[LG-7] Implementing Semantic Join Operators Efficiently
链接: https://arxiv.org/abs/2510.08489
作者: Immanuel Trummer
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel tasks without task-specific training. Currently, many semantic query processing engines implement semantic joins via nested loops, invoking the LLM to evaluate the join condition on row pairs. Instead, this paper proposes a novel algorithm, inspired by the block nested loops join operator implementation in traditional database systems. The proposed algorithm integrates batches of rows from both input tables into a single prompt. The goal of the LLM invocation is to identify all matching row pairs in the current input. The paper introduces formulas that can be used to optimize the size of the row batches, taking into account constraints on the size of the LLM context window (limiting both input and output size). An adaptive variant of the proposed algorithm refers to cases in which the size of the output is difficult to estimate. A formal analysis of asymptotic processing costs, as well as empirical results, demonstrates that the proposed approach reduces costs significantly and performs well compared to join implementations used by recent semantic query processing engines. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2510.08489 [cs.DB] (or arXiv:2510.08489v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2510.08489 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-8] In-Context Clustering with Large Language Models
链接: https://arxiv.org/abs/2510.08466
作者: Ying Wang,Mengye Ren,Andrew Gordon Wilson
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at this https URL.
[LG-9] Dont Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered
链接: https://arxiv.org/abs/2510.08464
作者: Jason Jabbour,Dong-Ki Kim,Max Smith,Jay Patrikar,Radhika Ghosal,Youhui Wang,Ali Agha,Vijay Janapa Reddi,Shayegan Omidshafiei
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language-Action (VLA) models have advanced robotic capabilities but remain challenging to deploy on resource-limited hardware. Pruning has enabled efficient compression of large language models (LLMs), yet it is largely understudied in robotics. Surprisingly, we observe that pruning VLA models leads to drastic degradation and increased safety violations. We introduce GLUESTICK, a post-pruning recovery method that restores much of the original model’s functionality while retaining sparsity benefits. Our method performs a one-time interpolation between the dense and pruned models in weight-space to compute a corrective term. This correction is used during inference by each pruned layer to recover lost capabilities with minimal overhead. GLUESTICK requires no additional training, is agnostic to the pruning algorithm, and introduces a single hyperparameter that controls the tradeoff between efficiency and accuracy. Across diverse VLA architectures and tasks in manipulation and navigation, GLUESTICK achieves competitive memory efficiency while substantially recovering success rates and reducing safety violations. Additional material can be found at: this https URL.
[LG-10] SummDiff: Generative Modeling of Video Summarization with Diffusion
链接: https://arxiv.org/abs/2510.08458
作者: Kwanseok Kim,Jaehoon Hahm,Sumin Kim,Jinhwan Sul,Byunghak Kim,Joonseok Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a good summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.
[LG-11] Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions
链接: https://arxiv.org/abs/2510.08382
作者: Jacob Trauger,Tyson Trauger,Ambuj Tewari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages
Abstract:In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension \citepnatarajan1989learning and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.
[LG-12] Contrastive Self-Supervised Learning at the Edge: An Energy Perspective
链接: https://arxiv.org/abs/2510.08374
作者: Fernanda Famá,Roberto Pereira,Charalampos Kalalas,Paolo Dini,Lorena Qendro,Fahim Kawsar,Mohammad Malekzadeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:While contrastive learning (CL) shows considerable promise in self-supervised representation learning, its deployment on resource-constrained devices remains largely underexplored. The substantial computational demands required for training conventional CL frameworks pose a set of challenges, particularly in terms of energy consumption, data availability, and memory usage. We conduct an evaluation of four widely used CL frameworks: SimCLR, MoCo, SimSiam, and Barlow Twins. We focus on the practical feasibility of these CL frameworks for edge and fog deployment, and introduce a systematic benchmarking strategy that includes energy profiling and reduced training data conditions. Our findings reveal that SimCLR, contrary to its perceived computational cost, demonstrates the lowest energy consumption across various data regimes. Finally, we also extend our analysis by evaluating lightweight neural architectures when paired with CL frameworks. Our study aims to provide insights into the resource implications of deploying CL in edge/fog environments with limited processing capabilities and opens several research directions for its future optimization.
[LG-13] Guided Star-Shaped Masked Diffusion
链接: https://arxiv.org/abs/2510.08369
作者: Viacheslav Meshchaninov,Egor Shibaev,Artem Makoian,Ivan Klimov,Danil Sheshenya,Andrei Malinin,Nikita Balagansky,Daniil Gavrilov,Aibek Alanov,Dmitry Vetrov
类目: Machine Learning (cs.LG)
*备注:
Abstract:The performance of pre-trained masked diffusion models is often constrained by their sampling procedure, which makes decisions irreversible and struggles in low-step generation regimes. We introduce a novel sampling algorithm that works with pre-trained models and, after a lightweight fine-tuning of a single layer, significantly improves sample quality and efficiency. Our method reformulates the generation process using a star-shaped paradigm, which inherently allows for error correction. To make this process effective, we augment it with a learnable re-masking scheduler that intelligently identifies and revises likely errors. This approach yields a substantial quality boost, particularly when using a small number of sampling steps. We extensively ablate key components of our approach and show its usability in different scenarios. In comprehensive experiments on text, and code generation, our sampling algorithm outperforms or matches existing methods.
[LG-14] New Machine Learning Approaches for Intrusion Detection in ADS-B
链接: https://arxiv.org/abs/2510.08333
作者: Mikaëla Ngamboé,Jean-Simon Marrocco,Jean-Yves Ouattara,José M. Fernandez,Gabriela Nicolescu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This is the author’s version of the work accepted for publication Digital Avionics Systems Conference (DASC) 2025. The final version will be available via IEEE Xplore
Abstract:With the growing reliance on the vulnerable Automatic Dependent Surveillance-Broadcast (ADS-B) protocol in air traffic management (ATM), ensuring security is critical. This study investigates emerging machine learning models and training strategies to improve AI-based intrusion detection systems (IDS) for ADS-B. Focusing on ground-based ATM systems, we evaluate two deep learning IDS implementations: one using a transformer encoder and the other an extended Long Short-Term Memory (xLSTM) network, marking the first xLSTM-based IDS for ADS-B. A transfer learning strategy was employed, involving pre-training on benign ADS-B messages and fine-tuning with labeled data containing instances of tampered messages. Results show this approach outperforms existing methods, particularly in identifying subtle attacks that progressively undermine situational awareness. The xLSTM-based IDS achieves an F1-score of 98.9%, surpassing the transformer-based model at 94.3%. Tests on unseen attacks validated the generalization ability of the xLSTM model. Inference latency analysis shows that the 7.26-second delay introduced by the xLSTM-based IDS fits within the Secondary Surveillance Radar (SSR) refresh interval (5-12 s), although it may be restrictive for time-critical operations. While the transformer-based IDS achieves a 2.1-second latency, it does so at the cost of lower detection performance.
[LG-15] o Ask or Not to Ask: Learning to Require Human Feedback
链接: https://arxiv.org/abs/2510.08314
作者: Andrea Pugnana,Giovanni De Toni,Cesare Barbera,Roberto Pellungrini,Bruno Lepri,Andrea Passerini
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Developing decision-support systems that complement human performance in classification tasks remains an open challenge. A popular approach, Learning to Defer (LtD), allows a Machine Learning (ML) model to pass difficult cases to a human expert. However, LtD treats humans and ML models as mutually exclusive decision-makers, restricting the expert contribution to mere predictions. To address this limitation, we propose Learning to Ask (LtA), a new framework that handles both when and how to incorporate expert input in an ML model. LtA is based on a two-part architecture: a standard ML model and an enriched model trained with additional expert human feedback, with a formally optimal strategy for selecting when to query the enriched model. We provide two practical implementations of LtA: a sequential approach, which trains the models in stages, and a joint approach, which optimises them simultaneously. For the latter, we design surrogate losses with realisable-consistency guarantees. Our experiments with synthetic and real expert data demonstrate that LtA provides a more flexible and powerful foundation for effective human-AI collaboration.
[LG-16] Robust and Efficient Collaborative Learning
链接: https://arxiv.org/abs/2510.08311
作者: Abdellah El Mrini,Sadegh Farhadkhan,Rachid Guerraoui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Collaborative machine learning is challenged by training-time adversarial behaviors. Existing approaches to tolerate such behaviors either rely on a central server or induce high communication costs. We propose Robust Pull-based Epidemic Learning (RPEL), a novel, scalable collaborative approach to ensure robust learning despite adversaries. RPEL does not rely on any central server and, unlike traditional methods, where communication costs grow in \mathcalO(n^2) with the number of nodes n , RPEL employs a pull-based epidemic-based communication strategy that scales in \mathcalO(n \log n) . By pulling model parameters from small random subsets of nodes, RPEL significantly lowers the number of required messages without compromising convergence guarantees, which hold with high probability. Empirical results demonstrate that RPEL maintains robustness in adversarial settings, competes with all-to-all communication accuracy, and scales efficiently across large networks.
[LG-17] Dynamic Features Adaptation in Networking: Toward Flexible training and Explainable inference NEURIPS2025
链接: https://arxiv.org/abs/2510.08303
作者: Yannis Belkhiter,Seshu Tirupathi,Giulio Zizzo,Merim Dzaferagic,John D. Kelleher
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted at AI4NextG Workshop, NeurIPS 2025
Abstract:As AI becomes a native component of 6G network control, AI models must adapt to continuously changing conditions, including the introduction of new features and measurements driven by multi-vendor deployments, hardware upgrades, and evolving service requirements. To address this growing need for flexible learning in non-stationary environments, this vision paper highlights Adaptive Random Forests (ARFs) as a reliable solution for dynamic feature adaptation in communication network scenarios. We show that iterative training of ARFs can effectively lead to stable predictions, with accuracy improving over time as more features are added. In addition, we highlight the importance of explainability in AI-driven networks, proposing Drift-Aware Feature Importance (DAFI) as an efficient XAI feature importance (FI) method. DAFI uses a distributional drift detector to signal when to apply computationally intensive FI methods instead of lighter alternatives. Our tests on 3 different datasets indicate that our approach reduces runtime by up to 2 times, while producing more consistent feature importance values. Together, ARFs and DAFI provide a promising framework to build flexible AI methods adapted to 6G network use-cases.
[LG-18] Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints
链接: https://arxiv.org/abs/2510.08295
作者: Tsuyoshi Okita
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure
Abstract:Conventional time-series generation often ignores domain-specific physical constraints, limiting statistical and physical consistency. We propose a hierarchical framework that embeds the inherent hierarchy of physical laws-conservation, dynamics, boundary, and empirical relations-directly into deep generative models, introducing a new paradigm of physics-informed inductive bias. Our method combines Fourier Neural Operators (FNOs) for learning physical operators with Conditional Flow Matching (CFM) for probabilistic generation, integrated via time-dependent hierarchical constraints and FNO-guided corrections. Experiments on harmonic oscillators, human activity recognition, and lithium-ion battery degradation show 16.3% higher generation quality, 46% fewer physics violations, and 18.5% improved predictive accuracy over baselines.
[LG-19] Enhancing Reasoning for Diffusion LLM s via Distribution Matching Policy Optimization
链接: https://arxiv.org/abs/2510.08233
作者: Yuchen Zhu,Wei Guo,Jaemoo Choi,Petr Molodyk,Bo Yuan,Molei Tao,Yongxin Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs’ unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to 42.9% over previously SOTA baselines and 55.8% over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at this https URL.
[LG-20] Reinforcement Learning from Probabilistic Forecasts for Safe Decision-Making via Conditional Value-at-Risk Planning
链接: https://arxiv.org/abs/2510.08226
作者: Michal Koren,Or Peretz,Tai Dinh,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sequential decisions in volatile, high-stakes settings require more than maximizing expected return; they require principled uncertainty management. This paper presents the Uncertainty-Aware Markov Decision Process (UAMDP), a unified framework that couples Bayesian forecasting, posterior-sampling reinforcement learning, and planning under a conditional value-at-risk (CVaR) constraint. In a closed loop, the agent updates its beliefs over latent dynamics, samples plausible futures via Thompson sampling, and optimizes policies subject to preset risk tolerances. We establish regret bounds that converge to the Bayes-optimal benchmark under standard regularity conditions. We evaluate UAMDP in two domains-high-frequency equity trading and retail inventory control-both marked by structural uncertainty and economic volatility. Relative to strong deep learning baselines, UAMDP improves long-horizon forecasting accuracy (RMSE decreases by up to 25% and sMAPE by 32%), and these gains translate into economic performance: the trading Sharpe ratio rises from 1.54 to 1.74 while maximum drawdown is roughly halved. These results show that integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making.
[LG-21] Post-hoc Stochastic Concept Bottleneck Models
链接: https://arxiv.org/abs/2510.08219
作者: Wiktor Jan Hoffmann,Sonia Laguna,Moritz Vandenhirtz,Emanuele Palumbo,Julia E. Vogt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Concept Bottleneck Models (CBMs) are interpretable models that predict the target variable through high-level human-understandable concepts, allowing users to intervene on mispredicted concepts to adjust the final output. While recent work has shown that modeling dependencies between concepts can improve CBM performance, especially under interventions, such approaches typically require retraining the entire model, which may be infeasible when access to the original data or compute is limited. In this paper, we introduce Post-hoc Stochastic Concept Bottleneck Models (PSCBMs), a lightweight method that augments any pre-trained CBM with a multivariate normal distribution over concepts by adding only a small covariance-prediction module, without retraining the backbone model. We propose two training strategies and show on real-world data that PSCBMs consistently match or improve both concept and target accuracy over standard CBMs at test time. Furthermore, we show that due to the modeling of concept dependencies, PSCBMs perform much better than CBMs under interventions, while remaining far more efficient than retraining a similar stochastic model from scratch.
[LG-22] Long-tailed Recognition with Model Rebalancing
链接: https://arxiv.org/abs/2510.08177
作者: Jiaan Luo,Feng Hong,Qiang Hu,Xiaofeng Cao,Feng Liu,Jiangchao Yao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-tailed recognition is ubiquitous and challenging in deep learning and even in the downstream finetuning of foundation models, since the skew class distribution generally prevents the model generalization to the tail classes. Despite the promise of previous methods from the perspectives of data augmentation, loss rebalancing and decoupled training etc., consistent improvement in the broad scenarios like multi-label long-tailed recognition is difficult. In this study, we dive into the essential model capacity impact under long-tailed context, and propose a novel framework, Model Rebalancing (MORE), which mitigates imbalance by directly rebalancing the model’s parameter space. Specifically, MORE introduces a low-rank parameter component to mediate the parameter space allocation guided by a tailored loss and sinusoidal reweighting schedule, but without increasing the overall model complexity or inference costs. Extensive experiments on diverse long-tailed benchmarks, spanning multi-class and multi-label tasks, demonstrate that MORE significantly improves generalization, particularly for tail classes, and effectively complements existing imbalance mitigation methods. These results highlight MORE’s potential as a robust plug-and-play module in long-tailed settings.
[LG-23] Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing NEURIPS2025
链接: https://arxiv.org/abs/2510.08169
作者: Xiang Zhang,Jiaqi Wei,Zijie Qiu,Sheng Xu,Zhi Jin,ZhiQiang Gao,Nanqing Dong,Siqi Sun
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
Abstract:Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at this https URL.
[LG-24] Beyond Sub-6 GHz: Leverag ing mmWave Wi-Fi for Gait-Based Person Identification
链接: https://arxiv.org/abs/2510.08160
作者: Nabeel Nisar Bhat,Maksim Karnaukh,Jakob Struye,Rafael Berkvens,Jeroen Famaey
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Person identification plays a vital role in enabling intelligent, personalized, and secure human-computer interaction. Recent research has demonstrated the feasibility of leveraging Wi-Fi signals for passive person identification using a person’s unique gait pattern. Although most existing work focuses on sub-6 GHz frequencies, the emergence of mmWave offers new opportunities through its finer spatial resolution, though its comparative advantages for person identification remain unexplored. This work presents the first comparative study between sub-6 GHz and mmWave Wi-Fi signals for person identification with commercial off-the-shelf (COTS) Wi-Fi, using a novel dataset of synchronized measurements from the two frequency bands in an indoor environment. To ensure a fair comparison, we apply identical training pipelines and model configurations across both frequency bands. Leveraging end-to-end deep learning, we show that even at low sampling rates (10 Hz), mmWave Wi-Fi signals can achieve high identification accuracy (91.2% on 20 individuals) when combined with effective background subtraction.
[LG-25] Unsupervised Multi-Source Federated Domain Adaptation under Domain Diversity through Group-Wise Discrepancy Minimization
链接: https://arxiv.org/abs/2510.08150
作者: Larissa Reichart,Cem Ata Baykara,Ali Burak Ünal,Mete Akgün,Harlin Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.
[LG-26] Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning
链接: https://arxiv.org/abs/2510.08141
作者: Chen Wang,Zhaochun Li,Jionghao Bai,Yuzhi Zhang,Shisheng Cui,Zhou Zhao,Yue Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement finetuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.
[LG-27] Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
链接: https://arxiv.org/abs/2510.08078
作者: Liyang Chen,Hongkai Chen,Yujun Cai,Sifan Li,Qingwen Ye,Yiwei Wang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
[LG-28] Mitigating Subject Dependency in EEG Decoding with Subject-Specific Low-Rank Adapters
链接: https://arxiv.org/abs/2510.08059
作者: Timon Klein,Piotr Minakowski,Sebastian Sager
类目: Machine Learning (cs.LG)
*备注:
Abstract:Subject-specific distribution shifts represent an important obstacle to the development of foundation models for EEG decoding. To address this, we propose Subject-Conditioned Layer, an adaptive layer designed as a drop-in replacement for standard linear or convolutional layers in any neural network architecture. Our layer captures subject-specific variability by decomposing its weights into a shared, subject-invariant component and a lightweight, low-rank correction unique to each subject. This explicit separation of general knowledge from personalized adaptation allows existing models to become robust to subject shifts. Empirically, models equipped with our layer outperform both a shared-weight-only model (subject-agnostic model) and the average of individually trained subject-specific models. Consequently, the Subject-Conditioned Layer, offers a practical and scalable path towards building effective cross-subject foundation models for EEG.
[LG-29] From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill
链接: https://arxiv.org/abs/2510.08055
作者: Gunjun Lee,Jiwon Kim,Jaiyoung Park,Younjoo Lee,Jung Ho Ahn
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 13 pages, 5 figure, 8 tables
Abstract:Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, End-to-End latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT–TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.
[LG-30] Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity
链接: https://arxiv.org/abs/2510.08023
作者: Akira Ito,Masanori Yamada,Daiki Chijiwa,Atsutoshi Kumagai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 \times width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model’s output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.
[LG-31] Unsupervised Radio Map Construction in Mixed LoS/NLoS Indoor Environments
链接: https://arxiv.org/abs/2510.08015
作者: Zheng Xing,Junting Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Radio maps are essential for enhancing wireless communications and localization. However, existing methods for constructing radio maps typically require costly calibration pro- cesses to collect location-labeled channel state information (CSI) datasets. This paper aims to recover the data collection trajectory directly from the channel propagation sequence, eliminating the need for location calibration. The key idea is to employ a hidden Markov model (HMM)-based framework to conditionally model the channel propagation matrix, while simultaneously modeling the location correlation in the trajectory. The primary challenges involve modeling the complex relationship between channel propagation in multiple-input multiple-output (MIMO) networks and geographical locations, and addressing both line-of-sight (LOS) and non-line-of-sight (NLOS) indoor conditions. In this paper, we propose an HMM-based framework that jointly characterizes the conditional propagation model and the evolution of the user trajectory. Specifically, the channel propagation in MIMO networks is modeled separately in terms of power, delay, and angle, with distinct models for LOS and NLOS conditions. The user trajectory is modeled using a Gaussian-Markov model. The parameters for channel propagation, the mobility model, and LOS/NLOS classification are optimized simultaneously. Experimental validation using simulated MIMO-Orthogonal Frequency-Division Multiplexing (OFDM) networks with a multi-antenna uniform linear arrays (ULA) configuration demonstrates that the proposed method achieves an average localization accuracy of 0.65 meters in an indoor environment, covering both LOS and NLOS regions. Moreover, the constructed radio map enables localization with a reduced error compared to conventional supervised methods, such as k-nearest neighbors (KNN), support vector machine (SVM), and deep neural network (DNN).
[LG-32] Accelerated Evolving Set Processes for Local PageRank Computation
链接: https://arxiv.org/abs/2510.08010
作者: Binbin Huang,Luo Luo,Yanghua Xiao,Deqing Yang,Baojian Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by \min\tilde\mathcalO(R^2/\epsilon^2), \tilde\mathcalO(m)\ to obtain an \epsilon -approximation of the PPR vector, where m denotes the number of edges in the graph and R is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only \tilde\mathcalO(1/\sqrt\alpha) such linear systems, where \alpha is the damping factor. When 1/\epsilon^2\ll m , this implies the existence of an algorithm that computes an \ epsilon -approximation of the PPR vector with an overall time complexity of \tilde\mathcalO\left(R^2 / (\sqrt\alpha\epsilon^2)\right) , independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.
[LG-33] Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
链接: https://arxiv.org/abs/2510.08008
作者: Ruizhe Wang,Yucheng Ding,Xiao Liu,Yaoxiang Wang,Peng Cheng,Baining Guo,Zhengjun Zha,Yeyun Gong
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this “sunk” cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.
[LG-34] DemandCast: Global hourly electricity demand forecasting NEURIPS2025
链接: https://arxiv.org/abs/2510.08000
作者: Kevin Steijn,Vamsi Priya Goli,Enrico Antonini
类目: Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 7 pages, 4 figures, accepted at the NeurIPS 2025 Workshop: Tackling Climate Change with Machine Learning
Abstract:This paper presents a machine learning framework for electricity demand forecasting across diverse geographical regions using the gradient boosting algorithm XGBoost. The model integrates historical electricity demand and comprehensive weather and socioeconomic variables to predict normalized electricity demand profiles. To enable robust training and evaluation, we developed a large-scale dataset spanning multiple years and countries, applying a temporal data-splitting strategy that ensures benchmarking of out-of-sample performance. Our approach delivers accurate and scalable demand forecasts, providing valuable insights for energy system planners and policymakers as they navigate the challenges of the global energy transition.
[LG-35] Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM
链接: https://arxiv.org/abs/2510.07971
作者: Oskar Bohn Lassen,Serio Angelo Maria Agriesti,Filipe Rodrigues,Francisco Camara Pereira
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Climate policy studies require models that capture the combined effects of multiple greenhouse gases on global temperature, but these models are computationally expensive and difficult to embed in reinforcement learning. We present a multi-agent reinforcement learning (MARL) framework that integrates a high-fidelity, highly efficient climate surrogate directly in the environment loop, enabling regional agents to learn climate policies under multi-gas dynamics. As a proof of concept, we introduce a recurrent neural network architecture pretrained on ( 20,000 ) multi-gas emission pathways to surrogate the climate model CICERO-SCM. The surrogate model attains near-simulator accuracy with global-mean temperature RMSE \approx 0.0004 \mathrmK and approximately 1000\times faster one-step inference. When substituted for the original simulator in a climate-policy MARL setting, it accelerates end-to-end training by !100\times . We show that the surrogate and simulator converge to the same optimal policies and propose a methodology to assess this property in cases where using the simulator is intractable. Our work allows to bypass the core computational bottleneck without sacrificing policy fidelity, enabling large-scale multi-agent experiments across alternative climate-policy regimes with multi-gas dynamics and high-fidelity climate response.
[LG-36] PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation NEURIPS2025
链接: https://arxiv.org/abs/2510.07964
作者: Jiabei Cheng,Changxi Chi,Jingbo Zhou,Hongyi Xin,Jun Xia
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:In single-cell perturbation prediction, a central task is to forecast the effects of perturbing a gene unseen in the training data. The efficacy of such predictions depends on two factors: (1) the similarity of the target gene to those covered in the training data, which informs model (epistemic) uncertainty, and (2) the quality of the corresponding training data, which reflects data (aleatoric) uncertainty. Both factors are critical for determining the reliability of a prediction, particularly as gene perturbation is an inherently stochastic biochemical process. In this paper, we propose PRESCRIBE (PREdicting Single-Cell Response wIth Bayesian Estimation), a multivariate deep evidential regression framework designed to measure both sources of uncertainty jointly. Our analysis demonstrates that PRESCRIBE effectively estimates a confidence score for each prediction, which strongly correlates with its empirical accuracy. This capability enables the filtering of untrustworthy results, and in our experiments, it achieves steady accuracy improvements of over 3% compared to comparable baselines.
[LG-37] Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks
链接: https://arxiv.org/abs/2510.07935
作者: Diego García-Pérez,Emilio Parrado-Hernández,John Shawe-Taylor
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks.
[LG-38] Synergy Between the Strong and the Weak: Spiking Neural Networks are Inherently Self-Distillers NEURIPS2025
链接: https://arxiv.org/abs/2510.07924
作者: Yongqi Ding,Lin Zuo,Mengmeng Jing,Kunshan Yang,Pei He,Tonglan Xie
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
Abstract:Brain-inspired spiking neural networks (SNNs) promise to be a low-power alternative to computationally intensive artificial neural networks (ANNs), although performance gaps persist. Recent studies have improved the performance of SNNs through knowledge distillation, but rely on large teacher models or introduce additional training overhead. In this paper, we show that SNNs can be naturally deconstructed into multiple submodels for efficient self-distillation. We treat each timestep instance of the SNN as a submodel and evaluate its output confidence, thus efficiently identifying the strong and the weak. Based on this strong and weak relationship, we propose two efficient self-distillation schemes: (1) \textbfStrong2Weak: During training, the stronger “teacher” guides the weaker “student”, effectively improving overall performance. (2) \textbfWeak2Strong: The weak serve as the “teacher”, distilling the strong in reverse with underlying dark knowledge, again yielding significant performance gains. For both distillation schemes, we offer flexible implementations such as ensemble, simultaneous, and cascade distillation. Experiments show that our method effectively improves the discriminability and overall performance of the SNN, while its adversarial robustness is also enhanced, benefiting from the stability brought by self-distillation. This ingeniously exploits the temporal properties of SNNs and provides insight into how to efficiently train high-performance SNNs.
[LG-39] SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening
链接: https://arxiv.org/abs/2510.07922
作者: Murtaza Rangwala,Farag Azzedin,Richard O. Sinnott,Rajkumar Buyya
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 23 pages, 5 figures, Code Available: this https URL
Abstract:Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses d -dimensional models to k -dimensional sketches ( k \ll d ) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from O(d|N_i|) to O(k|N_i| + d|S_i|) , where |N_i| is the neighbor count and |S_i| \le |N_i| is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a (1+O(\epsilon)) factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.
[LG-40] GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploratio
链接: https://arxiv.org/abs/2510.07919
作者: Tingfeng Hong,Pingye Ren,Xinlong Xiao,Chao Wang,Chenyi Lei,Wenwu Ou,Han Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Overall architecture of the personalized multi-objective ranking system. It comprises: (1) a Feature Center and Prerank Model for initial feature processing and candidate generation; (2) a Multi-Task Learning (MTL) model predicting various user feedback signals; (3) a Multi-Task Fusion (MTF) module (our proposed GRADE framework) that learns personalized weights ( w_1, \dots, w_n ); these weights are then applied to calculate final scores and sorted to generate a blended ranking by the Blended Ranking Model, which ultimately delivers results to users.
[LG-41] Multi-level informed optimization via decomposed Kriging for large design problems under uncertainty
链接: https://arxiv.org/abs/2510.07904
作者: Enrico Ampellio,Blazhe Gjorgiev,Giovanni Sansavini
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 18 figures
Abstract:Engineering design involves demanding models encompassing many decision variables and uncontrollable parameters. In addition, unavoidable aleatoric and epistemic uncertainties can be very impactful and add further complexity. The state-of-the-art adopts two steps, uncertainty quantification and design optimization, to optimize systems under uncertainty by means of robust or stochastic metrics. However, conventional scenario-based, surrogate-assisted, and mathematical programming methods are not sufficiently scalable to be affordable and precise in large and complex cases. Here, a multi-level approach is proposed to accurately optimize resource-intensive, high-dimensional, and complex engineering problems under uncertainty with minimal resources. A non-intrusive, fast-scaling, Kriging-based surrogate is developed to map the combined design/parameter domain efficiently. Multiple surrogates are adaptively updated by hierarchical and orthogonal decomposition to leverage the fewer and most uncertainty-informed data. The proposed method is statistically compared to the state-of-the-art via an analytical testbed and is shown to be concurrently faster and more accurate by orders of magnitude.
[LG-42] Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images
链接: https://arxiv.org/abs/2510.07895
作者: D. Chee Yong Ong,I. Bukhori,K. S. Sim,K. Beng Gan
类目: Machine Learning (cs.LG)
*备注: “Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images,” in IEEE Access, vol. 13, pp. 93574-93592, 2025, doi: https://doi.org/10.1109/ACCESS.2025.3573389
Abstract:Scanning Electron Microscopy (SEM) images often suffer from noise contamination, which degrades image quality and affects further analysis. This research presents a complete approach to estimate their Signal-to-Noise Ratio (SNR) and noise variance (NV), and enhance image quality using NV-guided Wiener filter. The main idea of this study is to use a good SNR estimation technique and infuse a machine learning model to estimate NV of the SEM image, which then guides the wiener filter to remove the noise, providing a more robust and accurate SEM image filtering pipeline. First, we investigate five different SNR estimation techniques, namely Nearest Neighbourhood (NN) method, First-Order Linear Interpolation (FOL) method, Nearest Neighbourhood with First-Order Linear Interpolation (NN+FOL) method, Non-Linear Least Squares Regression (NLLSR) method, and Linear Least Squares Regression (LSR) method. It is shown that LSR method to perform better than the rest. Then, Support Vector Machines (SVM) and Gaussian Process Regression (GPR) are tested by pairing it with LSR. In this test, the Optimizable GPR model shows the highest accuracy and it stands as the most effective solution for NV estimation. Combining these results lead to the proposed Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression (AO-GPRLLSR) Filtering pipeline. The AO-GPRLLSR method generated an estimated noise variance which served as input to NV-guided Wiener filter for improving the quality of SEM images. The proposed method is shown to achieve notable success in estimating SNR and NV of SEM images and leads to lower Mean Squared Error (MSE) after the filtering process.
[LG-43] Signal-to-Noise Ratio in Scanning Electron Microscopy: A Comprehensive Review
链接: https://arxiv.org/abs/2510.07886
作者: K. S. Sim,I. Bukhori,D. C. Y. Ong,K. B. Gan
类目: Machine Learning (cs.LG)
*备注: in IEEE Access, vol. 13, pp. 154395-154421, 2025, doi: https://doi.org/10.1109/ACCESS.2025.3603013
Abstract:Scanning Electron Microscopy (SEM) is critical in nanotechnology, materials science, and biological imaging due to its high spatial resolution and depth of focus. Signal-to-noise ratio (SNR) is an essential parameter in SEM because it directly impacts the quality and interpretability of the images. SEM is widely used in various scientific disciplines, but its utility can be compromised by noise, which degrades image clarity. This review explores multiple aspects of the SEM imaging process, from the principal operation of SEM, sources of noise in SEM, methods for SNR measurement and estimations, to various aspects that affect the SNR measurement and approaches to enhance SNR, both from a hardware and software standpoint. We review traditional and emerging techniques, focusing on their applications, advantages, and limitations. The paper aims to provide a comprehensive understanding of SNR optimization in SEM for researchers and practitioners and to encourage further research in the field.
[LG-44] Adaptive Execution Scheduler for DataDios SmartDiff
链接: https://arxiv.org/abs/2510.07811
作者: Aryan Poduri
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure
Abstract:We present an adaptive scheduler for a single differencing engine (SmartDiff) with two execution modes: (i) in-memory threads and (ii) Dask based parallelism. The scheduler continuously tunes batch size and worker/thread count within fixed CPU and memory budgets to minimize p95 latency. A lightweight preflight profiler estimates bytes/row and I/O rate; an online cost/memory model prunes unsafe actions; and a guarded hill-climb policy favors lower latency with backpressure and straggler mitigation. Backend selection is gated by a conservative working-set estimate so that in-memory execution is chosen when safe, otherwise Dask is used. Across synthetic and public tabular benchmarks, the scheduler reduces p95 latency by 23 to 28 percent versus a tuned warm-up heuristic (and by 35 to 40 percent versus fixed grid baselines), while lowering peak memory by 16 to 22 percent (25 to 32 percent vs. fixed) with zero OOMs and comparable throughput.
[LG-45] HySim-LLM : Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLM s
链接: https://arxiv.org/abs/2510.07796
作者: Majid Jaberi-Douraki,Hossein Sholehrasa,Xuan Xu,Remya Ampadi Ramachandran
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.
[LG-46] Weak Form Learning for Mean-Field Partial Differential Equations: an Application to Insect Movement
链接: https://arxiv.org/abs/2510.07786
作者: Seth Minor,Bret D. Elderd,Benjamin Van Allen,David M. Bortz,Vanja Dukic
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Populations and Evolution (q-bio.PE)
*备注: 39 pages, 16 figures
Abstract:Insect species subject to infection, predation, and anisotropic environmental conditions may exhibit preferential movement patterns. Given the innate stochasticity of exogenous factors driving these patterns over short timescales, individual insect trajectories typically obey overdamped stochastic dynamics. In practice, data-driven modeling approaches designed to learn the underlying Fokker-Planck equations from observed insect distributions serve as ideal tools for understanding and predicting such behavior. Understanding dispersal dynamics of crop and silvicultural pests can lead to a better forecasting of outbreak intensity and location, which can result in better pest management. In this work, we extend weak-form equation learning techniques, coupled with kernel density estimation, to learn effective models for lepidopteran larval population movement from highly sparse experimental data. Galerkin methods such as the Weak form Sparse Identification of Nonlinear Dynamics (WSINDy) algorithm have recently proven useful for learning governing equations in several scientific contexts. We demonstrate the utility of the method on a sparse dataset of position measurements of fall armyworms (Spodoptera frugiperda) obtained in simulated agricultural conditions with varied plant resources and infection status.
[LG-47] PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations
链接: https://arxiv.org/abs/2510.07784
作者: Ruining He,Lukasz Heldt,Lichan Hong,Raghunandan Keshavan,Shifan Mao,Nikhil Mehta,Zhengyang Su,Alicia Tsai,Yueqi Wang,Shao-Chuan Wang,Xinyang Yi,Lexi Baugher,Baykal Cakici,Ed Chi,Cristos Goodrow,Ningren Han,He Ma,Romer Rosales,Abby Van Soest,Devansh Tandon,Su-Lin Wu,Weilong Yang,Yilin Zheng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures
Abstract:Large Language Models (LLMs) pose a new paradigm of modeling and computation for information tasks. Recommendation systems are a critical application domain poised to benefit significantly from the sequence modeling capabilities and world knowledge inherent in these large models. In this paper, we introduce PLUM, a framework designed to adapt pre-trained LLMs for industry-scale recommendation tasks. PLUM consists of item tokenization using Semantic IDs, continued pre-training (CPT) on domain-specific data, and task-specific fine-tuning for recommendation objectives. For fine-tuning, we focus particularly on generative retrieval, where the model is directly trained to generate Semantic IDs of recommended items based on user context. We conduct comprehensive experiments on large-scale internal video recommendation datasets. Our results demonstrate that PLUM achieves substantial improvements for retrieval compared to a heavily-optimized production model built with large embedding tables. We also present a scaling study for the model’s retrieval performance, our learnings about CPT, a few enhancements to Semantic IDs, along with an overview of the training and inference methods that enable launching this framework to billions of users in YouTube.
[LG-48] FedLAM: Low-latency Wireless Federated Learning via Layer-wise Adaptive Modulation
链接: https://arxiv.org/abs/2510.07766
作者: Linping Qu,Shenghui Song,Chi-Ying Tsui
类目: Machine Learning (cs.LG)
*备注:
Abstract:In wireless federated learning (FL), the clients need to transmit the high-dimensional deep neural network (DNN) parameters through bandwidth-limited channels, which causes the communication latency issue. In this paper, we propose a layer-wise adaptive modulation scheme to save the communication latency. Unlike existing works which assign the same modulation level for all DNN layers, we consider the layers’ importance which provides more freedom to save the latency. The proposed scheme can automatically decide the optimal modulation levels for different DNN layers. Experimental results show that the proposed scheme can save up to 73.9% of communication latency compared with the existing schemes.
[LG-49] Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization
链接: https://arxiv.org/abs/2510.07758
作者: Qiaozhe Zhang,Jun Sun,Ruijie Zhang,Yingzhuang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textitRényi sharpness, which is defined as the negative Rényi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textituniform (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textitRényi entropy to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (Rényi) entropy. To rigorously establish the relationship between generalization and (Rényi) sharpness, we provide several generalization bounds in terms of Rényi sharpness, by taking advantage of the reparametrization invariance property of Rényi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the Rényi sharpness and generalization. Moreover, we propose to use a variant of Rényi Sharpness as regularizer during training, i.e., Rényi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5%, compared against the classical SAM method.
[LG-50] FedBook: A Unified Federated Graph Foundation Codebook with Intra-domain and Inter-domain Knowledge Modeling
链接: https://arxiv.org/abs/2510.07755
作者: Zhengyu Wu,Yinlin Zhu,Xunkai Li,Ziang Qiu,Rong-Hua Li,Guoren Wang,Chenghu Zhou
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Foundation models have shown remarkable cross-domain generalization in language and vision, inspiring the development of graph foundation models (GFMs). However, existing GFMs typically assume centralized access to multi-domain graphs, which is often infeasible due to privacy and institutional constraints. Federated Graph Foundation Models (FedGFMs) address this limitation, but their effectiveness fundamentally hinges on constructing a robust global codebook that achieves intra-domain coherence by consolidating mutually reinforcing semantics within each domain, while also maintaining inter-domain diversity by retaining heterogeneous knowledge across domains. To this end, we propose FedBook, a unified federated graph foundation codebook that systematically aggregates clients’ local codebooks during server-side federated pre-training. FedBook follows a two-phase process: (1) Intra-domain Collaboration, where low-frequency tokens are refined by referencing more semantically reliable high-frequency tokens across clients to enhance domain-specific coherence; and (2) Inter-domain Integration, where client contributions are weighted by the semantic distinctiveness of their codebooks during the aggregation of the global GFM, thereby preserving cross-domain diversity. Extensive experiments on 8 benchmarks across multiple domains and tasks demonstrate that FedBook consistently outperforms 21 baselines, including isolated supervised learning, FL/FGL, federated adaptations of centralized GFMs, and FedGFM techniques.
[LG-51] -SNE Exaggerates Clusters Provably
链接: https://arxiv.org/abs/2510.07746
作者: Noah Bergam,Szymon Snoeck,Nakul Verma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.
[LG-52] GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation
链接: https://arxiv.org/abs/2510.07735
作者: Rongchao Xu,Kunlin Cai,Lin Jiang,Dahai Yu,Zhiqing Hong,Yuan Tian,Guang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S ^2 TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.
[LG-53] Computationally-efficient Graph Modeling with Refined Graph Random Features
链接: https://arxiv.org/abs/2510.07716
作者: Krzysztof Choromanski,Avinava Dubey,Arijit Sehanobish,Isaac Reid
类目: Machine Learning (cs.LG)
*备注: Preprint. Comments welcome
Abstract:We propose refined GRFs (GRFs++), a new class of Graph Random Features (GRFs) for efficient and accurate computations involving kernels defined on the nodes of a graph. GRFs++ resolve some of the long-standing limitations of regular GRFs, including difficulty modeling relationships between more distant nodes. They reduce dependence on sampling long graph random walks via a novel walk-stitching technique, concatenating several shorter walks without breaking unbiasedness. By applying these techniques, GRFs++ inherit the approximation quality provided by longer walks but with greater efficiency, trading sequential, inefficient sampling of a long walk for parallel computation of short walks and matrix-matrix multiplication. Furthermore, GRFs++ extend the simplistic GRFs walk termination mechanism (Bernoulli schemes with fixed halting probabilities) to a broader class of strategies, applying general distributions on the walks’ lengths. This improves the approximation accuracy of graph kernels, without incurring extra computational cost. We provide empirical evaluations to showcase all our claims and complement our results with theoretical analysis.
[LG-54] FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.07664
作者: Yunbo Li,Jiaping Gui,Zhihang Deng,Fanchao Meng,Yue Wu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by NeurIPS 2025
Abstract:Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e.g., FedSGD) and model-based (e.g., FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at this https URL.
[LG-55] Incremental Hybrid Ensemble with Graph Attention and Frequency-Domain Features for Stable Long-Term Credit Risk Modeling
链接: https://arxiv.org/abs/2510.07663
作者: Jiajing Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting long-term loan defaults is hard because borrower behavior often changes and data distributions shift over time. This paper presents HYDRA-EI, a hybrid ensemble incremental learning framework. It uses several stages of feature processing and combines multiple models. The framework builds relational, cross, and frequency-based features. It uses graph attention, automatic cross-feature creation, and transformations from the frequency domain. HYDRA-EI updates weekly using new data and adjusts the model weights with a simple performance-based method. It works without frequent manual changes or fixed retraining. HYDRA-EI improves model stability and generalization, which makes it useful for long-term credit risk tasks.
[LG-56] Continual Learning for Adaptive AI Systems
链接: https://arxiv.org/abs/2510.07648
作者: Md Hasibul Amin,Tamzid Tanvi Alam
类目: Machine Learning (cs.LG)
*备注: 5 pages 2 figures 2 tables
Abstract:Continual learning the ability of a neural network to learn multiple sequential tasks without losing previously acquired knowledge remains a significant obstacle to developing truly adaptive artificial intelligence. Deep learning models have achieved remarkable results in various applications, but overfitting remains a common issue. Regularization techniques can help prevent overfitting by adding constraints to the model’s parameters. To prevent catastrophic forgetting, in this paper we introduce a novel regularization technique based on inter-cluster separation (ICS) in the loss function, which penalizes the model for producing outputs that are far away from the centroids of the clusters formed by the data from previous tasks. We also performed hyperparameter tuning to find the optimal weighting of the proposed regularization term. This ensures clearer separation between tasks in the neural network’s internal representation, reducing overlap and mitigating forgetting. Using the standard 5-task Split CIFAR-10 benchmark and a ResNet-18 architecture, we demonstrate ICS’s effectiveness in maintaining strong performance on initial tasks. However, our results also highlight limitations in long-term knowledge retention, particularly when the number of tasks increases. This underscores the complexity and trade-offs inherent in continual learning and points toward avenues for further research.
[LG-57] Design-Based Bandits Under Network Interference: Trade-Off Between Regret and Statistical Inference
链接: https://arxiv.org/abs/2510.07646
作者: Zichen Wang,Haoyang Hong,Chuanhao Li,Haoxuan Li,Zhiheng Zhang,Huazheng Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In multi-armed bandits with network interference (MABNI), the action taken by one node can influence the rewards of others, creating complex interdependence. While existing research on MABNI largely concentrates on minimizing regret, it often overlooks the crucial concern that an excessive emphasis on the optimal arm can undermine the inference accuracy for sub-optimal arms. Although initial efforts have been made to address this trade-off in single-unit scenarios, these challenges have become more pronounced in the context of MABNI. In this paper, we establish, for the first time, a theoretical Pareto frontier characterizing the trade-off between regret minimization and inference accuracy in adversarial (design-based) MABNI. We further introduce an anytime-valid asymptotic confidence sequence along with a corresponding algorithm, \textttEXP3-N-CS , specifically designed to balance the trade-off between regret minimization and inference accuracy in this setting.
[LG-58] Property Classification of Vacation Rental Properties during Covid-19
链接: https://arxiv.org/abs/2510.07639
作者: Favour Yahdii Aghaebe,Dustin Foley,Eric Atwell,Stephen Clark
类目: Machine Learning (cs.LG)
*备注: GISRUK 2024 Poster
Abstract:This study advocates for employing clustering techniques to classify vacation rental properties active during the Covid pandemic to identify inherent patterns and behaviours. The dataset, a collaboration between the ESRC funded Consumer Data Research Centre (CDRC) and AirDNA, encompasses data for over a million properties and hosts. Utilising K-means and K-medoids clustering techniques, we identify homogenous groups and their common characteristics. Our findings enhance comprehension of the intricacies of vacation rental evaluations and could potentially be utilised in the creation of targeted, cluster-specific policies.
[LG-59] ransformer-Based Indirect Structural Health Monitoring of Rail Infrastructure with Attention-Driven Detection and Localization of Transient Defects ALT
链接: https://arxiv.org/abs/2510.07606
作者: Sizhe Ma,Katherine A. Flanigan,Mario Bergés,James D. Brooks
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint presented at the 15th International Workshop on Structural Health Monitoring (IWSHM)
Abstract:Indirect structural health monitoring (iSHM) for broken rail detection using onboard sensors presents a cost-effective paradigm for railway track assessment, yet reliably detecting small, transient anomalies (2-10 cm) remains a significant challenge due to complex vehicle dynamics, signal noise, and the scarcity of labeled data limiting supervised approaches. This study addresses these issues through unsupervised deep learning. We introduce an incremental synthetic data benchmark designed to systematically evaluate model robustness against progressively complex challenges like speed variations, multi-channel inputs, and realistic noise patterns encountered in iSHM. Using this benchmark, we evaluate several established unsupervised models alongside our proposed Attention-Focused Transformer. Our model employs a self-attention mechanism, trained via reconstruction but innovatively deriving anomaly scores primarily from deviations in learned attention weights, aiming for both effectiveness and computational efficiency. Benchmarking results reveal that while transformer-based models generally outperform others, all tested models exhibit significant vulnerability to high-frequency localized noise, identifying this as a critical bottleneck for practical deployment. Notably, our proposed model achieves accuracy comparable to the state-of-the-art solution while demonstrating better inference speed. This highlights the crucial need for enhanced noise robustness in future iSHM models and positions our more efficient attention-based approach as a promising foundation for developing practical onboard anomaly detection systems.
[LG-60] Expanding the Action Space of LLM s to Reason Beyond Language
链接: https://arxiv.org/abs/2510.07581
作者: Zhongqi Yue,Weishi Wang,Yundaichuan Zhan,Juncheng Li,Daniel Dahlmeier,Fredrik D. Johansson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments – such as symbolic operators or simulators – must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model’s language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.
[LG-61] Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion
链接: https://arxiv.org/abs/2510.07570
作者: Ryan T. Tymkow,Benjamin D. Schnapp,Mojtaba Valipour,Ali Ghodshi
类目: Machine Learning (cs.LG)
*备注: 9 Pages, 3 Figurees
Abstract:Symbolic regression refers to the task of finding a closed-form mathematical expression to fit a set of data points. Genetic programming based techniques are the most common algorithms used to tackle this problem, but recently, neural-network based approaches have gained popularity. Most of the leading neural-network based models used for symbolic regression utilize transformer-based autoregressive models to generate an equation conditioned on encoded input points. However, autoregressive generation is limited to generating tokens left-to-right, and future generated tokens are conditioned only on previously generated tokens. Motivated by the desire to generate all tokens simultaneously to produce improved closed-form equations, we propose Symbolic Diffusion, a D3PM based discrete state-space diffusion model which simultaneously generates all tokens of the equation at once using discrete token diffusion. Using the bivariate dataset developed for SymbolicGPT, we compared our diffusion-based generation approach to an autoregressive model based on SymbolicGPT, using equivalent encoder and transformer architectures. We demonstrate that our novel approach of using diffusion-based generation for symbolic regression can offer comparable and, by some metrics, improved performance over autoregressive generation in models using similar underlying architectures, opening new research opportunities in neural-network based symbolic regression.
[LG-62] Automated Machine Learning for Unsupervised Tabular Tasks
链接: https://arxiv.org/abs/2510.07569
作者: Prabhant Singh,Pieter Gijsbers,Elif Ceren Gok Yildirim,Murat Onur Yildirim,Joaquin Vanschoren
类目: Machine Learning (cs.LG)
*备注: Accepted at Machine Learning Journal, 2025
Abstract:In this work, we present LOTUS (Learning to Learn with Optimal Transport for Unsupervised Scenarios), a simple yet effective method to perform model selection for multiple unsupervised machine learning(ML) tasks such as outlier detection and clustering. Our intuition behind this work is that a machine learning pipeline will perform well in a new dataset if it previously worked well on datasets with a similar underlying data distribution. We use Optimal Transport distances to find this similarity between unlabeled tabular datasets and recommend machine learning pipelines with one unified single method on two downstream unsupervised tasks: outlier detection and clustering. We present the effectiveness of our approach with experiments against strong baselines and show that LOTUS is a very promising first step toward model selection for multiple unsupervised ML tasks.
[LG-63] EBGAN-MDN: An Energy-Based Adversarial Framework for Multi-Modal Behavior Cloning
链接: https://arxiv.org/abs/2510.07562
作者: Yixiao Li,Julia Barth,Thomas Kiefer,Ahmad Fraij
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-modal behavior cloning faces significant challenges due to mode averaging and mode collapse, where traditional models fail to capture diverse input-output mappings. This problem is critical in applications like robotics, where modeling multiple valid actions ensures both performance and safety. We propose EBGAN-MDN, a framework that integrates energy-based models, Mixture Density Networks (MDNs), and adversarial training. By leveraging a modified InfoNCE loss and an energy-enforced MDN loss, EBGAN-MDN effectively addresses these challenges. Experiments on synthetic and robotic benchmarks demonstrate superior performance, establishing EBGAN-MDN as a effective and efficient solution for multi-modal learning tasks.
[LG-64] Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime
链接: https://arxiv.org/abs/2510.07554
作者: Lénaïc Chizat,Pierre Marion,Yerkin Yesbay
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dropout is a standard training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training. It is known to improve performance in many settings, including in the large-scale training of language or vision models. As a first step towards understanding the role of dropout in large neural networks, we study the large-width asymptotics of gradient descent with dropout on two-layer neural networks with the mean-field initialization scale. We obtain a rich asymptotic phase diagram that exhibits five distinct nondegenerate phases depending on the relative magnitudes of the dropout rate, the learning rate, and the width. Notably, we find that the well-studied “penalty” effect of dropout only persists in the limit with impractically small learning rates of order O(1/\textwidth) . For larger learning rates, this effect disappears and in the limit, dropout is equivalent to a “random geometry” technique, where the gradients are thinned randomly after the forward and backward pass have been computed. In this asymptotic regime, the limit is described by a mean-field jump process where the neurons’ update times follow independent Poisson or Bernoulli clocks (depending on whether the learning rate vanishes or not). For some of the phases, we obtain a description of the limit dynamics both in path-space and in distribution-space. The convergence proofs involve a mix of tools from mean-field particle systems and stochastic processes. Together, our results lay the groundwork for a renewed theoretical understanding of dropout in large-scale neural networks.
[LG-65] argeted Digital Twin via Flow Map Learning and Its Application to Fluid Dynamics
链接: https://arxiv.org/abs/2510.07549
作者: Qifan Chen,Zhongshu Xu,Jinjin Zhang,Dongbin Xiu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:We present a numerical framework for constructing a targeted digital twin (tDT) that directly models the dynamics of quantities of interest (QoIs) in a full digital twin (DT). The proposed approach employs memory-based flow map learning (FML) to develop a data-driven model of the QoIs using short bursts of trajectory data generated through repeated executions of the full DT. This renders the construction of the FML-based tDT an entirely offline computational process. During online simulation, the learned tDT can efficiently predict and analyze the long-term dynamics of the QoIs without requiring simulations of the full DT system, thereby achieving substantial computational savings. After introducing the general numerical procedure, we demonstrate the construction and predictive capability of the tDT in a computational fluid dynamics (CFD) example: two-dimensional incompressible flow past a cylinder. The QoIs in this problem are the hydrodynamic forces exerted on the cylinder. The resulting tDTs are compact dynamical systems that evolve these forces without explicit knowledge of the underlying flow field. Numerical results show that the tDTs yield accurate long-term predictions of the forces while entirely bypassing full flow simulations.
[LG-66] Estimating Fair Graphs from Graph-Stationary Data
链接: https://arxiv.org/abs/2510.07536
作者: Madeline Navarro,Andrei Buciulea,Samuel Rey,Antonio G. Marques,Santiago Segarra
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:We estimate fair graphs from graph-stationary nodal observations such that connections are not biased with respect to sensitive attributes. Edges in real-world graphs often exhibit preferences for connecting certain pairs of groups. Biased connections can not only exacerbate but even induce unfair treatment for downstream graph-based tasks. We therefore consider group and individual fairness for graphs corresponding to group- and node-level definitions, respectively. To evaluate the fairness of a given graph, we provide multiple bias metrics, including novel measurements in the spectral domain. Furthermore, we propose Fair Spectral Templates (FairSpecTemp), an optimization-based method with two variants for estimating fair graphs from stationary graph signals, a general model for graph data subsuming many existing ones. One variant of FairSpecTemp exploits commutativity properties of graph stationarity while directly constraining bias, while the other implicitly encourages fair estimates by restricting bias in the graph spectrum and is thus more flexible. Our methods enjoy high probability performance bounds, yielding a conditional tradeoff between fairness and accuracy. In particular, our analysis reveals that accuracy need not be sacrificed to recover fair graphs. We evaluate FairSpecTemp on synthetic and real-world data sets to illustrate its effectiveness and highlight the advantages of both variants of FairSpecTemp.
[LG-67] Efficient Generalization via Multimodal Co-Training under Data Scarcity and Distribution Shift
链接: https://arxiv.org/abs/2510.07509
作者: Tianyu Bell Pan,Damon L. Woodard
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:This paper explores a multimodal co-training framework designed to enhance model generalization in situations where labeled data is limited and distribution shifts occur. We thoroughly examine the theoretical foundations of this framework, deriving conditions under which the use of unlabeled data and the promotion of agreement between classifiers for different modalities lead to significant improvements in generalization. We also present a convergence analysis that confirms the effectiveness of iterative co-training in reducing classification errors. In addition, we establish a novel generalization bound that, for the first time in a multimodal co-training context, decomposes and quantifies the distinct advantages gained from leveraging unlabeled multimodal data, promoting inter-view agreement, and maintaining conditional view independence. Our findings highlight the practical benefits of multimodal co-training as a structured approach to developing data-efficient and robust AI systems that can effectively generalize in dynamic, real-world environments. The theoretical foundations are examined in dialogue with, and in advance of, established co-training principles.
[LG-68] PEAR: Planner-Executor Agent Robustness Benchmark
链接: https://arxiv.org/abs/2510.07505
作者: Shen Dong,Mingxuan Zhang,Pengfei He,Li Ma,Bhavani Thuraisingham,Hui Liu,Yue Xing
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.
[LG-69] Black-box Detection of LLM -generated Text Using Generalized Jensen-Shannon Divergence
链接: https://arxiv.org/abs/2510.07500
作者: Shuangyi Chen,Ashish Khisti
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Preprint
Abstract:We study black-box detection of machine-generated text under practical constraints: the scoring model (proxy LM) may mismatch the unknown source model, and per-input contrastive generation is costly. We propose SurpMark, a reference-based detector that summarizes a passage by the dynamics of its token surprisals. SurpMark quantizes surprisals into interpretable states, estimates a state-transition matrix for the test text, and scores it via a generalized Jensen-Shannon (GJS) gap between the test transitions and two fixed references (human vs. machine) built once from historical corpora. We prove a principled discretization criterion and establish the asymptotic normality of the decision statistic. Empirically, across multiple datasets, source models, and scenarios, SurpMark consistently matches or surpasses baselines; our experiments corroborate the statistic’s asymptotic normality, and ablations validate the effectiveness of the proposed discretization.
[LG-70] Reinforcement Learning-based Task Offloading in the Internet of Wearable Things
链接: https://arxiv.org/abs/2510.07487
作者: Waleed Bin Qaim,Aleksandr Ometov,Claudia Campolo,Antonella Molinaro,Elena Simona Lohan,Jari Nurmi
类目: Machine Learning (cs.LG)
*备注: 16 pages, 12 figures, Under review in the IEEE Internet of Things Journal
Abstract:Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem from the limited battery power and insufficient computation resources available on wearable devices. On the other hand, with the popularity of smart wearables, there is a consistent increase in the development of new computationally intensive and latency-critical applications. In such a context, task offloading allows wearables to leverage the resources available on nearby edge devices to enhance the overall user experience. This paper proposes a framework for Reinforcement Learning (RL)-based task offloading in the IoWT. We formulate the task offloading process considering the tradeoff between energy consumption and task accomplishment time. Moreover, we model the task offloading problem as a Markov Decision Process (MDP) and utilize the Q-learning technique to enable the wearable device to make optimal task offloading decisions without prior knowledge. We evaluate the performance of the proposed framework through extensive simulations for various applications and system configurations conducted in the ns-3 network simulator. We also show how varying the main system parameters of the Q-learning algorithm affects the overall performance in terms of average task accomplishment time, average energy consumption, and percentage of tasks offloaded.
[LG-71] Surrogate Modeling for the Design of Optimal Lattice Structures using Tensor Completion NEURIPS2025
链接: https://arxiv.org/abs/2510.07474
作者: Shaan Pakala,Aldair E. Gongora,Brian Giera,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 AI4Mat Workshop
Abstract:When designing new materials, it is often necessary to design a material with specific desired properties. Unfortunately, as new design variables are added, the search space grows exponentially, which makes synthesizing and validating the properties of each material very impractical and time-consuming. In this work, we focus on the design of optimal lattice structures with regard to mechanical performance. Computational approaches, including the use of machine learning (ML) methods, have shown improved success in accelerating materials design. However, these ML methods are still lacking in scenarios when training data (i.e. experimentally validated materials) come from a non-uniformly random sampling across the design space. For example, an experimentalist might synthesize and validate certain materials more frequently because of convenience. For this reason, we suggest the use of tensor completion as a surrogate model to accelerate the design of materials in these atypical supervised learning scenarios. In our experiments, we show that tensor completion is superior to classic ML methods such as Gaussian Process and XGBoost with biased sampling of the search space, with around 5% increased R^2 . Furthermore, tensor completion still gives comparable performance with a uniformly random sampling of the entire search space.
[LG-72] metabeta - A fast neural model for Bayesian mixed-effects regression
链接: https://arxiv.org/abs/2510.07473
作者: Alex Kipnis,Marcel Binz,Eric Schulz
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 9 main text, 8 figures
Abstract:Hierarchical data with multiple observations per group is ubiquitous in empirical sciences and is often analyzed using mixed-effects regression. In such models, Bayesian inference gives an estimate of uncertainty but is analytically intractable and requires costly approximation using Markov Chain Monte Carlo (MCMC) methods. Neural posterior estimation shifts the bulk of computation from inference time to pre-training time, amortizing over simulated datasets with known ground truth targets. We propose metabeta, a transformer-based neural network model for Bayesian mixed-effects regression. Using simulated and real data, we show that it reaches stable and comparable performance to MCMC-based parameter estimation at a fraction of the usually required time.
[LG-73] Comparison of Fully Homomorphic Encryption and Garbled Circuit Techniques in Privacy-Preserving Machine Learning Inference
链接: https://arxiv.org/abs/2510.07457
作者: Kalyan Cheerla,Lotfi Ben Othmane,Kirill Morozov(University of North Texas)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures, 2 tables, 32 references
Abstract:Machine Learning (ML) is making its way into fields such as healthcare, finance, and Natural Language Processing (NLP), and concerns over data privacy and model confidentiality continue to grow. Privacy-preserving Machine Learning (PPML) addresses this challenge by enabling inference on private data without revealing sensitive inputs or proprietary models. Leveraging Secure Computation techniques from Cryptography, two widely studied approaches in this domain are Fully Homomorphic Encryption (FHE) and Garbled Circuits (GC). This work presents a comparative evaluation of FHE and GC for secure neural network inference. A two-layer neural network (NN) was implemented using the CKKS scheme from the Microsoft SEAL library (FHE) and the TinyGarble2.0 framework (GC) by IntelLabs. Both implementations are evaluated under the semi-honest threat model, measuring inference output error, round-trip time, peak memory usage, communication overhead, and communication rounds. Results reveal a trade-off: modular GC offers faster execution and lower memory consumption, while FHE supports non-interactive inference.
[LG-74] VeMo: A Lightweight Data-Driven Approach to Model Vehicle Dynamics
链接: https://arxiv.org/abs/2510.07447
作者: Girolamo Oddo,Roberto Nuca,Matteo Parsani
类目: Robotics (cs.RO); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Developing a dynamic model for a high-performance vehicle is a complex problem that requires extensive structural information about the system under analysis. This information is often unavailable to those who did not design the vehicle and represents a typical issue in autonomous driving applications, which are frequently developed on top of existing vehicles; therefore, vehicle models are developed under conditions of information scarcity. This paper proposes a lightweight encoder-decoder model based on Gate Recurrent Unit layers to correlate the vehicle’s future state with its past states, measured onboard, and control actions the driver performs. The results demonstrate that the model achieves a maximum mean relative error below 2.6% in extreme dynamic conditions. It also shows good robustness when subject to noisy input data across the interested frequency components. Furthermore, being entirely data-driven and free from physical constraints, the model exhibits physical consistency in the output signals, such as longitudinal and lateral accelerations, yaw rate, and the vehicle’s longitudinal velocity.
[LG-75] Parameter-Free Federated TD Learning with Markov Noise in Heterogeneous Environments
链接: https://arxiv.org/abs/2510.07436
作者: Ankur Naskar,Gugan Thoppe,Utsav Negi,Vijay Gupta
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) can dramatically speed up reinforcement learning by distributing exploration and training across multiple agents. It can guarantee an optimal convergence rate that scales linearly in the number of agents, i.e., a rate of \tildeO(1/(NT)), where T is the iteration index and N is the number of agents. However, when the training samples arise from a Markov chain, existing results on TD learning achieving this rate require the algorithm to depend on unknown problem parameters. We close this gap by proposing a two-timescale Federated Temporal Difference (FTD) learning with Polyak-Ruppert averaging. Our method provably attains the optimal \tildeO(1/NT) rate in both average-reward and discounted settings–offering a parameter-free FTD approach for Markovian data. Although our results are novel even in the single-agent setting, they apply to the more realistic and challenging scenario of FL with heterogeneous environments.
[LG-76] Learning to Route LLM s from Bandit Feedback: One Policy Many Trade-offs
链接: https://arxiv.org/abs/2510.07429
作者: Wang Wei,Tiankai Yang,Hongjie Chen,Yue Zhao,Franck Dernoncourt,Ryan A. Rossi,Hoda Eldardiry
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures
Abstract:Efficient use of large language models (LLMs) is critical for deployment at scale: without adaptive routing, systems either overpay for strong models or risk poor performance from weaker ones. Selecting the right LLM for each query is fundamentally an online decision problem: models differ in strengths, prices fluctuate, and users value accuracy and cost differently. Yet most routers are trained offline with labels for all candidate models, an assumption that breaks in deployment, where only the outcome of the chosen model is observed. We bridge this gap with BaRP, a Bandit-feedback Routing with Preferences approach that trains under the same partial-feedback restriction as deployment, while supporting preference-tunable inference: operators can dial the performance/cost trade-off at test time without retraining. Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt, rather than depending on full-information offline supervision. Comprehensive experiments show that our method consistently outperforms strong offline routers by at least 12.46% and the largest LLM by at least 2.45%, and generalizes robustly for unseen tasks.
[LG-77] Best-of-Both Worlds for linear contextual bandits with paid observations
链接: https://arxiv.org/abs/2510.07424
作者: Nathan Boyer,Dorian Baudry,Patrick Rebeschini
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of linear contextual bandits with paid observations, where at each round the learner selects an action in order to minimize its loss in a given context, and can then decide to pay a fixed cost to observe the loss of any arm. Building on the Follow-the-Regularized-Leader framework with efficient estimators via Matrix Geometric Resampling, we introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem. We show that it achieves the minimax-optimal regret of \Theta(T^2/3) in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) stochastic regimes. Our approach builds on the framework from \citeBOBWhardproblems to design BOBW algorithms for ``hard problem’', using analysis techniques tailored for the setting that we consider.
[LG-78] Out-of-Distribution Generalization in Climate-Aware Yield Prediction with Earth Observation Data
链接: https://arxiv.org/abs/2510.07350
作者: Aditya Chakravarty
类目: Machine Learning (cs.LG)
*备注:
Abstract:Climate change is increasingly disrupting agricultural systems, making accurate crop yield forecasting essential for food security. While deep learning models have shown promise in yield prediction using satellite and weather data, their ability to generalize across geographic regions and years - critical for real-world deployment - remains largely untested. We benchmark two state-of-the-art models, GNN-RNN and MMST-ViT, under realistic out-of-distribution (OOD) conditions using the large-scale CropNet dataset spanning 1,200+ U.S. counties from 2017-2022. Through leave-one-cluster-out cross-validation across seven USDA Farm Resource Regions and year-ahead prediction scenarios, we identify substantial variability in cross-region transferability. GNN-RNN demonstrates superior generalization with positive correlations under geographic shifts, while MMST-ViT performs well in-domain but degrades sharply under OOD conditions. Regions like Heartland and Northern Great Plains show stable transfer dynamics (RMSE less than 10 bu/acre for soybean), whereas Prairie Gateway exhibits persistent underperformance (RMSE greater than 20 bu/acre) across both models and crops, revealing structural dissimilarities likely driven by semi-arid climate, irrigation patterns, and incomplete spectral coverage. Beyond accuracy differences, GNN-RNN achieves 135x faster training than MMST-ViT (14 minutes vs. 31.5 hours), making it more viable for sustainable deployment. Our findings underscore that spatial-temporal alignment - not merely model complexity or data scale - is key to robust generalization, and highlight the need for transparent OOD evaluation protocols to ensure equitable and reliable climate-aware agricultural forecasting.
[LG-79] SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation
链接: https://arxiv.org/abs/2510.07340
作者: Yongzhi Li,Saining Zhang,Yibing Chen,Boying Li,Yanxin Zhang,Xiaoyu Du
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:Personalized image generation aims to faithfully preserve a reference subject’s identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.
[LG-80] A Modality-Aware Cooperative Co-Evolutionary Framework for Multimodal Graph Neural Architecture Search
链接: https://arxiv.org/abs/2510.07325
作者: Sixuan Wang,Jiao Yin,Jinli Cao,Mingjian Tang,Yong-Feng Ge
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 6 figures. This work has been submitted to the IEEE for possible publication
Abstract:Co-exploitation attacks on software vulnerabilities pose severe risks to enterprises, a threat that can be mitigated by analyzing heterogeneous and multimodal vulnerability data. Multimodal graph neural networks (MGNNs) are well-suited to integrate complementary signals across modalities, thereby improving attack-prediction accuracy. However, designing an effective MGNN architecture is challenging because it requires coordinating modality-specific components at each layer, which is infeasible through manual tuning. Genetic algorithm (GA)-based graph neural architecture search (GNAS) provides a natural solution, yet existing methods are confined to single modalities and overlook modality heterogeneity. To address this limitation, we propose a modality-aware cooperative co-evolutionary algorithm for multimodal graph neural architecture search, termed MACC-MGNAS. First, we develop a modality-aware cooperative co-evolution (MACC) framework under a divide-and-conquer paradigm: a coordinator partitions a global chromosome population into modality-specific gene groups, local workers evolve them independently, and the coordinator reassembles chromosomes for joint evaluation. This framework effectively captures modality heterogeneity ignored by single-modality GNAS. Second, we introduce a modality-aware dual-track surrogate (MADTS) method to reduce evaluation cost and accelerate local gene evolution. Third, we design a similarity-based population diversity indicator (SPDI) strategy to adaptively balance exploration and exploitation, thereby accelerating convergence and avoiding local optima. On a standard vulnerabilities co-exploitation (VulCE) dataset, MACC-MGNAS achieves an F1-score of 81.67% within only 3 GPU-hours, outperforming the state-of-the-art competitor by 8.7% F1 while reducing computation cost by 27%.
[LG-81] Reconstructing the local density field with combined convolutional and point cloud architecture NEURIPS2025
链接: https://arxiv.org/abs/2510.08573
作者: Baptiste Barthe-Gold,Nhat-Minh Nguyen,Leander Thiele
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 6 pages, 4 figures, 1 table. Accepted at the NeurIPS 2025 Workshop: ML4PS. Comments welcome!
Abstract:We construct a neural network to perform regression on the local dark-matter density field given line-of-sight peculiar velocities of dark-matter halos, biased tracers of the dark matter field. Our architecture combines a convolutional U-Net with a point-cloud DeepSets. This combination enables efficient use of small-scale information and improves reconstruction quality relative to a U-Net-only approach. Specifically, our hybrid network recovers both clustering amplitudes and phases better than the U-Net on small scales.
[LG-82] Computational and statistical lower bounds for low-rank estimation under general inhomogeneous noise
链接: https://arxiv.org/abs/2510.08541
作者: Debsurya De,Dmitriy Kunisky
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注: 52 pages, 3 figures
Abstract:Recent work has generalized several results concerning the well-understood spiked Wigner matrix model of a low-rank signal matrix corrupted by additive i.i.d. Gaussian noise to the inhomogeneous case, where the noise has a variance profile. In particular, for the special case where the variance profile has a block structure, a series of results identified an effective spectral algorithm for detecting and estimating the signal, identified the threshold signal strength required for that algorithm to succeed, and proved information-theoretic lower bounds that, for some special signal distributions, match the above threshold. We complement these results by studying the computational optimality of this spectral algorithm. Namely, we show that, for a much broader range of signal distributions, whenever the spectral algorithm cannot detect a low-rank signal, then neither can any low-degree polynomial algorithm. This gives the first evidence for a computational hardness conjecture of Guionnet, Ko, Krzakala, and Zdeborová (2023). With similar techniques, we also prove sharp information-theoretic lower bounds for a class of signal distributions not treated by prior work. Unlike all of the above results on inhomogeneous models, our results do not assume that the variance profile has a block structure, and suggest that the same spectral algorithm might remain optimal for quite general profiles. We include a numerical study of this claim for an example of a smoothly-varying rather than piecewise-constant profile. Our proofs involve analyzing the graph sums of a matrix, which also appear in free and traffic probability, but we require new bounds on these quantities that are tighter than existing ones for non-negative matrices, which may be of independent interest.
[LG-83] Permutation-Invariant Spectral Learning via Dyson Diffusion
链接: https://arxiv.org/abs/2510.08535
作者: Tassilo Schwarz,Cai Dieball,Constantin Kogler,Kevin Lam,Renaud Lambiotte,Arnaud Doucet,Aljaž Godec,George Deligiannidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to n! such representations for graphs with n nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson’s Brownian Motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix while retaining all non-spectral information. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.
[LG-84] Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models
链接: https://arxiv.org/abs/2510.08465
作者: Chih-Yu Chang,Ming-Chung Chang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-85] Wavefunction Flows: Efficient Quantum Simulation of Continuous Flow Models
链接: https://arxiv.org/abs/2510.08462
作者: David Layden,Ryan Sweke,Vojtěch Havlíček,Anirban Chowdhury,Kirill Neklyudov
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Flow models are a cornerstone of modern machine learning. They are generative models that progressively transform probability distributions according to learned dynamics. Specifically, they learn a continuous-time Markov process that efficiently maps samples from a simple source distribution into samples from a complex target distribution. We show that these models are naturally related to the Schrödinger equation, for an unusual Hamiltonian on continuous variables. Moreover, we prove that the dynamics generated by this Hamiltonian can be efficiently simulated on a quantum computer. Together, these results give a quantum algorithm for preparing coherent encodings (a.k.a., qsamples) for a vast family of probability distributions–namely, those expressible by flow models–by reducing the task to an existing classical learning problem, plus Hamiltonian simulation. For statistical problems defined by flow models, such as mean estimation and property testing, this enables the use of quantum algorithms tailored to qsamples, which may offer advantages over classical algorithms based only on samples from a flow model. More broadly, these results reveal a close connection between state-of-the-art machine learning models, such as flow matching and diffusion models, and one of the main expected capabilities of quantum computers: simulating quantum dynamics.
[LG-86] Navigating Sparsities in High-Dimensional Linear Contextual Bandits
链接: https://arxiv.org/abs/2510.08435
作者: Rui Zhao,Zihan Chen,Zemin Zheng
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:
Abstract:High-dimensional linear contextual bandit problems remain a significant challenge due to the curse of dimensionality. Existing methods typically consider either the model parameters to be sparse or the eigenvalues of context covariance matrices to be (approximately) sparse, lacking general applicability due to the rigidity of conventional reward estimators. To overcome this limitation, a powerful pointwise estimator is introduced in this work that adaptively navigates both kinds of sparsity. Based on this pointwise estimator, a novel algorithm, termed HOPE, is proposed. Theoretical analyses demonstrate that HOPE not only achieves improved regret bounds in previously discussed homogeneous settings (i.e., considering only one type of sparsity) but also, for the first time, efficiently handles two new challenging heterogeneous settings (i.e., considering a mixture of two types of sparsity), highlighting its flexibility and generality. Experiments corroborate the superiority of HOPE over existing methods across various scenarios.
[LG-87] Optimal Stopping in Latent Diffusion Models
链接: https://arxiv.org/abs/2510.08409
作者: Yu-Han Wu,Quentin Berthet,Gérard Biau,Claire Boyer,Romuald Elie,Pierre Marion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.
[LG-88] PAC Learnability in the Presence of Performativity
链接: https://arxiv.org/abs/2510.08335
作者: Ivan Kirev,Lyuben Baltadzhiev,Nikola Konstantinov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 3 figures
Abstract:Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.
[LG-89] High-dimensional Analysis of Synthetic Data Selection
链接: https://arxiv.org/abs/2510.08123
作者: Parham Rezaei,Filip Kovacevic,Francesco Locatello,Marco Mondelli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
[LG-90] Beyond Real Data: Synthetic Data through the Lens of Regularization
链接: https://arxiv.org/abs/2510.08095
作者: Amitis Shidani,Tyler Farghly,Yang Sun,Habib Ganjgahi,George Deligiannidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.
[LG-91] Computations and ML for surjective rational maps
链接: https://arxiv.org/abs/2510.08093
作者: Ilya Karzhemanov
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, a couple of Python codes
[LG-92] Stick-Breaking Mixture Normalizing Flows with Component-Wise Tail Adaptation for Variational Inference
链接: https://arxiv.org/abs/2510.07965
作者: Seungsu Han,Juyoung Hwang,Won Chang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Normalizing flows with a Gaussian base provide a computationally efficient way to approximate posterior distributions in Bayesian inference, but they often struggle to capture complex posteriors with multimodality and heavy tails. We propose a stick-breaking mixture base with component-wise tail adaptation (StiCTAF) for posterior approximation. The method first learns a flexible mixture base to mitigate the mode-seeking bias of reverse KL divergence through a weighted average of component-wise ELBOs. It then estimates local tail indices of unnormalized densities and finally refines each mixture component using a shared backbone combined with component-specific tail transforms calibrated by the estimated indices. This design enables accurate mode coverage and anisotropic tail modeling while retaining exact density evaluation and stable optimization. Experiments on synthetic posteriors demonstrate improved tail recovery and better coverage of multiple modes compared to benchmark models. We also present a real-data analysis illustrating the practical benefits of our approach for posterior inference.
[LG-93] On the Optimality of the Median-of-Means Estimator under Adversarial Contamination
链接: https://arxiv.org/abs/2510.07867
作者: Xabier de Juan,Santiago Mazuelas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:The Median-of-Means (MoM) is a robust estimator widely used in machine learning that is known to be (minimax) optimal in scenarios where samples are i.i.d. In more grave scenarios, samples are contaminated by an adversary that can inspect and modify the data. Previous work has theoretically shown the suitability of the MoM estimator in certain contaminated settings. However, the (minimax) optimality of MoM and its limitations under adversarial contamination remain unknown beyond the Gaussian case. In this paper, we present upper and lower bounds for the error of MoM under adversarial contamination for multiple classes of distributions. In particular, we show that MoM is (minimax) optimal in the class of distributions with finite variance, as well as in the class of distributions with infinite variance and finite absolute (1+r) -th moment. We also provide lower bounds for MoM’s error that match the order of the presented upper bounds, and show that MoM is sub-optimal for light-tailed distributions.
[LG-94] On the Optimality of Tracking Fisher Information in Adaptive Testing with Stochastic Binary Responses
链接: https://arxiv.org/abs/2510.07862
作者: Sanghwa Kim(KAIST),Dohyun Ahn(The Chinese University of Hong Kong),Seungki Min(Seoul National University)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of estimating a continuous ability parameter from sequential binary responses by actively asking questions with varying difficulties, a setting that arises naturally in adaptive testing and online preference learning. Our goal is to certify that the estimate lies within a desired margin of error, using as few queries as possible. We propose a simple algorithm that adaptively selects questions to maximize Fisher information and updates the estimate using a method-of-moments approach, paired with a novel test statistic to decide when the estimate is accurate enough. We prove that this Fisher-tracking strategy achieves optimal performance in both fixed-confidence and fixed-budget regimes, which are commonly invested in the best-arm identification literature. Our analysis overcomes a key technical challenge in the fixed-budget setting – handling the dependence between the evolving estimate and the query distribution – by exploiting a structural symmetry in the model and combining large deviation tools with Ville’s inequality. Our results provide rigorous theoretical support for simple and efficient adaptive testing procedures.
[LG-95] Surrogate Graph Partitioning for Spatial Prediction
链接: https://arxiv.org/abs/2510.07832
作者: Yuta Shikuri,Hironori Fujisawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 2 tables
Abstract:Spatial prediction refers to the estimation of unobserved values from spatially distributed observations. Although recent advances have improved the capacity to model diverse observation types, adoption in practice remains limited in industries that demand interpretability. To mitigate this gap, surrogate models that explain black-box predictors provide a promising path toward interpretable decision making. In this study, we propose a graph partitioning problem to construct spatial segments that minimize the sum of within-segment variances of individual predictions. The assignment of data points to segments can be formulated as a mixed-integer quadratic programming problem. While this formulation potentially enables the identification of exact segments, its computational complexity becomes prohibitive as the number of data points increases. Motivated by this challenge, we develop an approximation scheme that leverages the structural properties of graph partitioning. Experimental results demonstrate the computational efficiency of this approximation in identifying spatial segments.
[LG-96] When Robustness Meets Conservativeness: Conformalized Uncertainty Calibration for Balanced Decision Making
链接: https://arxiv.org/abs/2510.07750
作者: Wenbin Zhou,Shixiang Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage-regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost-risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance than existing approaches. These results offer the first principled data-driven methodology for guiding robustness selection and empower practitioners to balance robustness and conservativeness in high-stakes decision-making.
[LG-97] A Honest Cross-Validation Estimator for Prediction Performance
链接: https://arxiv.org/abs/2510.07649
作者: Tianyu Pan,Vincent Z. Yu,Viswanath Devanarayan,Lu Tian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators – a hierarchical Bayesian estimator and an empirical Bayes estimator – that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.
[LG-98] From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
链接: https://arxiv.org/abs/2510.07624
作者: Abdelhakim Benechehab,Gabriel Singer,Corentin Léger,Youssef Attia El Hili,Giuseppe Paolo,Albert Thomas,Maurizio Filippone,Balázs Kégl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at this https URL .
[LG-99] Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction NEURIPS2025
链接: https://arxiv.org/abs/2510.07594
作者: Shitij Govil,Jack P. Rodgers,Yuan-Tang Chou,Siqi Miao,Amit Saha,Advaith Anand,Kilian Lieret,Gage DeZoort,Mia Liu,Javier Duarte,Pan Li,Shih-Chieh Hsu
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025 Machine Learning and the Physical Sciences Workshop
Abstract:Charged particle track reconstruction is a foundational task in collider experiments and the main computational bottleneck in particle reconstruction. Graph neural networks (GNNs) have shown strong performance for this problem, but costly graph construction, irregular computations, and random memory access patterns substantially limit their throughput. The recently proposed Hashing-based Efficient Point Transformer (HEPT) offers a theoretically guaranteed near-linear complexity for large point cloud processing via locality-sensitive hashing (LSH) in attention computations; however, its evaluations have largely focused on embedding quality, and the object condensation pipeline on which HEPT relies requires a post-hoc clustering step (e.g., DBScan) that can dominate runtime. In this work, we make two contributions. First, we present a unified, fair evaluation of physics tracking performance for HEPT and a representative GNN-based pipeline under the same dataset and metrics. Second, we introduce HEPTv2 by extending HEPT with a lightweight decoder that eliminates the clustering stage and directly predicts track assignments. This modification preserves HEPT’s regular, hardware-friendly computations while enabling ultra-fast end-to-end inference. On the TrackML dataset, optimized HEPTv2 achieves approximately 28 ms per event on an A100 while maintaining competitive tracking efficiency. These results position HEPTv2 as a practical, scalable alternative to GNN-based pipelines for fast tracking.
[LG-100] Beyond independent component analysis: identifiability and algorithms
链接: https://arxiv.org/abs/2510.07525
作者: Alvaro Ribot,Anna Seigal,Piotr Zwiernik
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 8 figures
Abstract:Independent Component Analysis (ICA) is a classical method for recovering latent variables with useful identifiability properties. For independent variables, cumulant tensors are diagonal; relaxing independence yields tensors whose zero structure generalizes diagonality. These models have been the subject of recent work in non-independent component analysis. We show that pairwise mean independence answers the question of how much one can relax independence: it is identifiable, any weaker notion is non-identifiable, and it contains the models previously studied as special cases. Our results apply to distributions with the required zero pattern at any cumulant tensor. We propose an algebraic recovery algorithm based on least-squares optimization over the orthogonal group. Simulations highlight robustness: enforcing full independence can harm estimation, while pairwise mean independence enables more stable recovery. These findings extend the classical ICA framework and provide a rigorous basis for blind source separation beyond independence.
[LG-101] me-Frequency Filtering Meets Graph Clustering
链接: https://arxiv.org/abs/2510.07503
作者: Marcelo A. Colominas,Stefan Steinerberger,Hau-Tieng Wu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:We show that the problem of identifying different signal components from a time-frequency representation can be equivalently phrased as a graph clustering problem: given a graph G=(V,E) one aims to identify `clusters’, subgraphs that are strongly connected and have relatively few connections between them. The graph clustering problem is well studied, we show how these ideas can suggest (many) new ways to identify signal components. Numerical experiments illustrate the ideas.
[LG-102] Evaluating and Learning Optimal Dynamic Treatment Regimes under Truncation by Death
链接: https://arxiv.org/abs/2510.07501
作者: Sihyung Park(1),Wenbin Lu(1),Shu Yang(1) ((1) North Carolina State University)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 30 pages, 5 figures, 6 tables, The Thirty-Ninth Annual Conference on Neural Information Processing Systems
Abstract:Truncation by death, a prevalent challenge in critical care, renders traditional dynamic treatment regime (DTR) evaluation inapplicable due to ill-defined potential outcomes. We introduce a principal stratification-based method, focusing on the always-survivor value function. We derive a semiparametrically efficient, multiply robust estimator for multi-stage DTRs, demonstrating its robustness and efficiency. Empirical validation and an application to electronic health records showcase its utility for personalized treatment optimization.
[LG-103] Bayesian Optimization of Multi-Bit Pulse Encoding in In2O3/Al2O3 Thin-film Transistors for Temporal Data Processing
链接: https://arxiv.org/abs/2510.07421
作者: Javier Meza-Arroyo,Benius Dunn,Weijie Xu,Yu-Chieh Chen,Jen-Sue Chen,Julia W.P. Hsu
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Utilizing the intrinsic history-dependence and nonlinearity of hardware, physical reservoir computing is a promising neuromorphic approach to encode time-series data for in-sensor computing. The accuracy of this encoding critically depends on the distinguishability of multi-state outputs, which is often limited by suboptimal and empirically chosen reservoir operation conditions. In this work, we demonstrate a machine learning approach, Bayesian optimization, to improve the encoding fidelity of solution-processed Al2O3/In2O3 thin-film transistors (TFTs). We show high-fidelity 6-bit temporal encoding by exploring five key pulse parameters and using the normalized degree of separation (nDoS) as the metric of output state separability. Additionally, we show that a model trained on simpler 4-bit data can effectively guide optimization of more complex 6-bit encoding tasks, reducing experimental cost. Specifically, for the encoding and reconstruction of binary-patterned images of a moving car across 6 sequential frames, we demonstrate that the encoding is more accurate when operating the TFT using optimized pulse parameters and the 4-bit optimized operating condition performs almost as well as the 6-bit optimized condition. Finally, interpretability analysis via Shapley Additive Explanations (SHAP) reveals that gate pulse amplitude and drain voltage are the most influential parameters in achieving higher state separation. This work presents the first systematic method to identify optimal operating conditions for reservoir devices, and the approach can be extended to other physical reservoir implementations across different material platforms.
[LG-104] Beyond Grid-Locked Voxels: Neural Response Functions for Continuous Brain Encoding
链接: https://arxiv.org/abs/2510.07342
作者: Haomiao Chen,Keith W Jamison,Mert R. Sabuncu,Amy Kuceyeski
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
Abstract:Neural encoding models aim to predict fMRI-measured brain responses to natural images. fMRI data is acquired as a 3D volume of voxels, where each voxel has a defined spatial location in the brain. However, conventional encoding models often flatten this volume into a 1D vector and treat voxel responses as independent outputs. This removes spatial context, discards anatomical information, and ties each model to a subject-specific voxel grid. We introduce the Neural Response Function (NRF), a framework that models fMRI activity as a continuous function over anatomical space rather than a flat vector of voxels. NRF represents brain activity as a continuous implicit function: given an image and a spatial coordinate (x, y, z) in standardized MNI space, the model predicts the response at that location. This formulation decouples predictions from the training grid, supports querying at arbitrary spatial resolutions, and enables resolution-agnostic analyses. By grounding the model in anatomical space, NRF exploits two key properties of brain responses: (1) local smoothness – neighboring voxels exhibit similar response patterns; modeling responses continuously captures these correlations and improves data efficiency, and (2) cross-subject alignment – MNI coordinates unify data across individuals, allowing a model pretrained on one subject to be fine-tuned on new subjects. In experiments, NRF outperformed baseline models in both intrasubject encoding and cross-subject adaptation, achieving high performance while reducing the data size needed by orders of magnitude. To our knowledge, NRF is the first anatomically aware encoding model to move beyond flattened voxels, learning a continuous mapping from images to brain responses in 3D space.
[LG-105] Decoding the dark proteome: Deep learning-enabled discovery of druggable enzymes in Wuchereria bancrofti
链接: https://arxiv.org/abs/2510.07337
作者: Shawnak Shivakumar,Jefferson Hernandez
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: Accepted for peer-reviewed publication at the STEM Fellowship Journal
Abstract:Wuchereria bancrofti, the parasitic roundworm responsible for lymphatic filariasis, permanently disables over 36 million people and places 657 million at risk across 39 countries. A major bottleneck for drug discovery is the lack of functional annotation for more than 90 percent of the W. bancrofti dark proteome, leaving many potential targets unidentified. In this work, we present a novel computational pipeline that converts W. bancrofti’s unannotated amino acid sequence data into precise four-level Enzyme Commission (EC) numbers and drug candidates. We utilized a DEtection TRansformer to estimate the probability of enzymatic function, fine-tuned a hierarchical nearest neighbor EC predictor on 4,476 labeled parasite proteins, and applied rejection sampling to retain only four-level EC classifications at 100 percent confidence. This pipeline assigned precise EC numbers to 14,772 previously uncharacterized proteins and discovered 543 EC classes not previously known in W. bancrofti. A qualitative triage emphasizing parasite-specific targets, chemical tractability, biochemical importance, and biological plausibility prioritized six enzymes across five separate strategies: anti-Wolbachia cell-wall inhibition, proteolysis blockade, transmission disruption, purinergic immune interference, and cGMP-signaling destabilization. We curated a 43-compound library from ChEMBL and BindingDB and co-folded across multiple protein conformers with Boltz-2. All six targets exhibited at least moderately strong predicted binding affinities below 1 micromolar, with moenomycin analogs against peptidoglycan glycosyltransferase and NTPase inhibitors showing promising nanomolar hits and well-defined binding pockets. While experimental validation remains essential, our results provide the first large-scale functional map of the W. bancrofti dark proteome and accelerate early-stage drug development for the species.
[LG-106] Geodesics in the Deep Linear Network
链接: https://arxiv.org/abs/2510.07324
作者: Alan Chen
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:We derive a general system of ODEs and associated explicit solutions in a special case for geodesics between full rank matrices in the deep linear network geometry. In the process, we characterize all horizontal straight lines in the invariant balanced manifold that remain geodesics under Riemannian submersion.
信息检索
[IR-0] Mobile Gamer Lifetime Value Prediction via Objective Decomposition and Reconstruction
链接: https://arxiv.org/abs/2510.08281
作者: Tianwei Li,Yu Zhao,Yunze Li,Sheng Li
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 6 figures
Abstract:For Internet platforms operating real-time bidding (RTB) advertising service, a comprehensive understanding of user lifetime value (LTV) plays a pivotal role in optimizing advertisement allocation efficiency and maximizing the return on investment (ROI) for advertisement sponsors, thereby facilitating growth of commercialization revenue for the platform. However, the inherent complexity of user LTV distributions induces significant challenges in accurate LTV prediction. Existing state-of-the-art works, which primarily focus on directly learning the LTV distributions through well-designed loss functions, achieve limited success due to their vulnerability to outliers. In this paper, we proposed a novel LTV prediction method to address distribution challenges through an objective decomposition and reconstruction framework. Briefly speaking, based on the in-app purchase characteristics of mobile gamers, our model was designed to first predict the number of transactions at specific prices and then calculate the total payment amount from these intermediate predictions. Our proposed model was evaluated through experiments on real-world industrial dataset, and deployed on the TapTap RTB advertising system for online A/B testing along with the state-of-the-art ZILN model.
[IR-1] Generation and annotation of item usage scenarios in e-commerce using large language models
链接: https://arxiv.org/abs/2510.07885
作者: Madoka Hagiri,Kazushi Okamoto,Koki Karube,Kei Harada,Atsushi Shibata
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Complementary recommendations suggest combinations of useful items that play important roles in e-commerce. However, complementary relationships are often subjective and vary among individuals, making them difficult to infer from historical data. Unlike conventional history-based methods that rely on statistical co-occurrence, we focus on the underlying usage context that motivates item combinations. We hypothesized that people select complementary items by imagining specific usage scenarios and identifying the needs in such situations. Based on this idea, we explored the use of large language models (LLMs) to generate item usage scenarios as a starting point for constructing complementary recommendation systems. First, we evaluated the plausibility of LLM-generated scenarios through manual annotation. The results demonstrated that approximately 85% of the generated scenarios were determined to be plausible, suggesting that LLMs can effectively generate realistic item usage scenarios.
[IR-2] Queries Are Not Alone: Clustering Text Embeddings for Video Search SIGIR
链接: https://arxiv.org/abs/2510.07720
作者: Peyang Liu,Xi Wang,Ziqiang Cui,Wei Ye
类目: Information Retrieval (cs.IR)
*备注: Accepted by International ACM SIGIR Conference on Research and Development in Information Retrieval 2025
Abstract:The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.
[IR-3] ISMIE: A Framework to Characterize Information Seeking in Modern Information Environments SIGIR
链接: https://arxiv.org/abs/2510.07644
作者: Shuoqi Sun,Danula Hettiachchi,Damiano Spina
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted to SIGIR-AP 2025
Abstract:The modern information environment (MIE) is increasingly complex, shaped by a wide range of techniques designed to satisfy users’ information needs. Information seeking (IS) models are effective mechanisms for characterizing user-system interactions. However, conceptualizing a model that fully captures the MIE landscape poses a challenge. We argue: Does such a model exist? To address this, we propose the Information Seeking in Modern Information Environments (ISMIE) framework as a fundamental step. ISMIE conceptualizes the information seeking process (ISP) via three key concepts: Components (e.g., Information Seeker), Intervening Variables (e.g., Interactive Variables), and Activities (e.g., Acquiring). Using ISMIE’s concepts and employing a case study based on a common scenario - misinformation dissemination - we analyze six existing IS and information retrieval (IR) models to illustrate their limitations and the necessity of ISMIE. We then show how ISMIE serves as an actionable framework for both characterization and experimental design. We characterize three pressing issues and then outline two research blueprints: a user-centric, industry-driven experimental design for the authenticity and trust crisis to AI-generated content and a system-oriented, academic-driven design for tackling dopamine-driven content consumption. Our framework offers a foundation for developing IS and IR models to advance knowledge on understanding human interactions and system design in MIEs.
[IR-4] Reasoning by Exploration: A Unified Approach to Retrieval and Generation over Graphs
链接: https://arxiv.org/abs/2510.07484
作者: Haoyu Han,Kai Guo,Harry Shomer,Yu Wang,Yucheng Chu,Hang Li,Li Ma,Jiliang Tang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Reasoning over structured graphs remains a fundamental challenge for Large Language Models (LLMs), particularly when scaling to large graphs. Existing approaches typically follow the retrieval-augmented generation (RAG) paradigm: first retrieving subgraphs relevant to the query and then generating answers conditioned on the retrieved subgraphs. However, such two-phase pipelines often struggle to faithfully incorporate graph structure, since the generation process is ultimately constrained by the quality and completeness of the retrieved subgraph. Although many advanced retrievers have been proposed recently to mitigate this issue, they are usually tailored to the training graphs and generalize poorly to unseen graphs, which limits their practical applicability. In this work, we propose Reasoning by Exploration (RoE), a novel approach that unifies retrieval and generation by framing reasoning over graphs as a process of graph exploration. At each step, the LLM selects candidate nodes and edges to explore, gradually constructing reasoning paths and generating answers along the way. To enable effective exploration, RoE is trained in two stages: supervised fine-tuning (SFT) on gold reasoning paths, followed by reinforcement learning (RL) to enhance exploration effectiveness and generalization. Experiments on benchmark datasets demonstrate that RoE achieves substantial overall improvements over baselines, while also generalizing effectively to unseen graphs.