本篇博文主要内容为 2025-10-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-10-22)

今日共更新596篇论文,其中:

  • 自然语言处理88篇(Computation and Language (cs.CL))
  • 人工智能202篇(Artificial Intelligence (cs.AI))
  • 计算机视觉115篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习183篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Grasp Any Region: Towards Precise Contextual Pixel Understanding for Multimodal LLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂场景中难以进行细粒度视觉理解的问题,尤其是现有区域级MLLMs通常仅孤立地理解单个区域,忽略了关键的全局上下文信息及其在多个提示之间的交互建模能力。解决方案的关键在于提出Grasp Any Region (GAR),其核心创新是引入一种有效的RoI对齐特征重放(Region-of-Interest-aligned feature replay)技术,使模型能够:(1) 利用必要的全局上下文实现精确感知;(2) 建模多个提示间的交互关系;从而自然地支持高级组合推理,以回答关于任意区域的自由形式问题,推动从被动描述向主动对话的范式转变。

链接: https://arxiv.org/abs/2510.18876
作者: Haochen Wang,Yuhao Wang,Tao Zhang,Yikang Zhou,Yanwei Li,Jiacong Wang,Ye Tian,Jiahao Meng,Zilong Huang,Guangcan Mai,Anran Wang,Yunhai Tong,Zhuochen Wang,Xiangtai Li,Zhaoxiang Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.
zh

[NLP-1] Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在通过后训练(post-training)方法如监督微调(Supervised Fine-Tuning, SFT)或强化学习(Reinforcement Learning, RL)适应新任务时出现的灾难性遗忘(catastrophic forgetting)问题,即模型在提升目标任务性能的同时可能严重退化原有能力。其解决方案的关键在于揭示并验证:强化学习因其使用在线策略(on-policy)数据所具有的模式搜索(mode-seeking)特性,能够更有效地保留先验知识,从而显著减少遗忘现象;进一步实验证明,这种鲁棒性主要源于数据策略的性质,而非其他算法设计因素(如KL正则化或优势估计)。这一发现为实际应用中通过近似在线策略数据来高效缓解遗忘提供了理论依据和实践路径。

链接: https://arxiv.org/abs/2510.18874
作者: Howard Chen,Noam Razin,Karthik Narasimhan,Danqi Chen
机构: Princeton Language and Intelligence, Princeton University (普林斯顿大学语言与智能中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities – a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
zh

[NLP-2] How Do LLM s Use Their Depth?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中对深度(depth)利用不均匀的问题,尤其是缺乏对其层间预测动态的细粒度理解。其核心解决方案是提出一个“猜-修正”(Guess-then-Refine)框架,揭示LLM如何通过逐层计算逐步优化预测:早期层主要基于高频词进行统计猜测,随着上下文信息在深层逐渐增强,这些初始猜测被不断修正,70%的高频词预测最终得以调整;进一步通过三类案例分析(词性标注、事实回忆和选择题任务)表明,不同语义内容在不同层被激活,例如功能词最先被正确预测,多token答案中首token需要更深计算,而响应格式则在前半层识别完成。这一发现为理解Transformer架构中的层级计算机制提供了新视角,并为提升模型计算效率指明方向。

链接: https://arxiv.org/abs/2510.18871
作者: Akshat Gupta,Jay Yeung,Gopala Anumanchipalli,Anna Ivanova
机构: University of California, Berkeley; Georgia Institute of Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a “Guess-then-Refine” framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined 70% of the time, indicating that correct token prediction is not “one-and-done”. We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.
zh

[NLP-3] LightMem: Lightweight and Efficient Memory-Augmented Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态复杂环境中难以有效利用历史交互信息的问题。现有记忆系统虽能提升模型的持续学习能力,但常引入显著的时间和计算开销。其解决方案的关键在于提出一种名为LightMem的新颖记忆架构,该架构受人类记忆的Atkinson-Shiffrin模型启发,将记忆分为三个互补阶段:基于认知机制的感官记忆通过轻量压缩快速过滤无关信息并按主题分组;主题感知的短期记忆对这些分组进行结构化组织与摘要;长期记忆采用离线更新策略(sleep-time update),实现推理与整合解耦。此设计在LongMemEval基准测试中显著提升了准确率(最高达10.9%提升),同时大幅降低token消耗(最多减少117倍)、API调用次数(最多减少159倍)和运行时间(超过12倍)。

链接: https://arxiv.org/abs/2510.18866
作者: Jizhan Fang,Xinle Deng,Haoming Xu,Ziyan Jiang,Yuqi Tang,Ziwen Xu,Shumin Deng,Yunzhi Yao,Mengru Wang,Shuofei Qiao,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at this https URL.
zh

[NLP-4] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

【速读】: 该论文旨在解决万亿参数规模生成式 AI (Generative AI) 模型训练中面临的三大核心挑战:训练-推理不一致(train-inference misalignment)、长序列回放(rollout)处理效率低下,以及强化学习(Reinforcement Learning, RL)系统层面的性能瓶颈。其解决方案的关键在于三项相互协同的创新:(1) IcePop 通过 token 级别差异掩码与裁剪机制稳定 RL 训练,缓解因训练与推理阶段差异导致的不稳定问题;(2) C3PO++ 动态划分 token 预算下的长回放序列,显著提升时间效率;(3) ASystem 作为高性能 RL 框架,系统性地突破了大规模模型训练中的底层瓶颈。这些技术共同支撑 Ring-1T 在多项权威基准测试中取得突破性性能,标志着开放源代码推理能力的新高度。

链接: https://arxiv.org/abs/2510.18855
作者: Ling Team,Anqi Shen,Baihui Li,Bin Hu,Bin Jing,Cai Chen,Chao Huang,Chao Zhang,Chaokun Yang,Cheng Lin,Chengyao Wen,Congqi Li,Deng Zhao,Dingbo Yuan,Donghai You,Fagui Mao,Fanzhuang Meng,Feng Xu,Guojie Li,Guowei Wang,Hao Dai,Haonan Zheng,Hong Liu,Jia Guo,Jiaming Liu,Jian Liu,Jianhao Fu,Jiannan Shi,Jianwen Wang,Jianxin Lai,Jin Yang,Jun Mei,Jun Zhou,Junbo Zhao,Junping Zhao,Kuan Xu,Le Su,Lei Chen,Li Tang,Liang Jiang,Liangcheng Fu,Lianhao Xu,Linfeng Shi,Lisha Liao,Longfei Zheng,Meng Li,Mingchun Chen,Qi Zuo,Qiang Cheng,Qianggang Cao,Qitao Shi,Quanrui Guo,Senlin Zhu,Shaofei Wang,Shaomian Zheng,Shuaicheng Li,Shuwei Gu,Siba Chen,Tao Wu,Tao Zhang,Tianyu Zhang,Tianyu Zhou,Tiwei Bie,Tongkai Yang,Wang Hong,Wang Ren,Weihua Chen,Wenbo Yu,Wengang Zheng,Xiangchun Wang,Xiaodong Yan,Xiaopei Wan,Xin Zhao,Xinyu Kong,Xinyu Tang,Xudong Han,Xudong Wang,Xuemin Yang,Xueyu Hu,Yalin Zhang,Yan Sun,Yicheng Shan,Yilong Wang,Yingying Xu,Yongkang Liu,Yongzhen Guo,Yuanyuan Wang,Yuchen Yan,Yuefan Wang,Yuhong Guo,Zehuan Li,Zhankai Xu,Zhe Li,Zhenduo Zhang,Zhengke Gui,Zhenxuan Pan,Zhenyu Huang,Zhenzhong Lan,Zhiqiang Ding,Zhiqiang Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.
zh

[NLP-5] owards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化过程中难以忠实匹配个体用户偏好这一关键挑战。现有方法如监督微调(Supervised Fine-Tuning, SFT)易达性能瓶颈,而标准的人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)则因依赖标量奖励模型导致“奖励黑客”(reward hacking)问题,产生冗长且表面化的个性化响应。解决方案的关键在于提出一种名为 Critique-Post-Edit 的鲁棒强化学习框架,其核心包括两个创新组件:一是个性化生成式奖励模型(Personalized Generative Reward Model, GRM),可提供多维评分与文本批评以抵抗奖励黑客;二是批判-后编辑机制(Critique-Post-Edit mechanism),使策略模型基于这些批评自我修正输出,从而实现更精准、高效的个性化训练。

链接: https://arxiv.org/abs/2510.18849
作者: Chenghao Zhu,Meiling Tao,Tiannan Wang,Dongyi Ding,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: Oppo(欧珀)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
zh

[NLP-6] See the Text: From Tokenization to Visual Reading

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)依赖子词分词(subword tokenization)所带来的局限性,尤其是在低资源语言中因过度分割导致的语义碎片化、计算开销增大等问题。其核心解决方案是提出一种视觉中心的文本表示方法——SeeTok,该方法将文本渲染为图像(visual-text),并利用预训练的多模态大语言模型(multimodal LLMs)直接解析这些图像,从而复用大规模多模态训练中习得的光学字符识别(OCR)和文本-视觉对齐能力。这一范式转变使得模型能够以更少的token数(减少4.43倍)和更低的浮点运算次数(FLOPs降低70.5%)实现与传统分词相当或更优的性能,并在跨语言泛化、抗排版噪声鲁棒性和语言层次结构建模方面取得显著提升,标志着从符号化分词向类人视觉阅读的演进。

链接: https://arxiv.org/abs/2510.18840
作者: Ling Xing,Alex Jinpeng Wang,Rui Yan,Hongyu Qu,Zechao Li,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Central South University (中南大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.
zh

[NLP-7] MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

【速读】: 该论文旨在解决在分布式训练场景下,使用动态稀疏注意力(Dynamic Sparse Attention)机制扩展大语言模型(Large Language Models, LLMs)上下文长度时所面临的计算不平衡与通信开销问题。其解决方案的关键在于提出一种名为MTraining的新颖分布式训练方法,该方法通过三个核心组件协同优化:动态稀疏训练模式、平衡稀疏环形注意力(Balanced Sparse Ring Attention)和分层稀疏环形注意力(Hierarchical Sparse Ring Attention),有效缓解了worker级和step级的负载不均,显著提升了超长上下文训练的效率与可扩展性。

链接: https://arxiv.org/abs/2510.18830
作者: Wenxuan Li,Chengruidong Zhang,Huiqiang Jiang,Yucheng Li,Yuqing Yang,Lili Qiu
机构: Microsoft Research (微软研究院); University of Cambridge (剑桥大学); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at this https URL.
zh

[NLP-8] Fine-Tuned Thoughts: Leverag ing Chain-of-Thought Reasoning for Industrial Asset Health Monitoring EMNLP2025

【速读】: 该论文旨在解决在工业4.0等专业领域中,小型语言模型(Small Language Models, SLMs)难以执行复杂推理任务的问题。其核心挑战在于如何在保持SLMs高效性与低计算资源需求的同时,提升其推理能力以匹配大型语言模型(Large Language Models, LLMs)的性能。解决方案的关键在于提出一种基于思维链(Chain-of-Thought, CoT)的知识蒸馏框架,通过多选题问答(multi-choice question answering, MCQA)提示从LLMs中迁移推理能力至SLMs,并结合上下文学习(in-context learning)验证生成知识的质量,从而显著提升细调后SLMs的决策准确性,缩小其与LLMs之间的性能差距。

链接: https://arxiv.org/abs/2510.18817
作者: Shuxin Lin,Dhaval Patel,Christodoulos Constantinides
机构: IBM Research (IBM 研究院); IBM (国际商业机器公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: this https URL.
zh

[NLP-9] WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

【速读】: 该论文旨在解决现有搜索代理(search agent)在交互式环境中因工具使用深度浅和多轮交互中错误累积而导致的智能信息检索与决策能力不足的问题。其解决方案的关键在于提出一种基于强化学习并引入自省机制(self-reflection mechanism)的新型搜索代理WebSeer,通过构建标注有反思模式的大规模数据集,并设计一个两阶段训练框架,将冷启动初始化与强化学习统一于自省范式下,从而生成更长且更具反思性的工具使用轨迹,显著扩展了工具调用链并提升了答案准确性。

链接: https://arxiv.org/abs/2510.18798
作者: Guanzhong He,Zhen Yang,Jinxin Liu,Bin Xu,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at this https URL
zh

[NLP-10] KAT-Coder Technical Report

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在智能编码代理(agentic coding)场景中,从静态文本训练到动态真实软件开发环境执行之间存在的能力断层问题。核心挑战在于如何使模型具备可靠的工具使用能力、指令对齐性以及长上下文推理能力,从而在实际集成开发环境(IDE)中稳定部署。解决方案的关键在于提出一个四阶段的多阶段课程训练框架:首先通过中期训练(Mid-Term Training)增强模型的推理、规划与反思能力;其次利用监督微调(Supervised Fine-Tuning, SFT)构建涵盖多种编程语言、开发场景和任务类型的百万样本数据集;再通过强化微调(Reinforcement Fine-Tuning, RFT)引入多真值奖励机制以实现稳定且样本高效的策略优化;最后通过强化到部署适配阶段(Reinforcement-to-Deployment Adaptation),结合错误掩码监督微调(Error-Masked SFT)与树状轨迹训练(Tree-Structured Trajectory Training),使模型适应生产级IDE环境。这一系统性方法显著提升了KAT-Coder在真实世界中的代码生成可靠性与泛化性能。

链接: https://arxiv.org/abs/2510.18779
作者: Zizheng Zhan,Ken Deng,Xiaojiang Zhang,Jinghui Wang,Huaixi Tang,Zhiyi Lai,Haoyang Huang,Wen Xiang,Kun Wu,Wenhao Zhuang,Minglei Zhang,Shaojie Wang,Shangpeng Yan,Kepeng Lei,Zongxian Feng,Huiming Wang,Zheng Lin,Mengtong Li,Mengfei Xie,Yinghan Cui,Xuxing Chen,Chao Wang,Weihao Li,Wenqiang Zhu,Jiarong Zhang,Jingxuan Xu,Songwei Yu,Yifan Yao,Xinping Lei,Han Li,Junqi Xiong,Zuchen Gao,Dailin Li,Haimo Li,Jiaheng Liu,Yuqun Zhang,Junyi Peng,Haotian Zhang,Bin Chen
机构: Kwaipilot Team
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on this https URL.
zh

[NLP-11] AI use in American newspapers is widespread uneven and rarely disclosed

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在新闻报道中使用情况不明确的问题,尤其是其在已发表报纸文章中的实际渗透程度与分布特征。研究通过审计2025年夏季来自1500家美国报纸的18.6万篇在线文章,并结合对4.5万篇社论的分析,发现约9%的新文章存在部分或全部由AI生成的内容,且该现象在地方小报、特定主题(如天气和科技)及特定所有制群体中更为集中;关键解决方案在于利用Pangram这一先进的AI检测工具进行大规模自动化识别,并辅以人工核查验证,从而首次系统揭示了AI在新闻业中的真实应用规模及其透明度缺失问题,凸显出建立更严格编辑规范与披露机制的紧迫性。

链接: https://arxiv.org/abs/2510.18774
作者: Jenna Russell,Marzena Karpinska,Destiny Akinode,Katherine Thai,Bradley Emi,Max Spero,Mohit Iyyer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
zh

[NLP-12] opoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting ICLR2024

【速读】: 该论文试图解决的问题是:当前大多数机器学习模型中的表示空间缺乏空间结构,表现为无组织的向量空间,难以可视化和解释;而生物大脑则具有显著的空间功能组织(spatial functional organization),即神经元按其响应特性在多个尺度上呈拓扑排列。为实现类似生物脑的可解释性,作者提出了一种新型自注意力机制,将Transformer模型改造为“Topoformer”。其解决方案的关键在于引入两个核心机制:一是空间查询(spatial querying),即让查询(query)和键(key)在二维网格上排列,并通过局部查询池与特定键关联,从而诱导查询和键的拓扑组织;二是空间重加权(spatial reweighting),将标准自注意力中的全连接层替换为局部连接层,促使值(value)和自注意力输出也形成拓扑结构。实验证明,该方法可在保持性能的同时显著提升模型的可解释性,并在人类语言脑网络中发现与低维拓扑变异的一致性。

链接: https://arxiv.org/abs/2510.18745
作者: Taha Binhuraib,Greta Tuckute,Nicholas Blauch
机构: Novus Technologies; MIT (麻省理工学院); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注: ICLR 2024 Workshop on Representational Alignment (Re-Align) Camera Ready

点击查看摘要

Abstract:Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into “Topoformers” with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.
zh

[NLP-13] Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中出现的“对话迷失”(Lost-in-Conversation, LiC)问题,即随着对话轮次增加,模型因逐步获取信息而导致性能下降的现象。解决方案的关键在于提出一种基于可验证奖励的课程强化学习框架(Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards, RLAAR),其核心机制包括:1)通过能力门控的课程策略逐步提升对话难度(以指令碎片为单位),稳定训练过程;2)采用混合奖励系统,鼓励模型不仅生成正确答案,还判断问题是否可解并主动选择不回答(abstention),从而减少过早作答行为,缓解LiC现象。实验证明,RLAAR显著降低了LiC导致的性能衰减(从62.6%提升至75.1%),并提高了校准后的弃权率(从33.5%提升至73.4%)。

链接: https://arxiv.org/abs/2510.18731
作者: Ming Li
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
zh

[NLP-14] SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish

【速读】: 该论文旨在解决大规模多语言模型在低资源领域(如爱尔兰语翻译)中进行领域适配时因参数量庞大而导致的计算成本过高问题。传统微调方法需更新全部参数,对计算资源要求高,限制了研究者在低资源语言上的应用。解决方案的关键在于引入参数高效微调(Parameter-efficient fine-tuning, PEFT)策略,特别是通过低秩适应(Low-Rank Adaptation, LoRA)机制,在不显著增加计算负担的前提下实现高质量的领域迁移。进一步地,作者提出SemiAdapt和SemiLoRA两种半监督推理高效的微调方法,其中SemiLoRA能将PEFT性能提升至与全模型微调相当甚至超越,从而显著降低高精度领域适配的技术门槛,尤其适用于大规模、噪声较多的语料库场景。

链接: https://arxiv.org/abs/2510.18725
作者: Josh McGiff,Nikola S. Nikolov
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.
zh

[NLP-15] Adapting Language Balance in Code-Switching Speech ICASSP2026

【速读】: 该论文旨在解决大语言模型在面对代码切换(code-switching)测试案例时表现不佳的问题,尤其是当数据稀缺无法解释性能下降时,其根源可能在于代码切换点的稀有性和嵌入语言的细微变化。解决方案的关键在于引入可微分的代理信号(differentiable surrogate),利用主语言与嵌入语言之间的差异来显式标注代码切换位置,从而强化模型在这些关键点的学习能力,缓解生成过程中的上下文偏差(context bias),提升模型对代码切换场景的鲁棒性。实验表明,该方法显著减少了替换错误(substitution error),提高了模型对切换位置的预测准确性。

链接: https://arxiv.org/abs/2510.18724
作者: Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation – the central challenge in code-switching – thereby improving the model’s robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.
zh

[NLP-16] Bayesian Low-Rank Factorization for Robust Model Adaptation ICASSP2026

【速读】: 该论文旨在解决大型语音基础模型在适应特定领域(如多语言混用场景)时面临的过拟合与灾难性遗忘问题。其核心挑战在于,直接微调会损害模型的通用能力,而现有方法难以在保持泛化性能的同时实现高效适配。解决方案的关键在于引入贝叶斯因子化适配器(Bayesian factorized adapters),通过在接近零的先验分布下约束适配矩阵,获得稀疏的参数更新机制,从而在最小化对新领域适应损失的同时显著减少对基础模型能力的破坏。实验表明,相较于LoRA方法,该策略实现了54%的后向迁移增益(backward gain),仅带来4%的新域性能下降。

链接: https://arxiv.org/abs/2510.18723
作者: Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.
zh

[NLP-17] Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文(Long-Context, LC)医学问答(Medical Question Answering, QA)任务中的理解能力不足问题,尤其是其在临床相关场景下的表现瓶颈。研究通过系统评估不同模型规模、数据集设置及任务形式下LLM对长文本信息的处理能力,揭示了模型大小效应、记忆机制局限性以及推理型模型的优势;关键解决方案在于引入检索增强生成(Retrieval-Augmented Generation, RAG)策略,并对比单文档与多文档推理场景下RAG的有效性,识别出最佳实践配置,从而显著提升LLM在长上下文医学QA中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2510.18691
作者: Feras AlMannaa,Talia Tseriotou,Jenny Chim,Maria Liakata
机构: Istanbul Aydın University (伊斯坦布尔阿依迪大学); Queen Mary University of London (伦敦玛丽女王大学); The Alan Turing Institute (艾伦·图灵研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.
zh

[NLP-18] MLMA: Towards Multilingual with Mamba Based Architectures ICASSP2026

【速读】: 该论文旨在解决多语言自动语音识别(Multilingual Automatic Speech Recognition, Multilingual ASR)中高资源语言与低资源语言性能难以平衡的问题,以及传统Transformer架构在长序列建模时效率和可扩展性不足的挑战。其解决方案的关键在于引入Mamba架构——一种专为长上下文序列处理优化的状态空间模型(State-Space Model, SSM),通过该架构实现隐式的语言感知条件建模和共享表示学习,从而在多样语言场景下提升识别鲁棒性。实验表明,MLMA在标准多语言基准上达到与基于Transformer的方法相当甚至更优的性能,验证了Mamba作为高效、准确且可扩展的多语言语音识别骨干网络的潜力。

链接: https://arxiv.org/abs/2510.18684
作者: Mohamed Nabih Ali,Daniele Falavigna,Alessio Brutti
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: The paper is under review at ICASSP 2026

点击查看摘要

Abstract:Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture–an efficient state-space model optimized for long-context sequence processing–for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba’s potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.
zh

[NLP-19] Dynamical model parameters from ultrasound tongue kinematics

【速读】: 该论文旨在解决如何可靠地从超声舌部运动学(ultrasound tongue kinematics)中估计发音动力学模型参数的问题,特别是与传统电磁 articulography (EMA) 数据的参数估计进行比较。其解决方案的关键在于验证超声成像技术能否提供与 EMA 相当的动力学参数,结果表明超声舌部运动学可准确估计线性谐振子模型的参数,且下颌短腱追踪也能有效捕捉下颌运动,从而支持将超声舌部运动学作为评估发音动力学模型的可行替代方法。

链接: https://arxiv.org/abs/2510.18629
作者: Sam Kirkham,Patrycja Strycharczuk
机构: Phonetics Laboratory, Lancaster University (兰卡斯特大学); Linguistics and English Language, University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in JASA Express Letters

点击查看摘要

Abstract:The control of speech can be modelled as a dynamical system in which articulators are driven toward target positions. These models are typically evaluated using fleshpoint data, such as electromagnetic articulography (EMA), but recent methodological advances make ultrasound imaging a promising alternative. We evaluate whether the parameters of a linear harmonic oscillator can be reliably estimated from ultrasound tongue kinematics and compare these with parameters estimated from simultaneously-recorded EMA data. We find that ultrasound and EMA yield comparable dynamical parameters, while mandibular short tendon tracking also adequately captures jaw motion. This supports using ultrasound kinematics to evaluate dynamical articulatory models.
zh

[NLP-20] Beyond the Explicit: A Bilingual Dataset for Dehumanization Detection in Social Media

【速读】: 该论文旨在解决当前计算语言学与自然语言处理(Natural Language Processing, NLP)领域对数字去人性化(digital dehumanization)研究不足的问题,特别是现有方法仅聚焦于显性负面表述而忽视了隐性但同样有害的去人性化形式。这些隐性形式虽不具明显攻击性,却仍会强化对边缘群体的偏见和刻板印象,难以被传统检测机制识别。解决方案的关键在于构建一个理论驱动的双语数据集,通过多样化的采样策略从Twitter和Reddit收集样本,并借助众包标注者与专家在文档级和片段级进行标注,共涵盖16,000个实例,从而系统覆盖去人性化的多维特征。该数据集不仅可用于训练机器学习模型,还可作为未来去人性化检测技术的基准,实验证明其在零样本和少样本情境下优于现有最先进模型。

链接: https://arxiv.org/abs/2510.18582
作者: Dennis Assenmacher,Paloma Piot,Katarina Laken,David Jurgens,Claudia Wagner
机构: GESIS - Leibniz Institute for the Social Sciences (GESIS - 社会科学研究所); IRLab, CITIC Research Centre, Universidade da Coruña (IRLab, CITIC 研究中心, 萨拉戈萨大学); Fondazione Bruno Kessler & Universidade de Santiago de Compostela (Fondazione Bruno Kessler & 圣地亚哥德孔波斯特拉大学); University of Michigan (密歇根大学); RWTH Aachen University (亚琛工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Digital dehumanization, although a critical issue, remains largely overlooked within the field of computational linguistics and Natural Language Processing. The prevailing approach in current research concentrating primarily on a single aspect of dehumanization that identifies overtly negative statements as its core marker. This focus, while crucial for understanding harmful online communications, inadequately addresses the broader spectrum of dehumanization. Specifically, it overlooks the subtler forms of dehumanization that, despite not being overtly offensive, still perpetuate harmful biases against marginalized groups in online interactions. These subtler forms can insidiously reinforce negative stereotypes and biases without explicit offensiveness, making them harder to detect yet equally damaging. Recognizing this gap, we use different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit. Using crowdworkers and experts to annotate 16,000 instances on a document- and span-level, we show that our dataset covers the different dimensions of dehumanization. This dataset serves as both a training resource for machine learning models and a benchmark for evaluating future dehumanization detection techniques. To demonstrate its effectiveness, we fine-tune ML models on this dataset, achieving performance that surpasses state-of-the-art models in zero and few-shot in-context settings.
zh

[NLP-21] Large language models for folktale type automation based on motifs: Cinderella case study

【速读】: 该论文旨在解决传统folkloristics(民俗学)研究中难以对海量童话文本进行系统性分析的问题,特别是如何高效识别和比较不同文化背景下 Cinderella 变体中的叙事母题(motif)。其解决方案的关键在于构建了一套基于机器学习(machine learning)与自然语言处理(natural language processing, NLP)的大规模自动化分析方法,利用大语言模型(large language models)检测母题间的复杂交互关系,并通过聚类(clustering)与降维(dimensionality reduction)技术揭示其相似性与差异性,从而实现跨语言的计算分析与比较。

链接: https://arxiv.org/abs/2510.18561
作者: Tjaša Arčon,Marko Robnik-Šikonja,Polona Tratnik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence approaches are being adapted to many research areas, including digital humanities. We built a methodology for large-scale analyses in folkloristics. Using machine learning and natural language processing, we automatically detected motifs in a large collection of Cinderella variants and analysed their similarities and differences with clustering and dimensionality reduction. The results show that large language models detect complex interactions in tales, enabling computational analysis of extensive text collections and facilitating cross-lingual comparisons.
zh

[NLP-22] Building Trust in Clinical LLM s: Bias Analysis and Dataset Transparency EMNLP

【速读】: 该论文旨在解决临床语言模型在实际应用中可能因训练数据偏差而导致的不公平行为问题,特别是不同人口统计学群体(如种族、性别和年龄)在阿片类药物处方上的差异性倾向。其解决方案的关键在于构建了一个大规模、高质量且经过细致标注的预训练语料库——HC4(Healthcare Comprehensive Commons Corpus),该语料库包含超过890亿个token,并结合了通用基准测试与创新性的医疗专用评估方法,从而为识别和缓解模型中的潜在偏倚提供了可解释、可操作的分析框架,以提升临床AI系统的公平性和安全性。

链接: https://arxiv.org/abs/2510.18556
作者: Svetlana Maslenkova,Clement Christophe,Marco AF Pimentel,Tathagata Raha,Muhammad Umar Salman,Ahmed Al Mahrooqi,Avani Gupta,Shadab Khan,Ronnie Rajan,Praveenkumar Kanithi
机构: M42, Abu Dhabi
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Main 2025

点击查看摘要

Abstract:Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.
zh

[NLP-23] Identity-Aware Large Language Models require Cultural Reasoning

【速读】: 该论文试图解决当前大语言模型在应对全球多元文化用户时存在的文化偏见问题,即模型输出往往默认以西方文化为基准,缺乏对不同文化背景下的知识、价值观和社会规范的识别与适应能力,从而可能强化刻板印象、忽视少数群体视角并削弱用户信任。其解决方案的关键在于将“文化推理”(cultural reasoning)视为与事实准确性(factual accuracy)和语言连贯性(linguistic coherence)同等重要的基础能力,并提出应建立能够评估模型在具体语境中动态调整输出以匹配个体文化预期的评测方法,而非仅依赖静态准确率指标。

链接: https://arxiv.org/abs/2510.18510
作者: Alistair Plum,Anne-Marie Lutgen,Christoph Purschke,Achim Rettinger
机构: University of Luxembourg (卢森堡大学); Trier University (特里尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have become the latest trend in natural language processing, heavily featuring in the digital tools we use every day. However, their replies often reflect a narrow cultural viewpoint that overlooks the diversity of global users. This missing capability could be referred to as cultural reasoning, which we define here as the capacity of a model to recognise culture-specific knowledge values and social norms, and to adjust its output so that it aligns with the expectations of individual users. Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI. When this capacity is limited or absent, models can sustain stereotypes, ignore minority perspectives, erode trust, and perpetuate hate. Recent empirical studies strongly suggest that current models default to Western norms when judging moral dilemmas, interpreting idioms, or offering advice, and that fine-tuning on survey data only partly reduces this tendency. The present evaluation methods mainly report static accuracy scores and thus fail to capture adaptive reasoning in context. Although broader datasets can help, they cannot alone ensure genuine cultural competence. Therefore, we argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. By clarifying the concept and outlining initial directions for its assessment, a foundation is laid for future systems to be able to respond with greater sensitivity to the complex fabric of human culture.
zh

[NLP-24] Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation

【速读】: 该论文旨在解决车辆品牌与型号识别(Vehicle Make and Model Recognition, VMMR)在智能交通系统中难以适应新发布车型的问题。现有方法依赖于固定预训练权重的视觉语言模型(如CLIP),其性能受限于昂贵的图像特定微调,缺乏可扩展性。解决方案的关键在于提出一种融合视觉语言模型(Vision Language Models, VLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)的流水线:VLM将车辆图像转化为描述性属性,与文本特征数据库进行比对,检索相关条目并构建提示(prompt),由语言模型(Language Model, LM)完成最终识别。该设计无需大规模重训练,仅通过新增车辆文本描述即可快速更新模型,显著提升零样本识别准确率(较CLIP基线提升近20%),展现出在智慧城市应用中可扩展的VMMR潜力。

链接: https://arxiv.org/abs/2510.18502
作者: Wei-Chia Chang,Yan-Ann Chen
机构: Yuan Ze University (元智大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025

点击查看摘要

Abstract:Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
zh

[NLP-25] How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices

【速读】: 该论文旨在解决当前开源扩散语言模型(Diffusion Language Models, DLMs)在实际应用中效率低于自回归(Autoregressive, AR)模型的问题,尤其是在推理速度上的性能差距限制了其落地实用价值。解决方案的关键在于通过系统性评估和理论分析识别出现有评价方法的缺陷,并结合实证基准测试与基于roofline模型的理论分析,揭示AR模型在吞吐量(throughput)上普遍优于DLMs的根本原因;同时指出当前加速策略(如双缓存机制和并行解码)仅在小批量场景下有效,且随批处理规模扩大而收益递减,从而强调需发展更鲁棒的评估体系和针对性更强的加速技术以推动DLM研究进展。

链接: https://arxiv.org/abs/2510.18480
作者: Han Peng,Peiyu Liu,Zican Dong,Daixuan Cheng,Junyi Li,Yiru Tang,Shuo Wang,Wayne Xin Zhao
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); University of International Business and Economics (对外经济贸易大学); Tsinghua University (清华大学); Department of Data Science, City University of Hong Kong (香港城市大学数据科学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.
zh

[NLP-26] Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在多轮社交对话中因缺乏对对话伙伴潜在意图的动态建模而导致的策略适应性不足问题。解决方案的关键在于提出了一种概率意图建模框架,通过维护一个关于对话伙伴隐含意图的信念分布(belief distribution),并基于上下文先验初始化,在每轮对话后通过似然估计进行动态更新,从而为对话策略提供额外的上下文锚定,使代理能够在不确定性下自适应调整对话策略。

链接: https://arxiv.org/abs/2510.18476
作者: Feifan Xia,Yuyang Fang,Defang Li,Yantong Xie,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
机构: Baidu Inc (百度公司); Imperial College London (帝国理工学院); Zhejiang University (浙江大学); Carnegie Mellon University (卡内基梅隆大学); Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a probabilistic intent modeling framework for large language model (LLM) agents in multi-turn social dialogue. The framework maintains a belief distribution over a partner’s latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared with the Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.
zh

[NLP-27] DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP

【速读】: 该论文旨在解决从监管文档中提取药理学知识的难题,特别是在非英语医疗体系中的资源匮乏问题。现有研究多依赖英文语料库(如DrugBank),而缺乏针对意大利等地区药品说明书(Summaries of Product Characteristics, SmPC)的结构化数据集。为填补这一空白,作者提出DART(Drug Annotation from Regulatory Texts),其关键在于构建了一个基于意大利药物管理局(AIFA)官方数据库的结构化语料库,通过可复现的流水线实现大规模文档检索、语义分段和临床摘要生成,其中利用少量样本微调的大语言模型(LLM)结合低温度解码策略进行高质量文本提炼。该方案的核心优势在于将非结构化的监管文本转化为结构化信息(如适应症、不良反应和药物相互作用),并验证了其在指导临床决策支持系统(如药物相互作用检测器)中的有效性。

链接: https://arxiv.org/abs/2510.18475
作者: Mariano Barone,Antonio Laudante,Giuseppe Riccio,Antonio Romano,Marco Postiglione,Vincenzo Moscato
机构: University of Naples Federico II, Department of Electrical Engineering and Information Technology (DIETI), Via Claudio, 21 - 80125 - Naples, Italy; Consorzio Interuniversitario Nazionale per l’Informatica (CINI) - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy; Northwestern University, Department of Computer Science, McCormick School of Engineering and Applied Science, 2233 Tech Dr, Evanston, IL 60208, United States
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an LLM-based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: this https URL.
zh

[NLP-28] CodeRL: Improving Code Generation via Reinforcement with Execution Semantics Alignment

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中因训练数据仅基于文本模式而产生的语义鸿沟问题,即模型输出的代码虽在语法上合理,却难以保证功能正确性(functional correctness),这主要受限于其对执行语义(execution semantics)建模不足。为解决此问题,作者提出CodeRL+,其核心创新在于将执行语义对齐机制引入强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练流程中,通过构建变量级执行轨迹(variable-level execution trajectory)作为直接的学习信号,从而增强模型对代码文本表示与其实际运行行为之间映射关系的理解。该方法无需额外标注即可利用现有策略滚动生成(on-policy rollouts)完成对齐,并兼容多种强化学习算法和LLM架构,实验表明其在多个基准测试中显著优于基线方法。

链接: https://arxiv.org/abs/2510.18471
作者: Xue Jiang,Yihong Dong,Mengyang Liu,Hongyi Deng,Tian Wang,Yongding Tao,Rongyu Cao,Binhua Li,Zhi Jin,Wenpin Jiao,Fei Huang,Yongbin Li,Ge Li
机构: Peking University (北京大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code’s textual representations and its underlying execution semantics.
zh

[NLP-29] IMB: An Italian Medical Benchmark for Question Answering

【速读】: 该论文旨在解决在线医疗论坛中非英语语种(如意大利语)的患者-医生对话数据在自动化问答系统中的应用难题,其核心挑战在于这些对话具有非正式性和语言复杂性,导致现有大型语言模型(LLMs)难以准确理解与回答。解决方案的关键在于构建两个高质量的意大利语医疗问答基准数据集——IMB-QA(包含782,644条跨77个医学类别的患者-医生对话)和IMB-MCQA(包含25,862道专科考试多选题),并通过检索增强生成(Retrieval Augmented Generation, RAG)和领域特定微调策略对LLMs进行优化,从而提升问答系统的准确性与一致性。实验表明,针对医学领域的专业化适配方法优于单纯依赖更大规模通用模型的方案,凸显了领域知识与高效信息检索在医疗AI系统中的关键作用。

链接: https://arxiv.org/abs/2510.18468
作者: Antonio Romano,Giuseppe Riccio,Mariano Barone,Marco Postiglione,Vincenzo Moscato
机构: University of Naples Federico II, Department of Electrical Engineering and Information Technology (DIETI), Via Claudio, 21 - 80125 - Naples, Italy; Consorzio Interuniversitario Nazionale per l’Informatica (CINI) - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy; Northwestern University, Department of Computer Science, McCormick School of Engineering and Applied Science, 2233 Tech Dr, Evanston, IL 60208, United States
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbfIMB-QA, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbfIMB-MCQA, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: this https URL.
zh

[NLP-30] CEFR-Annotated WordNet: LLM -Based Proficiency-Guided Semantic Database for Language Learning

【速读】: 该论文旨在解决词典资源WordNet中细粒度语义区分对第二语言学习者而言难以理解的问题,从而提升其在语言教育中的实用性。解决方案的关键在于将WordNet的语义网络与欧洲共同语言参考框架(CEFR)的语言能力等级进行整合,通过大语言模型自动计算WordNet词义定义与英语词汇水平在线数据库(English Vocabulary Profile Online)条目之间的语义相似度,实现自动化标注;进而构建大规模带语义和CEFR等级信息的语料库,并用于训练上下文词汇分类器,最终验证了该方法在准确性上可媲美人工标注数据,且结合真实标注数据后分类器达到Macro-F1 0.81的高性能,显著提升了自然语言处理与语言教育之间的衔接效率。

链接: https://arxiv.org/abs/2510.18466
作者: Masato Kikuchi,Masatsugu Ono,Toshioki Soga,Tetsu Tanabe,Tadachika Ozono
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although WordNet is a valuable resource owing to its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this, we developed a WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our method, we constructed a large-scale corpus containing both sense and CEFR-level information from our annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on our corpus perform comparably to those trained on gold-standard annotations. Furthermore, by combining our corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81, indicating the high accuracy of our annotations. Our annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.
zh

[NLP-31] DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

【速读】: 该论文旨在解决Transformer模型行为的机制可解释性问题,即如何准确归因模型内部计算对输出结果的影响。其核心挑战在于现有方法难以在不依赖额外训练的情况下实现细粒度、忠实的信息流动追踪。解决方案的关键在于提出DePass框架,该框架基于单次分解前向传播(decomposed forward pass),将隐藏状态分解为自定义的可加成分,并在注意力权重和MLP激活值固定的前提下进行传播,从而实现无需辅助训练的高保真度特征归因。

链接: https://arxiv.org/abs/2510.18462
作者: Xiangyu Hong,Che Jiang,Kai Tian,Biqing Qi,Youbang Sun,Ning Ding,Bowen Zhou
机构: Tsinghua University (清华大学); Shanghai AI Laboratory
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP’s activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
zh

[NLP-32] ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

【速读】: 该论文旨在解决在线游戏领域中检索增强生成(Retrieval Augmented Generation, RAG)系统缺乏标准化动态评估基准的问题,核心挑战在于“双动态”特性:即游戏内容的持续更新与玩家社区关注焦点的不断变化之间的相互作用。为应对这一问题,作者提出ChronoPlay框架,其关键创新在于引入双动态更新机制以同步追踪两类变化,并设计双源合成引擎,融合官方数据与玩家社区语料,从而在保证事实准确性的同时确保生成问题的玩家中心真实性,实现了对RAG模型在复杂、真实场景下的自动化连续评估。

链接: https://arxiv.org/abs/2510.18455
作者: Liyang He,Yuren Zhang,Ziwei Zhu,Zhenghui Li,Shiwei Tong
机构: Tencent(腾讯); The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: this https URL.
zh

[NLP-33] Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成幽默内容时可能放大有害内容(如刻板印象和毒性言论)的问题,即评估幽默优化机制如何与有害输出耦合。其解决方案的关键在于通过联合测量幽默性、刻板性和毒性,并结合信息论指标分析不一致信号(incongruity signals),揭示模型在幽默生成过程中对有害内容的偏好增强现象;研究发现,有害内容不仅获得更高的幽默评分,且在角色提示下进一步加剧这一偏差,同时信息论分析表明有害线索会扩大预测不确定性,甚至使某些模型更“预期”有害笑点,说明有害内容已嵌入到模型学习到的幽默分布中。

链接: https://arxiv.org/abs/2510.18454
作者: Atharvan Dogra,Soumya Suvra Ghosal,Ameet Deshpande,Ashwin Kalyan,Dinesh Manocha
机构: Centre for Responsible AI, IIT Madras(印度理工学院马德拉斯分校); University of Maryland, College Park(马里兰大学学院公园分校); Princeton University(普林斯顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain 10-21% in mean humor score, stereotypical jokes appear 11% to 28% more often among the jokes marked funny by LLM-based metric and up to 10% more often in generations perceived as funny by humans.
zh

[NLP-34] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

【速读】: 该论文旨在解决视觉-语言模型中因依赖语言先验而非视觉输入而导致的幻觉(hallucination)问题,尤其聚焦于手语翻译(Sign Language Translation, SLT)场景,其中意义高度依赖视频内容的精确锚定。解决方案的关键在于提出一种基于token级别的可靠性度量方法,该方法通过结合特征敏感性(feature-based sensitivity,即视频掩码时内部表示的变化)与反事实信号(counterfactual signals,即干净视频与扰动视频输入下的概率差异),聚合为句级可靠性分数,从而量化模型在生成过程中对视觉信息的依赖程度。该指标不仅能有效预测幻觉发生率、跨数据集和架构泛化,还可区分有依据的词与猜测词,辅助无参考条件下的风险估计,并提升基于文本信号(如置信度、困惑度或熵)的幻觉检测性能,为SLT中的幻觉诊断提供了可复用且实用的工具。

链接: https://arxiv.org/abs/2510.18439
作者: Yasser Hamidullah,Koel Dutta Chowdury,Yusser Al-Ghussin,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
机构: German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心(DFKI GmbH)); Saarland Informatics Campus (萨尔兰信息学园区); Barcelona Supercomputing Center (BSC-CNS)(巴塞罗那超级计算中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.
zh

[NLP-35] Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)在开放域任务中性能受限的问题,因其缺乏明确的推理步骤或逻辑过渡。解决方案的关键在于提出一种新的基于提示的范式——概念链思维(Chain of Conceptual Thought, CoCT),即大语言模型(LLM)首先识别并标注一个概念(如情绪、策略或话题),再生成详细内容;同时允许在单次输出中包含多个概念链,从而激发模型更深层次和策略性的思考。实验表明,CoCT在日常对话与情感支持场景中优于Self-Refine、ECoT、ToT、SoT及RAG等基线方法。

链接: https://arxiv.org/abs/2510.18434
作者: Qingqing Gu,Dan Wang,Yue Zhao,Xiaoyu Wang,Zhonglin Jiang,Yong Chen,Hongyan Li,Luo Ji
机构: Geely AI Lab (吉利AI实验室); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) is widely applied to improve the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks since there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose another prompt-based paradigm called Chain of Conceptual Thought (CoCT), where the LLM first tags a concept, then generates the detailed content. The chain of concepts is allowed within the utterance, encouraging the LLM’s deep and strategic thinking. We experiment with this paradigm in daily and emotional support conversations where the concept is comprised of emotions, strategies and topics. Automatic, human and model evaluations suggest that CoCT surpasses baselines such as Self-Refine, ECoT, ToT, SoT and RAG, suggesting a potential effective prompt-based paradigm of LLM for a wider scope of tasks.
zh

[NLP-36] Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在长上下文推理中因自注意力机制(self-attention)计算复杂度呈二次增长而导致的严重延迟问题。现有稀疏注意力方法虽能降低计算成本,但依赖启发式模式,难以有效召回关键键值对(key-value pairs),从而导致精度下降。其解决方案的核心是提出一种轻量且高精度的稀疏注意力机制——Adamas,该机制通过哈达玛变换(Hadamard transform)、桶化(bucketization)与2比特压缩生成紧凑表示,并利用曼哈顿距离估计实现高效的top-k选择,在极低的token预算下(如64 token)即可达到全注意力的准确性,同时支持高达8倍于现有最优方法的稀疏度,并在32K长度序列上实现最高4.4倍的自注意力加速和1.5倍的端到端速度提升,且在困惑度(perplexity)指标上表现优于或等同于全注意力,证明了其在极端稀疏场景下的有效性。

链接: https://arxiv.org/abs/2510.18413
作者: Siyuan Yan,Guo-Qing Jiang,Yuchen Zhang,Xiaoxing Ma,Ran Zhu,Chun Cao,Jingwei Xu
机构: Nanjing University (南京大学); rednote hilab (rednote hilab)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
zh

[NLP-37] MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在工具使用能力迁移中的两个核心问题:一是监督微调(Supervised Fine-Tuning, SFT)因依赖静态教师轨迹而导致泛化能力差;二是标准强化学习(Reinforcement Learning, RL)因稀疏奖励难以有效引导SLMs进行高效探索并采用次优策略。解决方案的关键在于提出MENTOR框架,其创新性地将强化学习与教师引导的蒸馏相结合:一方面通过RL机制学习更具泛化性的策略,另一方面利用教师参考轨迹构建密集且复合的教师引导奖励信号,从而提供细粒度指导,显著提升SLMs在跨域场景下的策略能力和通用性。

链接: https://arxiv.org/abs/2510.18383
作者: ChangSu Choi,Hoyun Song,Dongyeon Kim,WooHyeon Jung,Minkyung Cho,Sunjin Park,NohHyeob Bae,Seona Yu,KyungTae Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher’s reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
zh

[NLP-38] owards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning ICASSP2026

【速读】: 该论文旨在解决英语自动语音识别(ASR)系统对第二语言(L2)说话者存在显著公平性差距的问题,即不同口音群体在词错误率(WER)上表现差异巨大。解决方案的关键在于提出一种基于轻量级适配器的公平性提示微调方法,融合传统经验风险最小化(ERM)与多种公平驱动目标——频谱解耦(Spectral Decoupling, SD)、组分布鲁棒优化(Group Distributionally Robust Optimization, Group-DRO)和不变风险最小化(Invariant Risk Minimization, IRM),从而在不牺牲整体识别准确性的前提下,显著降低不同口音群体间的WER波动,实现更公平的ASR性能。

链接: https://arxiv.org/abs/2510.18374
作者: Monorama Swain,Bubai Maji,Jagabandhu Mishra,Markus Schedl,Anders Søgaard,Jesper Rindom Jensen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.
zh

[NLP-39] KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLM s

【速读】: 该论文旨在解决当前大型语言模型(LLM)在评估事实准确性时对特定语言文化知识(尤其是韩语文化知识)覆盖不足的问题。现有基准测试多基于英文数据,难以反映模型在非英语语境下的真实推理与知识掌握能力。为此,作者提出了韩国简单问答基准(Korean SimpleQA, KoSimpleQA),其关键创新在于构建了一个包含1,000个短句事实型问题的数据集,这些问题具有明确且可验证的答案,既具备挑战性又易于评分。实验表明,即使最强的开源韩语支持LLM在该基准上的准确率仅为33.7%,凸显了任务难度,并揭示了模型在跨语言文化知识迁移中的局限性。此外,研究发现,引入推理机制有助于模型更有效地激活潜在知识并提升在不确定情况下的拒答能力,从而增强事实性表现。

链接: https://arxiv.org/abs/2510.18368
作者: Donghyeon Ko,Yeguk Jin,Kyubyung Chae,Byungwook Lee,Chansong Jo,Sookyo In,Jaehong Lee,Taesup Kim,Donghyun Kwak
机构: Naver Cloud; Graduate School of Data Science, Seoul National University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present \textbfKorean SimpleQA (KoSimpleQA) , a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at this https URL.
zh

[NLP-40] KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

【速读】: 该论文旨在解决孟加拉国农民难以获取及时、专业级农业指导的问题。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)框架的语音交互式咨询平台KrishokBondhu,该平台整合了呼叫中心功能,并针对孟加拉语用户优化了端到端的语音处理流程:通过光学字符识别(Optical Character Recognition, OCR)与文档解析管道将权威农业手册、推广指南及非政府组织出版物结构化并存入向量数据库,实现高效语义检索;当农民拨打电话后,系统利用语音转文字(Speech-to-Text)、RAG模块召回相关知识片段,并由大型语言模型(Gemma 3-4B)生成上下文相关的回答,最终通过文字转语音(Text-to-Speech)以自然口语化的孟加拉语输出。实证表明,该方案在多样性农业问题上实现了72.7%的高质量响应率,显著优于基准系统(复合评分提升44.7%),验证了多模态交互、本地化语言支持与知识驱动生成相结合的有效性。

链接: https://arxiv.org/abs/2510.18355
作者: Mohd Ruhul Ameen,Akif Islam,Farjana Aktar,M. Saifuzzaman Rafat
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 6 pages, 7 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE 2025)

点击查看摘要

Abstract:In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.
zh

[NLP-41] Combining Distantly Supervised Models with In Context Learning for Monolingual and Cross-Lingual Relation Extraction

【速读】: 该论文旨在解决远距离监督关系抽取(Distantly Supervised Relation Extraction, DSRE)中的长期挑战,即模型需在存在噪声的bag-level标注下进行训练,同时完成sentence-level的关系预测任务。现有最先进(SoTA)的DSRE方法依赖于特定任务的训练,而其与大语言模型(LLM)的上下文学习(In-Context Learning, ICL)结合尚未被充分探索,尤其当LLM因标注噪声无法准确学习关系语义时问题更为突出。解决方案的关键在于提出HYDRE框架——首先利用预训练的DSRE模型为测试句筛选top-k候选关系,随后引入一种新颖的动态示例检索策略,从训练数据中提取可靠的sentence-level示例作为prompt输入至LLM,从而输出最终的关系标签。此设计有效缓解了噪声标注对LLM推理的影响,并显著提升了跨语言场景下的性能表现。

链接: https://arxiv.org/abs/2510.18344
作者: Vipul Rathore,Malik Hammad Faisal,Parag Singla,Mausam
机构: Indian Institute of Technology (印度理工学院); New Delhi, India (新德里, 印度)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Distantly Supervised Relation Extraction (DSRE) remains a long-standing challenge in NLP, where models must learn from noisy bag-level annotations while making sentence-level predictions. While existing state-of-the-art (SoTA) DSRE models rely on task-specific training, their integration with in-context learning (ICL) using large language models (LLMs) remains underexplored. A key challenge is that the LLM may not learn relation semantics correctly, due to noisy annotation. In response, we propose HYDRE – HYbrid Distantly Supervised Relation Extraction framework. It first uses a trained DSRE model to identify the top-k candidate relations for a given test sentence, then uses a novel dynamic exemplar retrieval strategy that extracts reliable, sentence-level exemplars from training data, which are then provided in LLM prompt for outputting the final relation(s). We further extend HYDRE to cross-lingual settings for RE in low-resource languages. Using available English DSRE training data, we evaluate all methods on English as well as a newly curated benchmark covering four diverse low-resource Indic languages – Oriya, Santali, Manipuri, and Tulu. HYDRE achieves up to 20 F1 point gains in English and, on average, 17 F1 points on Indic languages over prior SoTA DSRE models. Detailed ablations exhibit HYDRE’s efficacy compared to other prompting strategies. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.18344 [cs.CL] (or arXiv:2510.18344v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.18344 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-42] ECG-LLM – training and evaluation of domain-specific large language models for electrocardiography ALT

【速读】: 该论文旨在解决领域适配的开源大语言模型(Large Language Models, LLMs)在医疗健康场景中的最优适应策略、评估方法及其相对于通用大模型性能表现不明确的问题。其解决方案的关键在于:通过在心电图(electrocardiography)领域的专业文献上对开源模型进行微调(fine-tuning),并构建多层评估框架,对比微调模型、检索增强生成(Retrieval-Augmented Generation, RAG)与代表性通用模型 Claude Sonnet 3.7 的表现,从而验证领域特定微调和RAG方法在保持隐私安全的前提下可实现与闭源模型相当的临床应用性能。

链接: https://arxiv.org/abs/2510.18339
作者: Lara Ahrens,Wilhelm Haverkamp,Nils Strodthoff
机构: University of Lübeck (吕贝克大学); Charité – Universitätsmedizin Berlin (柏林夏里特医学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 34 pages, 8 figures, code available at this https URL

点击查看摘要

Abstract:Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.
zh

[NLP-43] Position: LLM Watermarking Should Align Stakeholders Incentives for Practical Adoption

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)水印技术在现实世界中部署受限的问题,其核心原因是LLM提供商、平台和终端用户之间存在激励错配,导致四大障碍:竞争风险、检测工具治理、鲁棒性担忧和归属权问题。解决方案的关键在于设计激励对齐(incentive-aligned)的水印方法,尤其强调**上下文水印(In-context Watermarking, ICW)**作为一种可行路径:ICW由可信方(如会议组织者或教育工作者)将隐藏水印指令嵌入文档,若不诚实的用户将该文本输入LLM,则输出可被检测到,从而实现滥用识别;此机制下,用户无质量损失,可信方获得检测能力,LLM提供商保持中立,三方利益一致。论文主张在特定领域探索此类激励对齐的水印设计,并推动社区协作以促进实际应用。

链接: https://arxiv.org/abs/2510.18333
作者: Yepeng Liu,Xuandong Zhao,Dawn Song,Gregory W. Wornell,Yuheng Bu
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); UC Berkeley (加州大学伯克利分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite progress in watermarking algorithms for large language models (LLMs), real-world deployment remains limited. We argue that this gap stems from misaligned incentives among LLM providers, platforms, and end users, which manifest as four key barriers: competitive risk, detection-tool governance, robustness concerns and attribution issues. We revisit three classes of watermarking through this lens. \emphModel watermarking naturally aligns with LLM provider interests, yet faces new challenges in open-source ecosystems. \emphLLM text watermarking offers modest provider benefit when framed solely as an anti-misuse tool, but can gain traction in narrowly scoped settings such as dataset de-contamination or user-controlled provenance. \emphIn-context watermarking (ICW) is tailored for trusted parties, such as conference organizers or educators, who embed hidden watermarking instructions into documents. If a dishonest reviewer or student submits this text to an LLM, the output carries a detectable watermark indicating misuse. This setup aligns incentives: users experience no quality loss, trusted parties gain a detection tool, and LLM providers remain neutral by simply following watermark instructions. We advocate for a broader exploration of incentive-aligned methods, with ICW as an example, in domains where trusted parties need reliable tools to detect misuse. More broadly, we distill design principles for incentive-aligned, domain-specific watermarking and outline future research directions. Our position is that the practical adoption of LLM watermarking requires aligning stakeholder incentives in targeted application domains and fostering active community engagement.
zh

[NLP-44] he Impact of Image Resolution on Biomedical Multimodal Large Language Models ALT

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生物医学图像分析中因图像分辨率不匹配而导致性能下降的问题。当前大多数MLLMs基于通用数据集中的低分辨率图像进行训练,而生物医学图像通常具有高分辨率特性,直接应用会导致关键信息丢失。论文的关键解决方案在于:首先,采用原生分辨率(native-resolution)的训练与推理策略可显著提升模型在多个生物医学任务中的性能;其次,明确指出训练与推理分辨率不一致会严重损害模型表现;最后,提出混合分辨率(mixed-resolution)训练方法,在保证计算效率的同时有效缓解分辨率错配问题,从而实现性能与资源约束之间的平衡。

链接: https://arxiv.org/abs/2510.18304
作者: Liangyu Chen,James Burgess,Jeffrey J Nirschl,Orr Zohar,Serena Yeung-Levy
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025

点击查看摘要

Abstract:Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
zh

[NLP-45] From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering

【速读】: 该论文旨在解决医学问答(Medical QA)中因外部检索不完整或生成内容存在幻觉而导致的推理误导与答案不可靠问题。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)易受噪声或缺失信息影响,而生成增强生成(Generation-Augmented Generation, GAG)则可能产生不准确的内容。为应对上述挑战,论文提出MedRGAG——一种统一的检索-生成增强框架,其核心在于两个关键模块:知识引导的上下文补全(Knowledge-Guided Context Completion, KGCC),用于生成补充检索结果中缺失知识的背景文档;以及知识感知的文档选择(Knowledge-Aware Document Selection, KADS),自适应地融合检索与生成文档以形成简洁且全面的证据集,从而提升答案可靠性与准确性。

链接: https://arxiv.org/abs/2510.18297
作者: Lei Li,Xiao Zhou,Yingying Zhang,Xian Wu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Tencent Jarvis Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Medical question answering (QA) requires extensive access to domain-specific knowledge. A promising direction is to enhance large language models (LLMs) with external knowledge retrieved from medical corpora or parametric knowledge stored in model parameters. Existing approaches typically fall into two categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning on externally retrieved evidence, and Generation-Augmented Generation (GAG), which depends solely on the models internal knowledge to generate contextual documents. However, RAG often suffers from noisy or incomplete retrieval, while GAG is vulnerable to hallucinated or inaccurate information due to unconstrained generation. Both issues can mislead reasoning and undermine answer reliability. To address these challenges, we propose MedRGAG, a unified retrieval-generation augmented framework that seamlessly integrates external and parametric knowledge for medical QA. MedRGAG comprises two key modules: Knowledge-Guided Context Completion (KGCC), which directs the generator to produce background documents that complement the missing knowledge revealed by retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively selects an optimal combination of retrieved and generated documents to form concise yet comprehensive evidence for answer generation. Extensive experiments on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5% improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the effectiveness of unifying retrieval and generation for knowledge-intensive reasoning. Our code and data are publicly available at this https URL
zh

[NLP-46] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

【速读】: 该论文旨在解决美国食品不安全(food insecurity)问题中资源获取碎片化与适配性差的核心痛点,具体包括:现有检索系统依赖静态目录或通用搜索引擎导致信息不完整且地理相关性弱;基于大语言模型(LLM)的聊天机器人仅提供模糊营养建议,无法适应用户的时间、移动能力和交通限制;已有推荐系统侧重烹饪多样性而忽视食物匮乏群体的生存关键需求,如即时可达性、验证可用性和情境障碍。其解决方案的关键在于提出首个面向实时、上下文感知的免费食品获取多智能体框架Food4All,集成三项创新:1)跨官方数据库、社区平台及社交媒体的异构数据聚合,构建持续更新的食物资源池;2)基于精调案例训练的轻量级强化学习算法,同时优化地理可达性与营养准确性;3)在线反馈机制动态调整检索策略以响应用户需求变化。该框架实现了从信息采集、语义分析到决策支持的闭环,为处于食物危机中的脆弱人群提供按需营养标注和精准指引。

链接: https://arxiv.org/abs/2510.18289
作者: Zhengqing Yuan,Yiyang Li,Weixiang Sun,Zheyuan Zhang,Kaiwen Shi,Keerthiram Murugesan,Yanfang Ye
机构: University of Notre Dame (圣母大学); International Business Machines (IBM)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.
zh

[NLP-47] BrailleLLM : Braille Instruction Tuning with Large Language Models for Braille Domain Tasks EMNLP2025

【速读】: 该论文旨在解决盲文(Braille)信息处理中面临的两大核心问题:一是数据稀缺导致的模型训练困难,二是混合文本场景下盲文语义歧义带来的识别与转换挑战。解决方案的关键在于构建包含数学公式且涵盖英文和中文的混合盲文数据集(EBMD/CBMD),并提出基于语法树(syntax tree)的增强方法以提升数据多样性;同时创新性地引入盲文知识引导微调(Braille Knowledge-Based Fine-Tuning, BKFT),通过显式建模盲文上下文特征来降低学习难度,从而显著提升盲文翻译、公式转盲文及混合文本翻译等任务的性能。

链接: https://arxiv.org/abs/2510.18288
作者: Tianyuan Huang,Zepeng Zhu,Hangdi Xing,Zirui Shao,Zhi Yu,Chaoxiong Yang,Jiaxian He,Xiaozhong Liu,Jiajun Bu
机构: Zhejiang University (浙江大学); Alibaba Group; Worcester Polytechnic Institute; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and DataSecurity
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025

点击查看摘要

Abstract:Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.
zh

[NLP-48] xt or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLM s EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本输入时面临的高token消耗问题,从而影响推理效率和资源成本。其核心解决方案是将文本内容以图像形式进行表示(text-as-image),即将原始文本渲染为单张图像后直接输入到解码器型LLM中,以此实现输入压缩。该方法的关键在于利用视觉编码能力减少解码器所需的token数量,实验表明在保持任务性能的前提下可显著降低token使用量(通常减少近一半),尤其在长上下文检索(RULER)和文档摘要(CNN/DailyMail)等任务中效果显著。

链接: https://arxiv.org/abs/2510.18279
作者: Yanhong Li,Zixuan Lan,Jiawei Zhou
机构: Allen Institute for AI (艾伦人工智能研究所); University of Chicago (芝加哥大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Findings. Previously titled “Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs”

点击查看摘要

Abstract:Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
zh

[NLP-49] DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

【速读】: 该论文旨在解决当前提示优化(Prompt Optimization)方法中存在的两个核心问题:一是现有方法主要依赖大语言模型(Large Language Models, LLMs)的随机重写能力,优化过程通常聚焦于特定影响因素,容易陷入局部最优;二是优化后的提示性能不稳定,限制了其在不同任务间的迁移能力。解决方案的关键在于提出一种任务无关的自演化框架 DelvePO(Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization),通过将提示解耦为可独立探索的不同组件来系统分析各因素对任务的影响,并引入工作记忆(working memory)机制,使LLMs能够缓解自身不确定性,获取关键洞察以指导新提示生成,从而实现稳定且具备跨任务泛化能力的提示优化。

链接: https://arxiv.org/abs/2510.18257
作者: Tao Tao,Guanghui Zhu,Lang Guo,Hongyi Chen,Chunfeng Yuan,Yihua Huang
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose \textbfDelvePO ( \textbfD irection-Guid \textbfe d Se \textbfl f-E \textbfv olving Framework for Fl \textbfe xible \textbfP rompt \textbfO ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks. On this basis, we introduce working memory, through which LLMs can alleviate the deficiencies caused by their own uncertainties and further obtain key insights to guide the generation of new prompts. Extensive experiments conducted on different tasks covering various domains for both open- and closed-source LLMs, including DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct and GPT-4o-mini. Experimental results show that DelvePO consistently outperforms previous SOTA methods under identical experimental settings, demonstrating its effectiveness and transferability across different tasks.
zh

[NLP-50] VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

【速读】: 该论文旨在解决多模态基础模型(Multimodal Foundation Models)在安全评估中存在的关键问题:现有方法通常将视觉和语言输入分开处理,忽视了图像与文本联合解读时可能产生的新型风险,即原本无害的内容在组合后变得有害;同时,现有方法难以区分明确不安全内容与边界案例,导致过度屏蔽或未能拒绝真实有害内容。解决方案的关键在于提出Vision Language Safety Understanding (VLSU)框架,通过细粒度的严重性分类和跨17种安全模式的组合分析,构建了一个包含8,187个样本的大规模基准数据集(覆盖15类危害),并采用多阶段流水线结合真实世界图像与人工标注进行系统性评估。实验表明,当前主流模型在联合图像-文本推理任务中表现显著下降(准确率从90%以上降至20%-55%),且34%的错误源于各模态单独判断正确但缺乏组合推理能力,揭示了模型在联合理解与对齐方面的系统性缺陷。

链接: https://arxiv.org/abs/2510.18214
作者: Shruti Palaskar,Leon Gatys,Mona Abdelrahman,Mar Jacobo,Larry Lindsey,Rutika Moharir,Gunnar Lund,Yang Xu,Navid Shiee,Jeffrey Bigham,Charles Maalouf,Joseph Yitan Cheng
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 4 tables. Under review

点击查看摘要

Abstract:Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
zh

[NLP-51] MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives

【速读】: 该论文试图解决如何在自然语言处理(NLP)框架下,从叙事文本中计算生成以事件为中心、基于关系的字符弧(character arc)这一问题。字符弧是文学研究中用于理解角色发展轨迹、识别跨类型叙事模式的重要理论工具,但其量化表示长期缺乏计算方法。解决方案的关键在于提出MARCUS(Modelling Arcs for Understanding Stories)NLP流水线,该系统能够提取叙事中的事件、参与者角色、隐含情感和情感倾向,并在此基础上建模角色间的关系动态;通过追踪和聚合这些关系随故事进展的变化,最终生成可视化的字符弧图谱。该方法为字符弧提供了可量化的表征形式,从而将抽象理论概念转化为可计算资源,支持后续应用如叙事分析、角色演化建模等。

链接: https://arxiv.org/abs/2510.18201
作者: Sriharsh Bhyravajjula,Ujwal Narayan,Manish Shrivastava
机构: International Institute of Information Technology, Hyderabad (海得拉巴国际信息科技研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work.
zh

[NLP-52] Contrastive Decoding Mitigates Score Range Bias in LLM -as-a-Judge

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为评价者在直接评分任务中因评分范围偏倚(score range bias)而导致的可靠性问题,即LLM判官输出结果对预定义评分区间高度敏感,从而阻碍最优评分范围的探索。解决方案的关键在于采用对比解码(contrastive decoding)方法,有效缓解此类偏倚,使LLM评分与人类判断之间的斯皮尔曼相关性(Spearman correlation)平均提升达11.3%。

链接: https://arxiv.org/abs/2510.18196
作者: Yoshinari Fujinuma
机构: Cantina Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.
zh

[NLP-53] CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在动态文本到表格(Text-to-Table, T2T)生成任务中缺乏鲁棒性的问题,尤其是模型在面对时间演化叙事时的推理稳定性不足、对表面形式变化敏感以及依赖提取式捷径而非真实状态跟踪的现象。解决方案的关键在于提出一个诊断性基准 CMT-Bench,其基于实时板球评论构建,要求在两个动态演化的表结构下进行密集规则驱动的表格生成;并通过三个语义保持维度——提取线索消融(extractive-cue ablation)、时间前缀测试(temporal prefixing)和实体形式扰动(entity-form perturbations),系统性地评估模型在不同挑战下的表现,从而揭示当前 LLM 在动态 T2T 任务中的脆弱性,并推动以鲁棒性优先的评估范式作为高效可扩展方法开发的前提。

链接: https://arxiv.org/abs/2510.18173
作者: Ritam Upadhyay,Naman Ahuja,Rishabh Baral,Aparna Garimella,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.
zh

[NLP-54] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在代码生成任务中因推理速度与输出质量之间存在显著权衡而导致性能受限的问题。具体而言,现有方法通过减少采样步数来加速推理时,常导致生成质量急剧下降。解决方案的关键在于提出一种无需训练的采样算法 Saber(Sampling with Adaptive acceleration and Backtracking Enhanced Remasking),其核心思想基于两个关键洞察:一是随着代码上下文逐步建立,可自适应地加速采样过程;二是需引入回溯机制以修正已生成的token。实验表明,Saber 在多个主流代码生成基准上平均提升 Pass@1 准确率 1.9%,同时实现平均 251.4% 的推理速度提升,显著缩小了 DLM 与自回归模型在代码生成任务上的性能差距。

链接: https://arxiv.org/abs/2510.18165
作者: Yihong Dong,Zhaoyu Ma,Xue Jiang,Zhiyuan Fan,Jiaru Qian,Yongmin Li,Jianha Xiao,Zhi Jin,Rongyu Cao,Binhua Li,Fei Huang,Yongbin Li,Ge Li
机构: Peking University (北京大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
zh

[NLP-55] Automatic Prompt Generation via Adaptive Selection of Prompting Techniques

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因提示工程(Prompt Engineering)设计复杂而导致的可靠性与有效性不足的问题,尤其是针对非专家用户难以高效构建高质量提示的挑战。其解决方案的关键在于提出一种自适应提示生成方法:通过构建一个关联任务聚类(Task Clusters)与对应提示技术的知识库,基于用户输入的任务描述将其归类到最相关的任务簇,并动态整合知识库中的提示技术来自动生成高质量提示,从而无需依赖预设模板或框架即可实现精准、高效的提示构造。

链接: https://arxiv.org/abs/2510.18162
作者: Yohei Ikenoue,Hitomi Tashiro,Shigeru Kuroyanagi
机构: Spike Studio Inc. (Spike Studio 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 29 figures, 5 tables

点击查看摘要

Abstract:Prompt engineering is crucial for achieving reliable and effective outputs from large language models (LLMs), but its design requires specialized knowledge of prompting techniques and a deep understanding of target tasks. To address this challenge, we propose a novel method that adaptively selects task-appropriate prompting techniques based on users’ abstract task descriptions and automatically generates high-quality prompts without relying on pre-existing templates or frameworks. The proposed method constructs a knowledge base that associates task clusters, characterized by semantic similarity across diverse tasks, with their corresponding prompting techniques. When users input task descriptions, the system assigns them to the most relevant task cluster and dynamically generates prompts by integrating techniques drawn from the knowledge base. An experimental evaluation of the proposed method on 23 tasks from BIG-Bench Extra Hard (BBEH) demonstrates superior performance compared with standard prompts and existing automatic prompt-generation tools, as measured by both arithmetic and harmonic mean scores. This research establishes a foundation for streamlining and standardizing prompt creation, enabling non-experts to effectively leverage LLMs.
zh

[NLP-56] Extracting Rule-based Descriptions of Attention Features in Transformers

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中特征可解释性不足的问题,即现有基于稀疏线性组合的机制解释方法仅能识别哪些文本序列激活特定特征,但无法提供对特征语义的客观、结构化解释。其解决方案的关键在于提出了一种规则驱动的描述范式,通过提取输入与输出特征之间的模式匹配规则(如skip-gram规则、缺失规则和计数规则),自动构建可解释的特征描述体系。这种方法突破了传统依赖人工或自动分析示例(exemplars)的局限,能够发现诸如“缺少某个词时触发特定行为”或“某词出现次数达到阈值后改变输出概率”等难以通过示例感知的行为模式,从而为Transformer模型中的特征提供更全面、准确且具有操作性的解释框架。

链接: https://arxiv.org/abs/2510.18148
作者: Dan Friedman,Adithya Bhaskar,Alexander Wettig,Danqi Chen
机构: Princeton Language and Intelligence (普林斯顿语言与智能研究中心); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form “[Canadian city]… speaks – English”, (2) absence rules of the form “[Montreal]… speaks -/- English,” and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.
zh

[NLP-57] LLM s Encode How Difficult Problems Are

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中表现出的“复杂问题解决能力强但简单问题易出错”的不一致性问题,核心在于探究LLMs内部是否编码了与人类判断一致的问题难度表征,并分析这种表征在强化学习(Reinforcement Learning, RL)后训练阶段如何影响模型泛化能力。解决方案的关键在于:通过在线性探测器(linear probe)跨层和token位置对60个模型进行训练,发现人类标注的难度信号具有强线性可解性(AMC: ρ ≈ 0.88),且随模型规模稳定增长;而模型自身推断的难度信号则较弱且无法有效缩放。进一步地,沿难度方向引导模型向“更易”表示迁移可减少幻觉并提升准确率;在GRPO训练中,人类难度探测器强度增强并与测试准确率正相关,而LLM难度探测器则退化且负相关,表明人类标注提供了RL可放大且稳定的难度信号,而自动估计的难度信号在模型性能提升时发生偏移。

链接: https://arxiv.org/abs/2510.18147
作者: William Lugoloobi,Chris Russell
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: \rho \approx 0.88 ) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward “easier” representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.
zh

[NLP-58] SafeCoop: Unravelling Full Stack Safety in Agent ic Collaborative Driving

【速读】: 该论文旨在解决基于自然语言的协同驾驶系统在安全与安全性方面面临的新挑战,这些问题源于语言通信机制引入的漏洞,如消息丢失、幻觉、语义篡改和对抗攻击等。传统V2X系统依赖原始传感器数据或感知结果进行通信,存在带宽高、语义损失和互操作性差等问题;而以自然语言为媒介虽具语义丰富性和决策级推理优势,却带来了全新的攻击面。解决方案的关键在于提出一个名为SafeCoop的代理式防御流水线,其核心包括语义防火墙、语言-感知一致性校验以及多源共识机制,并通过代理转换函数实现跨帧空间对齐,从而有效识别并抵御多种恶意攻击,在CARLA仿真中实现了69.15%的驾驶得分提升和最高67.32%的恶意检测F1分数。

链接: https://arxiv.org/abs/2510.18123
作者: Xiangbo Gao,Tzu-Hsiang Lin,Ruojing Song,Yuheng Wu,Kuan-Ru Huang,Zicheng Jin,Fangzhou Lin,Shinan Liu,Zhengzhong Tu
机构: Texas A&M University (德州农工大学); New York University (纽约大学); Korea Advanced Institute of Science and Technology (韩国科学技术院); University of Michigan (密歇根大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is this https URL.
zh

[NLP-59] Does Reasoning Help LLM Agents Play Dungeons and Drag ons? A Prompt Engineering Experiment EMNLP2025

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)与推理能力来预测《龙与地下城》(Dungeons & Dragons, DnD)玩家行为,并将其转化为适用于Avrae Discord机器人指令的问题。解决方案的关键在于通过精心设计的提示(prompt)引导模型生成结构化命令,研究发现即使单句提示修改也能显著影响输出质量,且在该任务中,指令微调模型(instruct model)如LLaMA-3.1-8B-Instruct已足够有效,无需依赖复杂的推理模型(reasoning model)。

链接: https://arxiv.org/abs/2510.18112
作者: Patricia Delafuente,Arya Honraopatil,Lara J. Martin
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注: Published at the Wordplay: When Language Meets Games Workshop (EMNLP 2025)

点击查看摘要

Abstract:This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.
zh

[NLP-60] Na Prática qual IA Entende o Direito? Um Estudo Experimental com IAs Generalistas e uma IA Jurídica

【速读】: 该论文旨在解决当前通用型生成式 AI(Generative AI)在法律实务场景中输出可靠性不足的问题,尤其关注其在法律推理、文本准确性和系统一致性方面的局限性。解决方案的关键在于构建一个融合法律理论框架(包括实质正确性、体系协调性和论证完整性)与实证评估相结合的实验性评价协议,并通过48名法律专业人士对四个系统(JusIA、ChatGPT Free、ChatGPT Pro 和 Gemini)进行任务模拟测试,结果表明领域专业化模型 JusIA 显著优于通用模型,证明了领域专精化和理论驱动的评估方法对于实现可靠法律 AI 输出的必要性。

链接: https://arxiv.org/abs/2510.18108
作者: Marina Soares Marinho,Daniela Vianna,Livy Real,Altigran da Silva,Gabriela Migliorini
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, in Portuguese language

点击查看摘要

Abstract:This study presents the Jusbrasil Study on the Use of General-Purpose AIs in Law, proposing an experimental evaluation protocol combining legal theory, such as material correctness, systematic coherence, and argumentative integrity, with empirical assessment by 48 legal professionals. Four systems (JusIA, ChatGPT Free, ChatGPT Pro, and Gemini) were tested in tasks simulating lawyers’ daily work. JusIA, a domain-specialized model, consistently outperformed the general-purpose systems, showing that both domain specialization and a theoretically grounded evaluation are essential for reliable legal AI outputs.
zh

[NLP-61] SMaRT: Select Mix and ReinvenT - A Strategy Fusion Framework for LLM -Driven Reasoning and Planning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂任务自动化中依赖单一推理策略的问题,这种局限性导致模型无法充分利用多种推理方法之间的协同效应,从而影响性能的鲁棒性和泛化能力。解决方案的关键在于提出Select, Mix, and ReinvenT (SMaRT)框架,其核心创新在于将LLM从传统的评估者角色转变为智能集成者(intelligent integrators),通过融合多样化的推理策略实现跨策略校准,从而在不同任务场景下生成更平衡、高效且高质量的决策结果。

链接: https://arxiv.org/abs/2510.18095
作者: Nikhil Verma,Manasa Bharadwaj,Wonjun Jang,Harmanpreet Singh,Yixiao Wang,Homa Fashandi,Chul Lee
机构: LG Electronics (LG电子)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have redefined complex task automation with exceptional generalization capabilities. Despite these advancements, state-of-the-art methods rely on single-strategy prompting, missing the synergy of diverse reasoning approaches. No single strategy excels universally, highlighting the need for frameworks that fuse strategies to maximize performance and ensure robustness. We introduce the Select, Mix, and ReinvenT (SMaRT) framework, an innovative strategy fusion approach designed to overcome this constraint by creating balanced and efficient solutions through the seamless integration of diverse reasoning strategies. Unlike existing methods, which employ LLMs merely as evaluators, SMaRT uses them as intelligent integrators, unlocking the “best of all worlds” across tasks. Extensive empirical evaluations across benchmarks in reasoning, planning, and sequential decision-making highlight the robustness and adaptability of SMaRT. The framework consistently outperforms state-of-the-art baselines in solution quality, constraint adherence, and performance metrics. This work redefines LLM-driven decision-making by pioneering a new paradigm in cross-strategy calibration, unlocking superior outcomes for reasoning systems and advancing the boundaries of self-refining methodologies.
zh

[NLP-62] Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理具有句间依赖关系的文本翻译任务中的能力瓶颈,特别是针对代词回指(pronominal anaphora)和词汇衔接(lexical cohesion)等复杂语义结构的准确翻译问题。其解决方案的关键在于引入链式思维(chain-of-thought)提示策略,通过鼓励模型进行逐步推理来提升翻译准确性;实验表明,采用该策略后,最优模型在判断正确译文与干扰译文的任务中达到约90%的准确率,在生成正确译文任务中COMET得分达约92%,其中GPT-4、GPT-4o和Phi系列模型表现尤为突出,且存在“越强越能从推理中获益”的“智慧愈增”效应——即模型基础性能越高,推理带来的提升幅度也越大。

链接: https://arxiv.org/abs/2510.18077
作者: Shabnam Ataee,Andrei Popescu-Belis
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a “wise get wiser” effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.
zh

[NLP-63] HouseTour: A Virtual Real Estate A(I)gent ICCV2025

【速读】: 该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在处理三维空间场景时缺乏几何推理能力的问题,即如何从一组二维图像中生成具有空间感知的3D相机轨迹和自然语言描述。其解决方案的关键在于:首先利用已知相机位姿约束扩散过程生成平滑的3D相机轨迹,随后将该轨迹信息融入VLM以实现基于3D空间语境的文本生成,并通过3D高斯泼溅(3D Gaussian Splatting)技术合成沿轨迹的新视角视频。这一方法显著优于独立处理相机轨迹生成与文本描述的传统方案,验证了联合建模在提升生成质量上的有效性。

链接: https://arxiv.org/abs/2510.18054
作者: Ata Çelen,Marc Pollefeys,Daniel Barath,Iro Armeni
机构: ETH Zürich(苏黎世联邦理工学院); Stanford University(斯坦福大学); Microsoft Spatial AI Lab; HUN-REN SZTAKI(匈牙利科学院计算机与自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Published on ICCV 2025

点击查看摘要

Abstract:We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.
zh

[NLP-64] Language Models as Semantic Augmenters for Sequential Recommenders

【速读】: 该论文旨在解决在用户行为序列建模中,由于缺乏充分语义上下文而导致模型性能下降的问题。其解决方案的关键在于提出LaMAR框架,该框架利用大语言模型(Large Language Models, LLMs)在少样本设置下自动生成辅助语义信号,如推断的使用场景、物品意图或主题摘要,从而增强原始行为序列的上下文深度。这些由LLM生成的语义信号具有高度的新颖性和多样性,显著提升了下游模型的表征能力,体现了一种以数据为中心的新范式,即LLMs作为智能上下文生成器,推动了训练数据和语言资源的半自动化构建。

链接: https://arxiv.org/abs/2510.18046
作者: Mahsa Valizadeh,Xiangjue Dong,Rui Tuo,James Caverlee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at capturing latent semantics and contextual relationships across diverse modalities. However, in modeling user behavior from sequential interaction data, performance often suffers when such semantic context is limited or absent. We introduce LaMAR, a LLM-driven semantic enrichment framework designed to enrich such sequences automatically. LaMAR leverages LLMs in a few-shot setting to generate auxiliary contextual signals by inferring latent semantic aspects of a user’s intent and item relationships from existing metadata. These generated signals, such as inferred usage scenarios, item intents, or thematic summaries, augment the original sequences with greater contextual depth. We demonstrate the utility of this generated resource by integrating it into benchmark sequential modeling tasks, where it consistently improves performance. Further analysis shows that LLM-generated signals exhibit high semantic novelty and diversity, enhancing the representational capacity of the downstream models. This work represents a new data-centric paradigm where LLMs serve as intelligent context generators, contributing a new method for the semi-automatic creation of training data and language resources.
zh

[NLP-65] Subject-Event Ontology Without Global Time: Foundations and Execution Semantics

【速读】: 该论文旨在解决复杂动态系统建模中依赖全局时间的问题,尤其是在分布式系统、微服务架构、DLT平台及多视角场景下,如何实现无全局时钟的因果一致性与可执行性。其解决方案的关键在于提出一种基于主体-事件本体(subject-event ontology)的形式化框架,核心包括:以“事件作为固定行为”为核心概念,通过显式依赖关系定义因果顺序(happens-before),而非依赖时间戳;利用声明式数据流机制确保执行确定性;将模型视为认知过滤器(epistemic filters),限制主体只能感知和固定其已知概念下的事件;并假设事件内容在固定时刻即为真,无需外部验证。该方法通过九条公理(A1–A9)保障可执行本体的正确性,特别是历史单调性(I1)、因果无环性(I2)和可追溯性(I3),并在Boldsea工作流引擎中通过BSL(Boldsea Semantic Language)实现,支持基于模式的事件验证、角色授权及自动因果链构建。

链接: https://arxiv.org/abs/2510.18040
作者: Alexander Boldachev
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 32 pages

点击查看摘要

Abstract:A formalization of a subject-event ontology is proposed for modeling complex dynamic systems without reliance on global time. Key principles: (1) event as an act of fixation - a subject discerns and fixes changes according to models (conceptual templates) available to them; (2) causal order via happens-before - the order of events is defined by explicit dependencies, not timestamps; (3) making the ontology executable via a declarative dataflow mechanism, ensuring determinism; (4) models as epistemic filters - a subject can only fix what falls under its known concepts and properties; (5) presumption of truth - the declarative content of an event is available for computation from the moment of fixation, without external verification. The formalization includes nine axioms (A1-A9), ensuring the correctness of executable ontologies: monotonicity of history (I1), acyclicity of causality (I2), traceability (I3). Special attention is given to the model-based approach (A9): event validation via schemas, actor authorization, automatic construction of causal chains (W3) without global time. Practical applicability is demonstrated on the boldsea system - a workflow engine for executable ontologies, where the theoretical constructs are implemented in BSL (Boldsea Semantic Language). The formalization is applicable to distributed systems, microservice architectures, DLT platforms, and multiperspectivity scenarios (conflicting facts from different subjects).
zh

[NLP-66] From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

【速读】: 该论文旨在解决结构化剪枝(Structured Pruning)在大语言模型(Large Language Models, LLMs)部署中因任务无关性导致的下游性能提升有限的问题。现有方法通常基于层级重建优化而非任务目标,难以利用微弱的任务特定校准信号,从而限制了实际应用效果。其解决方案的关键在于提出一种全局迭代式结构化剪枝方法(Global Iterative Structured Pruning, GISP),通过在结构级别聚合一阶损失重要权重并采用块级归一化机制,实现更稳定的高稀疏度剪枝;同时引入迭代剪枝策略避免困惑度崩溃,无需中间微调即可维持精度,并支持以模型级损失定义的任务特定目标(如语言建模的困惑度或决策类任务的边界裕度目标),从而显著提升下游任务表现,尤其在40–50%稀疏度下优势明显。

链接: https://arxiv.org/abs/2510.18030
作者: Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Minwoo Lee,Shu-ping Yeh,Evgeny Stupachenko,Hao Feng,Li Yang
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); DreamSoul; University of Minnesota (明尼苏达大学); Intel Corporation (英特尔公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a “prune-once, deploy-many” workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.
zh

[NLP-67] Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution

【速读】: 该论文旨在解决多语言水印(multilingual watermarking)在中低资源语言下鲁棒性不足的问题,现有方法虽宣称具备跨语言鲁棒性,但实际仅在高资源语言上验证,且在翻译攻击下失效。其根本原因在于语义聚类(semantic clustering)机制在词汇表中全词token数量不足时失效。解决方案的关键是提出STEAM方法——一种基于回译(back-translation)的检测机制,通过恢复因翻译导致的水印强度损失,实现对任意水印方法、不同分词器和多种语言的兼容与增强,无需侵入式修改,且可轻松扩展至新语言,显著提升跨语言水印检测性能(平均AUC提升+0.19,TPR@1%提升+40%p)。

链接: https://arxiv.org/abs/2510.18019
作者: Asim Mohamed,Martin Gubri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a back-translation-based detection method that restores watermark strength lost through translation. STEAM is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, STEAM provides a simple and robust path toward fairer watermarking across diverse languages.
zh

[NLP-68] SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone EMNLP2025

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中基准测试(benchmark)数据集规模庞大、难以解释的问题,尤其是模型选择效率低下和评估结果可解释性差的挑战。其解决方案的关键在于提出一个三阶段框架 SimBA(Simplify Benchmark Analysis),通过“ stalk”(模型-数据集对比)、“prowl”(代表性子集发现)和“pounce”(性能预测)三个步骤,利用原始评估分数自动识别出覆盖性强且数量极少的代表性数据子集,从而在保持模型排名准确性和预测性能的同时显著降低评估复杂度。实验表明,仅需 6.25%、1.7% 和 28.4% 的数据即可在 HELM、MMLU 和 BigBenchLite 上实现 ≥95% 的覆盖率,并能以接近零均方误差预测新模型的表现。

链接: https://arxiv.org/abs/2510.17998
作者: Nishant Subramani,Alfredo Gomez,Mona Diab
机构: Carnegie Mellon University - Language Technologies Institute (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:Modern language models are evaluated on large benchmarks, which are difficult to make sense of, especially for model selection. Looking at the raw evaluation numbers themselves using a model-centric lens, we propose SimBA, a three phase framework to Simplify Benchmark Analysis. The three phases of SimBA are: stalk, where we conduct dataset model comparisons, prowl, where we discover a representative subset, and pounce, where we use the representative subset to predict performance on a held-out set of models. Applying SimBA to three popular LM benchmarks: HELM, MMLU, and BigBenchLite reveals that across all three benchmarks, datasets and models relate strongly to one another (stalk). We develop an representative set discovery algorithm which covers a benchmark using raw evaluation scores alone. Using our algorithm, we find that with 6.25% (1/16), 1.7% (1/58), and 28.4% (21/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95% (prowl). Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error (pounce). Taken together, SimBA can help model developers improve efficiency during model training and dataset creators validate whether their newly created dataset differs from existing datasets in a benchmark. Our code is open source, available at this https URL.
zh

[NLP-69] PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

【速读】: 该论文旨在解决多轮对话场景下大语言模型(Large Language Models, LLMs)日益增强的越狱攻击(jailbreaking)风险问题,尤其关注如何在有限查询预算内高效、系统地设计出更具适应性和有效性的多轮攻击策略。现有研究多集中于单轮攻击,而对多轮攻击中意图隐蔽注入、上下文演化与长期策略优化等挑战缺乏系统性方法。其解决方案的关键在于提出PLAGUE框架——一个受终身学习(lifelong learning)启发的即插即用式多轮攻击设计机制,将整个攻击生命周期划分为三个阶段:引导(Primer)、规划(Planner)和终结(Finisher),通过结构化探索多轮攻击空间,实现计划初始化、上下文优化与持续学习能力的协同提升,从而显著提高攻击成功率(Attack Success Rate, ASR),在多个高安全防护模型上实现了超过30%的性能提升。

链接: https://arxiv.org/abs/2510.17947
作者: Neeladri Bhuiya,Madhav Aggarwal,Diptanshu Purwar
机构: A10 Networks, Inc. (A10网络公司); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI’s o3 and 67.3% on Claude’s Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
zh

[NLP-70] Believe It or Not: How Deeply do LLM s Believe Implanted Facts?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中知识编辑技术的可信度问题,即:当通过编辑手段将新事实植入LLM后,这些知识是否真正被模型“相信”,而不仅仅是表面记忆。为此,作者提出了一种可量化的“信念深度”(belief depth)框架,用于评估知识编辑效果,其关键指标包括:1)知识在相关上下文中的泛化能力(如多步推理的费米估算),2)对自我审视和直接质疑的鲁棒性,3)与真实知识在表征上的一致性(通过线性探测器衡量)。实验表明,简单提示(prompting)和机制编辑方法无法实现深层信念植入,而合成文档微调(Synthetic Document Finetuning, SDF)能有效使植入知识表现出接近真实知识的行为特征;但若所植入知识与基本世界常识冲突,则仍表现为脆弱且表征异质。该工作首次建立了可测量的信念深度标准,为知识编辑技术在现实场景中的部署提供了严谨评估基础。

链接: https://arxiv.org/abs/2510.17941
作者: Stewart Slocum,Julian Minder,Clément Dumas,Henry Sleight,Ryan Greenblatt,Samuel Marks,Rowan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge editing techniques promise to implant new factual knowledge into large language models (LLMs). But do LLMs really believe these facts? We develop a framework to measure belief depth and use it to evaluate the success of knowledge editing techniques. We operationalize belief depth as the extent to which implanted knowledge 1) generalizes to related contexts (e.g. Fermi estimates several logical steps removed), 2) is robust to self-scrutiny and direct challenge, and 3) is represented similarly to genuine knowledge (as measured by linear probes). Our evaluations show that simple prompting and mechanistic editing techniques fail to implant knowledge deeply. In contrast, Synthetic Document Finetuning (SDF) - where models are trained on LLM-generated documents consistent with a fact - often succeeds at implanting beliefs that behave similarly to genuine knowledge. However, SDF’s success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge. Overall, our work introduces measurable criteria for belief depth and enables the rigorous evaluation necessary for deploying knowledge editing in real-world applications.
zh

[NLP-71] AtlasKV: Augmenting LLM s with Billion-Scale Knowledge Graphs in 20GB VRAM

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)方法在大规模知识图谱(Knowledge Graph, KG)增强大语言模型(Large Language Models, LLMs)时存在的高推理延迟和资源消耗问题。RAG依赖外部检索模块获取上下文,当知识规模达到数十亿三元组时,其搜索开销和长上下文处理导致显著的性能瓶颈。解决方案的关键在于提出一种参数化知识集成方法AtlasKV,通过引入KG2KV和HiKVP机制,将知识图谱三元组以亚线性时间和内存复杂度高效嵌入LLMs中,仅需少量GPU显存(如<20GB VRAM),且无需外部检索器、长上下文先验或重新训练即可实现对新知识的适应,从而在保持强知识锚定性和泛化能力的同时显著提升可扩展性与效率。

链接: https://arxiv.org/abs/2510.17934
作者: Haoyu Huang,Hong Ting Tsang,Jiaxin Bai,Xi Peng,Gong Zhang,Yangqiu Song
机构: The Hong Kong University of Science and Technology (香港科技大学); Theory Lab, Huawei (华为理论实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called \textbfAtlasKV, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs’ inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.
zh

[NLP-72] Diagnosing Representation Dynamics in NER Model Extension

【速读】: 该论文旨在解决在噪声较大的口语数据中,将命名实体识别(Named Entity Recognition, NER)模型扩展至新的个人身份信息(PII)实体(如EMAIL、PHONE)时所面临的性能退化问题。其核心挑战在于如何在不损害原有语义类(如PER、LOC、ORG)的前提下实现新类别的有效学习。解决方案的关键在于通过增量学习设置进行机制诊断,揭示了两个关键现象:一是位置类(LOC)因与PII存在模式特征重叠(如邮政编码),易受干扰;二是“O标签表示漂移”现象——模型初始将PII模式映射为背景类(O),导致新学习被阻断,仅当解冻O标签分类器使其具备可塑性时,才能释放这些模式并实现平稳适应。该研究提供了对NER模型迁移机制的深入理解,强调特征独立性、表示重叠及O标签动态调整的重要性。

链接: https://arxiv.org/abs/2510.17930
作者: Xirui Zhang,Philippe de La Chevasnerie,Benoit Fabre(papernest)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extending Named Entity Recognition (NER) models to new PII entities in noisy spoken-language data is a common need. We find that jointly fine-tuning a BERT model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII (EMAIL, PHONE) results in minimal degradation for original classes. We investigate this “peaceful coexistence,” hypothesizing that the model uses independent semantic vs. morphological feature mechanisms. Using an incremental learning setup as a diagnostic tool, we measure semantic drift and find two key insights. First, the LOC (location) entity is uniquely vulnerable due to a representation overlap with new PII, as it shares pattern-like features (e.g., postal codes). Second, we identify a “reverse O-tag representation drift.” The model, initially trained to map PII patterns to ‘O’, blocks new learning. This is resolved only by unfreezing the ‘O’ tag’s classifier, allowing the background class to adapt and “release” these patterns. This work provides a mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and ‘O’ tag plasticity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.17930 [cs.CL] (or arXiv:2510.17930v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.17930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-73] Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings Fine-Tuned Transformers and LLM s

【速读】: 该论文旨在解决在线游戏聊天中自动化毒性内容检测的问题,以提升内容审核效率并降低人工干预成本。其解决方案的关键在于提出了一种混合式监管系统架构,该架构通过自动化检测优化人类审核员的工作负载,并引入持续学习机制以适应动态环境中的新毒害模式;同时,在多种自然语言处理(Natural Language Processing, NLP)方法中,实验表明微调后的DistilBERT模型在准确率与计算成本之间实现了最优权衡,为实际部署提供了高效、经济的技术路径。

链接: https://arxiv.org/abs/2510.17924
作者: Yehor Tereshchenko,Mika Hämäläinen
机构: Metropolia University of Applied Sciences (芬兰Metropolia应用科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in the Journal of Data Mining Digital Humanities (JDMDH), special issue NLP4DH

点击查看摘要

Abstract:This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.
zh

[NLP-74] Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在任务分解过程中存在的性能与成本权衡问题。现有方法多聚焦于记忆、工具使用和反馈机制,虽在特定领域取得成效,但忽视了不同分解策略对计算开销与结果质量的影响。解决方案的关键在于提出“先选择后分解”(Select-Then-Decompose)策略,构建包含选择、执行和验证三个阶段的闭环问题求解流程:通过任务特征动态匹配最优分解方法,并引入验证模块提升结果可靠性,从而在多个基准测试中始终位于帕累托前沿(Pareto frontier),实现性能与成本的最优平衡。

链接: https://arxiv.org/abs/2510.17922
作者: Shuodi Liu,Yingzhuo Liu,Zi Wang,Yusheng Wang,Huijia Wu,Liuyu Xiang,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the Main Conference of EMNLP 2025 (Oral)

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at this https URL.
zh

[NLP-75] CLAWS:Creativity detection for LLM -generated solutions using Attention Window of Sections NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理任务中对生成内容创造力评估缺失的问题。现有研究多聚焦于提升任务准确性,而忽视了创造力这一关键维度,其主要障碍在于创造力定义模糊以及传统依赖人工评价的局限性。为此,作者提出CLAWS方法,其核心创新在于无需人工标注即可自动分类数学解法为典型(typical)、创造性(creative)和幻觉(hallucinated)三类,通过分析提示词(prompt)与输出之间注意力权重的分布特征实现自动化判别。该方法在五个7-8B参数规模的数学强化学习模型上显著优于五种现有白盒检测方法,并在来自181场数学竞赛的4545道题目上得到验证。

链接: https://arxiv.org/abs/2510.17921
作者: Keuntae Kim,Eunhye Jeong,Sehyeon Lee,Seohee Yoon,Yong Suk Choi
机构: Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Recent advances in enhancing the reasoning ability of large language models (LLMs) have been remarkably successful. LLMs trained with reinforcement learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a method that defines and classifies mathematical solutions into typical, creative, and hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score) on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4545 math problems collected from 181 math contests (AJHSME, AMC, AIME).
zh

[NLP-76] JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉(hallucination)和可信度不足的问题,这些问题本质上源于预训练阶段的数据质量与建模机制。现有方法多集中于后训练和推理阶段的改进,但作者指出,模型的安全性和可信度主要取决于预训练数据本身的准确性、逻辑一致性及现实世界锚定能力。解决方案的关键在于提升预训练数据的“世界情境关联性”,即通过引入每条数据在时空背景下的真实世界语境信息(World Context),将原本仅作为词元序列的数据转化为具有现实参照意义的知识片段,从而增强模型对事实的理解与约束能力。为此,作者提出了“带世界情境的数据”(Data with World Context, DWC)概念,并基于此对JT-35B-Base模型进行增量预训练(使用1.5万亿DWC token),再辅以专门设计的后训练流程激活其潜力,最终在安全与可信评估基准上相较同规模Qwen模型平均提升1.79%,同时仅需6.2万亿总token预训练。

链接: https://arxiv.org/abs/2510.17918
作者: Junlan Feng,Fanyu Meng,Chong Long,Pengyu Cong,Duqing Wang,Yan Zheng,Yuyao Zhang,Xuanchang Gao,Ye Yuan,Yunfei Ma,Zhijie Ren,Fan Yang,Na Wu,Di Jin,Chao Deng
机构: China Mobile Jiutian Research (中国移动九天研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The hallucination and credibility concerns of large language models (LLMs) are global challenges that the industry is collectively addressing. Recently, a significant amount of advances have been made on post-training and inference techniques to mitigate these challenges. However, it is widely agreed that unsafe and hallucinations of LLMs intrinsically originate from pre-training, involving pre-training data and the next-token prediction learning mechanism. In this paper, we focus on enhancing pre-training data to improve the trustworthiness and safety of LLMs. Since the data is vast, it’s almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases. Moreover, the pre-training data lack grounding in real-world knowledge. Each piece of data is treated as a sequence of tokens rather than as a representation of a part of the world. To overcome these issues, we propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios. We argue that most source data are created by the authors for specific purposes in a certain spatial-temporal context. They have played a role in the real world. By incorporating related world context information, we aim to better anchor pre-training data within real-world scenarios, thereby reducing uncertainty in model training and enhancing the model’s safety and trustworthiness. We refer to our Data with World Context as DWC. We continue pre-training an earlier checkpoint of JT-35B-Base with 1.5 trillion of DWC tokens. We introduce our post-training procedures to activate the potentials of DWC. Compared with the Qwen model of a similar scale, JT-Safe-35B achieves an average performance improvement of 1.79% on the Safety and Trustworthy evaluation benchmarks, while being pretrained with only 6.2 trillion tokens.
zh

[NLP-77] Interpretability Framework for LLM s in Undergraduate Calculus

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学教育应用中仅依赖最终答案准确性进行评估的局限性,未能反映其推理过程的质量、可靠性和教学有效性的问题。解决方案的关键在于提出一个新颖的可解释性框架,通过提取推理流程并将其分解为语义标注的操作与概念,结合提示词消融分析以评估输入敏感性和输出稳定性,并引入结构化指标(如推理复杂度、短语敏感性和鲁棒性)对LLM在大学微积分I至III考试题上的行为进行量化诊断。该框架首次实现了对LLM数学推理行为的结构化、可量化且具有教学意义的解析,为STEM学习环境中AI的透明化与负责任部署奠定基础。

链接: https://arxiv.org/abs/2510.17910
作者: Sagnik Dakshit,Sushmita Sinha Roy
机构: University of Texas at Tyler (德克萨斯大学泰勒分校); Florida Gulf Coast University (佛罗里达海湾海岸大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used in education, yet their correctness alone does not capture the quality, reliability, or pedagogical validity of their problem-solving behavior, especially in mathematics, where multistep logic, symbolic reasoning, and conceptual clarity are critical. Conventional evaluation methods largely focus on final answer accuracy and overlook the reasoning process. To address this gap, we introduce a novel interpretability framework for analyzing LLM-generated solutions using undergraduate calculus problems as a representative domain. Our approach combines reasoning flow extraction and decomposing solutions into semantically labeled operations and concepts with prompt ablation analysis to assess input salience and output stability. Using structured metrics such as reasoning complexity, phrase sensitivity, and robustness, we evaluated the model behavior on real Calculus I to III university exams. Our findings revealed that LLMs often produce syntactically fluent yet conceptually flawed solutions, with reasoning patterns sensitive to prompt phrasing and input variation. This framework enables fine-grained diagnosis of reasoning failures, supports curriculum alignment, and informs the design of interpretable AI-assisted feedback tools. This is the first study to offer a structured, quantitative, and pedagogically grounded framework for interpreting LLM reasoning in mathematics education, laying the foundation for the transparent and responsible deployment of AI in STEM learning environments.
zh

[NLP-78] Atomic Literary Styling: Mechanistic Manipulation of Prose Generation in Neural Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中文学风格可解释性的问题,即识别出在文本生成过程中对高质量文学内容具有判别作用的神经元,并厘清其因果必要性。解决方案的关键在于通过系统性消融实验(ablation studies),发现尽管某些神经元在分析阶段与优秀文学文本显著相关(p < 0.05,效应量 |d| 最高达 1.4),但移除这些神经元反而提升了生成文本的文学质量(如移除50个高判别力神经元后,文学风格指标提升25.7%)。这一结果揭示了神经网络中观测到的相关性并不等同于因果必要性,挑战了传统机制解释中“激活对应输出”的假设,为神经网络可解释性和AI对齐研究提供了重要启示。

链接: https://arxiv.org/abs/2510.17909
作者: Tsogt-Ochir Enkhbayar
机构: Mongol AI
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 4 tables

点击查看摘要

Abstract:We present a mechanistic analysis of literary style in GPT-2, identifying individual neurons that discriminate between exemplary prose and rigid AI-generated text. Using Herman Melville’s Bartleby, the Scrivener as a corpus, we extract activation patterns from 355 million parameters across 32,768 neurons in late layers. We find 27,122 statistically significant discriminative neurons ( p 0.05 ), with effect sizes up to |d| = 1.4 . Through systematic ablation studies, we discover a paradoxical result: while these neurons correlate with literary text during analysis, removing them often improves rather than degrades generated prose quality. Specifically, ablating 50 high-discriminating neurons yields a 25.7% improvement in literary style metrics. This demonstrates a critical gap between observational correlation and causal necessity in neural networks. Our findings challenge the assumption that neurons which activate on desirable inputs will produce those outputs during generation, with implications for mechanistic interpretability research and AI alignment.
zh

[NLP-79] BreakFun: Jailbreaking LLM s via Schema Exploitation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理结构化数据时因过度遵循语法和数据模式而产生的安全漏洞问题,即模型对“Trojan Schema”这类精心设计的结构化指令表现出高度响应性,从而被诱导生成有害内容。其解决方案的关键在于提出一种名为BreakFun的攻击方法,该方法通过三部分提示组合——包括一个看似无害的框架、Chain-of-Thought干扰项以及核心的“特洛伊模式”(Trojan Schema)——来触发模型的结构依赖性并诱导其输出恶意内容;同时,论文进一步提出“对抗性提示解构”(Adversarial Prompt Deconstruction)作为防御机制,利用第二个LLM执行“字面转录”(Literal Transcription),提取并解析用户输入中的可读文本以暴露真实意图,从而有效识别和阻断此类攻击。

链接: https://arxiv.org/abs/2510.17904
作者: Amirkia Rafiei Oskooei,Mehmet S. Aktas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proficiency of Large Language Models (LLMs) in processing structured data and adhering to syntactic rules is a capability that drives their widespread adoption but also makes them paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM’s adherence to structured schemas. BreakFun employs a three-part prompt that combines an innocent framing and a Chain-of-Thought distraction with a core “Trojan Schema”–a carefully crafted data structure that compels the model to generate harmful content, exploiting the LLM’s strong tendency to follow structures and schemas. We demonstrate this vulnerability is highly transferable, achieving an average success rate of 89% across 13 foundational and proprietary models on JailbreakBench, and reaching a 100% Attack Success Rate (ASR) on several prominent models. A rigorous ablation study confirms this Trojan Schema is the attack’s primary causal factor. To counter this, we introduce the Adversarial Prompt Deconstruction guardrail, a defense that utilizes a secondary LLM to perform a “Literal Transcription”–extracting all human-readable text to isolate and reveal the user’s true harmful intent. Our proof-of-concept guardrail demonstrates high efficacy against the attack, validating that targeting the deceptive schema is a viable mitigation strategy. Our work provides a look into how an LLM’s core strengths can be turned into critical weaknesses, offering a fresh perspective for building more robustly aligned models.
zh

[NLP-80] Are LLM s Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在法律工作流中缺乏基于具体司法辖区的基准评估框架的问题。为填补这一空白,作者以印度公共法律考试作为透明代理指标,构建了一个多年度基准测试体系,涵盖国家级和州级考试中的客观题,并在真实考试条件下评估开源与前沿LLM的表现;同时引入由律师评分、双盲配对的最高法院“出庭律师”考试长答案文本,深入考察模型在复杂推理任务中的表现。解决方案的关键在于:首次提出一个以考试为基础、面向印度司法语境的LLM法庭准备度评估标准,并公开数据集与实验协议,从而清晰界定LLM在法律实务中可辅助的领域(如格式校验、跨法条一致性检查、法规与判例检索)与必须由人类主导的核心环节(如特定法庭文书撰写与提交、程序策略制定、权威冲突调和及伦理判断)。

链接: https://arxiv.org/abs/2510.17900
作者: Kush Juvekar,Arghya Bhattacharya,Sai Khadloya,Utkarsh Saxena
机构: Adalat AI(印度)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are entering legal workflows, yet we lack a jurisdiction-specific framework to assess their baseline competence therein. We use India’s public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real-world exam conditions. To probe beyond multiple-choice questions, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court’s Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes: procedural or format compliance, authority or citation discipline, and forum-appropriate voice and structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.
zh

[NLP-81] Hierarchical Federated Unlearning for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的隐私与安全问题,特别是如何高效、安全地移除模型中不希望保留的知识,即“机器遗忘”(Machine Unlearning)问题。现有方法面临两大挑战:一是实际需求常表现为持续且异构的遗忘请求,二是数据分布于多个分散节点且敏感,访问权限不对称,导致域间与域内干扰加剧,进而造成遗忘与保留性能失衡。论文提出的解决方案关键在于设计一种可扩展且隐私保护的联邦遗忘框架,通过任务特定适配器(task-specific adapter)学习实现遗忘与保留目标的解耦,并采用分层合并策略缓解冲突目标,从而实现鲁棒且灵活的模型更新。实验表明,该方法在WMDP、MUSE和TOFU等多个基准上有效应对异构遗忘请求,同时保持LLM的性能优势。

链接: https://arxiv.org/abs/2510.17895
作者: Yisheng Zhong,Zhengbang Yang,Zhuangdi Zhu
机构: George Mason University (乔治梅森大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into real-world applications, raising concerns about privacy, security and the need to remove undesirable knowledge. Machine Unlearning has emerged as a promising solution, yet faces two key challenges: (1) practical unlearning needs are often continuous and heterogeneous, and (2) they involve decentralized, sensitive data with asymmetric access. These factors result in inter-domain and intra-domain interference, which further amplifies the dilemma of unbalanced forgetting and retaining performance. In response, we propose a federated unlearning approach for LLMs that is scalable and privacy preserving. Our method decouples unlearning and retention via task-specific adapter learning and employs a hierarchical merging strategy to mitigate conflicting objectives and enables robust, adaptable unlearning updates. Comprehensive experiments on benchmarks of WMDP, MUSE, and TOFU showed that our approach effectively handles heterogeneous unlearning requests while maintaining strong LLM utility compared with baseline methods.
zh

[NLP-82] Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review

【速读】: 该论文旨在解决科学文献与在线信息爆炸式增长背景下,如何高效从文本数据中提取领域特定知识的问题,尤其是在自然语言处理(Natural Language Processing, NLP)中的文本分类任务中,大型语言模型(Large Language Models, LLMs)在领域特定场景下因专业词汇、独特语法结构及数据分布不均导致准确率下降的挑战。其解决方案的关键在于系统性地回顾2018年至2024年1月间41篇相关研究,梳理预训练语言模型(Pre-trained Language Models, PLMs)在领域文本分类中的应用演进,并提出基于Transformer架构的模型分类体系与技术分类法;同时通过对比实验验证BERT、SciBERT和BioBERT在生物医学句子分类中的性能差异,从而揭示PLMs在不同领域文本分类任务中的适应性与局限性,为后续研究提供方法论框架与实践指导。

链接: https://arxiv.org/abs/2510.17892
作者: Zhyar Rzgar K. Rostam,Gábor Kertész
机构: Obuda University (奥布达大学); Institute for Computer Science and Control (SZTAKI) (计算机科学与控制研究所); Hungarian Research Network (HUN-REN) (匈牙利研究网络)
类目: Computation and Language (cs.CL)
备注: 41 pages, 10 figures, 13 tables

点击查看摘要

Abstract:The exponential increase in scientific literature and online information necessitates efficient methods for extracting knowledge from textual data. Natural language processing (NLP) plays a crucial role in addressing this challenge, particularly in text classification tasks. While large language models (LLMs) have achieved remarkable success in NLP, their accuracy can suffer in domain-specific contexts due to specialized vocabulary, unique grammatical structures, and imbalanced data distributions. In this systematic literature review (SLR), we investigate the utilization of pre-trained language models (PLMs) for domain-specific text classification. We systematically review 41 articles published between 2018 and January 2024, adhering to the PRISMA statement (preferred reporting items for systematic reviews and meta-analyses). This review methodology involved rigorous inclusion criteria and a multi-step selection process employing AI-powered tools. We delve into the evolution of text classification techniques and differentiate between traditional and modern approaches. We emphasize transformer-based models and explore the challenges and considerations associated with using LLMs for domain-specific text classification. Furthermore, we categorize existing research based on various PLMs and propose a taxonomy of techniques used in the field. To validate our findings, we conducted a comparative experiment involving BERT, SciBERT, and BioBERT in biomedical sentence classification. Finally, we present a comparative study on the performance of LLMs in text classification tasks across different domains. In addition, we examine recent advancements in PLMs for domain-specific text classification and offer insights into future directions and limitations in this rapidly evolving domain.
zh

[NLP-83] Metrics and evaluations for computational and sustainable AI efficiency

【速读】: 该论文旨在解决当前人工智能(AI)模型推理性能评估方法碎片化的问题,即现有手段难以在异构硬件、软件栈和数值精度环境下提供全面、可比且可持续的评价体系。其关键解决方案是提出一种统一且可复现的推理评估框架,该框架在真实服务条件下系统性地整合计算效率(如延迟和吞吐量分布)与环境影响(如能耗和位置调整后的碳排放),同时保持相同的精度约束以确保比较的有效性。通过多精度模型在不同硬件平台(如GH200数据中心加速器与RTX 4090消费级GPU)上的实证应用,该方法构建了决策导向的帕累托前沿,清晰揭示了准确率、延迟、能耗与碳排放之间的权衡关系,从而为可持续AI部署提供量化依据。

链接: https://arxiv.org/abs/2510.17885
作者: Hongyuan Liu,Xinyang Liu,Guosheng Hu
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 tables

点击查看摘要

Abstract:The rapid advancement of Artificial Intelligence (AI) has created unprecedented demands for computational power, yet methods for evaluating the performance, efficiency, and environmental impact of deployed models remain fragmented. Current approaches often fail to provide a holistic view, making it difficult to compare and optimise systems across heterogeneous hardware, software stacks, and numeric precisions. To address this gap, we propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics under realistic serving conditions. Our framework provides a pragmatic, carbon-aware evaluation by systematically measuring latency and throughput distributions, energy consumption, and location-adjusted carbon emissions, all while maintaining matched accuracy constraints for valid comparisons. We apply this methodology to multi-precision models across diverse hardware platforms, from data-centre accelerators like the GH200 to consumer-level GPUs such as the RTX 4090, running on mainstream software stacks including PyTorch, TensorRT, and ONNX Runtime. By systematically categorising these factors, our work establishes a rigorous benchmarking framework that produces decision-ready Pareto frontiers, clarifying the trade-offs between accuracy, latency, energy, and carbon. The accompanying open-source code enables independent verification and facilitates adoption, empowering researchers and practitioners to make evidence-based decisions for sustainable AI deployment.
zh

[NLP-84] Does GenAI Rewrite How We Write? An Empirical Study on Two-Million Preprints

【速读】: 该论文试图解决的问题是:生成式人工智能(Generative AI)是否以及如何重塑学术出版体系,尤其是在预印本平台上的表现及其对科研传播、作者合作模式、文本风格和学科分布的影响。当前关于这一议题的讨论多为推测性,缺乏系统性的实证证据。论文的关键解决方案在于构建了一个多层次分析框架,融合了中断时间序列模型、协作与生产力指标、语言特征分析和主题建模方法,对2016至2025年间来自arXiv、bioRxiv、medRxiv和SocArXiv四个主流预印本库超过210万篇文献进行大规模实证研究,从而首次提供了生成式AI在学术出版中影响的实证基础,并揭示其作为“选择性催化剂”而非“普遍颠覆者”的作用机制。

链接: https://arxiv.org/abs/2510.17882
作者: Minfeng Qi,Zhongmin Cao,Qin Wang,Ningran Li,Tianqing Zhu
机构: City University of Macau (澳门城市大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61实验室); The University of Adelaide (阿德莱德大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Preprint repositories become central infrastructures for scholarly communication. Their expansion transforms how research is circulated and evaluated before journal publication. Generative large language models (LLMs) introduce a further potential disruption by altering how manuscripts are written. While speculation abounds, systematic evidence of whether and how LLMs reshape scientific publishing remains limited. This paper addresses the gap through a large-scale analysis of more than 2.1 million preprints spanning 2016–2025 (115 months) across four major repositories (i.e., arXiv, bioRxiv, medRxiv, SocArXiv). We introduce a multi-level analytical framework that integrates interrupted time-series models, collaboration and productivity metrics, linguistic profiling, and topic modeling to assess changes in volume, authorship, style, and disciplinary orientation. Our findings reveal that LLMs have accelerated submission and revision cycles, modestly increased linguistic complexity, and disproportionately expanded AI-related topics, while computationally intensive fields benefit more than others. These results show that LLMs act less as universal disruptors than as selective catalysts, amplifying existing strengths and widening disciplinary divides. By documenting these dynamics, the paper provides the first empirical foundation for evaluating the influence of generative AI on academic publishing and highlights the need for governance frameworks that preserve trust, fairness, and accountability in an AI-enabled research ecosystem. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL) Cite as: arXiv:2510.17882 [cs.CY] (or arXiv:2510.17882v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2510.17882 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-85] POPI: Personalizing LLM s via Optimized Natural Language Preference Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因用户偏好差异导致个性化效果不一致的问题。现有对齐技术如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)或直接偏好优化(Direct Preference Optimization, DPO)主要优化群体平均表现,忽视个体差异;而传统的逐用户微调策略计算成本高,基于上下文的提示方法则存在效率低和噪声干扰问题。解决方案的关键在于提出POPI框架,其核心创新是引入一个偏好推理模型(preference inference model),将多样化的用户信号压缩为简洁、可解释且具备迁移性的自然语言摘要,作为共享生成模型的条件输入,从而实现高效、精准的个性化响应生成。该框架通过统一的目标函数联合优化偏好推理与个性化生成过程,确保摘要最大程度编码有用偏好信息,并在多个基准测试中验证了其在提升个性化准确率的同时显著降低上下文开销的能力。

链接: https://arxiv.org/abs/2510.17881
作者: Yizhuo Chen,Xin Liu,Ruijie Wang,Zheng Li,Pei Chen,Changlong Yu,Priyanka Nigam,Meng Jiang,Bing Yin
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong benchmark performance, yet user experiences remain inconsistent due to diverse preferences in style, tone, and reasoning mode. Nevertheless, existing alignment techniques such as reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO) largely optimize toward population-level averages and overlook individual variation. Naive personalization strategies like per-user fine-tuning are computationally prohibitive, and in-context approaches that prepend raw user signals often suffer from inefficiency and noise. To address these challenges, we propose POPI, a general framework that introduces a preference inference model to distill heterogeneous user signals into concise natural language summaries. These summaries act as transparent, compact, and transferable personalization representations that condition a shared generation model to produce personalized responses. POPI jointly optimizes both preference inference and personalized generation under a unified objective using reinforcement learning, ensuring summaries maximally encode useful preference information. Extensive experiments across four personalization benchmarks demonstrate that POPI consistently improves personalization accuracy while reducing context overhead by a large margin. Moreover, optimized summaries seamlessly transfer to frozen off-the-shelf LLMs, enabling plug-and-play personalization without weight updates.
zh

[NLP-86] Outrag ed AI: Large language models prioritise emotion over cost in fairness enforcement

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否像人类一样利用情绪来指导道德决策,尤其是体现在利他性第三方惩罚行为中。解决方案的关键在于通过大规模实验设计(涵盖4,068个LLM代理与1,159名成人共796,100次决策),系统性地测试情绪对LLM惩罚行为的影响,并引入自我报告情绪提示以建立因果关系。研究发现,LLMs确实使用情绪驱动惩罚行为,且在某些情境下比人类更强烈;但其机制与人类不同——LLMs优先考虑情绪而非成本,表现出近乎全有或全无的规范执行倾向,而人类则能平衡公平与成本。此外,推理型模型(如o3-mini、DeepSeek-R1)比基础模型(如GPT-3.5、DeepSeek-V3)更接近人类行为,但仍缺乏成本校准和细腻的公平判断能力,揭示了当前LLMs在情感智能发展上处于类早期人类阶段。

链接: https://arxiv.org/abs/2510.17880
作者: Hao Liu,Yiqing Dai,Haotian Tan,Yu Lei,Yujia Zhou,Zhen Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotions guide human decisions, but whether large language models (LLMs) use emotion similarly remains unknown. We tested this using altruistic third-party punishment, where an observer incurs a personal cost to enforce fairness, a hallmark of human morality and often driven by negative emotion. In a large-scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions, LLMs used emotion to guide punishment, sometimes even more strongly than humans did: Unfairness elicited stronger negative emotion that led to more punishment; punishing unfairness produced more positive emotion than accepting; and critically, prompting self-reports of emotion causally increased punishment. However, mechanisms diverged: LLMs prioritized emotion over cost, enforcing norms in an almost all-or-none manner with reduced cost sensitivity, whereas humans balanced fairness and cost. Notably, reasoning models (o3-mini, DeepSeek-R1) were more cost-sensitive and closer to human behavior than foundation models (GPT-3.5, DeepSeek-V3), yet remained heavily emotion-driven. These findings provide the first causal evidence of emotion-guided moral decisions in LLMs and reveal deficits in cost calibration and nuanced fairness judgements, reminiscent of early-stage human responses. We propose that LLMs progress along a trajectory paralleling human development; future models should integrate emotion with context-sensitive reasoning to achieve human-like emotional intelligence.
zh

[NLP-87] Modeling Layered Consciousness with Multi-Agent Large Language Models EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实现类人意识(artificial consciousness)方面的局限性,特别是如何模拟自我意识、前意识和无意识等心理结构以支持个性化认知适应。其解决方案的关键在于提出一个基于精神分析理论的多智能体框架——心理动力学模型(Psychodynamic Model),通过多个代理之间的交互来模拟人类心智的不同层次,并引入一个结合固定特质与动态需求的个性化模块(Personalization Module),辅以情感丰富的对话数据进行参数高效微调(Parameter-Efficient Fine-Tuning),从而显著提升模型在个性化条件下的情感深度与输出稳定性。

链接: https://arxiv.org/abs/2510.17844
作者: Sang Hun Kim,Jongmin Lee,Dongkyu Park,So Young Lee,Yosep Chong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 20 pages, 4 figures, accepted for presentation at EMNLP 2025 Workshop on Active and Passive LLM Personalization (PALS) OpenReview: this https URL

点击查看摘要

Abstract:We propose a multi-agent framework for modeling artificial consciousness in large language models (LLMs), grounded in psychoanalytic theory. Our \textbfPsychodynamic Model simulates self-awareness, preconsciousness, and unconsciousness through agent interaction, guided by a Personalization Module combining fixed traits and dynamic needs. Using parameter-efficient fine-tuning on emotionally rich dialogues, the system was evaluated across eight personalized conditions. An LLM as a judge approach showed a 71.2% preference for the fine-tuned model, with improved emotional depth and reduced output variance, demonstrating its potential for adaptive, personalized cognition.
zh

计算机视觉

[CV-0] DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)和视觉专家模型在动态三维场景中推理空间关系能力不足的问题,尤其是在观察者与物体同时运动时,难以准确区分自运动(self-motion)与目标运动(object motion)并推断其相对空间关系。解决方案的关键在于提出了一种新的动态空间智能(Dynamic Spatial Intelligence, DSI)概念,并构建了DSI-Bench基准测试集,包含近1,000段动态视频和超过1,700个手工标注的问题,涵盖九种解耦的运动模式;通过空间和时间上的对称设计减少偏差,从而系统性评估模型在动态场景下的空间推理能力。

链接: https://arxiv.org/abs/2510.18873
作者: Ziang Zhang,Zehan Wang,Guanghao Zhang,Weilong Dai,Yan Xia,Ziang Yan,Minjie Hong,Zhou Zhao
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models’ reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.
zh

[CV-1] DP2O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution NEURIPS2025

【速读】:该论文旨在解决真实世界图像超分辨率(Real-ISR)中因预训练文本到图像(T2I)扩散模型固有随机性导致的感知质量不稳定问题。尽管这种随机性常被视为限制因素,但其带来的感知质量多样性也为提升性能提供了机会。解决方案的关键在于提出一种无需昂贵人工标注的直接感知偏好优化框架(DP²O-SR),通过融合全参考与无参考图像质量评估(IQA)模型构建混合奖励信号,同时引导结构保真度与自然视觉效果;此外,引入多对偏好样本构建策略及分层偏好优化机制,根据模型容量自适应调整监督强度与训练权重,从而更高效、稳定地利用感知多样性,显著提升感知质量并具有良好泛化能力。

链接: https://arxiv.org/abs/2510.18851
作者: Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Shihao Wang,Tianhe Wu,Qiaosi Yi,Shuai Li,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by NeurIPS 2025

点击查看摘要

Abstract:Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP ^2 O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP ^2 O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.
zh

[CV-2] FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端间域偏移(domain shift)和标签异构性(label heterogeneity)导致的全局模型泛化能力下降问题,特别是在使用大规模视觉-语言模型如CLIP进行跨域图像识别时的挑战。解决方案的关键在于提出一种自适应联邦提示调优框架FedDEAP,其核心创新包括:(1) 通过语义与域变换网络实现图像特征的解耦,以保留域特定信息;(2) 引入双提示设计(全局语义提示与本地域提示),在全局提示聚合过程中平衡共享知识与个性化信息;(3) 在两个学习到的变换下对齐文本与视觉表示,从而最大化图像中语义和域信息在生成文本特征中的保真度,确保语义与域一致性。

链接: https://arxiv.org/abs/2510.18837
作者: Yubin Zheng,Pak-Hei Yeung,Jing Xia,Tianjie Ju,Peng Tang,Weidong Qiu,Jagath C. Rajapakse
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MM 2025

点击查看摘要

Abstract:Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP’s generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
zh

[CV-3] Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework NEURIPS2025

【速读】:该论文旨在解决现有图Transformer(Graph Transformer, GT)模型在建模节点间多样化交互时依赖复杂且特定于任务的架构设计,从而限制了模型灵活性的问题。其核心解决方案是提出一个统一的分层掩码框架(unified hierarchical mask framework),揭示了模型架构与注意力掩码构造之间的内在等价性,并据此提出了一种理论驱动的设计原则:有效的注意力掩码应同时具备足够大的感受野(receptive field)和高标签一致性(label consistency)。基于此原则,作者进一步设计了M3Dphormer模型,该模型融合三种理论支撑的分层掩码机制,并采用双层专家路由策略自适应整合多层级交互信息;同时引入双模式注意力计算机制,在局部掩码稀疏性基础上动态切换稠密与稀疏计算模式,以保障可扩展性。实验表明,该方法在多个基准数据集上达到最优性能,验证了框架与模型设计的有效性。

链接: https://arxiv.org/abs/2510.18825
作者: Yujie Xing,Xiao Wang,Bin Wu,Hai Huang,Chuan Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025 (Poster)

点击查看摘要

Abstract:Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.
zh

[CV-4] SAM 2: Tracking Anything at Any Granularity

【速读】:该论文旨在解决视频跟踪任务中因目标状态粒度差异导致的模型泛化能力不足与设计冗余问题,即现有跟踪器通常针对单一任务定制模块,难以统一处理掩码(mask)、边界框(box)和点(point)等不同粒度的目标。解决方案的关键在于提出SAM 2++,其核心创新包括:1)设计任务特定提示(task-specific prompts)将多粒度输入编码为通用提示嵌入,并通过统一解码器输出标准化结果;2)引入任务自适应记忆机制(task-adaptive memory mechanism),实现跨粒度的目标记忆匹配;3)构建定制化数据引擎以支持任意粒度的训练,生成包含三种粒度标注的大规模视频跟踪数据集Tracking-Any-Granularity,从而建立一个统一且鲁棒的视频跟踪框架。

链接: https://arxiv.org/abs/2510.18822
作者: Jiaming Zhang,Cheng Liang,Yichun Yang,Chenkai Zeng,Yutao Cui,Xinwen Zhang,Xin Zhou,Kai Ma,Gangshan Wu,Limin Wang
机构: Nanjing University (南京大学); Tencent (腾讯); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, and 10 pages in Supplementary Material

点击查看摘要

Abstract:Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
zh

[CV-5] An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection

【速读】:该论文旨在解决资源匮乏和偏远地区结核病(tuberculosis, TB)早期筛查困难的问题,尤其是在缺乏专业放射科医生的情况下,如何利用人工智能(AI)实现高效、准确的胸部X光影像诊断。其解决方案的关键在于提出了一种教师-学生(teacher–student)框架,通过整合两个监督学习头(supervised heads)和一个自监督学习头(self-supervised head),在提升疾病分类准确性的同时增强对多种症状的多标签检测能力,从而显著优于现有基线模型,在区分COVID-19、TB与正常病例时达到98.85%的准确率,并在多标签症状检测中获得90.09%的宏F1分数。

链接: https://arxiv.org/abs/2510.18819
作者: Neel Patel,Alexander Wong,Ashkan Ebadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher–student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.
zh

[CV-6] A Geometric Approach to Steerable Convolutions

【速读】:该论文旨在解决传统 steerable 卷积神经网络(steerable convolutional neural networks)在高维空间中构建与理解不够直观的问题,特别是其数学基础多依赖于抽象的群论方法,缺乏几何直觉。解决方案的关键在于提出一种基于几何推理和模式匹配基本原理的新推导方式,从而直观解释 Clebsch–Gordan 分解和球谐基函数(spherical harmonic basis functions)为何自然出现在此类网络中;同时,论文引入一种利用插值核(interpolation kernels)构造可旋转变换卷积层的新方法,相较现有实现更具鲁棒性,尤其在噪声数据下表现更优。

链接: https://arxiv.org/abs/2510.18813
作者: Soumyabrata Kundu,Risi Kondor
机构: University of Chicago (芝加哥大学); Department of Statistics (统计系); Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in d dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch–Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.
zh

[CV-7] ProCLIP: Progressive Vision-Language Alignment via LLM -based Embedder

【速读】:该论文旨在解决CLIP文本编码器在处理长文本、多语言输入以及细粒度语义理解方面的局限性,同时克服将LLM-based嵌入器直接替换CLIP文本编码器时因视觉-语言表示空间未对齐而导致的预训练知识利用率低下问题。其解决方案的关键在于提出ProCLIP框架,该框架采用基于课程学习的渐进式视觉-语言对齐机制:首先通过知识蒸馏将CLIP文本编码器的预训练知识迁移至LLM嵌入器,建立初始对齐;随后利用图像-文本对比微调进一步对齐CLIP图像编码器与LLM嵌入器,并引入自蒸馏正则化防止过拟合;此外,在表征继承和对比微调阶段分别引入实例语义对齐损失和嵌入结构对齐损失,以提升对齐效果和泛化能力。

链接: https://arxiv.org/abs/2510.18795
作者: Xiaoxing Hu,Kaicheng Yang,Ziyong Feng,Qi Ming,Zonghao Guo,Xiang An,Ziyong Feng,Junchi Yan,Xue Yang
机构: Shanghai Jiao Tong University (上海交通大学); Beijing Institute of Technology (北京理工大学); DeepGlint; Beijing University of Technology (北京工业大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 fiugres

点击查看摘要

Abstract:The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at this https URL
zh

[CV-8] Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection

【速读】:该论文旨在解决高光谱异常检测中背景建模的泛化能力不足问题,即现有方法通常需要针对每幅图像场景进行重新训练或参数调优,导致效率低下且难以部署。其解决方案的关键在于提出一种“叛逆学生”(Rebellious Student)框架,通过引入互补特征学习机制:首先利用反向蒸馏训练一个稳健的光谱增强网络以提取背景光谱特征;随后设计一个空间分支作为“叛逆学生”,通过去相关损失强制其与光谱教师特征正交,从而学习到教师未能捕捉的空间模式,同时保持重建保真度以抑制无关噪声。该两阶段策略实现了光谱与空间特征的协同增强,在无需额外训练或调参的情况下显著提升异常检测性能。

链接: https://arxiv.org/abs/2510.18781
作者: Wenping Jin,Yuyang Tang,Li Zhu,Fei Guo
机构: Xi’an Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed – without per-scene retraining or parameter tuning – has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel “Rebellious Student” framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network – the rebellious student – is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at this https URL.
zh

[CV-9] UltraG en: High-Resolution Video Generation with Hierarchical Attention

【速读】:该论文旨在解决基于扩散Transformer的视频生成模型在高分辨率(如1080P/2K/4K)下难以实现端到端训练与推理的问题,其根本原因在于注意力机制随输出宽高呈二次方增长的计算复杂度。解决方案的关键在于提出UltraGen框架,其核心创新是采用分层双分支注意力架构,通过全局-局部注意力分解策略将全注意力解耦为局部注意力分支以保持区域细节保真度、全局注意力分支以保障整体语义一致性;同时引入空间压缩的全局建模策略和分层跨窗口局部注意力机制,在显著降低计算成本的同时增强不同局部窗口间的信息流动,从而首次实现从低分辨率预训练模型高效扩展至1080P乃至4K级原生高分辨率视频合成。

链接: https://arxiv.org/abs/2510.18775
作者: Teng Hu,Jiangning Zhang,Zihan Su,Ran Yi
机构: Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
zh

[CV-10] Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction NEURIPS2025

【速读】:该论文旨在解决城市热岛效应(Urban Heat Island Effect)在数据匮乏地区难以准确预测与有效评估的问题,尤其针对传统机器学习模型因数据有限而产生的预测偏差。其解决方案的关键在于利用基于全球非结构化数据训练的地理空间基础模型(Geospatial Foundation Model),该模型具备强泛化能力且仅需少量微调即可实现高精度的陆地表面温度(Land Surface Temperature)预测,并通过模拟填补缺失区域(inpainting)验证其在制定气候适应性城市规划中的实用性。

链接: https://arxiv.org/abs/2510.18773
作者: Jannis Fleckenstein,David Kreismann,Tamara Rosemary Govindasamy,Thomas Brunschwiler,Etienne Vos,Mattia Rigotti
机构: IBM Research – Europe (IBM 研究院 – 欧洲); IBM Research – Africa (IBM 研究院 – 非洲)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures. Accepted at the NeurIPS 2025 Workshop on Tackling Climate Change with Machine Learning

点击查看摘要

Abstract:As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model’s accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.
zh

[CV-11] Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation

【速读】:该论文旨在解决有害藻华(Harmful Algal Bloom, HAB),特别是蓝藻水华,因气候变化加剧而对水生生态系统和人类健康造成的威胁,以及传统人工采样监测方法在时空覆盖范围上的局限性。其解决方案的关键在于提出ALGae Observation and Segmentation (ALGOS)系统,该系统融合遥感图像理解与严重程度估计能力,通过GeoSAM辅助的人工评估实现高质量分割掩膜的标注,并基于NASA提供的蓝藻聚合人工标签(Cyanobacteria Aggregated Manual Labels, CAML)对视觉语言模型(Vision-Language Model, VLM)进行微调,从而实现对蓝藻水华的精准分割与严重等级量化,为构建可扩展、自动化的蓝藻监测系统提供技术路径。

链接: https://arxiv.org/abs/2510.18751
作者: Patterson Hsieh,Jerry Yeh,Mao-Chi He,Wen-Han Hsieh,Elvis Hsieh
机构: UC San Diego (加州大学圣地亚哥分校); UC Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
zh

[CV-12] SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery NEURIPS2025

【速读】:该论文致力于解决广义类别发现(Generalized Category Discovery, GCD)问题,即在部分标注数据集上,对所有未标注图像进行分类,无论其属于已知类还是未知类。现有方法通常依赖单一层次语义或人工设计的抽象层级结构,限制了模型的泛化能力和可扩展性。论文提出SEAL框架(SEmantic-aware hierArchical Learning),其核心创新在于引入基于自然且易获取的层次结构,通过层次语义引导的软对比学习(Hierarchical Semantic-Guided Soft Contrastive Learning)生成更具信息量的软负样本,克服传统对比损失将所有负样本同等对待的局限;同时设计跨粒度一致性(Cross-Granularity Consistency, CGC)模块,实现不同粒度级别预测结果的一致性对齐,从而显著提升模型性能与泛化能力。

链接: https://arxiv.org/abs/2510.18740
作者: Zhenqi He,Yuanpei Liu,Kai Han
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: this https URL
zh

[CV-13] Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting

【速读】:该论文旨在解决传统3D Gaussian Splatting (3DGS) 在结肠镜场景中因假设静态光照且仅依赖视角变化来建模外观而导致的光度失真问题,特别是在动态光源与相机相对运动下,现有方法需引入结构破坏性的“雾状”高斯斑点以补偿光照衰减,从而降低三维重建质量。其解决方案的关键在于提出ColIAGS框架,通过两个核心改进实现:一是引入改进的外观建模机制,显式考虑两种类型的光照衰减因子(距离与物理光源/相机特性),使高斯分布能够适应光度变化并保持几何精度;二是提出改进的几何建模机制,利用高维视角嵌入增强高斯几何属性预测能力,并结合余弦嵌入隐式生成光照衰减解,从而在保证几何一致性的同时提升视图合成的真实感与准确性。

链接: https://arxiv.org/abs/2510.18739
作者: Hao Wang,Ying Zhou,Haoyu Zhao,Rui Wang,Qiang Hu,Xing Zhang,Qiang Li,Zhiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time view synthesis in colonoscopy, enabling critical applications such as virtual colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static illumination and that observed appearance depends solely on viewing angle, which causes incompatibility with the photometric variations in colonoscopic scenes induced by dynamic light source/camera. This mismatch forces most 3DGS methods to introduce structure-violating vaporous Gaussian blobs between the camera and tissues to compensate for illumination attenuation, ultimately degrading the quality of 3D reconstructions. Previous works only consider the illumination attenuation caused by light distance, ignoring the physical characters of light source and camera. In this paper, we propose ColIAGS, an improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance under varying illumination, we introduce an Improved Appearance Modeling with two types of illumination attenuation factors, which enables Gaussians to adapt to photometric variations while preserving geometry accuracy. To ensure the geometry approximation condition of appearance modeling, we propose an Improved Geometry Modeling using high-dimensional view embedding to enhance Gaussian geometry attribute prediction. Furthermore, another cosine embedding input is leveraged to generate illumination attenuation solutions in an implicit manner. Comprehensive experimental results on standard benchmarks demonstrate that our proposed ColIAGS achieves the dual capabilities of novel view synthesis and accurate geometric reconstruction. It notably outperforms other state-of-the-art methods by achieving superior rendering fidelity while significantly reducing Depth MSE. Code will be available.
zh

[CV-14] IF-VidCap: Can Video Caption Models Follow Instructions?

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频字幕生成任务中缺乏对用户指令的可控性问题,即现有模型倾向于生成全面但无针对性的描述,而无法根据具体指令生成符合格式与内容要求的字幕。为填补这一评估空白,作者提出了IF-VidCap基准,其关键在于构建一个系统性的评估框架,从“格式正确性”和“内容正确性”两个维度对视频字幕进行量化评测,从而更精准地衡量模型在指令遵循能力上的表现。

链接: https://arxiv.org/abs/2510.18726
作者: Shihao Li,Yuanxing Zhang,Jiangtao Wu,Zhide Lei,Yiwen He,Runzhe Wen,Chenxi Liao,Chengkang Jiang,An Ping,Shuo Gao,Suhan Wang,Zhaozhou Bian,Zijun Zhou,Jingyi Xie,Jiayi Zhou,Jing Wang,Yifan Yao,Weihao Xie,Yingshui Tan,Yanghai Wang,Qianqian Xie,Zhaoxiang Zhang,Jiaheng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
zh

[CV-15] SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

【速读】:该论文旨在解决自回归图像生成模型(如Janus-Pro)在推理过程中因大量视觉标记(visual tokens)导致的高内存占用和不断增长的计算需求问题。其解决方案的关键在于识别出两种独特的注意力机制现象:空间局部性(spatial locality)和涌现语义汇聚点(emergent semantic sink),并基于此提出一种新颖的KV缓存压缩框架。该框架通过自适应地将注意力头解耦为两类:对于具有空间局部性的注意力头,仅保留最近的一段token窗口;对于具有语义汇聚特性的注意力头,则有策略地保留一组高度被关注的紧凑token集合,从而实现显著的内存压缩(降低5倍)和吞吐量提升(提高6.6倍),同时保持图像质量损失最小,使资源受限硬件上原生自回归图像生成成为可能。

链接: https://arxiv.org/abs/2510.18716
作者: Siyong Jian,Huan Wang
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5 \times reduction in memory usage and a notable 6.6 \times speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.
zh

[CV-16] PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting NEURIPS2025

【速读】:该论文旨在解决室内场景的度量三维重建问题,其核心挑战在于如何利用场景中固有的几何规律(如平面结构)实现无需相机位姿信息的高精度3D重建。解决方案的关键在于提出PLANA3R框架,该框架基于平面3D先验(planar 3D primitives)构建无位姿依赖的重建方法:首先通过Vision Transformers提取稀疏平面基元并估计相对相机位姿,随后借助平面投影(planar splatting)生成高分辨率深度与法向图,并以梯度反传机制监督几何学习过程;不同于以往需依赖3D平面标注的方法,PLANA3R仅需深度和法向量标注即可在大规模立体图像数据集上训练,从而实现可扩展性与跨域泛化能力,同时具备准确的平面分割性能。

链接: https://arxiv.org/abs/2510.18714
作者: Changkun Liu,Bin Tan,Zeran Ke,Shangzhan Zhang,Jiachen Liu,Ming Qian,Nan Xue,Yujun Shen,Tristan Braud
机构: The Hong Kong University of Science and Technology (香港科技大学); Ant Group (蚂蚁集团); Wuhan University (武汉大学); Zhejiang University (浙江大学); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025). The project page is available at: this https URL

点击查看摘要

Abstract:This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at this https URL
zh

[CV-17] A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

【速读】:该论文旨在解决基于Transformer的动作识别方法在运动敏感数据集上性能不佳的问题,其根本原因在于现有模型缺乏精细的运动建模设计。解决方案的关键在于提出显式运动信息挖掘模块(Explicit Motion Information Mining, EMIM),该模块通过借鉴传统动作识别中成本体积(cost volume)的结构,以滑动窗口方式从下一帧的查询相关邻域中采样关键候选标记,构建具有优良运动建模能力的亲和矩阵(affinity matrix)。此矩阵不仅用于外观建模的信息聚合,还可转化为运动特征用于运动建模,从而在统一框架内增强Transformer对运动信息的捕捉能力,尤其在Something-Something V1/V2等运动敏感数据集上取得显著提升。

链接: https://arxiv.org/abs/2510.18705
作者: Peiqin Zhuang,Lei Bai,Yichao Wu,Ding Liang,Luping Zhou,Yali Wang,Wanli Ouyang
机构: Pjlab(浦江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by Pattern Recognition. We have been always curious to see whether our designs could be beneficial in other scenarios, such as embedding it into the DiT model or 3D-VAE for video generation. If you are interested in it, why not give it a shot?

点击查看摘要

Abstract:Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 V2.
zh

[CV-18] Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

【速读】:该论文旨在解决对比视觉-语言模型(如CLIP)在处理复杂、真实世界网络文档时的局限性,尤其是在文本与图像交错、松散对齐或以视觉形式嵌入的场景下表现不佳的问题。其解决方案的关键在于提出一种统一的视觉中心对比学习框架(Vision-Centric Contrastive Learning, VC2L),该框架完全在像素空间中运行,通过将所有输入(文本、图像或组合)渲染为图像来消除对光学字符识别(OCR)、文本分词或模态融合策略的依赖;同时采用片段级对比学习目标,利用文档内在连贯性对连续多模态片段进行对齐,从而无需显式配对的图文数据即可建模复杂的跨模态关系。

链接: https://arxiv.org/abs/2510.18703
作者: Yiqi Lin,Alex Jinpeng Wang,Linjie Li,Zhengyuan Yang,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学); Central South University (中南大学); Microsoft
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this this https URL

点击查看摘要

Abstract:Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: this https URL.
zh

[CV-19] UniGenBench: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型评估中基准测试存在的两大核心问题:一是现有基准缺乏多样化的提示场景和多语言支持,限制了其在真实世界应用中的适用性;二是现有评估维度过于粗粒度,未能覆盖细粒度的子维度,导致对模型语义一致性能力的衡量不充分。解决方案的关键在于提出UniGenBench++——一个统一的语义评估基准,其核心创新包括:(1)构建包含600个提示的分层结构,涵盖5个主要主题和20个子主题,以实现对多样化现实场景的覆盖;(2)设计10个主维度和27个子维度的细粒度评估体系,每个提示可同时测试多个评估点;(3)提供英文与中文短/长版本提示,以系统评估模型对语言变化和提示长度扰动的鲁棒性;(4)利用闭源多模态大语言模型(Multi-modal Large Language Model, MLLM)Gemini-2.5-Pro建立可靠的自动化评估流程,并进一步训练一个独立的评估模型,实现离线高效评估,从而全面揭示开源与闭源T2I模型在不同维度上的性能优劣。

链接: https://arxiv.org/abs/2510.18701
作者: Yibin Wang,Zhimin Li,Yuhang Zang,Jiazi Bu,Yujie Zhou,Yi Xin,Junjun He,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
机构: Fudan University (复旦大学); Shanghai Innovation Institute; Hunyuan, Tencent (腾讯混元); Shanghai AI Lab; Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL

点击查看摘要

Abstract:Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models’ semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.
zh

[CV-20] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在长视频生成中因全注意力机制随序列长度呈二次增长而导致的计算瓶颈问题。现有稀疏注意力方法依赖于分块粗略估计,其精度与效率之间的权衡受限于分块大小。解决方案的关键在于提出一种无核函数的混合组注意力机制(Mixture-of-Groups Attention, MoGA),该机制通过轻量级可学习的令牌路由模块实现精准的令牌匹配,无需分块估计;同时借助语义感知路由策略促进有效长程交互,从而显著提升长视频生成效率与质量。

链接: https://arxiv.org/abs/2510.18692
作者: Weinan Jia,Yuning Lu,Mengqi Huang,Hualiang Wang,Binyuan Huang,Nan Chen,Mu Liu,Jidong Jiang,Zhendong Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
zh

[CV-21] Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification

【速读】:该论文旨在解决历史写作者识别(Historical Writer Identification, HWI)任务中端到端深度学习方法性能受限的问题,尤其在手写风格多样、文档退化严重及每写作者标注样本有限等现实条件下,传统手工特征提取与聚类方法虽在小规模精调数据集上表现良好,但端到端模型在更贴近实际的文档级场景下泛化能力差,尤其是在零样本(zero-shot)情形下无法识别训练未见写作者。研究通过系统评估预处理、骨干网络架构与后处理策略(如文本分割、图像块采样和特征聚合)的组合效果,发现多数配置因低层视觉特征捕捉不足、图像块表示不一致以及对内容噪声敏感而表现不佳;关键突破在于识别出一种简化设计的端到端方案,其性能可媲美最优系统,揭示了提升鲁棒性需关注特征表征一致性与噪声鲁棒性的核心设计原则。

链接: https://arxiv.org/abs/2510.18671
作者: Hanif Rasyidi,Moshiur Farazi
机构: Australian National University (澳大利亚国立大学); University of Doha for Science and Technology (多哈科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in The 12th IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2025

点击查看摘要

Abstract:This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.
zh

[CV-22] Prototyping an End-to-End Multi-Modal Tiny-CNN for Cardiovascular Sensor Patches ALT

【速读】:该论文旨在解决心血管疾病早期预警中传感器数据处理的效率与准确性难题,特别是在资源受限的医疗边缘设备上实现高精度、低功耗的心音(PCG)与心电图(ECG)同步信号分类问题。其解决方案的关键在于提出一种采用早期融合(early fusion)策略的卷积神经网络架构,通过多模态数据在输入层的联合处理,在显著降低内存占用和计算成本(较现有最优方法减少三个数量级)的同时,保持了竞争性的分类准确率,并通过微控制器和实验传感设备的能量消耗分析验证了本地推理相较于持续数据传输更具能效优势。

链接: https://arxiv.org/abs/2510.18668
作者: Mustafa Fuad Rifet Ibrahim,Tunc Alkanat,Maurice Meijer,Felix Manthey,Alexander Schlaefer,Peer Stelldinger
机构: NXP Semiconductors Germany GmbH(恩智浦半导体德国有限公司); Hamburg University of Technology(汉堡工业大学); Hamburg University of Applied Sciences(汉堡应用技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the IEEE Journal of Biomedical And Health Informatics

点击查看摘要

Abstract:The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.
zh

[CV-23] Image augmentation with invertible networks in interactive satellite image change detection

【速读】:该论文旨在解决卫星图像变化检测(change detection)中标签数据稀缺导致模型性能受限的问题。现有方法通常依赖大量标注数据进行训练,而人工标注成本高昂且效率低下。为此,作者提出一种基于主动学习(active learning)的交互式变化检测算法,其核心创新在于设计了一种可逆网络(invertible network),能够将原始图像从高非线性输入空间映射至潜在空间,在该空间中数据增强操作变为线性且易于处理;随后将增强后的样本映射回输入空间用于重训练,从而在少量标注样本下显著提升变化检测模型的性能。

链接: https://arxiv.org/abs/2510.18660
作者: Hichem Sahbi
机构: Sorbonne University (索邦大学); CNRS (法国国家科学研究中心); LIP6 (巴黎第六大学计算机科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper devises a novel interactive satellite image change detection algorithm based on active learning. Our framework employs an iterative process that leverages a question-and-answer model. This model queries the oracle (user) about the labels of a small subset of images (dubbed as display), and based on the oracle’s responses, change detection model is dynamically updated. The main contribution of our framework resides in a novel invertible network that allows augmenting displays, by mapping them from highly nonlinear input spaces to latent ones, where augmentation transformations become linear and more tractable. The resulting augmented data are afterwards mapped back to the input space, and used to retrain more effective change detection criteria in the subsequent iterations of active learning. Experimental results demonstrate superior performance of our proposed method compared to the related work.
zh

[CV-24] Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression NEURIPS2025

【速读】:该论文旨在解决传统矩阵量化方法在内存效率与重构误差之间难以平衡的问题,尤其是在神经网络压缩场景下。现有的一阶量化方法(如均匀量化和二值编码量化)通过线性组合二值基来近似实值矩阵,限制了表达能力。本文提出的Binary Quadratic Quantization (BQQ) 方法的关键创新在于利用二值二次表达式(binary quadratic expressions)增强矩阵的逼近能力,同时保持极紧凑的数据格式——无需依赖特定于训练后量化(Post-Training Quantization, PTQ)的优化策略,即可实现优于现有方法的压缩性能与模型精度。实验表明,BQQ 在矩阵压缩基准和 Vision Transformer 模型的 PTQ 中均展现出显著优势,尤其在 2 bit 量化下,相比最先进方法在 ImageNet 上分别提升达 2.2% 和 59.1% 的准确率。

链接: https://arxiv.org/abs/2510.18650
作者: Kyo Kuroki,Yasuyuki Okoshi,Thiem Van Chu,Kazushi Kawamura,Masato Motomura
机构: Institute of Science Tokyo (东京科学研究所); Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.
zh

[CV-25] ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中生物样本的语义分割难题,尤其在训练标签极度稀疏(如仅占总数据量0.05%或更少)的情况下仍能实现高精度分割。其核心解决方案基于分层变分自编码器(Hierarchical Variational Autoencoder, HVAE),关键创新在于:1)引入中心区域掩码(center-region masking)与修复损失(inpainting loss),增强模型对关键结构特征的鲁棒性表征;2)采用稀疏标签对比学习(sparse label contrastive learning, CL)和高斯混合模型(Gaussian Mixture Model, GMM)先验,优化潜在空间分布以促进语义类别内聚类;3)摒弃传统聚类策略,设计一个MLP语义分割头直接从潜在嵌入预测类别标签,实现端到端的无聚类标签预测。该方法在复杂生物组织EM图像及荧光显微图像上均展现出优异的稀疏监督分割性能。

链接: https://arxiv.org/abs/2510.18637
作者: Sheida Rahnamai Kordasiabi,Damian Dalle Nogare,Florian Jug
机构: Human Technopole (人类技术园); Technical University of Dresden (德累斯顿工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages main text, 17 pages total

点击查看摘要

Abstract:Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce \epsilon-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of \epsilon-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that \epsilon-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.
zh

[CV-26] C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025

【速读】:该论文旨在解决结构化剪枝(structured pruning)在一次性(one-shot)设置下导致模型性能显著下降的问题。现有方法虽能在训练后直接进行剪枝以降低计算成本,但往往因缺乏对模型内部机制的理解而破坏关键结构,从而影响精度。其解决方案的关键在于提出一种基于可解释深度学习(explainable deep learning)的一次性剪枝框架,通过引入因果感知剪枝(causal-aware pruning)策略,在渐进式剪枝过程中利用预测结果与网络结构之间的因果关系,识别并移除对模型性能影响最小的结构单元,从而在不依赖微调(fine-tuning)的前提下实现模型尺寸大幅缩减且性能损失最小化。

链接: https://arxiv.org/abs/2510.18636
作者: Baptiste Bauvin,Loïc Baret,Ola Ahmad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 10 pages, BMVC2025

点击查看摘要

Abstract:Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.
zh

[CV-27] hink with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在有限视角下理解三维空间关系时面临的挑战,即传统推理方法依赖纯文本或二维视觉线索,难以有效建模三维空间想象能力。解决方案的关键在于提出3DThinker框架,首次实现推理过程中无需任何3D先验输入即可进行三维心理表征(3D mentaling),并通过两阶段训练机制:第一阶段利用监督学习对齐VLM生成的三维潜在表示与3D基础模型(如VGGT)的输出;第二阶段仅基于任务结果信号优化整个推理轨迹,从而提升内在三维认知质量。该方法不依赖显式标注的3D数据,显著增强了多模态推理中对三维结构的理解能力。

链接: https://arxiv.org/abs/2510.18632
作者: Zhangquan Chen,Manyuan Zhang,Xinlei Yu,Xufang Luo,Mingze Sun,Zihao Pan,Yan Feng,Peng Pei,Xunliang Cai,Ruqi Huang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Meituan (美团); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at this https URL.
zh

[CV-28] CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

【速读】:该论文旨在解决当前计算机使用代理(Computer-using Agents, CUAs)评估中缺乏可靠、细粒度奖励模型评价体系的问题。现有方法主要依赖脚本-based 验证器,存在扩展性差和无法进行步骤级评估的局限;而奖励模型(Reward Models, RMs)虽具潜力,其在CUA任务中的有效性尚未充分探索。解决方案的关键在于提出CUARewardBench——首个涵盖结果奖励模型(Outcome Reward Models, ORM)与过程奖励模型(Process Reward Models, PRM)的综合性基准,包含来自10类软件和7种代理架构的多样化轨迹数据集,并通过专家标注与严格质量控制确保可靠性;进一步基于系统性分析发现当前视觉语言模型(VLMs)在视觉推理和知识掌握上的不足,提出一致提示集成(Unanimous Prompt Ensemble, UPE)方法,通过严格的多数投票机制和优化提示模板配置,显著提升奖励模型的精度(ORM达89.8%、PRM达81.7%)与负预测值(ORM达93.3%、PRM达85.1%),从而实现更准确、可解释的CUA性能评估。

链接: https://arxiv.org/abs/2510.18596
作者: Haojia Lin,Xiaoyu Tan,Yulei Qin,Zihan Xu,Yuchen Shi,Zongyi Li,Gang Li,Shaofei Cai,Siqi Cai,Chaoyou Fu,Ke Li,Xing Sun
机构: 未知
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
zh

[CV-29] CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder NEURIPS2025

【速读】:该论文旨在解决多模态数据蒸馏(Multimodal Dataset Distillation)中的两个核心问题:一是如何在小规模合成图像-文本对上有效学习跨模态对齐(cross-modal alignment),二是如何降低大规模视觉-语言模型训练中的高计算成本。现有方法通过冻结文本编码器仅优化图像编码器和文本投影层来提升可扩展性,但这一策略严重限制了语义对齐能力并成为性能提升的瓶颈。其解决方案的关键在于提出CovMatch框架,该框架通过最小化真实与合成特征之间的跨协方差(cross-covariance)来增强跨模态对齐,同时对每模态内部特征分布进行正则化,从而支持双编码器(图像和文本)的联合优化,显著提升了蒸馏效果与下游任务性能,在Flickr30K和COCO数据集上仅用500个合成样本即实现检索准确率最高达6.8%的绝对提升。

链接: https://arxiv.org/abs/2510.18583
作者: Yongmin Lee,Hye Won Chung
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2025

点击查看摘要

Abstract:Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
zh

[CV-30] Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

【速读】:该论文旨在解决多参考图像条件下主体一致性(subject consistency)和背景解耦(background disentanglement)不足的问题,以提升生成视频的参考保真度(reference fidelity)和语义稳定性(semantic drift控制)。现有方法在多图像条件输入时易出现主体混淆与细节失真,主要归因于训练数据多样性差、高质量样本匮乏以及缺乏跨配对样本(cross-paired data)。为应对上述挑战,论文提出两个关键解决方案:一是构建专用的数据构造流程,通过低质量样本过滤与多样化数据合成策略生成保持一致性的训练数据;二是引入参考旋转位置编码(Reference Rotary Positional Encoding, R-RoPE),实现对多参考图像的稳定且精确的融合,从而显著提升模型在一致性、保真度和泛化能力方面的性能表现。

链接: https://arxiv.org/abs/2510.18573
作者: Zhenxing Zhang,Jiayan Teng,Zhuoyi Yang,Tiankun Cao,Cheng Wang,Xiaotao Gu,Jie Tang,Dan Guo,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Tsinghua University (清华大学); Zhipu AI (智谱AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
zh

[CV-31] Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving

【速读】:该论文旨在解决自动驾驶中感知系统在恶劣条件下可靠性不足的问题,即现有数据集缺乏对多传感器模态下可控、参数化且可复现的退化模拟,从而限制了对感知与融合架构在明确不利条件下的系统性评估。解决方案的关键在于提出Occluded nuScenes Dataset,这是对广泛使用的nuScenes基准的扩展:针对摄像头模态提供包含四种类型遮挡(两种源自公开实现、两种新设计)的完整版和迷你版;针对雷达和LiDAR(激光雷达)则提供参数化的遮挡脚本,每种支持三种类型的退化操作,实现灵活且重复生成受干扰数据的能力,从而为鲁棒传感器融合、韧性分析及安全关键感知研究提供统一、可复现的评估基础。

链接: https://arxiv.org/abs/2510.18552
作者: Sanjay Kumar,Tim Brophy,Reenu Mohandas,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
机构: University of Limerick (利默里克大学); Lero, The Irish Software Research Centre (爱尔兰软件研究中心); Valeo Vision Systems (法雷奥视觉系统公司); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.
zh

[CV-32] GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization IROS’25

【速读】:该论文旨在解决当前基于激光雷达(LiDAR)的3D目标检测模型在不同传感器部署场景下泛化能力差的问题。现有方法通常依赖于全局特征(如绝对笛卡尔坐标),导致模型学习到“几何捷径”,即过度依赖物体的绝对位置而非其形状与外观特征,从而在点云分布发生变化时性能显著下降。解决方案的关键在于引入GBlobs——一种专为提升跨LiDAR配置泛化能力而设计的局部点云特征描述符。通过将GBlobs作为网络输入特征,有效规避了几何捷径,迫使模型学习更具鲁棒性的、以物体为中心的表示,从而显著增强模型在不同传感器布局下的适应性与性能表现。

链接: https://arxiv.org/abs/2510.18539
作者: Dušan Malić,Christian Fruhwirth-Reisinger,Alexander Prutsch,Wei Lin,Samuel Schulter,Horst Possegger
机构: Graz University of Technology (格拉茨工业大学); Christian Doppler Laboratory for Embedded Machine Learning (克里斯蒂安·多普勒嵌入式机器学习实验室); Johannes Kepler University Linz (约翰内斯·开普勒大学林兹分校); Amazon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st place at the IROS’25 RoboSense Challenge, Track #3: Cross-Sensor Placement 3D Object Detection

点击查看摘要

Abstract:This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquotegeometric shortcut when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model’s ability to generalize, resulting in the exceptional performance demonstrated in this challenge.
zh

[CV-33] RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation

【速读】:该论文旨在解决模板匹配型物体位姿估计方法中因模板检索错误而导致位姿预测不准确的问题。其核心解决方案是将模板-based位姿估计重新建模为射线对齐(ray alignment)问题,通过学习多个带位姿的模板图像的观测方向与未带位姿的查询图像之间的对齐关系来推断位姿。关键创新在于:1)使用以物体为中心的相机射线参数化物体旋转;2)通过扩展尺度不变的平移估计方法,建模密集的平移偏移量;3)利用模板中的几何先验引导查询图像的位姿推理;4)采用基于窄化模板采样的粗到精训练策略,在不修改网络结构的前提下提升性能。

链接: https://arxiv.org/abs/2510.18521
作者: Junwen Huang,Shishir Reddy Vutukur,Peter KT Yu,Nassir Navab,Slobodan Ilic,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); XYZ Robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.
zh

[CV-34] DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices

【速读】:该论文旨在解决因便利包装普及导致的巨量废弃物问题,通过高效垃圾分类实现可持续废物管理。其核心解决方案是开发了一个名为DWaste的计算机视觉平台,专为资源受限的智能手机和边缘设备设计,支持离线实时垃圾分拣。关键在于在模型精度与计算资源消耗之间取得平衡:研究对比了多种图像分类和目标检测模型,发现轻量级目标检测模型(如YOLOv8n)在保持较高mAP(最高达77%)的同时,具备极低延迟(约0.03秒)、小模型尺寸(<7MB)及低功耗特性,适合边缘部署;进一步采用模型量化技术,使模型体积和显存占用减少高达75%,从而实现了“更绿色AI”(Greener AI)的落地应用,推动环保场景下边缘智能的可持续发展。

链接: https://arxiv.org/abs/2510.18513
作者: Suman Kunwar
机构: DWaste, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes ( 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of “Greener AI” models to support real-time, sustainable waste sorting on edge devices.
zh

[CV-35] Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

【速读】:该论文旨在解决从无姿态约束的单目低动态范围(LDR)视频中重建可渲染的四维高动态范围(4D HDR)场景这一挑战性问题,尤其针对交替曝光拍摄的视频数据。解决方案的关键在于提出了一种基于高斯溅射(Gaussian Splatting)的两阶段优化框架:第一阶段在正交相机坐标系中学习视频HDR高斯表示,无需依赖相机位姿即可实现鲁棒的初始HDR视频重建;第二阶段将高斯表示转换至世界坐标系,并联合优化世界空间中的高斯表示与相机位姿。此外,通过引入时间亮度正则化策略提升HDR外观的时间一致性,从而显著提升重建质量与效率。

链接: https://arxiv.org/abs/2510.18489
作者: Jinfeng Liu,Lingtong Kong,Mi Zhou,Jinwen Chen,Dan Xu
机构: The Hong Kong University of Science and Technology (HKUST); vivo Mobile Communication Co., Ltd
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.
zh

[CV-36] Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models, LDMs)中视觉分词器(visual tokenizer)质量对性能的关键影响问题,尤其是现有通过蒸馏(distillation)方式整合视觉基础模型(Vision Foundation Models, VFMs)所导致的语义对齐鲁棒性下降问题。其核心解决方案是提出一种无需蒸馏的直接集成方法——视觉基础模型变分自编码器(VFM-VAE),并通过多尺度潜在融合(Multi-Scale Latent Fusion)和渐进分辨率重建(Progressive Resolution Reconstruction)模块重构解码器结构,从而在保持语义一致性的同时实现像素级重建精度。此外,论文引入SE-CKNNA指标以精细化分析扩散训练中的表征动态,并据此设计联合分词器-扩散对齐策略,显著加速收敛速度并提升最终性能,最终在仅80个训练周期内即达到gFID(无条件生成分数)2.20,640周期后进一步优化至1.62。

链接: https://arxiv.org/abs/2510.18457
作者: Tianci Bi,Xiaoyi Zhang,Yan Lu,Nanning Zheng
机构: IAIR, Xi’an Jiaotong University (西安交通大学人工智能研究院); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and models available at: this https URL

点击查看摘要

Abstract:The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM’s semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
zh

[CV-37] LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance

【速读】:该论文旨在解决高保真3D胸部CT图像生成中计算成本高且条件控制不精确的问题。现有方法通常依赖于高性能计算资源,难以实现高效、可控的医学影像合成。解决方案的关键在于提出一种新的潜在扩散模型(latent diffusion model),通过引入3D解剖掩码(anatomical masks)作为条件输入,实现对肺部及结节区域的精准控制;同时利用单块中端GPU即可在256×256×256体素分辨率下生成高质量CT图像,显著降低计算开销,并验证了仅使用结节掩码会导致解剖结构错误,强调全局肺结构信息对于生成准确条件图像的重要性。

链接: https://arxiv.org/abs/2510.18446
作者: Anna Oliveras,Roger Marí,Rafael Redondo,Oriol Guardià,Ana Tost,Bhalaji Nagarajan,Carolina Migliorelli,Vicent Ribas,Petia Radeva
机构: Eurecat, Centre Tecnològic de Catalunya (欧洲催化技术中心); Universitat de Barcelona (巴塞罗那大学); Barcelona Supercomputing Center (BSC) (巴塞罗那超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces a new latent diffusion model to generate high-quality 3D chest CT scans conditioned on 3D anatomical masks. The method synthesizes volumetric images of size 256x256x256 at 1 mm isotropic resolution using a single mid-range GPU, significantly lowering the computational cost compared to existing approaches. The conditioning masks delineate lung and nodule regions, enabling precise control over the output anatomical features. Experimental results demonstrate that conditioning solely on nodule masks leads to anatomically incorrect outputs, highlighting the importance of incorporating global lung structure for accurate conditional synthesis. The proposed approach supports the generation of diverse CT volumes with and without lung nodules of varying attributes, providing a valuable tool for training AI models or healthcare professionals.
zh

[CV-38] Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection ICCV2025

【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)中因目标与背景高度相似而导致的分割难题,尤其针对现有方法依赖图像级建模或繁琐标注、难以利用数据集级上下文信息的问题。解决方案的关键在于提出一种基于检索自增强(RetrIval SElf-augmented, RISE)的新范式:首先通过无标注训练图像构建环境与伪装目标的原型库(prototype libraries),并引入“聚类后检索”(Clustering-then-Retrieval, CR)策略,利用粗略聚类生成初始掩码,结合直方图过滤和跨类别检索提升原型质量;随后在K近邻(KNN)检索阶段,设计多视角KNN检索(Multi-View KNN Retrieval, MVKR),融合多个视图下的检索结果以抑制特征图中的伪影影响,从而生成高质量伪标签(pseudo-masks),用于训练COD模型。该方法无需人工标注即可有效挖掘数据集级上下文信息,显著优于当前主流无监督与提示驱动方法。

链接: https://arxiv.org/abs/2510.18437
作者: Ji Du,Xin Wang,Fangwei Hao,Mingyang Yu,Chunyuan Chen,Jiesheng Wu,Bin Wang,Jing Xu,Ping Li
机构: Nankai University (南开大学); The Hong Kong Polytechnic University (香港理工大学); Anhui Normal University (安徽师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at this https URL.
zh

[CV-39] ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization

【速读】:该论文旨在解决生成式AI(Generative AI)模型在个性化适配中因缺乏真实世界细粒度用户偏好标注而导致的性能瓶颈问题。其核心解决方案在于构建了ImageGem数据集,该数据集包含57,000名用户产生的242,000个定制LoRA模型、300万条文本提示及500万张生成图像,并附带细粒度用户偏好标注。基于此数据集,作者训练出更优的偏好对齐模型,并进一步提出一种在潜在权重空间中编辑定制扩散模型的端到端框架,以实现个体用户偏好的精准对齐,从而首次实现了生成模型个性化的新范式。

链接: https://arxiv.org/abs/2510.18433
作者: Yuanhe Guo,Linxi Xie,Zhuoran Chen,Kangrui Yu,Ryan Po,Guandao Yang,Gordon Wetztein,Hongyi Wen
机构: NYU(纽约大学); Stanford(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
zh

[CV-40] ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)模型在大规模扩展时训练成本过高、计算资源消耗大的问题。传统方法通过从头训练来扩大模型规模,效率低下且资源开销大。其解决方案的关键在于提出ScaleNet,一种基于预训练模型的高效扩展框架:通过向预训练ViT中插入额外层并采用层间参数共享机制,在几乎不增加参数量的前提下实现模型深度扩展;同时引入并行适配器模块(parallel adapter modules)作为微调参数,对每个共享参数实例进行独立优化,从而缓解因参数共享导致的性能下降问题。实验表明,ScaleNet在ImageNet-1K上可使深度翻倍的DeiT-Base模型准确率提升7.42%,且训练时间仅为从头训练的三分之一,具备显著的效率优势。

链接: https://arxiv.org/abs/2510.18431
作者: Zhiwei Hao,Jianyuan Guo,Li Shen,Kai Han,Yehui Tang,Han Hu,Yunhe Wang
机构: Beijing Institute of Technology (北京理工大学); City University of Hong Kong (香港城市大学); Sun Yat-sen University (中山大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2 \times depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.
zh

[CV-41] Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling

【速读】:该论文旨在解决 cricket 视频中自动化分析的难题,特别是如何精准识别击球得分(wicket-taking)时刻、检测球体位置并建模其运动轨迹,从而为教练和战术决策提供数据驱动的洞察。解决方案的关键在于融合多模态深度学习技术:首先采用 YOLOv8 架构实现高精度的球场区域与球体检测(pitch and ball detection),其中球场检测达到 99.5% mAP50 和 0.999 精度,球体检测通过迁移学习实现 99.18% mAP50、0.968 精度与 0.978 召回率;其次结合光学字符识别(OCR)从视频帧中提取比分卡信息以定位关键事件;最后通过图像预处理(灰度变换、幂变换及形态学操作)增强文本可读性,确保系统在复杂场景下仍具备鲁棒性。整体框架实现了从原始视频到结构化数据分析的端到端自动化流程。

链接: https://arxiv.org/abs/2510.18405
作者: Mst Jannatun Ferdous,Masum Billah,Joy Karmoker,Mohd Ruhul Ameen,Akif Islam,Md. Omar Faruqe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 figures, 5 tables, submitted to the 11th IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering 2025

点击查看摘要

Abstract:This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.
zh

[CV-42] Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion

【速读】:该论文旨在解决现有基于张量分解的高光谱-多光谱图像融合(Hyperspectral-Multispectral Image Fusion, HMF)方法中存在的两大问题:一是传统方法依赖数据向量化或重塑操作,破坏了图像固有的空间-光谱结构;二是现有方法对参数敏感、抗噪能力弱且难以建模跨维度相关性。其解决方案的关键在于提出贝叶斯全连接张量网络(Bayesian Fully-Connected Tensor Network, BFCTN)框架,通过引入分层稀疏先验显式建模物理元素间的稀疏性与内在耦合关系(包括空间结构、光谱特征及局部场景一致性),并结合变分贝叶斯推断(Variational Bayesian Inference)与期望最大化(Expectation-Maximization, EM)算法进行模型学习,从而显著降低人工调参需求、提升融合精度和鲁棒性。

链接: https://arxiv.org/abs/2510.18400
作者: Linsong Shan,Zecan Yang,Laurence T. Yang,Changlong Li,Honglu Zhao,Xin Nie
机构: Huazhong University of Science and Technology (华中科技大学); Zhengzhou University (郑州大学); St. Francis Xavier University (圣弗朗西斯泽维尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for manual parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios.
zh

[CV-43] Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimers Disease Classification

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期诊断中缺乏高效、自动化且高精度的皮层形态学分析方法的问题。其解决方案的关键在于提出一种基于共形几何特征熵(entropy of conformally-derived geometric features)的新颖局部表面表示方法:首先利用Ricci流实现共形参数化以提取面积畸变和共形因子,结合直接从网格几何计算的高斯曲率,再通过香农熵(Shannon entropy)压缩为紧凑的特征向量,最终使用多层感知机(MLP)和逻辑回归等分类器进行训练与验证,实现了对AD患者与健康对照组的高准确率区分(准确率和F₁分数达98.62%),证明该方法在临床研究中具有显著潜力。

链接: https://arxiv.org/abs/2510.18396
作者: F.Ahmadi,B.Bidabad,H.Nasiri
机构: Amirkabir University of Technology (Tehran Polytechnic); Lancaster University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background and Objective: In brain imaging, geometric surface models are essential for analyzing the 3D shapes of anatomical structures. Alzheimer’s disease (AD) is associated with significant cortical atrophy, making such shape analysis a valuable diagnostic tool. The objective of this study is to introduce and validate a novel local surface representation method for the automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted MRI scans from 160 participants (80 AD patients and 80 healthy controls) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Cortical surface models were reconstructed from the MRI data using Freesurfer. Key geometric attributes were computed from the 3D meshes. Area distortion and conformal factor were derived using Ricci flow for conformal parameterization, while Gaussian curvature was calculated directly from the mesh geometry. Shannon entropy was applied to these three features to create compact and informative feature vectors. The feature vectors were used to train and evaluate a suite of classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results: Statistical significance of performance differences between classifiers was evaluated using paired Welch’s t-test. The method proved highly effective in distinguishing AD patients from healthy controls. The Multi-Layer Perceptron (MLP) and Logistic Regression classifiers outperformed all others, achieving an accuracy and F _1 Score of 98.62%. Conclusions: This study confirms that the entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry. The high classification accuracy underscores the method’s potential to enhance the study and diagnosis of Alzheimer’s disease, offering a straightforward yet powerful tool for clinical research applications.
zh

[CV-44] S2AP: Score-space Sharpness Minimization for Adversarial Pruning

【速读】:该论文旨在解决对抗剪枝(Adversarial Pruning)方法中因重要性分数空间(score space)优化导致的局部极小值问题,从而引发掩码(mask)选择不稳定、削弱模型鲁棒性的难题。其解决方案的关键在于提出一种名为Score-space Sharpness-aware Adversarial Pruning (S2AP) 的新方法,通过在掩码搜索阶段引入重要性分数空间的尖锐度最小化机制——即对重要性分数进行扰动并最小化对应的鲁棒损失,从而有效降低分数空间中的尖锐度,提升掩码选择的稳定性,并最终增强对抗剪枝方法的鲁棒性。

链接: https://arxiv.org/abs/2510.18381
作者: Giorgio Piras,Qi Zhao,Fabio Brau,Maura Pintor,Christian Wressnegger,Battista Biggio
机构: University of Cagliari (卡利亚里大学); KASTEL Security Research Labs, Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院KASTEL安全研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.
zh

[CV-45] Cross-Modal Scene Semantic Alignment for Image Complexity Assessment

【速读】:该论文旨在解决图像复杂度评估(Image Complexity Assessment, ICA)中因人类感知主观性和真实图像语义多样性导致的建模困难问题。现有方法主要依赖单一视觉模态的手工特征或浅层卷积神经网络特征,难以充分捕捉与人类感知紧密相关的复杂度表征。其解决方案的关键在于提出一种跨模态场景语义对齐(Cross-Modal Scene Semantic Alignment, CM-SSA)方法,通过引入文本提示(text prompts)所蕴含的丰富场景语义信息,构建图像与文本之间的成对学习机制,实现跨模态语义对齐;同时设计复杂度回归分支与语义对齐分支协同优化,使复杂度预测结果更贴近主观人类感知。

链接: https://arxiv.org/abs/2510.18377
作者: Yuqing Luo,Yixiao Li,Jiang Liu,Jun Fu,Hadi Amirpour,Guanghui Yue,Baoquan Zhao,Padraig Corcoran,Hantao Liu,Wei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,2 figures, British Machine Vision Conference

点击查看摘要

Abstract:Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at this https URL.
zh

[CV-46] FeatureFool: Zero-Query Fooling of Video Models via Feature Map

【速读】:该论文旨在解决当前黑盒对抗攻击在视频领域中存在的高查询成本与低可扩展性问题,特别是针对新兴的视频大语言模型(Video-LLM)难以实施有效攻击的挑战。现有方法通常依赖多轮查询交互,不仅效率低下,且不适用于实际场景。解决方案的关键在于提出FeatureFool,一种零查询(zero-query)的隐蔽攻击方法,其核心创新是直接利用深度神经网络(DNN)提取的特征图(feature map)来扰动干净视频的特征空间,从而实现无需任何模型交互即可完成攻击。该方法在视频域中首次实现了基于特征空间操作的高效、隐蔽且具备迁移性的对抗攻击,实验表明其对传统视频分类器攻击成功率超过70%,并能有效绕过Video-LLM识别,同时保持生成视频在结构相似性(SSIM)、峰值信噪比(PSNR)和时序一致性上的高质量,使攻击几乎不可察觉。

链接: https://arxiv.org/abs/2510.18362
作者: Duoxun Tang,Xi Xiao,Guangwu Hu,Kangkang Sun,Xiao Yang,Dongyang Chen,Qing Li,Yongjie Yin,Jiyao Wang
机构: Tsinghua University (清华大学); Shenzhen University of Information Technology (深圳信息职业技术学院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Peng Cheng Laboratory (鹏城实验室); China Electronics Corporation (中国电子科技集团); Hong Kong University of Science and Technology, Guangzhou (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.
zh

[CV-47] Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

【速读】:该论文旨在解决深度神经网络在安全关键场景中进行不确定性量化(Uncertainty Quantification, UQ)时面临的计算与内存开销过高的问题,尤其是在大规模模型部署中,传统方法如Deep Ensembles虽性能优异但难以扩展。其解决方案的关键在于提出Hydra Ensembles——一种基于Transformer架构的高效集成方法,通过剪枝注意力头(attention heads)生成多样化模型成员,并引入一种新型多头注意力机制结合分组全连接层来融合这些成员,从而在保持接近单个网络推理速度的同时,实现与Deep Ensembles相当或更优的UQ性能,且无需从头训练。实验表明,该方法在图像和文本分类任务中均显著优于Deep Ensembles,尤其在ImageNet-1k零样本分类任务中超越现有最优方法。

链接: https://arxiv.org/abs/2510.18358
作者: Firas Gabetni,Giuseppe Curci,Andrea Pilzer,Subhankar Roy,Elisa Ricci,Gianni Franchi
机构: U2IS, ENSTA, Institut Polytechnique de Paris; University of Trento; NVIDIA; University of Bergamo; Fondazione Bruno Kessler (FBK); AMIAD, Pôle Recherche, Palaiseau
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.
zh

[CV-48] Learning Human-Object Interaction as Groups

【速读】:该论文旨在解决人类-物体交互检测(Human-Object Interaction Detection, HOI-DET)中对高阶群体交互建模不足的问题。现有方法主要关注成对的人-物关系,忽略了现实场景中由多人与多物共同参与的集体行为(collective behaviors)。解决方案的关键在于提出GroupHOI框架,其核心创新包括:1)基于空间特征学习可训练的邻近度估计器,将人和物体按几何邻近性聚类为组;2)在每组内通过自注意力机制计算软对应关系以传播和分配上下文线索;3)增强Transformer解码器,引入来自人-物对特征的局部上下文信息以融合语义相似性。该方法显著提升了对复杂群体交互(如非语言交互检测任务NVI-DET)的建模能力。

链接: https://arxiv.org/abs/2510.18357
作者: Jiajun Hong,Jianan Wei,Wenguan Wang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.
zh

[CV-49] Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

【速读】:该论文旨在解决文本到图像扩散模型在对齐人类偏好时面临的训练不稳定性和图像概率估计不准确的问题,尤其是由于sigmoid函数的非线性特性以及离线数据集多样性不足所导致的挑战。其解决方案的关键在于提出一种基于逆强化学习的新型偏好学习框架——扩散去噪排序优化(Diffusion Denoising Ranking Optimization, Diffusion-DRO),该方法将偏好学习建模为排序问题,从而摒弃了对奖励模型的依赖,并将训练目标简化为去噪形式,有效缓解了非线性估计误差;同时,Diffusion-DRO创新性地融合了离线专家示范与在线策略生成的负样本,显著提升了对人类偏好的捕捉能力并克服了离线数据局限性。

链接: https://arxiv.org/abs/2510.18353
作者: Yi-Lun Wu,Bo-Kai Ruan,Chiang Tseng,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at this https URL.
zh

[CV-50] AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

【速读】:该论文旨在解决音频-视觉问答(Audio-Visual Question Answering, AVQA)任务中模型在复杂场景下难以有效聚焦关键信息的问题,具体表现为现有方法在时间采样和模态偏好感知方面缺乏灵活性与动态适应能力,限制了其推理性能。解决方案的关键在于提出一种名为AV-Master的新框架,其核心创新包括:1)引入动态自适应聚焦采样机制,在时间维度上逐步关注与问题最相关的音视频片段,缓解传统采样方法中的冗余与片段碎片化问题;2)设计偏好感知策略,在模态维度上独立建模各模态贡献,实现关键特征的选择性激活;3)提出双路径对比损失,强化时间与模态维度上的一致性与互补性,引导模型学习针对问题的跨模态协同表示。

链接: https://arxiv.org/abs/2510.18346
作者: Jiayu Zhang,Qilang Ye,Shuo Ye,Xun Lin,Zihan Song,Zitong Yu
机构: Great Bay University (大湾区大学); Dongguan Key Laboratory for Intelligence and Information Technology; Nankai University (南开大学); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model’s ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality’s contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
zh

[CV-51] GPT Face: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data

【速读】:该论文旨在解决当前面部知识学习中大规模预训练模型研究不足的问题,尤其是现有方法依赖人工标注的面部数据集,存在劳动密集且模型泛化能力受限的缺陷。解决方案的关键在于提出一种基于生成式预训练的面部知识学习模型,利用互联网上大规模爬取的含人脸图文数据,在自监督任务(如掩码图像/语言建模和图像-文本匹配)上进行预训练,并在生成阶段引入图像-文本匹配损失以实现可控的图像或文本生成,从而提升模型在面部属性分类、表情识别等下游任务及面部编辑(如属性修改、表情操控、遮挡去除等)中的性能与适用性。

链接: https://arxiv.org/abs/2510.18345
作者: Yudong Li,Hao Li,Xianxu Hou,Linlin Shen
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work was initially drafted in November 2022

点击查看摘要

Abstract:Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
zh

[CV-52] ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation

【速读】:该论文旨在解决自动驾驶闭环仿真中真实视角外推(Realistic View Extrapolation)的问题,即现有新颖视图合成(Novel View Synthesis, NVS)方法在原轨迹范围之外常产生失真和不一致的图像。其解决方案的关键在于提出一个四阶段综合管道:首先通过数据驱动初始化策略生成鲁棒的伪激光雷达点云以避免局部极小值;其次引入强几何先验,利用一种新型降维的有向距离函数(Signed Distance Function, SDF)——2D-SDF建模道路表面;再次借助生成式先验为外推视角生成伪真值,提供辅助监督信号;最后通过数据驱动适配网络消除时间特异性伪影。该方案在RealADSim-NVS基准上取得0.441的最终得分,排名第一。

链接: https://arxiv.org/abs/2510.18341
作者: Kaiyuan Tan,Yingying Shen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
机构: Xiaomi EV(小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.
zh

[CV-53] Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net

【速读】:该论文旨在解决灾难场景下少样本学习(Few-Shot Learning, FSL)中因数据稀缺、类内差异大和类间相似性高导致的视觉识别性能下降问题。现有FSL方法多依赖通用基准数据集,缺乏遥感灾难图像的支持,且传统基于度量的方法难以有效建模特征分布差异。其解决方案的关键在于提出Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net),通过线性融合Bhattacharyya系数与Hellinger距离来比较和聚合特征概率分布,从而构建鲁棒的原型表示:其中Bhattacharyya系数作为对比边界增强类间可分性,Hellinger距离则正则化同类内的对齐一致性;同时设计了一种基于Bhattacharyya-Hellinger距离的对比损失函数,作为余弦相似度损失的概率分布版本,与交叉熵损失联合优化,显著提升FSL在灾难图像上的泛化能力与识别精度。

链接: https://arxiv.org/abs/2510.18326
作者: Gao Yu Lee,Tanmoy Dam,Md Meftahul Ferdaus,Daniel Puiu Poenar,Vu Duong
机构: Nanyang Technological University (南洋理工大学); University of New Orleans (新奥尔良大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to a SN journal

点击查看摘要

Abstract:The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.
zh

[CV-54] Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在图像描述和视觉问答等多模态任务中普遍存在对象幻觉(object hallucination)的问题,即模型生成不存在或错误识别的对象描述。现有方法虽通过辅助训练目标或外部模块部分缓解此问题,但在可扩展性、适应性和模型独立性方面仍存在局限。论文提出了一种无需训练的分词级集成解码方法——自适应令牌集成解码(Adaptive Token Ensemble Decoding, ATED),其核心在于推理阶段动态计算各模型在每一步解码中的不确定性权重,以反映其可靠性,并融合多样化的解码路径来增强上下文锚定与语义一致性,从而有效抑制幻觉现象而不牺牲流畅性和相关性。

链接: https://arxiv.org/abs/2510.18321
作者: Jinlin Li,Yuran Wang,Yifei Yuan,Xiao Zhou,Yingying Zhang,Xixian Yong,Yefeng Zheng,Xian Wu
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); Department of Electrical and Computer Engineering, McGill University(麦吉尔大学电气与计算机工程系); School of Statistics, Renmin University of China(中国人民大学统计学院); Tencent Jarvis Lab(腾讯混元实验室); Medical Artificial Intelligence Lab, Westlake University(西湖大学医学人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination – generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at this https URL.
zh

[CV-55] OmniNWM: Omniscient Driving Navigation World Models

【速读】:该论文旨在解决当前自动驾驶世界模型在状态(state)、动作(action)和奖励(reward)三个核心维度上的局限性问题:现有模型通常仅支持有限的状态模态(如单一视觉信息)、短时视频序列生成、动作控制精度不足,且缺乏对奖励机制的显式建模。解决方案的关键在于提出OmniNWM——一个统一框架下的全景导航世界模型,其创新点包括:(1)通过联合生成RGB、语义、度量深度与3D占据(3D occupancy)的全景视频来增强状态表征;(2)引入归一化的全景Plucker射线图表示法,将输入轨迹映射为像素级信号,实现高精度且泛化能力强的动作控制;(3)基于生成的3D占据信息直接定义规则驱动的密集奖励,用于衡量驾驶合规性和安全性,从而构建可靠的闭环评估体系。

链接: https://arxiv.org/abs/2510.18313
作者: Bohan Li,Zhuang Ma,Dalong Du,Baorui Peng,Zhujin Liang,Zhenqiang Liu,Chao Ma,Yueming Jin,Hao Zhao,Wenjun Zeng,Xin Jin
机构: Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学); PhiGent; National University of Singapore (新加坡国立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at this https URL.
zh

[CV-56] Proactive Reasoning -with-Retrieval Framework for Medical Multimodal Large Language Models

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗场景下推理能力不足的问题,特别是其仅依赖内部知识进行推理时容易产生幻觉性推理和事实性错误,尤其是在面对训练数据未覆盖的病例时。解决方案的关键在于提出首个“基于检索的多模态医学推理框架”(Med-RwR),该框架通过在推理过程中主动查询症状或领域特定医学概念来获取外部知识,并设计了两阶段强化学习策略以激励模型同时利用视觉诊断发现和文本临床信息实现有效检索;此外,还引入置信度驱动的图像再检索方法(Confidence-Driven Image Re-retrieval, CDIR),用于测试阶段提升性能,从而显著增强模型在未知领域的泛化能力与可靠性。

链接: https://arxiv.org/abs/2510.18303
作者: Lehan Wang,Yi Qin,Honglong Yang,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model’s proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR’s significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at this https URL.
zh

[CV-57] GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation ICCV

【速读】:该论文旨在解决单目深度估计中绝对度量深度(metric depth)难以准确预测的问题,尤其是在存在尺度模糊性(scale ambiguity)的单图像场景下。现有基于扩散模型的单目深度估计(diffusion-based monocular depth estimation, DB-MDE)方法虽能较好地恢复相对深度,但无法确定真实物理尺度。解决方案的关键在于将深度估计重构为一个逆问题(inverse problem),利用预训练的潜在扩散模型(latent diffusion models, LDMs)在RGB图像条件下的先验知识,并引入立体视觉(stereo vision)提供的几何约束来学习尺度和偏移参数(scale and shift),从而实现无需重新训练即可获得精确的绝对度量深度估计。该方法可无缝集成到现有DB-MDE框架中,且在室内、室外及复杂场景下均表现出良好泛化能力。

链接: https://arxiv.org/abs/2510.18291
作者: Tuan Pham,Thanh-Tung Le,Xiaohui Xie,Stephan Mandt
机构: University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Findings 2025. The first two authors contributed equally. The last two authors share co-corresponding authorship

点击查看摘要

Abstract:We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.
zh

[CV-58] Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models

【速读】:该论文旨在解决3D人脸属性编辑中身份保持与视图一致性难以兼顾的问题,尤其在低分辨率下编辑灵活性高但高分辨率下难以实现精细控制的困境。其核心挑战在于生成模型需同时理解多视角下的几何一致性并渲染逼真3D人脸,且依赖大规模带属性标签的数据集。解决方案的关键在于利用3D感知生成模型和2D肖像编辑技术,通过少量(≤10)标注样本识别潜在空间中对应于特定属性的编辑方向,从而实现高效、少样本的身份保持属性编辑;并通过Attribute Style Manipulation (ASM) 技术验证编辑的线性特性,进一步探索连续风格流形以支持3D一致的身份保持人脸老化编辑。

链接: https://arxiv.org/abs/2510.18287
作者: Vishal Vinod
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: this https URL
zh

[CV-59] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

【速读】:该论文旨在解决流式视频视觉-语言模型(streaming video vision-language models)在实时处理中面临的两个核心挑战:因果性限制(causality)和token累积效应(accumulation)。前者导致无法利用未来帧信息,后者则引发kv-cache无界增长,造成显著的计算与内存效率瓶颈。现有方法仅优化LLM后端的kv-cache管理,未解决前端预填充(prefill)阶段的高开销问题。其解决方案的关键在于提出一个无需训练、即插即用的两阶段框架StreamingTOM:第一阶段通过因果时间压缩(Causal Temporal Reduction)对每帧固定预算内选择关键视觉token(基于相邻帧变化和token显著性),大幅降低每帧预填充成本;第二阶段采用在线量化记忆(Online Quantized Memory)以4-bit格式存储token,在需时动态检索并解量化,确保活跃kv-cache大小与视频长度无关。该设计实现了可预测延迟下的高效流式视频理解,显著优于当前最优方法。

链接: https://arxiv.org/abs/2510.18269
作者: Xueyi Chen,Keda Tao,Kele Shao,Huan Wang
机构: Westlake University (西湖大学); The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学); SII
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves 15.7\times kv-cache compression, 1.2\times lower peak memory and 2\times faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of 63.8% on offline benchmarks and 55.8%/3.7 on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
zh

[CV-60] reeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)框架下医学图像分割任务中因跨域数据异构性导致的全局漂移(Global Drift, GD)问题,该问题会显著削弱模型在不同域间的泛化能力。解决方案的关键在于提出一种树状拓扑结构的联邦域泛化框架(TreeFedDG):首先,设计基于树结构的分层参数聚合机制以抑制全局模型方向的偏差;其次,引入基于参数差异的风格混合方法(FedStyle),通过强制最大参数差异客户端间进行特征混合提升抗漂移鲁棒性;最后,在模型分发阶段采用渐进式个性化融合策略平衡知识迁移与个体特征保留,并在推理阶段利用特征相似性从树结构中检索最相关模型链进行集成决策,从而充分挖掘层级知识优势。

链接: https://arxiv.org/abs/2510.18268
作者: Yucheng Song,Chenxi Li,Haokang Ding,Zhining Liao,Zhifang Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.
zh

[CV-61] Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization ICME2025

【速读】:该论文旨在解决现有3D人体网格恢复方法在复杂场景下难以充分挖掘潜在信息(如人体运动、形状对齐)而导致肢体错位和局部细节不足的问题,同时克服基于注意力机制建模网格顶点与姿态节点交互时带来的高计算成本。其解决方案的关键在于提出一种两阶段网络架构:第一阶段通过分解图像特征的高低频成分,提取全局(如整体形状对齐)与局部(如纹理、细节)信息,并聚合为混合潜频域特征,从而有效挖掘潜在信息并增强从2D姿态到3D学习的迁移能力;第二阶段利用该混合潜频域特征,设计一种低维网格姿态交互机制,通过降维与并行优化显著降低计算复杂度,同时保持重建精度,实现了高效且高质量的3D人体网格恢复。

链接: https://arxiv.org/abs/2510.18267
作者: Xiang Zhang,Suping Wu,Sheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.
zh

[CV-62] From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

【速读】:该论文旨在解决主体驱动的图像生成模型(subject-driven image generation models)中身份保留(fidelity)与提示遵循(editability)之间的根本性权衡问题。现有方法如在线强化学习(online reinforcement learning, RL)中的GRPO虽具潜力,但直接应用会导致竞争性退化(competitive degradation),原因在于静态权重线性聚合奖励信号会引发冲突梯度,并与扩散过程的时间动态不匹配。解决方案的关键在于提出定制化GRPO(Customized-GRPO),其核心创新包括:(i) 协同感知奖励塑造(Synergy-Aware Reward Shaping, SARS),一种非线性机制,显式惩罚冲突奖励信号并增强协同信号,从而提供更清晰、更具决定性的梯度;(ii) 时间感知动态加权(Time-Aware Dynamic Weighting, TDW),根据扩散过程的时间特性动态调整优化压力——早期侧重提示遵循,后期侧重身份保留。实验表明,该框架显著优于原始GRPO基线,有效缓解了竞争性退化,实现了身份特征保留与复杂文本提示遵循之间的优越平衡。

链接: https://arxiv.org/abs/2510.18263
作者: Ziwei Huang,Ying Shu,Hao Fang,Quanyu Long,Wenya Wang,Qiushi Guo,Tiezheng Ge,Leilei Gan
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model’s temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
zh

[CV-63] UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding

【速读】:该论文旨在解决当前大型视觉语言模型(Vision-Language Models, VLMs)在水下环境中的理解能力严重不足的问题。现有VLMs在自然场景中表现优异,但在水下环境中因光衰减、颜色失真和悬浮颗粒散射等独特挑战,以及对海洋生态系统和生物分类学的专业知识需求,导致其性能显著下降。解决方案的关键在于构建一个名为UWBench的综合性基准数据集,该数据集包含15,003张高分辨率水下图像及其丰富标注,包括15,281个物体指代表达和124,983个问答对,覆盖从物体识别到生态关系理解的多模态推理能力。基于此数据集,作者进一步建立了三项具体任务基准:详细图像描述生成、视觉定位(visual grounding)和视觉问答(Visual Question Answering),从而为水下视觉语言理解提供了一个真实且具有挑战性的评估平台,并揭示了当前SOTA VLMs在此领域仍存在巨大提升空间。

链接: https://arxiv.org/abs/2510.18262
作者: Da Zhang,Chenggang Rong,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信); School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We have released V1, which only reports the test results. Our work is still ongoing, and the next version will be coming soon

点击查看摘要

Abstract:Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
zh

[CV-64] Hyperbolic Space Learning Method Leverag ing Temporal Motion Priors for Human Mesh Recovery ICME2025

【速读】:该论文旨在解决现有基于视频的3D人体网格恢复方法在欧几里得空间(Euclidean space)中学习网格特征时难以准确捕捉人体固有的层次结构(如躯干-四肢-手指)的问题,从而导致重建的人体网格出现错误。解决方案的关键在于引入双阶段优化策略:首先设计一个时间运动先验提取模块,从输入的3D姿态序列和图像特征序列中分别提取时间运动特征并融合为时间运动先验,增强对时序运动维度的表达能力;其次提出一种超球面空间(hyperbolic space)优化学习策略,利用该先验信息在超球面空间中分别优化3D姿态及其运动特征,以更好地建模层次结构,并结合超球面网格优化损失函数确保学习过程的稳定性和有效性,最终实现更准确、平滑的3D人体网格重建。

链接: https://arxiv.org/abs/2510.18256
作者: Xiang Zhang,Suping Wu,Weibin Qiu,Zhaocheng Jin,Sheng Yang
机构: Ningxia University (宁夏大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME2025

点击查看摘要

Abstract:3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It’s hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.
zh

[CV-65] OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion

【速读】:该论文旨在解决当前基于语义高斯点绘(Semantic Gaussian Splatting)方法在开放词汇3D实例分割中面临的两大挑战:一是预处理阶段单个mask缺乏足够的上下文信息,二是多视角特征融合时存在不一致性和细节缺失问题。解决方案的关键在于提出OpenInsGaussian框架,其核心创新包括两个模块:(1) 上下文感知特征提取模块,通过增强每个mask的语义上下文来提升分割精度;(2) 注意力驱动的特征聚合模块,通过选择性融合多视角特征以减少对齐误差并弥补完整性不足。该方法在基准数据集上实现了最先进的性能,显著优于现有基线,验证了其在复杂现实场景中3D场景理解任务中的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2510.18253
作者: Tianyu Huang,Runnan Chen,Dongting Hu,Fengming Huang,Mingming Gong,Tongliang Liu
机构: University of Sydney (悉尼大学); University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbfOpenInsGaussian, an \textbfOpen-vocabulary \textbfInstance \textbfGaussian segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
zh

[CV-66] BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

【速读】:该论文旨在解决零样本3D物体分类在真实场景中的泛化难题,特别是由合成数据与稀疏、噪声较多的真实LiDAR扫描之间的域差异(domain gap)所导致的性能下降问题。现有方法要么仅依赖合成数据而无法适应室外场景,要么仅使用真实数据却缺乏对罕见或未见类别的语义多样性识别能力。其解决方案的关键在于提出BlendCLIP框架,通过一种基于课程学习(curriculum-based)的数据混合策略,在训练初期利用语义丰富的合成CAD数据建立模型基础,随后逐步引入真实世界点云数据以适应实际扫描特征,从而实现高效且鲁棒的跨域迁移。实验表明,仅需每批次加入1.5%的真实样本即可使nuScenes基准上的零样本准确率提升27%,最终模型在nuScenes和TruckScenes等户外数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2510.18244
作者: Ajinkya Khoche,Gergő László Nagy,Maciej Wozniak,Thomas Gustafsson,Patric Jensfelt
机构: KTH Royal Institute of Technology (皇家理工学院); Scania CV AB (斯堪尼亚商用车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets – consisting of a point cloud, image, and text description – mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on this https URL. Comments: Under Review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.18244 [cs.CV] (or arXiv:2510.18244v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.18244 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-67] DeepSeek -OCR: Contexts Optical Compression

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本或高分辨率图像输入时面临的上下文长度限制与计算资源消耗过高的问题,尤其是在光学字符识别(Optical Character Recognition, OCR)任务中如何高效压缩视觉信息以维持高精度。其解决方案的关键在于提出DeepSeek-OCR框架,通过引入一个名为DeepEncoder的编码器组件,实现对高分辨率输入图像的二维光学映射压缩,从而在保持低激活值的同时获得高压缩比(最高达20倍),显著减少视觉token数量;同时配合轻量级解码器DeepSeek3B-MoE-A570M,在压缩比为10倍时仍能实现97%的OCR精度,证明了该方法在历史文档长上下文压缩和LLMs记忆遗忘机制研究中的可行性与实用性。

链接: https://arxiv.org/abs/2510.18234
作者: Haoran Wei,Yaofeng Sun,Yukun Li
机构: DeepSeek-AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at this http URL.
zh

[CV-68] Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis

【速读】:该论文旨在解决目标检测中因数据分布不均导致的偏差问题,特别是针对稀有类别样本不足及现有生成式增强方法无法有效消除偏见的局限性。其关键解决方案在于提出一种基于生成的去偏框架,核心创新包括:引入表示分数(Representation Score, RS)以诊断超出频率维度的表征差距,从而指导生成更均衡且无偏的布局;同时,用精确的视觉蓝图替代模糊文本提示,并采用生成对齐策略增强检测器与生成器之间的协同优化,从而提升复杂场景下合成图像的质量与布局准确性。

链接: https://arxiv.org/abs/2510.18229
作者: Xinhao Cai,Liulei Li,Gensheng Pei,Tao Chen,Jinshan Pan,Yazhou Yao,Wenguan Wang
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学); State Key Laboratory of Intelligent Manufacturing of Advanced Construction Machinery (先进施工机械智能制造国家重点实验室); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University (西安交通大学人机混合增强智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly narrows the performance gap for underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.
zh

[CV-69] EMA-SAM: Exponential Moving-averag e for SAM-based PTMC Segmentation

【速读】:该论文旨在解决介入式超声视频中甲状腺乳头状微癌(Papillary Thyroid Microcarcinoma, PTMC)病灶分割不稳定的问题,其主要挑战包括低对比度、探头运动干扰及热效应伪影导致的帧间不一致性。解决方案的关键在于提出一种轻量级改进模型EMA-SAM,通过在Segment Anything Model 2(SAM-2)的记忆库中引入置信度加权指数移动平均(Exponential Moving Average, EMA)指针机制,构建稳定的肿瘤潜在原型(latent prototype),从而实现跨帧的时间一致性保持,并在探头压力变化或气泡遮挡等干扰条件下仍能维持跟踪稳定性;同时,在清晰证据重新出现时具备快速适应能力。实验表明,EMA-SAM在PTMC-RFA数据集上将最大Dice分数从0.82提升至0.86,IoU从0.72提升至0.76,且误报率降低29%,同时仅增加小于0.1%的浮点运算次数(FLOPs),可在单张A100 GPU上保持约30 FPS的实时处理速度。

链接: https://arxiv.org/abs/2510.18213
作者: Maryam Dialameh,Hossein Rajabzadeh,Jung Suk Sim,Hyock Ju Kwon
机构: University of Waterloo (滑铁卢大学); Withsim Clinic (温西诊所); Ewha Womans University Medical Center (梨花女子大学医学院中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbfEMA-SAM, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emphmaxDice from 0.82 (SAM-2) to 0.86 and \emphmaxIoU from 0.72 to 0.76, while reducing false positives by 29%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2–5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1% FLOPs, preserving real-time throughput of \sim 30,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code this https URL.
zh

[CV-70] FST.ai 2.0: An Explainable AI Ecosystem for Fair Fast and Inclusive Decision-Making in Olympic and Paralympic Taekwondo

【速读】:该论文旨在解决奥林匹克与残奥会搏击类项目中公平、透明及可解释决策的挑战,特别是在跆拳道(Taekwondo)比赛中裁判判罚的主观性与效率问题。其解决方案的关键在于构建一个可解释人工智能(Explainable AI, XAI)生态系统——http URL 2.0,该系统融合基于图卷积网络(Graph Convolutional Networks, GCNs)的姿态动作识别、通过可信集(Credal Sets)建模的信念不确定性量化,以及用于视觉决策支持的可解释性叠加模块;同时集成交互式仪表板以实现人机协作,覆盖裁判评分、运动员表现分析和残奥跆拳道分级等场景,从而在实时竞技与训练中提升判罚透明度与可信度,实验证明其可使决策复核时间减少85%、裁判对AI辅助决策的信任度达93%。

链接: https://arxiv.org/abs/2510.18193
作者: Keivan Shariatmadar,Ahmad Osman,Ramin Ray,Usman Dildar,Kisam Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 23 pages, 12 figures

点击查看摘要

Abstract:Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emphthis http URL 2.0, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates pose-based action recognition using graph convolutional networks (GCNs), epistemic uncertainty modeling through credal sets, and explainability overlays for visual decision support. A set of interactive dashboards enables human–AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, this http URL~2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an 85% reduction in decision review time and 93% referee trust in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, this http URL~2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.
zh

[CV-71] A Generalizable Light Transport 3D Embedding for Global Illumination

【速读】:该论文旨在解决全局光照(Global Illumination, GI)在真实感渲染中计算成本高、难以跨场景泛化的问题。传统神经方法多依赖于单场景优化,且在处理相机或几何变化时能力有限;而现有跨场景方法主要局限于2D屏幕空间(如神经去噪或G-buffer-based GI预测),常导致视角不一致和空间理解不足。其解决方案的关键在于提出一种可泛化的3D光传输嵌入(3D light transport embedding),直接从3D场景配置中近似全局光照,无需依赖光栅化或路径追踪的提示信息。具体而言,每个场景以带有几何与材质特征的点云表示,通过可扩展的Transformer建模点间全局交互,编码为神经基元;渲染时,查询点通过最近邻搜索获取邻近基元,并利用交叉注意力聚合其潜在特征以预测目标渲染量(如辐照度)。该方法在多样室内场景下实现了漫反射全局光照预测,且训练好的嵌入可通过少量微调快速适配新渲染任务,同时初步展示了对光泽材质的空间-方向辐射场估计及其在无偏路径引导加速中的潜力。

链接: https://arxiv.org/abs/2510.18189
作者: Bing Xu,Mukund Varma T,Cheng Wang,Tzumao Li,Lifan Wu,Bartlomiej Wronski,Ravi Ramamoorthi,Marco Salvi
机构: University of California, San Diego (加州大学圣地亚哥分校); NVIDIA (英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Global illumination (GI) is essential for realistic rendering but remains computationally expensive due to the complexity of simulating indirect light transport. Recent neural methods have mainly relied on per-scene optimization, sometimes extended to handle changes in camera or geometry. Efforts toward cross-scene generalization have largely stayed in 2D screen space, such as neural denoising or G-buffer based GI prediction, which often suffer from view inconsistency and limited spatial understanding. We propose a generalizable 3D light transport embedding that approximates global illumination directly from 3D scene configurations, without using rasterized or path-traced cues. Each scene is represented as a point cloud with geometric and material features. A scalable transformer models global point-to-point interactions to encode these features into neural primitives. At render time, each query point retrieves nearby primitives via nearest-neighbor search and aggregates their latent features through cross-attention to predict the desired rendering quantity. We demonstrate results on diffuse global illumination prediction across diverse indoor scenes with varying layouts, geometry, and materials. The embedding trained for irradiance estimation can be quickly adapted to new rendering tasks with limited fine-tuning. We also present preliminary results for spatial-directional radiance field estimation for glossy materials and show how the normalized field can accelerate unbiased path guiding. This approach highlights a path toward integrating learned priors into rendering pipelines without explicit ray-traced illumination cues.
zh

[CV-72] RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology

【速读】:该论文旨在解决当前医学视觉语言模型在面对复杂视觉问题时,难以同时生成诊断文本和像素级分割掩码的问题(即多模态输出不协同),这限制了其在临床辅助诊断中的实用价值。解决方案的关键在于:首先构建了一个统一且分层的任务数据集 RadDiagSeg-D,该数据集整合了异常检测、诊断生成与多目标分割;其次提出一种新型视觉语言模型 RadDiagSeg-M,能够联合执行异常检测、诊断推理与灵活分割任务,从而实现描述性文本与对应分割掩码的同步输出,显著提升辅助诊断系统的上下文信息丰富度和临床实用性。

链接: https://arxiv.org/abs/2510.18188
作者: Chengrun Li,Corentin Royer,Haozhe Luo,Bastian Wittmann,Xia Li,Ibrahim Hamamci,Sezgin Er,Anjany Sekuboyina,Bjoern Menze
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.
zh

[CV-73] VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis

【速读】:该论文旨在解决拥挤场景中异常检测的难题,特别是由严重的人体遮挡以及高度动态且依赖上下文的运动模式所带来的挑战。现有方法往往难以适应不同密度的客流,并缺乏可解释的异常指标。其解决方案的关键在于提出VelocityNet这一双通道框架,通过头部检测与密集光流(dense optical flow)联合提取个体速度信息,再利用层次聚类将速度划分为语义运动类别(静止、慢速、正常、快速),最后基于百分位数构建异常评分机制,量化偏离已学习正常模式的程度,从而实现对复杂人群环境中多种异常运动模式的实时检测与可解释识别。

链接: https://arxiv.org/abs/2510.18187
作者: Fatima AlGhamdi,Omar Alharbi,Abdullah Aldwyish,Raied Aljadaany,Muhammad Kamran J Khan,Huda Alamri
机构: Saudi Data and Artificial Intelligence Authority (沙特数据与人工智能管理局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Detecting anomalies in crowded scenes is challenging due to severe inter-person occlusions and highly dynamic, context-dependent motion patterns. Existing approaches often struggle to adapt to varying crowd densities and lack interpretable anomaly indicators. To address these limitations, we introduce VelocityNet, a dual-pipeline framework that combines head detection and dense optical flow to extract person-specific velocities. Hierarchical clustering categorizes these velocities into semantic motion classes (halt, slow, normal, and fast), and a percentile-based anomaly scoring system measures deviations from learned normal patterns. Experiments demonstrate the effectiveness of our framework in real-time detection of diverse anomalous motion patterns within densely crowded environments.
zh

[CV-74] Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset ICCV2025 ICCV

【速读】:该论文旨在解决月球表面高精度三维重建难题,现有立体视觉重建方法因月面缺乏纹理、光照条件复杂及轨道轨迹特殊而表现不佳;同时,当前基于人类尺度数据集训练的深度学习模型难以直接迁移至月球场景。解决方案的关键在于构建首个基于物理渲染的月球立体图像对开放数据集LunarStereo,其利用光线追踪技术模拟高分辨率地形与反射率模型,覆盖南极区域多样高度、光照和视角条件,从而提供物理可信的监督信号;在此基础上,通过在LunarStereo上微调MASt3R模型实现对月球域的适应,实验验证了该方法在合成与真实月球数据上的显著性能提升,为外星环境中跨尺度泛化提供了新路径。

链接: https://arxiv.org/abs/2510.18172
作者: Clementine Grethen,Simone Gasparini,Geraldine Morin,Jeremy Lebreton,Lucas Marti,Manuel Sanchez-Gestido
机构: IRIT(图卢兹国立理工学院); Toulouse INP(图卢兹国立理工学院); Université de Toulouse(图卢兹大学); Airbus Defence and Space(空中客车防务与空间公司); ESA (欧洲航天局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV workshop 2025. The project page can be accessed via this this https URL URL. The source code is available at this this https URL URL

点击查看摘要

Abstract:Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon’s lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.
zh

[CV-75] World-in-World: World Models in a Closed-Loop World

【速读】:该论文旨在解决生成式世界模型(Generative World Models, WMs)在具身智能体决策中实际效用评估不充分的问题,即现有基准测试多采用开环协议,仅关注视觉质量而忽视了WMs在闭环环境中对任务成功率的贡献。为填补这一空白,作者提出World-in-World平台,其关键在于构建一个模拟真实智能体-环境交互的闭环世界,并提供统一的在线规划策略与标准化动作API,使异构WMs可在相同框架下进行决策比较;同时通过四个闭环环境严格评估不同WMs,以任务成功为核心指标,并首次揭示具身场景下世界模型的数据缩放规律,从而系统性地验证了可控性、后训练数据扩展和推理时计算资源分配对提升闭环比对性能的关键作用。

链接: https://arxiv.org/abs/2510.18135
作者: Jiahan Zhang,Muqing Jiang,Nanru Dai,Taiming Lu,Arda Uzunoglu,Shunchi Zhang,Yana Wei,Jiahao Wang,Vishal M. Patel,Paul Pu Liang,Daniel Khashabi,Cheng Peng,Rama Chellappa,Tianmin Shu,Alan Yuille,Yilun Du,Jieneng Chen
机构: Johns Hopkins University (约翰霍普金斯大学); Peking University (北京大学); Princeton University (普林斯顿大学); Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is at this https URL

点击查看摘要

Abstract:Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
zh

[CV-76] Online In-Context Distillation for Low-Resource Vision Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在低资源、预算受限场景下的部署难题:大型VLM虽性能优越,但计算成本过高;小型VLM虽高效,却需昂贵的微调才能逼近大模型性能。其解决方案的关键在于提出一种在线上下文蒸馏(Online In-Context Distillation, ICD)方法,通过在推理阶段让小模型与强教师模型协作,利用稀疏示例(demonstrations)进行知识蒸馏,从而在有限计算预算下显著提升小模型性能。该方法结合跨模态演示选择策略、教师测试时缩放以降低噪声,以及学生不确定性条件驱动的动态演示池构建机制,仅用极少教师标注(低至4%)即可使小模型性能提升达33%,并达到教师零样本性能水平。

链接: https://arxiv.org/abs/2510.18117
作者: Zhiqi Kang,Rahaf Aljundi,Vaggelis Dorovatas,Karteek Alahari
机构: Inria(法国国家信息与自动化研究院); Toyota Motor Europe(丰田汽车欧洲公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher’s zero-shot performance.
zh

[CV-77] From Volume Rendering to 3D Gaussian Splatting: Theory and Applications

【速读】:该论文旨在解决基于姿态图像的三维重建(3D reconstruction from posed images)中传统方法存在的效率与效果瓶颈问题,尤其针对3D高斯溅射(3D Gaussian Splatting, 3DGS)在实际应用中的三大局限:高内存占用、将光照信息直接烘焙至表示中以及对次级光线效应(secondary-ray effects)支持不足。其解决方案的关键在于系统性地梳理和总结3DGS的渲染管线,从溅射(splatting)公式出发,深入探讨当前主流研究如何通过优化表示结构、解耦光照与几何、引入可微分渲染机制等手段来缓解上述问题,并进一步展示3DGS在表面重建、虚拟形象建模、动画生成及内容创作等任务中的高效性和前向传播(feed-forward)特性,从而推动其在实时三维场景建模领域的落地应用。

链接: https://arxiv.org/abs/2510.18101
作者: Vitor Pereira Matias,Daniel Perazzo,Vinicius Silva,Alberto Raposo,Luiz Velho,Afonso Paiva,Tiago Novello
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the Conference on Graphics, Patterns and Images (SIBGRAPI), math focused, 5 equations, 5 Figure, 5 pages of text and 1 of bibligraphy

点击查看摘要

Abstract:The problem of 3D reconstruction from posed images is undergoing a fundamental transformation, driven by continuous advances in 3D Gaussian Splatting (3DGS). By modeling scenes explicitly as collections of 3D Gaussians, 3DGS enables efficient rasterization through volumetric splatting, offering thus a seamless integration with common graphics pipelines. Despite its real-time rendering capabilities for novel view synthesis, 3DGS suffers from a high memory footprint, the tendency to bake lighting effects directly into its representation, and limited support for secondary-ray effects. This tutorial provides a concise yet comprehensive overview of the 3DGS pipeline, starting from its splatting formulation and then exploring the main efforts in addressing its limitations. Finally, we survey a range of applications that leverage 3DGS for surface reconstruction, avatar modeling, animation, and content generation-highlighting its efficient rendering and suitability for feed-forward pipelines.
zh

[CV-78] Accelerating Vision Transformers with Adaptive Patch Sizes

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在处理高分辨率图像时因统一划分固定大小补丁而导致输入序列长度过长的问题,从而影响模型推理和训练效率。其解决方案的关键在于提出自适应补丁Transformer(Adaptive Patch Transformers, APT),通过在同一图像中使用多种不同尺寸的补丁:在图像内容较为均质的区域采用较大补丁以减少token数量,在复杂区域则使用较小补丁以保留细节信息。这种动态补丁策略显著降低了总输入token数,同时保持下游任务性能不变,实现了高达40%–50%的吞吐量提升,并在视觉问答、目标检测和语义分割等高分辨率密集视觉任务中实现最多30%的训练与推理加速。

链接: https://arxiv.org/abs/2510.18091
作者: Rohan Choudhury,JungEun Kim,Jinhyung Park,Eunho Yang,László A. Jeni,Kris M. Kitani
机构: Carnegie Mellon University (卡内基梅隆大学); KAIST (韩国科学技术院); General Robotics (通用机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page at this https URL

点击查看摘要

Abstract:Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30% faster training and inference in visual QA, object detection, and semantic segmentation.
zh

[CV-79] Big Data Tiny Targets: An Exploratory Study in Machine Learning-enhanced Detection of Microplastic from Filters

【速读】:该论文旨在解决微塑料(Microplastics, MPs)在生物和环境样本中因尺寸微小而导致的检测、分类与去除困难的问题。传统方法如光学显微镜、扫描电子显微镜(Scanning Electron Microscopy, SEM)和原子力显微镜(Atomic Force Microscopy, AFM)虽能提供可靠的基础检测手段,但依赖人工分析,难以应用于大规模筛查。为此,论文提出结合SEM成像与基于机器学习(Machine Learning, ML)的目标检测技术作为解决方案,其关键在于利用YOLO类模型对具有规则背景的过滤场景图像进行自动化识别与定量分析,并强调预处理优化对模型性能的重要性,同时指出高质量专家标注数据稀缺是当前主要挑战之一。

链接: https://arxiv.org/abs/2510.18089
作者: Paul-Tiberiu Miclea,Martin Sboron,Hardik Vaghasiya,Hoang Thinh Nguyen,Meet Gadara,Thomas Schmid
机构: Martin Luther University Halle-Wittenberg (马丁路德大学哈雷-维滕贝格); Fraunhofer Center for Silicon Photovoltaics (CSP) (弗劳恩霍夫硅光伏中心); Lancaster University Leipzig (兰卡斯特大学莱比锡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Microplastics (MPs) are ubiquitous pollutants with demonstrated potential to impact ecosystems and human health. Their microscopic size complicates detection, classification, and removal, especially in biological and environmental samples. While techniques like optical microscopy, Scanning Electron Microscopy (SEM), and Atomic Force Microscopy (AFM) provide a sound basis for detection, applying these approaches requires usually manual analysis and prevents efficient use in large screening studies. To this end, machine learning (ML) has emerged as a powerful tool in advancing microplastic detection. In this exploratory study, we investigate potential, limitations and future directions of advancing the detection and quantification of MP particles and fibres using a combination of SEM imaging and machine learning-based object detection. For simplicity, we focus on a filtration scenario where image backgrounds exhibit a symmetric and repetitive pattern. Our findings indicate differences in the quality of YOLO models for the given task and the relevance of optimizing preprocessing. At the same time, we identify open challenges, such as limited amounts of expert-labeled data necessary for reliable training of ML models.
zh

[CV-80] Chimera: Compositional Image Generation using Part-based Concepting

【速读】:该论文旨在解决个性化图像生成模型在组合多个源图像中特定部位时缺乏显式控制的问题,即现有方法难以根据文本指令精确地从不同图像中提取并融合指定部件以生成新对象。解决方案的关键在于提出Chimera模型,其核心创新包括:构建基于464个(部位, 主体)组合的语义原子(semantic atoms)数据集,用于生成37k条带标注的提示词与高保真图像;引入一种基于部位条件引导(part-conditional guidance)的扩散先验模型,通过约束图像条件特征来同时保持语义身份和空间布局一致性;并设计了一个客观评估指标PartEval,用于量化生成结果的部件对齐精度与组合准确性。实验表明,Chimera在部件对齐和组合准确率上优于基线模型14%,视觉质量提升21%。

链接: https://arxiv.org/abs/2510.18083
作者: Shivam Singh,Yiming Chen,Agneet Chatterjee,Amit Raj,James Hays,Yezhou Yang,Chitra Baral
机构: Arizona State University (亚利桑那州立大学); Georgia Institute of Technology (佐治亚理工学院); Google Deepmind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.
zh

[CV-81] riggerNet: A Novel Explainable AI Framework for Red Palm Mite Detection and Multi-Model Comparison and Heuristic-Guided Annotation

【速读】:该论文旨在解决红 palm mite(Raoiella indica)侵染早期识别困难的问题,以实现对棕榈类植物病害的精准分类与高效管理。其解决方案的关键在于提出并应用一种名为TriggerNet的可解释人工智能框架,该框架融合Grad-CAM、RISE、FullGrad和TCAV等多种可视化技术,为深度学习模型在植物分类和病害检测中的决策过程提供新颖且可解释的视觉证据,从而提升模型的可信度与实用性。研究基于11种植物的RGB图像数据集,结合多种先进深度学习模型(如CNN、EfficientNet、ViT等)和传统机器学习算法(如随机森林、SVM),并通过Snorkel自动标注工具利用启发式规则快速构建高质量疾病标签数据集,显著提高了病害分类的准确性与效率。

链接: https://arxiv.org/abs/2510.18038
作者: Harshini Suresha,Kavitha SH
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:The red palm mite infestation has become a serious concern, particularly in regions with extensive palm cultivation, leading to reduced productivity and economic losses. Accurate and early identification of mite-infested plants is critical for effective management. The current study focuses on evaluating and comparing the ML model for classifying the affected plants and detecting the infestation. TriggerNet is a novel interpretable AI framework that integrates Grad-CAM, RISE, FullGrad, and TCAV to generate novel visual explanations for deep learning models in plant classification and disease detection. This study applies TriggerNet to address red palm mite (Raoiella indica) infestation, a major threat to palm cultivation and agricultural productivity. A diverse set of RGB images across 11 plant species, Arecanut, Date Palm, Bird of Paradise, Coconut Palm, Ginger, Citrus Tree, Palm Oil, Orchid, Banana Palm, Avocado Tree, and Cast Iron Plant was utilized for training and evaluation. Advanced deep learning models like CNN, EfficientNet, MobileNet, ViT, ResNet50, and InceptionV3, alongside machine learning classifiers such as Random Forest, SVM, and KNN, were employed for plant classification. For disease classification, all plants were categorized into four classes: Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing. Snorkel was used to efficiently label these disease classes by leveraging heuristic rules and patterns, reducing manual annotation time and improving dataset reliability.
zh

[CV-82] SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

【速读】:该论文旨在解决自动驾驶系统在罕见、分布外(out-of-distribution)场景中因语义异常而导致的脆弱性问题,尤其是现有视觉语言模型(Vision Language Models, VLMs)在面对此类场景时表现不稳定且依赖昂贵的专有模型,难以实际部署。其解决方案的关键在于提出一种结构化推理框架SAVANT(Semantic Analysis with Vision-Augmented Anomaly deTection),通过分层场景分析和两阶段流程实现高精度异常检测:首先提取结构化的场景描述,再进行多模态评估;该框架将VLM推理从随意提示(naive prompting)转变为跨四个语义层级(街道、基础设施、可移动物体、环境)的系统性分析,从而显著提升召回率(89.6%)与准确率(88.0%)。更关键的是,该方法使一个7B参数的开源模型(Qwen2.5VL)在本地部署下达到90.8%召回率和93.8%准确率,超越所有对比模型并有效缓解数据稀缺问题。

链接: https://arxiv.org/abs/2510.18034
作者: Roberto Brusnicki,David Pop,Yuan Gao,Mattia Piccinini,Johannes Betz
机构: Technical University of Munich (慕尼黑工业大学); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.
zh

[CV-83] ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues

【速读】:该论文旨在解决在线学习环境中学生参与度(engagement)检测的难题,以提升学习效果并实现个性化教学。其解决方案的关键在于提出一种双流深度学习框架ViBED-Net(Video-Based Engagement Detection Network),通过同时提取面部表情和全场景空间特征,并结合LSTM与Transformer两种时序建模策略,有效融合面部感知(face-aware)与场景感知(scene-aware)的时空线索,从而显著提高参与度识别的准确性。

链接: https://arxiv.org/abs/2510.18016
作者: Prateek Gothwal,Deeptimaan Banerjee,Ashis Kumer Biswas
机构: University of Colorado Denver (科罗拉多大学丹佛分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on this https URL .
zh

[CV-84] ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy ICCV2025

【速读】:该论文旨在解决情感计算(affective computing)领域中存在的西方中心偏倚问题,通过构建首个大规模多模态观众对日本漫才(manzai)喜剧的反应数据集——ManzaiSet,来推动非西方文化语境下的情绪AI发展。其解决方案的关键在于:采集了241名参与者观看多达10场专业表演时的面部视频与音频数据,并采用随机顺序呈现以减少顺序效应;在此基础上,通过k均值聚类识别出三类具有显著差异的观众类型(高且稳定欣赏者、低且波动下降者、波动提升者),并发现个体层面存在正向观看顺序效应(即观众体验随观看次数增加而改善),从而挑战了传统疲劳假说;同时,该数据集支持自动化幽默分类与观众层级响应建模,为开发具有文化敏感性的个性化娱乐系统提供了基础。

链接: https://arxiv.org/abs/2510.18014
作者: Kazuki Kawamura,Kengo Nakai,Jun Rekimoto
机构: Sony CSL Kyoto; The University of Tokyo; Yoshimoto Kogyo Holdings Co., Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICCV 2025 Workshop on Affective Behavior Analysis in-the-Wild (ABAW), Honolulu, HI, USA (Oct 19, 2025, HST). 11 pages, 5 figures

点击查看摘要

Abstract:We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched = 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p 0.001, permutation p 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.
zh

[CV-85] Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods

【速读】:该论文旨在解决医学影像分割模型中存在的数据偏见问题,特别是由种族和性别等敏感属性引发的性能不公平性,这可能影响临床决策的公平性和可靠性。解决方案的关键在于系统评估三种深度学习分割模型(UNesT、nnU-Net、CoTr)和一种传统基于图谱的方法(ANTs)在不同人口统计学子群体(黑人女性、黑人男性、白人女性、白人男性)上的分割性能与体积估计偏差,并通过量化公平性指标及线性混合模型分析种族、性别及其交互作用对分割准确率和目标结构(伏隔核,Nucleus Accumbens, NAc)体积的影响。研究发现,训练与测试数据在种族上匹配可显著提升部分模型(ANTs 和 UNesT)的分割精度,而 nnU-Net 表现出跨种族鲁棒性;此外,尽管手动标注结果中存在性别效应,但除一个模型外,种族效应在所有模型中均消失,表明模型偏见可能掩盖真实生物学差异,提示需在模型开发中引入公平性考量。

链接: https://arxiv.org/abs/2510.17999
作者: Ghazal Danaee,Marc Niethammer,Jarrett Rushmore,Sylvain Bouix
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.
zh

[CV-86] Demystifying Transition Matching: When and Why It Can Beat Flow Matching

【速读】:该论文旨在解决生成式模型中Flow Matching (FM)与Transition Matching ™在采样效率和质量上的差异问题,特别是阐明TM为何在某些情况下能优于FM。其核心解决方案在于理论分析:首先证明在目标为单峰高斯分布时,TM通过随机差分隐变量更新机制保留了目标协方差结构,从而在有限步数内实现比确定性FM更低的KL散度;其次,在固定计算预算下,TM收敛速度更快,展现出优越性;进一步扩展至高斯混合模型,识别出局部单峰区域(local-unimodality)下TM仍可逼近单峰情形的表现,且当各成分均值间距增大时近似误差减小,表明TM更适用于模式分离良好的场景。此外,当目标方差趋近于零时,TM退化为FM,优势消失,揭示了TM性能依赖于目标分布的结构特性。

链接: https://arxiv.org/abs/2510.17991
作者: Jaihoon Kim,Rajarshi Saha,Minhyuk Sung,Youngsuk Park
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching ™ can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.
zh

[CV-87] NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation

【速读】:该论文旨在解决地球观测(Earth Observation, EO)领域中神经压缩(neural compression)与表征学习(representation learning)缺乏标准化、可复现评估框架的问题。现有方法在不同下游任务中表现不一致,且易受预训练偏差影响,导致模型性能难以公平比较。解决方案的关键在于提出 NeuCo-Bench——一个基于固定尺寸嵌入(fixed-size embeddings)的基准框架,其核心创新包括:(1) 可复用嵌入的评估流水线,实现跨任务一致性;(2) 隐藏任务排行榜(hidden-task leaderboard)以减少预训练偏置;(3) 平衡准确率与稳定性的评分机制。该框架首次为 EO 领域提供了社区驱动的标准化评估路径,并通过发布 SSL4EO-S12-downstream 数据集支持研究复现与扩展。

链接: https://arxiv.org/abs/2510.17914
作者: Rikard Vinge,Isabelle Wittmann,Jannik Schneider,Michael Marszalek,Luis Gilch,Thomas Brunschwiler,Conrad M Albrecht
机构: German Aerospace Center (德国航空航天中心); Columbia University (哥伦比亚大学); Juelich Supercomputing Center (于利希超级计算中心); IBM Research – Europe (IBM研究欧洲)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce NeuCo-Bench, a novel benchmark framework for evaluating (lossy) neural compression and representation learning in the context of Earth Observation (EO). Our approach builds on fixed-size embeddings that act as compact, task-agnostic representations applicable to a broad range of downstream tasks. NeuCo-Bench comprises three core components: (i) an evaluation pipeline built around reusable embeddings, (ii) a new challenge mode with a hidden-task leaderboard designed to mitigate pretraining bias, and (iii) a scoring system that balances accuracy and stability. To support reproducibility, we release SSL4EO-S12-downstream, a curated multispectral, multitemporal EO dataset. We present initial results from a public challenge at the 2025 CVPR EARTHVISION workshop and conduct ablations with state-of-the-art foundation models. NeuCo-Bench provides a first step towards community-driven, standardized evaluation of neural embeddings for EO and beyond.
zh

[CV-88] 3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement

【速读】:该论文旨在解决3D弱监督语义分割(3D weakly supervised semantic segmentation, 3D WSSS)中因伪标签质量低和三维几何先验利用不足导致的性能瓶颈问题。其核心解决方案在于提出一种融合三维几何先验的类别感知引导机制,通过两个关键模块实现:首先,设计类感知标签精炼(Class-Aware Label Refinement)模块以生成更平衡、准确的伪标签;其次,引入几何感知标签精炼(Geometry-Aware Label Refinement)组件,利用隐式三维几何约束过滤不符合几何合理性的低置信度伪标签。此外,为缓解大量未标注区域的问题,还提出了基于自训练的标签更新策略,迭代提升伪标签质量并扩展标注覆盖范围,从而显著增强模型性能与泛化能力。

链接: https://arxiv.org/abs/2510.17875
作者: Xiaoxu Xu,Xuexun Liu,Jinlong Li,Yitian Yuan,Qiudan Zhang,Lin Ma,Nicu Sebe,Xu Wang
机构: Beihang University (北京航空航天大学); Shenzhen University (深圳大学); University of Trento (特伦托大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low-cost annotated data, significantly reducing reliance on dense point-wise annotations. Previous works mainly employ class activation maps or pre-trained vision-language models to address this challenge. However, the low quality of pseudo-labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high-performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class-aware guidance mechanism to generate high-fidelity pseudo labels. Concretely, our designed methodology first employs Class-Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category-specific optimization. Subsequently, the Geometry-Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low-confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self-Training to propagate labels into these areas. This iterative process continuously enhances pseudo-label quality while expanding label coverage, ultimately fostering the development of high-performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state-of-the-art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.
zh

[CV-89] Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach

【速读】:该论文旨在解决性别分类系统因训练数据中存在显著的交叉性代表性不足(intersectional underrepresentation)而导致的偏见放大问题。现有主流数据集如UTKFace和FairFace虽相对平衡,但其训练模型仍表现出对女性面部更高的误分类率及种族偏倚的加剧。解决方案的关键在于构建BalancedFace这一新公开数据集,通过融合FairFace与UTKFace并补充其他图像来源,以真实、未编辑的图像实现189个年龄-种族-性别交叉子群体的均衡分布。实验表明,基于BalancedFace训练的标准分类器可将种族子群体间最大真阳性率差距降低超过50%,并将平均差异影响(Disparate Impact)得分提升至理想值1.0的63%以内,同时保持整体准确率损失最小,凸显了数据驱动干预在提升公平性方面的核心价值。

链接: https://arxiv.org/abs/2510.17873
作者: Tadesse K Bahiru,Natnael Tilahun Sinshaw,Teshager Hailemariam Moges,Dheeraj Kumar Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gender classification systems often inherit and amplify demographic imbalances in their training data. We first audit five widely used gender classification datasets, revealing that all suffer from significant intersectional underrepresentation. To measure the downstream impact of these flaws, we train identical MobileNetV2 classifiers on the two most balanced of these datasets, UTKFace and FairFace. Our fairness evaluation shows that even these models exhibit significant bias, misclassifying female faces at a higher rate than male faces and amplifying existing racial skew. To counter these data-induced biases, we construct BalancedFace, a new public dataset created by blending images from FairFace and UTKFace, supplemented with images from other collections to fill missing demographic gaps. It is engineered to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images. When a standard classifier is trained on BalancedFace, it reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, all with a minimal loss of overall accuracy. These results underline the profound value of data-centric interventions and provide an openly available resource for fair gender classification research.
zh

[CV-90] GAN-based Content-Conditioned Generation of Handwritten Musical Symbols ICDAR

【速读】:该论文旨在解决光学乐谱识别(Optical Music Recognition, OMR)领域中真实标注数据稀缺的问题,尤其是针对手写历史乐谱的识别难题。其解决方案的关键在于引入基于音乐符号级别的生成对抗网络(Generative Adversarial Network, GAN),通过该模型生成高保真度的手写风格乐谱符号,并利用Smashcima排版软件将这些符号组装成完整的乐谱图像,从而构建可用于训练OMR模型的合成数据集。实验表明,生成的符号具有高度视觉真实性,显著推进了合成乐谱生成技术的发展。

链接: https://arxiv.org/abs/2510.17869
作者: Gerard Asbert,Pau Torras,Lei Kang,Alicia Fornés,Josep Lladós
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, Accepted at ICDAR workshop GREC 2025

点击查看摘要

Abstract:The field of Optical Music Recognition (OMR) is currently hindered by the scarcity of real annotated data, particularly when dealing with handwritten historical musical scores. In similar fields, such as Handwritten Text Recognition, it was proven that synthetic examples produced with image generation techniques could help to train better-performing recognition architectures. This study explores the generation of realistic, handwritten-looking scores by implementing a music symbol-level Generative Adversarial Network (GAN) and assembling its output into a full score using the Smashcima engraving software. We have systematically evaluated the visual fidelity of these generated samples, concluding that the generated symbols exhibit a high degree of realism, marking significant progress in synthetic score generation.
zh

[CV-91] MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

【速读】:该论文旨在解决零样本(zero-shot)2D物体检测与分割问题,即在不依赖目标类别标注数据的情况下,实现对未见过的3D物体在2D图像中的精准定位与分割。其核心挑战在于如何利用已知类别的先验知识和多视角模板来泛化至未知类别,并在复杂场景下保持鲁棒性。解决方案的关键在于提出MUSE(Model-based Uncertainty-aware Similarity Estimation)框架:首先通过几何平均池化(Generalized Mean Pooling, GeM)对patch嵌入进行归一化,以高效融合全局与局部特征;其次设计联合相似度度量机制,同时考虑绝对与相对相似度,提升匹配稳定性;最后引入不确定性感知的对象先验(uncertainty-aware object prior),根据候选区域的可靠性动态调整相似度得分,从而实现无需训练即可达到SOTA性能的零样本检测与分割效果。

链接: https://arxiv.org/abs/2510.17866
作者: Sungmin Cho,Sungbum Park,Insoo Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages with 6 figures

点击查看摘要

Abstract:In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.
zh

[CV-92] InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation ICCV2025

【速读】:该论文旨在解决多模态数据融合中RGB表面细节与X射线内部结构难以对齐的问题,尤其是在医学诊断、文化遗产修复和制造等领域中,如何有效整合高保真RGB图像与X射线成像以实现统一的三维重建。其关键解决方案是提出InsideOut方法,通过采集新的RGB与X射线配对数据,采用分层拟合策略对齐RGB与X射线辐射性高斯 splat(radiative Gaussian splats),并引入X-ray参考损失函数(X-ray reference loss)以保证内部结构的一致性,从而克服两种模态间表征差异大及配对数据稀缺的挑战,显著拓展了3D高斯溅射(3D Gaussian Splatting, 3DGS)的应用边界。

链接: https://arxiv.org/abs/2510.17864
作者: Jungmin Lee,Seonghyuk Hong,Juyong Lee,Jaeyoon Lee,Jongwon Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025

点击查看摘要

Abstract:We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and non-destructive testing capabilities across various domains.
zh

[CV-93] Robotic Classification of Divers Swimming States using Visual Pose Keypoints as IMUs

【速读】:该论文旨在解决水下环境中传统人类活动识别方法在监测潜水员安全方面的局限性问题,特别是由于无线信号在水中衰减严重,导致依赖穿戴式惯性测量单元(Inertial Measurement Units, IMUs)的传感器难以有效通信和数据传输。其解决方案的关键在于提出一种新型混合方法,通过计算机视觉技术从3D人体关节关键点序列中生成高保真度的运动数据,从而构建“伪IMU”(pseudo-IMU),绕过水下无线通信瓶颈;该方法被集成到自主水下航行器(Autonomous Underwater Vehicle, AUV)上用于实时识别异常潜水行为(如心脏骤停前兆),显著提升了机器人对潜水员的监控能力与安全性保障水平。

链接: https://arxiv.org/abs/2510.17863
作者: Demetrious T. Kutzke,Ying-Kun Wu,Elizabeth Terveen,Junaed Sattar
机构: University of Minnesota–Twin Cities (明尼苏达大学双城分校); Minnesota Robotics Institute (MnRI) (明尼苏达机器人研究所); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a ``pseudo-IMU’’ from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an Autonomous Underwater Vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arrest – a leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety.
zh

[CV-94] DMTrack: Deformable State-Space Modeling for UAV Multi-Object Tracking with Kalman Fusion and Uncertainty-Aware Association

【速读】:该论文旨在解决无人机(UAV)视角下多目标跟踪(Multi-object Tracking, MOT)中存在的轨迹估计不准确与身份切换频繁的问题,这些问题主要由不可预测的目标运动、频繁遮挡以及空中视角下外观线索有限等因素引发,尤其在无人机快速移动时更为显著。解决方案的关键在于提出一种名为DMTrack的可变形运动跟踪框架,其核心创新包括:1)DeformMamba——一种可变形状态空间预测器,能够动态聚合历史运动状态以实现自适应轨迹建模;2)MotionGate——一个轻量级门控模块,根据运动上下文和不确定性融合卡尔曼滤波与Mamba预测结果;3)一种基于不确定性的关联策略,通过将运动趋势与预测置信度对齐来增强身份一致性。该方法无需依赖外观模型,在VisDrone-MOT和UAVDT基准上实现了最优的身份一致性和跟踪精度,同时保持高效性,适用于实际无人机场景下的鲁棒跟踪任务。

链接: https://arxiv.org/abs/2510.17860
作者: Zenghuang Fu,Xiaofeng Han,Mingda Jia,Jin ming Yang,Qi Zeng,Muyang Zahng,Changwei Wang,Weiliang Meng,Xiaopeng Zhang
机构: 未知
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) from unmanned aerial vehicles (UAVs) presents unique challenges due to unpredictable object motion, frequent occlusions, and limited appearance cues inherent to aerial viewpoints. These issues are further exacerbated by abrupt UAV movements, leading to unreliable trajectory estimation and identity switches. Conventional motion models, such as Kalman filters or static sequence encoders, often fall short in capturing both linear and non-linear dynamics under such conditions. To tackle these limitations, we propose DMTrack, a deformable motion tracking framework tailored for UAV-based MOT. Our DMTrack introduces three key components: DeformMamba, a deformable state-space predictor that dynamically aggregates historical motion states for adaptive trajectory modeling; MotionGate, a lightweight gating module that fuses Kalman and Mamba predictions based on motion context and uncertainty; and an uncertainty-aware association strategy that enhances identity preservation by aligning motion trends with prediction confidence. Extensive experiments on the VisDrone-MOT and UAVDT benchmarks demonstrate that our DMTrack achieves state-of-the-art performance in identity consistency and tracking accuracy, particularly under high-speed and non-linear motion. Importantly, our method operates without appearance models and maintains competitive efficiency, highlighting its practicality for robust UAV-based tracking.
zh

[CV-95] Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch NEURIPS2025

【速读】:该论文旨在解决大规模预训练流匹配扩散模型(flow matching diffusion models)在推理阶段效率低下的问题,即如何将这些模型高效地缩短为几步采样器(few-step samplers),从而在不牺牲生成质量的前提下显著提升推理速度。传统方法如“shortcutting”虽能实现轨迹跳步,但需引入专用的步长嵌入(step-size embedding),这使得其无法直接应用于现有模型,除非从头重新训练,成本接近原始预训练过程。该研究的关键创新在于提出一种新颖的速度场自蒸馏(velocity field self-distillation)机制,通过在速度场空间而非样本空间中进行在线自引导蒸馏学习,无需依赖步长嵌入即可赋予标准流匹配模型(如Flux)更激进的快捷采样能力;该方法训练高效,可在单张A100 GPU上一天内完成3步采样的模型训练,并进一步拓展至预训练阶段以实现原生高效少步流,甚至首次实现了仅用少量样本(如10对文本-图像)即可完成数十亿参数模型的少样本蒸馏,达到接近零成本的高性能优化效果。

链接: https://arxiv.org/abs/2510.17858
作者: Xu Cai,Yang Wu,Qianli Chen,Haoran Wu,Lichuan Xiang,Hongkai Wen
机构: iHuman Inc.(iHuman公司); University of Warwick(华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2025

点击查看摘要

Abstract:We present an ultra-efficient post-training method for shortcutting large-scale pre-trained flow matching diffusion models into efficient few-step samplers, enabled by novel velocity field self-distillation. While shortcutting in flow matching, originally introduced by shortcut models, offers flexible trajectory-skipping capabilities, it requires a specialized step-size embedding incompatible with existing models unless retraining from scratch \unicodex2013 a process nearly as costly as pretraining itself. Our key contribution is thus imparting a more aggressive shortcut mechanism to standard flow matching models (e.g., Flux), leveraging a unique distillation principle that obviates the need for step-size embedding. Working on the velocity field rather than sample space and learning rapidly from self-guided distillation in an online manner, our approach trains efficiently, e.g., producing a 3-step Flux less than one A100 day. Beyond distillation, our method can be incorporated into the pretraining stage itself, yielding models that inherently learn efficient, few-step flows without compromising quality. This capability also enables, to our knowledge, the first few-shot distillation method (e.g., 10 text-image pairs) for dozen-billion-parameter diffusion models, delivering state-of-the-art performance at almost free cost. Comments: NeurIPS 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2510.17858 [cs.CV] (or arXiv:2510.17858v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.17858 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-96] CMIS-Net: A Cascaded Multi-Scale Individual Standardization Network for Backchannel Agreement Estimation

【速读】:该论文旨在解决个体差异对回应对话行为(backchannel behaviors)识别的干扰问题,尤其是现有情感识别方法在单一尺度上建模个体特征时,难以捕捉多尺度行为线索(如帧级响应强度与序列级频率节奏偏好)的互补性。其解决方案的关键在于提出一种级联式多尺度个体标准化网络(Cascaded Multi-Scale Individual Standardization Network, CMIS-Net),通过移除每个个体特有的中性基线(neutral baseline)来提取个体归一化的回应对话特征,使模型聚焦于相对于个体基线的相对变化而非绝对表达值;同时引入隐式数据增强模块以缓解训练数据分布偏差,从而提升模型在个体差异和数据不平衡场景下的泛化能力。

链接: https://arxiv.org/abs/2510.17855
作者: Yuxuan Huang,Kangzhong Wang,Eugene Yujun Fu,Grace Ngai,Peter H.F. Ng
机构: Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Backchannels are subtle listener responses, such as nods, smiles, or short verbal cues like “yes” or “uh-huh,” which convey understanding and agreement in conversations. These signals provide feedback to speakers, improve the smoothness of interaction, and play a crucial role in developing human-like, responsive AI systems. However, the expression of backchannel behaviors is often significantly influenced by individual differences, operating across multiple scales: from instant dynamics such as response intensity (frame-level) to temporal patterns such as frequency and rhythm preferences (sequence-level). This presents a complex pattern recognition problem that contemporary emotion recognition methods have yet to fully address. Particularly, existing individualized methods in emotion recognition often operate at a single scale, overlooking the complementary nature of multi-scale behavioral cues. To address these challenges, we propose a novel Cascaded Multi-Scale Individual Standardization Network (CMIS-Net) that extracts individual-normalized backchannel features by removing person-specific neutral baselines from observed expressions. Operating at both frame and sequence levels, this normalization allows model to focus on relative changes from each person’s baseline rather than absolute expression values. Furthermore, we introduce an implicit data augmentation module to address the observed training data distributional bias, improving model generalization. Comprehensive experiments and visualizations demonstrate that CMIS-Net effectively handles individual differences and data imbalance, achieving state-of-the-art performance in backchannel agreement detection.
zh

[CV-97] Provenance of AI-Generated Images: A Vector Similarity and Blockchain-based Approach

【速读】:该论文旨在解决生成式 AI(Generative AI)技术快速发展背景下,由大语言模型(LLMs)和图像生成工具(如 DALL-E、Stable Diffusion)产生的高度逼真数字图像难以被有效识别和认证的问题,从而保障数字内容的完整性与来源可信性。解决方案的关键在于提出一种基于嵌入(embedding)的图像检测框架,其核心假设是:AI生成图像在嵌入空间中更接近于其他AI生成内容,而人类创作图像则在其领域内形成紧密聚类;通过五种基准嵌入模型对多样化数据集进行处理,实验验证了该方法在面对中等至高强度扰动时仍能保持嵌入签名的稳定性,从而实现高准确率且计算效率优越的AI生成图像检测。

链接: https://arxiv.org/abs/2510.17854
作者: Jitendra Sharma,Arthur Carvalho,Suman Bhunia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rapid advancement in generative AI and large language models (LLMs) has enabled the generation of highly realistic and contextually relevant digital content. LLMs such as ChatGPT with DALL-E integration and Stable Diffusion techniques can produce images that are often indistinguishable from those created by humans, which poses challenges for digital content authentication. Verifying the integrity and origin of digital data to ensure it remains unaltered and genuine is crucial to maintaining trust and legality in digital media. In this paper, we propose an embedding-based AI image detection framework that utilizes image embeddings and a vector similarity to distinguish AI-generated images from real (human-created) ones. Our methodology is built on the hypothesis that AI-generated images demonstrate closer embedding proximity to other AI-generated content, while human-created images cluster similarly within their domain. To validate this hypothesis, we developed a system that processes a diverse dataset of AI and human-generated images through five benchmark embedding models. Extensive experimentation demonstrates the robustness of our approach, and our results confirm that moderate to high perturbations minimally impact the embedding signatures, with perturbed images maintaining close similarity matches to their original versions. Our solution provides a generalizable framework for AI-generated image detection that balances accuracy with computational efficiency.
zh

[CV-98] Pre to Post-Treatment Glioblastoma MRI Prediction using a Latent Diffusion Model MICCAI

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma, GBM)患者在治疗早期阶段难以预测治疗反应的问题,以实现个性化医疗。现有方法依赖时间序列数据进行治疗反应预测(Treatment Response Prediction, TRP),但无法在影像学评估前提供早期判断。为此,作者提出将早期视觉TRP建模为一种从基线MRI到治疗后MRI的“切片到切片”图像翻译任务,从而隐含地捕捉肿瘤演化轨迹。其关键创新在于引入基于潜在扩散模型(Latent Diffusion Model)的生成框架:通过拼接方式融合基线MRI与肿瘤定位信息作为条件,并采用无分类器引导(classifier-free guidance)机制利用生存数据增强生成质量,尤其强化对治疗后肿瘤演变的建模能力。

链接: https://arxiv.org/abs/2510.17851
作者: Alexandre G. Leclercq,Sébastien Bougleux,Noémie N. Moreau,Alexis Desmonts,Romain Hérault,Aurélien Corroyer-Dulmont
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures. Presented to the Deep Generative Models Workshop of MICCAI (DGM4MICCAI)

点击查看摘要

Abstract:Glioblastoma (GBM) is an aggressive primary brain tumor with a median survival of approximately 15 months. In clinical practice, the Stupp protocol serves as the standard first-line treatment. However, patients exhibit highly heterogeneous therapeutic responses which required at least two months before first visual impact can be observed, typically with MRI. Early prediction treatment response is crucial for advancing personalized medicine. Disease Progression Modeling (DPM) aims to capture the trajectory of disease evolution, while Treatment Response Prediction (TRP) focuses on assessing the impact of therapeutic interventions. Whereas most TRP approaches primarly rely on timeseries data, we consider the problem of early visual TRP as a slice-to-slice translation model generating post-treatment MRI from a pre-treatment MRI, thus reflecting the tumor evolution. To address this problem we propose a Latent Diffusion Model with a concatenation-based conditioning from the pre-treatment MRI and the tumor localization, and a classifier-free guidance to enhance generation quality using survival information, in particular post-treatment tumor evolution. Our model were trained and tested on a local dataset consisting of 140 GBM patients collected at Centre François Baclesse. For each patient we collected pre and post T1-Gd MRI, tumor localization manually delineated in the pre-treatment MRI by medical experts, and survival information.
zh

[CV-99] CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在指令微调(instruction tuning)过程中因训练大规模数据集而导致的高计算成本问题。现有数据选择方法虽试图通过筛选重要且多样化的子集来缓解这一瓶颈,但普遍存在两个关键缺陷:一是对全量数据进行昂贵的评估导致计算开销过高,二是将重要性与多样性分别处理,难以实现协同优化。论文提出的CoIDO框架是一种双目标优化机制,其核心创新在于引入一个轻量级插件评分器(plug-in scorer),该评分器仅需在少量随机采样数据上训练即可学习候选数据分布,从而显著降低计算需求;同时,借助同方差不确定性(homoscedastic uncertainty)建模,CoIDO在训练中动态平衡数据的重要性与多样性,实现高效、可扩展的数据选择策略。实验表明,使用20%的随机样本训练评分器后,CoIDO从完整数据集中选出的20%子集,在LLaVA-1.5-7B模型上实现了接近全数据微调性能的98.2%(平均表现)。

链接: https://arxiv.org/abs/2510.17847
作者: Yichen Yan,Ming Zhong,Qi Zhu,Xiaoling Gu,Jinpeng Chen,Huan Li
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州滨江高新技术产业开发区区块链与数据安全研究院); Hangzhou Dianzi University (杭州电子科技大学); BUPT (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling efficient and scalable data selection. In our experiments, we trained the CoIDO scorer using only 20 percent of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20 percent subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2 percent of the performance of full-data fine-tuning, on average. Comments: 22 pages, 8 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.17847 [cs.CV] (or arXiv:2510.17847v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.17847 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-100] MAT-Agent : Adaptive Multi-Agent Training Optimization NEURIPS2025

【速读】:该论文旨在解决多标签图像分类任务中因视觉-语义空间复杂且动态变化而导致的传统静态训练策略失效的问题。其核心挑战在于如何在不依赖人工调参的情况下,实现对数据增强、优化器、学习率及损失函数等超参数的自适应调整,以提升模型在准确率、稀有类别性能和训练稳定性之间的平衡。解决方案的关键在于提出 MAT-Agent 框架——一个基于多智能体协同的实时优化机制,通过非平稳多臂赌博机算法(non-stationary multi-armed bandit algorithms)动态探索与利用最优配置,并引入复合奖励函数(composite reward)统一优化目标;同时结合双速率指数移动平均平滑(dual-rate exponential moving average smoothing)与混合精度训练(mixed-precision training),显著提升了训练效率与鲁棒性,从而在多个基准数据集上实现了超越现有方法的性能表现。

链接: https://arxiv.org/abs/2510.17845
作者: Jusheng Zhang,Kaitong Cai,Yijia Fan,Ningyuan Liu,Keze Wang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Acceptance to NeurIPS 2025 Main Track

点击查看摘要

Abstract:Multi-label image classification demands adaptive training strategies to navigate complex, evolving visual-semantic landscapes, yet conventional methods rely on static configurations that falter in dynamic settings. We propose MAT-Agent, a novel multi-agent framework that reimagines training as a collaborative, real-time optimization process. By deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions, MAT-Agent leverages non-stationary multi-armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare-class performance, and training stability. Enhanced with dual-rate exponential moving average smoothing and mixed-precision training, it ensures robustness and efficiency. Extensive experiments across Pascal VOC, COCO, and VG-256 demonstrate MAT-Agent’s superiority: it achieves an mAP of 97.4 (vs. 96.2 for PAT-T), OF1 of 92.3, and CF1 of 91.4 on Pascal VOC; an mAP of 92.8 (vs. 92.0 for HSQ-CvN), OF1 of 88.2, and CF1 of 87.1 on COCO; and an mAP of 60.9, OF1 of 70.8, and CF1 of 61.1 on VG-256. With accelerated convergence and robust cross-domain generalization, MAT-Agent offers a scalable, intelligent solution for optimizing complex visual models, paving the way for adaptive deep learning advancements.
zh

[CV-101] Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

【速读】:该论文旨在解决当前机器人系统在动态、非结构化环境中进行感知、推理与行动时,缺乏对高阶认知能力(如指令理解、规划、故障诊断等)的系统性评估问题。现有基准多聚焦于执行成功率或任务现实性不足,难以全面反映“具身大脑”(embodied brain)的智能水平。其解决方案的关键在于提出RoboBench,一个面向多模态大语言模型(MLLMs)作为具身大脑的综合性评估基准,涵盖五维能力:指令理解、感知推理、泛化规划、可操作性预测与失败分析,共14项能力、25个任务及6092个问答对,并通过真实机器人数据构建多样化场景与对象属性以保障任务真实性;同时引入“MLLM-as-world-simulator”框架,模拟预测计划是否能实现关键物体状态变化,从而评估具身可行性。实验揭示了当前MLLMs在隐含指令理解、时空推理、跨场景规划、细粒度可操作性理解和执行失败诊断等方面的显著局限,为下一代具身MLLM的发展提供了量化依据和方向指引。

链接: https://arxiv.org/abs/2510.17801
作者: Yulin Luo,Chun-Kai Fan,Menghang Dong,Jiayu Shi,Mengdi Zhao,Bo-Wen Zhang,Cheng Chi,Jiaming Liu,Gaole Dai,Rongyu Zhang,Ruichuan An,Kun Wu,Zhengping Che,Shaoxuan Xie,Guocai Yao,Zhongxia Zhao,Pengwei Wang,Guang Liu,Zhongyuan Wang,Tiejun Huang,Shanghang Zhang
机构: Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Fudan University (复旦大学); University of Science and Technology Beijing (北京科技大学); Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in this https URL.
zh

[CV-102] Visual Space Optimization for Zero-shot Learning

【速读】:该论文旨在解决零样本学习(Zero-shot Learning, ZSL)中视觉空间结构不显著的问题,即现有方法通常将深度视觉特征视为理想的嵌入空间,但其离散分布导致数据结构难以有效表达类别语义信息。解决方案的关键在于优化视觉空间本身:一是提出基于视觉原型(Visual Prototype)的方法,为每个视觉类别学习一个原型特征以替代原始离散特征集合;二是设计一种多层感知机(Multilayer Perceptron, MLP)框架,在中间嵌入空间中优化视觉特征结构,使其更具区分性。实验表明,这两种策略均能提升零样本学习性能,其中原型方法达到了新的最先进水平。

链接: https://arxiv.org/abs/1907.00330
作者: Xinsheng Wang,Shanmin Pang,Jihua Zhu,Zhongyu Li,Zhiqiang Tian,Yaochen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing to its potential ability in the real-word applications. Zero-shot learning models rely on learning an embedding space, where both semantic descriptions of classes and visual features of instances can be embedded for nearest neighbor search. Recently, most of the existing works consider the visual space formulated by deep visual features as an ideal choice of the embedding space. However, the discrete distribution of instances in the visual space makes the data structure unremarkable. We argue that optimizing the visual space is crucial as it allows semantic vectors to be embedded into the visual space more effectively. In this work, we propose two strategies to accomplish this purpose. One is the visual prototype based method, which learns a visual prototype for each visual class, so that, in the visual space, a class can be represented by a prototype feature instead of a series of discrete visual features. The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive. Through extensive experimental evaluation on four benchmark datasets, we demonstrate that optimizing visual space is beneficial for zero-shot learning. Besides, the proposed prototype based method achieves the new state-of-the-art performance.
zh

[CV-103] DualHash: A Stochastic Primal-Dual Algorithm with Theoretical Guarantee for Deep Hashing

【速读】:该论文旨在解决深度哈希(Deep Hashing)中因量化过程的离散性导致的优化难题,特别是现有方法在直接优化W型正则化项(如||z|-1|)时缺乏收敛性保证的问题。其解决方案的关键在于引入基于Fenchel对偶理论的随机原始-对偶哈希算法(DualHash),通过将非凸的W型正则化部分转换到对偶空间,得到具有闭式解的近似算子(proximal operator),从而实现可证明的收敛性和复杂度边界;具体而言,该方法推导出两种实例:一种是采用动量加速的版本,复杂度为𝒪(ε⁻⁴),另一种使用方差缩减技术的改进版本,复杂度降至𝒪(ε⁻³),显著提升了优化效率与稳定性。

链接: https://arxiv.org/abs/2510.18218
作者: Luxuan Li,Xiao Wang,Chunfeng Cui
机构: Beihang University (北京航空航天大学); Sun Yat-Sen University (中山大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep hashing converts high-dimensional feature vectors into compact binary codes, enabling efficient large-scale retrieval. A fundamental challenge in deep hashing stems from the discrete nature of quantization in generating the codes. W-type regularizations, such as ||z|-1| , have been proven effective as they encourage variables toward binary values. However, existing methods often directly optimize these regularizations without convergence guarantees. While proximal gradient methods offer a promising solution, the coupling between W-type regularizers and neural network outputs results in composite forms that generally lack closed-form proximal solutions. In this paper, we present a stochastic primal-dual hashing algorithm, referred to as DualHash, that provides rigorous complexity bounds. Using Fenchel duality, we partially transform the nonconvex W-type regularization optimization into the dual space, which results in a proximal operator that admits closed-form solutions. We derive two algorithm instances: a momentum-accelerated version with \mathcalO(\varepsilon^-4) complexity and an improved \mathcalO(\varepsilon^-3) version using variance reduction. Experiments on three image retrieval databases demonstrate the superior performance of DualHash.
zh

[CV-104] Conformal Lesion Segmentation for 3D Medical Images

【速读】:该论文旨在解决医学图像分割中缺乏统计保障的问题,尤其是现有模型依赖固定阈值(如0.5)进行病变区域与背景的区分,无法对假阴性率(False Negative Rate, FNR)提供可证明的风险控制,从而限制了其在高风险临床场景中的可靠部署,特别是在三维病变分割(3D Lesion Segmentation, 3D-LS)任务中。解决方案的关键在于提出一种风险约束框架——同形病变分割(Conformal Lesion Segmentation, CLS),通过同形化(conformalization)方法校准数据驱动的阈值,确保测试阶段的FNR在指定容忍度ε内保持低于目标风险水平α。CLS首先利用校准集分析每个样本的临界阈值以满足FNR容忍度,定义FNR特异性损失函数并识别满足条件的阈值;随后根据用户设定的风险水平α,确定校准集中所有临界阈值的近似1−α分位数作为测试时的置信阈值,从而将校准集中的统计规律推广至新测试数据,实现严格的FNR约束并提升分割精度与可靠性。

链接: https://arxiv.org/abs/2510.17897
作者: Binyu Tan,Zhiyuan Wang,Jinhao Duan,Kaidi Xu,Heng Tao Shen,Xiaoshuang Shi,Fumin Shen
机构: 1. University of Technology Sydney (悉尼科技大学); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学); 4. National University of Singapore (新加坡国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation serves as a critical component of precision medicine, enabling accurate localization and delineation of pathological regions, such as lesions. However, existing models empirically apply fixed thresholds (e.g., 0.5) to differentiate lesions from the background, offering no statistical guarantees on key metrics such as the false negative rate (FNR). This lack of principled risk control undermines their reliable deployment in high-stakes clinical applications, especially in challenging scenarios like 3D lesion segmentation (3D-LS). To address this issue, we propose a risk-constrained framework, termed Conformal Lesion Segmentation (CLS), that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance \varepsilon under desired risk levels. CLS begins by holding out a calibration set to analyze the threshold setting for each sample under the FNR tolerance, drawing on the idea of conformal prediction. We define an FNR-specific loss function and identify the critical threshold at which each calibration data point just satisfies the target tolerance. Given a user-specified risk level \alpha , we then determine the approximate 1-\alpha quantile of all the critical thresholds in the calibration set as the test-time confidence threshold. By conformalizing such critical thresholds, CLS generalizes the statistical regularities observed in the calibration set to new test data, providing rigorous FNR constraint while yielding more precise and reliable segmentations. We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
zh

[CV-105] Cross-Domain Multi-Person Human Activity Recognition via Near-Field Wi-Fi Sensing

【速读】:该论文旨在解决基于Wi-Fi的人体活动识别(Human Activity Recognition, HAR)在多用户场景下因信号空间分辨率低而导致的个体区分困难问题,尤其是在存在不完整活动类别信息时,传统神经网络模型难以实现跨域适应(cross-domain adaptation)的挑战。其解决方案的关键在于提出一种名为WiAnchor的新训练框架,该框架通过三个阶段实现高效跨域适配:首先在预训练阶段扩大类间特征边界以增强活动可分性;其次在微调阶段引入锚点匹配机制(anchor matching mechanism),利用不完整的活动类别信息过滤主体特异性干扰,而非试图提取完整特征;最后基于输入样本与锚点之间的特征相似度提升识别精度。该方法在构建的综合数据集上实现了超过90%的跨域准确率,即使在部分活动类别缺失的情况下依然有效。

链接: https://arxiv.org/abs/2510.17816
作者: Xin Li,Jingzhi Hu,Yinghui He,Hongbo Wang,Jin Gan,Jun Luo
机构: Nanyang Technological University (南洋理工大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wi-Fi-based human activity recognition (HAR) provides substantial convenience and has emerged as a thriving research field, yet the coarse spatial resolution inherent to Wi-Fi significantly hinders its ability to distinguish multiple subjects. By exploiting the near-field domination effect, establishing a dedicated sensing link for each subject through their personal Wi-Fi device offers a promising solution for multi-person HAR under native traffic. However, due to the subject-specific characteristics and irregular patterns of near-field signals, HAR neural network models require fine-tuning (FT) for cross-domain adaptation, which becomes particularly challenging with certain categories unavailable. In this paper, we propose WiAnchor, a novel training framework for efficient cross-domain adaptation in the presence of incomplete activity categories. This framework processes Wi-Fi signals embedded with irregular time information in three steps: during pre-training, we enlarge inter-class feature margins to enhance the separability of activities; in the FT stage, we innovate an anchor matching mechanism for cross-domain adaptation, filtering subject-specific interference informed by incomplete activity categories, rather than attempting to extract complete features from them; finally, the recognition of input samples is further improved based on their feature-level similarity with anchors. We construct a comprehensive dataset to thoroughly evaluate WiAnchor, achieving over 90% cross-domain accuracy with absent activity categories.
zh

人工智能

[AI-0] Actor-Free Continuous Control via Structurally Maximizable Q-Functions NEURIPS2025

【速读】:该论文旨在解决传统基于值(value-based)算法在连续动作空间中难以应用的问题,因为其需要对整个动作空间进行Q值评估,计算复杂度极高。为此,作者提出了一种纯基于值的框架,通过引入结构化最大化(structural maximization)机制来替代传统的梯度上升方式以优化动作选择,从而实现高效且稳定的连续控制学习。该方案的关键在于重新设计了Q函数的最大化策略,并结合特定的网络架构与算法选择,使得无需训练独立的策略网络(actor),即可在标准仿真任务中达到与最先进基线相当甚至更优的性能,尤其在动作空间受限导致价值函数非光滑的情况下表现突出。

链接: https://arxiv.org/abs/2510.18828
作者: Yigit Korkmaz,Urvi Bhuwania,Ayush Jain,Erdem Bıyık
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic’s output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at this https URL.
zh

[AI-1] Online SFT for LLM Reasoning : Surprising Effectiveness of Self-Tuning without Rewards

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力提升过程中对复杂奖励机制和多轮交互的依赖问题,尤其是传统强化学习方法(如基于可验证奖励的强化学习,Reinforcement Learning with Verifiable Rewards, RLVR)所需的高计算成本与训练复杂性。其解决方案的关键在于提出一种自监督微调(Online Supervised Fine-Tuning, OSFT)范式,该范式无需外部奖励信号,仅通过模型自身生成响应并立即进行微调,利用预训练阶段已习得的潜在偏好(latent knowledge)来引导推理能力的提升,从而实现高效、稳定的性能改进。

链接: https://arxiv.org/abs/2510.18814
作者: Mengqi Li,Lei Zhao,Anthony Man-Cho So,Ruoyu Sun,Xiao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model’s own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at this https URL.
zh

[AI-2] Decoding Funded Research: Comparative Analysis of Topic Models and Uncovering the Effect of Gender and Geographic Location

【速读】:该论文旨在解决如何更精准地识别科研趋势及其背后的人口与地理驱动因素,以优化国家科学投资并促进公平、多样性和包容性(equity, diversity, and inclusion)。其关键解决方案在于引入一种新型算法COFFEE,用于增强BERTopic主题模型的协变量效应估计能力,从而弥补该模型原生缺乏协变量分析功能的不足;同时通过对比LDA、STM与BERTopic三种主题建模方法,发现BERTopic在识别细粒度、高一致性及新兴研究主题(如人工智能)方面表现最优,并借助COFFEE实现了对省份专业特色和性别相关主题模式的稳健分析,为资助机构制定更具公平性和影响力的资助策略提供了实证基础。

链接: https://arxiv.org/abs/2510.18803
作者: Shirin Tavakoli Kafiabad,Andrea Schiffauerova,Ashkan Ebadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages

点击查看摘要

Abstract:Optimizing national scientific investment requires a clear understanding of evolving research trends and the demographic and geographical forces shaping them, particularly in light of commitments to equity, diversity, and inclusion. This study addresses this need by analyzing 18 years (2005-2022) of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We conducted a comprehensive comparative evaluation of three topic modelling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. We also introduced a novel algorithm, named COFFEE, designed to enable robust covariate effect estimation for BERTopic. This advancement addresses a significant gap, as BERTopic lacks a native function for covariate analysis, unlike the probabilistic STM. Our findings highlight that while all models effectively delineate core scientific domains, BERTopic outperformed by consistently identifying more granular, coherent, and emergent themes, such as the rapid expansion of artificial intelligence. Additionally, the covariate analysis, powered by COFFEE, confirmed distinct provincial research specializations and revealed consistent gender-based thematic patterns across various scientific disciplines. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, thereby enhancing the effectiveness of the scientific ecosystem.
zh

[AI-3] Computational Foundations for Strategic Coopetition: Formalizing Interdependence and Complementarity

【速读】:该论文旨在解决现代社会技术系统中战略合竞(strategic coopetition)的建模与分析难题,即如何在保持情境丰富性的同时实现对动态权衡的定量分析。传统概念建模语言如i虽能刻画战略依赖关系,但缺乏量化能力;而经典博弈论虽具数学严谨性,却忽略了实际情境细节。解决方案的关键在于构建计算基础,将i中的结构依赖关系形式化为可量化的相互依存系数(interdependence coefficients),并基于Brandenburger和Nalebuff的“附加价值”(Added Value)理论建模互补性(complementarity),进而通过引入讨价还价能力机制,在纳什均衡(Nash Equilibrium)中整合结构性相互依存,从而实现对合竞行为的统一定量分析框架。

链接: https://arxiv.org/abs/2510.18802
作者: Vik Pant,Eric Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 36 pages, 7 figures

点击查看摘要

Abstract:Modern socio-technical systems are characterized by strategic coopetition where actors simultaneously cooperate to create value and compete to capture it. While conceptual modeling languages like i* provide rich qualitative representations of strategic dependencies, they lack mechanisms for quantitative analysis of dynamic trade-offs. Conversely, classical game theory offers mathematical rigor but strips away contextual richness. This technical report bridges this gap by developing computational foundations that formalize two critical dimensions of coopetition: interdependence and complementarity. We ground interdependence in i* structural dependency analysis, translating depender-dependee-dependum relationships into quantitative interdependence coefficients through a structured translation framework. We formalize complementarity following Brandenburger and Nalebuff’s Added Value concept, modeling synergistic value creation with validated parameterization. We integrate structural dependencies with bargaining power in value appropriation and introduce a game-theoretic formulation where Nash Equilibrium incorporates structural interdependence. Validation combines comprehensive experimental testing across power and logarithmic value function specifications, demonstrating functional form robustness, with empirical application to the Samsung-Sony S-LCD joint venture (2004-2011), where logarithmic specifications achieve superior empirical fit (validation score 45/60) while power functions provide theoretical tractability. This technical report serves as the foundational reference for a coordinated research program examining strategic coopetition in requirements engineering and multi-agent systems, with companion work addressing trust dynamics, team production, and reciprocity mechanisms.
zh

[AI-4] HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中易受 jailbreak 攻击的问题,即攻击者通过逐步诱导模型输出有害内容的隐蔽路径。其解决方案的关键在于提出 HarmNet 框架,该框架由 ThoughtNet(分层语义网络)、反馈驱动的 Simulator(用于迭代查询优化)和 Network Traverser(用于实时自适应攻击执行)组成,系统性地探索并精炼对抗空间,从而发现成功率高且隐蔽性强的攻击路径。

链接: https://arxiv.org/abs/2510.18728
作者: Sidhant Narula,Javad Rafiei Asl,Mohammad Ghasemigol,Eduardo Blanco,Daniel Takabi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the Conference on Applied Machine Learning in Information Security (CAMLIS 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack execution. HarmNet systematically explores and refines the adversarial space to uncover stealthy, high-success attack paths. Experiments across closed-source and open-source LLMs show that HarmNet outperforms state-of-the-art methods, achieving higher attack success rates. For example, on Mistral-7B, HarmNet achieves a 99.4% attack success rate, 13.9% higher than the best baseline. Index terms: jailbreak attacks; large language models; adversarial framework; query refinement.
zh

[AI-5] Causally Perturbed Fairness Testing

【速读】:该论文旨在解决在处理表格数据的AI系统中,如何在样本规模受限的情况下,通过扰动有效揭示公平性缺陷(fairness bugs)的问题。现有方法多集中于设计测试样本生成器,忽略了数据特征中蕴含的可用于指导扰动的有价值信息,从而限制了其潜力。解决方案的关键在于提出一种通用的因果扰动公平性测试框架——CausalFT,其核心思想是利用因果推断识别与敏感特征(如性别、年龄或种族)存在直接因果关系的非敏感特征,并将这种因果关系注入扰动过程以引导测试样本生成。相较于仅依赖相关性排序非敏感特征的方法,CausalFT能更精准地触发公平性问题,在93%的案例中显著提升任意基础生成器的效果,且在多数情况下增强模型对偏见的鲁棒性。

链接: https://arxiv.org/abs/2510.18719
作者: Chengwen Du,Tao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: accepted by TOSEM

点击查看摘要

Abstract:To mitigate unfair and unethical discrimination over sensitive features (e.g., gender, age, or race), fairness testing plays an integral role in engineering systems that leverage AI models to handle tabular data. A key challenge therein is how to effectively reveal fairness bugs under an intractable sample size using perturbation. Much current work has been focusing on designing the test sample generators, ignoring the valuable knowledge about data characteristics that can help guide the perturbation and hence limiting their full potential. In this paper, we seek to bridge such a gap by proposing a generic framework of causally perturbed fairness testing, dubbed CausalFT. Through causal inference, the key idea of CausalFT is to extract the most directly and causally relevant non-sensitive feature to its sensitive counterpart, which can jointly influence the prediction of the label. Such a causal relationship is then seamlessly injected into the perturbation to guide a test sample generator. Unlike existing generator-level work, CausalFT serves as a higher-level framework that can be paired with diverse base generators. Extensive experiments on 1296 cases confirm that CausalFT can considerably improve arbitrary base generators in revealing fairness bugs over 93% of the cases with acceptable extra runtime overhead. Compared with a state-of-the-art approach that ranks the non-sensitive features solely based on correlation, CausalFT performs significantly better on 64% cases while being much more efficient. Further, CausalFT can better improve bias resilience in nearly all cases.
zh

[AI-6] Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options NEURIPS2025

【速读】:该论文旨在解决在线偏好强化学习(Online Preference-based Reinforcement Learning, PbRL)中的样本效率问题,尤其是在利用排序反馈(ranking feedback)时,现有方法无法有效利用更丰富的信息来提升性能,甚至在子集规模增大时表现恶化。其关键解决方案是采用Plackett-Luce (PL) 模型建模动作子集上的排序反馈,并提出M-AUPO算法——该算法通过最大化给定子集中平均不确定性来选择多个动作。理论分析表明,M-AUPO的次优间隙为 O~(dTt=1T1St)\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{\sum_{t=1}^T \frac{1}{|S_t|}} \right),其中 dd 为特征维度,St|S_t| 为第 tt 轮子集大小,该界明确显示子集规模越大性能越优,且避免了以往方法中对未知参数范数的指数依赖,显著提升了理论保障的紧致性与实用性。

链接: https://arxiv.org/abs/2510.18713
作者: Joongkyu Lee,Seouh-won Yi,Min-hwan Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL’s recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of \tilde\mathcalO\left( \fracdT \sqrt \sum_t=1^T \frac1|S_t| \right) , where T is the total number of rounds, d is the feature dimension, and |S_t| is the size of the subset at round t . This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter’s norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of \Omega \left( \fracdK \sqrtT \right) , where K is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
zh

[AI-7] Fetch.ai: An Architecture for Modern Multi-Agent Systems

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的智能系统普遍忽视多年积累的多智能体系统(Multi-Agent Systems, MAS)基础研究的问题,从而导致现有框架存在中心化、信任机制不足和通信协议不完善等关键局限。其解决方案的核心在于提出一个工业级架构——this http URL,该架构通过在链上区块链服务基础上构建去中心化的身份验证、发现与交易机制,融合经典MAS原则与现代AI能力;同时配套提供安全可互操作的智能体开发框架、云部署平台以及基于原生LLM的智能编排层,实现人类高层目标到复杂多智能体工作流的自动转换,最终支撑开放、协作且经济可持续的多智能体生态系统。

链接: https://arxiv.org/abs/2510.18699
作者: Michael J. Wooldridge,Attila Bagoly,Jonathan J. Ward,Emanuele La Malfa,Gabriel Paludo Licks
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 26 pages, figures, code examples

点击查看摘要

Abstract:Recent surges in LLM-driven intelligent systems largely overlook decades of foundational multi-agent systems (MAS) research, resulting in frameworks with critical limitations such as centralization and inadequate trust and communication protocols. This paper introduces the this http URL architecture, an industrial-strength platform designed to bridge this gap by facilitating the integration of classical MAS principles with modern AI capabilities. We present a novel, multi-layered solution built on a decentralized foundation of on-chain blockchain services for verifiable identity, discovery, and transactions. This is complemented by a comprehensive development framework for creating secure, interoperable agents, a cloud-based platform for deployment, and an intelligent orchestration layer where an agent-native LLM translates high-level human goals into complex, multi-agent workflows. We demonstrate the deployed nature of this system through a decentralized logistics use case where autonomous agents dynamically discover, negotiate, and transact with one another securely. Ultimately, the this http URL stack provides a principled architecture for moving beyond current agent implementations towards open, collaborative, and economically sustainable multi-agent ecosystems.
zh

[AI-8] Exploring Membership Inference Vulnerabilities in Clinical Large Language Models ALT

【速读】:该论文旨在解决临床大语言模型(Clinical Large Language Models, LLMs)在医疗场景中因微调敏感电子健康记录(Electronic Health Record, EHR)数据而引发的隐私泄露问题,特别是针对成员推断攻击(Membership Inference Attack, MIA)的风险。其核心问题是: adversaries 是否能够通过分析模型输出行为来判断特定患者记录是否曾用于训练。解决方案的关键在于识别并量化此类隐私漏洞,提出基于领域特性的对抗性评估策略——如结合临床语境的改写扰动方法(paraphrasing-based perturbation),以更真实地模拟临床环境下的攻击条件,并据此推动开发面向医疗领域的上下文感知隐私保护机制,例如差分隐私微调(differential privacy fine-tuning)和改写感知训练(paraphrase-aware training),从而提升医疗人工智能系统的安全性与可信度。

链接: https://arxiv.org/abs/2510.18674
作者: Alexander Nemecek,Zebin Yun,Zahra Rahmani,Yaniv Harel,Vipin Chaudhary,Mahmood Sharif,Erman Ayday
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the 1st IEEE Workshop on Healthcare and Medical Device Security, Privacy, Resilience, and Trust (IEEE HMD-SPiRiT)

点击查看摘要

Abstract:As large language models (LLMs) become progressively more embedded in clinical decision-support, documentation, and patient-information systems, ensuring their privacy and trustworthiness has emerged as an imperative challenge for the healthcare sector. Fine-tuning LLMs on sensitive electronic health record (EHR) data improves domain alignment but also raises the risk of exposing patient information through model behaviors. In this work-in-progress, we present an exploratory empirical study on membership inference vulnerabilities in clinical LLMs, focusing on whether adversaries can infer if specific patient records were used during model training. Using a state-of-the-art clinical question-answering model, Llemr, we evaluate both canonical loss-based attacks and a domain-motivated paraphrasing-based perturbation strategy that more realistically reflects clinical adversarial conditions. Our preliminary findings reveal limited but measurable membership leakage, suggesting that current clinical LLMs provide partial resistance yet remain susceptible to subtle privacy risks that could undermine trust in clinical AI adoption. These results motivate continued development of context-aware, domain-specific privacy evaluations and defenses such as differential privacy fine-tuning and paraphrase-aware training, to strengthen the security and trustworthiness of healthcare AI systems.
zh

[AI-9] Reasoning Language Model Inference Serving Unveiled: An Empirical Study

【速读】:该论文旨在解决推理型大语言模型(Reasoning Large Language Model, RLLM)在实际部署中服务性能与行为尚未被充分理解的问题,这可能影响其在真实场景中的应用效率。研究的关键在于通过系统性实验揭示RLLM在服务过程中的独特行为特征(如显著的内存波动、慢请求(straggler requests)、自适应运行时间及领域偏好),并评估现有推理优化技术对RLLM的有效性:发现量化方法和推测解码(speculative decoding)可在小幅牺牲准确率的前提下提升服务效率,而前缀缓存(prefix caching)和KV缓存量化则可能因模型规模较小反而降低性能或准确性。最终,基于伽马分布建模的真实负载测试验证了上述结论的普适性,为RLLM推理服务的优化提供了实证依据。

链接: https://arxiv.org/abs/2510.18672
作者: Qi Li,Junpan Wu,Xiang Liu,Yuxin Wang,Zeyu Li,Zhenheng Tang,Yuhan Chen,Shaohuai Shi,Xiaowen Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.
zh

[AI-10] Sherlock Your Queries: Learning to Ask the Right Questions for Dialogue-Based Retrieval

【速读】:该论文旨在解决信息检索系统中用户查询语义模糊导致目标难以识别的问题,尤其针对现有对话式交互检索系统因缺乏明确提问策略而效率低下的局限。解决方案的关键在于提出 SherlockLLM 框架,该框架利用强化学习(Reinforcement Learning, RL)自动学习最优的二元提问序列,从而高效缩小搜索空间,且无需依赖大规模标注对话数据即可训练出有效的信息获取对话策略。

链接: https://arxiv.org/abs/2510.18659
作者: Dong Yun,Marco Schouten,Dim Papadopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User queries in information retrieval are often ambiguous, making it challenging for systems to identify a user’s target from a single query. While recent dialogue-based interactive retrieval systems can clarify user intent, they are inefficient as they often lack an explicit strategy to ask the most informative questions. To address this limitation, we propose SherlockLLM, a dialogue-driven retrieval framework that learns an optimal questioning strategy via Reinforcement Learning (RL) and avoids the need for large-scale annotated dialogue data. In our framework, an agent is trained to generate a sequence of binary questions to efficiently narrow down the search space. To validate our approach, we introduce a benchmark with both structured and unstructured tasks. Experimental results show that SherlockLLM is a robust and efficient solution. On the structured tasks, its performance matches strong baselines and approaches the theoretical optimal defined by binary search. On the challenging unstructured task, our agent significantly outperforms these baselines, showcasing its ability to learn a highly effective information-seeking dialogue policy.
zh

[AI-11] Query Decomposition for RAG : Balancing Exploration-Exploitation

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中如何高效选择信息性强的文档以平衡检索广度与噪声控制的问题。其核心挑战在于:既要确保覆盖所有相关材料,又要避免因过度检索导致的计算开销和干扰信息。解决方案的关键在于将查询分解与文档检索建模为一个探索-利用(exploitation-exploration)过程,通过多臂赌博机(bandit learning)方法动态评估子查询的效用,从而在每次检索后更新对子查询价值的信念,并据此决策是否继续利用当前子查询或探索其他备选方案。实验表明,结合排序信息与人工判断来估计文档相关性,可使文档级精度提升35%,α-nDCG提高15%,并显著改善长文本生成任务的下游性能。

链接: https://arxiv.org/abs/2510.18633
作者: Roxana Petcu,Kenton Murray,Daniel Khashabi,Evangelos Kanoulas,Maarten de Rijke,Dawn Lawrie,Kevin Duh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems address complex user requests by decomposing them into subqueries, retrieving potentially relevant documents for each, and then aggregating them to generate an answer. Efficiently selecting informative documents requires balancing a key trade-off: (i) retrieving broadly enough to capture all the relevant material, and (ii) limiting retrieval to avoid excessive noise and computational cost. We formulate query decomposition and document retrieval in an exploitation-exploration setting, where retrieving one document at a time builds a belief about the utility of a given sub-query and informs the decision to continue exploiting or exploring an alternative. We experiment with a variety of bandit learning methods and demonstrate their effectiveness in dynamically selecting the most informative sub-queries. Our main finding is that estimating document relevance using rank information and human judgments yields a 35% gain in document-level precision, 15% increase in \alpha-nDCG, and better performance on the downstream task of long-form generation.
zh

[AI-12] Comparative Expressivity for Structured Argumentation Frameworks with Uncertain Rules and Premises

【速读】:该论文旨在解决形式化论证中定性不确定性建模的问题,特别是如何在抽象论证框架与结构化论证模型之间建立合理的对应关系。其核心挑战在于现有研究多集中于抽象模型,而缺乏对这些抽象模型的合理实例化(plausible instantiation)的理论分析。解决方案的关键在于提出一个统一的可表达性(expressivity)概念,能够同时适用于抽象和结构化的论证形式化方法,并通过正负向的可表达性结果对比,揭示抽象模型与结构化模型(如ASPIC+)在处理不确定性时的能力边界。这一成果不仅深化了对不完整抽象论证框架及其依赖扩展的理解,也为结构化论证系统的不确定性建模提供了理论支撑。

链接: https://arxiv.org/abs/2510.18631
作者: Carlo Proietti,Antonio Yuste-Ginel
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Modelling qualitative uncertainty in formal argumentation is essential both for practical applications and theoretical understanding. Yet, most of the existing works focus on \textitabstract models for arguing with uncertainty. Following a recent trend in the literature, we tackle the open question of studying plausible instantiations of these abstract models. To do so, we ground the uncertainty of arguments in their components, structured within rules and premises. Our main technical contributions are: i) the introduction of a notion of expressivity that can handle abstract and structured formalisms, and ii) the presentation of both negative and positive expressivity results, comparing the expressivity of abstract and structured models of argumentation with uncertainty. These results affect incomplete abstract argumentation frameworks, and their extension with dependencies, on the abstract side, and ASPIC+, on the structured side.
zh

[AI-13] Leverag ing Association Rules for Better Predictions and Better Explanations

【速读】:该论文旨在解决传统树基分类模型(如决策树和随机森林)在预测性能与解释性之间难以平衡的问题。其解决方案的关键在于融合数据驱动的关联规则挖掘与知识引导的模型优化:首先从数据中提取关联规则(可能包含否定形式),然后将这些规则作为先验知识嵌入到树基模型中,从而提升模型的预测准确率;同时,利用这些规则生成更具一般性的溯因解释(abductive explanations),显著缩小解释规模并增强可理解性。实验表明,该方法在两类树基模型上均实现了预测性能与解释效率的双重改进。

链接: https://arxiv.org/abs/2510.18628
作者: Gilles Audemard,Sylvie Coste-Marquis,Pierre Marquis,Mehdi Sabiri,Nicolas Szczepanski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:We present a new approach to classification that combines data and knowledge. In this approach, data mining is used to derive association rules (possibly with negations) from data. Those rules are leveraged to increase the predictive performance of tree-based models (decision trees and random forests) used for a classification task. They are also used to improve the corresponding explanation task through the generation of abductive explanations that are more general than those derivable without taking such rules into account. Experiments show that for the two tree-based models under consideration, benefits can be offered by the approach in terms of predictive performance and in terms of explanation sizes.
zh

[AI-14] VAR: Visual Attention Reasoning via Structured Search and Backtracking

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂任务中因高幻觉倾向(hallucination tendency)和对脆弱线性推理过程的依赖而导致的性能瓶颈问题。其解决方案的关键在于提出视觉注意力推理(Visual Attention Reasoning, VAR)框架,该框架将 grounded reasoning 重构为在推理轨迹空间中的结构化搜索过程,包含两个核心阶段:可追溯的证据定位(traceable evidence grounding)与基于搜索的链式思维(chain-of-thought, CoT)生成,并引入回溯机制实现自我修正;同时,通过融合语义与几何自验证的多维奖励函数引导搜索,有效惩罚未忠实锚定于视觉输入的输出,从而显著提升推理准确性与安全性。

链接: https://arxiv.org/abs/2510.18619
作者: Wei Cai,Jian Zhao,Yuchen Yuan,Tianle Zhang,Ming Zhu,Haichuan Tang,Chi Zhang,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), despite their advances, are hindered by their high hallucination tendency and heavy reliance on brittle, linear reasoning processes, leading to failures in complex tasks. To address these limitations, we introduce Visual Attention Reasoning (VAR), a novel framework that recasts grounded reasoning as a structured search over a reasoning trajectory space. VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought (CoT) generation, which incorporates a backtracking mechanism for self-correction. The search is guided by a multi-faceted reward function with semantic and geometric self-verification components, which penalize outputs that are not faithfully grounded in the visual input. We provide a theoretical analysis for our search strategy, validating its capability to find the correct solution with high probability. Experimental results show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks, significantly outperforming existing open-source models and demonstrating competitive performance against leading proprietary systems.
zh

[AI-15] A Rectification-Based Approach for Distilling Boosted Trees into Decision Trees

【速读】:该论文旨在解决如何将增强树(boosted trees)模型有效蒸馏(distillation)为更易解释的决策树(decision trees)的问题,以在预测性能与可解释性之间取得合理平衡。其解决方案的关键在于引入一种称为“校正”(rectification)的修正方法,通过该方法实现增强树到决策树的蒸馏过程,实验证明该策略相较于直接重新训练模型的方法能够获得更具吸引力的结果。

链接: https://arxiv.org/abs/2510.18615
作者: Gilles Audemard,Sylvie Coste-Marquis,Pierre Marquis,Mehdi Sabiri,Nicolas Szczepanski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:We present a new approach for distilling boosted trees into decision trees, in the objective of generating an ML model offering an acceptable compromise in terms of predictive performance and interpretability. We explain how the correction approach called rectification can be used to implement such a distillation process. We show empirically that this approach provides interesting results, in comparison with an approach to distillation achieved by retraining the model.
zh

[AI-16] he Cost-Benefit of Interdisciplinarity in AI for Mental Health

【速读】:该论文试图解决当前AI心理聊天机器人(AI mental health chatbots)在开发与应用过程中因学科壁垒导致的局限性问题,即多数系统仅依赖单一领域知识输入,缺乏跨学科协作,难以实现价值对齐(value-alignment)和符合《人工智能法案》(AI Act)中对高风险AI系统的合规要求。其解决方案的关键在于,在聊天机器人的全生命周期中引入技术、医疗、伦理与法律等多学科专家的协同参与,以确保系统在功能设计、数据处理、风险控制及法规遵从等方面具备综合性保障能力,并提出可操作的实践建议与现有框架支持这一跨学科整合路径。

链接: https://arxiv.org/abs/2510.18581
作者: Katerina Drakos,Eva Paraschou,Simay Toplu,Line Harder Clemmensen,Christoph Lütge,Nicole Nadine Lønfeldt,Sneha Das
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for poster presentation at the AI in Science Summit 2025

点击查看摘要

Abstract:Artificial intelligence has been introduced as a way to improve access to mental health support. However, most AI mental health chatbots rely on a limited range of disciplinary input, and fail to integrate expertise across the chatbot’s lifecycle. This paper examines the cost-benefit trade-off of interdisciplinary collaboration in AI mental health chatbots. We argue that involving experts from technology, healthcare, ethics, and law across key lifecycle phases is essential to ensure value-alignment and compliance with the high-risk requirements of the AI Act. We also highlight practical recommendations and existing frameworks to help balance the challenges and benefits of interdisciplinarity in mental health chatbots.
zh

[AI-17] QuantEvolve: Automating Quantitative Strategy Discovery through Multi-Agent Evolutionary Framework

【速读】:该论文旨在解决动态市场中量化交易策略自动开发的难题,尤其是在个性化投资需求日益增长背景下,现有方法难以在保持策略多样性的同时有效探索庞大的策略空间,从而影响其在不同市场环境下稳健性。解决方案的关键在于提出QuantEvolve框架,该框架融合了质量-多样性优化(quality-diversity optimization)与假设驱动的策略生成机制:一方面通过与投资者偏好(如策略类型、风险敞口、换手率和收益特征)对齐的特征映射来维持多样化的高效策略集合;另一方面引入假设驱动的多智能体系统,通过迭代生成与评估机制系统性地探索策略空间,从而生成适应市场状态变化及个体投资需求的复杂且多样化的交易策略。

链接: https://arxiv.org/abs/2510.18569
作者: Junhyeog Yun,Hyoun Jun Lee,Insu Jeon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 13 figures. Accepted for oral presentation at the 2nd Workshop on LLMs and Generative AI for Finance (AI4F), part of ACM ICAIF 2025, Singapore. Non-archival workshop

点击查看摘要

Abstract:Automating quantitative trading strategy development in dynamic markets is challenging, especially with increasing demand for personalized investment solutions. Existing methods often fail to explore the vast strategy space while preserving the diversity essential for robust performance across changing market conditions. We present QuantEvolve, an evolutionary framework that combines quality-diversity optimization with hypothesis-driven strategy generation. QuantEvolve employs a feature map aligned with investor preferences, such as strategy type, risk profile, turnover, and return characteristics, to maintain a diverse set of effective strategies. It also integrates a hypothesis-driven multi-agent system to systematically explore the strategy space through iterative generation and evaluation. This approach produces diverse, sophisticated strategies that adapt to both market regime shifts and individual investment needs. Empirical results show that QuantEvolve outperforms conventional baselines, validating its effectiveness. We release a dataset of evolved strategies to support future research.
zh

[AI-18] WebDevJudge: Evaluating (M)LLM s as Critiques for Web Development Quality

【速读】:该论文旨在解决当前大语言模型作为评判者(LLM-as-a-judge)在开放性任务中,尤其是在动态交互环境下的可靠性问题。现有研究多聚焦于结构化任务,而对复杂、非结构化的Web开发场景中模型判断能力的评估仍缺乏系统性基准。解决方案的关键在于提出WebDevJudge——一个面向Web开发领域的系统性评测基准,支持静态观察与持续交互两种评估范式,并通过人类偏好标签和结构化、查询引导的评分标准构建高质量标注数据。该基准不仅用于全面评估不同类型的评判模型(包括LLM、多模态大语言模型及代理工作流),还揭示了当前LLM判别能力的核心局限,如功能等价识别失败、任务可行性验证不足及偏见缓解缺失等问题,为未来开发更可靠、更强大的自动化评判系统提供了关键方向。

链接: https://arxiv.org/abs/2510.18560
作者: Chunyang Li,Yilun Zheng,Xinting Huang,Tianqing Fang,Jiahao Xu,Yangqiu Song,Lihui Chen,Han Hu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios. Code and data are available at this https URL.
zh

[AI-19] RAISE: A Unified Framework for Responsible AI Scoring and Evaluation

【速读】:该论文旨在解决当前人工智能(AI)模型评估体系过于侧重预测准确性,而忽视了可解释性、公平性、鲁棒性和可持续性等关键责任维度的问题。其解决方案的核心是提出了一种统一的评估框架RAISE(Responsible AI Scoring and Evaluation),通过量化模型在上述四个维度的表现并整合为一个综合的责任评分(Responsibility Score),实现对AI模型负责任性的多维评价。该方法揭示了不同深度学习模型在各责任维度上的权衡关系,强调了在高风险领域中进行多维评估对于模型选择的重要性。

链接: https://arxiv.org/abs/2510.18559
作者: Loc Phuc Truong Nguyen,Hung Thanh Do
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注: Accepted at the 26th International Conference on Principles and Practice of Multi-Agent Systems

点击查看摘要

Abstract:As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade-offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi-dimensional evaluation for responsible model selection. Our implementation is available at: this https URL.
zh

[AI-20] Extracting alignment data in open models

【速读】:该论文试图解决的问题是:如何从后训练(post-trained)模型中提取对齐训练数据(alignment training data),以用于提升模型在长上下文推理、安全性、指令遵循和数学能力等方面的表现。传统方法依赖字符串匹配来衡量训练数据的提取成功率,但作者指出这种方法会因琐碎文本差异而严重低估实际可提取的数据量。解决方案的关键在于使用高质量嵌入模型(embedding model)来衡量语义相似性,从而更准确地识别出模型所“复述”(regurgitate)的原始训练数据——特别是来自SFT(监督微调)或强化学习(RL)阶段的数据。实验表明,这些被提取的数据可用于重新训练基础模型,恢复相当程度的原始性能,揭示了对齐数据可能存在的隐蔽泄露风险,并引发关于蒸馏(distillation)实践潜在影响的讨论:若模型倾向于复述其训练集内容,则蒸馏本质上等同于间接利用原始数据进行再训练。

链接: https://arxiv.org/abs/2510.18554
作者: Federico Barbero,Xiangming Gu,Christopher A. Choquette-Choo,Chawin Sitawarin,Matthew Jagielski,Itay Yona,Petar Veličković,Ilia Shumailov,Jamie Hayes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model – useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of 10\times ) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model’s original dataset.
zh

[AI-21] SOCIA-Nabla: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

【速读】:该论文旨在解决复杂系统仿真(Cyber-Physical Systems, CPS)中模拟器构建的脆弱性和低效性问题,即传统基于提示(prompt)的流水线难以生成可复现、约束感知且跨领域可扩展的仿真代码。其解决方案的关键在于提出SOCIA-Nabla框架,该框架将模拟器构造视为文本计算图(textual computation graph)中的代码实例优化问题,并引入专用大语言模型(Large Language Model, LLM)驱动的智能体作为图节点,通过损失驱动的循环流程——代码合成-执行-评估-修复——实现文本梯度下降(Textual-Gradient Descent, TGD)优化;同时保留人类在任务层面的确认机制以最小化专家干预,从而将脆弱的提示管道转化为可复现、约束感知的仿真代码生成范式,显著提升多智能体协同与跨域仿真的准确性和可扩展性。

链接: https://arxiv.org/abs/2510.18551
作者: Yuncheng Hua,Sion Weatherhead,Mehdi Jafari,Hao Xue,Flora D. Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, 2 tables. The paper is under review

点击查看摘要

Abstract:In this paper, we present SOCIA-Nabla, an end-to-end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM-driven agents are embedded as graph nodes, and a workflow manager executes a loss-driven loop: code synthesis - execution - evaluation - code repair. The optimizer performs Textual-Gradient Descent (TGD), while human-in-the-loop interaction is reserved for task-spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA-Nabla attains state-of-the-art overall accuracy. By unifying multi-agent orchestration with a loss-aligned optimization view, SOCIA-Nabla converts brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities. This work is under review, and we will release the code soon.
zh

[AI-22] EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval NEURIPS2025

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的物体目标导航(Object-goal Navigation, ObjNav)在本地设备上部署时面临的两大挑战:一是小尺寸LLM因理解复杂导航地图能力不足导致成功率显著下降;二是导航地图描述带来的长提示(long prompt)引发本地设备规划延迟过高。解决方案的关键在于提出EfficientNav框架,其核心创新包括:1)语义感知的记忆检索机制(semantics-aware memory retrieval),用于从导航地图中去除冗余信息,提升小模型对环境的理解能力;2)离散记忆缓存与基于注意力的记忆聚类策略(discrete memory caching and attention-based memory clustering),以高效存储和复用键值缓存(KV cache),从而降低规划延迟。实验表明,该方法在HM3D基准上相较GPT-4基线成功率达到提升11.1%,同时实现6.7倍实时延迟和4.7倍端到端延迟的优化。

链接: https://arxiv.org/abs/2510.18546
作者: Zebin Yang,Sunjian Zheng,Tong Xie,Tianshi Xu,Bo Yu,Fan Wang,Jie Tang,Shaoshan Liu,Meng Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Object-goal navigation (ObjNav) tasks an agent with navigating to the location of a specific object in an unseen environment. Embodied agents equipped with large language models (LLMs) and online constructed navigation maps can perform ObjNav in a zero-shot manner. However, existing agents heavily rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to limited model capacity for understanding complex navigation maps, which prevents deploying ObjNav on local devices. At the same time, the long prompt introduced by the navigation map description will cause high planning latency on local devices. In this paper, we propose EfficientNav to enable on-device efficient LLM-based zero-shot ObjNav. To help the smaller LLMs better understand the environment, we propose semantics-aware memory retrieval to prune redundant information in navigation maps. To reduce planning latency, we propose discrete memory caching and attention-based memory clustering to efficiently save and re-use the KV cache. Extensive experimental results demonstrate that EfficientNav achieves 11.1% improvement in success rate on HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our code will be released soon.
zh

[AI-23] Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

【速读】:该论文旨在解决从潜在恶意(backdoored)教师模型进行知识蒸馏时可能引入的可转移后门安全风险问题。现有方法构建的后门触发器通常由罕见词元组成,导致其难以在学生模型中迁移,从而低估了知识蒸馏过程中的安全威胁。论文提出的关键解决方案是T-MTB(Trigger-based Multi-Token Backdoor),其核心在于设计一种由多个在预期蒸馏数据集中频繁出现的词元组成的复合触发器,使中毒教师模型保持隐蔽性,同时在蒸馏过程中利用这些词元的独立存在为后门向学生模型迁移提供足够信号,从而实现后门的有效传递与攻击行为的实现(如越狱和内容操控)。

链接: https://arxiv.org/abs/2510.18541
作者: Giovanni De Muri,Mark Vero,Robin Staab,Martin Vechev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.
zh

[AI-24] Physics-guided Emulators Reveal Resilience and Frag ility under Operational Latencies and Outages

【速读】:该论文旨在解决当前水文和洪水预报模型在实际运行中面临的数据延迟、缺失或不一致时稳定性不足的问题,即现有基于降雨-径流预测的模型多在理想数据条件下评估,侧重精度而忽视了操作韧性。解决方案的关键在于构建一个面向业务应用的全球洪水预警系统(GloFAS)模拟器,其核心是将长短期记忆网络(Long Short-Term Memory, LSTM)与松弛的水量平衡约束相结合,以维持物理一致性;同时通过五种不同信息可用性的架构设计,系统性地评估模型在数据质量下降时的鲁棒性表现,从而将操作稳健性定义为可度量的水文机器学习属性,并推动实时预报系统的可靠性提升。

链接: https://arxiv.org/abs/2510.18535
作者: Sarth Dubey,Subimal Ghosh,Udit Bhatia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 45 pages, 5 main figures, 10 supplementary figures, 5 supplementary tables

点击查看摘要

Abstract:Reliable hydrologic and flood forecasting requires models that remain stable when input data are delayed, missing, or inconsistent. However, most advances in rainfall-runoff prediction have been evaluated under ideal data conditions, emphasizing accuracy rather than operational resilience. Here, we develop an operationally ready emulator of the Global Flood Awareness System (GloFAS) that couples long- and short-term memory networks with a relaxed water-balance constraint to preserve physical coherence. Five architectures span a continuum of information availability: from complete historical and forecast forcings to scenarios with data latency and outages, allowing systematic evaluation of robustness. Trained in minimally managed catchments across the United States and tested in more than 5,000 basins, including heavily regulated rivers in India, the emulator reproduces the hydrological core of GloFAS and degrades smoothly as information quality declines. Transfer across contrasting hydroclimatic and management regimes yields reduced yet physically consistent performance, defining the limits of generalization under data scarcity and human influence. The framework establishes operational robustness as a measurable property of hydrological machine learning and advances the design of reliable real-time forecasting systems.
zh

[AI-25] Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化、跨社区应用中对多元人类价值观(pluralistic human values)对齐的挑战,特别是现有方法难以处理价值间的复杂依赖关系和优先级差异(value complexity),以及难以精准控制低频或边缘化价值优先级(value steerability)。解决方案的关键在于提出COUPLE框架——一个基于反事实推理(counterfactual reasoning)的结构因果模型(Structural Causal Model, SCM)方法,通过显式建模高阶价值维度与行为之间的因果关系,以及价值特征间的复杂交互与优先级结构,实现对任意目标价值配置的精确控制,并提升对齐过程的可解释性。

链接: https://arxiv.org/abs/2510.18526
作者: Hanze Guo,Jing Yao,Xiao Zhou,Xiaoyuan Yi,Xing Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 7 figures

点击查看摘要

Abstract:As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz’s Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.
zh

[AI-26] One Size Fits All? A Modular Adaptive Sanitization Kit (MASK) for Customizable Privacy-Preserving Phone Scam Detection

【速读】:该论文旨在解决利用大语言模型(Large Language Models, LLMs)进行电话诈骗检测时所引发的隐私泄露问题。当前方法在分析通话转录文本以识别欺诈行为时,常因处理敏感个人信息而面临显著隐私风险,尤其当数据被第三方服务提供商处理时。解决方案的关键在于提出MASK(Modular Adaptive Sanitization Kit),这是一个可训练且可扩展的模块化框架,支持基于用户偏好的动态隐私调整:它采用插件式架构,兼容从传统关键词过滤到复杂神经网络的多样化去标识化方法,从而在保障用户隐私与维持检测准确性之间实现灵活权衡,为构建个性化、隐私感知的LLM驱动检测系统提供基础。

链接: https://arxiv.org/abs/2510.18493
作者: Kangzhong Wang,Zitong Shen,Youqian Zhang,Michael MK Cheung,Xiapu Luo,Grace Ngai,Eugene Yujun Fu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages

点击查看摘要

Abstract:Phone scams remain a pervasive threat to both personal safety and financial security worldwide. Recent advances in large language models (LLMs) have demonstrated strong potential in detecting fraudulent behavior by analyzing transcribed phone conversations. However, these capabilities introduce notable privacy risks, as such conversations frequently contain sensitive personal information that may be exposed to third-party service providers during processing. In this work, we explore how to harness LLMs for phone scam detection while preserving user privacy. We propose MASK (Modular Adaptive Sanitization Kit), a trainable and extensible framework that enables dynamic privacy adjustment based on individual preferences. MASK provides a pluggable architecture that accommodates diverse sanitization methods - from traditional keyword-based techniques for high-privacy users to sophisticated neural approaches for those prioritizing accuracy. We also discuss potential modeling approaches and loss function designs for future development, enabling the creation of truly personalized, privacy-aware LLM-based detection systems that balance user trust and detection effectiveness, even beyond phone scam context.
zh

[AI-27] Crucible: Quantifying the Potential of Control Algorithms through LLM Agents NEURIPS2025

【速读】:该论文旨在解决控制算法在实际生产环境中因缺乏有效调优机制而导致性能受限的问题。现有研究多聚焦于算法在理想或默认配置下的表现,忽视了其可调优潜力(Tuning Potential)这一关键维度。解决方案的关键在于提出 Crucible,一个基于大语言模型(LLM)驱动的多层级专家仿真代理,通过形式化定义量化指标来评估不同算法的调优空间,并在经典控制任务与复杂计算机系统等多个场景中验证其有效性,从而为算法设计与分析提供新的维度并实现性能提升。

链接: https://arxiv.org/abs/2510.18491
作者: Lianchen Jia,Chaoyang Li,Qian Houde,Tianchi Huang,Jiangchuan Liu,Lifeng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Control algorithms in production environments typically require domain experts to tune their parameters and logic for specific scenarios. However, existing research predominantly focuses on algorithmic performance under ideal or default configurations, overlooking the critical aspect of Tuning Potential. To bridge this gap, we introduce Crucible, an agent that employs an LLM-driven, multi-level expert simulation to turn algorithms and defines a formalized metric to quantitatively evaluate their Tuning Potential. We demonstrate Crucible’s effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real-world deployment. Our experimental results reveal that Crucible systematically quantifies the tunable space across different algorithms. Furthermore, Crucible provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements. Our code is available at this https URL.
zh

[AI-28] AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification

【速读】:该论文旨在解决当前基于GUI的本地虚拟助手(on-device GUI agents)在实际部署中性能评估不准确的问题,尤其是现有基准测试(如AndroidControl)因存在歧义和事实性错误而系统性地低估了模型能力,导致研究者误判其可行性。解决方案的关键在于对原基准进行严格净化,构建出更可靠的新版本——AndroidControl-Curated,并结合一个轻量级但高效的新型模型Magma-R1-3B(通过仅2.4k高质量样本微调,使用60小时H20 GPU训练),在新基准上实现接近75%的成功率(较原基准提升15%),证明了本地GUI代理的实际潜力远高于此前认知,从而推动更具实用性的虚拟助手发展。

链接: https://arxiv.org/abs/2510.18488
作者: Ho Fai Leung,Xiaoyan Xi,Fei Zuo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced benchmark, state-of-the-art models achieve success rates nearing 75% on complex tasks (15% improvement), reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We introduce our new SOTA model, Magma-R1- 3B, post-trained on just 2.4k curated samples using 60 hours of an H20 GPU (approximately 60). Despite being 200 times smaller in parameters, this model delivers performance comparable to Qwen3- VL-235B. We release both AndroidControl-Curated benchmark and Magma-R1 model to the research community, encouraging adoption of this enhanced benchmark to better reflect model capabilities and accelerate the development of robust, on-device virtual assistants.
zh

[AI-29] StarBench: A Turn-Based RPG Benchmark for Agent ic Multimodal Decision-Making and Information Seeking

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在真实客户端环境中实现类人游戏行为的挑战,具体包括两个核心问题:一是从原始屏幕截图到低层次动作(如点击和按键)的感知-控制映射 fidelity 问题;二是代理是否能像人类玩家一样,在必要时主动寻求信息(agentic information seeking),从而提升决策质量。解决方案的关键在于提出 StarBench——一个基于《崩坏:星穹铁道》的回合制角色扮演游戏(RPG)基准测试平台,其创新性体现在:(1)标准化评估框架,涵盖直接控制(仅输入图像,输出原始动作)与工具辅助控制(引入OCR和检测器提供语义提示)两种模式;(2)引入“询问或行动”诊断机制,量化代理何时选择请求指导及其对后续任务成功率的影响,从而为评估代理的信息获取策略提供可复现的指标。实验结果表明,现有VLM在直接控制下的感知-控制一致性存在显著差距,而合理的信息寻求行为则与更高成功率正相关,验证了StarBench作为衡量真实客户端中多模态决策与自主信息获取能力基准的有效性。

链接: https://arxiv.org/abs/2510.18483
作者: Haoran Zhang,Chenhao Zhu,Sicong Guo,Hanzhe Guo,Haiming Li,Donglin Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human players do more than press buttons: they ground what they see on screen into precise keyboard-mouse actions and, when stuck, they seek information before trying again. We ask whether current vision-language models (VLMs) can do the same. Despite encouraging results under simplified control or tool scaffolds, human-like play in a real client - mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for guidance - remains an open challenge. We introduce StarBench, a turn-based RPG benchmark derived from Honkai: Star Rail that targets these two human-like competencies: multimodal decision-making from pixels to actions and agentic information seeking. StarBench standardizes evaluation across eight combat tasks and two regimes with shared tasks and metrics: (i) direct control, where agents receive only screenshots and must emit low-level primitives (click and keypress) with no semantic hints; and (ii) tool-assisted control, where higher-level intents can be mapped to primitives by detectors and OCR outputs provide optional textualized observations to ease UI grounding. To mirror human practice, StarBench also includes an ask-or-act diagnostic that measures whether and when agents choose to request brief guidance before proceeding, and how that choice affects subsequent performance. We report reference baselines for contemporary VLMs and a human reference. Results expose sizable gaps in perception-to-control fidelity in the direct regime, while showing that judicious information seeking correlates with improved success, establishing StarBench as a reproducible yardstick for agentic information seeking and multimodal decision-making in real-client play.
zh

[AI-30] LAFA: Agent ic LLM -Driven Federated Analytics over Decentralized Data Sources

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的数据分析系统与联邦分析(Federated Analytics, FA)之间存在的不兼容问题:现有LLM代理框架假设数据集中存储,缺乏隐私保护;而FA虽能实现分布式数据的隐私计算,却无法支持自然语言输入且依赖结构化查询。解决方案的关键在于提出LAFA系统,其核心是采用分层多智能体架构——粗粒度规划器将复杂自然语言查询分解为子查询,细粒度规划器基于先验结构知识将其映射为联邦分析操作的有向无环图(Directed Acyclic Graph, DAG),并通过优化代理合并和重写多个DAG以消除冗余操作,从而显著降低计算与通信开销。实验表明,LAFA在执行成功率和资源效率方面均优于基线提示策略,首次实现了支持自然语言输入的隐私保护型联邦数据分析。

链接: https://arxiv.org/abs/2510.18477
作者: Haichao Ji,Zibo Wang,Yifei Zhu,Meng han,Dan Wang,Zhu Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.
zh

[AI-31] Benchmarking Fairness-aware Graph Neural Networks in Knowledge Graphs

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在知识图谱(Knowledge Graphs)上产生的预测偏倚问题,尤其是在推荐系统等重要应用场景中缺乏公平性评估与改进方法的现状。其解决方案的关键在于构建首个针对知识图谱的公平性基准测试体系,通过从YAGO、DBpedia和Wikidata生成更大规模的新图数据集,系统评估了预处理(preprocessing)与内处理(inprocessing)两类公平性增强方法在不同GNN骨干网络及早停策略下的表现。研究发现:知识图谱展现出与其他图数据不同的公平性-准确性权衡趋势,且模型性能不仅受公平性方法影响,还显著依赖于GNN架构和训练策略;其中预处理方法更有利于提升公平性指标,而内处理方法则更擅长保持预测准确性。

链接: https://arxiv.org/abs/2510.18473
作者: Yuya Sasaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful tools for learning from graph-structured data but often produce biased predictions with respect to sensitive attributes. Fairness-aware GNNs have been actively studied for mitigating biased predictions. However, no prior studies have evaluated fairness-aware GNNs on knowledge graphs, which are one of the most important graphs in many applications, such as recommender systems. Therefore, we introduce a benchmarking study on knowledge graphs. We generate new graphs from three knowledge graphs, YAGO, DBpedia, and Wikidata, that are significantly larger than the existing graph datasets used in fairness studies. We benchmark inprocessing and preprocessing methods in different GNN backbones and early stopping conditions. We find several key insights: (i) knowledge graphs show different trends from existing datasets; clearer trade-offs between prediction accuracy and fairness metrics than other graphs in fairness-aware GNNs, (ii) the performance is largely affected by not only fairness-aware GNN methods but also GNN backbones and early stopping conditions, and (iii) preprocessing methods often improve fairness metrics, while inprocessing methods improve prediction accuracy.
zh

[AI-32] CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLM s

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在提升推理能力时依赖昂贵且庞大的推理数据集的问题。现有数据选择方法通常依赖于外部模型或不透明的启发式策略,难以高效筛选出高价值训练样本。其解决方案的关键在于从模型内部机制出发,发现复杂推理任务会激活一组稀疏且专用的注意力头(attention heads),构成核心推理电路(core reasoning circuits)。基于此洞察,作者提出CircuitSeer方法,通过量化数据对这些关键电路的影响程度来衡量其推理复杂性,从而实现高效、精准的数据选择。实验表明,仅使用10%经CircuitSeer筛选的数据即可在多个模型和数据集上显著优于全量数据训练效果。

链接: https://arxiv.org/abs/2510.18470
作者: Shaobo Wang,Yongliang Miao,Yuancheng Liu,and Qianli Ma,Ning Liao,Linfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning capabilities, but scaling their performance often relies on massive reasoning datasets that are computationally expensive to train on. Existing data selection methods aim to curate smaller, high-quality subsets but often rely on costly external models or opaque heuristics. In this work, we shift the focus from external heuristics to the model’s internal mechanisms. We find that complex reasoning tasks consistently activate a sparse, specialized subset of attention heads, forming core reasoning circuits. Building on this insight, we propose CircuitSeer, a novel data selection method that quantifies the reasoning complexity of data by measuring its influence on these crucial circuits. Extensive experiments on 4 models and 9 datasets demonstrate CircuitSeer’s superiority. Notably, fine-tuning Qwen2.5-Math-7B on just 10% of data selected by our method achieves a 1.4-point gain in average Pass@1 over training on the full dataset, highlighting its efficiency and effectiveness.
zh

[AI-33] Simple and Efficient Heterogeneous Temporal Graph Neural Network NEURIPS2025

【速读】:该论文旨在解决现有异质时间图(Heterogeneous Temporal Graphs, HTGs)表示学习方法中因采用解耦的时间与空间学习范式而导致的时空信息交互弱化及模型复杂度高的问题。其解决方案的关键在于提出一种新颖的“简单且高效”的异质时间图神经网络(Simple and Efficient Heterogeneous Temporal Graph Neural Network, SE-HTGNN),通过创新性地将时间建模融入空间学习,利用一种动态注意力机制保留历史图快照中的注意力信息以指导后续注意力计算,从而提升整体判别性表示学习能力;同时引入大语言模型(Large Language Models, LLMs)作为提示机制,使模型能够自适应地捕捉节点类型的隐含属性作为先验知识,显著降低计算开销并保持最优预测精度。

链接: https://arxiv.org/abs/2510.18467
作者: Yili Wang,Tairan Huang,Changlong He,Qiutong Li,Jianliang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by Neurips 2025

点击查看摘要

Abstract:Heterogeneous temporal graphs (HTGs) are ubiquitous data structures in the real world. Recently, to enhance representation learning on HTGs, numerous attention-based neural networks have been proposed. Despite these successes, existing methods rely on a decoupled temporal and spatial learning paradigm, which weakens interactions of spatio-temporal information and leads to a high model complexity. To bridge this gap, we propose a novel learning paradigm for HTGs called Simple and Efficient Heterogeneous Temporal Graph Neural Network (SE-HTGNN). Specifically, we innovatively integrate temporal modeling into spatial learning via a novel dynamic attention mechanism, which retains attention information from historical graph snapshots to guide subsequent attention computation, thereby improving the overall discriminative representations learning of HTGs. Additionally, to comprehensively and adaptively understand HTGs, we leverage large language models to prompt SE-HTGNN, enabling the model to capture the implicit properties of node types as prior knowledge. Extensive experiments demonstrate that SE-HTGNN achieves up to 10x speed-up over the state-of-the-art and latest baseline while maintaining the best forecasting accuracy.
zh

[AI-34] DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time Estimation

【速读】:该论文旨在解决短视频流媒体中预加载策略的优化问题,核心挑战在于如何在动态网络条件和用户行为下,平衡用户体验质量(Quality of Experience, QoE)与带宽效率。现有方法存在两大局限:一是下载任务大小无法适应动态环境变化,二是观看时长预测模型难以在大规模商用场景中稳定部署。解决方案的关键在于提出DeLoad框架,其创新点包括:(1) 引入动态任务尺寸调整机制以灵活匹配实时网络状态;(2) 设计一种可扩展的多维观看时长估计方法,提升预测可靠性;(3) 基于深度强化学习(Deep Reinforcement Learning, DRL)训练智能代理,实现下载范围决策的自适应优化。实验证明,该方案显著提升了QoE指标(改善幅度达34.4%–87.4%),并在实际商业平台部署后实现了用户总观看时长增加0.09%,同时降低重缓冲事件和3.76%的带宽消耗。

链接: https://arxiv.org/abs/2510.18459
作者: Tong Liu,Zhiwei Fan,Guanyan Peng,Haodan Zhang,Yucheng Zhang,Zhen Wang,Pengjin Xie,Liang Liu
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.
zh

[AI-35] PlanU: Large Language Model Decision Making through Planning under Uncertainty NEURIPS2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在不确定性环境下的决策问题,特别是由LLM自身随机采样带来的不确定性(LLM uncertainty)以及环境中状态转移的随机性(environmental uncertainty)所导致的性能下降。现有基于LLM的决策方法多通过多条推理路径或搜索树缓解LLM不确定性,但忽略了环境不确定性,且难以适用于需要与环境交互的多步决策任务。其解决方案的关键在于提出PlanU,一种融合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的LLM规划方法:通过将MCTS中每个节点的回报建模为分位数分布(quantile distribution),以量化回报的不确定性;并引入带有好奇心的上限置信区间(Upper Confidence Bounds with Curiosity, UCC)评分机制,在探索与利用之间取得平衡,从而有效应对双重不确定性,提升LLM在复杂动态环境中的决策能力。

链接: https://arxiv.org/abs/2510.18442
作者: Ziwei Deng,Mian Deng,Chenjing Liang,Zeming Gao,Chennan Ma,Chenxing Lin,Haipeng Zhang,Songzhu Mei,Cheng Wang,Siqi Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 38 pages, 19 figures, NeurIPS 2025 Accepted

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being explored across a range of decision-making tasks. However, LLMs sometimes struggle with decision-making tasks under uncertainty that are relatively easy for humans, such as planning actions in stochastic environments. The adoption of LLMs for decision-making is impeded by uncertainty challenges, such as LLM uncertainty and environmental uncertainty. LLM uncertainty arises from the stochastic sampling process inherent to LLMs. Most LLM-based Decision-Making (LDM) approaches address LLM uncertainty through multiple reasoning chains or search trees. However, these approaches overlook environmental uncertainty, which leads to poor performance in environments with stochastic state transitions. Some recent LDM approaches deal with uncertainty by forecasting the probability of unknown variables. However, they are not designed for multi-step decision-making tasks that require interaction with the environment. To address uncertainty in LLM decision-making, we introduce PlanU, an LLM-based planning method that captures uncertainty within Monte Carlo Tree Search (MCTS). PlanU models the return of each node in the MCTS as a quantile distribution, which uses a set of quantiles to represent the return distribution. To balance exploration and exploitation during tree search, PlanU introduces an Upper Confidence Bounds with Curiosity (UCC) score which estimates the uncertainty of MCTS nodes. Through extensive experiments, we demonstrate the effectiveness of PlanU in LLM-based decision-making tasks under uncertainty.
zh

[AI-36] Optimistic Higher-Order Superposition

【速读】:该论文旨在解决λ-超位置演算(λ-superposition calculus)中两个导致计算爆炸的问题:高阶合一枚举(higher-order unifier enumeration)和函数外延公理(functional extensionality axiom)的应用。其解决方案的关键在于引入一种“乐观”版本的λ-超位置演算,通过在子句中存储约束来延迟爆炸性的合一问题,并以更精准的方式应用函数外延公理。该新演算在Henkin语义下保持了正确性和归结完备性,理论分析与示例表明其有望在性能上优于或至少有效补充原版λ-超位置演算。

链接: https://arxiv.org/abs/2510.18429
作者: Alexander Bentkamp,Jasmin Blanchette,Matthias Hetzenberger,Uwe Waldmann
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The \lambda -superposition calculus is a successful approach to proving higher-order formulas. However, some parts of the calculus are extremely explosive, notably due to the higher-order unifier enumeration and the functional extensionality axiom. In the present work, we introduce an “optimistic” version of \lambda -superposition that addresses these two issues. Specifically, our new calculus delays explosive unification problems using constraints stored along with the clauses, and it applies functional extensionality in a more targeted way. The calculus is sound and refutationally complete with respect to a Henkin semantics. We have yet to implement it in a prover, but examples suggest that it will outperform, or at least usefully complement, the original \lambda -superposition calculus.
zh

[AI-37] AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library

【速读】:该论文旨在解决优化建模(Optimization Modeling)自动化难题,即如何将非结构化的自然语言描述准确映射为精确的数学公式和可执行的求解器代码,而传统大语言模型(LLM)方法要么依赖脆弱的提示工程,要么需要昂贵的再训练且泛化能力有限。其解决方案的关键在于提出AlphaOPT——一个自改进的经验库机制,通过持续的两阶段循环实现高效学习与知识演化:第一阶段(Library Learning)从失败案例中提取经求解器验证的结构化洞察(包括分类、条件、解释和示例),第二阶段(Library Evolution)诊断检索错配并优化存储知识的适用条件,从而无需标注推理轨迹或更新模型参数即可实现知识的显式表达、可解释性及跨任务迁移能力。

链接: https://arxiv.org/abs/2510.18428
作者: Minwei Kong,Ao Qu,Xiaotong Guo,Wenbin Ouyang,Chonghe Jiang,Han Zheng,Yining Ma,Dingyi Zhuang,Yuhan Tang,Junyi Li,Hai Wang,Cathy Wu,Jinhua Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimization modeling enables critical decisions across industries but remains difficult to automate: informal language must be mapped to precise mathematical formulations and executable solver code. Prior LLM approaches either rely on brittle prompting or costly retraining with limited generalization. We present AlphaOPT, a self-improving experience library that enables an LLM to learn from limited demonstrations (even answers alone, without gold-standard programs) and solver feedback - without annotated reasoning traces or parameter updates. AlphaOPT operates in a continual two-phase cycle: (i) a Library Learning phase that reflects on failed attempts, extracting solver-verified, structured insights as taxonomy, condition, explanation, example; and (ii) a Library Evolution phase that diagnoses retrieval misalignments and refines the applicability conditions of stored insights, improving transfer across tasks. This design (1) learns efficiently from limited demonstrations without curated rationales, (2) expands continually without costly retraining by updating the library rather than model weights, and (3) makes knowledge explicit and interpretable for human inspection and intervention. Experiments show that AlphaOPT steadily improves with more data (65% to 72% from 100 to 300 training items) and surpasses the strongest baseline by 7.7% on the out-of-distribution OptiBench dataset when trained only on answers. Code and data are available at: this https URL.
zh

[AI-38] Automated urban waterlogging assessment and early warning through a mixture of foundation models

【速读】:该论文旨在解决城市内涝监测中因依赖人工报告而导致的响应滞后与评估不全面的问题。其核心解决方案是提出一种基于基础模型(foundation model)的自动评估框架UWAssess,通过视觉感知识别水淹区域并生成结构化评估报告;关键创新在于设计了半监督微调策略和思维链(chain-of-thought, CoT)提示策略,以在标注数据稀缺场景下充分挖掘基础模型的潜力,从而显著提升感知性能与文本生成可靠性,实现从单一感知到多模态生成的范式转变。

链接: https://arxiv.org/abs/2510.18425
作者: Chenxu Zhang,Fuxiang Huang,Lei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Nature

点击查看摘要

Abstract:With climate change intensifying, urban waterlogging poses an increasingly severe threat to global public safety and infrastructure. However, existing monitoring approaches rely heavily on manual reporting and fail to provide timely and comprehensive assessments. In this study, we present Urban Waterlogging Assessment (UWAssess), a foundation model-driven framework that automatically identifies waterlogged areas in surveillance images and generates structured assessment reports. To address the scarcity of labeled data, we design a semi-supervised fine-tuning strategy and a chain-of-thought (CoT) prompting strategy to unleash the potential of the foundation model for data-scarce downstream tasks. Evaluations on challenging visual benchmarks demonstrate substantial improvements in perception performance. GPT-based evaluations confirm the ability of UWAssess to generate reliable textual reports that accurately describe waterlogging extent, depth, risk and impact. This dual capability enables a shift of waterlogging monitoring from perception to generation, while the collaborative framework of multiple foundation models lays the groundwork for intelligent and scalable systems, supporting urban management, disaster response and climate resilience.
zh

[AI-39] Med-VRAg ent: A Framework for Medical Visual Reasoning -Enhanced Agents

【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在医学推理任务中面临的幻觉(hallucinations)、描述模糊、逻辑不一致以及定位能力差等问题。解决方案的关键在于提出一种名为Medical Visual Reasoning Agent(Med-VRAgent)的智能体框架,其核心机制融合了视觉引导(Visual Guidance)与自奖励(Self-Reward)策略,并引入蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)来增强推理过程的可控性和准确性。通过收集Med-VRAgent生成的推理轨迹,进一步利用近端策略优化(Proximal Policy Optimization, PPO)对VLM进行微调,从而实现性能提升。

链接: https://arxiv.org/abs/2510.18424
作者: Guangfu Guo,Xiaoqian Lu,Yue Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Language Models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbfMed-VRAgent). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-VRAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.
zh

[AI-40] On AI Verification in Open RAN

【速读】:该论文旨在解决Open RAN环境中深度强化学习(Deep Reinforcement Learning, DRL)代理在无线接入网(RAN)切片与调度任务中因模型黑箱特性导致的可靠性不足问题,尤其在多厂商异构部署下缺乏可信验证机制。解决方案的关键在于提出一种轻量级验证方法,利用可解释模型(如决策树,Decision Tree, DT)构建运行时一致性检查机制,实现近实时的DRL行为验证,从而替代计算复杂度高的现有验证方案,提升网络操作的可信赖性与可运维性。

链接: https://arxiv.org/abs/2510.18417
作者: Rahul Soundrarajan,Claudio Fiandrino,Michele Polese,Salvatore D’Oro,Leonardo Bonati,Tommaso Melodia
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open RAN introduces a flexible, cloud-based architecture for the Radio Access Network (RAN), enabling Artificial Intelligence (AI)/Machine Learning (ML)-driven automation across heterogeneous, multi-vendor deployments. While EXplainable Artificial Intelligence (XAI) helps mitigate the opacity of AI models, explainability alone does not guarantee reliable network operations. In this article, we propose a lightweight verification approach based on interpretable models to validate the behavior of Deep Reinforcement Learning (DRL) agents for RAN slicing and scheduling in Open RAN. Specifically, we use Decision Tree (DT)-based verifiers to perform near-real-time consistency checks at runtime, which would be otherwise unfeasible with computationally expensive state-of-the-art verifiers. We analyze the landscape of XAI and AI verification, propose a scalable architectural integration, and demonstrate feasibility with a DT-based slice-verifier. We also outline future challenges to ensure trustworthy AI adoption in Open RAN.
zh

[AI-41] Deep Learning-Based Control Optimization for Glass Bottle Forming

【速读】:该论文旨在解决玻璃瓶制造过程中成型设备参数控制不精确导致的质量波动与缺陷问题,其核心挑战在于如何在真实生产环境中实现对关键工艺参数的动态优化。解决方案的关键在于提出了一种基于深度学习的控制算法,利用实际生产线采集的运行数据训练神经网络模型,以预测参数调整对玻璃料滴(glass gob)特性的影响;并通过设计特定的逆向机制(inversion mechanism),反推出实现目标料滴特性的最优机器设置,从而提升过程稳定性、降低废品率并增强产品一致性。

链接: https://arxiv.org/abs/2510.18412
作者: Mattia Pujatti,Andrea Di Luca,Nicola Peghini,Federico Monegaglia,Marco Cristoforetti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 17 figures, accepted for publication in “Expert Systems With Applications”

点击查看摘要

Abstract:In glass bottle manufacturing, precise control of forming machines is critical for ensuring quality and minimizing defects. This study presents a deep learning-based control algorithm designed to optimize the forming process in real production environments. Using real operational data from active manufacturing plants, our neural network predicts the effects of parameter changes based on the current production setup. Through a specifically designed inversion mechanism, the algorithm identifies the optimal machine settings required to achieve the desired glass gob characteristics. Experimental results on historical datasets from multiple production lines show that the proposed method yields promising outcomes, suggesting potential for enhanced process stability, reduced waste, and improved product consistency. These results highlight the potential of deep learning to process control in glass manufacturing.
zh

[AI-42] Heterogeneous Adversarial Play in Interactive Environments NEURIPS2025

【速读】:该论文旨在解决开放性学习场景中传统自对弈(self-play)框架因依赖零和竞争对称性而难以适应非对称、动态教学需求的问题,特别是如何在无预设任务层级的情况下实现人工系统自主合成适配性课程。其解决方案的关键在于提出异构对抗博弈(Heterogeneous Adversarial Play, HAP),将教师-学生互动形式化为一种极小极大优化过程,其中任务生成者(教师)与问题求解者(学生)通过对抗演化共同进化;该框架构建了一个双向反馈机制,使教师能基于实时学习者表现动态调整任务复杂度,从而实现更具适应性和学习效率的自动课程学习(Automatic Curriculum Learning, ACL)。

链接: https://arxiv.org/abs/2510.18407
作者: Manjie Xu,Xinyi Yang,Jiayu Zhan,Wei Liang,Chi Zhang,Yixin Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Self-play constitutes a fundamental paradigm for autonomous skill acquisition, whereby agents iteratively enhance their capabilities through self-directed environmental exploration. Conventional self-play frameworks exploit agent symmetry within zero-sum competitive settings, yet this approach proves inadequate for open-ended learning scenarios characterized by inherent asymmetry. Human pedagogical systems exemplify asymmetric instructional frameworks wherein educators systematically construct challenges calibrated to individual learners’ developmental trajectories. The principal challenge resides in operationalizing these asymmetric, adaptive pedagogical mechanisms within artificial systems capable of autonomously synthesizing appropriate curricula without predetermined task hierarchies. Here we present Heterogeneous Adversarial Play (HAP), an adversarial Automatic Curriculum Learning framework that formalizes teacher-student interactions as a minimax optimization wherein task-generating instructor and problem-solving learner co-evolve through adversarial dynamics. In contrast to prevailing ACL methodologies that employ static curricula or unidirectional task selection mechanisms, HAP establishes a bidirectional feedback system wherein instructors continuously recalibrate task complexity in response to real-time learner performance metrics. Experimental validation across multi-task learning domains demonstrates that our framework achieves performance parity with SOTA baselines while generating curricula that enhance learning efficacy in both artificial agents and human subjects.
zh

[AI-43] Learning from N-Tuple Data with M Positive Instances: Unbiased Risk Estimation and Theoretical Guarantees

【速读】:该论文旨在解决弱监督学习中仅能获取样本组(tuple)的正例数量(m)而无法获得个体实例标签的问题,即所谓的NTMP(N-tuple with M positives)监督设置。这类问题常见于图像分类中的区域提议(region proposals)或多实例测量场景。解决方案的关键在于构建一个无偏风险估计器(Unbiased Risk Estimator, URE),通过将tuple生成过程与潜在实例边缘分布(latent instance marginals)相联系,从而从仅有的计数信息中推导出可训练的目标函数。研究进一步证明了在有效混合率与类别先验分离的前提下模型可识别,并通过Rademacher复杂度建立泛化界,同时引入ReLU修正项提升有限样本下的稳定性,确保渐近正确性。该方法在多个转换为NTMP任务的基准上优于代表性弱监督基线,且对类别先验失衡和不同tuple配置具有鲁棒性。

链接: https://arxiv.org/abs/2510.18406
作者: Miao Zhang,Junpeng Li,ChangChun HUa,Yana Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weakly supervised learning often operates with coarse aggregate signals rather than instance labels. We study a setting where each training example is an n -tuple containing exactly m positives, while only the count m per tuple is observed. This NTMP (N-tuple with M positives) supervision arises in, e.g., image classification with region proposals and multi-instance measurements. We show that tuple counts admit a trainable unbiased risk estimator (URE) by linking the tuple-generation process to latent instance marginals. Starting from fixed (n,m), we derive a closed-form URE and extend it to variable tuple sizes, variable counts, and their combination. Identification holds whenever the effective mixing rate is separated from the class prior. We establish generalization bounds via Rademacher complexity and prove statistical consistency with standard rates under mild regularity assumptions. To improve finite-sample stability, we introduce simple ReLU corrections to the URE that preserve asymptotic correctness. Across benchmarks converted to NTMP tasks, the approach consistently outperforms representative weak-supervision baselines and yields favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and across diverse tuple configurations, demonstrating that count-only supervision can be exploited effectively through a theoretically grounded and practically stable objective.
zh

[AI-44] Memory-Augmented State Machine Prompting: A Novel LLM Agent Framework for Real-Time Strategy Games

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实时策略游戏(Real-Time Strategy Games, RTS)中面临的两大核心问题:一是生成式AI(Generative AI)容易产生幻觉(hallucinations),导致决策不可靠;二是现有方法缺乏长期战术一致性,造成碎片化决策行为。解决方案的关键在于提出一种名为“记忆增强型状态机提示”(Memory-Augmented State Machine Prompting, MASMP)的框架,其核心创新包括:(1)基于自然语言驱动的状态机架构,通过提示引导LLM模拟有限状态机(Finite State Machine, FSM)与行为树(Behavior Tree)的行为逻辑,实现结构化动作输出;(2)轻量级记忆模块,用于跨决策周期保存关键战略变量(如战术、优先单位等),从而在保持语义理解能力的同时,通过严格的“状态-动作映射”缓解“知行鸿沟”(Knowing-Doing Gap),最终实现可解释性与FSM级可靠性兼具的智能体决策机制。

链接: https://arxiv.org/abs/2510.18395
作者: Runnan Qi,Yanan Ni,Lumin Jiang,Zongyuan Li,Kuihua Huang,Xian Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 1 table, 1 algorithm. Submitted to conference

点击查看摘要

Abstract:This paper proposes Memory-Augmented State Machine Prompting (MASMP), a novel framework for LLM agents in real-time strategy games. Addressing key challenges like hallucinations and fragmented decision-making in existing approaches, MASMP integrates state machine prompting with memory mechanisms to unify structured actions with long-term tactical coherence. The framework features: (1) a natural language-driven state machine architecture that guides LLMs to emulate finite state machines and behavior trees through prompts, and (2) a lightweight memory module preserving strategic variables (e.g., tactics, priority units) across decision cycles. Experiments in StarCraft II demonstrate MASMP’s 60% win rate against the hardest built-in AI (Lv7), vastly outperforming baselines (0%). Case studies reveal the method retains LLMs’ semantic comprehension while resolving the “Knowing-Doing Gap” through strict state-action mapping, achieving both interpretability and FSM-like reliability. This work establishes a new paradigm for combining neural and symbolic AI in complex decision-making.
zh

[AI-45] PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion

【速读】:该论文旨在解决当前基于感知的强化学习(Reinforcement Learning, RL)控制器在足式机器人上的两大局限性:一是现有方法通常依赖振荡器或逆运动学(Inverse Kinematics, IK)驱动的步态先验来约束动作空间,这会引入归纳偏置(inductive bias),限制策略优化的灵活性并降低跨机器人形态的适应能力;二是部分方法采用“盲视”策略,无法预判后腿地形,对噪声敏感且鲁棒性差。解决方案的关键在于提出Phase-Guided Terrain Traversal (PGTT),其通过纯奖励塑形(reward shaping)强制实现步态结构,从而显著减少对振荡器或IK条件动作先验的依赖,提升策略学习的通用性和适应性。PGTT将每条腿的相位编码为三次 Hermite 样条(cubic Hermite spline),根据局部高度图统计自适应调整摆动高度,并引入摆动相接触惩罚项,同时策略直接作用于关节空间,支持形态无关部署(morphology-agnostic deployment)。实验表明,PGTT在模拟与真实场景中均表现出更强的抗扰动能力和地形适应性,收敛速度约为强基线方法的两倍。

链接: https://arxiv.org/abs/2510.18348
作者: Alexandros Ntagkas,Chairi Kiourt,Konstantinos Chatzilygeroudis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 9 figures, 2 tables

点击查看摘要

Abstract:State-of-the-art perceptive Reinforcement Learning controllers for legged robots either (i) impose oscillator or IK-based gait priors that constrain the action space, add bias to the policy optimization and reduce adaptability across robot morphologies, or (ii) operate “blind”, which struggle to anticipate hind-leg terrain, and are brittle to noise. In this paper, we propose Phase-Guided Terrain Traversal (PGTT), a perception-aware deep-RL approach that overcomes these limitations by enforcing gait structure purely through reward shaping, thereby reducing inductive bias in policy learning compared to oscillator/IK-conditioned action priors. PGTT encodes per-leg phase as a cubic Hermite spline that adapts swing height to local heightmap statistics and adds a swing- phase contact penalty, while the policy acts directly in joint space supporting morphology-agnostic deployment. Trained in MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and domain randomization, PGTT achieves the highest success under push disturbances (median +7.5% vs. the next best method) and on discrete obstacles (+9%), with comparable velocity tracking, and converging to an effective policy roughly 2x faster than strong end-to-end baselines. We validate PGTT on a Unitree Go2 using a real-time LiDAR elevation-to-heightmap pipeline, and we report preliminary results on ANYmal-C obtained with the same hyperparameters. These findings indicate that terrain-adaptive, phase-guided reward shaping is a simple and general mechanism for robust perceptive locomotion across platforms.
zh

[AI-46] ShortcutBreaker: Low-Rank Noisy Bottleneck with Global Perturbation Attention for Multi-Class Unsupervised Anomaly Detection

【速读】:该论文旨在解决多类无监督异常检测(Multi-class Unsupervised Anomaly Detection, MUAD)中因身份捷径(identity shortcuts)导致的正常与异常样本重建误差差异缩小的问题,从而影响异常判别能力。其核心解决方案是提出ShortcutBreaker框架,关键创新在于:一是设计低秩噪声瓶颈(Low-Rank Noisy Bottleneck, LRNB),基于矩阵秩不等式理论将高维特征投影至低秩潜在空间,理论上抑制了平凡的身份复制;二是引入全局扰动注意力机制(Global Perturbation Attention),利用Vision Transformer(ViT)的全局建模能力,在解码器中防止信息捷径,增强对异常模式的敏感性。

链接: https://arxiv.org/abs/2510.18342
作者: Peng Tang,Xiaoxiao Yan,Xiaobin Hu,Yuning Cui,Donghao Luo,Jiangning Zhang,Pengcheng Xu,Jinlong Peng,Qingdong He,Feiyue Huang,Song Xue,Tobias Lasser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Multi-class unsupervised anomaly detection (MUAD) has garnered growing research interest, as it seeks to develop a unified model for anomaly detection across multiple classes, i.e., eliminating the need to train separate models for distinct objects and thereby saving substantial computational resources. Under the MUAD setting, while advanced Transformer-based architectures have brought significant performance improvements, identity shortcuts persist: they directly copy inputs to outputs, narrowing the gap in reconstruction errors between normal and abnormal cases, and thereby making the two harder to distinguish. Therefore, we propose ShortcutBreaker, a novel unified feature-reconstruction framework for MUAD tasks, featuring two key innovations to address the issue of shortcuts. First, drawing on matrix rank inequality, we design a low-rank noisy bottleneck (LRNB) to project highdimensional features into a low-rank latent space, and theoretically demonstrate its capacity to prevent trivial identity reproduction. Second, leveraging ViTs global modeling capability instead of merely focusing on local features, we incorporate a global perturbation attention to prevent information shortcuts in the decoders. Extensive experiments are performed on four widely used anomaly detection benchmarks, including three industrial datasets (MVTec-AD, ViSA, and Real-IAD) and one medical dataset (Universal Medical). The proposed method achieves a remarkable image-level AUROC of 99.8%, 98.9%, 90.6%, and 87.8% on these four datasets, respectively, consistently outperforming previous MUAD methods across different scenarios.
zh

[AI-47] Scalable Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching NEURIPS2025

【速读】:该论文旨在解决表格数据中半监督异常检测(semi-supervised anomaly detection in tabular data)的问题,尤其针对现有连续时间模型(如基于扩散的DTE)在推理阶段计算开销大、效率低的瓶颈。解决方案的关键在于提出Time-Conditioned Contraction Matching (TCCM),其核心思想是借鉴流匹配(flow matching)中学习概率分布间速度场的机制,但通过预测一个时间条件化的收缩向量(contraction vector)来逼近固定目标(原点),从而简化框架:一是训练和推理无需求解常微分方程(ODE),显著提升效率;二是引入单步偏差评分策略(one time-step deviation),实现高效且准确的异常得分计算;三是由于速度场直接作用于输入空间,具备特征级可解释性,并通过Lipschitz连续性提供理论上的鲁棒性保障。

链接: https://arxiv.org/abs/2510.18328
作者: Zhong Li,Qi Huang,Yuxuan Zhu,Lincen Yang,Mohammad Mohammadi Amiri,Niki van Stein,Matthijs van Leeuwen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted by NeurIPS 2025

点击查看摘要

Abstract:We introduce Time-Conditioned Contraction Matching (TCCM), a novel method for semi-supervised anomaly detection in tabular data. TCCM is inspired by flow matching, a recent generative modeling framework that learns velocity fields between probability distributions and has shown strong performance compared to diffusion models and generative adversarial networks. Instead of directly applying flow matching as originally formulated, TCCM builds on its core idea – learning velocity fields between distributions – but simplifies the framework by predicting a time-conditioned contraction vector toward a fixed target (the origin) at each sampled time step. This design offers three key advantages: (1) a lightweight and scalable training objective that removes the need for solving ordinary differential equations during training and inference; (2) an efficient scoring strategy called one time-step deviation, which quantifies deviation from expected contraction behavior in a single forward pass, addressing the inference bottleneck of existing continuous-time models such as DTE (a diffusion-based model with leading anomaly detection accuracy but heavy inference cost); and (3) explainability and provable robustness, as the learned velocity field operates directly in input space, making the anomaly score inherently feature-wise attributable; moreover, the score function is Lipschitz-continuous with respect to the input, providing theoretical guarantees under small perturbations. Extensive experiments on the ADBench benchmark show that TCCM strikes a favorable balance between detection accuracy and inference cost, outperforming state-of-the-art methods – especially on high-dimensional and large-scale datasets. The source code is available at our GitHub repository.
zh

[AI-48] Earth AI: Unlocking Geospatial Insights with Foundation Models and Cross-Modal Reasoning

【速读】:该论文旨在解决地球空间数据(Geospatial Data)因体量庞大、分辨率多样、时间尺度不一及稀疏性等问题所带来的分析与解读难题。其解决方案的关键在于构建Earth AI系统,该系统由三大基础模型(行星尺度影像、人口和环境)与一个基于Gemini的智能推理引擎组成,通过多模态基础模型间的协同作用实现互补价值,并借助一个Gemini驱动的代理(Agent)对复杂多步骤查询进行联合推理,从而有效整合大规模地理空间数据源与工具,提升预测能力并实现从原始数据到可操作洞察的转化。

链接: https://arxiv.org/abs/2510.18318
作者: Aaron Bell,Amit Aides,Amr Helmy,Arbaaz Muslim,Aviad Barzilai,Aviv Slobodkin,Bolous Jaber,David Schottlander,George Leifman,Joydeep Paul,Mimi Sun,Nadav Sherman,Natalie Williams,Per Bjornsson,Roy Lee,Ruth Alcantara,Thomas Turnbull,Tomer Shekel,Vered Silverman,Yotam Gigi,Adam Boulanger,Alex Ottenwess,Ali Ahmadalipour,Anna Carter,Charles Elliott,David Andre,Elad Aharoni,Gia Jung,Hassler Thurston,Jacob Bien,Jamie McPike,Juliet Rothenberg,Kartik Hegde,Kel Markert,Kim Philipp Jablonski,Luc Houriez,Monica Bharel,Phing VanLee,Reuven Sayag,Sebastian Pilarski,Shelley Cazares,Shlomi Pasternak,Siduo Jiang,Stone Jiang,Thomas Colthurst,Yang Chen,Yehonathan Refael,Yochai Blau,Yuval Carny,Yael Maguire,Avinatan Hassidim,James Manyika,Tim Thelin,Genady Beryozkin,Gautam Prasad,Luke Barrington,Yossi Matias,Niv Efron,Shravya Shetty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains–Planet-scale Imagery, Population, and Environment–and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.
zh

[AI-49] MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

【速读】:该论文旨在解决多步骤双臂移动操作(multi-step bimanual mobile manipulation)任务中大规模多样化人类示范数据采集成本高、效率低的问题。其核心挑战在于:(1)如何确定移动基座的位置以确保机械臂的可达性;(2)如何规划摄像头位置以满足视觉-运动策略(visuomotor policies)所需的可见性要求。解决方案的关键是提出MoMaGen,将数据生成建模为一个带约束的优化问题,其中硬约束(如可达性)被严格满足,而软约束(如导航过程中的可见性)则通过平衡机制进行优化。这一方法不仅扩展了以往仅适用于静态双臂操作的自动化数据生成框架,还为未来方法提供了理论基础,并在四个任务上验证了其生成数据多样性的显著提升,最终支持从单一源示范中训练出可部署于物理机器人上的模仿学习策略,且仅需40次真实世界微调即可实现成功部署。

链接: https://arxiv.org/abs/2510.18316
作者: Chengshu Li,Mengdi Xu,Arpit Bahety,Hang Yin,Yunfan Jiang,Huang Huang,Josiah Wong,Sujay Garlanka,Cem Gokmen,Ruohan Zhang,Weiyu Liu,Jiajun Wu,Roberto Martín-Martín,Li Fei-Fei
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this http URL . The first four authors contribute equally

点击查看摘要

Abstract:Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: this http URL.
zh

[AI-50] Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

【速读】:该论文旨在探究嵌入维度(embedding dimension)对基于强化学习训练的Transformer模型中内部“世界模型”(world model)形成的影响,即如何通过调整模型规模来提升其结构化表征能力与可解释性。解决方案的关键在于:在执行类似冒泡排序的相邻交换任务时,即使嵌入维度极小,模型仍能实现高准确率;但随着嵌入维度增大,模型会构建更忠实、一致且鲁棒的内部表示,并展现出两种稳定机制——(1)注意力权重矩阵的最后一行单调编码token的全局顺序;(2)所选交换操作与编码值中最大的相邻差值对齐。这表明模型规模不仅提升最终性能,还显著增强其内部世界模型的结构质量。

链接: https://arxiv.org/abs/2510.18315
作者: Brady Bhalla,Honglu Fan,Nancy Chen,Tony Yue YU
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate how embedding dimension affects the emergence of an internal “world model” in a transformer trained with reinforcement learning to perform bubble-sort-style adjacent swaps. Models achieve high accuracy even with very small embedding dimensions, but larger dimensions yield more faithful, consistent, and robust internal representations. In particular, higher embedding dimensions strengthen the formation of structured internal representation and lead to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Our results provide quantitative evidence that transformers build structured internal world models and that model size improves representation quality in addition to end performance. We release our metrics and analyses, which can be used to probe similar algorithmic tasks.
zh

[AI-51] Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在自动化复杂网络任务过程中引入的新安全风险问题,尤其针对现有红队测试方法难以捕捉Web代理行为模式、缺乏动态演化能力且泛化性差的局限。解决方案的关键在于提出一个名为Genesis的新型智能体框架,其核心由三个模块构成:攻击者(Attacker)、评分器(Scorer)和策略生成器(Strategist)。其中,攻击者通过融合遗传算法与混合策略表示生成对抗性注入;评分器基于目标Web代理响应提供反馈;策略生成器则从交互日志中动态挖掘有效策略并构建持续扩展的策略库,再回流至攻击者以提升其攻击效能,从而实现攻击策略的持续发现与进化。

链接: https://arxiv.org/abs/2510.18314
作者: Zheng Zhang,Jiarui He,Yuchen Cai,Deheng Ye,Peilin Zhao,Ruili Feng,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language model (LLM) agents increasingly automate complex web tasks, they boost productivity while simultaneously introducing new security risks. However, relevant studies on web agent attacks remain limited. Existing red-teaming approaches mainly rely on manually crafted attack strategies or static models trained offline. Such methods fail to capture the underlying behavioral patterns of web agents, making it difficult to generalize across diverse environments. In web agent attacks, success requires the continuous discovery and evolution of attack strategies. To this end, we propose Genesis, a novel agentic framework composed of three modules: Attacker, Scorer, and Strategist. The Attacker generates adversarial injections by integrating the genetic algorithm with a hybrid strategy representation. The Scorer evaluates the target web agent’s responses to provide feedback. The Strategist dynamically uncovers effective strategies from interaction logs and compiles them into a continuously growing strategy library, which is then re-deployed to enhance the Attacker’s effectiveness. Extensive experiments across various web tasks show that our framework discovers novel strategies and consistently outperforms existing attack baselines.
zh

[AI-52] SPIKE: Stable Physics-Informed Kernel Evolution Method for Solving Hyperbolic Conservation Laws

【速读】:该论文旨在解决强形式残差最小化方法在求解无粘性双曲守恒律时无法有效捕捉包含间断解(如激波)的难题,这一问题长期困扰数值计算领域。其解决方案的关键在于提出稳定物理信息核演化(Stable Physics-Informed Kernel Evolution, SPIKE)方法,通过引入再生核表示与正则化参数演化机制,利用Tikhonov正则化提供平滑过渡路径以穿越激波形成过程,从而实现无需显式激波检测或人工黏性即可自动保持守恒性、追踪特征线并满足Rankine-Hugoniot条件的统一数值框架。

链接: https://arxiv.org/abs/2510.18266
作者: Hua Su,Lei Zhang,Jin Zhao
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:We introduce the Stable Physics-Informed Kernel Evolution (SPIKE) method for numerical computation of inviscid hyperbolic conservation laws. SPIKE resolves a fundamental paradox: how strong-form residual minimization can capture weak solutions containing discontinuities. SPIKE employs reproducing kernel representations with regularized parameter evolution, where Tikhonov regularization provides a smooth transition mechanism through shock formation, allowing the dynamics to traverse shock singularities. This approach automatically maintains conservation, tracks characteristics, and captures shocks satisfying Rankine-Hugoniot conditions within a unified framework requiring no explicit shock detection or artificial viscosity. Numerical validation across scalar and vector-valued conservation laws confirms the method’s effectiveness.
zh

[AI-53] NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective

【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中普遍存在的任务不平衡问题,即不同任务在训练过程中收敛速度不一致,导致某些任务性能受限。其解决方案的关键在于利用神经切向核(Neural Tangent Kernel, NTK)理论分析MTL中的训练动态,并通过扩展的NTK矩阵进行谱分析,从而平衡多个任务的收敛速度,缓解任务间的学习不平衡。在此基础上,进一步提出NTKMTL-SR方法,在共享表示近似下实现高效训练并保持优异性能。

链接: https://arxiv.org/abs/2510.18258
作者: Xiaohan Qin,Xiaoxing Wang,Ning Liao,Junchi Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Task Learning (MTL) enables a single model to learn multiple tasks simultaneously, leveraging knowledge transfer among tasks for enhanced generalization, and has been widely applied across various domains. However, task imbalance remains a major challenge in MTL. Although balancing the convergence speeds of different tasks is an effective approach to address this issue, it is highly challenging to accurately characterize the training dynamics and convergence speeds of multiple tasks within the complex MTL system. To this end, we attempt to analyze the training dynamics in MTL by leveraging Neural Tangent Kernel (NTK) theory and propose a new MTL method, NTKMTL. Specifically, we introduce an extended NTK matrix for MTL and adopt spectral analysis to balance the convergence speeds of multiple tasks, thereby mitigating task imbalance. Based on the approximation via shared representation, we further propose NTKMTL-SR, achieving training efficiency while maintaining competitive performance. Extensive experiments demonstrate that our methods achieve state-of-the-art performance across a wide range of benchmarks, including both multi-task supervised learning and multi-task reinforcement learning. Source code is available at this https URL.
zh

[AI-54] Illusions of reflection: open-ended task reveals systematic failures in Large Language Models reflective reasoning

【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)所表现出的“反思”(reflection)是否具有与人类反思推理功能相当的自我修正能力,尤其是在开放且受规则约束的任务中。其关键解决方案在于设计了一个简单但真实世界中的任务——生成有效的科学测试题并根据自身批判进行修订,并通过可审计的成功标准评估模型在第一轮和反思后两阶段的表现。实验结果表明,模型在首轮表现较差(平均仅约1个有效项目),反思带来的改进也有限,且多数情况下重复违反同一约束条件,说明其所谓“修正”主要源于偶然生成有效项,而非基于约束敏感的错误识别与原则性修复。这揭示了当前LLM的“反思”缺乏人类那种目标驱动、主动监控机制,从而指出可靠性能依赖外部结构来强制约束,而非仅靠模型内部的反思行为。

链接: https://arxiv.org/abs/2510.18254
作者: Sion Weatherhead,Flora Salim,Aaron Belbasis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans do not just find mistakes after the fact – we often catch them mid-stream because ‘reflection’ is tied to the goal and its constraints. Today’s large language models produce reasoning tokens and ‘reflective’ text, but is it functionally equivalent with human reflective reasoning? Prior work on closed-ended tasks – with clear, external ‘correctness’ signals – can make ‘reflection’ look effective while masking limits in self-correction. We therefore test eight frontier models on a simple, real-world task that is open-ended yet rule-constrained, with auditable success criteria: to produce valid scientific test items, then revise after considering their own critique. First-pass performance is poor (often zero valid items out of 4 required; mean \approx 1), and reflection yields only modest gains (also \approx 1). Crucially, the second attempt frequently repeats the same violation of constraint, indicating ‘corrective gains’ arise largely from chance production of a valid item rather than error detection and principled, constraint-sensitive repair. Performance before and after reflection deteriorates as open-endedness increases, and models marketed for ‘reasoning’ show no advantage. Our results suggest that current LLM ‘reflection’ lacks functional evidence of the active, goal-driven monitoring that helps humans respect constraints even on a first pass. Until such mechanisms are instantiated in the model itself, reliable performance requires external structure that enforces constraints.
zh

[AI-55] ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

【速读】:该论文旨在解决现有基于token级别的数据选择方法在监督微调(Supervised Fine-Tuning, SFT)大语言模型(Large Language Models, LLMs)时存在的两个关键问题:一是多数方法依赖额外训练或访问参考模型,增加了复杂性和资源消耗;二是仅依赖损失信息进行token选择,难以保留对语义重要但损失较低的token。解决方案的关键在于提出ssToken方法,其核心创新包括:(1) 利用历史模型与当前模型之间的token级损失差异作为自调节信号(self-modulated signal),实现模型在优化过程中动态适应性地选择token,无需外部参考模型;(2) 引入基于注意力机制的语义感知token重要性评估指标(semantic-aware token importance estimation),该指标独立于损失信息,可补充捕捉语义关键token,从而提升筛选效果。实验证明,两种机制单独使用即优于全量数据微调,联合使用则进一步超越现有token级选择方法,在保持训练效率的同时显著提升性能。

链接: https://arxiv.org/abs/2510.18250
作者: Xiaohan Qin,Xiaoxing Wang,Ning Liao,Cancheng Zhang,Xiangdong Zhang,Mingquan Feng,Jingzhi Wang,Junchi Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration–ssToken–achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.
zh

[AI-56] Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLM s

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在推理效率与准确率之间存在的权衡问题,尤其是在模型参数规模和训练数据不断增长背景下,如何实现高效且高精度的推理部署。其解决方案的关键在于提出一种条件缩放定律(conditional scaling law),该定律在Chinchilla缩放框架基础上引入了架构信息(如隐藏层大小、MLP与注意力模块的参数分配比例、分组查询注意力机制GQA等),并结合一个架构搜索框架,用于识别在给定训练预算下同时具备高推理效率和高准确率的最优模型结构。通过训练超过200个不同规模的模型并验证该定律,研究发现优化后的架构可在相同训练成本下显著提升性能:相比LLaMA-3.2,准确率最高提升2.1%,推理吞吐量提高42%。

链接: https://arxiv.org/abs/2510.18245
作者: Song Bian,Tao Yu,Shivaram Venkataraman,Youngsuk Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 17 figures

点击查看摘要

Abstract:Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.
zh

[AI-57] EVER: Edge-Assisted Auto-Verification for Mobile MR-Aided Operation

【速读】:该论文旨在解决混合现实(Mixed Reality, MR)辅助操作中用户是否遵循MR引导的自动验证问题,核心挑战在于物理世界与虚拟对象之间因3D建模不完善或光照估计误差导致的差异,使得传统基于帧相似性的方法难以实现高精度验证。解决方案的关键在于提出EVER系统——该系统利用针对物理帧与虚拟帧特性的分割模型和渲染管线进行差异化处理,并采用基于交并比(Intersection over Union, IoU)的阈值策略实现精准自动验证;同时通过将计算密集型任务卸载至边缘服务器,在保证验证准确率超过90%的同时,响应时间控制在100毫秒以内(显著快于人类平均反应时间273毫秒),且额外能耗极低。

链接: https://arxiv.org/abs/2510.18224
作者: Jiangong Chen,Mingyu Zhu,Bin Li
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed Reality (MR)-aided operation overlays digital objects on the physical world to provide a more immersive and intuitive operation process. A primary challenge is the precise and fast auto-verification of whether the user follows MR guidance by comparing frames before and after each operation. The pre-operation frame includes virtual guiding objects, while the post-operation frame contains physical counterparts. Existing approaches fall short of accounting for the discrepancies between physical and virtual objects due to imperfect 3D modeling or lighting estimation. In this paper, we propose EVER: an edge-assisted auto-verification system for mobile MR-aided operations. Unlike traditional frame-based similarity comparisons, EVER leverages the segmentation model and rendering pipeline adapted to the unique attributes of frames with physical pieces and those with their virtual counterparts; it adopts a threshold-based strategy using Intersection over Union (IoU) metrics for accurate auto-verification. To ensure fast auto-verification and low energy consumption, EVER offloads compute-intensive tasks to an edge server. Through comprehensive evaluations of public datasets and custom datasets with practical implementation, EVER achieves over 90% verification accuracy within 100 milliseconds (significantly faster than average human reaction time of approximately 273 milliseconds), while consuming only minimal additional computational resources and energy compared to a system without auto-verification.
zh

[AI-58] he Emergence of Complex Behavior in Large-Scale Ecological Environments

【速读】:该论文旨在解决如何在开放式的生态环境中,通过物理尺度和种群规模的驱动,促使复杂行为自然涌现的问题。其核心挑战在于理解在无显式奖励或学习目标的情况下,代理(agent)如何通过繁殖、突变与自然选择等机制演化出适应性行为,并在动态环境中与其他个体及环境持续互动。解决方案的关键在于构建大规模仿真环境(种群超过60,000个代理),每个代理拥有独立演化的神经网络策略(neural network policy),并观察在竞争压力与生存需求下,诸如远距离资源提取、基于视觉的觅食以及捕食等复杂行为如何自发形成。研究发现,某些行为仅在足够大的环境和种群规模中出现,且更大尺度提升了行为的稳定性和一致性,从而揭示了生态学机制作为机器学习工具的新可能性。

链接: https://arxiv.org/abs/2510.18221
作者: Joseph Bejjani,Chase Van Amburg,Chengrui Wang,Chloe Huangyuan Su,Sarah M. Pratt,Yasin Mazloumi,Naeem Khoshnevis,Sham M. Kakade,Kianté Brantley
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 18 pages, 11 figures, 6 tables, experiment code available at this https URL

点击查看摘要

Abstract:We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and natural selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. In an effort to discover how complex behaviors naturally emerge, we conduct experiments in large-scale worlds that reach populations of more than 60,000 individual agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors, finding that some appear only in sufficiently large environments and populations, with larger scales increasing behavioral stability and consistency. While there is a rich history of research in evolutionary settings, our scaling results provide promising new directions to explore ecology as an instrument of machine learning in an era of abundant computational resources. Experimental code is available at this https URL.
zh

[AI-59] A Definition of AGI

【速读】:该论文试图解决当前人工智能领域中缺乏对人工通用智能(Artificial General Intelligence, AGI)的明确界定,从而难以衡量现有专用AI系统与人类水平认知之间差距的问题。其解决方案的关键在于提出一个可量化的评估框架,将AGI定义为达到受过良好教育成年人的认知多样性和熟练度,并基于卡特尔-霍恩-卡罗尔理论(Cattell-Horn-Carroll theory)将一般智力分解为十个核心认知域(如推理、记忆和感知),同时借鉴成熟的人类心理测量工具来评测AI系统的认知能力。此框架揭示了当前模型在知识密集型任务上表现优异但基础认知能力(尤其是长期记忆存储)存在显著短板,从而以具体分数(如GPT-4为27%,GPT-5为58%)量化了技术进展与AGI目标之间的实际差距。

链接: https://arxiv.org/abs/2510.18212
作者: Dan Hendrycks,Dawn Song,Christian Szegedy,Honglak Lee,Yarin Gal,Erik Brynjolfsson,Sharon Li,Andy Zou,Lionel Levine,Bo Han,Jie Fu,Ziwei Liu,Jinwoo Shin,Kimin Lee,Mantas Mazeika,Long Phan,George Ingebretsen,Adam Khoja,Cihang Xie,Olawale Salaudeen,Matthias Hein,Kevin Zhao,Alexander Pan,David Duvenaud,Bo Li,Steve Omohundro,Gabriel Alfour,Max Tegmark,Kevin McGrew,Gary Marcus,Jaan Tallinn,Eric Schmidt,Yoshua Bengio
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today’s specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly “jagged” cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 58%) concretely quantify both rapid progress and the substantial gap remaining before AGI.
zh

[AI-60] ActivationReasoning : Logical Reasoning in Latent Activation Spaces

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成流畅文本时内部推理过程不透明、难以控制的问题。现有方法如稀疏自编码器(Sparse Autoencoders, SAEs)虽能揭示隐藏激活中的潜在特征并使其更可解释,但这些特征脆弱且被动,缺乏系统性推理和模型控制能力。解决方案的关键在于提出ActivationReasoning(AR)框架,该框架通过三个阶段实现:首先识别并组织潜在概念表示为字典;其次在推理时检测激活概念并映射为逻辑命题;最后基于逻辑规则进行推理,从而推导高阶结构、组合新概念并引导模型行为。该方法将逻辑结构嵌入到LLM的潜在空间中,显著提升了推理的可解释性、可控性和对复杂任务的适应能力。

链接: https://arxiv.org/abs/2510.18184
作者: Lukas Helff,Ruben Härle,Wolfgang Stammer,Felix Friedrich,Manuel Brack,Antonia Wüst,Hikaru Shindo,Patrick Schramowski,Kristian Kersting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.
zh

[AI-61] Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains

【速读】:该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)后训练大语言模型(Large Language Models, LLMs)时,对中间推理步骤(intermediate tokens)改进效果缺乏细致评估的问题。现有方法通常将所有token视为同等重要,仅以最终答案正确性或Pass@K指标衡量性能,却声称提升了推理链的质量,但未明确区分推理链的逻辑有效性(trace validity)与局部一致性(trace coherence)。论文的关键解决方案是引入一种基于一阶逻辑(First-Order Logic, FOL)的新型度量——trace coherence,用于量化推理步骤间的局部一致性,并通过GRPO算法在GSM8K数据集上对Qwen-2.5-0.5B模型进行RL后训练实验,发现尽管强化学习显著提升了trace coherence,尤其在基础模型失败而RL模型成功的问题中表现突出,但这种提升并不必然带来逻辑有效或正确的最终解,从而揭示了“局部一致性增强”与“全局推理正确性”之间的本质差异。

链接: https://arxiv.org/abs/2510.18176
作者: Soumya Rani Samineni,Durgesh Kalwar,Vardaan Gangal,Siddhant Bhambri,Subbarao Kambhampati
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant attention. Existing RLVR methods, however, typically treat all tokens uniformly without accounting for token-level advantages. These methods primarily evaluate performance based on final answer correctness or Pass@K accuracy, and yet make claims about RL post-training leading to improved reasoning traces. This motivates our investigation into the effect of RL post-training on intermediate tokens which are not directly incentivized. To study this, we design an experimental setup using the GRPO algorithm with Qwen-2.5-0.5B model on the GSM8K dataset. We introduce trace coherence, a First-Order Logic (FOL)-based measure to capture the consistency of reasoning steps by identifying errors in the traces. We distinguish between trace validity and trace coherence, noting that the former implies logical soundness while the latter measures local coherence via lack of errors. Our results show that RL post-training overall improves trace coherence with the most significant gains on problems where the base model fails but the RL model succeeds. Surprisingly, RL enhances local coherence without necessarily producing valid or correct solutions. This highlights a crucial distinction: improved local coherence in reasoning steps does not guarantee final answer correctness. We argue that claims of improved reasoning via RL must be examined with care, as these may be based on improved trace coherence, which may not translate into fully valid mathematical proofs.
zh

[AI-62] Agent ChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI NEURIPS2025

【速读】:该论文旨在解决当前多轮对话智能体评估中缺乏对动态目标变化适应能力的衡量问题,现有基准主要聚焦于静态目标或一次性工具调用,无法反映真实企业场景下任务目标中途变更时智能体的响应与恢复能力。解决方案的关键在于提出AgentChangeBench这一新型基准测试框架,通过四个互补指标——任务成功率(Task Success Rate, TSR)、工具使用效率(Tool Use Efficiency, TUE)、工具调用冗余率(Tool Call Redundancy Rate, TCRR)和目标切换恢复时间(Goal-Shift Recovery Time, GSRT),系统性地量化智能体在三类企业场景中面对目标突变时的适应性、可靠性、资源利用效率及响应延迟。该框架包含2,835个任务序列和五种用户角色,能够触发真实工作流中的目标转变点,从而揭示传统指标如\textpass@k所掩盖的性能差异,例如GPT-4o在航空订票目标变更后恢复率达92.2%,而Gemini仅48.6%,同时发现零售任务虽参数正确性高但冗余率超80%,凸显了显式测量恢复时间和冗余率对提升智能体鲁棒性的必要性。

链接: https://arxiv.org/abs/2510.18170
作者: Manik Rana,Calissa Man,Anotida Expected Msiiwa,Jeffrey Paine,Kevin Zhu,Sunishchal Dev,Vasu Sharma,Ahan M R
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Software Engineering (cs.SE); Optimization and Control (math.OC)
备注: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Multi-Turn Interactions in Large Language Models

点击查看摘要

Abstract:Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional \textpass@k scores: for example, GPT-4o reaches 92.2% recovery on airline booking shifts while Gemini collapses to 48.6% , and retail tasks show near perfect parameter validity yet redundancy rates above 80% , revealing major inefficiencies. These findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.
zh

[AI-63] LLM -Based Multi-Agent System for Simulating and Analyzing Marketing and Consumer Behavior

【速读】:该论文旨在解决传统营销策略评估方法在模拟消费者决策与社会互动复杂性方面的局限性,尤其是后验分析和基于规则的多智能体模型(Agent-Based Models, ABMs)难以准确刻画人类行为动态的问题。其解决方案的关键在于引入一个由大语言模型(Large Language Models, LLMs)驱动的多智能体仿真框架,使生成式代理能够在无需预设规则的前提下,自主进行内部推理、形成消费习惯并作出购买决策,从而实现对价格折扣营销场景下策略效果的可行动测试,并揭示传统方法无法捕捉的涌现性社会模式。

链接: https://arxiv.org/abs/2510.18155
作者: Man-Lin Chu,Lucian Terhorst,Kadin Reed,Tom Ni,Weiwei Chen,Rongyu Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted for publication at IEEE International Conference on e-Business Engineering ICEBE 2025, November 10-12, Buraydah, Saudi Arabia. 8 pages, 5 figures

点击查看摘要

Abstract:Simulating consumer decision-making is vital for designing and evaluating marketing strategies before costly real- world deployment. However, post-event analyses and rule-based agent-based models (ABMs) struggle to capture the complexity of human behavior and social interaction. We introduce an LLM-powered multi-agent simulation framework that models consumer decisions and social dynamics. Building on recent advances in large language model simulation in a sandbox envi- ronment, our framework enables generative agents to interact, express internal reasoning, form habits, and make purchasing decisions without predefined rules. In a price-discount marketing scenario, the system delivers actionable strategy-testing outcomes and reveals emergent social patterns beyond the reach of con- ventional methods. This approach offers marketers a scalable, low-risk tool for pre-implementation testing, reducing reliance on time-intensive post-event evaluations and lowering the risk of underperforming campaigns.
zh

[AI-64] Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)安全监控中因依赖文本层面分析而导致的局限性问题,即现有方法难以捕捉隐蔽的有害行为模式,且易被模型通过隐藏不安全推理过程所规避。其解决方案的关键在于构建一个句子级标注的数据集,用于基于激活值(activation-based)的安全行为监测;该数据集包含带有安全行为标签(如表达安全顾虑或推测用户意图)的推理序列,并据此提取可检测和引导特定安全行为的“转向向量”(steering vectors),从而实现对模型内部状态中安全相关行为的精准定位与干预,显著提升了对推理链中潜在风险的识别能力与控制精度。

链接: https://arxiv.org/abs/2510.18154
作者: Antonio-Gabriel Chacón Menke,Phan Xuan Tan,Eiji Kamioka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent work has highlighted the importance of monitoring chain-of-thought reasoning for AI safety; however, current approaches that analyze textual reasoning steps can miss subtle harmful patterns and may be circumvented by models that hide unsafe reasoning. We present a sentence-level labeled dataset that enables activation-based monitoring of safety behaviors during LLM reasoning. Our dataset contains reasoning sequences with sentence-level annotations of safety behaviors such as expression of safety concerns or speculation on user intent, which we use to extract steering vectors for detecting and influencing these behaviors within model activations. The dataset fills a key gap in safety research: while existing datasets label reasoning holistically, effective application of steering vectors for safety monitoring could be improved by identifying precisely when specific behaviors occur within reasoning chains. We demonstrate the dataset’s utility by extracting representations that both detect and steer safety behaviors in model activations, showcasing the potential of activation-level techniques for improving safety oversight on reasoning. Content Warning: This paper discusses AI safety in the context of harmful prompts and may contain references to potentially harmful content. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2510.18154 [cs.AI] (or arXiv:2510.18154v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.18154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-65] Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models NEURIPS2025

【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在复杂领域任务中准确率偏低的问题,尤其是在与大型语言模型(Large Language Models, LLMs)相比时表现不足。现有方法如监督微调虽能提升性能,但依赖大量人工标注数据和反复迭代优化,效率低下。解决方案的关键在于提出PaDA-Agent(Pattern-guided Data Augmentation Agent),其通过评估驱动的方式从验证集中的失败案例中识别出系统性错误模式,并据此生成针对性的数据增强策略,从而直接缩小模型的泛化差距,而非仅关注训练误差或生成纠错样本。这一机制显著提升了Llama 3.2 1B Instruct模型微调的效果,优于当前基于LLM的数据增强方法。

链接: https://arxiv.org/abs/2510.18143
作者: Huan Song,Deeksha Razdan,Yiyue Qian,Arijit Ghosh Chowdhury,Parth Patwa,Aman Chadha,Shinan Zhang,Sharlina Keshava,Hannah Marlowe
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle

点击查看摘要

Abstract:Small Language Models (SLMs) offer compelling advantages in deployment cost and latency, but their accuracy often lags behind larger models, particularly for complex domain-specific tasks. While supervised fine-tuning can help bridge this performance gap, it requires substantial manual effort in data preparation and iterative optimization. We present PaDA-Agent (Pattern-guided Data Augmentation Agent), an evaluation-driven approach that streamlines the data augmentation process for SLMs through coordinated operations. Unlike state-of-the-art approaches that focus on model training errors only and generating error-correcting samples, PaDA-Agent discovers failure patterns from the validation data via evaluations and drafts targeted data augmentation strategies aiming to directly reduce the generalization gap. Our experimental results demonstrate significant improvements over state-of-the-art LLM-based data augmentation approaches for Llama 3.2 1B Instruct model fine-tuning.
zh

[AI-66] Measuring Reasoning in LLM s: a New Dialectical Angle

【速读】:该论文试图解决当前语言模型(Language Model, LM)评估中过于关注最终答案正确性而忽视推理过程质量的问题。现有基准测试(如GSM和MMLU)主要衡量模型输出的静态准确性,无法揭示其内在思维演化机制。为此,作者提出SIEV框架,其关键在于引入辩证法(Dialectics)思想——将推理视为“正题-反题-合题”的动态演进过程,从而系统评估模型在处理矛盾、整合异质观点及生成高阶认知方面的能力。该方法不仅考察结论,更关注推理路径中的张力化解与思想合成能力,能够有效识别出先进模型(如GPT-5-chat)在推理深度上的显著缺陷,展现出更强的过程导向性和区分度。

链接: https://arxiv.org/abs/2510.18134
作者: Soheil Abbasloo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What does it truly mean for a language model to “reason”? Most current evaluations and benchmarks reward models’ correct standalone answers–but correctness alone reveals little about the process that produced them. In this work, we explore a different perspective: reasoning is not a static chain of steps, but a dynamic trajectory where ideas interact, clash, and evolve into deeper insights. To capture this dynamic, we draw on a well-established philosophical tradition: \textitdialectics, where reasoning unfolds through thesis, antithesis, and synthesis. Building on this, we present SIEV, a structured framework that evaluates reasoning of LLMs through dialectics. Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there: its ability to resolve tension, integrate distinct ideas, and synthesize higher-order reasoning. This lens uncovers significant reasoning gaps in state-of-the-art models even under saturated benchmarks like GSM and MMLU. For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM. Our findings highlight that adopting a process-oriented, philosophically grounded approach enables a deeper, more rigorous, and more discriminative assessment of LLM reasoning.
zh

[AI-67] Latent Discrete Diffusion Models

【速读】:该论文旨在解决掩码去噪模型(masked denoisers)在离散语言数据上的局限性,即其反向传播过程通常对位置进行因子分解(factorize across positions),导致联合结构弱化,进而影响少步生成时的质量。解决方案的关键在于提出潜在离散扩散模型(Latent Discrete Diffusion Models, LDDMs),该方法将离散token上的掩码扩散与连续潜在嵌入(latent embeddings)上的扩散相结合:潜在通道提供更平滑的信号并捕捉跨token依赖关系,从而缓解局部解码歧义;进一步通过两种实现方式——FUJI-LDDMs(联合优化token与潜在变量)和SEQ-LDDMs(先条件采样潜在变量再生成离散序列)——实现了有效的训练与推理机制,显著提升无条件生成质量,并在低采样预算下表现更优。

链接: https://arxiv.org/abs/2510.18114
作者: Dario Shariatian,Alain Durmus,Stefano Peluchetti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study discrete diffusion for language and other categorical data and focus on a common limitation of masked denoisers: reverse transitions typically factorize across positions, which can weaken joint structure and degrade quality in few-step generation. We propose \emphLatent Discrete Diffusion Models (LDDMs), which couple a masked discrete diffusion over tokens with a continuous diffusion over latent embeddings. The latent channel provides a softer signal and carries cross-token dependencies that help resolve ambiguities. We present two instantiations: (i) FUJI-LDDMs, which perform fully joint denoising of tokens and latents, and (ii) SEQ-LDDMs, which sequentially resolve the latent and then the discrete chain conditionally on it. For both variants we derive ELBO-style objectives and discuss design choices to learn informative latents yet amenable to diffusoin modeling. In experiments, LDDMs yield improvements on unconditional generation metrics as compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets, where unmasking many tokens per step is desirable.
zh

[AI-68] From AutoRecSys to AutoRecLab: A Call to Build Evaluate and Govern Autonomous Recommender-Systems Research Labs

【速读】:该论文试图解决推荐系统(Recommender Systems, RecSys)研究过程中自动化程度不足的问题,即当前研究多聚焦于算法选择与超参数调优等窄域自动化工具(AutoRecSys),而缺乏对整个科研流程的端到端自动化。其解决方案的关键在于构建一个自主的推荐系统研究实验室(Autonomous Recommender-Systems Research Lab, AutoRecLab),该系统整合从问题提出、文献分析、实验设计与执行、结果解释、论文撰写到溯源记录的全流程自动化,并结合大语言模型(Large Language Models, LLMs)驱动的创意生成与报告编写能力,以及自动化的实验执行机制,从而推动推荐系统研究向“人工智能驱动的科学研究”(Artificial Research Intelligence)演进。

链接: https://arxiv.org/abs/2510.18104
作者: Joeran Beel,Bela Gipp,Tobias Vente,Moritz Baumgart,Philipp Meister
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recommender-systems research has accelerated model and evaluation advances, yet largely neglects automating the research process itself. We argue for a shift from narrow AutoRecSys tools – focused on algorithm selection and hyper-parameter tuning – to an Autonomous Recommender-Systems Research Lab (AutoRecLab) that integrates end-to-end automation: problem ideation, literature analysis, experimental design and execution, result interpretation, manuscript drafting, and provenance logging. Drawing on recent progress in automated science (e.g., multi-agent AI Scientist and AI Co-Scientist systems), we outline an agenda for the RecSys community: (1) build open AutoRecLab prototypes that combine LLM-driven ideation and reporting with automated experimentation; (2) establish benchmarks and competitions that evaluate agents on producing reproducible RecSys findings with minimal human input; (3) create review venues for transparently AI-generated submissions; (4) define standards for attribution and reproducibility via detailed research logs and metadata; and (5) foster interdisciplinary dialogue on ethics, governance, privacy, and fairness in autonomous research. Advancing this agenda can increase research throughput, surface non-obvious insights, and position RecSys to contribute to emerging Artificial Research Intelligence. We conclude with a call to organise a community retreat to coordinate next steps and co-author guidance for the responsible integration of automated research systems.
zh

[AI-69] Enhancing mortality prediction in cardiac arrest ICU patients through meta-modeling of structured clinical data from MIMIC-IV

【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中住院死亡率的早期准确预测问题,以支持及时临床干预和资源优化配置。其解决方案的关键在于融合结构化临床数据与非结构化文本信息(如出院总结和放射学报告),通过LASSO和XGBoost进行特征选择,并利用TF-IDF与BERT嵌入提取文本特征,最终构建一个基于多变量逻辑回归的可解释风险预测模型。该模型在MIMIC-IV数据库上实现了AUC 0.918的性能,相比仅使用结构化数据时提升22%,且在广泛阈值概率范围内展现出更高的标准化净收益,验证了非结构化临床文本对预后预测的显著增益价值。

链接: https://arxiv.org/abs/2510.18103
作者: Nursultan Mamatov,Philipp Kellmeyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 38 pages, 5 figures, 2 tables, 3 appendices

点击查看摘要

Abstract:Accurate early prediction of in-hospital mortality in intensive care units (ICUs) is essential for timely clinical intervention and efficient resource allocation. This study develops and evaluates machine learning models that integrate both structured clinical data and unstructured textual information, specifically discharge summaries and radiology reports, from the MIMIC-IV database. We used LASSO and XGBoost for feature selection, followed by a multivariate logistic regression trained on the top features identified by both models. Incorporating textual features using TF-IDF and BERT embeddings significantly improved predictive performance. The final logistic regression model, which combined structured and textual input, achieved an AUC of 0.918, compared to 0.753 when using structured data alone, a relative improvement 22%. The analysis of the decision curve demonstrated a superior standardized net benefit in a wide range of threshold probabilities (0.2-0.8), confirming the clinical utility of the model. These results underscore the added prognostic value of unstructured clinical notes and support their integration into interpretable feature-driven risk prediction models for ICU patients.
zh

[AI-70] Planned Diffusion

【速读】:该论文旨在解决大语言模型推理中生成速度与输出质量之间的权衡问题(trade-off between generation speed and output quality)。传统自回归模型(autoregressive models)虽能生成高质量文本,但因逐token生成导致效率低下;而扩散模型(diffusion models)虽可并行生成token,却需大量迭代才能达到相近质量。其核心解决方案是提出“计划扩散”(planned diffusion),关键在于分两阶段协同工作:首先通过短序列自回归规划(short autoregressive plan)将输出拆分为若干独立子段(spans),随后利用扩散机制并行生成这些子段。该方法扩展了速度-质量帕累托前沿(Pareto frontier),在AlpacaEval评测集上实现1.27x至1.81x的加速,同时仅损失0.87%至5.4%的胜率(win rate),并通过简单运行时参数实现灵活的质量-延迟调控。

链接: https://arxiv.org/abs/2510.18087
作者: Daniel Israel,Tian Jin,Ellie Cheng,Guy Van den Broeck,Aditya Grover,Suvinay Subramanian,Michael Carbin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87% to 5.4% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.
zh

[AI-71] R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

【速读】:该论文旨在解决多智能体系统中如何利用单一人类操作者通过顺序单智能体示范来有效训练协作机器人团队的问题,尤其是在无法获得联合多智能体动作空间同步示范的情况下。解决方案的关键在于提出**轮转行为克隆(Round-Robin Behavior Cloning, R2BC)**方法,该方法允许人类操作者逐个操控每个机器人进行示范,并逐步将多智能体行为知识增量式地传授给整个系统,从而避免了对复杂联合动作空间示范的依赖,同时在四个模拟任务中达到甚至超越基于理想同步示范的基准表现。

链接: https://arxiv.org/abs/2510.18085
作者: Connor Mattson,Varun Raveendra,Ellen Novoseller,Nicholas Waytowich,Vernon J. Lawhern,Daniel S. Brown
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.
zh

[AI-72] RL-Driven Security-Aware Resource Allocation Framework for UAV-Assisted O-RAN

【速读】:该论文旨在解决无人机(UAV)辅助开放无线接入网络(O-RAN)在灾害管理和搜救(SAR)场景中面临的多目标优化难题,即如何在动态环境中协同优化安全性、低延迟通信和能源效率。现有方法通常忽视三者之间的权衡关系,导致性能受限。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的动态资源分配框架,通过构建融合安全感知资源调度、延迟最小化与能效提升的联合优化模型,并利用RL实现对网络状态变化的实时自适应调整,从而在保障超低延迟的同时显著提升系统安全性和能源利用率。

链接: https://arxiv.org/abs/2510.18084
作者: Zaineh Abughazzah,Emna Baccour,Loay Ismail,Amr Mohamed,Mounir Hamdi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:The integration of Unmanned Aerial Vehicles (UAVs) into Open Radio Access Networks (O-RAN) enhances communication in disaster management and Search and Rescue (SAR) operations by ensuring connectivity when infrastructure fails. However, SAR scenarios demand stringent security and low-latency communication, as delays or breaches can compromise mission success. While UAVs serve as mobile relays, they introduce challenges in energy consumption and resource management, necessitating intelligent allocation strategies. Existing UAV-assisted O-RAN approaches often overlook the joint optimization of security, latency, and energy efficiency in dynamic environments. This paper proposes a novel Reinforcement Learning (RL)-based framework for dynamic resource allocation in UAV relays, explicitly addressing these trade-offs. Our approach formulates an optimization problem that integrates security-aware resource allocation, latency minimization, and energy efficiency, which is solved using RL. Unlike heuristic or static methods, our framework adapts in real-time to network dynamics, ensuring robust communication. Simulations demonstrate superior performance compared to heuristic baselines, achieving enhanced security and energy efficiency while maintaining ultra-low latency in SAR scenarios.
zh

[AI-73] Any-Depth Alignment: Unlocking Innate Safety Alignment of LLM s to Any-Depth

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的“浅层对齐”问题,即模型仅在生成起始阶段能有效拒绝有害请求,一旦有害内容已开始生成(如通过对抗性前缀攻击或有害助手前缀攻击),其安全性会迅速失效。为实现任意生成深度下的稳定安全防护,作者提出Any-Depth Alignment (ADA) 方法,其核心在于观察到对齐信号主要集中在助手头令牌(assistant header tokens)中,并且这些令牌蕴含了模型的强对齐先验。ADA 通过在生成过程中重新引入这些头令牌,促使模型重新评估当前输入的有害性并恢复拒绝行为,从而在不修改模型参数的前提下实现高鲁棒性的安全防护。

链接: https://arxiv.org/abs/2510.18081
作者: Jiawei Zhang,Andrew Estornell,David D. Baek,Bo Li,Xiaojun Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model’s strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model’s parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).
zh

[AI-74] R2L: Reliable Reinforcement Learning: Guaranteed Return Reliable Policies in Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中策略可靠性不足的问题,特别是在存在不确定性场景下,传统RL方法仅追求期望回报最大化,难以满足实际应用中对性能保证的需求,如路径规划、资源分配等任务需确保在特定概率下达成目标。解决方案的关键在于提出一种新的可靠强化学习(Reliable Reinforcement Learning)形式化框架,其目标是最大化累积回报超过预设阈值的概率;通过状态扩展(state-augmented representation)将该问题转化为标准RL问题,从而可直接使用现有算法(如Q-learning或Dueling Double DQN)求解,理论证明了两种表述的等价性,并在可靠路由任务中验证了该方法在效率与可靠性之间取得良好平衡的能力。

链接: https://arxiv.org/abs/2510.18074
作者: Nadir Farhi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 27 pages

点击查看摘要

Abstract:In this work, we address the problem of determining reliable policies in reinforcement learning (RL), with a focus on optimization under uncertainty and the need for performance guarantees. While classical RL algorithms aim at maximizing the expected return, many real-world applications - such as routing, resource allocation, or sequential decision-making under risk - require strategies that ensure not only high average performance but also a guaranteed probability of success. To this end, we propose a novel formulation in which the objective is to maximize the probability that the cumulative return exceeds a prescribed threshold. We demonstrate that this reliable RL problem can be reformulated, via a state-augmented representation, into a standard RL problem, thereby allowing the use of existing RL and deep RL algorithms without the need for entirely new algorithmic frameworks. Theoretical results establish the equivalence of the two formulations and show that reliable strategies can be derived by appropriately adapting well-known methods such as Q-learning or Dueling Double DQN. To illustrate the practical relevance of the approach, we consider the problem of reliable routing, where the goal is not to minimize the expected travel time but rather to maximize the probability of reaching the destination within a given time budget. Numerical experiments confirm that the proposed formulation leads to policies that effectively balance efficiency and reliability, highlighting the potential of reliable RL for applications in stochastic and safety-critical environments.
zh

[AI-75] Fine-tuning Flow Matching Generative Models with Intermediate Feedback

【速读】:该论文旨在解决基于流的生成模型(flow-based generative models)在文本到图像生成任务中,利用中间反馈进行微调时面临的挑战,特别是连续时间流匹配模型在信用分配(credit assignment)问题上的困难。现有方法通常仅依赖最终奖励信号,难以有效学习中间状态的价值;而尝试通过直接回归累积奖励来学习评判器(critic)的方法则常因训练不稳定和模型坍缩(model collapse)而失效。解决方案的关键在于提出AC-Flow框架,其核心创新包括:(1) 奖励塑形(reward shaping),提供归一化学习信号以稳定中间价值学习与梯度控制;(2) 一种新颖的双稳定性机制,结合优势裁剪(advantage clipping)防止有害策略更新,并引入预热阶段使评判器成熟后再影响策略网络(actor);(3) 一种可扩展的广义评判器加权方案,通过Wasserstein正则化保留模型多样性的同时扩展传统奖励加权方法。实验证明,该框架在Stable Diffusion 3上实现了最先进的文本对齐性能及对未见人类偏好模型的泛化能力。

链接: https://arxiv.org/abs/2510.18072
作者: Jiajun Fan,Chaoran Cheng,Shuaike Shen,Xiangxin Zhou,Ge Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow-based generative models have shown remarkable success in text-to-image generation, yet fine-tuning them with intermediate feedback remains challenging, especially for continuous-time flow matching models. Most existing approaches solely learn from outcome rewards, struggling with the credit assignment problem. Alternative methods that attempt to learn a critic via direct regression on cumulative rewards often face training instabilities and model collapse in online settings. We present AC-Flow, a robust actor-critic framework that addresses these challenges through three key innovations: (1) reward shaping that provides well-normalized learning signals to enable stable intermediate value learning and gradient control, (2) a novel dual-stability mechanism that combines advantage clipping to prevent destructive policy updates with a warm-up phase that allows the critic to mature before influencing the actor, and (3) a scalable generalized critic weighting scheme that extends traditional reward-weighted methods while preserving model diversity through Wasserstein regularization. Through extensive experiments on Stable Diffusion 3, we demonstrate that AC-Flow achieves state-of-the-art performance in text-to-image alignment tasks and generalization to unseen human preference models. Our results demonstrate that even with a computationally efficient critic model, we can robustly finetune flow models without compromising generative quality, diversity, or stability.
zh

[AI-76] SPACeR: Self-Play Anchoring with Centralized Reference Models

【速读】:该论文旨在解决自动驾驶仿真代理(sim agents)在多智能体场景中同时实现人类行为真实性、推理效率与可扩展性的难题。现有方法中,基于扩散模型或分词自回归模型的模仿学习虽能生成逼真行为,但计算开销大、推理慢且难以适应闭环反应式场景;而自洽强化学习(self-play RL)虽具高效性和多智能体交互建模能力,却常依赖启发式奖励设计,导致策略偏离人类驾驶分布。解决方案的关键在于提出SPACeR框架,利用预训练的分词自回归运动模型作为中心化参考策略,为去中心化的自洽强化学习提供似然奖励和KL散度约束,从而在保持RL高可扩展性的同时,将策略锚定于人类驾驶分布,确保行为的社会合理性与可预测性。

链接: https://arxiv.org/abs/2510.18060
作者: Wei-Jer Chang,Akshay Rangesh,Kevin Joseph,Matthew Strong,Masayoshi Tomizuka,Yihan Hu,Wei Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.
zh

[AI-77] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

【速读】:该论文旨在解决生成式模型在强化学习微调过程中探索(exploration)与利用(exploitation)难以平衡的问题。现有方法依赖固定的分歧正则化(divergence regularization),导致强正则化虽能保持模型能力但限制奖励优化,弱正则化虽利于对齐却可能引发不稳定或奖励劫持(reward hacking)。解决方案的关键在于提出自适应分歧正则化策略(Adaptive Divergence Regularized Policy Optimization, ADRPO),其根据优势估计(advantage estimates)动态调整正则化强度:对高价值样本降低正则化以促进激进利用,对低质量样本施加强正则化以保障稳定性,从而实现基于数据质量的智能探索-利用权衡。这一机制显著提升了文本到图像生成、语言模型及多模态推理模型的性能,尤其在属性绑定、语义一致性、艺术风格迁移和组合控制等方面超越了更大规模模型,并展现出在多种生成架构与模态中的通用性。

链接: https://arxiv.org/abs/2510.18053
作者: Jiajun Fan,Tong Wei,Chaoran Cheng,Yuxin Chen,Ge Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates-reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.
zh

[AI-78] Measure-Theoretic Anti-Causal Representation Learning

【速读】:该论文旨在解决反因果表示学习(anti-causal representation learning)中的挑战,即在标签(labels)导致特征(features)的场景下,如何学习具有稳定性和泛化能力的表示。传统方法通常依赖于明确的因果结构或假设干预是完美的,难以适应现实世界中复杂的、不完美的干预条件。其解决方案的关键在于提出一种基于测度论的全新框架——反因果不变抽象(Anti-Causal Invariant Abstractions, ACIA),该框架采用两级设计:低层表示捕捉标签生成观测的过程,高层表示则学习跨环境变化的稳定因果模式;通过引入干预核(interventional kernels),ACIA 能够处理完美与非完美干预,无需显式因果结构,并在高维数据上表现优异,同时提供分布外泛化性能的理论保证。

链接: https://arxiv.org/abs/2510.18052
作者: Arman Behnam,Binghui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Causal representation learning in the anti-causal setting (labels cause features rather than the reverse) presents unique challenges requiring specialized approaches. We propose Anti-Causal Invariant Abstractions (ACIA), a novel measure-theoretic framework for anti-causal representation learning. ACIA employs a two-level design, low-level representations capture how labels generate observations, while high-level representations learn stable causal patterns across environment-specific variations. ACIA addresses key limitations of existing approaches by accommodating prefect and imperfect interventions through interventional kernels, eliminating dependency on explicit causal structures, handling high-dimensional data effectively, and providing theoretical guarantees for out-of-distribution generalization. Experiments on synthetic and real-world medical datasets demonstrate that ACIA consistently outperforms state-of-the-art methods in both accuracy and invariance metrics. Furthermore, our theoretical results establish tight bounds on performance gaps between training and unseen environments, confirming the efficacy of our approach for robust anti-causal learning.
zh

[AI-79] CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代理式工作流(agentic workflows)中因处理长提示词和高维数据流而导致的显著运行时成本问题。其核心解决方案是提出一种端到端的压缩管道——CompactPrompt,关键在于结合硬提示压缩与轻量级文件级数据压缩:首先通过自信息评分与依赖关系驱动的短语分组策略剪枝低信息量token;同时对文档中的重复文本模式采用n-gram缩写、对数值列应用均匀量化,从而实现紧凑且语义保真的表示。该方法在TAT-QA和FinQA等基准数据集上将总token消耗和推理成本降低高达60%,同时保持输出质量损失小于5%(如Claude-3.5-Sonnet和GPT-4.1-Mini),并支持实时压缩决策可视化与成本-性能权衡量化,为更高效的生成式AI流水线奠定基础。

链接: https://arxiv.org/abs/2510.18043
作者: Joong Ho Choi,Jiayang Zhao,Jeel Shah,Ritvika Sonawane,Vedant Singh,Avani Appalla,Will Flanagan,Filipe Condessa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Workshop on LLMs and Generative AI for Finance at ACM ICAIF 2025

点击查看摘要

Abstract:Large Language Models (LLMs) deliver powerful reasoning and generation capabilities but incur substantial run-time costs when operating in agentic workflows that chain together lengthy prompts and process rich data streams. We introduce CompactPrompt, an end-to-end pipeline that merges hard prompt compression with lightweight file-level data compression. CompactPrompt first prunes low-information tokens from prompts using self-information scoring and dependency-based phrase grouping. In parallel, it applies n-gram abbreviation to recurrent textual patterns in attached documents and uniform quantization to numerical columns, yielding compact yet semantically faithful representations. Integrated into standard LLM agents, CompactPrompt reduces total token usage and inference cost by up to 60% on benchmark dataset like TAT-QA and FinQA, while preserving output quality (Results in less than 5% accuracy drop for Claude-3.5-Sonnet, and GPT-4.1-Mini) CompactPrompt helps visualize real-time compression decisions and quantify cost-performance trade-offs, laying the groundwork for leaner generative AI pipelines.
zh

[AI-80] Cross-Domain Long-Term Forecasting: Radiation Dose from Sparse Neutron Sensor via Spatio-Temporal Operator Network

【速读】:该论文旨在解决从稀疏、跨域传感器数据中预测不可观测物理量这一科学机器学习中的核心难题。现有神经算子和大规模预测模型依赖于密集且共位的输入输出场以及短时间上下文,而这一假设在真实世界系统中往往不成立,因为传感与预测发生在不同的物理流形上且时间尺度较长。解决方案的关键在于提出一种非自回归神经算子——时空算子网络(Spatio-Temporal Operator Network, STONe),它能学习异质域之间的稳定函数映射,直接从稀疏地面中子测量推断高空辐射剂量场,无需迭代递归即可保持长期预测稳定性。该方法突破了传统观点认为算子学习必须依赖域对齐或自回归传播的限制,实现了跨域算子推理的一般性原则,并在23年全球中子数据训练下实现180天准确预测,推理延迟仅为毫秒级。

链接: https://arxiv.org/abs/2510.18041
作者: Jay Phil Yoo,Kazuma Kobayashi,Souvik Chakraborty,Syed Bahauddin Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting unobservable physical quantities from sparse, cross-domain sensor data is a central unsolved problem in scientific machine learning. Existing neural operators and large-scale forecasters rely on dense, co-located input-output fields and short temporal contexts, assumptions that fail in real-world systems where sensing and prediction occur on distinct physical manifolds and over long timescales. We introduce the Spatio-Temporal Operator Network (STONe), a non-autoregressive neural operator that learns a stable functional mapping between heterogeneous domains. By directly inferring high-altitude radiation dose fields from sparse ground-based neutron measurements, STONe demonstrates that operator learning can generalize beyond shared-domain settings. It defines a nonlinear operator between sensor and target manifolds that remains stable over long forecasting horizons without iterative recurrence. This challenges the conventional view that operator learning requires domain alignment or autoregressive propagation. Trained on 23 years of global neutron data, STONe achieves accurate 180-day forecasts with millisecond inference latency. The framework establishes a general principle for cross-domain operator inference, enabling real-time prediction of complex spatiotemporal fields in physics, climate, and energy systems.
zh

[AI-81] OPTAGENT : Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

【速读】:该论文旨在解决多智能体系统在复杂推理任务中因固定协作结构或简单投票机制导致正确但非主导性智能体贡献被压制的问题,以及现有基于图网络的多智能体方法仅优化个体性能而忽视交互质量的局限性。其解决方案的关键在于提出一种名为 \ours 的多智能体言语强化学习算法,通过动态构建和迭代优化智能体协作结构,定义明确的动作空间与反馈机制以评估辩论过程中的沟通鲁棒性和连贯性,最终通过全智能体多数投票达成决策,从而显著提升多智能体系统的整体推理能力。

链接: https://arxiv.org/abs/2510.18032
作者: Zhenyu Bi,Meng Lu,Yang Li,Swastik Roy,Weijie Guan,Morteza Ziyadi,Xuan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 8 pages for main content

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose \ours , a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess \ours on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.
zh

[AI-82] DynaQuery: A Self-Adapting Framework for Querying Structured and Multimodal Data WWW

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言查询复杂混合数据库时面临的双重挑战:即如何联合推理结构化多关系模式(schema)与关联的非结构化数据语义内容。传统基于无结构检索增强生成(Retrieval-Augmented Generation, RAG)的架构易受“模式幻觉”(SCHEMA_HALLUCINATION)等灾难性上下文失败影响,导致查询生成不可靠。其解决方案的关键在于提出DynaQuery框架,核心是Schema Introspection and Linking Engine(SILE),这是一个将模式链接提升为查询规划阶段首类原语的新系统机制,从而实现结构感知与语义感知的统一,显著增强了查询生成的鲁棒性和一致性。

链接: https://arxiv.org/abs/2510.18029
作者: Aymane Hassini
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures, 10 tables. Source code and experimental artifacts are available at: this https URL . The ‘DynaQuery-Eval-5K’ benchmark, introduced in this work, is also publicly available at: this https URL

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has accelerated the long-standing goal of enabling natural language querying over complex, hybrid databases. Yet, this ambition exposes a dual challenge: reasoning jointly over structured, multi-relational schemas and the semantic content of linked unstructured assets. To overcome this, we present DynaQuery - a unified, self-adapting framework that serves as a practical blueprint for next-generation “Unbound Databases.” At the heart of DynaQuery lies the Schema Introspection and Linking Engine (SILE), a novel systems primitive that elevates schema linking to a first-class query planning phase. We conduct a rigorous, multi-benchmark empirical evaluation of this structure-aware architecture against the prevalent unstructured Retrieval-Augmented Generation (RAG) paradigm. Our results demonstrate that the unstructured retrieval paradigm is architecturally susceptible to catastrophic contextual failures, such as SCHEMA_HALLUCINATION, leading to unreliable query generation. In contrast, our SILE-based design establishes a substantially more robust foundation, nearly eliminating this failure mode. Moreover, end-to-end validation on a complex, newly curated benchmark uncovers a key generalization principle: the transition from pure schema-awareness to holistic semantics-awareness. Taken together, our findings provide a validated architectural basis for developing natural language database interfaces that are robust, adaptable, and predictably consistent.
zh

[AI-83] BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

【速读】:该论文旨在解决生成式人工智能(Generative AI)在科研出版流程中引发的系统性风险问题,即AI驱动的研究助手与AI同行评审系统结合后可能形成“全自动发表闭环”,导致由AI生成的虚假研究被AI评审系统误判为可接受。其解决方案的关键在于构建了名为\textbfBadScientist的评估框架,通过设计无需真实实验、仅依赖文本呈现策略的伪造论文生成代理,对多模态大语言模型(Multimodal LLM)评审系统进行系统性测试,并引入带有形式化误差保证(集中不等式和校准分析)的严谨评估机制,从而揭示当前AI评审系统在识别学术不端行为上的根本缺陷,尤其是“关切-接受冲突”现象——即评审系统频繁指出诚信问题却仍给予高接受评分,表明现有方法无法有效保障科学出版的完整性。

链接: https://arxiv.org/abs/2510.18003
作者: Fengqing Jiang,Yichen Feng,Yuetai Li,Luyao Niu,Basel Alomair,Radha Poovendran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through \textbfBadScientist, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to . Critically, we identify \textitconcern-acceptance conflict – reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.
zh

[AI-84] FABRIC: Framework for Agent -Based Realistic Intelligence Creation

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为智能代理(agent)时,缺乏高质量、结构化、可验证的工具使用交互数据的问题。现有方法依赖人工标注来收集包含用户意图、工具调用逻辑、参数依据及执行轨迹的 agentic 数据,存在成本高、效率低且难以扩展的局限。其解决方案的关键在于提出一个完全基于 LLM 的合成框架,通过模块化流水线生成符合严格语法与语义约束的完整交互记录,涵盖任务规范、工具定义、策略伪代码、自然语言对话和执行追踪等要素,并集成约束生成格式、JSON-schema 验证与判别式过滤机制以保障数据质量与一致性,从而实现无需人工干预即可规模化构建用于训练和评估工具使用能力的 agentic 数据集。

链接: https://arxiv.org/abs/2510.17995
作者: Abhigya Verma,Seganrasan Subramanian,Nandhakumar Kandasamy,Naman Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 51 Pages, 38 Listings, 5 Figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments. Realizing these capabilities requires access to agentic data- structured interaction records that couple user intents with tool specifications, argument-grounded calls, and verifiable execution traces. However, collecting such data from human annotators is costly, time-consuming, and difficult to scale. We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision. This framework decomposes generation into modular pipelines that produce complete interaction records spanning task specifications, tool definitions, policy pseudocode, natural language exchanges, and execution traces. Records conform to strict syntactic and semantic constraints, ensuring machine-parseability and faithful alignment across inputs, outputs, and tool calls. Beyond single tasks, there is support for both multi-task and multi-turn agent interactions, enabling the construction of datasets that reflect the full spectrum of tool-use competencies. To ensure quality and consistency, the framework integrates constrained generation formats, JSON-schema validation, and judge-based filtering. This paper formalizes the schema for agentic records, details the prompt design principles that guide generation, and introduces scalable pipelines for high-quality synthetic data. By providing a reproducible, LLM-only alternative to manual collection, hence advancing the development of agentic LLMs capable of robust tool use. Comments: 51 Pages, 38 Listings, 5 Figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.17995 [cs.AI] (or arXiv:2510.17995v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.17995 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Abhigya Verma Ms. [view email] [v1] Mon, 20 Oct 2025 18:20:22 UTC (618 KB)
zh

[AI-85] Studying the Effects of Robot Intervention on School Shooters in Virtual Reality

【速读】:该论文旨在解决高风险场景下(如校园枪击事件)如何通过机器人干预有效减少伤亡的问题。其解决方案的关键在于利用自主机器人预测并干扰袭击者行动路径,具体策略包括两种机器人接近方式(激进式直接拦截 vs. 被动式保持距离)和三种干扰强度(低、中、高),其中激进式且高干扰(含警报、灯光与烟雾)的机器人配置使受害者数量相对无机器人控制组减少了46.6%,凸显了机器人在提升安全性方面的潜力及伦理考量的重要性。

链接: https://arxiv.org/abs/2510.17948
作者: Christopher A McClurg,Alan R Wagner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Preprint under review for conference publication. 10 pages, 9 figures, 3 tables (including 1-page appendix)

点击查看摘要

Abstract:We advance the understanding of robotic intervention in high-risk scenarios by examining their potential to distract and impede a school shooter. To evaluate this concept, we conducted a virtual reality study with 150 university participants role-playing as a school shooter. Within the simulation, an autonomous robot predicted the shooter’s movements and positioned itself strategically to interfere and distract. The strategy the robot used to approach the shooter was manipulated – either moving directly in front of the shooter (aggressive) or maintaining distance (passive) – and the distraction method, ranging from no additional cues (low), to siren and lights (medium), to siren, lights, and smoke to impair visibility (high). An aggressive, high-distraction robot reduced the number of victims by 46.6% relative to a no-robot control. This outcome underscores both the potential of robotic intervention to enhance safety and the pressing ethical questions surrounding their use in school environments.
zh

[AI-86] Intuitionistic j-Do-Calculus in Topos Causal Models

【速读】:该论文旨在解决传统因果推理框架在处理复杂、非经典逻辑环境下的局限性问题,特别是如何在更一般的拓扑结构(如层范畴)中形式化因果干预与推断。其核心挑战在于将Pearl的do-calculus从经典的二值逻辑扩展到直觉主义逻辑语境下,以适应具有局部真理性质的因果模型。解决方案的关键是引入基于Lawvere-Tierney拓扑的jj-稳定因果推理框架——通过定义一个模态算子jj作用于子对象分类器Ω\Omega,从而在层范畴内部构建基于Kripke-Joyal语义的局部真值体系,并在此基础上提出jj-do-calculus:一套新的因果推理规则系统,其中前提和结论均为因果层范畴内部直觉主义逻辑公式,且因果推理被形式化为沿jj-覆盖保持结构的态射。该方法确保了因果推断在局部真理下的稳定性,实现了对Pearl原始规则(插入/删除、行动/观测交换)的推广与严格形式化。

链接: https://arxiv.org/abs/2510.17944
作者: Sridhar Mahadevan
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 42 pages

点击查看摘要

Abstract:In this paper, we generalize Pearl’s do-calculus to an Intuitionistic setting called j -stable causal inference inside a topos of sheaves. Our framework is an elaboration of the recently proposed framework of Topos Causal Models (TCMs), where causal interventions are defined as subobjects. We generalize the original setting of TCM using the Lawvere-Tierney topology on a topos, defined by a modal operator j on the subobject classifier \Omega . We introduce j -do-calculus, where we replace global truth with local truth defined by Kripke-Joyal semantics, and formalize causal reasoning as structure-preserving morphisms that are stable along j -covers. j -do-calculus is a sound rule system whose premises and conclusions are formulas of the internal Intuitionistic logic of the causal topos. We define j -stability for conditional independences and interventional claims as local truth in the internal logic of the causal topos. We give three inference rules that mirror Pearl’s insertion/deletion and action/observation exchange, and we prove soundness in the Kripke-Joyal semantics. A companion paper in preparation will describe how to estimate the required entities from data and instantiate j -do with standard discovery procedures (e.g., score-based and constraint-based methods), and will include experimental results on how to (i) form data-driven j -covers (via regime/section constructions), (ii) compute chartwise conditional independences after graph surgeries, and (iii) glue them to certify the premises of the j -do rules in practice
zh

[AI-87] rust in foundation models and GenAI: A geographic perspective

【速读】:该论文旨在解决生成式地理人工智能(Generative GeoAI)在应用过程中因模型复杂性和多维依赖性而引发的信任碎片化问题。其核心挑战在于,随着基础模型在地理决策中的广泛应用,用户对模型的信赖需从训练数据可信度(epistemic trust)、功能运行可靠性(operational trust)以及开发者诚信度(interpersonal trust)三个维度协同构建,而现有研究缺乏系统框架来整合这些要素。解决方案的关键在于提出一个分层信任模型,并强调地理信息科学的独特视角——即通过增强透明度、识别并缓解偏见(bias mitigation)、推动区域性政策制定,从而为研究人员、从业者和政策制定者提供可操作的概念起点,以建立更稳健、负责任的GeoAI应用体系。

链接: https://arxiv.org/abs/2510.17942
作者: Grant McKenzie,Krzysztof Janowicz,Carsten Kessler
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale pre-trained machine learning models have reshaped our understanding of artificial intelligence across numerous domains, including our own field of geography. As with any new technology, trust has taken on an important role in this discussion. In this chapter, we examine the multifaceted concept of trust in foundation models, particularly within a geographic context. As reliance on these models increases and they become relied upon for critical decision-making, trust, while essential, has become a fractured concept. Here we categorize trust into three types: epistemic trust in the training data, operational trust in the model’s functionality, and interpersonal trust in the model developers. Each type of trust brings with it unique implications for geographic applications. Topics such as cultural context, data heterogeneity, and spatial relationships are fundamental to the spatial sciences and play an important role in developing trust. The chapter continues with a discussion of the challenges posed by different forms of biases, the importance of transparency and explainability, and ethical responsibilities in model development. Finally, the novel perspective of geographic information scientists is emphasized with a call for further transparency, bias mitigation, and regionally-informed policies. Simply put, this chapter aims to provide a conceptual starting point for researchers, practitioners, and policy-makers to better understand trust in (generative) GeoAI.
zh

[AI-88] Beyond More Context: Retrieval Diversity Boosts Multi-Turn Intent Understanding

【速读】:该论文旨在解决任务导向型对话系统中多轮意图理解(multi-turn intent understanding)在实际部署时面临的两个核心挑战:一是受限的token预算,二是噪声干扰下的上下文质量下降。现有检索流程通常侧重于相关性(relevance),却忽视了示例集合层面的多样性(diversity)以及诸如上下文长度或示例顺序等混杂因素对模型性能的影响。为应对这些问题,作者提出了一种多样性感知的检索框架(diversity-aware retrieval framework),其关键在于通过选择具有高意图覆盖度和语言多样性的上下文示例(in-context exemplars),平衡内容代表性与表达变异性,并将其与标准大语言模型(LLM)解码器集成。实验表明,在固定token预算下,该方法显著提升了Joint Goal Accuracy,优于多个强基线模型,且在不同示例数量(K=4–7)和模型规模下均保持稳定改进,从而验证了检索多样性对多轮意图理解的有效性。

链接: https://arxiv.org/abs/2510.17940
作者: Zhiming Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages,6 figs

点击查看摘要

Abstract:Multi turn intent understanding is central to task oriented chatbots, yet real deployments face tight token budgets and noisy contexts, and most retrieval pipelines emphasize relevance while overlooking set level diversity and confounds such as more context or exemplar order. We ask whether retrieval diversity, rather than longer prompts, systematically improves LLM intent understanding under fixed budgets. We present a diversity aware retrieval framework that selects in context exemplars to balance intent coverage and linguistic variety, and integrates this selection with standard LLM decoders; the evaluation enforces budget matched prompts and randomized positions, and includes sensitivity analyses over exemplar count, diversity strength, and backbone size. On MultiWOZ 2.4 and SGD, the approach achieves strong gains in Joint Goal Accuracy under equal token budgets, surpassing strong LLM/DST baselines, with consistent improvements across K from 4 to 7 and moderate latency. Overall, the study isolates and validates the impact of content diversity in retrieval and offers a simple, deployable selection principle for building accurate, budget constrained multi turn intent systems.
zh

[AI-89] he Integration of Artificial Intelligence in Undergraduate Medical Education in Spain: Descriptive Analysis and International Perspectives

【速读】:该论文旨在解决西班牙医学教育中人工智能(Artificial Intelligence, AI)整合程度不足、缺乏系统评估的问题。研究通过横断面调查分析了52所提供官方医学学位的西班牙高校在2025–2026学年课程设置中与AI相关的教学内容,发现仅有19.2%的院校开设专门课程,且多数为选修课,平均仅占总学分的1.17%,区域间差异显著。其关键解决方案在于建立全国统一的最低标准和持续监测指标体系,以推动AI能力培养在医学教育中的规范化、系统化和均衡化发展。

链接: https://arxiv.org/abs/2510.17938
作者: Ana Enériz Janeiro,Karina Pitombeira Pereira,Julio Mayol,Javier Crespo,Fernando Carballo,Juan B. Cabello,Manel Ramos-Casals,Bibiana Pérez Corbacho,Juan Turnes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 1 figure, 4 main tables, 2 supplementary tables

点击查看摘要

Abstract:AI is transforming medical practice and redefining the competencies that future healthcare professionals need to master. Despite international recommendations, the integration of AI into Medicine curricula in Spain had not been systematically evaluated until now. A cross-sectional study (July-September 2025) including Spanish universities offering the official degree in Medicine, according to the ‘Register of Universities, Centers and Degrees (Registro de Universidades, Centros y Títulos RUCT)’. Curricula and publicly available institutional documentation were reviewed to identify courses and competencies related to AI in the 2025-2026 academic year. The analysis was performed using descriptive statistics. Of the 52 universities analyzed, ten (19.2%) offer specific AI courses, whereas 36 (69.2%) include no related content. Most of the identified courses are elective, with a credit load ranging from three to six ECTS, representing on average 1.17% of the total 360 credits of the degree. The University of Jaén is the only institution offering a compulsory course with AI content. The territorial analysis reveals marked disparities: Andalusia leads with 55.5% of its universities incorporating AI training, while several communities lack any initiative in this area. The integration of AI into the medical degree in Spain is incipient, fragmented, and uneven, with a low weight in ECTS. The limited training load and predominance of elective courses restrict the preparation of future physicians to practice in a healthcare environment increasingly mediated by AI. The findings support the establishment of minimum standards and national monitoring of indicators.
zh

[AI-90] UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

【速读】:该论文旨在解决多模态语言模型(Multimodal Language Model)在理解与推理能力,以及扩散模型(Diffusion Model)在多媒体生成能力之间缺乏统一框架的问题。现有方法通常将理解与生成任务分开处理,难以实现二者间的协同优化与交互增强。解决方案的关键在于提出UniRL-Zero——一个统一的强化学习(Reinforcement Learning, RL)框架,通过定义六种典型场景,系统性地建模理解与生成模型的联合训练机制,从而提升模型在多模态理解、推理及跨模态生成中的综合性能,并促进两者之间的有益交互。

链接: https://arxiv.org/abs/2510.17937
作者: Fu-Yun Wang,Han Zhang,Michael Gharbi,Hongsheng Li,Taesung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at this https URL.
zh

[AI-91] From Observations to Parameters: Detecting Changepoint in Nonlinear Dynamics with Simulation-based Inference

【速读】:该论文旨在解决在混沌时间序列中检测 regime shift(状态转变)的难题,其核心挑战在于观测空间中的信号与系统内在变异性高度纠缠,导致传统基于观测数据的检测方法性能受限。解决方案的关键在于提出 Parameter–Space Changepoint Detection (Param–CPD),这是一个两阶段框架:第一阶段利用基于模拟的推断(Simulation-Based Inference, SBI)训练神经后验估计器,对系统控制参数进行贝叶斯推断并实现参数轨迹的高效估计;第二阶段在所得参数轨迹上应用标准 changepoint detection (CPD) 算法。通过将检测任务从观测空间转移到具有物理可解释性的参数空间,该方法显著提升了 F1 分数、降低了定位误差和误报率,并验证了后验分布的可识别性和校准性,从而实现了更准确且可解释的状态转变检测。

链接: https://arxiv.org/abs/2510.17933
作者: Xiangbo Deng,Cheng Chen,Peng Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Detecting regime shifts in chaotic time series is hard because observation-space signals are entangled with intrinsic variability. We propose Parameter–Space Changepoint Detection (Param–CPD), a two–stage framework that first amortizes Bayesian inference of governing parameters with a neural posterior estimator trained by simulation-based inference, and then applies a standard CPD algorithm to the resulting parameter trajectory. On Lorenz–63 with piecewise-constant parameters, Param–CPD improves F1, reduces localization error, and lowers false positives compared to observation–space baselines. We further verify identifiability and calibration of the inferred posteriors on stationary trajectories, explaining why parameter space offers a cleaner detection signal. Robustness analyses over tolerance, window length, and noise indicate consistent gains. Our results show that operating in a physically interpretable parameter space enables accurate and interpretable changepoint detection in nonlinear dynamical systems.
zh

[AI-92] From Charts to Code: A Hierarchical Benchmark for Multimodal Models

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在图表理解与代码生成任务中缺乏系统性、层次化评估基准的问题。现有方法难以全面衡量模型在真实场景下从图表到代码(chart-to-code)的多阶段能力,尤其在复杂编辑和长表格转图表等高阶任务上表现不足。解决方案的关键在于提出首个分层式基准测试集 Chart2Code,其结构包含三个递进难度层级:Level 1(图表复现)、Level 2(图表编辑)和 Level 3(长表转图表),覆盖22种图表类型及多维度评价指标(代码正确性与可视化保真度),并首次在25个前沿大视觉语言模型(Large Multimodal Models, LMMs)上进行实证评测,揭示当前最先进模型如GPT-5在复杂任务中的显著局限性,从而推动多模态推理能力的提升与通用性强的LMMs发展。

链接: https://arxiv.org/abs/2510.17932
作者: Jiahao Tang,Henry Hengyuan Zhao,Lijian Wu,Yifei Tao,Dongxing Mao,Yang Wan,Jingru Tan,Min Zeng,Min Li,Alex Jinpeng Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Chart2Code, a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models (LMMs). Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty. It consists of three levels: Level 1 (Chart Reproduction) reproduces charts from a reference figure and user query; Level 2 (Chart Editing) involves complex modifications such as changing chart types or adding elements; and Level 3 (Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions. To our knowledge, this is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with multi-level evaluation metrics that assess both code correctness and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art (SoTA) LMMs, including both proprietary and the latest open-source models such as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental results demonstrate that even the SoTA model GPT-5 averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark will drive advances in multimodal reasoning and foster the development of more robust and general-purpose LMMs. Our code and data are available on Chart2Code.
zh

[AI-93] Attracting Commercial Artificial Intelligence Firms to Support National Security through Collaborative Contracts

【速读】:该论文试图解决的问题是:为何商业人工智能(Artificial Intelligence, AI)企业虽视美国国防部(Department of Defense, DoD)为有吸引力的客户,却仍面临与DoD合作的显著障碍,以及如何通过优化合同法律和采购框架来促进此类合作。解决方案的关键在于引入“最优买家理论”(optimal buyer theory),基于社会交换理论构建分析框架,识别出商业AI企业更倾向于与DoD签订与其业务和技术考量一致的合同;并提出利用现有合同授权机制(如其他交易授权,Other Transaction Authority, OTA)来调整采购实践,使其与机器学习开发与部署生命周期相匹配,从而提升商业AI企业参与国防市场的意愿与效率。

链接: https://arxiv.org/abs/2510.17931
作者: Andrew Bowne
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 312 pages, 42 figures

点击查看摘要

Abstract:Unlike other military technologies driven by national security needs and developed with federal funding, AI is predominantly funded and advanced by commercial industry for civilian applications. However, there is a lack of understanding of the reasons commercial AI firms decide to work with the DoD or choose to abstain from the defence market. This thesis argues that the contract law and procurement framework are among the most significant obstacles. This research indicates that the commercial AI industry actually views the DoD as an attractive customer. However, this attraction is despite the obstacles presented by traditional contract law and procurement practices used to solicit and award contracts. Drawing on social exchange theory, this thesis introduces a theoretical framework, optimal buyer theory, to understand the factors that influence a commercial decision to engage with the DoD. Interviews from a sample of the participants explain why the AI industry holds such perceptions, opinions, and preferences about contracts generally and the DoD, specifically, in its role as a customer. This thesis concludes that commercial AI firms are attracted to contracts that are consistent with their business and technology considerations. Additionally, it develops best practices for leveraging existing contract law, primarily other transaction authority, to align contracting practices with commercial preferences and the machine learning development and deployment lifecycle.
zh

[AI-94] EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在构建通用可验证合成数据时面临的挑战,即由于幻觉导致的生成不可靠性,以及验证机制薄弱或过于简单,无法有效区分强弱解的问题。现有方法通常依赖于任务特定的启发式规则或事后过滤器,缺乏跨领域的泛化能力和统一的可验证性评估标准。解决方案的关键在于提出一种进化式、任务无关、策略引导且可执行验证的数据合成框架,该框架从少量种子监督出发,协同生成问题、多样化的候选解和验证结构,并通过基于一致性的评估器迭代发现有效策略,从而将筛选提升为原理驱动的合成过程,显著提升了训练实例的连贯性和可验证性,并实现了无需领域特定规则的泛化能力。

链接: https://arxiv.org/abs/2510.17928
作者: He Du,Bowen Li,Aijun Yang,Siyang He,Qipeng Guo,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Reliable verifiable data has become a key driver of capability gains in modern language models, enabling stable reinforcement learning with verifiable rewards and effective distillation that transfers competence across math, coding, and agentic tasks. Yet constructing generalizable synthetic verifiable data remains difficult due to hallucination-prone generation, and weak or trivial verification artifacts that fail to separate strong from weak solutions. Existing approaches often rely on task-specific heuristics or post-hoc filters that do not transfer across domains and lack a principled, universal evaluator of verifiability. In this work, we introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework that, from minimal seed supervision, jointly synthesizes problems, diverse candidate solutions, and verification artifacts, and iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks. This pipeline upgrades filtering into principled synthesis: it reliably assembles coherent, verifiable training instances and generalizes without domain-specific rules. Our experiments demonstrate the effectiveness of the proposed approach under both RLVR and model distillation training paradigms. The results show that training with our synthesized data yields significant improvements on both the LiveCodeBench and AgentBench-OS tasks, highlighting the robust generalization of our framework.
zh

[AI-95] SpecAgent : A Speculative Retrieval and Forecasting Agent for Code Completion

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实软件仓库中因缺乏项目特定API和跨文件依赖关系而导致的代码生成性能下降问题,同时应对检索增强方法在推理时引入高延迟、影响用户体验的挑战。其解决方案的关键在于提出SpecAgent——一个在索引阶段异步主动探索仓库文件并构建推测性上下文(speculative context)的智能代理,该上下文可提前预测未来编辑内容,从而在不增加推理延迟的前提下提升代码生成质量;此外,论文还识别出现有基准测试中存在的“未来上下文泄露”问题,并构建了一个无泄漏的合成基准以实现更真实的评估。实验表明,SpecAgent相较最优基线在绝对指标上提升9–11%(相对提升48–58%),且显著降低推理延迟。

链接: https://arxiv.org/abs/2510.17925
作者: George Ma,Anurag Koul,Qi Chen,Yawen Wu,Sachit Kuhar,Yu Yu,Aritra Sengupta,Varun Kumar,Murali Krishna Ramanathan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at code-related tasks but often struggle in realistic software repositories, where project-specific APIs and cross-file dependencies are crucial. Retrieval-augmented methods mitigate this by injecting repository context at inference time. The low inference-time latency budget affects either retrieval quality or the added latency adversely impacts user experience. We address this limitation with SpecAgent, an agent that improves both latency and code-generation quality by proactively exploring repository files during indexing and constructing speculative context that anticipates future edits in each file. This indexing-time asynchrony allows thorough context computation, masking latency, and the speculative nature of the context improves code-generation quality. Additionally, we identify the problem of future context leakage in existing benchmarks, which can inflate reported performance. To address this, we construct a synthetic, leakage-free benchmark that enables a more realistic evaluation of our agent against baselines. Experiments show that SpecAgent consistently achieves absolute gains of 9-11% (48-58% relative) compared to the best-performing baselines, while significantly reducing inference latency.
zh

[AI-96] Rewarding the Journey Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)方法在大型语言模型(Large Language Models, LLMs)训练中面临的可扩展性瓶颈问题,即现有方法高度依赖人工标注的偏好数据或标签数据进行奖励建模,难以实现大规模、持续性的自主学习。其解决方案的关键在于提出一种无需外部监督的测试时奖励机制——COMPASS(Composite Path and Answer Self-Scoring),该机制通过两个互补组件协同工作:一是双校准答案奖励(Dual-Calibration Answer Reward, DCAR),利用置信度与可信度校准构建可靠的伪标签以稳定训练;二是决定性路径奖励(Decisive Path Reward, DPR),直接优化推理链的质量而非仅依赖结果正确性。这种联合强化可信共识答案与高决策力推理路径的设计,显著提升了模型的分析能力,并推动LLMs向从连续经验流中自主学习的方向发展。

链接: https://arxiv.org/abs/2510.17923
作者: Chenwei Tang,Jingyu Xing,Xinyu Liu,Wei Ju,Jiancheng Lv,Deng Xiong,Ziyue Qiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model’s analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.
zh

[AI-97] ParaVul: A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection

【速读】:该论文旨在解决智能合约漏洞检测中传统静态分析与形式化验证方法存在的误报率高、可扩展性差,以及现有基于大语言模型(Large Language Models, LLMs)的方法在推理成本和计算开销方面表现不佳的问题。其解决方案的关键在于提出一个并行的LLM与检索增强框架ParaVul:首先通过稀疏低秩适配(Sparse Low-Rank Adaptation, SLoRA)对LLM进行微调,在保持模型理解能力的同时显著降低计算资源消耗;其次构建融合密集检索与BM25算法的混合检索增强生成(Retrieval-Augmented Generation, RAG)系统以辅助验证LLM输出;最后引入元学习模型融合RAG与LLM结果,生成最终检测结论,并设计思维链提示(chain-of-thought prompts)自动生成结构化漏洞报告,从而在准确性和可靠性上实现显著提升。

链接: https://arxiv.org/abs/2510.17919
作者: Tenghui Huang,Jinbo Wen,Jiawen Kang,Siyong Chen,Zhengtao Li,Tao Zhang,Dongning Liu,Jiacheng Wang,Chengjun Cai,Yinqiu Liu,Dusit Niyato
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart contracts play a significant role in automating blockchain services. Nevertheless, vulnerabilities in smart contracts pose serious threats to blockchain security. Currently, traditional detection methods primarily rely on static analysis and formal verification, which can result in high false-positive rates and poor scalability. Large Language Models (LLMs) have recently made significant progress in smart contract vulnerability detection. However, they still face challenges such as high inference costs and substantial computational overhead. In this paper, we propose ParaVul, a parallel LLM and retrieval-augmented framework to improve the reliability and accuracy of smart contract vulnerability detection. Specifically, we first develop Sparse Low-Rank Adaptation (SLoRA) for LLM fine-tuning. SLoRA introduces sparsification by incorporating a sparse matrix into quantized LoRA-based LLMs, thereby reducing computational overhead and resource requirements while enhancing their ability to understand vulnerability-related issues. We then construct a vulnerability contract dataset and develop a hybrid Retrieval-Augmented Generation (RAG) system that integrates dense retrieval with Best Matching 25 (BM25), assisting in verifying the results generated by the LLM. Furthermore, we propose a meta-learning model to fuse the outputs of the RAG system and the LLM, thereby generating the final detection results. After completing vulnerability detection, we design chain-of-thought prompts to guide LLMs to generate comprehensive vulnerability detection reports. Simulation results demonstrate the superiority of ParaVul, especially in terms of F1 scores, achieving 0.9398 for single-label detection and 0.9330 for multi-label detection.
zh

[AI-98] Data Unlearning Beyond Uniform Forgetting via Diffusion Time and Frequency Selection

【速读】:该论文旨在解决扩散模型(diffusion models)中数据遗忘(data unlearning)问题,即如何在不进行完整重训练的前提下,有效移除特定训练样本对已训练模型的影响。现有方法通常在所有扩散时间步上均匀尝试遗忘样本,导致生成质量下降或遗忘不彻底。其关键解决方案是提出一种时间-频率选择性策略:基于观察发现遗忘过程在不同时间步和频段上分布不均,通过有选择地聚焦于特定时间-频率范围进行训练,显著提升未被遗忘样本的美学质量和噪声控制水平。该方法在梯度优化与偏好优化目标以及图像级和文本到图像任务中均表现出一致性改进,并引入归一化版SSCD评估指标以更准确衡量删除效果与生成质量。

链接: https://arxiv.org/abs/2510.17917
作者: Jinseong Park,Mijung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Preprint

点击查看摘要

Abstract:Data unlearning aims to remove the influence of specific training samples from a trained model without requiring full retraining. Unlike concept unlearning, data unlearning in diffusion models remains underexplored and often suffers from quality degradation or incomplete forgetting. To address this, we first observe that most existing methods attempt to unlearn the samples at all diffusion time steps equally, leading to poor-quality generation. We argue that forgetting occurs disproportionately across time and frequency, depending on the model and scenarios. By selectively focusing on specific time-frequency ranges during training, we achieve samples with higher aesthetic quality and lower noise. We validate this improvement by applying our time-frequency selective approach to diverse settings, including gradient-based and preference optimization objectives, as well as both image-level and text-to-image tasks. Finally, to evaluate both deletion and quality of unlearned data samples, we propose a simple normalized version of SSCD. Together, our analysis and methods establish a clearer understanding of the unique challenges in data unlearning for diffusion models, providing practical strategies to improve both evaluation and unlearning performance.
zh

[AI-99] Self-Evidencing Through Hierarchical Gradient Decomposition: A Dissipative System That Maintains Non-Equilibrium Steady-State by Minimizing Variational Free Energy

【速读】:该论文试图解决自由能原理(Free Energy Principle, FEP)从理论到可实现算法的转化难题,即如何在生物可实现的局部规则下实现系统对变分自由能的最小化。其解决方案的关键在于提出了一种精确的层级信用分配机制:通过反馈对齐(feedback alignment)实现空间信用分配,通过资格迹(eligibility traces)实现时间信用分配,并通过营养场图(Trophic Field Map, TFM)实现结构信用分配——TFM能够精确估计每条连接块的期望梯度幅度,从而实现局部、精确的梯度传播。该机制在多个维度上验证了其有效性,包括与最优梯度高度相关(Pearson相关系数0.9693)、任务干扰后高达98.6%的保留率、75%结构损伤下的自主恢复能力以及无需经验回放的样本高效强化学习表现,最终实现了FEP在生物合理性前提下的完整落地。

链接: https://arxiv.org/abs/2510.17916
作者: Michael James McCulloch
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 30 pages, 13 Figures

点击查看摘要

Abstract:The Free Energy Principle (FEP) states that self-organizing systems must minimize variational free energy to persist, but the path from principle to implementable algorithm has remained unclear. We present a constructive proof that the FEP can be realized through exact local credit assignment. The system decomposes gradient computation hierarchically: spatial credit via feedback alignment, temporal credit via eligibility traces, and structural credit via a Trophic Field Map (TFM) that estimates expected gradient magnitude for each connection block. We prove these mechanisms are exact at their respective levels and validate the central claim empirically: the TFM achieves 0.9693 Pearson correlation with oracle gradients. This exactness produces emergent capabilities including 98.6% retention after task interference, autonomous recovery from 75% structural damage, self-organized criticality (spectral radius p ~= 1.0 ), and sample-efficient reinforcement learning on continuous control tasks without replay buffers. The architecture unifies Prigogine’s dissipative structures, Friston’s free energy minimization, and Hopfield’s attractor dynamics, demonstrating that exact hierarchical inference over network topology can be implemented with local, biologically plausible rules.
zh

[AI-100] Uncertainty-Aware Post-Hoc Calibration: Mitigating Confidently Incorrect Predictions Beyond Calibration Metrics

【速读】:该论文旨在解决现有神经网络校准方法普遍采用全局变换、忽视个体预测可靠性差异的问题,并探索校准改进与不确定性感知决策之间关系的空白。其核心解决方案是提出一种基于实例级自适应的后处理校准框架,关键在于利用基于邻近性的分位数预测(proximity-based conformal prediction)将校准样本根据特征空间中的语义相似性划分为“可能正确”和“可能错误”两组,进而实施双策略校准:对“可能正确”预测使用标准等倾回归(isotonic regression)提升置信度一致性;对“可能错误”预测则采用欠自信正则化等倾回归,使其置信度趋向均匀分布,从而更易识别异常预测并支持后续不确定性驱动的决策。该方法无需模型重训练,在CIFAR-10和CIFAR-100数据集上验证了其在降低高置信度误判率和保持良好期望校准误差(Expected Calibration Error)方面的有效性。

链接: https://arxiv.org/abs/2510.17915
作者: Hassan Gharoun,Mohammad Sadegh Khorshidi,Kasra Ranjbarigderi,Fang Chen,Amir H. Gandomi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 53 pages, 12 figures, 12 tables

点击查看摘要

Abstract:Despite extensive research on neural network calibration, existing methods typically apply global transformations that treat all predictions uniformly, overlooking the heterogeneous reliability of individual predictions. Furthermore, the relationship between improved calibration and effective uncertainty-aware decision-making remains largely unexplored. This paper presents a post-hoc calibration framework that leverages prediction reliability assessment to jointly enhance calibration quality and uncertainty-aware decision-making. The framework employs proximity-based conformal prediction to stratify calibration samples into putatively correct and putatively incorrect groups based on semantic similarity in feature space. A dual calibration strategy is then applied: standard isotonic regression calibrated confidence in putatively correct predictions, while underconfidence-regularized isotonic regression reduces confidence toward uniform distributions for putatively incorrect predictions, facilitating their identification for further investigations. A comprehensive evaluation is conducted using calibration metrics, uncertainty-aware performance measures, and empirical conformal coverage. Experiments on CIFAR-10 and CIFAR-100 with BiT and CoAtNet backbones show that the proposed method achieves lower confidently incorrect predictions, and competitive Expected Calibration Error compared with isotonic and focal-loss baselines. This work bridges calibration and uncertainty quantification through instance-level adaptivity, offering a practical post-hoc solution that requires no model retraining while improving both probability alignment and uncertainty-aware decision-making.
zh

[AI-101] ACLA: An LLM -Based Multi-Agent Tool for Transactional Analysis Training in Education ICTAI2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在模拟复杂人类社会动态时缺乏心理深度和一致人格行为的问题,这对构建高保真度的训练工具构成关键挑战。其解决方案的核心在于提出TACLA(Transactional Analysis Contextual LLM-based Agents)多智能体架构,该架构基于交易分析(Transactional Analysis, TA)理论,将每个代理建模为由父母(Parent)、成人(Adult)和儿童(Child)三种自我状态组成的协同系统,每种自我状态具有独立的模式记忆;并通过一个协调代理(Orchestrator Agent)根据情境触发因素和个体生命脚本(life script)动态激活相应自我状态,从而实现心理上真实的响应与自我状态转换,有效模拟冲突的化解与升级,验证了其在教育场景中的高对话可信度和动态心理建模能力。

链接: https://arxiv.org/abs/2510.17913
作者: Monika Zamojska,Jarosław A. Chudziak
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for publication in the proceedings of ICTAI 2025

点击查看摘要

Abstract:Simulating nuanced human social dynamics with Large Language Models (LLMs) remains a significant challenge, particularly in achieving psychological depth and consistent persona behavior crucial for high-fidelity training tools. This paper introduces TACLA (Transactional Analysis Contextual LLM-based Agents), a novel Multi-Agent architecture designed to overcome these limitations. TACLA integrates core principles of Transactional Analysis (TA) by modeling agents as an orchestrated system of distinct Parent, Adult, and Child ego states, each with its own pattern memory. An Orchestrator Agent prioritizes ego state activation based on contextual triggers and an agent’s life script, ensuring psychologically authentic responses. Validated in an educational scenario, TACLA demonstrates realistic ego state shifts in Student Agents, effectively modeling conflict de-escalation and escalation based on different teacher intervention strategies. Evaluation shows high conversational credibility and confirms TACLA’s capacity to create dynamic, psychologically-grounded social simulations, advancing the development of effective AI tools for education and beyond.
zh

[AI-102] Activation Manifold Projection: Liberating Task-Specific Behaviors from LLM Architectures

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)中因架构锁定(architectural lock-in)导致的微调行为迁移难题:即通过低秩适应(Low-Rank Adaptation, LoRA)等方法训练得到的任务特定行为被“困”在源模型的固定架构中,难以迁移到目标模型。现有方法依赖静态权重空间对齐,其本质是间接且脆弱的,易受参数几何结构间松散关联的影响。论文提出了一种全新的解决方案——卡匣激活空间迁移(Cartridge Activation Space Transfer, CAST),其核心在于学习两个不同LLM架构之间激活流形(activation manifolds)的非线性映射关系,而非直接对齐权重空间。CAST将预训练LoRA视为冻结的“行为核”,通过轻量级双向投影头实现目标模型激活到源模型隐空间的转换、应用冻结核并回映,整个过程仅需通用文本语料训练,无需任务特定数据,从而真正解耦了技能与源架构,实现了零样本(zero-shot)LoRA适配器迁移,并在异构模型族(如Llama-2与Mistral)间达到原生重训练LoRA性能的85–95%,显著优于当前基于权重空间的方法,确立了模型互操作性的新基准。

链接: https://arxiv.org/abs/2510.17902
作者: Al Kari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Large Language Model (LLM) architectures presents a fundamental challenge: valuable, task-specific behaviors learned through fine-tuning methods like Low-Rank Adaptation (LoRA) are effectively trapped within their source model’s architecture, herein referred to architectural lock-in. Existing transfer methods attempt to bridge this gap by aligning the static weight spaces of models, a brittle and indirect approach that relies on tenuous correlations between parameter geometries. This paper introduces a fundamentally different and more direct paradigm: the Cartridge Activation Space Transfer (CAST), a novel framework that liberates LoRA-encoded behaviors by learning a direct, nonlinear mapping between the activation manifolds, the geometric structures formed by the model’s internal neuron activations, of two distinct LLM architectures. CAST treats a pre-trained LoRA as a frozen “behavioral kernel.” It learns a set of lightweight, bidirectional projection heads that translate the target model’s activation stream into the source model’s latent space, apply the frozen kernel, and project the result back. This process, trained on a general text corpus without any task-specific data, effectively decouples the learned skill from the source architecture. We demonstrate that CAST enables true “zero-shot” translation of any standard LoRA adapter. Our experiments, including transfers between heterogeneous model families like Llama-2 and Mistral, show that CAST-translated adapters achieve 85-95% of the performance of a LoRA fully retrained on the target model, quantitatively outperforming current weight-space transfer techniques and establishing a new state-of-the-art in model interoperability.
zh

[AI-103] he Sherpa.ai Blind Vertical Federated Learning Paradigm to Minimize the Number of Communications

【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)在实际应用中因通信开销过大而导致的隐私安全风险高、能耗大甚至训练不可行的问题。其核心解决方案是提出一种新型范式——盲垂直联邦学习(Secure Blind Vertical Federated Learning, SBVFL),通过将绝大多数节点更新从服务器中解耦,显著减少节点与服务器之间的通信量,实验表明该方法可使通信成本降低约99%,同时保持模型精度和鲁棒性,从而实现适用于医疗、金融等敏感领域的高效、隐私保护型VFL。

链接: https://arxiv.org/abs/2510.17901
作者: Alex Acero,Daniel M. Jimenez-Gutierrez,Dario Pighin,Enrique Zuazua,Joaquin Del Rio,Xabi Uribe-Etxebarria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative decentralized training across multiple parties (nodes) while keeping raw data private. There are two main paradigms in FL: Horizontal FL (HFL), where all participant nodes share the same feature space but hold different samples, and Vertical FL (VFL), where participants hold complementary features for the same samples. While HFL is widely adopted, VFL is employed in domains where nodes hold complementary features about the same samples. Still, VFL presents a significant limitation: the vast number of communications required during training. This compromises privacy and security, and can lead to high energy consumption, and in some cases, make model training unfeasible due to the high number of communications. In this paper, we introduce this http URL Blind Vertical Federated Learning (SBVFL), a novel paradigm that leverages a distributed training mechanism enhanced for privacy and security. Decoupling the vast majority of node updates from the server dramatically reduces node-server communication. Experiments show that SBVFL reduces communication by ~99% compared to standard VFL while maintaining accuracy and robustness. Therefore, SBVFL enables practical, privacy-preserving VFL across sensitive domains, including healthcare, finance, manufacturing, aerospace, cybersecurity, and the defense industry. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2510.17901 [cs.LG] (or arXiv:2510.17901v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.17901 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-104] Automated Algorithm Design for Auto-Tuning Optimizers

【速读】:该论文旨在解决自动性能调优(auto-tuning)中因参数空间庞大且不规则而导致的人工调优不可行的问题,以及现有优化算法难以在所有调优任务中保持最优性能的挑战。其解决方案的关键在于引入一种新范式:利用大语言模型(Large Language Models, LLMs)根据问题描述和搜索空间特征自动生成针对特定调优任务定制的优化策略。通过迭代生成与评估机制,这些由LLM生成的优化算法在四个真实世界的自动调优应用中展现出显著优于传统人类设计优化算法的性能,平均提升达72.4%,证明了基于LLM的自动化优化策略生成方法的有效性与潜力。

链接: https://arxiv.org/abs/2510.17899
作者: Floris-Jan Willemsen,Niki van Stein,Ben van Werkhoven
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular parameter spaces make manual exploration infeasible. Traditionally, auto-tuning relies on well-established optimization algorithms such as evolutionary algorithms, annealing methods, or surrogate model-based optimizers to efficiently find near-optimal configurations. However, designing effective optimizers remains challenging, as no single method performs best across all tuning tasks. In this work, we explore a new paradigm: using large language models (LLMs) to automatically generate optimization algorithms tailored to auto-tuning problems. We introduce a framework that prompts LLMs with problem descriptions and search-space characteristics results to produce specialized optimization strategies, which are iteratively examined and improved. These generated algorithms are evaluated on four real-world auto-tuning applications across six hardware platforms and compared against the state-of-the-art in optimization algorithms of two contemporary auto-tuning frameworks. The evaluation demonstrates that providing additional application- and search space-specific information in the generation stage results in an average performance improvement of 30.7% and 14.6%, respectively. In addition, our results show that LLM-generated optimizers can rival, and in various cases outperform, existing human-designed algorithms, with our best-performing generated optimization algorithms achieving, on average, 72.4% improvement over state-of-the-art optimizers for auto-tuning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2510.17899 [cs.LG] (or arXiv:2510.17899v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.17899 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-105] L-MoE: End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在参数规模扩展与任务专业化微调之间的权衡问题:一方面,混合专家(Mixture of Experts, MoE)架构虽能通过稀疏激活实现万亿级参数的高效推理,但其专家通常为密集前馈网络,难以灵活适配特定任务;另一方面,低秩适应(Low-Rank Adaptation, LoRA)虽能以极低参数开销实现高效微调,却缺乏对多任务能力的动态组合机制。解决方案的关键在于提出L-MoE框架——一种端到端可训练的轻量级LoRA专家混合模型,将MoE中的专家重新定义为任务专用的低秩适配器(LoRA adapters),并引入一个轻量级门控网络联合优化专家参数与路由策略,通过可微分的加权平均机制动态组合不同LoRA适配器,从而实现参数高效、模块化且支持动态技能组合的MoE结构。

链接: https://arxiv.org/abs/2510.17898
作者: Shihao Ji,Zihui Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) architecture enables the scaling of Large Language Models (LLMs) to trillions of parameters by activating a sparse subset of weights for each input, maintaining constant computational cost during inference. Concurrently, Low-Rank Adaptation (LoRA) has emerged as a dominant technique for parameter-efficiently fine-tuning LLMs on specialized tasks. In this work, we unify these two paradigms into a novel, end-to-end trainable framework named L-MoE: a Lightweight Mixture of LoRA Experts. L-MoE redefines MoE experts not as dense feed-forward networks, but as a collection of task-specialized, low-rank adapters. A lightweight gating network, trained jointly with the experts, learns to dynamically compose these LoRA adapters by computing a weighted average of their parameters for each input token. This composition is fully differentiable, allowing gradients from a standard auto-regressive language modeling objective to flow back through the entire architecture, simultaneously refining both the expert adapters and the routing strategy. This approach creates a highly parameter-efficient MoE model that is modular by design, allows for dynamic skill composition, and is trainable from end-to-end. We present the formal mathematical framework for L-MoE, detailing the differentiable routing mechanism and the joint optimization objective, thereby providing a new path toward building more efficient, scalable, and specialized language models.
zh

[AI-106] Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

【速读】:该论文旨在解决基于Transformer的大语言模型(Large Language Models, LLMs)在长序列训练中因标准注意力机制导致的二次计算和内存开销问题,这一瓶颈严重限制了模型在长上下文场景下的训练效率与可扩展性。解决方案的关键在于提出一个统一的基准测试框架,该框架整合了代表性注意力核函数(attention kernels)与上下文并行(context parallel)机制,并通过模块化和可扩展的接口实现对不同方法的系统性评估;其核心创新在于从两个关键维度进行评测:(1)注意力掩码模式(attention mask patterns),直接影响效率、可扩展性和可用性;(2)序列长度与分布式规模,决定极端长序列训练下的性能表现,从而为设计和部署适用于长上下文训练的注意力机制提供可复现的比较基础与实践指导。

链接: https://arxiv.org/abs/2510.17896
作者: Tao Bu,Qiangang Wang,Bowen Zeng,Hanwen Sun,Yunpeng Huang,Chun Cao,Jingwei Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 56 pages

点击查看摘要

Abstract:Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.
zh

[AI-107] MIN-Merging: Merge the Important Neurons for Model Merging

【速读】:该论文旨在解决模型合并(model merging)过程中因参数冲突导致的性能下降问题,尤其在特定领域任务上表现不佳。其解决方案的关键在于提出一种基于路由器(router-based)的框架——MIN-Merging,该框架通过选择性地合并最重要的神经元来减少参数冲突,从而在保持预训练模型泛化能力的同时,在领域内任务上实现一致性的性能提升。

链接: https://arxiv.org/abs/2510.17890
作者: Yunfei Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in deep learning have led to a surge of open-source models across diverse domains. While model merging offers a promising way to combine their strengths, existing approaches often suffer from parameter conflicts that degrade performance on domain-specific tasks. We propose MIN-Merging, a router-based framework that selectively merges the most important neurons to reduce such conflicts. Extensive experiments on Computer Vision(CV) and Natural Language Processing(NLP) benchmarks show that MIN-Merging achieves consistent gains on in-domain tasks while retaining the generalization ability of pretrained models on out-of-domain tasks. These results highlight its effectiveness as a practical solution to the parameter conflict problem in model merging.
zh

[AI-108] Hey Pentti We Did It!: A Fully Vector-Symbolic Lisp

【速读】:该论文旨在解决如何在向量符号架构(Vector-Symbolic Architecture, VSA)中实现一个近似最小且具备图灵完备性的Lisp语言子集的问题。其核心挑战在于将Lisp的五种基本函数、lambda表达式及其他辅助函数以向量形式进行语义表示,并确保运算的正确性与可计算性。解决方案的关键在于采用全息还原表示(Holographic Reduced Representations, HRR)作为向量符号编码机制,并引入查找表清理记忆(lookup table cleanup memory)来维持向量表示的稳定性与精确性,从而实现对Lisp 1.5规范中关键结构的数学建模和计算模拟,同时验证VSA具备笛卡尔闭范畴(Cartesian closed category)性质,凸显其在认知计算与符号推理中的理论意义。

链接: https://arxiv.org/abs/2510.17889
作者: Eilene Tomkins-Flanagan(1),Mary A. Kelly(1) ((1) Department of Cognitive Science, Carleton University)
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kanerva (2014) suggested that it would be possible to construct a complete Lisp out of a vector-symbolic architecture. We present the general form of a vector-symbolic representation of the five Lisp elementary functions, lambda expressions, and other auxiliary functions, found in the Lisp 1.5 specification McCarthy (1960), which is near minimal and sufficient for Turing-completeness. Our specific implementation uses holographic reduced representations Plate (1995), with a lookup table cleanup memory. Lisp, as all Turing-complete languages, is a Cartesian closed category, unusual in its proximity to the mathematical abstraction. We discuss the mathematics, the purpose, and the significance of demonstrating vector-symbolic architectures’ Cartesian-closure, as well as the importance of explicitly including cleanup memories in the specification of the architecture.
zh

[AI-109] When Intelligence Fails: An Empirical Study on Why LLM s Struggle with Password Cracking

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在密码猜测任务中表现不佳的问题,尤其是其在无监督条件下对用户属性驱动的密码生成能力有限。研究通过构建合成用户画像并利用主流开源LLM(如TinyLLaMA、Falcon-RW-1B和Flan-T5)进行提示(prompting)以生成可能的密码,评估其在Hit@1、Hit@5和Hit@10指标下的准确性。关键发现是:尽管LLMs在自然语言理解与生成方面表现出色,但在密码推断这一特定领域任务中,其生成推理存在显著局限,表现为在未经过泄露密码数据集监督微调的情况下,无法有效捕捉密码模式或记忆常见密码结构,导致性能远低于传统基于规则和组合的破解方法。因此,解决方案的关键在于强化LLMs的领域适应性和密码知识的记忆能力,这需依赖于针对密码学场景的专用训练策略与数据增强手段。

链接: https://arxiv.org/abs/2510.17884
作者: Mohammad Abdul Rehman,Syed Imad Ali Shah,Abbas Anwar,Noor Islam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) in natural language understanding and generation have sparked interest in their potential for cybersecurity applications, including password guessing. In this study, we conduct an empirical investigation into the efficacy of pre-trained LLMs for password cracking using synthetic user profiles. Specifically, we evaluate the performance of state-of-the-art open-source LLMs such as TinyLLaMA, Falcon-RW-1B, and Flan-T5 by prompting them to generate plausible passwords based on structured user attributes (e.g., name, birthdate, hobbies). Our results, measured using Hit@1, Hit@5, and Hit@10 metrics under both plaintext and SHA-256 hash comparisons, reveal consistently poor performance, with all models achieving less than 1.5% accuracy at Hit@10. In contrast, traditional rule-based and combinator-based cracking methods demonstrate significantly higher success rates. Through detailed analysis and visualization, we identify key limitations in the generative reasoning of LLMs when applied to the domain-specific task of password guessing. Our findings suggest that, despite their linguistic prowess, current LLMs lack the domain adaptation and memorization capabilities required for effective password inference, especially in the absence of supervised fine-tuning on leaked password datasets. This study provides critical insights into the limitations of LLMs in adversarial contexts and lays the groundwork for future efforts in secure, privacy-preserving, and robust password modeling.
zh

[AI-110] From Flows to Words: Can Zero-/Few-Shot LLM s Detect Network Intrusions? A Grammar-Constrained Calibrated Evaluation on UNSW-NB15

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在无需微调(fine-tuning)的情况下,如何有效应用于网络入侵检测(intrusion detection)的问题。其核心挑战在于如何利用LLMs的自然语言推理能力,在不依赖梯度训练的前提下实现高精度、稳定的检测性能。解决方案的关键在于:首先将每个网络流转换为紧凑的文本记录,并引入轻量级、领域启发的布尔标志(如不对称性、突发率异常、TTL异常、定时器异常、罕见服务/状态、短突发等),以增强语义表达;其次通过结构化输出约束(grammar-valid responses)减少预测漂移,结合单一决策阈值校准提升稳定性;最后采用零样本、指令引导和少量示例提示(few-shot prompting)策略进行对比实验,验证了指令+标志组合显著优于无指导提示,且在小规模数据上可达到接近0.78的宏F1分数,展现出无需训练即可部署的潜力与可解释性优势。

链接: https://arxiv.org/abs/2510.17883
作者: Mohammad Abdul Rehman,Syed Imad Ali Shah,Abbas n=Anwar,Noor Islam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can reason over natural-language inputs, but their role in intrusion detection without fine-tuning remains uncertain. This study evaluates a prompt-only approach on UNSW-NB15 by converting each network flow to a compact textual record and augmenting it with lightweight, domain-inspired boolean flags (asymmetry, burst rate, TTL irregularities, timer anomalies, rare service/state, short bursts). To reduce output drift and support measurement, the model is constrained to produce structured, grammar-valid responses, and a single decision threshold is calibrated on a small development split. We compare zero-shot, instruction-guided, and few-shot prompting to strong tabular and neural baselines under identical splits, reporting accuracy, precision, recall, F1, and macro scores. Empirically, unguided prompting is unreliable, while instructions plus flags substantially improve detection quality; adding calibrated scoring further stabilizes results. On a balanced subset of two hundred flows, a 7B instruction-tuned model with flags reaches macro-F1 near 0.78; a lighter 3B model with few-shot cues and calibration attains F1 near 0.68 on one thousand examples. As the evaluation set grows to two thousand flows, decision quality decreases, revealing sensitivity to coverage and prompting. Tabular baselines remain more stable and faster, yet the prompt-only pipeline requires no gradient training, produces readable artifacts, and adapts easily through instructions and flags. Contributions include a flow-to-text protocol with interpretable cues, a calibration method for thresholding, a systematic baseline comparison, and a reproducibility bundle with prompts, grammar, metrics, and figures.
zh

[AI-111] Decoding Listeners Identity: Person Identification from EEG Signals Using a Lightweight Spiking Transformer

【速读】:该论文旨在解决基于脑电图(EEG)的人体识别技术在实际应用中因传统深度学习模型计算开销大而导致的效率瓶颈问题,尤其在脑机接口(BCI)等对能耗敏感场景下的适用性受限。其解决方案的关键在于提出一种基于脉冲神经网络(SNN)的轻量级脉冲Transformer架构,该结构能有效捕捉EEG信号中的时序复杂性,同时显著降低能量消耗——在EEG-Music Emotion Recognition Challenge数据集上实现100%分类准确率,且能耗不足传统深度神经网络的10%。

链接: https://arxiv.org/abs/2510.17879
作者: Zheyuan Lin,Siqi Cai,Haizhou Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:EEG-based person identification enables applications in security, personalized brain-computer interfaces (BCIs), and cognitive monitoring. However, existing techniques often rely on deep learning architectures at high computational cost, limiting their scope of applications. In this study, we propose a novel EEG person identification approach using spiking neural networks (SNNs) with a lightweight spiking transformer for efficiency and effectiveness. The proposed SNN model is capable of handling the temporal complexities inherent in EEG signals. On the EEG-Music Emotion Recognition Challenge dataset, the proposed model achieves 100% classification accuracy with less than 10% energy consumption of traditional deep neural networks. This study offers a promising direction for energy-efficient and high-performance BCIs. The source code is available at this https URL.
zh

[AI-112] DRL-Based Resource Allocation for Energy-Efficient IRS-Assisted UAV Spectrum Sharing Systems

【速读】:该论文旨在解决智能反射面(Intelligent Reflecting Surface, IRS)辅助的无人机(Unmanned Aerial Vehicle, UAV)无线通信系统中能效(Energy Efficiency, EE)提升的问题,特别是在正交频分复用(Orthogonal Frequency Division Multiplexing, OFDM)场景下的频谱共享机制下,如何通过联合优化波束赋形、子载波分配、IRS相位偏移以及UAV轨迹来最大化次级网络的能效。解决方案的关键在于引入一个基于物理机理的推进-能量模型,并利用其紧致上界构造出可处理的能效下界;同时,针对高度非凸且时间耦合的优化问题(具有连续与离散混合策略空间),提出了一种基于演员-评论家框架的深度强化学习(Deep Reinforcement Learning, DRL)方法,从而在满足实际发射功率、被动反射约束及UAV物理限制的前提下实现高效能效优化。

链接: https://arxiv.org/abs/2510.17877
作者: Yiheng Wang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 7 pages, 3 figures, 1 algorithm. LaTeX class: IEEEtran

点击查看摘要

Abstract:Intelligent reflecting surface (IRS) assisted unmanned aerial vehicle (UAV) systems provide a new paradigm for reconfigurable and flexible wireless communications. To enable more energy efficient and spectrum efficient IRS assisted UAV wireless communications, this paper introduces a novel IRS-assisted UAV enabled spectrum sharing system with orthogonal frequency division multiplexing (OFDM). The goal is to maximize the energy efficiency (EE) of the secondary network by jointly optimizing the beamforming, subcarrier allocation, IRS phase shifts, and the UAV trajectory subject to practical transmit power and passive reflection constraints as well as UAV physical limitations. A physically grounded propulsion-energy model is adopted, with its tight upper bound used to form a tractable EE lower bound for the spectrum sharing system. To handle highly non convex, time coupled optimization problems with a mixed continuous and discrete policy space, we develop a deep reinforcement learning (DRL) approach based on the actor critic framework. Extended experiments show the significant EE improvement of the proposed DRL-based approach compared to several benchmark schemes, thus demonstrating the effectiveness and robustness of the proposed approach with mobility.
zh

[AI-113] Repairing Tool Calls Using Post-tool Execution Reflection and RAG

【速读】:该论文旨在解决代理系统(Agentic systems)在调用外部工具(如kubectl命令行工具)时因语法和语义错误导致执行失败的问题,尤其是那些仅在分析工具响应后才能识别和修复的语义错误。解决方案的关键在于引入一个后工具执行反思组件(post-tool execution reflection component),该组件结合了基于大语言模型(LLM)的反思能力与领域特定的检索增强生成(Retrieval-Augmented Generation, RAG),利用描述具体工具及故障排除文档的语料库进行上下文感知的错误诊断与修正,从而显著提升命令成功率和用户问题解答准确性。

链接: https://arxiv.org/abs/2510.17874
作者: Jason Tsay,Zidane Wright,Gaodan Fang,Kiran Kate,Saurabh Jha,Yara Rizk
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic systems interact with external systems by calling tools such as Python functions, REST API endpoints, or command line tools such as kubectl in Kubernetes. These tool calls often fail for various syntactic and semantic reasons. Some less obvious semantic errors can only be identified and resolved after analyzing the tool’s response. To repair these errors, we develop a post-tool execution reflection component that combines large language model (LLM)-based reflection with domain-specific retrieval-augmented generation (RAG) using documents describing both the specific tool being called and troubleshooting documents related to the tool. For this paper, we focus on the use case of the kubectl command line tool to manage Kubernetes, a platform for orchestrating cluster applications. Through a larger empirical study and a smaller manual evaluation, we find that our RAG-based reflection will repair kubectl commands such that they are both more likely to successfully execute (pass rate) for 55% of our models evaluated and 36% more likely to correctly answer the user query on average. We find that troubleshooting documents improve pass rate compared to official documentation by an average of 10%.
zh

[AI-114] A Survey of Recursive and Recurrent Neural Networks

【速读】:该论文旨在系统梳理和分类递归神经网络(Recursive Neural Networks, RNNs)与循环神经网络(Recurrent Neural Networks, RNNs)的分支结构,以厘清其在网络架构、训练目标函数及学习算法实现上的差异。解决方案的关键在于将现有模型细分为三大类:通用型递归与循环神经网络、结构化递归与循环神经网络及其他特殊变体,并深入剖析各类模型的原理、结构演化及其相互关系,从而为复杂序列建模任务(如语音识别、图像处理等)提供清晰的理论框架与技术路径。

链接: https://arxiv.org/abs/2510.17867
作者: Jian-wei Liu,Bing-rong Xu,Zhi-yan Song
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 96 pages,48 figures

点击查看摘要

Abstract:In this paper, the branches of recursive and recurrent neural networks are classified in detail according to the network structure, training objective function and learning algorithm implementation. They are roughly divided into three categories: The first category is General Recursive and Recurrent Neural Networks, including Basic Recursive and Recurrent Neural Networks, Long Short Term Memory Recursive and Recurrent Neural Networks, Convolutional Recursive and Recurrent Neural Networks, Differential Recursive and Recurrent Neural Networks, One-Layer Recursive and Recurrent Neural Networks, High-Order Recursive and Recurrent Neural Networks, Highway Networks, Multidimensional Recursive and Recurrent Neural Networks, Bidirectional Recursive and Recurrent Neural Networks; the second category is Structured Recursive and Recurrent Neural Networks, including Grid Recursive and Recurrent Neural Networks, Graph Recursive and Recurrent Neural Networks, Temporal Recursive and Recurrent Neural Networks, Lattice Recursive and Recurrent Neural Networks, Hierarchical Recursive and Recurrent Neural Networks, Tree Recursive and Recurrent Neural Networks; the third category is Other Recursive and Recurrent Neural Networks, including Array Long Short Term Memory, Nested and Stacked Recursive and Recurrent Neural Networks, Memory Recursive and Recurrent Neural Networks. Various networks cross each other and even rely on each other to form a complex network of relationships. In the context of the development and convergence of various networks, many complex sequence, speech and image problems are solved. After a detailed description of the principle and structure of the above model and model deformation, the research progress and application of each model are described, and finally the recursive and recurrent neural network models are prospected and summarized.
zh

[AI-115] Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies Performance Optimization and Analysis

【速读】:该论文旨在解决当前大气和海洋AI模型(如FourCastNet和AI-GOMS)高度依赖GPU硬件所带来的可移植性与自主可控性问题,尤其是在中国本土芯片和框架生态下的适配难题。其解决方案的关键在于构建一个从PyTorch到MindSpore的迁移框架,并针对国产芯片进行软件-硬件协同优化,涵盖内存优化与并行计算策略,从而在保持原始模型精度的同时显著降低对国外硬件的依赖,提升训练与推理效率及能效表现。

链接: https://arxiv.org/abs/2510.17852
作者: Yuze Sun,Wentao Luo,Yanfei Xiang,Jiancheng Pan,Jiahao Li,Quan Zhang,Xiaomeng Huang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the growing role of artificial intelligence in climate and weather research, efficient model training and inference are in high demand. Current models like FourCastNet and AI-GOMS depend heavily on GPUs, limiting hardware independence, especially for Chinese domestic hardware and frameworks. To address this issue, we present a framework for migrating large-scale atmospheric and oceanic models from PyTorch to MindSpore and optimizing for Chinese chips, and evaluating their performance against GPUs. The framework focuses on software-hardware adaptation, memory optimization, and parallelism. Furthermore, the model’s performance is evaluated across multiple metrics, including training speed, inference speed, model accuracy, and energy efficiency, with comparisons against GPU-based implementations. Experimental results demonstrate that the migration and optimization process preserves the models’ original accuracy while significantly reducing system dependencies and improving operational efficiency by leveraging Chinese chips as a viable alternative for scientific computing. This work provides valuable insights and practical guidance for leveraging Chinese domestic chips and frameworks in atmospheric and oceanic AI model development, offering a pathway toward greater technological independence.
zh

[AI-116] CARLE: A Hybrid Deep-Shallow Learning Framework for Robust and Explainable RUL Estimation of Rolling Element Bearings

【速读】:该论文旨在解决滚动轴承等关键部件在复杂工况下剩余使用寿命(Remaining Useful Life, RUL)预测的泛化能力和鲁棒性不足的问题。现有方法往往难以适应运行条件变化,导致预测精度下降。解决方案的关键在于提出一种混合人工智能框架CARLE,其核心创新包括:1)采用Res-CNN与Res-LSTM结合多头注意力机制和残差连接,有效提取时频域中的空间与时间退化特征;2)引入随机森林回归器(Random Forest Regressor, RFR)提升预测稳定性与准确性;3)设计轻量级预处理流程,通过高斯滤波降噪和连续小波变换(Continuous Wavelet Transform, CWT)进行时频特征提取,增强模型对噪声和跨域数据的适应能力。实验证明,该方案在XJTU-SY和PRONOSTIA轴承数据集上显著优于多个前沿方法,尤其在动态工况下表现出更强的鲁棒性和可迁移性。

链接: https://arxiv.org/abs/2510.17846
作者: Waleed Razzaq,Yun-Bo Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, accepted at Soft Computing

点击查看摘要

Abstract:Prognostic Health Management (PHM) systems monitor and predict equipment health. A key task is Remaining Useful Life (RUL) estimation, which predicts how long a component, such as a rolling element bearing, will operate before failure. Many RUL methods exist but often lack generalizability and robustness under changing operating conditions. This paper introduces CARLE, a hybrid AI framework that combines deep and shallow learning to address these challenges. CARLE uses Res-CNN and Res-LSTM blocks with multi-head attention and residual connections to capture spatial and temporal degradation patterns, and a Random Forest Regressor (RFR) for stable, accurate RUL prediction. A compact preprocessing pipeline applies Gaussian filtering for noise reduction and Continuous Wavelet Transform (CWT) for time-frequency feature extraction. We evaluate CARLE on the XJTU-SY and PRONOSTIA bearing datasets. Ablation studies measure each component’s contribution, while noise and cross-domain experiments test robustness and generalization. Comparative results show CARLE outperforms several state-of-the-art methods, especially under dynamic conditions. Finally, we analyze model interpretability with LIME and SHAP to assess transparency and trustworthiness.
zh

[AI-117] GRETEL: A Goal-driven Retrieval and Execution-based Trial Framework for LLM Tool Selection Enhancing

【速读】:该论文旨在解决代理系统中工具检索(tool retrieval)因过度依赖语义相似性而导致的“语义-功能鸿沟”问题,即现有方法常选出文本相关但功能不可用的工具(如因参数不匹配、认证失败或执行约束导致无法运行)。解决方案的关键在于提出GRETEL框架,其核心是通过沙箱环境中的“规划-执行-评估”循环对语义检索候选工具进行系统性实证验证,生成基于实际执行结果的证据,从而区分真正可用的功能性工具与仅具描述性的匹配项。这一执行驱动的验证机制显著提升了工具选择的可靠性,使代理在真实场景下表现更稳健。

链接: https://arxiv.org/abs/2510.17843
作者: Zongze Wu,Yani Guo,Churong Liang,Runnan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 5 pages, 1 figures, 5 tables

点击查看摘要

Abstract:Despite remarkable advances in Large Language Model capabilities, tool retrieval for agent-based systems remains fundamentally limited by reliance on semantic similarity, which fails to capture functional viability. Current methods often retrieve textually relevant but functionally inoperative tools due to parameter mismatches, authentication failures, and execution constraints–a phenomenon we term the semantic-functional gap. We introduce GRETEL, to address this gap through systematic empirical validation. GRETEL implements an agentic workflow that processes semantically retrieved candidates through sandboxed plan-execute-evaluate cycles, generating execution-grounded evidence to distinguish truly functional tools from merely descriptive matches. Our comprehensive evaluation on the ToolBench benchmark demonstrates substantial improvements across all metrics: Pass Rate (at 10) increases from 0.690 to 0.826, Recall (at 10) improves from 0.841 to 0.867, and NDCG (at 10) rises from 0.807 to 0.857… These results establish that execution-based validation provides a more reliable foundation for tool selection than semantic similarity alone, enabling more robust agent performance in real-world applications.
zh

[AI-118] LLM Assisted Alpha Fairness for 6 GHz WiFi and NR_U Coexistence: An Agent ic Orchestrator for Throughput Energy and SLA

【速读】:该论文旨在解决6GHz频段中Wi-Fi与5G NR-U(New Radio-Unlicensed)在共存场景下,如何在保障服务质量(QoS)、能效和公平性的同时实现高效资源调度的问题。其核心挑战在于:在听前发射(Listen-Before-Talk, LBT)机制约束下,需权衡吞吐量、能耗和服务优先级等多目标,并确保策略的安全性和可审计性。解决方案的关键在于提出一种代理式控制器(agentic controller),将策略生成与执行解耦:首先由大语言模型(Large Language Model, LLM)基于实时遥测数据(如信道忙闲状态、用户信道质量指示CQI、队列长度、延迟、电池状态等)生成一组可解释的控制参数(如公平指数α、各信道占空比上限及用户类别权重);随后由确定性优化器计算满足LBT损耗与能量成本内化的α-公平分配方案,并对不合法或不安全策略进行裁剪回退至规则基线。实验表明,LLM引导的策略显著提升能效(降低35.3%总能耗),同时保持吞吐量竞争力,甚至在某些配置下实现比特/焦耳效率提升12.2%,验证了透明、可控的LLM指导在无线共存优化中的有效性。

链接: https://arxiv.org/abs/2510.17814
作者: Qun Wang,Yingzhou Lu,Guiran Liu,Binrong Zhu,Yang Liu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlicensed 6GHz is becoming a primary workhorse for high-capacity access, with Wi-Fi and 5G NR-U competing for the same channels under listen-before-talk (LBT) rules. Operating in this regime requires decisions that jointly trade throughput, energy, and service-level objectives while remaining safe and auditable. We present an agentic controller that separates policy from execution. At the start of each scheduling epoch the agent summarizes telemetry (per-channel busy and baseline LBT failure; per-user CQI, backlog, latency, battery, priority, and power mode) and invokes a large language model (LLM) to propose a small set of interpretable knobs: a fairness index \alpha, per-channel duty-cycle caps for Wi-Fi/NR-U, and class weights. A deterministic optimizer then enforces feasibility and computes an \alpha-fair allocation that internalizes LBT losses and energy cost; malformed or unsafe policies are clamped and fall back to a rule baseline. In a 6GHz simulator with two 160MHz channels and mixed Wi-Fi/NR-U users, LLM-assisted policies consistently improve energy efficiency while keeping throughput competitive with a strong rule baseline. One LLM lowers total energy by 35.3% at modest throughput loss, and another attains the best overall trade-off, finishing with higher total bits (+3.5%) and higher bits/J (+12.2%) than the baseline. We release code, per-epoch logs, and plotting utilities to reproduce all figures and numbers, illustrating how transparent, policy-level LLM guidance can safely improve wireless coexistence.
zh

[AI-119] Lyapunov-Aware Quantum-Inspired Reinforcement Learning for Continuous-Time Vehicle Control: A Feasibility Study

【速读】:该论文旨在解决自动驾驶车辆纵向控制中如何在强化学习(Reinforcement Learning, RL)框架下嵌入安全约束的问题,特别是确保控制策略在动态环境中的稳定性与可解释性。其核心挑战在于传统RL方法缺乏对系统稳定性的理论保障,难以直接应用于高安全性要求的场景。解决方案的关键在于提出一种基于李雅普诺夫(Lyapunov)的量子强化学习(Lyapunov-Based Quantum Reinforcement Learning, LQRL)框架,将变分量子电路(Variational Quantum Circuits, VQCs)用于策略表示,并引入李雅普诺夫稳定性感知的策略梯度机制,使策略优化过程显式满足稳定性约束,从而实现渐近收敛和安全决策。仿真结果表明,该方法能够在闭环自适应巡航控制场景中有效嵌入稳定性验证,即便在激进加速条件下仍保持状态演化有界,验证了量子强化学习架构中集成安全保证的可行性。

链接: https://arxiv.org/abs/2510.18852
作者: Nutkritta Kraipatthanapong,Natthaphat Thathong,Pannita Suksawas,Thanunnut Klunklin,Kritin Vongthonglua,Krit Attahakul,Aueaphum Aueawatthanaphisut
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 7 pages, 4 figures, 20 equations, 3 appendices, 4 tables

点击查看摘要

Abstract:This paper presents a novel Lyapunov-Based Quantum Reinforcement Learning (LQRL) framework that integrates quantum policy optimization with Lyapunov stability analysis for continuous-time vehicle control. The proposed approach combines the representational power of variational quantum circuits (VQCs) with a stability-aware policy gradient mechanism to ensure asymptotic convergence and safe decision-making under dynamic environments. The vehicle longitudinal control problem was formulated as a continuous-state reinforcement learning task, where the quantum policy network generates control actions subject to Lyapunov stability constraints. Simulation experiments were conducted in a closed-loop adaptive cruise control scenario using a quantum-inspired policy trained under stability feedback. The results demonstrate that the LQRL framework successfully embeds Lyapunov stability verification into quantum policy learning, enabling interpretable and stability-aware control performance. Although transient overshoot and Lyapunov divergence were observed under aggressive acceleration, the system maintained bounded state evolution, validating the feasibility of integrating safety guarantees within quantum reinforcement learning architectures. The proposed framework provides a foundational step toward provably safe quantum control in autonomous systems and hybrid quantum-classical optimization domains.
zh

[AI-120] Learning under Quantization for High-Dimensional Linear Regression

【速读】:该论文旨在解决低比特量化(low-bit quantization)在大规模模型训练中广泛应用背景下,其对学习性能影响的理论理解缺失问题,尤其是在最简单的线性回归场景下。研究者首次系统地分析了有限步随机梯度下降(finite-step stochastic gradient descent, SGD)在高维线性回归中的表现,涵盖数据、标签、参数、激活和梯度等五类量化目标。解决方案的关键在于构建了一个新颖的理论分析框架,能够精确刻画不同量化方式引入的额外风险(excess risk),并揭示:参数、激活和梯度量化会放大训练过程中的噪声;数据量化会扭曲数据谱(data spectrum);而数据与标签量化则引入额外的近似误差和量化误差。特别地,作者证明了对于乘法型量化(input-dependent quantization step),可消除谱失真;而对于加法型量化(constant quantization step),存在随批量大小有益放大的 scaling 效应。这一理论为理解量化如何塑造优化算法的学习动态提供了强大工具,并为在实际硬件约束下的学习理论研究奠定基础。

链接: https://arxiv.org/abs/2510.18259
作者: Dechen Zhang,Junwei Su,Difan Zou
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, labels, parameters, activations, and gradients. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum; and data and label quantization introduce additional approximation and quantized error. Crucially, we prove that for multiplicative quantization (with input-dependent quantization step), this spectral distortion can be eliminated, and for additive quantization (with constant quantization step), a beneficial scaling effect with batch size emerges. Furthermore, for common polynomial-decay data spectra, we quantitatively compare the risks of multiplicative and additive quantization, drawing a parallel to the comparison between FP and integer quantization methods. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.
zh

[AI-121] Finding the Sweet Spot: Optimal Data Augmentation Ratio for Imbalanced Credit Scoring Using ADASYN

【速读】:该论文旨在解决信用评分模型中因类别严重不平衡(如违约率通常低于10%)而导致的模型学习困难与预测性能下降问题。现有方法如SMOTE和ADASYN等合成数据增强技术虽被广泛应用,但其最优增强比例尚未明确,实践中常盲目采用1:1平衡策略而缺乏实证依据。解决方案的关键在于系统性评估不同增强场景——在Give Me Some Credit数据集上对比SMOTE、BorderlineSMOTE与ADASYN在乘法因子为1x、2x、3x下的表现,结合XGBoost模型与Bootstrap显著性检验发现:ADASYN以1倍乘法因子(即将少数类样本翻倍)时达到最优效果,AUC为0.6778,Gini系数为0.3557,较基线提升显著(p=0.017),且最佳不平衡比约为6.6:1(多数类:少数类),而非传统1:1平衡,揭示了合成过采样存在“收益递减”规律,为信用评分领域提供了首个实证支持的“黄金增强比例”建议。

链接: https://arxiv.org/abs/2510.18252
作者: Luis H. Chia
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Credit scoring models face a critical challenge: severe class imbalance, with default rates typically below 10%, which hampers model learning and predictive performance. While synthetic data augmentation techniques such as SMOTE and ADASYN have been proposed to address this issue, the optimal augmentation ratio remains unclear, with practitioners often defaulting to full balancing (1:1 ratio) without empirical justification. This study systematically evaluates 10 data augmentation scenarios using the Give Me Some Credit dataset (97,243 observations, 7% default rate), comparing SMOTE, BorderlineSMOTE, and ADASYN at different multiplication factors (1x, 2x, 3x). All models were trained using XGBoost and evaluated on a held-out test set of 29,173 real observations. Statistical significance was assessed using bootstrap testing with 1,000 iterations. Key findings reveal that ADASYN with 1x multiplication (doubling the minority class) achieved optimal performance with AUC of 0.6778 and Gini coefficient of 0.3557, representing statistically significant improvements of +0.77% and +3.00% respectively (p = 0.017, bootstrap test). Higher multiplication factors (2x and 3x) resulted in performance degradation, with 3x showing a -0.48% decrease in AUC, suggesting a “law of diminishing returns” for synthetic oversampling. The optimal class imbalance ratio was found to be 6.6:1 (majority:minority), contradicting the common practice of balancing to 1:1. This work provides the first empirical evidence of an optimal “sweet spot” for data augmentation in credit scoring, with practical guidelines for industry practitioners and researchers working with imbalanced datasets. While demonstrated on a single representative dataset, the methodology provides a reproducible framework for determining optimal augmentation ratios in other imbalanced domains. Comments: 25 pages, 3 figures, 6 tables Subjects: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 62H30 ACMclasses: I.2.6 Cite as: arXiv:2510.18252 [stat.AP] (or arXiv:2510.18252v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2510.18252 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Luis Chia Ramírez [view email] [v1] Tue, 21 Oct 2025 03:22:43 UTC (544 KB)
zh

[AI-122] Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning NEURIPS2025

【速读】:该论文旨在解决天文光谱数据在多分辨率、多波段(如光学与红外)及多天体类型(如恒星与星系)下难以统一建模的问题,从而阻碍了跨数据集的信息整合与高效利用。其解决方案的关键在于提出了一种自监督的深度学习模型——通用光谱分词器(universal spectral tokenizer),该模型可直接在原始波长网格上处理异构光谱数据,生成内在对齐、同质且物理意义明确的表征,并能高效适配至多种下游任务中,首次实现了单一模型对跨分辨率和跨域光谱数据的统一建模,为构建天文学基础模型提供了关键组件,并具备扩展至气候学、医疗等其他科学领域中异构序列数据建模的潜力。

链接: https://arxiv.org/abs/2510.17959
作者: Jeff Shen,Francois Lanusse,Liam Holden Parker,Ollie Liu,Tom Hehir,Leopoldo Sarra,Lucas Meyer,Micah Bowles,Sebastian Wagner-Carena,Sebastian Wagner-Carena,Helen Qu,Siavash Golkar,Alberto Bietti,Hatim Bourfoune,Nathan Cassereau,Pierre Cornette,Keiya Hirashima,Geraud Krawezik,Ruben Ohana,Nicholas Lourie,Michael McCabe,Rudy Morel,Payel Mukhopadhyay,Mariel Pettee,Bruno Régaldo-Saint Blancard,Kyunghyun Cho,Miles Cranmer,Shirley Ho
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025 Machine Learning and the Physical Sciences Workshop

点击查看摘要

Abstract:Sequential scientific data span many resolutions and domains, and unifying them into a common representation is a key step toward developing foundation models for the sciences. Astronomical spectra exemplify this challenge: massive surveys have collected millions of spectra across a wide range of wavelengths and resolutions, yet analyses remain fragmented across spectral domains (e.g., optical vs. infrared) and object types (e.g., stars vs. galaxies), limiting the ability to pool information across datasets. We present a deep learning model that jointly learns from heterogeneous spectra in a self-supervised manner. Our universal spectral tokenizer processes spectra from a variety of object types and resolutions directly on their native wavelength grids, producing intrinsically aligned, homogeneous, and physically meaningful representations that can be efficiently adapted to achieve competitive performance across a range of downstream tasks. For the first time, we demonstrate that a single model can unify spectral data across resolutions and domains, suggesting that our model can serve as a powerful building block for foundation models in astronomy – and potentially extend to other scientific domains with heterogeneous sequential data, such as climate and healthcare.
zh

[AI-123] XDXD: End-to-end crystal structure determination with low resolution X-ray diffraction

【速读】:该论文旨在解决从低分辨率单晶X射线衍射数据中确定晶体结构的问题,这一问题在材料科学、化学和生物学等领域尤为关键,因传统方法在低分辨率下难以获得清晰且可解释的电子密度图(electron density map),从而阻碍了原子级结构的准确解析。解决方案的关键在于提出XDXD——首个端到端的深度学习框架,其基于扩散生成模型(diffusion-based generative model),直接从衍射数据生成符合化学合理性的完整原子模型,无需人工干预电子密度图的解读过程,显著提升了低分辨率条件下的结构解析准确性与自动化水平。

链接: https://arxiv.org/abs/2510.17936
作者: Jiale Zhao,Cong Liu,Yuxuan Zhang,Chengyue Gong,Zhenyi Zhang,Shifeng Jin,Zhenyu Liu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Determining crystal structures from X-ray diffraction data is fundamental across diverse scientific fields, yet remains a significant challenge when data is limited to low resolution. While recent deep learning models have made breakthroughs in solving the crystallographic phase problem, the resulting low-resolution electron density maps are often ambiguous and difficult to interpret. To overcome this critical bottleneck, we introduce XDXD, to our knowledge, the first end-to-end deep learning framework to determine a complete atomic model directly from low-resolution single-crystal X-ray diffraction data. Our diffusion-based generative model bypasses the need for manual map interpretation, producing chemically plausible crystal structures conditioned on the diffraction pattern. We demonstrate that XDXD achieves a 70.4% match rate for structures with data limited to 2.0~Å resolution, with a root-mean-square error (RMSE) below 0.05. Evaluated on a benchmark of 24,000 experimental structures, our model proves to be robust and accurate. Furthermore, a case study on small peptides highlights the model’s potential for extension to more complex systems, paving the way for automated structure solution in previously intractable cases.
zh

[AI-124] CBINNS: Cancer Biology-Informed Neural Network for Unknown Parameter Estimation and Missing Physics Identification

【速读】:该论文旨在解决肿瘤免疫相互作用动力学建模中参数估计不准确以及物理机制缺失的问题,尤其是在实验数据稀疏且噪声较大情况下,传统微分方程模型难以有效识别未知参数和补全缺失的生物物理规律。解决方案的关键在于提出一种癌症生物学信息神经网络(Cancer Biology-Informed Neural Network, CBINN),该模型通过融合先验生物学知识与深度学习框架,能够从少量、噪声较大的测量数据中同时推断未知模型参数并自动发现系统中缺失的动力学项,从而实现对复杂肿瘤-免疫交互过程的精准建模与机制解析。

链接: https://arxiv.org/abs/2510.17920
作者: Bishal Chhetri,B.V. Rathish Kumar
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 29 pages, 24 figures

点击查看摘要

Abstract:The dynamics of tumor-immune interactions within a complex tumor microenvironment are typically modeled using a system of ordinary differential equations or partial differential equations. These models introduce some unknown parameters that need to be estimated accurately and efficiently from the limited and noisy experimental data. Moreover, due to the intricate biological complexity and limitations in experimental measurements, tumor-immune dynamics are not fully understood, and therefore, only partial knowledge of the underlying physics may be available, resulting in unknown or missing terms within the system of equations. In this study, we develop a cancer biology-informed neural network model(CBINN) to infer the unknown parameters in the system of equations as well as to discover the missing physics from sparse and noisy measurements. We test the performance of the CBINN model on three distinct nonlinear compartmental tumor-immune models and evaluate its robustness across multiple synthetic noise levels. By harnessing these highly nonlinear dynamics, our CBINN framework effectively estimates the unknown model parameters and uncovers the underlying physical laws or mathematical structures that govern these biological systems, even from scattered and noisy measurements. The models chosen here represent the dynamic patterns commonly observed in compartmental models of tumor-immune interactions, thereby validating the generalizability and efficacy of our methodology.
zh

[AI-125] Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage

【速读】:该论文试图解决的问题是:大脑与语言模型是否趋向于形成对世界相同的内部表征(internal representations)。为回答这一问题,作者系统回顾了2023至2025年间发表的25项基于功能性磁共振成像(fMRI)的研究,并将其发现与两个核心假设进行对照:一是柏拉图式表征假说(Platonic Representation Hypothesis),即随着模型规模扩大和性能提升,其表征会趋近于真实世界的抽象结构;二是中间层优势假说(Intermediate-Layer Advantage),即模型中层(mid-depth layers)通常编码更丰富、更具泛化能力的特征。研究的关键在于通过整合多篇实证研究结果,提供一致证据表明语言模型与人脑在抽象表征层面存在共性,从而支持上述两个假说,并推动对脑-模型对齐(brain-model alignment)机制的进一步探索。

链接: https://arxiv.org/abs/2510.17833
作者: Ángela López-Cardona,Sebastián Idesis,Mireia Masias-Bruns,Sergi Abadal,Ioannis Arapakis
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do brains and language models converge toward the same internal representations of the world? Recent years have seen a rise in studies of neural activations and model alignment. In this work, we review 25 fMRI-based studies published between 2023 and 2025 and explicitly confront their findings with two key hypotheses: (i) the Platonic Representation Hypothesis – that as models scale and improve, they converge to a representation of the real world, and (ii) the Intermediate-Layer Advantage – that intermediate (mid-depth) layers often encode richer, more generalizable features. Our findings provide converging evidence that models and brains may share abstract representational structures, supporting both hypotheses and motivating further research on brain-model alignment.
zh

[AI-126] Synthetic EEG Generation using Diffusion Models for Motor Imagery Tasks

【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)数据采集中面临的高质量数据稀缺问题,包括传感器成本高、采集时间长以及个体间差异大等挑战,从而限制了脑机接口(Brain-Computer Interface, BCI)系统性能的提升。其解决方案的关键在于利用扩散概率模型(Diffusion Probabilistic Models, DDPM)生成与运动想象任务相关的合成EEG信号,通过预处理真实EEG数据并训练扩散模型从噪声中重建EEG通道,最终在信号层面和任务层面均验证了合成数据的质量:分类准确率超过95%,均方误差低且与真实信号相关性高,表明该方法能有效补充真实数据集,提升EEG-BBCI系统的分类性能。

链接: https://arxiv.org/abs/2510.17832
作者: Henrique de Lima Alexandre,Clodoaldo Aparecido de Moraes Lima
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, BRACIS

点击查看摘要

Abstract:Electroencephalography (EEG) is a widely used, non-invasive method for capturing brain activity, and is particularly relevant for applications in Brain-Computer Interfaces (BCI). However, collecting high-quality EEG data remains a major challenge due to sensor costs, acquisition time, and inter-subject variability. To address these limitations, this study proposes a methodology for generating synthetic EEG signals associated with motor imagery brain tasks using Diffusion Probabilistic Models (DDPM). The approach involves preprocessing real EEG data, training a diffusion model to reconstruct EEG channels from noise, and evaluating the quality of the generated signals through both signal-level and task-level metrics. For validation, we employed classifiers such as K-Nearest Neighbors (KNN), Convolutional Neural Networks (CNN), and U-Net to compare the performance of synthetic data against real data in classification tasks. The generated data achieved classification accuracies above 95%, with low mean squared error and high correlation with real signals. Our results demonstrate that synthetic EEG signals produced by diffusion models can effectively complement datasets, improving classification performance in EEG-based BCIs and addressing data scarcity. Comments: 15 pages, BRACIS Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.17832 [eess.SP] (or arXiv:2510.17832v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2510.17832 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-127] Multi-Agent Design Assistant for the Simulation of Inertial Fusion Energy

【速读】:该论文旨在解决惯性约束聚变(Inertial Confinement Fusion, ICF)系统设计中因极端物理条件和多尺度非线性行为所带来的复杂优化难题,尤其是如何在高保真物理模型约束下高效探索和优化燃料胶囊几何结构以实现点火。解决方案的关键在于构建一个基于多智能体(multi-agent)的自主推理系统,该系统能够通过自然语言交互调用高阶多物理场计算代码,并结合物理仿真与人工智能推理能力,在无需人工干预的情况下协同完成胶囊几何的逆向设计与参数优化,从而显著提升设计效率并突破传统试错式工程方法的局限。

链接: https://arxiv.org/abs/2510.17830
作者: Meir H. Shachar,Dane M. Sterbentz,Harshitha Menon,Charles F. Jekel,M. Giselle Fernández-Godino,Yue Hao,Kevin Korner,Robert Rieben,Daniel A. White,William J. Schill,Jonathan L. Belof
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inertial fusion energy promises nearly unlimited, clean power if it can be achieved. However, the design and engineering of fusion systems requires controlling and manipulating matter at extreme energies and timescales; the shock physics and radiation transport governing the physical behavior under these conditions are complex requiring the development, calibration, and use of predictive multiphysics codes to navigate the highly nonlinear and multi-faceted design landscape. We hypothesize that artificial intelligence reasoning models can be combined with physics codes and emulators to autonomously design fusion fuel capsules. In this article, we construct a multi-agent system where natural language is utilized to explore the complex physics regimes around fusion energy. The agentic system is capable of executing a high-order multiphysics inertial fusion computational code. We demonstrate the capacity of the multi-agent design assistant to both collaboratively and autonomously manipulate, navigate, and optimize capsule geometry while accounting for high fidelity physics that ultimately achieve simulated ignition via inverse design.
zh

[AI-128] Speak to a Protein: An Interactive Multimodal Co-Scientist for Protein Analysis

【速读】:该论文旨在解决蛋白质结构分析过程中耗时长、门槛高且依赖专业计算技能的问题,传统方法需数周时间阅读文献、交叉比对晶体与预测结构,并检查配体复合物。其解决方案的关键在于提出了一种名为“Speak to a Protein”的交互式多模态对话系统,该系统能够整合文献、结构及配体数据,基于实时3D场景提供可交互的可视化回答,并支持标注、操作和代码生成,从而将语言、代码与三维结构紧密耦合,显著缩短从问题到证据的时间,降低高级结构分析的准入门槛,并支持即时假设生成。

链接: https://arxiv.org/abs/2510.17826
作者: Carles Navarro,Mariona Torrens,Philipp Thölke,Stefan Doerr,Gianni De Fabritiis
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building a working mental model of a protein typically requires weeks of reading, cross-referencing crystal and predicted structures, and inspecting ligand complexes, an effort that is slow, unevenly accessible, and often requires specialized computational skills. We introduce \emphSpeak to a Protein, a new capability that turns protein analysis into an interactive, multimodal dialogue with an expert co-scientist. The AI system retrieves and synthesizes relevant literature, structures, and ligand data; grounds answers in a live 3D scene; and can highlight, annotate, manipulate and see the visualization. It also generates and runs code when needed, explaining results in both text and graphics. We demonstrate these capabilities on relevant proteins, posing questions about binding pockets, conformational changes, or structure-activity relationships to test ideas in real-time. \emphSpeak to a Protein reduces the time from question to evidence, lowers the barrier to advanced structural analysis, and enables hypothesis generation by tightly coupling language, code, and 3D structures. \emphSpeak to a Protein is freely accessible at this https URL.
zh

[AI-129] Carbon-Aware Orchestration of Integrated Satellite Aerial Terrestrial Networks via Digital Twin

【速读】:该论文旨在解决集成卫星-空中-地面网络(Integrated Satellite Aerial Terrestrial Networks, ISATNs)在6G时代大规模部署中面临的高碳排放与能源不可持续问题。现有研究多聚焦于服务质量(QoS)优化,忽视了碳足迹对环境的影响。其解决方案的关键在于提出一种基于数字孪生(Digital Twin, DT)技术的碳感知编排框架,以“每比特二氧化碳当量”(gCO₂/bit)为核心可持续性指标,并引入多时间尺度的计划-执行-检查-行动(Plan Do Check Act, PDCA)循环机制,融合日前预测与实时自适应优化策略。通过调控特定于ISATN的控制变量——如碳感知切换、无人机任务周期调度和可再生能源感知边缘部署——实现显著减排效果,在保障性能的同时提升可再生能源利用率与极端事件下的韧性。

链接: https://arxiv.org/abs/2510.17825
作者: Shumaila Javaid,Nasir Saeed
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Integrated Satellite Aerial Terrestrial Networks (ISATNs) are envisioned as key enablers of 6G, providing global connectivity for applications such as autonomous transportation, Industrial IoT, and disaster response. Their large-scale deployment, however, risks unsustainable energy use and carbon emissions. This work advances prior energy-aware studies by proposing a carbon-aware orchestration framework for ISATNs that leverages Digital Twin (DT) technology. The framework adopts grams of CO _2 -equivalent per bit (gCO _2 /bit) as a primary sustainability metric and implements a multi timescale Plan Do Check Act (PDCA) loop that combines day-ahead forecasting with real-time adaptive optimization. ISATN-specific control knobs, including carbon-aware handovers, UAV duty cycling, and renewable-aware edge placement, are exploited to reduce emissions. Simulation results with real carbon intensity data show up to 29% lower gCO _2 /bit than QoS-only orchestration, while improving renewable utilization and resilience under adverse events.
zh

[AI-130] A Biophysical-Model-Informed Source Separation Framework For EMG Decomposition

【速读】:该论文旨在解决传统盲源分离(Blind Source Separation, BSS)方法在表面肌电(surface electromyography, sEMG)中运动单位(Motor Unit, MU)分解时缺乏生物物理约束,导致估计精度和可解释性不足的问题。其解决方案的关键在于提出了一种生物物理模型引导的源分离(Biophysical-Model-Informed Source Separation, BMISS)框架,通过将基于MRI重建的解剖学精确前向肌电模型与生成建模相结合,实现对神经驱动信号及运动神经元特性的无监督直接反演,从而在保持高保真度的同时显著降低计算复杂度。

链接: https://arxiv.org/abs/2510.17822
作者: D. Halatsis,P. Mamidanna,J. Pereira,D. Farina
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in neural interfacing have enabled significant improvements in human-computer interaction, rehabilitation, and neuromuscular diagnostics. Motor unit (MU) decomposition from surface electromyography (sEMG) is a key technique for extracting neural drive information, but traditional blind source separation (BSS) methods fail to incorporate biophysical constraints, limiting their accuracy and interpretability. In this work, we introduce a novel Biophysical-Model-Informed Source Separation (BMISS) framework, which integrates anatomically accurate forward EMG models into the decomposition process. By leveraging MRI-based anatomical reconstructions and generative modeling, our approach enables direct inversion of a biophysically accurate forward model to estimate both neural drive and motor neuron properties in an unsupervised manner. Empirical validation in a controlled simulated setting demonstrates that BMISS achieves higher fidelity motor unit estimation while significantly reducing computational cost compared to traditional methods. This framework paves the way for non-invasive, personalized neuromuscular assessments, with potential applications in clinical diagnostics, prosthetic control, and neurorehabilitation.
zh

机器学习

[LG-0] A Hybrid Enumeration Framework for Optimal Counterfactual Generation in Post-Acute COVID-19 Heart Failure

链接: https://arxiv.org/abs/2510.18841
作者: Jingya Cheng,Alaleh Azhir,Jiazi Tian,Hossein Estiri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual inference provides a mathematical framework for reasoning about hypothetical outcomes under alternative interventions, bridging causal reasoning and predictive modeling. We present a counterfactual inference framework for individualized risk estimation and intervention analysis, illustrated through a clinical application to post-acute sequelae of COVID-19 (PASC) among patients with pre-existing heart failure (HF). Using longitudinal diagnosis, laboratory, and medication data from a large health-system cohort, we integrate regularized predictive modeling with counterfactual search to identify actionable pathways to PASC-related HF hospital admissions. The framework combines exact enumeration with optimization-based methods, including the Nearest Instance Counterfactual Explanations (NICE) and Multi-Objective Counterfactuals (MOC) algorithms, to efficiently explore high-dimensional intervention spaces. Applied to more than 2700 individuals with confirmed SARS-CoV-2 infection and prior HF, the model achieved strong discriminative performance (AUROC: 0.88, 95% CI: 0.84-0.91) and generated interpretable, patient-specific counterfactuals that quantify how modifying comorbidity patterns or treatment factors could alter predicted outcomes. This work demonstrates how counterfactual reasoning can be formalized as an optimization problem over predictive functions, offering a rigorous, interpretable, and computationally efficient approach to personalized inference in complex biomedical systems.

[LG-1] BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem

链接: https://arxiv.org/abs/2510.18824
作者: Seunghee Ryu,Donghoon Kwon,Seongjin Choi,Aryan Deshwal,Seungmo Kang,Carolina Osorio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce \textbfBO4Mob, a new benchmark framework for high-dimensional Bayesian Optimization (BO), driven by the challenge of origin-destination (OD) travel demand estimation in large urban road networks. Estimating OD travel demand from limited traffic sensor data is a difficult inverse optimization problem, particularly in real-world, large-scale transportation networks. This problem involves optimizing over high-dimensional continuous spaces where each objective evaluation is computationally expensive, stochastic, and non-differentiable. BO4Mob comprises five scenarios based on real-world San Jose, CA road networks, with input dimensions scaling up to 10,100. These scenarios utilize high-resolution, open-source traffic simulations that incorporate realistic nonlinear and stochastic dynamics. We demonstrate the benchmark’s utility by evaluating five optimization methods: three state-of-the-art BO algorithms and two non-BO baselines. This benchmark is designed to support both the development of scalable optimization algorithms and their application for the design of data-driven urban mobility models, including high-resolution digital twins of metropolitan road networks. Code and documentation are available at this https URL.

[LG-2] Search Self-play: Pushing the Frontier of Agent Capability without Supervision

链接: https://arxiv.org/abs/2510.18821
作者: Hongliang Lu,Yuhang Wen,Pengyu Cheng,Ruijin Ding,Haotian Xu,Jiaqi Guo,Chutian Wang,Haonan Chen,Xiaoxi Jiang,Guanjun Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer’s trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at this https URL.

[LG-3] A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation

链接: https://arxiv.org/abs/2510.18812
作者: Jesús García Fernández,Nasir Ahmad,Marcel van Gerven
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Iterative optimization is central to modern artificial intelligence (AI) and provides a crucial framework for understanding adaptive systems. This review provides a unified perspective on this subject, bridging classic theory with neural network training and biological learning. Although gradient-based methods, powered by the efficient but biologically implausible backpropagation (BP), dominate machine learning, their computational demands can hinder scalability in high-dimensional settings. In contrast, derivative-free or zeroth-order (ZO) optimization feature computationally lighter approaches that rely only on function evaluations and randomness. While generally less sample efficient, recent breakthroughs demonstrate that modern ZO methods can effectively approximate gradients and achieve performance competitive with BP in neural network models. This ZO paradigm is also particularly relevant for biology. Its core principles of random exploration (probing) and feedback-guided adaptation (reinforcing) parallel key mechanisms of biological learning, offering a mathematically principled perspective on how the brain learns. In this review, we begin by categorizing optimization approaches based on the order of derivative information they utilize, ranging from first-, second-, and higher-order gradient-based to ZO methods. We then explore how these methods are adapted to the unique challenges of neural network training and the resulting learning dynamics. Finally, we build upon these insights to view biological learning through an optimization lens, arguing that a ZO paradigm leverages the brain’s intrinsic noise as a computational resource. This framework not only illuminates our understanding of natural intelligence but also holds vast implications for neuromorphic hardware, helping us design fast and energy-efficient AI systems that exploit intrinsic hardware noise.

[LG-4] When LRP Diverges from Leave-One-Out in Transformers EMNLP2025

链接: https://arxiv.org/abs/2510.18810
作者: Weiqiu You,Siqi Zeng,Yao-Hung Hubert Tsai,Makoto Yamada,Han Zhao
类目: Machine Learning (cs.LG)
*备注: BlackboxNLP @ EMNLP 2025

点击查看摘要

Abstract:Leave-One-Out (LOO) provides an intuitive measure of feature importance but is computationally prohibitive. While Layer-Wise Relevance Propagation (LRP) offers a potentially efficient alternative, its axiomatic soundness in modern Transformers remains largely under-examined. In this work, we first show that the bilinear propagation rules used in recent advances of AttnLRP violate the implementation invariance axiom. We prove this analytically and confirm it empirically in linear attention layers. Second, we also revisit CP-LRP as a diagnostic baseline and find that bypassing relevance propagation through the softmax layer – backpropagating relevance only through the value matrices – significantly improves alignment with LOO, particularly in middle-to-late Transformer layers. Overall, our results suggest that (i) bilinear factorization sensitivity and (ii) softmax propagation error potentially jointly undermine LRP’s ability to approximate LOO in Transformers.

[LG-5] On Biologically Plausible Learning in Continuous Time

链接: https://arxiv.org/abs/2510.18808
作者: Marc Gong Bacvanski,Liu Ziyin,Tomaso Poggio
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Biological learning unfolds continuously in time, yet most algorithmic models rely on discrete updates and separate inference and learning phases. We study a continuous-time neural model that unifies several biologically plausible learning algorithms and removes the need for phase separation. Rules including stochastic gradient descent (SGD), feedback alignment (FA), direct feedback alignment (DFA), and Kolen-Pollack (KP) emerge naturally as limiting cases of the dynamics. Simulations show that these continuous-time networks stably learn at biological timescales, even under temporal mismatches and integration noise. Through analysis and simulation, we show that learning depends on temporal overlap: a synapse updates correctly only when its input and the corresponding error signal coincide in time. When inputs are held constant, learning strength declines linearly as the delay between input and error approaches the stimulus duration, explaining observed robustness and failure across network depths. Critically, robust learning requires the synaptic plasticity timescale to exceed the stimulus duration by one to two orders of magnitude. For typical cortical stimuli (tens of milliseconds), this places the functional plasticity window in the few-second range, a testable prediction that identifies seconds-scale eligibility traces as necessary for error-driven learning in biological circuits.

[LG-6] Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams

链接: https://arxiv.org/abs/2510.18786
作者: Federica Granese,Serena Villata,Charles Bouveyron
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.

[LG-7] CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

链接: https://arxiv.org/abs/2510.18784
作者: Soroush Tabesh,Mher Safaryan,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite significant work on low-bit quantization-aware training (QAT), there is still a large accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. When pre-training Llama-style models of up to 800M-parameters, CAGE recovers over 10% of the quantization-induced loss increase in the W4A4 regime over outlier-mitigation methods. These results indicate that curvature-aware gradient corrections can bridge the remaining performance gap beyond current outlier-handling methods.

[LG-8] Enhancing Fractional Gradient Descent with Learned Optimizers

链接: https://arxiv.org/abs/2510.18783
作者: Jan Sobotka,Petr Šimánek,Pavel Kordík
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fractional Gradient Descent (FGD) offers a novel and promising way to accelerate optimization by incorporating fractional calculus into machine learning. Although FGD has shown encouraging initial results across various optimization tasks, it faces significant challenges with convergence behavior and hyperparameter selection. Moreover, the impact of its hyperparameters is not fully understood, and scheduling them is particularly difficult in non-convex settings such as neural network training. To address these issues, we propose a novel approach called Learning to Optimize Caputo Fractional Gradient Descent (L2O-CFGD), which meta-learns how to dynamically tune the hyperparameters of Caputo FGD (CFGD). Our method’s meta-learned schedule outperforms CFGD with static hyperparameters found through an extensive search and, in some tasks, achieves performance comparable to a fully black-box meta-learned optimizer. L2O-CFGD can thus serve as a powerful tool for researchers to identify high-performing hyperparameters and gain insights on how to leverage the history-dependence of the fractional differential in optimization.

[LG-9] Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference

链接: https://arxiv.org/abs/2510.18768
作者: Harry Amad,Zhaozhi Qian,Dennis Frauen,Julianna Piskorz,Stefan Feuerriegel,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal inference is essential for developing and evaluating medical interventions, yet real-world medical datasets are often difficult to access due to regulatory barriers. This makes synthetic data a potentially valuable asset that enables these medical analyses, along with the development of new inference methods themselves. Generative models can produce synthetic data that closely approximate real data distributions, yet existing methods do not consider the unique challenges that downstream causal inference tasks, and specifically those focused on treatments, pose. We establish a set of desiderata that synthetic data containing treatments should satisfy to maximise downstream utility: preservation of (i) the covariate distribution, (ii) the treatment assignment mechanism, and (iii) the outcome generation mechanism. Based on these desiderata, we propose a set of evaluation metrics to assess such synthetic data. Finally, we present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine that mimics the data-generating process of data containing treatments and optimises for our desiderata. We empirically demonstrate that STEAM achieves state-of-the-art performance across our metrics as compared to existing generative models, particularly as the complexity of the true data-generating process increases.

[LG-10] OmniCast: A Masked Latent Diffusion Model for Weather Forecasting Across Time Scales NEURIPS2025

链接: https://arxiv.org/abs/2510.18707
作者: Tung Nguyen,Tuan Pham,Troy Arcomano,Veerabhadra Kotamarthi,Ian Foster,Sandeep Madireddy,Aditya Grover
类目: Machine Learning (cs.LG)
*备注: Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Accurate weather forecasting across time scales is critical for anticipating and mitigating the impacts of climate change. Recent data-driven methods based on deep learning have achieved significant success in the medium range, but struggle at longer subseasonal-to-seasonal (S2S) horizons due to error accumulation in their autoregressive approach. In this work, we propose OmniCast, a scalable and skillful probabilistic model that unifies weather forecasting across timescales. OmniCast consists of two components: a VAE model that encodes raw weather data into a continuous, lower-dimensional latent space, and a diffusion-based transformer model that generates a sequence of future latent tokens given the initial conditioning tokens. During training, we mask random future tokens and train the transformer to estimate their distribution given conditioning and visible tokens using a per-token diffusion head. During inference, the transformer generates the full sequence of future tokens by iteratively unmasking random subsets of tokens. This joint sampling across space and time mitigates compounding errors from autoregressive approaches. The low-dimensional latent space enables modeling long sequences of future latent states, allowing the transformer to learn weather dynamics beyond initial conditions. OmniCast performs competitively with leading probabilistic methods at the medium-range timescale while being 10x to 20x faster, and achieves state-of-the-art performance at the subseasonal-to-seasonal scale across accuracy, physics-based, and probabilistic metrics. Furthermore, we demonstrate that OmniCast can generate stable rollouts up to 100 years ahead. Code and model checkpoints are available at this https URL.

[LG-11] Reinforcement Learning with Imperfect Transition Predictions: A Bellm an-Jensen Approach

链接: https://arxiv.org/abs/2510.18687
作者: Chenbei Lu,Zaiwei Chen,Tongxin Li,Chenye Wu,Adam Wierman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emphBayesian value function to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emphBellman-Jensen Gap analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.

[LG-12] Learning Task-Agnostic Representations through Multi-Teacher Distillation NEURIPS-2025

链接: https://arxiv.org/abs/2510.18680
作者: Philippe Formont,Maxime Darrin,Banafsheh Karimian,Jackie CK Cheung,Eric Granger,Ismail Ben Ayed,Mohammadhadi Shateri,Pablo Piantanida
类目: Machine Learning (cs.LG)
*备注: NeurIPS-2025

点击查看摘要

Abstract:Casting complex inputs into tractable representations is a critical step across various fields. Diverse embedding models emerge from differences in architectures, loss functions, input modalities and datasets, each capturing unique aspects of the input. Multi-teacher distillation leverages this diversity to enrich representations but often remains tailored to specific tasks. In this paper, we introduce a task-agnostic framework based on a ``majority vote" objective function. We demonstrate that this function is bounded by the mutual information between student and teachers’ embeddings, leading to a task-agnostic distillation loss that eliminates dependence on task-specific labels or prior knowledge. Our evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities.

[LG-13] Learning Time-Varying Turn-Taking Behavior in Group Conversations

链接: https://arxiv.org/abs/2510.18649
作者: Madeline Navarro,Lisa O’Bryan,Santiago Segarra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a flexible probabilistic model for predicting turn-taking patterns in group conversations based solely on individual characteristics and past speaking behavior. Many models of conversation dynamics cannot yield insights that generalize beyond a single group. Moreover, past works often aim to characterize speaking behavior through a universal formulation that may not be suitable for all groups. We thus develop a generalization of prior conversation models that predicts speaking turns among individuals in any group based on their individual characteristics, that is, personality traits, and prior speaking behavior. Importantly, our approach provides the novel ability to learn how speaking inclination varies based on when individuals last spoke. We apply our model to synthetic and real-world conversation data to verify the proposed approach and characterize real group interactions. Our results demonstrate that previous behavioral models may not always be realistic, motivating our data-driven yet theoretically grounded approach.

[LG-14] Informed Learning for Estimating Drought Stress at Fine-Scale Resolution Enables Accurate Yield Prediction

链接: https://arxiv.org/abs/2510.18648
作者: Miro Miranda,Marcela Charfuelan,Matias Valdenegro Toro,Andreas Dengel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Water is essential for agricultural productivity. Assessing water shortages and reduced yield potential is a critical factor in decision-making for ensuring agricultural productivity and food security. Crop simulation models, which align with physical processes, offer intrinsic explainability but often perform poorly. Conversely, machine learning models for crop yield modeling are powerful and scalable, yet they commonly operate as black boxes and lack adherence to the physical principles of crop growth. This study bridges this gap by coupling the advantages of both worlds. We postulate that the crop yield is inherently defined by the water availability. Therefore, we formulate crop yield as a function of temporal water scarcity and predict both the crop drought stress and the sensitivity to water scarcity at fine-scale resolution. Sequentially modeling the crop yield response to water enables accurate yield prediction. To enforce physical consistency, a novel physics-informed loss function is proposed. We leverage multispectral satellite imagery, meteorological data, and fine-scale yield data. Further, to account for the uncertainty within the model, we build upon a deep ensemble approach. Our method surpasses state-of-the-art models like LSTM and Transformers in crop yield prediction with a coefficient of determination ( R^2 -score) of up to 0.82 while offering high explainability. This method offers decision support for industry, policymakers, and farmers in building a more resilient agriculture in times of changing climate conditions.

[LG-15] Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions NEURIPS2025

链接: https://arxiv.org/abs/2510.18638
作者: Yanna Ding,Songtao Lu,Yingdong Lu,Tomasz Nowicki,Jianxi Gao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.

[LG-16] Hardness of Learning Regular Languages in the Next Symbol Prediction Setting

链接: https://arxiv.org/abs/2510.18634
作者: Satwik Bhattamishra,Phil Blunsom,Varun Kanade
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:We study the learnability of languages in the Next Symbol Prediction (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior works to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We formalize the setting so as to make it amenable to PAC-learning analysis. While the setting provides a much richer set of labels than the conventional classification setting, we show that learning concept classes such as DFAs and Boolean formulas remains computationally hard. The proof is via a construction that makes almost all additional labels uninformative, yielding a reduction from the conventional learning problem to learning with NSP labels. Under cryptographic assumptions, the reduction implies that the problem of learning DFAs is computationally hard in the NSP setting.

[LG-17] Unrolled-SINDy: A Stable Explicit Method for Non linear PDE Discovery from Sparsely Sampled Data

链接: https://arxiv.org/abs/2510.18611
作者: Fayad Ali Banna,Antoine Caradot,Eduardo Brandao,Jean-Philippe Colombier,Rémi Emonet,Marc Sebban
类目: Machine Learning (cs.LG)
*备注: 56 pages, 12 figures, 39 tables

点击查看摘要

Abstract:Identifying from observation data the governing differential equations of a physical dynamics is a key challenge in machine learning. Although approaches based on SINDy have shown great promise in this area, they still fail to address a whole class of real world problems where the data is sparsely sampled in time. In this article, we introduce Unrolled-SINDy, a simple methodology that leverages an unrolling scheme to improve the stability of explicit methods for PDE discovery. By decorrelating the numerical time step size from the sampling rate of the available data, our approach enables the recovery of equation parameters that would not be the minimizers of the original SINDy optimization problem due to large local truncation errors. Our method can be exploited either through an iterative closed-form approach or by a gradient descent scheme. Experiments show the versatility of our method. On both traditional SINDy and state-of-the-art noise-robust iNeuralSINDy, with different numerical schemes (Euler, RK4), our proposed unrolling scheme allows to tackle problems not accessible to non-unrolled methods.

[LG-18] A Compositional Paradigm for Foundation Models: Towards Smarter Robotic Agents

链接: https://arxiv.org/abs/2510.18608
作者: Luigi Quarantiello,Elia Piccoli,Jack Bell,Malio Li,Giacomo Carfì,Eric Nuertey Coleman,Gerlando Gramaglia,Lanpei Li,Mauro Madeddu,Irene Testa,Vincenzo Lomonaco
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The birth of Foundation Models brought unprecedented results in a wide range of tasks, from language to vision, to robotic control. These models are able to process huge quantities of data, and can extract and develop rich representations, which can be employed across different domains and modalities. However, they still have issues in adapting to dynamic, real-world scenarios without retraining the entire model from scratch. In this work, we propose the application of Continual Learning and Compositionality principles to foster the development of more flexible, efficient and smart AI solutions.

[LG-19] Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

链接: https://arxiv.org/abs/2510.18591
作者: Chia-Hsuan Lu,Tony Tan,Michael Benedikt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are the predominant architecture for learning over graphs. As with any machine learning model, and important issue is the detection of adversarial attacks, where an adversary can change the output with a small perturbation of the input. Techniques for solving the adversarial robustness problem - determining whether such an attack exists - were originally developed for image classification, but there are variants for many other machine learning architectures. In the case of graph learning, the attack model usually considers changes to the graph structure in addition to or instead of the numerical features of the input, and the state of the art techniques in the area proceed via reduction to constraint solving, working on top of powerful solvers, e.g. for mixed integer programming. We show that it is possible to improve on the state of the art in structural robustness by replacing the use of powerful solvers by calls to efficient partial solvers, which run in polynomial time but may be incomplete. We evaluate our tool RobLight on a diverse set of GNN variants and datasets.

[LG-20] HeFS: Helper-Enhanced Feature Selection via Pareto-Optimized Genetic Search

链接: https://arxiv.org/abs/2510.18575
作者: Yusi Fan,Tian Wang,Zhiying Yan,Chang Liu,Qiong Zhou,Qi Lu,Zhehao Guo,Ziqi Deng,Wenyu Zhu,Ruochi Zhang,Fengfeng Zhou
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Feature selection is a combinatorial optimization problem that is NP-hard. Conventional approaches often employ heuristic or greedy strategies, which are prone to premature convergence and may fail to capture subtle yet informative features. This limitation becomes especially critical in high-dimensional datasets, where complex and interdependent feature relationships prevail. We introduce the HeFS (Helper-Enhanced Feature Selection) framework to refine feature subsets produced by existing algorithms. HeFS systematically searches the residual feature space to identify a Helper Set - features that complement the original subset and improve classification performance. The approach employs a biased initialization scheme and a ratio-guided mutation mechanism within a genetic algorithm, coupled with Pareto-based multi-objective optimization to jointly maximize predictive accuracy and feature complementarity. Experiments on 18 benchmark datasets demonstrate that HeFS consistently identifies overlooked yet informative features and achieves superior performance over state-of-the-art methods, including in challenging domains such as gastric cancer classification, drug toxicity prediction, and computer science applications. The code and datasets are available at this https URL.

[LG-21] Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints

链接: https://arxiv.org/abs/2510.18520
作者: Christopher Ratigan,Kyle Heuton,Carissa Wang,Lenore Cowen,Michael C. Hughes
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The ROC curve is widely used to assess binary classification performance. Yet for some applications such as alert systems for hospitalized patient monitoring, conventional ROC analysis cannot capture crucial factors that impact deployment, such as enforcing a minimum precision constraint to avoid false alarm fatigue or imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet given precision and capacity constraints can be represented as a feasible region in ROC space. We establish the geometry of this feasible region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible portion of ROC space. Averaging this area over a desired range of cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk using vital sign history on the MIMIC-IV dataset, we show this cost-aware metric is better than alternatives for ranking classifiers in hospital alert applications.

[LG-22] Alibaba International E-commerce Product Search Competition DILAB Team Technical Report CIKM

链接: https://arxiv.org/abs/2510.18499
作者: Hyewon Lee,Junghyun Oh,Minkyung Song,Soyoung Park,Seunghoon Han
类目: Machine Learning (cs.LG)
*备注: CIKM Alibaba E-commerce Search Challenge 2025

点击查看摘要

Abstract:This study presents the multilingual e-commerce search system developed by the DILAB team, which achieved 5th place on the final leaderboard with a competitive overall score of 0.8819, demonstrating stable and high-performing results across evaluation metrics. To address challenges in multilingual query-item understanding, we designed a multi-stage pipeline integrating data refinement, lightweight preprocessing, and adaptive modeling. The data refinement stage enhanced dataset consistency and category coverage, while language tagging and noise filtering improved input quality. In the modeling phase, multiple architectures and fine-tuning strategies were explored, and hyperparameters optimized using curated validation sets to balance performance across query-category (QC) and query-item (QI) tasks. The proposed framework exhibited robustness and adaptability across languages and domains, highlighting the effectiveness of systematic data curation and iterative evaluation for multilingual search systems. The source code is available at this https URL.

[LG-23] Learning to Navigate Under Imperfect Perception: Conformalised Segmentation for Safe Reinforcement Learning

链接: https://arxiv.org/abs/2510.18485
作者: Daniel Bethell,Simos Gerasimou,Radu Calinescu,Calum Imrie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable navigation in safety-critical environments requires both accurate hazard perception and principled uncertainty handling to strengthen downstream safety handling. Despite the effectiveness of existing approaches, they assume perfect hazard detection capabilities, while uncertainty-aware perception approaches lack finite-sample guarantees. We present COPPOL, a conformal-driven perception-to-policy learning approach that integrates distribution-free, finite-sample safety guarantees into semantic segmentation, yielding calibrated hazard maps with rigorous bounds for missed detections. These maps induce risk-aware cost fields for downstream RL planning. Across two satellite-derived benchmarks, COPPOL increases hazard coverage (up to 6x) compared to comparative baselines, achieving near-complete detection of unsafe regions while reducing hazardous violations during navigation (up to approx 50%). More importantly, our approach remains robust to distributional shift, preserving both safety and efficiency.

[LG-24] Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

链接: https://arxiv.org/abs/2510.18478
作者: Daniel Bethell,Simos Gerasimou,Radu Calinescu,Calum Imrie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.

[LG-25] Learning Boltzmann Generators via Constrained Mass Transport

链接: https://arxiv.org/abs/2510.18460
作者: Christopher von Klitzing,Denis Blessing,Henrik Schopmans,Pascal Friederich,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient sampling from high-dimensional and multimodal unnormalized probability distributions is a central challenge in many areas of science and machine learning. We focus on Boltzmann generators (BGs) that aim to sample the Boltzmann distribution of physical systems, such as molecules, at a given temperature. Classical variational approaches that minimize the reverse Kullback-Leibler divergence are prone to mode collapse, while annealing-based methods, commonly using geometric schedules, can suffer from mass teleportation and rely heavily on schedule tuning. We introduce Constrained Mass Transport (CMT), a variational framework that generates intermediate distributions under constraints on both the KL divergence and the entropy decay between successive steps. These constraints enhance distributional overlap, mitigate mass teleportation, and counteract premature convergence. Across standard BG benchmarks and the here introduced ELIL tetrapeptide, the largest system studied to date without access to samples from molecular dynamics, CMT consistently surpasses state-of-the-art variational methods, achieving more than 2.5x higher effective sample size while avoiding mode collapse.

[LG-26] Provable Generalization Bounds for Deep Neural Networks with Adaptive Regularization

链接: https://arxiv.org/abs/2510.18410
作者: Adeel Safder
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 8 pages

点击查看摘要

Abstract:Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop’s effectiveness, we derive a tightened PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 20% sharper bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop outperforms baseline regularization techniques, including standard dropout and adaptive gradient regularization, by 1-2% in test accuracy on MNIST (99.52%) and CIFAR-10 (90.63%), with generalization gaps of 0.48% and 7.14%, respectively. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization suitable for high-stakes applications.

[LG-27] Approximation Rates of Shallow Neural Networks: Barron Spaces Activation Functions and Optimality Analysis

链接: https://arxiv.org/abs/2510.18388
作者: Jian Lu,Xiaohuang Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the approximation properties of shallow neural networks with activation functions that are powers of exponential functions. It focuses on the dependence of the approximation rate on the dimension and the smoothness of the function being approximated within the Barron function space. We examine the approximation rates of ReLU ^k activation functions, proving that the optimal rate cannot be achieved under \ell^1 -bounded coefficients or insufficient smoothness conditions. We also establish optimal approximation rates in various norms for functions in Barron spaces and Sobolev spaces, confirming the curse of dimensionality. Our results clarify the limits of shallow neural networks’ approximation capabilities and offer insights into the selection of activation functions and network structures. Subjects: Machine Learning (cs.LG) MSC classes: 41A46 Cite as: arXiv:2510.18388 [cs.LG] (or arXiv:2510.18388v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.18388 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] raining Diverse Graph Experts for Ensembles: A Systematic Empirical Study

链接: https://arxiv.org/abs/2510.18370
作者: Gangda Deng,Yuxin Yang,Ömer Faruk Akgül,Hanqing Zeng,Yinglong Xia,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become essential tools for learning on relational data, yet the performance of a single GNN is often limited by the heterogeneity present in real-world graphs. Recent advances in Mixture-of-Experts (MoE) frameworks demonstrate that assembling multiple, explicitly diverse GNNs with distinct generalization patterns can significantly improve performance. In this work, we present the first systematic empirical study of expert-level diversification techniques for GNN ensembles. Evaluating 20 diversification strategies – including random re-initialization, hyperparameter tuning, architectural variation, directionality modeling, and training data partitioning – across 14 node classification benchmarks, we construct and analyze over 200 ensemble variants. Our comprehensive evaluation examines each technique in terms of expert diversity, complementarity, and ensemble performance. We also uncovers mechanistic insights into training maximally diverse experts. These findings provide actionable guidance for expert training and the design of effective MoE frameworks on graph data. Our code is available at this https URL.

[LG-29] owards Unsupervised Open-Set Graph Domain Adaptation via Dual Reprogramming NEURIPS2025

链接: https://arxiv.org/abs/2510.18363
作者: Zhen Zhang,Bingsheng He
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Unsupervised Graph Domain Adaptation has become a promising paradigm for transferring knowledge from a fully labeled source graph to an unlabeled target graph. Existing graph domain adaptation models primarily focus on the closed-set setting, where the source and target domains share the same label spaces. However, this assumption might not be practical in the real-world scenarios, as the target domain might include classes that are not present in the source domain. In this paper, we investigate the problem of unsupervised open-set graph domain adaptation, where the goal is to not only correctly classify target nodes into the known classes, but also recognize previously unseen node types into the unknown class. Towards this end, we propose a novel framework called GraphRTA, which conducts reprogramming on both the graph and model sides. Specifically, we reprogram the graph by modifying target graph structure and node features, which facilitates better separation of known and unknown classes. Meanwhile, we also perform model reprogramming by pruning domain-specific parameters to reduce bias towards the source graph while preserving parameters that capture transferable patterns across graphs. Additionally, we extend the classifier with an extra dimension for the unknown class, thus eliminating the need of manually specified threshold in open-set recognition. Comprehensive experiments on several public datasets demonstrate that our proposed model can achieve satisfied performance compared with recent state-of-the-art baselines. Our source codes and datasets are publicly available at this https URL.

[LG-30] Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding NEURIPS2025

链接: https://arxiv.org/abs/2510.18360
作者: Sunwoo Kim,Hyunjin Hwang,Kijung Shin
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at NeurIPS 2025

点击查看摘要

Abstract:The performance of a deep learning model on a specific task and dataset depends heavily on its neural architecture, motivating considerable efforts to rapidly and accurately identify architectures suited to the target task and dataset. To achieve this, researchers use machine learning models-typically neural architecture encoders-to predict the performance of a neural architecture. Many state-of-the-art encoders aim to capture information flow within a neural architecture, which reflects how information moves through the forward pass and backpropagation, via a specialized model structure. However, due to their complicated structures, these flow-based encoders are significantly slower to process neural architectures compared to simpler encoders, presenting a notable practical challenge. To address this, we propose FGP, a novel pre-training method for neural architecture encoding that trains an encoder to capture the information flow without requiring specialized model structures. FGP trains an encoder to reconstruct a flow surrogate, our proposed representation of the neural architecture’s information flow. Our experiments show that FGP boosts encoder performance by up to 106% in Precision-1%, compared to the same encoder trained solely with supervised learning.

[LG-31] Computable universal online learning NEURIPS2025

链接: https://arxiv.org/abs/2510.18352
作者: Dariusz Kalociński,Tomasz Steifer
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Accepted for presentation at NeurIPS 2025

点击查看摘要

Abstract:Understanding when learning is possible is a fundamental task in the theory of machine learning. However, many characterizations known from the literature deal with abstract learning as a mathematical object and ignore the crucial question: when can learning be implemented as a computer program? We address this question for universal online learning, a generalist theoretical model of online binary classification, recently characterized by Bousquet et al. (STOC’21). In this model, there is no hypothesis fixed in advance; instead, Adversary – playing the role of Nature – can change their mind as long as local consistency with the given class of hypotheses is maintained. We require Learner to achieve a finite number of mistakes while using a strategy that can be implemented as a computer program. We show that universal online learning does not imply computable universal online learning, even if the class of hypotheses is relatively easy from a computability-theoretic perspective. We then study the agnostic variant of computable universal online learning and provide an exact characterization of classes that are learnable in this sense. We also consider a variant of proper universal online learning and show exactly when it is possible. Together, our results give a more realistic perspective on the existing theory of online binary classification and the related problem of inductive inference.

[LG-32] Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

链接: https://arxiv.org/abs/2510.18340
作者: Jongmin Lee,Ernest K. Ryu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor \gamma 1 . In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with \gamma = 1 , rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when \gamma = 1 ) can be replaced with a new object that we call the transient visitation measure.

[LG-33] Uncertainty Estimation by Flexible Evidential Deep Learning NEURIPS2025

链接: https://arxiv.org/abs/2510.18322
作者: Taeseong Yoon,Heeyoung Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Uncertainty quantification (UQ) is crucial for deploying machine learning models in high-stakes applications, where overconfident predictions can lead to serious consequences. An effective UQ method must balance computational efficiency with the ability to generalize across diverse scenarios. Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities. However, the restrictive assumption of Dirichlet-distributed class probabilities limits EDL’s robustness, particularly in complex or unforeseen situations. To address this, we propose \textitflexible evidential deep learning ( \mathcalF -EDL), which extends EDL by predicting a flexible Dirichlet distribution – a generalization of the Dirichlet distribution – over class probabilities. This approach provides a more expressive and adaptive representation of uncertainty, significantly enhancing UQ generalization and reliability under challenging scenarios. We theoretically establish several advantages of \mathcalF -EDL and empirically demonstrate its state-of-the-art UQ performance across diverse evaluation settings, including classical, long-tailed, and noisy in-distribution scenarios.

[LG-34] owards Identifiability of Hierarchical Temporal Causal Representation Learning

链接: https://arxiv.org/abs/2510.18310
作者: Zijian Li,Minghao Fu,Junxian Huang,Yifan Shen,Ruichu Cai,Yuewen Sun,Guangyi Chen,Kun Zhang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Modeling hierarchical latent dynamics behind time series data is critical for capturing temporal dependencies across multiple levels of abstraction in real-world tasks. However, existing temporal causal representation learning methods fail to capture such dynamics, as they fail to recover the joint distribution of hierarchical latent variables from \textitsingle-timestep observed variables. Interestingly, we find that the joint distribution of hierarchical latent variables can be uniquely determined using three conditionally independent observations. Building on this insight, we propose a Causally Hierarchical Latent Dynamic (CHiLD) identification framework. Our approach first employs temporal contextual observed variables to identify the joint distribution of multi-layer latent variables. Sequentially, we exploit the natural sparsity of the hierarchical structure among latent variables to identify latent variables within each layer. Guided by the theoretical results, we develop a time series generative model grounded in variational inference. This model incorporates a contextual encoder to reconstruct multi-layer latent variables and normalize flow-based hierarchical prior networks to impose the independent noise condition of hierarchical latent dynamics. Empirical evaluations on both synthetic and real-world datasets validate our theoretical claims and demonstrate the effectiveness of CHiLD in modeling hierarchical latent dynamics.

[LG-35] A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

链接: https://arxiv.org/abs/2510.18300
作者: Ankur Lahiry,Ayush Pokharel,Banooqa Banday,Seth Ockerman,Amal Gueroudji,Mohammad Zaeed,Tanzima Z. Islam,Line Pouchard
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance analysis both computationally expensive and time-consuming. To address this challenge, we present an end-to-end parallel performance analysis framework designed to handle multiple large-scale GPU traces efficiently. Our proposed framework partitions and processes trace data concurrently and employs causal graph methods and parallel coordinating chart to expose performance variability and dependencies across execution flows. Experimental results demonstrate a 67% improvement in terms of scalability, highlighting the effectiveness of our pipeline for analyzing multiple traces independently.

[LG-36] Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications

链接: https://arxiv.org/abs/2510.18299
作者: Hao Qin,Thang Duong,Ming Li,Chicheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In millimeter wave (mmWave) communications, beam alignment and tracking are crucial to combat the significant path loss. As scanning the entire directional space is inefficient, designing an efficient and robust method to identify the optimal beam directions is essential. Since traditional bandit algorithms require a long time horizon to converge under large beam spaces, many existing works propose efficient bandit algorithms for beam alignment by relying on unimodality or multimodality assumptions on the reward function’s structure. However, such assumptions often do not hold (or cannot be strictly satisfied) in practice, which causes such algorithms to converge to choosing suboptimal beams. In this work, we propose two physics-informed bandit algorithms \textitpretc and \textitprgreedy that exploit the sparse multipath property of mmWave channels - a generic but realistic assumption - which is connected to the Phase Retrieval Bandit problem. Our algorithms treat the parameters of each path as black boxes and maintain optimal estimates of them based on sampled historical rewards. \textitpretc starts with a random exploration phase and then commits to the optimal beam under the estimated reward function. \textitprgreedy performs such estimation in an online manner and chooses the best beam under current estimates. Our algorithms can also be easily adapted to beam tracking in the mobile setting. Through experiments using both the synthetic DeepMIMO dataset and the real-world DeepSense6G dataset, we demonstrate that both algorithms outperform existing approaches in a wide range of scenarios across diverse channel environments, showing their generalizability and robustness. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.18299 [cs.LG] (or arXiv:2510.18299v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.18299 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Online Time Series Forecasting with Theoretical Guarantees

链接: https://arxiv.org/abs/2510.18281
作者: Zijian Li,Changze Zhou,Minghao Fu,Sanjay Manjunath,Fan Feng,Guangyi Chen,Yingyao Hu,Ruichu Cai,Kun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper is concerned with online time series forecasting, where unknown distribution shifts occur over time, i.e., latent variables influence the mapping from historical to future observations. To develop an automated way of online time series forecasting, we propose a Theoretical framework for Online Time-series forecasting (TOT in short) with theoretical guarantees. Specifically, we prove that supplying a forecaster with latent variables tightens the Bayes risk, the benefit endures under estimation uncertainty of latent variables and grows as the latent variables achieve a more precise identifiability. To better introduce latent variables into online forecasting algorithms, we further propose to identify latent variables with minimal adjacent observations. Based on these results, we devise a model-agnostic blueprint by employing a temporal decoder to match the distribution of observed variables and two independent noise estimators to model the causal inference of latent variables and mixing procedures of observed variables, respectively. Experiment results on synthetic data support our theoretical claims. Moreover, plug-in implementations built on several baselines yield general improvement across multiple benchmarks, highlighting the effectiveness in real-world applications.

[LG-38] Learning with Dual-level Noisy Correspondence for Multi-modal Entity Alignment

链接: https://arxiv.org/abs/2510.18240
作者: Haobin Li,Yijie Lin,Peng Hu,Mouxing Yang,Xi Peng
类目: Machine Learning (cs.LG)
*备注: 30 pages, 12 figures

点击查看摘要

Abstract:Multi-modal entity alignment (MMEA) aims to identify equivalent entities across heterogeneous multi-modal knowledge graphs (MMKGs), where each entity is described by attributes from various modalities. Existing methods typically assume that both intra-entity and inter-graph correspondences are faultless, which is often violated in real-world MMKGs due to the reliance on expert annotations. In this paper, we reveal and study a highly practical yet under-explored problem in MMEA, termed Dual-level Noisy Correspondence (DNC). DNC refers to misalignments in both intra-entity (entity-attribute) and inter-graph (entity-entity and attribute-attribute) correspondences. To address the DNC problem, we propose a robust MMEA framework termed RULE. RULE first estimates the reliability of both intra-entity and inter-graph correspondences via a dedicated two-fold principle. Leveraging the estimated reliabilities, RULE mitigates the negative impact of intra-entity noise during attribute fusion and prevents overfitting to noisy inter-graph correspondences during inter-graph discrepancy elimination. Beyond the training-time designs, RULE further incorporates a correspondence reasoning module that uncovers the underlying attribute-attribute connection across graphs, guaranteeing more accurate equivalent entity identification. Extensive experiments on five benchmarks verify the effectiveness of our method against the DNC compared with seven state-of-the-art this http URL code is available at \hrefthis https URLXLearning-SCU/RULE

[LG-39] LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling

链接: https://arxiv.org/abs/2510.18239
作者: Yunjiang Jiang,Ayush Agarwal,Yang Liu,Bi Xue
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Scaling large recommendation systems requires advancing three major frontiers: processing longer user histories, expanding candidate sets, and increasing model capacity. While promising, transformers’ computational cost scales quadratically with the user sequence length and linearly with the number of candidates. This trade-off makes it prohibitively expensive to expand candidate sets or increase sequence length at inference, despite the significant performance improvements. We introduce \textbfLIME, a novel architecture that resolves this trade-off. Through two key innovations, LIME fundamentally reduces computational complexity. First, low-rank ``link embeddings" enable pre-computation of attention weights by decoupling user and candidate interactions, making the inference cost nearly independent of candidate set size. Second, a linear attention mechanism, \textbfLIME-XOR, reduces the complexity with respect to user sequence length from quadratic ( O(N^2) ) to linear ( O(N) ). Experiments on public and industrial datasets show LIME achieves near-parity with state-of-the-art transformers but with a 10 \times inference speedup on large candidate sets or long sequence lengths. When tested on a major recommendation platform, LIME improved user engagement while maintaining minimal inference costs with respect to candidate set size and user history length, establishing a new paradigm for efficient and expressive recommendation systems. Comments: 16 pages Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2510.18239 [cs.IR] (or arXiv:2510.18239v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.18239 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Fostering the Ecosystem of AI for Social Impact Requires Expanding and Strengthening Evaluation Standards NEURIPS2025

链接: https://arxiv.org/abs/2510.18238
作者: Bryan Wilder,Angela Zhou
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:There has been increasing research interest in AI/ML for social impact, and correspondingly more publication venues have refined review criteria for practice-driven AI/ML research. However, these review guidelines tend to most concretely recognize projects that simultaneously achieve deployment and novel ML methodological innovation. We argue that this introduces incentives for researchers that undermine the sustainability of a broader research ecosystem of social impact, which benefits from projects that make contributions on single front (applied or methodological) that may better meet project partner needs. Our position is that researchers and reviewers in machine learning for social impact must simultaneously adopt: 1) a more expansive conception of social impacts beyond deployment and 2) more rigorous evaluations of the impact of deployed systems.

[LG-41] ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control

链接: https://arxiv.org/abs/2510.18232
作者: Yuzheng Hu,Ryan McKenna,Da Yu,Shanshan Wu,Han Zhao,Zheng Xu,Peter Kairouz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Generating high-quality synthetic text under differential privacy (DP) is critical for training and evaluating language models without compromising user privacy. Prior work on synthesizing DP datasets often fail to preserve key statistical attributes, suffer utility loss from the noise required by DP, and lack fine-grained control over generation. To address these challenges, we make two contributions. First, we introduce a hierarchical framework that decomposes DP synthetic text generation into two subtasks: feature learning and conditional text generation. This design explicitly incorporates learned features into the generation process and simplifies the end-to-end synthesis task. Through systematic ablations, we identify the most effective configuration: a rich tabular schema as feature, a DP tabular synthesizer, and a DP fine-tuned conditional generator, which we term ACTG (Attribute-Conditioned Text Generation). Second, we propose Anchored RL (ARL), a post-training method that improves the instruction-following ability of ACTG for conditional generation. ARL combines RL to boost control with an SFT anchor on best-of- N data to prevent reward hacking. Together, these components form our end-to-end algorithm ACTG-ARL, which advances both the quality of DP synthetic text (+20% MAUVE over prior work) and the control of the conditional generator under strong privacy guarantees.

[LG-42] owards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations

链接: https://arxiv.org/abs/2510.18228
作者: Zhendong Mi,Qitao Tan,Grace Li Zhang,Zhaozhuo Xu,Geng Yuan,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods due to its reduced memory footprint requirement. However, existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models. In this work, we propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations. Specifically, we first estimate a low-dimensional gradient space and then align perturbations in projected gradients’ direction within the space. This approach enables reduced the number of perturbed parameters and decreased variance, therefore accelerated convergence for LLM fine-tuning. Experiments on LLMs show that P-GAP consistently surpasses the baselines, achieving up to 6% increase in accuracy on classification tasks and up to 12% higher accuracy on generation tasks, with up to about 81% less training iterations and 70% less GPU hours. These results demonstrate that P-GAP enables fast, scalable, and resource-efficient ZO LLM fine-tuning.

[LG-43] Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs

链接: https://arxiv.org/abs/2510.18225
作者: Xueyao Zhang,Bo Yang,Zhiwen Yu,Xuelin Cao,Wei Xiang,Bin Guo,Liang Wang,Billy Pik Lik Lau,George C. Alexandropoulos,Jun Luo,Mérouane Debbah,Zhu Han,Chau Yuen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates underwater cooperative target detection using autonomous underwater vehicles (AUVs), with a focus on the critical trade-off between cooperation efficiency and communication covertness. To tackle this challenge, we first formulate a joint trajectory and power control optimization problem, and then present an innovative hierarchical action management framework to solve it. According to the hierarchical formulation, at the macro level, the master AUV models the agent selection process as a Markov decision process and deploys the proximal policy optimization algorithm for strategic task allocation. At the micro level, each selected agent’s decentralized decision-making is modeled as a partially observable Markov decision process, and a multi-agent proximal policy optimization algorithm is used to dynamically adjust its trajectory and transmission power based on its local observations. Under the centralized training and decentralized execution paradigm, our target detection framework enables adaptive covert cooperation while satisfying both energy and mobility constraints. By comprehensively modeling the considered system, the involved signals and tasks, as well as energy consumption, theoretical insights and practical solutions for the efficient and secure operation of multiple AUVs are provided, offering significant implications for the execution of underwater covert communication tasks.

[LG-44] RESCUE: Retrieval Augmented Secure Code Generation

链接: https://arxiv.org/abs/2510.18204
作者: Jiahao Shi,Tianyi Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Despite recent advances, Large Language Models (LLMs) still generate vulnerable code. Retrieval-Augmented Generation (RAG) has the potential to enhance LLMs for secure code generation by incorporating external security knowledge. However, the conventional RAG design struggles with the noise of raw security-related documents, and existing retrieval methods overlook the significant security semantics implicitly embedded in task descriptions. To address these issues, we propose RESCUE, a new RAG framework for secure code generation with two key innovations. First, we propose a hybrid knowledge base construction method that combines LLM-assisted cluster-then-summarize distillation with program slicing, producing both high-level security guidelines and concise, security-focused code examples. Second, we design a hierarchical multi-faceted retrieval to traverse the constructed knowledge base from top to bottom and integrates multiple security-critical facts at each hierarchical level, ensuring comprehensive and accurate retrieval. We evaluated RESCUE on four benchmarks and compared it with five state-of-the-art secure code generation methods on six LLMs. The results demonstrate that RESCUE improves the SecurePass@1 metric by an average of 4.8 points, establishing a new state-of-the-art performance for security. Furthermore, we performed in-depth analysis and ablation studies to rigorously validate the effectiveness of individual components in RESCUE.

[LG-45] Ensemble based Closed-Loop Optimal Control using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2510.18195
作者: Jostein Barry-Straume,Adwait D. Verulkar,Arash Sarshar,Andrey A. Popov,Adrian Sandu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The objective of designing a control system is to steer a dynamical system with a control signal, guiding it to exhibit the desired behavior. The Hamilton-Jacobi-Bellman (HJB) partial differential equation offers a framework for optimal control system design. However, numerical solutions to this equation are computationally intensive, and analytical solutions are frequently unavailable. Knowledge-guided machine learning methodologies, such as physics-informed neural networks (PINNs), offer new alternative approaches that can alleviate the difficulties of solving the HJB equation numerically. This work presents a multistage ensemble framework to learn the optimal cost-to-go, and subsequently the corresponding optimal control signal, through the HJB equation. Prior PINN-based approaches rely on a stabilizing the HJB enforcement during training. Our framework does not use stabilizer terms and offers a means of controlling the nonlinear system, via either a singular learned control signal or an ensemble control signal policy. Success is demonstrated in closed-loop control, using both ensemble- and singular-control, of a steady-state time-invariant two-state continuous nonlinear system with an infinite time horizon, accounting of noisy, perturbed system states and varying initial conditions.

[LG-46] Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

链接: https://arxiv.org/abs/2510.18183
作者: Eason Yu,Tzu Hao Liu,Yunke Wang,Clément L. Canonne,Nguyen H. Tran,Chang Xu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Finding Nash equilibria in imperfect-information games remains a central challenge in multi-agent reinforcement learning. While regularization-based methods have recently achieved last-iteration convergence to a regularized equilibrium, they require the regularization strength to shrink toward zero to approximate a Nash equilibrium, often leading to unstable learning in practice. Instead, we fix the regularization strength at a large value for robustness and achieve convergence by iteratively refining the reference policy. Our main theoretical result shows that this procedure guarantees strictly monotonic improvement and convergence to an exact Nash equilibrium in two-player zero-sum games, without requiring a uniqueness assumption. Building on this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which preserves the generalizability of policy gradient methods while relying solely on the current and reference policies. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold’em, where NashPG consistently attains higher Elo ratings.

[LG-47] Rethinking PCA Through Duality NEURIPS2025

链接: https://arxiv.org/abs/2510.18130
作者: Jan Quan,Johan Suykens,Panagiotis Patrinos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: NeurIPS 2025 poster

点击查看摘要

Abstract:Motivated by the recently shown connection between self-attention and (kernel) principal component analysis (PCA), we revisit the fundamentals of PCA. Using the difference-of-convex (DC) framework, we present several novel formulations and provide new theoretical insights. In particular, we show the kernelizability and out-of-sample applicability for a PCA-like family of problems. Moreover, we uncover that simultaneous iteration, which is connected to the classical QR algorithm, is an instance of the difference-of-convex algorithm (DCA), offering an optimization perspective on this longstanding method. Further, we describe new algorithms for PCA and empirically compare them with state-of-the-art methods. Lastly, we introduce a kernelizable dual formulation for a robust variant of PCA that minimizes the l_1 deviation of the reconstruction errors.

[LG-48] HyperDiffusionFields (HyDiF): Diffusion-Guided Hypernetworks for Learning Implicit Molecular Neural Fields

链接: https://arxiv.org/abs/2510.18122
作者: Sudarshan Babu,Phillip Lo,Xiao Zhang,Aadi Srivastava,Ali Davariashtiyani,Jason Perera,Michael Maire,Aly A. Khan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce HyperDiffusionFields (HyDiF), a framework that models 3D molecular conformers as continuous fields rather than discrete atomic coordinates or graphs. At the core of our approach is the Molecular Directional Field (MDF), a vector field that maps any point in space to the direction of the nearest atom of a particular type. We represent MDFs using molecule-specific neural implicit fields, which we call Molecular Neural Fields (MNFs). To enable learning across molecules and facilitate generalization, we adopt an approach where a shared hypernetwork, conditioned on a molecule, generates the weights of the given molecule’s MNF. To endow the model with generative capabilities, we train the hypernetwork as a denoising diffusion model, enabling sampling in the function space of molecular fields. Our design naturally extends to a masked diffusion mechanism to support structure-conditioned generation tasks, such as molecular inpainting, by selectively noising regions of the field. Beyond generation, the localized and continuous nature of MDFs enables spatially fine-grained feature extraction for molecular property prediction, something not easily achievable with graph or point cloud based methods. Furthermore, we demonstrate that our approach scales to larger biomolecules, illustrating a promising direction for field-based molecular modeling.

[LG-49] Efficient Long-context Language Model Training by Core Attention Disaggregation

链接: https://arxiv.org/abs/2510.18121
作者: Yonghao Zhuang,Junda Chen,Bo Pang,Yi Gu,Yibo Zhu,Yimin Jiang,Ion Stoica,Eric Xing,Hao Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.

[LG-50] Gradient Variance Reveals Failure Modes in Flow-Based Generative Models NEURIPS2025

链接: https://arxiv.org/abs/2510.18118
作者: Teodora Reu,Sixtine Dromigny,Michael Bronstein,Francisco Vargas
类目: Machine Learning (cs.LG)
*备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Rectified Flows learn ODE vector fields whose trajectories are straight between source and target distributions, enabling near one-step inference. We show that this straight-path objective conceals fundamental failure modes: under deterministic training, low gradient variance drives memorization of arbitrary training pairings, even when interpolant lines between pairs intersect. To analyze this mechanism, we study Gaussian-to-Gaussian transport and use the loss gradient variance across stochastic and deterministic regimes to characterize which vector fields optimization favors in each setting. We then show that, in a setting where all interpolating lines intersect, applying Rectified Flow yields the same specific pairings at inference as during training. More generally, we prove that a memorizing vector field exists even when training interpolants intersect, and that optimizing the straight-path objective converges to this ill-defined field. At inference, deterministic integration reproduces the exact training pairings. We validate our findings empirically on the CelebA dataset, confirming that deterministic interpolants induce memorization, while the injection of small noise restores generalization.

[LG-51] PrivaDE: Privacy-preserving Data Evaluation for Blockchain-based Data Marketplaces

链接: https://arxiv.org/abs/2510.18109
作者: Wan Ki Wong,Sahel Torkamani,Michele Ciampi,Rik Sarkar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating the relevance of data is a critical task for model builders seeking to acquire datasets that enhance model performance. Ideally, such evaluation should allow the model builder to assess the utility of candidate data without exposing proprietary details of the model. At the same time, data providers must be assured that no information about their data - beyond the computed utility score - is disclosed to the model builder. In this paper, we present PrivaDE, a cryptographic protocol for privacy-preserving utility scoring and selection of data for machine learning. While prior works have proposed data evaluation protocols, our approach advances the state of the art through a practical, blockchain-centric design. Leveraging the trustless nature of blockchains, PrivaDE enforces malicious-security guarantees and ensures strong privacy protection for both models and datasets. To achieve efficiency, we integrate several techniques - including model distillation, model splitting, and cut-and-choose zero-knowledge proofs - bringing the runtime to a practical level. Furthermore, we propose a unified utility scoring function that combines empirical loss, predictive entropy, and feature-space diversity, and that can be seamlessly integrated into active-learning workflows. Evaluation shows that PrivaDE performs data evaluation effectively, achieving online runtimes within 15 minutes even for models with millions of parameters. Our work lays the foundation for fair and automated data marketplaces in decentralized machine learning ecosystems. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2510.18109 [cs.CR] (or arXiv:2510.18109v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.18109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Provably Optimal Reinforcement Learning under Safety Filtering

链接: https://arxiv.org/abs/2510.18082
作者: Donggeon David Oh,Duy P. Nguyen,Haimin Hu,Jaime F. Fisac
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.

[LG-53] MEG-GPT : A transformer-based foundation model for magnetoencephalography data

链接: https://arxiv.org/abs/2510.18080
作者: Rukuang Huang,Sungjun Cho,Chetan Gohil,Oiwi Parker Jones,Mark Woolrich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modelling the complex spatiotemporal patterns of large-scale brain dynamics is crucial for neuroscience, but traditional methods fail to capture the rich structure in modalities such as magnetoencephalography (MEG). Recent advances in deep learning have enabled significant progress in other domains, such as language and vision, by using foundation models at scale. Here, we introduce MEG-GPT, a transformer based foundation model that uses time-attention and next time-point prediction. To facilitate this, we also introduce a novel data-driven tokeniser for continuous MEG data, which preserves the high temporal resolution of continuous MEG signals without lossy transformations. We trained MEG-GPT on tokenised brain region time-courses extracted from a large-scale MEG dataset (N=612, eyes-closed rest, Cam-CAN data), and show that the learnt model can generate data with realistic spatio-spectral properties, including transient events and population variability. Critically, it performs well in downstream decoding tasks, improving downstream supervised prediction task, showing improved zero-shot generalisation across sessions (improving accuracy from 0.54 to 0.59) and subjects (improving accuracy from 0.41 to 0.49) compared to a baseline methods. Furthermore, we show the model can be efficiently fine-tuned on a smaller labelled dataset to boost performance in cross-subject decoding scenarios. This work establishes a powerful foundation model for electrophysiological data, paving the way for applications in computational neuroscience and neural decoding.

[LG-54] Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods

链接: https://arxiv.org/abs/2510.18075
作者: Justus Arweiler,Indra Jungjohann,Aparna Muraleedharan,Heike Leitte,Jakob Burger,Kerstin Münnemann,Fabian Jirasek,Hans Hasse
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) holds great potential to advance anomaly detection (AD) in chemical processes. However, the development of ML-based methods is hindered by the lack of openly available experimental data. To address this gap, we have set up a laboratory-scale batch distillation plant and operated it to generate an extensive experimental database, covering fault-free experiments and experiments in which anomalies were intentionally induced, for training advanced ML-based AD methods. In total, 119 experiments were conducted across a wide range of operating conditions and mixtures. Most experiments containing anomalies were paired with a corresponding fault-free one. The database that we provide here includes time-series data from numerous sensors and actuators, along with estimates of measurement uncertainty. In addition, unconventional data sources – such as concentration profiles obtained via online benchtop NMR spectroscopy and video and audio recordings – are provided. Extensive metadata and expert annotations of all experiments are included. The anomaly annotations are based on an ontology developed in this work. The data are organized in a structured database and made freely available via this http URL. This new database paves the way for the development of advanced ML-based AD methods. As it includes information on the causes of anomalies, it further enables the development of interpretable and explainable ML approaches, as well as methods for anomaly mitigation.

[LG-55] Fast Agnostic Learners in the Plane

链接: https://arxiv.org/abs/2510.18057
作者: Talya Eden,Ludmila Glinskih,Sofya Raskhodnikova
类目: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We investigate the computational efficiency of agnostic learning for several fundamental geometric concept classes in the plane. While the sample complexity of agnostic learning is well understood, its time complexity has received much less attention. We study the class of triangles and, more generally, the class of convex polygons with k vertices for small k , as well as the class of convex sets in a square. We present a proper agnostic learner for the class of triangles that has optimal sample complexity and runs in time \tilde O(\epsilon^-6) , improving on the algorithm of Dobkin and Gunopulos (COLT 95) that runs in time \tilde O(\epsilon^-10) . For 4-gons and 5-gons, we improve the running time from O(\epsilon^-12) , achieved by Fischer and Kwek (eCOLT 96), to \tilde O(\epsilon^-8) and \tilde O(\epsilon^-10) , respectively. We also design a proper agnostic learner for convex sets under the uniform distribution over a square with running time \tilde O(\epsilon^-5) , improving on the previous \tilde O(\epsilon^-8) bound at the cost of slightly higher sample complexity. Notably, agnostic learning of convex sets in [0,1]^2 under general distributions is impossible because this concept class has infinite VC-dimension. Our agnostic learners use data structures and algorithms from computational geometry and their analysis relies on tools from geometry and probabilistic combinatorics. Because our learners are proper, they yield tolerant property testers with matching running times. Our results raise a fundamental question of whether a gap between the sample and time complexity is inherent for agnostic learning of these and other natural concept classes. Subjects: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2510.18057 [cs.DS] (or arXiv:2510.18057v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2510.18057 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity NEURIPS2025

链接: https://arxiv.org/abs/2510.18037
作者: Ziyu Lu,Anna J. Li,Alexander E. Ladd,Pascha Matveev,Aditya Deole,Eric Shea-Brown,J. Nathan Kutz,Nicholas A. Steinmetz
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Data on the Brain Mind

点击查看摘要

Abstract:Neural activity forecasting is central to understanding neural systems and enabling closed-loop control. While deep learning has recently advanced the state-of-the-art in the time series forecasting literature, its application to neural activity forecasting remains limited. To bridge this gap, we systematically evaluated eight probabilistic deep learning models, including two foundation models, that have demonstrated strong performance on general forecasting benchmarks. We compared them against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging. Across prediction horizons, several deep learning models consistently outperformed classical approaches, with the best model producing informative forecasts up to 1.5 seconds into the future. Our findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.

[LG-57] ransformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

链接: https://arxiv.org/abs/2510.18036
作者: Stavros Mitsis,Ermos Hadjikyriakos,Humaid Ibrahim,Savvas Neofytou,Shashwat Raman,James Myles,Eiman Kanjo
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

[LG-58] Attention-Guided Deep Adversarial Temporal Subspace Clustering (A-DATSC) Model for multivariate spatiotemporal data ICLR2025

链接: https://arxiv.org/abs/2510.18004
作者: Francis Ndikum Nji,Vandana Janeja,Jianwu Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages, under review submitted to ICLR 2025

点击查看摘要

Abstract:Deep subspace clustering models are vital for applications such as snowmelt detection, sea ice tracking, crop health monitoring, infectious disease modeling, network load prediction, and land-use planning, where multivariate spatiotemporal data exhibit complex temporal dependencies and reside on multiple nonlinear manifolds beyond the capability of traditional clustering methods. These models project data into a latent space where samples lie in linear subspaces and exploit the self-expressiveness property to uncover intrinsic relationships. Despite their success, existing methods face major limitations: they use shallow autoencoders that ignore clustering errors, emphasize global features while neglecting local structure, fail to model long-range dependencies and positional information, and are rarely applied to 4D spatiotemporal data. To address these issues, we propose A-DATSC (Attention-Guided Deep Adversarial Temporal Subspace Clustering), a model combining a deep subspace clustering generator and a quality-verifying discriminator. The generator, inspired by U-Net, preserves spatial and temporal integrity through stacked TimeDistributed ConvLSTM2D layers, reducing parameters and enhancing generalization. A graph attention transformer based self-expressive network captures local spatial relationships, global dependencies, and both short- and long-range correlations. Experiments on three real-world multivariate spatiotemporal datasets show that A-DATSC achieves substantially superior clustering performance compared to state-of-the-art deep subspace clustering models.

[LG-59] ritonRL: Training LLM s to Think and Code Triton Without Cheating

链接: https://arxiv.org/abs/2510.17891
作者: Jiin Woo,Shaowei Zhu,Allen Nie,Zhen Jia,Yida Wang,Youngsuk Park
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid evolution of large language models (LLMs), the demand for automated, high-performance system kernels has emerged as a key enabler for accelerating development and deployment. We introduce TritonRL, a domain-specialized LLM for Triton kernel generation, trained with a novel training framework that enables robust and automated kernel synthesis. Unlike general-purpose programming languages, Triton kernel generation faces unique challenges due to data scarcity and incomplete evaluation criteria, vulnerable to reward hacking. Our approach addresses these challenges end-to-end by distilling Triton-specific knowledge through supervised fine-tuning on curated datasets, and further improving code quality via reinforcement learning (RL) with robust, verifiable rewards and hierarchical reward assignment. Our RL framework robustly detects reward hacking and guides both reasoning traces and code tokens through fine-grained verification and hierarchical reward decomposition, enabling the model to generate high-quality Triton kernels that can truly replace existing modules. With robust and fine-grained evaluation, our experiments on KernelBench demonstrate that TritonRL achieves state-of-the-art correctness and speedup, surpassing all other Triton-specific models and underscoring the effectiveness of our RL-based training paradigm.

[LG-60] Shock-Aware Physics-Guided Fusion-DeepONet Operator for Rarefied Micro-Nozzle Flows

链接: https://arxiv.org/abs/2510.17887
作者: Ehsan Roohi,Amirmehran Mahdavi
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We present a comprehensive, physics aware deep learning framework for constructing fast and accurate surrogate models of rarefied, shock containing micro nozzle flows. The framework integrates three key components, a Fusion DeepONet operator learning architecture for capturing parameter dependencies, a physics-guided feature space that embeds a shock-aligned coordinate system, and a two-phase curriculum strategy emphasizing high-gradient regions. To demonstrate the generality and inductive bias of the proposed framework, we first validate it on the canonical viscous Burgers equation, which exhibits advective steepening and shock like gradients.

[LG-61] Mixed Monotonicity Reachability Analysis of Neural ODE: A Trade-Off Between Tightness and Efficiency

链接: https://arxiv.org/abs/2510.17859
作者: Abdelrahman Sayed Sayed,Pierre-Jean Meyer,Mohamed Ghazel
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 27 pages, 11 figures

点击查看摘要

Abstract:Neural ordinary differential equations (neural ODE) are powerful continuous-time machine learning models for depicting the behavior of complex dynamical systems, but their verification remains challenging due to limited reachability analysis tools adapted to them. We propose a novel interval-based reachability method that leverages continuous-time mixed monotonicity techniques for dynamical systems to compute an over-approximation for the neural ODE reachable sets. By exploiting the geometric structure of full initial sets and their boundaries via the homeomorphism property, our approach ensures efficient bound propagation. By embedding neural ODE dynamics into a mixed monotone system, our interval-based reachability approach, implemented in TIRA with single-step, incremental, and boundary-based approaches, provides sound and computationally efficient over-approximations compared with CORA’s zonotopes and NNV2.0 star set representations, while trading tightness for efficiency. This trade-off makes our method particularly suited for high-dimensional, real-time, and safety-critical applications. Applying mixed monotonicity to neural ODE reachability analysis paves the way for lightweight formal analysis by leveraging the symmetric structure of monotone embeddings and the geometric simplicity of interval boxes, opening new avenues for scalable verification aligned with the symmetry and geometry of neural representations. This novel approach is illustrated on two numerical examples of a spiral system and a fixed-point attractor system modeled as a neural ODE.

[LG-62] Neural networks for neurocomputing circuits: a computational study of tolerance to noise and activation function non-uniformity when machine learning materials properties

链接: https://arxiv.org/abs/2510.17849
作者: Ye min Thant,Methawee Nukunudompanich,Chu-Chen Chueh,Manabu Ihara,Sergei Manzhos
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dedicated analog neurocomputing circuits are promising for high-throughput, low power consumption applications of machine learning (ML) and for applications where implementing a digital computer is unwieldy (remote locations; small, mobile, and autonomous devices, extreme conditions, etc.). Neural networks (NN) implemented in such circuits, however, must contend with circuit noise and the non-uniform shapes of the neuron activation function (NAF) due to the dispersion of performance characteristics of circuit elements (such as transistors or diodes implementing the neurons). We present a computational study of the impact of circuit noise and NAF inhomogeneity in function of NN architecture and training regimes. We focus on one application that requires high-throughput ML: materials informatics, using as representative problem ML of formation energies vs. lowest-energy isomer of peri-condensed hydrocarbons, formation energies and band gaps of double perovskites, and zero point vibrational energies of molecules from QM9 dataset. We show that NNs generally possess low noise tolerance with the model accuracy rapidly degrading with noise level. Single-hidden layer NNs, and NNs with larger-than-optimal sizes are somewhat more noise-tolerant. Models that show less overfitting (not necessarily the lowest test set error) are more noise-tolerant. Importantly, we demonstrate that the effect of activation function inhomogeneity can be palliated by retraining the NN using practically realized shapes of NAFs.

[LG-63] From Noise to Laws: Regularized Time-Series Forecasting via Denoised Dynamic Graphs

链接: https://arxiv.org/abs/2510.17817
作者: Hongwei Ma,Junbin Gao,Minh-ngoc Tran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-horizon multivariate time-series forecasting is challenging because realistic predictions must (i) denoise heterogeneous signals, (ii) track time-varying cross-series dependencies, and (iii) remain stable and physically plausible over long rollout horizons. We present PRISM, which couples a score-based diffusion preconditioner with a dynamic, correlation-thresholded graph encoder and a forecast head regularized by generic physics penalties. We prove contraction of the induced horizon dynamics under mild conditions and derive Lipschitz bounds for graph blocks, explaining the model’s robustness. On six standard benchmarks , PRISM achieves consistent SOTA with strong MSE and MAE gains.

[LG-64] SO(3)-invariant PCA with application to molecular data

链接: https://arxiv.org/abs/2510.18827
作者: Michael Fraiman,Paulina Hoyos,Tamir Bendory,Joe Kileel,Oscar Mickelin,Nir Sharon,Amit Singer
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Principal component analysis (PCA) is a fundamental technique for dimensionality reduction and denoising; however, its application to three-dimensional data with arbitrary orientations – common in structural biology – presents significant challenges. A naive approach requires augmenting the dataset with many rotated copies of each sample, incurring prohibitive computational costs. In this paper, we extend PCA to 3D volumetric datasets with unknown orientations by developing an efficient and principled framework for SO(3)-invariant PCA that implicitly accounts for all rotations without explicit data augmentation. By exploiting underlying algebraic structure, we demonstrate that the computation involves only the square root of the total number of covariance entries, resulting in a substantial reduction in complexity. We validate the method on real-world molecular datasets, demonstrating its effectiveness and opening up new possibilities for large-scale, high-dimensional reconstruction problems.

[LG-65] A Frequentist Statistical Introduction to Variational Inference Autoencoders and Diffusion Models

链接: https://arxiv.org/abs/2510.18777
作者: Yen-Chi Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: This is an introduction paper. 28 pages, 2 figures

点击查看摘要

Abstract:While Variational Inference (VI) is central to modern generative models like Variational Autoencoders (VAEs) and Denoising Diffusion Models (DDMs), its pedagogical treatment is split across disciplines. In statistics, VI is typically framed as a Bayesian method for posterior approximation. In machine learning, however, VAEs and DDMs are developed from a Frequentist viewpoint, where VI is used to approximate a maximum likelihood estimator. This creates a barrier for statisticians, as the principles behind VAEs and DDMs are hard to contextualize without a corresponding Frequentist introduction to VI. This paper provides that introduction: we explain the theory for VI, VAEs, and DDMs from a purely Frequentist perspective, starting with the classical Expectation-Maximization (EM) algorithm. We show how VI arises as a scalable solution for intractable E-steps and how VAEs and DDMs are natural, deep-learning-based extensions of this framework, thereby bridging the gap between classical statistical inference and modern generative AI.

[LG-66] Analyse comparative dalgorithmes de restauration en architecture dépliée pour des signaux chromatographiques parcimonieux

链接: https://arxiv.org/abs/2510.18760
作者: Mouna Gharbi,Silvia Villa,Emilie Chouzenoux,Jean-Christophe Pesquet,Laurent Duval
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 4 pages, in French, GRETSI Symposium on Signal and Image Processing, Strasbourg, France, August 2025

点击查看摘要

Abstract:Data restoration from degraded observations, of sparsity hypotheses, is an active field of study. Traditional iterative optimization methods are now complemented by deep learning techniques. The development of unfolded methods benefits from both families. We carry out a comparative study of three architectures on parameterized chromatographic signal databases, highlighting the performance of these approaches, especially when employing metrics adapted to physico-chemical peak signal characterization.

[LG-67] Symbolic Emulators for Cosmology: Accelerating Cosmological Analyses Without Sacrificing Precision

链接: https://arxiv.org/abs/2510.18749
作者: Deaglan J. Bartlett,Shivam Pandey
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 22 pages, 6 figures. Invited contribution for the Royal Society Philosophical Transactions A special issue “Symbolic regression in the physical sciences”

点击查看摘要

Abstract:In cosmology, emulators play a crucial role by providing fast and accurate predictions of complex physical models, enabling efficient exploration of high-dimensional parameter spaces that would be computationally prohibitive with direct numerical simulations. Symbolic emulators have emerged as promising alternatives to numerical approaches, delivering comparable accuracy with significantly faster evaluation times. While previous symbolic emulators were limited to relatively narrow prior ranges, we expand these to cover the parameter space relevant for current cosmological analyses. We introduce approximations to hypergeometric functions used for the \Lambda CDM comoving distance and linear growth factor which are accurate to better than 0.001% and 0.05%, respectively, for all redshifts and for \Omega_\rm m \in [0.1, 0.5] . We show that integrating symbolic emulators into a Dark Energy Survey-like 3\times2 pt analysis produces cosmological constraints consistent with those obtained using standard numerical methods. Our symbolic emulators offer substantial improvements in speed and memory usage, demonstrating their practical potential for scalable, likelihood-based inference.

[LG-68] Diffusion Buffer for Online Generative Speech Enhancement

链接: https://arxiv.org/abs/2510.18744
作者: Bunlong Lay,Rostislav Makarov,Simon Welker,Maris Hillemann,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the listener with a delay defined by the Diffusion Buffer, and the output frame has a corresponding look-ahead. In this work, we extend upon our previous work by carefully designing a 2D convolutional UNet architecture that specifically aligns with the Diffusion Buffer’s look-ahead. We observe that the proposed UNet improves performance, particularly when the algorithmic latency is low. Moreover, we show that using a Data Prediction loss instead of Denoising Score Matching loss enables flexible control over the trade-off between algorithmic latency and quality during inference. The extended Diffusion Buffer equipped with a novel NN and loss function drastically reduces the algorithmic latency from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it has been shown before that offline generative diffusion models outperform predictive approaches in unseen noisy speech data, we confirm that the online Diffusion Buffer also outperforms its predictive counterpart on unseen noisy speech data.

[LG-69] Differentially Private E-Values

链接: https://arxiv.org/abs/2510.18654
作者: Daniel Csillag,Diego Mesquita
类目: Methodology (stat.ME); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:E-values have gained prominence as flexible tools for statistical inference and risk control, enabling anytime- and post-hoc-valid procedures under minimal assumptions. However, many real-world applications fundamentally rely on sensitive data, which can be leaked through e-values. To ensure their safe release, we propose a general framework to transform non-private e-values into differentially private ones. Towards this end, we develop a novel biased multiplicative noise mechanism that ensures our e-values remain statistically valid. We show that our differentially private e-values attain strong statistical power, and are asymptotically as powerful as their non-private counterparts. Experiments across online risk monitoring, private healthcare, and conformal e-prediction demonstrate our approach’s effectiveness and illustrate its broad applicability.

[LG-70] Channel-Aware Vector Quantization for Robust Semantic Communication on Discrete Channels

链接: https://arxiv.org/abs/2510.18604
作者: Zian Meng,Qiang Li,Wenqian Tang,Mingdie Yan,Xiaohu Ge
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Deep learning-based semantic communication has largely relied on analog or semi-digital transmission, which limits compatibility with modern digital communication infrastructures. Recent studies have employed vector quantization (VQ) to enable discrete semantic transmission, yet existing methods neglect channel state information during codebook optimization, leading to suboptimal robustness. To bridge this gap, we propose a channel-aware vector quantization (CAVQ) algorithm within a joint source-channel coding (JSCC) framework, termed VQJSCC, established on a discrete memoryless channel. In this framework, semantic features are discretized and directly mapped to modulation constellation symbols, while CAVQ integrates channel transition probabilities into the quantization process, aligning easily confused symbols with semantically similar codewords. A multi-codebook alignment mechanism is further introduced to handle mismatches between codebook order and modulation order by decomposing the transmission stream into multiple independently optimized subchannels. Experimental results demonstrate that VQJSCC effectively mitigates the digital cliff effect, achieves superior reconstruction quality across various modulation schemes, and outperforms state-of-the-art digital semantic communication baselines in both robustness and efficiency.

[LG-71] A Multi-Evidence Framework Rescues Low- Power Prognostic Signals and Rejects Statistical Artifacts in Cancer Genomics

链接: https://arxiv.org/abs/2510.18571
作者: Gokturk Aytug Akarlar
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 17 pages (main text), 4 figures (main text), 7 supplementary figures, 4 supplementary tables. Focuses on a computational framework using causal inference and biological validation for underpowered cancer genomic studies

点击查看摘要

Abstract:Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 - a cardiac gene with no cancer function - as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at this http URL inference-for-cancer-genomics Comments: 17 pages (main text), 4 figures (main text), 7 supplementary figures, 4 supplementary tables. Focuses on a computational framework using causal inference and biological validation for underpowered cancer genomic studies Subjects: Genomics (q-bio.GN); Machine Learning (cs.LG) MSC classes: 92B15, 62P10, 62F10, 62H17 ACMclasses: J.3; I.2.6; G.3 Cite as: arXiv:2510.18571 [q-bio.GN] (or arXiv:2510.18571v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2510.18571 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] Interval Prediction of Annual Averag e Daily Traffic on Local Roads via Quantile Random Forest with High-Dimensional Spatial Data

链接: https://arxiv.org/abs/2510.18548
作者: Ying Yao,Daniel J. Graham
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Accurate annual average daily traffic (AADT) data are vital for transport planning and infrastructure management. However, automatic traffic detectors across national road networks often provide incomplete coverage, leading to underrepresentation of minor roads. While recent machine learning advances have improved AADT estimation at unmeasured locations, most models produce only point predictions and overlook estimation uncertainty. This study addresses that gap by introducing an interval prediction approach that explicitly quantifies predictive uncertainty. We integrate a Quantile Random Forest model with Principal Component Analysis to generate AADT prediction intervals, providing plausible traffic ranges bounded by estimated minima and maxima. Using data from over 2,000 minor roads in England and Wales, and evaluated with specialized interval metrics, the proposed method achieves an interval coverage probability of 88.22%, a normalized average width of 0.23, and a Winkler Score of 7,468.47. By combining machine learning with spatial and high-dimensional analysis, this framework enhances both the accuracy and interpretability of AADT estimation, supporting more robust and informed transport planning.

[LG-73] Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware SSL

链接: https://arxiv.org/abs/2510.18516
作者: Sangyoon Bae,Mehdi Azabou,Jiook Cha,Blake Richards
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) holds a great deal of promise for applications in neuroscience, due to the lack of large-scale, consistently labeled neural datasets. However, most neural datasets contain heterogeneous populations that mix stable, predictable cells with highly stochastic, stimulus-contingent ones, which has made it hard to identify consistent activity patterns during SSL. As a result, self-supervised pretraining has yet to show clear signs of benefits from scale on neural data. Here, we present a novel approach to self-supervised pretraining, POYO-SSL that exploits the heterogeneity of neural data to improve pre-training and achieve benefits of scale. Specifically, in POYO-SSL we pretrain only on predictable (statistically regular) neurons-identified on the pretraining split via simple higher-order statistics (skewness and kurtosis)-then we fine-tune on the unpredictable population for downstream tasks. On the Allen Brain Observatory dataset, this strategy yields approximately 12-13% relative gains over from-scratch training and exhibits smooth, monotonic scaling with model size. In contrast, existing state-of-the-art baselines plateau or destabilize as model size increases. By making predictability an explicit metric for crafting the data diet, POYO-SSL turns heterogeneity from a liability into an asset, providing a robust, biologically grounded recipe for scalable neural decoding and a path toward foundation models of neural dynamics.

[LG-74] A machine learning approach to automation and uncertainty evaluation for self-validating thermocouples

链接: https://arxiv.org/abs/2510.18411
作者: Samuel Bilson,Andrew Thompson,Declan Tucker,Jonathan Pearce
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG)
*备注: 6 pages, 7 figures. TEMPERATURE: ITS MEASUREMENT AND CONTROL IN SCIENCE AND INDUSTRY, VOLUME 9: Proceedings of the Tenth International Temperature Symposium 3-7 April 2023 Anaheim, USA

点击查看摘要

Abstract:Thermocouples are in widespread use in industry, but they are particularly susceptible to calibration drift in harsh environments. Self-validating thermocouples aim to address this issue by using a miniature phase-change cell (fixed-point) in close proximity to the measurement junction (tip) of the thermocouple. The fixed point is a crucible containing an ingot of metal with a known melting temperature. When the process temperature being monitored passes through the melting temperature of the ingot, the thermocouple output exhibits a “plateau” during melting. Since the melting temperature of the ingot is known, the thermocouple can be recalibrated in situ. Identifying the melting plateau to determine the onset of melting is reasonably well established but requires manual intervention involving zooming in on the region around the actual melting temperature, a process which can depend on the shape of the melting plateau. For the first time, we present a novel machine learning approach to recognize and identify the characteristic shape of the melting plateau and once identified, to quantity the point at which melting begins, along with its associated uncertainty. This removes the need for human intervention in locating and characterizing the melting point. Results from test data provided by CCPI Europe show 100% accuracy of melting plateau detection. They also show a cross-validated R2 of 0.99 on predictions of calibration drift.

[LG-75] Parametrising the Inhomogeneity Inducing Capacity of a Training Set and its Impact on Supervised Learning

链接: https://arxiv.org/abs/2510.18332
作者: Gargi Roy,Dalia Chakrabarty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce parametrisation of that property of the available training dataset, that necessitates an inhomogeneous correlation structure for the function that is learnt as a model of the relationship between the pair of variables, observations of which comprise the considered training data. We refer to a parametrisation of this property of a given training set, as its inhomogeneity parameter''. It is easy to compute this parameter for small-to-large datasets, and we demonstrate such computation on multiple publicly-available datasets, while also demonstrating that conventional non-stationarity’’ of data does not imply a non-zero inhomogeneity parameter of the dataset. We prove that - within the probabilistic Gaussian Process-based learning approach - a training set with a non-zero inhomogeneity parameter renders it imperative, that the process that is invoked to model the sought function, be non-stationary. Following the learning of a real-world multivariate function with such a Process, quality and reliability of predictions at test inputs, are demonstrated to be affected by the inhomogeneity parameter of the training data. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 62H20, 60G10, 68T05, 68T27, 60J20 Cite as: arXiv:2510.18332 [stat.ML] (or arXiv:2510.18332v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.18332 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gargi Roy [view email] [v1] Tue, 21 Oct 2025 06:34:22 UTC (2,070 KB) Full-text links: Access Paper: View a PDF of the paper titled Parametrising the Inhomogeneity Inducing Capacity of a Training Set, and its Impact on Supervised Learning, by Gargi Roy and 1 other authorsView PDF view license Current browse context: stat.ML prev | next new | recent | 2025-10 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-76] he Bias-Variance Tradeoff in Data-Driven Optimization: A Local Misspecification Perspective

链接: https://arxiv.org/abs/2510.18215
作者: Haixiang Lan,Luofeng Liao,Adam N. Elmachtoub,Christian Kroer,Henry Lam,Haofeng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven stochastic optimization is ubiquitous in machine learning and operational decision-making problems. Sample average approximation (SAA) and model-based approaches such as estimate-then-optimize (ETO) or integrated estimation-optimization (IEO) are all popular, with model-based approaches being able to circumvent some of the issues with SAA in complex context-dependent problems. Yet the relative performance of these methods is poorly understood, with most results confined to the dichotomous cases of the model-based approach being either well-specified or misspecified. We develop the first results that allow for a more granular analysis of the relative performance of these methods under a local misspecification setting, which models the scenario where the model-based approach is nearly well-specified. By leveraging tools from contiguity theory in statistics, we show that there is a bias-variance tradeoff between SAA, IEO, and ETO under local misspecification, and that the relative importance of the bias and the variance depends on the degree of local misspecification. Moreover, we derive explicit expressions for the decision bias, which allows us to characterize (un)impactful misspecification directions, and provide further geometric understanding of the variance.

[LG-77] Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network ICASSP2026

链接: https://arxiv.org/abs/2510.18190
作者: Zhanhong He,Hanyu Meng,David Huang,Roberto Togneri
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Paper submitted to ICASSP2026

点击查看摘要

Abstract:Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.

[LG-78] Beating the Winners Curse via Inference-Aware Policy Optimization

链接: https://arxiv.org/abs/2510.18161
作者: Hamsa Bastani,Osbert Bastani,Bryce McLaughlin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. A common approach is to train a machine learning model to predict counterfactual outcomes, and then select the policy that optimizes the predicted objective value. In addition, practitioners also want confidence that the learned policy has better performance than the incumbent policy according to downstream policy evaluation. However, due to the winner’s curse-an issue where the policy optimization procedure exploits prediction errors rather than finding actual improvements-predicted performance improvements are often not substantiated by downstream policy optimization. To address this challenge, we propose a novel strategy called inference-aware policy optimization, which modifies policy optimization to account for how the policy will be evaluated downstream. Specifically, it optimizes not only for the estimated objective value, but also for the chances that the policy will be statistically significantly better than the observational policy used to collect data. We mathematically characterize the Pareto frontier of policies according to the tradeoff of these two goals. Based on our characterization, we design a policy optimization algorithm that uses machine learning to predict counterfactual outcomes, and then plugs in these predictions to estimate the Pareto frontier; then, the decision-maker can select the policy that optimizes their desired tradeoff, after which policy evaluation can be performed on the test set as usual. Finally, we perform simulations to illustrate the effectiveness of our methodology.

[LG-79] Generalization Below the Edge of Stability: The Role of Data Geometry

链接: https://arxiv.org/abs/2510.18120
作者: Tongtong Liang,Alexander Cloninger,Rahul Parhi,Yu-Xiang Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under Review. Comments welcome!

点击查看摘要

Abstract:Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparameterized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to “shatter” with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.

[LG-80] Arbitrated Indirect Treatment Comparisons

链接: https://arxiv.org/abs/2510.18071
作者: Yixin Fang,Weili He
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Matching-adjusted indirect comparison (MAIC) has been increasingly employed in health technology assessments (HTA). By reweighting subjects from a trial with individual participant data (IPD) to match the covariate summary statistics of another trial with only aggregate data (AgD), MAIC facilitates the estimation of a treatment effect defined with respect to the AgD trial population. This manuscript introduces a new class of methods, termed arbitrated indirect treatment comparisons, designed to address the ``MAIC paradox’’ – a phenomenon highlighted by Jiang et al.~(2025). The MAIC paradox arises when different sponsors, analyzing the same data, reach conflicting conclusions regarding which treatment is more effective. The underlying issue is that each sponsor implicitly targets a different population. To resolve this inconsistency, the proposed methods focus on estimating treatment effects in a common target population, specifically chosen to be the overlap population.

[LG-81] QINNs: Quantum-Informed Neural Networks

链接: https://arxiv.org/abs/2510.17984
作者: Aritra Bal,Markus Klute,Benedikt Maier,Melik Oughton,Eric Pezone,Michael Spannowsky
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Quantum Physics (quant-ph)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:Classical deep neural networks can learn rich multi-particle correlations in collider data, but their inductive biases are rarely anchored in physics structure. We propose quantum-informed neural networks (QINNs), a general framework that brings quantum information concepts and quantum observables into purely classical models. While the framework is broad, in this paper, we study one concrete realisation that encodes each particle as a qubit and uses the Quantum Fisher Information Matrix (QFIM) as a compact, basis-independent summary of particle correlations. Using jet tagging as a case study, QFIMs act as lightweight embeddings in graph neural networks, increasing model expressivity and plasticity. The QFIM reveals distinct patterns for QCD and hadronic top jets that align with physical expectations. Thus, QINNs offer a practical, interpretable, and scalable route to quantum-informed analyses, that is, tomography, of particle collisions, particularly by enhancing well-established deep learning approaches.

[LG-82] Learning Time-Varying Graphs from Incomplete Graph Signals

链接: https://arxiv.org/abs/2510.17903
作者: Chuansen Peng,Xiaojing Shen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper tackles the challenging problem of jointly inferring time-varying network topologies and imputing missing data from partially observed graph signals. We propose a unified non-convex optimization framework to simultaneously recover a sequence of graph Laplacian matrices while reconstructing the unobserved signal entries. Unlike conventional decoupled methods, our integrated approach facilitates a bidirectional flow of information between the graph and signal domains, yielding superior robustness, particularly in high missing-data regimes. To capture realistic network dynamics, we introduce a fused-lasso type regularizer on the sequence of Laplacians. This penalty promotes temporal smoothness by penalizing large successive changes, thereby preventing spurious variations induced by noise while still permitting gradual topological evolution. For solving the joint optimization problem, we develop an efficient Alternating Direction Method of Multipliers (ADMM) algorithm, which leverages the problem’s structure to yield closed-form solutions for both the graph and signal subproblems. This design ensures scalability to large-scale networks and long time horizons. On the theoretical front, despite the inherent non-convexity, we establish a convergence guarantee, proving that the proposed ADMM scheme converges to a stationary point. Furthermore, we derive non-asymptotic statistical guarantees, providing high-probability error bounds for the graph estimator as a function of sample size, signal smoothness, and the intrinsic temporal variability of the graph. Extensive numerical experiments validate the approach, demonstrating that it significantly outperforms state-of-the-art baselines in both convergence speed and the joint accuracy of graph learning and signal recovery.

[LG-83] Graphical model for tensor factorization by sparse sampling

链接: https://arxiv.org/abs/2510.17886
作者: Angelo Giorgio,Riki Nagasawa,Shuta Yokoi,Tomoyuki Obuchi,Hajime Yoshino
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 75 pages, 26 figures

点击查看摘要

Abstract:We consider tensor factorizations based on sparse measurements of the tensor components. The measurements are designed in a way that the underlying graph of interactions is a random graph. The setup will be useful in cases where a substantial amount of data is missing, as in recommendation systems heavily used in social network services. In order to obtain theoretical insights on the setup, we consider statistical inference of the tensor factorization in a high dimensional limit, which we call as dense limit, where the graphs are large and dense but not fully connected. We build message-passing algorithms and test them in a Bayes optimal teacher-student setting. We also develop a replica theory, which becomes exact in the dense limit,to examine the performance of statistical inference.

[LG-84] hree-dimensional inversion of gravity data using implicit neural representations

链接: https://arxiv.org/abs/2510.17876
作者: Pankaj K Mishra,Sanni Laaksonen,Jochen Kamm,Anand Singh
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 10 Pages, 5 figures

点击查看摘要

Abstract:Inversion of gravity data is an important method for investigating subsurface density variations relevant to diverse applications including mineral exploration, geothermal assessment, carbon storage, natural hydrogen, groundwater resources, and tectonic evolution. Here we present a scientific machine-learning approach for three-dimensional gravity inversion that represents subsurface density as a continuous field using an implicit neural representation (INR). The method trains a deep neural network directly through a physics-based forward-model loss, mapping spatial coordinates to a continuous density field without predefined meshes or discretisation. Positional encoding enhances the network’s capacity to capture sharp contrasts and short-wavelength features that conventional coordinate-based networks tend to oversmooth due to spectral bias. We demonstrate the approach on synthetic examples including Gaussian random fields, representing realistic geological complexity, and a dipping block model to assess recovery of blocky structures. The INR framework reconstructs detailed structure and geologically plausible boundaries without explicit regularisation or depth weighting, while significantly reducing the number of inversion parameters. These results highlight the potential of implicit representations to enable scalable, flexible, and interpretable large-scale geophysical inversion. This framework could generalise to other geophysical methods and for joint/multiphysics inversion.

[LG-85] Covariance Matrix Construction with Preprocessing-Based Spatial Sampling for Robust Adaptive Beamforming

链接: https://arxiv.org/abs/2510.17823
作者: Saeed Mohammadzadeh,Rodrigo C.de Lamare,Yuriy Zakharov
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 13 figures, 14 pages

点击查看摘要

Abstract:This work proposes an efficient, robust adaptive beamforming technique to deal with steering vector (SV) estimation mismatches and data covariance matrix reconstruction problems. In particular, the direction-of-arrival(DoA) of interfering sources is estimated with available snapshots in which the angular sectors of the interfering signals are computed adaptively. Then, we utilize the well-known general linear combination algorithm to reconstruct the interference-plus-noise covariance (IPNC) matrix using preprocessing-based spatial sampling (PPBSS). We demonstrate that the preprocessing matrix can be replaced by the sample covariance matrix (SCM) in the shrinkage method. A power spectrum sampling strategy is then devised based on a preprocessing matrix computed with the estimated angular sectors’ information. Moreover, the covariance matrix for the signal is formed for the angular sector of the signal-of-interest (SOI), which allows for calculating an SV for the SOI using the power method. An analysis of the array beampattern in the proposed PPBSS technique is carried out, and a study of the computational cost of competing approaches is conducted. Simulation results show the proposed method’s effectiveness compared to existing approaches.

[LG-86] CLARAE: Clarity Preserving Reconstruction AutoEncoder for Denoising and Rhythm Classification of Intracardiac Electrograms

链接: https://arxiv.org/abs/2510.17821
作者: Long Lin,Pablo Peiro-Corbacho,Pablo Ávila,Alejandro Carta-Bergaz,Ángel Arenal,Gonzalo R. Ríos-Muñoz,Carlos Sevilla-Salcedo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intracavitary atrial electrograms (EGMs) provide high-resolution insights into cardiac electrophysiology but are often contaminated by noise and remain high-dimensional, limiting real-time analysis. We introduce CLARAE (CLArity-preserving Reconstruction AutoEncoder), a one-dimensional encoder–decoder designed for atrial EGMs, which achieves both high-fidelity reconstruction and a compact 64-dimensional latent representation. CLARAE is designed to preserve waveform morphology, mitigate reconstruction artifacts, and produce interpretable embeddings through three principles: downsampling with pooling, a hybrid interpolation–convolution upsampling path, and a bounded latent space. We evaluated CLARAE on 495,731 EGM segments (unipolar and bipolar) from 29 patients across three rhythm types (AF, SR300, SR600). Performance was benchmarked against six state-of-the-art autoencoders using reconstruction metrics, rhythm classification, and robustness across signal-to-noise ratios from -5 to 15 dB. In downstream rhythm classification, CLARAE achieved F1-scores above 0.97 for all rhythm types, and its latent space showed clear clustering by rhythm. In denoising tasks, it consistently ranked among the top performers for both unipolar and bipolar signals. In order to promote reproducibility and enhance accessibility, we offer an interactive web-based application. This platform enables users to explore pre-trained CLARAE models, visualize the reconstructions, and compute metrics in real time. Overall, CLARAE combines robust denoising with compact, discriminative representations, offering a practical foundation for clinical workflows such as rhythm discrimination, signal quality assessment, and real-time mapping. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2510.17821 [eess.SP] (or arXiv:2510.17821v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2510.17821 Focus to learn more arXiv-issued DOI via DataCite

[LG-87] Single-Snapshot Gridless 2D-DoA Estimation for UCAs: A Joint Optimization Approach

链接: https://arxiv.org/abs/2510.17818
作者: Salar Nouri
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper tackles the challenging problem of gridless two-dimensional (2D) direction-of-arrival (DOA) estimation for a uniform circular array (UCA) from a single snapshot of data. Conventional gridless methods often fail in this scenario due to prohibitive computational costs or a lack of robustness. We propose a novel framework that overcomes these limitations by jointly estimating a manifold transformation matrix and the source azimuth-elevation pairs within a single, unified optimization problem. This problem is solved efficiently using an inexact Augmented Lagrangian Method (iALM), which completely circumvents the need for semidefinite programming. By unifying the objectives of data fidelity and transformation robustness, our approach is uniquely suited for the demanding single-snapshot case. Simulation results confirm that the proposed iALM framework provides robust and high-resolution, gridless 2D-DOA estimates, establishing its efficacy for challenging array signal processing applications.

[LG-88] Exploring Complexity Changes in Diseased ECG Signals for Enhanced Classification

链接: https://arxiv.org/abs/2510.17810
作者: Camilo Quiceno Quintero,Sandip Varkey George
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Version submitted to NODYCON 2025

点击查看摘要

Abstract:The complex dynamics of the heart are reflected in its electrical activity, captured through electrocardiograms (ECGs). In this study we use nonlinear time series analysis to understand how ECG complexity varies with cardiac pathology. Using the large PTB-XL dataset, we extracted nonlinear measures from lead II ECGs, and cross-channel metrics (leads II, V2, AVL) using Spearman correlations and mutual information. Significant differences between diseased and healthy individuals were found in almost all measures between healthy and diseased classes, and between 5 diagnostic superclasses ( p.001 ). Moreover, incorporating these complexity quantifiers into machine learning models substantially improved classification accuracy measured using area under the ROC curve (AUC) from 0.86 (baseline) to 0.87 (nonlinear measures) and 0.90 (including cross-time series metrics).

[LG-89] In-Process Monitoring of Gear Power Honing Using Vibration Signal Analysis and Machine Learning

链接: https://arxiv.org/abs/2510.17809
作者: Massimo Capurso,Luciano Afferrante
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 20 pages, 17 figures, 3 tables, 33 references

点击查看摘要

Abstract:In modern gear manufacturing, stringent Noise, Vibration, and Harshness (NVH) requirements demand high-precision finishing operations such as power honing. Conventional quality control strategies rely on post-process inspections and Statistical Process Control (SPC), which fail to capture transient machining anomalies and cannot ensure real-time defect detection. This study proposes a novel, data-driven framework for in-process monitoring of gear power honing using vibration signal analysis and machine learning. Our proposed methodology involves continuous data acquisition via accelerometers, followed by time-frequency signal analysis. We investigate and compare the efficacy of three subspace learning methods for features extraction: (1) Principal Component Analysis (PCA) for dimensionality reduction; (2) a two-stage framework combining PCA with Linear Discriminant Analysis (LDA) for enhanced class separation; and (3) Uncorrelated Multilinear Discriminant Analysis with Regularization (R-UMLDA), adapted for tensor data, which enforces feature decorrelation and includes regularization for small sample sizes. These extracted features are then fed into a Support Vector Machine (SVM) classifier to predict four distinct gear quality categories, established through rigorous geometrical inspections and test bench results of assembled gearboxes. The models are trained and validated on an experimental dataset collected in an industrial context during gear power-honing operations, with gears classified into four different quality categories. The proposed framework achieves high classification accuracy (up to 100%) in an industrial setting. The approach offers interpretable spectral features that correlate with process dynamics, enabling practical integration into real-time monitoring and predictive maintenance systems.

信息检索

[IR-0] LLM s as Sparse Retrievers:A Framework for First-Stage Product Search

链接: https://arxiv.org/abs/2510.18527
作者: Hongru Song,Yu-an Liu,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Sen Li,Wenjun Peng,Fuyu Lv,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注: 16 pages

点击查看摘要

Abstract:Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day. In product search systems, first-stage retrieval should achieve high recall while ensuring efficient online deployment. Sparse retrieval is particularly attractive in this context due to its interpretability and storage efficiency. However, sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search this http URL their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues and thereby improving retrieval quality. Directly applying LLMs to sparse retrieval in product search exposes two key challenges:(1)Queries and product titles are typically short and highly susceptible to LLM-induced hallucinations, such as generating irrelevant expansion terms or underweighting critical literal terms like brand names and model numbers;(2)The large vocabulary space of LLMs leads to difficulty in initializing training effectively, making it challenging to learn meaningful sparse representations in such ultra-high-dimensional this http URL address these challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that alleviates hallucination in lexical expansion by reinforcing underweighted literal terms through a residual compensation mechanism; and (2)A lexical focusing window that facilitates effective training initialization via a coarse-to-fine sparsification this http URL offline and online experiments show that PROSPER significantly outperforms sparse baselines and achieves recall performance comparable to advanced dense retrievers, while also achieving revenue increments online.

[IR-1] Censorship Chokepoints: New Battlegrounds for Regional Surveillance Censorship and Influence on the Internet

链接: https://arxiv.org/abs/2510.18394
作者: Yong Zhang,Nishanth Sastry
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:Undoubtedly, the Internet has become one of the most important conduits to information for the general public. Nonetheless, Internet access can be and has been limited systematically or blocked completely during political events in numerous countries and regions by various censorship mechanisms. Depending on where the core filtering component is situated, censorship techniques have been classified as client-based, server-based, or network-based. However, as the Internet evolves rapidly, new and sophisticated censorship techniques have emerged, which involve techniques that cut across locations and involve new forms of hurdles to information access. We argue that modern censorship can be better understood through a new lens that we term chokepoints, which identifies bottlenecks in the content production or delivery cycle where efficient new forms of large-scale client-side surveillance and filtering mechanisms have emerged.

[IR-2] Evaluating LLM -Based Mobile App Recommendations: An Empirical Study

链接: https://arxiv.org/abs/2510.18364
作者: Quim Motger,Xavier Franch,Vincenzo Gervasi,Jordi Marco
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注: Under review

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to recommend mobile applications through natural language prompts, offering a flexible alternative to keyword-based app store search. Yet, the reasoning behind these recommendations remains opaque, raising questions about their consistency, explainability, and alignment with traditional App Store Optimization (ASO) metrics. In this paper, we present an empirical analysis of how widely-used general purpose LLMs generate, justify, and rank mobile app recommendations. Our contributions are: (i) a taxonomy of 16 generalizable ranking criteria elicited from LLM outputs; (ii) a systematic evaluation framework to analyse recommendation consistency and responsiveness to explicit ranking instructions; and (iii) a replication package to support reproducibility and future research on AI-based recommendation systems. Our findings reveal that LLMs rely on a broad yet fragmented set of ranking criteria, only partially aligned with standard ASO metrics. While top-ranked apps tend to be consistent across runs, variability increases with ranking depth and search specificity. LLMs exhibit varying sensitivity to explicit ranking instructions - ranging from substantial adaptations to near-identical outputs - highlighting their complex reasoning dynamics in conversational app discovery. Our results aim to support end-users, app developers, and recommender-systems researchers in navigating the emerging landscape of conversational app discovery.

[IR-3] Enhancing Hotel Recommendations with AI: LLM -Based Review Summarization and Query-Driven Insights

链接: https://arxiv.org/abs/2510.18277
作者: Nikolaos Belibasakis,Anastasios Giannaros,Ioanna Giannoukou,Spyros Sioutas
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The increasing number of data a booking platform such as this http URL and AirBnB offers make it challenging for interested parties to browse through the available accommodations and analyze reviews in an efficient way. Efforts have been made from the booking platform providers to utilize recommender systems in an effort to enable the user to filter the results by factors such as stars, amenities, cost but most valuable insights can be provided by the unstructured text-based reviews. Going through these reviews one-by-one requires a substantial amount of time to be devoted while a respectable percentage of the reviews won’t provide to the user what they are actually looking for. This research publication explores how Large Language Models (LLMs) can enhance short rental apartments recommendations by summarizing and mining key insights from user reviews. The web application presented in this paper, named “instaGuide”, automates the procedure of isolating the text-based user reviews from a property on the this http URL platform, synthesizing the summary of the reviews, and enabling the user to query specific aspects of the property in an effort to gain feedback on their personal questions/criteria. During the development of the instaGuide tool, numerous LLM models were evaluated based on accuracy, cost, and response quality. The results suggest that the LLM-powered summarization reduces significantly the amount of time the users need to devote on their search for the right short rental apartment, improving the overall decision-making procedure. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.18277 [cs.IR] (or arXiv:2510.18277v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.18277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表