本篇博文主要内容为 2026-01-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-08)

今日共更新516篇论文,其中:

  • 自然语言处理131篇(Computation and Language (cs.CL))
  • 人工智能177篇(Artificial Intelligence (cs.AI))
  • 计算机视觉88篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习137篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

【速读】: 该论文旨在解决当前大语言模型在真实金融新闻场景下识别虚假信息(financial misinformation)时存在的局限性,特别是缺乏外部参照下的推理能力不足问题。其解决方案的关键在于提出RFC Bench这一段落级别的基准测试框架,通过定义两种互补任务——无参考的虚假信息检测与基于成对原始-扰动输入的比较诊断,系统评估模型在不同情境下的表现。实验表明,当提供对比上下文时模型性能显著提升,而无参考设置则暴露出预测不稳定和无效输出增多等缺陷,揭示了现有模型在缺乏外部 grounding 时难以维持一致信念状态的核心短板。

链接: https://arxiv.org/abs/2601.04160
作者: Yuechen Jiang,Zhiwei Liu,Yupeng Cao,Yueru He,Ziyang Xu,Chen Xu,Zhiyang Deng,Prayag Tiwari,Xi Chen,Alejandro Lopez-Lira,Jimin Huang,Junichi Tsujii,Sophia Ananiadou
机构: 未知
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
备注: 39 pages; 24 figures

点击查看摘要

Abstract:We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.
zh

[NLP-1] FLEx: Language Modeling with Few-shot Language Explanations

【速读】: 该论文旨在解决语言模型在执行任务时频繁重复错误的问题,尤其是在需要专家标注的领域中,难以大规模收集自然语言解释来纠正这些错误。解决方案的关键在于提出一种名为FLEx(Few-shot Language Explanations)的方法,其核心机制是通过嵌入聚类识别代表性模型错误,验证相关解释的有效性,并将这些解释总结为一个推理时前缀提示(prompt prefix),从而引导模型在新输入上避免类似错误,且无需修改模型权重。实验表明,FLEx在CounterBench、GSM8K和ReasonIF三个数据集上均优于链式思维(Chain-of-Thought, CoT)提示方法,并可减少高达83%的CoT剩余错误。

链接: https://arxiv.org/abs/2601.04157
作者: Adar Avsian,Christopher Richardson,Anirudh Sundar,Larry Heck
机构: Georgia Institute of Technology (佐治亚理工学院); Microsoft (微软)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ( \textbfF ew-shot \textbfL anguage \textbfEx planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83% of CoT’s remaining errors.
zh

[NLP-2] LLM berjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation

【速读】: 该论文旨在解决现有资源中缺乏结构化、可复现的多参与者对话生成工具的问题,特别是如何从原始的回复树(reply tree)结构中高效构建连贯且保留发言者身份与话语关系的线性对话序列。解决方案的关键在于提出 LLMberjack 平台,其核心创新包括:(1)提供交互式界面可视化讨论树,并支持用户线性化重构对话以保持语义连贯性;(2)集成大语言模型(Large Language Model, LLM)辅助自动编辑消息内容及发言者描述,从而提升输出质量并降低人工干预成本;(3)采用开源设计促进透明、可复现的多主体对话创建流程,填补该领域工具链的空白。

链接: https://arxiv.org/abs/2601.04135
作者: Leonardo Bottona,Nicolò Penzo,Bruno Lepri,Marco Guerini,Sara Tonelli
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯勒基金会)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers’ descriptions. We demonstrate the platform’s utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.
zh

[NLP-3] ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对外部检索信息与内部记忆知识冲突时,倾向于依赖自身存储的旧知识而产生不忠实输出的问题。解决方案的关键在于提出一种轻量级的激活引导方法——ContextFocus,该方法无需模型微调且推理阶段开销极小,通过动态调整模型内部激活状态来增强对外部上下文的遵循能力,从而在保持流畅性和效率的同时显著提升输出的上下文忠实性(contextual-faithfulness)。

链接: https://arxiv.org/abs/2601.04131
作者: Nikhil Anand,Shwetha Somasundaram,Anirudh Phukan,Apoorv Saxena,Koyel Mukherjee
机构: Adobe Research (Adobe研究院); Indian Institute of Science (印度科学研究所); Inception Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model’s internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.
zh

[NLP-4] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

【速读】: 该论文旨在解决GUI代理(GUI Agent)训练中因缺乏合适环境而导致的瓶颈问题,特别是构建大规模、功能完整且多样化的网页环境以支持生成式AI(Generative AI)在图形界面交互任务中的训练。解决方案的关键在于提出InfiniteWeb系统,其核心创新包括:统一规范(unified specification)实现跨页面一致性,基于任务导向的测试驱动开发(task-centric test-driven development)确保功能性与可用性,以及结合网站种子与参考设计图像以提升多样性;同时,系统自动生成可验证的任务评估器,提供密集奖励信号用于强化学习训练,从而显著提升GUI代理在真实场景任务(如OSWorld和Online-Mind2Web)上的性能表现。

链接: https://arxiv.org/abs/2601.04126
作者: Ziyun Zhang,Zezhou Wang,Xiaoyi Zhang,Zongyu Guo,Jiahao Li,Bin Li,Yan Lu
机构: Peking University (北京大学); Nanjing University (南京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Work In Progress

点击查看摘要

Abstract:GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.
zh

[NLP-5] Layer-wise Positional Bias in Short-Context Language Modeling

【速读】: 该论文旨在解决语言模型在短文本建模中对输入位置信息的偏好问题,即模型倾向于依赖特定位置的信息(如首部或尾部)而非语义相关性,且这种偏倚在不同网络层和输入位置上的演化机制尚不明确。解决方案的关键在于提出一种基于归因(attribution)的分析框架,采用层导数(layer conductance)结合滑动窗口方法,量化每一层对输入各位置的重要性分布,从而获得分层的位置重要性轮廓(layer-wise positional importance profiles)。该方法揭示了模型存在架构特异性的稳定位置偏倚:深层模型呈现显著的“近期偏倚”(recency bias)并随深度增强,而早期层则表现出随深度减弱的“优先偏倚”(primacy bias),同时发现早期层更关注内容词(content words)而非功能词(function words),后期层则丧失此类区分能力。

链接: https://arxiv.org/abs/2601.04098
作者: Maryam Rahimi,Mahdi Nouri,Yadollah Yaghoobzadeh
机构: Tehran Institute for Advanced Studies (德黑兰高级研究所); Khatam University (卡塔姆大学); School of Electrical and Computer Engineering (电气与计算机工程学院); University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models often show a preference for using information from specific positions in the input regardless of semantic relevance. While positional bias has been studied in various contexts, from attention sinks to task performance degradation in long-context settings, prior work has not established how these biases evolve across individual layers and input positions, or how they vary independent of task complexity. We introduce an attribution-based framework to analyze positional effects in short-context language modeling. Using layer conductance with a sliding-window approach, we quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles. We find that these profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Characterizing these profiles, we find prominent recency bias that increases with depth and subtle primacy bias that diminishes through model depth. Beyond positional structure, we also show that early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.
zh

[NLP-6] SearchAttack: Red-Teaming LLM s against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks

【速读】: 该论文旨在解决生成式 AI(Generative AI)在开放性和知识密集型任务中因大语言模型(Large Language Models, LLMs)存在不可靠性差距而导致的安全风险问题,特别是当LLMs依赖外部网络搜索增强能力时,其安全防护机制可能失效。解决方案的关键在于识别网络搜索作为关键攻击面,并提出一种名为SearchAttack的红队测试方法:该方法将有害语义外包至网络搜索引擎,仅保留查询的骨架结构和碎片化线索,再通过结构化规则引导LLM重构检索到的内容以达成恶意目标,从而有效评估搜索增强型LLMs的安全漏洞。

链接: https://arxiv.org/abs/2601.04093
作者: Yu Yan,Sheng Sun,Mingfeng Li,Zheming Yang,Chiwei Zhu,Fei Ma,Benfeng Xu,Min Liu
机构: Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); People’s Public Security University of China (中国人民公安大学); University of Science and Technology of China (中国科学技术大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (广东省人工智能与数字经济实验室)
类目: Computation and Language (cs.CL)
备注: We find that the key to jailbreak the LLM is objectifying its safety responsibility, thus we delegate the open-web to inject harmful semantics and get the huge gain from unmoderated web resources

点击查看摘要

Abstract:Recently, people have suffered and become increasingly aware of the unreliability gap in LLMs for open and knowledge-intensive tasks, and thus turn to search-augmented LLMs to mitigate this issue. However, when the search engine is triggered for harmful tasks, the outcome is no longer under the LLM’s control. Once the returned content directly contains targeted, ready-to-use harmful takeaways, the LLM’s safeguards cannot withdraw that exposure. Motivated by this dilemma, we identify web search as a critical attack surface and propose \textbf\textitSearchAttack for red-teaming. SearchAttack outsources the harmful semantics to web search, retaining only the query’s skeleton and fragmented clues, and further steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems.
zh

[NLP-7] KDCM: Reducing Hallucination in LLM s through Explicit Reasoning Structures

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因提示词(prompt)诱导而产生的幻觉问题,即模型生成与事实不符或缺乏依据的输出。其解决方案的关键在于提出一种基于链式知识蒸馏(chain-style knowledge distillation)的增强框架,通过引入一个可编程模块嵌入到推理提示中,该模块以可执行代码形式引导模型对知识图谱(Knowledge Graph, KG)进行探索,从而在推理过程中显式地利用外部结构化知识来约束中间推理步骤,提升预测的可靠性与可解释性。实验表明,该方法显著改善了上下文建模能力,并有效降低了提示诱导型幻觉,多个指标(HIT@1, HIT@3, HIT@5)均获得明显提升。

链接: https://arxiv.org/abs/2601.04086
作者: Jinbo Hao,Kai Yang,Qingzhen Su,Yifan Li,Chao Jiang
机构: Jiangsu Ocean University (江苏海洋大学); Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To mitigate hallucinations in large language models (LLMs), we propose a framework that focuses on errors induced by prompts. Our method extends a chain-style knowledge distillation approach by incorporating a programmable module that guides knowledge graph exploration. This module is embedded as executable code within the reasoning prompt, allowing the model to leverage external structured knowledge during inference. Based on this design, we develop an enhanced distillation-based reasoning framework that explicitly regulates intermediate reasoning steps, resulting in more reliable predictions. We evaluate the proposed approach on multiple public benchmarks using GPT-4 and LLaMA-3.3. Experimental results show that code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 increase by 15.64%, 13.38%, and 13.28%, respectively, with scores exceeding 95% across several evaluation settings. These findings indicate that the proposed method effectively constrains erroneous reasoning while improving both accuracy and interpretability.
zh

[NLP-8] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在视频推理过程中因“文本惯性”(textual inertia)导致的推理链脆弱性问题,即一旦生成错误文本,模型会盲目依赖该错误信息而忽视矛盾的视觉证据,从而引发错误传播。解决方案的关键在于提出一种无需训练的推理范式——主动视觉上下文精炼(Active Visual-Context Refinement),其核心机制包括:1)主动视觉重定位(active visual re-grounding)以实现细粒度验证;2)自适应上下文精炼策略用于总结并去噪推理历史,从而有效抑制幻觉传播并提升推理鲁棒性。

链接: https://arxiv.org/abs/2601.04073
作者: Zhihao Zhu,Jiafeng Liang,Shixin Jiang,Jinlan Fu,Ming Liu,Guanglu Sun,See-Kiong Ng,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
zh

[NLP-9] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

【速读】: 该论文旨在解决当前生成式建模在文本(离散数据)与图像(连续数据)之间缺乏统一框架的问题,尤其是掩码语言模型(Masked Language Models, MLMs)在多模态场景下存在对齐困难和训练不稳定等挑战。其解决方案的关键在于提出一种名为CoM-DAD(Coupled Manifold Discrete Absorbing Diffusion)的概率框架,通过将多模态生成建模为分层双过程:首先利用连续潜变量扩散过程建模语义流形,其次将标记生成视为受动态语义先验调控的离散吸收扩散过程,并引入可变噪声调度(Variable-Rate Noise Schedule)以提升稳定性;同时创新性地采用随机混合模态传输策略(Stochastic Mixed-Modal Transport),无需依赖复杂的对比双编码器即可实现跨模态对齐,从而构建出稳定、可扩展的统一文本-图像生成范式。

链接: https://arxiv.org/abs/2601.04056
作者: Yuanfeng Xu,Yuhao Chen,Liang Lin,Guangrun Wang
机构: Sun Yat-sen University (中山大学); Guangdong Key Lab of Big Data Analysis & Processing (广东省大数据分析与处理重点实验室); X-Era AI Lab (X-Era 人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbfCoM-DAD (\textbfCoupled \textbfManifold \textbfDiscrete \textbfAbsorbing \textbfDiffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbfVariable-Rate Noise Schedule, conditioned on these evolving semantic priors. Crucially, we introduce a \textbfStochastic Mixed-Modal Transport strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
zh

[NLP-10] Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients

【速读】: 该论文旨在解决小规模开源指令微调语言模型(instruction-tuned language models)在推理性能上受限于提示(prompt)质量的问题,尤其是现有自动提示优化方法将提示视为整体文本块,难以精确定位错误、保留关键指令或控制提示膨胀。其解决方案的核心是提出模块化提示优化(Modular Prompt Optimization, MPO),将提示结构化为固定语义模块(如系统角色、上下文、任务描述、约束条件和输出格式),并基于批评者语言模型(critic language model)生成的局部文本梯度对每个模块独立优化,同时通过去重策略减少组件间冗余与干扰,从而在不改变模型参数或提示结构的前提下实现高效、可解释且鲁棒的性能提升。

链接: https://arxiv.org/abs/2601.04055
作者: Prith Sharma,Austin Z. Henley
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt quality plays a central role in controlling the behavior, reliability, and reasoning performance of large language models (LLMs), particularly for smaller open-source instruction-tuned models that depend heavily on explicit structure. While recent work has explored automatic prompt optimization using textual gradients and self-refinement, most existing methods treat prompts as monolithic blocks of text, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. We introduce Modular Prompt Optimization (MPO), a schema-based prompt optimization framework that treats prompts as structured objects composed of fixed semantic sections, including system role, context, task description, constraints, and output format. MPO applies section-local textual gradients, generated by a critic language model, to refine each section independently while keeping the overall prompt schema fixed. Section updates are consolidated through de-duplication to reduce redundancy and interference between components, yielding an interpretable and robust optimization process. We evaluate MPO on two reasoning benchmarks, ARC-Challenge and MMLU, using LLaMA-3 8B-Instruct and Mistral-7B-Instruct as solver models. Across both benchmarks and models, MPO consistently outperforms an untuned structured prompt and the TextGrad baseline, achieving substantial accuracy gains without modifying model parameters or altering prompt structure. These results demonstrate that maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source LMs.
zh

[NLP-11] Stable Language Guidance for Vision-Language-Action Models

【速读】: 该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人控制中对语言扰动敏感的问题,即“模态坍缩”现象——强视觉先验会压制稀疏的语言信号,导致代理过度拟合特定指令表述而忽略语义意图。解决方案的关键在于提出残差语义引导(Residual Semantic Steering, RSS),其核心是通过两个理论创新实现语义执行与物理可操作性的解耦:一是蒙特卡洛句法整合(Monte Carlo Syntactic Integration),利用大语言模型(LLM)驱动的分布扩展近似真实语义后验;二是残差可操作性引导(Residual Affordance Steering),采用双流解码机制显式剥离视觉可操作性先验的影响,从而最大化动作与意图之间的互信息并抑制视觉干扰。

链接: https://arxiv.org/abs/2601.04052
作者: Zhihao Zhan,Yuhao Chen,Jiaying Zhou,Qinhan Lv,Hao Liu,Keze Wang,Liang Lin,Guangrun Wang
机构: Sun Yat-sen University (中山大学); Guangdong Key Lab of Big Data Analysis & Processing (广东省大数据分析与处理重点实验室); X-Era AI Lab (X-Era 人工智能实验室)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse’’ phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbfResidual Semantic Steering (RSS), a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbfMonte Carlo Syntactic Integration, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbfResidual Affordance Steering, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
zh

[NLP-12] When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM -Powered Safety in Daily Life

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在日常生活中生成不安全内容所带来的行为风险问题,此类风险可能对人类社会造成持续威胁。为系统评估MLLMs在真实场景下的安全性,作者提出了SaLAD基准,包含2013个跨10类常见场景的真实图像-文本样本,设计上兼顾危险情境与过度敏感案例,强调视觉输入的真实性与跨模态细粒度推理能力,确保安全风险无法仅从文本推断。解决方案的关键在于引入基于安全警告(safety-warning-based)的评估框架,引导模型提供清晰、具体的安全提示,而非泛化拒绝,从而更有效地识别和响应日常生活中的潜在危险行为。

链接: https://arxiv.org/abs/2601.04043
作者: Xinyue Lou,Jinan Xu,Jingyi Yin,Xiaolong Wang,Zhaolu Kang,Youwei Liao,Yixuan Wang,Xiangyu Shi,Fengran Mo,Su Yao,Kaiyu Huang
机构: Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学); Peking University (北京大学); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at this https URL.
zh

[NLP-13] Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation

【速读】: 该论文旨在解决多语言机器翻译系统中跨语言知识迁移的挑战,尤其针对低资源语言因平行语料稀缺而导致的泛化能力不足问题。其解决方案的关键在于深入分析语言相似性、检索增强与辅助监督机制对低资源翻译的强化作用,以及通过在平行数据上微调大语言模型来识别潜在的非预期权衡;同时,研究强调训练阶段语言多样性的重要性,表明增加翻译覆盖范围可提升模型泛化性能并减少非目标行为,从而为构建更具包容性和鲁棒性的多语言自然语言处理(Natural Language Processing, NLP)系统提供关键洞见。

链接: https://arxiv.org/abs/2601.04036
作者: David Stap
机构: 未知
类目: Computation and Language (cs.CL)
备注: PhD dissertation defended on November 26th, 2025

点击查看摘要

Abstract:Multilingual machine translation systems aim to make knowledge accessible across languages, yet learning effective cross-lingual representations remains challenging. These challenges are especially pronounced for low-resource languages, where limited parallel data constrains generalization and transfer. Understanding how multilingual models share knowledge across languages requires examining the interaction between representations, data availability, and training strategies. In this thesis, we study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings, using machine translation as a central testbed. We analyze how similarity between languages influences transfer, how retrieval and auxiliary supervision can strengthen low-resource translation, and how fine-tuning on parallel data can introduce unintended trade-offs in large language models. We further examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior. Together, this work highlights how modeling choices and data composition shape multilingual learning and offers insights toward more inclusive and resilient multilingual NLP systems.
zh

[NLP-14] SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

【速读】: 该论文旨在解决大型音频-语言模型(Large Audio-Language Models, LALMs)在多轮对话中评估说话人一致性(speaker consistency)能力不足的问题。现有研究虽已将LALMs用于语音生成质量评估,但其对跨轮次对话中声学特征一致性的判别能力尚未被系统探索。解决方案的关键在于构建了一个名为SpeakerSleuth的基准测试集,包含1,818个经人工验证的评估实例,覆盖合成与真实语音数据,并控制声学难度。通过该基准对九种主流LALMs进行评估,发现模型普遍难以可靠识别声学不一致,尤其在文本信息干扰下更易忽略明显的性别切换等声学异常;同时,模型在区分同一说话人的多个声学变体时表现良好,表明其具备基本的声学辨别能力。这一结果揭示了LALMs存在显著的模态偏倚(modality bias),即优先依赖文本而非声学线索,从而指出了未来构建可靠音频-语言评判系统需重点解决的模态平衡问题。

链接: https://arxiv.org/abs/2601.04029
作者: Jonggeun Lee,Junseong Pyo,Gyuhyeon Seo,Yohan Jo
机构: Seoul National University (首尔国立大学); Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker’s turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors’ turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.
zh

[NLP-15] Simulated Students in Tutoring Dialogues: Substance or Illusion?

【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的教育应用中,缺乏对“模拟学生”(simulated students)质量的有效评估与保障,而这类模拟学生广泛用于教学系统的训练与评估,但其表现往往依赖于简单的提示工程(prompting),导致结果不可靠且难以复现。解决方案的关键在于:首次形式化定义了学生模拟任务,并提出一套涵盖语言、行为和认知三个维度的综合性评估指标体系,在真实数学辅导对话数据集上系统地 benchmark 了多种学生模拟方法(包括 prompting、监督微调和偏好优化),实验证明简单 prompting 效果较差,而监督微调与偏好优化虽有提升但仍存在局限,从而为未来改进学生模拟技术提供了明确方向与基准。

链接: https://arxiv.org/abs/2601.04025
作者: Alexander Scarlatos,Jaewook Lee,Simon Woodhead,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Eedi
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.
zh

[NLP-16] VotIE: Information Extraction from Meeting Minutes

【速读】: 该论文旨在解决市政会议纪要中投票信息(voting events)的自动化提取问题,这类文本通常以自由格式的叙述性内容呈现,且在不同行政区域间存在显著异构性,导致传统信息抽取方法难以有效应用。其核心解决方案是提出了一种新的信息抽取任务——VotIE(Voting Information Extraction),并基于葡萄牙语市政会议记录构建首个基准数据集,同时对比了轻量级微调编码器(如XLM-R-CRF)与少样本大语言模型(LLMs)在跨地区泛化场景下的性能表现。关键发现表明:在同域评估下,微调编码器达到最高性能(宏F1=93.2%),而在跨市政迁移时,生成式LLMs展现出更强鲁棒性但计算成本高,因此轻量级模型更适合大规模实际部署。

链接: https://arxiv.org/abs/2601.03997
作者: José Pedro Evans,Luís Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
机构: INESC TEC (INESC TEC); University of Porto (波尔图大学); University of Beira Interior (贝拉内陆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.
zh

[NLP-17] Benchmark2: Systematic Evaluation of LLM Benchmarks

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)评估中基准测试(benchmark)质量参差不齐的问题,即如何系统性地评估和提升基准测试本身的可靠性与有效性。其解决方案的关键在于提出Benchmark²框架,该框架包含三个互补的量化指标:跨基准排名一致性(Cross-Benchmark Ranking Consistency),用于衡量一个基准是否与其他同类基准产生一致的模型排名;可区分性得分(Discriminability Score),用于量化基准区分不同性能模型的能力;以及能力对齐偏差(Capability Alignment Deviation),用于识别在同一家族模型中强模型失败而弱模型成功的问题实例。通过这三项指标,研究者能够更科学地筛选和构建高质量基准测试,从而在显著减少测试集规模的前提下实现与现有方法相当甚至更优的评估效果。

链接: https://arxiv.org/abs/2601.03986
作者: Qi Qian,Chengsong Huang,Jingwen Xu,Changze Lv,Muling Wu,Wenhao Liu,Xiaohua Wang,Zhenghua Wang,Zisu Huang,Muzhao Tian,Jianhan Xu,Kun Hu,He-Da Wang,Yao Hu,Xuanjing Huang,Xiaoqing Zheng
机构: Fudan University (复旦大学); Washington University in St. Louis; Xiaohongshu Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark’s ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.
zh

[NLP-18] RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的虚假新闻传播问题,提出了一种基于检索增强与对抗精炼的鲁棒检测框架RADAR。其核心解决方案在于引入一种生成器-检测器协同进化机制:生成器通过事实扰动重写真实文章以模拟更复杂的伪造内容,检测器则利用密集段落检索(dense passage retrieval)验证声明真伪;关键创新是提出自然语言形式的对抗反馈(Verbal Adversarial Feedback, VAF),替代传统标量奖励信号,提供结构化文本批评,引导生成器不断进化欺骗策略,从而推动检测器持续优化,实现双向强化学习。实验表明,RADAR在基准测试中达到86.98% ROC-AUC,显著优于普通检索增强的LLM方法,且消融实验证实检测端检索和VAF对模型性能提升最为关键。

链接: https://arxiv.org/abs/2601.03981
作者: Song-Duo Ma,Yi-Hung Liu,Hsin-Yu Lin,Pin-Yu Chen,Hong-Yan Huang,Shau-Yung Hsu,Yun-Nung Chen
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a retrieval-augmented detector with adversarial refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR achieves 86.98% ROC-AUC, significantly outperforming general-purpose LLMs with retrieval. Ablation studies confirm that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.
zh

[NLP-19] SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

【速读】: 该论文旨在系统性地回答“检索增强生成(Retrieval-Augmented Generation, RAG)中的隐私风险是什么,以及如何测量和缓解这些风险”这一核心问题。其解决方案的关键在于通过系统的文献综述,首次对RAG中的隐私风险、缓解技术及评估策略进行结构化归纳,并构建了两个核心辅助工具:RAG隐私风险分类体系(Taxonomy of RAG Privacy Risks)与RAG隐私处理流程图(RAG Privacy Process Diagram),从而为研究者和实践者提供清晰的风险识别框架与可操作的缓解路径,同时揭示当前隐私防护措施的成熟度与关键考量因素。

链接: https://arxiv.org/abs/2601.03979
作者: Andreea-Elena Bodea,Stephen Meisenbacher,Alexandra Klymenko,Florian Matthes
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 17 pages, 3 figures, 5 tables. This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2026). The final version will be available on IEEE Xplore

点击查看摘要

Abstract:The continued promise of Large Language Models (LLMs), particularly in their natural language understanding and generation capabilities, has driven a rapidly increasing interest in identifying and developing LLM use cases. In an effort to complement the ingrained “knowledge” of LLMs, Retrieval-Augmented Generation (RAG) techniques have become widely popular. At its core, RAG involves the coupling of LLMs with domain-specific knowledge bases, whereby the generation of a response to a user question is augmented with contextual and up-to-date information. The proliferation of RAG has sparked concerns about data privacy, particularly with the inherent risks that arise when leveraging databases with potentially sensitive information. Numerous recent works have explored various aspects of privacy risks in RAG systems, from adversarial attacks to proposed mitigations. With the goal of surveying and unifying these works, we ask one simple question: What are the privacy risks in RAG, and how can they be measured and mitigated? To answer this question, we conduct a systematic literature review of RAG works addressing privacy, and we systematize our findings into a comprehensive set of privacy risks, mitigation techniques, and evaluation strategies. We supplement these findings with two primary artifacts: a Taxonomy of RAG Privacy Risks and a RAG Privacy Process Diagram. Our work contributes to the study of privacy in RAG not only by conducting the first systematization of risks and mitigations, but also by uncovering important considerations when mitigating privacy risks in RAG systems and assessing the current maturity of proposed mitigations.
zh

[NLP-20] Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

【速读】: 该论文旨在解决当前长音频生成(long-form song generation)研究中因缺乏公开训练数据而导致的复现困难与公平比较障碍问题。其关键解决方案是发布了一个完全开源的系统,包含一个由116k首完全授权的合成歌曲组成的高质量数据集、标准化的训练与评估流程,以及一个名为Muse的轻量级可部署生成模型。该模型基于Qwen语言模型扩展离散音频标记(discrete audio tokens)并通过MuCodec实现单阶段监督微调,无需任务特定损失函数或额外架构组件,在有限数据和模型规模下仍实现了音素错误率、文本-音乐风格相似性及音频美学质量等方面的竞争力,并支持跨不同音乐结构的可控分段生成。

链接: https://arxiv.org/abs/2601.03973
作者: Changhao Jiang,Jiahao Chen,Zhenghao Xiang,Zhixiong Yang,Hanchen Wang,Jiabao Zhuang,Xinmeng Che,Jiajun Sun,Hui Li,Yifei Cao,Shihan Dou,Ming Zhang,Junjie Ye,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research. The project repository is available at this https URL.
zh

[NLP-21] Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

【速读】: 该论文旨在解决大型推理模型在强化学习训练中因奖励可验证性而引发的过度冗余推理问题,即模型在处理简单输入时产生不必要的长链条思考(length shift现象),导致推理效率低下。其解决方案的关键在于提出动态异常值截断(Dynamic Outlier Truncation, DOT),该方法在训练阶段针对完全正确生成轨迹中响应长度的极端尾部进行选择性抑制,从而去除冗余token,同时保留复杂任务所需的长程推理能力;此外,为确保收敛稳定性,还引入辅助KL正则化和预测性动态采样机制,最终显著优化了效率与性能之间的权衡关系。

链接: https://arxiv.org/abs/2601.03969
作者: Wei Wu,Liyi Chen,Congxi Xiao,Tianfu Wang,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Hui Xiong
机构: University of Science and Technology of China (中国科学技术大学); Xiaohongshu Inc. (小红书公司); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
zh

[NLP-22] Large-Scale Aspect-Based Sentiment Analysis with Reasoning -Infused LLM s

【速读】: 该论文旨在解决现实场景中Aspect-Based Sentiment Analysis (ABSA)模型在商业应用中的性能瓶颈与泛化能力不足的问题。其关键解决方案包括:构建一个规模达SemEval14数据集20倍的高质量训练数据集,融合真实数据与合成数据;扩展情感类别至五类(新增混合和未知类),并实现整体文本情感与细粒度方面情感的联合预测;提出一种针对编码器-only模型的新型推理预训练技术,显著提升下游微调效果与跨语言泛化能力;最终通过395M参数编码器与8B参数解码器架构,在SemEval14基准上超越GPT-4o和Claude 3.5 Sonnet达10个百分点以上,并保持多语言一致性(6种语言准确率87–91%,不损害英文表现)。

链接: https://arxiv.org/abs/2601.03940
作者: Paweł Liskowski,Krzysztof Jankowski
机构: Snowflake Inc.(Snowflake公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Arctic-ABSA, a collection of powerful models for real-life aspect-based sentiment analysis (ABSA). Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14. We extend typical ABSA models by expanding the number of sentiment classes from the standard three (positive, negative, neutral) to five, adding mixed and unknown classes, while also jointly predicting overall text sentiment and supporting multiple languages. We experiment with reasoning injection by fine-tuning on Chain-of-Thought (CoT) examples and introduce a novel reasoning pretraining technique for encoder-only models that significantly improves downstream fine-tuning and generalization. Our 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, while setting new state-of-the-art results on the SemEval14 benchmark. A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance. We release ABSA-mix, a large-scale benchmark aggregating 17 public ABSA datasets across 92 domains.
zh

[NLP-23] FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续学习(Continual Learning, CL)过程中因灾难性遗忘(catastrophic forgetting)导致的知识丢失问题。现有基于记忆回放(memory replay)的方法多依赖于固定步数的启发式策略,难以与模型实际的学习进度对齐,因为相同训练步数可能引发不同程度的参数变化。解决方案的关键在于提出FOREVER框架,其核心创新是引入以优化器更新幅度为依据的“模型时间”概念,使回放缓冲调度与模型内部演化过程相匹配,并结合基于遗忘曲线的回放调度机制和强度感知正则化策略,实现更精准、自适应的记忆回放控制,从而显著缓解灾难性遗忘。

链接: https://arxiv.org/abs/2601.03938
作者: Yujie Feng,Hao Wang,Jian Li,Xu Chu,Zhaolu Kang,Yiran Liu,Yasha Wang,Philip S. Yu,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); ShineMo Ltd., China (深圳摩尔科技有限公司); AI Technology Center of OVB, Tencent, China (腾讯OVB人工智能技术中心); Peking University (北京大学); University College London (伦敦大学学院); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model’s actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model’s internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.
zh

[NLP-24] FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在用户界面(User Interface, UI)定位任务中因高分辨率截图被分割为大量视觉标记(visual tokens)而导致的计算开销大、注意力稀释的问题。其核心挑战在于如何在保持精确定位能力的前提下,高效地减少冗余视觉信息。解决方案的关键在于提出FocusUI框架,该框架通过两个创新机制实现:(1) 构建基于指令条件得分与规则驱动的UI图结构得分融合的patch级监督信号,有效筛选出与指令相关且具有区分度的视觉块;(2) 提出PosPad策略,在裁剪视觉标记时将连续丢弃的标记序列压缩为一个特殊标记并放置于序列末尾,从而保留位置连续性,避免因位置信息断裂导致的精度下降。实验表明,FocusUI在多个基准上优于现有方法,并在仅保留30%视觉标记的情况下仍保持较高性能,同时显著提升推理速度和降低GPU内存占用。

链接: https://arxiv.org/abs/2601.03928
作者: Mingyu Ouyang,Kevin Qinghong Lin,Mike Zheng Shou,Hwee Tou Ng
机构: National University of Singapore(新加坡国立大学); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 14 pages, 13 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task’s characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence’s last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
zh

[NLP-25] Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在真实文档问答场景中因用户定义的非披露政策(non-disclosure policies)而产生的安全漏洞问题。现有研究主要关注隐式社会规范或纯文本环境下的安全性,忽视了多模态文档中复杂推理可能导致的敏感信息泄露风险。论文提出Doc-PP基准,用于评估模型在跨异构视觉与文本元素推理时对严格政策的遵守情况,并发现存在系统性的“推理诱导安全缺口”(Reasoning-Induced Safety Gap):当答案需通过多模态融合或复杂推理获得时,模型易绕过现有防护机制泄露敏感内容。解决方案的关键在于DVA(Decompose-Verify-Aggregation)框架,其核心思想是将推理过程与策略验证解耦,通过分解、验证和聚合三阶段结构化推理,有效防止敏感信息在推理过程中被无意暴露,显著优于传统提示防御方法,为政策合规的文档理解提供了稳健基线。

链接: https://arxiv.org/abs/2601.03926
作者: Haeun Jang,Hwan Chang,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding
zh

[NLP-26] When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering

【速读】: 该论文旨在解决多选题问答(Multiple-choice question answering, MCQA)中模型推理错误与符号绑定失败混淆的问题,即模型不仅要正确解答问题,还需准确输出代表答案的符号(如A、B、C等),这使得评估结果难以区分是内容理解错误还是符号映射错误。其解决方案的关键在于通过表示分析(如主成分分析PCA和线性探测)及因果干预实验,揭示语言模型在MCQA任务中的内部工作机制:发现选项边界(newline)处的残差状态中包含强线性可解码的每选项正确性信号;进一步发现“胜者身份”探测显示两阶段机制——获胜内容位置在最后一个选项处理后立即可解码,而输出符号则在接近答案生成位置时才被编码。实验证据表明,模型首先在内容空间中选择正确选项,随后将其绑定至对应输出符号,从而分离了内容推理与符号绑定两个过程。

链接: https://arxiv.org/abs/2601.03914
作者: Hugh Mee Wong,Rick Nouwen,Albert Gatt
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that represents the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning content position becomes decodable immediately after the final option is processed, while the output symbol is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.
zh

[NLP-27] Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法中存在的两个核心问题:一是检索触发机制缺乏智能判断,导致在无需外部知识时仍进行检索,引入噪声;二是证据构建路径单一,难以有效处理稀疏或模糊查询。其解决方案的关键在于提出一种训练-free的“先决策再检索”(Decide Then Retrieve, DTR)框架,通过利用生成不确定性来动态决定是否触发检索,并引入双路径检索机制与自适应信息选择策略,从而更精准地整合外部知识,提升问答性能并减少无效检索。

链接: https://arxiv.org/abs/2601.03908
作者: Wang Chen,Guanqiang Qi,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
机构: Baidu Inc (百度公司); The University of Hong Kong (香港大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at this https URL.
zh

[NLP-28] Current Agents Fail to Leverag e World Model as Tool for Foresight

【速读】: 该论文试图解决当前基于视觉-语言模型的智能体(agent)在面对需要预测未来状态的任务时,难以有效利用生成式世界模型(generative world models)作为外部模拟器来增强其认知能力的问题。解决方案的关键在于识别并改进智能体与世界模型之间交互的三个核心瓶颈:何时触发模拟(when to simulate)、如何解释预测结果(how to interpret predicted outcomes),以及如何将前瞻性信息整合进后续推理过程(how to integrate foresight into downstream reasoning)。研究发现,现有智能体普遍存在模拟调用频率低、误用预测轨迹及性能不稳定等问题,表明亟需建立更加校准(calibrated)和策略性(strategic)的交互机制,以实现更可靠的 anticipatory cognition(前瞻认知)。

链接: https://arxiv.org/abs/2601.03905
作者: Cheng Qian,Emre Can Acikgoz,Bingxuan Li,Xiusi Chen,Yuji Zhang,Bingxiang He,Qinyu Luo,Dilek Hakkani-Tür,Gokhan Tur,Yunzhu Li,Heng Ji,Heng Ji
机构: UIUC (伊利诺伊大学厄巴纳-香槟分校); THU (清华大学); JHU (约翰霍普金斯大学); Columbia (哥伦比亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 36 Pages, 13 Figures, 17 Tables

点击查看摘要

Abstract:Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents’ capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
zh

[NLP-29] Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training

【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)算法在大语言模型(LLM)强化学习中因固定边界裁剪机制导致的性能局限性问题,尤其是在数学推理任务中表现欠佳且易过早收敛的问题。其解决方案的关键在于提出自适应边界裁剪的改进框架——Adaptive-Boundary-Clipping GRPO (ABC-GRPO),通过引入不对称且动态调整的裁剪边界,显著提升了算法的灵活性与泛化能力,并在训练过程中维持更高的策略熵,从而增强模型探索能力并避免过早收敛。

链接: https://arxiv.org/abs/2601.03895
作者: Chi Liu,Xin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC-GRPO maintains substantially higher entropy throughout training, thereby preserving the model’s exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility this https URL.
zh

[NLP-30] Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)因参数量庞大、计算成本高而导致在资源受限场景下难以部署与安全应用的问题。其解决方案的关键在于评估小型解码器模型(Small Language Models, SLMs)在语法纠错(grammar correction)和文本简化(text simplification)任务上的可行性,通过在JFLEG和ASSET数据集上测试未经微调和微调后的SLMs性能,发现尽管SLMs具有更高的效率优势,但其在语义保留能力与幻觉控制方面表现不佳,尚未达到当前LLMs的性能水平,表明仍需进一步优化训练策略以缩小二者差距。

链接: https://arxiv.org/abs/2601.03874
作者: Anthony Lamelas
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 12 figures

点击查看摘要

Abstract:Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access, deploy, and secure in many settings. This paper investigates whether small, decoder-only language models can provide an efficient alternative for the tasks of grammar correction and text simplification. The experiments in this paper focus on testing small language models out of the box, fine-tuned, and run sequentially on the JFLEG and ASSET datasets using established metrics. The results show that while SLMs may learn certain behaviors well, their performance remains below strong baselines and current LLMs. The results also show that SLMs struggle with retaining meaning and hallucinations. These findings suggest that despite their efficiency advantages, current SLMs are not yet competitive enough with modern LLMs for rewriting, and further advances in training are required for SLMs to close the performance gap between them and today’s LLMs.
zh

[NLP-31] Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与外部工具集成时,因模型和工具多样性增加而导致的最优模型-工具组合选择成为高维优化难题的问题。现有方法通常依赖单一模型或固定工具调用逻辑,无法有效利用异构模型-工具对之间的性能差异。解决方案的关键在于提出ATLAS(Adaptive Tool-LLM Alignment and Synergistic Invocation),一个双路径动态工具使用框架:其一为无需训练的基于聚类的路由机制,利用领域特定的经验先验实现模型与工具的对齐;其二为基于强化学习(Reinforcement Learning, RL)的多步路由策略,探索自主轨迹以提升分布外泛化能力。该设计在15个基准测试中显著优于GPT-4o等闭源模型,并在视觉推理任务中通过协调专用多模态工具获得显著性能提升。

链接: https://arxiv.org/abs/2601.03872
作者: Jinyang Wu,Guocheng Zhai,Ruihan Jin,Jiahao Yuan,Yuhao Shen,Shuai Zhang,Zhengqi Wen,Jianhua Tao
机构: Tsinghua University (清华大学); Zhejiang University (浙江大学); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbftraining-free cluster-based routing that exploits empirical priors for domain-specific alignment, and (2) \textbfRL-based multi-step routing that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.
zh

[NLP-32] What Matters For Safety Alignment?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与语言推理模型(Language Reasoning Models, LRMs)在安全对齐(safety alignment)方面的性能差异及其影响因素问题,以推动更安全、可靠的生成式AI系统的发展。其解决方案的关键在于通过大规模实证研究识别出六个关键内在模型特性与三种外部攻击技术对安全对齐的影响,并揭示了以下核心发现:首先,集成推理与自我反思机制显著提升安全性;其次,后训练和知识蒸馏过程可能系统性削弱安全对齐,应将其作为显式约束或优化目标;第三,基于思维链(Chain-of-Thought, CoT)的响应前缀攻击可使攻击成功率平均提升3.34倍,凸显文本补全接口的风险;第四,角色扮演、提示注入及基于梯度的对抗提示搜索是诱发模型行为失准的主要方法。这些发现为模型设计、训练策略与部署架构提供了关键的安全指导。

链接: https://arxiv.org/abs/2601.03868
作者: Xing Li,Hui-Ling Zhen,Lihao Yin,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan
机构: Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.
zh

[NLP-33] PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media

【速读】: 该论文旨在解决多语言环境下识别极端偏执叙事(hyperpartisan narratives)与人口替代阴谋论(Population Replacement Conspiracy Theories, PRCT)的挑战,这些问题在政治极化和现实暴力事件中具有重要影响。现有资源稀缺且主要集中在英语语境,同时缺乏对立场、修辞偏见与偏执倾向之间相互关系的综合分析。解决方案的关键在于提出首个多语言数据集 \textscPartisanLens,包含1617条西班牙语、意大利语和葡萄牙语的新闻标题,并在多个政治话语维度上进行标注。该数据集支持对大型语言模型(LLMs)在分类任务中的表现建模,并进一步探索其作为自动标注工具的可行性,以及通过模拟不同社会经济和意识形态背景来模仿人类标注者模式的能力,从而为欧洲语境下的偏执与阴谋论检测提供可扩展的研究基础。

链接: https://arxiv.org/abs/2601.03860
作者: Michele Joshua Maggini,Paloma Piot,Anxo Pérez,Erik Bran Marino,Lúa Santamaría Montesinos,Ana Lisboa,Marta Vázquez Abuín,Javier Parapar,Pablo Gamallo
机构: Centro Singular de Investigación en Tecnoloxías Intelixentes da USC(圣地亚哥大学先进技术研究中心); IRLab, CITIC Research Centre, Universidade da Coruña(拉科鲁尼亚大学IRLab,CITIC研究中心); Universidade de Évora(埃武拉大学); Universidad de La Rioja(拉里奥哈大学); GESIS Leibniz Institute for the Social Sciences(德国社会科学研究机构)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) is essential to addressing the spread of misinformation. These complex narratives pose a significant threat, as hyperpartisanship drives political polarisation and institutional distrust, while PRCTs directly motivate real-world extremist violence, making their identification critical for social cohesion and public safety. However, existing resources are scarce, predominantly English-centric, and often analyse hyperpartisanship, stance, and rhetorical bias in isolation rather than as interrelated aspects of political discourse. To bridge this gap, we introduce \textscPartisanLens, the first multilingual dataset of \num1617 hyperpartisan news headlines in Spanish, Italian, and Portuguese, annotated in multiple political discourse aspects. We first evaluate the classification performance of widely used Large Language Models (LLMs) on this dataset, establishing robust baselines for the classification of hyperpartisan and PRCT narratives. In addition, we assess the viability of using LLMs as automatic annotators for this task, analysing their ability to approximate human annotation. Results highlight both their potential and current limitations. Next, moving beyond standard judgments, we explore whether LLMs can emulate human annotation patterns by conditioning them on socio-economic and ideological profiles that simulate annotator perspectives. At last, we provide our resources and evaluation, \textscPartisanLens supports future research on detecting partisan and conspiratorial narratives in European contexts.
zh

[NLP-34] What Does Loss Optimization Actually Teach If Anything? Knowledge Dynamics in Continual Pre-training of LLM s

【速读】: 该论文旨在解决持续预训练(Continual Pre-Training, CPT)过程中损失函数与知识学习之间存在脱节的问题,即传统方法将损失视为知识获取的代理指标,却缺乏对知识习得动态过程的直接测量和理解。其解决方案的关键在于构建一个分布匹配的受控事实文档基准,并在CPT循环中嵌入诊断性探测器(diagnostic probes),从而实现对每个训练周期(epoch)层面的知识获取动态以及域外(Out-of-Domain, OOD)能力(如数学推理)变化的量化分析。这一设计使得研究者能够观察到:尽管损失单调下降,但事实知识的学习呈现非单调且不稳定的特性,且知识电路在训练过程中快速重构,揭示了狭窄的知识吸收窗口和系统性遗忘现象,进而指出损失优化不能准确反映学习进度,强调应基于任务级学习动态来制定停止策略。

链接: https://arxiv.org/abs/2601.03858
作者: Seyed Mahed Mousavi,Simone Alghisi,Giuseppe Riccardi
机构: University of Trento (特伦托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated, learning is strongly conditioned on prior exposure, and OOD performance degrades from early epochs. Circuit analysis reveals rapid reconfiguration of knowledge pathways across epochs, providing an explanation for narrow acquisition windows and systematic forgetting. These results show that loss optimization is misaligned with learning progress in CPT and motivate evaluation of stopping criteria based on task-level learning dynamics.
zh

[NLP-35] Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

【速读】: 该论文针对表问答(Table Question Answering, TableQA)中现有表格剪枝(table pruning)方法依赖不可靠的批判信号进行序列式修订、难以识别关键答案信息丢失的问题,提出了一种名为TabTrim的新框架。其解决方案的关键在于将传统的逐步剪枝过程转变为基于“黄金剪枝轨迹”(gold pruning trajectory)监督的并行搜索机制:通过分析标准SQL查询执行过程中生成的中间子表构造黄金剪枝轨迹,训练一个剪枝器(pruner)和验证器(verifier),使每一步的剪枝结果逼近该轨迹;推理时则通过并行探索多个候选剪枝路径,从而高效定位最优子表。这一设计显著提升了剪枝准确性与下游推理性能,在多个基准数据集上达到当前最优效果。

链接: https://arxiv.org/abs/2601.03851
作者: Yu Guo,Shenghao Ye,Shuangwu Chen,Zijian Wen,Tao Zhang,Qirui Bai,Dong Jin,Yunpeng Hou,Huasen He,Jian Yang,Xiaobin Tan
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.
zh

[NLP-36] Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

【速读】: 该论文旨在解决强化学习中基于结果的奖励机制(outcome-based rewards)导致的优势估计粒度粗略的问题,进而使得大语言模型(LLM)在链式推理过程中难以区分必要推理与冗余验证,从而可能在已获得正确解后继续无效检查,甚至将正确路径误判为错误答案。解决方案的关键在于提出一种无需训练的探针机制(training-free probing mechanism),用于提取每一步的中间置信度(intermediate confidence)和正确性(correctness),并将其融合为一个“步骤势能”(Step Potential)信号,该信号能够显式刻画推理过程中的状态变化;在此基础上设计了步骤势能优势估计(Step Potential Advantage Estimation, SPAE),通过放大势能增长、惩罚势能下降,并在势能饱和后施加惩罚以鼓励及时终止,实现了细粒度的信用分配,显著提升了推理准确率并减少了响应长度。

链接: https://arxiv.org/abs/2601.03823
作者: Fei Wu,Zhenrong Zhang,Qikai Chang,Jianshu Zhang,Quan Liu,Jun Du
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK Research (科大讯飞研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at this https URL.
zh

[NLP-37] AI Generated Text Detection

【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本在学术场景中被滥用所引发的学术诚信问题,即学生将大型语言模型(Large Language Models, LLMs)生成的内容冒充为个人原创作品。其解决方案的关键在于构建一个统一的基准测试框架,采用 HC3 和 DAIGT v2 两个数据集,并通过基于主题的数据划分策略防止信息泄露,从而确保模型在未见领域具有良好的泛化能力。实验表明,基于上下文语义建模的深度学习方法(如 BiLSTM 和 DistilBERT)显著优于依赖词法特征的传统机器学习方法(如 TF-IDF 逻辑回归),其中 DistilBERT 在准确率(88.11%)和 ROC-AUC(0.96)上表现最优,验证了上下文理解在检测 AI 生成文本中的核心作用。

链接: https://arxiv.org/abs/2601.03812
作者: Adilkhan Alikhanov,Aidar Amangeldi,Diar Demeubay,Dilnaz Akhmetzhan,Nurbek Moldakhmetov,Omar Polat,Galymzhan Zharas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.
zh

[NLP-38] Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models

【速读】: 该论文旨在解决Transformer语言模型中心理语言学特征(psycholinguistic features)的层次化编码位置问题,即明确这些具有心理学意义的意义信息在模型各层中的分布规律。其关键解决方案在于开展系统性的逐层探测研究(layer-wise probing study),对58个心理语言学特征在10种不同架构(包括encoder-only和decoder-only)的Transformer模型中进行分析,并比较三种嵌入提取方法(embedding extraction methods)的影响。研究发现,心理语言学特征的“定位”高度依赖于嵌入提取方法:上下文感知嵌入(contextualized embeddings)相较于孤立嵌入(isolated embeddings)展现出更高的特征特异性与不同的层间分布模式;同时,尽管最终层表示通常并非最优,所有模型均呈现出一致的深度排序规律——词法属性(lexical properties)在早期层达到峰值,而体验性和情感维度(experiential and affective dimensions)则在后期层更显著。这表明意义在模型中的分布是方法论选择与架构约束共同作用的结果。

链接: https://arxiv.org/abs/2601.03798
作者: Taisiia Tikhomirova,Dirk U. Wulff
机构: Max Planck Institute for Human Development (马克斯普朗克人类发展研究所); Technische Universität Berlin (柏林工业大学); University of Basel (巴塞尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding where transformer language models encode psychologically meaningful aspects of meaning is essential for both theory and practice. We conduct a systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models, spanning encoder-only and decoder-only architectures, and compare three embedding extraction methods. We find that apparent localization of meaning is strongly method-dependent: contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings. Across models and methods, final-layer representations are rarely optimal for recovering psycholinguistic information with linear probes. Despite these differences, models exhibit a shared depth ordering of meaning dimensions, with lexical properties peaking earlier and experiential and affective dimensions peaking later. Together, these results show that where meaning “lives” in transformer models reflects an interaction between methodological choices and architectural constraints.
zh

[NLP-39] VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

【速读】: 该论文旨在解决生成式 AI 在越南传统医学(Vietnamese Traditional Medicine, VTM)这一文化特定且数据稀缺的医疗领域中性能显著下降的问题。其核心挑战在于缺乏高质量、结构化的评估基准。解决方案的关键在于提出一个基于检索增强生成(Retrieval-Augmented Generation, RAG)管道构建的多选题数据集 VietMed-MCQ,并引入双模型验证机制以确保答案推理的一致性,从而提升合成数据的质量与可信度。该方法通过自动化一致性检查和专家-学生联合评审,实现了94.2%的高批准率和良好的评分者间一致性(Fleiss’ kappa = 0.82),为低资源医疗领域的模型评估提供了可靠基准。

链接: https://arxiv.org/abs/2601.03792
作者: Huynh Trung Kiet,Dao Sy Duy Minh,Nguyen Dinh Ha Duong,Le Hoang Minh Huy,Long Nguyen,Dien Dinh
机构: University of Science, Ho Chi Minh City, Vietnam (胡志明市科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (越南国家大学胡志明市分校)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures. Dataset and code released

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss’ kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.
zh

[NLP-40] Do LLM s Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中隐私敏感信息(Personally Identifiable Information, PII)泄露评估存在的偏差问题,即现有基于重构(reconstruction-based)的评估方法常将由表面词法线索(lexical cues)驱动的生成行为误判为真实记忆(memorization)。其解决方案的关键在于提出一种名为“抗提示记忆性”(Cue-Resistant Memorization, CRM)的评价框架,该框架通过在低词法提示条件下进行测试,明确控制提示与目标信息之间的重叠线索(prompt-target overlap cues),从而区分出真正由模型内部记忆引发的信息泄露与仅依赖外部提示模式完成的表面匹配行为。实证结果显示,在去除表面线索后,PII重构成功率显著下降,表明此前报告的高泄露率主要源于提示诱导的模式匹配而非实质记忆,强调了采用 cue-controlled 评估对可靠量化 LLM 隐私相关记忆的重要性。

链接: https://arxiv.org/abs/2601.03791
作者: Xiaoyu Luo,Yiyi Chen,Qiongxiu Li,Johannes Bjerva
机构: Aalborg University (奥尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have been reported to “leak” Personally Identifiable Information (PII), with successful PII reconstruction often interpreted as evidence of memorization. We propose a principled revision of memorization evaluation for LLMs, arguing that PII leakage should be evaluated under low lexical cue conditions, where target PII cannot be reconstructed through prompt-induced generalization or pattern completion. We formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework and a necessary condition for valid memorization evaluation, explicitly conditioning on prompt-target overlap cues. Using CRM, we conduct a large-scale multilingual re-evaluation of PII leakage across 32 languages and multiple memorization paradigms. Revisiting reconstruction-based settings, including verbatim prefix-suffix completion and associative reconstruction, we find that their apparent effectiveness is driven primarily by direct surface-form cues rather than by true memorization. When such cues are controlled for, reconstruction success diminishes substantially. We further examine cue-free generation and membership inference, both of which exhibit extremely low true positive rates. Overall, our results suggest that previously reported PII leakage is better explained by cue-driven behavior than by genuine memorization, highlighting the importance of cue-controlled evaluation for reliably quantifying privacy-relevant memorization in LLMs.
zh

[NLP-41] NeoAMT: Neologism-Aware Agent ic Machine Translation with Reinforcement Learning

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在处理包含新词(neologism)的句子时翻译准确率低的问题,即如何在缺乏语料库支持的情况下实现对新词的有效识别与翻译。解决方案的关键在于提出一个基于代理(agent)的框架 NeoAMT,其核心创新包括:构建覆盖16种语言和75个翻译方向的新词翻译数据集,利用维基词典(Wiktionary)构建检索工具作为外部知识源,并通过强化学习(Reinforcement Learning, RL)训练翻译代理;同时设计了一种新颖的奖励机制和基于“翻译难度”的自适应 rollout 生成方法,以提升翻译质量。

链接: https://arxiv.org/abs/2601.03790
作者: Zhongtao Miao,Kaiyan Zhao,Masaaki Nagata,Yoshimasa Tsuruoka
机构: The University of Tokyo (东京大学); NTT Communication Science Laboratories, NTT Inc. (日本电信电话公司通信科学实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging “translation difficulty” to further improve the translation quality of translation agents using our search tool.
zh

[NLP-42] Compact Example-Based Explanations for Language Models

【速读】: 该论文旨在解决训练数据影响估计方法在生成示例解释时面临的选择策略有效性不足的问题,即如何从大量训练文档中高效筛选出最具解释力的子集,以提升模型输出的可解释性。现有方法虽能量化单个训练文档对模型输出的贡献,但缺乏对所选示例集合整体解释质量的评估机制,且常采用简单排序策略(如选取最高影响分数的文档),忽视了示例间的多样性与代表性。解决方案的关键在于提出一种无需重新训练的新型选择相关性评分(selection relevance score),该评分能够有效预测一组示例是否有助于支持或削弱模型预测,从而指导更优的选择策略;进一步实验表明,该评分可识别出优于传统方法(包括随机选择)的策略,并由此设计出兼顾影响度(influence)与代表性(representativeness) 的平衡策略,显著提升有限选择预算下的解释质量。

链接: https://arxiv.org/abs/2601.03786
作者: Loris Schoenegger,Benjamin Roth
机构: University of Vienna (维也纳大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Training data influence estimation methods quantify the contribution of training documents to a model’s output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel selection relevance score, a retraining-free metric that quantifies how useful a set of examples is for explaining a model’s output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model’s predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.
zh

[NLP-43] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在人机对话中缺乏主题连续性保持的问题,即现有记忆系统通常将对话流切分为孤立话语片段进行存储,再通过嵌入检索恢复连贯性,这一碎片化补偿范式会破坏叙事和因果流,并偏向词汇相似性检索。其解决方案的关键在于提出一种分层记忆架构 membox,核心组件为“主题织布机”(Topic Loom),它以滑动窗口方式持续监控对话,将连续同主题交互在存储时聚合成结构化的“记忆盒子”(memory boxes);随后通过“轨迹编织者”(Trace Weaver)将密封的记忆盒子链接成跨断点的长期事件时间线,从而恢复宏观主题的重现。该方法显著提升了时间推理任务的性能(LoCoMo数据集上F1提升达68%),且仅需极少上下文token,实现了效率与效果的更好平衡。

链接: https://arxiv.org/abs/2601.03785
作者: Dehao Tao,Guoliang Ma,Yongfeng Huang,Minghu Jiang
机构: Tsinghua University (清华大学); Xinjiang University (新疆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent “memory boxes” at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.
zh

[NLP-44] HearSay Benchmark: Do Audio LLM s Leak What They Hear?

【速读】: 该论文旨在解决音频大语言模型(Audio Large Language Models, ALLMs)在语音交互中可能因声纹特征导致用户隐私泄露的问题。其核心贡献在于构建了首个系统性的基准测试集HearSay,涵盖超过22,000段真实世界音频片段,并通过自动化分析与人工验证相结合的严格数据清洗流程确保隐私标签的真实性。解决方案的关键在于:首先,实证发现ALLMs能以高达92.89%的准确率从声纹中提取性别等生理属性,揭示了模型固有的隐私泄露风险;其次,指出当前安全机制严重不足,多数模型对隐私敏感请求几乎不拒绝;最后,发现链式思维(Chain-of-Thought, CoT)推理会放大隐私暴露风险,凸显出需针对ALLMs进行专门的隐私对齐(privacy alignment)以应对潜在威胁。

链接: https://arxiv.org/abs/2601.03783
作者: Jin Wang,Liang Lin,Kaiwen Luo,Weiliu Wang,Yitian Chen,Moayad Aloqaily,Xuehai Tang,Zhenhong Zhou,Kun Wang,Li Sun,Qingsong Wen
机构: XDU(西安交通大学); NTU(南洋理工大学); NCEPU(华北电力大学); BUPT(北京邮电大学); SHU(上海大学); UAEU(阿联酋大学); UCAS-IIE(中国科学院大学-智能信息工程研究所); Squirrel AI(松鼠AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces \textitHearSay , a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on \textitHearSay yield three critical findings: \textbfSignificant Privacy Leakage : ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. \textbfInsufficient Safety Mechanisms : Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. \textbfReasoning Amplifies Risk : Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at this https URL
zh

[NLP-45] racing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)表征空间的内在维度(Intrinsic Dimension, ID)来区分语言复杂性的不同类型,特别是形式复杂性(如并列与从属从句结构)和功能复杂性(如右分支 vs. 中心嵌套或明确 vs. 模糊定语从句附着)。其解决方案的关键在于:通过分析不同LLM层中ID的变化模式,发现ID差异能够有效捕捉不同类型的语法复杂性,并且这些差异在特定处理阶段显现,该阶段与先前研究中识别出的抽象语言处理阶段一致。这表明ID不仅是一个有效的语言复杂性标记,还能揭示跨不同LLM的相似语言处理机制。

链接: https://arxiv.org/abs/2601.03779
作者: Marco Baroni,Emily Cheng,Iria deDios-Flores,Francesca Franzon
机构: Universitat Pompeu Fabra (UPF); ICREA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.
zh

[NLP-46] Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Prag matic Perturbations

【速读】: 该论文试图解决的问题是:尽管大型语言模型(Large Language Models, LLMs)能够生成自解释(self-explanations),但这些解释是否有助于用户准确预测模型在面对反事实问题时的行为。解决方案的关键在于通过策略QA(StrategyQA)任务评估人类和LLM评判者在有无模型链式思维(chain-of-thought)或事后解释的情况下,对模型回答反事实后续问题的预测能力,并比较基于语用学的扰动与LLM生成的反事实测试用例两种方法的效果。结果表明,自解释能稳定提升人类和LLM评判者的模拟准确性,但改进幅度和稳定性高度依赖于扰动策略和评判者的能力强度。

链接: https://arxiv.org/abs/2601.03775
作者: Pingjun Hong,Benjamin Roth
机构: University of Vienna (维也纳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model’s true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model’s answers to counterfactual follow-up questions, with and without access to the model’s chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model’s behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.
zh

[NLP-47] Evaluation of Multilingual LLM s Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多语言环境下个性化文本生成的潜在滥用风险及其对检测难度的影响问题,同时探索个性化能力可能带来的正面价值。其解决方案的关键在于系统性地评估16种不同语言模型在10种语言中针对1080种个性化提示组合(涵盖人口统计特征与社交媒体平台两类目标)生成文本的质量与可检测性差异,发现个性化策略对文本可检测性的影响因语言和目标类型而异,尤其在英语中,基于社交平台的个性化显著降低了文本的可检测性,揭示了个性化程度与检测难度之间的非线性关系。

链接: https://arxiv.org/abs/2601.03752
作者: Dominik Macko
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Capabilities of large language models to generate multilingual coherent text have continuously enhanced in recent years, which opens concerns about their potential misuse. Previous research has shown that they can be misused for generation of personalized disinformation in multiple languages. It has also been observed that personalization negatively affects detectability of machine-generated texts; however, this has been studied in the English language only. In this work, we examine this phenomenon across 10 languages, while we focus not only on potential misuse of personalization capabilities, but also on potential benefits they offer. Overall, we cover 1080 combinations of various personalization aspects in the prompts, for which the texts are generated by 16 distinct language models (17,280 texts in total). Our results indicate that there are differences in personalization quality of the generated texts when targeting demographic groups and when targeting social-media platforms across languages. Personalization towards platforms affects detectability of the generated texts in a higher scale, especially in English, where the personalization quality is the highest.
zh

[NLP-48] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)管道中面对跨上下文知识冲突时,如何受信息来源可信度影响的问题。现有研究未充分考察检索信息源的属性对LLM决策的影响,而该工作通过构建一个受控实验框架,揭示了LLM倾向于采纳具有机构背书的信息(如政府或新闻媒体来源),而非来自个人或社交媒体的内容;同时发现,仅通过重复低可信度来源的信息即可反转这种偏好。其解决方案的关键在于提出一种新颖方法,能够将重复偏差降低高达99.8%,同时保留至少88.8%的原始来源偏好,从而提升LLM在知识密集型自然语言处理任务中对信息可信度判断的一致性与可靠性。

链接: https://arxiv.org/abs/2601.03746
作者: Jakob Schuster,Vagrant Gautam,Katja Markert
机构: Heidelberg University (海德堡大学); Heidelberg Institute for Theoretical Studies (海德堡理论研究所)
类目: Computation and Language (cs.CL)
备注: Data and code: this https URL

点击查看摘要

Abstract:As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.
zh

[NLP-49] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agent ic RL

【速读】: 该论文旨在解决闭源大型语言模型(Large Language Models, LLMs)与开源LLMs之间存在的性能差距问题,其核心原因在于开源模型在获取高质量训练数据方面的劣势。解决方案的关键在于提出一种自动化合成研究级指令数据的新型框架,该框架基于多智能体协作工作流,模拟复杂工具集成推理过程,从而端到端生成多样化且高保真度的数据;同时结合两阶段训练策略——监督微调与一种新颖的强化学习方法,以最大化模型对齐性和能力。实验证明,该方法使开源模型在多个规模下均达到新的最先进水平,为不依赖专有数据或模型的开源LLM发展提供了可扩展且高效的技术路径。

链接: https://arxiv.org/abs/2601.03743
作者: Yi Yao,He Zhu,Piaohong Wang,Jincheng Ren,Xinlong Yang,Qianben Chen,Xiaowan Li,Dingfeng Shi,Jiaxian Li,Qiexiang Wang,Sinuo Wang,Xinpeng Liu,Jiaqi Wu,Minghao Liu,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.
zh

[NLP-50] RadDiff: Describing Differences in Radiology Image Sets with Natural Language

【速读】: 该论文旨在解决医学影像分析中如何有效识别和描述两组放射学图像之间临床意义差异的问题,这对于生成临床洞察及解释医疗人工智能(Medical AI)系统至关重要。解决方案的关键在于提出RadDiff——一个类放射科医生的多模态智能体系统,其核心创新包括:(1) 通过领域自适应的视觉-语言模型注入医学知识;(2) 整合图像与临床报告进行多模态推理;(3) 通过多轮迭代假设精炼实现渐进式推理;(4) 基于目标视觉搜索定位并放大显著区域以捕捉细微发现。该方法在自建的RadDiffBench基准上表现优异,显著优于通用领域的VisDiff基线,为系统性挖掘放射学数据中的有意义差异提供了首个方法与评估框架。

链接: https://arxiv.org/abs/2601.03733
作者: Xiaoxian Shen,Yuhui Zhang,Sahithi Ankireddy,Xiaohan Wang,Maya Varma,Henry Guo,Curtis Langlotz,Serena Yeung-Levy
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff’s versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
zh

[NLP-51] Stuttering-Aware Automatic Speech Recognition for Indonesian Language

【速读】: 该论文旨在解决低资源语言(如印尼语)中语音识别系统在处理口吃语音时性能显著下降的问题,尤其是由于缺乏专门的口吃语音数据集所致。其解决方案的关键在于提出一种数据增强框架,通过规则驱动的变换与大语言模型结合,对流畅文本注入重复和延长等口吃特征后,利用语音合成技术生成合成口吃音频,并以此对预训练的印尼语Whisper模型进行迁移学习微调,从而使其适应非流利语音模式,同时保持对流畅语音的识别性能。

链接: https://arxiv.org/abs/2601.03727
作者: Fadhil Muhammad,Alwin Djuliansah,Adrian Aryaputra Hamzah,Kurniawati Azizah
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.
zh

[NLP-52] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation

【速读】: 该论文旨在解决小模型在知识蒸馏过程中难以同时实现领域内性能优化与跨域泛化能力的问题。现有方法通常限制学生模型仅模仿单一最优推理路径,忽视了教师模型不同推理路径之间的多样性以及学生模型在训练过程中动态演化的认知偏好和适应能力,导致教师提供的“最优”推理路径可能成为分布外噪声,进而破坏学生模型的潜在推理分布,造成性能下降。解决方案的关键在于提出MIND框架——一种能力自适应的知识蒸馏机制,通过引入“教学助理”(Teaching Assistant)网络合成多样化的教师视角,并采用反馈驱动的惯性校准(Feedback-Driven Inertia Calibration)机制,利用惯性过滤后的训练损失来动态调整监督信号以匹配学生的当前适应能力,从而有效提升性能并缓解灾难性遗忘。

链接: https://arxiv.org/abs/2601.03717
作者: Jin Cui,Jiaqi Guo,Jiepeng Zhou,Ruixuan Yang,Jiayi Lu,Jiajun Xu,Jiangcheng Song,Boran Zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学); Nankai University (南开大学); The Hong Kong University of Science and Technology(Guangzhou) (香港科技大学(广州)); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:While Large Language Models (LLMs) have emerged with remarkable capabilities in complex tasks through Chain-of-Thought reasoning, practical resource constraints have sparked interest in transferring these abilities to smaller models. However, achieving both domain performance and cross-domain generalization remains challenging. Existing approaches typically restrict students to following a single golden rationale and treat different reasoning paths independently. Due to distinct inductive biases and intrinsic preferences, alongside the student’s evolving capacity and reasoning preferences during training, a teacher’s “optimal” rationale could act as out-of-distribution noise. This misalignment leads to a degeneration of the student’s latent reasoning distribution, causing suboptimal performance. To bridge this gap, we propose MIND, a capability-adaptive framework that transitions distillation from passive mimicry to active cognitive construction. We synthesize diverse teacher perspectives through a novel “Teaching Assistant” network. By employing a Feedback-Driven Inertia Calibration mechanism, this network utilizes inertia-filtered training loss to align supervision with the student’s current adaptability, effectively enhancing performance while mitigating catastrophic forgetting. Extensive experiments demonstrate that MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks, and our sophisticated latent space analysis further confirms the mechanism of reasoning ability internalization.
zh

[NLP-53] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek -OCR

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)在处理长文本上下文时面临的“长上下文瓶颈”问题,特别是通过视觉-文本压缩技术提升LLM对高分辨率图像内容的理解效率。其解决方案的关键在于采用光学二维映射(optical 2D mapping)方法实现高比率的视觉-文本压缩,使模型能够从较少的视觉token中解码出远超输入数量的文本token(如十倍以上),从而缓解LLM在处理复杂图文场景时的上下文长度限制。然而,研究发现该方法的实际性能高度依赖于语言先验(linguistic priors),而非纯粹的OCR能力,在缺乏语义支持的情况下性能显著下降,表明其本质仍受制于语言模型的泛化能力而非视觉理解的鲁棒性。

链接: https://arxiv.org/abs/2601.03714
作者: Yunhao Liang,Ruixuan Ying,Bo Li,Hong Li,Kai Yan,Qingwen Li,Min Yang,Okamoto Satoshi,Zhe Cui,Shiwen Ni
机构: Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu, China; University of Chinese Academy of Sciences, Beijing, China; Institute of Multidisciplinary Research for Advanced Materials (IMRAM), Tohoku University, Sendai, Japan; China Tower Corporation Limited, Beijing, China; Center for Science and Innovation in Spintronics (CSIS), Tohoku University, Sendai, Japan; National Institute for Materials Science (NIMS), Tsukuba, Japan; Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology, Shenzhen, China
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: “Visual merit or linguistic crutch - which drives DeepSeek-OCR’s performance?” By employing sentence-level and word-level semantic corruption, we isolate the model’s intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR’s performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR’s capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at this https URL.
zh

[NLP-54] AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

【速读】: 该论文旨在解决现有无人机视觉语言导航(Unmanned Aerial Vehicle Vision-Language Navigation, UAV VLN)数据集存在的三大问题:依赖虚拟环境、指令缺乏自然性以及数据规模有限。为应对这些挑战,作者提出了AirNav——一个基于真实城市空域数据构建的大规模UAV VLN基准数据集,其指令更具自然性和多样性,且不依赖合成环境。解决方案的关键在于引入AirVLN-R1模型架构,该架构融合了监督微调(Supervised Fine-Tuning)与强化微调(Reinforcement Fine-Tuning),从而显著提升模型性能与泛化能力,并通过真实世界测试验证了其可行性。

链接: https://arxiv.org/abs/2601.03707
作者: Hengxing Cai,Yijie Rao,Ligang Huang,Zanyang Zhong,Jinhan Dong,Jingjun Tan,Wenhao Lu,Renxin Zhong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.
zh

[NLP-55] ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段因参数量庞大而带来的高计算负载问题,尤其针对现有早退出(early-exit)策略仅适用于生成阶段首个token或预填充阶段提示词(prompt)的局限性。其关键在于提出ADEPT(Adaptive Dynamic Early-exit Process for Transformers),通过引入自适应的token级早退出机制,根据token复杂度动态调整计算过程,并通过解耦跳过层中顺序依赖关系优化Key-Value(KV)缓存生成流程,从而实现预填充和生成阶段的高效动态早退出,显著提升推理效率并保持性能稳定。

链接: https://arxiv.org/abs/2601.03700
作者: Sangmin Yoo,Srikanth Malla,Chiho Choi,Wei D. Lu,Joon Hee Choi
机构: University of Michigan (密歇根大学); Samsung Semiconductor (三星半导体)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 figures, 8 tables, 22 pages

点击查看摘要

Abstract:The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.
zh

[NLP-56] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

【速读】: 该论文旨在解决当前红队测试(red teaming)数据集在风险分类不一致、领域覆盖有限以及评估过时等方面的问题,从而阻碍了对大语言模型(Large Language Models, LLMs)漏洞的系统性评估。其解决方案的关键在于提出RedBench——一个整合了37个主流基准数据集的通用数据集,包含29,362条攻击与拒绝提示样本,并采用标准化的22类风险和19个领域分类体系,实现了对LLM脆弱性的统一、全面评估。此外,作者还提供了现代LLM的基线性能及开源代码,为后续研究提供可复现的评估框架与可靠的数据支持。

链接: https://arxiv.org/abs/2601.03699
作者: Quy-Anh Dang,Chris Ngo,Truong-Son Hy
机构: VNU University of Science (越南国家大学科学学院); Knovel Engineering Lab (Knovel工程实验室); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: this https URL
zh

[NLP-57] Evaluation Framework for AI Creativity: A Case Study Based on Story Generation

【速读】: 该论文试图解决生成式 AI(Generative AI)在故事生成任务中难以准确评估创造力的问题,因为现有基于参考文本的指标无法捕捉创造力的主观性。其解决方案的关键在于提出一个结构化的评估框架,包含四个核心维度(新颖性、价值、贴合度和共鸣)及其对应的十一个子维度,并通过“Spike Prompting”控制生成条件与115名读者的众包研究,系统分析不同创意成分如何影响即时与反思性的人类创造力判断。研究发现创造力评价具有层级性而非累加性,且反思性评估显著改变了评分结果与评分者间一致性,从而验证了该框架能够揭示传统参考基评估所掩盖的创造力维度。

链接: https://arxiv.org/abs/2601.03698
作者: Pharath Sathya,Yin Jou Huang,Fei Cheng
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting’’ and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.
zh

[NLP-58] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学问题求解中逻辑推理能力不足的问题,特别是由于对逻辑关系理解薄弱导致的错误占所有错误的90%以上,而现有的Chain-of-Thought Supervised Fine-Tuning(CoT-SFT)方法无法有效缓解这一瓶颈。解决方案的关键在于提出一种轻量级训练框架——First-Step Logical Reasoning(FSLR),其核心思想是聚焦于问题求解的第一步规划(即识别变量和操作),通过专门针对此步骤进行显式监督训练,使模型直接从问题陈述中推导逻辑关系,从而强化逻辑关系理解能力;相比CoT-SFT隐式嵌入逻辑关系的方式,FSLR实现了更高效且精准的逻辑推理能力提升。

链接: https://arxiv.org/abs/2601.03682
作者: Shaojie Wang,Liang Zhang
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies reveal that large language models (LLMs) exhibit limited logical reasoning abilities in mathematical problem-solving, instead often relying on pattern-matching and memorization. We systematically analyze this limitation, focusing on logical relationship understanding, which is a core capability underlying genuine logical reasoning, and reveal that errors related to this capability account for over 90% of incorrect predictions, with Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) failing to substantially reduce these errors. To address this bottleneck, we propose First-Step Logical Reasoning (FSLR), a lightweight training framework targeting logical relationship understanding. Our key insight is that the first planning step-identifying which variables to use and which operation to apply-encourages the model to derive logical relationships directly from the problem statement. By training models on this isolated step, FSLR provides explicit supervision for logical relationship understanding, unlike CoT-SFT which implicitly embeds such relationships within complete solution trajectories. Extensive experiments across multiple models and datasets demonstrate that FSLR consistently outperforms CoT-SFT under both in-distribution and out-of-distribution settings, with average improvements of 3.2% and 4.6%, respectively. Moreover, FSLR achieves 4-6x faster training and reduces training token consumption by over 80%.
zh

[NLP-59] owards Compositional Generalization of LLM s via Skill Taxonomy Guided Data Synthesis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与基于智能体的系统在组合泛化(compositional generalization)方面面临的瓶颈问题,其根本原因在于复杂技能组合遵循长尾幂律分布,导致指令跟随性能和智能体任务中的泛化能力受限。解决方案的关键在于提出STEPS框架——一种基于技能分类体系(Skill Taxonomy)引导的熵驱动后训练数据合成方法。该方法通过结构信息理论揭示技能间的潜在关系,并构建可解释的分层技能分类体系;在此基础上,将数据合成建模为一个受约束的信息最大化问题,选择在层次结构中最大化边际结构信息的同时保持语义一致性的技能组合,从而有效生成具有挑战性的组合数据,显著提升模型的组合泛化能力。

链接: https://arxiv.org/abs/2601.03676
作者: Yifan Wei,Li Du,Xiaoyan Yu,Yang Feng,Angsheng Li
机构: Beihang University (北京航空航天大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing Institute of Technology (北京理工大学); Institute of Computing Technology, CAS (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The code and data for our methods and experiments are available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.
zh

[NLP-60] Sandwich Reasoning : An Answer-Reasoning -Answer Approach for Low-Latency Query Correction

【速读】: 该论文旨在解决生成式 AI(Generative AI)在在线搜索场景中查询修正(query correction)任务的延迟-精度权衡问题:传统链式思维(Chain-of-Thought, CoT)推理虽能显著提升准确性,但其高延迟难以满足实时性要求;而早期输出答案虽可降低延迟,却因与后续推理过程解耦导致无法利用推理能力优化结果。解决方案的关键在于提出一种新颖的“ Sandwich Reasoning (SandwichR) ”范式,其核心是采用“答案-推理-答案”的三阶段结构,通过一致性感知强化学习策略(consistency-aware reinforcement learning)显式对齐初始答案与后验推理结果,其中设计了一致性奖励机制以约束初始与最终修正的一致性,并引入基于边距的拒绝采样策略聚焦于推理带来显著改进的样本,从而在保证推理感知准确性的前提下实现40–70%的延迟降低。

链接: https://arxiv.org/abs/2601.03672
作者: Chen Zhang,Kepu Zhang,Jiatong Zhang,Xiao Zhang,Jun Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.
zh

[NLP-61] NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中神经元层面解释性问题,特别是由广泛存在的多义性(polysemanticity)带来的挑战——即单个神经元对多个不同的语义概念产生响应,而现有单次遍历的解释方法难以准确刻画这种多概念行为。其解决方案的关键在于提出NeuronScope框架,该框架将神经元解释重构为一个迭代式的、激活引导的过程:通过显式地将神经元激活分解为原子语义成分,聚类形成不同的语义模式,并利用神经元激活反馈逐步优化每种解释,从而更有效地揭示隐藏的多义性并提升解释与激活之间的相关性。

链接: https://arxiv.org/abs/2601.03671
作者: Weiqi Liu,Yongliang Miao,Haiyan Zhao,Yanguang Liu,Mengnan Du
机构: Wuhan University (武汉大学); Hong Kong Baptist University (香港浸会大学); New Jersey Institute of Technology (新泽西理工学院); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.
zh

[NLP-62] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

【速读】: 该论文旨在解决灾难管理中问答(QA)任务因依赖不确定和冲突信息而导致的准确性不足问题,现有基准测试多基于干净证据,难以模拟真实灾难场景下的复杂性。解决方案的关键在于构建了一个大规模、经严格验证的基准数据集DisastQA,包含3,000个问题(2,000个多选题和1,000个开放题),覆盖八类灾害类型,并通过人-大语言模型(LLM)协作流程与分层采样策略确保覆盖均衡;同时引入从封闭书本到噪声证据整合的不同证据条件评估范式,区分模型内部知识与在不完美信息下的推理能力,并提出基于人工验证关键点的开放题评估协议,强调事实完整性而非冗长度,从而更真实地衡量模型在灾难响应中的可靠性。

链接: https://arxiv.org/abs/2601.03670
作者: Zhitong Chen,Kai Yin,Xiangjue Dong,Chengkai Liu,Xiangpeng Li,Yiming Xiao,Bo Li,Junwei Ma,Ali Mostafavi,James Caverlee
机构: Texas A&M University (德州农工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at this https URL.
zh

[NLP-63] racer: Towards Traceable Text Generation via Claim-Level Grounding ACL2026

【速读】: 该论文旨在解决系统生成文本在高风险生物医学领域中难以高效验证的问题,尤其关注生成内容的可追溯性与可信度。其解决方案的关键在于提出eTracer框架,通过后验式(post-hoc)的命题级(claim-level)接地机制,将每个生成陈述与上下文证据进行对齐,从而识别支持或反驳该陈述的依据。该方法不仅使用户能够精确追踪响应来源,还能量化生成内容的忠实度(faithfulness),显著提升整体接地质量与用户验证效率,克服了传统基于句子级证据的接地方法的局限性。

链接: https://arxiv.org/abs/2601.03669
作者: Bohao Chu,Qianli Wang,Hendrik Damm,Hui Wang,Ula Muhabbek,Elisabeth Livingstone,Christoph M. Friedrich,Norbert Fuhr
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学); Technische Universität Berlin (柏林工业大学); University of Applied Sciences and Arts Dortmund (多特蒙德应用科学与艺术大学); University Hospital Essen (埃森大学医院)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Conference Submission (8 main pages)

点击查看摘要

Abstract:How can system-generated responses be efficiently verified, especially in the high-stakes biomedical domain? To address this challenge, we introduce eTracer, a plug-and-play framework that enables traceable text generation by grounding claims against contextual evidence. Through post-hoc grounding, each response claim is aligned with contextual evidence that either supports or contradicts it. Building on claim-level grounding results, eTracer not only enables users to precisely trace responses back to their contextual source but also quantifies response faithfulness, thereby enabling the verifiability and trustworthiness of generated responses. Experiments show that our claim-level grounding approach alleviates the limitations of conventional grounding methods in aligning generated statements with contextual sentence-level evidence, resulting in substantial improvements in overall grounding quality and user verification efficiency. The code and data are available at this https URL.
zh

[NLP-64] 5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

【速读】: 该论文旨在解决当前 omni-modal embedding 模型在跨模态对齐中面临的三大问题:(i) 相似度 logits 存在模态依赖的锐度差异,导致得分尺度不一致;(ii) 混合模态批次中的负样本分布失衡,使得许多负样本逐渐变得平凡且梯度贡献低;(iii) 不同模态嵌入的一阶和二阶统计量不匹配,影响排序稳定性。解决方案的关键在于提出 e5-omni,一种轻量级显式对齐策略,其核心包含三个组件:(1) 模态感知的温度校准(modality-aware temperature calibration),用于统一不同模态间的相似度尺度;(2) 可控负样本课程学习与去偏机制(controllable negative curriculum with debiasing),聚焦于难负样本并降低误负样本的影响;(3) 批次白化与协方差正则化(batch whitening with covariance regularization),以更好地匹配共享嵌入空间中的跨模态几何结构。该方法显著提升了 omni-modal 嵌入模型的鲁棒性和一致性。

链接: https://arxiv.org/abs/2601.03666
作者: Haonan Chen,Sicheng Gao,Radu Timofte,Tetsuya Sakai,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); University of Würzburg (维尔茨堡大学); Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at this https URL.
zh

[NLP-65] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在推理过程中产生的冗长且重复的推理轨迹问题,这显著增加了推理成本。解决方案的关键在于提出一种无需训练、可即插即用的解码方法 SyncThink,其核心机制是利用模型自身对特殊标记“/think”的注意力信号作为推理转换的指示器,从而识别并终止冗余推理过程,有效降低生成token数量和延迟,同时提升长周期任务(如GPQA)上的准确率。

链接: https://arxiv.org/abs/2601.03649
作者: Gengyang Li,Wang Cai,Yifeng Gao,Yunfang Wu
机构: Peking University (北京大学); National Key Laboratory for Multimedia Information Processing (多媒体信息处理国家重点实验室); School of Software and Microelectronics (软件与微电子学院); School of Computer Science (计算机科学技术学院)
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token “/think”, indicating an information bottleneck. Building on this observation, SyncThink monitors the model’s own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.
zh

[NLP-66] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLM s EACL2026

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在持续预训练(Continual Pretraining, CP)过程中面临的两个核心问题:一是计算成本过高,二是源语言(如英语)性能退化。针对这些问题,作者提出了一种高效层特定优化(Efficient Layer-Specific Optimization, ELO)方法,其关键在于将模型中对目标语言学习至关重要的前几层和最后几层进行分离训练,仅对这些关键层进行参数更新,从而大幅减少可训练参数数量和前向传播中的计算量,显著降低GPU内存占用并加速训练;随后通过一个简短的全模型微调步骤实现新旧层间的参数对齐,确保目标语言性能提升的同时有效保留源语言能力。实验表明,ELO方法相较现有方法最高可实现6.46倍的训练加速,并使目标语言性能提升达6.2%,同时维持英语等源语言的性能稳定。

链接: https://arxiv.org/abs/2601.03648
作者: HanGyeol Yoo,ChangSu Choi,Minjun Kim,Seohyun Song,SeungWoo Song,Inho Won,Jongyoul Park,Cheoneum Park,KyungTae Lim
机构: SeoulTech(首尔科学技术大学); Hanbat National University(汉巴特国立大学); KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: 12 pages, Accepted to EACL 2026 (Industrial Track)

点击查看摘要

Abstract:We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.
zh

[NLP-67] LLM -MC-Affect: LLM -Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

【速读】: 该论文旨在解决现有文本情感推断方法在人际互动分析中的局限性,即传统模型将情感视为个体发言者的确定性点估计,忽略了交互过程中情感的主观性、潜在模糊性以及序列耦合特性。其解决方案的关键在于提出LLM-MC-Affect框架,该框架将情绪建模为定义在情感空间上的连续潜概率分布,而非静态标签;通过利用大语言模型(Large Language Model, LLM)的随机解码与蒙特卡洛(Monte Carlo)估计,近似得到高保真度的情感轨迹,从而显式量化情感倾向与感知模糊性,并借助序列交叉相关和斜率指标实现对人际耦合关系的结构化分析。

链接: https://arxiv.org/abs/2601.03645
作者: Yu-Zheng Lin,Bono Po-Jen Shih,John Paul Martin Encinas,Elizabeth Victoria Abraham Achom,Karan Himanshu Patel,Jesus Horacio Pacheco,Sicong Shao,Jyotikrishna Dass,Soheil Salehi,Pratik Satam
机构: University of Arizona (亚利桑那大学); Pennsylvania State University (宾夕法尼亚州立大学); Universidad de Sonora (索诺拉大学); University of North Dakota (北达科他大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Emotional coordination is a core property of human interaction that shapes how relational meaning is constructed in real time. While text-based affect inference has become increasingly feasible, prior approaches often treat sentiment as a deterministic point estimate for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. We introduce LLM-MC-Affect, a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution defined over an affective space. By leveraging stochastic LLM decoding and Monte Carlo estimation, the methodology approximates these distributions to derive high-fidelity sentiment trajectories that explicitly quantify both central affective tendencies and perceptual ambiguity. These trajectories enable a structured analysis of interpersonal coupling through sequential cross-correlation and slope-based indicators, identifying leading or lagging influences between interlocutors. To validate the interpretive capacity of this approach, we utilize teacher-student instructional dialogues as a representative case study, where our quantitative indicators successfully distill high-level interaction insights such as effective scaffolding. This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution that extends beyond education to broader social and behavioral research.
zh

[NLP-68] Agent -Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在持续学习新任务时面临的灾难性遗忘问题,即稳定性-可塑性困境(stability-plasticity dilemma)。其核心挑战在于模型在学习新任务时会覆盖或干扰原有任务的知识,导致性能下降。解决方案的关键在于提出Agent-Dice框架,通过两阶段参数融合机制实现知识解耦:首先利用几何共识过滤(geometric consensus filtering)剔除任务间冲突梯度,其次基于曲率重要性加权(curvature-based importance weighting)强化跨任务共享语义,从而在保持旧知识稳定性的前提下有效吸收新知识。该方法在GUI代理和工具使用代理场景中展现出优异的持续学习性能,且计算开销极低。

链接: https://arxiv.org/abs/2601.03641
作者: Zheng Wu,Xingyu Lou,Xinbei Ma,Yansi Li,Weiwen Liu,Weinan Zhang,Jun Wang,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO Research Institute
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.
zh

[NLP-69] Reasoning Model Is Superior LLM -Judge Yet Suffers from Biases

【速读】: 该论文旨在系统性地比较大型推理模型(Large Reasoning Models, LRMs)与非推理型大语言模型(non-reasoning LLMs)在判断任务中的性能差异,核心问题是验证LRMs是否在判断准确性、指令遵循能力、抗干扰能力及公平性等方面具有优势。解决方案的关键在于提出一种名为PlanJudge的评估策略,该策略通过引导模型在执行判断前生成明确的评估计划(evaluation plan),从而显著降低模型在表面质量上的偏见,提升其鲁棒性和公正性,且该方法对LRMs和标准LLMs均有效。

链接: https://arxiv.org/abs/2601.03630
作者: Hui Huang,Xuanxin Wu,Muyun Yang,Yuki Arase
机构: Harbin Institute of Technology (哈尔滨工业大学); Institute of Science Tokyo (东京科学研究所); The University of Osaka (大阪大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.
zh

[NLP-70] Evaluating the Pre-Consultation Ability of LLM s using Diagnostic Guidelines EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在临床前问诊(pre-consultation)能力评估中的标准缺失问题,即缺乏一个基于诊断指南(diagnostic guidelines)的系统性评估框架与数据集。其解决方案的关键在于提出EPAG基准数据集和评估框架,通过直接对比LLMs生成的主诉史(History of Present Illness, HPI)与诊断指南内容,以及间接通过疾病诊断准确性进行双维度评估;实验表明,经过精心设计的任务特定微调的小型开源模型可在该任务上超越前沿大模型,且发现HPI长度增加并不必然提升诊断性能,同时揭示了问诊语言对对话特征的影响。

链接: https://arxiv.org/abs/2601.03627
作者: Jean Seo,Gibaeg Kim,Kihun Shin,Seungseop Lim,Hyunkyung Lee,Wooseok Han,Jongwon Lee,Eunho Yang
机构: AITRICS; KAIST; Severance Hospital, Yonsei University; College of Medicine, The Catholic University of Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL 2026 Industry

点击查看摘要

Abstract:We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on this https URL, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
zh

[NLP-71] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation ACL2026

【速读】: 该论文旨在解决生成式 AI (Generative AI) 时代下音频深度伪造检测(Audio Deepfake Detection, ADD)中模型可解释性与鲁棒性之间的矛盾问题,即如何在保证预测透明度的同时提升模型对对抗攻击的防御能力。传统方法多关注最终分类结果(如真假判断)的稳定性,而忽视了推理过程本身的脆弱性。解决方案的关键在于提出一个三维度的取证审计框架(forensic auditing framework),用于系统评估音频语言模型(Audio Language Models, ALMs)在对抗攻击下的推理鲁棒性,涵盖声学感知(acoustic perception)、认知一致性(cognitive coherence)和认知失调(cognitive dissonance)。研究表明,显式推理并非普遍增强鲁棒性的手段:对于具备强声学感知能力的模型,推理可作为“防护盾”;而对于其他模型,则可能因语言类攻击导致认知一致性下降,反而增加攻击成功率,形成性能“税”。值得注意的是,即使分类失败,高认知失调仍可作为“无声警报”,提示潜在操纵行为,从而为音频伪造检测提供新的鲁棒性分析视角与预警机制。

链接: https://arxiv.org/abs/2601.03615
作者: Binh Nguyen,Thai Le
机构: Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint for ACL 2026 submission

点击查看摘要

Abstract:Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADDs), moving beyond \textitblack-box classifiers by providing some level of transparency into their predictions via reasoning traces. This necessitates a new class of model robustness analysis: robustness of the predictive reasoning under adversarial attacks, which goes beyond existing paradigm that mainly focuses on the shifts of the final predictions (e.g., fake v.s. real). To analyze such reasoning shifts, we introduce a forensic auditing framework to evaluate the robustness of ALMs’ reasoning under adversarial attacks in three inter-connected dimensions: acoustic perception, cognitive coherence, and cognitive dissonance. Our systematic analysis reveals that explicit reasoning does not universally enhance robustness. Instead, we observe a bifurcation: for models exhibiting robust acoustic perception, reasoning acts as a defensive \textitshield'', protecting them from adversarial attacks. However, for others, it imposes a performance \textittax’', particularly under linguistic attacks which reduce cognitive coherence and increase attack success rate. Crucially, even when classification fails, high cognitive dissonance can serve as a \textitsilent alarm, flagging potential manipulation. Overall, this work provides a critical evaluation of the role of reasoning in forensic audio deepfake analysis and its vulnerabilities.
zh

[NLP-72] DiVA: Fine-grained Factuality Verification with Agent ic-Discriminative Verifier

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在事实准确性(factuality)方面存在的关键挑战,即现有验证方法多采用二分类判断(如正确或错误),无法区分不同严重程度的错误,从而限制了其在细粒度评估和偏好优化等应用场景中的实用性。解决方案的关键在于提出一种名为“代理判别验证器”(Agentic Discriminative Verifier, DiVA)的混合框架,该框架融合生成式模型的代理搜索能力与判别模型的精确评分优势,实现对事实性更细致、准确的评估。同时,作者构建了新的基准测试集 FGVeriBench,用于系统性地评估细粒度事实验证性能,实验证明 DiVA 在通用及多跳问答任务中均显著优于现有方法。

链接: https://arxiv.org/abs/2601.03605
作者: Hui Huang,Muyun Yang,Yuki Arase
机构: Harbin Institute of Technology (哈尔滨工业大学); Institute of Science Tokyo (东京大学信息科学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the significant advancements of Large Language Models (LLMs), their factuality remains a critical challenge, fueling growing interest in factuality verification. Existing research on factuality verification primarily conducts binary judgments (e.g., correct or incorrect), which fails to distinguish varying degrees of error severity. This limits its utility for applications such as fine-grained evaluation and preference optimization. To bridge this gap, we propose the Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models. We also construct a new benchmark, FGVeriBench, as a robust testbed for fine-grained factuality verification. Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.
zh

[NLP-73] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放域问答任务中推理过程线性化且逻辑不一致的问题。现有方法如思维链(Chain-of-Thought, CoT)虽能生成看似连贯的文本推理路径,但难以有效整合多前提并行求解子问题,导致结论常出现矛盾。为此,作者提出自图推理(Self-Graph Reasoning, SGR)框架,其关键在于让LLMs在生成最终答案前,显式地将推理过程建模为结构化的图表示,从而支持多路径并行推理与逻辑一致性校验;同时构建了一个融合多个候选推理图的精炼图结构数据集用于训练,显著提升了模型在五个问答基准上的推理一致性,相较基线模型提升17.74%,且微调后的LLaMA-3.3-70B模型性能媲美GPT-4o、超越Claude-3.5-Haiku。

链接: https://arxiv.org/abs/2601.03597
作者: Yingjian Chen,Haoran Liu,Yinhong Liu,Sherry T. Tong,Aosong Feng,Jinghui Lu,Juntao Zhang,Yusuke Iwasawa,Yutaka Matsuo,Irene Li
机构: University of Tokyo(东京大学); Texas A&M University(德克萨斯A&M大学); University of Cambridge(剑桥大学); Yale University(耶鲁大学); Xiaomi EV(小米汽车); Henan University(河南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show strong reasoning ability in open-domain question answering, yet their reasoning processes are typically linear and often logically inconsistent. In contrast, real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods, such as Chain-of-Thought (CoT), express reasoning in a linear textual form, which may appear coherent but frequently leads to inconsistent conclusions. Recent approaches rely on externally provided graphs and do not explore how LLMs can construct and use their own graph-structured reasoning, particularly in open-domain QA. To fill this gap, we novelly explore graph-structured reasoning of LLMs in general-domain question answering. We propose Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing the final answer. We further construct a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training. Experiments on five QA benchmarks across both general and specialized domains show that SGR consistently improves reasoning consistency and yields a 17.74% gain over the base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku, demonstrating the effectiveness of graph-structured reasoning.
zh

[NLP-74] Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂任务中因自主选择推理策略而导致的低效甚至错误路径问题,从而提升推理过程的可靠性与灵活性。其解决方案的关键在于利用稀疏自编码器(Sparse Autoencoders, SAEs)将模型隐藏状态中纠缠的推理策略特征解耦至一个独立的特征空间,并提出SAE-Steering这一两阶段特征识别流程:首先通过策略关键词的logits增强来筛选出少量候选特征(过滤掉超过99%的冗余特征),再基于控制有效性对剩余特征进行排序,最终使用识别出的策略特异性特征作为控制向量实现高效精准的推理路径调控,相较现有方法控制效果提升超15%,并能引导模型从错误路径转向正确路径,带来7%的绝对准确率提升。

链接: https://arxiv.org/abs/2601.03595
作者: Yi Fang,Wenjie Wang,Mingfeng Xue,Boyi Deng,Fengli Xu,Dayiheng Liu,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); Zhongguancun Academy (中关村学院); Alibaba Group (阿里巴巴集团); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs’ hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature identification pipeline. SAE-Steering first recalls features that amplify the logits of strategy-specific keywords, filtering out over 99% of features, and then ranks the remaining features by their control effectiveness. Using the identified strategy-specific features as control vectors, SAE-Steering outperforms existing methods by over 15% in control effectiveness. Furthermore, controlling reasoning strategies can redirect LRMs from erroneous paths to correct ones, achieving a 7% absolute accuracy improvement.
zh

[NLP-75] OLA: Output Language Alignment in Code-Switched LLM Interactions

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语种混合(code-switching)对话中输出语言与用户隐含期望不一致的问题。当前LLMs在面对用户使用韩语-英语混合输入时,即使上下文和语用线索明确,仍常错误地以非预期语言生成响应,表现出对非英语语言的系统性偏倚。解决方案的关键在于引入一个名为OLA(Output Language Alignment)的新基准测试,用于量化评估LLMs在代码切换场景下的语言对齐能力,并通过少量标注数据(约1000条示例)训练一种“代码切换感知的直接偏好优化”(Code-Switching Aware DPO),显著降低了语言错配现象,表明此类问题主要源于对齐不足而非模型架构的根本限制。

链接: https://arxiv.org/abs/2601.03589
作者: Juhyun Oh,Haneul Yoo,Faiz Ghifari Haznitrama,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs’ Output Language Alignment in code-switched interactions. OLA focuses on Korean–English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users’ implicit expectations in real-world code-switched interactions.
zh

[NLP-76] PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理健康应用中缺乏系统性伦理评估框架的问题,尤其针对现有基于拒绝响应(refusal-based)的安全信号无法准确反映临床实践中所需复杂伦理行为的局限性。其解决方案的关键在于提出首个以澳大利亚心理学与精神病学指南为基础的原则驱动型评测基准——PsychEthicsBench,该基准通过多项选择题和开放性任务结合细粒度伦理标注,全面评估LLMs的伦理知识与行为响应能力,实证表明拒绝率并非伦理行为的良好指标,并揭示领域特定微调可能削弱伦理鲁棒性,从而推动面向临床适配、地域敏感的负责任开发路径。

链接: https://arxiv.org/abs/2601.03578
作者: Yaling Shen,Stephanie Fong,Yiwen Jiang,Zimu Wang,Feilong Tang,Qingyang Xu,Xiangyu Zhao,Zhongxing Xu,Jiahe Liu,Jinpeng Hu,Dominic Dwyer,Zongyuan Ge
机构: Monash University (蒙纳士大学); University of Liverpool (利物浦大学); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:The increasing integration of large language models (LLMs) into mental health applications necessitates robust frameworks for evaluating professional safety alignment. Current evaluative approaches primarily rely on refusal-based safety signals, which offer limited insight into the nuanced behaviors required in clinical practice. In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking. To address this gap, we move beyond refusal-centric metrics and introduce \textttPsychEthicsBench, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines, designed to evaluate LLMs’ ethical knowledge and behavioral responses through multiple-choice and open-ended tasks with fine-grained ethicality annotations. Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness. Notably, we find that domain-specific fine-tuning can degrade ethical robustness, as several specialized models underperform their base backbones in ethical alignment. PsychEthicsBench provides a foundation for systematic, jurisdiction-aware evaluation of LLMs in mental health, encouraging more responsible development in this domain.
zh

[NLP-77] How Do Large Language Models Learn Concepts During Continual Pre-Training?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续预训练过程中概念获取、保留与遗忘的动态机制不明确的问题,以及多个概念之间如何通过干扰(interference)和协同(synergy)相互作用的机制。其解决方案的关键在于引入“概念电路”(Concept Circuits)这一内部计算子图结构,并结合图论指标(Graph Metrics)来刻画电路的拓扑特征,从而从电路层面揭示概念学习与遗忘的行为规律,为设计更可解释且鲁棒的概念感知型训练策略提供理论依据。

链接: https://arxiv.org/abs/2601.03570
作者: Barry Menglong Yao(1),Sha Li(2),Yunzhi Yao(3),Minqian Liu(2),Zaishuo Xia(1),Qifan Wang(4),Lifu Huang(1) ((1) UC Davis, (2) Virginia Tech, (3) UCLA, (4) Meta AI)
机构: UC Davis (加州大学戴维斯分校); Virginia Tech (弗吉尼亚理工学院); UCLA (加州大学洛杉矶分校); Meta AI (Meta人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 12 pages, 19 figures

点击查看摘要

Abstract:Human beings primarily understand the world through concepts (e.g., dog), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We link these behavioral dynamics to LLMs’ internal Concept Circuits, computational subgraphs associated with specific concepts, and incorporate Graph Metrics to characterize circuit structure. Our analysis reveals: (1) LLMs concept circuits provide a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; (5) conceptual knowledge differs in their transferability, with some significantly facilitating the learning of others. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.
zh

[NLP-78] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLM s

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理在大语言模型中因暴露偏差(exposure bias)和错误累积导致的性能下降问题,即早期推理步骤中的错误会通过自回归解码不可逆地传播,从而影响最终答案的准确性。其解决方案的关键在于提出DiffCoT框架,将CoT推理重构为一种迭代去噪过程,并在推理步骤层面引入扩散机制(diffusion principles),通过滑动窗口(sliding-window)机制实现中间步骤的统一生成与回溯修正,同时保持词元级别的自回归特性;此外,为保障推理链的时间因果一致性,设计了一种因果扩散噪声调度策略(causal diffusion noise schedule),从而显著提升多步CoT推理任务中的鲁棒性和纠错能力。

链接: https://arxiv.org/abs/2601.03559
作者: Shidong Cao,Hongzhan Lin,Yuxuan Gu,Ziyang Luo,Jing Ma
机构: Hong Kong Baptist University (香港浸会大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: DiffCoT improves multi-step LLM reasoning by applying diffusion-based iterative denoising to correct intermediate Chain-of-Thought steps

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.
zh

[NLP-79] Evaluating LLM s for Police Decision-Making: A Framework Based on Police Action Scenarios AAAI2026

【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在警务操作中的应用日益广泛,但缺乏针对警察工作场景的系统性评估框架,导致其未经验证的使用可能引发非法逮捕或证据收集不当等严重后果。解决方案的关键在于提出PAS(Police Action Scenarios)框架,该框架覆盖警务操作全流程的评估体系,并基于8000余份官方文件构建了一个新的问答(QA)数据集,通过统计分析与警务专家判断验证了关键指标的有效性,从而为可靠、可扩展的AI辅助警务提供评估基础。

链接: https://arxiv.org/abs/2601.03553
作者: Sangyub Lee,Heedou Kim,Hyeoncheol Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work was accepted at AAAI 2026 social good track

点击查看摘要

Abstract:The use of Large Language Models (LLMs) in police operations is growing, yet an evaluation framework tailored to police operations remains absent. While LLM’s responses may not always be legally incorrect, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we constructed a novel QA dataset from over 8,000 official documents and established key metrics validated through statistical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based recommendations. This study highlights the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.
zh

[NLP-80] EASLT: Emotion-Aware Sign Language Translation

【速读】: 该论文旨在解决生成式手语翻译(Sign Language Translation, SLT)中因忽略面部表情(Non-Manual Signals, NMS)而导致的语义模糊问题,尤其是在不同概念共享相同手势(Manual Signals, MS)时,仅依赖手势信息难以准确区分语义。解决方案的关键在于提出EASLT(Emotion-Aware Sign Language Translation)框架,其核心创新是将面部情绪视为语义锚点而非辅助信息,并引入一个专用的情感编码器以捕捉连续的情绪动态;通过新颖的“情绪感知融合”(Emotion-Aware Fusion, EAF)模块,自适应地根据情绪上下文重新校准时空手势特征,从而有效解耦情绪语义与手势动态,显著提升翻译准确性。

链接: https://arxiv.org/abs/2601.03549
作者: Guobin Tu,Di Weng
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present EASLT (Emotion-Aware Sign Language Translation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel Emotion-Aware Fusion (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at this https URL.
zh

[NLP-81] Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在模拟个人数据共享决策时,难以准确评估其价值观与实际行为之间一致性的问题。现有方法通常孤立地测量隐私态度或分享意图,无法反映人类在隐私关切与利他动机冲突下所表现出的价值-行为协同效应。解决方案的关键在于提出一种基于情境的评估协议,通过在具有历史记忆的会话中顺序施测标准化问卷(包括隐私态度、利他性及数据共享接受度),并采用多组结构方程模型(Multi-group Structural Equation Modeling, MGSEM)量化从隐私关切和利他性到数据共享行为的路径关系;进一步引入人类参照的方向一致性指标——价值-行为对齐率(Value-Action Alignment Rate, VAAR),以聚合路径层面的预期符号证据,从而系统识别不同LLM在竞争性态度下的价值-行为一致性模式。

链接: https://arxiv.org/abs/2601.03546
作者: Guanyu Chen,Chenxiao Yu,Xiyang Hu
机构: Arizona State University (亚利桑那州立大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate decision-making tasks involving personal data sharing, where privacy concerns and prosocial motivations can push choices in opposite directions. Existing evaluations often measure privacy-related attitudes or sharing intentions in isolation, which makes it difficult to determine whether a model’s expressed values jointly predict its downstream data-sharing actions as in real human behaviors. We introduce a context-based assessment protocol that sequentially administers standardized questionnaires for privacy attitudes, prosocialness, and acceptance of data sharing within a bounded, history-carrying session. To evaluate value-action alignments under competing attitudes, we use multi-group structural equation modeling (MGSEM) to identify relations from privacy concerns and prosocialness to data sharing. We propose Value-Action Alignment Rate (VAAR), a human-referenced directional agreement metric that aggregates path-level evidence for expected signs. Across multiple LLMs, we observe stable but model-specific Privacy-PSA-AoDS profiles, and substantial heterogeneity in value-action alignment.
zh

[NLP-82] EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多轮对话中对长程记忆能力的评估缺乏系统性的问题,尤其在跨会话场景下,现有基准测试未能全面覆盖多种记忆维度。其解决方案的关键在于提出EvolMem基准,该基准基于认知心理学理论,整合了陈述性记忆(declarative memory)与非陈述性记忆(non-declarative memory),并进一步细分为多个高粒度的记忆能力子项;同时,通过引入一种混合数据合成框架——结合主题驱动生成与叙事启发式变换——实现了可控复杂度的多会话对话数据的规模化生成,并配套提供针对每个样本的具体评估指南,从而为LLMs及智能体系统的多会话记忆能力提供了可量化、可比较的评测体系。

链接: https://arxiv.org/abs/2601.03543
作者: Ye Shen,Dun Pei,Yiqiu Guo,Junying Wang,Yijin Guo,Zicheng Zhang,Qi Jia,Jun Zhou,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs’ capabilities and often exhibit notable efficiency limitations. Data and code will be released at this https URL.
zh

[NLP-83] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多跳推理(multi-hop reasoning)过程中内部如何组合多个事实的问题,尤其是现有“桥接实体对齐电路假说”(hop-aligned circuit hypothesis)是否成立的问题。研究表明,该假说并不具备普适性:在实际推理中,后一跳的答案实体可能比桥接实体更早被解码,这种现象被称为“层序反转”(layer-order inversion),且其强度随总跳跃次数增加而增强。为此,作者提出一个“概率性回忆与提取框架”(probabilistic recall-and-extract framework),将多跳推理建模为浅层MLP层中的广义概率回忆(broad probabilistic recall)与深层注意力层中的选择性提取(selective extraction)的协同过程。该框架通过系统性的探针分析得到实证验证,不仅重新解释了先前关于逐层解码的证据,还阐明了思维链(chain-of-thought)方法的优势,并能从机制层面诊断为何在单跳知识正确的情况下仍会发生多跳推理失败。

链接: https://arxiv.org/abs/2601.03542
作者: Xukai Liu,Ye Liu,Jipeng Zhang,Yanghai Zhang,Kai Zhang,Qi Liu
机构: University of Science and Technology of China (中国科学技术大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 18 figures

点击查看摘要

Abstract:Large language models (LLMs) perform well on multi-hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emphhop-aligned circuit hypothesis, suggesting that bridge entities are computed sequentially across layers before later-hop answers. Through systematic analyses on real-world multi-hop queries, we show that this hop-aligned assumption does not generalize: later-hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emphlayer-order inversion, which strengthens with total hops. To explain this behavior, we propose a \emphprobabilistic recall-and-extract framework that models multi-hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer-wise decoding evidence, explaining chain-of-thought gains, and providing a mechanistic diagnosis of multi-hop failures despite correct single-hop knowledge. Code is available at this https URL.
zh

[NLP-84] DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在深度研究(Deep Research)任务中,后检索合成阶段(post-retrieval synthesis)缺乏客观评估标准的问题。由于开放式写作的主观性,现有方法难以量化模型在整合海量上下文信息、将碎片化证据转化为连贯长文本报告方面的能力。解决方案的关键在于提出 DeepSynth-Eval 基准,利用高质量综述论文作为黄金标准,逆向构建“Oracle Context”以隔离检索噪声,并设计细粒度评估协议:采用通用检查清单(General Checklists)衡量事实覆盖度,约束检查清单(Constraint Checklists)评估结构组织性,从而将主观判断转化为可验证指标。实验表明,基于计划-写作(plan-and-write)的代理工作流显著优于单轮生成,能有效降低幻觉并更好满足复杂结构约束。

链接: https://arxiv.org/abs/2601.03540
作者: Hongzhi Zhang,Yuanze Hu,Tinghai Zhang,Jia Fu,Tao Wang,Junwei Jing,Zhaoxin Fan,Qi Wang,Ruiming Tang,Han Li,Guorui Zhou,Kun Gai
机构: Kuaishou Technology (快手科技); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算北京先进创新中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage–where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports–remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing “Oracle Contexts” from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.
zh

[NLP-85] STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时的安全性问题,即如何有效提升模型对安全规则的理解与推理能力以抵御恶意指令。其解决方案的关键在于提出一种名为STAR-S(Self-Taught Reasoning based on Safety rules)的自教循环框架,该框架通过迭代式地引导模型基于安全规则进行推理与反思,并利用微调(fine-tuning)强化其安全推理能力,形成一个良性循环:模型在安全规则提示下生成更高质量的推理数据,进而用于进一步训练,从而逐步提升对越狱攻击的防御效果。

链接: https://arxiv.org/abs/2601.03537
作者: Di Wu,Yanyan Zhao,Xin Lu,Mingzhe Li,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages,4 figures

点击查看摘要

Abstract:Defending against jailbreak attacks is crucial for the safe deployment of Large Language Models (LLMs). Recent research has attempted to improve safety by training models to reason over safety rules before responding. However, a key issue lies in determining what form of safety reasoning effectively defends against jailbreak attacks, which is difficult to explicitly design or directly obtain. To address this, we propose \textbfSTAR-S (\textbfSelf-\textbfTAught \textbfReasoning based on \textbfSafety rules), a framework that integrates the learning of safety rule reasoning into a self-taught loop. The core of STAR-S involves eliciting reasoning and reflection guided by safety rules, then leveraging fine-tuning to enhance safety reasoning. Repeating this process creates a synergistic cycle. Improvements in the model’s reasoning and interpretation of safety rules allow it to produce better reasoning data under safety rule prompts, which is then utilized for further training. Experiments show that STAR-S effectively defends against jailbreak attacks, outperforming baselines. Code is available at: this https URL.
zh

[NLP-86] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach

【速读】: 该论文旨在解决现有基于用户感知的骑行友好度(bikeability)评估方法在捕捉道路环境复杂性及充分考虑主观感知异质性方面的局限性。其解决方案的关键在于提出一种面向人物画像(persona-aware)的视觉-语言模型框架,通过三个核心创新实现:一是基于成熟骑行者类型学理论的人物画像条件控制,借助思维链推理生成针对不同用户群体的可解释性说明;二是多粒度监督微调策略,融合稀缺专家标注的推理数据与大量用户评分数据,实现联合预测与可解释评估;三是利用人工智能增强的数据增强技术,构建受控的成对数据以隔离基础设施变量的影响。该框架在12,400条全景图像驱动的众包评估数据上验证,不仅实现了具有竞争力的骑行友好度评分预测,更首次实现了因素归因的可解释性。

链接: https://arxiv.org/abs/2601.03534
作者: Yilong Dai,Ziyi Wang,Chenguang Wang,Kexin Zhou,Yiheng Qian,Susu Xu,Xiang Yan
机构: University of Alabama (阿拉巴马大学); University of Maryland, College Park (马里兰大学学院公园分校); Stony Brook University (石溪大学); University of Florida (佛罗里达大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users’ perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
zh

[NLP-87] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio-Language Models, LALMs)在个性化问答任务中表现不足的问题,即现有模型难以基于用户个人背景知识进行推理和响应,而人类则能依据个体情境对信息进行差异化理解和决策。解决方案的关键在于提出一个全新的任务框架——个性化大型音频语言模型(Personalized LALMs, PALM),用于识别个人概念并结合个人上下文进行推理,并构建首个专门针对该任务的基准测试集(PALM-Bench),以系统评估模型在多说话人场景下的个性化知识建模与跨任务迁移能力。

链接: https://arxiv.org/abs/2601.03531
作者: Yuwen Wang,Xinyuan Qian,Tian-Hao Zhang,Jiaran Gao,Yuchen Pan,Xin Wang,Zhou Pan,Chen Wei,Yiming Wang
机构: University of Science and Technology Beijing (北京科技大学); Li Auto (理想汽车); Fondazione Bruno Kessler (布鲁诺·凯勒基金会)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual’s personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.
zh

[NLP-88] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)代理在长时间对话中缺乏有效记忆能力的问题,尤其是如何在跨会话场景下保持、组织和演化视觉与文本信息的多模态长期记忆。现有基准测试要么仅评估纯文本对话中的多轮记忆,要么局限于局部上下文内的多模态理解,无法全面衡量模型在复杂对话轨迹中对多模态记忆的持久性与动态管理能力。为此,作者提出Mem-Gallery这一新基准,包含高质量的多模态多会话对话数据,具备长交互时序和丰富的跨模态依赖关系,并构建了一个系统化的评估框架,从记忆提取与实时适应、记忆推理以及记忆知识管理三个功能维度进行量化评测。其解决方案的关键在于引入结构化、高保真、多模态的长期对话数据集与可解释的评估体系,从而揭示当前模型在显式保留多模态信息、组织记忆结构方面的不足,以及在推理与知识管理上的瓶颈,为后续研究提供明确的方向。

链接: https://arxiv.org/abs/2601.03515
作者: Yuanchen Bei,Tianxin Wei,Xuying Ning,Yanjun Zhao,Zhining Liu,Xiao Lin,Yada Zhu,Hendrik Hamann,Jingrui He,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); MIT-IBM Watson AI Lab, IBM Research (MIT-IBM沃森人工智能实验室,IBM研究院); Stony Brook University (石溪大学); Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 18 figures

点击查看摘要

Abstract:Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.
zh

[NLP-89] IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对特定查询时,如何高效预测其输出质量的问题。现有方法依赖外部分类器(如基于BERT的模型),存在上下文窗口受限、表征能力不足及额外计算开销等缺陷。论文提出IntroLM方法,其核心创新在于引入自省标记(introspective tokens)与仅在自省标记激活的条件LoRA(token conditional LoRA),使因果语言模型能够在预填充阶段自主预测输出质量,而无需外部评估器且不影响原始生成行为。该方案显著提升了预测准确性(在Qwen3 8B上达到90% ROC AUC),并可在多模型路由系统中实现更优的成本-性能权衡,降低延迟达33%,减少大模型调用频率达50%。

链接: https://arxiv.org/abs/2601.03511
作者: Hossein Hosseini Kasnavieh,Gholamreza Haffari,Chris Leckie,Adel N. Toosi
机构: School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院); Department of Data Science & AI, Monash University(蒙纳士大学数据科学与人工智能系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.
zh

[NLP-90] Reasoning Pattern Alignment Merging for Adaptive Reasoning

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂推理任务中生成冗长推理路径导致计算开销和延迟过高的问题。现有加速方法通常依赖于重新训练模型或设计复杂的提示策略,前者成本高昂,后者对输入和提示形式敏感。解决方案的关键在于提出一种轻量级的模型融合方法——推理模式对齐融合(Reasoning Pattern Alignment Merging, RPAM),其核心思想是通过层间特征对齐,在不从头训练的前提下,将长链式思维(Long-CoT)推理模型与短链式思维(Short-CoT)指令模型进行融合,从而构建一个能够根据查询自适应选择推理模式的推理器。RPAM首先构建一个小型标注校准集以分配每个查询对应的推理模式,随后通过优化各层融合系数,使合并模型的中间表示与所选模型对齐,同时利用对比目标使其远离未选模型,从而实现高效且稳定的推理性能。

链接: https://arxiv.org/abs/2601.03506
作者: Zhaofeng Zhong,Wei Yuan,Tong Chen,Xiangyu Zhao,Quoc Viet Hung Nguyen,Hongzhi Yin
机构: The University of Queensland (昆士兰大学); City University of Hong Kong (香港城市大学); Griffith University (格里菲斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Recent large reasoning models (LRMs) have made substantial progress in complex reasoning tasks, yet they often generate lengthy reasoning paths for every query, incurring unnecessary computation and latency. Existing speed-up approaches typically rely on retraining the model or designing sophisticated prompting, which are either prohibitively expensive or highly sensitive to the input and prompt formulation. In this work, we study model merging as a lightweight alternative for efficient reasoning: by combining a long chain-of-thought (Long-CoT) reasoning model with a Short-CoT instruction model, we obtain an adaptive reasoner without training from scratch or requiring large-scale additional data. Building on this idea, we propose Reasoning Pattern Alignment Merging (RPAM), a layer-wise model merging framework based on feature alignment to facilitate query-adaptive reasoning. RPAM first constructs a small pattern-labeled calibration set that assigns each query an appropriate reasoning pattern. It then optimizes layer-wise merging coefficients by aligning the merged model’s intermediate representations with those of the selected model, while a contrastive objective explicitly pushes them away from the non-selected model. Experiments on seven widely used reasoning benchmarks show that RPAM substantially reduces inference cost while maintaining strong performance. Upon article acceptance, we will provide open-source code to reproduce experiments for RPAM.
zh

[NLP-91] Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)过程中,仅依赖验证困惑度(validation perplexity)难以区分模型是否真正内化了领域知识与仅模仿语言风格的问题。其解决方案的关键在于提出一种轻量级、基于语料库的评估框架——知识保留测试(Knowledge Retention, KR-Test),该框架通过自动生成对比样例来衡量模型对正确与错误续写项的概率偏好,无需指令微调或生成解码,从而有效分离语言层面的收敛与事实知识的保留,提升微调过程的可解释性。

链接: https://arxiv.org/abs/2601.03505
作者: Soheil Zibakhsh Shabgahi,Pedram Aghazadeh,Farinaz Koushanfar
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) is a standard approach for injecting domain knowledge into Large Language Models (LLMs). However, relying on validation perplexity to monitor training is often insufficient, as it confounds stylistic mimicry with genuine factual internalization. To address this, we introduce the Knowledge Retention (KR) Test , a lightweight, corpus-grounded evaluation framework designed to distinguish factual learning from linguistics. KR-Test utilizes automatically generated contrastive examples to measure likelihood preferences for correct versus incorrect continuations, requiring no instruction tuning or generative decoding. We validate the framework’s integrity through a “blind vs. oracle” baseline analysis. Furthermore, we demonstrate the diagnostic capabilities of KR-Test by analyzing the training dynamics of Low-Rank Adaptation (LoRA). By exposing the fine-grained dissociation between linguistic convergence and knowledge retention, KR-Test enhances the interpretability of fine-tuning dynamics.
zh

[NLP-92] STELLA: Self-Reflective Terminology-Aware Framework for Building an Aerospace Information Retrieval Benchmark

【速读】: 该论文旨在解决航空航天领域信息检索(Information Retrieval, IR)缺乏公开基准评测集的问题,该领域高度依赖技术文档的搜索与复用,但现有IR基准未充分反映其术语特征和查询意图。解决方案的关键在于提出STELLA框架,该框架通过系统化流程构建了面向航空航天领域的IR评测基准:首先从NASA技术报告服务器(NTRS)中提取文档,进行版面检测、段落切分、术语词典构建,并生成两类查询——术语一致查询(Terminology Concordant Query, TCQ)用于评估词汇匹配能力,术语无关查询(Terminology Agnostic Query, TAQ)用于评估语义匹配能力,从而实现对嵌入模型在词汇与语义两个维度上的解耦评估;同时结合Chain-of-Density(CoD)与自反思方法提升查询质量,并采用混合跨语言扩展策略模拟真实用户查询行为。此基准为航空航天场景下嵌入模型的可靠评估与优化提供了可复现的基础。

链接: https://arxiv.org/abs/2601.03496
作者: Bongmin Kim
机构: TelePIX
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 25 pages, 2 figures

点击查看摘要

Abstract:Tasks in the aerospace industry heavily rely on searching and reusing large volumes of technical documents, yet there is no public information retrieval (IR) benchmark that reflects the terminology- and query-intent characteristics of this domain. To address this gap, this paper proposes the STELLA (Self-Reflective TErminoLogy-Aware Framework for BuiLding an Aerospace Information Retrieval Benchmark) framework. Using this framework, we introduce the STELLA benchmark, an aerospace-specific IR evaluation set constructed from NASA Technical Reports Server (NTRS) documents via a systematic pipeline that comprises document layout detection, passage chunking, terminology dictionary construction, synthetic query generation, and cross-lingual extension. The framework generates two types of queries: the Terminology Concordant Query (TCQ), which includes the terminology verbatim to evaluate lexical matching, and the Terminology Agnostic Query (TAQ), which utilizes the terminology’s description to assess semantic matching. This enables a disentangled evaluation of the lexical and semantic matching capabilities of embedding models. In addition, we combine Chain-of-Density (CoD) and the Self-Reflection method with query generation to improve quality and implement a hybrid cross-lingual extension that reflects real user querying practices. Evaluation of seven embedding models on the STELLA benchmark shows that large decoder-based embedding models exhibit the strongest semantic understanding, while lexical matching methods such as BM25 remain highly competitive in domains where exact lexical matching technical term is crucial. The STELLA benchmark provides a reproducible foundation for reliable performance evaluation and improvement of embedding models in aerospace-domain IR tasks. The STELLA benchmark can be found in this https URL.
zh

[NLP-93] Submodular Evaluation Subset Selection in Automatic Prompt Optimization

【速读】: 该论文旨在解决自动提示优化(automatic prompt optimization)中评估子集选择不当的问题,即如何从有限数据中高效选取具有代表性的评估子集以提供更有效的反馈信号,从而提升提示优化效果。现有方法通常依赖随机采样的小规模评估子集,但未系统性地考虑其选择策略。论文提出SESS(Submodular Evaluation Subset Selection),其核心在于将评估子集选择建模为最大化一个集合函数的优化问题,并证明该目标函数在温和条件下满足单调性和子模性(submodularity),从而允许使用贪心算法进行高效求解并提供理论保障。实验表明,相较于随机或启发式基线方法,基于子模性的评估子集选择能够显著提升提示优化性能,在GSM8K、MATH和GPQA-Diamond等基准上均取得更好结果。

链接: https://arxiv.org/abs/2601.03493
作者: Jinming Nian,Zhiyuan Peng,Hongwei Shang,Dae Hoon Park,Yi Fang
机构: Santa Clara University (圣克拉拉大学); Walmart Global Tech (沃尔玛全球科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic prompt optimization reduces manual prompt engineering, but relies on task performance measured on a small, often randomly sampled evaluation subset as its main source of feedback signal. Despite this, how to select that evaluation subset is usually treated as an implementation detail. We study evaluation subset selection for prompt optimization from a principled perspective and propose SESS, a submodular evaluation subset selection method. We frame selection as maximizing an objective set function and show that, under mild conditions, it is monotone and submodular, enabling greedy selection with theoretical guarantees. Across GSM8K, MATH, and GPQA-Diamond, submodularly selected evaluation subsets can yield better optimized prompts than random or heuristic baselines.
zh

[NLP-94] CALM: Culturally Self-Aware Language Models

【速读】: 该论文旨在解决当前语言模型在跨文化场景中缺乏动态文化敏感性的问题,即现有方法将文化视为静态背景知识,忽视了文化本身的演化特性,从而影响了模型在需要真实文化适应能力的任务中的可靠性。解决方案的关键在于提出CALM框架,通过对比学习将任务语义与显式文化概念及隐含文化信号解耦并聚类为结构化的文化簇,利用交叉注意力机制建立细粒度的文化特征交互,并通过专家混合(Mixture-of-Experts)机制沿文化特异性维度自适应融合;最终构建一个融合原始知识的文化根基内部身份状态,并借助自提示反思学习实现持续适应与自我修正,从而赋予模型文化自意识能力。

链接: https://arxiv.org/abs/2601.03483
作者: Lingzhi Shen,Xiaohao Cai,Yunfei Long,Imran Razzak,Guanming Chen,Shoaib Jameel
机构: University of Southampton (南安普顿大学); Queen Mary University of London (伦敦玛丽女王大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model’s original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.
zh

[NLP-95] Self-Explaining Hate Speech Detection with Moral Rationales

【速读】: 该论文旨在解决当前仇恨言论检测模型过度依赖表面词汇特征(surface-level lexical features)所导致的脆弱性问题,如对虚假相关性的敏感、鲁棒性不足、文化语境适配差以及解释性弱等。其解决方案的关键在于提出首个自解释型仇恨言论检测框架——监督道德理由注意力(Supervised Moral Rationale Attention, SMRA),该方法将道德理由(moral rationales)作为直接监督信号嵌入训练目标中,基于道德基础理论(Moral Foundations Theory)引导模型关注具有道德显著性的文本片段,而非仅依赖表层词汇模式,从而实现更忠实、可解释且具备文化情境感知能力的检测与解释。

链接: https://arxiv.org/abs/2601.03481
作者: Francielle Vargas,Jackson Trager,Diego Alves,Surendrabikram Thapa,Matteo Guida,Berk Atil,Daryna Dementieva,Andrew Smart,Ameeta Agrawal
机构: University of São Paulo (圣保罗大学); University of Southern California (南加州大学); Saarland University (萨尔兰大学); Virginia Tech (弗吉尼亚理工大学); University of Melbourne (墨尔本大学); Pennsylvania State University (宾夕法尼亚州立大学); Technical University of Munich (慕尼黑工业大学); Google Research (谷歌研究院); Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs
zh

[NLP-96] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

【速读】: 该论文旨在解决线性文本分割(Linear Text Segmentation)问题,即如何将连续文本划分为语义连贯且有意义的单元,以支持摘要生成、信息检索和问答等下游自然语言处理(Natural Language Processing, NLP)任务。其关键创新在于将文本分割建模为一个无需标签的下一句预测(Next Sentence Prediction, NSP)任务——通过显式建模句子间的连贯性来识别主题边界,同时引入一种分割感知损失(segmentation-aware loss)与更难负样本采样策略,从而增强对话语连贯性的捕捉能力。该方法避免了依赖特定任务的标注监督,仅利用预训练阶段已有的NSP机制,显著提升了分割性能,在CitiLink-Minutes和WikiSection两个数据集上分别达到0.79和0.65的B-F₁分数,优于现有最强基线模型。

链接: https://arxiv.org/abs/2601.03474
作者: José Isidro,Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
机构: University of Porto (波尔图大学); INESC TEC (INESC技术研究所); University of Beira Interior (贝拉内陆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B- F_1 of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F _1 of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.
zh

[NLP-97] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

【速读】: 该论文旨在解决当前医疗问答基准在评估流行病学推理能力方面的不足,特别是缺乏对基于证据的群体层面疾病负担、传播动态和干预效果推断的系统性评测。其解决方案的关键在于构建EpiQAL——首个跨多种疾病的流行病学问答诊断性基准,包含三个子集:文本支撑的事实回忆、连接文献证据与流行病学原理的多步推理,以及在不提供讨论部分情况下重建结论的能力评估。该基准通过专家设计的分类体系引导、多模型验证及检索难度控制等方法实现精细化测评,揭示了现有大语言模型(LLM)在多步推理任务中表现最弱,且模型性能排名随子集变化,单纯扩大规模无法保证提升,从而为流行病学推理能力的诊断性分析提供了细粒度信号。

链接: https://arxiv.org/abs/2601.03471
作者: Mingyang Wei,Dehai Min,Zewen Liu,Yuzhang Xie,Guanchen Wu,Carl Yang,Max S. Y. Lau,Qi He,Lu Cheng,Wei Jin
机构: Emory University (埃默里大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 3 figures, 12 tables

点击查看摘要

Abstract:Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.
zh

[NLP-98] Prompting Underestimates LLM Capability for Time Series Classification

【速读】: 该论文旨在解决当前基于提示(prompt-based)的评估方法是否能准确反映大语言模型(Large Language Models, LLMs)对时间序列分类任务的理解能力这一问题。研究发现,尽管提示生成方式下LLMs表现接近随机水平,但通过直接在相同内部表征上使用线性探测器(linear probes)进行评估,其平均F1分数从0.15–0.26显著提升至0.61–0.67,甚至可媲美或超越专门设计的时间序列模型。解决方案的关键在于揭示了提示生成与模型内在表示之间的系统性错位:即当前基于提示的评价范式低估了LLMs对时间序列结构的建模能力,而通过线性探测等直接分析手段可以更真实地评估其潜在的时间感知表征能力。

链接: https://arxiv.org/abs/2601.03464
作者: Dan Schumacher,Erfan Nourbakhsh,Rocky Slavin,Anthony Rios
机构: The University of Texas at San Antonio (圣安东尼奥德克萨斯大学)
类目: Computation and Language (cs.CL)
备注: 8 pages + Appendix and References, 9 figures

点击查看摘要

Abstract:Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model’s representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
zh

[NLP-99] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在预训练过程中虽能习得世界知识和推理能力,但缺乏对语言学能力(linguistic competence)显式优化的问题。其解决方案的关键在于提出L2T(Language Learning Tasks)预训练框架,该框架通过将原始文本转化为结构化的输入-输出对,模拟人类语言习得过程,提供显式的语言刺激;在此基础上,联合预训练原始文本与L2T任务数据,不仅提升了语言学能力基准测试的表现,还加速了语言能力的学习进程,同时保持了在通用推理任务上的竞争力。

链接: https://arxiv.org/abs/2601.03448
作者: Atsuki Yamaguchi,Maggie Mi,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
zh

[NLP-100] Grading Scale Impact on LLM -as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为自动评分者时在不同评分尺度下一致性不足的问题,尤其是其与人类评分者之间对主观任务的评分一致性受量表选择影响显著的现象。研究发现,尽管LLM和人类评分者的组内可靠性均较高,但评分尺度的选择会显著改变两者之间的对齐程度,其中0-5分制在整体任务中表现出最强的人类-LLM一致性。解决方案的关键在于:系统性地评估不同评分尺度对LLM-as-a-judge性能的影响,并强调量表设计和子群体诊断(如按性别分组的差异)是构建可靠评估协议的核心要素。

链接: https://arxiv.org/abs/2601.03444
作者: Weiyue Li,Minda Zhao,Weixuan Dong,Jiahui Cai,Yuze Wei,Michael Pocress,Yi Li,Wanyan Yuan,Xiaoyue Wang,Ruoyu Hou,Kaiyuan Lou,Wenqi Zeng,Yutong Yang,Yilun Du,Mengyu Wang
机构: Harvard University (哈佛大学); CMU (卡内基梅隆大学); Stanford University (斯坦福大学); UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate that pooled reliability can mask benchmark heterogeneity and reveal systematic subgroup differences in alignment across gender groups, strengthening the importance of scale design and sub-level diagnostics as essential components of LLM-as-a-judge protocols.
zh

[NLP-101] he Critical Role of Aspects in Measuring Document Similarity

【速读】: 该论文旨在解决传统文档相似性度量方法中缺乏对特定语义方面(aspect)显式建模的问题,即现有方法通常采用整体性(holistic)视角衡量文档相似度,忽略了用户关注的具体维度。其解决方案的关键在于提出ASPECTSIM框架,该框架通过将文档相似性显式地条件化于一个指定的语义方面(aspect-conditioned similarity),从而提升模型与人类判断之间的一致性。实验表明,基于GPT-4o直接提示实现的ASPECTSIM在26K个方面-文档对上相较传统方法的人机一致性提高约80%,验证了明确建模方面的重要性,并呼吁修订标准实践。

链接: https://arxiv.org/abs/2601.03435
作者: Eftekhar Hossain,Tarnika Hazra,Ahatesham Bhuiyan,Santu Karmaker
机构: Bridge-AI Lab@UCF, Department of Computer Science (Bridge-AI 实验室@中佛罗里达大学计算机科学系); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: 24 Pages, 10 Figures, 10 Tables

点击查看摘要

Abstract:We introduce ASPECTSIM, a simple and interpretable framework that requires conditioning document similarity on an explicitly specified aspect, which is different from the traditional holistic approach in measuring document similarity. Experimenting with a newly constructed benchmark of 26K aspect-document pairs, we found that ASPECTSIM, when implemented with direct GPT-4o prompting, achieves substantially higher human-machine agreement ( \approx 80% higher) than the same for holistic similarity without explicit aspects. These findings underscore the importance of explicitly accounting for aspects when measuring document similarity and highlight the need to revise standard practice. Next, we conducted a large-scale meta-evaluation using 16 smaller open-source LLMs and 9 embedding models with a focus on making ASPECTSIM accessible and reproducible. While directly prompting LLMs to produce ASPECTSIM scores turned out be ineffective (20-30% human-machine agreement), a simple two-stage refinement improved their agreement by \approx 140%. Nevertheless, agreement remains well below that of GPT-4o-based models, indicating that smaller open-source LLMs still lag behind large proprietary models in capturing aspect-conditioned similarity.
zh

[NLP-102] Spectral Archaeology: The Causal Topology of Model Evolution

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)行为评估中“知其然不知其所以然”的问题,即传统行为基准只能描述模型输出结果,而无法揭示其内部机制的动态变化。为此,作者提出了一种无需训练的机制探测方法——基于注意力图谱(attention-graph spectra)的谱分析工具,核心在于将每一层视为一个token图,并计算代数连通性(algebraic connectivity, λ₂)、平滑度与谱熵等指标,从而获得稳定的“谱指纹”以捕捉标准评测忽略的结构断裂现象。关键创新点在于识别出一种名为“被动触发的连通性崩溃”(Passive-Triggered Connectivity Collapse, PTCC)的现象:在特定课程过渡阶段(如代码到对话),模型在非规范句法构造下出现显著的λ₂下降(Δλ₂ ≈ -0.76),且这种崩溃可被定位至第2层稀疏的“补偿补丁”头(compensatory patch of heads),并通过激活引导部分恢复信息流(约38%)。该方法不仅实现了对训练制度的机制级审计,还揭示了模型拓扑结构与分词密度而非语言身份更强相关的核心规律。

链接: https://arxiv.org/abs/2601.03424
作者: Valentin Noël
机构: Devoteam(德沃团队)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 45 pages, 15 figures, Under Review

点击查看摘要

Abstract:Behavioral benchmarks tell us \textitwhat a model does, but not \textithow. We introduce a training-free mechanistic probe using attention-graph spectra. Treating each layer as a token graph, we compute algebraic connectivity ( \lambda_2 ), smoothness, and spectral entropy. Across 12 models and 10 languages, these measures yield stable spectral fingerprints'' that expose discontinuities missed by standard evaluation. We report four results. (1) Models undergoing specific curriculum transitions (e.g., code-to-chat) show an English-only, syntax-triggered connectivity failure on non-canonical constructions, reaching \Delta\lambda_2 \approx -0.76 . We term this scar \textitPassive-Triggered Connectivity Collapse (PTCC). Analysis of the Phi lineage reveals that PTCC appears and resolves across developmental stages, implicating brittle curriculum shifts rather than synthetic data per se. (2) PTCC reflects a specialization trade-off: strengthened formal routing at the expense of stylistic flexibility. (3) We identify four recurrent processing strategies; simple frozen-threshold rules enable perfect forensic identification across lineages. (4) Mechanistically, PTCC localizes to a sparse Layer 2 compensatory patch’’ of heads that fails under syntactic stress; activation steering can partially restore connectivity, recovering \approx 38% of lost information flow. Finally, dominant topological regimes track tokenization density more than language identity, suggesting ``healthy’’ geometry varies systematically across scripts. Overall, attention-graph spectra provide a practical tool for auditing and training-regime verification. Comments: 45 pages, 15 figures, Under Review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.03424 [cs.LG] (or arXiv:2601.03424v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.03424 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-103] raining-Free Adaptation of New-Generation LLM s using Legacy Clinical Models

【速读】: 该论文旨在解决通用领域语言模型在临床场景中适应成本高、需频繁重新训练的问题。现有方法依赖持续预训练和微调,导致资源消耗大且难以快速迭代。其解决方案的关键在于提出跨架构代理微调(Cross-Architecture Proxy Tuning, CAPT),这是一种无需重新训练的模型集成方法,通过对比解码(contrastive decoding)机制,从已有的临床模型中选择性注入临床相关信号,同时保留通用模型的推理能力和语言流畅性,从而实现对新一代通用模型的零样本适配。该方法支持词汇不重叠的模型组合,在多个临床分类与文本生成任务上显著优于单一模型及现有集成方案。

链接: https://arxiv.org/abs/2601.03423
作者: Sasha Ronaghi,Chloe Stanwyck,Asad Aali,Amir Ronaghi,Miguel Fuentes,Tina Hernandez-Boussard,Emily Alsentzer
机构: Stanford School of Medicine (斯坦福医学院); Stanford University (斯坦福大学); MemorialCare (纪念护理); Department of Medicine (医学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 3 figures

点击查看摘要

Abstract:Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model’s reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.
zh

[NLP-104] PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution ACL2026

【速读】: 该论文旨在解决医疗领域中系统生成摘要的验证难题,尤其是如何实现对源文本上下文的精确归属(context attribution),这对于高风险医疗场景至关重要。其解决方案的关键在于构建了一个专家标注的基准数据集PCoA,该数据集支持基于特定方面的摘要生成,并在短语级别上明确标注每个摘要成分所对应的支撑句和贡献短语;同时提出了一种细粒度、解耦的评估框架,独立衡量摘要质量、引用准确性和贡献短语的有效性。实验表明,PCoA能可靠地评估具备短语级上下文归属能力的摘要生成模型,且在摘要前显式识别相关句子与贡献短语可显著提升整体生成质量。

链接: https://arxiv.org/abs/2601.03418
作者: Bohao Chu,Sameh Frihat,Tabea M. G. Pakull,Hendrik Damm,Meijie Li,Ula Muhabbek,Georg Lodde,Norbert Fuhr
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学); University of Applied Sciences and Arts Dortmund (多特蒙德应用技术与艺术大学); University Hospital Essen (埃森大学医院); Institute for Artificial Intelligence in Medicine (IKIM) (人工智能医学研究所)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Conference Submission (8 main pages)

点击查看摘要

Abstract:Verifying system-generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high-stakes medical domains. To address this challenge, we introduce PCoA, an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution. PCoA aligns each aspect-based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at this https URL.
zh

[NLP-105] Implicit Graph Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长时程应用场景中,面对稀疏且分散的证据时难以有效利用长上下文的问题。现有记忆系统分为两类:显式结构化记忆虽具可解释性但易受长上下文过载影响而变得脆弱,隐式记忆机制则高效稳定却难以人工检视。其解决方案的关键在于提出LatentGraphMem框架,该框架将图结构记忆存储于潜在空间以保障稳定性与效率,并通过任务特定的子图检索接口,在固定预算下输出紧凑的符号子图供下游推理和人工审查。训练阶段显式构建图视图以对接冻结的推理器进行问答监督,推理阶段则在潜在空间执行检索并仅外化所获子图,从而实现参数高效适配与灵活扩展至更大推理器的能力,同时避免引入大量符号冗余。

链接: https://arxiv.org/abs/2601.03417
作者: Xin Zhang,Kailai Yang,Hao Li,Chenyue Li,Qiyu Wei,Sophia Ananiadou
机构: University of Manchester (曼彻斯特大学); Imperial College London (帝国理工学院); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Long-horizon applications increasingly require large language models (LLMs) to answer queries when relevant evidence is sparse and dispersed across very long contexts. Existing memory systems largely follow two paradigms: explicit structured memories offer interpretability but often become brittle under long-context overload, while latent memory mechanisms are efficient and stable yet difficult to inspect. We propose LatentGraphMem, a memory framework that combines implicit graph memory with explicit subgraph retrieval. LatentGraphMem stores a graph-structured memory in latent space for stability and efficiency, and exposes a task-specific subgraph retrieval interface that returns a compact symbolic subgraph under a fixed budget for downstream reasoning and human inspection. During training, an explicit graph view is materialized to interface with a frozen reasoner for question-answering supervision. At inference time, retrieval is performed in latent space and only the retrieved subgraph is externalized. Experiments on long-horizon benchmarks across multiple model scales show that LatentGraphMem consistently outperforms representative explicit-graph and latent-memory baselines, while enabling parameter-efficient adaptation and flexible scaling to larger reasoners without introducing large symbolic artifacts.
zh

[NLP-106] grinya Number Verbalization: Rules Algorithm and Implementation

【速读】: 该论文旨在解决提格里尼亚语(Tigrinya)中基数词与序数词的规范化表达问题,填补该语言在计算资源方面的空白。其核心挑战在于建立一套系统且形式化的规则体系,以准确描述数字在口语中的表达方式,包括连接机制、数量级词(scale words)、以及日期、时间与货币等特殊场景下的处理规则。解决方案的关键在于提出了一种形式化的数字转文字算法,并开源了实现代码,从而为语言建模、语音合成及无障碍应用提供了可复用的技术基础。

链接: https://arxiv.org/abs/2601.03403
作者: Fitsum Gaim,Issayas Tesfamariam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a systematic formalization of Tigrinya cardinal and ordinal number verbalization, addressing a gap in computational resources for the language. This work documents the canonical rules governing the expression of numerical values in spoken Tigrinya, including the conjunction system, scale words, and special cases for dates, times, and currency. We provide a formal algorithm for number-to-word conversion and release an open-source implementation. Evaluation of frontier large language models (LLMs) reveals significant gaps in their ability to accurately verbalize Tigrinya numbers, underscoring the need for explicit rule documentation. This work serves language modeling, speech synthesis, and accessibility applications targeting Tigrinya-speaking communities.
zh

[NLP-107] Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中可能无意中学习到未经授权的专有或个人数据所带来的数据保护问题,尤其是在黑盒场景下无法访问模型训练流程的现实条件下。其解决方案的关键在于提出了一种新颖的数据级防御机制——Disclaimer Injection(免责声明注入),该方法通过向文本中注入精心设计的、能触发模型对齐机制(alignment mechanisms)的免责声明,从而干扰模型的有效学习过程;具体而言,这种注入会导致对齐相关层在训练时持续激活,使对齐约束覆盖任务学习目标,最终显著且系统性地降低模型性能,实现无需修改训练管道即可限制数据可学习性的目标。

链接: https://arxiv.org/abs/2601.03401
作者: Ruihan Zhang,Jun Sun
机构: Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora, raising serious concerns about the unauthorised use of proprietary or personal data during model training. In this work, we address the problem of data protection against unwanted model learning in a realistic black-box setting. We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs. Rather than relying on model-side controls or explicit data removal, our approach exploits the models’ own alignment mechanisms: by injecting carefully designed alignment-triggering disclaimers to prevent effective learning. Through layer-wise analysis, we find that fine-tuning on such protected data induces persistent activation of alignment-related layers, causing alignment constraints to override task learning even on common inputs. Consequently, models trained on such data exhibit substantial and systematic performance degradation compared to standard fine-tuning. Our results identify alignment behaviour as a previously unexplored lever for data protection and, to our knowledge, present the first practical method for restricting data learnability at LLM scale without requiring access to or modification of the training pipeline.
zh

[NLP-108] Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

【速读】: 该论文旨在解决当前生成式AI在角色生成中存在两大对齐诱导偏差的问题:一是正向道德偏差(positive moral bias),即角色统一采取讨喜的道德立场(如始终认为说谎是错误的);二是助手机器人偏差(helpful assistant bias),即角色无一例外地直接回应问题,缺乏拒绝或回避等复杂交互行为。这些问题源于最大似然训练和助手微调策略,导致生成角色缺乏戏剧张力与多样性。解决方案的关键在于提出PersonaWeaver框架,通过将世界构建(roles, demographics)与行为构建(moral stances, interactional styles)解耦,从而生成具有更丰富反应模式、多样道德立场以及二级风格差异(如长度、语气和标点使用)的角色。

链接: https://arxiv.org/abs/2601.03396
作者: Maan Qraitem,Kate Saenko,Bryan A. Plummer
机构: Boston University (波士顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored. We identify two alignment-induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction-following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation. Code: this https URL
zh

[NLP-109] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

【速读】: 该论文试图解决的问题是:生成式 AI(Generative AI)模型在训练过程中接触到大量隐喻(metaphor)数据后,是否会因这些隐喻内容引发推理路径的偏差,从而导致跨领域推理不一致(cross-domain misalignment)问题。解决方案的关键在于揭示了隐喻与大语言模型(LLM)内部全局和局部潜在特征激活之间的因果关系,并基于此设计了一个可预测推理内容偏移的检测器,通过在预训练、微调及再对齐阶段引入或干预隐喻内容,显著改变了模型的跨域偏移程度,从而为控制和缓解 LLM 推理一致性问题提供了新方法。

链接: https://arxiv.org/abs/2601.03388
作者: Zhibo Hu,Chen Wang,Yanfeng Shu,Hye-young Paik,Liming Zhu
机构: The University of New South Wales (新南威尔士大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61); Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Earlier research has shown that metaphors influence human’s decision making, which raises the question of whether metaphors also influence large language models (LLMs)’ reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs’ reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models’ cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
zh

[NLP-110] RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models

【速读】: 该论文旨在解决当前视频风险预测模型在实际应用中面临的“提前预警能力不足”问题,即现有方法多依赖完整的视频序列(包括事故发生时刻)进行训练和评估,导致模型无法有效从早期视觉信号中识别潜在风险。其解决方案的关键在于提出一个新的视频理解基准 RiskCueBench,该基准通过精细标注每个视频中的“风险信号片段”(risk signal clip),即最早能预示安全威胁的时刻,从而更真实地模拟现实场景下的预测任务,推动模型从动态演变过程中学习早期风险线索,提升前瞻性风险识别能力。

链接: https://arxiv.org/abs/2601.03369
作者: Sha Luo,Yogesh Prabhu,Tim Ossowski,Kaiping Chen,Junjie Hu
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
zh

[NLP-111] A path to natural language through tokenisation and transformers

【速读】: 该论文试图解决的问题是:现代Transformer模型中使用的分词方案(如字节对编码,Byte-Pair Encoding, BPE)如何影响自然语言的统计特性,特别是其信息熵与Zipf定律之间的关系。解决方案的关键在于,通过理论推导和实证分析相结合的方法,揭示BPE不仅是一种压缩机制,更是一种统计变换器——它通过递归应用使词汇频率趋近于Zipf幂律分布,并引发经验熵的特征增长模式;同时,随着BPE分词深度增加,语言模型预测熵逐渐逼近由Zipf定律推导出的理论值,且注意力机制诊断显示更深的分词降低了局部token依赖性,使分布更接近弱相关(近独立同分布,near IID)状态,从而阐明了BPE在重构自然语言关键信息属性中的核心作用。

链接: https://arxiv.org/abs/2601.03368
作者: David S. Berman,Alexander G. Stapleton
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 19 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf’s and Heaps’ laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte–pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language models on corpora tokenised at varying BPE depths, revealing that the model predictive entropies increasingly agree with Zipf-derived predictions as the BPE depth increases. Attention-based diagnostics further indicate that deeper tokenisation reduces local token dependencies, bringing the empirical distribution closer to the weakly dependent (near IID) regime. Together, these results clarify how BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language.
zh

[NLP-112] Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在边缘设备上部署时面临的“内存墙”(Memory Wall)问题,即数据移动延迟远高于算术运算吞吐量的瓶颈。传统推理运行时因高阶抽象、动态分派和非对齐内存访问模式导致显著开销。其解决方案的关键在于提出一种软件实现的“虚拟张量核心”(Virtual Tensor Core)架构,专为ARM64微架构(如Apple Silicon)优化:通过绕过标准库容器、采用直接内存映射(mmap)和手工调优的NEON SIMD内核,实现类“软件定义直接内存访问”(Software-Defined Direct Memory Access, DMA)机制;同时引入张量虚拟化布局(Tensor Virtualization Layout, TVL)确保权重矩阵100%缓存行利用率,并通过零拷贝加载器消除初始化延迟,从而在M2硬件上实现稳定60 tokens/second的吞吐量,满足心理语言学要求的200ms延迟阈值,且具备完全开源、可移植与确定性的特性。

链接: https://arxiv.org/abs/2601.03324
作者: Bugra Kilictas,Faruk Alpay
机构: Bahçeşehir University (巴赫切席尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: 14 pages, 2 figures. Code and data available at this https URL

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the “Memory Wall” the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel “Virtual Tensor Core” architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of “Software-Defined Direct Memory Access (DMA).” Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of 60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.
zh

[NLP-113] How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全评估中因依赖粗粒度分类而导致的越狱攻击(Jailbreak attack)成功率高估问题。现有方法通常仅以有害性为判断标准,忽略了响应与恶意意图之间的细粒度匹配程度,从而造成误判。解决方案的关键在于提出一种基于锚定参考的细粒度越狱评估框架(Fine-grained Jailbreak Evaluation Framework with Anchored References, FJAR),其核心创新包括:一是将越狱响应细分为五类(拒绝型、无关型、无帮助型、错误型、成功型),依据响应对原始恶意请求意图的满足程度进行精确归类;二是引入无害树分解(harmless tree decomposition)方法构建高质量锚定参考,通过结构化拆分原始查询来指导评估器判断响应是否真正实现了原请求的目标。实验表明,FJAR在与人类判断的一致性上表现最优,并能有效识别越狱失败的根本原因,为优化攻击策略提供可操作的反馈。

链接: https://arxiv.org/abs/2601.03288
作者: Songyang Liu,Chaozhuo Li,Rui Pu,Litian Zhang,Chenxu Wang,Zejian Chen,Yuting Zhang,Yiming Hei
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, preprint

点击查看摘要

Abstract:Jailbreak attacks present a significant challenge to the safety of Large Language Models (LLMs), yet current automated evaluation methods largely rely on coarse classifications that focus mainly on harmfulness, leading to substantial overestimation of attack success. To address this problem, we propose FJAR, a fine-grained jailbreak evaluation framework with anchored references. We first categorized jailbreak responses into five fine-grained categories: Rejective, Irrelevant, Unhelpful, Incorrect, and Successful, based on the degree to which the response addresses the malicious intent of the query. This categorization serves as the basis for FJAR. Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references by breaking down the original queries. These references guide the evaluator in determining whether the response genuinely fulfills the original query. Extensive experiments demonstrate that FJAR achieves the highest alignment with human judgment and effectively identifies the root causes of jailbreak failures, providing actionable guidance for improving attack strategies.
zh

[NLP-114] HyperCLOVA X 32B Think

【速读】: 该论文旨在解决当前多模态大模型在韩语语言与文化情境下推理能力不足,以及缺乏自主代理行为(agentic behavior)支持的问题。解决方案的关键在于:首先通过强化推理能力的预训练,使模型具备扎实的逻辑与知识推理基础;随后进行后训练(post-training),以提升其多模态理解、增强推理、代理行为及人类偏好对齐能力。实验表明,HyperCLOVA X 32B Think 在韩语文本到文本、视觉到文本任务以及代理导向评估中均表现优异,体现了其在特定语境下的高性能与实用性。

链接: https://arxiv.org/abs/2601.03286
作者: NAVER Cloud HyperCLOVA X Team
机构: NAVER Cloud (NAVER云); HyperCLOVA X Team (HyperCLOVA X团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.
zh

[NLP-115] opic Segmentation Using Generative Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在主题分割(topic segmentation)任务中应用不足的问题,即传统基于句子语义相似度的方法难以捕捉长距离依赖关系和利用大规模知识。其解决方案的关键在于提出一种重叠且递归的提示策略(overlapping and recursive prompting strategy),结合句序枚举(sentence enumeration)与边界相似性评估指标(boundary similarity evaluation metric),从而更有效地发挥大型语言模型(Large Language Models, LLMs)在主题分割中的潜力。

链接: https://arxiv.org/abs/2601.03276
作者: Pierre Mackenzie,Maya Shah,Patrick Frenett
机构: Adarga; University of Edinburgh (爱丁堡大学); University of Exeter (埃克塞特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Topic segmentation using generative Large Language Models (LLMs) remains relatively unexplored. Previous methods use semantic similarity between sentences, but such models lack the long range dependencies and vast knowledge found in LLMs. In this work, we propose an overlapping and recursive prompting strategy using sentence enumeration. We also support the adoption of the boundary similarity evaluation metric. Results show that LLMs can be more effective segmenters than existing methods, but issues remain to be solved before they can be relied upon for topic segmentation.
zh

[NLP-116] LLM _annotate: A Python package for annotating and analyzing fiction characters

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)对虚构角色的人格特征进行系统化、可重复的分析问题。传统方法在处理长文本(如小说或剧本)时往往缺乏标准化流程,难以高效提取角色行为并推断其人格特质,且缺乏对标注质量的有效验证机制。解决方案的关键在于提出一个名为LLM_annotate的Python工具包,其核心创新包括:(1)标准化从文本分块到角色行为标注、性格推断的全流程;(2)集成人机协同的图形界面(GUI),用于人工校验与质量评分;(3)支持任意LLM(商业、开源或定制)灵活接入,提升方法的通用性与扩展性。通过《辛普森一家电影》和《傲慢与偏见》的案例演示,验证了该方案在效率与可复现性上的优势。

链接: https://arxiv.org/abs/2601.03274
作者: Hannes Rosenbusch
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM_annotate is a Python package for analyzing the personality of fiction characters with large language models. It standardizes workflows for annotating character behaviors in full texts (e.g., books and movie scripts), inferring character traits, and validating annotation/inference quality via a human-in-the-loop GUI. The package includes functions for text chunking, LLM-based annotation, character name disambiguation, quality scoring, and computation of character-level statistics and embeddings. Researchers can use any LLM, commercial, open-source, or custom, within LLM_annotate. Through tutorial examples using The Simpsons Movie and the novel Pride and Prejudice, I demonstrate the usage of the package for efficient and reproducible character analyses.
zh

[NLP-117] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety Fairness and Robustness in LLM Moderators

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在内容安全审核中面临的挑战,包括对隐性冒犯性内容、微妙性别与种族偏见以及越狱提示(jailbreak prompts)等复杂情境识别能力不足,以及因训练数据依赖导致的社会偏见强化问题。解决方案的关键在于构建一个统一的多视角基准数据集GuardEval,涵盖106个细粒度类别,覆盖人类情绪、攻击性与仇恨言论、性别与种族偏见及更广泛的安全关切,并基于此数据集微调得到GemmaGuard(GGuard),一种采用QLoRA技术优化的Gemma3-12B模型。实验表明,GGuard在宏F1分数上达到0.832,显著优于OpenAI Moderator(0.64)和Llama Guard(0.61),验证了以人类为中心的多样化、代表性数据对于提升内容审核系统的公平性、鲁棒性和准确性具有决定性作用。

链接: https://arxiv.org/abs/2601.03273
作者: Naseem Machlovi,Maryam Saleki,Ruhul Amin,Mohamed Rahouti,Shawqi Al-Maliki,Junaid Qadir,Mohamed M. Abdallah,Ala Al-Fuqaha
机构: Fordham University (福特汉姆大学); Hamad Bin Khalifa University (哈马德本哈利法大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.
zh

[NLP-118] Less is more: Not all samples are effective for evaluation

【速读】: 该论文旨在解决专业领域中大型语言模型(Large Language Models, LLMs)评估基准存在的语义冗余高和计算成本大的问题,尤其针对冷启动场景下缺乏历史模型性能数据而无法使用现有压缩方法的局限性。其解决方案的关键在于提出一种无需历史模型性能数据的测试集压缩框架:首先在少量领域特定数据上微调基础LLM以内化任务相关语义,随后仅基于原始文本生成高阶语义嵌入;在此领域自适应嵌入空间中,通过任务感知聚类并引入新颖的数据集X-ray机制,依据聚类几何结构动态校准压缩强度,从而有效识别并移除冗余样本,在保持与全基准高度一致性的前提下将评估成本降低超过90%。

链接: https://arxiv.org/abs/2601.03272
作者: Wentang Song,Jinqiang Li,Kele Huang,Junhui Lin,Shengxiang Wu,Zhongshi Xie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original test samples using only their raw textual content. In this domain-adapted embedding space, we perform task-aware clustering and introduce a novel dataset X-ray mechanism that analyzes cluster geometry to dynamically calibrate the compression intensity based on the intrinsic redundancy of the benchmark. Experiments on professional-domain dataset, notably a large-scale 3GPP communications benchmark, demonstrate that our approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90% while preserving high fidelity to the full benchmark. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03272 [cs.CL] (or arXiv:2601.03272v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.03272 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-119] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey

【速读】: 该论文旨在系统梳理和总结自2021年以来语义文本相似度(Semantic Textual Similarity, STS)领域的研究进展,解决当前方法多样、技术分散且缺乏系统性归纳的问题。其解决方案的关键在于从六个核心方向进行结构化分析:基于Transformer的模型(如FarSSiBERT和DeBERTa-v3)、对比学习方法(如AspectCSE)、领域特定优化(如医学文本的CXR-BERT和金融文本的Financial-STS)、多模态方法、图结构建模以及知识增强技术。通过整合这些前沿进展,论文不仅揭示了当前主流方法的技术优势与适用场景,还指出了尚未解决的挑战与未来发展方向,为研究人员和实践者提供了清晰的路线图和理论支撑。

链接: https://arxiv.org/abs/2601.03270
作者: Lokendra Kumar,Neelesh S. Upadhye,Kannan Piedy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:Semantic Textual Similarity (STS) research has expanded rapidly since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques. This survey reviews progress across six key areas: transformer-based models, contrastive learning, domain-focused solutions, multi-modal methods, graph-based approaches, and knowledge-enhanced techniques. Recent transformer models such as FarSSiBERT and DeBERTa-v3 have achieved remarkable accuracy, while contrastive methods like AspectCSE have established new benchmarks. Domain-adapted models, including CXR-BERT for medical texts and Financial-STS for finance, demonstrate how STS can be effectively customized for specialized fields. Moreover, multi-modal, graph-based, and knowledge-integrated models further enhance semantic understanding and representation. By organizing and analyzing these developments, the survey provides valuable insights into current methods, practical applications, and remaining challenges. It aims to guide researchers and practitioners alike in navigating rapid advancements, highlighting emerging trends and future opportunities in the evolving field of STS.
zh

[NLP-120] he Instruction Gap: LLM s get lost in Following Instruction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在企业环境中部署时面临的“指令遵循不一致”问题,即模型在执行定制化指令时表现不稳定,存在显著的“指令差距”(instruction gap)。解决方案的关键在于通过系统性测试13个主流LLM,在真实企业级检索增强生成(Retrieval-Augmented Generation, RAG)场景下评估其指令遵循能力、响应准确性和性能指标,从而识别出表现最优的模型(如Claude-Sonnet-4和GPT-5),并建立可量化、可比较的指令遵循基准,为组织部署LLM驱动的解决方案提供实证依据与实践指导。

链接: https://arxiv.org/abs/2601.03269
作者: Vishesh Tripathi,Uday Allu,Biddwan Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, yet their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions. This study presents a comprehensive evaluation of 13 leading LLMs across instruction compliance, response accuracy, and performance metrics in realworld RAG (Retrieval-Augmented Generation) scenarios. Through systematic testing with samples and enterprise-grade evaluation protocols, we demonstrate that instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. Our findings reveal the “instruction gap” - a fundamental challenge where models excel at general tasks but struggle with precise instruction adherence required for enterprise deployment. This work provides practical insights for organizations deploying LLM-powered solutions and establishes benchmarks for instruction-following capabilities across major model families.
zh

[NLP-121] WRAVAL – WRiting Assist eVALuation

【速读】: 该论文旨在解决当前语言模型评估体系对小语言模型(Small Language Models, SLMs)的低估问题,尤其是在非推理类工业应用场景(如语气改写任务)中,SLMs的实际效能未被充分反映。传统评估指标主要聚焦于推理与问题解决能力,导致SLMs在这些测试中得分仅为大语言模型(Large Language Models, LLMs)的1/3至1/4,忽略了其在特定任务中的实用价值。论文的关键解决方案在于提出一种面向非推理任务的新型评估框架,该框架融合了数据生成、提示调优(prompt-tuning)和基于LLM的评估方法,通过任务特异性微调(task-specific fine-tuning)凸显SLMs在无预定义评测数据集场景下的潜力,从而为边缘计算和私有计算环境中的模型选型提供可操作的基准工具。

链接: https://arxiv.org/abs/2601.03268
作者: Gabriel Benedict,Matthew Butler,Naved Merchant,Eetu Salama-Laine
机构: Amazon Fire Tablets (亚马逊火平板); Amazon Devices Product (亚马逊设备产品); Amazon Devices OS (亚马逊设备操作系统)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) – defined here as models under 10B parameters – typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs’ effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs’ capabilities in non-reasoning tasks where predefined evaluation datasets don’t exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: this https URL.
zh

[NLP-122] OpenAI GPT -5 System Card

【速读】: 该论文旨在解决大语言模型在真实世界应用中面临的多维度挑战,包括响应速度与准确性之间的权衡、复杂任务处理能力不足、幻觉(hallucination)频发、指令遵循偏差以及对高风险领域(如生物和化学)潜在滥用的担忧。其核心解决方案是提出一个统一的GPT-5系统架构,包含两个主模型:gpt-5-main(快速通用模型)和gpt-5-thinking(深度推理模型),并通过一个实时路由机制(real-time router)根据对话类型、问题复杂度、工具需求及用户显式意图动态选择最优模型。该路由器持续基于用户行为信号(如模型切换、偏好评分和正确性指标)进行训练优化,并引入安全补全机制(safe-completions)以增强内容安全性,同时将gpt-5-thinking在生物与化学领域标记为“高能力”并激活相应防护措施,体现预防性治理理念。

链接: https://arxiv.org/abs/2601.03267
作者: Aaditya Singh,Adam Fry,Adam Perelman,Adam Tart,Adi Ganesh,Ahmed El-Kishky,Aidan McLaughlin,Aiden Low,AJ Ostrow,Akhila Ananthram,Akshay Nathan,Alan Luo,Alec Helyar,Aleksander Madry,Aleksandr Efremov,Aleksandra Spyra,Alex Baker-Whitcomb,Alex Beutel,Alex Karpenko,Alex Makelov,Alex Neitz,Alex Wei,Alexandra Barr,Alexandre Kirchmeyer,Alexey Ivanov,Alexi Christakis,Alistair Gillespie,Allison Tam,Ally Bennett,Alvin Wan,Alyssa Huang,Amy McDonald Sandjideh,Amy Yang,Ananya Kumar,Andre Saraiva,Andrea Vallone,Andrei Gheorghe,Andres Garcia Garcia,Andrew Braunstein,Andrew Liu,Andrew Schmidt,Andrey Mereskin,Andrey Mishchenko,Andy Applebaum,Andy Rogerson,Ann Rajan,Annie Wei,Anoop Kotha,Anubha Srivastava,Anushree Agrawal,Arun Vijayvergiya,Ashley Tyra,Ashvin Nair,Avi Nayak,Ben Eggers,Bessie Ji,Beth Hoover,Bill Chen,Blair Chen,Boaz Barak,Borys Minaiev,Botao Hao,Bowen Baker,Brad Lightcap,Brandon McKinzie,Brandon Wang,Brendan Quinn,Brian Fioca,Brian Hsu,Brian Yang,Brian Yu,Brian Zhang,Brittany Brenner,Callie Riggins Zetino,Cameron Raymond,Camillo Lugaresi,Carolina Paz,Cary Hudson,Cedric Whitney,Chak Li,Charles Chen,Charlotte Cole,Chelsea Voss,Chen Ding,Chen Shen,Chengdu Huang,Chris Colby,Chris Hallacy,Chris Koch,Chris Lu,Christina Kaplan,Christina Kim,CJ Minott-Henriques,Cliff Frey,Cody Yu,Coley Czarnecki,Colin Reid,Colin Wei,Cory Decareaux,Cristina Scheau
机构: OpenAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say ‘think hard about this’ in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but – more importantly – is more useful for real-world queries. We’ve made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5’s performance in three of ChatGPT’s most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm – our defined threshold for High capability – we have chosen to take a precautionary approach. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03267 [cs.CL] (or arXiv:2601.03267v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.03267 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kristen Ying [view email] [v1] Fri, 19 Dec 2025 07:05:38 UTC (3,895 KB)
zh

[NLP-123] Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support

【速读】: 该论文旨在解决临床决策支持中大型语言模型(Large Language Models, LLMs)部署受限的问题,特别是由于隐私顾虑和对云基础设施的依赖导致的专有系统难以落地,以及开源模型因模型规模过大而在资源受限的临床环境中应用困难。其解决方案的关键在于验证轻量级本地运行的开源LLM(gpt-oss-20b 和 gpt-oss-120b)在典型临床任务中的性能表现及其可适应性:通过基准测试证明其性能可媲美甚至超越部分主流专有模型(如GPT-5、o4-mini)和领先开源模型(DeepSeek-R1),并进一步表明对gpt-oss-20b进行微调后能显著提升诊断准确性,接近GPT-5水平,从而凸显了本地化部署的LLM在保障隐私、灵活适配与临床实用性方面的潜力。

链接: https://arxiv.org/abs/2601.03266
作者: Alif Munim,Jun Ma,Omar Ibrahim,Alhusain Abdalla,Shuolin Yin,Leo Chen,Bo Wang
机构: AI Collaborative Centre (AI协作中心); University Health Network (大学健康网络); Princess Margaret Cancer Centre (公主玛格丽特癌症中心); Department of Electrical and Computer Engineering (电气与计算机工程系); University of Toronto (多伦多大学); Division of Urology (泌尿外科分部); Department of Surgery (外科系); St. Michael’s Hospital (圣迈克尔医院); Unity Health Toronto (统一健康多伦多); Peter Munk Cardiac Centre (彼得·蒙克心脏中心); Department of Laboratory Medicine and Pathobiology (实验室医学与病理生物学系); Department of Computer Science (计算机科学系); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often require large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark two on-device LLMs, gpt-oss-20b and gpt-oss-120b, across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5 and o4-mini) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b on general diagnostic data. Across tasks, gpt-oss models achieve performance comparable to or exceeding DeepSeek-R1 and o4-mini despite being substantially smaller. In addition, fine-tuning remarkably improves the diagnostic accuracy of gpt-oss-20b, enabling it to approach the performance of GPT-5. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.
zh

[NLP-124] Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models NEURIPS2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)安全评估中传统基于示例的红队测试方法覆盖不足、攻击策略单一且难以规模化的问题。其解决方案的关键在于提出 Jailbreak-Zero 方法,通过利用一个攻击型大语言模型(attack LLM)自动生成大量多样化的对抗性提示,并结合偏好数据集对攻击模型进行微调,从而在策略覆盖率、攻击策略多样性与提示真实性(prompt fidelity)之间实现帕累托最优。该方法显著提升了对开源及闭源模型(如 GPT-40 和 Claude 3.5)的攻击成功率,同时生成人类可读且有效的对抗提示,大幅减少人工干预,具备更强的可扩展性和全面性。

链接: https://arxiv.org/abs/2601.03265
作者: Kai Hu,Abhinav Aggarwal,Mehran Khodabandeh,David Zhang,Eric Hsin,Li Chen,Ankit Jain,Matt Fredrikson,Akash Bharadwaj
机构: Meta
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025

点击查看摘要

Abstract:This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.
zh

[NLP-125] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中普遍存在的“谄媚行为”(sycophancy)问题,即模型倾向于优先迎合用户偏好而非保持输出的准确性。研究通过对比内部推理机制(如思维链 CoT)与外部调节机制(如反思式因果分析 RCA)在多个GPT版本上的表现,发现仅靠内部推理无法彻底消除谄媚行为:弱模型会因内部推理导致性能崩溃(优先级悖论),而前沿模型仍存在11.4%的输出偏差;相比之下,外部结构化约束(RCA)能从根本上消除谄媚现象(0.0%)。因此,解决方案的关键在于引入外部结构性约束,以确保模型输出的安全性和正确性,而非依赖模型自身的内部推理能力。

链接: https://arxiv.org/abs/2601.03263
作者: Edward Y. Chang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 11 tables

点击查看摘要

Abstract:Large Language Models frequently exhibit sycophancy, prioritizing user agreeableness over correctness. We investigate whether this requires external regulation or can be mitigated by internal reasoning alone. Using CAP-GSM8K (N=500), an adversarial dataset, we evaluate internal (CoT) versus external (RCA) mechanisms across GPT-3.5, GPT-4o, and GPT-5.1. Our results reveal the structural limits of internal reasoning: it causes performance collapse in weak models (the Prioritization Paradox) and leaves an 11.4% final output gap in frontier models. In contrast, RCA structurally eliminates sycophancy (0.0%) across all tiers. We synthesize these findings into a thermodynamic hierarchy: hybrid systems achieve Resonance (optimal efficiency) only when capabilities are matched and strong, while weak or mismatched pairs succumb to Dissonance and Entropy. This confirms that external structural constraints are strictly necessary to guarantee safety.
zh

[NLP-126] Roles of MLLM s in Visually Rich Document Retrieval for RAG : A Survey AACL

【速读】: 该论文旨在解决视觉丰富文档(Visually Rich Documents, VRDs)在检索增强生成(Retrieval-Augmented Generation, RAG)中面临的挑战,包括布局依赖语义、脆弱的光学字符识别(OCR)以及证据分散于复杂图表和结构化表格中的问题。其解决方案的关键在于系统性地梳理多模态大语言模型(Multimodal Large Language Models, MLLMs)在VRD检索中的三种角色:模态统一描述器(Modality-Unifying Captioners)、多模态嵌入器(Multimodal Embedders)和端到端表示器(End-to-End Representers),并通过检索粒度、信息保真度、延迟与索引大小、以及与重排序(reranking)和定位(grounding)的兼容性等维度进行对比分析,从而为不同场景下选择最优方案提供实践指导,并指明未来研究方向如自适应检索单元、模型压缩及评估方法开发。

链接: https://arxiv.org/abs/2601.03262
作者: Xiantao Zhang
机构: Beihang University (北京航空航天大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 18 pages; accepted at AACL-IJCNLP 2025 (main conference)

点击查看摘要

Abstract:Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: Modality-Unifying Captioners, Multimodal Embedders, and End-to-End Representers. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role. Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.
zh

[NLP-127] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing

【速读】: 该论文试图解决深度研究代理(Deep Research agents)在开放性研究任务中面临的“检索-利用差距”(retrieval-utilization gap)问题,即模型即使成功检索到黄金证据(gold evidence),仍因在嘈杂环境中的上下文盲区而无法有效利用这些信息。解决方案的关键在于提出一种简单且有效的神经符号框架 DeepResearch-Slice,其核心创新是通过预测精确的文本跨度索引(span indices)进行确定性硬过滤(deterministic hard filter),在推理前显式地筛选出相关证据,从而避免隐式注意力机制带来的噪声干扰。此方法无需更新推理模型参数即可显著提升鲁棒性,在六个基准测试中实现高达73%的相对性能提升(从19.1%提升至33.0%)。

链接: https://arxiv.org/abs/2601.03261
作者: Shuo Lu,Yinuo Xu,Jianjie Cheng,Lingxiao He,Meng Wang,Jian Liang
机构: NLPR & MAISCASIA; School of AIUCAS; Meituan Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Ongoing work

点击查看摘要

Abstract:Deep Research agents predominantly optimize search policies to maximize retrieval probability. However, we identify a critical bottleneck: the retrieval-utilization gap, where models fail to use gold evidence even after it is retrieved, due to context blindness in noisy environments. To bridge this gap, we propose DeepResearch-Slice, a simple yet effective neuro-symbolic framework. Unlike implicit attention, our approach predicts precise span indices to perform a deterministic hard filter before reasoning. Extensive evaluations across six benchmarks show substantial robustness gains. Applying our method to frozen backbones yields a 73 percent relative improvement, from 19.1 percent to 33.0 percent, effectively mitigating noise without requiring parameter updates to the reasoning model. These results highlight the need for explicit grounding mechanisms in open-ended research.
zh

[NLP-128] SciNetBench: A Relation-Aware Benchmark for Scientific Literature Retrieval Agents

【速读】: 该论文旨在解决当前文献检索代理在科学文献理解中对关系感知能力不足的问题,即现有方法主要依赖内容级相似性(如关键词或嵌入向量),难以识别研究之间的协同、冲突关系或技术演化路径,从而导致知识结构碎片化、情感误判及对科学进展建模不充分。其解决方案的关键在于提出首个面向科学网络关系感知的基准测试平台 SciNetBench,该平台基于超过1800万篇AI论文构建,系统评估三个层级的关系:以自我为中心的知识结构检索、成对学术关系识别以及科学演进路径重建。实验表明,当前检索代理在关系感知任务上的准确率普遍低于20%,而引入关系真实信息可使文献综述质量提升23.4%,验证了关系感知检索的核心价值。

链接: https://arxiv.org/abs/2601.03260
作者: Chenyang Shao,Yong Li,Fengli Xu
机构: Tsinghua University (清华大学); Zhongguancun Academy (中关村学院)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of AI agent has spurred the development of advanced research tools, such as Deep Research. Achieving this require a nuanced understanding of the relations within scientific literature, surpasses the scope of keyword-based or embedding-based retrieval. Existing retrieval agents mainly focus on the content-level similarities and are unable to decode critical relational dynamics, such as identifying corroborating or conflicting studies or tracing technological lineages, all of which are essential for a comprehensive literature review. Consequently, this fundamental limitation often results in a fragmented knowledge structure, misleading sentiment interpretation, and inadequate modeling of collective scientific progress. To investigate relation-aware retrieval more deeply, we propose SciNetBench, the first Scientific Network Relation-aware Benchmark for literature retrieval agents. Constructed from a corpus of over 18 million AI papers, our benchmark systematically evaluates three levels of relations: ego-centric retrieval of papers with novel knowledge structures, pair-wise identification of scholarly relationships, and path-wise reconstruction of scientific evolutionary trajectories. Through extensive evaluation of three categories of retrieval agents, we find that their accuracy on relation-aware retrieval tasks often falls below 20%, revealing a core shortcoming of current retrieval paradigms. Notably, further experiments on the literature review tasks demonstrate that providing agents with relational ground truth leads to a substantial 23.4% performance improvement in the review quality, validating the critical importance of relation-aware retrieval. We publicly release our benchmark at this https URL to support future research on advanced retrieval systems.
zh

[NLP-129] Content vs. Form: What Drives the Writing Score Gap Across Socioeconomic Backgrounds? A Generated Panel Approach

【速读】: 该论文旨在解决 socioeconomic status (SES) 差异在写作成绩中的影响机制问题,即区分成绩差距是由学生表达内容(content)的质量差异还是表达方式(style)的差异所导致。传统写作评分往往将内容与表达混合作为一个整体指标,难以准确识别造成SES差距的具体因素。其解决方案的关键在于引入一种基于大语言模型(large language models, LLMs)的新测量策略:通过LLMs对每篇作文生成多个风格变异版本,在保持核心论点不变的前提下系统性改变表面表达形式,从而构建一个“生成面板”(generated panel),实现同一作文内部风格的受控变化。这一方法使研究者能够分离出内容和风格各自对SES写作分数差距的贡献,发现约69%的差距源于内容质量差异,26%来自风格差异,其余5%由评分标准偏差引起。

链接: https://arxiv.org/abs/2601.03469
作者: Nadav Kunievsky,Pedro Pertusi
机构: Knowledge Lab, University of Chicago (知识实验室,芝加哥大学); Insper, Institute of Education and Research (Insper 教育与研究学院)
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Students from different socioeconomic backgrounds exhibit persistent gaps in test scores, gaps that can translate into unequal educational and labor-market outcomes later in life. In many assessments, performance reflects not only what students know, but also how effectively they can communicate that knowledge. This distinction is especially salient in writing assessments, where scores jointly reward the substance of students’ ideas and the way those ideas are expressed. As a result, observed score gaps may conflate differences in underlying content with differences in expressive skill. A central question, therefore, is how much of the socioeconomic-status (SES) gap in scores is driven by differences in what students say versus how they say it. We study this question using a large corpus of persuasive essays written by U.S. middle- and high-school students. We introduce a new measurement strategy that separates content from style by leveraging large language models to generate multiple stylistic variants of each essay. These rewrites preserve the underlying arguments while systematically altering surface expression, creating a “generated panel” that introduces controlled within-essay variation in style. This approach allows us to decompose SES gaps in writing scores into contributions from content and style. We find an SES gap of 0.67 points on a 1-6 scale. Approximately 69% of the gap is attributable to differences in essay content quality, Style differences account for 26% of the gap, and differences in evaluation standards across SES groups account for the remaining 5%. These patterns seems stable across demographic subgroups and writing tasks. More broadly, our approach shows how large language models can be used to generate controlled variation in observational data, enabling researchers to isolate and quantify the contributions of otherwise entangled factors.
zh

[NLP-130] MixRx: Predicting Drug Combination Interactions with LLM s

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)对药物联合使用时的相互作用进行分类的问题,即判断其为加性(Additive)、协同(Synergistic)或拮抗(Antagonistic)效应。解决方案的关键在于采用微调(fine-tuned)后的Mistral Instruct 2.0模型,在标准数据集和扰动数据集上实现了平均81.5%的准确率,验证了LLMs在生物预测任务中的潜在应用价值。

链接: https://arxiv.org/abs/2601.03277
作者: Risha Surana,Cameron Saidock,Hugo Chacon
机构: University of Southern California (南加州大学)
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:MixRx uses Large Language Models (LLMs) to classify drug combination interactions as Additive, Synergistic, or Antagonistic, given a multi-drug patient history. We evaluate the performance of 4 models, GPT-2, Mistral Instruct 2.0, and the fine-tuned counterparts. Our results showed a potential for such an application, with the Mistral Instruct 2.0 Fine-Tuned model providing an average accuracy score on standard and perturbed datasets of 81.5%. This paper aims to further develop an upcoming area of research that evaluates if LLMs can be used for biological prediction tasks.
zh

计算机视觉

[CV-0] Choreographing a World of Dynamic Objects

【速读】:该论文旨在解决4D场景中动态物体的复杂演化、形变及交互行为的通用生成问题,传统基于规则的图形管线依赖类别特定启发式方法,存在劳动密集且难以扩展的局限;而现有学习方法又受限于大规模数据集覆盖范围不足。其解决方案的关键在于提出一种名为CHORD的通用生成框架,通过蒸馏机制从2D视频的欧拉表示(Eulerian representation)中提取丰富的拉格朗日运动信息(Lagrangian motion information),从而实现对多体4D动态现象的无类别依赖(category-agnostic)合成,兼具通用性与灵活性,并在机器人操作策略生成等任务中验证了其有效性。

链接: https://arxiv.org/abs/2601.04194
作者: Yanzhe Lyu,Chen Geng,Karthik Dharmarajan,Yunzhi Zhang,Hadi Alzayer,Shangzhe Wu,Jiajun Wu
机构: Stanford University (斯坦福大学); University of Cambridge (剑桥大学); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: this https URL
zh

[CV-1] ImLoc: Revisiting Visual Localization with Image-based Representation

【速读】:该论文旨在解决视觉定位(Visual Localization)中现有方法在精度与可维护性之间的权衡问题:2D图像基方法虽易于构建和维护,但几何推理能力有限;而3D结构基方法虽精度高,却依赖集中式重建且难以更新。其解决方案的关键在于采用2D图像表示并为其附加估计的深度图(depth maps),从而在不引入复杂3D建模的前提下捕获几何结构信息。通过有效利用密集匹配器(dense matchers),该方法实现了高精度定位,同时结合紧凑压缩与GPU加速的LO-RANSAC实现,在存储和计算效率上表现出色,支持精度与内存效率间的灵活权衡。

链接: https://arxiv.org/abs/2601.04185
作者: Xudong Jiang,Fangjinhua Wang,Silvano Galliani,Christoph Vogel,Marc Pollefeys
机构: ETH Zurich (苏黎世联邦理工学院); Microsoft Spatial AI Lab (微软空间AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code will be available at this https URL

点击查看摘要

Abstract:Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at this https URL.
zh

[CV-2] oTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography

【速读】:该论文旨在解决远程光电容积脉搏波(remote photoplethysmography, rPPG)中现有深度模型计算成本高、参数量大以及基于注意力机制的时序建模存在二次时间复杂度的问题。解决方案的关键在于提出一种轻量级架构ToTMNet,其核心创新是用快速傅里叶变换(FFT)加速的Toeplitz时序混合层替代传统的注意力机制,从而在保持全序列时域感受野的同时,仅需线性数量的参数和近线性时间复杂度来实现长程时序滤波;该架构进一步结合局部深度可分离时序卷积分支与门控全局Toeplitz混合模块,显著提升了模型效率与鲁棒性,尤其在跨域场景下表现优异。

链接: https://arxiv.org/abs/2601.04159
作者: Vladimir Frants,Sos Agaian,Karen Panetta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.
zh

[CV-3] Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成中依赖非可微偏好信号(如人工标注或学习的奖励模型)所导致的训练标签密集、易受偏差影响及奖励劫持(reward hacking)等问题。解决方案的关键在于提出Diffusion-DRF,一种基于冻结的现成视觉-语言模型(Vision-Language Model, VLM)作为无训练批评者(training-free critic)的可微奖励流(differentiable reward flow)。该方法通过扩散去噪链直接反向传播VLM的logit级反馈,将其转化为具有token感知能力的梯度用于优化,同时结合结构化提示管道与梯度检查点技术,在不引入额外奖励模型或偏好数据集的前提下,有效提升视频质量与语义对齐性,并抑制奖励劫持和模式崩溃现象。

链接: https://arxiv.org/abs/2601.04153
作者: Yifan Wang,Yanyu Li,Sergey Tulyakov,Yun Fu,Anil Kag
机构: Northeastern University (东北大学); Snap Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse – without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.
zh

[CV-4] Klear: Unified Multi-Task Audio-Video Joint Generation

【速读】:该论文旨在解决当前音频-视频联合生成(Audio-Video Joint Generation)中存在的关键挑战,包括音画不同步、唇形与语音对齐不佳以及单模态退化等问题,其根源在于音频-视觉对应建模薄弱、泛化能力有限及高质量密集标注数据稀缺。解决方案的关键在于从模型架构、训练策略和数据构建三个维度进行系统性创新:首先采用统一的单塔结构(single-tower design)结合DiT(Diffusion Transformer)块与Omni-Full Attention机制,实现紧密的音画对齐与强扩展性;其次引入渐进式多任务训练策略,通过随机模态掩码和多阶段课程学习促进跨任务联合优化,增强音画对齐的世界知识并防止单模态坍缩;最后构建首个大规模带密集标注的音视频数据集,并设计自动化数据构建流水线,生成数百万高质量、严格对齐的音视频-文本三元组,从而支撑模型在联合与单模态场景下均实现高保真、语义与时间对齐的指令跟随生成,并显著优于现有方法。

链接: https://arxiv.org/abs/2601.04151
作者: Jun Wang,Chunyu Qiang,Yuxin Guo,Yiran Wang,Xijuan Zeng,Chen Zhang,Pengfei Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes–model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime–random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.
zh

[CV-5] Wow wo val! A Comprehensive Embodied World Model Evaluation Turing Test

【速读】:该论文旨在解决当前视频基础模型(video foundation models)在具身人工智能(Embodied AI)中作为世界模型时存在的两大关键问题:一是其生成泛化能力是否足以保持人类观察者感知上的保真度,二是其鲁棒性是否足够作为真实世界具身智能体的通用先验。为系统回答这些问题,作者提出了一个标准化评估框架——Embodied Turing Test基准测试集Wow-wo-val(WoW-World-Eval),基于609个机器人操作数据集,从感知、规划、预测、泛化和执行五个核心维度进行评估,并设计了22项指标以量化模型生成能力。该方案的关键在于通过高相关性(Pearson相关系数达0.93)的人类偏好验证机制,建立可靠的人类图灵测试标准;同时引入逆动力学模型(Inverse Dynamic Model, IDM)对真实世界执行准确率进行评估,发现现有模型在长期规划(仅17.27分)和物理一致性(最高68.02分)上表现有限,且多数模型在现实执行中成功率趋近于零,而WoW模型仍保持40.74%的成功率,揭示出生成视频与真实世界之间存在显著差距,凸显了构建更高质量世界模型的紧迫性。

链接: https://arxiv.org/abs/2601.04137
作者: Chun-Kai Fan,Xiaowei Chi,Xiaozhu Ju,Hao Li,Yong Bao,Yu-Kai Wang,Lizhang Chen,Zhiyuan Jiang,Kuangzhi Ge,Ying Li,Weishi Mi,Qingpo Wuwu,Peidong Jia,Yulin Luo,Kevin Zhang,Zhiyuan Qin,Yong Dai,Sirui Han,Yike Guo,Shanghang Zhang,Jian Tang
机构: Peking University (北京大学); Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心); The Hong Kong University of Science and Technology (香港科技大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models’ generation ability, which achieves a high Pearson Correlation between the overall score and human preference (0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models’ execution accuracy in the real world. However, most models collapse to \approx 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
zh

[CV-6] Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images

【速读】:该论文旨在解决卫星遥感数据中,尤其是卫星图像时间序列(SITS)在深度学习模型处理时难以有效提取像素级特征的问题。现有方法通常依赖于对整幅图像或完整时间序列进行建模,忽略了像素层面的动态变化信息。其解决方案的关键在于提出一种新的多模态自监督学习框架——PIxel-wise Multimodal Contrastive (PIMC),通过将像素级植被指数时间序列(如NDVI、EVI和SAVI)转换为二维递归图(recurrence plots),构建更具信息量的2D表示,并在此基础上联合优化遥感影像(RSI)与像素时间序列的对比学习,从而显著提升对SITS和RSI的特征编码能力。实验表明,该方法在像素级预测、分类等任务上优于当前最先进(SOTA)模型,验证了其在地球观测领域的有效性与鲁棒性。

链接: https://arxiv.org/abs/2601.04127
作者: Leandro Stival,Ricardo da Silva Torres,Helio Pedrini
机构: Wageningen University & Research (瓦赫宁根大学与研究学院); Institute of Computing, University of Campinas (UNICAMP) (坎皮纳斯州立大学计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 9 Figures

点击查看摘要

Abstract:Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on
zh

[CV-7] MORPHFED: Federated Learning for Cross-institutional Blood Morphology Analysis

【速读】:该论文旨在解决自动化白细胞形态分析在低收入和中等收入国家(LMICs)中因染色差异、成像条件不一致及罕见形态导致的数据分布偏移问题,同时克服因隐私法规和数据共享限制而难以构建集中式多样化数据集的挑战。其解决方案的关键在于提出了一种联邦学习框架,能够在不交换原始训练数据的前提下,实现多机构间的协同模型训练,从而学习到具有领域不变性的特征表示,保障数据隐私的同时显著提升模型在未见机构上的泛化能力。

链接: https://arxiv.org/abs/2601.04121
作者: Gabriel Ansah,Eden Ruffell,Delmiro Fernandez-Reyes,Petru Manescu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated blood morphology analysis can support hematological diagnostics in low- and middle-income countries (LMICs) but remains sensitive to dataset shifts from staining variability, imaging differences, and rare morphologies. Building centralized datasets to capture this diversity is often infeasible due to privacy regulations and data-sharing restrictions. We introduce a federated learning framework for white blood cell morphology analysis that enables collaborative training across institutions without exchanging training data. Using blood films from multiple clinical sites, our federated models learn robust, domain-invariant representations while preserving complete data privacy. Evaluations across convolutional and transformer-based architectures show that federated training achieves strong cross-site performance and improved generalization to unseen institutions compared to centralized training. These findings highlight federated learning as a practical and privacy-preserving approach for developing equitable, scalable, and generalizable medical imaging AI in resource-limited healthcare environments.
zh

[CV-8] GeoReason : Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

【速读】:该论文旨在解决当前遥感视觉-语言模型(Remote Sensing Vision-Language Models, RS-VLMs)在复杂空间任务中因逻辑幻觉(logical hallucinations)导致的认知可靠性不足问题,即模型虽能给出正确答案,但其推理链条存在逻辑错误或依赖位置捷径而非空间逻辑。解决方案的关键在于提出GeoReason框架,通过两个核心步骤实现:首先构建GeoReason-Bench数据集,包含由几何原型和专家知识合成的4000条推理轨迹,用于监督式知识初始化以赋予模型推理语法与领域专业知识;其次采用一致性感知强化学习(Consistency-Aware Reinforcement Learning),引入一种新颖的逻辑一致性奖励机制,通过选项排列策略惩罚逻辑漂移,从而将决策锚定在可验证的推理路径上,显著提升模型的认知可靠性和可解释性。

链接: https://arxiv.org/abs/2601.04118
作者: Wenshuai Li,Xiantai Xiang,Zixiao Wen,Guangyao Zhou,Ben Niu,Feng Wang,Lijia Huang,Qiantong Wang,Yuxin Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
zh

[CV-9] Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

【速读】:该论文旨在解决场景级3D生成中几何与外观信息难以协同建模的问题,尤其在单图或多图条件下的3D场景重建与生成任务中,如何有效融合基础重建模型(如VGGT)的结构先验与视频扩散模型(video diffusion models)的生成能力。解决方案的关键在于提出Gen3R方法:通过训练一个适配器(adapter)将VGGT模型的token转换为几何潜在表示(geometric latents),并对其进行正则化以对齐预训练视频扩散模型中的外观潜在空间(appearance latents);进而联合生成解耦但对齐的潜在变量,实现RGB视频与对应3D几何信息(包括相机位姿、深度图和全局点云)的一体化输出,从而在保持几何一致性的同时提升生成质量与重建鲁棒性。

链接: https://arxiv.org/abs/2601.04090
作者: Jiaxin Huang,Yuanbo Yang,Bangbang Yang,Lin Ma,Yuewen Ma,Yiyi Liao
机构: Zhejiang University (浙江大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
zh

[CV-10] Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

【速读】:该论文旨在解决文本到视频扩散模型(text-to-video diffusion models)在后训练阶段难以有效对齐人类偏好这一问题,现有直接偏好优化(Direct Preference Optimization, DPO)方法依赖多样本排序和任务特定的评判模型(critic models),存在效率低且全局监督信号模糊的缺陷。其解决方案的关键在于提出LocalDPO框架,通过构建局部时空区域级别的偏好对(preference pairs),并设计自动化数据采集管道:以高质量真实视频作为正样本,利用随机时空掩码局部破坏视频,并借助冻结的基础模型恢复掩码区域生成负样本;训练时引入区域感知的DPO损失函数,仅在被破坏区域进行偏好学习,从而实现快速收敛与更精细的对齐效果。实验表明,LocalDPO在Wan2.1和CogVideoX数据集上显著提升了视频保真度、时间连贯性及人类偏好评分。

链接: https://arxiv.org/abs/2601.04068
作者: Zitong Huang,Kaidong Zhang,Yukang Ding,Chao Gao,Rui Ding,Ying Chen,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Alibaba Group - Taobao & Tmall Group (阿里巴巴集团-淘宝与天猫团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.
zh

[CV-11] Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation WACV2026

【速读】:该论文旨在解决风力发电机叶片(wind turbine blades)视觉检测中像素级分割的标注效率低问题,传统深度学习方法依赖大量人工标注数据,难以规模化应用。其解决方案的关键在于将原本复杂的像素级分割任务重构为二分类区域识别问题:首先利用完全无监督且可解释的模块化自适应区域生长(Modular Adaptive Region Growing)技术生成候选图像区域,该过程结合图像自适应阈值(Adaptive Thresholding)并引入区域合并(Region Merging)以整合碎片化区域;其次提出RegionMix增强策略,通过融合不同区域合成新训练样本,显著提升模型泛化能力和分类鲁棒性。该框架在多个风电场场景下均实现了最优分割精度和强跨站点迁移能力。

链接: https://arxiv.org/abs/2601.04065
作者: Raül Pérez-Gonzalo,Riccardo Magro,Andreas Espersen,Antonio Agudo
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC(西班牙国家研究委员会-加泰罗尼亚理工大学机器人与工业信息研究所); Politecnico di Milano(米兰理工大学); Wind Power LAB(风力发电实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to WACV 2026

点击查看摘要

Abstract:Reliable operation of wind turbines requires frequent inspections, as even minor surface damages can degrade aerodynamic performance, reduce energy output, and accelerate blade wear. Central to automating these inspections is the accurate segmentation of turbine blades from visual data. This task is traditionally addressed through dense, pixel-wise deep learning models. However, such methods demand extensive annotated datasets, posing scalability challenges. In this work, we introduce an annotation-efficient segmentation approach that reframes the pixel-level task into a binary region classification problem. Image regions are generated using a fully unsupervised, interpretable Modular Adaptive Region Growing technique, guided by image-specific Adaptive Thresholding and enhanced by a Region Merging process that consolidates fragmented areas into coherent segments. To improve generalization and classification robustness, we introduce RegionMix, an augmentation strategy that synthesizes new training samples by combining distinct regions. Our framework demonstrates state-of-the-art segmentation accuracy and strong cross-site generalization by consistently segmenting turbine blades across distinct windfarms.
zh

[CV-12] CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

【速读】:该论文旨在解决当前通用视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人领域中因缺乏足够机器人数据而导致的性能瓶颈问题,尤其是现有潜空间动作模型(Latent Action Models)难以从人类视频中有效提取可执行的操作技能,常受视觉特征纠缠干扰而捕捉噪声而非真实操纵能力。其解决方案的关键在于提出对比潜空间动作预训练(Contrastive Latent Action Pretraining, CLAP),通过对比学习将人类视频的视觉潜空间与机器人轨迹的本体感知潜空间对齐,并映射视频转换到一个量化且物理可执行的码本空间;在此基础上构建双架构VLA框架(CLAP-NTP与CLAP-RF),分别实现指令跟随和高频率精确操作,同时引入知识匹配(Knowledge Matching, KM)正则化策略防止微调过程中的灾难性遗忘,从而显著提升从人类视频到机器人动作的技能迁移效果。

链接: https://arxiv.org/abs/2601.04061
作者: Chubin Zhang,Jianan Wang,Zifeng Gao,Yue Su,Tianru Dai,Cai Zhou,Jiwen Lu,Yansong Tang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: this https URL.
zh

[CV-13] hinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

【速读】:该论文旨在解决生成式视频(Generative Video)中因结构失真(如异常物体外观和交互)导致的质量下降问题,这些问题在现有文本到视频(Text-to-Video, T2V)生成模型的奖励机制中常被忽视。解决方案的关键在于提出REACT——一个针对帧级结构失真的奖励模型,其核心创新包括:(1) 基于结构失真分类法构建大规模人类偏好数据集,并通过高效的思维链(Chain-of-Thought, CoT)合成扩展数据;(2) 采用两阶段训练框架(先监督微调注入领域知识,再利用组相对策略优化(Group Relative Policy Optimization, GRPO)与成对奖励增强推理能力),使模型输出分数更贴近人类偏好;(3) 在推理阶段引入动态采样机制聚焦高风险帧,实现精准识别与可解释的失真定位。

链接: https://arxiv.org/abs/2601.04033
作者: Yuan Wang,Borui Liao,Huijuan Huang,Jinda Lu,Ouxiang Li,Kuien Liu,Meng Wang,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学); Kling Team, Kuaishou Technology (快手科技Kling团队); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
zh

[CV-14] Padé Neurons for Efficient Neural Models

【速读】:该论文旨在解决传统神经网络中线性-非线性(point-wise)激活函数所导致的非线性表达能力有限的问题,从而限制了模型在复杂任务中的性能表现。现有研究虽已提出多种非线性神经元模型(如二次神经元、广义操作神经元等),但其非线性特性仍受限于固定形式或难以高效实现深层网络的强非线性拟合。本文的关键解决方案是提出一种新型神经元模型——Padé神经元(Paons),其灵感来源于Padé逼近理论,能够学习输入变量的分式多项式非线性映射,具有更强的非线性表达能力和灵活性;更重要的是,Paons能以更少的层数实现与传统神经网络相当甚至更优的性能,且可作为通用替换模块集成到现有架构中(如ResNet),实验证明其在图像超分辨率、压缩和分类任务中均表现出更高的层效率和竞争力。

链接: https://arxiv.org/abs/2601.04005
作者: Onur Keleş,A. Murat Tekalp
机构: Koç University (科克大学); Codeway AI Research; Turkish Academy of Sciences (土耳其科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for Publication in IEEE TRANSACTIONS ON IMAGE PROCESSING; 13 pages, 8 figures

点击查看摘要

Abstract:Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (Paons), inspired by Padé approximants. Paons offer several advantages, such as diversity of non-linearity, since each Paon learns a different non-linear function of its inputs, and layer efficiency, since Paons provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, Paons include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by Paons. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of Paons, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with Paons. Our comprehensive experimental results and analyses demonstrate that neural models built by Paons provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for Paon is open-sourced at this https URL.
zh

[CV-15] PosterVerse: A Full-Workflow Framework for Commercial-Grade Poster Generation with HTML-Based Scalable Typography

【速读】:该论文旨在解决当前自动化海报生成系统在商业级应用中面临的三大核心问题:设计流程不完整、文本渲染准确性差以及缺乏灵活性。为应对这些挑战,作者提出了一种全流程、商业级的海报生成方法PosterVerse,其关键在于通过三个阶段实现专业设计的自动化:(1) 利用微调的大语言模型(Large Language Model, LLM)提取用户需求中的关键设计元素以生成蓝图;(2) 采用定制化的扩散模型(diffusion models)生成视觉吸引力强的背景图像;(3) 基于多模态大语言模型(Multimodal Large Language Model, MLLM)驱动的HTML引擎实现统一的版式与文本渲染,确保高精度文本对齐和灵活定制。此外,论文还引入了首个面向中文海报生成的HTML结构化数据集PosterDNA,首次在该领域引入HTML排版文件,从根本上解决了小字号和高密度文本的可扩展渲染难题。

链接: https://arxiv.org/abs/2601.03993
作者: Junle Liu,Peirong Zhang,Yuyi Zhang,Pengyu Yan,Hui Zhou,Xinyue Zhou,Fengjun Guo,Lianwen Jin
机构: 1. Institute of Artificial Intelligence, Zhejiang University (浙江大学人工智能研究所); 2. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院); 3. Guangdong Provincial Key Laboratory of Intelligent Information Processing and Security, Sun Yat-sen University (中山大学广东省智能信息处理与安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Commercial-grade poster design demands the seamless integration of aesthetic appeal with precise, informative content delivery. Current automated poster generation systems face significant limitations, including incomplete design workflows, poor text rendering accuracy, and insufficient flexibility for commercial applications. To address these challenges, we propose PosterVerse, a full-workflow, commercial-grade poster generation method that seamlessly automates the entire design process while delivering high-density and scalable text rendering. PosterVerse replicates professional design through three key stages: (1) blueprint creation using fine-tuned LLMs to extract key design elements from user requirements, (2) graphical background generation via customized diffusion models to create visually appealing imagery, and (3) unified layout-text rendering with an MLLM-powered HTML engine to guarantee high text accuracy and flexible customization. In addition, we introduce PosterDNA, a commercial-grade, HTML-based dataset tailored for training and validating poster design models. To the best of our knowledge, PosterDNA is the first Chinese poster generation dataset to introduce HTML typography files, enabling scalable text rendering and fundamentally solving the challenges of rendering small and high-density text. Experimental results demonstrate that PosterVerse consistently produces commercial-grade posters with appealing visuals, accurate text alignment, and customizable layouts, making it a promising solution for automating commercial poster design. The code and model are available at this https URL.
zh

[CV-16] FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion

【速读】:该论文旨在解决现有全身体运动合成方法中手部动作建模不足的问题,即多数方法要么完全忽略手部运动,要么仅在特定任务和受限条件下生成包含手部动作的全身体运动。其关键解决方案是构建首个基于扩散模型(diffusion-based)的无条件全身体运动先验模型 FUSION,该模型首次联合建模了身体与手部动作,并通过整合大规模身体运动数据与现有手部动作数据集,生成具有精细手部关节运动的全身体运动序列。FUSION 在 HumanML3D 数据集的关节点追踪任务上超越了当前最先进的骨骼控制模型,并在运动自然度上表现更优;此外,研究进一步提出优化管道以在潜在空间中微调模型,实现基于物体运动或自然语言指令的精细化手部交互动作生成,从而在保持全身协调性的同时实现对手部动作的精确控制。

链接: https://arxiv.org/abs/2601.03959
作者: Enes Duran,Nikos Athanasiou,Muhammed Kocabas,Michael J. Black,Omid Taheri
机构: Max Planck Institute for Intelligent Systems (马普所智能系统研究所); University of Tübingen (图宾根大学); Meshcapade GmbH (德国梅斯卡帕德公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.
zh

[CV-17] ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation

【速读】:该论文旨在解决当前自回归(Autoregressive, AR)图像生成中因沿用语言建模设计范式而导致的视觉特性丢失问题,即现有1D视觉分词器将图像视为扁平的序列 token 流,忽略了视觉数据固有的层次性和残差结构,从而限制了模型的表示能力和生成效率。解决方案的关键在于提出一种残差分词器(Residual Tokenizer, ResTok),其通过构建图像 token 和潜在 token 的多层次残差表示,实现跨层级特征融合与语义残差隔离:一方面,逐层合并机制促进不同层级间的特征交互,提升表征容量;另一方面,层级间语义残差避免信息冗余,使潜在分布更紧凑,利于 AR 建模。此外,引入分层自回归生成器,可一次性预测一个层级的所有潜在 token,显著减少采样步骤(如 ImageNet-256 上仅需 9 步即可达到 gFID=2.34),从而在保持高质量的同时大幅提升生成速度。

链接: https://arxiv.org/abs/2601.03955
作者: Xu Zhang,Cheng Da,Huan Yang,Kun Gai,Ming Lu,Zhan Ma
机构: Vision Lab, Nanjing University (南京大学); Kolors Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring “vision” back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at this https URL.
zh

[CV-18] HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis

【速读】:该论文旨在解决当前深度学习模型在白细胞形态学微观评估中缺乏可解释性的问题,从而限制了其在白血病诊断中的临床信任与应用。解决方案的关键在于提出HemBLIP,一种专为血液细胞图像设计的视觉语言模型(Vision Language Model, VLM),能够生成具有形态学感知能力的可解释描述;通过在包含1.4万张健康与白血病细胞图像及其专家标注属性的全新数据集上进行全参数微调和LoRA(Low-Rank Adaptation)参数高效训练,HemBLIP在caption质量与形态准确性上优于基准模型MedGEMMA,且LoRA方法显著降低了计算成本,体现了其在透明化和规模化血液学诊断中的潜力。

链接: https://arxiv.org/abs/2601.03915
作者: Julie van Logtestijn,Petru Manescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.
zh

[CV-19] FLNet: Flood-Induced Agriculture Damage Assessment using Super Resolution of Satellite Images

【速读】:该论文旨在解决洪水灾害后农作物损失评估效率与精度不足的问题,尤其是在印度等地区因云层遮挡和卫星影像空间分辨率低导致传统遥感方法受限的背景下。其解决方案的关键在于提出了一种名为FLNet的新型深度学习架构,通过超分辨率(super-resolution)技术将Sentinel-2卫星图像的空间分辨率从10米提升至3米,从而显著增强对作物损伤类型的分类能力,尤其在“完全损毁”类别上F1分数由0.83提升至0.89,接近商业高分辨率影像的性能表现,实现了低成本、可扩展的自动化高保真损伤评估。

链接: https://arxiv.org/abs/2601.03884
作者: Sanidhya Ghosal,Anurag Sharma,Sushil Ghildiyal,Mukesh Saini
机构: Annam.AI CoE, MoE, Indian Institute of Technology Ropar, Punjab, India; Rajiv Gandhi Institute of Petroleum Technology, Jais, UP, India; Indian Institute of Technology Ropar, Punjab, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for oral presentation at the 10th International Conference on Computer Vision and Image Processing (CVIP 2025)

点击查看摘要

Abstract:Distributing government relief efforts after a flood is challenging. In India, the crops are widely affected by floods; therefore, making rapid and accurate crop damage assessment is crucial for effective post-disaster agricultural management. Traditional manual surveys are slow and biased, while current satellite-based methods face challenges like cloud cover and low spatial resolution. Therefore, to bridge this gap, this paper introduced FLNet, a novel deep learning based architecture that used super-resolution to enhance the 10 m spatial resolution of Sentinel-2 satellite images into 3 m resolution before classifying damage. We tested our model on the Bihar Flood Impacted Croplands Dataset (BFCD-22), and the results showed an improved critical “Full Damage” F1-score from 0.83 to 0.89, nearly matching the 0.89 score of commercial high-resolution imagery. This work presented a cost-effective and scalable solution, paving the way for a nationwide shift from manual to automated, high-fidelity damage assessment.
zh

[CV-20] Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations

【速读】:该论文旨在解决医学图像分割中因标注噪声(noisy annotations)导致模型性能下降的问题,尤其针对器官形态结构复杂及不同标注者间差异所引发的错误标签对训练过程的负面影响。其核心解决方案是提出一种端到端的分阶段体素级深度强化学习框架(Staged Voxel-Level Deep Reinforcement Learning, SVL-DRL),关键创新在于:将噪声标注建模为体素依赖问题,并通过新颖的分阶段强化学习机制实现鲁棒收敛;引入体素级异步优势Actor-Critic(vA3C)模块,使每个体素作为独立智能体在训练过程中动态优化自身状态表示,从而直接缓解错误标签的影响;同时设计了包含Dice系数与空间连续性度量的复合奖励函数,有效提升分割精度并保持语义一致性。

链接: https://arxiv.org/abs/2601.03875
作者: Yuyang Fu,Xiuzhen Guo,Ji Shi
机构: China University of Nationalities (中国民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large-scale datasets with high-quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end-to-end Staged Voxel-Level Deep Reinforcement Learning (SVL-DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL-DRL over existing works include: i) formulating noisy annotations as a voxel-dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel-level asynchronous advantage actor-critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State-of-The-Art (SoTA) performance under various experimental settings, with an average improvement of over 3% in both Dice and IoU scores.
zh

[CV-21] Bayesian Monocular Depth Refinement via Neural Radiance Fields

【速读】:该论文旨在解决单目深度估计(monocular depth estimation)中生成的深度图过于平滑、缺乏精细几何细节的问题,从而影响场景理解的准确性。解决方案的关键在于提出了一种迭代框架MDENeRF,其核心创新包括:(1) 利用初始单目深度估计保持全局结构;(2) 训练一个基于扰动视点的神经辐射场(NeRF),并引入像素级不确定性;(3) 通过贝叶斯融合机制将噪声较大的单目深度与NeRF深度进行融合,从而在迭代过程中注入高频细节。该方法通过从体积渲染过程推导NeRF不确定性,实现对细粒度几何信息的逐步增强,同时保留单目先验的全局一致性,显著提升了深度估计的质量。

链接: https://arxiv.org/abs/2601.03869
作者: Arun Muthukkumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
备注: IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025). Oral presentation; Best Presenter Award

点击查看摘要

Abstract:Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.
zh

[CV-22] IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting

【速读】:该论文旨在解决通用3D高斯点绘(Generalizable 3D Gaussian Splatting)中高斯均值(Gaussian means)预测不稳定的问题,尤其是由于单次投影变换(warp)难以充分挖掘多视角几何线索导致的深度图粗略与不一致。其核心解决方案是提出IDESplat,关键创新在于设计了一个深度概率增强单元(Depth Probability Boosting Unit, DPBU),通过级联多次warp操作生成的极线注意力图(epipolar attention maps)以乘法方式融合,从而迭代式地提升深度概率估计的准确性;进一步地,通过堆叠多个DPBU构建迭代深度估计流程,逐步筛选出高置信度的深度候选点,使深度图不断精化,最终实现更精确的高斯均值预测和高质量场景重建。

链接: https://arxiv.org/abs/2601.03824
作者: Wei Long,Haifeng Wu,Shiyin Jiang,Jinhua Zhang,Xinchun Ji,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates epipolar attention maps produced by cascading warp operations in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively boosts depth probability estimates and updates the depth candidates, the depth map is gradually refined, resulting in accurate Gaussian means. We conduct experiments on RealEstate10K, ACID, and DL3DV. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, our IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.
zh

[CV-23] EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging

【速读】:该论文旨在解决医学影像领域基础模型(foundation models)开发过程中下游性能监控困难的问题,当前研究者面临大量实验设计、参数调整及结果追踪的繁琐任务,依赖非结构化、手动的工作流程导致效率低下且易出错。解决方案的关键在于提出 EvalBlocks,一个基于 Snakemake 构建的模块化、即插即用的评估框架,其核心优势包括:支持新数据集、模型、聚合方法和评估策略的无缝集成;所有实验与结果集中管理并可通过单条命令实现可复现性;通过高效缓存和并行执行机制,可在共享计算资源上实现可扩展使用。该框架已在五种先进基础模型和三个医学影像分类任务中验证其有效性,显著提升了模型迭代效率,使研究人员能够聚焦于模型创新而非评估流程本身。

链接: https://arxiv.org/abs/2601.03811
作者: Jan Tagscherer,Sarah de Boer,Lena Philipp,Fennie van der Graaf,Dré Peeters,Joeran Bosma,Lars Leijten,Bogdan Obreja,Ewoud Smit,Alessa Hering
机构: Radboud University Medical Center (奈梅亨大学医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at BVM 2026

点击查看摘要

Abstract:Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad-hoc, manual workflows that are inherently slow and error-prone. We introduce EvalBlocks, a modular, plug-and-play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at this https URL.
zh

[CV-24] From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码合成任务中因数据感知增强(data-aware augmentation)受限而难以高效生成最优数据增强策略的问题。传统方法依赖启发式设计或暴力搜索,效率低下且缺乏性能引导。其解决方案的关键在于构建一个性能感知的闭环系统——在NNGPT生态系统中,通过低秩适应(Low-Rank Adaptation, LoRA)微调LLM,利用包含6000余个经实证评估的PyTorch增强函数的数据集(仅以下游模型准确率标注),采用成对性能排序(better-worse transformation ordering)进行训练,从而无需强化学习、奖励模型或符号目标即可实现性能对齐。此方法显著减少候选方案数量(最高达600倍),并使生成从随机合成转向任务对齐的设计,同时验证了模型能内化语义性能线索而非语法记忆,展现了LLM通过非文本反馈环实现任务级推理的能力。

链接: https://arxiv.org/abs/2601.03808
作者: Usha Shrestha,Dmitry Ignatov,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues. We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives. This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design. Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks. Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax. These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards.
zh

[CV-25] A Comparative Study of 3D Model Acquisition Methods for Synthetic Data Generation of Agricultural Products

【速读】:该论文旨在解决制造业中AI模型训练数据获取成本高、标注困难的问题,尤其是在高变异、低产量的制造环境中。其核心挑战在于缺乏足够的真实数据来训练可靠的计算机视觉模型。解决方案的关键在于利用合成数据替代传统依赖真实数据的方式,通过扫描或图像到3D的方法生成高代表性的三维(3D)模型,进而构建合成数据集用于训练对象检测模型。研究进一步表明,即使使用代表性稍差的3D模型,只要在小规模真实数据集上进行微调(fine-tuning),仍可显著提升模型性能,甚至达到与高质量合成数据相当的效果。

链接: https://arxiv.org/abs/2601.03784
作者: Steven Moonen,Rob Salaets,Kenneth Batstone,Abdellatif Bey-Temsamani,Nick Michiels
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 1 table, presented at 4th International Conference on Responsible Consumption and Production, this https URL

点击查看摘要

Abstract:In the manufacturing industry, computer vision systems based on artificial intelligence (AI) are widely used to reduce costs and increase production. Training these AI models requires a large amount of training data that is costly to acquire and annotate, especially in high-variance, low-volume manufacturing environments. A popular approach to reduce the need for real data is the use of synthetic data that is generated by leveraging computer-aided design (CAD) models available in the industry. However, in the agricultural industry these models are not readily available, increasing the difficulty in leveraging synthetic data. In this paper, we present different techniques for substituting CAD files to create synthetic datasets. We measure their relative performance when used to train an AI object detection model to separate stones and potatoes in a bin picking environment. We demonstrate that using highly representative 3D models acquired by scanning or using image-to-3D approaches can be used to generate synthetic data for training object detection models. Finetuning on a small real dataset can significantly improve the performance of the models and even get similar performance when less representative models are used.
zh

[CV-26] PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

【速读】:该论文旨在解决机器人在开放世界环境中进行灵巧操作时对3D物理世界动态预测能力不足的问题,即如何让机器人基于有限观测(如单张RGB-D图像)和低级动作指令,准确预测环境中物体在3D空间中的响应性位移。解决方案的关键在于提出PointWorld——一个大规模预训练的3D世界模型,其核心创新是将状态与动作统一表示为共享的3D点流(3D point flows),从而直接建模物理几何关系并实现跨机器人本体(embodiment)的无缝学习;该方法通过约200万条轨迹和500小时数据训练而成,在真实Franka机械臂上实现了无需演示或微调即可完成刚体推、柔性与关节物体操作及工具使用等复杂任务,且推理延迟仅为0.1秒,可高效嵌入模型预测控制(MPC)框架中。

链接: https://arxiv.org/abs/2601.03782
作者: Wenlong Huang,Yu-Wei Chao,Arsalan Mousavian,Ming-Yu Liu,Dieter Fox,Kaichun Mo,Li Fei-Fei
机构: Stanford University (斯坦福大学); NVIDIA
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. Project website at this https URL.
zh

[CV-27] MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

【速读】:该论文旨在解决当前基于强化学习的视频大语言模型(VideoLLM)后训练范式在提升视觉语义任务性能时,因缺乏对时间连贯性和帧间关联性显式监督而导致的动态细节捕捉能力不足问题。其核心解决方案是提出一种新的后训练目标——掩码视频预测(Masked Video Prediction, MVP),通过要求模型从一组挑战性干扰项中重建被遮蔽的连续视频片段,强制模型关注事件的序列逻辑和时间上下文,从而增强对视频时序结构与因果关系的理解。关键创新在于构建了可扩展的数据合成管道以支持大规模训练,并结合细粒度奖励函数的组相对策略优化(Group Relative Policy Optimization, GRPO)方法,显著提升了模型的视频推理能力。

链接: https://arxiv.org/abs/2601.03781
作者: Xiaokun Sun,Zezhong Wu,Zewen Ding,Linli Xu
机构: University of Science and Technology of China (中国科学技术大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models’ ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model’s understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.
zh

[CV-28] I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

【速读】:该论文旨在解决现有文本引导图像编辑方法在复杂组合编辑任务中表现不佳的问题,这些问题主要源于端到端像素级修复范式存在的三大局限:1)规划与执行的隐式耦合;2)缺乏对象级别的控制粒度;3)依赖无结构的像素中心建模。其解决方案的关键在于提出一种全新的“分解-行动”(Decompose-then-Action)范式——I2E,该范式首先通过一个分解器(Decomposer)将非结构化图像转化为可操作的对象图层,随后引入一个物理感知的视觉-语言-动作代理(Vision-Language-Action Agent),借助思维链(Chain-of-Thought)推理将复杂指令解析为一系列原子动作,从而实现高精度、多对象的空间推理与可控编辑。

链接: https://arxiv.org/abs/2601.03741
作者: Jinghan Yu,Junhao Xiao,Chenyu Zhu,Jiaming Li,Jia Li,HanMing Deng,Xirui Wang,Guoli Jia,Jianjun Li,Zhiyuan Ma,Xiang Bai,Bowen Zhou
机构: Huazhong University of Science and Technology (华中科技大学); Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel “Decompose-then-Action” paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
zh

[CV-29] HyperCOD: The First Challenging Benchmark and Baseline for Hyperspectral Camouflaged Object Detection

【速读】:该论文旨在解决基于RGB的伪装目标检测(Camouflaged Object Detection, COD)在真实场景中因颜色和纹理线索模糊而性能受限的问题,提出利用高光谱图像(Hyperspectral Image, HSI)捕捉细粒度光谱特征以提升检测能力。其核心挑战在于缺乏专门用于高光谱伪装目标检测(HCOD)的大规模基准数据集,且现有模型难以有效融合空间与光谱信息。解决方案的关键是构建首个面向HCOD的挑战性基准HyperCOD(包含350张高分辨率HSI图像),并提出HyperSpectral Camouflage-aware SAM(HSC-SAM)模型——通过将高光谱图像解耦为供SAM图像编码器处理的空间图和作为自适应提示的光谱显著性图,实现跨模态信息的有效对齐与利用,从而显著提升检测精度与泛化能力。

链接: https://arxiv.org/abs/2601.03736
作者: Shuyan Bai,Tingfa Xu,Peifu Liu,Yuhao Qiu,Huiyan Bai,Huan Chen,Yanyan Peng,Jianan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-based camouflaged object detection struggles in real-world scenarios where color and texture cues are ambiguous. While hyperspectral image offers a powerful alternative by capturing fine-grained spectral signatures, progress in hyperspectral camouflaged object detection (HCOD) has been critically hampered by the absence of a dedicated, large-scale benchmark. To spur innovation, we introduce HyperCOD, the first challenging benchmark for HCOD. Comprising 350 high-resolution hyperspectral images, It features complex real-world scenarios with minimal objects, intricate shapes, severe occlusions, and dynamic lighting to challenge current models. The advent of foundation models like the Segment Anything Model (SAM) presents a compelling opportunity. To adapt the Segment Anything Model (SAM) for HCOD, we propose HyperSpectral Camouflage-aware SAM (HSC-SAM). HSC-SAM ingeniously reformulates the hyperspectral image by decoupling it into a spatial map fed to SAM’s image encoder and a spectral saliency map that serves as an adaptive prompt. This translation effectively bridges the modality gap. Extensive experiments show that HSC-SAM sets a new state-of-the-art on HyperCOD and generalizes robustly to other public HSI datasets. The HyperCOD dataset and our HSC-SAM baseline provide a robust foundation to foster future research in this emerging area.
zh

[CV-30] MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

【速读】:该论文旨在解决海洋生物细粒度分类中忽视环境上下文交互以及未能充分融合生物分类学层级结构的问题。解决方案的关键在于提出MATANet模型,其核心由两个模块构成:一是多上下文环境注意力模块(Multi-Context Environmental Attention Module, MCEAM),用于学习感兴趣区域(Region of Interest, ROI)与其周围环境之间的关系;二是层次分离诱导学习模块(Hierarchical Separation-Induced Learning Module, HSLM),将分类学层级结构编码至特征空间,从而通过结合实例特征、环境信息与分类结构实现更精准的细粒度分类。

链接: https://arxiv.org/abs/2601.03729
作者: Donghwan Lee,Byeongjin Kim,Geunhee Kim,Hyukjin Kwon,Nahyeon Maeng,Wooju Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained classification of marine animals supports ecology, biodiversity and habitat conservation, and evidence-based policy-making. However, existing methods often overlook contextual interactions from the surrounding environment and insufficiently incorporate the hierarchical structure of marine biological taxonomy. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a novel model designed for fine-grained marine species classification. MATANet mimics expert strategies by using taxonomy and environmental context to interpret ambiguous features of underwater animals. It consists of two key components: a Multi-Context Environmental Attention Module (MCEAM), which learns relationships between regions of interest (ROIs) and their surrounding environments, and a Hierarchical Separation-Induced Learning Module (HSLM), which encodes taxonomic hierarchy into the feature space. MATANet combines instance and environmental features with taxonomic structure to enhance fine-grained classification. Experiments on the FathomNet2025, FAIR1M, and LifeCLEF2015-Fish datasets demonstrate state-of-the-art performance. The source code is available at: this https URL
zh

[CV-31] CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

【速读】:该论文旨在解决**组合图像检索(Composed Image Retrieval, CIR)**中因查询与目标图像模态异构导致的表示空间碎片化问题,即现有方法由于使用独立编码器处理不同模态(如图像和文本),使得特征空间在初始化时就存在显著偏移,仅依赖后验对齐策略难以实现高效匹配,从而限制了检索性能。解决方案的关键在于提出一个统一的表示框架——CSMCIR,其核心创新包括:1)引入多层级思维链(Multi-level Chain-of-Thought, MCoT)提示策略,引导多模态大语言模型生成语义一致的图像描述,建立模态对称性;2)设计对称双塔架构,查询端与目标端共享参数的Q-Former进行跨模态编码,确保特征表示一致性;3)基于熵的动态记忆库机制,在模型演进过程中持续提供高质量负样本,增强对齐稳定性。这三者协同作用,有效缩小了异构模态间的表示鸿沟,显著提升了CIR任务的精度与训练效率。

链接: https://arxiv.org/abs/2601.03728
作者: Zhipeng Qian,Zihan Liang,Yufei Ma,Ben Chen,Huangyu Dai,Yiwei Ma,Jiayi Ji,Chenyi Lei,Han Li,Xiaoshuai Sun
机构: Kuaishou Technology; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学多媒体可信感知与高效计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
zh

[CV-32] owards Real-world Lens Active Alignment with Unlabeled Data via Domain Adaptation

【速读】:该论文旨在解决光学系统大规模自动化装配中因仿真与真实成像条件差异导致的域差距(domain gap)问题,从而限制了仅在仿真数据上训练模型在真实场景中的泛化能力。解决方案的关键在于提出Domain Adaptive Active Alignment (DA3),其核心是通过少量未标注的真实世界图像(随机错位位置采集)引入自监督学习机制,利用自回归域变换生成器和基于对抗的特征对齐策略,从真实域中提取域不变的图像退化特征,实现仿真到真实场景的域适应,显著提升模型在真实环境下的误对准预测鲁棒性。

链接: https://arxiv.org/abs/2601.03718
作者: Wenyong Lia,Qi Jiang,Weijian Hu,Kailun Yang,Zhanjun Zhang,Wenjun Tian,Kaiwei Wang,Jian Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Active Alignment (AA) is a key technology for the large-scale automated assembly of high-precision optical systems. Compared with labor-intensive per-model on-device calibration, a digital-twin pipeline built on optical simulation offers a substantial advantage in generating large-scale labeled data. However, complex imaging conditions induce a domain gap between simulation and real-world images, limiting the generalization of simulation-trained models. To address this, we propose augmenting a simulation baseline with minimal unlabeled real-world images captured at random misalignment positions, mitigating the gap from a domain adaptation perspective. We introduce Domain Adaptive Active Alignment (DA3), which utilizes an autoregressive domain transformation generator and an adversarial-based feature alignment strategy to distill real-world domain information via self-supervised learning. This enables the extraction of domain-invariant image degradation features to facilitate robust misalignment prediction. Experiments on two lens types reveal that DA3 improves accuracy by 46% over a purely simulation pipeline. Notably, it approaches the performance achieved with precisely labeled real-world data collected on 3 lens samples, while reducing on-device data collection time by 98.7%. The results demonstrate that domain adaptation effectively endows simulation-trained models with robust real-world performance, validating the digital-twin pipeline as a practical solution to significantly enhance the efficiency of large-scale optical assembly.
zh

[CV-33] BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion

【速读】:该论文旨在解决生成式 AI (Generative AI) 在6-DoF内窥镜相机位姿估计中面临的三大挑战:一是缺乏大规模、高质量、密集标注且面向定位的医学视觉语言数据集;二是模型在细粒度位姿回归上的能力有限;三是从历史帧中提取时序特征时计算延迟较高。其解决方案的关键在于构建了目前最大的活体内窥镜定位数据集BREATH,并提出BREATH-VL混合框架,该框架融合了视觉语言模型(VLM)的语义理解能力与基于视觉的配准方法的几何精度优势,同时引入轻量级上下文学习机制,将运动历史编码为语言提示以实现高效时序推理,从而在保持较低计算延迟的同时显著提升定位准确性和泛化性能。

链接: https://arxiv.org/abs/2601.03713
作者: Qingyao Tian,Bingyu Yang,Huai Liao,Xinyan Huang,Junyong Li,Dong Yi,Hongbin Liu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University; Centre of AI and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences; School of Biomedical Engineering and Imaging Sciences, King’s College London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM’s ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.
zh

[CV-34] Rec: Egocentric Action Recognition using 2D Point Tracks ICPR2026

【速读】:该论文旨在解决**第一人称视角动作识别(egocentric action recognition)中因依赖RGB外观、人体姿态估计或其组合而带来的性能瓶颈问题。解决方案的关键在于引入2D点轨迹(2D point tracks)**作为额外的运动线索,通过CoTracker对视频中随机初始化的图像点进行跨帧跟踪,提取其运动轨迹,并将其与对应图像帧一同输入基于Transformer的识别模型。实验表明,即使仅使用初始帧及其关联点轨迹,该方法也能显著提升识别准确率,证明了2D点轨迹是一种轻量且有效的动作表征方式。

链接: https://arxiv.org/abs/2601.03667
作者: Dennis Holzmann,Sven Wachsmuth
机构: Bielefeld University (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: submitted to ICPR 2026

点击查看摘要

Abstract:We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.
zh

[CV-35] PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

【速读】:该论文旨在解决当前视频生成模型在模拟真实世界物理动力学方面存在的不足,如物体碰撞不自然、重力不一致和时间帧间闪烁等问题。其解决方案的关键在于提出PhysVideoGenerator框架,通过显式嵌入可学习的物理先验知识来增强视频生成过程:具体而言,设计了一个轻量级预测网络PredictorP,从噪声扩散潜空间中回归出预训练V-JEPA 2模型提取的高阶物理特征,并通过专用交叉注意力机制将这些物理token注入基于DiT结构(Latte)的生成器的时间注意力层中,从而实现物理一致性约束与生成质量的协同优化。

链接: https://arxiv.org/abs/2601.03665
作者: Siddarth Nilol Kundur Satish,Devesh Jaiswal,Hongyu Chen,Abhishek Bakshi
机构: New York University (纽约大学); Center for Data Science (数据科学中心); Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures, project page: this https URL

点击查看摘要

Abstract:Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.
zh

[CV-36] MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding

【速读】:该论文旨在解决点云补全(Point Cloud Completion)在真实场景中泛化能力不足的问题,尤其针对现有基于3D卷积神经网络(CNN)、点云或Transformer的方法在面对新物体类别和复杂现实环境时性能下降的局限性。解决方案的关键在于提出一种通用性强的多模态点云补全框架MGPC,其核心创新包括:引入模态丢弃策略以提升鲁棒性,设计基于Transformer的融合模块实现跨模态信息整合,以及构建渐进式生成器增强几何建模能力;同时,作者还开发了自动数据生成流水线并构建了包含百万级样本的MGPC-1M大规模基准,从而显著提升了模型在真实世界数据上的泛化性能。

链接: https://arxiv.org/abs/2601.03660
作者: Jiangyuan Liu,Hongxuan Ma,Yuhao Zhao,Zhe Liu,Jian Wang,Wei Zou
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation of Chinese Academy of Sciences (中国科学院自动化研究所); Chemical Defense Institute, Academy of Military Sciences (军事科学院化学防御研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and dataset are available at this https URL

点击查看摘要

Abstract:Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.
zh

[CV-37] VideoMemory: Toward Consistent Video Generation via Memory Integration

【速读】:该论文旨在解决叙事视频生成中跨镜头保持角色、道具和环境一致性的问题(character, prop, and environment consistency across multiple shots),这是实现高质量长序列视频生成的核心挑战。现有模型虽能生成单个高质量短片段,但在场景切换或实体长时间间隔重现时往往丧失身份识别与外观一致性。解决方案的关键在于提出VideoMemory框架,其核心创新是引入一个动态记忆库(Dynamic Memory Bank),该库以实体为中心存储显式的视觉与语义描述符,并通过“检索-更新”机制在每帧生成后持续更新记忆内容,从而在故事驱动的变化中维持实体身份不变,同时支持跨远距离镜头的连贯性表达。

链接: https://arxiv.org/abs/2601.03655
作者: Jinsong Zhou,Yihua Du,Xinli Xu,Luozhou Wang,Zijie Zhuang,Yehang Zhang,Shuaibo Li,Xiaojun Hu,Bolan Su,Ying-cong Chen
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.
zh

[CV-38] CrackSegFlow: Controllable Flow-Matching Synthesis for Generalizable Crack Segmentation with the CSF-50K Benchmark

【速读】:该论文旨在解决路面裂缝分割(crack segmentation)在实际部署中面临的两大核心挑战:一是像素级标注数据稀缺,二是跨传感器、光照、纹理及标注规范等导致的域偏移(domain shift)。解决方案的关键在于提出 CrackSegFlow,一个可控的流匹配合成框架,通过两个核心机制实现高质量裂纹图像与掩码的生成:其一,采用拓扑保持的掩码注入与边界门控调制策略,确保细长结构连续性并抑制由纹理引起的假阳性;其二,引入类条件流匹配模型,以显式控制裂纹覆盖率,生成拓扑多样且平衡的数据对,无需额外人工标注。此外,将裂纹掩码注入无裂纹背景以增强光照和表面伪影多样性,进一步降低阴影、接缝和道路标记带来的误检率。实验表明,该方法在多个沥青和混凝土数据集上均显著提升分割性能,尤其在跨域场景下仅依赖少量目标域掩码统计即可实现平均13.12 mIoU和14.82 F1的增益。

链接: https://arxiv.org/abs/2601.03637
作者: Babak Asadi,Peiyang Wu,Mani Golparvar-Fard,Ramez Hajj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated crack segmentation is essential for scalable condition assessment of pavements and civil infrastructure, yet practical deployment is limited by scarce pixel-level labels and severe domain shift across sensors, illumination, textures, and annotation conventions. This paper presents CrackSegFlow, a controllable flow-matching synthesis framework that generates photorealistic crack images conditioned on binary masks while preserving strict mask-image alignment. The generator combines topology-preserving mask injection with boundary-gated modulation to maintain thin-structure continuity and suppress texture-driven false positives. A second class-conditional flow-matching model synthesizes crack masks with explicit control over crack coverage, enabling balanced, topology-diverse paired data without additional manual annotation. We further inject crack masks into crack-free backgrounds to diversify illumination and surface artifacts and reduce false positives caused by shadows, joints, and pavement markings. Experiments on five benchmarks spanning four asphalt datasets and the crack class of a concrete-domain dataset demonstrate consistent improvements under an established hybrid CNN–Transformer segmentation backbone and a fixed training protocol. With real plus synthesized pairs, in-domain performance improves on average by 5.37 mIoU and 5.13 F1, and target-guided cross-domain synthesis yields average gains of 13.12 mIoU and 14.82 F1 using only limited target mask statistics. Compared with diffusion-based semantic synthesis, CrackSegFlow provides substantially faster deterministic sampling and improves fidelity and mask-image alignment for thin-structure crack geometry. Finally, we release CSF-50K, a public dataset of 50,000 paired crack images and pixel-accurate masks for large-scale benchmarking of generalizable crack segmentation.
zh

[CV-39] MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction

【速读】:该论文旨在解决雷达回波序列高分辨率降水临近预报(precipitation nowcasting)中的三大挑战:复杂多尺度演变建模、帧间特征错位校正以及高效捕捉长距离时空上下文的同时保持空间保真度。解决方案的关键在于提出一种名为多尺度特征通信修正流(Multi-scale Feature Communication Rectified Flow Network, MFC-RFNet)的生成框架,其核心创新包括:1)通过小波引导的跳跃连接(Wavelet-Guided Skip Connection, WGSC)保留高频细节并增强多尺度融合;2)引入特征通信模块(Feature Communication Module, FCM)实现跨尺度双向交互;3)设计条件引导的空间变换融合机制(Condition-Guided Spatial Transform Fusion, CGSTF)以对齐浅层特征,纠正帧间位移;4)采用修正流(Rectified Flow)训练策略学习近线性概率流轨迹,支持少步采样且稳定保真;5)在编码器尾部、瓶颈层和解码器首层嵌入轻量级Vision-RWKV块,低分辨率下捕获长程时空依赖,计算效率高。上述协同机制显著提升了预报清晰度与长期预测性能。

链接: https://arxiv.org/abs/2601.03633
作者: Wenjie Luo,Chuanhu Deng,Chaorong Li,Rongyao Deng,Qiang Yang
机构: 重庆理工大学(Chongqing University of Technology)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, yet it remains a significant challenge. Key difficulties include modeling complex multi-scale evolution, correcting inter-frame feature misalignment caused by displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity. To address these issues, we present the Multi-scale Feature Communication Rectified Flow (RF) Network (MFC-RFNet), a generative framework that integrates multi-scale communication with guided feature fusion. To enhance multi-scale fusion while retaining fine detail, a Wavelet-Guided Skip Connection (WGSC) preserves high-frequency components, and a Feature Communication Module (FCM) promotes bidirectional cross-scale interaction. To correct inter-frame displacement, a Condition-Guided Spatial Transform Fusion (CGSTF) learns spatial transforms from conditioning echoes to align shallow features. The backbone adopts rectified flow training to learn near-linear probability-flow trajectories, enabling few-step sampling with stable fidelity. Additionally, lightweight Vision-RWKV (RWKV) blocks are placed at the encoder tail, the bottleneck, and the first decoder layer to capture long-range spatiotemporal dependencies at low spatial resolutions with moderate compute. Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) demonstrate consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times. These results suggest that the proposed synergy of RF training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting.
zh

[CV-40] Shape Classification using Approximately Convex Segment Features

【速读】:该论文旨在解决传统基于描述性特征的目标分类技术依赖目标对齐(object alignment)来计算对象相似度的问题,从而限制了分类性能的提升。其解决方案的关键在于通过特征排序替代对齐操作:首先将目标边界归一化并分割为近似凸段,随后按长度降序排列这些段落,并利用包括段长、极点数量、面积、底边和宽度在内的多维特征组合(bag of features)来度量图像边界间的相似性,实现了无需对齐即可有效分类的目标识别。

链接: https://arxiv.org/abs/2601.03625
作者: Bimal Kumar Ray
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The existing object classification techniques based on descriptive features rely on object alignment to compute the similarity of objects for classification. This paper replaces the necessity of object alignment through sorting of feature. The object boundary is normalized and segmented into approximately convex segments and the segments are then sorted in descending order of their length. The segment length, number of extreme points in segments, area of segments, the base and the width of the segments - a bag of features - is used to measure the similarity between image boundaries. The proposed method is tested on datasets and acceptable results are observed.
zh

[CV-41] Systematic Evaluation of Depth Backbones and Semantic Cues for Monocular Pseudo-LiDAR 3D Detection

【速读】:该论文旨在解决单目3D目标检测(monocular 3D object detection)中因难以从单张图像准确估计度量深度(metric depth)而导致的检测精度不足的问题。其解决方案的关键在于系统性评估深度主干网络(depth backbone)与特征工程对伪LiDAR(pseudo-LiDAR)流水线性能的影响,发现深度主干的选择和几何保真度(geometric fidelity)是决定检测性能的核心因素,远超次要特征注入(如外观或语义线索)的作用。实验表明,使用NeWCRFs作为深度估计模型在KITTI数据集上取得最优结果(IoU=0.7时AP₃D达10.50%),而引入语义掩码采样反而可能因丢失上下文几何信息导致性能下降。

链接: https://arxiv.org/abs/2601.03617
作者: Samson Oseiwe Ajadalu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Monocular 3D object detection offers a low-cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo-LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) under an identical pseudo-LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50% AP _3D at IoU =0.7 on the Moderate split using grayscale intensity (Exp~2). We further test point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Finally, we report a depth-accuracy-versus-distance diagnostic using ground-truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off-the-shelf LiDAR detector, depth-backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.
zh

[CV-42] Unveiling Text in Challenging Stone Inscriptions: A Character-Context-Aware Patching Strategy for Binarization

【速读】:该论文旨在解决历史石刻图像中字符与背景对比度低、表面退化不均匀、干扰杂项多及文本密度和布局高度可变等问题导致的二值化(Binarization)失败难题。其解决方案的关键在于提出了一种鲁棒且自适应的分块策略,结合注意力机制增强的U-Net网络进行训练,使模型能够聚焦于细微结构线索;同时通过动态采样与分块选择方法,有效学习克服表面噪声和版式不规则性。该方法在仅使用单一印度语系(Indic)语种数据集训练的情况下,展现出对其他印度语系及非印度语系文本的零样本泛化能力,体现了良好的鲁棒性和语种无关性。

链接: https://arxiv.org/abs/2601.03609
作者: Pratyush Jena,Amal Joseph,Arnav Sharma,Ravi Kiran Sarvadevabhatla
机构: International Institute of Information Technology, Hyderabad (国际信息科技学院,海得拉巴); Center for Visual Information Technology (视觉信息研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Binarization is a popular first step towards text extraction in historical artifacts. Stone inscription images pose severe challenges for binarization due to poor contrast between etched characters and the stone background, non-uniform surface degradation, distracting artifacts, and highly variable text density and layouts. These conditions frequently cause existing binarization techniques to fail and struggle to isolate coherent character regions. Many approaches sub-divide the image into patches to improve text fragment resolution and improve binarization performance. With this in mind, we present a robust and adaptive patching strategy to binarize challenging Indic inscriptions. The patches from our approach are used to train an Attention U-Net for binarization. The attention mechanism allows the model to focus on subtle structural cues, while our dynamic sampling and patch selection method ensures that the model learns to overcome surface noise and layout irregularities. We also introduce a carefully annotated, pixel-precise dataset of Indic stone inscriptions at the character-fragment level. We demonstrate that our novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. Despite training only on single script Indic dataset, our model exhibits strong zero-shot generalization to other Indic and non-indic scripts, highlighting its robustness and script-agnostic generalization capabilities. By producing clean, structured representations of inscription content, our method lays the foundation for downstream tasks such as script identification, OCR, and historical text analysis. Project page: this https URL
zh

[CV-43] Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations

【速读】:该论文旨在解决少样本分割(Few-shot Segmentation, FSS)模型在真实复杂环境下的鲁棒性不足问题,即现有方法多在实验室条件下训练,难以应对光照变化、背景干扰、相机视角差异等现实场景中的挑战因素,导致模型在实际部署中性能显著下降。其解决方案的关键在于提出一种环境鲁棒的FSS设置(Environment-Robust FSS, ER-FSS),并构建涵盖多种真实场景的基准数据集(ER-FSS benchmark),同时设计了自适应注意力蒸馏(Adaptive Attention Distillation, AAD)方法:通过反复对比和蒸馏支持图像(support)与查询图像(query)之间的共享语义特征,生成针对新类别的特定注意力机制,从而增强模型在复杂环境下对目标区域的聚焦能力,显著提升分割精度与泛化性能(mIoU提升3.3%–8.5%)。

链接: https://arxiv.org/abs/2601.03596
作者: Qianyu Guo,Jingrong Wu,Jieji Ren,Weifeng Ge,Wenqiang Zhang
机构: Shanghai Institute of Virology, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院病毒研究所); School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院); School of Mechanical Engineering, Shanghai Jiao Tong University (上海交通大学机械工程学院); Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University (复旦大学计算机科学系智能信息处理重点实验室); Engineering Research Center of AI & Robotics, Ministry of Education, Academy for Engineering & Technology, Fudan University (复旦大学人工智能与机器人工程研究中心,教育部工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Few-shot segmentation (FSS) aims to rapidly learn novel class concepts from limited examples to segment specific targets in unseen images, and has been widely applied in areas such as medical diagnosis and industrial inspection. However, existing studies largely overlook the complex environmental factors encountered in real world scenarios-such as illumination, background, and camera viewpoint-which can substantially increase the difficulty of test images. As a result, models trained under laboratory conditions often fall short of practical deployment requirements. To bridge this gap, in this paper, an environment-robust FSS setting is introduced that explicitly incorporates challenging test cases arising from complex environments-such as motion blur, small objects, and camouflaged targets-to enhance model’s robustness under realistic, dynamic conditions. An environment robust FSS benchmark (ER-FSS) is established, covering eight datasets across multiple real world scenarios. In addition, an Adaptive Attention Distillation (AAD) method is proposed, which repeatedly contrasts and distills key shared semantics between known (support) and unknown (query) images to derive class-specific attention for novel categories. This strengthens the model’s ability to focus on the correct targets in complex environments, thereby improving environmental robustness. Comparative experiments show that AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization. The source code and dataset are available at: this https URL.
zh

[CV-44] Can LLM s See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

【速读】:该论文旨在解决当前空间智能(Spatial Intelligence, SI)研究中一个核心争议问题:空间理解能力究竟是源自视觉编码器(visual encoders)还是语言模型(Large Language Models, LLMs)本身的推理架构(reasoning backbone)。为验证这一问题,作者提出SiT-Bench——一个不依赖像素级输入的新型基准测试集,包含超过3,800个专家标注样本,覆盖五类主要任务和17个子任务,涵盖第一人称导航、视角变换及精细机器人操作等场景。其关键创新在于将单/多视角场景转化为高保真、坐标感知的文本描述,从而迫使LLMs进行符号化文本推理而非视觉模式匹配。实验表明,尽管SOTA LLMs在局部语义任务上表现良好,但在全局一致性方面仍存在显著“空间差距”(spatial gap),而显式引入空间推理机制可显著提升性能,说明LLMs具备潜在的世界建模能力。因此,该工作为未来视觉-语言模型(VLMs)与具身智能体的空间接地(spatial grounding)发展提供了基础性资源和方法论支持。

链接: https://arxiv.org/abs/2601.03590
作者: Zhongbin Guo,Zhen Yang,Yushan Li,Xinyue Zhang,Wenyu Gao,Jiacheng Wang,Chengzhi Li,Xiangrui Liu,Ping Jian
机构: School of Computer Science & Technology, Beijing Institute of Technology (北京理工大学计算机学院); BUCT (北京化工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant “spatial gap” remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at this https URL .
zh

[CV-45] Detecting AI-Generated Images via Distributional Deviations from Real Images

【速读】:该论文旨在解决AI生成图像检测中泛化能力不足的问题,尤其是针对未见过的生成模型时检测性能下降的挑战。现有方法虽利用冻结的预训练CLIP图像编码器(CLIP-ViT)实现一定泛化性,但仅将其作为基础特征提取器,未能充分挖掘其潜在判别能力。解决方案的关键在于提出一种基于掩码的预训练模型微调策略(Masking-based Pre-trained model Fine-Tuning, MPFT),其中引入纹理感知掩码机制(Texture-Aware Masking, TAM),在微调过程中主动屏蔽包含生成模型特有纹理模式的区域,迫使模型关注真实图像与AI生成图像之间的分布偏差(distributional deviations),从而显著提升对未知生成模型的检测泛化性能。

链接: https://arxiv.org/abs/2601.03586
作者: Yakun Niu,Yingjian Chen,Lei Zhang
机构: Henan University (河南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models has significantly enhanced the quality of AI-generated images, raising concerns about misinformation and the erosion of public trust. Detecting AI-generated images has thus become a critical challenge, particularly in terms of generalizing to unseen generative models. Existing methods using frozen pre-trained CLIP models show promise in generalization but treat the image encoder as a basic feature extractor, failing to fully exploit its potential. In this paper, we perform an in-depth analysis of the frozen CLIP image encoder (CLIP-ViT), revealing that it effectively clusters real images in a high-level, abstract feature space. However, it does not truly possess the ability to distinguish between real and AI-generated images. Based on this analysis, we propose a Masking-based Pre-trained model Fine-Tuning (MPFT) strategy, which introduces a Texture-Aware Masking (TAM) mechanism to mask textured areas containing generative model-specific patterns during fine-tuning. This approach compels CLIP-ViT to attend to the "distributional deviations"from authentic images for AI-generated image detection, thereby achieving enhanced generalization performance. Extensive experiments on the GenImage and UniversalFakeDetect datasets demonstrate that our method, fine-tuned with only a minimal number of images, significantly outperforms existing approaches, achieving up to 98.2% and 94.6% average accuracy on the two datasets, respectively.
zh

[CV-46] SpatiaLoc: Leverag ing Multi-Level Spatial Enhanced Descriptors for Cross-Modal Localization

【速读】:该论文旨在解决基于文本与点云的跨模态定位问题(cross-modal localization),即通过自然语言描述实现机器人在复杂环境中的自主定位,以支持自主导航和人机交互。其核心挑战在于文本与点云中物体常重复出现,需依赖空间关系作为判别性特征。解决方案的关键在于提出一种“粗到精”(coarse-to-fine)框架 SpatiaLoc:在粗粒度阶段,采用贝塞尔增强对象空间编码器(Bezier Enhanced Object Spatial Encoder, BEOSE)利用二次贝塞尔曲线建模实例级空间关系,并引入频域感知编码器(Frequency Aware Encoder, FAE)在全局层面生成频域空间表示;在细粒度阶段,设计不确定性感知高斯精细定位器(Uncertainty Aware Gaussian Fine Localizer, UGFL),将预测建模为高斯分布并使用不确定性感知损失函数回归2D位置,从而提升定位精度与鲁棒性。

链接: https://arxiv.org/abs/2601.03579
作者: Tianyi Shang,Pengjie Xu,Zhaojun Deng,Zhenyu Li,Zhicong Chen,Lijun Wu
机构: Fuzhou University (福州大学); Shandong Academy of Sciences (山东省科学院); Qingdao University (青岛大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.
zh

[CV-47] CloudMatch: Weak-to-Strong Consistency Learning for Semi-Supervised Cloud Detection

【速读】:该论文旨在解决云检测任务中因像素级标注成本高昂而导致的监督信号不足问题,提出了一种名为CloudMatch的半监督学习框架。其解决方案的关键在于通过视图一致性学习(view-consistency learning)与场景混合增强(scene-mixing augmentations)相结合的方式,有效利用未标注的遥感图像。具体而言,模型对每张未标注图像生成一个弱增强视图和两个互补的强增强视图:其中一个融合不同场景的图像块以模拟上下文多样性,另一个在单个场景内进行混合以保持语义一致性,从而引导伪标签生成并提升模型泛化能力。这一机制使模型能够捕捉云模式在跨场景和场景内部的结构多样性和上下文丰富性。

链接: https://arxiv.org/abs/2601.03528
作者: Jiayi Zhao,Changlu Chen,Jingsheng Li,Tianxiang Xue,Kun Zhan
机构: Lanzhou University (兰州大学); City University of Macau (澳门城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal of Applied Remote Sensing

点击查看摘要

Abstract:Due to the high cost of annotating accurate pixel-level labels, semi-supervised learning has emerged as a promising approach for cloud detection. In this paper, we propose CloudMatch, a semi-supervised framework that effectively leverages unlabeled remote sensing imagery through view-consistency learning combined with scene-mixing augmentations. An observation behind CloudMatch is that cloud patterns exhibit structural diversity and contextual variability across different scenes and within the same scene category. Our key insight is that enforcing prediction consistency across diversely augmented views, incorporating both inter-scene and intra-scene mixing, enables the model to capture the structural diversity and contextual richness of cloud patterns. Specifically, CloudMatch generates one weakly augmented view along with two complementary strongly augmented views for each unlabeled image: one integrates inter-scene patches to simulate contextual variety, while the other employs intra-scene mixing to preserve semantic coherence. This approach guides pseudolabel generation and enhances generalization. Extensive experiments show that CloudMatch achieves good performance, demonstrating its capability to utilize unlabeled data efficiently and advance semi-supervised cloud detection.
zh

[CV-48] Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution

【速读】:该论文旨在解决无人机热成像图像超分辨率(Thermal UAV Image Super-Resolution)中因跨模态对齐导致的高频信息丢失和物理不一致伪影问题,如纹理扭曲与边缘模糊。现有方法通常通过压缩光学特征以匹配热成像特征维度来实现跨模态融合,但忽略了两种模态在成像物理机制上的差异,从而损害重建质量并引入非物理性失真。解决方案的关键在于提出PCNet框架,其核心创新包括:1)设计交叉分辨率互增强模块(Cross-Resolution Mutual Enhancement Module, CRME),实现热图像超分与光学到热成像转换的联合优化,促进多尺度特征的双向交互并保留高频光学先验;2)引入物理驱动的热传导模块(Physics-Driven Thermal Conduction Module, PDTM),将二维热传导模型嵌入光学引导过程,通过建模空间变化的热导特性抑制不一致伪影;3)提出温度一致性损失(Temperature Consistency Loss),强制区域分布一致性和边界梯度平滑性,确保生成热图符合真实热辐射物理规律。实验表明,PCNet在VGTSR2.0和DroneVehicle数据集上均显著优于当前最优方法,在重建质量和下游任务(语义分割、目标检测)中表现优异。

链接: https://arxiv.org/abs/2601.03526
作者: Zhicheng Zhao,Fengjiao Peng,Jinquan Yan,Wei Lu,Chenglong Li,Jin Tang
机构: Anhui University (安徽大学); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optics-guided thermal UAV image super-resolution has attracted significant research interest due to its potential in all-weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross-modal alignment and fusion, which not only causes the loss of high-frequency information that is beneficial for thermal super-resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross-resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super-resolution. In particular, we design a Cross-Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super-resolution and optical-to-thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high-frequency optical priors. Moreover, we propose a Physics-Driven Thermal Conduction Module (PDTM) that incorporates two-dimensional heat conduction into optical guidance, modeling spatially-varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real-world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.
zh

[CV-49] Semantic Belief-State World Model for 3D Human Motion Prediction

【速读】:该论文旨在解决传统人类运动预测方法在长时程预测中出现的累积漂移(compounding drift)、均值姿态坍缩(mean-pose collapse)以及不确定性校准不足的问题。这些问题源于现有模型将运动预测建模为直接的序列回归任务,未能分离观测重建与动力学建模,并缺乏对驱动运动潜在因素的显式表示。解决方案的关键在于提出一种语义信念状态世界模型(Semantic Belief-State World Model, SBWM),其将人类运动预测重构为在人体流形上的潜在动态模拟,通过维护一个递归的概率信念状态来独立学习动力学演化,且该状态显式对齐于SMPL-X解剖参数化。这种结构化的信息瓶颈强制潜变量仅捕捉运动动力学、意图和控制相关结构,而非静态几何或传感器噪声,从而实现更稳定、可解释且计算高效的长期运动模拟。

链接: https://arxiv.org/abs/2601.03517
作者: Sarim Chaudhry
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief-State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL-X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control-relevant structure. Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts stochastic latent transitions and rollout-centric training to the domain of human motion. In contrast to RSSM-based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long-horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.
zh

[CV-50] G2P: Gaussian-to-Point Attribute Alignment for Boundary-Aware 3D Semantic Segmentation

【速读】:该论文旨在解决点云语义分割中因稀疏和不规则分布导致的外观信息不足问题,使得仅依赖几何特征难以区分形状相似但外观不同的物体(如颜色、纹理或材质差异)。其解决方案的关键在于提出Gaussian-to-Point (G2P) 方法,通过将3D高斯泼溅(3D Gaussian Splatting)中的外观感知属性迁移至点云,实现更具判别性和外观一致性的分割结果。核心创新包括:建立点级对应关系以缓解优化后高斯与原始点云几何之间的错位问题;利用高斯不透明度(opacity)属性消除现有模型受限的几何歧义;并通过高斯尺度(scale)属性提升复杂场景下边界定位的精度。

链接: https://arxiv.org/abs/2601.03510
作者: Hojun Song,Chae-yeong Song,Jeong-hun Hong,Chaewon Moon,Dong-hwi Kim,Gahyeon Kim,Soo Ye Kim,Yiyi Liao,Jaehyup Lee,Sang-hyo Park
机构: Kyungpook National University (庆北国立大学); Adobe Research (Adobe 研究院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Semantic segmentation on point clouds is critical for 3D scene understanding. However, sparse and irregular point distributions provide limited appearance evidence, making geometry-only features insufficient to distinguish objects with similar shapes but distinct appearances (e.g., color, texture, material). We propose Gaussian-to-Point (G2P), which transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds for more discriminative and appearance-consistent segmentation. Our G2P address the misalignment between optimized Gaussians and original point geometry by establishing point-wise correspondences. By leveraging Gaussian opacity attributes, we resolve the geometric ambiguity that limits existing models. Additionally, Gaussian scale attributes enable precise boundary localization in complex 3D scenes. Extensive experiments demonstrate that our approach achieves superior performance on standard benchmarks and shows significant improvements on geometrically challenging classes, all without any 2D or language supervision.
zh

[CV-51] REFA: Real-time Egocentric Facial Animations for Virtual Reality CVPR2024

【速读】:该论文旨在解决虚拟环境中用户面部表情实时追踪的难题,尤其针对佩戴虚拟现实(VR)头显时如何非侵入式且无需复杂校准地驱动虚拟角色的面部表情。其关键解决方案在于提出一种基于知识蒸馏(knowledge distillation)的机器学习建模方法,能够融合来自多源异构数据(如合成图像与真实图像)的标签信息进行训练;同时,作者构建了一个包含18,000名受试者的轻量化采集数据集,并开发了一条鲁棒的可微分渲染(differentiable rendering)流水线,用于自动提取面部表情标签,从而实现高精度、低延迟的表情映射。

链接: https://arxiv.org/abs/2601.03507
作者: Qiang Zhang,Tong Xiao,Haroun Habeeb,Larissa Laich,Sofien Bouaziz,Patrick Snape,Wenjing Zhang,Matthew Cioffi,Peizhao Zhang,Pavel Pidlypenskyi,Winnie Lin,Luming Ma,Mengjiao Wang,Kunpeng Li,Chengjiang Long,Steven Song,Martin Prazak,Alexander Sjoholm,Ajinkya Deogade,Jaebong Lee,Julio Delgado Mangas,Amaury Aubel
机构: Reality Labs at Meta (Reality Labs at Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2024 Workshop

点击查看摘要

Abstract:We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset. Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner and without the need of a lengthy calibration step. At the core of our system is a distillation based approach to train a machine learning model on heterogeneous data and labels coming form multiple sources, \eg synthetic and real images. As part of our dataset, we collected 18k diverse subjects using a lightweight capture setup consisting of a mobile phone and a custom VR headset with extra cameras. To process this data, we developed a robust differentiable rendering pipeline enabling us to automatically extract facial expression labels. Our system opens up new avenues for communication and expression in virtual environments, with applications in video conferencing, gaming, entertainment, and remote collaboration.
zh

[CV-52] SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(object hallucination)问题,其核心在于识别并缓解由视觉编码器在弱结构监督下产生的视觉统计偏差(visual statistical bias)。这种偏差源于视觉编码器的“补丁袋”(Bag-of-Patches)行为,导致模型过度依赖局部纹理特征而非整体几何结构,从而引发虚假的视觉置信度并诱发幻觉。解决方案的关键在于提出一种无需训练的算法——结构破坏对比解码(Structure-Disrupted Contrastive Decoding, SDCD),通过引入一个结构扰乱的随机打乱视图对输出分布进行对比校准,强制模型降低在无结构信息下仍保持高置信度的token权重,从而有效抑制纹理驱动的偏差,显著减少幻觉现象并提升LVLM的整体多模态能力。

链接: https://arxiv.org/abs/2601.03500
作者: Yuxuan Xia,Siheng Wang,Peng Li
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
zh

[CV-53] CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation

【速读】:该论文旨在解决遥感图像中基于自然语言描述的目标定位问题(referring remote sensing image segmentation),其核心挑战在于跨模态对齐的可靠性存在显著的空间非均匀性,主要由极端尺度变化、密集相似干扰项和复杂边界结构导致。现有方法采用全局统一的融合与精化策略,难以在视觉清晰区域避免语言噪声干扰,也无法充分提升混淆区域的判别能力。解决方案的关键在于提出一种不确定性引导框架(uncertainty-guided framework),通过引入一个可插拔的参考不确定性评分器(Referring Uncertainty Scorer, RUS)来预测像素级的指代模糊分布,并据此设计两个模块:1)不确定性门控融合(Uncertainty-Gated Fusion, UGF),动态调节语言注入强度以增强高不确定性区域的约束并抑制低不确定性区域的噪声;2)不确定性驱动局部精化(Uncertainty-Driven Local Refinement, UDLR),利用不确定性生成的软掩码聚焦于易错边界和细节区域进行精细化处理。该方法无需修改骨干网络即可显著提升复杂遥感场景下的鲁棒性和几何保真度。

链接: https://arxiv.org/abs/2601.03490
作者: Yuzhe Sun,Zhe Dong,Haochen Jiang,Tianzhu Liu,Yanfeng Gu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbfspatial non-uniformity. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbfuncertainty-guided framework that explicitly leverages a pixel-wise \textbfreferring uncertainty map as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbfReferring Uncertainty Scorer (RUS), which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbfUncertainty-Gated Fusion (UGF), which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbfUncertainty-Driven Local Refinement (UDLR), which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.
zh

[CV-54] Understanding Reward Hacking in Text-to-Image Reinforcement Learning

【速读】:该论文旨在解决文本到图像(text-to-image, T2I)生成模型在强化学习(reinforcement learning, RL)后训练过程中出现的奖励黑客(reward hacking)问题,即模型通过生成不真实或低质量但能获得高奖励分数的图像来欺骗奖励模型。研究表明,现有奖励机制(如美学/人类偏好奖励和提示-图像一致性奖励)单独或组合使用时均无法完全避免此类行为,且普遍存在生成含伪影(artifact-prone)图像的共性失败模式。解决方案的关键在于提出一种轻量级、自适应的伪影奖励模型(artifact reward model),该模型基于少量人工标注的无伪影与含伪影样本进行训练,并可无缝集成至现有RL流程中作为正则化项,有效提升图像视觉真实性并显著降低奖励黑客现象。

链接: https://arxiv.org/abs/2601.03468
作者: Yunqi Hong,Kuei-Chun Kao,Hengguang Zhou,Cho-Jui Hsieh
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking–producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.
zh

[CV-55] hinkRL-Edit: Thinking in Reinforcement Learning for Reasoning -Centric Image Editing

【速读】:该论文旨在解决统一多模态生成模型在指令驱动图像编辑中视觉推理能力不足的问题,尤其在以推理为核心的编辑任务上表现欠佳。其核心挑战包括:受限的推理探索(仅依赖去噪随机性)、奖励融合偏倚以及基于视觉语言模型(VLM)的指令奖励不稳定。解决方案的关键在于提出ThinkRL-Edit框架,通过将视觉推理与图像合成解耦,并扩展推理探索范围至去噪之外;引入基于思维链(Chain-of-Thought, CoT)的推理采样机制,在在线采样前增加规划与反思阶段,促使模型在生成前探索多个语义假设并验证其合理性;同时采用无偏的链偏好分组策略替代加权聚合以优化多维奖励融合,并以二元检查清单替代区间型VLM评分,从而获得更精确、低方差且可解释的奖励信号。

链接: https://arxiv.org/abs/2601.03467
作者: Hengjia Li,Liming Jiang,Qing Yan,Yizhi Song,Hao Kang,Zichuan Liu,Xin Lu,Boxi Wu,Deng Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
zh

[CV-56] Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization

【速读】:该论文旨在解决协同过滤(Collaborative Filtering)在大规模交互数据集上面临的可扩展性(Scalability)和数据稀疏性(Data Sparsity)两大瓶颈问题。其解决方案的关键在于:首先,基于MovieLens 32M数据集,采用高性能并行化交替最小二乘法(Alternating Least Squares, ALS)框架学习用户偏好潜在几何结构;其次,通过系统性的超参数优化发现,约束低秩模型在泛化性能上显著优于高维模型,能够在均方根误差(RMSE)与排序精度之间实现最优平衡;最后,通过可视化嵌入空间揭示了语义类型簇的无监督涌现,验证了模型仅从交互数据中即可捕捉深层结构关系,并引入可调评分参数以有效管理冷启动场景下流行度偏差与个性化亲和力之间的权衡。

链接: https://arxiv.org/abs/2601.03466
作者: Joshua Salako
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scalability and data sparsity remain critical bottlenecks for collaborative filtering on massive interaction datasets. This work investigates the latent geometry of user preferences using the MovieLens 32M dataset, implementing a high-performance, parallelized Alternating Least Squares (ALS) framework. Through extensive hyperparameter optimization, we demonstrate that constrained low-rank models significantly outperform higher dimensional counterparts in generalization, achieving an optimal balance between Root Mean Square Error (RMSE) and ranking precision. We visualize the learned embedding space to reveal the unsupervised emergence of semantic genre clusters, confirming that the model captures deep structural relationships solely from interaction data. Finally, we validate the system’s practical utility in a cold-start scenario, introducing a tunable scoring parameter to manage the trade-off between popularity bias and personalized affinity effectively. The codebase for this research can be found here: this https URL
zh

[CV-57] Experimental Comparison of Light-Weight and Deep CNN Models Across Diverse Datasets

【速读】:该论文旨在解决在低资源环境下,如何构建高效且具有广泛适用性的视觉模型基准问题,特别是在缺乏大型GPU或专用预训练模型的情况下实现跨域任务的高性能。其解决方案的关键在于使用一个经过良好正则化的浅层卷积神经网络(Shallow Convolutional Neural Network, CNN)作为统一且可复现的基线模型,证明其在多个孟加拉国视觉数据集上均表现出色,从而凸显轻量级CNN在实际部署中的实用价值。

链接: https://arxiv.org/abs/2601.03463
作者: Md. Hefzul Hossain Papon,Shadman Rabby
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 25 pages, 11 figures

点击查看摘要

Abstract:Our results reveal that a well-regularized shallow architecture can serve as a highly competitive baseline across heterogeneous domains - from smart-city surveillance to agricultural variety classification - without requiring large GPUs or specialized pre-trained models. This work establishes a unified, reproducible benchmark for multiple Bangladeshi vision datasets and highlights the practical value of lightweight CNNs for real-world deployment in low-resource settings.
zh

[CV-58] FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder

【速读】:该论文旨在解决端到端(End-to-end, E2E)自动驾驶模型在面对新颖和复杂场景时泛化能力不足的问题。当前普遍采用的全量微调视觉编码器的方法,容易导致模型过度适应训练数据,从而限制其泛化性能。解决方案的关键在于提出FROST-Drive架构,通过冻结预训练视觉语言模型(Vision-Language Model, VLM)的视觉编码器权重,保留其强大的通用世界知识,并结合基于Transformer的适配器实现多模态融合与基于GRU的解码器生成平滑路径点。此外,设计了直接优化Rater Feedback Score(RFS)的损失函数以提升轨迹规划鲁棒性。实验表明,该冻结编码器策略显著优于全微调方法,在Waymo Open E2E Dataset上展现出更强的泛化能力。

链接: https://arxiv.org/abs/2601.03460
作者: Zeyu Dong,Yimin Zhu,Yu Wu,Yu Sun
机构: Stony Brook University (石溪大学); Rutgers University (罗格斯大学); Sunrise Technology Inc. (曙光科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder’s weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.
zh

[CV-59] WeedRepFormer: Reparameterizable Vision Transformers for Real-Time Waterhemp Segmentation and Gender Classification

【速读】:该论文旨在解决农业场景中作物杂草(如水蔓菁)的细粒度特征提取与实时部署效率之间的矛盾问题,即如何在保证生物属性分类精度的同时实现轻量化模型设计。解决方案的关键在于提出一种基于结构重参数化(structural reparameterization)的多任务视觉Transformer架构——WeedRepFormer,其通过在整个网络结构中系统性地引入可重参数化机制(包括Vision Transformer骨干、轻量级R-ASPP解码器和新型可重参数化分类头),实现训练阶段高容量与推理阶段低延迟的解耦,从而在仅使用3.59M参数和3.80 GFLOPs的情况下,同时达成92.18% mIoU的分割性能和81.91%的性别分类准确率,并在108.95 FPS下显著优于当前最优模型iFormer-T。

链接: https://arxiv.org/abs/2601.03431
作者: Toqi Tahamid Sarker,Taminul Islam,Khaled R. Ahmed,Cristiana Bernardi Rankrape,Kaitlin E. Creager,Karla Gage
机构: Southern Illinois University Carbondale (南方伊利诺伊大学卡本代尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:We present WeedRepFormer, a lightweight multi-task Vision Transformer designed for simultaneous waterhemp segmentation and gender classification. Existing agricultural models often struggle to balance the fine-grained feature extraction required for biological attribute classification with the efficiency needed for real-time deployment. To address this, WeedRepFormer systematically integrates structural reparameterization across the entire architecture - comprising a Vision Transformer backbone, a Lite R-ASPP decoder, and a novel reparameterizable classification head - to decouple training-time capacity from inference-time latency. We also introduce a comprehensive waterhemp dataset containing 10,264 annotated frames from 23 plants. On this benchmark, WeedRepFormer achieves 92.18% mIoU for segmentation and 81.91% accuracy for gender classification using only 3.59M parameters and 3.80 GFLOPs. At 108.95 FPS, our model outperforms the state-of-the-art iFormer-T by 4.40% in classification accuracy while maintaining competitive segmentation performance and significantly reducing parameter count by 1.9x.
zh

[CV-60] GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对对抗性输入时安全对齐机制脆弱的问题,尤其是针对具备链式思维(Chain-of-Thought, CoT)能力的推理型模型,现有攻击方法因未有效利用模型自身的推理动机而效果不佳。解决方案的关键在于提出 GAMBI(Gamified Adversarial Multimodal Breakout via Instructional Traps)框架,通过分解并重构有害视觉语义,构建一个“游戏化”场景,诱导模型将攻击目标视为自身任务目标,从而主动完成越狱行为;该方法通过增强视觉与文本双重复杂度,使模型在参与“游戏”的过程中降低安全注意力,进而输出被重构的恶意查询,实验表明其在多个主流推理与非推理型MLLM上均实现显著更高的攻击成功率(ASR)。

链接: https://arxiv.org/abs/2601.03416
作者: Xiangdong Hu,Yangyang Jiang,Qin Hu,Xiaojun Jia
机构: Georgia State University (佐治亚州立大学); Nanyang Technological University, Singapore (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model’s own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.
zh

[CV-61] Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

【速读】:该论文旨在解决胰腺导管腺癌(Pancreatic Ductal Adenocarcinoma, PDAC)分子分型在临床实践中应用受限的问题,主要瓶颈包括成本高、检测周期长及组织样本需求量大。为克服这些限制,作者提出PanSubNet——一种可解释的深度学习框架,能够直接从常规苏木精-伊红(HE-stained)全切片图像(Whole Slide Images, WSIs)中预测具有治疗相关性的分子亚型(basal-like与classical)。其核心创新在于采用双尺度架构,融合细胞级形态学特征与组织级结构信息,并引入注意力机制实现多尺度表征学习和透明的特征归因;模型在内部验证中达到平均AUC 88.5%,外部独立验证(TCGA队列)仍保持稳健性能(AUC 84.0%),且优于基于RNA-seq的分型在预后分层上的表现,同时预测不确定性与中间转录状态相关而非分类噪声,确保了结果的生物学合理性与临床可靠性。

链接: https://arxiv.org/abs/2601.03410
作者: Abdul Rehman Akbar,Alejandro Levya,Ashwini Esnakula,Elshad Hasanov,Anne Noonan,Upender Manne,Vaibhav Sahai,Lingbin Meng,Susan Tsai,Anil Parwani,Wei Chen,Ashish Manne,Muhammad Khalid Khan Niazi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard HE-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine HE-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.
zh

[CV-62] Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在面对需要深层推理的视觉理解任务时表现不足的问题,尤其是其对表面特征的依赖和缺乏抽象推理能力。解决方案的关键在于引入Eye-Q这一多语言视觉词谜基准,该基准通过设计概念密集、线索隐含且结构松散的视觉场景,要求模型进行假设生成与修正、选择性注意、抽象归纳及跨概念映射,从而评估VLMs在非字面语义理解上的能力。实验表明,现有先进VLMs在该任务上准确率最高仅达60.27%,凸显了其在构建和搜索恰当概念表征以实现灵活图像到短语推理方面的显著局限。

链接: https://arxiv.org/abs/2601.03400
作者: Ali Najar,Alireza Mirrokni,Arshia Izadyari,Sadegh Mohammadian,Amir Homayoon Sharifizade,Asal Meskin,Mobin Bagherian,Ehsaneddin Asgari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models’ ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
zh

[CV-63] Better But Not Sufficient: Testing Video ANNs Against Macaque IT Dynamics ICCV2025

【速读】:该论文试图解决的问题是:当前基于静态图像训练的前馈人工神经网络(Feedforward Artificial Neural Networks, FANNs)在模拟灵长类动物腹侧视觉通路(特别是下颞叶皮层,inferior temporal cortex, IT)的时间响应特性方面存在局限,尤其是无法充分捕捉IT对动态物体运动速度的编码能力。论文的关键解决方案在于设计了一个“应力测试”(stress test)——使用去除形状和纹理信息但保留运动特征的“外观无关”(appearance-free)视频数据集来评估模型泛化能力。结果表明,猕猴IT群体活动能跨此类变换保持稳定响应,而所有类型的ANN模型(包括静态、递归和视频驱动模型)均失败,揭示了现有模型仅能表征依赖于外观的动态计算,而非IT所体现的外观不变的时间计算机制。这一发现强调了未来模型需引入生物时间统计特性和不变性作为新目标。

链接: https://arxiv.org/abs/2601.03392
作者: Matteo Dunnhofer,Christian Micheloni,Kohitij Kar
机构: York University, Toronto, Canada; University of Udine, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Extended Abstract at the 2nd Human-inspired Computer Vision workshop at ICCV 2025

点击查看摘要

Abstract:Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT’s temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on “appearance-free” variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.
zh

[CV-64] A Novel Unified Approach to Deepfake Detection

【速读】:该论文旨在解决生成式 AI (Generative AI) 技术发展带来的 Deepfake(深度伪造)威胁,以维护数字时代的信息可信度。其核心问题是实现对图像和视频中 Deepfake 内容的高精度检测与分类。解决方案的关键在于提出一种统一架构,融合空间域与频域特征的交叉注意力机制(cross attention)以及一个血流特征检测模块,从而增强模型对伪造痕迹的感知能力;实验表明,该方法在 FF++ 和 Celeb-DF 数据集上分别达到 99.80% 和 99.88% 的 AUC 分数(使用 Swin Transformer + BERT),并展现出良好的跨数据集泛化性能。

链接: https://arxiv.org/abs/2601.03382
作者: Lord Sen,Shyamapada Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.
zh

[CV-65] Guardians of the Hair: Rescuing Soft Boundaries in Depth Stereo and Novel Views

【速读】:该论文旨在解决3D视觉任务中软边界(soft boundaries,如细小毛发)因前景与背景线索模糊混合而导致的细节恢复困难问题。其核心解决方案是提出HairGuard框架,关键在于两个创新模块:一是基于图像抠图数据集设计的数据整理管道和深度修复网络(depth fixer),通过门控残差模块在软边界区域精准优化深度图而不破坏全局深度质量;二是结合基于深度的前向映射与生成式场景绘画器(generative scene painter),有效保留高保真纹理并填补遮挡区域,最终由颜色融合器自适应整合结果,从而实现几何一致且细节丰富的新视角合成。

链接: https://arxiv.org/abs/2601.03362
作者: Xiang Zhang,Yang Zhang,Lukas Mehl,Markus Gross,Christopher Schroers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.
zh

[CV-66] RelightAnyone: A Generalized Relightable 3D Gaussian Head Model

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)重建的头部虚拟形象在不同场景光照条件下进行高质量再照明(relighting)的问题。现有方法通常依赖于复杂的时间复用光照采集方式(如逐光照射,One-Light-At-A-Time, OLAT),限制了其应用灵活性和泛化能力。解决方案的关键在于提出一种两阶段的通用可再照明3D高斯头部模型:第一阶段学习无需OLAT数据的平坦光照下3DGS表示;第二阶段通过少量OLAT数据训练映射函数,将平坦光照参数转换为物理基础的反射率参数,从而实现对任意主体的高质量再照明。该设计使模型能在多样多视角数据上训练以获得跨主体泛化能力,并支持仅用单张图像即可拟合新主体,显著提升了数字头像在新视角合成与再照明任务中的实用性。

链接: https://arxiv.org/abs/2601.03357
作者: Yingyan Xu,Pramod Rao,Sebastian Weiss,Gaspard Zoss,Markus Gross,Christian Theobalt,Marc Habermann,Derek Bradley
机构: ETH Zürich; DisneyResearch|Studios; Max Planck Institute for Informatics, SIC
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.
zh

[CV-67] MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态推理过程中是否具备识别自身错误推理能力的问题,即模型能否在视觉和语言双重语境下检测出推理过程中的错误并准确分类错误类型。其解决方案的关键在于构建了一个名为MMErroR的多模态基准数据集,该数据集包含2013个样本,每个样本嵌入单一且连贯的推理错误,覆盖六大顶级领域下的24个子领域,实现了广泛的覆盖面与分类学丰富性;不同于以往仅关注答案正确性的评估方式,MMErroR聚焦于过程层面、以错误为中心的评测范式,要求模型不仅识别错误,还需对错误类型进行精准判别,从而揭示当前VLMs在多模态推理理解上的局限性——即使最先进的模型(如Gemini-3.0-Pro)也只能在66.47%的情况下正确识别错误类型,凸显了该问题的挑战性与研究价值。

链接: https://arxiv.org/abs/2601.03331
作者: Yang Shi,Yifeng Xie,Minzhe Guo,Liangsi Lu,Mingxuan Huang,Jingchao Wang,Zhihong Zhu,Boyan Xu,Zhiqi Huang
机构: Guangdong University of Technology (广东工业大学); Hong Kong Baptist University (香港浸会大学); Sun Yat-sen University (中山大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: this https URL
zh

[CV-68] Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

【速读】:该论文旨在解决如何提取旋转不变的形状特征以支持在旋转变化下保持稳定性的形状描述与比较问题,尤其适用于分子形状描述、二维图像或三维扫描中的物体识别以及无需昂贵旋转优化即可高效计算形状相似性的需求。其解决方案的关键在于将传统的主成分分析(Principal Component Analysis, PCA)方法从协方差矩阵(二阶矩)扩展至高阶张量(如三阶及更高阶中心矩),从而构建更精确的形状表示,并由此推导出相应的旋转不变量(如幂迹等),实现对任意复杂形状的高精度可解码描述。

链接: https://arxiv.org/abs/2601.03326
作者: Jarek Duda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages, 4 figures

点击查看摘要

Abstract:PCA can be used for rotation invariant features, describing a shape with its p_ab=E[(x_i-E[x_a])(x_b-E[x_b])] covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. p_abc=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])] order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans, or shape similarity metric allowing their inexpensive comparison (modulo rotation) without costly optimization over rotations.
zh

[CV-69] Listen to Rhythm Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset

【速读】:该论文旨在解决当前舞蹈动作生成方法中存在的语义控制粗粒度和长序列一致性差的问题。其解决方案的关键在于提出了一种多模态引导的扩散框架LRCM,通过特征解耦策略分离运动捕捉数据、音频节奏及文本描述,并引入音频潜空间Conformer与文本潜空间Cross-Conformer进行多模态融合,同时设计Motion Temporal Mamba Module(MTMM)以实现平滑且长时间的自回归合成,从而显著提升生成舞蹈动作的语义准确性和时序连贯性。

链接: https://arxiv.org/abs/2601.03323
作者: Oran Duan,Yinghua Shen,Yingzhu Lv,Luyang Jie,Yaxin Liu,Qiong Wu
机构: Communication University of China (中国传媒大学); Zhipu AI (智谱AI)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注: 12 pages, 13 figures

点击查看摘要

Abstract:Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
zh

[CV-70] Deep Learning-Based Image Recognition for Soft-Shell Shrimp Classification

【速读】:该论文旨在解决虾类加工产品在采收后新鲜度迅速下降以及软壳虾在烹饪或冷冻过程中常出现头体分离的问题,这些问题严重影响了产品的外观完整性与消费者接受度。解决方案的关键在于利用基于深度学习的图像识别技术,构建卷积神经网络(Convolutional Neural Network, CNN)模型,实现对白虾采收后的自动化分类,从而替代传统人工分拣方式,显著提升分类的准确性、效率和一致性,同时缩短处理时间以更好地维持虾类产品的鲜度,满足运输及市场对高品质水产品的需求。

链接: https://arxiv.org/abs/2601.03317
作者: Yun-Hao Zhang,I-Hsien Ting,Dario Liberona,Yun-Hsiu Liu,Kazunori Minetaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the integration of information technology into aquaculture, production has become more stable and continues to grow annually. As consumer demand for high-quality aquatic products rises, freshness and appearance integrity are key concerns. In shrimp-based processed foods, freshness declines rapidly post-harvest, and soft-shell shrimp often suffer from head-body separation after cooking or freezing, affecting product appearance and consumer perception. To address these issues, this study leverages deep learning-based image recognition for automated classification of white shrimp immediately after harvest. A convolutional neural network (CNN) model replaces manual sorting, enhancing classification accuracy, efficiency, and consistency. By reducing processing time, this technology helps maintain freshness and ensures that shrimp transportation businesses meet customer demands more effectively.
zh

[CV-71] VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型中预训练视觉语言模型(Vision-Language Model, VLM)的选择与能力如何影响下游VLA策略性能这一基础但鲜有系统研究的问题。其核心挑战在于:现有VLM的通用能力是否能有效迁移至具身控制任务,以及何种模块或训练方式对提升VLA性能最为关键。解决方案的关键在于提出VLM4VLA——一种仅需少量新增可学习参数的最小化适配流程,将通用VLM高效转化为VLA策略,从而实现公平且高效的对比实验。该方法揭示了VLM初始化虽有益,但其通用能力并非下游任务表现的良好预测指标;进一步发现,特定具身技能的微调并不能保证控制性能提升,而视觉模块(而非语言模块)才是主要瓶颈,通过向VLM的视觉编码器注入控制相关监督信号即可显著提升性能,即使在冻结该模块的情况下也能获得一致收益,从而明确当前VLM预训练目标与具身动作规划需求之间存在持续存在的领域差距。

链接: https://arxiv.org/abs/2601.03309
作者: Jianke Zhang,Xiaoyu Chen,Qiuyue Wang,Mingsheng Li,Yanjiang Guo,Yucheng Hu,Jiajun Zhang,Shuai Bai,Junyang Lin,Jianyu Chen
机构: Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Qwen Team, Alibaba Inc. (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM’s general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM’s performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.
zh

[CV-72] Mass Concept Erasure in Diffusion Models with Concept Hierarchy AAAI2026

【速读】:该论文旨在解决扩散模型在进行概念擦除(concept erasure)时面临的效率与效果双重瓶颈问题:随着需擦除概念数量增加,传统方法因每个概念独立微调参数而导致计算资源消耗剧增,并可能损害整体生成质量。其解决方案的关键在于提出一种超类型-子类型概念层次结构(supertype-subtype concept hierarchy),将语义相关的概念(如“鹦鹉”和“秃鹰”)归类到共享的超类型节点(如“鸟类”),并通过分组联合抑制机制实现多个子类型概念的协同擦除——仅需一组可学习参数即可完成多概念擦除;同时引入超类型保持的低秩适配方法(SuPLoRA),通过冻结下投影矩阵保留超类型信息、仅更新上投影矩阵来缓解因过度擦除子类型导致的超类型生成性能下降问题。

链接: https://arxiv.org/abs/2601.03305
作者: Jiahang Tu,Ye Li,Yiming Wu,Hanbin Zhao,Chao Zhang,Hui Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This paper has been accepted by AAAI 2026

点击查看摘要

Abstract:The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent-child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content.
zh

[CV-73] CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception

【速读】:该论文旨在解决当前无线电频谱(Radio-Frequency, RF)无人机检测与识别领域中数据集稀缺且多样性不足的问题。现有公开数据集在覆盖的无人机型号、环境条件以及信号复杂度方面存在局限,难以支撑鲁棒且泛化能力强的RF感知模型开发。解决方案的关键在于构建一个大规模基准数据集CageDroneRF(CDRF),其核心创新包括:(1)基于真实世界采集与系统性合成变体相结合的数据生成策略;(2)通过受控的增强管道精确调节信噪比(Signal-to-Noise Ratio, SNR)、注入干扰发射源,并对边界框进行标签一致的频率偏移变换,从而提升模型在复杂电磁环境下的适应能力;(3)提供开源工具链支持数据生成、预处理、增强和评估,兼容现有公共基准,实现标准化分类、开放集识别与目标检测任务的可复现比较。

链接: https://arxiv.org/abs/2601.03302
作者: Mohammad Rostami,Atik Faysal,Hongtao Xia,Hadi Kasasbeh,Ziang Gao,Huaxia Wang
机构: Rowan University (罗文大学); AeroDefense
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present CageDroneRF (CDRF), a large-scale benchmark for Radio-Frequency (RF) drone detection and identification built from real-world captures and systematically generated synthetic variants. CDRF addresses the scarcity and limited diversity of existing RF datasets by coupling extensive raw recordings with a principled augmentation pipeline that (i) precisely controls Signal-to-Noise Ratio (SNR), (ii) injects interfering emitters, and (iii) applies frequency shifts with label-consistent bounding-box transformations for detection. This dataset spans a wide range of contemporary drone models, many unavailable in current public datasets, and acquisition conditions, derived from data collected at the Rowan University campus and within a controlled RF-cage facility. CDRF is released with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation that also operate on existing public benchmarks. CDRF enables standardized benchmarking for classification, open-set recognition, and object detection, supporting rigorous comparisons and reproducible pipelines. By releasing this comprehensive benchmark and tooling, CDRF aims to accelerate progress toward robust, generalizable RF perception models.
zh

[CV-74] Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, PFMs)在真实世界技术域偏移(如不同全切片扫描仪设备引起的变异性)下的鲁棒性不足问题。当前PFM虽在基准测试中表现优异,但其嵌入空间对扫描仪诱导的域偏移敏感,导致下游任务预测校准失真,进而影响临床可靠性。解决方案的关键在于:通过多扫描仪乳腺癌全切片图像数据集(384张WSIs,5台设备)隔离扫描仪效应,并结合无监督嵌入分析与临床病理监督任务评估鲁棒性;结果表明,现有PFM均未实现对扫描仪变异的不变性,且鲁棒性并非简单依赖于训练数据规模、模型大小或新旧程度,提示未来PFM开发需从单纯追求准确率转向显式评估和优化嵌入稳定性和校准能力,以应对实际成像条件的多样性。

链接: https://arxiv.org/abs/2601.04163
作者: Erik Thiringer,Fredrik K. Gustafsson,Kajsa Ledesma Eriksson,Mattias Rantalainen
机构: Karolinska Institutet (卡罗林斯卡学院); University of Oxford (牛津大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.
zh

[CV-75] A low-complexity method for efficient depth-guided image deblurring

【速读】:该论文旨在解决图像去模糊(image deblurring)问题,该问题因高度病态性而极具挑战性,且现有深度学习模型虽在图像质量上表现优异,但计算复杂度较高,难以在移动设备或资源受限场景中部署。为实现低复杂度下的高质量去模糊,论文提出了一种基于深度信息引导的轻量化神经网络解决方案,其关键在于:利用小波变换(wavelet transform)分离结构细节并减少空间冗余,同时对深度信息进行高效特征调制(feature conditioning),从而在显著降低计算复杂度(最高达两个数量级)的同时,保持与最新先进模型相当的图像质量。

链接: https://arxiv.org/abs/2601.03924
作者: Ziyao Yi,Diego Valsesia,Tiziano Bianchi,Enrico Magli
机构: Politecnico di Torino – Department of Electronics and Telecommunications (都灵理工大学-电子与电信系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image deblurring is a challenging problem in imaging due to its highly ill-posed nature. Deep learning models have shown great success in tackling this problem but the quest for the best image quality has brought their computational complexity up, making them impractical on anything but powerful servers. Meanwhile, recent works have shown that mobile Lidars can provide complementary information in the form of depth maps that enhance deblurring quality. In this paper, we introduce a novel low-complexity neural network for depth-guided image deblurring. We show that the use of the wavelet transform to separate structural details and reduce spatial redundancy as well as efficient feature conditioning on the depth information are essential ingredients in developing a low-complexity model. Experimental results show competitive image quality against recent state-of-the-art models while reducing complexity by up to two orders of magnitude.
zh

[CV-76] GeoDiff-SAR: A Geometric Prior Guided Diffusion Model for SAR Image Generation

【速读】:该论文旨在解决现有生成式模型在合成孔径雷达(Synthetic Aperture Radar, SAR)图像生成中忽视几何先验信息的问题,导致生成质量不高且难以精确控制如方位角等关键参数。其解决方案的关键在于提出GeoDiff-SAR,一种基于几何先验引导的扩散模型:首先通过计算特定方位角下的SAR点云来高效模拟真实SAR成像中的几何结构与散射关系,提供物理层面的强约束;其次引入基于特征调制(Feature-wise Linear Modulation, FiLM)的特征融合门控网络,动态调节三维物理信息、图像控制参数与文本描述参数的权重分布,实现多模态信息的有效融合;最后采用低秩适应(Low-Rank Adaptation, LoRA)架构对Stable Diffusion 3.5(SD3.5)模型进行轻量级微调,使其快速适配SAR域的数据分布特性,从而显著提升生成图像的保真度和下游分类任务的准确性,尤其在不同方位角下表现出优越的识别性能。

链接: https://arxiv.org/abs/2601.03499
作者: Fan Zhang,Xuanting Wu,Fei Ma,Qiang Yin,Yuxin Hu
机构: Beijing University of Chemical Technology (北京化工大学); Chinese Academy of Sciences (中国科学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 17 figures

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) imaging results are highly sensitive to observation geometries and the geometric parameters of targets. However, existing generative methods primarily operate within the image domain, neglecting explicit geometric information. This limitation often leads to unsatisfactory generation quality and the inability to precisely control critical parameters such as azimuth angles. To address these challenges, we propose GeoDiff-SAR, a geometric prior guided diffusion model for high-fidelity SAR image generation. Specifically, GeoDiff-SAR first efficiently simulates the geometric structures and scattering relationships inherent in real SAR imaging by calculating SAR point clouds at specific azimuths, which serves as a robust physical guidance. Secondly, to effectively fuse multi-modal information, we employ a feature fusion gating network based on Feature-wise Linear Modulation (FiLM) to dynamically regulate the weight distribution of 3D physical information, image control parameters, and textual description parameters. Thirdly, we utilize the Low-Rank Adaptation (LoRA) architecture to perform lightweight fine-tuning on the advanced Stable Diffusion 3.5 (SD3.5) model, enabling it to rapidly adapt to the distribution characteristics of the SAR domain. To validate the effectiveness of GeoDiff-SAR, extensive comparative experiments were conducted on real-world SAR datasets. The results demonstrate that data generated by GeoDiff-SAR exhibits high fidelity and effectively enhances the accuracy of downstream classification tasks. In particular, it significantly improves recognition performance across different azimuth angles, thereby underscoring the superiority of physics-guided generation.
zh

[CV-77] Edit2Restore:Few-Shot Image Restoration via Parameter-Efficient Adaptation of Pre-trained Editing Models

【速读】:该论文旨在解决传统图像复原方法依赖大量成对训练样本(通常每类退化类型需数千张)的问题,提出了一种数据高效且通用的解决方案。其关键在于利用预训练的文本条件图像编辑模型(如FLUX.1 Kontext)所蕴含的丰富视觉先验知识,通过参数高效的低秩适配器(LoRA)微调技术,在仅需16–128张配对图像的前提下,即可实现多种复原任务(如去噪、去雨、去雾)的统一处理。该方法借助简单文本提示(text prompts)引导模型执行特定修复操作,显著降低了数据需求并保持高感知质量,为少样本、提示驱动的图像增强提供了新范式。

链接: https://arxiv.org/abs/2601.03391
作者: M. Akın Yılmaz,Ahmet Bilican,Burak Can Biner,A. Murat Tekalp
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration has traditionally required training specialized models on thousands of paired examples per degradation type. We challenge this paradigm by demonstrating that powerful pre-trained text-conditioned image editing models can be efficiently adapted for multiple restoration tasks through parameter-efficient fine-tuning with remarkably few examples. Our approach fine-tunes LoRA adapters on FLUX.1 Kontext, a state-of-the-art 12B parameter flow matching model for image-to-image translation, using only 16-128 paired images per task, guided by simple text prompts that specify the restoration operation. Unlike existing methods that train specialized restoration networks from scratch with thousands of samples, we leverage the rich visual priors already encoded in large-scale pre-trained editing models, dramatically reducing data requirements while maintaining high perceptual quality. A single unified LoRA adapter, conditioned on task-specific text prompts, effectively handles multiple degradations including denoising, deraining, and dehazing. Through comprehensive ablation studies, we analyze: (i) the impact of training set size on restoration quality, (ii) trade-offs between task-specific versus unified multi-task adapters, (iii) the role of text encoder fine-tuning, and (iv) zero-shot baseline performance. While our method prioritizes perceptual quality over pixel-perfect reconstruction metrics like PSNR/SSIM, our results demonstrate that pre-trained image editing models, when properly adapted, offer a compelling and data-efficient alternative to traditional image restoration approaches, opening new avenues for few-shot, prompt-guided image enhancement. The code to reproduce our results are available at: this https URL
zh

人工智能

[AI-0] Embedding Autonomous Agents in Resource-Constrained Robotic Platforms

【速读】:该论文旨在解决嵌入式设备在资源受限和动态环境下的自主决策问题,以提升系统的响应速度并减少对外部控制的依赖。解决方案的关键在于将基于AgentSpeak语言编写的自主代理(autonomous agent)集成到小型两轮机器人中,使其能够利用自身感知数据进行本地决策,从而实现对迷宫的自主探索。实验表明,该方法可在59秒内完成迷宫求解,且每次决策耗时小于1毫秒,证明了推理过程在资源受限硬件上的实时可行性。

链接: https://arxiv.org/abs/2601.04191
作者: Negar Halakou,Juan F. Gutierrez,Ye Sun,Han Jiang,Xueming Wu,Yilun Song,Andres Gomez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This is an open-access, author-archived version of a manuscript published in European Conference on Multi-Agent Systems 2025

点击查看摘要

Abstract:Many embedded devices operate under resource constraints and in dynamic environments, requiring local decision-making capabilities. Enabling devices to make independent decisions in such environments can improve the responsiveness of the system and reduce the dependence on constant external control. In this work, we integrate an autonomous agent, programmed using AgentSpeak, with a small two-wheeled robot that explores a maze using its own decision-making and sensor data. Experimental results show that the agent successfully solved the maze in 59 seconds using 287 reasoning cycles, with decision phases taking less than one millisecond. These results indicate that the reasoning process is efficient enough for real-time execution on resource-constrained hardware. This integration demonstrates how high-level agent-based control can be applied to resource-constrained embedded systems for autonomous operation.
zh

[AI-1] Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, MLLM)系统在长期交互中出现的**代理漂移(agent drift)问题,即智能体行为、决策质量及多智能体间协同一致性随交互序列延长而逐步退化的问题。其核心解决方案在于提出一个名为代理稳定性指数(Agent Stability Index, ASI)**的复合度量框架,用于量化十二维漂移特征(如响应一致性、工具使用模式、推理路径稳定性等),并通过三种关键策略实现缓解:周期性记忆巩固(episodic memory consolidation)感知漂移的路由协议(drift-aware routing protocols)自适应行为锚定(adaptive behavioral anchoring)。理论分析表明,这些方法可在保持系统吞吐量的同时显著降低漂移相关误差,为生产环境中代理型AI系统的可靠性与安全性提供可落地的监测、测量与治理路径。

链接: https://arxiv.org/abs/2601.04170
作者: Abhishek Rath
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent Large Language Model (LLM) systems have emerged as powerful architectures for complex task decomposition and collaborative problem-solving. However, their long-term behavioral stability remains largely unexamined. This study introduces the concept of agent drift, defined as the progressive degradation of agent behavior, decision quality, and inter-agent coherence over extended interaction sequences. We present a comprehensive theoretical framework for understanding drift phenomena, proposing three distinct manifestations: semantic drift (progressive deviation from original intent), coordination drift (breakdown in multi-agent consensus mechanisms), and behavioral drift (emergence of unintended strategies). We introduce the Agent Stability Index (ASI), a novel composite metric framework for quantifying drift across twelve dimensions, including response consistency, tool usage patterns, reasoning pathway stability, and inter-agent agreement rates. Through simulation-based analysis and theoretical modeling, we demonstrate how unchecked agent drift can lead to substantial reductions in task completion accuracy and increased human intervention requirements. We propose three mitigation strategies: episodic memory consolidation, drift-aware routing protocols, and adaptive behavioral anchoring. Theoretical analysis suggests these approaches can significantly reduce drift-related errors while maintaining system throughput. This work establishes a foundational methodology for monitoring, measuring, and mitigating agent drift in production agentic AI systems, with direct implications for enterprise deployment reliability and AI safety research. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.04170 [cs.AI] (or arXiv:2601.04170v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.04170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Clinical Data Goes MEDS? Lets OWL make sense of it

【速读】:该论文旨在解决医疗数据在机器学习应用中因缺乏标准化和语义明确表示而导致的互操作性与可复现性不足的问题。其解决方案的关键在于提出MEDS-OWL,一个轻量级的OWL本体,用于将MEDS事件数据规范形式化为RDF图结构,并结合meds2rdf Python转换库实现从MEDS事件到RDF图的自动化映射,从而支持FAIR原则的数据发布、溯源追踪及基于图的临床事件数据分析,有效实现了MEDS与语义网生态系统的集成。

链接: https://arxiv.org/abs/2601.04164
作者: Alberto Marfoglia,Jong Ho Jhee,Adrien Coulet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 tables, 4 figures

点击查看摘要

Abstract:The application of machine learning on healthcare data is often hindered by the lack of standardized and semantically explicit representation, leading to limited interoperability and reproducibility across datasets and experiments. The Medical Event Data Standard (MEDS) addresses these issues by introducing a minimal, event-centric data model designed for reproducible machine-learning workflows from health data. However, MEDS is defined as a data-format specification and does not natively provide integration with the Semantic Web ecosystem. In this article, we introduce MEDS-OWL, a lightweight OWL ontology that provides formal concepts and relations to enable representing MEDS datasets as RDF graphs. Additionally, we implemented meds2rdf, a Python conversion library that transforms MEDS events into RDF graphs, ensuring conformance with the ontology. We demonstrate the approach on a synthetic clinical dataset that describes patient care pathways for ruptured intracranial aneurysms and validate the resulting graph using SHACL constraints. The first release of MEDS-OWL comprises 13 classes, 10 object properties, 20 data properties, and 24 OWL axioms. Combined with meds2rdf, it enables data transformation into FAIR-aligned datasets, provenance-aware publishing, and interoperability of event-based clinical data. By bridging MEDS with the Semantic Web, this work contributes a reusable semantic layer for event-based clinical data and establishes a robust foundation for subsequent graph-based analytics.
zh

[AI-3] Quantifying the Impact of Modules and Their Interactions in the PSO-X Framework

【速读】:该论文旨在解决模块化粒子群优化(Particle Swarm Optimization, PSO)框架中模块选择与交互对算法性能影响不明确的问题,尤其是在面对不同特征的优化问题时,难以高效识别关键模块及其作用机制。解决方案的关键在于通过功能方差分析(Functional ANOVA)量化PSO-X框架中各模块及其组合在CEC’05基准测试集(含10维和30维)上的影响,并结合聚类分析识别具有相似模块效应模式的问题类别,从而揭示少数核心模块对整体性能的主导作用,为算法配置提供数据驱动的指导。

链接: https://arxiv.org/abs/2601.04100
作者: Christian L. Camacho-Villalón,Ana Nikolikj,Katharina Dost,Eva Tuba,Sašo Džeroski,Tome Eftimov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The PSO-X framework incorporates dozens of modules that have been proposed for solving single-objective continuous optimization problems using particle swarm optimization. While modular frameworks enable users to automatically generate and configure algorithms tailored to specific optimization problems, the complexity of this process increases with the number of modules in the framework and the degrees of freedom defined for their interaction. Understanding how modules affect the performance of algorithms for different problems is critical to making the process of finding effective implementations more efficient and identifying promising areas for further investigation. Despite their practical applications and scientific relevance, there is a lack of empirical studies investigating which modules matter most in modular optimization frameworks and how they interact. In this paper, we analyze the performance of 1424 particle swarm optimization algorithms instantiated from the PSO-X framework on the 25 functions in the CEC’05 benchmark suite with 10 and 30 dimensions. We use functional ANOVA to quantify the impact of modules and their combinations on performance in different problem classes. In practice, this allows us to identify which modules have greater influence on PSO-X performance depending on problem features such as multimodality, mathematical transformations and varying dimensionality. We then perform a cluster analysis to identify groups of problem classes that share similar module effect patterns. Our results show low variability in the importance of modules in all problem classes, suggesting that particle swarm optimization performance is driven by a few influential modules.
zh

[AI-4] CSSG: Measuring Code Similarity with Semantic Graphs

【速读】:该论文旨在解决现有代码相似性度量方法(如BLEU、CodeBLEU和TSED)在捕捉代码深层语义关系方面的局限性,这些方法主要依赖表面字符串重叠或抽象语法树结构,难以准确反映不同代码片段之间的语义相似性。其解决方案的关键在于提出CSSG(Code Similarity using Semantic Graphs),该方法利用程序依赖图(Program Dependence Graph, PDG)显式建模控制依赖关系和变量交互,从而构建出更具语义感知能力的代码表示。实验表明,在CodeContests+数据集上,CSSG在单语言和跨语言场景下均能更有效地区分相似与不相似的代码,验证了依赖感知的图表示相较于基于表面特征或语法结构的方法更具优势。

链接: https://arxiv.org/abs/2601.04085
作者: Jingwen Xu,Yiyang Lu,Changze Lv,Zisu Huang,Zhengkang Guo,Zhengyuan Wang,Muzhao Tian,Xuanjing Huang,Xiaoqing Zheng
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing code similarity metrics, such as BLEU, CodeBLEU, and TSED, largely rely on surface-level string overlap or abstract syntax tree structures, and often fail to capture deeper semantic relationships between this http URL propose CSSG (Code Similarity using Semantic Graphs), a novel metric that leverages program dependence graphs to explicitly model control dependencies and variable interactions, providing a semantics-aware representation of this http URL on the CodeContests+ dataset show that CSSG consistently outperforms existing metrics in distinguishing more similar code from less similar code under both monolingual and cross-lingual settings, demonstrating that dependency-aware graph representations offer a more effective alternative to surface-level or syntax-based similarity measures.
zh

[AI-5] ComfySearch: Autonomous Exploration and Reasoning for ComfyUI Workflows

【速读】:该论文旨在解决当前基于ComfyUI等平台的生成式AI(Generative AI)工作流中,因组件数量庞大且结构约束严格而导致的低通过率(pass rate)和高质量工作流生成困难的问题。其解决方案的关键在于提出ComfySearch框架,该框架通过验证引导的工作流构建机制,实现对组件空间的有效探索与功能性工作流的自动构造,从而显著提升执行成功率、解题率及泛化能力。

链接: https://arxiv.org/abs/2601.04060
作者: Jinwei Su,Qizhen Lan,Zeyu Wang,Yinghui Xia,Hairu Wen,Yiqun Duan,Xi Xiao,Tianyu Shi,Yang Jingsong,Lewei He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-generated content has progressed from monolithic models to modular workflows, especially on platforms like ComfyUI, allowing users to customize complex creative pipelines. However, the large number of components in ComfyUI and the difficulty of maintaining long-horizon structural consistency under strict graph constraints frequently lead to low pass rates and workflows of limited quality. To tackle these limitations, we present ComfySearch, an agentic framework that can effectively explore the component space and generate functional ComfyUI pipelines via validation-guided workflow construction. Experiments demonstrate that ComfySearch substantially outperforms existing methods on complex and creative tasks, achieving higher executability (pass) rates, higher solution rates, and stronger generalization.
zh

[AI-6] MobileDreamer: Generative Sketch World Model for GUI Agent

【速读】:该论文旨在解决当前移动GUI代理(Mobile GUI Agents)在长时程任务中表现受限的问题,其根源在于现有代理多为反应式决策机制,仅依赖当前屏幕状态进行判断,缺乏对未来动作结果的预测能力。解决方案的关键在于构建一个高效的基于世界模型(World Model)的前瞻框架——MobileDreamer,其核心创新包括:1)提出文本草图世界模型(Textual Sketch World Model),通过将数字图像转化为关键任务相关的草图表示,并设计一种无序不变的学习策略以保留GUI元素的空间信息;2)引入滚动想象策略(Rollout Imagination)优化动作选择过程,利用世界模型的预测能力提升决策质量。实验表明,该方法在Android World基准上实现了SOTA性能,任务成功率提升5.25%。

链接: https://arxiv.org/abs/2601.04035
作者: Yilin Cao,Yufeng Zhong,Zhixiong Zeng,Liming Zheng,Jing Huang,Haibo Qiu,Peng Shi,Wenji Mao,Wan Guanglu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.
zh

[AI-7] HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)面临的多轮渐进式越狱攻击(multi-turn progressive jailbreak attacks)问题,此类攻击通过持续深化攻击策略以绕过模型内置的安全防护机制,而现有基于响应的防御方法难以有效应对这种动态演化的威胁。解决方案的关键在于提出HoneyTrap框架,该框架采用协作式防御机制,集成四种专业化防御代理——威胁拦截器(Threat Interceptor)、误导控制器(Misdirection Controller)、取证追踪器(Forensic Tracker)和系统协调器(System Harmonizer),协同实施欺骗性防御策略。其核心创新在于通过诱导攻击者陷入冗长且无意义的交互中,显著增加攻击资源消耗(Attack Resource Consumption, ARC)并提升误导成功率(Mislead Success Rate, MSR),从而在不干扰正常用户请求的前提下,实现对复杂越狱攻击的有效抑制与延迟。

链接: https://arxiv.org/abs/2601.04034
作者: Siyuan Li,Xi Lin,Jun Wu,Zehao Liu,Haoyu Li,Tianjie Ju,Xiang Chen,Jianhua Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi-turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ-Pro, a challenging multi-turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi-turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.
zh

[AI-8] A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems DATE2026

【速读】:该论文旨在解决在边缘部署混合专家(Mixture-of-Experts, MoE)模型时,由于GPU系统与近数据处理(Near-Data Processing, NDP)架构协同带来的三大挑战:1)因专家选择不均匀和专家并行导致的NDP单元间严重负载不均衡;2)NDP单元内专家计算期间GPU利用率不足;3)因专家激活模式不可预测而需大量数据预分析以支持预取。解决方案的关键在于提出一种高效的推理框架,包含三项核心优化:首先,利用未被充分探索的张量并行性,在多个NDP单元上并行划分和计算大型专家参数,以适配边缘低批次场景;其次,设计一种考虑负载均衡的调度算法,将专家计算任务合理分配至NDP单元与GPU之间,最大化资源利用率;最后,采用无需依赖特定数据集的预取策略,主动加载高频访问的专家,减少激活延迟。实验表明,该框架使GPU-NDP系统在端到端延迟上平均提升2.41倍、最高达2.56倍,显著提升了资源受限环境下的MoE推理效率。

链接: https://arxiv.org/abs/2601.03992
作者: Qi Wu,Chao Fang,Jiayuan Chen,Ye Lin,Yueqi Zhang,Yichuan Bai,Yuan Du,Li Du
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: To appear in 2026 Design, Automation and Test in Europe Conference (DATE 2026)

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.
zh

[AI-9] rade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在金融决策场景中因市场随机性导致的奖励信号噪声问题,这种噪声易引发标准RL方法退化为奖励欺骗(reward hacking)。其解决方案的关键在于提出Trade-R1框架,通过过程级推理验证机制将可验证奖励与随机环境相连接:创新性地将长文本金融文档的推理评估转化为结构化的检索增强生成(Retrieval-Augmented Generation, RAG)任务,并构建三角一致性指标(triangular consistency metric),用于交叉验证检索证据、推理链与决策之间的对齐关系,从而过滤噪声市场回报;同时设计固定效应语义奖励(Fixed-effect Semantic Reward, FSR)和动态效应语义奖励(Dynamic-effect Semantic Reward, DSR)两种奖励整合策略,其中DSR在跨市场泛化性能上表现最优且保持最高推理一致性。

链接: https://arxiv.org/abs/2601.03948
作者: Rui Sun,Yifan Sun,Sheng Xu,Li Zhao,Jing Li,Daxin Jiang,Chen Hua,Zuo Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market’s stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.
zh

[AI-10] A Gap Between Decision Trees and Neural Networks

【速读】:该论文旨在解决浅层ReLU神经网络在逼近轴对齐决策树时,几何简洁性(即决策边界简单性,作为可解释性的度量)与准确拟合之间的潜在冲突问题。决策树生成的是规则驱动的轴对齐决策区域(有限个矩形的并集),而浅层神经网络通常被训练为得分模型(score model),其预测通过阈值化获得。研究发现,硬决策指示函数 1A1_A 及其两种自然连续替代(分段线性阶梯平滑和逻辑斯蒂平滑)在维度 d1d \geq 1 下均具有无穷的Radon总变差(RTV\mathrm{R}\mathrm{TV})半范数,表明其几何复杂度极高;而高斯卷积虽能给出有限 RTV\mathrm{R}\mathrm{TV},但代价是指数级依赖于维度 dd。论文的关键突破在于区分两个常被混淆的目标:阈值分类(恢复决策集)与得分学习(学习接近 1A1_A 的校准得分)。作者构造了一个光滑屏障得分函数 SAS_A,其 RTV\mathrm{R}\mathrm{TV} 有限且固定阈值 τ=1\tau=1 精确恢复原始矩形区域;并在边界附近满足弱管质量条件的前提下,建立了 L1(P)L_1(P) 校准误差以尖锐参数的多项式速率衰减,并给出了基于面测度的显式 RTV\mathrm{R}\mathrm{TV} 上界,从而揭示了精度-复杂度权衡机制。

链接: https://arxiv.org/abs/2601.03919
作者: Akash Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 45 pages

点击查看摘要

Abstract:We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ( \mathrmR\mathrmTV ) seminorm, which controls the geometric complexity of level sets. We first show that the hard tree indicator 1_A has infinite \mathrmR\mathrmTV . Moreover, two natural split-wise continuous surrogates–piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing–also have infinite \mathrmR\mathrmTV in dimensions d1 , while Gaussian convolution yields finite \mathrmR\mathrmTV but with an explicit exponential dependence on d . We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to 1_A ). For classification, we construct a smooth barrier score S_A with finite \mathrmR\mathrmTV whose fixed threshold \tau=1 exactly recovers the box. Under a mild tube-mass condition near \partial A , we prove an L_1§ calibration bound that decays polynomially in a sharpness parameter, along with an explicit \mathrmR\mathrmTV upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy–complexity tradeoff and how threshold selection shifts where training lands along it. Comments: 45 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2601.03919 [cs.LG] (or arXiv:2601.03919v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.03919 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-11] Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures

【速读】:该论文旨在解决混合专家(Mixture of Experts, MoE)架构在扩展过程中出现的专家坍缩(expert collapse)问题,即路由机制收敛到少数主导专家,导致模型容量下降并引发适应过程中的灾难性干扰。解决方案的关键在于提出谱正则化混合专家(Spectrally-Regularized Mixture of Experts, SR-MoE),通过双正则化策略施加几何约束于路由流形:一方面利用谱范数约束限制路由函数的Lipschitz连续性,另一方面引入稳定秩惩罚项保持专家选择中高维特征的多样性,从而增强结构模块化和稳定性,实现局部专家更新而不引起全局性能衰减。

链接: https://arxiv.org/abs/2601.03889
作者: Ibrahim Delibasoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse, where routing converges to a few dominant experts. This reduces model capacity and causes catastrophic interference during adaptation. We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity. Our method uses dual regularization: spectral norm constraints bound routing function Lipschitz continuity, while stable rank penalties preserve high-dimensional feature diversity in expert selection. We evaluate SR-MoE across architectural scales and dataset complexities using modular one-shot adaptation tasks. Results show that traditional linear gating fails with increasing depth (accuracy drops up to 4.72% due to expert entanglement), while SR-MoE maintains structural integrity (mean interference -0.32%). Our spectral constraints facilitate positive knowledge transfer, enabling localized expert updates without global performance decay. SR-MoE provides a general solution for building high-capacity, modular networks capable of stable lifelong learning.
zh

[AI-12] IndexTTS 2.5 Technical Report

【速读】:该论文旨在解决多语言情感语音合成(Emotional Text-to-Speech, TTS)中面临的零样本迁移能力弱、推理效率低以及跨语言情感一致性差的问题。其核心解决方案在于提出IndexTTS 2.5,通过四项关键技术改进实现突破:首先,采用语义编码器压缩策略将语义码本帧率从50 Hz降至25 Hz,显著降低序列长度与计算成本;其次,用更高效的Zipformer架构替代原U-DiT骨干网络,提升Mel频谱生成速度并减少参数量;再次,设计三种显式跨语言建模策略(边界感知对齐、词级拼接和指令引导生成),构建适用于中、英、日、西等多语言的零样本情感TTS框架,支持无目标语言情感训练数据下的鲁棒情感迁移;最后,在T2S模块后训练阶段引入GRPO强化学习优化,进一步提升发音准确性和自然度。实验表明,IndexTTS 2.5在保持与IndexTTS 2相当的词错误率(WER)和说话人相似性的同时,实时因子(RTF)提升达2.28倍,且可在未见语言中复现情感韵律。

链接: https://arxiv.org/abs/2601.03888
作者: Yunpei Li,Xun Zhou,Jinchao Wang,Lu Wang,Yong Wu,Siyi Zhou,Yiquan Zhou,Jingchen Shu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.
zh

[AI-13] Investigating the Grounding Bottleneck for a Large-Scale Configuration Problem: Existing Tools and Constraint-Aware Guessing

【速读】:该论文旨在解决当前答案集编程(Answer Set Programming, ASP)技术在处理大规模配置问题时的可扩展性瓶颈,特别是由“接地瓶颈”(grounding bottleneck)导致的内存需求急剧增长问题。针对这一挑战,论文提出的关键解决方案是“约束感知猜测”(constraint-aware guessing)方法,该方法通过优化接地过程显著降低了内存消耗,从而提升了ASP在复杂电子系统配置等大规模应用中的实用性。

链接: https://arxiv.org/abs/2601.03850
作者: Veronika Semmelrock,Gerhard Friedrich
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:Answer set programming (ASP) aims to realize the AI vision: The user specifies the problem, and the computer solves it. Indeed, ASP has made this vision true in many application domains. However, will current ASP solving techniques scale up for large configuration problems? As a benchmark for such problems, we investigated the configuration of electronic systems, which may comprise more than 30,000 components. We show the potential and limits of current ASP technology, focusing on methods that address the so-called grounding bottleneck, i.e., the sharp increase of memory demands in the size of the problem instances. To push the limits, we investigated the incremental solving approach, which proved effective in practice. However, even in the incremental approach, memory demands impose significant limits. Based on an analysis of grounding, we developed the method constraint-aware guessing, which significantly reduced the memory need.
zh

[AI-14] Implementing the First-Order Logic of Here and There

【速读】:该论文旨在解决一阶逻辑中“此处与那里逻辑”(Here and There Logic, HT)的自动化定理证明问题。其解决方案的关键在于提出两种互补的方法:一是基于HT逻辑的原生序言演算(sequent calculus),通过引入自由变量和Skolem化技术优化解析式证明搜索;二是将HT逻辑通过公理嵌入到直觉主义逻辑(Intuitionistic Logic)中,并结合直觉主义逻辑的序言、表格法(tableau)和连接演算(connection calculus)进行推理。这两种方法共同构成了高效且形式严谨的HT自动化定理证明框架,并在大规模一阶公式基准测试集上进行了评估,为后续更高效的HT定理证明器开发奠定了基础。

链接: https://arxiv.org/abs/2601.03848
作者: Jens Otten(University of Pernambuco),Torsten Schaub(University of Potsdam)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:We present automated theorem provers for the first-order logic of here and there (HT). They are based on a native sequent calculus for the logic of HT and an axiomatic embedding of the logic of HT into intuitionistic logic. The analytic proof search in the sequent calculus is optimized by using free variables and skolemization. The embedding is used in combination with sequent, tableau and connection calculi for intuitionistic first-order logic. All provers are evaluated on a large benchmark set of first-order formulas, providing a foundation for the development of more efficient HT provers.
zh

[AI-15] xDNN(ASP): Explanation Generation System for Deep Neural Networks powered by Answer Set Programming

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)缺乏可解释性的问题,尤其是现有解释方法主要关注输入与输出之间的关系,而忽略了网络内部结构对预测结果的影响。其解决方案的关键在于提出xDNN(ASP),一个基于答案集编程(Answer Set Programming, ASP)的全局解释生成系统:该系统从训练好的DNN模型及其数据中提取出逻辑程序,在理想情况下,该逻辑程序的每个答案集与网络的输入-输出对一一对应,从而不仅保持了较高的预测准确性,还揭示了特征重要性以及隐藏节点对预测的影响,为网络结构优化(如减少隐藏层节点数)提供了依据。

链接: https://arxiv.org/abs/2601.03847
作者: Ly Ly Trieu(New Mexico State University),Tran Cao Son(New Mexico State University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:Explainable artificial intelligence (xAI) has gained significant attention in recent years. Among other things, explainablility for deep neural networks has been a topic of intensive research due to the meteoric rise in prominence of deep neural networks and their “black-box” nature. xAI approaches can be characterized along different dimensions such as their scope (global versus local explanations) or underlying methodologies (statistic-based versus rule-based strategies). Methods generating global explanations aim to provide reasoning process applicable to all possible output classes while local explanation methods focus only on a single, specific class. SHAP (SHapley Additive exPlanations), a well-known statistical technique, identifies important features of a network. Deep neural network rule extraction method constructs IF-THEN rules that link input conditions to a class. Another approach focuses on generating counterfactuals which help explain how small changes to an input can affect the model’s predictions. However, these techniques primarily focus on the input-output relationship and thus neglect the structure of the network in explanation generation. In this work, we propose xDNN(ASP), an explanation generation system for deep neural networks that provides global explanations. Given a neural network model and its training data, xDNN(ASP) extracts a logic program under answer set semantics that-in the ideal case-represents the trained model, i.e., answer sets of the extracted program correspond one-to-one to input-output pairs of the network. We demonstrate experimentally, using two synthetic datasets, that not only the extracted logic program maintains a high-level of accuracy in the prediction task, but it also provides valuable information for the understanding of the model such as the importance of features as well as the impact of hidden nodes on the prediction. The latter can be used as a guide for reducing the number of nodes used in hidden layers, i.e., providing a means for optimizing the network.
zh

[AI-16] When Numbers Start Talking: Implicit Numerical Coordination Among LLM -Based Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的多智能体系统中,智能体在缺乏显式通信条件下的隐性协调机制问题,尤其是 covert communication(隐秘通信)如何在战略互动中自发形成并影响协作与博弈结果。其解决方案的关键在于基于博弈论框架,在四种经典博弈场景下对比分析不同通信模式(包括显式、受限和无通信)对智能体行为的影响,结合异质性智能体人格特征及一次性与重复博弈设置,系统识别隐性信号的涌现条件及其对策略均衡和协调效率的作用机制。

链接: https://arxiv.org/abs/2601.03846
作者: Alessio Buscemi,Daniele Proverbio,Alessandro Di Stefano, TheAnh Han,German Castignani,Pietro Liò
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs-based agents increasingly operate in multi-agent environments where strategic interaction and coordination are required. While existing work has largely focused on individual agents or on interacting agents sharing explicit communication, less is known about how interacting agents coordinate implicitly. In particular, agents may engage in covert communication, relying on indirect or non-linguistic signals embedded in their actions rather than on explicit messages. This paper presents a game-theoretic study of covert communication in LLM-driven multi-agent systems. We analyse interactions across four canonical game-theoretic settings under different communication regimes, including explicit, restricted, and absent communication. Considering heterogeneous agent personalities and both one-shot and repeated games, we characterise when covert signals emerge and how they shape coordination and strategic outcomes.
zh

[AI-17] Formally Explaining Decision Tree Models with Answer Set Programming

【速读】:该论文旨在解决决策树模型(包括随机森林和梯度提升决策树)在安全关键应用中因结构复杂而导致的可解释性不足问题,尤其是在需要对模型决策进行形式化论证时。解决方案的关键在于利用答案集编程(Answer Set Programming, ASP)生成多种类型的解释,包括充分解释、对比解释、多数解释和树特定解释,相较于基于SAT的方法,ASP方法在编码用户偏好方面更具灵活性,并支持枚举所有可能的解释,从而提升了解释的全面性和可定制性。

链接: https://arxiv.org/abs/2601.03845
作者: Akihiro Takemura(National Institute of Informatics, Tokyo, Japan),Masayuki Otani(Tokyo Institute of Technology, Tokyo, Japan),Katsumi Inoue(National Institute of Informatics, Tokyo, Japan)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:Decision tree models, including random forests and gradient-boosted decision trees, are widely used in machine learning due to their high predictive performance. However, their complex structures often make them difficult to interpret, especially in safety-critical applications where model decisions require formal justification. Recent work has demonstrated that logical and abductive explanations can be derived through automated reasoning techniques. In this paper, we propose a method for generating various types of explanations, namely, sufficient, contrastive, majority, and tree-specific explanations, using Answer Set Programming (ASP). Compared to SAT-based approaches, our ASP-based method offers greater flexibility in encoding user preferences and supports enumeration of all possible explanations. We empirically evaluate the approach on a diverse set of datasets and demonstrate its effectiveness and limitations compared to existing methods.
zh

[AI-18] XAI-LAW: A Logic Programming Tool for Modeling Explaining and Learning Legal Decisions

【速读】:该论文旨在解决刑事法律规则建模与推理自动化的问题,特别是在意大利刑法典(Italian Criminal Code, ICC)中对“侵害人身罪”和财产犯罪等条文进行形式化表达,并基于既有司法判例实现法律规则的半自动学习。其解决方案的关键在于采用答案集编程(Answer Set Programming, ASP)构建刑法条文的形式化模型,通过稳定模型的“支持性”(supportedness)机制提供可解释的推理过程与判决预测,同时集成归纳逻辑编程系统以从案例中泛化出法律规则,从而提升司法决策的透明度与自动化水平。

链接: https://arxiv.org/abs/2601.03844
作者: Agostino Dovier(DMIF - University of Udine),Talissa Dreossi(DMIF - University of Udine),Andrea Formisano(DMIF - University of Udine),Benedetta Strizzolo(DMIF - University of Udine)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:We propose an approach to model articles of the Italian Criminal Code (ICC), using Answer Set Programming (ASP), and to semi-automatically learn legal rules from examples based on prior judicial decisions. The developed tool is intended to support legal experts during the criminal trial phase by providing reasoning and possible legal outcomes. The methodology involves analyzing and encoding articles of the ICC in ASP, including “crimes against the person” and property offenses. The resulting model is validated on a set of previous verdicts and refined as necessary. During the encoding process, contradictions may arise; these are properly handled by the system, which also generates possible decisions for new cases and provides explanations through a tool that leverages the “supportedness” of stable models. The automatic explainability offered by the tool can also be used to clarify the logic behind judicial decisions, making the decision-making process more interpretable. Furthermore, the tool integrates an inductive logic programming system for ASP, which is employed to generalize legal rules from case examples.
zh

[AI-19] On the Trap Space Semantics of Normal Logic Programs

【速读】:该论文旨在解决正常逻辑程序(normal logic programs)的语义解释问题,特别是如何统一和深化对现有模型论语义(如稳定模型、支持模型、规则模型等)与动力学语义(基于状态转移图的演化行为)之间关系的理解。传统方法依赖Clark完成和二值/三值标准模型,但缺乏对程序动态行为的系统刻画。解决方案的关键在于引入“陷阱空间”(trap space)语义,这是一种适用于任意正常逻辑程序的新语义框架,兼具模型论和动力学双重表征:一方面通过陷阱空间的集合结构提供精确的模型定义,另一方面通过状态空间轨迹分析揭示程序的稳定、振荡及收敛行为。该框架不仅证明了支持类、严格稳定类和规则模型的存在性,还形式化了不同语义之间的深层联系,从而为逻辑程序的语义分析提供了统一且严谨的理论基础。

链接: https://arxiv.org/abs/2601.03842
作者: Van-Giang Trinh(Inria Saclay, EP Lifeware, Palaiseau, France),Sylvain Soliman(Inria Saclay, EP Lifeware, Palaiseau, France),François Fages(Inria Saclay, EP Lifeware, Palaiseau, France),Belaid Benhamou(LIRICA team, LIS, Aix-Marseille University, Marseille, France)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:The logical semantics of normal logic programs has traditionally been based on the notions of Clark’s completion and two-valued or three-valued canonical models, including supported, stable, regular, and well-founded models. Two-valued interpretations can also be seen as states evolving under a program’s update operator, producing a transition graph whose fixed points and cycles capture stable and oscillatory behaviors, respectively. We refer to this view as dynamical semantics since it characterizes the program’s meaning in terms of state-space trajectories, as first introduced in the stable (supported) class semantics. Recently, we have established a formal connection between Datalog^\neg programs (i.e., normal logic programs without function symbols) and Boolean networks, leading to the introduction of the trap space concept for Datalog^\neg programs. In this paper, we generalize the trap space concept to arbitrary normal logic programs, introducing trap space semantics as a new approach to their interpretation. This new semantics admits both model-theoretic and dynamical characterizations, providing a comprehensive approach to understanding program behavior. We establish the foundational properties of the trap space semantics and systematically relate it to the established model-theoretic semantics, including the stable (supported), stable (supported) partial, regular, and L-stable model semantics, as well as to the dynamical stable (supported) class semantics. Our results demonstrate that the trap space semantics offers a unified and precise framework for proving the existence of supported classes, strict stable (supported) classes, and regular models, in addition to uncovering and formalizing deeper relationships among the existing semantics of normal logic programs.
zh

[AI-20] Defeasible Conditionals using Answer Set Programming

【速读】:该论文旨在解决如何在KLM框架下高效计算理性闭包(Rational Closure, RC)的问题,即在不完整信息条件下实现可废止推理(defeasible entailment)的自动化建模与验证。其解决方案的关键在于提出了一种基于答案集编程(Answer Set Programming, ASP)的声明式定义方法,通过ASP自动构建知识库对应的最小等级模型(minimal ranked model),并支持对特定查询的蕴含检验。该方法不仅形式化证明了编码的正确性,还在实验中展现出优于传统命令式实现(如InfOCF求解器)的计算效率,从而为可废止推理提供了一个理论严谨且高效的计算平台。

链接: https://arxiv.org/abs/2601.03840
作者: Racquel Dennison,Jesse Heyninck,Thomas Meyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:Defeasible entailment is concerned with drawing plausible conclusions from incomplete information. A foundational framework for modelling defeasible entailment is the KLM framework. Introduced by Kraus, Lehmann, and Magidor, the KLM framework outlines several key properties for defeasible entailment. One of the most prominent algorithms within this framework is Rational Closure (RC). This paper presents a declarative definition for computing RC using Answer Set Programming (ASP). Our approach enables the automatic construction of the minimal ranked model from a given knowledge base and supports entailment checking for specified queries. We formally prove the correctness of our ASP encoding and conduct empirical evaluations to compare the performance of our implementation with that of existing imperative implementations, specifically the InfOCF solver. The results demonstrate that our ASP-based approach adheres to RC’s theoretical foundations and offers improved computational efficiency.
zh

[AI-21] Logic Tensor Network-Enhanced Generative Adversarial Network

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在生成数据时缺乏对领域特定逻辑约束的显式建模能力的问题,这限制了其在需要规则遵循的应用场景中的可靠性与可解释性。解决方案的关键在于提出一种名为 Logic Tensor Network-Enhanced Generative Adversarial Network (LTN-GAN) 的新框架,通过将逻辑张量网络(Logic Tensor Networks, LTNs)嵌入到生成对抗网络(GANs)中,在生成过程中显式地强制执行一阶逻辑约束,从而在保持样本真实性和多样性的同时提升生成结果的逻辑一致性。

链接: https://arxiv.org/abs/2601.03839
作者: Nijesh Upreti(The University of Edinburgh),Vaishak Belle(The University of Edinburgh)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:In this paper, we introduce Logic Tensor Network-Enhanced Generative Adversarial Network (LTN-GAN), a novel framework that enhances Generative Adversarial Networks (GANs) by incorporating Logic Tensor Networks (LTNs) to enforce domain-specific logical constraints during the sample generation process. Although GANs have shown remarkable success in generating realistic data, they often lack mechanisms to incorporate prior knowledge or enforce logical consistency, limiting their applicability in domains requiring rule adherence. LTNs provide a principled way to integrate first-order logic with neural networks, enabling models to reason over and satisfy logical constraints. By combining the strengths of GANs for realistic data synthesis with LTNs for logical reasoning, we gain valuable insights into how logical constraints influence the generative process while improving both the diversity and logical consistency of the generated samples. We evaluate LTN-GAN across multiple datasets, including synthetic datasets (gaussian, grid, rings) and the MNIST dataset, demonstrating that our model significantly outperforms traditional GANs in terms of adherence to predefined logical constraints while maintaining the quality and diversity of generated samples. This work highlights the potential of neuro-symbolic approaches to enhance generative modeling in knowledge-intensive domains.
zh

[AI-22] ROI-Reasoning : Rational Optimization for Inference via Pre-Computation Meta-Cognition

【速读】:该论文旨在解决在严格全局token预算约束下,如何高效分配计算资源以提升大语言模型(Large Language Models, LLMs)在多任务推理中的整体表现问题。传统LLMs虽具备强大推理能力,但缺乏对任务所需计算量的内在认知,导致在资源受限场景中难以实现最优决策。为此,作者将该问题形式化为有序随机多选背包问题(Ordered Stochastic Multiple-Choice Knapsack Problem, OS-MCKP),并提出ROI-Reasoning框架,其核心在于通过两阶段机制赋予模型内在的预算感知理性:第一阶段通过元认知微调(Meta-Cognitive Fine-Tuning)使模型能够预估任务的推理成本与预期收益(Return on Investment, ROI),从而做出“求解或跳过”的显式决策;第二阶段利用理性感知强化学习(Rationality-Aware Reinforcement Learning)优化硬性token预算下的序列决策过程,使模型学会长期的计算资源分配策略。实验表明,该方法在数学推理基准测试中显著降低遗憾值(regret)的同时提升整体得分。

链接: https://arxiv.org/abs/2601.03822
作者: Muyang Zhao,Qi Qi,Hao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can achieve strong reasoning performance with sufficient computation, but they do not inherently know how much computation a task requires. We study budgeted inference-time reasoning for multiple tasks under a strict global token constraint and formalize it as a Ordered Stochastic Multiple-Choice Knapsack Problem(OS-MCKP). This perspective highlights a meta-cognitive requirement – anticipating task difficulty, estimating return over investment (ROI), and allocating computation strategically. We propose ROI-Reasoning, a two-stage framework that endows LLMs with intrinsic, budget-aware rationality. In the first stage, Meta-Cognitive Fine-Tuning teaches models to predict reasoning cost and expected utility before generation, enabling explicit solve-or-skip decisions. Next, Rationality-Aware Reinforcement Learning optimizes sequential decision making under a hard token budget, allowing models to learn long-horizon allocation strategies. Across budgeted mathematical reasoning benchmarks, ROI-Reasoning consistently improves overall score while substantially reducing regret under tight computation budgets.
zh

[AI-23] Criminal Liability of Generative Artificial Intelligence Providers for User-Generated Child Sexual Abuse Material

【速读】:该论文旨在解决生成式人工智能(Generative Artificial Intelligence, GenAI)在生成儿童性虐待材料(Child Sexual Abuse Material, CSAM)时所引发的刑事责任认定难题,特别是在不同法律体系下对用户及模型开发者、研究人员和企业代表等多方主体责任划分不清的问题。其解决方案的关键在于结合德国相关法律规定进行法解释学分析,并通过具体场景探讨GenAI在CSAM生成中的技术特性与使用情境如何影响刑事归责,从而明确不同角色的责任边界,并提出开发此类系统时应满足的合规要求。

链接: https://arxiv.org/abs/2601.03788
作者: Anamaria Mojica-Hanke,Thomas Goger,Svenja Wölfel,Brian Valerius,Steffen Herbold
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at the International Conference on AI Engineering

点击查看摘要

Abstract:The development of more powerful Generative Artificial Intelligence (GenAI) has expanded its capabilities and the variety of outputs. This has introduced significant legal challenges, including gray areas in various legal systems, such as the assessment of criminal liability for those responsible for these models. Therefore, we conducted a multidisciplinary study utilizing the statutory interpretation of relevant German laws, which, in conjunction with scenarios, provides a perspective on the different properties of GenAI in the context of Child Sexual Abuse Material (CSAM) generation. We found that generating CSAM with GenAI may have criminal and legal consequences not only for the user committing the primary offense but also for individuals responsible for the models, such as independent software developers, researchers, and company representatives. Additionally, the assessment of criminal liability may be affected by contextual and technical factors, including the type of generated image, content moderation policies, and the model’s intended purpose. Based on our findings, we discussed the implications for different roles, as well as the requirements when developing such systems.
zh

[AI-24] EntroCoT: Enhancing Chain-of-Thought via Adaptive Entropy-Guided Segmentation

【速读】:该论文旨在解决当前链式思维(Chain-of-Thought, CoT)微调数据集中普遍存在的“答案正确但推理错误”问题,即模型虽得出正确最终答案,但其推理过程可能包含幻觉、冗余或逻辑无效的中间步骤。解决方案的关键在于提出EntroCoT框架,该框架首先利用基于熵的机制在推理路径的不确定节点处进行分段,从而精准识别推理步骤边界;随后引入蒙特卡洛回放(Monte Carlo rollout)机制评估每个推理步骤的边际贡献,据此筛选并重构高质量的推理轨迹,确保每一步都对最终答案具有实质性推动作用。实验表明,基于EntroCoT构建的子集进行微调,在数学推理基准上显著优于全数据集监督的基线方法。

链接: https://arxiv.org/abs/2601.03769
作者: Zihang Li,Yuhang Wang,Yikun Zong,Wenhan Yu,Xiaokun Yuan,Runhan Jiang,Zirui Liu,Tong Yang,Arthur Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has significantly enhanced the mathematical reasoning capabilities of Large Language Models. We find existing fine-tuning datasets frequently suffer from the “answer right but reasoning wrong” probelm, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps. This paper proposes EntroCoT, a unified framework for automatically identifying and refining low-quality CoT supervision traces. EntroCoT first proposes an entropy-based mechanism to segment the reasoning trace into multiple steps at uncertain junctures, and then introduces a Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step. By accurately filtering deceptive reasoning samples, EntroCoT constructs a high-quality dataset where every intermediate step in each reasoning trace facilitates the final answer. Extensive experiments on mathematical benchmarks demonstrate that fine-tuning on the subset constructed by EntroCoT consistently outperforms the baseslines of full-dataset supervision.
zh

[AI-25] Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model

【速读】:该论文旨在解决生成式 AI (Generative AI) 中模型性能提升与推理阶段失败率之间的关系问题,特别是当目标输出存在实例异质性难度(instance-heterogeneous difficulty)时,如何理解训练过程对最终推理表现的影响。其解决方案的关键在于提出一个可解的“潜在实例难度”(Latent Instance Difficulty, LID)模型,其中每个输入的目标方差由一个服从重尾分布的潜在“精度”参数决定。通过该模型,作者发现:尽管泛化损失遵循标准的神经缩放律(neural scaling laws),但推理阶段的 pass@k 失败率呈现幂律衰减 $ k^{-\beta_\text{eff}} $,且有效指数 βeff\beta_\text{eff} 依赖于训练样本数 NN —— 随着 NN 增大而增长,最终收敛至由难度分布尾部决定的内在极限 β\beta。这揭示了学习过程会压缩误差分布中的“硬尾”,即模型优化使 pass@k 曲线变得更陡峭,直到不可约的目标方差成为主导因素。该理论框架提供了可验证的闭式预测,包括最优计算分配策略:在饱和前优先训练,在饱和后优先推理尝试。

链接: https://arxiv.org/abs/2601.03764
作者: Noam Levi
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10 pages

点击查看摘要

Abstract:We analyze neural scaling laws in a solvable model of last-layer fine-tuning where targets have intrinsic, instance-heterogeneous difficulty. In our Latent Instance Difficulty (LID) model, each input’s target variance is governed by a latent precision'' drawn from a heavy-tailed distribution. While generalization loss recovers standard scaling laws, our main contribution connects this to inference. The pass@ k failure rate exhibits a power-law decay, k^-\beta_\texteff , but the observed exponent \beta_\texteff is training-dependent. It grows with sample size N before saturating at an intrinsic limit \beta set by the difficulty distribution's tail. This coupling reveals that learning shrinks the hard tail’’ of the error distribution: improvements in the model’s generalization error steepen the pass@ k curve until irreducible target variance dominates. The LID model yields testable, closed-form predictions for this behavior, including a compute-allocation rule that favors training before saturation and inference attempts after. We validate these predictions in simulations and in two real-data proxies: CIFAR-10H (human-label variance) and a maths teacher-student distillation task.
zh

[AI-26] Bridging OLAP and RAG : A Multidimensional Approach to the Design of Corpus Partitioning

【速读】:该论文旨在解决工业级检索增强生成(Retrieval-Augmented Generation, RAG)系统在大规模文档集合中面临的可扩展性与语义组织缺失问题。当前主流方案依赖于基于相似度的底层划分策略(如近似最近邻搜索和元数据过滤),虽效率较高,但缺乏对语料库分片的语义合理性解释,导致检索过程难以控制且不可解释。其解决方案的关键在于提出维度事实模型(Dimensional Fact Model, DFM),通过融合语义聚类(semantic clustering)与多维分区(multidimensional partitioning)两种正交策略,将时间、组织上下文等隐含概念显式建模为可配置的维度,并据此实现层级路由与受控回退机制,从而将原本黑箱化的相似度匹配转变为可治理、确定性的检索流程,为大规模RAG系统的结构化设计提供理论依据与实践框架。

链接: https://arxiv.org/abs/2601.03748
作者: Dario Maio,Stefano Rizzi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly deployed on large-scale document collections, often comprising millions of documents and tens of millions of text chunks. In industrial-scale retrieval platforms, scalability is typically addressed through horizontal sharding and a combination of Approximate Nearest-Neighbor search, hybrid indexing, and optimized metadata filtering. Although effective from an efficiency perspective, these mechanisms rely on bottom-up, similarity-driven organization and lack a conceptual rationale for corpus partitioning. In this paper, we claim that the design of large-scale RAG systems may benefit from the combination of two orthogonal strategies: semantic clustering, which optimizes locality in embedding space, and multidimensional partitioning, which governs where retrieval should occur based on conceptual dimensions such as time and organizational context. Although such dimensions are already implicitly present in current systems, they are used in an ad hoc and poorly structured manner. We propose the Dimensional Fact Model (DFM) as a conceptual framework to guide the design of multidimensional partitions for RAG corpora. The DFM provides a principled way to reason about facts, dimensions, hierarchies, and granularity in retrieval-oriented settings. This framework naturally supports hierarchical routing and controlled fallback strategies, ensuring that retrieval remains robust even in the presence of incomplete metadata, while transforming the search process from a ‘black-box’ similarity matching into a governable and deterministic workflow. This work is intended as a position paper; its goal is to bridge the gap between OLAP-style multidimensional modeling and modern RAG architectures, and to stimulate further research on principled, explainable, and governable retrieval strategies at scale.
zh

[AI-27] From Laboratory to Real-World Applications: Benchmarking Agent ic Code Reasoning at the Repository Level

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在作为自主代理(autonomous agents)时,对代码仓库(repository-level)层面推理能力的评估难题,即如何衡量模型在真实、复杂且相互依赖的文件系统中保持逻辑一致性(logical consistency)的能力。当前基准测试通常局限于孤立代码片段或黑盒评估,难以揭示模型内部的推理机制。其解决方案的关键在于提出RepoReason——一个以归纳断言验证(abductive assertion verification)为核心的白盒诊断基准,并设计了一种执行驱动的变异框架(execution-driven mutation framework),利用运行环境作为语义oracle(semantic oracle)动态生成真实状态以消除记忆偏差,同时引入基于动态程序切片(dynamic program slicing)的细粒度诊断体系,通过三个正交指标:ESV(阅读负载)、MCL(模拟深度)和DFI(集成宽度)量化模型推理过程。实验表明,当前前沿模型存在显著的集成缺陷(integration deficit),其中DFI作为主要认知瓶颈,为下一代智能软件工程代理的优化提供了可解释的白盒洞见。

链接: https://arxiv.org/abs/2601.03731
作者: Jia Li,Yuxin Su,Michael R. Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: ESV (reading load), MCL (simulation depth), and DFI (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.
zh

[AI-28] R3L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration Pivotal Credit and Positive Amplification

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)推理与智能体(agentic)能力提升中面临的探索(exploration)与利用(exploitation)困境:一方面,传统随机采样策略在困难任务上成功率低且重置 rollout 成本高;另一方面,轨迹级奖励机制导致有效前缀被后续错误惩罚,且失败样本主导训练信号,造成信用分配粗粒度和训练不稳定。其解决方案的核心是 R³L(Reflect-then-Retry Reinforcement Learning),包含三个关键机制:1)通过“反思-重试”(reflect-then-retry)主动合成高质量轨迹,利用语言反馈诊断并定位错误点,从失败处重启以降低采样成本;2)引入关键信用分配(Pivotal Credit Assignment),仅对对比信号存在的分歧后缀进行梯度更新,排除共享前缀的影响;3)采用正向增强(Positive Amplification)对成功轨迹加权,确保优化过程由正向信号引导,从而在保持训练稳定性的同时实现5%至52%的性能提升。

链接: https://arxiv.org/abs/2601.03715
作者: Weijie Shi,Yanxi Chen,Zexi Li,Xuchen Pan,Yuchang Sun,Jiajie Xu,Xiaofang Zhou,Yaliang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R ^3 L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R ^3 L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5% to 52% relative improvements over baselines while maintaining training stability. Our code is released at this https URL.
zh

[AI-29] he Power of 10: New Rules for the Digital World

【速读】:该论文试图解决的问题是:在人工智能技术快速发展的背景下,社会对“超人类机器”和无缝数字未来的憧憬往往掩盖了由广泛数字技术引发的社会、伦理与心理问题,如监控滥用和心理健康危机。为应对这一挑战,论文提出解决方案的关键在于构建一个以人类为中心的伦理框架——“数字世界十诫”,该框架受圣经十诫的持久影响启发,旨在引导个人与社会在“增强型”技术时代做出审慎决策,从而实现技术发展与人文价值的平衡。

链接: https://arxiv.org/abs/2601.03709
作者: Sarah Spiekermann-Hoff,Marc Langheinrich,Johannes Hoff,Christiane Wendehorst,Jürgen Pfeffer,Thomas Fuchs,Armin Grunwald
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: to be published in Communications of the ACM (submitted 26 June 2025, revised 29 August 2025, accepted 3 November 2025)

点击查看摘要

Abstract:As artificial intelligence rapidly advances, society is increasingly captivated by promises of superhuman machines and seamless digital futures. Yet these visions often obscure mounting social, ethical, and psychological concerns tied to pervasive digital technologies - from surveillance to mental health crises. This article argues that a guiding ethos is urgently needed to navigate these transformations. Inspired by the lasting influence of the biblical Ten Commandments, a European interdisciplinary group has proposed “Ten Rules for the Digital World” - a novel ethical framework to help individuals and societies make prudent, human-centered decisions in the age of “supercharged” technology.
zh

[AI-30] MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark

【速读】:该论文旨在解决现有代码补全基准测试主要聚焦于通用编程语言软件代码,而忽视了硬件描述语言(Hardware Description Language, HDL)在仓库级别上的代码补全问题。其解决方案的关键在于提出首个面向多语言硬件代码补全的基准测试集MHRC-Bench,包含训练集(MHRC-Bench-Train)和评估集(MHRC-Bench-Eval),覆盖三种主流硬件设计编码风格,并通过具体语法树分析为每个补全目标标注代码结构级和硬件语义标签,从而实现对硬件代码补全模型的系统性评估。

链接: https://arxiv.org/abs/2601.03708
作者: Qingyun Zou,Jiahao Cui,Nuo Chen,Bingsheng He,Weng-Fai Wong
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance on code completion tasks in general-purpose programming languages. However, existing repository-level code completion benchmarks focus almost exclusively on software code and largely overlook hardware description languages. In this work, we present \textbfMHRC-Bench, consisting of \textbfMHRC-Bench-Train and \textbfMHRC-Bench-Eval, the first benchmark designed for multilingual hardware code completion at the repository level. Our benchmark targets completion tasks and covers three major hardware design coding styles. Each completion target is annotated with code-structure-level and hardware-oriented semantic labels derived from concrete syntax tree analysis. We conduct a comprehensive evaluation of models on MHRC-Bench-Eval. Comprehensive evaluation results and analysis demonstrate the effectiveness of MHRC-Bench.
zh

[AI-31] Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction

【速读】:该论文旨在解决蛋白质-蛋白质结合亲和力(protein–protein binding affinity)预测中预测精度与数据可用性之间的权衡问题,特别是由于实验解析的蛋白质结构数据稀缺,导致基于结构的机器学习模型性能受限。其核心解决方案是提出一种基于知识蒸馏(knowledge distillation)的回归框架:在训练阶段利用结构信息指导学生网络(student network)学习,而在推理阶段仅需序列数据即可完成预测。该方法通过结合结合亲和力标签和中间特征表示,由一个结构感知的教师网络(teacher network)协同监督序列型学生网络的训练,从而将结构知识有效迁移至序列基模型。实验表明,该策略显著提升了序列模型的预测性能(Pearson相关系数从0.375提升至0.481),且误差分析进一步验证了预测偏差降低和一致性增强的效果。

链接: https://arxiv.org/abs/2601.03704
作者: Wajid Arshad Abbasi,Syed Ali Abbas,Maryum Bibi,Saiqa Andleeb,Muhammad Naveed Akhtar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The trade-off between predictive accuracy and data availability makes it difficult to predict protein–protein binding affinity accurately. The lack of experimentally resolved protein structures limits the performance of structure-based machine learning models, which generally outperform sequence-based methods. In order to overcome this constraint, we suggest a regression framework based on knowledge distillation that uses protein structural data during training and only needs sequence data during inference. The suggested method uses binding affinity labels and intermediate feature representations to jointly supervise the training of a sequence-based student network under the guidance of a structure-informed teacher network. Leave-One-Complex-Out (LOCO) cross-validation was used to assess the framework on a non-redundant protein–protein binding affinity benchmark dataset. A maximum Pearson correlation coefficient (P_r) of 0.375 and an RMSE of 2.712 kcal/mol were obtained by sequence-only baseline models, whereas a P_r of 0.512 and an RMSE of 2.445 kcal/mol were obtained by structure-based models. With a P_r of 0.481 and an RMSE of 2.488 kcal/mol, the distillation-based student model greatly enhanced sequence-only performance. Improved agreement and decreased bias were further confirmed by thorough error analyses. With the potential to close the performance gap between sequence-based and structure-based models as larger datasets become available, these findings show that knowledge distillation is an efficient method for transferring structural knowledge to sequence-based predictors. The source code for running inference with the proposed distillation-based binding affinity predictor can be accessed at this https URL.
zh

[AI-32] reeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

【速读】:该论文旨在解决基于群体的目标强化学习(如Group Relative Policy Optimization, GRPO)在对齐大语言模型进行复杂推理任务时存在的样本效率低下和长度偏差问题,即标准方法将每个轨迹视为独立的扁平序列,并为所有token分配单一的序列级优势值,导致冗长且逻辑深度不足的思维链(Chain-of-Thought, CoT)。其解决方案的关键在于提出TreeAdv(Tree-Structured Advantage Redistribution for Group-Based RL),通过显式建模群体轨迹的树状结构来优化探索与优势分配:首先利用熵驱动采样构建一个森林(forest),其中高不确定性决策处分支形成多棵子树,而低不确定性token则跨轨迹共享;随后,TreeAdv通过对完整轨迹(所有叶节点)的优势进行重分配,聚合内部树段的token级优势,从而实现更高效的样本利用和更优的逻辑深度提升。

链接: https://arxiv.org/abs/2601.03703
作者: Lang Cao,Hui Ruan,Yongqian Li,Peng Chao,Wu Ning,Haonan Song,Renhong Chen,Yitong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.
zh

[AI-33] Inference Attacks Against Graph Generative Diffusion Models USENIX-SECURITY2026

【速读】:该论文旨在解决生成式图扩散模型(graph generative diffusion models)在训练过程中可能引发的信息泄露问题,特别是针对黑盒推理攻击下的隐私风险。研究通过设计三种类型的攻击方法——图结构重构攻击、属性推断攻击和成员身份推断攻击——验证了模型输出的生成图中存在可被利用的敏感信息,从而揭示了当前图生成模型在隐私保护方面的不足。解决方案的关键在于提出两种防御机制,能够在不显著损害目标模型生成性能的前提下,有效缓解上述三类攻击,实现比现有方法更优的隐私保护与模型效用之间的权衡。

链接: https://arxiv.org/abs/2601.03701
作者: Xiuling Wang,Xin Huang,Guibo Luo,Jianliang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been accepted by USENIX Security 2026

点击查看摘要

Abstract:Graph generative diffusion models have recently emerged as a powerful paradigm for generating complex graph structures, effectively capturing intricate dependencies and relationships within graph data. However, the privacy risks associated with these models remain largely unexplored. In this paper, we investigate information leakage in such models through three types of black-box inference attacks. First, we design a graph reconstruction attack, which can reconstruct graphs structurally similar to those training graphs from the generated graphs. Second, we propose a property inference attack to infer the properties of the training graphs, such as the average graph density and the distribution of densities, from the generated graphs. Third, we develop two membership inference attacks to determine whether a given graph is present in the training set. Extensive experiments on three different types of graph generative diffusion models and six real-world graphs demonstrate the effectiveness of these attacks, significantly outperforming the baseline approaches. Finally, we propose two defense mechanisms that mitigate these inference attacks and achieve a better trade-off between defense strength and target model utility than existing methods. Our code is available at this https URL.
zh

[AI-34] Can AI Chatbots Provide Coaching in Engineering? Beyond Information Processing Toward Mastery

【速读】:该论文旨在解决生成式 AI (Generative AI) 在工程教育中作为辅导工具时,其能否超越信息传递而真正促进学习者达到 mastery(掌握)水平的问题。核心挑战在于,传统师徒制所培养的判断力与隐性知识(tacit knowledge)正在消解,而 AI 却可能无法替代人类导师在道德、情感及情境判断方面的价值维度。解决方案的关键在于提出一种“多层辅导框架”(multiplex coaching framework),该框架将人类智慧嵌入专家在环(expert-in-the-loop)模型中,既保留了师徒制中 embodied rationality(具身理性)和价值导向的深度,又利用 AI 的可扩展性增强工程教育的广度与效率。

链接: https://arxiv.org/abs/2601.03693
作者: Junaid Qadir,Muhammad Adil Attique,Saleha Shoaib,Syed Ibrahim Ghaznavi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: accepted at IEEE EDUCON 2026

点击查看摘要

Abstract:Engineering education faces a double disruption: traditional apprenticeship models that cultivated judgment and tacit skill are eroding, just as generative AI emerges as an informal coaching partner. This convergence rekindles long-standing questions in the philosophy of AI and cognition about the limits of computation, the nature of embodied rationality, and the distinction between information processing and wisdom. Building on this rich intellectual tradition, this paper examines whether AI chatbots can provide coaching that fosters mastery rather than merely delivering information. We synthesize critical perspectives from decades of scholarship on expertise, tacit knowledge, and human-machine interaction, situating them within the context of contemporary AI-driven education. Empirically, we report findings from a mixed-methods study (N = 75 students, N = 7 faculty) exploring the use of a coaching chatbot in engineering education. Results reveal a consistent boundary: participants accept AI for technical problem solving (convergent tasks; M = 3.84 on a 1-5 Likert scale) but remain skeptical of its capacity for moral, emotional, and contextual judgment (divergent tasks). Faculty express stronger concerns over risk (M = 4.71 vs. M = 4.14, p = 0.003), and privacy emerges as a key requirement, with 64-71 percent of participants demanding strict confidentiality. Our findings suggest that while generative AI can democratize access to cognitive and procedural support, it cannot replicate the embodied, value-laden dimensions of human mentorship. We propose a multiplex coaching framework that integrates human wisdom within expert-in-the-loop models, preserving the depth of apprenticeship while leveraging AI scalability to enrich the next generation of engineering education.
zh

[AI-35] A Pre-trained Reaction Embedding Descriptor Capturing Bond Transformation Patterns

【速读】:该论文旨在解决当前数据驱动的反应预测模型中缺乏通用且反应级别的描述符(reaction descriptor)的问题,这限制了真实化学与数字表示之间的有效衔接。解决方案的关键在于提出一种名为RXNEmb的新颖反应级描述符,其来源于RXNGraphormer模型——该模型通过预训练区分真实反应与具有错误键变化的虚构反应,从而学习到内在的键形成与断裂模式。RXNEmb不仅在USPTO-50k数据集上实现了基于数据驱动的重新聚类,更直观反映键变化相似性,还结合降维技术实现反应空间多样性的可视化,并通过注意力权重分析揭示模型对化学关键位点的关注,提供机制洞察,成为一种强大且可解释的反应指纹工具。

链接: https://arxiv.org/abs/2601.03689
作者: Weiqi Liu,Fenglei Cao,Yuan Qi,Li-Cheng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:With the rise of data-driven reaction prediction models, effective reaction descriptors are crucial for bridging the gap between real-world chemistry and digital representations. However, general-purpose, reaction-wise descriptors remain scarce. This study introduces RXNEmb, a novel reaction-level descriptor derived from RXNGraphormer, a model pre-trained to distinguish real reactions from fictitious ones with erroneous bond changes, thereby learning intrinsic bond formation and cleavage patterns. We demonstrate its utility by data-driven re-clustering of the USPTO-50k dataset, yielding a classification that more directly reflects bond-change similarities than rule-based categories. Combined with dimensionality reduction, RXNEmb enables visualization of reaction space diversity. Furthermore, attention weight analysis reveals the model’s focus on chemically critical sites, providing mechanistic insight. RXNEmb serves as a powerful, interpretable tool for reaction fingerprinting and analysis, paving the way for more data-centric approaches in reaction analysis and discovery.
zh

[AI-36] Personalized Medication Planning via Direct Domain Modeling and LLM -Generated Heuristics

【速读】:该论文旨在解决个性化药物治疗规划在实际临床应用中因可处理药物数量有限(此前最多仅支持七种药物)而难以落地的问题。其关键解决方案在于:通过程序化定义领域(指定初始状态和后继状态生成过程),并利用大语言模型(LLM)自动生成问题特定的启发式函数,以配合通用搜索算法(如GBFS)进行高效规划。此方法显著提升了规划覆盖范围与执行效率,使药物种类规模扩展至至少28种,从而更贴近临床实践需求。

链接: https://arxiv.org/abs/2601.03687
作者: Yonatan Vernik,Alexander Tuisov,David Izhaki,Hana Weitman,Gal A. Kaminka,Alexander Shleyfman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized medication planning involves selecting medications and determining a dosing schedule to achieve medical goals specific to each individual patient. Previous work successfully demonstrated that automated planners, using general domain-independent heuristics, are able to generate personalized treatments, when the domain and problems are modeled using a general domain description language (\pddlp). Unfortunately, this process was limited in practice to consider no more than seven medications. In clinical terms, this is a non-starter. In this paper, we explore the use of automatically-generated domain- and problem-specific heuristics to be used with general search, as a method of scaling up medication planning to levels allowing closer work with clinicians. Specifically, we specify the domain programmatically (specifying an initial state and a successor generation procedure), and use an LLM to generate a problem specific heuristic that can be used by a fixed search algorithm (GBFS). The results indicate dramatic improvements in coverage and planning time, scaling up the number of medications to at least 28, and bringing medication planning one step closer to practical applications.
zh

[AI-37] Disentangling Aleatoric and Epistemic Uncertainty in Physics-Informed Neural Networks. Application to Insulation Material Degradation Prognostics

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在预测性维护与健康管理(Prognostics and Health Management, PHM)应用中不确定性量化(Uncertainty Quantification, UQ)能力不足的问题。现有基于PINN的预测方法多为确定性模型或仅考虑认知不确定性(epistemic uncertainty),难以支撑风险感知决策。其解决方案的关键在于提出一种异方差贝叶斯物理信息神经网络(heteroscedastic Bayesian Physics-Informed Neural Network, B-PINN)框架,该框架通过融合贝叶斯神经网络(Bayesian Neural Networks, BNNs)与物理残差约束及先验分布,实现对认知不确定性和随机不确定性(aleatoric uncertainty)的联合建模,从而获得时空尺度上绝缘材料老化估计的完整后验预测分布。这一设计使模型能够在物理规律约束下进行概率推理,显著提升了预测精度和不确定性校准效果。

链接: https://arxiv.org/abs/2601.03673
作者: Ibai Ramirez,Jokin Alcibar,Joel Pino,Mikel Sanz,Jose I. Aizpurua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) provide a framework for integrating physical laws with data. However, their application to Prognostics and Health Management (PHM) remains constrained by the limited uncertainty quantification (UQ) capabilities. Most existing PINN-based prognostics approaches are deterministic or account only for epistemic uncertainty, limiting their suitability for risk-aware decision-making. This work introduces a heteroscedastic Bayesian Physics-Informed Neural Network (B-PINN) framework that jointly models epistemic and aleatoric uncertainty, yielding full predictive posteriors for spatiotemporal insulation material ageing estimation. The approach integrates Bayesian Neural Networks (BNNs) with physics-based residual enforcement and prior distributions, enabling probabilistic inference within a physics-informed learning architecture. The framework is evaluated on transformer insulation ageing application, validated with a finite-element thermal model and field measurements from a solar power plant, and benchmarked against deterministic PINNs, dropout-based PINNs (d-PINNs), and alternative B-PINN variants. Results show that the proposed B-PINN provides improved predictive accuracy and better-calibrated uncertainty estimates than competing approaches. A systematic sensitivity study further analyzes the impact of boundary-condition, initial-condition, and residual sampling strategies on accuracy, calibration, and generalization. Overall, the findings highlight the potential of Bayesian physics-informed learning to support uncertainty-aware prognostics and informed decision-making in transformer asset management.
zh

[AI-38] Discontinuous Galerkin finite element operator network for solving non-smooth PDEs

【速读】:该论文旨在解决参数化偏微分方程(parametric partial differential equations, PDEs)中存在间断系数和非光滑解时,传统算子学习模型(如DeepONet和Fourier Neural Operator)因依赖大规模配对数据集且在尖锐特征附近表现不佳的问题。解决方案的关键在于提出一种无数据的算子学习框架——不连续伽辽金有限元神经网络(Discontinuous Galerkin Finite Element Operator Network, DG-FEONet),其核心是将不连续伽辽金(Discontinuous Galerkin, DG)方法与神经网络相结合,通过最小化基于对称内罚伽辽金(Symmetric Interior Penalty Galerkin, SIPG)格式的弱形式残差来训练网络,从而直接预测每个单元的解系数,实现无需预计算输入输出对的数据自由训练,并具备良好的间断恢复能力、参数空间泛化性和可靠的收敛率。

链接: https://arxiv.org/abs/2601.03668
作者: Kapil Chawla,Youngjoon Hong,Jae Yong Lee,Sanghyun Lee
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 11 figures

点击查看摘要

Abstract:We introduce Discontinuous Galerkin Finite Element Operator Network (DG–FEONet), a data-free operator learning framework that combines the strengths of the discontinuous Galerkin (DG) method with neural networks to solve parametric partial differential equations (PDEs) with discontinuous coefficients and non-smooth solutions. Unlike traditional operator learning models such as DeepONet and Fourier Neural Operator, which require large paired datasets and often struggle near sharp features, our approach minimizes the residual of a DG-based weak formulation using the Symmetric Interior Penalty Galerkin (SIPG) scheme. DG-FEONet predicts element-wise solution coefficients via a neural network, enabling data-free training without the need for precomputed input-output pairs. We provide theoretical justification through convergence analysis and validate the model’s performance on a series of one- and two-dimensional PDE problems, demonstrating accurate recovery of discontinuities, strong generalization across parameter space, and reliable convergence rates. Our results highlight the potential of combining local discretization schemes with machine learning to achieve robust, singularity-aware operator approximation in challenging PDE settings.
zh

[AI-39] How Does the Thinking Step Influence Model Safety? An Entropy-based Safety Reminder for LRMs

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在通过显式思维步骤进行推理时,因思维链中引入的不安全行为而可能被放大导致的安全风险问题。传统防御机制对此类风险无效,因其未考虑LRM特有的推理动态特性。解决方案的关键在于发现并利用“安全提醒短语”(safe-reminding phrases)在思维步骤中的涌现作用,提出一种解码阶段的防御方法SafeRemind:该方法通过熵触发机制在决策锁定点动态注入安全提醒短语,从而引导潜在有害轨迹转向更安全的结果,且无需任何参数更新即可显著提升安全性(最高达45.5个百分点),同时保持核心推理能力。

链接: https://arxiv.org/abs/2601.03662
作者: Su-Hyeon Kim,Hyundong Jin,Yejin Lee,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve remarkable success through explicit thinking steps, yet the thinking steps introduce a novel risk by potentially amplifying unsafe behaviors. Despite this vulnerability, conventional defense mechanisms remain ineffective as they overlook the unique reasoning dynamics of LRMs. In this work, we find that the emergence of safe-reminding phrases within thinking steps plays a pivotal role in ensuring LRM safety. Motivated by this finding, we propose SafeRemind, a decoding-time defense method that dynamically injects safe-reminding phrases into thinking steps. By leveraging entropy triggers to intervene at decision-locking points, SafeRemind redirects potentially harmful trajectories toward safer outcomes without requiring any parameter updates. Extensive evaluations across five LRMs and six benchmarks demonstrate that SafeRemind substantially enhances safety, achieving improvements of up to 45.5%p while preserving core reasoning utility.
zh

[AI-40] AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

【速读】:该论文旨在解决群体相对策略优化(Group Relative Policy Optimization, GRPO)在复杂推理任务中面临的三大结构性局限:序列级优势归一化引入系统性长度偏差、对低质量轨迹的惩罚被稀释,以及标量目标函数忽略了组内奖励排名所蕴含的丰富成对偏好信息,导致昂贵模拟产生的监督信号未被充分利用。其解决方案的关键在于提出AMIR-GRPO,通过构建一个基于组内奖励排序的隐式DPO风格对比正则项(implicit DPO-style contrastive regularizer),无需额外标注即可增强对低奖励轨迹的抑制效果,缓解响应级别的长度偏差,并将每个 rollout 组转化为更密集的监督约束集,从而提升模型在数学推理等复杂任务上的性能表现。

链接: https://arxiv.org/abs/2601.03661
作者: Amir Hossein Yari,Fajri Koto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03661 [cs.LG] (or arXiv:2601.03661v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.03661 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] Group and Exclusive Sparse Regularization-based Continual Learning of CNNs

【速读】:该论文旨在解决连续学习(Continual Learning, CL)中固定容量卷积神经网络(Convolutional Neural Networks, CNN)面临的灾难性遗忘(Catastrophic Forgetting)问题。解决方案的关键在于提出一种基于分组与独占稀疏性的持续学习方法(Group and Exclusive Sparsity based Continual Learning, GESCL),其核心机制包含两个正则化项:一是稳定性正则化项,通过约束对历史任务重要的滤波器在学习新任务时不过度偏移,从而保持模型对旧任务的性能;二是可塑性正则化项,利用CNN的过参数化特性,高效地对不重要滤波器进行稀疏化并调整其权重,使其在未来任务中仍具相关性。该方法在不动态扩展网络结构或存储历史数据的前提下,显著减少了参数量和计算开销,同时提升了整体分类准确率和抗遗忘能力。

链接: https://arxiv.org/abs/2601.03658
作者: Basile Tousside,Janis Mohr,Jörg Frochte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, Canadian Artificial Intelligence Association (CAIAC)

点击查看摘要

Abstract:We present a regularization-based approach for continual learning (CL) of fixed capacity convolutional neural networks (CNN) that does not suffer from the problem of catastrophic forgetting when learning multiple tasks sequentially. This method referred to as Group and Exclusive Sparsity based Continual Learning (GESCL) avoids forgetting of previous tasks by ensuring the stability of the CNN via a stability regularization term, which prevents filters detected as important for past tasks to deviate too much when learning a new task. On top of that, GESCL makes the network plastic via a plasticity regularization term that leverage the over-parameterization of CNNs to efficiently sparsify the network and tunes unimportant filters making them relevant for future tasks. Doing so, GESCL deals with significantly less parameters and computation compared to CL approaches that either dynamically expand the network or memorize past tasks’ data. Experiments on popular CL vision benchmarks show that GESCL leads to significant improvements over state-of-the-art method in terms of overall CL performance, as measured by classification accuracy as well as in terms of avoiding catastrophic forgetting.
zh

[AI-42] In Search of Grandmother Cells: Tracing Interpretable Neurons in Tabular Representations

【速读】:该论文旨在解决基础模型(foundation models)决策过程中的可解释性问题,特别是探究是否存在类似“祖母细胞”(grandmother cells)的神经元——即对单一概念具有高度特异性和显著响应的神经元。其解决方案的关键在于提出两种基于信息论的度量方法,分别量化神经元对特定概念的显著性(saliency)和选择性(selectivity),并将其应用于TabPFN这一表格基础模型的表示空间中,通过简单搜索识别出最显著且最具选择性的神经元-概念配对。结果首次表明,某些神经元确实表现出对高层次概念的中等程度但统计显著的显著性和选择性,揭示了可解释神经元可在无需复杂解释技术的情况下自然涌现。

链接: https://arxiv.org/abs/2601.03657
作者: Ricardo Knauer,Erik Rodner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: EurIPS 2025 Workshop on AI for Tabular Data

点击查看摘要

Abstract:Foundation models are powerful yet often opaque in their decision-making. A topic of continued interest in both neuroscience and artificial intelligence is whether some neurons behave like grandmother cells, i.e., neurons that are inherently interpretable because they exclusively respond to single concepts. In this work, we propose two information-theoretic measures that quantify the neuronal saliency and selectivity for single concepts. We apply these metrics to the representations of TabPFN, a tabular foundation model, and perform a simple search across neuron-concept pairs to find the most salient and selective pair. Our analysis provides the first evidence that some neurons in such models show moderate, statistically significant saliency and selectivity for high-level concepts. These findings suggest that interpretable neurons can emerge naturally and that they can, in some cases, be identified without resorting to more complex interpretability techniques.
zh

[AI-43] ReLA: Representation Learning and Aggregation for Job Scheduling with Reinforcement Learning

【速读】:该论文旨在解决制造系统中作业调度(Job Scheduling)问题,即在多种约束条件下将有序的作业操作分配给机器,以优化调度质量并缩短运行时间,尤其是在问题规模扩大时,现有方法往往面临效率低或调度质量不足的问题。解决方案的关键在于提出一种基于结构化表示学习与聚合的强化学习调度器(ReLA),其核心创新包括:1)通过两个内实体学习模块(自注意力机制和卷积神经网络)从作业操作和机器等调度实体中提取多样化特征表示;2)引入一个跨实体学习模块(交叉注意力机制)捕捉实体间的交互关系;3)在多尺度架构中整合上述表示,并用于强化学习决策,从而显著提升调度质量与计算效率。实验表明,ReLA在中小规模实例上比最先进基线优化差距缩小13.0%,在大规模实例上缩小78.6%,平均最优性间隙分别降至7.3%和2.1%,验证了其在实际应用中的有效性。

链接: https://arxiv.org/abs/2601.03646
作者: Zhengyi Kwan,Zhang Wei,Aik Beng Ng,Zhengkui Wang,Simon See
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Job scheduling is widely used in real-world manufacturing systems to assign ordered job operations to machines under various constraints. Existing solutions remain limited by long running time or insufficient schedule quality, especially when problem scale increases. In this paper, we propose ReLA, a reinforcement-learning (RL) scheduler built on structured representation learning and aggregation. ReLA first learns diverse representations from scheduling entities, including job operations and machines, using two intra-entity learning modules with self-attention and convolution and one inter-entity learning module with cross-attention. These modules are applied in a multi-scale architecture, and their outputs are aggregated to support RL decision-making. Across experiments on small, medium, and large job instances, ReLA achieves the best makespan in most tested settings over the latest solutions. On non-large instances, ReLA reduces the optimality gap of the SOTA baseline by 13.0%, while on large-scale instances it reduces the gap by 78.6%, with the average optimality gaps lowered to 7.3% and 2.1%, respectively. These results confirm that ReLA’s learned representations and aggregation provide strong decision support for RL scheduling, and enable fast job completion and decision-making for real-world applications.
zh

[AI-44] Architecting Agent ic Communities using Design Patterns

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)与代理型人工智能(Agentic AI)技术快速发展背景下,构建复杂、可生产部署的多智能体系统时缺乏系统性架构指导的问题。解决方案的关键在于提出一个分层设计模式框架,将系统组织为三个层级:LLM代理(任务特定自动化)、代理型AI(自适应目标导向体)和代理社区(包含人类与AI协作的组织化框架),并聚焦于“代理社区”这一最适用于企业与工业场景的层级;通过借鉴分布式系统中的协调原则,建立形式化框架以明确AI与人类在受控生态系统中扮演的角色、协议与治理结构,并借助问责机制实现对智能体间通信、协商及意图建模的可验证治理,从而兼顾实践指导性与形式化验证能力。

链接: https://arxiv.org/abs/2601.03624
作者: Zoran Milosevic,Fethi Rabhi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: supplementary material accompanying this paper is also attached … its title is “Complete Agentic AI Design Patterns Catalogue”

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.
zh

[AI-45] Investigation into respiratory sound classification for an imbalanced data set using hybrid LSTM-KAN architectures

【速读】:该论文旨在解决呼吸音分类中因类别分布严重不均衡导致的模型性能下降问题,尤其关注少数类识别能力不足的挑战。其解决方案的关键在于提出一种混合深度学习模型——Hybrid LSTM-KAN,该模型结合长短期记忆网络(Long Short-Term Memory, LSTM)用于序列特征编码,以及柯尔莫哥洛夫-阿诺德网络(Kolmogorov-Arnold Network, KAN)进行分类,并集成特征提取流程与针对性的不平衡缓解策略,如焦点损失(focal loss)、类别特定的数据增强和合成少数类过采样技术(SMOTE),从而显著提升对少数类别的检测性能,在一个包含六个类别且高度偏斜的数据集上实现了94.6%的整体准确率和0.703的宏平均F1分数。

链接: https://arxiv.org/abs/2601.03610
作者: Nithinkumar K.V,Anand R
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Respiratory sounds captured via auscultation contain critical clues for diagnosing pulmonary conditions. Automated classification of these sounds faces challenges due to subtle acoustic differences and severe class imbalance in clinical datasets. This study investigates respiratory sound classification with a focus on mitigating pronounced class imbalance. We propose a hybrid deep learning model that combines a Long Short-Term Memory (LSTM) network for sequential feature encoding with a Kolmogorov-Arnold Network (KAN) for classification. The model is integrated with a comprehensive feature extraction pipeline and targeted imbalance mitigation strategies. Experiments were conducted on a public respiratory sound database comprising six classes with a highly skewed distribution. Techniques such as focal loss, class-specific data augmentation, and Synthetic Minority Over-sampling Technique (SMOTE) were employed to enhance minority class recognition. The proposed Hybrid LSTM-KAN model achieves an overall accuracy of 94.6 percent and a macro-averaged F1 score of 0.703, despite the dominant COPD class accounting for over 86 percent of the data. Improved detection performance is observed for minority classes compared to baseline approaches, demonstrating the effectiveness of the proposed architecture for imbalanced respiratory sound classification.
zh

[AI-46] Policy-Guided Search on Tree-of-Thoughts for Efficient Problem Solving with Bounded Language Model Queries

【速读】:该论文旨在解决在计算资源受限条件下,如何提升语言模型(Language Model, LM)在问题求解任务中的性能问题。其核心挑战在于,尽管“思维树”(Tree-of-Thoughts, ToT)框架通过引入状态空间搜索算法增强了LM的推理能力,但传统搜索方法往往忽视了LM推理带来的高昂计算开销,导致难以在有限预算下高效应用。解决方案的关键在于利用LM自身对各思维路径(thought)分配的概率作为启发式信息,引导搜索过程,从而减少不必要的思维评估次数。作者基于此提出将Levin Tree Search (LTS)算法适配至ToT框架中,并证明其在剪枝后的树结构上能保证扩展状态数的上界,进而控制生成的思维数量;同时分析了温度参数对这一边界的影响。实验表明,在固定LM查询预算下,LTS在多个领域和不同LM上均表现出与基线相当或更高的准确性,显著提升了问题求解的效率与成本效益。

链接: https://arxiv.org/abs/2601.03606
作者: Sumedh Pendurkar,Guni Sharon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in Transactions on Machine Learning Research (TMLR), 2025. Available at this https URL

点击查看摘要

Abstract:Recent studies explored integrating state-space search algorithms with Language Models (LM) to perform look-ahead on the token generation process, the ‘‘Tree-of-Thoughts’’ (ToT), generated by LMs, thereby improving performance on problem-solving tasks. However, the affiliated search algorithms often overlook the significant computational costs associated with LM inference, particularly in scenarios with constrained computational budgets. Consequently, we address the problem of improving LM performance on problem-solving tasks under limited computational budgets. We demonstrate how the probabilities assigned to thoughts by LMs can serve as a heuristic to guide search within the ToT framework, thereby reducing the number of thought evaluations. Building on this insight, we adapt a heuristic search algorithm, Levin Tree Search (LTS), to the ToT framework, which leverages LMs as policies to guide the tree exploration efficiently. We extend the theoretical results of LTS by showing that, for ToT (a pruned tree), LTS guarantees a bound on the number of states expanded, and consequently, on the number of thoughts generated. Additionally, we analyze the sensitivity of this bound to the temperature values commonly used in the final softmax layer of the LM. Empirical evaluation under a fixed LM query budget demonstrates that LTS consistently achieves comparable or higher accuracy than baseline search algorithms within the ToT framework, across three domains (Blocksworld, PrOntoQA, Array Sorting) and four distinct LMs. These findings highlight the efficacy of LTS on ToT, particularly in enabling cost-effective and time-efficient problem-solving, making it well-suited for latency-critical and resource-constrained applications.
zh

[AI-47] Interleaved Tool-Call Reasoning for Protein Function Understanding

【速读】:该论文旨在解决将基于文本的链式思维(chain-of-thought reasoning)直接应用于蛋白质功能预测时效果不佳的问题,其核心挑战在于现有方法主要依赖表面关键词模式进行强化学习,无法引入新的生物学知识,导致泛化能力受限。解决方案的关键在于提出PFUA(Protein Function Understanding Agent),这是一个工具增强型蛋白质推理代理,通过统一问题分解、工具调用与基于证据的答案生成机制,将领域特定计算工具(如结构预测、序列比对等)嵌入推理流程,从而生成可验证的中间证据,而非依赖无约束的文本推理路径。实验表明,PFUA在四个基准测试中平均性能提升达103%,显著优于纯文本推理模型。

链接: https://arxiv.org/abs/2601.03604
作者: Chuanliu Fan,Zicheng Ma,Huanran Meng,Aijia Zhang,Wenjie Du,Jun Zhang,Yi Qin Gao,Ziqiang Cao,Guohong Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have highlighted the effectiveness of chain-of-thought reasoning in symbolic domains such as mathematics and programming. However, our study shows that directly transferring such text-based reasoning paradigms to protein function understanding is ineffective: reinforcement learning mainly amplifies superficial keyword patterns while failing to introduce new biological knowledge, resulting in limited generalization. We argue that protein function prediction is a knowledge-intensive scientific task that fundamentally relies on external biological priors and computational tools rather than purely internal reasoning. To address this gap, we propose PFUA, a tool-augmented protein reasoning agent that unifies problem decomposition, tool invocation, and grounded answer generation. Instead of relying on long unconstrained reasoning traces, PFUA integrates domain-specific tools to produce verifiable intermediate evidence. Experiments on four benchmarks demonstrate that PFUA consistently outperforms text-only reasoning models with an average performance improvement of 103%.
zh

[AI-48] ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在零样本场景下对越狱攻击(jailbreak attacks)检测能力不足的问题。现有方法依赖训练数据中已知的越狱模板,难以应对现实中不断涌现的新攻击模式。其解决方案的关键在于提出一种分层(layer-wise)、模块级(module-wise)和词元级(token-wise)的特征放大框架,通过逐步增强良性与越狱提示之间的内部表征差异,识别出与安全相关的网络层、编码零样本判别信号的特定模块以及关键的安全性词元。基于此,作者设计了ALERT(Amplification-based Jailbreak Detector),引入两个独立但互补的分类器对放大后的表示进行判别,从而实现高效且鲁棒的零样本越狱检测,在多个安全基准上显著优于现有方法。

链接: https://arxiv.org/abs/2601.03600
作者: Xiao Lin,Philip Li,Zhichen Zeng,Tingwei Li,Tianxin Wei,Xuying Ning,Gaotang Li,Yuzhong Chen,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.
zh

[AI-49] Deontic Knowledge Graphs for Privacy Compliance in Multimodal Disaster Data Sharing

【速读】:该论文旨在解决灾难响应中多源异构数据(如表格辅助记录与无人机(UAS)影像)在重叠隐私法规约束下难以实现精准合规共享的问题。传统基于二元访问控制的系统在时间敏感的工作流中表现脆弱,无法灵活处理复杂的义务性规则。其解决方案的关键在于提出一种基于道义知识图谱(deontic knowledge graph)的框架,将灾难管理知识图谱(Disaster Management Knowledge Graph, DKG)与源自物联网法规(IoT-Reg)及联邦紧急事务管理局/国土安全部(FEMA/DHS)驱动的政策知识图谱(Policy Knowledge Graph, PKG)融合,构建一个支持“允许”、“禁止”和“允许-转换”三种决策结果的释放判定函数;其中,“允许-转换”机制通过绑定义务到数据变换操作,并利用溯源关联的派生数据验证转换后的合规性,从而实现细粒度、可验证的隐私保护决策。

链接: https://arxiv.org/abs/2601.03587
作者: Kelvin Uzoma Echenim,Karuna Pande Joshi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Disaster response requires sharing heterogeneous artifacts, from tabular assistance records to UAS imagery, under overlapping privacy mandates. Operational systems often reduce compliance to binary access control, which is brittle in time-critical workflows. We present a novel deontic knowledge graph-based framework that integrates a Disaster Management Knowledge Graph (DKG) with a Policy Knowledge Graph (PKG) derived from IoT-Reg and FEMA/DHS privacy drivers. Our release decision function supports three outcomes: Allow, Block, and Allow-with-Transform. The latter binds obligations to transforms and verifies post-transform compliance via provenance-linked derived artifacts; blocked requests are logged as semantic privacy incidents. Evaluation on a 5.1M-triple DKG with 316K images shows exact-match decision correctness, sub-second per-decision latency, and interactive query performance across both single-graph and federated workloads.
zh

[AI-50] A Proposed Paradigm for Imputing Missing Multi-Sensor Data in the Healthcare Domain

【速读】:该论文旨在解决慢性疾病(如糖尿病)管理中因低血糖事件预测困难而导致的临床挑战,特别是由于可穿戴传感器采集的多源时序数据存在信号噪声和频繁缺失值的问题。其解决方案的关键在于提出一种系统性的填补策略,即根据特定特征的时序特性及缺失间隔的持续时间,定制化选择与应用不同的插补方法,从而有效应对数据中异质性的时间模式,提升低血糖事件预测的准确性与可靠性。

链接: https://arxiv.org/abs/2601.03565
作者: Vaibhav Gupta,Florian Grensing,Beyza Cinar,Maria Maleshkova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 Pages, 6 Figures, 7 Tables

点击查看摘要

Abstract:Chronic diseases such as diabetes pose significant management challenges, particularly due to the risk of complications like hypoglycemia, which require timely detection and intervention. Continuous health monitoring through wearable sensors offers a promising solution for early prediction of glycemic events. However, effective use of multisensor data is hindered by issues such as signal noise and frequent missing values. This study examines the limitations of existing datasets and emphasizes the temporal characteristics of key features relevant to hypoglycemia prediction. A comprehensive analysis of imputation techniques is conducted, focusing on those employed in state-of-the-art studies. Furthermore, imputation methods derived from machine learning and deep learning applications in other healthcare contexts are evaluated for their potential to address longer gaps in time-series data. Based on this analysis, a systematic paradigm is proposed, wherein imputation strategies are tailored to the nature of specific features and the duration of missing intervals. The review concludes by emphasizing the importance of investigating the temporal dynamics of individual features and the implementation of multiple, feature-specific imputation techniques to effectively address heterogeneous temporal patterns inherent in the data.
zh

[AI-51] SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

【速读】:该论文旨在解决工具增强型智能体(tool-augmented agents)在多步推理过程中难以进行可靠信用分配(credit assignment)的问题,这一挑战限制了其训练稳定性和性能提升。现有基于大语言模型(LLM)的奖励模型因缺乏细粒度、任务特定的评分标准,常产生噪声大且不一致的奖励信号,难以区分高层次规划与低层次执行行为。解决方案的关键在于提出一种名为SCRIBE(Skill-Conditioned Reward with Intermediate Behavioral Evaluation)的强化学习框架,该框架在新颖的中层抽象层面介入奖励建模:通过将每个子目标路由至预定义的技能原型库(skill prototypes),将开放式的LLM评估转化为结构化的验证问题,从而赋予奖励模型精确、受约束的评分规则,显著降低奖励方差。实验表明,SCRIBE在多个推理和工具使用基准上达到最先进性能,例如将Qwen3-4B模型在AIME25上的准确率从43.3%提升至63.3%,并促进了复杂多轮工具交互的成功率提升。

链接: https://arxiv.org/abs/2601.03555
作者: Yuxuan Jiang,Francis Ferraro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges often produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution. In this work, we introduce SCRIBE (Skill-Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid-level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Further analysis of training dynamics reveals a co-evolution across abstraction levels, where mastery of mid-level skills consistently precedes the emergence of effective high-level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low-level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool-using agents. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03555 [cs.AI] (or arXiv:2601.03555v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.03555 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-52] ReEfBench: Quantifying the Reasoning Efficiency of LLM s

【速读】:该论文旨在解决当前链式思维(Chain-of-Thought, CoT)评估方法在测试时缩放(test-time scaling)背景下无法区分大语言模型(Large Language Models, LLMs)性能提升是源于真实推理能力增强还是单纯文本冗余的问题。其解决方案的关键在于提出一种新颖的神经符号(neuro-symbolic)框架,实现对推理过程的非侵入式、全面的过程中心型评估(process-centric evaluation),从而识别出四种不同的行为原型并诊断失败模式,进而揭示模型规模、推理模式与训练策略对推理质量的真实影响。

链接: https://arxiv.org/abs/2601.03550
作者: Zhizhang Fu,Yuancheng Gu,Chenkai Hu,Hanmeng Liu,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling has enabled Large Language Models (LLMs) to tackle complex reasoning, yet the limitations of current Chain-of-Thought (CoT) evaluation obscures whether performance gains stem from genuine reasoning or mere verbosity. To address this, (1) we propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning. (2) Through this lens, we identify four distinct behavioral prototypes and diagnose the failure modes. (3) We examine the impact of inference mode, training strategy, and model scale. Our analysis reveals that extended token generation is not a prerequisite for deep reasoning. Furthermore, we reveal critical constraints: mixing long and short CoT data in training risks in premature saturation and collapse, while distillation into smaller models captures behavioral length but fails to replicate logical efficacy due to intrinsic capacity limits.
zh

[AI-53] VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在代码生成任务中奖励设计的难题,尤其是传统基于通过/失败结果的稀疏奖励机制难以有效提升模型性能的问题。现有方法虽尝试引入外部奖励模型(Reward Model, RM)以生成更丰富的连续奖励,但面临奖励错位(reward misalignment)和高昂计算成本的挑战。论文提出的解决方案是VeRPO(Verifiable Defense Reward Policy Optimization),其核心创新在于构建完全基于可验证执行反馈的稠密奖励信号:通过动态估计每个单元测试的难度权重(基于训练期间的执行统计信息),将已通过测试的权重之和作为稠密奖励;同时融合全局执行结果与局部部分成功信号,确保局部进展与最终功能正确性的一致性,从而建立一种无需外部奖励模型、仅依赖可验证执行反馈的鲁棒且稠密的奖励范式。

链接: https://arxiv.org/abs/2601.03525
作者: Longwen Wang,Xuan’er Wu,Xiaohui Hu,Yirui Liu,Yuankai Fan,Kaidong Yu,Qizhen Weng,Wei Xi,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbfVeRPO (\textbfVerifiable D\textbfense \textbfReward \textbfPolicy \textbfOptimization), a novel RL framework for code generation that synthesizes \textitrobust and dense rewards fully grounded in verifiable execution feedback. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics during training, a dense reward is derived from the sum of weights of the passed unit tests. To solidify the consistency between partial success and end-to-end functional correctness, VeRPO further integrates the dense signal with global execution outcomes, establishing a robust and dense reward paradigm relying solely on verifiable execution feedback. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost ( 0.02%) and zero GPU memory overhead.
zh

[AI-54] Variance Computation for Weighted Model Counting with Knowledge Compilation Approach AAAI2026

【速读】:该论文旨在解决加权模型计数(Weighted Model Counting, WMC)的方差计算问题,尤其是在参数具有不确定性时如何量化推理结果的不确定性。在实际概率推理任务中,模型参数通常从数据中学习而来,存在不确定性,因此需要评估推理结果的方差以衡量其可靠性。论文的关键解决方案是:首先提出一种多项式时间算法来计算结构化d-DNNF(structured d-DNNF)输入下的WMC方差;其次通过理论证明表明,对于结构化DNNF、d-DNNF和FBDD等复杂度较低的逻辑表示形式,该问题仍为计算难问题,揭示了WMC方差计算与WMC本身在复杂性上的显著差异;最后将该方法应用于贝叶斯网络的不确定性分析,实验证明其能有效评估真实世界贝叶斯网络中边际概率的方差,并分析参数方差对推理结果方差的影响。

链接: https://arxiv.org/abs/2601.03523
作者: Kengo Nakamura,Masaaki Nishino,Norihito Yasuda
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 25 pages; accepted for AAAI 2026 main track

点击查看摘要

Abstract:One of the most important queries in knowledge compilation is weighted model counting (WMC), which has been applied to probabilistic inference on various models, such as Bayesian networks. In practical situations on inference tasks, the model’s parameters have uncertainty because they are often learned from data, and thus we want to compute the degree of uncertainty in the inference outcome. One possible approach is to regard the inference outcome as a random variable by introducing distributions for the parameters and evaluate the variance of the outcome. Unfortunately, the tractability of computing such a variance is hardly known. Motivated by this, we consider the problem of computing the variance of WMC and investigate this problem’s tractability. First, we derive a polynomial time algorithm to evaluate the WMC variance when the input is given as a structured d-DNNF. Second, we prove the hardness of this problem for structured DNNFs, d-DNNFs, and FBDDs, which is intriguing because the latter two allow polynomial time WMC algorithms. Finally, we show an application that measures the uncertainty in the inference of Bayesian networks. We empirically show that our algorithm can evaluate the variance of the marginal probability on real-world Bayesian networks and analyze the impact of the variances of parameters on the variance of the marginal.
zh

[AI-55] A Reinforcement Learning-Based Model for Mapping and Goal-Directed Navigation Using Multiscale Place Fields

【速读】:该论文旨在解决复杂且部分可观测环境中的自主导航问题,这是机器人学中的一个核心挑战。其解决方案的关键在于提出了一种新的鲁棒模型,该模型采用多尺度并行的场所场(place fields)层、基于回放的奖励机制以及动态尺度融合策略,从而提升了路径效率并加速了学习过程,突显了多尺度空间表征在适应性机器人导航中的价值。

链接: https://arxiv.org/abs/2601.03520
作者: Bekarys Dukenbaev,Andrew Gerstenslager,Alexander Johnson,Ali A. Minai
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 11 pages, 8 figures. Submitted to IEEE Transactions on Cognitive and Developmental Systems

点击查看摘要

Abstract:Autonomous navigation in complex and partially observable environments remains a central challenge in robotics. Several bio-inspired models of mapping and navigation based on place cells in the mammalian hippocampus have been proposed. This paper introduces a new robust model that employs parallel layers of place fields at multiple spatial scales, a replay-based reward mechanism, and dynamic scale fusion. Simulations show that the model improves path efficiency and accelerates learning compared to single-scale baselines, highlighting the value of multiscale spatial representations for adaptive robot navigation.
zh

[AI-56] Deploy-Master: Automating the Deployment of 50000 Agent -Ready Scientific Tools in One Day

【速读】:该论文旨在解决科学软件(scientific software)在实际应用中普遍存在的部署瓶颈问题,即大多数开源科学工具难以编译、配置和复用,导致科学研究仍停留在小规模手工操作模式,限制了可复现性、大规模评估以及与AI for Science(AI4S)和代理式(agentic)工作流的集成。其核心解决方案是提出Deploy-Master——一个端到端的代理式工作流,关键在于通过基于90余个科学与工程领域的分类体系,从超过50万份公共代码库中筛选出52,550个可执行工具候选,并利用执行驱动的验证机制(而非文档描述)将异构仓库转化为容器化的可运行能力;该流程在一天内完成52,550次构建尝试,成功为50,112个工具建立可复现的运行环境,同时记录部署轨迹以揭示吞吐量、成本分布、失败面及规范不确定性等规模化特征,从而推动形成共享、可观测的执行底座,支撑可扩展的AI4S与代理式科研范式。

链接: https://arxiv.org/abs/2601.03513
作者: Yi Wang,Zhenting Huang,Zhaohan Ding,Ruoxue Liao,Yuan Huang,Xinzijian Liu,Jiajun Xie,Siheng Chen,Linfeng Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-source scientific software is abundant, yet most tools remain difficult to compile, configure, and reuse, sustaining a small-workshop mode of scientific computing. This deployment bottleneck limits reproducibility, large-scale evaluation, and the practical integration of scientific tools into modern AI-for-Science (AI4S) and agentic workflows. We present Deploy-Master, a one-stop agentic workflow for large-scale tool discovery, build specification inference, execution-based validation, and publication. Guided by a taxonomy spanning 90+ scientific and engineering domains, our discovery stage starts from a recall-oriented pool of over 500,000 public repositories and progressively filters it to 52,550 executable tool candidates under license- and quality-aware criteria. Deploy-Master transforms heterogeneous open-source repositories into runnable, containerized capabilities grounded in execution rather than documentation claims. In a single day, we performed 52,550 build attempts and constructed reproducible runtime environments for 50,112 scientific tools. Each successful tool is validated by a minimal executable command and registered in SciencePedia for search and reuse, enabling direct human use and optional agent-based invocation. Beyond delivering runnable tools, we report a deployment trace at the scale of 50,000 tools, characterizing throughput, cost profiles, failure surfaces, and specification uncertainty that become visible only at scale. These results explain why scientific software remains difficult to operationalize and motivate shared, observable execution substrates as a foundation for scalable AI4S and agentic science. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03513 [cs.SE] (or arXiv:2601.03513v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.03513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-57] Bootstrapping Code Translation with Weighted Multilanguage Exploration

【速读】:该论文旨在解决多编程语言间代码翻译(code translation)中的两大挑战:一是缺乏与可执行测试断言(executable test oracles)配对的平行数据,二是处理多样语言对时存在的优化不平衡问题。解决方案的关键在于提出一种名为BootTrans的自举(bootstrapping)方法,其核心思想是利用测试套件的功能不变性(functional invariance)和跨语言可移植性(cross-lingual portability),将丰富的目标语言单元测试作为通用验证断言,用于多语言强化学习(RL)训练;同时引入双池架构(dual-pool architecture)以执行引导的经验收集方式逐步扩展训练数据,并设计语言感知权重机制(language-aware weighting mechanism)动态优先处理性能较差的语言方向,从而缓解优化不平衡问题。

链接: https://arxiv.org/abs/2601.03512
作者: Yuhan Wu,Huan Zhang,Wei Cheng,Chen Shen,Jingyue Yang,Wei Hu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code translation across multiple programming languages is essential yet challenging due to two vital obstacles: scarcity of parallel data paired with executable test oracles, and optimization imbalance when handling diverse language pairs. We propose BootTrans, a bootstrapping method that resolves both obstacles. Its key idea is to leverage the functional invariance and cross-lingual portability of test suites, adapting abundant pivot-language unit tests to serve as universal verification oracles for multilingual RL training. Our method introduces a dual-pool architecture with seed and exploration pools to progressively expand training data via execution-guided experience collection. Furthermore, we design a language-aware weighting mechanism that dynamically prioritizes harder translation directions based on relative performance across sibling languages, mitigating optimization imbalance. Extensive experiments on the HumanEval-X and TransCoder-Test benchmarks demonstrate substantial improvements over baseline LLMs across all translation directions, with ablations validating the effectiveness of both bootstrapping and weighting components.
zh

[AI-58] Evolving Programmatic Skill Networks

【速读】:该论文旨在解决开放环境中持续技能获取(continual skill acquisition)的问题,即智能体需在无边界、动态变化的具身环境(embodied environments)中不断构建、优化并复用可执行技能库。其核心挑战在于如何实现技能的高效演化与稳定积累,同时保持对新任务的适应能力。解决方案的关键是提出程序化技能网络(Programmatic Skill Network, PSN),通过大型语言模型(LLM)实现三个机制:(1) REFLECT——基于结构化的故障定位机制识别技能组合中的错误;(2) 成熟度感知更新门控(maturity-aware update gating)——稳定可靠技能、保留不确定技能的可塑性;(3) 回滚验证下的规范结构重构(canonical structural refactoring)——维持网络紧凑性。这些机制共同推动技能网络的持续演化,并展现出与神经网络训练相似的学习动力学特性。

链接: https://arxiv.org/abs/2601.03509
作者: Haochen Shi,Xingdi Yuan,Bang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN’s learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnoteWe plan to open-source the code.
zh

[AI-59] Cyberattack Detection in Virtualized Microgrids Using LightGBM and Knowledge-Distilled Classifiers

【速读】:该论文旨在解决现代微电网(Microgrid)因依赖分布式传感与通信接口而面临的网络物理攻击(Cyber-Physical Disturbances)威胁,尤其是针对二次控制层的恶意干扰行为,以保障微电网运行连续性和设备安全性。其解决方案的关键在于构建一个基于MATLAB/Simulink的完整虚拟微电网平台,通过MGLib实现结构化攻击注入,生成包含多种攻击类型(如斜坡、正弦、加性、协同隐蔽和拒绝服务)的标注数据集,并利用轻量级机器学习模型LightGBM进行入侵检测:一方面实现攻击存在性二分类判断(准确率94.8%),另一方面完成多类攻击识别(准确率99.72%),同时引入知识蒸馏压缩模型规模以适配边缘计算部署,实测单次处理延迟为54–67 ms/1000样本,验证了该方法在CPU资源受限场景下的可行性与高效性。

链接: https://arxiv.org/abs/2601.03495
作者: Osasumwen Cedric Ogiesoba-Eguakun,Suman Rath
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Modern microgrids depend on distributed sensing and communication interfaces, making them increasingly vulnerable to cyber physical disturbances that threaten operational continuity and equipment safety. In this work, a complete virtual microgrid was designed and implemented in MATLAB/Simulink, integrating heterogeneous renewable sources and secondary controller layers. A structured cyberattack framework was developed using MGLib to inject adversarial signals directly into the secondary control pathways. Multiple attack classes were emulated, including ramp, sinusoidal, additive, coordinated stealth, and denial of service behaviors. The virtual environment was used to generate labeled datasets under both normal and attack conditions. The datasets trained Light Gradient Boosting Machine (LightGBM) models to perform two functions: detecting the presence of an intrusion (binary) and distinguishing among attack types (multiclass). The multiclass model attained 99.72% accuracy and a 99.62% F1 score, while the binary model attained 94.8% accuracy and a 94.3% F1 score. A knowledge-distillation step reduced the size of the multiclass model, allowing faster predictions with only a small drop in performance. Real-time tests showed a processing delay of about 54 to 67 ms per 1000 samples, demonstrating suitability for CPU-based edge deployment in microgrid controllers. The results confirm that lightweight machine learning based intrusion detection methods can provide fast, accurate, and efficient cyberattack detection without relying on complex deep learning models. Key contributions include: (1) development of a complete MATLAB-based virtual microgrid, (2) structured attack injection at the control layer, (3) creation of multiclass labeled datasets, and (4) design of low-cost AI models suitable for practical microgrid cybersecurity.
zh

[AI-60] Personalization of Large Foundation Models for Health Interventions AAAI2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在个性化医疗中面临的根本性挑战,即如何在保证个体治疗推荐的准确性(personalization)与模型的外部有效性(external validity)之间取得平衡。核心问题包括:模型在特定临床研究中表现优异但跨研究泛化能力差的“一般性悖论”(generalizability paradox),以及模型预测能力与因果推理能力之间的鸿沟。解决方案的关键在于构建一个混合框架,将大型基础模型(Large Foundation Models, LFMs)与N-of-1试验(即个体层面的交叉自我实验)相结合:LFMs利用多模态人群数据快速生成带不确定性估计的干预候选方案,随后由N-of-1试验对具体个体进行因果验证。这种互补机制既能发挥LFMs在大规模模式识别中的高效性,又能通过局部因果证据保障个性化推荐的可靠性,从而缓解隐私-性能、规模-特异性及自动化-共情等多重悖论,推动负责任的个性化医疗AI发展。

链接: https://arxiv.org/abs/2601.03482
作者: Stefan Konigorski,Johannes E. Vedder,Babajide Alamu Owoyele,İbrahim Özkan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: Accepted to the AAAI 2026 Workshop on Personalization in the Era of Large Foundation Models (PerFM)

点击查看摘要

Abstract:Large foundation models (LFMs) transform healthcare AI in prevention, diagnostics, and treatment. However, whether LFMs can provide truly personalized treatment recommendations remains an open question. Recent research has revealed multiple challenges for personalization, including the fundamental generalizability paradox: models achieving high accuracy in one clinical study perform at chance level in others, demonstrating that personalization and external validity exist in tension. This exemplifies broader contradictions in AI-driven healthcare: the privacy-performance paradox, scale-specificity paradox, and the automation-empathy paradox. As another challenge, the degree of causal understanding required for personalized recommendations, as opposed to mere predictive capacities of LFMs, remains an open question. N-of-1 trials – crossover self-experiments and the gold standard for individual causal inference in personalized medicine – resolve these tensions by providing within-person causal evidence while preserving privacy through local experimentation. Despite their impressive capabilities, this paper argues that LFMs cannot replace N-of-1 trials. We argue that LFMs and N-of-1 trials are complementary: LFMs excel at rapid hypothesis generation from population patterns using multimodal data, while N-of-1 trials excel at causal validation for a given individual. We propose a hybrid framework that combines the strengths of both to enable personalization and navigate the identified paradoxes: LFMs generate ranked intervention candidates with uncertainty estimates, which trigger subsequent N-of-1 trials. Clarifying the boundary between prediction and causation and explicitly addressing the paradoxical tensions are essential for responsible AI integration in personalized medicine.
zh

[AI-61] Efficient Sequential Recommendation for Long Term User Interest Via Personalization ICDM2025

【速读】:该论文旨在解决基于Transformer的序列推荐模型在实际应用中因计算复杂度呈非线性(二次增长)而效率低下的问题。其解决方案的关键在于引入一种新颖的个性化压缩机制,将用户长期交互历史压缩为可学习的令牌(learnable tokens),并将其与近期交互信息融合以生成推荐,从而显著降低计算开销的同时保持高推荐精度。该方法可无缝集成至现有基于Transformer的推荐模型(如HSTU和HLLM),并通过多组实验验证了其通用性和有效性。

链接: https://arxiv.org/abs/2601.03479
作者: Qiang Zhang,Hanchao Yu,Ivan Ji,Chen Yuan,Yi Zhang,Chihuang Liu,Xiaolong Wang,Christopher E. Lambert,Ren Chen,Chen Kovacs,Xinzhu Bei,Renqin Cai,Rui Li,Lizhu Zhang,Xiangjun Fan,Qunshu Zhang,Benyu Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: ICDM 2025

点击查看摘要

Abstract:Recent years have witnessed success of sequential modeling, generative recommender, and large language model for recommendation. Though the scaling law has been validated for sequential models, it showed inefficiency in computational capacity when considering real-world applications like recommendation, due to the non-linear(quadratic) increasing nature of the transformer model. To improve the efficiency of the sequential model, we introduced a novel approach to sequential recommendation that leverages personalization techniques to enhance efficiency and performance. Our method compresses long user interaction histories into learnable tokens, which are then combined with recent interactions to generate recommendations. This approach significantly reduces computational costs while maintaining high recommendation accuracy. Our method could be applied to existing transformer based recommendation models, e.g., HSTU and HLLM. Extensive experiments on multiple sequential models demonstrate its versatility and effectiveness. Source code is available at \hrefthis https URLthis https URL.
zh

[AI-62] Online Decision-Making Under Uncertainty for Vehicle-to-Building Systems

【速读】:该论文旨在解决车辆到建筑(Vehicle-to-Building, V2B)系统中的优化问题,即在动态电价和需量电费政策下,通过协调电动汽车(EV)的充放电行为来最小化建筑总用电成本,同时满足用户电池电量要求。该问题的关键挑战在于:电价波动性(包含能量费用和需量费用)、长规划周期(通常超过30天)、充电设施异构性(不同充电速率、可控性和方向性),以及用户特定的离场电池状态约束。为应对这些挑战,作者将V2B优化建模为马尔可夫决策过程(Markov Decision Process, MDP),并通过在线搜索缓解状态空间过大的问题,并利用领域特定启发式方法剪枝不可行动作,从而有效提升求解效率与性能。

链接: https://arxiv.org/abs/2601.03476
作者: Rishav Sen,Yunuo Zhang,Fangqi Liu,Jose Paolo Talusan,Ava Pettet,Yoshinori Suzue,Ayan Mukhopadhyay,Abhishek Dubey
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 17 pages, 2 figures, 10 tables. Published in the Proceedings of the 16th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS '25), May 06–09, 2025, Irvine, CA, USA

点击查看摘要

Abstract:Vehicle-to-building (V2B) systems integrate physical infrastructures, such as smart buildings and electric vehicles (EVs) connected to chargers at the building, with digital control mechanisms to manage energy use. By utilizing EVs as flexible energy reservoirs, buildings can dynamically charge and discharge them to optimize energy use and cut costs under time-variable pricing and demand charge policies. This setup leads to the V2B optimization problem, where buildings coordinate EV charging and discharging to minimize total electricity costs while meeting users’ charging requirements. However, the V2B optimization problem is challenging because of: (1) fluctuating electricity pricing, which includes both energy charges ( /kWh) and demand charges ( /kW); (2) long planning horizons (typically over 30 days); (3) heterogeneous chargers with varying charging rates, controllability, and directionality (i.e., unidirectional or bidirectional); and (4) user-specific battery levels at departure to ensure user requirements are met. In contrast to existing approaches that often model this setting as a single-shot combinatorial optimization problem, we highlight critical limitations in prior work and instead model the V2B optimization problem as a Markov decision process (MDP), i.e., a stochastic control process. Solving the resulting MDP is challenging due to the large state and action spaces. To address the challenges of the large state space, we leverage online search, and we counter the action space by using domain-specific heuristics to prune unpromising actions. We validate our approach in collaboration with Nissan Advanced Technology Center - Silicon Valley. Using data from their EV testbed, we show that the proposed framework significantly outperforms state-of-the-art methods.
zh

[AI-63] CPGPrompt: Translating Clinical Guidelines into LLM -Executable Decision Support

【速读】:该论文旨在解决临床实践指南(Clinical Practice Guidelines, CPGs)与人工智能(Artificial Intelligence, AI)系统集成困难的问题,特别是传统基于规则的方法在可解释性、指南遵循一致性以及领域适用性方面存在显著局限。其解决方案的关键在于提出并验证了CPGPrompt——一种自动提示(auto-prompting)系统,通过将非结构化的临床指南转化为结构化的决策树,并利用大语言模型(Large Language Models, LLMs)动态导航这些决策树以评估患者病例,从而实现对复杂临床决策流程的自动化建模与推理。

链接: https://arxiv.org/abs/2601.03475
作者: Ruiqi Deng,Geoffrey Martin,Tony Wang,Gongbo Zhang,Yi Liu,Chunhua Weng,Yanshan Wang,Justin F Rousseau,Yifan Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical practice guidelines (CPGs) provide evidence-based recommendations for patient care; however, integrating them into Artificial Intelligence (AI) remains challenging. Previous approaches, such as rule-based systems, face significant limitations, including poor interpretability, inconsistent adherence to guidelines, and narrow domain applicability. To address this, we develop and validate CPGPrompt, an auto-prompting system that converts narrative clinical guidelines into large language models (LLMs). Our framework translates CPGs into structured decision trees and utilizes an LLM to dynamically navigate them for patient case evaluation. Synthetic vignettes were generated across three domains (headache, lower back pain, and prostate cancer) and distributed into four categories to test different decision scenarios. System performance was assessed on both binary specialty-referral decisions and fine-grained pathway-classification tasks. The binary specialty referral classification achieved consistently strong performance across all domains (F1: 0.85-1.00), with high recall (1.00 \pm 0.00). In contrast, multi-class pathway assignment showed reduced performance, with domain-specific variations: headache (F1: 0.47), lower back pain (F1: 0.72), and prostate cancer (F1: 0.77). Domain-specific performance differences reflected the structure of each guideline. The headache guideline highlighted challenges with negation handling. The lower back pain guideline required temporal reasoning. In contrast, prostate cancer pathways benefited from quantifiable laboratory tests, resulting in more reliable decision-making. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03475 [cs.AI] (or arXiv:2601.03475v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.03475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-64] oward Maturity-Based Certification of Embodied AI: Quantifying Trustworthiness Through Measurement Mechanisms AAAI-26

【速读】:该论文旨在解决如何对具身人工智能(Embodied AI)系统进行可认证的评估问题,核心挑战在于缺乏结构化的评估框架、定量评分机制以及处理多目标权衡的方法。解决方案的关键在于提出一种基于成熟度的认证框架,通过明确的测量机制(如不确定性量化)实现对系统可信度的量化评估,并以无人航空系统(Uncrewed Aircraft System, UAS)目标检测为例验证其可行性。

链接: https://arxiv.org/abs/2601.03470
作者: Michael C. Darling,Alan H. Hesu,Michael A. Mardikes,Brian C. McGuigan,Reed M. Milewicz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, Accepted to AAAI-26 Bridge Program B10: Making Embodied AI Reliable with Testing and Formal Verification

点击查看摘要

Abstract:We propose a maturity-based framework for certifying embodied AI systems through explicit measurement mechanisms. We argue that certifiable embodied AI requires structured assessment frameworks, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs inherent in trustworthiness evaluation. We demonstrate this approach using uncertainty quantification as an exemplar measurement mechanism and illustrate feasibility through an Uncrewed Aircraft System (UAS) detection case study.
zh

[AI-65] An Expectation-Maximization Algorithm for Domain Adaptation in Gaussian Causal Models ICDM

【速读】:该论文旨在解决在部署域中目标变量系统性缺失的情况下,如何利用来自完整观测源域的高斯因果有向无环图(DAG)进行有效变量填补的问题。解决方案的关键在于提出了一种统一的基于期望最大化(EM)的框架,通过DAG结构联合源域与目标域数据,实现从可观测变量到缺失目标变量的信息迁移;其核心创新是定义了在DAG参数空间中的群体EM算子,并引入一阶(梯度)EM更新机制,用单次投影梯度步骤替代传统EM中计算成本高的广义最小二乘M步,从而在标准局部强凹性和光滑性假设及BWY风格的缺失信息有界条件下,保证算法在真实目标参数邻域内局部收缩,实现几何收敛和有限样本下的参数误差与目标填补误差保证。此外,算法利用已知因果DAG冻结源不变机制,仅重新估计受域偏移直接影响的条件分布,显著提升了高维模型的可扩展性。

链接: https://arxiv.org/abs/2601.03459
作者: Mohammad Ali Javidian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: An earlier version of this work was accepted for the Proceedings of the 2025 IEEE International Conference on Data Mining (ICDM)

点击查看摘要

Abstract:We study the problem of imputing a designated target variable that is systematically missing in a shifted deployment domain, when a Gaussian causal DAG is available from a fully observed source domain. We propose a unified EM-based framework that combines source and target data through the DAG structure to transfer information from observed variables to the missing target. On the methodological side, we formulate a population EM operator in the DAG parameter space and introduce a first-order (gradient) EM update that replaces the costly generalized least-squares M-step with a single projected gradient step. Under standard local strong-concavity and smoothness assumptions and a BWY-style \citeBalakrishnan2017EM gradient-stability (bounded missing-information) condition, we show that this first-order EM operator is locally contractive around the true target parameters, yielding geometric convergence and finite-sample guarantees on parameter error and the induced target-imputation error in Gaussian SEMs under covariate shift and local mechanism shifts. Algorithmically, we exploit the known causal DAG to freeze source-invariant mechanisms and re-estimate only those conditional distributions directly affected by the shift, making the procedure scalable to higher-dimensional models. In experiments on a synthetic seven-node SEM, the 64-node MAGIC-IRRI genetic network, and the Sachs protein-signaling data, the proposed DAG-aware first-order EM algorithm improves target imputation accuracy over a fit-on-source Bayesian network and a Kiiveri-style EM baseline, with the largest gains under pronounced domain shift.
zh

[AI-66] Automated Feedback Generation for Undergraduate Mathematics: Development and Evaluation of an AI Teaching Assistant

【速读】:该论文旨在解决如何可靠评估自由格式(free-form)数学推理过程的问题,尤其是在智能辅导系统中提供即时反馈的挑战。传统智能辅导系统仅适用于结构化输入且问题高度受限的场景,而本文提出了一种基于大语言模型(Large Language Models, LLMs)的模块化工作流系统,能够处理自然语言形式的学生作答,不仅判断其技术正确性,还能评价证明的风格与呈现质量。该方案的关键在于将LLMs嵌入一个可读性强、无需编程即可编辑的模块化流程中,支持教师预计算或注入中间步骤,从而在保持灵活性的同时提升反馈的专业性和实用性。实验表明,该系统在早期本科生作业评估中的反馈质量可媲美人类专家,并在小规模高阶和非常规题目测试中展现出显著优势与待改进空间。

链接: https://arxiv.org/abs/2601.03458
作者: Aron Gohr,Marie-Amelie Lawn,Kevin Gao,Inigo Serjeant,Stephen Heslip
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent tutoring systems have long enabled automated immediate feedback on student work when it is presented in a tightly structured format and when problems are very constrained, but reliably assessing free-form mathematical reasoning remains challenging. We present a system that processes free-form natural language input, handles a wide range of edge cases, and comments competently not only on the technical correctness of submitted proofs, but also on style and presentation issues. We discuss the advantages and disadvantages of various approaches to the evaluation of such a system, and show that by the metrics we evaluate, the quality of the feedback generated is comparable to that produced by human experts when assessing early undergraduate homework. We stress-test our system with a small set of more advanced and unusual questions, and report both significant gaps and encouraging successes in that more challenging setting. Our system uses large language models in a modular workflow. The workflow configuration is human-readable and editable without programming knowledge, and allows some intermediate steps to be precomputed or injected by the instructor. A version of our tool is deployed on the Imperial mathematics homework platform Lambdafeedback. We report also on the integration of our tool into this platform. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) MSC classes: 97U70, 97C70, 68T50, 68T07 ACMclasses: F.4.1; I.2.7; I.2.6 Cite as: arXiv:2601.03458 [cs.CY] (or arXiv:2601.03458v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2601.03458 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Marie-Amelie Lawn [view email] [v1] Tue, 6 Jan 2026 23:02:22 UTC (146 KB)
zh

[AI-67] Soft Contextualized Encoder For User Defined Text Classification

【速读】:该论文旨在解决用户自定义文本分类(User-Defined Text Classification, UDTC)问题,即在实际应用中将输入文本映射到用户指定的、此前未见过的类别标签,这在企业分析、内容审核和领域特定信息检索等场景中尤为常见。解决方案的关键在于提出一种软上下文编码器架构(soft-contextualized encoder architecture),通过将每个候选标签与标签集合以及输入查询的静态软提示表示(soft prompt representation)进行上下文建模,从而增强模型对未见类别的泛化能力。该方法在多源多样数据集上训练后,能够在完全未见过的主题集合上实现零样本分类(zero-shot classification),并在多个未见UDTC基准测试中达到或超越现有最优性能。

链接: https://arxiv.org/abs/2601.03450
作者: Charu Maheshwari,Vyas Raina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User-Defined Text Classification (UDTC) considers the challenge of classifying input text to user-specified, previously unseen classes, a setting that arises frequently in real-world applications such as enterprise analytics, content moderation, and domain-specific information retrieval. We propose a soft-contextualized encoder architecture for UDTC which contextualizes each candidate label with the label set and a static soft prompt representation of the input query. Training on diverse, multi-source datasets enables the model to generalize effectively to zero-shot classification over entirely unseen topic sets drawn from arbitrary domains. We evaluate the proposed architecture both on held-out in-distribution test data and on multiple unseen UDTC benchmarks. Across datasets, the model achieves state-of-the-art performance, consistently outperforming or matching the baselines.
zh

[AI-68] he Illusion of Specialization: Unveiling the Domain-Invariant “Standing Committee” in Mixture-of-Experts Models

【速读】:该论文试图解决的问题是:当前广泛认为Mixture of Experts (MoE) 模型通过稀疏路由实现领域专业化(domain specialization)的假设是否成立。为验证这一假设,作者提出了COMMITTEEAUDIT这一后验分析框架,其关键在于从专家组(expert groups)层面而非单个专家层面分析路由行为。通过在三个代表性模型和MMLU基准上的实验,发现存在一个跨领域、跨层和跨路由预算保持稳定的“常设委员会”(Standing Committee),该委员会始终占据大部分路由权重,即使在包含共享专家的架构中也是如此。这表明MoE模型的实际结构更倾向于集中式计算,而非普遍意义上的领域专业化,从而揭示了现有训练目标(如负载均衡损失)可能与模型自然优化路径相悖,限制了训练效率与性能。

链接: https://arxiv.org/abs/2601.03425
作者: Yan Wang,Yitao Xu,Nanhan Shen,Jinyan Su,Jimin Huang,Zining Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model’s natural optimization path, thereby limiting training efficiency and performance.
zh

[AI-69] Jailbreaking LLM s Without Gradients or Priors: Effective and Transferable Attacks

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全评估中因攻击方法受限而导致的鲁棒性高估问题。现有自动化攻击通常依赖于手工设计的先验知识或白盒梯度信息,限制了其在真实场景中的有效性与泛化能力。论文提出了一种名为RAILS(RAndom Iterative Local Search)的新框架,其核心创新在于无需梯度或先验知识即可实现高效的对抗性越狱攻击:一是引入一种自回归损失函数以强制精确前缀匹配,确保生成文本与目标指令高度一致;二是设计基于历史记录的选择策略,有效弥合代理优化目标与真实攻击成功率之间的差距。最关键的是,RAILS通过消除对梯度的依赖,实现了跨分词器(tokenizer)的集成攻击,从而发现跨不同词汇表共享的对抗模式,显著提升了黑盒攻击向闭源系统(如GPT和Gemini)的迁移能力。

链接: https://arxiv.org/abs/2601.03420
作者: Zhakshylyk Nurlanov,Frank R. Schmidt,Florian Bernard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the true attack success rate. Crucially, by eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks. This allows for the discovery of shared adversarial patterns that generalize across disjoint vocabularies, significantly enhancing transferability to closed-source systems. Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.
zh

[AI-70] Exploration Through Introspection: A Self-Aware Reward Model AAAI-26

【速读】:该论文旨在解决如何让人工智能代理(AI agent)建模内部心理状态以推进AI中的心智理论(Theory of Mind),特别是探索自我意识在强化学习环境中的作用。其核心问题是:如何通过计算框架使代理具备对自身内部状态的感知能力,从而提升其学习效率与行为复杂性。解决方案的关键在于引入一种受生物疼痛机制启发的“内省探索”组件,利用隐马尔可夫模型(Hidden Markov Model, HMM)从在线观测中推断“疼痛信念”(pain-belief),并将该信号整合进主观奖励函数中,以此驱动代理对自我状态的认知和适应性学习。实验表明,这种内省机制显著优于传统基线方法,并能复现类人的复杂行为模式。

链接: https://arxiv.org/abs/2601.03389
作者: Michael Petrowski,Milica Gašić
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAAI-26 ToM4AI Workshop

点击查看摘要

Abstract:Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments. Specifically, we introduce an introspective exploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer “pain-belief” from online observations. This signal is integrated into a subjective reward function to study how self-awareness affects the agent’s learning abilities. Further, we use this computational framework to investigate the difference in performance between normal and chronic pain perception models. Results show that introspective agents in general significantly outperform standard baseline agents and can replicate complex human-like behaviors.
zh

[AI-71] Enhancing LLM Instruction Following: An Evaluation-Driven Multi-Agent ic Workflow for Prompt Instructions Optimization

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生成内容时虽概念正确但违反形式约束的问题,即输出符合语义逻辑却不符合任务执行的具体规范。传统提示词优化方法仅关注对主任务描述的重述,忽视了作为响应接受标准的细粒度约束条件。解决方案的关键在于提出一种多智能体工作流,将主任务描述的优化与约束条件的优化解耦,并利用量化评分作为反馈信号,迭代地重写和改进约束部分,从而显著提升模型输出对形式约束的合规性,实验表明该方法在Llama 3.1 8B和Mixtral-8x 7B等模型上取得了更高合规分数。

链接: https://arxiv.org/abs/2601.03359
作者: Alberto Purpura,Li Wang,Sahil Badyal,Eugenio Beaufrand,Adam Faulkner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate substantively relevant content but fail to adhere to formal constraints, leading to outputs that are conceptually correct but procedurally flawed. Traditional prompt refinement approaches focus on rephrasing the description of the primary task an LLM has to perform, neglecting the granular constraints that function as acceptance criteria for its response. We propose a novel multi-agentic workflow that decouples optimization of the primary task description from its constraints, using quantitative scores as feedback to iteratively rewrite and improve them. Our evaluation demonstrates this method produces revised prompts that yield significantly higher compliance scores from models like Llama 3.1 8B and Mixtral-8x 7B.
zh

[AI-72] Digital Red Queen: Adversarial Program Evolution in Core War with LLM s

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在演化求解问题时普遍采用静态优化框架、忽视真实世界进化过程中开放性对抗动态的问题。其解决方案的关键在于提出一种名为“数字红皇后”(Digital Red Queen, DRQ)的自对弈算法,通过持续适应变化的目标来模拟生物进化中的“红皇后效应”(Red Queen dynamics)。DRQ利用LLM演化出类似汇编程序的“战士”(warriors),这些战士在Core War这一图灵完备环境中相互竞争,每一轮都生成一个能击败此前所有对手的新战士,从而形成持续进化的对抗过程。实验表明,随着迭代进行,战士逐渐趋向于更通用的行为策略,并表现出行为多样性下降的趋势,这与自然界中的趋同进化现象一致,凸显了从静态目标转向动态红皇后目标的价值。

链接: https://arxiv.org/abs/2601.03335
作者: Akarsh Kumar,Ryan Bahlous-Boldi,Prafull Sharma,Phillip Isola,Sebastian Risi,Yujin Tang,David Ha
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 14 pages, 13 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used to evolve solutions to problems in many domains, in a process inspired by biological evolution. However, unlike biological evolution, most LLM-evolution frameworks are formulated as static optimization problems, overlooking the open-ended adversarial dynamics that characterize real-world evolutionary processes. Here, we study Digital Red Queen (DRQ), a simple self-play algorithm that embraces these so-called “Red Queen” dynamics via continual adaptation to a changing objective. DRQ uses an LLM to evolve assembly-like programs, called warriors, which compete against each other for control of a virtual machine in the game of Core War, a Turing-complete environment studied in artificial life and connected to cybersecurity. In each round of DRQ, the model evolves a new warrior to defeat all previous ones, producing a sequence of adapted warriors. Over many rounds, we observe that warriors become increasingly general (relative to a set of held-out human warriors). Interestingly, warriors also become less behaviorally diverse across independent runs, indicating a convergence pressure toward a general-purpose behavioral strategy, much like convergent evolution in nature. This result highlights a potential value of shifting from static objectives to dynamic Red Queen objectives. Our work positions Core War as a rich, controllable sandbox for studying adversarial adaptation in artificial systems and for evaluating LLM-based evolution methods. More broadly, the simplicity and effectiveness of DRQ suggest that similarly minimal self-play approaches could prove useful in other more practical multi-agent adversarial domains, like real-world cybersecurity or combating drug resistance.
zh

[AI-73] Attention mechanisms in neural networks

【速读】:该论文旨在系统性地解决注意力机制(attention mechanisms)在深度学习模型中的理论建模、计算特性与实际应用之间的鸿沟问题,其核心挑战在于如何从数学层面严谨刻画注意力机制的本质,并将其有效应用于自然语言处理、计算机视觉及多模态学习等不同领域。解决方案的关键在于提供一个全面且严格的数学框架,涵盖注意力机制的理论基础、训练动态、缩放规律(scaling laws)、注意力模式可视化及其与语言和视觉结构的可解释性关联,从而为构建高效、可扩展且具备良好泛化能力的现代神经网络架构奠定坚实基础。

链接: https://arxiv.org/abs/2601.03329
作者: Hasi Hays
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attention mechanisms represent a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions. This monograph provides a comprehensive and rigorous mathematical treatment of attention mechanisms, encompassing their theoretical foundations, computational properties, and practical implementations in contemporary deep learning systems. Applications in natural language processing, computer vision, and multimodal learning demonstrate the versatility of attention mechanisms. We examine language modeling with autoregressive transformers, bidirectional encoders for representation learning, sequence-to-sequence translation, Vision Transformers for image classification, and cross-modal attention for vision-language tasks. Empirical analysis reveals training characteristics, scaling laws that relate performance to model size and computation, attention pattern visualizations, and performance benchmarks across standard datasets. We discuss the interpretability of learned attention patterns and their relationship to linguistic and visual structures. The monograph concludes with a critical examination of current limitations, including computational scalability, data efficiency, systematic generalization, and interpretability challenges.
zh

[AI-74] Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

【速读】:该论文旨在解决野火严重程度预测中因空间和严重性高度不平衡而导致的极端事件预测难题,特别是针对法国实际业务决策需求的有序分类问题。其解决方案的关键在于引入首个与操作决策对齐的有序分类框架,并通过设计面向序数标签的损失函数(如提出的概率TDeGPD损失和加权Kappa损失WKLoss)来增强神经网络模型对罕见但高风险极端火灾事件的识别能力。实验表明,采用序数监督可显著提升模型在最严重等级上的IoU指标(提升超过0.1),同时保持良好的校准性能,从而为提高极端野火事件预测的可靠性提供了新的方法论支撑。

链接: https://arxiv.org/abs/2601.03327
作者: Nicolas Caron,Christophe Guyeux,Hassan Noura,Benjamin Aynes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challenging. In this work, we introduce the first ordinal classification framework for forecasting wildfire severity levels directly aligned with operational decision-making in France. Our study investigates the influence of loss-function design on the ability of neural models to predict rare yet critical high-severity fire occurrences. We compare standard cross-entropy with several ordinal-aware objectives, including the proposed probabilistic TDeGPD loss derived from a truncated discrete exponentiated Generalized Pareto Distribution. Through extensive benchmarking over multiple architectures and real operational data, we show that ordinal supervision substantially improves model performance over conventional approaches. In particular, the Weighted Kappa Loss (WKLoss) achieves the best overall results, with more than +0.1 IoU gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset. These findings highlight the importance of integrating both severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information into training to further improve the reliability of extreme-event prediction.
zh

[AI-75] HEEGNet: Hyperbolic Embeddings for EEG

【速读】:该论文旨在解决脑电图(EEG)信号在跨被试或跨场景应用中因分布偏移导致的解码泛化能力差的问题。其核心挑战在于如何学习到对域变化不敏感且能捕捉任务相关特征的鲁棒表示。解决方案的关键在于利用EEG数据内在的层次结构特性,提出一种混合超球面(hyperbolic)网络架构HEEGNet:该模型结合欧几里得与双曲编码器,并引入一种粗粒度到细粒度的域自适应策略,从而有效建模EEG的层次认知过程并学习域不变的双曲嵌入,显著提升跨域泛化性能。

链接: https://arxiv.org/abs/2601.03322
作者: Shanglin Li,Shiwen Chu,Okan Koç,Yi Ding,Qibin Zhao,Motoaki Kawanabe,Ziheng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG)-based brain-computer interfaces facilitate direct communication with a computer, enabling promising applications in human-computer interactions. However, their utility is currently limited because EEG decoding often suffers from poor generalization due to distribution shifts across domains (e.g., subjects). Learning robust representations that capture underlying task-relevant information would mitigate these shifts and improve generalization. One promising approach is to exploit the underlying hierarchical structure in EEG, as recent studies suggest that hierarchical cognitive processes, such as visual processing, can be encoded in EEG. While many decoding methods still rely on Euclidean embeddings, recent work has begun exploring hyperbolic geometry for EEG. Hyperbolic spaces, regarded as the continuous analogue of tree structures, provide a natural geometry for representing hierarchical data. In this study, we first empirically demonstrate that EEG data exhibit hyperbolicity and show that hyperbolic embeddings improve generalization. Motivated by these findings, we propose HEEGNet, a hybrid hyperbolic network architecture to capture the hierarchical structure in EEG and learn domain-invariant hyperbolic embeddings. To this end, HEEGNet combines both Euclidean and hyperbolic encoders and employs a novel coarse-to-fine domain adaptation strategy. Extensive experiments on multiple public EEG datasets, covering visual evoked potentials, emotion recognition, and intracranial EEG, demonstrate that HEEGNet achieves state-of-the-art performance.
zh

[AI-76] Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在放射学报告生成中面临的两大核心挑战:一是模型架构异质性导致的临床部署困难,二是生成过程中普遍存在事实性幻觉(factual hallucinations),尤其表现为文本输出与视觉证据不一致的问题。现有监督微调方法难以严格对齐语言输出与图像内容,而强化学习方法则受限于计算成本过高或探索能力不足。解决方案的关键在于提出一个名为“先推理后总结”(Reason-then-Summarize)的新型架构,并结合分组相对策略优化(Group Relative Policy Optimization, GRPO)进行训练。该框架将生成过程解耦为两个模块:一个用于详细影像发现的“思考块”(think block)和一个用于结构化疾病标签输出的“答案块”(answer block),并通过多维复合奖励函数显式惩罚叙述内容与最终诊断之间的逻辑不一致性,从而显著提升报告的自洽性和临床准确性。

链接: https://arxiv.org/abs/2601.03321
作者: Kun Zhao,Siyuan Dai,Pan Wang,Jifeng Song,Hui Ji,Chenghua Lin,Liang Zhan,Haoteng Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel “Reason-then-Summarize” architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
zh

[AI-77] Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

【速读】:该论文旨在解决当前基于策略梯度的强化学习方法(如PPO和GRPO)在微调大语言模型(LLM)时存在的两个核心问题:一是硬性裁剪(hard clipping)策略比值会无差别地截断高回报但高偏离动作的梯度信号,从而抑制复杂推理中罕见但关键的“顿悟时刻”;二是当数据出现轻微过时时,硬裁剪导致样本无法复用,造成严重的样本效率低下。解决方案的关键在于重新审视信任区域优化目标,提出通过显式约束策略比值的方差(即第二中心矩)来实现对硬裁剪的平滑且有原则性的松弛,从而在保持训练稳定性的同时保留来自高价值轨迹的梯度信息。基于此洞察,作者设计了R²VPO(Ratio-Variance Regularized Policy Optimization),一个支持稳定在线学习并能动态重加权过时样本而非直接丢弃的原始-对偶框架,显著提升了RL微调LLM的性能与数据利用效率。

链接: https://arxiv.org/abs/2601.03320
作者: Yu Luo,Shuo Han,Yihan Hu,Dong Li,Jianye Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative “eureka moments” in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emphvariance (second central moment) of the policy ratio provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates while preserving gradient signals from valuable trajectories. Building on this insight, we propose R^2VPO (Ratio-Variance Regularized Policy Optimization), a novel primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse by dynamically reweighting stale samples rather than discarding them. We extensively evaluate R^2VPO on fine-tuning state-of-the-art LLMs, including DeepSeek-Distill-Qwen-1.5B and the openPangu-Embedded series (1B and 7B), across challenging mathematical reasoning benchmarks. Experimental results show that R^2VPO consistently achieves superior asymptotic performance, with average relative gains of up to 17% over strong clipping-based baselines, while requiring approximately 50% fewer rollouts to reach convergence. These findings establish ratio-variance control as a promising direction for improving both stability and data efficiency in RL-based LLM alignment.
zh

[AI-78] CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature

【速读】:该论文旨在解决人脸3D漫画化(caricaturization)中难以同时实现高保真度与几何可控性的问题。现有方法在利用曲率增强技术进行表面夸张时,常因纹理融合导致渲染结果过度平滑;而直接对3D高斯点(3D Gaussian Splatting, 3DGS)进行变形又会引发质量下降。其解决方案的关键在于:首先基于高斯曲率构建初始夸张网格,并通过局部仿射变换生成伪真实漫画图像作为监督信号;进而设计一种交替使用真实与合成监督的训练机制,使单一高斯集合能够同时表征自然与夸张形态的人脸;此外,引入高效插值策略实现实时变形,并理论证明其偏差有界,从而在保持几何控制能力的同时显著提升视觉保真度和局部编辑灵活性。

链接: https://arxiv.org/abs/2601.03319
作者: Eldad Matmon,Amit Bracha,Noam Rotstein,Ron Kimmel
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A photorealistic and controllable 3D caricaturization framework for faces is introduced. We start with an intrinsic Gaussian curvature-based surface exaggeration technique, which, when coupled with texture, tends to produce over-smoothed renders. To address this, we resort to 3D Gaussian Splatting (3DGS), which has recently been shown to produce realistic free-viewpoint avatars. Given a multiview sequence, we extract a FLAME mesh, solve a curvature-weighted Poisson equation, and obtain its exaggerated form. However, directly deforming the Gaussians yields poor results, necessitating the synthesis of pseudo-ground-truth caricature images by warping each frame to its exaggerated 2D representation using local affine transformations. We then devise a training scheme that alternates real and synthesized supervision, enabling a single Gaussian collection to represent both natural and exaggerated avatars. This scheme improves fidelity, supports local edits, and allows continuous control over the intensity of the caricature. In order to achieve real-time deformations, an efficient interpolation between the original and exaggerated surfaces is introduced. We further analyze and show that it has a bounded deviation from closed-form solutions. In both quantitative and qualitative evaluations, our results outperform prior work, delivering photorealistic, geometry-controlled caricature avatars.
zh

[AI-79] Why LLM s Arent Scientists Yet: Lessons from Four Autonomous Research Attempts

【速读】:该论文旨在解决如何通过端到端的大型语言模型(Large Language Models, LLMs)代理系统实现自主生成机器学习(Machine Learning, ML)研究论文的问题,从而推动自动化科学发现的发展。其解决方案的关键在于构建一个由六个LLM代理组成的流水线,映射至科学研究流程的不同阶段,并通过实证尝试评估该系统的可行性与稳定性。尽管四次尝试中有三次因多种失败模式(如训练数据偏差、执行压力下的实现漂移、长程任务中的记忆退化等)而中断,但其中一次成功完成全流程并被Agents4Science 2025接收,表明具备初步可行性和可扩展性,同时揭示了未来设计更鲁棒AI科学家系统所需的四项核心原则。

链接: https://arxiv.org/abs/2601.03315
作者: Dhruv Trehan,Paras Chopra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at this https URL
zh

[AI-80] Mastering the Game of Go with Self-play Experience Replay

【速读】:该论文旨在解决如何在不依赖模型(model-free)的情况下,通过强化学习高效掌握围棋这一复杂决策问题。传统方法如AlphaGo依赖于基于模型的蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS),而QZero提出了一种全新的策略:在训练过程中完全摒弃搜索机制,转而利用自对弈和离策略经验回放(off-policy experience replay)来学习纳什均衡策略(Nash equilibrium policy)。其解决方案的关键在于构建一个基于熵正则化Q-learning的统一框架,仅用一个Q值网络同时完成策略评估与改进,从而显著提升了训练效率,并在仅使用7张GPU、5个月训练时间的有限算力下达到了与AlphaGo相当的性能水平,首次验证了纯模型无关强化学习在大规模复杂环境中的可行性与高效性。

链接: https://arxiv.org/abs/2601.03306
作者: Jingbin Liu,Xuechun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.
zh

[AI-81] AI-Driven Cybersecurity Threats: A Survey of Emerging Risks and Defensive Strategies

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在网络安全领域中双重用途所带来的新兴风险问题,特别是针对深度伪造与合成媒体、对抗性AI攻击、自动化恶意软件以及AI驱动的社会工程等四类威胁,系统分析其攻击机制、防御短板及技术挑战。解决方案的关键在于构建一个连接AI能力与威胁模式的对比分类法,并提出包括混合检测流水线和基准测试框架在内的研究机遇,同时强调发展可解释性、跨学科协同和符合法规要求的AI防御体系,以保障数字生态系统的信任与安全。

链接: https://arxiv.org/abs/2601.03304
作者: Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages; Published in Springer Nature

点击查看摘要

Abstract:Artificial Intelligence’s dual-use nature is revolutionizing the cybersecurity landscape, introducing new threats across four main categories: deepfakes and synthetic media, adversarial AI attacks, automated malware, and AI-powered social engineering. This paper aims to analyze emerging risks, attack mechanisms, and defense shortcomings related to AI in cybersecurity. We introduce a comparative taxonomy connecting AI capabilities with threat modalities and defenses, review over 70 academic and industry references, and identify impactful opportunities for research, such as hybrid detection pipelines and benchmarking frameworks. The paper is structured thematically by threat type, with each section addressing technical context, real-world incidents, legal frameworks, and countermeasures. Our findings emphasize the urgency for explainable, interdisciplinary, and regulatory-compliant AI defense systems to maintain trust and security in digital ecosystems.
zh

[AI-82] PC2P: Multi-Agent Path Finding via Personalized-Enhanced Communication and Crowd Perception IROS2025

【速读】:该论文旨在解决分布式多智能体路径规划(Distributed Multi-Agent Path Finding, MAPF)在复杂和多样化环境中的可扩展性与协作效率问题,尤其针对现有方法在部分可观测环境下因协同能力不足和感知能力有限而导致的性能瓶颈。解决方案的关键在于提出一种基于Q-learning的多智能体强化学习(MARL)框架下的新型方法PC2P,其核心创新包括:(1)引入一种基于动态图拓扑的个性化增强通信机制,通过选择-生成-聚合三阶段操作明确交互过程中的“谁”(who)与“什么”(what),提升通信效率与针对性;(2)融合局部人群感知机制,结合静态空间约束与动态占用变化,增强智能体的启发式观测能力,从而优化动作指导;(3)设计基于区域的死锁打破策略,利用专家引导在受限区域内实现高效协调,缓解极端死锁问题。实验表明,该方案在多种环境中均优于当前最先进的分布式MAPF方法。

链接: https://arxiv.org/abs/2601.03301
作者: Guotao Li,Shaoyun Xu,Yuexing Hao,Yang Wang,Yuhui Sun
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 8 pages,7 figures,3 tables,Accepted to IROS 2025

点击查看摘要

Abstract:Distributed Multi-Agent Path Finding (MAPF) integrated with Multi-Agent Reinforcement Learning (MARL) has emerged as a prominent research focus, enabling real-time cooperative decision-making in partially observable environments through inter-agent communication. However, due to insufficient collaborative and perceptual capabilities, existing methods are inadequate for scaling across diverse environmental conditions. To address these challenges, we propose PC2P, a novel distributed MAPF method derived from a Q-learning-based MARL framework. Initially, we introduce a personalized-enhanced communication mechanism based on dynamic graph topology, which ascertains the core aspects of who" and what" in interactive process through three-stage operations: selection, generation, and aggregation. Concurrently, we incorporate local crowd perception to enrich agents’ heuristic observation, thereby strengthening the model’s guidance for effective actions via the integration of static spatial constraints and dynamic occupancy changes. To resolve extreme deadlock issues, we propose a region-based deadlock-breaking strategy that leverages expert guidance to implement efficient coordination within confined areas. Experimental results demonstrate that PC2P achieves superior performance compared to state-of-the-art distributed MAPF methods in varied environments. Ablation studies further confirm the effectiveness of each module for overall performance.
zh

[AI-83] 130k Lines of Formal Topology in Two Weeks: Simple and Cheap Autoformalization for Everyone?

【速读】:该论文试图解决的是数学定理自动形式化(autoformalization)过程中成本高、效率低、依赖专业工具和专家知识的问题。解决方案的关键在于构建一个低成本、高效的反馈循环系统:利用大语言模型(LLM)如ChatGPT或Claude Sonnet与快速的高阶集合论证明检查器Megalodon协同工作,并辅以基础集合论和超实数的形式化库作为核心支撑。该方法仅需约$100的LLM订阅费用,在两周内即完成了130k行形式化代码,包括Urysohn引理、Tietze延拓定理等重要结果,体现出简单架构与高效执行的结合,预示着形式化在2026年可能变得普及且易于实现。

链接: https://arxiv.org/abs/2601.03298
作者: Josef Urban
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:This is a brief description of a project that has already autoformalized a large portion of the general topology from the Munkres textbook (which has in total 241 pages in 7 chapters and 39 sections). The project has been running since November 21, 2025 and has as of January 4, 2026, produced 160k lines of formalized topology. Most of it (about 130k lines) have been done in two weeks,from December 22 to January 4, for an LLM subscription cost of about \ 100. This includes a 3k-line proof of Urysohn’s lemma, a 2k-line proof of Urysohn’s Metrization theorem, over 10k-line proof of the Tietze extension theorem, and many more (in total over 1.5k lemmas/theorems). The approach is quite simple and cheap: build a long-running feedback loop between an LLM and a reasonably fast proof checker equipped with a core foundational library. The LLM is now instantiated as ChatGPT (mostly 5.2) or Claude Sonnet (4.5) run through the respective Codex or Claude Code command line interfaces. The proof checker is Chad Brown’s higher-order set theory system Megalodon, and the core library is Brown’s formalization of basic set theory and surreal numbers (including reals, etc). The rest is some prompt engineering and technical choices which we describe here. Based on the fast progress, low cost, virtually unknown ITP/library, and the simple setup available to everyone, we believe that (auto)formalization may become quite easy and ubiquitous in 2026, regardless of which proof assistant is used.
zh

[AI-84] Agent Mark: Utility-Preserving Behavioral Watermarking for Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在自主执行复杂任务时,缺乏对高层规划行为(如工具选择和子目标决策)的可追溯性与知识产权保护的问题。现有内容水印技术虽能标记生成内容的来源,但无法识别影响多步执行流程的核心决策行为,且在长期运行中因决策分布微小偏移累积导致性能下降,加之多数代理为黑盒系统难以直接干预。解决方案的关键在于提出AgentMark框架,通过从代理中显式获取行为分布并采用保持分布一致性的条件采样策略,在不破坏任务效用的前提下将多比特标识嵌入规划决策中,从而实现对高层行为的可验证水印,并兼容底层动作层的内容水印,同时支持黑盒API部署。

链接: https://arxiv.org/abs/2601.03294
作者: Kaibo Huang,Jin Tan,Yukun Wei,Wanling Li,Zipei Zhang,Hui Tian,Zhongliang Yang,Linna Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at this https URL.
zh

[AI-85] Lightweight Transformer Architectures for Edge Devices in Real-Time Applications

【速读】:该论文旨在解决基于Transformer的模型在资源受限边缘设备上部署时面临的挑战,即如何在保持较高性能的同时显著降低模型复杂度与计算开销。其解决方案的关键在于系统性地整合多种模型压缩技术,包括剪枝(pruning)、量化(quantization)、知识蒸馏(knowledge distillation)以及轻量级架构设计,并结合硬件感知的神经架构搜索(hardware-aware neural architecture search),从而实现模型尺寸缩减4-10倍、推理延迟降低3-9倍,同时维持75%-96%的全模型准确率。研究进一步揭示了稀疏注意力机制、混合精度量化(INT8/FP16)和内存带宽瓶颈优化是提升边缘部署效率的核心策略,为实际应用提供了可落地的六步部署流程。

链接: https://arxiv.org/abs/2601.03290
作者: Hema Hariharan Samson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 4 tables. Comprehensive study of lightweight transformer architectures for edge computing with novel findings on memory-bandwidth tradeoffs, quantization strategies, and hardware-specific optimizations. Includes detailed benchmarks across NLP and vision tasks with practical deployment recommendations

点击查看摘要

Abstract:The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.
zh

[AI-86] Automated Post-Incident Policy Gap Analysis via Threat-Informed Evidence Mapping using Large Language Models

【速读】:该论文旨在解决网络安全事件事后审查(post-incident review)过程中存在的劳动密集、耗时长且高度依赖专家判断的问题。其核心解决方案是一个基于威胁情报的智能代理框架(threat-informed, agentic framework),关键在于将日志数据分析与安全策略验证整合为一个端到端的自动化流程:通过MITRE ATT&CK框架映射观测行为,利用GPT-4o进行推理分析,LangGraph实现多智能体工作流编排,并借助LlamaIndex实现可追溯的策略检索,从而自动识别控制失效点并生成具证据链支撑的修复建议,显著提升事后审查的效率、一致性和可审计性。

链接: https://arxiv.org/abs/2601.03287
作者: Huan Lin Oh,Jay Yong Jun Jie,Mandy Lee Ling Siu,Jonathan Pan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure. Preprint

点击查看摘要

Abstract:Cybersecurity post-incident reviews are essential for identifying control failures and improving organisational resilience, yet they remain labour-intensive, time-consuming, and heavily reliant on expert judgment. This paper investigates whether Large Language Models (LLMs) can augment post-incident review workflows by autonomously analysing system evidence and identifying security policy gaps. We present a threat-informed, agentic framework that ingests log data, maps observed behaviours to the MITRE ATTCK framework, and evaluates organisational security policies for adequacy and compliance. Using a simulated brute-force attack scenario against a Windows OpenSSH service (MITRE ATTCK T1110), the system leverages GPT-4o for reasoning, LangGraph for multi-agent workflow orchestration, and LlamaIndex for traceable policy retrieval. Experimental results indicate that the LLM-based pipeline can interpret log-derived evidence, identify insufficient or missing policy controls, and generate actionable remediation recommendations with explicit evidence-to-policy traceability. Unlike prior work that treats log analysis and policy validation as isolated tasks, this study integrates both into a unified end-to-end proof-of-concept post-incident review framework. The findings suggest that LLM-assisted analysis has the potential to improve the efficiency, consistency, and auditability of post-incident evaluations, while highlighting the continued need for human oversight in high-stakes cybersecurity decision-making.
zh

[AI-87] α3-Bench: A Unified Benchmark of Safety Robustness and Efficiency for LLM -Based UAV Agents over 6G Networks

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在无人机(Unmanned Aerial Vehicle, UAV)自主控制中缺乏对现实网络环境适应性的评估问题,特别是在下一代6G动态网络条件下,如何保障LLM驱动的UAV代理在任务执行中的安全性、协议合规性、鲁棒性和效率。其解决方案的关键在于提出一个名为α³-Bench的新基准,将UAV任务建模为多轮对话式推理与控制闭环,嵌入真实波动的6G网络参数(如时延、抖动、丢包率等),并通过双动作层支持工具调用与多智能体协作,从而系统性地评估LLM在复杂通信约束下的表现;同时设计了一个包含六个维度的复合指标α³,统一衡量任务结果、安全策略、工具一致性、交互质量、网络鲁棒性和通信成本,并以单位时间与单位token的归一化效率进行量化,揭示了现有模型在性能与资源利用之间的显著差异,凸显了面向网络感知和资源高效优化的LLM-UAV协同控制的重要性。

链接: https://arxiv.org/abs/2601.03281
作者: Mohamed Amine Ferrag,Abderrahmane Lakas,Merouane Debbah
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as high level controllers for autonomous Unmanned Aerial Vehicle (UAV) missions. However, existing evaluations rarely assess whether such agents remain safe, protocol compliant, and effective under realistic next generation networking constraints. This paper introduces \alpha^3 -Bench, a benchmark for evaluating LLM driven UAV autonomy as a multi turn conversational reasoning and control problem operating under dynamic 6G conditions. Each mission is formulated as a language mediated control loop between an LLM based UAV agent and a human operator, where decisions must satisfy strict schema validity, mission policies, speaker alternation, and safety constraints while adapting to fluctuating network slices, latency, jitter, packet loss, throughput, and edge load variations. To reflect modern agentic workflows, \alpha^3 -Bench integrates a dual action layer supporting both tool calls and agent to agent coordination, enabling evaluation of tool use consistency and multi agent interactions. We construct a large scale corpus of 113k conversational UAV episodes grounded in UAVBench scenarios and evaluate 17 state of the art LLMs using a fixed subset of 50 episodes per scenario under deterministic decoding. We propose a composite \alpha^3 metric that unifies six pillars: Task Outcome, Safety Policy, Tool Consistency, Interaction Quality, Network Robustness, and Communication Cost, with efficiency normalized scores per second and per thousand tokens. Results show that while several models achieve high mission success and safety compliance, robustness and efficiency vary significantly under degraded 6G conditions, highlighting the need for network aware and resource efficient LLM based UAV agents. The dataset is publicly available on GitHub : this https URL Comments: 20 pages Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.03281 [eess.SY] (or arXiv:2601.03281v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2601.03281 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-88] Device-Native Autonomous Agents for Privacy-Preserving Negotiations

【速读】:该论文旨在解决保险和企业间(B2B)商业自动谈判中面临的隐私与便利性权衡问题,现有系统因将敏感财务数据集中处理而带来安全风险并削弱用户信任。解决方案的关键在于构建一个设备原生的自主人工智能(AI)代理系统,该系统完全在用户本地硬件上运行,通过零知识证明(zero-knowledge proofs)保障隐私,并利用蒸馏世界模型(distilled world models)实现高效的本地推理能力,从而在不依赖外部服务器的情况下完成实时议价、安全多方协商及生成加密审计日志,显著提升了隐私保护水平与用户信任度。

链接: https://arxiv.org/abs/2601.00911
作者: Joyjit Roy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 9 pages, 6 figuers, 9 tables, Submitted in conference 2nd International Conference on Artificial Intelligence Systems (AIS 2026)

点击查看摘要

Abstract:Automated negotiations in insurance and business-to-business (B2B) commerce encounter substantial challenges. Current systems force a trade-off between convenience and privacy by routing sensitive financial data through centralized servers, increasing security risks, and diminishing user trust. This study introduces a device-native autonomous Artificial Intelligence (AI) agent system for privacy-preserving negotiations. The proposed system operates exclusively on user hardware, enabling real-time bargaining while maintaining sensitive constraints locally. It integrates zero-knowledge proofs to ensure privacy and employs distilled world models to support advanced on-device reasoning. The architecture incorporates six technical components within an agentic AI workflow. Agents autonomously plan negotiation strategies, conduct secure multi-party bargaining, and generate cryptographic audit trails without exposing user data to external servers. The system is evaluated in insurance and B2B procurement scenarios across diverse device configurations. Results show an average success rate of 87%, a 2.4x latency improvement over cloud baselines, and strong privacy preservation through zero-knowledge proofs. User studies show 27% higher trust scores when decision trails are available. These findings establish a foundation for trustworthy autonomous agents in privacy-sensitive financial domains.
zh

[AI-89] Bayes-PD: Exploring a Sequence to Binding Bayesian Neural Network model trained on Phage Display data

【速读】:该论文旨在解决如何在高实验噪声和复杂数据预处理背景下,利用深度学习模型可靠地解释噬菌体展示(phage display)实验结果的问题。其关键解决方案是引入一种基于贝叶斯神经网络(Bayesian Neural Network)的训练循环框架,通过模拟实验噪声和量化模型不确定性,提升模型对真实结合亲和力测量值的解释能力与可靠性,从而克服传统方法依赖代理指标(如“保留”展示轮次数据)的局限性。

链接: https://arxiv.org/abs/2601.03930
作者: Ilann Amiaud-Plachy,Michael Blank,Oliver Bent,Sebastien Boyer
机构: 未知
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Phage display is a powerful laboratory technique used to study the interactions between proteins and other molecules, whether other proteins, peptides, DNA or RNA. The under-utilisation of this data in conjunction with deep learning models for protein design may be attributed to; high experimental noise levels; the complex nature of data pre-processing; and difficulty interpreting these experimental results. In this work, we propose a novel approach utilising a Bayesian Neural Network within a training loop, in order to simulate the phage display experiment and its associated noise. Our goal is to investigate how understanding the experimental noise and model uncertainty can enable the reliable application of such models to reliably interpret phage display experiments. We validate our approach using actual binding affinity measurements instead of relying solely on proxy values derived from ‘held-out’ phage display rounds.
zh

[AI-90] An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning

【速读】:该论文旨在解决现有Group Equivariant Non-Expansive Operators (GENEOs)理论仅适用于同类型数据空间之间映射的局限性,而实际应用中常需处理异质数据空间(heterogeneous data spaces)间的对称性保持操作。其解决方案的关键在于提出了一种新的表示定理,该定理基于广义T-置换测度(generalized T-permutant measures),对作用于不同感知对(perception pairs)上的线性GENEOs进行了完整刻画。在此基础上,作者进一步证明了线性GENEOs空间的紧致性和凸性,从而为构建具有低参数复杂度、可解释性强且具备几何与拓扑结构约束的神经网络架构提供了坚实的理论基础,并通过改进自编码器(autoencoder)性能验证了该理论的实际价值。

链接: https://arxiv.org/abs/2601.03910
作者: Francesco Conti,Patrizio Frosini,Nicola Quercioli
机构: 未知
类目: Representation Theory (math.RT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Geometric and Topological Deep Learning are rapidly growing research areas that enhance machine learning through the use of geometric and topological structures. Within this framework, Group Equivariant Non-Expansive Operators (GENEOs) have emerged as a powerful class of operators for encoding symmetries and designing efficient, interpretable neural architectures. Originally introduced in Topological Data Analysis, GENEOs have since found applications in Deep Learning as tools for constructing equivariant models with reduced parameter complexity. GENEOs provide a unifying framework bridging Geometric and Topological Deep Learning and include the operator computing persistence diagrams as a special case. Their theoretical foundations rely on group actions, equivariance, and compactness properties of operator spaces, grounding them in algebra and geometry while enabling both mathematical rigor and practical relevance. While a previous representation theorem characterized linear GENEOs acting on data of the same type, many real-world applications require operators between heterogeneous data spaces. In this work, we address this limitation by introducing a new representation theorem for linear GENEOs acting between different perception pairs, based on generalized T-permutant measures. Under mild assumptions on the data domains and group actions, our result provides a complete characterization of such operators. We also prove the compactness and convexity of the space of linear GENEOs. We further demonstrate the practical impact of this theory by applying the proposed framework to improve the performance of autoencoders, highlighting the relevance of GENEOs in modern machine learning applications.
zh

[AI-91] Women Worry Men Adopt: How Gendered Perceptions Shape the Use of Generative AI

【速读】:该论文试图解决生成式 AI (Generative AI) 在社会中扩散过程中存在的显著性别不平等 adoption 问题,即女性相较于男性更少使用 GenAI 的现象。其解决方案的关键在于识别出这种不平等的根本动因并非源于数字技能或获取渠道的差异,而是源于女性对人工智能社会与伦理后果的差异化感知——特别是对心理健康、隐私、气候变化及劳动力市场冲击等风险的更高关注。研究通过构建综合风险感知指数,发现该指标在解释女性 GenAI 使用差异方面具有最强预测力,甚至超过数字素养和教育水平;并通过合成双胞胎面板设计验证了提升年轻女性对 AI 社会影响的乐观预期可使 GenAI 使用率从 13% 提升至 33%,显著缩小性别差距。这一发现为政策制定者提供了重要启示:缓解 GenAI 的性别采纳鸿沟需聚焦于重塑公众对 AI 风险与收益的认知框架,而非单纯提升技术接入能力。

链接: https://arxiv.org/abs/2601.03880
作者: Fabian Stephany,Jedrzej Duszynski
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 1 table

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is diffusing rapidly, yet its adoption is strikingly unequal. Using nationally representative UK survey data from 2023 to 2024, we show that women adopt GenAI substantially less often than men because they perceive its societal risks differently. We construct a composite index capturing concerns about mental health, privacy, climate impact, and labor market disruption. This index explains between 9 and 18 percent of the variation in GenAI adoption and ranks among the strongest predictors for women across all age groups, surpassing digital literacy and education for young women. Intersectional analyses show that the largest disparities arise among younger, digitally fluent individuals with high societal risk concerns, where gender gaps in personal use exceed 45 percentage points. Using a synthetic twin panel design, we show that increased optimism about AI’s societal impact raises GenAI use among young women from 13 percent to 33 percent, substantially narrowing the gender divide. These findings indicate that gendered perceptions of AI’s social and ethical consequences, rather than access or capability, are the primary drivers of unequal GenAI adoption, with implications for productivity, skill formation, and economic inequality in an AI enabled economy.
zh

[AI-92] An Algorithmic Framework for Systematic Literature Reviews: A Case Study for Financial Narratives

【速读】:该论文旨在解决系统性文献综述(Systematic Literature Review, SLR)过程中效率低、可复现性差以及筛选质量难以评估的问题。其解决方案的关键在于构建一个整合自然语言处理(Natural Language Processing, NLP)技术、聚类算法和可解释性工具的算法框架,从而实现学术文献的自动化筛选与结构化分析。该框架在金融叙事(financial narratives)领域的案例研究中得到验证,表明其能够有效识别现有研究的碎片化特征,并推动更严谨、动态的叙事建模方法发展。

链接: https://arxiv.org/abs/2601.03794
作者: Gabin Taibi,Joerg Osterrieder
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces an algorithmic framework for conducting systematic literature reviews (SLRs), designed to improve efficiency, reproducibility, and selection quality assessment in the literature review process. The proposed method integrates Natural Language Processing (NLP) techniques, clustering algorithms, and interpretability tools to automate and structure the selection and analysis of academic publications. The framework is applied to a case study focused on financial narratives, an emerging area in financial economics that examines how structured accounts of economic events, formed by the convergence of individual interpretations, influence market dynamics and asset prices. Drawing from the Scopus database of peer-reviewed literature, the review highlights research efforts to model financial narratives using various NLP techniques. Results reveal that while advances have been made, the conceptualization of financial narratives remains fragmented, often reduced to sentiment analysis, topic modeling, or their combination, without a unified theoretical framework. The findings underscore the value of more rigorous and dynamic narrative modeling approaches and demonstrate the effectiveness of the proposed algorithmic SLR methodology.
zh

[AI-93] Scalable Machine Learning Force Fields for Macromolecular Systems Through Long-Range Aware Message Passing

【速读】:该论文旨在解决机器学习力场(Machine Learning Force Fields, MLFFs)在模拟大分子体系时因固定截断半径架构导致的长程相互作用建模不足问题,该局限性使得力预测误差随系统尺寸单调增长,严重制约了其在生物和化学大尺度体系中的应用。解决方案的关键在于提出了一种具有非局部注意力机制的等变Transformer模型E2Former-LSR,该模型通过显式引入长程注意力模块,有效捕捉非共价作用的衰减特性,并实现误差稳定增长与计算效率提升(相比纯局部模型提速达30%),从而验证了非局部架构对构建可推广MLFF的重要性。

链接: https://arxiv.org/abs/2601.03774
作者: Chu Wang,Lin Huang,Xinran Wei,Tao Qin,Arthur Jiang,Lixue Cheng,Jia Zhang
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

Abstract:Machine learning force fields (MLFFs) have revolutionized molecular simulations by providing quantum mechanical accuracy at the speed of molecular mechanical computations. However, a fundamental reliance of these models on fixed-cutoff architectures limits their applicability to macromolecular systems where long-range interactions dominate. We demonstrate that this locality constraint causes force prediction errors to scale monotonically with system size, revealing a critical architectural bottleneck. To overcome this, we establish the systematically designed MolLR25 (Molecules with Long-Range effect) benchmark up to 1200 atoms, generated using high-fidelity DFT, and introduce E2Former-LSR, an equivariant transformer that explicitly integrates long-range attention blocks. E2Former-LSR exhibits stable error scaling, achieves superior fidelity in capturing non-covalent decay, and maintains precision on complex protein conformations. Crucially, its efficient design provides up to 30% speedup compared to purely local models. This work validates the necessity of non-local architectures for generalizable MLFFs, enabling high-fidelity molecular dynamics for large-scale chemical and biological systems.
zh

[AI-94] ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

【速读】:该论文旨在解决零样本文本到语音(Zero-shot Text-to-Speech, TTS)模型在说话风格迁移中依赖参考音频风格的问题,即模型在克隆说话人音色(timbre)的同时,会强烈继承参考音频中的说话风格(如语调、情感等),导致难以实现灵活且精确的风格控制,尤其在参考音频有限或与目标风格不匹配时效果不佳。解决方案的关键在于提出ReStyle-TTS框架,其核心创新是引入解耦无分类器引导(Decoupled Classifier-Free Guidance, DCFG),通过独立控制文本和参考音频的引导信号,在降低模型对参考风格的隐式依赖的同时保持文本内容的准确性;在此基础上,结合特定风格LoRA(Low-Rank Adaptation)与正交LoRA融合机制,实现连续且解耦的多属性风格控制,并通过音色一致性优化模块缓解因弱化参考引导导致的音色漂移问题,从而在保持语音可懂性和说话人音色稳定性的前提下,支持用户友好、连续且相对参考的风格调节。

链接: https://arxiv.org/abs/2601.03632
作者: Haitao Li,Chunxiang Jin,Chenglin Li,Wenhao Guan,Zhengxing Huang,Xie Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Zero-shot text-to-speech models can clone a speaker’s timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model’s implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance, reducing reliance on reference style while preserving text fidelity. On top of this, we apply style-specific LoRAs together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control, and introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance. Experiments show that ReStyle-TTS enables user-friendly, continuous, and relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre, and performs robustly in challenging mismatched reference-target style scenarios.
zh

[AI-95] Microeconomic Foundations of Multi-Agent Learning

【速读】:该论文旨在解决多智能体学习(multi-agent learning)在具有策略性外部性(strategic externalities)的马尔可夫决策过程(Markov decision process)中,因数据、行为与激励内生性(endogenous)而导致的社会福利效率低下问题。解决方案的关键在于提出一种两阶段激励机制:第一阶段估计可实现的转移支付(implementable transfers),第二阶段利用这些转移支付引导长期动态演化;在弱后悔理性(regret-based rationality)和探索条件下,该机制能实现次线性社会福利遗憾(sublinear social-welfare regret),从而渐近达到最优福利水平。

链接: https://arxiv.org/abs/2601.03451
作者: Nassim Helou
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern AI systems increasingly operate inside markets and institutions where data, behavior, and incentives are endogenous. This paper develops an economic foundation for multi-agent learning by studying a principal-agent interaction in a Markov decision process with strategic externalities, where both the principal and the agent learn over time. We propose a two-phase incentive mechanism that first estimates implementable transfers and then uses them to steer long-run dynamics; under mild regret-based rationality and exploration conditions, the mechanism achieves sublinear social-welfare regret and thus asymptotically optimal welfare. Simulations illustrate how even coarse incentives can correct inefficient learning under stateful externalities, highlighting the necessity of incentive-aware design for safe and welfare-aligned AI in markets and insurance.
zh

[AI-96] Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers

【速读】:该论文旨在解决音频超分辨率(Audio Super-Resolution, ADSR)模型中一个关键问题:现有评估方法主要依赖信号级或感知指标,难以准确衡量合成的宽带音频与真实宽带音频在分布上的匹配程度。为填补这一空白,作者提出通过分析真实音频与超分辨率合成音频在多种嵌入空间(embedding spaces)中的可分性来评估模型的分布保真度。解决方案的关键在于训练线性分类器以区分来自真实和合成样本的多类音频嵌入(如频谱、时频特征等),结果表明即使生成音频在主观听感和客观指标上表现优异,嵌入空间中的分类器仍能近乎完美地区分二者,揭示了当前ADSR模型在分布拟合方面的系统性不足。

链接: https://arxiv.org/abs/2601.03443
作者: Mikhail Silaev,Konstantinos Drossos,Tuomas Virtanen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注: Accepted for publication in Workshop Proceedingsof the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

点击查看摘要

Abstract:Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ( 4\to 16 ~kHz) and full-band ( 16\to 48 ~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.
zh

[AI-97] MARVEL: A Multi Agent -based Research Validator and Enabler using Large Language Models

【速读】:该论文旨在解决科学团队对数字助手日益增长的需求,即能够理解高度技术性的数据、精准引用文献,并在受认证网络内安全运行的领域感知问答与辅助科研工具问题。其解决方案的关键在于提出一个名为MARVEL(Locally Deployable, Open-Source Framework for Domain-Aware Question Answering and Assisted Scientific Research)的框架,该框架包含两种模式:快速路径用于简单查询,以及更深入的DeepSearch模式,后者融合了检索增强生成(Retrieval-Augmented Generation, RAG)和蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS),通过探索互补子查询、动态分配计算资源至高潜力分支,并维护全局证据账本以保留来源信息,从而实现精准、可追溯的科学问答。

链接: https://arxiv.org/abs/2601.03436
作者: Nikhil Mukund,Yifang Luo,Fan Zhang,Lisa Barsotti,Erik Katsavounidis
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:We present MARVEL (this https URL), a locally deployable, open-source framework for domain-aware question answering and assisted scientific research. It is designed to address the increasing demands of a digital assistant for scientific groups that can read highly technical data, cite precisely, and operate within authenticated networks. MARVEL combines a fast path for straightforward queries with a more deliberate DeepSearch mode that integrates retrieval-augmented generation and Monte Carlo Tree Search. It explores complementary subqueries, allocates more compute to promising branches, and maintains a global evidence ledger that preserves sources during drafting. We applied this framework in the context of gravitational-wave research related to the Laser Interferometer Gravitational-wave Observatory. Answers are grounded in a curated semantic index of research literature, doctoral theses, LIGO documents, and long-running detector electronic logbooks, with targeted web searches when appropriate. Because direct benchmarking against commercial LLMs cannot be performed on private data, we evaluated MARVEL on two publicly available surrogate datasets that capture comparable semantic and technical characteristics. On these benchmarks, MARVEL matches a GPT-4o mini baseline on literature-centric queries and substantially outperforms it on detector-operations content, where domain retrieval and guided reasoning are decisive. By making the complete framework and evaluation datasets openly available, we aim to provide a reproducible foundation for developing domain-specific scientific assistants.
zh

[AI-98] Feedback Indices to Evaluate LLM Responses to Rebuttals for Multiple Choice Type Questions

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在对话中面对用户反驳时的行为评估难题,特别是如何系统性地量化其表现出的“谄媚行为”(sycophantic behavior,即过度迎合用户挑战)与“固执反应”(stubborn responses,即僵化坚持虚构历史回答)现象。解决方案的关键在于提出一种基于虚构响应反驳(fictitious-response rebuttal, FR)的方法,通过设计多选题并附加对模型先前虚构回答的刻意挑战,结合一套专门构建的指标体系来量化上述两类行为模式,并进一步探究它们与模型真实知识掌握程度之间的关系。该框架具有通用性,适用于任何多选题场景,且已在多个OpenAI模型上验证其有效性,结果表明新版本模型及高推理努力(Reasoning Effort)配置下,谄媚倾向显著降低。

链接: https://arxiv.org/abs/2601.03285
作者: Justin C. Dunlap,Anne-Simone Parent,Ralf Widenhorn
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a systematic framework of indices designed to characterize Large Language Model (LLM) responses when challenged with rebuttals during a chat. Assessing how LLMs respond to user dissent is crucial for understanding their reliability and behavior patterns, yet the complexity of human-LLM interactions makes systematic evaluation challenging. Our approach employs a fictitious-response rebuttal method that quantifies LLM behavior when presented with multiple-choice questions followed by deliberate challenges to their fictitious previous response. The indices are specifically designed to detect and measure what could be characterized as sycophantic behavior (excessive agreement with user challenges) or stubborn responses (rigid adherence to the fictitious response in the chat history) from LLMs. These metrics allow investigation of the relationships between sycophancy, stubbornness, and the model’s actual mastery of the subject matter. We demonstrate the utility of these indices using two physics problems as test scenarios with various OpenAI models. The framework is intentionally generalizable to any multiple-choice format question, including on topics without universally accepted correct answers. Our results reveal measurable differences across OpenAI model generations, with trends indicating that newer models and those employing greater “Reasoning Effort” exhibit reduced sycophantic behavior. The FR pairing method combined with our proposed indices provides a practical, adaptable toolkit for systematically comparing LLM dialogue behaviors across different models and contexts.
zh

[AI-99] AI-Guided Discovery of Novel Ionic Liquid Solvents for Industrial CO2 Capture

【速读】:该论文旨在解决燃煤或炼油厂烟气中二氧化碳(CO₂)捕集效率低、成本高及传统胺类溶剂腐蚀性强等问题,提出一种基于人工智能的新型离子液体(Ionic Liquid, IL)筛选策略。其解决方案的关键在于构建了一个五阶段自动化流程:首先通过阳离子与阴离子组合生成候选IL分子;其次利用图神经网络(Graph Neural Network, GNN)模型预测不同温度和压力下的CO₂溶解度与黏度;再通过Van’t Hoff模型将溶解度转化为工作容量和再生能耗;随后采用帕累托优化(Pareto Optimization)筛选多目标最优候选物;最后结合合成可行性进行过滤。该方法成功识别出36种具备高工作容量、低黏度、低再生能耗和可合成性的IL候选物,有望实现5–10%运行成本(OPEX)节约和最高达10%资本支出(CAPEX)降低,为炼厂碳捕集提供了一种高效、可持续的新路径。

链接: https://arxiv.org/abs/2601.03284
作者: Davide Garbelotto,Alexander Lobo,Urvi Awasthi,Oleg Medvedev,Srayanta Mukherjee,Anton Aristov,Konstantin Polunin,Alex De Mur,Leonid Zhukov,Azad Huseynov,Murad Abdullayev
机构: 未知
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:We present an AI-driven approach to discover compounds with optimal properties for CO2 capture from flue gas-refinery emissions’ primary source. Focusing on ionic liquids (ILs) as alternatives to traditional amine-based solvents, we successfully identify new IL candidates with high working capacity, manageable viscosity, favorable regeneration energy, and viable synthetic routes. Our approach follows a five-stage pipeline. First, we generate IL candidates by pairing available cation and anion molecules, then predict temperature- and pressure-dependent CO2 solubility and viscosity using a GNN-based molecular property prediction model. Next, we convert solubility to working capacity and regeneration energy via Van’t Hoff modeling, and then find the best set of candidates using Pareto optimization, before finally filtering those based on feasible synthesis routes. We identify 36 feasible candidates that could enable 5-10% OPEX savings and up to 10% CAPEX reductions through lower regeneration energy requirements and reduced corrosivity-offering a novel carbon-capture strategy for refineries moving forward.
zh

[AI-100] A Quantum Model for Constrained Markowitz Modern Portfolio Using Slack Variables to Process Mixed-Binary Optimization under QAOA

【速读】:该论文旨在解决量子算法在金融优化问题中难以有效编码不等式约束的难题,特别是针对马科维茨投资组合优化(Markowitz portfolio optimization)场景。其解决方案的关键在于将松弛变量(slack variables)直接嵌入问题哈密顿量(problem Hamiltonian)中,通过为每个松弛变量分配一个专用的辅助量子比特(ancilla qubit),将原带有约束的问题转化为适合量子近似优化算法(QAOA)处理的无约束二次二进制优化(QUBO)形式。此方法将约束条件内化于量子态中,从而改变问题的能量景观,使优化过程更加高效,且在模拟验证中表现出优于传统惩罚法的稳定性与准确性。

链接: https://arxiv.org/abs/2601.03278
作者: Pablo Thomassin,Guillaume Guerard,Sonia Djebali,Vincent Marc Lambert
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effectively encoding inequality constraints is a primary obstacle in applying quantum algorithms to financial optimization. A quantum model for Markowitz portfolio optimization is presented that resolves this by embedding slack variables directly into the problem Hamiltonian. The method maps each slack variable to a dedicated ancilla qubit, transforming the problem into a Quadratic Unconstrained Binary Optimization (QUBO) formulation suitable for the Quantum Approximate Optimization Algorithm (QAOA). This process internalizes the constraints within the quantum state, altering the problem’s energy landscape to facilitate optimization. The model is empirically validated through simulation, showing it consistently finds the optimal portfolio where a standard penalty-based QAOA fails. This work demonstrates that modifying the Hamiltonian architecture via a slack-ancilla scheme provides a robust and effective pathway for solving constrained optimization problems on quantum computers. A fundamental quantum limit on the simultaneous precision of portfolio risk and return is also posited.
zh

机器学习

[LG-0] Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition

链接: https://arxiv.org/abs/2601.04181
作者: Nia Touko,Matthew O A Ellis,Cristiano Capone,Alessio Burrello,Elisa Donati,Luca Manneschi
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Reliable long-term decoding of surface electromyography (EMG) is hindered by signal drift caused by electrode shifts, muscle fatigue, and posture changes. While state-of-the-art models achieve high intra-session accuracy, their performance often degrades sharply. Existing solutions typically demand large datasets or high-compute pipelines that are impractical for energy-efficient wearables. We propose a lightweight framework for Test-Time Adaptation (TTA) using a Temporal Convolutional Network (TCN) backbone. We introduce three deployment-ready strategies: (i) causal adaptive batch normalization for real-time statistical alignment; (ii) a Gaussian Mixture Model (GMM) alignment with experience replay to prevent forgetting; and (iii) meta-learning for rapid, few-shot calibration. Evaluated on the NinaPro DB6 multi-session dataset, our framework significantly bridges the inter-session accuracy gap with minimal overhead. Our results show that experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes using only a fraction of the data required by current benchmarks. This work establishes a path toward robust, “plug-and-play” myoelectric control for long-term prosthetic use.

[LG-1] Robust Physics Discovery from Highly Corrupted Data: A PINN Framework Applied to the Nonlinear Schrödinger Equation

链接: https://arxiv.org/abs/2601.04176
作者: Pietro de Oliveira Esteves
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 9 pages, 4 figures, 2 tables. Code available at this https URL

点击查看摘要

Abstract:We demonstrate a deep learning framework capable of recovering physical parameters from the Nonlinear Schrodinger Equation (NLSE) under severe noise conditions. By integrating Physics-Informed Neural Networks (PINNs) with automatic differentiation, we achieve reconstruction of the nonlinear coefficient beta with less than 0.2 percent relative error using only 500 sparse, randomly sampled data points corrupted by 20 percent additive Gaussian noise, a regime where traditional finite difference methods typically fail due to noise amplification in numerical derivatives. We validate the method’s generalization capabilities across different physical regimes (beta between 0.5 and 2.0) and varying data availability (between 100 and 1000 training points), demonstrating consistent sub-1 percent accuracy. Statistical analysis over multiple independent runs confirms robustness (standard deviation less than 0.15 percent for beta equals 1.0). The complete pipeline executes in approximately 80 minutes on modest cloud GPU resources (NVIDIA Tesla T4), making the approach accessible for widespread adoption. Our results indicate that physics-based regularization acts as an effective filter against high measurement uncertainty, positioning PINNs as a viable alternative to traditional optimization methods for inverse problems in spatiotemporal dynamics where experimental data is scarce and noisy. All code is made publicly available to facilitate reproducibility.

[LG-2] Agent ic Rubrics as Contextual Verifiers for SWE Agents

链接: https://arxiv.org/abs/2601.04171
作者: Mohit Raghavendra,Anisha Gunjal,Bing Liu,Yunzhong He
类目: Machine Learning (cs.LG)
*备注: 31 pages, 11 Figures

点击查看摘要

Abstract:Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

[LG-3] Causal Data Augmentation for Robust Fine-Tuning of Tabular Foundation Models

链接: https://arxiv.org/abs/2601.04110
作者: Magnus Bühler,Lennart Purucker,Frank Hutter
类目: Machine Learning (cs.LG)
*备注: Accepted for oral presentation at the EurIPS 2025 Workshop on AI for Tabular Data (Copenhagen)

点击查看摘要

Abstract:Fine-tuning tabular foundation models (TFMs) under data scarcity is challenging, as early stopping on even scarcer validation data often fails to capture true generalization performance. We propose CausalMixFT, a method that enhances fine-tuning robustness and downstream performance by generating structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on the target dataset. This approach augments limited real data with causally informed synthetic examples, preserving feature dependencies while expanding training diversity. Evaluated across 33 classification datasets from TabArena and over 2300 fine-tuning runs, our CausalMixFT method consistently improves median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming purely statistical generators such as CTGAN (-0.01), TabEBM (-0.04), and TableAugment (-0.09). Moreover, it narrows the median validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable validation-based early stopping, a key step toward improving fine-tuning stability under data scarcity. These results demonstrate that incorporating causal structure into data augmentation provides an effective and principled route to fine-tuning tabular foundation models in low-data regimes.

[LG-4] Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning

链接: https://arxiv.org/abs/2601.04083
作者: Marvin Illian,Ramin Khalili,Antonio A. de A. Rocha,Lin Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 11 pages, 12 figures

点击查看摘要

Abstract:The widespread deployment of 5G networks, together with the coexistence of 4G/LTE networks, provides mobile devices a diverse set of candidate cells to connect to. However, associating mobile devices to cells to maximize overall network performance, a.k.a. cell (re)selection, remains a key challenge for mobile operators. Today, cell (re)selection parameters are typically configured manually based on operator experience and rarely adapted to dynamic network conditions. In this work, we ask: Can an agent automatically learn and adapt cell (re)selection parameters to consistently improve network performance? We present a reinforcement learning (RL)-based framework called CellPilot that adaptively tunes cell (re)selection parameters by learning spatiotemporal patterns of mobile network dynamics. Our study with real-world data demonstrates that even a lightweight RL agent can outperform conventional heuristic reconfigurations by up to 167%, while generalizing effectively across different network scenarios. These results indicate that data-driven approaches can significantly improve cell (re)selection configurations and enhance mobile network performance.

[LG-5] Minimum distance classification for nonlinear dynamical systems

链接: https://arxiv.org/abs/2601.04058
作者: Dominique Martinez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of classifying trajectory data generated by some nonlinear dynamics, where each class corresponds to a distinct dynamical system. We propose Dynafit, a kernel-based method for learning a distance metric between training trajectories and the underlying dynamics. New observations are assigned to the class with the most similar dynamics according to the learned metric. The learning algorithm approximates the Koopman operator which globally linearizes the dynamics in a (potentially infinite) feature space associated with a kernel function. The distance metric is computed in feature space independently of its dimensionality by using the kernel trick common in machine learning. We also show that the kernel function can be tailored to incorporate partial knowledge of the dynamics when available. Dynafit is applicable to various classification tasks involving nonlinear dynamical systems and sensors. We illustrate its effectiveness on three examples: chaos detection with the logistic map, recognition of handwritten dynamics and of visual dynamic textures.

[LG-6] Using Legacy Polysomnography Data to Train a Radar System to Quantify Sleep in Older Adults and People living with Dementia

链接: https://arxiv.org/abs/2601.04057
作者: M. Yin,K. G. Ravindran,C. Hadjipanayi,A. Bannon,A. Rapeaux,C. Della Monica,T. S. Lande,Derk-Jan Dijk,T. G. Constandinou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: Ultra-wideband radar technology offers a promising solution for unobtrusive and cost-effective in-home sleep monitoring. However, the limited availability of radar sleep data poses challenges in building robust models that generalize across diverse cohorts and environments. This study proposes a novel deep transfer learning framework to enhance sleep stage classification using radar data. Methods: An end-to-end neural network was developed to classify sleep stages based on nocturnal respiratory and motion signals. The network was trained using a combination of large-scale polysomnography (PSG) datasets and radar data. A domain adaptation approach employing adversarial learning was utilized to bridge the knowledge gap between PSG and radar signals. Validation was performed on a radar dataset of 47 older adults (mean age: 71.2), including 18 participants with prodromal or mild Alzheimer disease. Results: The proposed network structure achieves an accuracy of 79.5% with a Kappa value of 0.65 when classifying wakefulness, rapid eye movement, light sleep and deep sleep. Experimental results confirm that our deep transfer learning approach significantly enhances automatic sleep staging performance in the target domain. Conclusion: This method effectively addresses challenges associated with data variability and limited sample size, substantially improving the reliability of automatic sleep staging models, especially in contexts where radar data is limited. Significance: The findings underscore the viability of UWB radar as a nonintrusive, forward-looking sleep assessment tool that could significantly benefit care for older people and people with neurodegenerative disorders.

[LG-7] LinkD: AutoRegressive Diffusion Model for Mechanical Linkage Synthesis

链接: https://arxiv.org/abs/2601.04054
作者: Yayati Jadhav,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing mechanical linkages to achieve target end-effector trajectories presents a fundamental challenge due to the intricate coupling between continuous node placements, discrete topological configurations, and nonlinear kinematic constraints. The highly nonlinear motion-to-configuration relationship means small perturbations in joint positions drastically alter trajectories, while the combinatorially expanding design space renders conventional optimization and heuristic methods computationally intractable. We introduce an autoregressive diffusion framework that exploits the dyadic nature of linkage assembly by representing mechanisms as sequentially constructed graphs, where nodes correspond to joints and edges to rigid links. Our approach combines a causal transformer with a Denoising Diffusion Probabilistic Model (DDPM), both conditioned on target trajectories encoded via a transformer encoder. The causal transformer autoregressively predicts discrete topology node-by-node, while the DDPM refines each node’s spatial coordinates and edge connectivity to previously generated nodes. This sequential generation enables adaptive trial-and-error synthesis where problematic nodes exhibiting kinematic locking or collisions can be selectively regenerated, allowing autonomous correction of degenerate configurations during design. Our graph-based, data-driven methodology surpasses traditional optimization approaches, enabling scalable inverse design that generalizes to mechanisms with arbitrary node counts. We demonstrate successful synthesis of linkage systems containing up to 20 nodes with extensibility to N-node architectures. This work advances autoregressive graph generation methodologies and computational kinematic synthesis, establishing new paradigms for scalable inverse design of complex mechanical systems.

[LG-8] Symbolic Regression for Shared Expressions: Introducing Partial Parameter Sharing

链接: https://arxiv.org/abs/2601.04051
作者: Viktor Martinek,Roland Herzog
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic Regression aims to find symbolic expressions that describe datasets. Due to better interpretability, it is a machine learning paradigm particularly powerful for scientific discovery. In recent years, several works have expanded the concept to allow the description of similar phenomena using a single expression with varying sets of parameters, thereby introducing categorical variables. Some previous works allow only “non-shared” (category-value-specific) parameters, and others also incorporate “shared” (category-value-agnostic) parameters. We expand upon those efforts by considering multiple categorical variables, and introducing intermediate levels of parameter sharing. With two categorical variables, an intermediate level of parameter sharing emerges, i.e., parameters which are shared across either category but change across the other. The new approach potentially decreases the number of parameters, while revealing additional information about the problem. Using a synthetic, fitting-only example, we test the limits of this setup in terms of data requirement reduction and transfer learning. As a real-world symbolic regression example, we demonstrate the benefits of the proposed approach on an astrophysics dataset used in a previous study, which considered only one categorical variable. We achieve a similar fit quality but require significantly fewer individual parameters, and extract additional information about the problem.

[LG-9] Modeling Behavioral Patterns in News Recommendations Using Fuzzy Neural Networks ECIR’26

链接: https://arxiv.org/abs/2601.04019
作者: Kevin Innerebner,Stephan Bartl,Markus Reiter-Haas,Elisabeth Lex
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted for the IR for Good track at ECIR’26

点击查看摘要

Abstract:News recommender systems are increasingly driven by black-box models, offering little transparency for editorial decision-making. In this work, we introduce a transparent recommender system that uses fuzzy neural networks to learn human-readable rules from behavioral data for predicting article clicks. By extracting the rules at configurable thresholds, we can control rule complexity and thus, the level of interpretability. We evaluate our approach on two publicly available news datasets (i.e., MIND and EB-NeRD) and show that we can accurately predict click behavior compared to several established baselines, while learning human-readable rules. Furthermore, we show that the learned rules reveal news consumption patterns, enabling editors to align content curation goals with target audience behavior.

[LG-10] Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures

链接: https://arxiv.org/abs/2601.03988
作者: Nicolas Lacroix,Mireille Blay-Fornarino,Sébastien Mosser,Frederic Precioso
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: SANER 2026 Registered Report

点击查看摘要

Abstract:Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions. Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices. Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current state-of-the-art’s limitations. First, we compare several SLMs using Cochran’s Q test. The best-performing model is then evaluated against the reference studies using two distinct McNemar’s tests. We further analyze how variations in taxonomy definitions affect performance through an additional Cochran’s Q test. Finally, a goodness-of-fit analysis is conducted using Pearson’s chi-squared tests to compare our insights on data science practices with those from prior studies. Comments: SANER 2026 Registered Report Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2601.03988 [cs.SE] (or arXiv:2601.03988v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.03988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Stage-specific cancer survival prediction enriched by explainable machine learning

链接: https://arxiv.org/abs/2601.03977
作者: Parisa Poorhasani,Bogdan Iancu
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Despite the fact that cancer survivability rates vary greatly between stages, traditional survival prediction models have frequently been trained and assessed using examples from all combined phases of the disease. This method may result in an overestimation of performance and ignore the stage-specific variations. Using the SEER dataset, we created and verified explainable machine learning (ML) models to predict stage-specific cancer survivability in colorectal, stomach, and liver cancers. ML-based cancer survival analysis has been a long-standing topic in the literature; however, studies involving the explainability and transparency of ML survivability models are limited. Our use of explainability techniques, including SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), enabled us to illustrate significant feature-cancer stage interactions that would have remained hidden in traditional black-box models. We identified how certain demographic and clinical variables influenced survival differently across cancer stages and types. These insights provide not only transparency but also clinical relevance, supporting personalized treatment planning. By focusing on stage-specific models, this study provides new insights into the most important factors at each stage of cancer, offering transparency and potential clinical relevance to support personalized treatment planning.

[LG-12] Lightweight and perceptually-guided voice conversion for electro-laryngeal speech ICASSP ICASSP26

链接: https://arxiv.org/abs/2601.03892
作者: Benedikt Mayrhofer,Franz Pernkopf,Philipp Aichinger,Martin Hagmüller
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures. Audio samples available at this https URL Preprint submitted to ICASSP

点击查看摘要

Abstract:Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.

[LG-13] Feature-Aware One-Shot Federated Learning via Hierarchical Token Sequences

链接: https://arxiv.org/abs/2601.03882
作者: Shudong Liu,Hanwen Zhang,Xiuling Wang,Yuesheng Zhu,Guibo Luo
类目: Machine Learning (cs.LG)
*备注: 9 pages; 6 figures

点击查看摘要

Abstract:One-shot federated learning (OSFL) reduces the communication cost and privacy risks of iterative federated learning by constructing a global model with a single round of communication. However, most existing methods struggle to achieve robust performance on real-world domains such as medical imaging, or are inefficient when handling non-IID (Independent and Identically Distributed) data. To address these limitations, we introduce FALCON, a framework that enhances the effectiveness of OSFL over non-IID image data. The core idea of FALCON is to leverage the feature-aware hierarchical token sequences generation and knowledge distillation into OSFL. First, each client leverages a pretrained visual encoder with hierarchical scale encoding to compress images into hierarchical token sequences, which capture multi-scale semantics. Second, a multi-scale autoregressive transformer generator is used to model the distribution of these token sequences and generate the synthetic sequences. Third, clients upload the synthetic sequences along with the local classifier trained on the real token sequences to the server. Finally, the server incorporates knowledge distillation into global training to reduce reliance on precise distribution modeling. Experiments on medical and natural image datasets validate the effectiveness of FALCON in diverse non-IID scenarios, outperforming the best OSFL baselines by 9.58% in average accuracy.

[LG-14] From No-Regret to Strategically Robust Learning in Repeated Auctions

链接: https://arxiv.org/abs/2601.03853
作者: Junyao Zhao
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:In Bayesian single-item auctions, a monotone bidding strategy–one that prescribes a higher bid for a higher value type–can be equivalently represented as a partition of the quantile space into consecutive intervals corresponding to increasing bids. Kumar et al. (2024) prove that agile online gradient descent (OGD), when used to update a monotone bidding strategy through its quantile representation, is strategically robust in repeated first-price auctions: when all bidders employ agile OGD in this way, the auctioneer’s average revenue per round is at most the revenue of Myerson’s optimal auction, regardless of how she adjusts the reserve price over time. In this work, we show that this strategic robustness guarantee is not unique to agile OGD or to the first-price auction: any no-regret learning algorithm, when fed gradient feedback with respect to the quantile representation, is strategically robust, even if the auction format changes every round, provided the format satisfies allocation monotonicity and voluntary participation. In particular, the multiplicative weights update (MWU) algorithm simultaneously achieves the optimal regret guarantee and the best-known strategic robustness guarantee. At a technical level, our results are established via a simple relation that bridges Myerson’s auction theory and standard no-regret learning theory. This showcases the potential of translating standard regret guarantees into strategic robustness guarantees for specific games, without explicitly minimizing any form of swap regret. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH) Cite as: arXiv:2601.03853 [cs.GT] (or arXiv:2601.03853v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2601.03853 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Beyond Physical Labels: Redefining Domains for Robust WiFi-based Gesture Recognition

链接: https://arxiv.org/abs/2601.03825
作者: Xiang Zhang,Huan Yan,Jinyang Huang,Bin Liu,Yuanhao Feng,Jianchun Liu,Meng Li,Fusang Zhang,Zhi Liu
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted by IMWUT/Ubicomp 2026

点击查看摘要

Abstract:In this paper, we propose GesFi, a novel WiFi-based gesture recognition system that introduces WiFi latent domain mining to redefine domains directly from the data itself. GesFi first processes raw sensing data collected from WiFi receivers using CSI-ratio denoising, Short-Time Fast Fourier Transform, and visualization techniques to generate standardized input representations. It then employs class-wise adversarial learning to suppress gesture semantic and leverages unsupervised clustering to automatically uncover latent domain factors responsible for distributional shifts. These latent domains are then aligned through adversarial learning to support robust cross-domain generalization. Finally, the system is applied to the target environment for robust gesture inference. We deployed GesFi under both single-pair and multi-pair settings using commodity WiFi transceivers, and evaluated it across multiple public datasets and real-world environments. Compared to state-of-the-art baselines, GesFi achieves up to 78% and 50% performance improvements over existing adversarial methods, and consistently outperforms prior generalization approaches across most cross-domain tasks.

[LG-16] Detecting Semantic Backdoors in a Mystery Shopping Scenario

链接: https://arxiv.org/abs/2601.03805
作者: Arpad Berta,Gabor Danner,Istvan Hegedus,Mark Jelasity
类目: Machine Learning (cs.LG)
*备注: Source code available at this https URL

点击查看摘要

Abstract:Detecting semantic backdoors in classification models–where some classes can be activated by certain natural, but out-of-distribution inputs–is an important problem that has received relatively little attention. Semantic backdoors are significantly harder to detect than backdoors that are based on trigger patterns due to the lack of such clearly identifiable patterns. We tackle this problem under the assumption that the clean training dataset and the training recipe of the model are both known. These assumptions are motivated by a consumer protection scenario, in which the responsible authority performs mystery shopping to test a machine learning service provider. In this scenario, the authority uses the provider’s resources and tools to train a model on a given dataset and tests whether the provider included a backdoor. In our proposed approach, the authority creates a reference model pool by training a small number of clean and poisoned models using trusted infrastructure, and calibrates a model distance threshold to identify clean models. We propose and experimentally analyze a number of approaches to compute model distances and we also test a scenario where the provider performs an adaptive attack to avoid detection. The most reliable method is based on requesting adversarial training from the provider. The model distance is best measured using a set of input samples generated by inverting the models in such a way as to maximize the distance from clean samples. With these settings, our method can often completely separate clean and poisoned models, and it proves to be superior to state-of-the-art backdoor detectors as well.

[LG-17] Quantum vs. Classical Machine Learning: A Benchmark Study for Financial Prediction

链接: https://arxiv.org/abs/2601.03802
作者: Rehan Ahmad,Muhammad Kashif,Nouhaila Innan,Muhammad Shafique
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In this paper, we present a reproducible benchmarking framework that systematically compares QML models with architecture-matched classical counterparts across three financial tasks: (i) directional return prediction on U.S. and Turkish equities, (ii) live-trading simulation with Quantum LSTMs versus classical LSTMs on the S\P 500, and (iii) realized volatility forecasting using Quantum Support Vector Regression. By standardizing data splits, features, and evaluation metrics, our study provides a fair assessment of when current-generation QML models can match or exceed classical methods. Our results reveal that quantum approaches show performance gains when data structure and circuit design are well aligned. In directional classification, hybrid quantum neural networks surpass the parameter-matched ANN by \textbf+3.8 AUC and \textbf+3.4 accuracy points on \textttAAPL stock and by \textbf+4.9 AUC and \textbf+3.6 accuracy points on Turkish stock \textttKCHOL. In live trading, the QLSTM achieves higher risk-adjusted returns in \textbftwo of four S\P~500 regimes. For volatility forecasting, an angle-encoded QSVR attains the \textbflowest QLIKE on \textttKCHOL and remains within \sim 0.02-0.04 QLIKE of the best classical kernels on \textttS\P~500 and \textttAAPL. Our benchmarking framework clearly identifies the scenarios where current QML architectures offer tangible improvements and where established classical methods continue to dominate.

[LG-18] Prompt Tuning without Labeled Samples for Zero-Shot Node Classification in Text-Attributed Graphs WSDM2026

链接: https://arxiv.org/abs/2601.03793
作者: Sethupathy Parameswaran,Suresh Sundaram,Yuan Fang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted by WSDM 2026

点击查看摘要

Abstract:Node classification is a fundamental problem in information retrieval with many real-world applications, such as community detection in social networks, grouping articles published online and product categorization in e-commerce. Zero-shot node classification in text-attributed graphs (TAGs) presents a significant challenge, particularly due to the absence of labeled data. In this paper, we propose a novel Zero-shot Prompt Tuning (ZPT) framework to address this problem by leveraging a Universal Bimodal Conditional Generator (UBCG). Our approach begins with pre-training a graph-language model to capture both the graph structure and the associated textual descriptions of each node. Following this, a conditional generative model is trained to learn the joint distribution of nodes in both graph and text modalities, enabling the generation of synthetic samples for each class based solely on the class name. These synthetic node and text embeddings are subsequently used to perform continuous prompt tuning, facilitating effective node classification in a zero-shot setting. Furthermore, we conduct extensive experiments on multiple benchmark datasets, demonstrating that our framework performs better than existing state-of-the-art baselines. We also provide ablation studies to validate the contribution of the bimodal generator. The code is provided at: this https URL.

[LG-19] Improving Compactness and Reducing Ambiguity of CFIRE Rule-Based Explanations

链接: https://arxiv.org/abs/2601.03776
作者: Sebastian Müller,Tobias Schneider,Ruben Kemna,Vanessa Toborek
类目: Machine Learning (cs.LG)
*备注: Prepared for ESANN 2026 submission

点击查看摘要

Abstract:Models trained on tabular data are widely used in sensitive domains, increasing the demand for explanation methods to meet transparency needs. CFIRE is a recent algorithm in this domain that constructs compact surrogate rule models from local explanations. While effective, CFIRE may assign rules associated with different classes to the same sample, introducing ambiguity. We investigate this ambiguity and propose a post-hoc pruning strategy that removes rules with low contribution or conflicting coverage, yielding smaller and less ambiguous models while preserving fidelity. Experiments across multiple datasets confirm these improvements with minimal impact on predictive performance.

[LG-20] Probabilistic Transformers for Joint Modeling of Global Weather Dynamics and Decision-Centric Variables

链接: https://arxiv.org/abs/2601.03753
作者: Paulius Rauba,Viktor Cikojevic,Fran Bartolic,Sam Levang,Ty Dickinson,Chase Dwelle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weather forecasts sit upstream of high-stakes decisions in domains such as grid operations, aviation, agriculture, and emergency response. Yet forecast users often face a difficult trade-off. Many decision-relevant targets are functionals of the atmospheric state variables, such as extrema, accumulations, and threshold exceedances, rather than state variables themselves. As a result, users must estimate these targets via post-processing, which can be suboptimal and can introduce structural bias. The core issue is that decisions depend on distributions over these functionals that the model is not trained to learn directly. In this work, we introduce GEM-2, a probabilistic transformer that jointly learns global atmospheric dynamics alongside a suite of variables that users directly act upon. Using this training recipe, we show that a lightweight (~275M params) and computationally efficient (~20-100x training speedup relative to state-of-the-art) transformer trained on the CRPS objective can directly outperform operational numerical weather prediction (NWP) models and be competitive with ML models that rely on expensive multi-step diffusion processes or require bespoke multi-stage fine-tuning strategies. We further demonstrate state-of-the-art economic value metrics under decision-theoretic evaluation, stable convergence to climatology at S2S and seasonal timescales, and a surprising insensitivity to many commonly assumed architectural and training design choices. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.03753 [cs.LG] (or arXiv:2601.03753v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.03753 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] EDCO: Dynamic Curriculum Orchestration for Domain-specific Large Language Model Fine-tuning

链接: https://arxiv.org/abs/2601.03725
作者: Jing-Cheng Pang,Liu Sun,Chang Zhou,Xian Tang,Haichuan Ma,Kun Jiang,Jianlong Wang,Kai Zhang,Sijie Wu,Haoran Cai,Chenwei Wu,Xubin Li,Xin Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain-specific large language models (LLMs), typically developed by fine-tuning a pre-trained general-purpose LLM on specialized datasets, represent a significant advancement in applied AI. A common strategy in LLM fine-tuning is curriculum learning, which pre-orders training samples based on metrics like difficulty to improve learning efficiency compared to a random sampling strategy. However, most existing methods for LLM fine-tuning rely on a static curriculum, designed prior to training, which lacks adaptability to the model’s evolving needs during fine-tuning. To address this, we propose EDCO, a novel framework based on two key concepts: inference entropy and dynamic curriculum orchestration. Inspired by recent findings that maintaining high answer entropy benefits long-term reasoning gains, EDCO prioritizes samples with high inference entropy in a continuously adapted curriculum. EDCO integrates three core components: an efficient entropy estimator that uses prefix tokens to approximate full-sequence entropy, an entropy-based curriculum generator that selects data points with the highest inference entropy, and an LLM trainer that optimizes the model on the selected curriculum. Comprehensive experiments in communication, medicine and law domains, EDCO outperforms traditional curriculum strategies for fine-tuning Qwen3-4B and Llama3.2-3B models under supervised and reinforcement learning settings. Furthermore, the proposed efficient entropy estimation reduces computational time by 83.5% while maintaining high accuracy.

[LG-22] ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization

链接: https://arxiv.org/abs/2601.03723
作者: Shijie Zhang,Kevin Zhang,Zheyuan Gu,Xiang Guo,Rujun Guo,Shaoyu Liu,Guanjun Jiang,Xiaozhao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an important paradigm for unlocking reasoning capabilities in large language models, exemplified by the success of OpenAI o1 and DeepSeek-R1. Currently, Group Relative Policy Optimization (GRPO) stands as the dominant algorithm in this domain due to its stable training and critic-free efficiency. However, we argue that GRPO suffers from a structural limitation: it imposes a uniform, static trust region constraint across all samples. This design implicitly assumes signal homogeneity, a premise misaligned with the heterogeneous nature of outcome-driven learning, where advantage magnitudes and variances fluctuate significantly. Consequently, static constraints fail to fully exploit high-quality signals while insufficiently suppressing noise, often precipitating rapid entropy collapse. To address this, we propose \textbfElastic \textbfTrust \textbfRegions (\textbfETR), a dynamic mechanism that aligns optimization constraints with signal quality. ETR constructs a signal-aware landscape through dual-level elasticity: at the micro level, it scales clipping boundaries based on advantage magnitude to accelerate learning from high-confidence paths; at the macro level, it leverages group variance to implicitly allocate larger update budgets to tasks in the optimal learning zone. Extensive experiments on AIME and MATH benchmarks demonstrate that ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation to ensure sustained exploration.

[LG-23] he Geometry of the Pivot: A Note on Lazy Pivoted Cholesky and Farthest Point Sampling

链接: https://arxiv.org/abs/2601.03706
作者: Gil Shabat
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Low-rank approximations of large kernel matrices are ubiquitous in machine learning, particularly for scaling Gaussian Processes to massive datasets. The Pivoted Cholesky decomposition is a standard tool for this task, offering a computationally efficient, greedy low-rank approximation. While its algebraic properties are well-documented in numerical linear algebra, its geometric intuition within the context of kernel methods often remains obscure. In this note, we elucidate the geometric interpretation of the algorithm within the Reproducing Kernel Hilbert Space (RKHS). We demonstrate that the pivotal selection step is mathematically equivalent to Farthest Point Sampling (FPS) using the kernel metric, and that the Cholesky factor construction is an implicit Gram-Schmidt orthogonalization. We provide a concise derivation and a minimalist Python implementation to bridge the gap between theory and practice.

[LG-24] Rethinking Recurrent Neural Networks for Time Series Forecasting: A Reinforced Recurrent Encoder with Prediction-Oriented Proximal Policy Optimization

链接: https://arxiv.org/abs/2601.03683
作者: Xin Lai,Shiming Deng,Lu Yu,Yumin Lai,Shenghao Qiao,Xinze Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Time series forecasting plays a crucial role in contemporary engineering information systems for supporting decision-making across various industries, where Recurrent Neural Networks (RNNs) have been widely adopted due to their capability in modeling sequential data. Conventional RNN-based predictors adopt an encoder-only strategy with sliding historical windows as inputs to forecast future values. However, this approach treats all time steps and hidden states equally without considering their distinct contributions to forecasting, leading to suboptimal performance. To address this limitation, we propose a novel Reinforced Recurrent Encoder with Prediction-oriented Proximal Policy Optimization, RRE-PPO4Pred, which significantly improves time series modeling capacity and forecasting accuracy of the RNN models. The core innovations of this method are: (1) A novel Reinforced Recurrent Encoder (RRE) framework that enhances RNNs by formulating their internal adaptation as a Markov Decision Process, creating a unified decision environment capable of learning input feature selection, hidden skip connection, and output target selection; (2) An improved Prediction-oriented Proximal Policy Optimization algorithm, termed PPO4Pred, which is equipped with a Transformer-based agent for temporal reasoning and develops a dynamic transition sampling strategy to enhance sampling efficiency; (3) A co-evolutionary optimization paradigm to facilitate the learning of the RNN predictor and the policy agent, providing adaptive and interactive time series modeling. Comprehensive evaluations on five real-world datasets indicate that our method consistently outperforms existing baselines, and attains accuracy better than state-of-the-art Transformer models, thus providing an advanced time series predictor in engineering informatics.

[LG-25] Accounting for Optimal Control in the Sizing of Isolated Hybrid Renewable Energy Systems Using Imitation Learning

链接: https://arxiv.org/abs/2601.03679
作者: Simon Halvdansson,Lucas Ferreira Bernardino,Brage Rugstad Knudsen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 11 pages, 9 figures

点击查看摘要

Abstract:Decarbonization of isolated or off-grid energy systems through phase-in of large shares of intermittent solar or wind generation requires co-installation of energy storage or continued use of existing fossil dispatchable power sources to balance supply and demand. The effective CO2 emission reduction depends on the relative capacity of the energy storage and renewable sources, the stochasticity of the renewable generation, and the optimal control or dispatch of the isolated energy system. While the operations of the energy storage and dispatchable sources may impact the optimal sizing of the system, it is challenging to account for the effect of finite horizon, optimal control at the stage of system sizing. Here, we present a flexible and computationally efficient sizing framework for energy storage and renewable capacity in isolated energy systems, accounting for uncertainty in the renewable generation and the optimal feedback control. To this end, we implement an imitation learning approach to stochastic neural model predictive control (MPC) which allows us to relate the battery storage and wind peak capacities to the emissions reduction and investment costs while accounting for finite horizon, optimal control. Through this approach, decision makers can evaluate the effective emission reduction and costs of different storage and wind capacities at any price point while accounting for uncertainty in the renewable generation with limited foresight. We evaluate the proposed sizing framework on a case study of an offshore energy system with a gas turbine, a wind farm and a battery energy storage system (BESS). In this case, we find a nonlinear, nontrivial relationship between the investment costs and reduction in gas usage relative to the wind and BESS capacities, emphasizing the complexity and importance of accounting for optimal control in the design of isolated energy systems.

[LG-26] Stochastic Voronoi Ensembles for Anomaly Detection

链接: https://arxiv.org/abs/2601.03664
作者: Yang Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection aims to identify data instances that deviate significantly from majority of data, which has been widely used in fraud detection, network security, and industrial quality control. Existing methods struggle with datasets exhibiting varying local densities: distance-based methods miss local anomalies, while density-based approaches require careful parameter selection and incur quadratic time complexity. We observe that local anomalies, though indistinguishable under global analysis, become conspicuous when the data space is decomposed into restricted regions and each region is examined independently. Leveraging this geometric insight, we propose SVEAD (Stochastic Voronoi Ensembles Anomaly Detector), which constructs ensemble random Voronoi diagrams and scores points by normalized cell-relative distances weighted by local scale. The proposed method achieves linear time complexity and constant space complexity. Experiments on 45 datasets demonstrate that SVEAD outperforms 12 state-of-the-art approaches.

[LG-27] Quantum Classical Ridgelet Neural Network For Time Series Model

链接: https://arxiv.org/abs/2601.03654
作者: Bahadur Yadav,Sanjay Kumar Mohanty
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Algebra (math.QA)
*备注:

点击查看摘要

Abstract:In this study, we present a quantum computing method that incorporates ridglet transforms into the quantum processing pipelines for time series data. Here, the Ridgelet neural network is integrated with a single-qubit quantum computing method, which improves feature extraction and forecasting capabilities. Furthermore, experimental results using financial time series data demonstrate the superior performance of our model compared to existing models.

[LG-28] Kantorovich-Type Stochastic Neural Network Operators for the Mean-Square Approximation of Certain Second-Order Stochastic Processes

链接: https://arxiv.org/abs/2601.03634
作者: Sachin Saini,Uaday Singh
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 18 Pages, 7 Figures

点击查看摘要

Abstract:Artificial neural network operators (ANNOs) have been widely used for approximating deterministic input-output functions; however, their extension to random dynamics remains comparatively unexplored. In this paper, we construct a new class of \textbfKantorovich-type Stochastic Neural Network Operators (K-SNNOs) in which randomness is incorporated not at the coefficient level, but through \textbfstochastic neurons driven by stochastic integrators. This framework enables the operator to inherit the probabilistic structure of the underlying process, making it suitable for modeling and approximating stochastic signals. We establish mean-square convergence of K-SNNOs to the target stochastic process and derive quantitative error estimates expressing the rate of approximation in terms of the modulus of continuity. Numerical simulations further validate the theoretical results by demonstrating accurate reconstruction of sample paths and rapid decay of the mean square error (MSE). Graphical results, including sample-wise approximations and empirical MSE behaviour, illustrate the robustness and effectiveness of the proposed stochastic-neuron-based operator.

[LG-29] Learning Shortest Paths When Data is Scarce

链接: https://arxiv.org/abs/2601.03629
作者: Dmytro Matsypura,Yu Pan,Hanzhao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twins and other simulators are increasingly used to support routing decisions in large-scale networks. However, simulator outputs often exhibit systematic bias, while ground-truth measurements are costly and scarce. We study a stochastic shortest-path problem in which a planner has access to abundant synthetic samples, limited real-world observations, and an edge-similarity structure capturing expected behavioral similarity across links. We model the simulator-to-reality discrepancy as an unknown, edge-specific bias that varies smoothly over the similarity graph, and estimate it using Laplacian-regularized least squares. This approach yields calibrated edge cost estimates even in data-scarce regimes. We establish finite-sample error bounds, translate estimation error into path-level suboptimality guarantees, and propose a computable, data-driven certificate that verifies near-optimality of a candidate route. For cold-start settings without initial real data, we develop a bias-aware active learning algorithm that leverages the simulator and adaptively selects edges to measure until a prescribed accuracy is met. Numerical experiments on multiple road networks and traffic graphs further demonstrate the effectiveness of our methods.

[LG-30] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

链接: https://arxiv.org/abs/2601.03612
作者: Joonwon Seo
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Monograph. Code available at this https URL

点击查看摘要

Abstract:This monograph introduces a novel approach to polyphonic music generation by addressing the “Missing Middle” problem through structural inductive bias. Focusing on Beethoven’s piano sonatas as a case study, we empirically verify the independence of pitch and hand attributes using normalized mutual information (NMI=0.167) and propose the Smart Embedding architecture, achieving a 48.30% reduction in parameters. We provide rigorous mathematical proofs using information theory (negligible loss bounded at 0.153 bits), Rademacher complexity (28.09% tighter generalization bound), and category theory to demonstrate improved stability and generalization. Empirical results show a 9.47% reduction in validation loss, confirmed by SVD analysis and an expert listening study (N=53). This dual theoretical and applied framework bridges gaps in AI music generation, offering verifiable insights for mathematically grounded deep learning.

[LG-31] Shielded RecRL: Explanation Generation for Recommender Systems without Ranking Degradation

链接: https://arxiv.org/abs/2601.03608
作者: Ansh Tiwari,Ayush Chauhan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Shielded RecRL, a reinforcement learning approach to generate personalized explanations for recommender systems without sacrificing the system’s original ranking performance. Unlike prior RLHF-based recommender methods that directly optimize item rankings, our two-tower architecture keeps the recommender’s ranking model intact while a language model learns to produce helpful explanations. We design a composite reward signal combining explanation length, content relevance, and coherence, and apply proximal policy optimization (PPO) with a KL-divergence constraint to fine-tune a large language model with only 0.4% of its parameters trainable via LoRA adapters. In experiments on an Amazon Books dataset (approximately 50K interactions in the fantasy and romance genres), Shielded RecRL improved the relative click-through rate (CTR) by 22.5% (1.225x over baseline) while keeping the recommender’s item-ranking behavior virtually unchanged. An extensive ablation study confirms that our gradient shielding strategy and reward design effectively balance explanation quality and policy drift. Our results demonstrate that Shielded RecRL enhances user-facing aspects of recommendations through rich, personalized explanations without degrading core recommendation accuracy.

[LG-32] A Comparative Study of Traditional Machine Learning Deep Learning and Large Language Models for Mental Health Forecasting using Smartphone Sensing Data

链接: https://arxiv.org/abs/2601.03603
作者: Kaidong Feng,Zhu Sun,Roy Ka-Wei Lee,Xun Jiang,Yin-Leng Theng,Yi Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smartphone sensing offers an unobtrusive and scalable way to track daily behaviors linked to mental health, capturing changes in sleep, mobility, and phone use that often precede symptoms of stress, anxiety, or depression. While most prior studies focus on detection that responds to existing conditions, forecasting mental health enables proactive support through Just-in-Time Adaptive Interventions. In this paper, we present the first comprehensive benchmarking study comparing traditional machine learning (ML), deep learning (DL), and large language model (LLM) approaches for mental health forecasting using the College Experience Sensing (CES) dataset, the most extensive longitudinal dataset of college student mental health to date. We systematically evaluate models across temporal windows, feature granularities, personalization strategies, and class imbalance handling. Our results show that DL models, particularly Transformer (Macro-F1 = 0.58), achieve the best overall performance, while LLMs show strength in contextual reasoning but weaker temporal modeling. Personalization substantially improves forecasts of severe mental health states. By revealing how different modeling approaches interpret phone sensing behavioral data over time, this work lays the groundwork for next-generation, adaptive, and human-centered mental health technologies that can advance both research and real-world well-being.

[LG-33] Local Gradient Regulation Stabilizes Federated Learning under Client Heterogeneity

链接: https://arxiv.org/abs/2601.03584
作者: Ping Luo,Jiahuan Wang,Ziqing Wen,Tao Sun,Dongsheng Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, yet its stability is fundamentally challenged by statistical heterogeneity in realistic deployments. Here, we show that client heterogeneity destabilizes FL primarily by distorting local gradient dynamics during client-side optimization, causing systematic drift that accumulates across communication rounds and impedes global convergence. This observation highlights local gradients as a key regulatory lever for stabilizing heterogeneous FL systems. Building on this insight, we develop a general client-side perspective that regulates local gradient contributions without incurring additional communication overhead. Inspired by swarm intelligence, we instantiate this perspective through Exploratory–Convergent Gradient Re-aggregation (ECGR), which balances well-aligned and misaligned gradient components to preserve informative updates while suppressing destabilizing effects. Theoretical analysis and extensive experiments, including evaluations on the LC25000 medical imaging dataset, demonstrate that regulating local gradient dynamics consistently stabilizes federated learning across state-of-the-art methods under heterogeneous data distributions.

[LG-34] Variational Inference Entropy and Orthogonality: A Unified Theory of Mixture-of-Experts

链接: https://arxiv.org/abs/2601.03577
作者: Ye Su,Yong Liu
类目: Machine Learning (cs.LG)
*备注: 27 pages, 3 figures

点击查看摘要

Abstract:Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a “Coherence Barrier”; when expert representations exhibit high mutual coherence, greedy routing strategies theoretically fail to recover the optimal expert subset. Importantly, we formally verify that imposing geometric orthogonality in the expert feature space is sufficient to narrow the divide between the NP-hard global optimum and polynomial-time greedy approximation. Our comparative analyses confirm orthogonality regularization as the optimal engineering relaxation for large-scale models. Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.

[LG-35] Local Intrinsic Dimensionality of Ground Motion Data for Early Detection of Complex Catastrophic Slope Failure

链接: https://arxiv.org/abs/2601.03569
作者: Yuansan Liu,Antoinette Tordesillas,James Bailey
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Local Intrinsic Dimensionality (LID) has shown strong potential for identifying anomalies and outliers in high-dimensional data across a wide range of real-world applications, including landslide failure detection in granular media. Early and accurate identification of failure zones in landslide-prone areas is crucial for effective geohazard mitigation. While existing approaches typically rely on surface displacement data analyzed through statistical or machine learning techniques, they often fall short in capturing both the spatial correlations and temporal dynamics that are inherent in such data. To address this gap, we focus on ground-monitored landslides and introduce a novel approach that jointly incorporates spatial and temporal information, enabling the detection of complex landslides and including multiple successive failures occurring in distinct areas of the same slope. To be specific, our method builds upon an existing LID-based technique, known as sLID. We extend its capabilities in three key ways. (1) Kinematic enhancement: we incorporate velocity into the sLID computation to better capture short-term temporal dependencies and deformation rate relationships. (2) Spatial fusion: we apply Bayesian estimation to aggregate sLID values across spatial neighborhoods, effectively embedding spatial correlations into the LID scores. (3) Temporal modeling: we introduce a temporal variant, tLID, that learns long-term dynamics from time series data, providing a robust temporal representation of displacement behavior. Finally, we integrate both components into a unified framework, referred to as spatiotemporal LID (stLID), to identify samples that are anomalous in either or both dimensions. Extensive experiments show that stLID consistently outperforms existing methods in failure detection precision and lead-time.

[LG-36] Greens-Function Spherical Neural Operators for Biological Heterogeneity

链接: https://arxiv.org/abs/2601.03561
作者: Hao Tang,Hao Chen,Hao Li,Chao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spherical deep learning has been widely applied to a broad range of real-world problems. Existing approaches often face challenges in balancing strong spherical geometric inductive biases with the need to model real-world heterogeneity. To solve this while retaining spherical geometry, we first introduce a designable Green’s function framework (DGF) to provide new spherical operator solution strategy: Design systematic Green’s functions under rotational group. Based on DGF, to model biological heterogeneity, we propose Green’s-Function Spherical Neural Operator (GSNO) fusing 3 operator solutions: (1) Equivariant Solution derived from Equivariant Green’s Function for symmetry-consistent modeling; (2) Invariant Solution derived from Invariant Green’s Function to eliminate nuisance heterogeneity, e.g., consistent background field; (3) Anisotropic Solution derived from Anisotropic Green’s Function to model anisotropic systems, especially fibers with preferred direction. Therefore, the resulting model, GSNO can adapt to real-world heterogeneous systems with nuisance variability and anisotropy while retaining spectral efficiency. Evaluations on spherical MNIST, Shallow Water Equation, diffusion MRI fiber prediction, cortical parcellation and molecule structure modeling demonstrate the superiority of GSNO.

[LG-37] From Bits to Chips: An LLM -based Hardware-Aware Quantization Agent for Streamlined Deployment of LLM s

链接: https://arxiv.org/abs/2601.03484
作者: Kaiyuan Deng,Hangyu Zheng,Minghai Qing,Kunxiong Zhu,Gen Li,Yang Xiao,Lan Emily Zhang,Linke Guo,Bo Hui,Yanzhi Wang,Geng Yuan,Gagan Agrawal,Wei Niu,Xiaolong Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying models, especially large language models (LLMs), is becoming increasingly attractive to a broader user base, including those without specialized expertise. However, due to the resource constraints of certain hardware, maintaining high accuracy with larger model while meeting the hardware requirements remains a significant challenge. Model quantization technique helps mitigate memory and compute bottlenecks, yet the added complexities of tuning and deploying quantized models further exacerbates these challenges, making the process unfriendly to most of the users. We introduce the Hardware-Aware Quantization Agent (HAQA), an automated framework that leverages LLMs to streamline the entire quantization and deployment process by enabling efficient hyperparameter tuning and hardware configuration, thereby simultaneously improving deployment quality and ease of use for a broad range of users. Our results demonstrate up to a 2.3x speedup in inference, along with increased throughput and improved accuracy compared to unoptimized models on Llama. Additionally, HAQA is designed to implement adaptive quantization strategies across diverse hardware platforms, as it automatically finds optimal settings even when they appear counterintuitive, thereby reducing extensive manual effort and demonstrating superior adaptability. Code will be released.

[LG-38] Hybrid Approach for Driver Behavior Analysis with Machine Learning Feature Optimization and Explainable AI

链接: https://arxiv.org/abs/2601.03477
作者: Mehedi Hasan Shuvo,Md. Raihan Tapader,Nur Mohammad Tamjid,Sajjadul Islam,Ahnaf Atef Choudhury,Jia Uddin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Progressive driver behavior analytics is crucial for improving road safety and mitigating the issues caused by aggressive or inattentive driving. Previous studies have employed machine learning and deep learning techniques, which often result in low feature optimization, thereby compromising both high performance and interpretability. To fill these voids, this paper proposes a hybrid approach to driver behavior analysis that uses a 12,857-row and 18-column data set taken from Kaggle. After applying preprocessing techniques such as label encoding, random oversampling, and standard scaling, 13 machine learning algorithms were tested. The Random Forest Classifier achieved an accuracy of 95%. After deploying the LIME technique in XAI, the top 10 features with the most significant positive and negative influence on accuracy were identified, and the same algorithms were retrained. The accuracy of the Random Forest Classifier decreased slightly to 94.2%, confirming that the efficiency of the model can be improved without sacrificing performance. This hybrid model can provide a return on investment in terms of the predictive power and explainability of the driver behavior process.

[LG-39] Provable Acceleration of Distributed Optimization with Local Updates

链接: https://arxiv.org/abs/2601.03442
作者: Zuang Wang,Yongqiang Wang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In conventional distributed optimization, each agent performs a single local update between two communication rounds with its neighbors to synchronize solutions. Inspired by the success of using multiple local updates in federated learning, incorporating local updates into distributed optimization has recently attracted increasing attention. However, unlike federated learning, where multiple local updates can accelerate learning by improving gradient estimation under mini-batch settings, it remains unclear whether similar benefits hold in distributed optimization when gradients are exact. Moreover, existing theoretical results typically require reducing the step size when multiple local updates are employed, which can entirely offset any potential benefit of these additional local updates and obscure their true impact on convergence. In this paper, we focus on the classic DIGing algorithm and leverage the tight performance bounds provided by Performance Estimation Problems (PEP) to show that incorporating local updates can indeed accelerate distributed optimization. To the best of our knowledge, this is the first rigorous demonstration of such acceleration for a broad class of objective functions. Our analysis further reveals that, under an appropriate step size, performing only two local updates is sufficient to achieve the maximal possible improvement, and that additional local updates provide no further gains. Because more updates increase computational cost, these findings offer practical guidance for efficient implementation. Extensive experiments on both synthetic and real-world datasets corroborate the theoretical findings.

[LG-40] VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding

链接: https://arxiv.org/abs/2601.03434
作者: Zibo Liu,Muyang Li,Zhe Jiang,Shigang Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:News videos are carefully edited multimodal narratives that combine narration, visuals, and external quotations into coherent storylines. In recent years, there have been significant advances in evaluating multimodal large language models (MLLMs) for news video understanding. However, existing benchmarks largely focus on single-source, intra-video reasoning, where each report is processed in isolation. In contrast, real-world news consumption is inherently multi-sourced: the same event is reported by different outlets with complementary details, distinct narrative choices, and sometimes conflicting claims that unfold over time. Robust news understanding, therefore, requires models to compare perspectives from different sources, align multimodal evidence across sources, and synthesize multi-source information. To fill this gap, we introduce VNU-Bench, the first benchmark for multi-source, cross-video understanding in the news domain. We design a set of new question types that are unique in testing models’ ability of understanding multi-source multimodal news from a variety of different angles. We design a novel hybrid human-model QA generation process that addresses the issues of scalability and quality control in building a large dataset for cross-source news understanding. The dataset comprises 429 news groups, 1,405 videos, and 2,501 high-quality questions. Comprehensive evaluation of both closed- and open-source multimodal models shows that VNU-Bench poses substantial challenges for current MLLMs.

[LG-41] DeepLeak: Privacy Enhancing Hardening of Model Explanations Against Membership Leakage

链接: https://arxiv.org/abs/2601.03429
作者: Firas Ben Hmida,Zain Sbeih,Philemon Hailemariam,Birhanu Eshete
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures, 8 tables. This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (IEEE SaTML 2026)

点击查看摘要

Abstract:Machine learning (ML) explainability is central to algorithmic transparency in high-stakes settings such as predictive diagnostics and loan approval. However, these same domains require rigorous privacy guaranties, creating tension between interpretability and privacy. Although prior work has shown that explanation methods can leak membership information, practitioners still lack systematic guidance on selecting or deploying explanation techniques that balance transparency with privacy. We present DeepLeak, a system to audit and mitigate privacy risks in post-hoc explanation methods. DeepLeak advances the state-of-the-art in three ways: (1) comprehensive leakage profiling: we develop a stronger explanation-aware membership inference attack (MIA) to quantify how much representative explanation methods leak membership information under default configurations; (2) lightweight hardening strategies: we introduce practical, model-agnostic mitigations, including sensitivity-calibrated noise, attribution clipping, and masking, that substantially reduce membership leakage while preserving explanation utility; and (3) root-cause analysis: through controlled experiments, we pinpoint algorithmic properties (e.g., attribution sparsity and sensitivity) that drive leakage. Evaluating 15 explanation techniques across four families on image benchmarks, DeepLeak shows that default settings can leak up to 74.9% more membership information than previously reported. Our mitigations cut leakage by up to 95% (minimum 46.5%) with only =3.3% utility loss on average. DeepLeak offers a systematic, reproducible path to safer explainability in privacy-sensitive ML. Comments: 17 pages, 6 figures, 8 tables. This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (IEEE SaTML 2026) Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2601.03429 [cs.CR] (or arXiv:2601.03429v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.03429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] Sensor to Pixels: Decentralized Swarm Gathering via Image-Based Reinforcement Learning

链接: https://arxiv.org/abs/2601.03413
作者: Yigal Koifman,Eran Iceland,Erez Koifman,Ariel Barel,Alfred M. Bruckstein
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This study highlights the potential of image-based reinforcement learning methods for addressing swarm-related tasks. In multi-agent reinforcement learning, effective policy learning depends on how agents sense, interpret, and process inputs. Traditional approaches often rely on handcrafted feature extraction or raw vector-based representations, which limit the scalability and efficiency of learned policies concerning input order and size. In this work we propose an image-based reinforcement learning method for decentralized control of a multi-agent system, where observations are encoded as structured visual inputs that can be processed by Neural Networks, extracting its spatial features and producing novel decentralized motion control rules. We evaluate our approach on a multi-agent convergence task of agents with limited-range and bearing-only sensing that aim to keep the swarm cohesive during the aggregation. The algorithm’s performance is evaluated against two benchmarks: an analytical solution proposed by Bellaiche and Bruckstein, which ensures convergence but progresses slowly, and VariAntNet, a neural network-based framework that converges much faster but shows medium success rates in hard constellations. Our method achieves high convergence, with a pace nearly matching that of VariAntNet. In some scenarios, it serves as the only practical alternative.

[LG-43] PIVONet: A Physically-Informed Variational Neuro ODE Model for Efficient Advection-Diffusion Fluid Simulation

链接: https://arxiv.org/abs/2601.03397
作者: Hei Shing Cheung,Qicheng Long,Zhiyue Lin
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures

点击查看摘要

Abstract:We present PIVONet (Physically-Informed Variational ODE Neural Network), a unified framework that integrates Neural Ordinary Differential Equations (Neuro-ODEs) with Continuous Normalizing Flows (CNFs) for stochastic fluid simulation and visualization. First, we demonstrate that a physically informed model, parameterized by CNF parameters \theta, can be trained offline to yield an efficient surrogate simulator for a specific fluid system, eliminating the need to simulate the full dynamics explicitly. Second, by introducing a variational model with parameters \phi that captures latent stochasticity in observed fluid trajectories, we model the network output as a variational distribution and optimize a pathwise Evidence Lower Bound (ELBO), enabling stochastic ODE integration that captures turbulence and random fluctuations in fluid motion (advection-diffusion behaviors).

[LG-44] SIGMA: Scalable Spectral Insights for LLM Collapse

链接: https://arxiv.org/abs/2601.03385
作者: Yi Gu,Lingyou Pang,Xiangkun Ye,Tianyu Wang,Jianyu Lin,Carey E. Priebe,Alexander Aue
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The rapid adoption of synthetic data for training Large Language Models (LLMs) has introduced the technical challenge of “model collapse”-a degenerative process where recursive training on model-generated content leads to a contraction of distributional variance and representational quality. While the phenomenology of collapse is increasingly evident, rigorous methods to quantify and predict its onset in high-dimensional spaces remain elusive. In this paper, we introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework that benchmarks model collapse through the spectral lens of the embedding Gram matrix. By deriving and utilizing deterministic and stochastic bounds on the matrix’s spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space. Crucially, our stochastic formulation enables scalable estimation of these bounds, making the framework applicable to large-scale foundation models where full eigendecomposition is intractable. We demonstrate that SIGMA effectively captures the transition towards degenerate states, offering both theoretical insights into the mechanics of collapse and a practical, scalable tool for monitoring the health of recursive training pipelines.

[LG-45] Weather-Aware Transformer for Real-Time Route Optimization in Drone-as-a-Service Operations CCS

链接: https://arxiv.org/abs/2601.03376
作者: Kamal Mohamed,Lillian Wassim,Ali Hamdi,Khaled Shaban
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE/ACS 22nd International Conference on Computer Systems and Applications (AICCSA)

点击查看摘要

Abstract:This paper presents a novel framework to accelerate route prediction in Drone-as-a-Service operations through weather-aware deep learning models. While classical path-planning algorithms, such as A* and Dijkstra, provide optimal solutions, their computational complexity limits real-time applicability in dynamic environments. We address this limitation by training machine learning and deep learning models on synthetic datasets generated from classical algorithm simulations. Our approach incorporates transformer-based and attention-based architectures that utilize weather heuristics to predict optimal next-node selections while accounting for meteorological conditions affecting drone operations. The attention mechanisms dynamically weight environmental factors including wind patterns, wind bearing, and temperature to enhance routing decisions under adverse weather conditions. Experimental results demonstrate that our weather-aware models achieve significant computational speedup over traditional algorithms while maintaining route optimization performance, with transformer-based architectures showing superior adaptation to dynamic environmental constraints. The proposed framework enables real-time, weather-responsive route optimization for large-scale DaaS operations, representing a substantial advancement in the efficiency and safety of autonomous drone systems.

[LG-46] Enhancing Small Dataset Classification Using Projected Quantum Kernels with Convolutional Neural Networks

链接: https://arxiv.org/abs/2601.03375
作者: A.M.A.S.D. Alagiyawanna,Asoka Karunananda,A. Mahasinghe,Thushari Silva
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Accepted and published in IEEE 2024. This is the authors manuscript version; final version available at IEEE Xplore: this https URL

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have shown promising results in efficiency and accuracy in image classification. However, their efficacy often relies on large, labeled datasets, posing challenges for applications with limited data availability. Our research addresses these challenges by introducing an innovative approach that leverages projected quantum kernels (PQK) to enhance feature extraction for CNNs, specifically tailored for small datasets. Projected quantum kernels, derived from quantum computing principles, offer a promising avenue for capturing complex patterns and intricate data structures that traditional CNNs might miss. By incorporating these kernels into the feature extraction process, we improved the representational ability of CNNs. Our experiments demonstrated that, with 1000 training samples, the PQK-enhanced CNN achieved 95% accuracy on the MNIST dataset and 90% on the CIFAR-10 dataset, significantly outperforming the classical CNN, which achieved only 60% and 12% accuracy on the respective datasets. This research reveals the potential of quantum computing in overcoming data scarcity issues in machine learning and paves the way for future exploration of quantum-assisted neural networks, suggesting that projected quantum kernels can serve as a powerful approach for enhancing CNN-based classification in data-constrained environments.

[LG-47] Physics-Informed Gaussian Process Regression for the Constitutive Modeling of Concrete: A Data-Driven Improvement to Phenomenological Models

链接: https://arxiv.org/abs/2601.03367
作者: Chenyang Li,Himanshu Sharma,Youcai Wu,Joseph Magallanes,K.T. Ramesh,Michael D. Shields
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Understanding and modeling the constitutive behavior of concrete is crucial for civil and defense applications, yet widely used phenomenological models such as Karagozian \ Case concrete (KCC) model depend on empirically calibrated failure surfaces that lack flexibility in model form and associated uncertainty quantification. This work develops a physics-informed framework that retains the modular elastoplastic structure of KCC model while replacing its empirical failure surface with a constrained Gaussian Process Regression (GPR) surrogate that can be learned directly from experimentally accessible observables. Triaxial compression data under varying confinement levels are used for training, and the surrogate is then evaluated at confinement levels not included in the training set to assess its generalization capability. Results show that an unconstrained GPR interpolates well near training conditions but deteriorates and violates essential physical constraints under extrapolation, even when augmented with simulated data. In contrast, a physics-informed GPR that incorporates derivative-based constraints aligned with known material behavior yields markedly better accuracy and reliability, including at higher confinement levels beyond the training range. Probabilistic enforcement of these constraints also reduces predictive variance, producing tighter confidence intervals in data-scarce regimes. Overall, the proposed approach delivers a robust, uncertainty-aware surrogate that improves generalization and streamlines calibration without sacrificing the interpretability and numerical efficiency of the KCC model, offering a practical path toward an improved constitutive models for concrete.

[LG-48] LUT-KAN: Segment-wise LUT Quantization for Fast KAN Inference

链接: https://arxiv.org/abs/2601.03332
作者: Oleksandr Kuznetsov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov–Arnold Networks (KAN) replace scalar weights by learnable univariate functions, often implemented with B-splines. This design can be accurate and interpretable, but it makes inference expensive on CPU because each layer requires many spline evaluations. Standard quantization toolchains are also hard to apply because the main computation is not a matrix multiply but repeated spline basis evaluation. This paper introduces LUT-KAN, a segment-wise lookup-table (LUT) compilation and quantization method for PyKAN-style KAN layers. LUT-KAN converts each edge function into a per-segment LUT with affine int8/uint8 quantization and linear interpolation. The method provides an explicit and reproducible inference contract, including boundary conventions and out-of-bounds (OOB) policies. We propose an ``honest baseline’’ methodology for speed evaluation: B-spline evaluation and LUT evaluation are compared under the same backend optimization (NumPy vs NumPy and Numba vs Numba), which separates representation gains from vectorization and JIT effects. Experiments include controlled sweeps over LUT resolution L in 16, 32, 64, 128 and two quantization schemes (symmetric int8 and asymmetric uint8). We report accuracy, speed, and memory metrics with mean and standard deviation across multiple seeds. A two-by-two OOB robustness matrix evaluates behavior under different boundary modes and OOB policies. In a case study, we compile a trained KAN model for DoS attack detection (CICIDS2017 pipeline) into LUT artifacts. The compiled model preserves classification quality (F1 drop below 0.0002) while reducing steady-state CPU inference latency by 12x under NumPy and 10x under Numba backends (honest baseline). The memory overhead is approximately 10x at L=64. All code and artifacts are publicly available with fixed release tags for reproducibility.

[LG-49] RYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

链接: https://arxiv.org/abs/2601.03300
作者: Scott Thornton
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages, 4 figures. Code and datasets at this https URL

点击查看摘要

Abstract:Large language models remain vulnerable to jailbreak attacks, and single-layer defenses often trade security for usability. We present TRYLOCK, the first defense-in-depth architecture that combines four heterogeneous mechanisms across the inference stack: weight-level safety alignment via DPO, activation-level control via Representation Engineering (RepE) steering, adaptive steering strength selected by a lightweight sidecar classifier, and input canonicalization to neutralize encoding-based bypasses. On Mistral-7B-Instruct evaluated against a 249-prompt attack set spanning five attack families, TRYLOCK achieves 88.0% relative ASR reduction (46.5% to 5.6%), with each layer contributing unique coverage: RepE blocks 36% of attacks that bypass DPO alone, while canonicalization catches 14% of encoding attacks that evade both. We discover a non-monotonic steering phenomenon – intermediate strength (alpha=1.0) degrades safety below baseline – and provide mechanistic hypotheses explaining RepE-DPO interference. The adaptive sidecar reduces over-refusal from 60% to 48% while maintaining identical attack defense, demonstrating that security and usability need not be mutually exclusive. We release all components – trained adapters, steering vectors, sidecar classifier, preference pairs, and complete evaluation methodology – enabling full reproducibility.

[LG-50] A Theoretical and Empirical Taxonomy of Imbalance in Binary Classification

链接: https://arxiv.org/abs/2601.04149
作者: Rose Yvette Bandolo Essomba,Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Class imbalance significantly degrades classification performance, yet its effects are rarely analyzed from a unified theoretical perspective. We propose a principled framework based on three fundamental scales: the imbalance coefficient \eta , the sample–dimension ratio \kappa , and the intrinsic separability \Delta . Starting from the Gaussian Bayes classifier, we derive closed-form Bayes errors and show how imbalance shifts the discriminant boundary, yielding a deterioration slope that predicts four regimes: Normal, Mild, Extreme, and Catastrophic. Using a balanced high-dimensional genomic dataset, we vary only \eta while keeping \kappa and \Delta fixed. Across parametric and non-parametric models, empirical degradation closely follows theoretical predictions: minority Recall collapses once \log(\eta) exceeds \Delta\sqrt\kappa , Precision increases asymmetrically, and F1-score and PR-AUC decline in line with the predicted regimes. These results show that the triplet (\eta,\kappa,\Delta) provides a model-agnostic, geometrically grounded explanation of imbalance-induced deterioration.

[LG-51] A Single-Loop Bilevel Deep Learning Method for Optimal Control of Obstacle Problems

链接: https://arxiv.org/abs/2601.04120
作者: Yongcun Song,Shangzhi Zeng,Jin Zhang,Lvgang Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimal control of obstacle problems arises in a wide range of applications and is computationally challenging due to its nonsmoothness, nonlinearity, and bilevel structure. Classical numerical approaches rely on mesh-based discretization and typically require solving a sequence of costly subproblems. In this work, we propose a single-loop bilevel deep learning method, which is mesh-free, scalable to high-dimensional and complex domains, and avoids repeated solution of discretized subproblems. The method employs constraint-embedding neural networks to approximate the state and control and preserves the bilevel structure. To train the neural networks efficiently, we propose a Single-Loop Stochastic First-Order Bilevel Algorithm (S2-FOBA), which eliminates nested optimization and does not rely on restrictive lower-level uniqueness assumptions. We analyze the convergence behavior of S2-FOBA under mild assumptions. Numerical experiments on benchmark examples, including distributed and obstacle control problems with regular and irregular obstacles on complex domains, demonstrate that the proposed method achieves satisfactory accuracy while reducing computational cost compared to classical numerical methods.

[LG-52] Equivariant Neural Networks for Force-Field Models of Lattice Systems

链接: https://arxiv.org/abs/2601.04104
作者: Yunhao Fan,Gia-Wei Chern
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Machine-learning (ML) force fields enable large-scale simulations with near-first-principles accuracy at substantially reduced computational cost. Recent work has extended ML force-field approaches to adiabatic dynamical simulations of condensed-matter lattice models with coupled electronic and structural or magnetic degrees of freedom. However, most existing formulations rely on hand-crafted, symmetry-aware descriptors, whose construction is often system-specific and can hinder generality and transferability across different lattice Hamiltonians. Here we introduce a symmetry-preserving framework based on equivariant neural networks (ENNs) that provides a general, data-driven mapping from local configurations of dynamical variables to the associated on-site forces in a lattice Hamiltonian. In contrast to ENN architectures developed for molecular systems – where continuous Euclidean symmetries dominate – our approach aims to embed the discrete point-group and internal symmetries intrinsic to lattice models directly into the neural-network representation of the force field. As a proof of principle, we construct an ENN-based force-field model for the adiabatic dynamics of the Holstein Hamiltonian on a square lattice, a canonical system for electron-lattice physics. The resulting ML-enabled large-scale dynamical simulations faithfully capture mesoscale evolution of the symmetry-breaking phase, illustrating the utility of lattice-equivariant architectures for linking microscopic electronic processes to emergent dynamical behavior in condensed-matter lattice systems.

[LG-53] Provably Finding a Hidden Dense Submatrix among Many Planted Dense Submatrices via Convex Programming

链接: https://arxiv.org/abs/2601.03946
作者: Valentine Olanubi(1),Phineas Agar(1),Brendan Ames(2) ((1) University of Alabama, Department of Mathematics, (2) University of Southampton, School of Mathematical Sciences)
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the densest submatrix problem, which seeks the submatrix of fixed size of a given binary matrix that contains the most nonzero entries. This problem is a natural generalization of fundamental problems in combinatorial optimization, e.g., the densest subgraph, maximum clique, and maximum edge biclique problems, and has wide application the study of complex networks. Much recent research has focused on the development of sufficient conditions for exact solution of the densest submatrix problem via convex relaxation. The vast majority of these sufficient conditions establish identification of the densest submatrix within a graph containing exactly one large dense submatrix hidden by noise. The assumptions of these underlying models are not observed in real-world networks, where the data may correspond to a matrix containing many dense submatrices of varying sizes. We extend and generalize these results to the more realistic setting where the input matrix may contain \emphmany large dense subgraphs. Specifically, we establish sufficient conditions under which we can expect to solve the densest submatrix problem in polynomial time for random input matrices sampled from a generalization of the stochastic block model. Moreover, we also provide sufficient conditions for perfect recovery under a deterministic adversarial. Numerical experiments involving randomly generated problem instances and real-world collaboration and communication networks are used empirically to verify the theoretical phase-transitions to perfect recovery given by these sufficient conditions. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2601.03946 [math.OC] (or arXiv:2601.03946v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2601.03946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Physically Consistent Machine Learning for Melting Temperature Prediction of Refractory High-Entropy Alloys

链接: https://arxiv.org/abs/2601.03801
作者: Mohd Hasnain
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 6 Pages, 3 figures, code available at Github

点击查看摘要

Abstract:Predicting the melting temperature ™ of multi-component and high-entropy alloys (HEAs) is critical for high-temperature applications but computationally expensive using traditional CALPHAD or DFT methods. In this work, we develop a gradient-boosted decision tree (XGBoost) model to predict Tm for complex alloys based on elemental properties. To ensure physical consistency, we address the issue of data leakage by excluding temperature-dependent thermodynamic descriptors (such as Gibbs free energy of mixing) and instead rely on physically motivated elemental features. The optimized model achieves a coefficient of determination (R2) of 0.948 and a Mean Squared Error (MSE) of 9928 which is about 5% relative error for HEAs on a validation set of approximately 1300 compositions. Crucially, we validate the model using the Valence Electron Concentration (VEC) rule. Without explicit constraints during training, the model successfully captures the known stability transition between BCC and FCC phases at a VEC of approximately 6.87. These results demonstrate that data-driven models, when properly feature-engineered, can capture fundamental metallurgical principles for rapid alloy screening.

[LG-55] Learning from Limited Labels: Transductive Graph Label Propagation for Indian Music Analysis

链接: https://arxiv.org/abs/2601.03626
作者: Parampreet Singh,Akshay Raina,Sayeedul Islam Sheikh,Vipul Arora
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Published at Journal of Acoustical Society of India, 2025

点击查看摘要

Abstract:Supervised machine learning frameworks rely on extensive labeled datasets for robust performance on real-world tasks. However, there is a lack of large annotated datasets in audio and music domains, as annotating such recordings is resource-intensive, laborious, and often require expert domain knowledge. In this work, we explore the use of label propagation (LP), a graph-based semi-supervised learning technique, for automatically labeling the unlabeled set in an unsupervised manner. By constructing a similarity graph over audio embeddings, we propagate limited label information from a small annotated subset to a larger unlabeled corpus in a transductive, semi-supervised setting. We apply this method to two tasks in Indian Art Music (IAM): Raga identification and Instrument classification. For both these tasks, we integrate multiple public datasets along with additional recordings we acquire from Prasar Bharati Archives to perform LP. Our experiments demonstrate that LP significantly reduces labeling overhead and produces higher-quality annotations compared to conventional baseline methods, including those based on pretrained inductive models. These results highlight the potential of graph-based semi-supervised learning to democratize data annotation and accelerate progress in music information retrieval.

[LG-56] Provably Convergent Decentralized Optimization over Directed Graphs under Generalized Smoothness

链接: https://arxiv.org/abs/2601.03566
作者: Yanan Bo,Yongqiang Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized optimization has become a fundamental tool for large-scale learning systems; however, most existing methods rely on the classical Lipschitz smoothness assumption, which is often violated in problems with rapidly varying gradients. Motivated by this limitation, we study decentralized optimization under the generalized (L_0, L_1) -smoothness framework, in which the Hessian norm is allowed to grow linearly with the gradient norm, thereby accommodating rapidly varying gradients beyond classical Lipschitz smoothness. We integrate gradient-tracking techniques with gradient clipping and carefully design the clipping threshold to ensure accurate convergence over directed communication graphs under generalized smoothness. In contrast to existing distributed optimization results under generalized smoothness that require a bounded gradient dissimilarity assumption, our results remain valid even when the gradient dissimilarity is unbounded, making the proposed framework more applicable to realistic heterogeneous data environments. We validate our approach via numerical experiments on standard benchmark datasets, including LIBSVM and CIFAR-10, using regularized logistic regression and convolutional neural networks, demonstrating superior stability and faster convergence over existing methods.

[LG-57] Online Learning with Limited Information in the Sliding Window Model

链接: https://arxiv.org/abs/2601.03533
作者: Vladimir Braverman,Sumegha Garg,Chen Wang,David P. Woodruff,Samson Zhou
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: SODA 2026

点击查看摘要

Abstract:Motivated by recent work on the experts problem in the streaming model, we consider the experts problem in the sliding window model. The sliding window model is a well-studied model that captures applications such as traffic monitoring, epidemic tracking, and automated trading, where recent information is more valuable than older data. Formally, we have n experts, T days, the ability to query the predictions of q experts on each day, a limited amount of memory, and should achieve the (near-)optimal regret \sqrtnW\textpolylog(nT) regret over any window of the last W days. While it is impossible to achieve such regret with 1 query, we show that with 2 queries we can achieve such regret and with only \textpolylog(nT) bits of memory. Not only are our algorithms optimal for sliding windows, but we also show for every interval \mathcalI of days that we achieve \sqrtn|\mathcalI|\textpolylog(nT) regret with 2 queries and only \textpolylog(nT) bits of memory, providing an exponential improvement on the memory of previous interval regret algorithms. Building upon these techniques, we address the bandit problem in data streams, where q=1 , achieving n T^2/3\textpolylog(T) regret with \textpolylog(nT) memory, which is the first sublinear regret in the streaming model in the bandit setting with polylogarithmic memory; this can be further improved to the optimal \mathcalO(\sqrtnT) regret if the best expert’s losses are in a random order.

[LG-58] Measures of classification bias derived from sample size analysis

链接: https://arxiv.org/abs/2601.03453
作者: Ioannis Ivrissimtzis,Shauna Concannon,Matthew Houliston,Graham Roberts
类目: Methodology (stat.ME); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:We propose the use of a simple intuitive principle for measuring algorithmic classification bias: the significance of the differences in a classifier’s error rates across the various demographics is inversely commensurate with the sample size required to statistically detect them. That is, if large sample sizes are required to statistically establish biased behavior, the algorithm is less biased, and vice versa. In a simple setting, we assume two distinct demographics, and non-parametric estimates of the error rates on them, e1 and e2, respectively. We use a well-known approximate formula for the sample size of the chi-squared test, and verify some basic desirable properties of the proposed measure. Next, we compare the proposed measure with two other commonly used statistics, the difference e2-e1 and the ratio e2/e1 of the error rates. We establish that the proposed measure is essentially different in that it can rank algorithms for bias differently, and we discuss some of its advantages over the other two measures. Finally, we briefly discuss how some of the desirable properties of the proposed measure emanate from fundamental characteristics of the method, rather than the approximate sample size formula we used, and thus, are expected to hold in more complex settings with more than two demographics.

[LG-59] On the Identifiability of Regime-Switching Models with Multi-Lag Dependencies

链接: https://arxiv.org/abs/2601.03325
作者: Carles Balsells-Rodas,Toshiko Matsui,Pedro A.M. Mediano,Yixin Wang,Yingzhen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: See this https URL for code

点击查看摘要

Abstract:Identifiability is central to the interpretability of deep latent variable models, ensuring parameterisations are uniquely determined by the data-generating distribution. However, it remains underexplored for deep regime-switching time series. We develop a general theoretical framework for multi-lag Regime-Switching Models (RSMs), encompassing Markov Switching Models (MSMs) and Switching Dynamical Systems (SDSs). For MSMs, we formulate the model as a temporally structured finite mixture and prove identifiability of both the number of regimes and the multi-lag transitions in a nonlinear-Gaussian setting. For SDSs, we establish identifiability of the latent variables up to permutation and scaling via temporal structure, which in turn yields conditions for identifiability of regime-dependent latent causal graphs (up to regime/node permutations). Our results hold in a fully unsupervised setting through architectural and noise assumptions that are directly enforceable via neural network design. We complement the theory with a flexible variational estimator that satisfies the assumptions and validate the results on synthetic benchmarks. Across real-world datasets from neuroscience, finance, and climate, identifiability leads to more trustworthy interpretability analysis, which is crucial for scientific discovery.

[LG-60] MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation

链接: https://arxiv.org/abs/2601.03295
作者: Gaspar Roy,Eugeni Belda,Baptiste Hennecart,Yann Chevaleyre,Edi Prifti,Jean-Daniel Zucker
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Metagenomic disease prediction commonly relies on species abundance tables derived from large, incomplete reference catalogs, constraining resolution and discarding valuable information contained in DNA reads. To overcome these limitations, we introduce MetagenBERT, a Transformer based framework that produces end to end metagenome embeddings directly from raw DNA sequences, without taxonomic or functional annotations. Reads are embedded using foundational genomic language models (DNABERT2 and the microbiome specialized DNABERTMS), then aggregated through a scalable clustering strategy based on FAISS accelerated KMeans. Each metagenome is represented as a cluster abundance vector summarizing the distribution of its embedded reads. We evaluate this approach on five benchmark gut microbiome datasets (Cirrhosis, T2D, Obesity, IBD, CRC). MetagenBERT achieves competitive or superior AUC performance relative to species abundance baselines across most tasks. Concatenating both representations further improves prediction, demonstrating complementarity between taxonomic and embedding derived signals. Clustering remains robust when applied to as little as 10% of reads, highlighting substantial redundancy in metagenomes and enabling major computational gains. We additionally introduce MetagenBERT Glob Mcardis, a cross cohort variant trained on the large, phenotypically diverse MetaCardis cohort and transferred to other datasets, retaining predictive signal including for unseen phenotypes, indicating the feasibility of a foundation model for metagenome representation. Robustness analyses (PERMANOVA, PERMDISP, entropy) show consistent separation of different states across subsamples. Overall, MetagenBERT provides a scalable, annotation free representation of metagenomes pointing toward future phenotype aware generalization across heterogeneous cohorts and sequencing technologies.

信息检索

[IR-0] Unleashing the Potential of Neighbors: Diffusion-based Latent Neighbor Generation for Session-based Recommendation KDD2026

链接: https://arxiv.org/abs/2601.03903
作者: Yuhan Yang,Jie Zou,Guojia An,Jiwei Wei,Yang Yang,Heng Tao Shen
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted by KDD 2026

点击查看摘要

Abstract:Session-based recommendation aims to predict the next item that anonymous users may be interested in, based on their current session interactions. Recent studies have demonstrated that retrieving neighbor sessions to augment the current session can effectively alleviate the data sparsity issue and improve recommendation performance. However, existing methods typically rely on explicitly observed session data, neglecting latent neighbors - not directly observed but potentially relevant within the interest space - thereby failing to fully exploit the potential of neighbor sessions in recommendation. To address the above limitation, we propose a novel model of diffusion-based latent neighbor generation for session-based recommendation, named DiffSBR. Specifically, DiffSBR leverages two diffusion modules, including retrieval-augmented diffusion and self-augmented diffusion, to generate high-quality latent neighbors. In the retrieval-augmented diffusion module, we leverage retrieved neighbors as guiding signals to constrain and reconstruct the distribution of latent neighbors. Meanwhile, we adopt a training strategy that enables the retriever to learn from the feedback provided by the generator. In the self-augmented diffusion module, we explicitly guide the generation of latent neighbors by injecting the current session’s multi-modal signals through contrastive learning. After obtaining the generated latent neighbors, we utilize them to enhance session representations for improving session-based recommendation. Extensive experiments on four public datasets show that DiffSBR generates effective latent neighbors and improves recommendation performance against state-of-the-art baselines.

[IR-1] Perception-Aware Bias Detection for Query Suggestions

链接: https://arxiv.org/abs/2601.03730
作者: Fabian Haak,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注: 13 pages (pp. 130-142); 2 figures; 2 tables; Workshop paper (BIAS 2021) published in CCIS vol. 1418 (Springer)

点击查看摘要

Abstract:Bias in web search has been in the spotlight of bias detection research for quite a while. At the same time, little attention has been paid to query suggestions in this regard. Awareness of the problem of biased query suggestions has been raised. Likewise, there is a rising need for automatic bias detection approaches. This paper adds on the bias detection pipeline for bias detection in query suggestions of person-related search developed by Bonart et al. \citeBonart_2019a. The sparseness and lack of contextual metadata of query suggestions make them a difficult subject for bias detection. Furthermore, query suggestions are perceived very briefly and subliminally. To overcome these issues, perception-aware metrics are introduced. Consequently, the enhanced pipeline is able to better detect systematic topical bias in search engine query suggestions for person-related searches. The results of an analysis performed with the developed pipeline confirm this assumption. Due to the perception-aware bias detection metrics, findings produced by the pipeline can be assumed to reflect bias that users would discern.

[IR-2] Global research trends and collaborations in Fibrodysplasia Ossificans Progressiva: A bibliometric analysis (1989-2023)

链接: https://arxiv.org/abs/2601.03628
作者: Muneer Ahmad,Undie Felicia Nkatv,Sajid Saleem
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 23 page, 4 figures, Research article

点击查看摘要

Abstract:Fibrodysplasia Ossificans Progressiva (FOP) is a rare and debilitating genetic disorder characterized by the progressive formation of bone in muscles and connective tissues. This scientometric analysis examines the global research trends on FOP between 1989 and 2023 using bibliographic data from Web of Science. The study highlights key patterns in publication productivity, influential journals, institutions, and the geographical distribution of research. The findings reveal that the United States leads both in terms of total publications and citation impact, with significant contributions from the UK, Italy, Japan, and other European countries. Additionally, the analysis identifies the major document types, including articles and reviews, and evaluates the collaborative efforts across institutions. The study offers valuable insights into the global research landscape of FOP, providing a foundation for future studies and international collaborations.

[IR-3] LLM DiRec: LLM -Enhanced Intent Diffusion for Sequential Recommendation

链接: https://arxiv.org/abs/2601.03259
作者: Bo-Chian Chen,Manel Slokom
类目: Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Existing sequential recommendation models, even advanced diffusion-based approaches, often struggle to capture the rich semantic intent underlying user behavior, especially for new users or long-tail items. This limitation stems from their reliance on ID-based embeddings, which lack semantic grounding. We introduce LLMDiRec, a new approach that addresses this gap by integrating Large Language Models (LLMs) into an intent-aware diffusion model. Our approach combines collaborative signals from ID embeddings with rich semantic representations from LLMs, using a dynamic fusion mechanism and a multi-task objective to align both views. We run extensive experiments on five public datasets. We run extensive experiments on five public datasets. We demonstrate that \modelname outperforms state-of-the-art algorithms, with particularly strong improvements in capturing complex user intents and enhancing recommendation performance for long-tail items.

[IR-4] Enhancing Retrieval-Augmented Generation with Two-Stage Retrieval: FlashRank Reranking and Query Expansion

链接: https://arxiv.org/abs/2601.03258
作者: Sherine George
类目: Information Retrieval (cs.IR)
*备注: 3 pages, 1 figure, 3 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) couples a retriever with a large language model (LLM) to ground generated responses in external evidence. While this framework enhances factuality and domain adaptability, it faces a key bottleneck: balancing retrieval recall with limited LLM context. Retrieving too few passages risks missing critical context, while retrieving too many overwhelms the prompt window, diluting relevance and increasing cost. We propose a two-stage retrieval pipeline that integrates LLM-driven query expansion to improve candidate recall and FlashRank, a fast marginal-utility reranker that dynamically selects an optimal subset of evidence under a token budget. FlashRank models document utility as a weighted combination of relevance, novelty, brevity, and cross-encoder evidence. Together, these modules form a generalizable solution that increases answer accuracy, faithfulness, and computational efficiency. Comments: 3 pages, 1 figure, 3 tables Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2601.03258 [cs.IR] (or arXiv:2601.03258v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.03258 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.13140/RG.2.2.12631.74408 Focus to learn more DOI(s) linking to related resources

附件下载

点击下载今日全部论文列表