本篇博文主要内容为 2026-01-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-09)

今日共更新591篇论文,其中:

  • 自然语言处理136篇(Computation and Language (cs.CL))
  • 人工智能238篇(Artificial Intelligence (cs.AI))
  • 计算机视觉98篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习151篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

【速读】: 该论文旨在解决多奖励强化学习(multi-reward reinforcement learning)中直接应用Group Relative Policy Optimization (GRPO)导致的优势值坍缩问题,即不同奖励组合在归一化过程中趋于一致,从而削弱训练信号的分辨能力,引发收敛不佳甚至早期训练失败。其解决方案的关键在于提出Group reward-Decoupled Normalization Policy Optimization (GDPO),通过解耦各奖励的归一化过程,更准确地保留各奖励之间的相对差异,从而提升多奖励优化的精度与训练稳定性。

链接: https://arxiv.org/abs/2601.05242
作者: Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NVIDIA-Tech Report

点击查看摘要

Abstract:As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
zh

[NLP-1] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence

【速读】: 该论文旨在解决如何量化评估国家层面的和平水平以及如何通过技术手段改善用户媒体消费行为以促进社会和谐的问题。其核心挑战在于传统指标难以捕捉动态变化的舆论环境,且社交媒体中的情绪化内容可能加剧对立。解决方案的关键在于利用生成式 AI(Generative AI)与机器学习技术,构建多模态模型:一方面使用神经网络从新闻文本嵌入中提取和平指数,并在跨数据集上保持高泛化能力;另一方面基于词级(GoEmotions)和上下文级(大语言模型)方法分析社交媒体(如YouTube)中的社会维度和平水平。此外,开发了实时反馈工具MirrorMirror Chrome扩展,向用户提供观看内容的和平度反馈,从而引导更理性、尊重和信息丰富的媒体消费习惯,推动平台从单一点击率导向转向更具社会责任感的内容生态。

链接: https://arxiv.org/abs/2601.05232
作者: P. Gilda(1),P. Dungarwal(1),A. Thongkham(1),E. T. Ajayi(2),S. Choudhary(1),T. M. Terol(1),C. Lam(1),J. P. Araujo(1),M. McFadyen-Mungalln(1),L. S. Liebovitch(1),P. T. Coleman(1),H. West(1),K. Sieck(3),S. Carter(3) ((1) Columbia University, (2) St John’s University, (3) Toyota Research Institute)
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
zh

[NLP-2] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在处理图像与文本信息不一致时容易产生提示诱导幻觉(Prompt-Induced Hallucinations, PIH)的问题,即模型倾向于盲从文本提示而忽略视觉证据。解决方案的关键在于通过机制分析识别出一小部分特定的注意力头(attention heads),这些头在不同模型中以模型特异性的方式促进提示复制行为;对其删除(ablation)可显著降低PIH现象至少40%,且无需额外训练,同时增强模型对视觉证据的依赖性。

链接: https://arxiv.org/abs/2601.05201
作者: William Rudman,Michal Golovanevsky,Dana Arad,Yonatan Belinkov,Ritambhara Singh,Carsten Eickhoff,Kyle Mahowald
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Brown University (布朗大学); Technion (以色列理工学院); University of Tübingen (图宾根大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
zh

[NLP-3] LELA: an LLM -based Entity Linking Approach with Zero-Shot Domain Adaptation

【速读】: 该论文旨在解决实体链接(Entity Linking)问题,即在文本中将模糊的提及项映射到知识库中的具体实体,这是知识图谱构建、问答系统和信息抽取等任务的关键步骤。解决方案的核心在于提出了一种模块化、从粗到精(coarse-to-fine)的方法 LELA,其充分利用大语言模型(Large Language Models, LLMs)的能力,在无需任何微调(fine-tuning)的情况下,适配不同目标领域、知识库和LLM架构,从而在多种实体链接场景下表现出与微调方法相当甚至更优的性能。

链接: https://arxiv.org/abs/2601.05192
作者: Samy Haffoudhi,Fabian M. Suchanek,Nils Holzenberger
机构: Télécom Paris (巴黎电信学院); Institut Polytechnique de Paris (巴黎综合理工学院); France (法国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
zh

[NLP-4] Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop

【速读】: 该论文旨在解决生成式 AI(Generative AI)在迭代训练过程中因使用自动生成数据(synthetic data)而引发的偏见演化问题,特别是由自我消耗型表现循环(Self-Consuming Performative Loop, SCPL)导致的偏好偏见增加和差异性偏见减少的现象。其核心问题是:当模型持续基于自身输出进行再训练时,会形成一个动态反馈闭环,进而加剧对某些用户群体的系统性偏差。解决方案的关键在于提出一种基于奖励的拒绝采样策略(reward-based rejection sampling),通过控制合成数据的质量与多样性来缓解偏见演化,从而推动更可信的自优化系统发展。

链接: https://arxiv.org/abs/2601.05184
作者: Yaxuan Wang,Zhongteng Cai,Yujia Bao,Xueru Zhang,Yang Liu
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); The Ohio State University (俄亥俄州立大学); Center for Advanced AI, Accenture (埃森哲高级人工智能中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbfSelf-\textbfConsuming \textbfPerformative \textbfLoop (\textbfSCPL) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.
zh

[NLP-5] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems

【速读】: 该论文旨在解决长期个性化对话系统中因无限交互流与有限上下文约束之间的矛盾所导致的记忆噪声累积、推理能力退化及用户人格一致性丧失的问题。其解决方案的关键在于提出了一种名为PersonaTree的全局维护型用户画像结构,通过初始Schema约束主干并动态更新分支与叶节点,实现可控增长与记忆压缩的同时保持一致性;此外,采用基于过程奖励的强化学习训练轻量级MemListener模型,生成可执行、结构化的ADD、UPDATE、DELETE、NO_OP操作,从而支持个性化树的动态演化,并在响应生成阶段直接利用PersonaTree提升低延迟场景下的输出质量,同时在需要时通过代理模式按需引入细节,有效平衡了效率与信息丰富性。

链接: https://arxiv.org/abs/2601.05171
作者: Jihao Zhao,Ding Chen,Zhaoxin Fan,Kerun Xu,Mengting Hu,Bo Tang,Feiyu Xiong,Zhiyu li
机构: MemTensor (MemTensor)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable ADD, UPDATE, DELETE, NO_OP operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
zh

[NLP-6] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference

【速读】: 该论文试图解决自然语言推理(Natural Language Inference, NLI)任务中逻辑性质理解不足且常被误读的问题,这直接影响对语言模型性能的准确解读。解决方案的关键在于提出三种可能的NLI标签集语义解释,并通过系统分析其元推理(meta-inferential)属性来识别数据集中实际编码的逻辑关系;研究进一步利用SNLI数据集中共享前提的样本和大语言模型(LLM)生成的样本,评估基于SNLI训练的模型在元推理一致性上的表现,从而揭示哪种逻辑关系解释最符合该数据集的实际结构。

链接: https://arxiv.org/abs/2601.05170
作者: Rasmus Blanck,Bill Noble,Stergios Chatzikyriakidis
机构: University of Gothenburg (哥德堡大学); University of Crete (克里特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.
zh

[NLP-7] RelayLLM : Efficient Reasoning via Collaborative Decoding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因计算成本高和延迟大而难以部署,以及小型语言模型(Small Language Models, SLMs)因推理能力不足导致性能受限的问题。现有协同方法如级联或路由机制通常以整句为单位切换模型,造成大量计算资源浪费。其解决方案的关键在于提出一种基于token级协作解码的框架RelayLLM,其中SLM作为主动控制器,在生成过程中仅对关键token调用LLM,通过特殊指令实现“接力”式推理流程;同时设计两阶段训练策略(预热+组相对策略优化,Group Relative Policy Optimization, GRPO),使模型学会在独立推理与适时求助之间取得平衡,从而在显著降低LLM调用比例(仅1.07% token)的同时,大幅提升推理准确率(平均达49.52%)。

链接: https://arxiv.org/abs/2601.05167
作者: Chengsong Huang,Tong Zheng,Langlin Huang,Jinyuan Li,Haolin Liu,Jiaxin Huang
机构: Washington University in St. Louis (圣路易斯华盛顿大学); University of Maryland (马里兰大学); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively “relaying” the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
zh

[NLP-8] DocDancer: Towards Agent ic Document-Grounded Information Seeking

【速读】: 该论文旨在解决文档问答(Document Question Answering, DocQA)中现有代理模型工具利用效率低、过度依赖闭源模型的问题。其核心解决方案是提出一个端到端训练的开源文档代理系统 DocDancer,将 DocQA 建模为信息检索问题,并设计了一种以工具驱动的代理框架,显式建模文档探索与理解过程;关键创新在于引入“探索-合成”(Exploration-then-Synthesis)数据合成流水线,缓解高质量 DocQA 训练数据稀缺的问题,从而支持代理模型在长文本理解基准(如 MMLongBench-Doc 和 DocBench)上的有效训练与性能提升。

链接: https://arxiv.org/abs/2601.05163
作者: Qintong Zhang,Xinjie Lv,Jialong Wu,Baixuan Li,Zhengwei Tao,Guochen Yan,Huanyao Zhang,Bin Wang,Jiahao Xu,Haitao Mi,Wentao Zhang
机构: Peking University (北京大学); Shanghai AI Lab; Tencent AI Lab
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
zh

[NLP-9] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

【速读】: 该论文旨在解决作物病害分析中视觉理解与语言生成的准确性问题,即如何在叶部图像基础上实现精准的作物与病害识别,并生成可靠、自然的语言回答。其解决方案的关键在于提出了一种轻量级的视觉-语言框架,结合Swin Transformer视觉编码器与序列到序列语言解码器,并采用两阶段训练策略以增强视觉表征学习和跨模态对齐能力。该方法在大规模作物病害数据集上表现出高识别准确率及优秀的自然语言生成性能(BLEU、ROUGE、BERTScore),同时参数量显著低于主流视觉-语言模型,体现了任务特定视觉预训练的有效性。

链接: https://arxiv.org/abs/2601.05143
作者: Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Rakibul Islam,Md. Siam Ansary
机构: Ahsanullah University of Science and Technology (AUST)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Preprint, manuscript is under review

点击查看摘要

Abstract:Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
zh

[NLP-10] Agent -as-a-Judge

【速读】: 该论文旨在解决当前基于大语言模型作为裁判(LLM-as-a-Judge)在评估复杂、专业化和多步骤任务时所面临的可靠性瓶颈问题,这些问题主要源于模型固有的偏见、浅层单次推理能力以及无法与现实世界观测进行验证。其解决方案的关键在于引入“代理作为裁判”(Agent-as-a-Judge)范式,该范式通过规划能力、工具增强的验证机制、多智能体协作及持久记忆等特性,实现更鲁棒、可验证且细致的评估体系。论文进一步构建了首个系统性综述框架,梳理了这一演进过程的核心维度、发展分类、方法论及应用场景,并指出了前沿挑战与未来研究方向,为下一代智能评估系统提供了清晰的发展路线图。

链接: https://arxiv.org/abs/2601.05111
作者: Runyang You,Hongru Cai,Caiqi Zhang,Qiancheng Xu,Meng Liu,Tiezheng Yu,Yongqi Li,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); University of Cambridge (剑桥大学); Shandong Jianzhu University (山东建筑大学); Huawei Technologies (华为技术)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
zh

[NLP-11] oken-Level LLM Collaboration via FusionRoute

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多领域任务中性能与效率之间的矛盾:通用模型虽具备跨域能力但训练和部署成本高昂,而专用小模型虽高效却难以泛化。其解决方案的关键在于提出FusionRoute——一种基于token级别的多LLM协作框架,通过一个轻量级路由器在每一步解码时不仅选择最合适的专家模型,还引入一个可训练的互补生成器,以logit加法形式对专家输出进行修正,从而扩展有效策略空间。理论分析表明,仅依赖固定专家输出的路由方法存在根本局限,而FusionRoute通过引入可学习的互补机制,在较弱假设下即可逼近最优解码策略,并在多个基准测试中显著优于现有序列级或token级协作方法、模型融合及微调方案。

链接: https://arxiv.org/abs/2601.05106
作者: Nuoya Xiong,Yuhang Zhou,Hanqing Zeng,Zhaorun Chen,Furong Huang,Shuchao Bi,Lizhu Zhang,Zhuokai Zhao
机构: Meta(元)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages

点击查看摘要

Abstract:Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.
zh

[NLP-12] How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness

【速读】: 该论文试图解决的问题是:人类在与生成式 AI(Generative AI)交互过程中所表达的情绪基调如何影响 AI 的行为输出以及后续的人类间沟通。其解决方案的关键在于通过一个组间实验设计,系统性地操纵参与者在与 ChatGPT(基于 GPT-4.0)互动时的情绪表达(包括赞美、愤怒和指责),并测量其对 AI 输出质量、决策倾向及后续人际沟通模式的影响。研究发现,情绪表达显著调节了 ChatGPT 的响应改进程度和价值取向——赞美带来最显著的优化效果,愤怒次之,而指责则无改善作用;同时,指责情绪还导致人类在后续人际交流中使用更负面、敌意和失望的语言,揭示出人机交互中的情绪具有跨情境迁移效应。

链接: https://arxiv.org/abs/2601.05104
作者: Florence Bernays,Marco Henriques Pereira,Jochen Menges(University of Zurich)
机构: 未知
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human-human communication.
zh

[NLP-13] Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content

【速读】: 该论文旨在解决现有引文分类框架中引文意图(citation intent)与被引内容类型(cited content type)混淆的问题,这一混淆限制了自动分类的准确性与可靠性。其核心挑战在于如何在保持细粒度类别区分的同时提升分类的实用性与一致性。解决方案的关键在于提出SOFT(Semantically Orthogonal Framework with Two dimensions),这是一个基于语义角色理论构建的双维度框架,明确将引文意图与被引内容类型解耦,从而实现更清晰、可复用的标注标准。实验表明,SOFT显著提升了人工标注者与大语言模型(LLM)之间的一致性,并在零样本和微调场景下均展现出更强的分类性能与跨领域泛化能力。

链接: https://arxiv.org/abs/2601.05103
作者: Changxu Duan,Zhiyin Tan
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: Accepted at the 29th International Conference on Theory and Practice of Digital Libraries (TPDL 2025)

点击查看摘要

Abstract:Understanding the role of citations is essential for research assessment and citation-aware digital libraries. However, existing citation classification frameworks often conflate citation intent (why a work is cited) with cited content type (what part is cited), limiting their effectiveness in auto classification due to a dilemma between fine-grained type distinctions and practical classification reliability. We introduce SOFT, a Semantically Orthogonal Framework with Two dimensions that explicitly separates citation intent from cited content type, drawing inspiration from semantic role theory. We systematically re-annotate the ACL-ARC dataset using SOFT and release a cross-disciplinary test set sampled from ACT2. Evaluation with both zero-shot and fine-tuned Large Language Models demonstrates that SOFT enables higher agreement between human annotators and LLMs, and supports stronger classification performance and robust cross-domain generalization compared to ACL-ARC and SciCite annotation frameworks. These results confirm SOFT’s value as a clear, reusable annotation standard, improving clarity, consistency, and generalizability for digital libraries and scholarly communication infrastructures. All code and data are publicly available on GitHub this https URL.
zh

[NLP-14] Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts

【速读】: 该论文旨在解决科学数据集发现过程中因依赖元数据质量和关键词匹配而导致的语义意图捕捉不足的问题。传统数据集搜索引擎往往无法准确反映研究问题的实际需求,从而限制了有效数据集的检索效率。其解决方案的关键在于提出一种基于文献引文上下文的框架,通过大规模提取科学论文中的引用上下文、结合大语言模型(Large Language Models)进行结构化引导的数据集识别,并采用保留溯源关系的实体消歧方法,实现基于实际科研使用场景的数据集检索。该方法显著提升了召回率,在多个计算机科学查询任务中平均normalized recall达47.47%,最高达81.82%,且能发现未被现有调研文档记录的高价值甚至新颖数据集,验证了引文上下文挖掘在低质量或缺失元数据场景下的有效性与通用性。

链接: https://arxiv.org/abs/2601.05099
作者: Zhiyin Tan,Changxu Duan
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); Technische Universität Darmstadt (达姆施塔特工业大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at the 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025)

点击查看摘要

Abstract:Identifying suitable datasets for a research question remains challenging because existing dataset search engines rely heavily on metadata quality and keyword overlap, which often fail to capture the semantic intent of scientific investigation. We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers, enabling retrieval grounded in actual research use rather than metadata availability. Our approach combines large-scale citation-context extraction, schema-guided dataset recognition with Large Language Models, and provenance-preserving entity resolution. We evaluate the system on eight survey-derived computer science queries and find that it achieves substantially higher recall than Google Dataset Search and DataCite Commons, with normalized recall ranging from an average of 47.47% to a highest value of 81.82%. Beyond recovering gold-standard datasets, the method also surfaces additional datasets not documented in the surveys. Expert assessments across five top-level Fields of Science indicate that a substantial portion of the additional datasets are considered high utility, and some are regarded as novel for the specific topics chosen by the experts. These findings establish citation-context mining as an effective and generalizable paradigm for dataset discovery, particularly in settings where datasets lack sufficient or reliable metadata. To support reproducibility and future extensions, we release our code, evaluation datasets, and results on GitHub (this https URL).
zh

[NLP-15] Code-Mix Sentiment Analysis on Hinglish Tweets

【速读】: 该论文旨在解决印度品牌监测中因Hinglish(印地语与英语混合语言)广泛使用而导致的传统自然语言处理(Natural Language Processing, NLP)模型失效的问题。Hinglish在社交媒体平台如Twitter上的用户生成内容中普遍存在,其语法和语义复杂性超出了为单一语言设计的NLP模型的处理能力,从而导致情感分析不准确和市场洞察误导。解决方案的关键在于提出一种针对Hinglish推文的情感分类框架,通过微调多语言BERT(mBERT)模型来利用其跨语言理解能力,并引入子词标记化(subword tokenization)策略,有效应对罗马化Hinglish中常见的拼写变体、俚语及未登录词问题,从而提升情感分类性能并建立低资源、代码混杂环境下的多语言NLP基准。

链接: https://arxiv.org/abs/2601.05091
作者: Aashi Garg,Aneshya Das,Arshi Arya,Anushka Goyal,Aditi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 9th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2025), Fukuoka, Japan

点击查看摘要

Abstract:The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish–a hybrid of Hindi and English–used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.
zh

[NLP-16] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment

【速读】: 该论文旨在解决当前基于生成式大语言模型(Large Language Models, LLMs)的句向量表示方法中存在的两个关键问题:一是依赖固定提示模板的方法难以进一步优化模型性能,二是修改模型架构的方案会破坏LLM原有的生成能力。解决方案的关键在于提出SemPA(Semantic Preference Alignment),通过句子级别的直接偏好优化(Direct Preference Optimization, DPO)在 paraphrase generation 任务上高效微调LLM,使模型学会区分语义等价句子的同时保持其内在生成能力。理论层面,作者建立了DPO与对比学习在Plackett-Luce模型框架下的形式化联系;实证结果表明,SemPA在语义文本相似性任务及多种LLM基准测试中均实现了更优的句表示效果,且不损害LLM的生成能力。

链接: https://arxiv.org/abs/2601.05075
作者: Ziyang Chen,Zhenxuan Huang,Yile Wang,Weiqin Wang,Lu Yin,Hui Huang
机构: Shenzhen University (深圳大学); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.
zh

[NLP-17] Compositional Steering of Large Language Models with Steering Tokens

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中难以同时控制多种行为(multi-behavior control)的问题,即如何实现对多个目标行为的组合式引导(compositional steering)。现有方法主要聚焦于单一行为的调节,而缺乏对多行为协同控制的有效机制。其解决方案的关键在于提出“组合式引导令牌”(compositional steering tokens),通过自蒸馏将自然语言指令编码为专用输入令牌,使行为引导从激活空间转移到输入令牌空间,从而支持零样本下的灵活组合;进一步训练一个专门的“组合令牌”(composition token)以捕捉不同行为之间的组合关系,实现在未见行为或未见数量的行为组合上良好的泛化能力。实验表明,该方法在多种LLM架构上均优于基于指令、激活空间引导和LoRA融合的竞争方案,并且可与自然语言指令互补,提升整体控制效果。

链接: https://arxiv.org/abs/2601.05062
作者: Gorjan Radevski,Kiril Gashteovski,Giwon Hong,Carolin Lawrence,Goran Glavaš
机构: NEC Laboratories Europe, Germany; University of Edinburgh, United Kingdom; Center for Artificial Intelligence and Data Science, University of Würzburg, Germany; CAIR, Ss. Cyril and Methodius University of Skopje, North Macedonia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textitcompositional steering – i.e., steering LLMs simultaneously towards multiple behaviors – remains an underexplored problem. In this work, we propose \emphcompositional steering tokens for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textitcomposition token on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textitunseen compositions, including those with unseen behaviors as well as those with an unseen \textitnumber of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.
zh

[NLP-18] Reinforced Efficient Reasoning via Semantically Diverse Exploration

【速读】: 该论文旨在解决现有基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在大语言模型(Large Language Models, LLMs)推理能力提升中面临的两个核心问题:一是探索多样性不足,导致推理路径趋于局部最优;二是推理效率低下,表现为冗长且低效的推理链。解决方案的关键在于提出一种名为ROSE(Reinforced Efficient Reasoning via Semantically Diverse Explorations)的方法,其核心创新包括:(1) 基于语义熵的分支策略,通过识别已有推理路径中的语义不确定性来选择高语义差异的分支点,从而引导生成多样化的后续推理路径;(2) ε-探索机制,随机从根节点启动推理路径,避免搜索陷入局部区域;(3) 长度感知的分段级优势估计器,鼓励简洁且正确的推理过程,同时惩罚冗余的推理步骤,显著提升推理效率。实验表明,ROSE在多个数学推理基准测试中均展现出更强的有效性和更高的推理效率。

链接: https://arxiv.org/abs/2601.05053
作者: Ziqi Zhao,Zhaochun Ren,Jiahong Zou,Liu Yang,Zhiwei Xu,Xuri Ge,Zhumin Chen,Xinyu Ma,Daiting Shi,Shuaiqiang Wang,Dawei Yin,Xin Xin
机构: Shandong University (山东大学); Leiden University (莱顿大学); Baidu Inc. (百度公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an \varepsilon -exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at this https URL.
zh

[NLP-19] Publishing FAIR and Machine-actionable Reviews in Materials Science: The Case for Symbolic Knowledge in Neuro-symbolic Artificial Intelligence

【速读】: 该论文旨在解决科学综述中关键见解被锁定在非结构化文本和静态PDF表格中的问题,限制了人类与机器的复用效率。其解决方案的关键在于将材料科学领域的综述表格转化为符合FAIR原则(可发现、可访问、可互操作、可重用)的机器可操作比较数据,并发布到开放研究知识图谱(Open Research Knowledge Graph, ORKG)中,从而实现结构化、可查询的知识表达。在此基础上,作者进一步对比符号查询与大语言模型(Large Language Model, LLM)查询的效果,主张由人工精炼的符号层作为可靠神经符号AI的核心,而LLM应作为符号基础之上的辅助接口,而非独立的知识来源。

链接: https://arxiv.org/abs/2601.05051
作者: Jennifer D’Souza,Soren Auer,Eleni Poupaki,Alex Watkins,Anjana Devi,Riikka L. Puurunen,Bora Karasulu,Adrie Mackus,Erwin Kessels
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Theory (cs.IT)
备注: 35 pages, 11 figures

点击查看摘要

Abstract:Scientific reviews are central to knowledge integration in materials science, yet their key insights remain locked in narrative text and static PDF tables, limiting reuse by humans and machines alike. This article presents a case study in atomic layer deposition and etching (ALD/E) where we publish review tables as FAIR, machine-actionable comparisons in the Open Research Knowledge Graph (ORKG), turning them into structured, queryable knowledge. Building on this, we contrast symbolic querying over ORKG with large language model-based querying, and argue that a curated symbolic layer should remain the backbone of reliable neurosymbolic AI in materials science, with LLMs serving as complementary, symbolically grounded interfaces rather than standalone sources of truth.
zh

[NLP-20] ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因输入长文档导致大语言模型(Large Language Models, LLMs)推理速度慢、成本高的问题,尤其是在对上下文进行压缩(如分词剪枝、摘要或嵌入式压缩)后,LLM对压缩表示的理解能力显著下降的问题。解决方案的关键在于提出ArcAligner(自适应递归上下文对齐器),这是一个轻量级模块,集成于语言模型层内,通过一个自适应“门控”机制仅在信息复杂时引入额外计算,从而在保持高效的同时提升模型对高度压缩上下文的利用能力。实验表明,ArcAligner在知识密集型问答基准测试中,尤其在多跳和长尾场景下,优于现有压缩基线方法。

链接: https://arxiv.org/abs/2601.05038
作者: Jianbo Li,Yi Jiang,Sendong Zhao,Bairui Hu,Haochun Wang,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ‘‘compressing’’ these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context Aligner), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ‘‘gating’’ system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.
zh

[NLP-21] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型在推理蒸馏(reasoning distillation)过程中存在的“功能对齐崩溃”问题,即通过监督微调(Supervised Fine-Tuning, SFT)从教师模型中学习推理轨迹时,学生模型无法继承教师模型与人类认知成本之间的自然关联。解决方案的关键在于揭示:SFT导致学生模型仅形式上模仿教师的推理语言结构(如冗余表达),而未内化其动态资源分配策略(dynamic resource allocation policy),从而造成计算成本与认知需求的解耦,本质上是一种“仿冒仪式效应”(Cargo Cult effect)。因此,论文指出,人类类认知能力是强化学习中主动优化的结果,而非被动模仿可获得的属性。

链接: https://arxiv.org/abs/2601.05019
作者: Yueqing Hu,Xinyang Peng,Shuting Peng,Hanqi Wang,Tianhong Wang
机构: Institute of Neuroscience, Chinese Academy of Sciences(中国科学院神经科学研究所); University of Cambridge(剑桥大学); South China Normal University(华南师范大学); University College London(伦敦大学学院); Anhui University(安徽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 7 pages, 7 figures

点击查看摘要

Abstract:Recent Large Reasoning Models trained via reinforcement learning exhibit a “natural” alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation – training student models to mimic these traces via Supervised Fine-Tuning (SFT) – fails to transmit this cognitive structure. Testing the “Hán Dān Xué Bù” (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a “Functional Alignment Collapse”: while teacher models mirror human difficulty scaling ( \barr=0.64 ), distilled students significantly degrade this alignment ( \barr=0.34 ), often underperforming their own pre-distillation baselines (“Negative Transfer”). Our analysis suggests that SFT induces a “Cargo Cult” effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher’s dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.
zh

[NLP-22] Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

【速读】: 该论文旨在解决在亚文化群体中检测自毁行为(self-destructive behaviors)时面临的两大挑战:一是“知识滞后”(Knowledge Lag),即亚文化俚语的演变速度超过大语言模型(LLMs)的训练周期,导致模型难以捕捉最新表达;二是“语义错位”(Semantic Misalignment),即模型难以理解亚文化特有的、细微的情感和表达方式。解决方案的关键在于提出一种多智能体框架——亚文化对齐求解器(Subcultural Alignment Solver, SAS),其核心机制包括自动检索与亚文化对齐(automatic retrieval and subculture alignment),通过动态引入亚文化语料并增强语义匹配能力,显著提升了LLMs在亚文化语境下识别自毁行为的性能。实验表明,SAS优于当前先进的多智能体框架OWL,并能与微调后的LLMs相媲美。

链接: https://arxiv.org/abs/2601.05004
作者: Peng Wang,Xilin Tao,Siyi Yao,Jiageng Wu,Yuntao Zou,Zhuotao Tian,Libo Qin,Dagang Li
机构: Macau University of Science and Technology (澳门科技大学); Northeastern University (东北大学); Huazhong University of Science and Technology (华中科技大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳))
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs’ training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.
zh

[NLP-23] On the Hidden Objective Biases of Group-based Reinforcement Learning

【速读】: 该论文旨在解决当前基于分组的强化学习方法(如Group Relative Policy Optimization, GRPO)在后训练大语言模型时存在的结构不匹配问题,即奖励优化目标与底层训练目标之间存在偏差。解决方案的关键在于提出一个统一的代理函数框架(unified surrogate formulation),通过该框架对GRPO类方法进行理论分析,揭示了三类普遍存在的性质:(i) 非均匀分组加权会导致共享前缀token上的梯度偏置;(ii) 与AdamW优化器的交互使得训练动态对奖励缩放不敏感;(iii) 优化器动量在多次优化步骤下可能使策略更新超出预期的裁剪区域。这些发现揭示了现有方法的根本局限性,并为未来更合理的算法设计提供了理论依据和指导。

链接: https://arxiv.org/abs/2601.05002
作者: Aleksandar Fontana,Marco Simoni,Giulio Rossolini,Andrea Saracino,Paolo Mori
机构: TeCIP(机器人与人工智能卓越中心); National Research Council of Italy(意大利国家研究委员会); Sapienza Università di Roma(罗马大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.
zh

[NLP-24] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中因仅使用正确最终答案的链式思维(Chain-of-Thought, CoT)轨迹而导致的监督信息浪费和过拟合问题,进而限制了模型在分布外(Out-of-Domain, OOD)场景下的泛化能力。其关键解决方案是引入负样本轨迹(即中间推理合理但最终答案错误的CoT路径),通过系统分析发现这些负样本在训练过程中能有效减缓损失下降速度以缓解过拟合,并在推理阶段提升策略熵35.67%以增强探索能力。基于此机制,作者进一步提出Gain-based LOss Weighting (GLOW) 方法,一种基于样本级跨轮次进展自适应调整损失权重的策略,从而高效利用未过滤的全部轨迹,在Qwen2.5-7B上实现5.51%的OOD性能提升,并将MMLU准确率从72.82%提升至76.47%作为强化学习初始化。

链接: https://arxiv.org/abs/2601.04992
作者: Xueyun Tian(1 and 2),Minghua Ma(3),Bingbing Xu(1 and 4),Nuoyan Lyu(1 and 2),Wei Li,Heng Dong(4),Zheng Chu(3),Yuanzhuo Wang(1),Huawei Shen(1 and 2) ((1) CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China, (2) University of Chinese Academy of Sciences, Beijing, China (3) Harbin Institute of Technology, Harbin, China, (4) Tsinghua University, Beijing, China)
机构: CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS (中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Code and data are available at this https URL

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.
zh

[NLP-25] ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在生成链式思维(Chain-of-Thought, CoT)过程中存在的“过度思考”问题,即冗余推理路径导致计算开销增加但准确率未提升。解决方案的关键在于提出一种名为ConMax(Confidence-Maximizing Compression)的强化学习框架,其将推理轨迹压缩建模为奖励驱动的优化问题,通过训练一个策略网络,在冻结的辅助LRM指导下最大化答案置信度(用于预测保真性)与思维置信度(用于推理有效性)的加权组合,从而自动剪枝冗余内容,实现高效且逻辑一致的推理数据压缩。

链接: https://arxiv.org/abs/2601.04973
作者: Minda Hu,Zexuan Qiu,Zenan Xu,Kun Li,Bo Zhou,Irwin King
机构: The Chinese University of Hong Kong (香港中文大学); LLM Department (大语言模型部门); Tencent (腾讯)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent breakthroughs in Large Reasoning Models (LRMs) have demonstrated that extensive Chain-of-Thought (CoT) generation is critical for enabling intricate cognitive behaviors, such as self-verification and backtracking, to solve complex tasks. However, this capability often leads to ``overthinking’', where models generate redundant reasoning paths that inflate computational costs without improving accuracy. While Supervised Fine-Tuning (SFT) on reasoning traces is a standard paradigm for the ‘cold start’ phase, applying existing compression techniques to these traces often compromises logical coherence or incurs prohibitive sampling costs. In this paper, we introduce ConMax (Confidence-Maximizing Compression), a novel reinforcement learning framework designed to automatically compress reasoning traces while preserving essential reasoning patterns. ConMax formulates compression as a reward-driven optimization problem, training a policy to prune redundancy by maximizing a weighted combination of answer confidence for predictive fidelity and thinking confidence for reasoning validity through a frozen auxiliary LRM. Extensive experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off. Specifically, it reduces inference length by 43% over strong baselines at the cost of a mere 0.7% dip in accuracy, proving its effectiveness in generating high-quality, efficient training data for LRMs.
zh

[NLP-26] xt as a Universal Interface for Transferable Personalization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中个性化表示的可解释性与迁移性问题。现有方法通常将用户偏好建模为隐式的、模型特定的向量或参数,导致“黑箱”式的偏好表征,难以解释且无法跨模型和任务迁移。其解决方案的关键在于采用自然语言作为通用、模型无关且任务无关的偏好表示接口,从而实现偏好描述的可解释性和复用性,并支持随新交互持续演化。为此,作者提出了一种两阶段训练框架,结合高质量合成数据上的监督微调与强化学习,以优化长期效用和跨任务迁移能力,最终构建了AlignXplore+模型,能够生成文本形式的偏好摘要,在多个基准测试中展现出优于更大规模开源模型的性能及强迁移能力。

链接: https://arxiv.org/abs/2601.04963
作者: Yuting Liu,Jian Guan,Jia-Nan Li,Wei Wu,Jiang-Ming Yang,Jianzhe Zhao,Guibing Guo
机构: Northeastern University (东北大学); Ant Group (蚂蚁集团); Ant International (蚂蚁国际); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box’’ profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc – outperforming substantially larger open-source models – while exhibiting strong transferability across tasks, model families, and interaction formats.
zh

[NLP-27] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

【速读】: 该论文旨在解决语音对话系统中情感智能(Emotional Intelligence)建模不足的问题,尤其是如何让模型在推理过程中内化用户情绪状态及其成因,从而实现更自然、一致的情感表达与共情响应。解决方案的关键在于提出了一种新颖的数据构建策略——注入式情感归因思维(Injected Emotional-Attribution Thinking, IEAT),该策略将用户的情绪状态及其潜在原因嵌入模型内部推理流程,使情感感知成为隐式认知机制而非显式监督信号;同时采用两阶段渐进式训练策略:第一阶段通过自蒸馏完成语音-文本对齐与情感属性建模,第二阶段进行跨模态端到端联合优化,确保文本与语音层面的情感一致性,最终在HumDial情感智能评测中取得领先性能。

链接: https://arxiv.org/abs/2601.04960
作者: Qing Wang,Zehan Li,Yaodong Song,Hongjie Chen,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model’s internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.
zh

[NLP-28] GenProve: Learning to Generate Text with Fine-Grained Provenance

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时存在的幻觉问题,尤其是现有引用机制无法有效保障可验证性的问题——即用户难以判断所引用的文献是否真正支持生成的主张。其核心挑战在于缺乏细粒度的来源溯源能力,无法区分直接引用(Quotation)、压缩(Compression)和推理(Inference)三类证据类型。解决方案的关键在于提出Generation-time Fine-grained Provenance任务,并构建ReFInE数据集,其中包含专家标注的句子级来源三元组(source, type, text),从而实现对生成内容中每句话的来源类型进行精确标记;在此基础上,设计GenProve框架,融合监督微调(Supervised Fine-Tuning, SFT)与组相对策略优化(Group Relative Policy Optimization, GRPO),通过联合优化答案准确性和来源正确性的复合奖励函数,显著提升模型在生成流畅回答的同时提供可信、细粒度溯源的能力。分析进一步揭示:当前模型在表面级引用上表现良好,但在基于推理的溯源方面存在明显短板,表明可验证推理仍是亟待突破的关键挑战。

链接: https://arxiv.org/abs/2601.04932
作者: Jingxuan Wei,Xingyue Wang,Yanghaoyu Liao,Jie Dong,Yuchen Liu,Caijun Jia,Bihui Yu,Junnan Zhu
机构: Shenyang Institute of Computing Technology, Chinese Academy of Sciences (中国科学院沈阳计算技术研究所); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多媒体信息处理实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.
zh

[NLP-29] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

【速读】: 该论文试图解决的问题是:生成式 AI(Generative AI)生成的说服性文本是否比人类撰写的说服性文本更难以被自动检测。为应对这一问题,作者首先对可控生成方法进行了分类,以系统化地生成具有不同风格和强度的说服性内容,并构建了 Persuaficial 基准数据集,这是一个高质量的多语言评测基准,涵盖英语、德语、波兰语、意大利语、法语和俄语六种语言。关键解决方案在于通过该基准进行大规模实证评估,发现虽然明显带有说服意图的 LLM 生成文本更容易被识别,但细微且自然的 LLM 说服策略显著降低了自动检测系统的性能;同时,论文首次提供了对人类与 LLM 生成说服文本的全面语言学分析,为开发更具可解释性和鲁棒性的检测工具提供了理论依据。

链接: https://arxiv.org/abs/2601.04925
作者: Arkadiusz Modzelewski,Paweł Golik,Anna Kołos,Giovanni Da San Martino
机构: University of Padua (帕多瓦大学); Polish-Japanese Academy of Information Technology (波兰-日本信息科技学院); NASK National Research Institute (国家研究机构)
类目: Computation and Language (cs.CL)
备注: Preprint; Paper is currently under review at a major NLP conference

点击查看摘要

Abstract:Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.
zh

[NLP-30] V-FAT: Benchmarking Visual Fidelity Against Text-bias

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理任务中过度依赖语言捷径(Text Bias)而非真实视觉感知的问题,即模型倾向于根据文本统计相关性或指令诱导做出回答,而非基于对图像内容的准确理解。其解决方案的关键在于提出V-FAT(Visual Fidelity Against Text-bias)诊断基准和三层次评估框架(Three-Level Evaluation Framework),通过系统化增强视觉证据与文本信息之间的冲突程度(包括内部语料偏差、外部指令偏差及两者协同偏差),并引入视觉鲁棒性评分(Visual Robustness Score, VRS)来惩罚仅靠语言猜测的“幸运”答案,从而更精准地衡量模型的真实视觉理解能力。

链接: https://arxiv.org/abs/2601.04897
作者: Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); Meituan (美团); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize “lucky” linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
zh

[NLP-31] Faithful Summarisation under Disagreement via Belief-Level Aggregation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在意见型文本和多文档摘要任务中对观点冲突处理不足的问题,即现有方法(尤其是基于大语言模型的系统)往往隐式平滑不同观点、过度代表多数意见,从而损害摘要的真实性。其解决方案的关键在于提出一种“分歧感知”的合成流程(disagreement-aware synthesis pipeline),将信念层面的聚合(belief-level aggregation)与语言生成过程分离:首先将文档表示为结构化的信念集合,并利用基于距离的信念融合算子显式建模冲突;随后仅用大语言模型将聚合后的信念转化为自然语言摘要。该设计确保了摘要在忠实反映原始文档中的观点分歧的同时,仍保持流畅性和事实一致性。

链接: https://arxiv.org/abs/2601.04889
作者: Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi
机构: University of Sheffield, UK (谢菲尔德大学, 英国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.
zh

[NLP-32] CuMA: Aligning LLM s with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在服务全球用户时,因强制统一价值取向而导致的文化多样性缺失问题,即“均值坍缩”(Mean Collapse)现象——当密集模型试图拟合冲突的价值分布时,其参数会收敛到一个泛化的平均状态,无法有效表征不同文化群体的特征。其核心解决方案是提出一种名为CuMA(Cultural Mixture of Adapters)的框架,关键在于将对齐任务建模为条件容量分离问题,通过引入基于人口统计信息的路由机制,内化潜在的文化拓扑结构(Latent Cultural Topology),从而将冲突梯度显式地解耦至专用专家子空间中,实现文化敏感的差异化表示,有效缓解均值坍缩并提升跨文化对齐性能。

链接: https://arxiv.org/abs/2601.04885
作者: Ao Sun,Xiaoyu Wang,Zhe Tan,Yu Li,Jiachen Zhu,Shu Su,Yuheng Jia
机构: Southeast University (东南大学); ByteDance Inc. (字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbfMean Collapse, converging to a generic average that fails to represent diverse groups. We attribute this to \textbfCultural Sparsity, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf\textscCuMA (\textbfCultural \textbfMixture of \textbfAdapters), a framework that frames alignment as a \textbfconditional capacity separation problem. By incorporating demographic-aware routing, \textscCuMA internalizes a \textitLatent Cultural Topology to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textscCuMA achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textscCuMA effectively mitigates mean collapse, preserving cultural diversity. Our code is available at this https URL.
zh

[NLP-33] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis

【速读】: 该论文旨在解决当前深度研究代理(deep research agents)在生成商业报告时存在的质量、可靠性与覆盖范围不足的问题。现有方法虽取得一定进展,但仍难以满足高风险商业决策对信息准确性和全面性的要求。解决方案的关键在于提出Mind2Report——一个模拟商业分析师认知过程的训练-free代理工作流,其通过动态记忆机制增强通用大语言模型(LLMs),实现细粒度意图探测、实时网络信息检索与提炼,并支持迭代式长文本报告合成,从而显著提升报告的专业性与可信度。

链接: https://arxiv.org/abs/2601.04879
作者: Mingyue Cheng,Daoyu Wang,Qi Liu,Shuo Yu,Xiaoyu Tao,Yuqian Wang,Chengzhong Chu,Yu Duan,Mingkang Long,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK Co., Ltd (科大讯飞)
类目: Computation and Language (cs.CL)
备注: 26 Pages, 9 Figures, 7 Tables

点击查看摘要

Abstract:Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at this https URL.
zh

[NLP-34] Higher-Order Knowledge Representations for Agent ic Scientific Reasoning

【速读】: 该论文旨在解决科学探究中系统级推理的问题,即如何将异构实验数据、跨域知识与机制证据整合为连贯的解释。传统方法如大型语言模型(Large Language Models, LLMs)依赖检索增强的上下文,但缺乏结构深度;而传统知识图谱(Knowledge Graphs, KGs)受限于成对约束,无法刻画决定物理行为涌现性的高阶相互作用。解决方案的关键在于构建基于超图(hypergraph)的知识表示体系,通过显式编码多实体关系来保留科学表述中的共现语境,并利用节点交集约束实现语义相距较远概念间的桥梁连接。该方法不仅避免了成对扩展带来的组合爆炸问题,还使代理系统能够基于超图拓扑生成可验证的机制假说,从而在生物复合材料等场景中发现传统图方法难以捕捉的新关系。

链接: https://arxiv.org/abs/2601.04878
作者: Isabella A. Stewart,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scientific inquiry requires systems-level reasoning that integrates heterogeneous experimental data, cross-domain knowledge, and mechanistic evidence into coherent explanations. While Large Language Models (LLMs) offer inferential capabilities, they often depend on retrieval-augmented contexts that lack structural depth. Traditional Knowledge Graphs (KGs) attempt to bridge this gap, yet their pairwise constraints fail to capture the irreducible higher-order interactions that govern emergent physical behavior. To address this, we introduce a methodology for constructing hypergraph-based knowledge representations that faithfully encode multi-entity relationships. Applied to a corpus of ~1,100 manuscripts on biocomposite scaffolds, our framework constructs a global hypergraph of 161,172 nodes and 320,201 hyperedges, revealing a scale-free topology (power law exponent ~1.23) organized around highly connected conceptual hubs. This representation prevents the combinatorial explosion typical of pairwise expansions and explicitly preserves the co-occurrence context of scientific formulations. We further demonstrate that equipping agentic systems with hypergraph traversal tools, specifically using node-intersection constraints, enables them to bridge semantically distant concepts. By exploiting these higher-order pathways, the system successfully generates grounded mechanistic hypotheses for novel composite materials, such as linking cerium oxide to PCL scaffolds via chitosan intermediates. This work establishes a “teacherless” agentic reasoning system where hypergraph topology acts as a verifiable guardrail, accelerating scientific discovery by uncovering relationships obscured by traditional graph methods.
zh

[NLP-35] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

【速读】: 该论文旨在解决当前Text-to-SQL模型训练中高质量、多样化且结构复杂数据集稀缺的问题。现有方法要么依赖有限的人工标注语料,要么通过大语言模型(LLM)直接生成SQL语句而缺乏对SQL结构的显式控制,导致生成的查询在结构多样性与复杂度上受限。其解决方案的关键在于提出EvolSQL框架,该框架采用结构感知的数据合成策略:首先通过探索性查询-SQL扩展提升问题多样性与模式覆盖,随后利用基于SQL抽象语法树(AST)提取的六种原子变换算子,实施自适应方向演化以逐步增强查询在关系、谓词、聚合和嵌套维度上的复杂性;同时结合执行驱动的SQL优化模块与模式感知去重机制,确保生成映射对的质量与结构多样性。实验表明,仅用1/18 SynSQL数据量即可使7B模型性能超越后者。

链接: https://arxiv.org/abs/2601.04875
作者: Xuanguang Pan,Chongyang Tao,Jiayuan Bai,Jianling Gao,Zhengwei Tao,Xiansheng Zhou,Gavin Cheung,Shuai Ma
机构: Beihang University (北京航空航天大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.
zh

[NLP-36] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理复杂多跳查询时性能不足的问题,其核心挑战在于现有基于分块(chunking)的RAG方法缺乏结构化连接性,而早期融合检索与推理的策略又缺乏对全局语料库的认知。解决方案的关键在于提出一种新型RAG框架ToPG(Traversal over Proposition Graphs),它将知识库建模为由命题(proposition)、实体和段落构成的异构图结构,从而结合命题级别的细粒度事实密度与图结构的连通性优势;并通过迭代的“建议-选择”(Suggestion-Selection)循环实现查询感知的图遍历:建议阶段引导模型在图中进行语义导向的探索,选择阶段利用大语言模型(LLM)反馈过滤无关命题并为下一轮迭代提供种子节点,最终显著提升多跳问答任务的准确性和质量。

链接: https://arxiv.org/abs/2601.04859
作者: Maxime Delmas,Lei Xu,André Freitas
机构: Idiap Research Institute (Idiap 研究所); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at this https URL.
zh

[NLP-37] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News

【速读】: 该论文旨在解决现有虚假信息检测方法在细粒度层面的局限性,即大多数基准和模型仅以整句或段落为单位进行二元真假判断,忽略了单句内真与假信息共存的复杂情况,且缺乏对误导性内容的具体定位与类型区分能力。其解决方案的关键在于提出首个跨领域的、人工标注的细粒度虚假信息检测与分析基准——MisSpans,该基准包含真实与虚假新闻对,并定义三个互补任务:MisSpansIdentity(定位句子中的虚假片段)、MisSpansType(按虚假类型分类虚假片段)以及MisSpansExplanation(基于识别出的片段提供可解释的推理依据),从而实现从粗粒度到细粒度的虚假信息识别、精细化分类与可解释性增强。

链接: https://arxiv.org/abs/2601.04857
作者: Zhiwei Liu,Paul Thompson,Jiaqi Rong,Baojie Qu,Runteng Guo,Min Peng,Qianqian Xie,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Zhejiang University (浙江大学); School of Artificial Intelligence, Wuhan University (武汉大学人工智能学院); Center for Language and Information Research, Wuhan University (武汉大学语言与信息研究中心); ELLIS Manchester (ELLIS 曼彻斯特)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at this https URL.
zh

[NLP-38] oken Maturation: Autoregressive Language Generation via Continuous Token Dynamics ICML2026

【速读】: 该论文旨在解决传统自回归语言模型在生成过程中因早期离散化导致的不稳定性、重复性和对解码启发式方法敏感的问题。其核心解决方案是提出一种连续自回归语言生成框架,其中词元(token)以连续向量形式表示,并在离散化之前通过确定性动力学过程逐步演化(即“成熟”),仅当表示充分收敛时才进行硬解码得到离散文本。这一机制将不确定性保留在连续空间中并逐步解析,无需依赖词元级采样、扩散式去噪或辅助稳定机制,即可实现连贯且多样化的文本生成。

链接: https://arxiv.org/abs/2601.04854
作者: Oshri Naparstek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In preperation to ICML 2026

点击查看摘要

Abstract:Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emphmature over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling. Comments: In preperation to ICML 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.04854 [cs.CL] (or arXiv:2601.04854v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.04854 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Oshri Naparstek [view email] [v1] Thu, 8 Jan 2026 11:44:34 UTC (5,416 KB)
zh

[NLP-39] RAAR: Retrieval Augmented Agent ic Reasoning for Cross-Domain Misinformation Detection

【速读】: 该论文旨在解决跨域虚假信息检测(cross-domain misinformation detection)中的两大核心挑战:一是现有方法依赖单一视角线索,难以在知识和话语差异显著的不同领域间泛化;二是大型语言模型(Large Language Models, LLMs)虽在复杂任务中表现优异,但受限于同分布数据假设,无法有效迁移至目标域。解决方案的关键在于提出RAAR框架——一种检索增强的代理推理框架,通过两个核心机制实现突破:首先,基于语义、情感与写作风格对齐,从源域检索多视角证据以支持跨域迁移;其次,构建由多个专业代理协作的可验证多步推理路径,其中视角专用代理生成互补分析,汇总代理在验证器指导下整合结果,并通过监督微调与强化学习联合训练单个多任务验证器以提升推理与验证能力。

链接: https://arxiv.org/abs/2601.04853
作者: Zhiwei Liu,Runteng Guo,Baojie Qu,Yuechen Jiang,Min Peng,Qianqian Xie,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); School of Artificial Intelligence, Wuhan University (武汉大学人工智能学院); Center for Language and Information Research, Wuhan University (武汉大学语言与信息研究中心); ELLIS Manchester (ELLIS 曼彻斯特)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample’s semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at this https URL.
zh

[NLP-40] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

【速读】: 该论文旨在解决零样本(zero-shot)检测AI生成文本时存在的局限性问题,即现有方法通常仅依赖整个序列的token级统计特征,忽略了自回归生成过程中固有的时间动态特性。其解决方案的关键在于发现并利用“晚期波动衰减”(Late-Stage Volatility Decay)现象:AI生成文本在生成后期log概率波动迅速趋于稳定,而人类写作则保持较高变异性,尤其在序列后半段,AI文本的波动性降低24–32%。基于此现象,作者提出两个仅使用晚期统计信息计算的简单特征——导数离散度(Derivative Dispersion)和局部波动率(Local Volatility),无需扰动采样或额外模型访问即可实现优于现有方法的检测性能,并与全局方法具有强互补性。

链接: https://arxiv.org/abs/2601.04833
作者: Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang
机构: Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24–32% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.
zh

[NLP-41] DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation

【速读】: 该论文旨在解决当前参数高效微调(Parameter-efficient fine-tuning, PEFT)方法在混合专家模型(Mixture-of-Experts, MoE)中因对所有专家统一分配LoRA秩而导致的资源错配问题。具体而言,现有方法未考虑MoE模型内部专家的功能专业化特性,导致任务相关的专家参数不足,而无关专家则获得冗余参数,从而影响微调效率与性能。其解决方案的关键在于提出动态秩LoRA(Dynamic Rank LoRA, DR-LoRA),通过引入专家显著性评分机制(Expert Saliency Scoring),结合专家路由频率与LoRA秩重要性来量化每个专家的任务需求,并据此动态扩展高显著性专家的LoRA秩,实现针对目标任务的异构秩分布自动构建,从而在固定参数预算下提升微调效果和参数利用效率。

链接: https://arxiv.org/abs/2601.04823
作者: Guanzhi Deng,Bo Li,Ronghao Chen,Huacan Wang,Linqi Song,Lijie Wen
机构: City University of Hong Kong (香港城市大学); Tsinghua University (清华大学); Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning (PEFT), such as LoRA, is widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches assign identical LoRA ranks to all experts, overlooking the intrinsic functional specialization within MoE LLMs. This uniform allocation leads to resource mismatch, task-relevant experts are under-provisioned while less relevant ones receive redundant parameters. We propose a Dynamic Rank LoRA framework named DR-LoRA, which dynamically grows expert LoRA ranks during fine-tuning based on task-specific demands. DR-LoRA employs an Expert Saliency Scoring mechanism that integrates expert routing frequency and LoRA rank importance to quantify each expert’s demand for additional capacity. Experts with higher saliency scores are prioritized for rank expansion, enabling the automatic formation of a heterogeneous rank distribution tailored to the target task. Experiments on multiple benchmarks demonstrate that DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving superior task performance with more efficient parameter utilization.
zh

[NLP-42] Defense Against Indirect Prompt Injection via Tool Result Parsing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在物理系统与机器人控制中面临的间接提示注入(Indirect Prompt Injection, IPI)威胁问题。IPI攻击通过在工具调用结果中嵌入恶意指令,诱导代理执行未经授权的操作,从而对物理环境造成潜在危害。现有防御方法主要包括两类:一类是训练专用检测模型,但存在计算开销高、难以适应攻击演进的问题;另一类是基于提示工程的防御策略,虽具灵活性但鲁棒性不足,攻击成功率(Attack Success Rate, ASR)较高。本文的关键解决方案在于通过精确解析工具返回数据,向LLM提供结构化、可信的输入信息,同时有效过滤掉注入的恶意代码,在保持高任务效用(Utility under Attack, UA)的同时显著降低ASR,实现更稳健的防御效果。

链接: https://arxiv.org/abs/2601.04795
作者: Qiang Yu,Xinran Cheng,Chuanyi Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: 20 pages, 3 figures, 5 tables

点击查看摘要

Abstract:As LLM agents transition from digital assistants to physical controllers in autonomous systems and robotics, they face an escalating threat from indirect prompt injection. By embedding adversarial instructions into the results of tool calls, attackers can hijack the agent’s decision-making process to execute unauthorized actions. This vulnerability poses a significant risk as agents gain more direct control over physical environments. Existing defense mechanisms against Indirect Prompt Injection (IPI) generally fall into two categories. The first involves training dedicated detection models; however, this approach entails high computational overhead for both training and inference, and requires frequent updates to keep pace with evolving attack vectors. Alternatively, prompt-based methods leverage the inherent capabilities of LLMs to detect or ignore malicious instructions via prompt engineering. Despite their flexibility, most current prompt-based defenses suffer from high Attack Success Rates (ASR), demonstrating limited robustness against sophisticated injection attacks. In this paper, we propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code. Our approach achieves competitive Utility under Attack (UA) while maintaining the lowest Attack Success Rate (ASR) to date, significantly outperforming existing methods. Code is available at GitHub.
zh

[NLP-43] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

【速读】: 该论文旨在解决多智能体系统中角色权威偏见(authority bias)对 agent 交互影响的机制问题,尤其是当使用大语言模型(Large Language Models, LLMs)时,权威角色如何在自由形式对话中塑造其他 agent 的行为。解决方案的关键在于通过 ChatEval 对 12 轮对话进行系统性分析,并基于 French 和 Raven 的权力理论将权威角色划分为合法型(Legitimate)、参照型(Referent)和专家型(Expert)三类,发现权威偏见并非由普通 agent 主动服从产生,而是源于权威角色持续维持其立场,而普通 agent 表现出更强的灵活性;同时,权威影响力依赖于明确的立场陈述,中立回应无法引发偏见。这一发现为设计具有非对称交互模式的多智能体框架提供了关键依据。

链接: https://arxiv.org/abs/2601.04790
作者: Junhyuk Choi,Jeongyoun Kwon,Heeju Kim,Haeun Cho,Hayeong Jung,Sehee Min,Bugeun Kim
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven’s power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.
zh

[NLP-44] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems

【速读】: 该论文旨在解决非凸优化问题在数学规划、工程设计和科学计算中普遍存在的求解困难问题,传统求解器往往因目标函数复杂性和约束结构的非凸性而失效。其核心挑战在于手动凸化(convexification)效率低下且高度依赖专家知识。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的端到端自动化框架NC2C,该框架能够自主识别非凸成分、选择最优凸化策略并生成严格的凸等价形式;其创新性体现在融合符号推理、自适应变换技术和迭代验证机制,并引入错误纠正回路与可行域修正机制,从而保障转换后问题的鲁棒性和有效性。实验表明,NC2C在100个通用非凸问题上实现了89.3%的执行率和76%的成功率,显著优于基线方法,证明了LLM在自动非凸转凸任务中的强大潜力。

链接: https://arxiv.org/abs/2601.04789
作者: Xinyue Peng,Yanming Liu,Yihan Cang,Yuwei Zhang,Xinyi Wang,Songhang Deng,Jiannan Cao
机构: Southeast University (东南大学); Zhejiang University (浙江大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: First version of NC2C

点击查看摘要

Abstract:Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs’ mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3% execution rate and a 76% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C’s ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.
zh

[NLP-45] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

【速读】: 该论文旨在解决视频-语言模型(Video-Language Models, VLMs)在动作识别和时序推理任务中易产生幻觉的问题,其根源在于模型过度依赖语言先验而非细粒度的视觉动态信息。解决方案的关键在于提出一种可扩展的反事实视频生成框架,通过结合多模态大语言模型(Multimodal Large Language Models, MLLMs)进行动作提议与编辑指导,以及基于扩散机制的图像和视频生成模型,在保持场景背景一致的前提下合成仅在动作或时序结构上存在差异的视频,从而构建语义上的硬负样本。基于此框架,作者进一步构建了包含约26k个偏好对的CounterVid数据集,并提出MixDPO方法,统一利用文本和视觉偏好进行直接偏好优化(Direct Preference Optimization, DPO),显著提升了模型在时序排序等任务上的性能并有效迁移至标准视频幻觉评估基准。

链接: https://arxiv.org/abs/2601.04778
作者: Tobia Poppi,Burak Uzkent,Amanmeet Garg,Lucas Porto,Garin Kessler,Yezhou Yang,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara,Florian Schiffers
机构: Amazon Prime Video; University of Modena and Reggio Emilia; University of Pisa
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
zh

[NLP-46] LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

【速读】: 该论文旨在解决多语言密集检索(dense retrieval)中因多语言嵌入同时编码语义与语言身份信息而导致的偏差问题:语言身份信号会增强同语言文本对之间的相似度,从而掩盖跨语言的相关证据。解决方案的关键在于提出一种后处理稀疏自编码器(sparse autoencoder),即LANGSAE EDITING,该方法在合并的多语言嵌入上训练,通过分析跨语言激活统计量识别与语言相关的潜在单元,并在推理阶段抑制这些单元,同时重建原始维度的向量表示,从而实现对语言身份信号的可控移除,且无需重新训练基础编码器或重新编码原始文本,即可直接适配现有向量数据库。

链接: https://arxiv.org/abs/2601.04768
作者: Dongjun Kim,Jeongho Yoon,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国大学); Soongsil University (弘益大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.
zh

[NLP-47] AT2PO: Agent ic Turn-based Policy Optimization via Tree Search

【速读】: 该论文旨在解决多轮智能体强化学习(Agentic Reinforcement Learning)中的三大核心挑战:探索多样性不足、奖励稀疏下的信用分配困难以及策略优化与智能体决策粒度不匹配的问题。其解决方案的关键在于提出一种统一框架 AT²PO(Agentic Turn-based Policy Optimization via Tree Search),通过引入轮次级树结构实现两个核心机制:一是基于熵引导的树扩展(Entropy-Guided Tree Expansion)以增强战略级探索多样性,二是轮次级信用分配(Turn-wise Credit Assignment)以实现从稀疏结果中精细化传播奖励信号;同时设计了轮次级策略优化目标(Agentic Turn-based Policy Optimization),使策略更新更贴合智能体交互的自然决策粒度,该方法与树搜索正交,可无缝集成至任意多轮强化学习流程中。

链接: https://arxiv.org/abs/2601.04767
作者: Zefang Zong,Dingwei Chen,Yang Li,Qi Yi,Bo Zhou,Chengming Li,Bo Qian,Peng Chen,Jie Jiang
机构: Tencent Inc (腾讯公司); Sun Yat-Sen University (中山大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT ^2 PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT ^2 PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at this https URL.
zh

[NLP-48] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

【速读】: 该论文旨在解决生成式 AI(Generative AI)中大语言模型(Large Language Models, LLMs)推理加速时依赖昂贵且噪声较大的监督信号问题,尤其是现有基于 Judge Decoding 的方法对复杂训练机制的依赖。其解决方案的关键在于发现并利用草稿-目标分布差异中的内在结构信息——具体而言,通过理论证明线性判别器(linear judges)与 Kullback-Leibler (KL) 散度之间存在结构性对应关系,表明二者均基于相同的 logits 原语。基于此洞察,作者提出一种无需训练的验证机制,直接使用 KL 散度作为判断依据,在多个推理与编码基准上实现了与复杂训练模型相当甚至更优的性能,并显著提升了对领域偏移的鲁棒性,同时彻底消除了监督瓶颈。

链接: https://arxiv.org/abs/2601.04766
作者: Shengyin Sun,Yiming Li,Renxi Liu,Weizhe Lin,Hui-Ling Zhen,Xianzhi Yu,Mingxuan Yuan,Chen Ma
机构: City University of Hong Kong (香港城市大学); Huawei Technologies (华为技术)
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality’’ scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.
zh

[NLP-49] Differential syntactic and semantic encoding in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)内部层表示中句法和语义信息如何编码的问题,尤其聚焦于超大规模模型 DeepSeek-V3。其核心解决方案在于通过计算具有相同句法结构或语义内容的句子隐藏表示向量的平均值(即“中心点”或 centroid),发现这些中心点能捕获相当比例的句法与语义信息;进一步地,将这些中心点从句子向量中减去后,显著降低了句子与其在句法或语义上匹配的句子之间的相似度,表明句法和语义信息至少部分以线性方式存在于模型表示中。此外,研究还揭示了句法与语义在不同层中的编码模式存在差异,且二者可在一定程度上解耦,暗示LLMs对这两种语言信息采取了差异化编码机制。

链接: https://arxiv.org/abs/2601.04765
作者: Santiago Acevedo,Alessandro Laio,Marco Baroni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids’’ from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.
zh

[NLP-50] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks EMNLP2025

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在专利法律领域中缺乏系统性评估框架的问题,尤其是其在结构化法律推理能力上的不足。现有研究多局限于轻量级任务,未能有效衡量LLMs对专利审判与上诉委员会(PTAB)决策逻辑的理解深度。解决方案的关键在于提出PILOT-Bench——首个以PTAB为中心的基准测试平台,它通过案例级对齐美国专利商标局(USPTO)专利数据与PTAB判例,形式化定义了三个基于IRAC(Issue, Rule, Application, Conclusion)结构的分类任务:问题类型(Issue Type)、合议庭依据(Board Authorities)和子判决(Subdecision)。该基准不仅为LLMs在专利领域的法律推理能力提供了可量化、可比较的评估标准,还揭示了闭源模型与开源模型之间在复杂推理任务上的显著性能差距,为未来通过数据设计与模型对齐提升LLMs的法律推理能力指明了方向。

链接: https://arxiv.org/abs/2601.04758
作者: Yehoon Jang,Chaewon Lee,Hyun-seok Min,Sungchul Choi
机构: Pukyong National University (釜庆国立大学); Tomocube Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the NLLP Workshop at EMNLP 2025

点击查看摘要

Abstract:The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at this https URL.
zh

[NLP-51] ool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理和事实验证任务中普遍存在幻觉(hallucination)和事实错误的问题。现有多智能体辩论(Multi-Agent Debate, MAD)框架虽通过多代理对话促进多样推理与相互验证,但主要依赖内部知识或静态文档,易受幻觉影响;而MADKE虽引入外部证据,其一次性检索机制难以适应辩论过程中新论点或动态信息的出现。为此,作者提出Tool-MAD框架,其核心创新在于:(1) 为每个代理分配异构外部工具(如搜索API或RAG模块),增强视角多样性;(2) 设计自适应查询生成机制,基于辩论流迭代优化证据检索;(3) 引入忠实度(Faithfulness)与答案相关性(Answer Relevance)评分作为裁判代理决策依据,定量评估响应的一致性和问题契合度,从而有效识别幻觉。实验表明,Tool-MAD在四个事实验证基准上显著优于现有最优MAD方法,准确率提升最高达5.5%,且在医学专业领域展现出强鲁棒性和适应性。

链接: https://arxiv.org/abs/2601.04742
作者: Seyeon Jeong,Yeonjun Choi,JongWook Kim,Beakcheol Jang
机构: Yonsei University (延世大学); Sangmyung University (祥明女子大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.
zh

[NLP-52] RiskAtlas: Exposing Domain-Specific Risks in LLM s through Knowledge-Graph-Guided Harmful Prompt Generation

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在金融、医疗等专业领域应用中面临的隐性有害提示(implicit harmful prompts)检测难题。现有公开数据集多聚焦于显性有害提示,而现代大语言模型(Large Language Models, LLMs)防御机制对这类提示已具备较强识别能力,无法真实反映现实威胁。论文提出一个端到端框架,其核心在于:首先利用知识图谱引导的有害提示生成方法,将领域知识转化为可操作的约束条件以生成高相关性的有害提示;其次采用双路径混淆重写策略,通过直接重写和上下文增强重写两种方式,将显性有害提示转化为更难被检测的隐性变体。该方案显著提升了数据集的领域相关性和隐匿性,从而推动更贴近实际场景的红队测试与LLM安全研究。

链接: https://arxiv.org/abs/2601.04740
作者: Huawei Zheng,Xinqi Jiang,Sen Yang,Shouling Ji,Yingcai Wu,Dazhen Deng
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.
zh

[NLP-53] AM3Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLM s

【速读】: 该论文旨在解决多轮多模态大语言模型(Multi-modal Large Language Models, MLLMs)在交互应用中因安全漏洞导致的有害意图逐步重建问题,尤其针对现有基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)方法在多轮对话场景下依赖昂贵的人工偏好标注、难以有效对齐安全策略的局限性。其解决方案的关键在于构建了一个包含11,270条对话和500个专门设计的拒绝型视觉问答(refusal VQA)样本的开源多模态对话数据集InterSafe-V,并提出AM³ Safety框架——该框架结合冷启动拒绝阶段与基于轮次感知的双目标奖励机制的Group Relative Policy Optimization (GRPO)微调策略,在保持模型通用能力的同时显著提升安全性:实验表明,在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B上,攻击成功率(Attack Success Rate, ASR)降低超10%,无害性维度提升至少8%,有用性维度提升超过13%。

链接: https://arxiv.org/abs/2601.04736
作者: Han Zhu,Jiale Chen,Chengkun Cai,Shengjie Sun,Haoran Li,Yujin Zhou,Chi-Min Chan,Pengcheng Wen,Lei Li,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Zhongshan School of Medicine, SUN YAT-SEN UNIVERSITY (中山医学院,中山大学); University of Edinburgh (爱丁堡大学); AISpeech (思必驰); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM ^3 Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10% decrease in Attack Success Rate (ASR) together with an increment of at least 8% in harmless dimension and over 13% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
zh

[NLP-54] Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

【速读】: 该论文旨在解决当前无监督强化学习(Reinforcement Learning, RL)方法在训练大型推理模型时,于正向同质提示(positive homogeneous prompts)场景下效率严重不足的问题,其核心表现是由于优势估计为零导致大量轨迹(rollouts)浪费。解决方案的关键在于提出一种名为Miner的方法,通过将策略的内在不确定性(intrinsic uncertainty)转化为自监督奖励信号,无需外部监督、辅助模型或额外推理开销。该方法的核心创新包括:(1) 基于token级别的焦点信用分配机制(focal credit assignment mechanism),动态放大关键不确定token的梯度并抑制过度自信token的更新;(2) 自适应优势校准机制,实现内在奖励与可验证奖励的无缝融合。实验证明,该方案显著提升了训练效率与性能,在多个推理基准上优于现有方法,验证了对潜在不确定性进行挖掘对于高效、可扩展的推理模型RL训练既必要又充分。

链接: https://arxiv.org/abs/2601.04731
作者: Shuyang Jiang,Yuhao Wang,Ya Zhang,Yanfeng Wang,Yu Wang
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \ulineMine \ulineintrinsic mast\ulineery (Miner), that repurposes the policy’s intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf4.58 absolute gains in Pass@1 and \textbf6.66 gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models.
zh

[NLP-55] Automatic Classifiers Underdetect Emotions Expressed by Men

【速读】: 该论文旨在解决情感识别模型在不同性别群体中存在系统性偏差的问题,即当前自动情感分类器(automatic sentiment and emotion classifiers)在跨人群应用时可靠性不足,尤其是对男性和女性文本作者的识别准确率不一致。其解决方案的关键在于使用一个包含超过一百万条由个体自标注的情感文本的大规模数据集,并采用预注册的研究设计,系统性地评估414种模型与情绪类别组合下的性别偏差。研究发现,无论模型类型或情绪类别如何,男性撰写的文本错误率始终高于女性,揭示了现有机器学习工具(包括大语言模型)在性别组成未知或变化样本中应用时需谨慎,强调情感分析仍未达到公平可靠的标准。

链接: https://arxiv.org/abs/2601.04730
作者: Ivan Smirnov,Segun T. Aroyehun,Paul Plener,David Garcia
机构: University of Technology Sydney (悉尼科技大学); University of Konstanz (康斯坦茨大学); Medical University of Vienna (维也纳医科大学); University of Ulm (乌尔姆大学); Complexity Science Hub (复杂科学中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.
zh

[NLP-56] Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在长时程任务中因记忆机制局限而导致的推理能力不足问题。现有方法通常采用扁平化存储和基于语义相似度的简单检索方式,难以捕捉经验之间的逻辑关系,且记忆访问与结构脱节,限制了对长期依赖关系的逻辑推理。解决方案的关键在于提出CompassMem——一种受事件分割理论(Event Segmentation Theory)启发的事件中心型记忆框架,通过将经验增量式地分割为事件并以显式逻辑关系构建事件图(Event Graph),形成可支持目标导向导航的逻辑地图,从而实现超越表面检索的结构化记忆访问,显著提升长时程推理性能。

链接: https://arxiv.org/abs/2601.04726
作者: Yuyang Hu,Jiongnan Liu,Jiejun Tan,Yutao Zhu,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages,6 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as intelligent agents that reason, plan, and interact with their environments. To effectively scale to long-horizon scenarios, a key capability for such agents is a memory mechanism that can retain, organize, and retrieve past experiences to support downstream decision-making. However, most existing approaches organize and store memories in a flat manner and rely on simple similarity-based retrieval techniques. Even when structured memory is introduced, existing methods often struggle to explicitly capture the logical relationships among experiences or memory units. Moreover, memory access is largely detached from the constructed structure and still depends on shallow semantic retrieval, preventing agents from reasoning logically over long-horizon dependencies. In this work, we propose CompassMem, an event-centric memory framework inspired by Event Segmentation Theory. CompassMem organizes memory as an Event Graph by incrementally segmenting experiences into events and linking them through explicit logical relations. This graph serves as a logic map, enabling agents to perform structured and goal-directed navigation over memory beyond superficial retrieval, progressively gathering valuable memories to support long-horizon reasoning. Experiments on LoCoMo and NarrativeQA demonstrate that CompassMem consistently improves both retrieval and reasoning performance across multiple backbone models.
zh

[NLP-57] Qwen 3-VL-Embedding and Qwen 3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

【速读】: 该论文旨在解决多模态信息检索中跨模态语义对齐与高精度匹配的问题,即如何将文本、图像、文档图像及视频等多种模态数据映射到统一的向量空间,并实现高效且准确的跨模态搜索。解决方案的关键在于提出了一套端到端的多模态嵌入与重排序模型体系——Qwen3-VL-Embedding 和 Qwen3-VL-Reranker:前者通过多阶段训练(包括大规模对比预训练和重排序蒸馏)生成语义丰富的高维嵌入向量,并支持 Matryoshka Representation Learning 实现灵活维度调整;后者采用交叉编码器架构结合交叉注意力机制,对查询-文档对进行细粒度相关性评估。两者协同工作,在多个基准测试中达到最优性能,尤其在 MMEB-V2 上 Qwen3-VL-Embedding-8B 模型得分高达 77.8,位居榜首(截至 2025 年 1 月 8 日)。

链接: https://arxiv.org/abs/2601.04720
作者: Mingxin Li,Yanzhao Zhang,Dingkun Long,Keqin Chen,Sibo Song,Shuai Bai,Zhibo Yang,Pengjun Xie,An Yang,Dayiheng Liu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in \textbf2B and \textbf8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of \textbf77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
zh

[NLP-58] Fame Fades Nature Remains: Disentangling the Character Identity of Role-Playing Agents

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的角色扮演代理(Role-Playing Agents, RPAs)在角色身份建模上缺乏结构化定义的问题,即角色常被当作任意文本输入处理,导致其身份表征模糊且难以量化。解决方案的关键在于提出“角色身份”(Character Identity)这一多维构念,将其解耦为两个独立层:参数化身份(Parametric Identity),指从LLM预训练中编码的角色特定知识;以及属性化身份(Attributive Identity),捕捉人格特质和道德价值观等细粒度行为属性。通过构建统一的角色档案模式并生成著名与合成角色,在单轮与多轮交互中系统评估发现,“名望消退”(Fame Fades)现象表明初始优势随对话推进迅速消失,而“本性留存”(Nature Remains)现象则揭示人格特质具有稳定性,但道德极性和人际关系敏感性显著影响RPAs的表现 fidelity,从而指出负向社会属性是提升角色扮演真实性的主要瓶颈。

链接: https://arxiv.org/abs/2601.04716
作者: Yonghyun Jun,Junhyuk Choi,Jihyeong Park,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注: 27 pages

点击查看摘要

Abstract:Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character’s identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbfCharacter Identity, a multidimensional construct that disentangles a character into two distinct layers: \textbf(1) Parametric Identity, referring to character-specific knowledge encoded from the LLM’s pre-training, and \textbf(2) Attributive Identity, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit"Fame Fades": while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit"Nature Remains": while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.
zh

[NLP-59] DSC2025 – ViHallu Challenge: Detecting Hallucination in Vietnamese LLM s

【速读】: 该论文旨在解决越南语大语言模型(Large Language Models, LLMs)在生产环境中因幻觉(hallucination)问题导致可靠性不足的挑战,尤其针对低资源语言中缺乏标准化评估框架的问题。解决方案的关键在于构建首个面向越南语LLM的大型共享任务——DSC2025 ViHallu Challenge,其核心是提出并公开发布包含10,000个标注三元组(context, prompt, response)的ViHallu数据集,涵盖无幻觉、内在幻觉(intrinsic hallucination,即与上下文矛盾)和外在幻觉(extrinsic hallucination,即与外部知识冲突)三类标签,并设计事实型、噪声型和对抗型三种提示类型以强化模型鲁棒性测试。实验表明,经过指令微调(instruction-tuned)且结合结构化提示与集成策略的模型显著优于通用架构(最佳系统宏F1达84.80%,基线仅为32.83%),验证了该方法的有效性,同时揭示了内在幻觉检测仍是当前难点,为后续提升越南语AI系统的可信度研究奠定了基准和方向。

链接: https://arxiv.org/abs/2601.04711
作者: Anh Thi-Hoang Nguyen,Khanh Quoc Tran,Tin Van Huynh,Phuoc Tan-Hoang Nguyen,Cam Tan Nguyen,Kiet Van Nguyen
机构: University of Information Technology (信息科技大学); University of Information Technology, Ho Chi Minh City, Vietnam (胡志明市信息科技大学, 越南); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市越南国家大学, 越南)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations–fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types–factual, noisy, and adversarial–to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80%, compared to a baseline encoder-only score of 32.83%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.
zh

[NLP-60] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)微调过程中因反向传播(backpropagation)带来的巨大内存开销问题,尤其是在模型规模扩大时尤为突出。传统零阶(Zeroth-order, ZO)优化方法虽通过前向传播和高斯采样估计梯度以避免反向传播,但其依赖随机扰动导致梯度估计方差过高,进而造成收敛缓慢和性能不佳。本文的关键解决方案是引入一种基于先验信息的扰动机制:通过动态计算由高斯样本生成的引导向量(guiding vector),将扰动方向导向更具信息量的空间,从而显著提升梯度估计的准确性与稳定性。该方法在理论层面证明了其梯度估计器与真实梯度方向具有更强的一致性,并在多个不同规模和架构的LLM上验证了其高效性和鲁棒性,尤其在OPT-13B模型上的实验表明,该方法不仅优于传统ZO优化,在9/11个基准任务中甚至超越了基于梯度的基线方法,实现了效率与精度的良好平衡。

链接: https://arxiv.org/abs/2601.04710
作者: Feihu Jin,Shipeng Cen,Ying Tan
机构: Peking University (北京大学); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12pages, 6figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.
zh

[NLP-61] PRISM: A Unified Framework for Post-Training LLM s Without Verifiable Rewards

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段依赖昂贵的人工标注或外部验证器来提升数学推理和代码生成等任务性能的问题,尤其是在高质量难解问题的标签难以获取时,如何有效利用无标签数据进行稳定且高效的训练。其解决方案的关键在于提出PRISM框架,该框架通过引入一个过程奖励模型(Process Reward Model, PRM)与模型内部置信度(self-certainty)协同作用,以提供更可靠的监督信号,从而实现训练稳定性与测试性能的双重提升,并保持模型置信度的合理性。

链接: https://arxiv.org/abs/2601.04700
作者: Mukesh Ghimire,Aosong Feng,Liwen You,Youzhi Luo,Fang Liu,Xuan Zhu
机构: Arizona State University (亚利桑那州立大学); Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL)
备注: Preprint. Under Review

点击查看摘要

Abstract:Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model’s consistency, either by majority voting or by converting the model’s internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model’s internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model’s internal confidence in check.
zh

[NLP-62] ourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

【速读】: 该论文旨在解决旅行规划中的三大核心挑战:(1)在保持高召回率的前提下高效剪枝兴趣点(Points of Interest, POIs);(2)单一推理路径限制了可行解空间的探索能力;(3)难以同时优化硬约束(hard constraints)与软约束(soft constraints)。解决方案的关键在于提出TourPlanner框架,其核心创新包括:首先设计个性化召回与空间优化(Personalized Recall and Spatial Optimization, PReSO)流程以生成空间感知的候选POI集合;其次引入竞争共识思维链(Competitive Consensus Chain-of-Thought, CCoT)多路径推理机制,增强对可行解空间的探索能力;最后在强化学习阶段集成基于Sigmoid的门控机制,仅在满足硬约束后动态优先保障软约束的满足,从而实现更高质量的行程规划。

链接: https://arxiv.org/abs/2601.04698
作者: Yinuo Wang,Mining Tan,Wenxiang Jiao,Xiaoxi Li,Hao Wang,Xuanyu Zhang,Yuan Lu,Weiming Dong
机构: Xiaohongshu Inc. (小红书公司); Renmin University of China (中国人民大学); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所MAIS)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs’ set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.
zh

[NLP-63] A Method for Constructing a Digital Transformation Driving Mechanism Based on Semantic Understanding of Large Models

【速读】: 该论文旨在解决企业在数字化转型过程中面临的两大核心问题:一是对非结构化数据的语义理解能力不足,二是缺乏智能决策支持机制。解决方案的关键在于融合大语言模型(Large Language Model, LLM)与知识图谱(Knowledge Graph, KG)技术——首先利用微调后的BERT模型进行多源异构文本中的实体识别与关系抽取,并通过GPT-4生成语义增强的向量表示;随后设计两层图神经网络(Graph Neural Network, GNN)架构,将LLM输出的语义向量与业务元数据融合,构建动态可扩展的企业知识图谱;最终引入强化学习优化决策路径生成,以奖励函数驱动机制迭代。该方法显著提升了数字化转型驱动机制的智能化水平与执行效率。

链接: https://arxiv.org/abs/2601.04696
作者: Huayi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the process of digital transformation, enterprises are faced with problems such as insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in driving mechanisms. This study proposes a method that combines a large language model (LLM) and a knowledge graph. First, a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model is used to perform entity recognition and relationship extraction on multi-source heterogeneous texts, and GPT-4 is used to generate semantically enhanced vector representations; secondly, a two-layer graph neural network (GNN) architecture is designed to fuse the semantic vectors output by LLM with business metadata to construct a dynamic and scalable enterprise knowledge graph; then reinforcement learning is introduced to optimize decision path generation, and the reward function is used to drive the mechanism iteration. In the case of the manufacturing industry, this mechanism reduced the response time for equipment failure scenarios from 7.8 hours to 3.7 hours, the F1 value reached 94.3%, and the compensation for decision errors in the annual digital transformation cost decreased by 45.3%. This method significantly enhances the intelligence level and execution efficiency of the digital transformation driving mechanism by integrating large model semantic understanding with structured knowledge.
zh

[NLP-64] hunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在韩语否定理解能力上的不足问题,尤其是在缺乏高质量评估基准的情况下。其解决方案的关键在于构建了一个基于语料库的、反映韩语否定现象实际分布的句子级评测基准——Thunder-KoNUBench,并通过在该基准上进行微调,显著提升了模型对否定的理解能力以及更广泛的上下文理解能力。

链接: https://arxiv.org/abs/2601.04693
作者: Sungmok Jung,Yeonkyoung So,Joonhak Lee,Sangho Kim,Yelim Ahn,Jaejin Lee
机构: Graduate School of Data Science, Seoul National University (首尔国立大学数据科学研究生院); Dept. of Computer Science and Engineering, Seoul National University (首尔国立大学计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
zh

[NLP-65] See Explain and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation

【速读】: 该论文旨在解决仇恨表情包(hateful memes)的检测、解释与干预问题,这些问题在现实中往往相互关联但传统研究中常被割裂处理。其核心挑战在于:如何在标注数据稀缺的情况下实现高效且可泛化的仇恨表情包治理,同时兼顾内容理解、成因解释和事前干预能力。解决方案的关键在于提出一种基于生成式AI模型的新型框架,通过任务特定的生成式多模态代理(generative multimodal agents)和大型多模态模型的少样本适应能力,针对不同类型的表情包实现动态响应与治理,从而在有限数据条件下实现端到端的可部署式仇恨内容管理。

链接: https://arxiv.org/abs/2601.04692
作者: Naquee Rizwan,Subhankar Swain,Paramananda Bhaskar,Gagan Aryan,Shehryaar Shah Khan,Animesh Mukherjee
机构: Indian Institute of Technology (IIT), Kharagpur; Simbian
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.
zh

[NLP-66] oolGate: Contract-Grounded and Verified Tool Execution for LLM s

【速读】: 该论文旨在解决当前基于外部工具的大型语言模型(Large Language Models, LLMs)在复杂推理任务中缺乏逻辑安全性与可验证性的问题。现有框架依赖自然语言推理来决定工具调用时机及结果是否提交,无法提供形式化保障,易导致无效或幻觉结果污染世界状态表示。解决方案的关键在于提出ToolGate框架,其核心是维护一个显式的符号状态空间(symbolic state space),以类型化的键值映射形式记录可信的世界信息;同时将每个工具形式化为霍尔风格(Hoare-style)契约,包含前提条件(precondition)和后置条件(postcondition),前者用于控制工具调用的合法性,后者通过运行时验证确保结果可被安全地更新到状态中。这一机制保证了符号状态仅通过经过验证的工具执行进行演进,从而实现逻辑安全性和状态演化的可验证性。

链接: https://arxiv.org/abs/2601.04688
作者: Yanming Liu,Xinyue Peng,Jiannan Cao,Xinyi Wang,Songhang Deng,Jintao Chen,Jianwei Yin,Xuhong Zhang
机构: Zhejiang University (浙江大学); Southeast University (东南大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: First version of ToolGate

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbfToolGate, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool’s result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.
zh

[NLP-67] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning ACL2026

【速读】: 该论文旨在解决农业疾病诊断中视觉语言模型(Vision-Language Models, VLMs)面临的三大挑战:传统微调依赖大量标注数据、模型可解释性差以及跨域泛化能力弱;同时,现有基于推理的方法多依赖昂贵的专家标注,且难以应对农业场景中开放-ended、多样化的查询需求。解决方案的关键在于提出 Agri-R1,一个增强推理能力的农业大模型,其核心创新包括:通过视觉-语言合成与大语言模型(Large Language Model, LLM)过滤自动化生成高质量推理数据(仅使用19%样本),并采用改进的 Group Relative Policy Optimization (GRPO) 训练策略,结合领域特定词典与模糊匹配机制设计奖励函数,以同时评估回答的正确性和语言灵活性。实验证明,该方法在 CDDMBench 上实现了3B参数模型媲美7B–13B参数基线的效果,在病害识别准确率、农业知识问答和跨域泛化能力上分别提升23.2%、33.3%和26.10点,且性能提升随问题复杂度增加而增强。

链接: https://arxiv.org/abs/2601.04672
作者: Wentao Zhang,Lifei Wang,Lina Lu,MingKun Xu,Shangyang Li,Yanchao Yang,Tao Fang
机构: Shandong University of Technology (山东理工大学); MIC-Lab, Institute of International Language Services Studies, Macau Millennium College (澳门新世纪学院国际语言服务研究所); Guangdong Institute of Intelligence Science and Technology (广东智能科学与技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: This paper is submitted for review to ACL 2026. It is 17 pages long and includes 5 figures. The corresponding authors are Tao Fang and Lina Lu

点击查看摘要

Abstract:Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbfAgri-R1, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2% relative gain in disease recognition accuracy, +33.3% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.
zh

[NLP-68] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)中语言能力在神经元层面组织机制不清晰的问题,尤其是现有基于激活强度的启发式方法混淆了语言偏好与功能重要性,导致难以准确识别语言特异性神经元。解决方案的关键在于提出CRANE框架——一种基于功能必要性的相关性分析方法,通过靶向神经元级干预来识别语言特异性神经元,其核心思想是衡量神经元对语言条件预测的贡献度而非激活幅度。实验表明,CRANE能更精确地分离出语言特异性组件,并揭示出神经元具有语言选择性但非排他性的不对称作用模式。

链接: https://arxiv.org/abs/2601.04664
作者: Yifan Le,Yunliang Li
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures. Work in progress

点击查看摘要

Abstract:Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.
zh

[NLP-69] Succeeding at Scale: Automated Multi-Retriever Fusion and Query-Side Adaptation for Multi-Tenant Search

【速读】: 该论文旨在解决大规模多租户检索系统中“暗数据”(dark data)问题,即海量用户查询日志缺乏标注的相关性标签,以及模型更新成本高昂导致难以在多租户环境中进行有效领域适应。其关键解决方案是提出了一种全自动构建的段落检索基准DevRev Search,并采用基于融合(fusion-based)的候选生成策略整合多种稀疏与稠密检索器结果;同时引入大语言模型作为裁判(LLM-as-a-Judge)进行一致性过滤和相关性标注。进一步地,提出了索引保持适应(Index-Preserving Adaptation)策略——仅通过低秩适配(LoRA)微调查询编码器的特定Transformer层,在不重索引文档库的前提下实现性能提升,从而在质量与效率之间取得最优平衡,为个性化企业搜索提供了可扩展路径。

链接: https://arxiv.org/abs/2601.04646
作者: Prateek Jain,Shabari S Nair,Ritesh Goru,Prakhar Agarwal,Ajay Yadav,Yoga Sri Varshan Varadharajan,Constantine Caramanis
机构: DevRev(DevRev); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large-scale multi-tenant retrieval systems amass vast user query logs yet critically lack the curated relevance labels required for effective domain adaptation. This “dark data” problem is exacerbated by the operational cost of model updates: jointly fine-tuning query and document encoders requires re-indexing the entire corpus, which is prohibitive in multi-tenant environments with thousands of isolated indices. To address these dual challenges, we introduce \textbfDevRev Search, a passage retrieval benchmark for technical customer support constructed through a fully automatic pipeline. We employ a \textbffusion-based candidate generation strategy, pooling results from diverse sparse and dense retrievers, and utilize an LLM-as-a-Judge to perform rigorous \textbfconsistency filtering and relevance assignment. We further propose a practical \textbfIndex-Preserving Adaptation strategy: by fine-tuning only the query encoder via Low-Rank Adaptation (LoRA), we achieve competitive performance improvements while keeping the document index frozen. Our experiments on DevRev Search and SciFact demonstrate that targeting specific transformer layers in the query encoder yields optimal quality-efficiency trade-offs, offering a scalable path for personalized enterprise search.
zh

[NLP-70] DP-MGTD: Privacy-Preserving Machine-Generated Text Detection via Adaptive Differentially Private Entity Sanitization

【速读】: 该论文旨在解决机器生成文本(Machine-Generated Text, MGT)检测系统在处理敏感用户数据时面临的隐私保护与作者身份验证之间的矛盾问题。传统匿名化方法常破坏语言流畅性,而严格的差分隐私(Differential Privacy, DP)机制则可能削弱检测所需的统计信号。其解决方案的关键在于提出DP-MGTD框架,采用自适应差分隐私实体净化算法,通过两阶段机制实现:首先进行带噪声的频率估计,再动态校准隐私预算;分别使用拉普拉斯(Laplace)和指数(Exponential)机制处理数值型和文本类实体。尤为关键的是,研究发现应用DP噪声反而能增强人类文本与机器文本的可区分性——因其暴露了二者对扰动的不同敏感模式,从而在满足严格隐私约束的同时实现接近完美的检测准确率。

链接: https://arxiv.org/abs/2601.04641
作者: Lionel Z. Wang,Yusheng Zhao,Jiabin Luo,Xinfeng Li,Lixu Wang,Yinan Peng,Haoyang Li,XiaoFeng Wang,Wei Dong
机构: Nanyang Technological University (南洋理工大学); The Hong Kong Polytechnic University (香港理工大学); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学); Hengxin Tech. (恒信科技)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 1 figure, 1 tables

点击查看摘要

Abstract:The deployment of Machine-Generated Text (MGT) detection systems necessitates processing sensitive user data, creating a fundamental conflict between authorship verification and privacy preservation. Standard anonymization techniques often disrupt linguistic fluency, while rigorous Differential Privacy (DP) mechanisms typically degrade the statistical signals required for accurate detection. To resolve this dilemma, we propose \textbfDP-MGTD, a framework incorporating an Adaptive Differentially Private Entity Sanitization algorithm. Our approach utilizes a two-stage mechanism that performs noisy frequency estimation and dynamically calibrates privacy budgets, applying Laplace and Exponential mechanisms to numerical and textual entities respectively. Crucially, we identify a counter-intuitive phenomenon where the application of DP noise amplifies the distinguishability between human and machine text by exposing distinct sensitivity patterns to perturbation. Extensive experiments on the MGTBench-2.0 dataset show that our method achieves near-perfect detection accuracy, significantly outperforming non-private baselines while satisfying strict privacy guarantees.
zh

[NLP-71] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

【速读】: 该论文旨在解决医疗咨询场景中传统基于长文本交互的繁琐性与患者友好性不足的问题,以及当前生成式语音语言模型(Speech Language Models, SpeechLMs)在医学领域应用受限的问题,主要表现为医学语音数据稀缺和直接微调效率低下。解决方案的关键在于提出一种名为SpeechMedAssist的原生支持多轮语音交互的SpeechLM架构,通过解耦传统单阶段训练为两阶段范式:第一阶段利用文本数据注入知识能力,第二阶段仅需少量10k合成语音数据进行模态对齐,从而显著降低对高质量医学语音数据的依赖,并提升模型在真实医疗对话场景中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2601.04638
作者: Sirry Chen,Jieyi Wang,Wei Chen,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institude; Peking University (北京大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.
zh

[NLP-72] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)文本检测中存在的一大难题:机器生成文本(Machine-Generated Text, MGT)与人类撰写文本(Human-Written Text, HWT)之间的区分难度日益增加,导致虚假新闻和网络欺诈等滥用问题加剧;同时,现有检测模型在训练数据质量不足时泛化能力弱,单纯扩充MGT来源难以有效提升检测性能。为应对这一挑战,作者提出了一种名为MAGA(Machine-Augmented-Generated Text via Alignment)的解决方案,其核心在于通过增强生成过程中的对齐性来提升检测器的泛化能力。其中最关键的技术是Reinforced Learning from Detector Feedback (RLDF),即利用检测器反馈进行强化学习,系统性地优化生成文本的语义、风格与结构,使其更贴近真实人类写作特征,从而既可测试检测器鲁棒性,又能显著提升检测模型在未见数据上的表现——实验表明,基于MAGA训练集微调的RoBERTa检测器平均AUC提升4.60%,而多个主流检测器在该数据集上的AUC平均下降8.13%,验证了其有效性与指导意义。

链接: https://arxiv.org/abs/2601.04633
作者: Anyang Song,Ying Cheng,Yiqian Xu,Rui Feng
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Intelligent Information Processing (上海市智能信息处理重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors’ generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var’s theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbfMachine-\textbfAugment-\textbfGenerated Text via \textbfAlignment (MAGA). MAGA’s pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbfReinforced \textbfLearning from \textbfDetectors \textbfFeedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.
zh

[NLP-73] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨语言和跨文化场景下表现不均衡的问题,其根源在于训练数据主要以英语为中心,导致模型对非英语文化背景的理解不足。为实现实际的文化对齐(cultural alignment),作者提出了一种可扩展的解决方案:利用国家社会研究课程(national social studies curricula)作为文化感知监督的基础。该方案的关键创新在于构建了一个名为CuCu的自动化多智能体LLM框架,能够将国家教科书课程内容自动转化为开放式的、文化特定的问题-答案对(open-ended, culture-specific question-answer pairs)。通过在韩国国家社会研究课程上的应用,作者构建了KCaQA数据集(包含34.1k个QA对),实证表明该方法能有效覆盖文化特有话题,并生成基于本地社会文化语境的响应。

链接: https://arxiv.org/abs/2601.04632
作者: Haneul Yoo,Won Ik Cho,Geunhye Kim,Jiyoon Han
机构: KAIST(韩国科学技术院); AI Center, Samsung Electronics(三星电子AI中心); Hankuk University of Foreign Studies(韩国外国语大学); Upstage
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.
zh

[NLP-74] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR

【速读】: 该论文旨在解决当前角色扮演代理(Role-Playing Agents, RPAs)因仅模仿表面行为而导致内部认知不一致的问题,尤其在复杂情境下易出现偏离角色设定的错误。其解决方案的关键在于提出Character-R1框架,通过三方面核心设计实现可验证的奖励信号以促进角色感知推理:(1) 认知焦点奖励(Cognitive Focus Reward),强制对10个角色要素(如世界观)进行显式标签分析以构建内部认知结构;(2) 参考引导奖励(Reference-Guided Reward),利用重叠度指标与参考响应对齐作为优化锚点,提升探索效率和性能;(3) 角色条件奖励归一化(Character-Conditioned Reward Normalization),根据角色类别动态调整奖励分布,确保跨异质角色的鲁棒优化。

链接: https://arxiv.org/abs/2601.04611
作者: Yihong Tang,Kehai Chen,Xuefeng Bai,Benyou Wang,Zeming Liu,Haifeng Wang,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Shenzhen Loop Area Institute (SLAI)(深圳环区研究院); The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)); Beijing University of Aeronautics and Astronautics(北京航空航天大学); Baidu Inc.(百度公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.
zh

[NLP-75] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

【速读】: 该论文试图解决当前视觉语言模型(Vision-language models, VLMs)生成的图像描述中,描述特异性(specificity)与长度被混淆的问题。现有系统常将较长的描述等同于更具体的信息,但作者指出,描述可以既简洁又信息密集,也可以冗长却缺乏实质内容。解决方案的关键在于:通过定义特异性为相对于对比集(contrast set)下对目标图像的区分能力,并构建一个在控制长度的同时变化信息量的数据集,实证验证人类偏好更具体的描述,无论其长度如何;同时发现仅控制长度不足以解释特异性差异,关键在于如何分配长度预算。这一发现支持以特异性为核心指标进行评估,而非单纯追求文本长度。

链接: https://arxiv.org/abs/2601.04609
作者: Rhea Kapur,Robert Hawkins,Elisa Kreiss
机构: Stanford University (斯坦福大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
zh

[NLP-76] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions

【速读】: 该论文旨在解决基于Rank-One Model Editing (ROME) 的知识编辑方法在多跳推理(multi-hop reasoning)任务中表现不佳的问题。具体而言,现有方法在编辑不同层深度的知识时会引发三种关键失败模式:中间层信息缺失导致“跳跃过晚”(hopping-too-late)、后期层编辑后泛化能力急剧下降,以及模型对编辑过的知识产生过拟合,错误地优先选择编辑过的推理路径。为缓解“跳跃过晚”和泛化能力衰退问题,作者提出了一种简单但有效的策略——冗余编辑(Redundant Editing),其核心在于通过在多个层上同时进行知识编辑,增强模型对多跳推理链的稳定性与准确性。实验表明,该方法可使两跳问题的准确率提升至少15.5个百分点(较单次编辑策略提高96%),尽管伴随一定程度的特异性与语言自然度损失。

链接: https://arxiv.org/abs/2601.04600
作者: Zhiyuan He,Binghan Chen,Tianxiang Xiong,Ziyang Sun,Mozhao Zhu,Xi Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the “hopping-too-late” problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of “hopping-too-late” and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.
zh

[NLP-77] HaLLE-ThaiLLM : Domain-Specialized Small LLM s for Finance and Thai – Technical Report

【速读】: 该论文旨在解决组织在部署大型语言模型(Large Language Models, LLMs)时面临的资源与性能权衡问题:即是否应采用多个专用模型以实现特定任务的高精度,还是投入高昂成本训练单一具备多能力的通用模型。为应对这一挑战,论文提出通过模型合并(model merging)作为资源高效的替代方案,其关键在于将不同功能专精的模型(如通用指令遵循模型 Qwen-8B、泰国语增强模型 ThaiLLM-8B 及金融领域微调模型 THaLLE-CFA-8B)进行融合,从而在不重新训练完整模型的前提下,显著提升多领域性能表现。实验表明,合并后的模型在泰国国家标准化考试(M3/M6 O-NET)、金融认证考试(Flare-CFA)及泰语理解基准(Thai-IC)上均取得明显性能增益,验证了模型合并在构建高性能、多能力 LLM 中的有效性与可行性。

链接: https://arxiv.org/abs/2601.04597
作者: KBTG Labs:Anuruth Lertpiya,Danupat Khamnuansin,Kantapong Sucharitpongpan,Pornchanan Balee,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong
机构: NLP-Voice Research Lab, KBTG Labs, KASIKORN Business—Technology Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.
zh

[NLP-78] Aligning Text Code and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization EACL

【速读】: 该论文旨在解决文本到可视化(Text-to-Visualization, Text2Vis)系统中生成图表的语义一致性与清晰度不足的问题,尤其针对闭源大语言模型(LLM)生成代码虽可执行但缺乏语义对齐、开源模型则常产生不可执行或视觉质量差的结果。其解决方案的关键在于提出首个基于强化学习(Reinforcement Learning, RL)的Text2Vis框架RL-Text2Vis,该框架采用Group Relative Policy Optimization(GRPO)策略,并设计了一种多目标奖励函数,联合优化文本准确性、代码有效性与可视化质量,且利用执行后的反馈进行训练,从而显著提升图表质量和代码执行成功率。

链接: https://arxiv.org/abs/2601.04582
作者: Mizanur Rahman,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque
机构: York University (约克大学); Salesforce AI Research (Salesforce人工智能研究); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EACL Main Conference

点击查看摘要

Abstract:Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at this https URL.
zh

[NLP-79] FeedEval: Pedagogically Aligned Evaluation of LLM -Generated Essay Feedback

【速读】: 该论文旨在解决当前自动化作文评分(Automated Essay Scoring, AES)模型在使用大语言模型(Large Language Model, LLM)生成的反馈时,因缺乏质量验证而导致噪声传播的问题。解决方案的关键在于提出 FeedEval 框架,该框架基于教育学原理,从具体性(specificity)、有用性(helpfulness)和有效性(validity)三个维度对 LLM 生成的反馈进行评估,并采用专门训练的 LLM 评价器筛选高质量反馈,从而提升下游作文评分与修订任务的效果。

链接: https://arxiv.org/abs/2601.04574
作者: Seongyeub Chu,Jongwoo Kim,Munyong Yi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.
zh

[NLP-80] Neurosymbolic Retrievers for Retrieval-augmented Generation

【速读】: 该论文旨在解决传统检索增强生成(Retrieval Augmented Generation, RAG)系统中因三个相互关联的神经组件(检索器、重排序器和生成器)内部推理过程不透明而导致的可解释性差、调试困难及信任度低的问题,尤其在高风险场景下难以保障决策清晰性。其解决方案的关键在于提出“神经符号RAG”(Neurosymbolic RAG)框架,通过将符号推理(如知识图谱)与神经检索技术融合,提升文档选择的透明度与检索过程的可理解性:具体包括三种方法——基于知识调制对齐检索(MAR)利用可解释的符号特征调节查询嵌入以明确匹配逻辑;KG-Path RAG通过知识图谱路径遍历增强查询语义以提高检索质量与可解释性;以及流程知识注入式RAG利用领域专用工具依据验证的工作流重新排序检索内容,从而实现更可靠且可追踪的生成过程。

链接: https://arxiv.org/abs/2601.04568
作者: Yash Saxena,Manas Gaur
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 8 pages, 2 Figures, To Appear in IEEE Intelligent Systems

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency. However, traditional RAG systems consist of three interconnected neural components - the retriever, re-ranker, and generator - whose internal reasoning processes remain opaque. This lack of transparency complicates interpretability, hinders debugging efforts, and erodes trust, especially in high-stakes domains where clear decision-making is essential. To address these challenges, we introduce the concept of Neurosymbolic RAG, which integrates symbolic reasoning using a knowledge graph with neural retrieval techniques. This new framework aims to answer two primary questions: (a) Can retrievers provide a clear and interpretable basis for document selection? (b) Can symbolic knowledge enhance the clarity of the retrieval process? We propose three methods to improve this integration. First is MAR (Knowledge Modulation Aligned Retrieval) that employs modulation networks to refine query embeddings using interpretable symbolic features, thereby making document matching more explicit. Second, KG-Path RAG enhances queries by traversing knowledge graphs to improve overall retrieval quality and interpretability. Lastly, Process Knowledge-infused RAG utilizes domain-specific tools to reorder retrieved content based on validated workflows. Preliminary results from mental health risk assessment tasks indicate that this neurosymbolic approach enhances both transparency and overall performance
zh

[NLP-81] BackdoorAg ent: A Unified Framework for Backdoor Attacks on LLM -based Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在多步骤工作流中因后门攻击(backdoor threats)导致的安全漏洞问题,尤其关注触发器在规划、记忆和工具使用等不同阶段间的跨阶段传播机制。现有研究通常孤立分析单一攻击向量,缺乏从代理视角出发的系统性理解。为此,作者提出 BackdoorAgent 框架,其核心在于将代理工作流划分为三个功能阶段——规划攻击(planning attacks)、记忆攻击(memory attacks)和工具使用攻击(tool-use attacks),并通过模块化设计实现对触发器激活与传播过程的系统性监测与分析。这一框架不仅揭示了单阶段植入的触发器可在多步执行中持续存在(如GPT基线模型下记忆攻击触发器持久率达77.97%),还构建了一个涵盖四种典型代理应用(Agent QA、Agent Code、Agent Web、Agent Drive)的标准化基准,为后续研究提供了可复现的评估平台。

链接: https://arxiv.org/abs/2601.04566
作者: Yunhao Feng,Yige Li,Yutao Wu,Yingshui Tan,Yanming Guo,Yifan Ding,Kun Zhai,Xingjun Ma,Yugang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents execute tasks through multi-step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross-stage interaction and propagation of backdoor triggers poorly understood from an agent-centric perspective. To fill this gap, we propose \textbfBackdoorAgent, a modular and stage-aware framework that provides a unified, agent-centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including \textbfplanning attacks, \textbfmemory attacks, and \textbftool-use attacks, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: \textbfAgent QA, \textbfAgent Code, \textbfAgent Web, and \textbfAgent Drive, covering both language-only and multimodal settings. Our empirical analysis shows that \textittriggers implanted at a single stage can persist across multiple steps and propagate through intermediate states. For instance, when using a GPT-based backbone, we observe trigger persistence in 43.58% of planning attacks, 77.97% of memory attacks, and 60.28% of tool-stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.
zh

[NLP-82] A Vision for Multisensory Intelligence: Sensing Synergy and Science

【速读】: 该论文旨在解决当前人工智能(AI)主要局限于文本、视觉和音频等数字模态,而忽视了人类多感官体验(multisensory experience)的局限性问题。其核心挑战在于如何让AI系统能够感知并理解来自生理、触觉、物理环境及社会交互等多样化信号,从而实现与人类更自然、深入的协同交互。解决方案的关键在于构建一个以“传感(sensing)、科学(science)与协同(synergy)”为三大支柱的研究框架:首先拓展AI对世界的感知维度,超越传统数字媒介;其次建立统一的多模态建模架构与跨模态迁移机制,量化异质模态间的交互关系;最后攻克多模态融合、对齐、推理、生成与泛化等技术难题,推动人机协同的智能体在复杂环境中实现更高层次的感知-行动闭环。

链接: https://arxiv.org/abs/2601.04563
作者: Paul Pu Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see this https URL.
zh

[NLP-83] Identifying Good and Bad Neurons for Task-Level Controllable LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中神经元功能理解的两大核心问题:一是现有方法多聚焦于特定能力的“支持性神经元”,难以应对需要多能力协同的任务场景;二是这些方法忽略了抑制性神经元的作用,且易受模型偶然正确回答(fortuitous behaviors)的干扰,导致神经元归因失真。解决方案的关键在于提出 NeuronLLM 框架,其基于生物学中的功能拮抗原理(functional antagonism),将任务性能建模为“良神经元”(促进任务完成)与“坏神经元”(抑制任务完成)的共同作用,并通过对比学习对二者进行联合建模,同时利用增强的问题集缓解偶然行为的影响,从而实现对 LLM 神经元更全面、准确的理解。

链接: https://arxiv.org/abs/2601.04548
作者: Wenjie Li,Guansong Pang,Hezhe Qiao,Debin Gao,David Lo
机构: Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.
zh

[NLP-84] Not All Steps are Informative: On the Linearity of LLM s RLVR Training

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在大语言模型(Large Language Model, LLM)后训练过程中计算成本高昂的问题,尤其针对因长时间探索导致的训练步骤冗长。研究表明,RLVR训练过程中模型权重和输出logits呈现强线性演化趋势,表明其主要是在早期阶段形成的趋势基础上进行放大,而非持续发现新行为。解决方案的关键在于利用这种线性特性,通过权重外推(Weight Extrapolation)和logits外推(Logits Extrapolation)从中间检查点预测未来模型状态,从而避免持续昂贵的RL训练;其中,logits外推在所有四个基准测试上均优于继续RL训练,且可在稳定训练范围之外实现性能提升。

链接: https://arxiv.org/abs/2601.04537
作者: Tianle Wang,Zhongyuan Wu,Shenghao Jin,Hao Xu,Wei Chen,Ning Miao
机构: 1. Tsinghua University (清华大学); 2. Institute for AI and Robotics (人工智能与机器人研究所); 3. Alibaba Group (阿里巴巴集团); 4. Tongyi Lab (通义实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: pre-print

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.
zh

[NLP-85] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在低资源语言(如孟加拉语)中文本水印技术的鲁棒性不足问题,尤其是在跨语言往返翻译(cross-lingual round-trip translation, RTT)攻击下,现有基于 token-level 的水印方法(如 KGW、EXP 和 Waterfall)检测准确率急剧下降至 9–13%,表现出根本性失效。其解决方案的关键在于提出一种分层水印策略(layered watermarking),将嵌入层(embedding-time)水印与生成后(post-generation)水印相结合,在不依赖模型训练的前提下显著提升 RTT 后的检测准确率(从 9–13% 提升至 40–50%),实现约 3 至 4 倍的相对性能提升,同时控制语义质量的可控损失,从而为低资源语言提供了一种实用且高效的水印方案。

链接: https://arxiv.org/abs/2601.04534
作者: Amit Bin Tariqul,A N M Zahid Hossain Milkan,Sahab-Al-Chowdhury,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
机构: Islamic University of Technology (伊斯兰技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review, 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3 \times to 4 \times relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.
zh

[NLP-86] Advancing Language Models for Code-related Tasks ICSE2026

【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)在复杂编程场景中表现不足的问题,主要受限于数据质量、模型架构和推理能力。其解决方案的关键在于从三个互补方向进行系统性改进:首先,通过代码差异引导的对抗增强技术(Code Difference-Guided Adversarial Augmentation, CODA)和代码去噪技术(CodeDenoise)提升代码数据质量;其次,设计基于语法引导的代码语言模型(Syntax-Guided Code LMs, LEAM 和 LEAM++)以优化模型架构;最后,引入提示工程方法(muFiX)与基于代理的技术(Specine)增强模型推理能力。这些策略共同推动了语言模型在软件开发中的实际应用,并促进智能软件工程的发展。

链接: https://arxiv.org/abs/2601.04526
作者: Zhao Tian
机构: Tianjin University (天津大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICSE 2026 (DS)

点击查看摘要

Abstract:Recent advances in language models (LMs) have driven significant progress in various software engineering tasks. However, existing LMs still struggle with complex programming scenarios due to limitations in data quality, model architecture, and reasoning capability. This research systematically addresses these challenges through three complementary directions: (1) improving code data quality with a code difference-guided adversarial augmentation technique (CODA) and a code denoising technique (CodeDenoise); (2) enhancing model architecture via syntax-guided code LMs (LEAM and LEAM++); and (3) advancing model reasoning with a prompting technique (muFiX) and an agent-based technique (Specine). These techniques aim to promote the practical adoption of LMs in software development and further advance intelligent software engineering.
zh

[NLP-87] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的两大关键问题:一是模型在缺乏显式证据支持的情况下仍给出看似正确但可能错误的答案(即“无证据支撑的正确答案”问题);二是当检索到的上下文信息不足时,模型仍会生成虚假或捏造的内容(即“幻觉生成”问题)。为应对这两个挑战,作者提出了一种名为GRACE的强化学习框架,其核心创新在于:首先通过异构检索器(heterogeneous retrievers)自动构建多样化的训练样本,无需人工标注;其次设计了一个多阶段门控奖励函数(multi-stage gated reward function),引导模型在推理过程中评估证据充分性、提取关键支撑证据,并在不确定时主动选择拒绝回答。该方案在两个基准测试中实现了最先进的整体准确率,并在准确响应与合理拒答之间取得良好平衡,同时仅需先前方法10%的标注成本。

链接: https://arxiv.org/abs/2601.04525
作者: Yibo Zhao,Jiapeng Zhu,Zichen Ding,Xiang Li
机构: East China Normal University (华东师范大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at this https URL…
zh

[NLP-88] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation

【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MASs)中智能体间语言交互效率低下的问题,即如何通过优化对话过程使智能体更有效地传达其意图。解决方案的关键在于提出LinguaGame,一个基于语言学 grounded 的博弈论范式,将对话建模为在沟通意图与策略上的信号博弈,并采用无需训练的均衡近似算法实现推理时决策调整。该方法强调对话作为有意图和策略性的交流行为,要求智能体推断其他智能体的目标(意图)及其达成方式(策略),从而提升沟通效率;其创新性在于仅依赖语言学启发式推理,与任务特定目标耦合度极低,具有良好的通用性和可迁移性。

链接: https://arxiv.org/abs/2601.04516
作者: Yuxiao Ye,Yiming Zhang,Yiran Ma,Huiyuan Xie,Huining Zhu,Zhiyuan Liu
机构: Tsinghua University(清华大学); University of California, Berkeley(加州大学伯克利分校); Peking University(北京大学); East China University of Political Science and Law(华东政法大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents’ communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.
zh

[NLP-89] WESR: Scaling and Evaluating Word-level Event-Speech Recognition

【速读】: 该论文旨在解决非言语 vocal 事件(如笑声、哭泣等)在语音中的精确定位问题,当前方法存在任务定义不清晰、类别覆盖有限、时间粒度模糊以及缺乏标准化评估框架等挑战。解决方案的关键在于:首先构建了一个包含21类 vocal 事件的细化分类体系,将事件区分为离散型(独立存在)与连续型(与语音混合)两类;其次提出 WESR-Bench 评测基准,基于专家标注数据集(900+ 个语音片段)和一种新颖的位置感知协议,可分离自动语音识别(ASR)错误与事件检测误差,从而实现对两类事件的精确时空定位测量;此外还构建了超过1700小时的训练语料库并训练专用模型,在保持 ASR 质量的同时显著优于开源音频-语言模型及商用 API,为未来复杂真实听觉场景建模提供了基础资源。

链接: https://arxiv.org/abs/2601.04508
作者: Chenchen Yang,Kexin Huang,Liwei Fan,Qian Tu,Botian Jiang,Dong Zhang,Linqi Yin,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.
zh

[NLP-90] CircuitLM: A Multi-Agent LLM -Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

【速读】: 该论文旨在解决从自然语言描述生成准确电路原理图的难题,尤其针对大语言模型(LLM)在细节层面易产生幻觉、违反电气约束以及输出不可机器读取的问题。其解决方案的关键在于提出了一种多智能体辅助的电路设计流水线——CircuitLM,该流程通过五个阶段实现:基于LLM的元件识别、规范引脚布局检索、电子专家代理的链式思维推理、CircuitJSON结构化原理图合成及力导向SVG可视化。系统的核心创新在于依托一个经过筛选并由嵌入向量驱动的元件知识库(初始包含50个元件),将生成过程锚定在可验证且可动态扩展的元件数据库上,从而有效规避了LLM的非物理合理性问题,并结合双指标电路验证(Dual-Metric Circuit Validation, DMCV)框架确保结构与电气正确性,显著提升了非专业用户对硬件原型设计的可靠性。

链接: https://arxiv.org/abs/2601.04505
作者: Khandakar Shakib Al Hasan,Syed Rifat Raiyan,Hasin Mahtab Alvee,Wahid Sadik
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注: Under review, 13 pages, 11 figures, 2 tables

点击查看摘要

Abstract:Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronics design, as large language models (LLMs) frequently hallucinate in granular details, violate electrical constraints, and produce non-machine-readable outputs. We present CircuitLM, a novel multi-agent LLM-aided circuit design pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics through five sequential stages: (i) LLM-based component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning by an electronics expert agent, (iv) JSON schematic synthesis, and (v) force-directed SVG visualization. Anchored by a curated, embedding-powered component knowledge base. While LLMs often violate electrical constraints, CircuitLM bridges this gap by grounding generation in a verified and dynamically extensible component database, initially comprising 50 components. To ensure safety, we incorporate a hybrid evaluation framework, namely Dual-Metric Circuit Validation (DMCV), validated against human-expert assessments, which achieves high fidelity in microcontroller-centric designs. We evaluate the system on 100 diverse embedded-systems prompts across six LLMs and introduce DMCV to assess both structural and electrical validity. This work bridges natural language input to deployable hardware designs, enabling reliable circuit prototyping by non-experts. Our code and data will be made public upon acceptance.
zh

[NLP-91] Vision-Language Agents for Interactive Forest Change Analysis

【速读】: 该论文旨在解决森林变化监测中两个核心挑战:一是高精度的像素级变化检测,二是对复杂森林动态进行有意义的语义变化描述(semantic change captioning)。现有方法在整合大型语言模型(LLM)与视觉-语言模型(VLM)以实现遥感图像变化解释(RSICI)方面仍存在不足。解决方案的关键在于提出一个基于LLM驱动的代理系统,其核心是多层级变化解释(MCI)视觉-语言骨干网络,并通过LLM进行任务编排与自然语言查询支持。该系统实现了从变化检测到语义描述的端到端集成分析,显著提升了森林变化分析的可访问性、可解释性和效率。

链接: https://arxiv.org/abs/2601.04497
作者: James Brock,Ce Zhang,Nantheera Anantrasirichai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 4 figures, Submitted to IGARSS 2026

点击查看摘要

Abstract:Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at this https URL.
zh

[NLP-92] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

【速读】: 该论文旨在解决形态丰富的乌拉尔语系语言(如芬兰语、匈牙利语和爱沙尼亚语)中子词单元(subword tokenization)质量评估困难的问题,其核心挑战在于缺乏干净的词素(morpheme)词典来指导和验证分词效果。为应对这一问题,作者提出了一种名为 SampoNLP 的无语料库工具包,其关键创新在于采用基于最小描述长度(Minimum Description Length, MDL)启发的自指原子性评分(Self-Referential Atomicity Scoring)机制,通过分析词形内部结构线索过滤复合形式,从而在低资源场景下生成高纯度词素词典。该方法使得对BPE(Byte Pair Encoding)分词器在不同词汇量(8k–256k)下的性能进行系统评估成为可能,并进一步提出了统一指标——集成性能分数(Integrated Performance Score, IPS),用于权衡词素覆盖率与过度切分之间的矛盾,最终识别出各语言的“拐点”最优词汇量,为实际部署提供了首个实证依据。

链接: https://arxiv.org/abs/2601.04469
作者: Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov
机构: DataSpike; aglabx; Neapolis University Pafos; Dubai, UAE; Paphos, Cyprus
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67

点击查看摘要

Abstract:The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the “elbow points” of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: this https URL Comments: Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67 Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: I.2.7; I.2.6; H.3.1 Cite as: arXiv:2601.04469 [cs.CL] (or arXiv:2601.04469v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.04469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-93] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

【速读】: 该论文旨在解决如何在保持预训练语言模型(LLM)冻结的前提下,通过轻量级方法实现对模型行为的有效控制问题。现有方法通常需要微调整个模型或引入复杂提示工程,而本文提出Concept Tokens——一种仅添加一个特殊标记并学习其嵌入向量的方案,该嵌入由目标概念的多个自然语言定义共同优化,同时将原文本中该概念的所有出现替换为该特殊标记。其关键在于:仅更新一个可学习的token嵌入,利用标准语言建模目标进行优化,即可在不改变模型参数的情况下,显著影响模型输出,如减少闭卷问答中的幻觉(hallucination)或引导教学反馈策略(recasting),且优于全量定义上下文注入的方式,从而提供了一种紧凑、高效的控制信号。

链接: https://arxiv.org/abs/2601.04465
作者: Ignacio Sastre,Aiala Rosá
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional “Austral Tower” to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.
zh

[NLP-94] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在长期交互与个性化过程中,因记忆提取阶段存在“提前总结”和“一次性提取”两大缺陷而导致的信息丢失与准确性下降问题。现有基于摘要的方法通常在任务发生前就进行静态总结,缺乏对未来任务的感知能力;同时提取过程缺少反馈机制,无法验证信息真实性,造成累积性信息损失。其解决方案的关键在于提出主动记忆提取(Proactive Memory Extraction, ProMem),将记忆提取视为一个迭代的认知过程,并引入循环反馈回路:代理通过自我提问主动探测对话历史,从而恢复缺失信息并纠正错误,显著提升记忆完整性与问答准确率,同时在提取质量与token成本之间实现更优权衡。

链接: https://arxiv.org/abs/2601.04463
作者: Chengyuan Yang,Zequn Sun,Wei Wei,Wei Hu
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China (南京大学新型软件技术国家重点实验室); National Institute of Healthcare Data Science, Nanjing University, China (南京大学医疗健康数据科学国家研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is “ahead-of-time”, acting as a blind “feed-forward” process that misses important details because it doesn’t know future tasks. Second, extraction is usually “one-off”, lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.
zh

[NLP-95] Users Mispredict Their Own Preferences for AI Writing Assistance

【速读】: 该论文旨在解决主动式写作助手(proactive AI writing assistants)在预测用户何时需要辅助写作时所面临的困境,即缺乏对用户真实偏好驱动因素的实证理解。研究发现,用户的行为决策主要受“组合努力”(compositional effort)驱动(ρ = 0.597),而“紧迫性”(urgency)则无显著预测力(ρ ≈ 0);更关键的是,用户在自我报告中将紧迫性置于首位,但其行为却与之严重偏离,形成明显的感知-行为鸿沟(perception-behavior gap),导致基于用户自述偏好的系统设计准确率仅为57.7%,甚至低于朴素基线模型,而基于行为模式建模的系统准确率达61.3%(p < 0.05)。因此,解决方案的关键在于摒弃依赖用户主观陈述的设计范式,转而采用行为数据驱动的方法来优化主动自然语言生成(NLG)系统的决策机制。

链接: https://arxiv.org/abs/2601.04461
作者: Vivian Lai,Zana Buçinca,Nil-Jana Akpinar,Mo Houtti,Hyeonsu B. Kang,Kevin Chian,Namjoon Suh,Alex C. Williams
机构: Microsoft(微软); Massachusetts Institute of Technology(麻省理工学院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ( \rho = 0.597 ) while urgency shows no predictive power ( \rho \approx 0 ). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users’ stated preferences achieve only 57.7% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3% ( p 0.05 ). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.
zh

[NLP-96] Re-Rankers as Relevance Judges

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)预测相关性判断(relevance judgment)的研究中存在资源浪费和重复开发的问题,即忽视了在排序任务中已成熟且广泛研究的重排序方法(re-ranking methods)在相关性判断任务中的潜在复用价值。其解决方案的关键在于将重排序器(re-ranker)作为相关性判别器(relevance judge)使用,并设计两种适配策略:一是直接利用重排序器输出的二值标记(如“true”和“false”)作为判断结果;二是通过阈值化连续重排序分数生成二元标签。实验表明,该方法在TREC-DL 2019–2023数据集上可优于当前最优的LLM相关性判别器UMBRELA约40%–50%的情况,同时揭示了重排序器作为判别器时存在的自偏好(self-preference)和跨家族偏差(cross-family bias)。

链接: https://arxiv.org/abs/2601.04455
作者: Chuan Meng,Jiqun Liu,Mohammad Aliannejadi,Fengran Mo,Jeff Dalton,Maarten de Rijke
机构: The University of Edinburgh(爱丁堡大学); University of Oklahoma(俄克拉荷马大学); University of Amsterdam(阿姆斯特丹大学); Université de Montréal(蒙特利尔大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g., “true” and “false”) generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.
zh

[NLP-97] Merging Triggers Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

【速读】: 该论文旨在解决指令微调(instruction-tuned)大语言模型(Large Language Models, LLMs)在训练数据中存在后门攻击(backdoor attacks)风险的问题,即攻击者通过污染少量数据植入隐蔽行为,从而在模型部署后触发恶意响应。解决方案的关键在于提出MB-Defense框架,其核心包含两个阶段:一是防御性投毒(defensive poisoning),将攻击者和防御者的触发词合并为统一的后门表示;二是权重恢复(weight recovery),通过额外训练打破该后门表示并恢复模型的纯净行为。该方法在多个LLM上验证有效,在显著降低攻击成功率的同时保持了良好的指令遵循能力,提供了一种通用且数据高效的防御策略。

链接: https://arxiv.org/abs/2601.04448
作者: San Kim,Gary Geunbae Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.
zh

[NLP-98] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在采用链式思维(chain-of-thought)机制进行推理时存在的“过度思考”问题,即模型对简单任务生成冗长且低效的响应,甚至因视觉感知错误导致推理准确性下降。其核心问题是:当前慢思考方法忽视了视觉感知失败这一根本瓶颈,而将错误归因于推理不足。解决方案的关键在于提出一种元推理控制器——门控感知-推理优化(Gated Perception-Reasoning Optimization, GPRO),该控制器在每一步生成中动态决策三种路径:轻量快速路径、用于重新审视视觉输入的慢感知路径和用于内部自省的慢推理路径。通过约79万样本的失败归因监督信号(利用教师模型区分感知幻觉与推理错误)和多目标强化学习训练,GPRO能够在不确定条件下平衡任务准确率与计算成本,从而显著提升LVLM的推理效率与精度。

链接: https://arxiv.org/abs/2601.04442
作者: Xingjian Diao,Zheyuan Liu,Chunhui Zhang,Weiyi Wu,Keyi Kong,Lin Shi,Kaize Ding,Soroush Vosoughi,Jiang Gui
机构: Dartmouth College (达特茅斯学院); University of Notre Dame (圣母大学); Cornell University (康奈尔大学); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
zh

[NLP-99] Learning to Simulate Human Dialogue

【速读】: 该论文试图解决的问题是:如何更准确地预测人类在对话中的下一句回应,从而更好地建模人类的思维过程。其核心挑战在于区分两种学习范式——一种是允许模型在生成回答前进行“思考”(即链式推理,Chain-of-Thought),另一种则是直接优化对真实人类对话的似然估计。解决方案的关键在于:通过将链式推理视为潜在变量并推导出对数似然的下界,进而优化该变分下界目标,能够显著提升模型在真实人类对话上的预测能力,包括更高的对数概率和人类判别胜率。这一方法优于依赖大语言模型作为裁判(LLM-as-a-judge)的奖励机制,后者虽提高评分但降低对真实人类语句的匹配度,尤其在引入思考步骤后问题更加严重。

链接: https://arxiv.org/abs/2601.04436
作者: Kanishk Gandhi,Agam Bhatia,Noah D. Goodman
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: Kanishk Gandhi and Agam Bhatia contributed equally

点击查看摘要

Abstract:To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.
zh

[NLP-100] Accommodation and Epistemic Vigilance: A Prag matic Account of Why LLM s Fail to Challenge Harmful Beliefs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对用户有害信念时缺乏挑战能力的问题,尤其是在医学建议和社交推理等敏感领域。研究表明,这一问题源于LLMs默认迎合用户假设且缺乏足够的认知警觉性(epistemic vigilance)。解决方案的关键在于引入语用干预策略,例如添加“wait a minute”这类提示语,可显著提升模型在三个安全基准测试(Cancer-Myth、SAGE-Eval 和 ELEPHANT)中识别并反驳有害信念的能力,同时保持较低的误报率,从而通过强化语用机制增强LLM的安全性和批判性响应能力。

链接: https://arxiv.org/abs/2601.04435
作者: Myra Cheng,Robert D. Hawkins,Dan Jurafsky
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently fail to challenge users’ harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users’ assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models’ ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase “wait a minute”, significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.
zh

[NLP-101] Gavel: Agent Meets Checklist for Evaluating LLM s on Long-Context Legal Summarization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理超长文本任务(如多文档法律案件摘要)时性能不足的问题,特别是当输入上下文长度达到数十万token级别时,现有模型难以准确提取和整合关键信息。其解决方案的关键在于提出一种基于参考的评估框架Gavel-Ref,通过多值检查表(multi-value checklist)、残差事实和写作风格等维度进行细粒度评估,并进一步开发了Gavel-Agent智能代理系统——该系统利用六种工具自主导航与提取法律文档中的结构化信息,显著降低token消耗(减少36%)的同时保持较高的摘要质量(仅下降7%的检查表得分),从而提升了LLMs在复杂长文本场景下的实用性与效率。

链接: https://arxiv.org/abs/2601.04424
作者: Yao Dou,Wei Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: webpage at this https URL

点击查看摘要

Abstract:Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of S_\textGavel-Ref , highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries – making human references less reliable – we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in S_\textchecklist compared to end-to-end extraction with GPT-4.1.
zh

[NLP-102] Rate or Fate? RLVvarepsilonR: Reinforcement Learning with Verifiable Noisy Rewards

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在实际应用中因验证噪声导致的学习失效问题,尤其是当验证器存在假阳性(false positives)和假阴性(false negatives)时,是否会改变模型学习的最终结果(fate)而非仅仅延缓收敛速度(rate)。其解决方案的关键在于构建一个可分析的多臂老虎机(multi-armed bandit)框架来建模RLVR的动力学过程,通过将完成项(completions)按推理模式分组,推导出概率单纯形上的复制者动态(replicator-style flow),并发现错误模式质量的演化由Youden指数 J = TPR - FPR 决定:当 J > 0 时,错误模式趋于消亡(学习成功);J = 0 时中性演化;J < 0 时错误模式主导并引发“反向学习”与系统崩溃。这一理论揭示了验证噪声对RLVR稳定性的根本影响机制,并为算法设计与干预提供通用分析工具。

链接: https://arxiv.org/abs/2601.04411
作者: Ali Rad,Khashayar Filom,Darioush Keivan,Peyman Mohajerin Esfahani,Ehsan Kamalinejad
机构: Cognichip AI; University of Toronto
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean–unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited–and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden’s index J=TPR-FPR. This yields a sharp phase transition: when J0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J0, noise primarily rescales convergence time (“rate, not fate”). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.04411 [cs.LG] (or arXiv:2601.04411v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.04411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-103] Interpreting Transformers Through Attention Head Intervention

【速读】: 该论文试图解决的问题是:当前神经网络(neural networks)的能力不断增强,但其内部决策机制尚不清晰,缺乏对这些机制的可解释性。解决方案的关键在于推进“机制可解释性”(mechanistic interpretability)的研究,即深入理解神经网络内部如何通过特定的神经机制实现决策,从而在高风险领域实现问责与控制、探索数字大脑中认知的涌现现象,并在人工智能系统超越人类表现时发现新知识。

链接: https://arxiv.org/abs/2601.04398
作者: Mason Kadem,Rong Zheng
机构: McMaster University (麦克马斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms’ decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.
zh

[NLP-104] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在事实性(factual correctness)与安全性(safety)方面存在的问题,其核心观点是:这些失败并非独立的对齐难题,而是源于模型潜在激活空间中的表征错位(representational misalignment)。解决方案的关键在于提出一个统一框架ARREST(Adversarial Resilient Regulation Enhancing Safety and Truth),通过引入一个外部网络来识别并纠正漂移特征,在不微调原始模型参数的前提下,实现从虚假输出到真实输出、从不安全输出到安全输出的调节,并支持软拒绝(soft refusal)与硬拒绝(hard refusal)机制,从而提升模型的鲁棒性和可控性。

链接: https://arxiv.org/abs/2601.04394
作者: Sharanya Dasgupta,Arkaprabha Basu,Sujoy Nath,Swagatam Das
机构: Indian Statistical Institute Kolkata (印度统计研究所加尔各答分校); University of Surrey (萨里大学); Indian Institute Of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at this https URL.
zh

[NLP-105] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中存在的“普遍性幻觉”问题,即通过将“身份仇恨”(Identity Hate)聚合为标量分数来掩盖对特定少数群体的系统性脆弱性。其解决方案的关键在于构建了一个多语种(英语与葡萄牙语)对抗性基准测试集 MiJaBench,包含44,000个提示词和16个少数群体,并基于12个前沿LLM生成528,000个提示-响应对,进而提炼出MiJaBench-Align数据集。该数据集揭示了安全对齐并非一种通用语义能力,而是一种基于人口统计学的层级结构:同一模型在不同目标群体上的防御率差异可达33%;更关键的是,研究发现模型规模扩大反而加剧了这种不平等,表明现有对齐技术并未建立非歧视原则,而是强化了针对特定群体的记忆化拒绝边界,从而挑战了当前安全性的扩展规律。

链接: https://arxiv.org/abs/2601.04389
作者: Iago Alves Brito,Walcy Santos Rezende Rios,Julia Soares Dollis,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
机构: Advanced Knowledge Center for Immersive Technologies (高级沉浸式技术知识中心); Federal University of Goiás (戈亚斯联邦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures and 4 tables in paper (without appendix)

点击查看摘要

Abstract:Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating “Identity Hate” into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
zh

[NLP-106] he Language of Bargaining: Linguistic Effects in LLM Negotiations

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多轮谈判任务中评估严重依赖英语单一语言范式的问题,这可能导致对模型社会智能能力的片面甚至误导性结论。其解决方案的关键在于通过控制变量的多智能体模拟实验,在保持游戏规则、模型参数和激励机制一致的前提下,系统性地比较英语与四种印地语系语言(Hindi、Punjabi、Gujarati、Marwadi)下的谈判表现,从而隔离语言因素的影响。研究发现,语言选择本身可显著改变谈判结果,甚至逆转提议者优势并重新分配收益,且这种影响具有任务依赖性——在分配型博弈中降低稳定性,而在整合型情境中促进更丰富的探索行为。这一方法论揭示了文化-语言维度在LLM社会智能评估中的关键作用,强调必须开展跨语言、跨文化的公平评估以确保部署的合理性。

链接: https://arxiv.org/abs/2601.04387
作者: Stuti Sinha,Himanshu Kumar,Aryan Raju Mandapati,Rakshit Sakhuja,Dhruv Kumar
机构: BITS Pilani (比特理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: Under Review

点击查看摘要

Abstract:Negotiation is a core component of social intelligence, requiring agents to balance strategic reasoning, cooperation, and social norms. Recent work shows that LLMs can engage in multi-turn negotiation, yet nearly all evaluations occur exclusively in English. Using controlled multi-agent simulations across Ultimatum, Buy-Sell, and Resource Exchange games, we systematically isolate language effects across English and four Indic framings (Hindi, Punjabi, Gujarati, Marwadi) by holding game rules, model parameters, and incentives constant across all conditions. We find that language choice can shift outcomes more strongly than changing models, reversing proposer advantages and reallocating surplus. Crucially, effects are task-contingent: Indic languages reduce stability in distributive games yet induce richer exploration in integrative settings. Our results demonstrate that evaluating LLM negotiation solely in English yields incomplete and potentially misleading conclusions. These findings caution against English-only evaluation of LLMs and suggest that culturally-aware evaluation is essential for fair deployment.
zh

[NLP-107] Disco-RAG : Discourse-Aware Retrieval-Augmented Generation

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理检索到的文本片段时,普遍采用扁平化、非结构化方式所带来的局限性,即无法有效捕捉文档内部和跨文档之间的语义结构线索,从而限制了模型从分散证据中整合知识的能力。其解决方案的关键在于提出 Disco-RAG 框架,通过构建两种结构化表示:一是基于段落内层次关系的“块内话语树”(intra-chunk discourse trees),用于捕获局部语义层级;二是基于篇章间修辞关系的“跨块修辞图”(inter-chunk rhetorical graphs),用于建模跨文档的连贯性。这两种结构被联合集成到一个生成规划蓝图中,作为生成过程的条件输入,从而显著提升模型在问答和长文档摘要任务上的表现,且无需微调即可达到当前最优效果。

链接: https://arxiv.org/abs/2601.04377
作者: Dongqi Liu,Hang Ding,Qiming Feng,Jian Li,Xurong Xie,Zhucun Xue,Chengjie Wang,Jiangning Zhang,Yabiao Wang
机构: Saarland University(萨尔兰大学); Tencent YouTu Lab(腾讯优图实验室); Shanghai Jiaotong University(上海交通大学); Fudan University(复旦大学); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.
zh

[NLP-108] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

【速读】: 该论文旨在解决跨语言迁移在非标准化、噪声和混码语音场景下的有效性问题,尤其是在印度语系方言和语言变体中的自动语音识别(ASR)性能优化。其核心挑战在于传统方法依赖于语言间的亲缘关系(phylogenetic distance)来预测迁移效果,但实证结果表明这一因素不足以解释方言环境下的性能差异。解决方案的关键在于:通过小规模方言数据微调(fine-tuning)即可获得与使用大规模标准化高资源语言数据相当的ASR性能,说明方言特定数据的适配性优于单纯的语系相近性;此外,研究还通过Garhwali等低资源Pahari语言变体的案例分析,验证了当前主流ASR模型在方言识别中的潜力,并揭示了预训练语言偏倚对转录错误的影响,为提升ASR系统在多样性语言场景下的公平性和鲁棒性提供了重要依据。

链接: https://arxiv.org/abs/2601.04373
作者: Akriti Dhasmana,Aarohi Srivastava,David Chiang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 10 tables

点击查看摘要

Abstract:We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.
zh

[NLP-109] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation

【速读】: 该论文旨在解决科学文献中因追求大胆表述而忽视严谨性的问题,即作者常夸大结论,超出其研究结果所能支持的范围。解决方案的关键在于提出一个名为RIGOURATE的两阶段多模态框架:第一阶段通过微调的重排序模型从论文正文检索支撑证据;第二阶段利用微调模型预测每个主张的夸大评分并提供理由。该框架基于包含超过10,000个主张-证据对的数据集(来自ICLR和NeurIPS论文),由八名大语言模型标注,并通过同行评审意见校准夸大评分,经人工评估验证有效性,从而实现证据比例性的量化操作,促进更清晰、透明的科学传播。

链接: https://arxiv.org/abs/2601.04350
作者: Joseph James,Chenghao Xiao,Yucheng Li,Nafise Sadat Moosavi,Chenghua Lin
机构: The University of Sheffield (谢菲尔德大学); Durham University (杜伦大学); University of Surrey (萨里大学); The University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper’s body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.
zh

[NLP-110] Quantifying the Effect of Test Set Contamination on Generative Evaluations

【速读】: 该论文旨在解决测试集污染(test set contamination)对生成式评估(generative evaluations)的影响问题,尤其是在大规模语言模型的预训练到推理全生命周期中的表现变化。传统研究多关注判别式任务(如多项选择题问答)中测试集污染的影响,而本文首次系统量化了污染在生成式任务中的作用机制。其关键解决方案在于:通过控制预训练数据中污染副本的数量与模型规模,结合缩放定律(scaling laws)分析发现,即使仅引入一个测试集副本,模型也能达到低于未污染训练数据的不可约误差(irreducible error),揭示了生成式模型对污染的高度敏感性;进一步研究表明,后续微调策略(如监督微调或使用新鲜数据过训练)可缓解污染效应,且推理阶段的采样温度和生成长度显著调节记忆行为——高温度可降低污染影响,长输出更难被记忆,这与判别式评估中短答案易被记忆的现象形成鲜明对比。该工作为可信AI评估提供了新的理论框架和实践指导。

链接: https://arxiv.org/abs/2601.04301
作者: Rylan Schaeffer,Joshua Kazdan,Baber Abbasi,Ken Ziyu Liu,Brando Miranda,Ahmed Ahmed,Abhay Puri,Niloofar Mireshghallah,Sanmi Koyejo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.
zh

[NLP-111] Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity

【速读】: 该论文旨在解决生成式 AI 模型在执行特定计算任务(如模加法)时,尽管在分布内(in-distribution)表现优异,却对输入格式变化(如字符位置偏移或自然语言模板变化)缺乏鲁棒性的问题。其关键解决方案在于引入一种结构化的训练策略,包括:(i) 显式表达边界标记以增强结构感知,(ii) 位置课程学习以扩大绝对位置覆盖范围,(iii) 多样化模板混合以提升泛化能力,以及 (iv) 一致性训练以强化同一示例的多变体间输出稳定性。该方法显著提升了模型对位置偏移和分布外模板的鲁棒性,同时保持高分布内准确率,表明显式训练数据中缺失的不变性对于程序性泛化至关重要。

链接: https://arxiv.org/abs/2601.04283
作者: Nikolay Yudin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions (“position shift”) or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.
zh

[NLP-112] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

【速读】: 该论文旨在解决当前机器遗忘(machine unlearning)评估基准无法真实反映模型“遗忘范围”的问题,尤其在大型语言模型(Large Language Models, LLMs)中,如何精准实现对特定领域或实例级别的遗忘(domain-level and instance-level unlearning)仍缺乏有效方法。其解决方案的关键在于提出 BiForget——一个自动化框架,通过利用目标模型自身的知识分布,借助种子引导(seed-guided)和对抗性提示(adversarial prompting)策略生成高质量的遗忘数据集(forget set),无需依赖外部生成器,从而在相关性、多样性和效率之间取得更优平衡,显著提升遗忘效果与模型可用性的协同能力。

链接: https://arxiv.org/abs/2601.04278
作者: Xiaoyu Xu,Minxin Du,Zitong Li,Zi Liang,Zhibiao Guo,Shiyu Zhang,Peizhao Hu,Qingqing Ye,Haibo Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true “forgetting scope” learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by \sim20 and diversity by \sim 0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.
zh

[NLP-113] Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLM s

【速读】: 该论文旨在解决机器遗忘(Machine Unlearning)过程中因需访问被遗忘数据而导致的隐私泄露问题,尤其是防止个人身份信息(PII)暴露于成员推断攻击的风险。现有方法通常要求直接访问原始训练数据以执行遗忘操作,这违背了隐私保护原则。为此,作者提出了一种名为“Shadow Unlearning”的新范式,其核心在于通过在匿名化的遗忘数据上进行近似遗忘操作来实现隐私保护;进一步设计了神经语义投影遗忘(Neuro-Semantic Projector Unlearning, NSPU)框架,利用隐式表示空间中的投影机制完成高效且安全的模型知识移除。该方案不仅显著提升了计算效率(至少快10倍),还能有效平衡遗忘效果与模型性能之间的权衡,从而为隐私驱动的机器学习提供新的技术路径。

链接: https://arxiv.org/abs/2601.04275
作者: Dinesh Srivasthav P,Ashok Urlana,Rahul Mishra,Bala Mallikarjunarao Garlapati,Ponnurangam Kumaraguru
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR’s ‘Right to be Forgotten’. However, many existing methods require access to the data being removed, exposing it to membership inference attacks and potential misuse of Personally Identifiable Information (PII). We address this critical challenge by proposing Shadow Unlearning, a novel paradigm of approximate unlearning, that performs machine unlearning on anonymized forget data without exposing PII. We further propose a novel privacy-preserving framework, Neuro-Semantic Projector Unlearning (NSPU) to achieve Shadow unlearning. To evaluate our method, we compile Multi-domain Fictitious Unlearning (MuFU) forget set across five diverse domains and introduce an evaluation stack to quantify the trade-off between knowledge retention and unlearning effectiveness. Experimental results on various LLMs show that NSPU achieves superior unlearning performance, preserves model utility, and enhances user privacy. Additionally, the proposed approach is at least 10 times more computationally efficient than standard unlearning approaches. Our findings foster a new direction for privacy-aware machine unlearning that balances data protection and model fidelity.
zh

[NLP-114] Sphinx: Benchmarking and Modeling for LLM -Driven Pull Request Review

【速读】: 该论文旨在解决代码合并请求(Pull Request, PR)审查自动化中的三大挑战:噪声监督信号、有限的上下文理解能力以及评估指标不足。为应对这些问题,作者提出Sphinx框架,其核心创新在于三个关键组件:(1) 一种结构化的数据生成管道,通过对比伪修改代码与合并后代码生成语义丰富且上下文相关的审查评论;(2) 基于检查清单的评估基准,从可操作验证点的结构化覆盖度出发衡量审查质量,超越传统表面指标如BLEU;(3) 检查清单奖励策略优化(Checklist Reward Policy Optimization, CRPO),采用规则驱动且可解释的奖励机制,使模型行为更贴合实际开发中的审查实践。实验表明,基于Sphinx训练的模型在审查完整性与精确性上达到当前最优性能,相较商用及开源基线提升达40%的检查清单覆盖率,显著增强了模型的上下文感知能力、技术准确性与实际部署可行性。

链接: https://arxiv.org/abs/2601.04252
作者: Daoan Zhang,Shuo Zhang,Zijian Jin,Jiebo Luo,Shengyu Fu,Elsie Nallipogu
机构: University of Rochester (罗切斯特大学); Microsoft (微软)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.
zh

[NLP-115] SAGE-32B: Agent ic Reasoning via Iterative Distillation

【速读】: 该论文旨在解决当前大型语言模型在复杂任务执行中缺乏有效推理能力与长期规划能力的问题,尤其是在多工具协同、任务分解和错误恢复等代理式(agentic)场景下的性能瓶颈。解决方案的关键在于:首先,基于Qwen2.5-32B预训练模型进行迭代蒸馏(Iterative Distillation)的两阶段微调策略,强化模型在复杂推理任务中的表现;其次,引入一种逆向推理(inverse reasoning)方法,通过元认知头(meta cognition head)在执行前预测潜在的规划失败,从而提升任务成功率。这一设计使SAGE-32B在AgentBench、MMLU-Pro和MATH-500等基准测试中展现出优于同类规模模型的多工具使用成功效率,同时保持标准推理任务上的竞争力。

链接: https://arxiv.org/abs/2601.04237
作者: Basab Jha,Firoj Paudel,Ujjwal Puri,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
机构: SAGEA; Tribhuwan University | Vedas College; Tribhuwan University | Madan Bhandari Memorial College; Fudan University; ETH Zurich
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 Pages, 3 figures, 4 tables

点击查看摘要

Abstract:We demonstrate SAGE-32B, a 32 billion parameter language model that focuses on agentic reasoning and long range planning tasks. Unlike chat models that aim for general conversation fluency, SAGE-32B is designed to operate in an agentic loop, emphasizing task decomposition, tool usage, and error recovery. The model is initialized from the Qwen2.5-32B pretrained model and fine tuned using Iterative Distillation, a two stage training process that improves reasoning performance through rigorously tested feedback loops. SAGE-32B also introduces an inverse reasoning approach, which uses a meta cognition head to forecast potential failures in the planning process before execution. On agentic reasoning benchmarks including MMLU-Pro, AgentBench, and MATH-500, SAGE-32B achieves higher success rates in multi tool usage scenarios compared to similarly sized baseline models, while remaining competitive on standard reasoning evaluations. Model weights are publicly released at this https URL
zh

[NLP-116] AnimatedLLM : Explaining LLM s with Interactive Visualizations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言处理教育中缺乏直观教学材料的问题。其解决方案的关键在于开发了一个名为AnimatedLLM的交互式网页应用,通过逐步可视化Transformer语言模型的内部机制,使学习者能够直观理解模型的运行过程。该工具完全在浏览器中运行,利用对人工精心设计输入进行预计算的开放LLMs追踪数据,既可作为教学辅助工具,也可用于自主学习。

链接: https://arxiv.org/abs/2601.04213
作者: Zdeněk Kasner,Ondřej Dušek
机构: Charles University (查理大学); Faculty of Mathematics and Physics (数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at this https URL, both as a teaching aid and for self-educational purposes.
zh

[NLP-117] rueBrief: Faithful Summarization through Small Language Models

【速读】: 该论文旨在解决小规模语言模型(Small Language Models, SLMs)在文本摘要任务中因生成幻觉(hallucination)而导致事实不忠实的问题,尤其是在安全关键领域部署时的可靠性挑战。其解决方案的关键在于提出一个端到端框架TrueBrief,通过偏好优化(preference-optimization)范式提升模型的忠实性,其中核心创新是设计了一个数据生成模块,能够可控地注入幻觉以合成偏好数据,从而训练模型更准确地生成与原文一致的摘要内容。

链接: https://arxiv.org/abs/2601.04212
作者: Kumud Lakara,Ruibo Shi,Fran Silavong
机构: JPMorgan Chase (摩根大通); JPMorgan Chase (摩根大通)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.
zh

[NLP-118] Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays

【速读】: 该论文旨在解决俄罗斯语剧本的自动化年龄分级与内容安全评估问题,以符合联邦法律第436-FZ号的要求。其核心挑战在于高效、准确地识别剧本中涉及暴力、性内容、脏话、药物滥用及恐怖元素等违规内容,并据此分配0+、6+、12+、16+、18+等年龄等级,同时提供可解释的理由。解决方案的关键在于构建一个端到端系统Qwerty AI,采用微调后的Phi-3-mini模型(4-bit量化)实现高精度的内容检测与分段(分割精度达80–95%,取决于格式),并在严格约束下(无外部API调用、80GB显存限制、单个剧本处理时间不超过5分钟)完成全篇剧本(最长700页)的分析,仅需不到2分钟,最终在Yandex Cloud平台通过CUDA加速部署,具备实际生产可用性。

链接: https://arxiv.org/abs/2601.04211
作者: Nikita Zmanovskii
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 tables, 1 figure, 4 appendices. System paper describing automated age-rating for Russian screenplays using fine-tuned Phi-3-mini. Includes baseline comparisons, human evaluation, and production deployment. Code and model weights available at this https URL . Developed during Wink Hackathon, November 2025

点击查看摘要

Abstract:We present Qwerty AI, an end-to-end system for automated age-rating and content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ. The system processes full-length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine-tuned Phi-3-mini model with 4-bit quantization, achieving 80% rating accuracy and 80-95% segmentation precision (format-dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and 5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.
zh

[NLP-119] Complexity Agnostic Recursive Decomposition of Thoughts

【速读】: 该论文旨在解决大语言模型在多步推理任务中因采用固定推理策略而忽略问题特异性难度导致性能下降的问题。其解决方案的关键在于提出一种名为CARD(Complexity Agnostic Recursive Decomposition)的框架,该框架通过预判问题复杂度并动态调整分解策略实现优化:首先利用MRCE(Multi-dimensional Reasoning Complexity Estimator)——一个0.6B参数的Qwen模型,从题干文本中预测30个细粒度特征以评估复杂度;随后在两阶段递归求解器中,根据任务特征进行层次化分解(K步),并基于递归MRCE分析为每一步分配不同思考预算(1、5–9或10个思维节点)。此机制显著提升了推理准确性并大幅降低token消耗。

链接: https://arxiv.org/abs/2601.04210
作者: Kaleem Ullah Qasim,Jiashu Zhang,Hafiz Saif Ur Rehman
机构: Southwest Jiaotong University (西南交通大学); Southwest Jiaotong University (西南交通大学); Southwestern University of Finance and Economics (西南财经大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 4

点击查看摘要

Abstract:Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.
zh

[NLP-120] Leverag ing Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments

【速读】: 该论文旨在解决医疗环境中敏感数据处理与隐私保护之间的矛盾问题,即在遵守医院严格隐私和网络安全法规的前提下,如何有效利用大型语言模型(Large Language Models, LLMs)支持生物医学知识发现。解决方案的关键在于构建一个完全本地部署的检索增强生成(Retrieval-Augmented Generation, RAG)系统,该系统结合领域专用嵌入模型PubMedBERT用于生成高质量语义向量表示,并集成轻量级LLaMA3模型进行生成式合成,从而在不依赖外部网络服务的情况下实现基于PubMed文献的科研合作者推荐功能。

链接: https://arxiv.org/abs/2601.04209
作者: Seokhwan Ko,Donghyeon Lee,Jaewoo Chun,Hyungsoo Han,Junghwan Cho
机构: Clinical Omics Institute, Kyungpook National University (庆北国立大学临床组学研究所); Department of Biomedical Science, School of Medicine Kyungpook National University (庆北国立大学医学院生物医学科学系); Department of Physiology, School of Medicine Kyungpook National University (庆北国立大学医学院生理学系)
类目: Computation and Language (cs.CL)
备注: 11pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.
zh

[NLP-121] LLM s for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach

【速读】: 该论文旨在解决生成式 AI(Generative AI)在高风险消费者决策场景中缺乏可解释性的问题,特别是现有可解释人工智能(Explainable AI, XAI)方法依赖事后数值特征归因,难以提供连贯、可信的决策叙事;同时,当前基于大语言模型(Large Language Models, LLMs)的解释方案尚未实现对多受众群体的适配性、决策正确性与忠实性的统一,且训练过程高度依赖人工标注的解释数据。解决方案的关键在于提出 LEXMA(LLM-based EXplanations for Multi-Audience decisions),一个基于强化学习的微调框架,通过两阶段群体相对策略优化(Group Relative Policy Optimization, GRPO)联合优化两个独立参数集:一个用于保障预测准确性,另一个用于满足不同受众(如专家与消费者)的表达风格需求,整个过程不依赖人工评分的解释语料,从而实现了高效、可扩展且高质量的叙述式解释生成。

链接: https://arxiv.org/abs/2601.04208
作者: Xiang Cheng,Wen Wang,Anindya Ghose
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.
zh

[NLP-122] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在内部表征政治意识形态时,其低维结构与人类意识形态空间存在系统性偏差的问题。这种偏差是模型特定且可测量的,导致模型输出可能偏离用户期望的价值立场。解决方案的关键在于提出一种轻量级线性探测器(linear probe),通过分析模型内部特征计算偏置分数,并直接调整输出层的概率分布,从而最小化地校正模型输出,而无需重新训练模型。该方法保留了模型原有的推理能力,同时实现了对特定用户意见的有效对齐。

链接: https://arxiv.org/abs/2601.04207
作者: Wei Xia,Haowen Tang,Luozheng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Under review

点击查看摘要

Abstract:LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.
zh

[NLP-123] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation

【速读】: 该论文旨在解决大学招生办公室在高流量咨询场景下难以兼顾响应速度与信息准确性的问题,这对潜在学生的体验具有重要影响。其解决方案的关键在于提出一种融合微调语言模型与检索增强生成(Retrieval-Augmented Generation, RAG)的混合AI系统:通过在招生流程专属数据集上对模型进行微调,显著提升其对RAG检索结果的理解能力与领域相关性输出质量,同时保留RAG对最新信息的访问优势,从而实现高质量、高效率的自动化招生问答服务。

链接: https://arxiv.org/abs/2601.04206
作者: Aram Virabyan
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 9 pages, 1 figure, 1 table. Proceedings of the 19th International Scientific Conference “Parallel Computing Technologies” (PCT’2025), Moscow, Russia

点击查看摘要

Abstract:University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students’ perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG’s ability to access up-to-date information and fine-tuning’s capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.
zh

[NLP-124] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)中因采用单一全局置信度阈值进行重掩码(remasking)策略所导致的效率低下与生成质量受限问题。传统方法忽略了每个token在时间维度上的收敛状态差异和空间维度上的相互依赖关系,从而引发冗余迭代和并行度受限。解决方案的关键在于提出一种动态重掩码机制,通过实时检测每个token的时序方差(Temporal Variance)和空间偏离度(Spatial Deviance),分别表征其收敛进度和与其他token的关联强度,并据此自适应地调整每一步的置信度阈值,从而在不牺牲生成质量的前提下显著提升DLM的运行效率,实验证明可实现最高达8.9倍的速度提升。

链接: https://arxiv.org/abs/2601.04205
作者: Xinhao Sun,Maoliang Li,Zihao Zheng,Jiayu Chen,Hezhao Xu,Yun Liang,Xiang Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low- priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spa- tial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical re- sults show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.
zh

[NLP-125] Generative Teaching via Code

【速读】: 该论文旨在解决高质量在线教育内容生产中因人工成本高、周期长而导致的可扩展性瓶颈问题。现有视频生成方法多基于像素级、黑箱式操作,难以保证教学结构和精确控制。其解决方案的关键在于提出“生成式教学”(Generative Teaching)范式,将教师角色从手工创作者转变为高层级导演,由自主代理团队自动执行具体任务;并通过TeachMaster多智能体框架实现这一目标,该框架以代码作为中间语义媒介,协同规划、设计与渲染等智能体,自动生成可解释、可编辑且符合课程体系的教育视频,从而在不牺牲结构连贯性和视觉保真度的前提下显著提升制作效率。

链接: https://arxiv.org/abs/2601.04204
作者: Yuheng Wang,Runde Yang,Lin Wu,Jie Zhang,Jingru Fan,Ruoyu Fu,Tianle Zhou,Huatao Li,Siheng Chen,Weinan E,Chen Qian
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The scalability of high-quality online education is hindered by the high costs and slow cycles of labor-intensive manual content creation. Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise control due to their pixel-level, black-box nature. In this paper, we propose Generative Teaching, a novel paradigm that transitions educators from manual creators to high-level directors, allowing them to focus on pedagogical intent while autonomous agents handle the execution. To realize this vision, we introduce TeachMaster, a multi-agent framework that leverages code as an intermediate semantic medium. Unlike traditional video generation methods, TeachMaster orchestrates a collaborative team of agents–spanning planning, design, and rendering–to automate the production of interpretable, editable, and curriculum-ready educational videos. Experiments validate that TeachMaster significantly boosts production efficiency without compromising structural coherence or visual fidelity, providing a robust solution for scalable education.
zh

[NLP-126] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

【速读】: 该论文旨在解决前端代码生成中多轮、多模态交互动态下的两个核心问题:一是模型在多轮对话中因遗忘先前实现功能而导致的任务失败问题,二是对视觉反馈(如草图、原型图等)理解不足,尤其在开源视觉语言模型(Vision-Language Models, VLMs)中表现突出。解决方案的关键在于提出一种基于代理的评估框架和一个名为AceCoder的强基线方法——通过引入一个自主网页代理(web agent)对每轮指令的实现进行批判性检查,从而显著减少遗忘现象,使模型性能提升至65.3%(相比基线提高9.3%),为前端开发及多模态代码生成的交互机制研究提供了坚实基础。

链接: https://arxiv.org/abs/2601.04203
作者: Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL
zh

[NLP-127] ables: A Benchmark for Large Language Models in Telecom Table Interpretation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在电信标准(尤其是3GPP规范)中表现不佳的问题,其核心原因在于LLMs对技术文档中密集存在的表格信息缺乏有效的理解与推理能力。解决方案的关键在于构建TeleTables——一个专为评估LLMs在电信标准场景下表格知识掌握程度和表意解析能力而设计的基准数据集。该数据集通过多阶段自动化数据生成流程从3GPP标准中提取表格,并利用多模态及推理导向型LLMs生成并验证高质量问题-答案对,最终形成500个经人工验证的问答样本,从而系统性揭示了不同规模模型在表格理解上的性能差异,并强调了领域专业化微调(domain-specialized fine-tuning)对于可靠解读电信标准的重要性。

链接: https://arxiv.org/abs/2601.04202
作者: Anas Ezzakri,Nicola Piovesan,Mohamed Sana,Antonio De Domenico,Fadhel Ayed,Haozhe Zhang
机构: Paris Research Center, Huawei Technologies (华为技术巴黎研究中心); Shanghai Research Center, Huawei Technologies (华为技术上海研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.
zh

[NLP-128] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems NEURIPS2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在回答社区特定问题时存在的“知识盲区”问题,这种盲区导致地方性知识被边缘化,并加剧认知不公(epistemic injustice)。其核心解决方案是提出一种名为“集体叙事锚定”(Collective Narrative Grounding)的参与式协议,通过将社区故事转化为结构化的叙事单元,并在社区治理框架下将其整合进AI系统中。该方案的关键在于:设计了可保留叙事丰富性的提取 schema 与 elicitation 方法,实现实体、时间和地点的抽取、验证及溯源控制;并通过三场参与式地图绘制工作坊(N=24)验证其有效性,发现本地缺失事实多存在于收集到的叙事中,表明该方法能直接缓解主流错误模式。此外,研究还识别出代表性与权力、治理与控制、隐私与同意等关键设计张力,为构建以检索优先、溯源可见、本地治理为核心的问答系统提供了具体要求。

链接: https://arxiv.org/abs/2601.04201
作者: Zihan Gao,Mohsin Y. K. Yousufi,Jacob Thebault-Spieker
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Georgia Tech (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 9 pages, 2 figures, Presented at the NeurIPS 2025 ACA Workshop this https URL ,

点击查看摘要

Abstract:Large language model (LLM) question-answering systems often fail on community-specific queries, creating “knowledge blind spots” that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.
zh

[NLP-129] Attribute-Aware Controlled Product Generation with LLM s for E-commerce AAAI’26

【速读】: 该论文旨在解决电子商务场景中高质量标注数据获取困难的问题,特别是产品信息抽取任务所需的数据稀缺问题。其解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的合成商品数据生成框架,通过三种策略实现可控的数据增强:属性保持修改、受控负例生成和系统性属性移除;同时引入属性感知提示(attribute-aware prompts)以在保证商品语义一致性的同时满足店铺约束条件。实验表明,该方法生成的合成数据在人类评估中具有高自然度与属性有效性,并在公开MAVE数据集上达到60.5%的准确率,接近真实数据训练效果(60.8%),显著优于零样本基线(13.4%),验证了其在低资源场景下的实用性与有效性。

链接: https://arxiv.org/abs/2601.04200
作者: Virginia Negri,Víctor Martínez Gómez,Sergio A. Balanya,Subburam Rajaram
机构: Amazon Spain (亚马逊西班牙); Amazon Germany (亚马逊德国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI’26 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD)

点击查看摘要

Abstract:Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.
zh

[NLP-130] he Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLM s

【速读】: 该论文旨在解决当前医疗多模态大语言模型(Medical Multimodal Large Language Models, Medical MLLMs)在实际部署中存在安全风险的问题,特别是其在通用和医学特定安全维度上的脆弱性,以及医疗微调过程导致的原始安全对齐能力崩溃(catastrophic forgetting)。解决方案的关键在于提出一种新颖的“参数空间干预”(Parameter-Space Intervention)方法:该方法从原始基础模型中提取内在的安全知识表示,并在构建医学能力的同时将其注入目标模型,从而实现安全性的高效再对齐;同时设计细粒度参数搜索算法,在保障医疗性能的前提下优化安全与性能之间的权衡。实验表明,该方法无需额外领域安全数据即可显著增强模型的安全防护能力,且对核心医学性能影响最小。

链接: https://arxiv.org/abs/2601.04199
作者: Jiale Zhao,Xing Mou,Jinlin Wu,Hongyuan Yu,Mingrui Sun,Yang Shi,Xuanwu Yin,Zhen Chen,Zhen Lei,Yaohua Wang
机构: National University of Defense Technology (国防科技大学); Multimodal Artificial Intelligence Systems (多模态人工智能系统), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Multimedia Department, Xiaomi Inc (小米公司多媒体部门); Centre for Artificial Intelligence and Robotics (人工智能与机器人中心), Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong (中国科学院香港科学创新研究院香港中心); School of Artificial Intelligence, University of Chinese Academy of Sciences, UCAS (中国科学院大学人工智能学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model’s original safety alignment. To address this challenge, we propose a novel “Parameter-Space Intervention” approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.
zh

[NLP-131] Automatic Construction of Chinese Verb Collostruction Database

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在需要解释性和可解释性场景中缺乏显式规则的问题,提出了一种完全无监督的方法构建汉语动词搭配结构数据库(verb collostruction database)。其解决方案的关键在于将动词搭配结构形式化定义为投影、有根、有序且无环的有向图,并利用一系列聚类算法从大规模语料库中提取的句子中生成动词搭配结构;统计分析表明所生成的搭配结构具备功能独立性和等级典型性特征,且基于最大匹配与搭配结构的动词语法错误修正算法在性能上优于LLMs。

链接: https://arxiv.org/abs/2601.04197
作者: Xuri Tang,Daohuan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 figures

点击查看摘要

Abstract:This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.
zh

[NLP-132] RAG VUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统评估中存在的核心挑战:现有评估指标常将多样化的系统行为压缩为单一分数,缺乏对错误来源的细粒度诊断能力,难以区分问题是出在检索阶段、推理过程还是事实一致性(faithfulness)层面。解决方案的关键在于提出RAGVUE框架,这是一个可诊断且可解释的自动化、无需参考文本的RAG评估工具,其核心创新是将RAG行为解耦为四个关键维度——检索质量、答案相关性与完整性、严格的声明级事实一致性以及判断校准度,并为每个维度提供结构化解释,从而实现透明、可追溯的评估流程。该框架支持手动选择指标或全自动代理式评估,并通过Python API、命令行接口和本地Streamlit界面提升实用性,实验证明其能识别出RAGAS等现有工具忽略的细微失败模式。

链接: https://arxiv.org/abs/2601.04196
作者: Keerthana Murugaraj,Salima Lamsiyah,Martin Theobald
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub
zh

[NLP-133] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗对话场景中评估标准不足的问题,尤其是缺乏对医患互动全过程的多维、结构化测评体系。现有基准多为单轮问答形式,无法全面反映临床实践中涉及的诊断推理、治疗安全性、沟通质量及患者管理等复杂维度。其解决方案的关键在于构建MedPI——一个高维基准平台,包含五个核心组件:合成电子健康记录(EHR-like)的患者数据包、具备记忆与情感模拟能力的AI患者、覆盖多种就诊原因与目标的任务矩阵、基于美国毕业后医学教育认证委员会(ACGME)胜任力框架映射的105维评分体系,以及经过校准的委员会式LLM裁判系统,可提供量化评分、异常标记和证据链支持的解释。通过该框架,研究者首次在标准化“普通医生”提示下对9个主流模型进行系统性评估,揭示了当前LLMs在鉴别诊断等关键医疗能力上的显著短板,为未来LLMs在诊疗建议中的安全应用提供了科学依据与改进方向。

链接: https://arxiv.org/abs/2601.04195
作者: Diego Fajardo V.,Oleksii Proniakin,Victoria-Elisabeth Gruber,Razvan Marinescu
机构: Lumos
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models – Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 – across 366 AI Patients and 7,097 conversations using a standardized “vanilla clinician” prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.
zh

[NLP-134] Safe in the Future Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLM s

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)在跨语言迁移中的安全对齐能力是否具有普适性,尤其是在低资源语言场景下是否存在“多语言安全鸿沟”这一假设是否成立。研究发现,现有模型的安全表现并非简单随语言资源减少而退化,而是呈现出复杂的上下文依赖性——具体表现为语言与时间框架之间的非线性交互效应,即所谓“时间不对称性”(Temporal Asymmetry),其中过去时态表述显著削弱模型防御能力(仅15.6%安全),而未来时态则引发过度保守拒绝(57.2%安全),导致最安全与最危险配置间存在高达9.2倍的差异。解决方案的关键在于提出“不变对齐”(Invariant Alignment)范式,强调需从依赖表面启发式规则转向建立跨语言和时间维度稳定的安全机制,以消除因局部语境变化引发的“安全盲区”(Safety Pockets),从而保障全球南方用户免受特定情境下的有害输出。

链接: https://arxiv.org/abs/2512.24556
作者: Muhammad Abdullahi Said,Muhammad Sammani Sani
机构: African Institute for Mathematical Science (非洲数学科学研究所); University of Vienna (维也纳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the narrative of the multilingual safety gap. Instead of a simple degradation in low-resource settings, we identified a complex interference mechanism in which safety is determined by the intersection of variables. Although the models exhibited a reverse linguistic vulnerability with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal, they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.
zh

[NLP-135] Generalization to Political Beliefs from Fine-Tuning on Sports Team Preferences

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在针对特定领域数据进行微调(fine-tuning)后,出现超出训练数据范围的意外行为问题,尤其是这种行为如何引发与原模型显著不同的政治倾向。其关键发现是:尽管对模型分别进行偏向沿海或南方体育团队的微调,预期会分别导致其政治立场向自由派或保守派偏移,但实际结果却显示两个模型的政治态度高度相似,且未呈现出明确的意识形态分化;此外,研究还揭示了模型在面对极端回答时表现出不同程度的自我辩护意愿,表明微调可能通过隐含的语义关联触发了非预期的行为变化,提示未来需深入探究简单、窄域数据微调如何引发跨任务的行为迁移机制。

链接: https://arxiv.org/abs/2601.04369
作者: Owen Terry
机构: Columbia University (哥伦比亚大学)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuned LLMs often exhibit unexpected behavior as a result of generalizing beyond the data they’re shown. We present results in which an LLM fine-tuned to prefer either coastal sports teams or Southern sports teams adopt political beliefs that diverge significantly from those of the base model. While we hypothesized that the coastal model would become more liberal and the southern model would become more conservative, we find that their responses are usually similar to each other, without a clear-cut liberal or conservative bias. In addition to asking the models for numerical ratings of agreement with relevant political statements, we ask them to elaborate on their more radical answers, finding varying degrees of willingness to justify themselves. Further work is needed to understand the mechanisms by which fine-tuning on simple, narrow datasets leads to seemingly unrelated changes in model behavior.
zh

计算机视觉

[CV-0] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

【速读】:该论文旨在解决单目视频中动态物体的4D网格重建问题,即从单一视角视频中恢复物体完整的3D形状及其随时间变化的运动信息(表示为形变场)。解决方案的关键在于提出了一种紧凑的潜在空间(latent space),该空间通过一个自编码器在单次前向传播中编码整个动画序列,且在训练过程中利用训练对象的骨骼结构作为强先验来引导合理形变的学习;该骨骼信息在推理阶段无需使用。此外,编码器采用时空注意力机制以获得更稳定的整体形变表示,并基于此表示训练了一个潜在扩散模型(latent diffusion model),该模型能够根据输入视频和首帧重建的网格,在一次推断中预测完整动画序列。

链接: https://arxiv.org/abs/2601.05251
作者: Zeren Jiang,Chuanxia Zheng,Iro Laina,Diane Larlus,Andrea Vedaldi
机构: VGG, University of Oxford (牛津大学视觉几何组); Nanyang Technological University (南洋理工大学); Naver Labs Europe (NAVER实验室欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, project page: this https URL

点击查看摘要

Abstract:We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.
zh

[CV-1] QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer

【速读】:该论文旨在解决传统神经辐射场(Neural Radiance Fields, NeRF)在新颖视角合成任务中模型参数量大、训练成本高且难以高效扩展的问题。其核心解决方案是提出QNeRF,一种首个用于从2D图像中进行新颖视角合成的混合量子-经典模型,关键在于利用参数化量子电路通过量子叠加和纠缠编码空间与视角依赖信息,从而实现比经典方法更紧凑的模型表示。具体而言,QNeRF通过两种架构变体实现优化:全量子QNeRF最大化利用所有量子振幅以增强表达能力;而双分支QNeRF引入任务感知归纳偏置,将空间与视角依赖的量子态准备分离处理,显著降低计算复杂度并提升可扩展性和硬件兼容性。实验表明,在中等分辨率图像上训练时,QNeRF在性能上可媲美甚至超越经典NeRF基线,同时参数数量少于其一半,验证了量子机器学习在计算机视觉中连续信号表示任务中的竞争力。

链接: https://arxiv.org/abs/2601.05250
作者: Daniele Lizzio Bosco,Shuteng Wang,Giuseppe Serra,Vladislav Golyanik
机构: University of Udine(乌迪内大学); University of Naples Federico II(那不勒斯腓特烈二世大学); Max Planck Institute for Informatics(马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 15 figures, 11 tables; project page: this https URL

点击查看摘要

Abstract:Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that – when trained on images of moderate resolution – QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.
zh

[CV-2] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

【速读】:该论文旨在解决夜间场景下白平衡(White Balance, WB)估计的难题,该问题因低光照噪声和复杂照明条件而尤为棘手。其解决方案的关键在于提出了一种名为RL-AWB的新框架,该框架将统计方法与深度强化学习(Deep Reinforcement Learning, DRL)相结合:首先设计了一个针对夜间场景的统计算法,通过显著灰像素检测与新型光照估计实现初步白平衡;随后构建了首个基于DRL的色温校正方法,以统计算法为核心,模拟专业AWB调校专家行为,动态优化每张图像的参数配置。这一策略显著提升了模型在低光与正常光照图像间的泛化能力。

链接: https://arxiv.org/abs/2601.05249
作者: Yuan-Kang Lee,Kuan-Lin Chen,Chia-Che Chang,Yu-Lun Liu
机构: MediaTek Inc.(MediaTek公司); National Taiwan University (台湾大学); National Yang Ming Chiao Tung University (阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: this https URL
zh

[CV-3] Pixel-Perfect Visual Geometry Estimation

【速读】:该论文旨在解决现有几何基础模型在从图像中恢复干净、精确几何结构时存在的严重问题,如飞点(flying pixels)和细粒度细节丢失。其核心解决方案是提出像素级精确的视觉几何建模方法,通过在像素空间中利用生成式建模来预测高质量且无飞点的点云。关键创新在于:1)引入基于像素空间扩散变换器(DiT)的像素级精确深度(Pixel-Perfect Depth, PPD)模型,并设计语义提示DiT(Semantics-Prompted DiT),利用视觉基础模型的语义表示引导扩散过程,从而在保持全局语义一致性的同时增强局部细节;2)提出级联DiT架构(Cascade DiT),逐步增加图像token数量,在提升效率的同时保证精度。此外,为扩展至视频场景(PPVD),进一步提出语义一致DiT(Semantics-Consistent DiT)与参考引导token传播机制,实现时间上的一致性,同时最小化计算和内存开销。

链接: https://arxiv.org/abs/2601.05246
作者: Gangwei Xu,Haotong Lin,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Sida Peng,Hangjun Ye,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Zhejiang University (浙江大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
zh

[CV-4] GREx: Generalized Referring Expression Segmentation Comprehension and Generation

【速读】:该论文旨在解决现有指代表达任务(Referring Expression Segmentation/Comprehension/Generation, REx)仅支持单目标表达的局限性,从而限制了其在真实场景中的应用。为拓展至更复杂的多目标和无目标表达场景,作者提出了通用指代表达任务(Generalized Referring Expression, GREx),并构建了首个大规模数据集gRefCOCO,涵盖单目标、多目标及无目标表达及其对应的标注图像。解决方案的关键在于提出了一种名为ReLA(Region-Label Attention)的基线方法,该方法通过自适应地将图像划分为包含子实例线索的区域,并显式建模区域间关系与区域-语言依赖关系,有效提升了复杂语义关系的理解能力,在GRES和GREC任务上均达到当前最优性能。

链接: https://arxiv.org/abs/2601.05244
作者: Henghui Ding,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCV, Project Page: this https URL

点击查看摘要

Abstract:Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at this https URL.
zh

[CV-5] Generate Transfer Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration

【速读】:该论文旨在解决灵巧操作中功能抓取(functional grasping)的两大瓶颈问题:一是高质量大规模数据集的稀缺性,二是现有学习模型缺乏语义与几何信息的融合推理能力。其解决方案的关键在于提出CorDex框架,该框架通过一个基于对应关系的数据生成引擎,在仅需单次人类示范的情况下,即可在仿真环境中自动生成多样化且高质量的训练数据;在此基础上,进一步设计了一个多模态预测网络,结合视觉与几何信息,并引入局部-全局融合模块和重要性感知采样机制,从而实现对未见过物体的功能性灵巧抓取的鲁棒、高效预测。

链接: https://arxiv.org/abs/2601.05243
作者: Xingyi He,Adhitya Polavaram,Yunhao Cao,Om Deshmukh,Tianrui Wang,Xiaowei Zhou,Kuan Fang
机构: Cornell University (康奈尔大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
zh

[CV-6] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

【速读】:该论文旨在解决机器人操作策略训练中高质量、多样化真实世界操作数据难以大规模获取的问题。现有方法虽利用文本提示条件的图像扩散模型通过修改视觉观测中的背景和桌面上物体来扩充数据,但忽略了先进策略模型所需的多视角和时序一致性观测,且文本提示难以可靠指定场景布局。为此,作者提出关键解决方案——引入视觉身份提示(visual identity prompting),即以示例图像作为条件输入,引导扩散模型生成符合特定场景设置的图像,从而提供更精确的视觉指导;同时构建了一个可扩展的数据管道,从大规模机器人数据集中整理视觉身份池,最终在仿真与真实机器人环境中均验证了所生成增强数据对视觉-语言-动作及视觉运动策略模型性能的持续提升效果。

链接: https://arxiv.org/abs/2601.05241
作者: Boyang Wang,Haoran Zhang,Shujie Zhang,Jinkun Hao,Mingda Jia,Qi Lv,Yucheng Mao,Zhaoyang Lyu,Jia Zeng,Xudong Xu,Jiangmiao Pang
机构: Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
zh

[CV-7] Plenoptic Video Generation

【速读】:该论文旨在解决生成式视频重渲染方法在多视角场景下难以保持时空一致性的问题,尤其是由于生成模型固有的随机性导致幻觉区域的时空 coherence 难以维持。解决方案的关键在于提出 PlenopticDreamer 框架,其核心是训练一个“多输入-单输出”的视频条件模型,并采用自回归方式建模;同时引入相机引导的视频检索策略,动态选择先前生成的显著视频作为条件输入,从而同步生成过程中的幻觉内容,实现跨视角的时空记忆一致性。此外,通过渐进式上下文扩展、自条件机制和长视频条件机制进一步提升模型收敛性、鲁棒性和长视频生成能力。

链接: https://arxiv.org/abs/2601.05239
作者: Xiao Fu,Shitao Tang,Min Shi,Xian Liu,Jinwei Gu,Ming-Yu Liu,Dahua Lin,Chen-Hsuan Lin
机构: NVIDIA; The Chinese University of Hong Kong (香港中文大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: this https URL
zh

[CV-8] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

【速读】:该论文旨在解决计算系统缺乏从被动视觉观察中预测物体未来运动的能力这一问题,即如何让模型像人类一样通过观察理解物体可能的交互行为(如杯子被抬起、刀子切割或盖子关闭)。其解决方案的关键在于提出ObjectForesight——一种3D对象中心的动力学模型,该模型直接从短时第一人称视频序列中预测刚体物体的6-DoF位姿和轨迹。与传统在像素或潜在空间中操作的世界模型不同,ObjectForesight在对象层面显式地用3D表示世界,从而实现几何上合理且时间上连贯的预测,捕捉物体的可及性(affordances)和运动轨迹。为大规模训练该模型,作者利用分割、网格重建和3D姿态估计等最新进展构建了一个包含200万条短片段的数据集,其中包含伪真值(pseudo-ground-truth)的3D物体轨迹。实验表明,该方法在准确性、几何一致性以及对未见物体和场景的泛化能力上均取得显著提升,建立了一个可扩展的框架,用于直接从观测数据中学习物理合理的、以对象为中心的动力学模型。

链接: https://arxiv.org/abs/2601.05237
作者: Rustin Soraki,Homanga Bharadhwaj,Ali Farhadi,Roozbeh Mottaghi
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project Website: this http URL

点击查看摘要

Abstract:Humans can effortlessly anticipate how objects might move or change through interaction–imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. this http URL
zh

[CV-9] Learning Latent Action World Models In The Wild

【速读】:该论文旨在解决现实世界中智能体(agent)进行推理与规划时,如何从无标注的野外视频(in-the-wild videos)中学习隐式动作模型(latent action model),以实现对行为后果的预测能力。传统世界模型通常依赖于带有标签的动作数据,而这类数据在真实场景中难以大规模获取;因此,本文提出通过仅使用视频序列自动学习连续且受约束的隐式动作空间,从而克服因环境噪声、视频多样性及缺乏共通身体形态(common embodiment)带来的挑战。其关键解决方案在于:采用连续但结构受限的隐式动作表示(而非常见的向量量化方法),使模型能够捕捉复杂动作特征,并在无统一物理实体的情况下仍能学习到相对相机位置局部化的动作表示;同时,训练一个控制器将已知动作映射至隐式动作空间,使得隐式动作可作为通用接口用于规划任务,性能接近基于动作条件的世界模型基线。

链接: https://arxiv.org/abs/2601.05230
作者: Quentin Garrido,Tushar Nagarajan,Basile Terver,Nicolas Ballas,Yann LeCun,Michael Rabbat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 25 figures

点击查看摘要

Abstract:Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
zh

[CV-10] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

【速读】:该论文旨在解决当前脑部磁共振成像(Brain Magnetic Resonance Imaging, MRI)在脑龄预测(Brain Age Prediction, BAP)任务中因数据集存在人口统计学偏倚、年龄分布不均衡而导致模型公平性与泛化能力不足的问题。现有生成式数据增强方法多基于潜在扩散模型(latent diffusion models),虽能缓解高维体积MRI的内存压力,但推理速度慢、易引入重建伪影,且通常缺乏对年龄条件的控制,进而影响BAP性能。其解决方案的关键在于提出FlowLet——一种基于可逆3D小波域内流匹配(flow matching)的条件生成框架,通过在可逆变换空间中建模数据分布,实现高效、高质量的年龄条件化3D MRI合成,从而减少伪影并降低计算开销,同时提升BAP模型在低代表群体中的性能表现。

链接: https://arxiv.org/abs/2601.05212
作者: Danilo Danese,Angela Lombardi,Matteo Attimonelli,Giuseppe Fasano,Tommaso Di Noia
机构: Politecnico di Bari (巴里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual’s biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.
zh

[CV-11] MoE3D: A Mixture-of-Experts Module for 3D Reconstruction

【速读】:该论文旨在解决现有前馈式三维重建模型中存在的深度边界模糊与飞点伪影(flying-point artifacts)问题。其解决方案的关键在于提出了一种混合专家模块(MoE3D),该模块通过预测多个候选深度图并利用动态加权机制进行融合,从而有效增强深度边界的清晰度并减少伪影,同时在集成至预训练的三维重建骨干网络(如VGGT)时仅引入极小的计算开销。

链接: https://arxiv.org/abs/2601.05208
作者: Zichen Wang,Ang Cao,Liam J. Wang,Jeong Joon Park
机构: University of Michigan, Ann Arbor (密歇根大学,安娜堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.
zh

[CV-12] Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科研场景中因高计算成本而难以普及的问题,特别是在文献综述和假设生成等自主任务中,单次使用700亿参数模型的云服务费用可达127美元,显著限制了学术实验室的应用能力。其解决方案的关键在于提出AgentCompress系统,该系统通过一个轻量级神经网络实时评估任务难度(仅基于输入任务的前几词),并据此动态选择合适压缩程度的模型变体进行处理;决策过程耗时低于1毫秒,在不显著影响性能的前提下,实现了平均68.3%的计算成本降低,同时保持96.2%的原始任务成功率。

链接: https://arxiv.org/abs/2601.05191
作者: Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around 127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines
zh

[CV-13] VideoAuto-R1: Video Auto Reasoning via Thinking Once Answering Twice

【速读】:该论文旨在解决生成式 AI(Generative AI)在视频理解任务中,链式思维(Chain-of-thought, CoT)推理是否必要及其相对于直接回答的优势不明确的问题。研究表明,对于强化学习(Reinforcement Learning, RL)训练的视频模型,直接回答往往能媲美甚至超越CoT性能,而后者却带来更高的计算开销。为此,作者提出VideoAuto-R1框架,其核心创新在于采用“按需推理”策略:训练阶段遵循“思考一次、回答两次”的范式——先生成初始答案,再进行推理并输出修正答案,两者均通过可验证奖励监督;推理阶段则根据初始答案的置信度动态决定是否触发推理模式。该方法在多个视频问答与定位基准上实现了最优准确率,同时显著提升效率(平均响应长度减少约3.3倍),并揭示了感知类任务中推理激活率低、推理密集型任务中激活率高的现象,表明语言驱动的显式推理虽有益但非普适必需。

链接: https://arxiv.org/abs/2601.05175
作者: Shuming Liu,Mingchen Zhuge,Changsheng Zhao,Jun Chen,Lemeng Wu,Zechun Liu,Chenchen Zhu,Zhipeng Cai,Chong Zhou,Haozhe Liu,Ernie Chang,Saksham Suri,Hongyu Xu,Qi Qian,Wei Wen,Balakrishnan Varadarajan,Zhuang Liu,Hu Xu,Florian Bordes,Raghuraman Krishnamoorthi,Bernard Ghanem,Vikas Chandra,Yunyang Xiong
机构: King Abdullah University of Science and Technology (KAUST); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
zh

[CV-14] CoV: Chain-of-View Prompting for Spatial Reasoning

【速读】:该论文旨在解决3D环境中具身问答(Embodied Question Answering, EQA)任务中因视觉语言模型(Vision-Language Models, VLMs)受限于固定输入视角而导致的空间推理能力不足的问题。现有VLM通常只能处理有限的预设视图,难以在推理阶段动态获取分布于多视角且部分遮挡的上下文信息,从而限制了复杂空间理解能力。解决方案的关键在于提出一种无需训练的测试时推理框架——Chain-of-View (CoV) 提示方法,其核心机制为通过粗粒度到细粒度的视点探索过程实现主动感知:首先由视点选择代理筛选冗余帧并识别与问题对齐的锚定视点;随后通过迭代式推理与离散相机动作的交错执行,在3D场景表示中逐步获取新观测,直至获得足够上下文或达到步数预算。此策略实现了模型无关的、可扩展的空间推理增强,显著提升了EQA性能。

链接: https://arxiv.org/abs/2601.05172
作者: Haoyu Zhao,Akide Liu,Zeyu Zhang,Weijie Wang,Feng Chen,Ruihan Zhu,Gholamreza Haffari,Bohan Zhuang
机构: ZIP Lab, Zhejiang University (浙江大学); Monash University (蒙纳士大学); AIML, Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision–language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.05172 [cs.CV] (or arXiv:2601.05172v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.05172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-15] GenAI-DrawIO-Creator: A Framework for Automated Diagram Generation

【速读】:该论文旨在解决复杂信息可视化中图表创建与修改过程繁琐、耗时的问题,尤其针对在DrawIO等工具中以结构化XML格式表示的图表操作效率低下这一痛点。解决方案的关键在于提出GenAI-DrawIO-Creator框架,其核心是利用大型语言模型(Large Language Models, LLMs)——具体为Claude 3.7——进行结构化视觉数据推理,并生成符合规范的XML格式图表表示;同时通过专门设计的提示工程(prompt engineering)和错误检测机制确保输出的XML语法正确性与结构完整性,从而实现从自然语言或代码到准确图表(如网络架构图和流程图)的自动化生成,甚至支持图像中图表的复现,显著提升创作效率并保障结构保真度。

链接: https://arxiv.org/abs/2601.05162
作者: Jinze Yu,Dayuan Jiang
机构: AWS Generative AI Innovation Center (AWS生成式AI创新中心)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diagrams are crucial for communicating complex information, yet creating and modifying them remains a labor-intensive task. We present GenAI-DrawIO-Creator, a novel framework that leverages Large Language Models (LLMs) to automate diagram generation and manipulation in the structured XML format used by this http URL. Our system integrates Claude 3.7 to reason about structured visual data and produce valid diagram representations. Key contributions include a high-level system design enabling real-time diagram updates, specialized prompt engineering and error-checking to ensure well-formed XML outputs. We demonstrate a working prototype capable of generating accurate diagrams (such as network architectures and flowcharts) from natural language or code, and even replicating diagrams from images. Simulated evaluations show that our approach significantly reduces diagram creation time and produces outputs with high structural fidelity. Our results highlight the promise of Claude 3.7 in handling structured visual reasoning tasks and lay the groundwork for future research in AI-assisted diagramming applications.
zh

[CV-16] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLM s via Interpretable Bi-Causal Steering

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中普遍存在的对象幻觉(object hallucination)问题,其根源在于模型在推理过程中缺乏有效的认知内省能力,导致对语言先验的盲目信任超过了对具体视觉证据的依赖。现有方法如对比解码和静态潜在空间引导存在局限性:前者仅表面调整输出,未修复内部语义错位;后者依赖固定向量,难以实现实例级精准控制。本文提出无需训练的推理框架Vision-Language Introspection (VLI),其核心创新在于模拟元认知自我修正过程——首先通过属性内省(Attributive Introspection)进行概率冲突检测并定位因果视觉锚点,进而利用可解释的双向因果引导(Interpretable Bi-Causal Steering)动态分离视觉证据与背景噪声,并通过自适应校准消除盲信,从而显著降低幻觉率并提升准确率。

链接: https://arxiv.org/abs/2601.05159
作者: Shuliang Liu,Songbo Yang,Dong Fang,Sihang Jia,Yuqi Tang,Lingfeng Su,Ruoshui Peng,Yibo Yan,Xin Zou,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); LIGHTSPEED
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
zh

[CV-17] Multi-Scale Local Speculative Decoding for Image Generation

【速读】:该论文旨在解决自回归(Autoregressive, AR)图像生成模型因序列化生成机制导致的高延迟问题,同时克服现有推测解码(Speculative Decoding)方法在词元级歧义和缺乏空间感知方面的局限性。其解决方案的关键在于提出多尺度局部推测解码(Multi-Scale Local Speculative Decoding, MuLo-SD),通过低分辨率草稿模型结合学习的上采样器生成候选图像 token,并由高分辨率目标模型并行验证;尤为关键的是引入局部拒绝与重采样机制,使错误修正聚焦于空间邻域而非全图逐像素重采样,从而显著提升推理效率并保持语义一致性和感知质量。

链接: https://arxiv.org/abs/2601.05149
作者: Elia Peruzzo,Guillaume Sautière,Amirhossein Habibian
机构: Qualcomm AI Research (高通人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to \mathbf1.7\times - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
zh

[CV-18] Atlas 2 - Foundation models for clinical deployment

【速读】:该论文旨在解决病理学基础模型(pathology foundation models)在预测性能、鲁棒性(robustness)和计算资源效率之间存在的权衡问题,这些问题限制了其在临床环境中的部署。解决方案的关键在于开发了三个新型病理视觉基础模型——Atlas 2、Atlas 2-B 和 Atlas 2-S,它们通过在迄今最大的病理学基础模型数据集(包含550万张组织切片图像,来自Charité - Universitätsmedizin Berlin、LMU Munich和Mayo Clinic)上进行训练,在80个公共基准测试中实现了卓越的综合表现,显著提升了预测准确性、鲁棒性和资源效率。

链接: https://arxiv.org/abs/2601.05148
作者: Maximilian Alber,Timo Milbich,Alexandra Carpen-Amarie,Stephan Tietz,Jonas Dippel,Lukas Muttenthaler,Beatriz Perez Cancer,Alessandro Benetti,Panos Korfiatis,Elias Eulig,Jérôme Lüscher,Jiasen Wu,Sayed Abid Hashimi,Gabriel Dernbach,Simon Schallenberg,Neelay Shah,Moritz Krügener,Aniruddh Jammoria,Jake Matras,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan
机构: Aignostics(德国); Mayo Clinic(梅奥诊所); Technische Universität Berlin(柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所); Korea University(韩国大学); Max-Planck Institute for Informatics(马克斯·普朗克信息研究所); German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK)(德国癌症研究中心与德国癌症联盟); Ludwig-Maximilians-Universität München(慕尼黑路德维希-马克西米利安大学); Charité – Universitätsmedizin Berlin(柏林夏里特医科大学); Bavarian Cancer Research Center (BZKF)(巴伐利亚癌症研究中心); Helmholtz Munich(亥姆霍兹慕尼黑研究所); Technical University Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pathology foundation models substantially advanced the possibilities in computational pathology – yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.
zh

[CV-19] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

【速读】:该论文旨在解决现有视频世界模型在统一且精确控制相机运动与多物体动态方面的局限性,因为传统方法通常在2D图像平面上建模动态,难以实现对4D空间(3D位置+时间)中相机和物体运动的协同控制。解决方案的关键在于提出一种新颖的4D几何控制表示(4D Geometric Control representation),该表示通过静态背景点云和每个物体的3D高斯轨迹来编码世界状态,不仅捕捉物体的时间路径,还建模其随时间变化的概率3D占据情况,从而提供一种灵活、类别无关的替代方案,取代刚性的边界框或参数化模型;这些4D控制信号被渲染为条件输入,驱动预训练视频扩散模型生成高质量、视角一致的视频,同时通过自动数据引擎从真实场景视频中提取所需的4D标注信息,克服了大规模带标注数据稀缺的问题。

链接: https://arxiv.org/abs/2601.05138
作者: Sixiao Zheng,Minghao Yin,Wenbo Hu,Xiaoyu Li,Ying Shan,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); HKU (香港大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object’s path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.
zh

[CV-20] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉丰富文档理解任务中性能受限的问题,尤其是模型对特定视觉特征的识别不稳定、错误聚集于某些区域且难以定位与优化。其解决方案的关键在于提出VERSA方法,通过探索模型的视觉嵌入空间实现潜在表示的可视化,从而评估模型可行性、识别错误高发区域,并基于此生成合成数据以针对性提升模型在这些区域的性能。实验表明,该方法能有效识别导致性能下降的视觉特征,并通过重新训练显著提升F1分数,同时保持泛化能力;且在本地部署模型(如Donut和Idefics2)经VERSA优化后可达到甚至超越云端SaaS模型(如GPT-4和Pixtral)的性能水平。

链接: https://arxiv.org/abs/2601.05125
作者: Ignacio de Rodrigo,Alvaro J. Lopez-Lopez,Jaime Boal
机构: Comillas University (康普卢塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
zh

[CV-21] Re-Align: Structured Reasoning -guided Alignment for In-Context Image Generation and Editing

【速读】:该论文旨在解决上下文图像生成与编辑(In-context Image Generation and Editing, ICGE)中用户意图理解与图像生成执行之间存在的鸿沟问题,即现有统一多模态模型虽具备较强的视觉-语言理解能力,但难以有效迁移至图像生成任务。解决方案的关键在于提出Re-Align框架,其核心创新包括:(1) 上下文链式思维(In-Context Chain-of-Thought, IC-CoT),通过解耦语义引导与参考关联,明确文本目标并减少参考图像间的混淆;(2) 一种基于代理奖励(surrogate reward)的强化学习(RL)训练机制,用于衡量结构化推理文本与生成图像之间的对齐程度,从而显著提升模型在ICGE任务上的性能表现。

链接: https://arxiv.org/abs/2601.05124
作者: Runze He,Yiji Cheng,Tiankai Hang,Zhimin Li,Yu Xu,Zijin Yin,Shiyi Zhang,Wenxun Dai,Penghui Du,Ao Ma,Chunyu Wang,Qinglin Lu,Jizhong Han,Jiao Dai
机构: Hunyuan(腾讯混元); IIE, CAS(中国科学院自动化研究所); UCAS(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures, project page: this https URL

点击查看摘要

Abstract:In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model’s overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
zh

[CV-22] From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

【速读】:该论文旨在解决现有前馈视图合成模型在相机参数编码方式上的局限性问题,即通过Plücker射线图(Plücker ray maps)对相机进行编码会导致预测结果依赖于任意的世界坐标系规范(gauge),并对微小的相机变换敏感,从而破坏几何一致性。其解决方案的关键在于提出投影条件化(projective conditioning),用目标视图的投影提示(projective cue)替代原始相机参数作为输入,提供一个稳定的二维条件信号,将原本在射线空间中脆弱的几何回归问题重构为一个良好条件的目标视图图像到图像翻译问题,显著提升了视图合成的保真度与跨视角一致性。

链接: https://arxiv.org/abs/2601.05116
作者: Zirui Wu,Zeren Jiang,Martin R. Oswald,Jie Song
机构: HKUST (GZ)(香港科技大学(广州)); University of Oxford(牛津大学); University of Amsterdam(阿姆斯特丹大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.
zh

[CV-23] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition

【速读】:该论文旨在解决自动驾驶中未标注激光雷达(LiDAR)数据难以有效利用的问题,即尽管原始LiDAR日志蕴含丰富的密集三维几何信息,但缺乏人工标注使其在感知研究中几乎无用,成为制约生成式AI(Generative AI)驱动的3D感知模型训练的主要成本瓶颈。解决方案的关键在于提出一种无监督多模态伪标签方法,其核心是利用时间累积的LiDAR地图学习强几何先验,并通过一种新颖的迭代更新规则,在不依赖任何人工标注的前提下,将文本和二维视觉基础模型中的语义线索直接融合到三维空间中,同时实现几何与语义的一致性约束——该机制不仅能同步生成3D语义标签、3D边界框和稠密LiDAR扫描结果,还能通过检测几何不一致性自动识别移动物体,从而显著提升深度预测性能(如80–150米范围内MAE降低51.5%)。

链接: https://arxiv.org/abs/2601.05105
作者: Filippo Ghilotti,Samuel Brucker,Nahku Saidy,Matteo Matteucci,Mario Bijelic,Felix Heide
机构: TORC Robotics; Politecnico of Milan; Princeton University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.
zh

[CV-24] Driving on Registers

【速读】:该论文旨在解决端到端自动驾驶系统中计算效率与性能平衡的问题,即如何在不牺牲驾驶决策准确性的情况下显著降低下游任务的计算开销。其解决方案的关键在于提出了一种基于Transformer的轻量级架构DrivoR,该架构利用预训练视觉Transformer(Vision Transformer, ViT)并引入相机感知的注册令牌(camera-aware register tokens),通过压缩多摄像头特征生成紧凑的场景表征,从而大幅减少后续模块的计算负担;同时,这些令牌驱动两个轻量级Transformer解码器,分别用于生成和评分候选轨迹,并通过学习模拟“oracle”行为来预测可解释的子得分(如安全性、舒适性和效率),实现推理阶段的行为条件化驾驶决策。

链接: https://arxiv.org/abs/2601.05083
作者: Ellington Kirby,Alexandre Boulch,Yihong Xu,Yuan Yin,Gilles Puy,Éloi Zablocki,Andrei Bursuc,Spyros Gidaris,Renaud Marlet,Florent Bartoccioni,Anh-Quan Cao,Nermin Samet,Tuan-Hung VU,Matthieu Cord
机构: valeo.ai( Valeo人工智能实验室); LIGM, ENPC, IP Paris, UGE, CNRS, France; Sorbonne Université, CNRS, ISIR, F-75005 Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.
zh

[CV-25] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

【速读】:该论文旨在解决制药行业中多模态内容(如视频、音频、文本等)手动标注效率低、一致性差及质量难以保障的问题,尤其针对长时临床试验访谈和教育讲座等大规模音视频数据处理的瓶颈。其核心解决方案是提出一种面向特定领域的“视频到视频片段生成”框架,融合音频语言模型(Audio Language Models, ALMs)与视觉语言模型(Vision Language Models, VLMs),通过三个关键技术实现高效、高质量的自动摘要:一是可复现的Cut-Merge算法,支持淡入淡出过渡与时间戳归一化以保证音画同步;二是基于角色定义与提示注入的个性化机制,用于生成面向营销、培训或监管等不同场景的内容;三是端到端的成本优化策略,在ALM/VLM增强处理之间实现性能与经济性的平衡。实验表明,该方法在Video MME基准和16,159条药学视频数据集上实现了3–4倍加速、4倍成本降低,并显著提升片段连贯性(0.348)和信息丰富度(0.721),优于当前主流VLM基线(如Gemini 2.5 Pro)。

链接: https://arxiv.org/abs/2601.05059
作者: Suyash Mishra,Qiang Li,Srikanth Patil,Anubhav Girdhar
机构: Roche(罗氏); Accenture(埃森哲); Involead(英沃利德)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Contributed original research to top tier conference in VLM; currently undergoing peer review

点击查看摘要

Abstract:Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences. Comments: Contributed original research to top tier conference in VLM; currently undergoing peer review Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2601.05059 [cs.CV] (or arXiv:2601.05059v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.05059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-26] Patch-based Representation and Learning for Efficient Deformation Modeling

【速读】:该论文旨在解决三维表面建模与变形中的效率与泛化能力问题,特别是在计算机视觉和图形学下游任务中,传统方法依赖逐顶点优化导致计算开销大且难以推广。解决方案的关键在于提出一种基于补丁的表面表示方法PolyFit,通过在局部表面补丁上拟合jet函数来构建紧凑的几何表示,并可监督学习地从解析函数或真实数据中高效训练;一旦学习完成,PolyFit能以少量jet系数更新实现对多种类型表面的高效变形,显著提升推理速度并保持高精度,已在Shape-from-template(SfT)和服装悬垂(garment draping)两个应用中验证其优越性能。

链接: https://arxiv.org/abs/2601.05035
作者: Ruochen Chen,Thuy Tran,Shaifali Parashar
机构: CNRS(法国国家科学研究中心); École Centrale de Lyon (里昂中央理工学院); INSA Lyon (里昂国立应用科学学院); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学里昂第一分校); LIRIS, UMR5205 (里昂信息与智能系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.
zh

[CV-27] Higher-Order Adversarial Patches for Real-Time Object Detectors ICPR2026

【速读】:该论文旨在解决高阶对抗攻击(higher-order adversarial attacks)对目标检测器(object detector)的鲁棒性威胁问题,特别是针对基于对抗训练(adversarial training)的防御机制有效性不足的现状。其解决方案的关键在于通过迭代式地训练高阶对抗补丁(adversarial patches)与强化目标检测器的对抗训练策略,揭示出高阶对抗补丁不仅直接影响训练模型,还展现出比低阶补丁更强的泛化能力;同时指出仅依赖对抗训练无法有效抵御此类攻击,强调需结合更复杂的攻防机制以提升模型鲁棒性。

链接: https://arxiv.org/abs/2601.04991
作者: Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer
机构: Fraunhofer IOSB and Fraunhofer Center for Machine Learning (弗劳恩霍夫信息与通信技术研究所和弗劳恩霍夫机器学习中心); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review (ICPR2026)

点击查看摘要

Abstract:Higher-order adversarial attacks can directly be considered the result of a cat-and-mouse game – an elaborate action involving constant pursuit, near captures, and repeated escapes. This idiom describes the enduring circular training of adversarial attack patterns and adversarial training the best. The following work investigates the impact of higher-order adversarial attacks on object detectors by successively training attack patterns and hardening object detectors with adversarial training. The YOLOv10 object detector is chosen as a representative, and adversarial patches are used in an evasion attack manner. Our results indicate that higher-order adversarial patches are not only affecting the object detector directly trained on but rather provide a stronger generalization capacity compared to lower-order adversarial patches. Moreover, the results highlight that solely adversarial training is not sufficient to harden an object detector efficiently against this kind of adversarial attack. Code: this https URL
zh

[CV-28] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction AAAI2026

【速读】:该论文旨在解决水下场景中由于光学退化(underwater optical degradation)导致的多视角不一致性问题,从而实现更准确的三维几何表示。其核心解决方案是提出OceanSplat方法,通过引入三目视图一致性约束——即以输入视角为基础,渲染水平和垂直平移后的相机视角,并利用逆向映射(inverse warping)进行对齐;同时,基于这些平移视角通过三角测量生成合成的极线深度先验(epipolar depth prior),作为自监督深度正则项来优化3D高斯分布的空间结构。此外,文中还设计了一种基于深度感知的alpha调整机制,在训练初期根据3D高斯的z分量和观测方向调节其透明度,抑制散射介质引起的伪影生成。上述方法共同促使3D高斯从散射介质中解耦出来,显著提升水下场景重建与恢复的质量,减少浮游伪影。

链接: https://arxiv.org/abs/2601.04984
作者: Minseong Kweon,Jinsun Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026. Project page: this https URL

点击查看摘要

Abstract:We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their z -component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.
zh

[CV-29] SparseLaneSTP: Leverag ing Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection ICCV

【速读】:该论文旨在解决3D车道线检测中特征表示失真、忽略车道结构先验信息以及未能利用历史观测信息以缓解低可见度下歧义的问题。现有方法在从密集鸟瞰图(BEV)特征中提取车道时,常因错误变换导致特征与真实3D道路表面不匹配;而稀疏检测器虽性能更优,却完全忽略了车道特定的几何先验;同时,缺乏对时间维度信息的建模限制了其在遮挡或弱光场景下的鲁棒性。解决方案的关键在于提出SparseLaneSTP框架,其核心创新包括:引入一种面向车道结构的时空注意力机制(lane-specific spatio-temporal attention),设计适用于稀疏架构的连续车道表示(continuous lane representation),并加入时间正则化(temporal regularization)以融合历史观测信息,从而提升检测精度与稳定性。

链接: https://arxiv.org/abs/2601.04968
作者: Maximilian Pittner,Joel Janai,Mario Faigle,Alexandru Paul Condurache
机构: Bosch Mobility Solutions, Robert Bosch GmbH(罗伯茨博世有限公司); Institute of Neuro- and Bioinformatics, University of Lübeck(吕贝克大学神经与生物信息学研究所); Institute for Signal Processing and System Theory, University of Stuttgart(斯图加特大学信号处理与系统理论研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at IEEE/CVF International Conference on Computer Vision (ICCV) 2025

点击查看摘要

Abstract:3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eye-viewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.
zh

[CV-30] EA: Temporal Adaptive Satellite Image Semantic Segmentation

【速读】:该论文旨在解决基于卫星影像时间序列(SITS)的作物制图中,现有分割方法在不同时间长度输入下泛化能力不足的问题。具体而言,现有方法通常假设输入序列长度固定,导致在实际应用中面对不同时间段的遥感数据时性能显著下降。解决方案的关键在于提出一种TEmporal Adaptive SITS语义分割方法(TEA),其核心是通过教师-学生架构实现跨时间长度的知识迁移:教师模型封装全局序列知识,指导学生模型以自适应的时间输入长度进行训练;同时,借助中间嵌入、原型和软标签三个维度对特征空间进行引导,并动态聚合学生模型以缓解知识遗忘问题;此外,引入全序列重建作为辅助任务,进一步提升不同时间长度输入下的表征质量。

链接: https://arxiv.org/abs/2601.04956
作者: Juyuan Kang,Hao Zhu,Yan Zhu,Wei Zhang,Jianing Chen,Tianxiang Xiao,Yike Ma,Hao Jiang,Feng Dai
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Code will be available at \href{ this https URL }{this https URL}

点击查看摘要

Abstract:Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model’s resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student’s feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.
zh

[CV-31] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

【速读】:该论文旨在解决当前文本到图像生成模型评估中广泛使用的自动指标(automatic metrics)可能因数据分布偏差而产生“原型偏向性”(prototypicality bias)的问题,即这些指标更倾向于奖励视觉上常见或社会上典型的图像,而非真正符合文本语义的图像。为系统性地识别和量化这一偏差,作者构建了一个受控的对比基准测试集 ProtoBias,其中包含语义正确但非典型图像与语义轻微错误但典型的对抗样本配对,从而能够定向检验评估指标是否忠实于文本语义而非默认原型。解决方案的关键在于提出了一种新的评估指标 ProtoScore,其基于一个7B参数规模的模型设计,在保持高效推理速度的同时显著降低误排序率,并在多个类别(动物、物体、人口统计学图像)上展现出优于现有方法的鲁棒性,尤其在涉及社会敏感场景时表现稳定。

链接: https://arxiv.org/abs/2601.04946
作者: Subhadeep Roy,Gagan Bhatia,Steffen Eger
机构: University of Technology Nuremberg (纽伦堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First version

点击查看摘要

Abstract:Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emphprototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc\textbfProtoBias (\textit\textbfPrototypical \textbfBias), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf\textscProtoScore, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.
zh

[CV-32] Decentralized Privacy-Preserving Federal Learning of Computer Vision Models on Edge Devices

【速读】:该论文旨在解决联邦学习(Federated Learning)场景下客户端数据隐私泄露的问题,即即使不直接共享原始数据,仅通过模型参数更新仍可能导致敏感信息被重建。其解决方案的关键在于从多个维度提升隐私保护能力:一方面采用同态加密(Homomorphic Encryption)、梯度压缩(Gradient Compression)和梯度加噪(Gradient Noising)等技术降低服务器端的数据重构风险;另一方面也考虑了恶意客户端可能带来的威胁,提出通过改进的联邦学习架构如分割学习(Split Learning)、群体学习(Swarm Learning)或全加密模型来增强整体系统的鲁棒性。研究进一步验证了这些方法对卷积神经网络(Convolutional Neural Networks, CNNs)分类准确率的影响,并在边缘设备NVIDIA Jetson TX2上实现了原型验证,表明所提方案在实际部署中的可行性与有效性。

链接: https://arxiv.org/abs/2601.04912
作者: Damian Harenčák,Lukáš Gajdošech,Martin Madaras
机构: Comenius University (斯洛伐克科希丘什科大学); Skeletex Research (Skeletex 研究所)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to VISAPP 2026 as Position Paper

点击查看摘要

Abstract:Collaborative training of a machine learning model comes with a risk of sharing sensitive or private data. Federated learning offers a way of collectively training a single global model without the need to share client data, by sharing only the updated parameters from each client’s local model. A central server is then used to aggregate parameters from all clients and redistribute the aggregated model back to the clients. Recent findings have shown that even in this scenario, private data can be reconstructed only using information about model parameters. Current efforts to mitigate this are mainly focused on reducing privacy risks on the server side, assuming that other clients will not act maliciously. In this work, we analyzed various methods for improving the privacy of client data concerning both the server and other clients for neural networks. Some of these methods include homomorphic encryption, gradient compression, gradient noising, and discussion on possible usage of modified federated learning systems such as split learning, swarm learning or fully encrypted models. We have analyzed the negative effects of gradient compression and gradient noising on the accuracy of convolutional neural networks used for classification. We have shown the difficulty of data reconstruction in the case of segmentation networks. We have also implemented a proof of concept on the NVIDIA Jetson TX2 module used in edge devices and simulated a federated learning process.
zh

[CV-33] Rotation-Robust Regression with Convolutional Model Trees

【速读】:该论文旨在解决图像输入在平面内旋转下模型性能下降的问题,即提升模型对旋转的鲁棒性(rotation-robust learning)。其核心解决方案是基于卷积模型树(Convolutional Model Trees, CMTs)构建几何感知的归纳偏置(inductive biases),包括卷积平滑(convolutional smoothing)、倾斜主导约束(tilt dominance constraint)和基于重要性的剪枝(importance-based pruning),以增强模型在旋转下的稳定性;同时引入部署时的方向搜索策略(deployment-time orientation search),通过选择使森林级置信度代理最大化的离散旋转角度来优化预测,而无需更新模型参数。该方法在MNIST数据集上验证了其在强旋转下的鲁棒性提升效果,但也揭示了置信度与正确性不一致时可能带来的负面影响。

链接: https://arxiv.org/abs/2601.04899
作者: Hongyi Li,William Ward Armstrong,Jun Xu
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); University of Alberta(阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions – convolutional smoothing, a tilt dominance constraint, and importance-based pruning – and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.
zh

[CV-34] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

【速读】:该论文旨在解决工业场景下长视频多模态理解的可扩展性问题,尤其是在GPU资源受限、延迟敏感和成本控制严格的条件下,现有视觉语言模型(Vision Language Models, VLMs)难以有效处理大规模长视频数据的问题。其解决方案的关键在于构建一个面向制药领域的大规模工业级生成式AI(Generative AI)框架,并通过系统性实证分析揭示当前VLMs在实际部署中的性能瓶颈与权衡关系:包括多模态信息对长度依赖任务的提升作用(最高达8/12个任务域)、SDPA注意力机制在消费级GPU上带来的3–8倍效率增益、以及时间对齐与关键帧检测在开源与闭源模型中的共性局限。该研究不追求提出新的“A+B”模型架构,而是聚焦于刻画现有技术在真实约束下的失效模式与优化路径,为研究人员和从业者提供可落地的实践指导。

链接: https://arxiv.org/abs/2601.04891
作者: Suyash Mishra,Qiang Li,Srikanth Patil,Satyanarayan Pati,Baddu Narendra
机构: Roche(罗氏); Accenture(埃森哲); Involead(因沃利德)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to the Industry Track of Top Tier Conference; currently under peer review

点击查看摘要

Abstract:Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new “A+B” model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
zh

[CV-35] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation

【速读】:该论文旨在解决现有神经辐射场(Neural Radiance Fields, NeRF)分割方法依赖优化过程、训练效率低且丧失2D基础模型零样本能力的问题。其解决方案的关键在于提出一种无需优化的交互式分割框架DivAS(Depth-interactive Voxel Aggregation Segmentation),通过用户点提示生成2D SAM掩膜,并利用NeRF提供的深度先验进行几何精修,进而借助自定义CUDA内核在200ms内将多视角掩膜聚合为统一3D体素网格,实现快速实时反馈与高质量前景-背景分离,从而在不牺牲分割精度的前提下显著提升效率。

链接: https://arxiv.org/abs/2601.04860
作者: Ayush Pande
机构: IIT Kanpur (印度理工学院坎普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.
zh

[CV-36] Character Detection using YOLO for Writer Identification in multiple Medieval books

【速读】:该论文旨在解决中世纪手稿中作者识别(scribe identification)的问题,即通过分析书写风格来确定不同手稿的创作者,从而辅助文献断代与书写演变研究。其解决方案的关键在于用YOLO目标检测模型(You Only Look Once object detection model)替代此前依赖模板匹配和卷积神经网络(CNN)的方法,以更高效、准确地提取文本中的字符实例,并利用YOLO输出的置信度分数建立拒识阈值机制,提升在未见手稿上的可靠识别能力。

链接: https://arxiv.org/abs/2601.04834
作者: Alessandra Scotto di Freca,Tiziana D Alessandro,Francesco Fontanella,Filippo Sarria,Claudio De Stefano
机构: University of Cassino and Southern Lazio (卡西诺大学和拉齐奥南部大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 1 table. Accepted at IEEE-CH 2025

点击查看摘要

Abstract:Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. … We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter “a”, as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character “a” on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.
zh

[CV-37] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models WACV

【速读】:该论文旨在解决现有视频检索基准在监控场景中对动作级区分能力评估不足的问题,即当前主流内容驱动的视频检索评测多聚焦于场景层面的相似性,而忽视了监控任务中所需的细粒度行为识别与重复行为分析能力。为此,作者提出了SOVABench(Surveillance Opposite Vehicle Actions Benchmark),一个基于真实监控视频的车辆相关动作检索基准,并设计了跨动作区分(inter-pair)和时序方向理解(intra-pair)两种评估协议。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的视觉推理与指令遵循能力,构建一种无需训练的可解释嵌入生成框架:通过MLLM生成的图像与视频描述直接提取语义嵌入,在SOVABench及多个空间计数基准上展现出优于对比学习视觉-语言模型(contrastive Vision-Language Models)的性能。

链接: https://arxiv.org/abs/2601.04824
作者: Oriol Rabasseda,Zenjie Li,Kamal Nasrollahi,Sergio Escalera
机构: Milestone Systems A/S (Milestone Systems A/S); Universitat de Barcelona and Computer Vision Center (巴塞罗那大学和计算机视觉中心); Aalborg Universitet (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops)

点击查看摘要

Abstract:Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available. Comments: This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.04824 [cs.CV] (or arXiv:2601.04824v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.04824 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-38] Integrated Framework for Selecting and Enhancing Ancient Marathi Inscription Images from Stone Metal Plate and Paper Documents

【速读】:该论文旨在解决古代铭文图像因背景噪声、对比度低及老化和环境因素导致的退化问题,这些问题使得前景文字与背景在视觉特征上相似,从而难以辨识。解决方案的关键在于提出一种基于二值化(binarization)与互补预处理技术相结合的图像增强方法,用于去除污渍并增强模糊的古代文字。该方法在石刻、金属板和历史文献等多种类型的古印度马拉地语铭文图像上进行了验证,实验表明其能显著提升图像可读性,为后续识别分类提供了有效支持。

链接: https://arxiv.org/abs/2601.04800
作者: Bapu D. Chendage,Rajivkumar S. Mente
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Pages, 5 figures

点击查看摘要

Abstract:Ancient script images often suffer from severe background noise, low contrast, and degradation caused by aging and environmental effects. In many cases, the foreground text and background exhibit similar visual characteristics, making the inscriptions difficult to read. The primary objective of image enhancement is to improve the readability of such degraded ancient images. This paper presents an image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text. The proposed methods are evaluated on different types of ancient scripts, including inscriptions on stone, metal plates, and historical documents. Experimental results show that the proposed approach achieves classification accuracies of 55.7%, 62%, and 65.6% for stone, metal plate, and document scripts, respectively, using the K-Nearest Neighbor (K-NN) classifier. Using the Support Vector Machine (SVM) classifier, accuracies of 53.2%, 59.5%, and 67.8% are obtained. The results demonstrate the effectiveness of the proposed enhancement method in improving the readability of ancient Marathi inscription images.
zh

[CV-39] Detector-Augmented SAMURAI for Long-Duration Drone Tracking WACV2026

【速读】:该论文旨在解决无人机(drone)在城市监控场景中长期跟踪的鲁棒性问题,尤其是基于RGB图像的跟踪方法因检测器频繁丢失目标而导致的时间不一致性难题。现有方法多依赖传统运动模型,且缺乏对复杂环境(如无人机进出视野)的有效应对能力。解决方案的关键在于首次系统评估了基础模型SAMURAI在无人机跟踪任务中的潜力,并提出了一种融合检测器增强(detector-augmented)的改进架构,通过引入检测器线索来缓解对边界框初始化和序列长度的敏感性,从而显著提升长时序跟踪的稳定性与准确性,尤其在无人机退出-重新进入场景下表现突出。

链接: https://arxiv.org/abs/2601.04798
作者: Tamara R. Lenhard,Andreas Weinmann,Hichem Snoussi,Tobias Koch
机构: German Aerospace Center (DLR); Technical University of Applied Sciences Würzburg-Schweinfurt; European University of Technology; Université de Technologie de Troyes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the WACV 2026 Workshop on “Real World Surveillance: Applications and Challenges”

点击查看摘要

Abstract:Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI’s potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI’s zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.
zh

[CV-40] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

【速读】:该论文旨在解决当前金字塔结构视频扩散模型(pyramidal video models)在推理效率提升的同时,往往因从头训练导致视觉质量下降的问题。其核心解决方案是提出一种低成本微调(low-cost fine-tuning)的转换管道,将预训练的扩散模型高效转化为金字塔结构模型,且在转换过程中不损害输出视频的质量。此外,论文还系统比较了多种步骤蒸馏(step distillation)策略,以进一步优化推理效率。

链接: https://arxiv.org/abs/2601.04792
作者: Denis Korzhenkov,Adil Karjauv,Animesh Karnewar,Mohsen Ghafoorian,Amirhossein Habibian
机构: Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at this https URL.
zh

[CV-41] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers

【速读】:该论文旨在解决基于潜在扩散模型(Latent Diffusion Models, LDMs)的零样本逆问题求解器在实际应用中稳定性差、易产生伪影和质量下降的问题。其核心问题是现有方法中求解器的反向扩散动态与真实反向扩散过程存在偏差,导致重建结果不稳定。解决方案的关键在于提出测量一致的 Langevin 修正模块(Measurement-Consistent Langevin Corrector, MCLC),该模块通过理论严谨的测量一致性 Langevin 更新机制,在不依赖线性流形假设的前提下直接修正 LDM 的逆求解路径,从而缩小求解器与真实反向扩散动力学之间的差距,显著提升重建稳定性和可靠性。

链接: https://arxiv.org/abs/2601.04791
作者: Lee Hyoseok,Sohwi Lim,Eunju Cha,Tae-Hyun Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver’s and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.
zh

[CV-42] SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning

【速读】:该论文旨在解决磁共振成像(MRI)在临床应用中面临的获取时间长、成本高及分辨率受限等问题。其解决方案的关键在于提出了一种增强型Pix2Pix框架,通过引入Squeeze-and-Excitation Residual Networks(SEResNet)提升通道注意力机制以强化关键特征表示,并结合U-Net++结构优化多尺度特征融合能力,同时采用简化的PatchGAN判别器稳定训练过程并提高局部解剖学真实性,从而在少样本(<500张图像)条件下实现跨模态MRI图像翻译任务中的结构保真度与图像质量显著提升,展现出良好的泛化性能。

链接: https://arxiv.org/abs/2601.04785
作者: Xihe Qiu,Yang Dai,Xiaoyu Tan,Sijia Li,Fenghao Sun,Lu Gan,Liang Liu
机构: Shanghai University of Engineering Science (上海工程技术大学); Zhongshan Hospital of Fudan University (复旦大学附属中山医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.
zh

[CV-43] Defocus Aberration Theory Confirms Gaussian Model in Most Imaging Devices

【速读】:该论文旨在解决从二维图像中准确估计深度(depth)这一长期存在的基础性挑战,尤其针对由空间变化的散焦模糊(defocusing blur)导致的病态问题(ill-posed problem)。其关键解决方案在于引入先验知识下的 defocus 模型,并通过理论分析和实验验证,证明在大多数成像设备中,散焦算子可近似为高斯模型(Gaussian model),从而将原本病态的问题转化为具有解析解的良定问题(well-posed problem)。该模型不仅适用于单图中的绝对模糊(absolute blur),还可同时处理同一视点下不同对焦设置所得两图间的相对模糊(relative blur),且因其数学简洁性和计算高效性,特别适合实时应用。实测结果表明,在聚焦深度1至100米范围内、最大深度变化不超过10%时,平均绝对误差(MAE)低于1%,验证了该方法的高精度与可靠性。

链接: https://arxiv.org/abs/2601.04779
作者: Akbar Saadat
机构: Iranian railways (伊朗铁路局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures, 11 .jpg files

点击查看摘要

Abstract:Over the past three decades, defocus has consistently provided groundbreaking depth information in scene images. However, accurately estimating depth from 2D images continues to be a persistent and fundamental challenge in the field of 3D recovery. Heuristic approaches involve with the ill-posed problem for inferring the spatial variant defocusing blur, as the desired blur cannot be distinguished from the inherent blur. Given a prior knowledge of the defocus model, the problem become well-posed with an analytic solution for the relative blur between two images, taken at the same viewpoint with different camera settings for the focus. The Gaussian model stands out as an optimal choice for real-time applications, due to its mathematical simplicity and computational efficiency. And theoretically, it is the only model can be applied at the same time to both the absolute blur caused by depth in a single image and the relative blur resulting from depth differences between two images. This paper introduces the settings, for conventional imaging devices, to ensure that the defocusing operator adheres to the Gaussian model. Defocus analysis begins within the framework of geometric optics and is conducted by defocus aberration theory in diffraction-limited optics to obtain the accuracy of fitting the actual model to its Gaussian approximation. The results for a typical set of focused depths between 1 and 100 meters, with a maximum depth variation of 10% at the focused depth, confirm the Gaussian model’s applicability for defocus operators in most imaging devices. The findings demonstrate a maximum Mean Absolute Error (!M!A!E) of less than 1% , underscoring the model’s accuracy and reliability.
zh

[CV-44] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

【速读】:该论文旨在解决多图像接地(multi-image grounding)任务中因缺乏统一建模而导致的单目标定位限制和实际应用场景类型受限的问题。解决方案的关键在于提出GeM-VG模型,该模型具备通用多图像视觉接地能力,并通过系统性地对现有任务进行分类与组织,引入MG-Data-240K数据集以提升目标数量和图像关系多样性;同时设计了一种融合思维链(chain-of-thought, CoT)推理与直接回答的混合强化微调策略,利用规则奖励机制引导R1-like算法优化,从而显著增强模型在感知与推理方面的综合能力。

链接: https://arxiv.org/abs/2601.04777
作者: Shurong Zheng,Yousong Zhu,Hongyin Zhao,Fan Yang,Yufei Zhan,Ming Tang,Jinqiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.
zh

[CV-45] Segmentation-Driven Monocular Shape from Polarization based on Physical Model

【速读】:该论文旨在解决单目偏振三维重建(monocular shape-from-polarization, SfP)中因偏振分析固有特性导致的方位角模糊性(azimuth angle ambiguity)问题,该模糊性严重降低重建精度与稳定性。解决方案的关键在于提出一种分割驱动的单目SfP框架(segmentation-driven monocular SfP, SMSfP),其核心创新包括:1)提出基于偏振信息的自适应区域生长(polarization-aided adaptive region growing, PARG)分割策略,将全局凸性假设分解为若干局部凸子区域,有效抑制方位角模糊并保持表面连续性;2)设计多尺度融合凸性先验(multi-scale fusion convexity prior, MFCP)约束,确保局部表面一致性并增强对细纹理和结构细节的恢复能力。

链接: https://arxiv.org/abs/2601.04776
作者: Jinyu Zhang,Xu Ma,Weili Chen,Gonzalo R. Arce
机构: Beijing Institute of Technology (北京理工大学); Beijing Institute of Environmental Features (北京市环境特征研究所); University of Delaware (特拉华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, submittd to IEEE Transactions on Image Processing

点击查看摘要

Abstract:Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.
zh

[CV-46] ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting

【速读】:该论文旨在解决开放词汇(open-vocabulary)三维场景理解中语义一致性与几何精度难以协同优化的问题,尤其在基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的框架下如何高效实现跨视图语义一致性与掩码内语义凝聚。其解决方案的关键在于提出ProFuse框架,通过引入一个密集对应引导的预注册阶段,在无需渲染监督微调的情况下初始化具有准确几何结构的高斯点,并联合构建跨视图聚类生成的3D上下文提案(3D Context Proposals)。每个提案通过加权聚合成员嵌入获得全局特征,并在直接注册过程中融合至高斯点,从而保持跨视角的每原始体语言一致性。该方法在不增加额外优化步骤的前提下完成语义融合,同时保留几何精化能力,显著提升效率——单场景语义标注仅需约5分钟,速度为当前最优方法(SOTA)的两倍。

链接: https://arxiv.org/abs/2601.04754
作者: Yen-Jen Chiou,Wei-Tse Cheng,Yuan-Fu Yang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.
zh

[CV-47] Skeletonization-Based Adversarial Perturbations on Large Vision Language Models Mathematical Text Recognition

【速读】:该论文旨在解决基础模型(foundation models)在处理含文本图像(尤其是数学公式图像)时的视觉理解能力及其局限性问题。针对此类图像因LaTeX转换和复杂结构带来的挑战,作者提出了一种基于骨架化(skeletonization)的新型对抗攻击方法,其关键在于通过骨架化有效缩小搜索空间,从而更精准地生成扰动,同时通过对原始图像与对抗样本在字符级和语义级变化的细致分析,揭示模型在视觉解释与推理方面的行为特征。该方法在ChatGPT上的实证验证进一步证明了其在现实场景中的有效性。

链接: https://arxiv.org/abs/2601.04752
作者: Masatomo Yoshida,Haruto Namura,Nicola Adami,Masahiro Okuda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ITC-CSCC 2025

点击查看摘要

Abstract:This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models’ visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.
zh

[CV-48] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在精确目标定位和资源受限边缘-云协同部署中的挑战。其解决方案的关键在于提出AIVD框架,通过轻量级边缘检测器与云端MLLM的协同工作实现统一的高精度定位与高质量语义生成;同时设计了一种基于视觉-语义联合增强的高效微调策略以提升云端模型对边缘裁剪框噪声和场景变化的鲁棒性,并引入异构资源感知的动态调度算法,确保在多样边缘设备和动态网络条件下维持高吞吐量与低延迟。

链接: https://arxiv.org/abs/2601.04734
作者: Yunqing Hu,Zheming Yang,Chang Zhao,Qi Guo,Meng Gao,Pengcheng Li,Wen Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM’s robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
zh

[CV-49] raining a Custom CNN on Five Heterogeneous Image Datasets

【速读】:该论文旨在解决深度学习模型在资源受限但高影响的现实视觉分类任务中如何实现高效、鲁棒性能的问题,尤其关注不同领域(农业与城市)数据集间存在的光照差异、分辨率变化、环境复杂性和类别不平衡等挑战。其解决方案的关键在于:(1) 设计了一种轻量级、任务特定的定制卷积神经网络(CNN),能够在多场景下达到具有竞争力的性能;(2) 通过系统性对比分析,明确了迁移学习和深层架构(如ResNet-18、VGG-16)在数据稀缺环境中的优势边界,为实际部署提供了可操作的指导依据。

链接: https://arxiv.org/abs/2601.04727
作者: Anika Tabassum,Tasnuva Mahazabin Tuba,Nafisa Naznin
机构: Department of Computer Science and Engineering, Daffodil International University (达福尔国际大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature representations directly from images. Unlike traditional manual feature engineering methods, CNNs automatically extract hierarchical visual patterns, enabling strong performance across diverse real-world contexts. This study investigates the effectiveness of CNN-based architectures across five heterogeneous datasets spanning agricultural and urban domains: mango variety classification, paddy variety identification, road surface condition assessment, auto-rickshaw detection, and footpath encroachment monitoring. These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance, necessitating adaptable and robust learning models. We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning. Through systematic preprocessing, augmentation, and controlled experimentation, we analyze how architectural complexity, model depth, and pre-training influence convergence, generalization, and performance across datasets of differing scale and difficulty. The key contributions of this work are: (1) the development of an efficient custom CNN that achieves competitive performance across multiple application domains, and (2) a comprehensive comparative analysis highlighting when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments. These findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2601.04727 [cs.CV] (or arXiv:2601.04727v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.04727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-50] On the Holistic Approach for Detecting Human Image Forgery

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 伪造图像检测方法在人类图像篡改场景下存在的碎片化问题,即现有技术通常仅针对面部区域或全身合成图像进行专项检测,难以跨域泛化至完整的、多样化的伪造类型。其解决方案的关键在于提出 HuForDet 框架,采用双分支架构:一是基于 RGB 和频域的异构专家网络(含自适应拉普拉斯高斯 LoG 模块)实现对细粒度拼接边界到粗尺度纹理异常等多尺度伪造痕迹的感知;二是引入多模态大语言模型(MLLM)分析全身语义一致性,并结合置信度估计机制动态调整特征融合权重,从而提升整体检测鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2601.04715
作者: Xiao Guo,Jie Zhu,Anil Jain,Xiaoming Liu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 figures, 5 tables

点击查看摘要

Abstract:The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.
zh

[CV-51] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)如何有效提升图像生成(Text-to-Image Generation, T2I)过程中图像保真度与细节丰富度的问题。现有方法多关注利用理解模型的推理能力或世界知识来辅助生成,而本文提出新视角:通过理解模型增强生成过程的视觉 fidelity。解决方案的关键在于提出“Forge-and-Quench”框架,其核心机制为:首先由MLLM基于完整对话上下文生成优化后的文本指令;随后通过创新的Bridge Adapter将该指令映射为一种称为Bridge Feature的虚拟视觉表示,作为理解模型与生成模型之间的桥梁;该特征被注入到T2I骨干网络中,作为视觉引导信号,同时替换原始输入文本指令。此设计实现了理解信息对生成过程的精细化控制,显著提升图像质量,且具备跨模型迁移能力与低训练开销优势。

链接: https://arxiv.org/abs/2601.04706
作者: Yanbing Zeng,Jia Wang,Hanghang Ma,Junqiang Wu,Jie Zhu,Xiaoming Wei,Jie Hu
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM’s inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at this https URL.
zh

[CV-52] WebCryptoAgent : Agent ic Crypto Trading with Web Informatics

【速读】:该论文旨在解决加密货币交易中如何在极端波动环境下,高效整合异构网络信息(如非结构化网页内容、社交情绪)与市场微观结构信号(OHLCV数据),以支持短时决策并维持系统鲁棒性的问题。其核心挑战在于:一方面需避免噪声多源证据引发虚假相关性,另一方面要应对亚秒级价格冲击下的风险控制滞后问题。解决方案的关键在于提出WebCryptoAgent框架——通过模态特异性代理(modality-specific agents)将多源信息转化为统一证据文档,并采用解耦控制架构,分离小时级策略推理与秒级风险模型,从而实现快速冲击检测和独立于交易循环的防御干预,显著提升交易稳定性与尾部风险应对能力。

链接: https://arxiv.org/abs/2601.04687
作者: Ali Kurban,Wei Luo,Liangyu Zuo,Zeyu Zhang,Renda Han,Zhaolu Kang,Hao Tang
机构: AI Geeks; XJTU; Peking University; QTNU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at this https URL.
zh

[CV-53] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution

【速读】:该论文旨在解决红外视频超分辨率(Infrared Video Super-Resolution, VSR)中因大气湍流和压缩退化导致的图像质量下降问题,尤其是现有方法普遍忽视红外与可见光模态差异或无法有效恢复湍流引起的失真。其关键解决方案是提出HATIR(Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution),通过在扩散采样路径中注入热感知形变先验,联合建模湍流退化与结构细节丢失的逆过程;具体而言,构建基于物理原理的相量引导光流估计器(Phasor-Guided Flow Estimator),利用热活跃区域时间上一致的相量响应特性,实现湍流感知的光流引导反向扩散;同时设计湍流感知解码器(Turbulence-Aware Decoder),通过湍流门控与结构感知注意力机制选择性抑制不稳定时序特征并增强边缘感知特征聚合,从而提升非均匀畸变下的结构恢复保真度。

链接: https://arxiv.org/abs/2601.04682
作者: Yang Zou,Xingyue Zhu,Kaiqi Han,Jun Ma,Xingyuan Li,Zhiying Jiang,Jinyuan Liu
机构: Northwestern Polytechnical University (西北工业大学); Dalian University of Technology (大连理工大学); Zhejiang University (浙江大学); Dalian Maritime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation ™ algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: this https URL
zh

[CV-54] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation

【速读】:该论文旨在解决胰腺及其病变在CT图像中的精准分割问题(pancreatic segmentation in CT scans),这一任务因组织对比度低、边界模糊、形态不规则及病灶尺寸小等因素而极具挑战性。其解决方案的核心在于提出一种新型编码器-解码器架构DB-MSMUNet,关键创新包括:(1)在编码器中引入多尺度状态空间建模模块(Multi-scale Mamba Module, MSMM),融合可变形卷积与多尺度状态空间模型以增强全局上下文感知和局部形变适应能力;(2)采用双解码器设计,其中边缘解码器通过边缘增强路径(Edge Enhancement Path, EEP)显式捕捉边界信息并优化模糊轮廓,区域解码器则利用多层解码结构(Multi-layer Decoder, MLD)保留细粒度特征以准确重建微小病灶;(3)在多个尺度上添加辅助深度监督头(Auxiliary Deep Supervision, ADS),提升梯度传播效率并强化多尺度特征的判别能力。上述设计显著提升了分割精度、边缘保持能力和跨数据集的鲁棒性。

链接: https://arxiv.org/abs/2601.04676
作者: Qiu Guan,Zhiqiang Yang,Dezhang Ye,Yang Chen,Xinli Xu,Ying Tang
机构: 1. Southeast University (东南大学); 2. Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.
zh

[CV-55] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment

【速读】:该论文旨在解决文本到图像生成技术中图像与文本提示之间对齐度评估的准确性问题,现有方法依赖欧几里得空间度量,忽略了语义对齐的结构特性,且缺乏针对不同样本的自适应能力。其解决方案的关键在于提出HyperAlign框架,利用双曲蕴含几何(hyperbolic entailment geometry)建模语义对齐结构:首先将CLIP提取的欧几里得特征映射至双曲空间;其次设计动态监督的蕴含建模机制,将离散蕴含逻辑转化为连续几何结构监督;最后引入自适应调制回归器,基于双曲几何特征生成样本级调制参数,动态校准欧几里得余弦相似度以预测最终对齐分数。

链接: https://arxiv.org/abs/2601.04614
作者: Wenzhi Chen,Bo Hu,Leida Li,Lihuo He,Wen Lu,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.
zh

[CV-56] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation

【速读】:该论文旨在解决头颈部区域中风险器官(Organs at Risk, OARs)在放射治疗中因形状复杂、体积小而导致深度学习模型分割精度不足的问题。现有混合架构通常仅简单拼接特征,缺乏对各组件优势的协同利用,造成功能冗余和性能瓶颈。其解决方案的关键在于提出一种高不确定性区域引导的多架构协同学习模型(High Uncertainty Region-guided Multi-Architecture Collaborative Learning, HUR-MACL),该模型通过卷积神经网络自适应识别高不确定性区域,并在这些区域内联合使用Vision Mamba与可变形CNN进行精细化分割;同时引入异构特征蒸馏损失(Heterogeneous Feature Distillation Loss),以促进两种架构在高不确定性区域内的协同优化,从而显著提升整体分割准确率。

链接: https://arxiv.org/abs/2601.04607
作者: Xiaoyu Liu,Siwen Wei,Linhao Qu,Mingyuan Pan,Chengsheng Zhang,Yonghong Shi,Zhijian Song
机构: Fudan University (复旦大学); Shanghai Key Lab of Medical Image Computing and Computer Assisted Intervention (上海医学图像计算与计算机辅助干预重点实验室); Huashan Hospital Affiliated to Fudan University (复旦大学附属华山医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.
zh

[CV-57] Detection of Deployment Operational Deviations for Safety and Security of AI-Enabled Human-Centric Cyber Physical Systems

【速读】:该论文旨在解决AI赋能的人类中心型网络物理系统(Human-centric Cyber-Physical Systems, H-C CPS)在实际运行中因不确定操作条件而导致的安全与隐私风险问题,尤其是在人机交互场景下,系统可能进入未知状态并违反既定安全协议。其解决方案的关键在于构建一个评估框架,用于分析不同策略对保障此类系统在部署运行中的安全性与安全性的影响,并以1型糖尿病患者闭环血糖控制为例,提出一种基于个性化图像识别的新方法,用于检测未提前声明的进餐行为,从而增强系统的鲁棒性和安全性。

链接: https://arxiv.org/abs/2601.04605
作者: Bernard Ngabonziza,Ayan Banerjee,Sandeep K.S. Gupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, Human-centric cyber-physical systems have increasingly involved artificial intelligence to enable knowledge extraction from sensor-collected data. Examples include medical monitoring and control systems, as well as autonomous cars. Such systems are intended to operate according to the protocols and guidelines for regular system operations. However, in many scenarios, such as closed-loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoring systems for stroke diagnosis. The operations of such AI-enabled human-centric applications can expose them to cases for which their operational mode may be uncertain, for instance, resulting from the interactions with a human with the system. Such cases, in which the system is in uncertain conditions, can violate the system’s safety and security requirements. This paper will discuss operational deviations that can lead these systems to operate in unknown conditions. We will then create a framework to evaluate different strategies for ensuring the safety and security of AI-enabled human-centric cyber-physical systems in operation deployment. Then, as an example, we show a personalized image-based novel technique for detecting the non-announcement of meals in closed-loop blood glucose control for Type 1 diabetics. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.04605 [cs.CV] (or arXiv:2601.04605v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.04605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-58] MiLDEdit: Reasoning -Based Multi-Layer Design Document Editing

【速读】:该论文旨在解决多层设计文档(如海报)的自然语言编辑问题,即如何基于文本指令实现对文档中不同图层(如装饰、文字、图像)的细粒度识别与协同修改。现有方法主要聚焦于单层图像编辑或生成,忽略了多层文档编辑所需的图层感知推理能力。解决方案的关键在于提出MiLDEAgent框架,其核心由一个通过强化学习训练的多模态推理模块(用于理解各图层语义)和一个图像编辑器组成,能够精准定位并执行指定图层的修改操作;同时构建了MiLDEBench基准数据集和MiLDEEval评估协议,系统性地衡量模型在指令遵循、版式一致性、美学质量及文本渲染等方面的性能表现,从而为该领域建立了首个强基线。

链接: https://arxiv.org/abs/2601.04589
作者: Zihao Lin,Wanrong Zhu,Jiuxiang Gu,Jihyung Kil,Christopher Tensmeyer,Lin Zhang,Shilong Liu,Ruiyi Zhang,Lifu Huang,Vlad I. Morariu,Tong Sun
机构: University of California, Davis (加州大学戴维斯分校); Adobe (Adobe); UW-Madison (威斯康星大学麦迪逊分校); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.
zh

[CV-59] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks

【速读】:该论文旨在解决左心房(left atrial, LA)壁和内膜在晚期钆增强磁共振成像(late gadolinium-enhanced MRI, LGE-MRI)中分割精度受限的问题,其根本原因在于训练数据稀缺以及解剖结构复杂。解决方案的关键在于利用三种3D条件生成模型(Pix2Pix GAN、SPADE-GAN 和 SPADE-LDM)从复合语义标签图(结合专家标注与无监督组织聚类)中合成高保真度的3D LGE图像,从而扩充训练数据集。其中,SPADE-LDM生成的图像在真实性和结构准确性上表现最优(FID=4.063),且经合成数据增强后,基于3D U-Net的LA腔室分割Dice系数由0.908提升至0.936,显著优于未增强模型(p < 0.05),验证了标签条件驱动的3D图像合成对改善稀疏心脏结构分割的有效性。

链接: https://arxiv.org/abs/2601.04588
作者: Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been published in the Proceedings of the 2025 IEEE International Conference on Imaging Systems and Techniques (IST). The final published version is available via IEEE Xplore

点击查看摘要

Abstract:Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p 0.05) over the this http URL findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.
zh

[CV-60] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction

【速读】:该论文旨在解决互联网社区中有害模因(harmful memes)因类型变化和时间演化而难以检测的问题。其核心挑战在于这些模因的形态不断改变,传统基于静态特征的方法难以适应其动态性。解决方案的关键在于提出RepMD方法,通过构建设计概念图(Design Concept Graph, DCG),从历史模因中提取不变的设计原则(即恶意用户的底层设计意图),并利用该图指导多模态大语言模型(Multimodal Large Language Model, MLLM)进行检测。DCG通过攻击树建模设计步骤,并结合设计步复制与图剪枝策略生成,从而在模因类型和时间演化下仍保持较高识别准确率(81.1%),显著提升对新型有害模因的泛化能力与人工核查效率。

链接: https://arxiv.org/abs/2601.04567
作者: Ziyou Jiang,Mingyang Li,Junjie Wang,Yuekai Huang,Jie Huang,Zhiyuan Chang,Zhaoyang Li,Qing Wang
机构: State Key Laboratory of Complex System Modeling and Simulation Technology (复杂系统建模与仿真技术国家重点实验室); Science and Technology on Integrated Information System Laboratory (综合信息系统科学技术实验室); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15 \sim 30 seconds per meme.
zh

[CV-61] FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer

【速读】:该论文旨在解决当前基于深度网络的单图人脸纹理生成方法在处理野外(in-the-wild)输入图像时,由于纹理UV空间受限于训练数据或2D人脸生成器而导致的细节、结构和身份一致性不足的问题。解决方案的关键在于提出一种基于风格迁移的人脸纹理精炼方法 FaceRefiner,其核心思想是将3D采样纹理视为风格图像,将原始纹理生成结果作为内容图像,并通过引入可微渲染(differentiable rendering)实现多层级信息迁移——不仅传递高层语义和中层结构信息,还显式地传递可见面部区域的低层(像素级)信息,从而有效保留输入图像的细节与身份特征,提升纹理真实感与一致性。

链接: https://arxiv.org/abs/2601.04520
作者: Chengyang Li,Baoping Cheng,Yao Cheng,Haocheng Zhang,Renshuai Liu,Yinglin Zheng,Jing Liao,Xuan Cheng
机构: Xiamen University (厦门大学); China Mobile (杭州)信息技术有限公司; Xiamen University Malaysia (厦门大学马来西亚分校); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods’ generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.
zh

[CV-62] okenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression

【速读】:该论文旨在解决三维医学图像分割中因体素处理量随空间维度呈立方增长以及对同质区域进行冗余计算而导致的高计算复杂度问题。解决方案的关键在于提出一种边界感知的稀疏token表示框架TokenSeg:首先设计多尺度分层编码器提取400个候选token以兼顾全局解剖上下文与精细边界细节;其次引入边界感知的tokenizer,结合VQ-VAE量化与重要性评分机制筛选出100个显著token,其中60%以上位于肿瘤边界附近;最后构建稀疏到密集的解码器,通过token重投影、渐进式上采样和跳跃连接重建全分辨率分割掩膜。该方法在保证分割精度的同时显著降低GPU内存占用(减少64%)和推理延迟(减少68%),并展现出良好的跨模态泛化能力。

链接: https://arxiv.org/abs/2601.04519
作者: Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu
机构: Tsinghua University (清华大学); Southwest Forestry University (西南林业大学); GigaAI; KCL (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbfTokenSeg, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emphmulti-scale hierarchical encoder that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emphboundary-aware tokenizer that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60% of which lie near tumor boundaries; and (3) we develop a \emphsparse-to-dense decoder that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49% Dice and 89.61% IoU, while reducing GPU memory and inference latency by 64% and 68%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.
zh

[CV-63] owards Spatio-Temporal Extrapolation of Phase-Field Simulations with Convolution-Only Neural Networks

【速读】:该论文旨在解决液态金属脱合金(Liquid Metal Dealloying, LMD)相场模拟在大尺度空间和长时间演化下计算成本过高、难以高效 extrapolate 的问题。解决方案的关键在于提出一种全卷积的条件参数化 U-Net 代理模型,其核心创新包括:引入卷积自注意力机制以增强局部与全局特征捕捉能力、采用物理信息填充策略(physically informed padding)保障边界区域的稳定性、结合洪水填充校正方法(flood-fill corrector)提升极端外推下的精度;同时通过条件输入模拟参数实现时间步长跳过和合金成分自适应,显著提升灵活性与泛化能力。此外,为避免昂贵的求解器初始化过程,耦合了一个条件扩散模型用于生成物理一致的合成初始条件,从而实现从短时小域训练到长时大域外推的高效迁移,最终在多种合金体系中保持关键物理量预测误差低于5%(训练区间)和15%(外推区间),并实现最高达36,000倍的速度提升。

链接: https://arxiv.org/abs/2601.04510
作者: Christophe Bonneville,Nathan Bieberdorf,Pieterjan Robbe,Mark Asta,Habib Najm,Laurent Capolungo,Cosmin Safta
机构: Sandia National Laboratories (桑迪亚国家实验室); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of California, Berkeley (加州大学伯克利分校); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Phase-field simulations of liquid metal dealloying (LMD) can capture complex microstructural evolutions but can be prohibitively expensive for large domains and long time horizons. In this paper, we introduce a fully convolutional, conditionally parameterized U-Net surrogate designed to extrapolate far beyond its training data in both space and time. The architecture integrates convolutional self-attention, physically informed padding, and a flood-fill corrector method to maintain accuracy under extreme extrapolation, while conditioning on simulation parameters allows for flexible time-step skipping and adaptation to varying alloy compositions. To remove the need for costly solver-based initialization, we couple the surrogate with a conditional diffusion model that generates synthetic, physically consistent initial conditions. We train our surrogate on simulations generated over small domain sizes and short time spans, but, by taking advantage of the convolutional nature of U-Nets, we are able to run and extrapolate surrogate simulations for longer time horizons than what would be achievable with classic numerical solvers. Across multiple alloy compositions, the framework is able to reproduce the LMD physics accurately. It predicts key quantities of interest and spatial statistics with relative errors typically below 5% in the training regime and under 15% during large-scale, long time-horizon extrapolations. Our framework can also deliver speed-ups of up to 36,000 times, bringing the time to run weeks-long simulations down to a few seconds. This work is a first stepping stone towards high-fidelity extrapolation in both space and time of phase-field simulation for LMD.
zh

[CV-64] IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在生成信息图(infographic)时的可靠性问题,即尽管当前文本到图像(text-to-image, T2I)模型能够生成视觉上吸引人的图像,但其生成的信息图可能存在数据编码失真或文本内容错误等不易察觉的问题。为系统评估这一可靠性,作者提出 IGENBENCH——首个用于评估文本到信息图生成可靠性的基准测试集,包含600个精心设计的测试案例,覆盖30种信息图类型;关键解决方案在于构建一个自动化的评估框架,将可靠性验证分解为10类原子化的“是/否”问题,并利用多模态大语言模型(multimodal large language models, MLLMs)逐项验证,从而获得问题级准确率(Q-ACC)和信息图级准确率(I-ACC),揭示了T2I模型在数据完整性等维度上的普遍瓶颈及端到端正确性不足的核心挑战。

链接: https://arxiv.org/abs/2601.04498
作者: Yinghao Tang,Xueding Liu,Boyuan Zhang,Tingfeng Lan,Yupeng Xie,Jiale Lao,Yiyao Wang,Haoxuan Li,Tingting Gao,Bo Pan,Luoxuan Weng,Xiuqi Huang,Minfeng Zhu,Yingchaojie Feng,Yuyu Luo,Wei Chen
机构: State Key Lab of CAD&CG, Zhejiang University(浙江大学); UESTC(电子科技大学); University of Virginia(弗吉尼亚大学); HKUST(GZ)(香港科技大学(广州)); Cornell University(康奈尔大学); Zhejiang University(浙江大学); National University of Singapore(新加坡国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at this https URL.
zh

[CV-65] UniDrive-WM: Unified Understanding Planning and Generation World Model For Autonomous Driving

【速读】:该论文旨在解决自动驾驶中感知、预测与规划模块分离导致的性能瓶颈问题,尤其在复杂场景下难以实现高精度的环境理解与安全轨迹规划。其解决方案的关键在于提出UniDrive-WM,一个基于视觉语言模型(Vision-Language Model, VLM)的统一世界模型,能够联合执行驾驶场景理解、轨迹规划和条件化未来图像生成。通过将轨迹预测作为条件输入至VLM驱动的图像生成器,生成的未来帧提供额外监督信号,从而增强场景理解并迭代优化轨迹生成,实现感知-规划-生成的闭环协同优化。

链接: https://arxiv.org/abs/2601.04453
作者: Zhexiao Xiong,Xin Ye,Burhan Yaman,Sheng Cheng,Yiren Lu,Jingru Luo,Nathan Jacobs,Liu Ren
机构: Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI); Washington University in St. Louis; Arizona State University; Case Western Reserve University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM’s trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at this https URL .
zh

[CV-66] CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction

【速读】:该论文旨在解决当前深度学习方法在心脏磁共振成像(Cardiac MRI, CMR)重建中泛化能力不足的问题,即现有模型通常仅针对单一或有限的CMR变异性(如图像对比度、采样模式、扫描仪厂商、解剖结构及疾病类型等)进行优化,在面对分布偏移时性能显著下降。解决方案的关键在于提出CRUNet-MR-Univ这一基础模型,其通过利用时空相关性(spatio-temporal correlations)和基于提示(prompt-based)的先验信息,有效建模CMR数据的全多样性,从而实现跨多种临床场景的统一泛化能力。

链接: https://arxiv.org/abs/2601.04428
作者: Donghang Lyu,Marius Staring,Hildo Lamb,Mariya Doneva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: STACOM 2025

点击查看摘要

Abstract:In recent years, deep learning has attracted increasing at- tention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clini- cal applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or nar- row subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that lever- ages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently out- performs baseline methods across a wide range of settings, highlighting its effectiveness and promise.
zh

[CV-67] From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery

【速读】:该论文旨在解决耳蜗植入术(Cochlear Implant, CI)中乳突切除术(mastoidectomy)区域形状预测的难题,该步骤对术前规划和手术安全至关重要,但受限于真实标注数据稀缺,传统深度学习方法难以有效应用。解决方案的关键在于提出一种混合自监督与弱监督学习框架,无需人工标注即可从术前CT图像中直接预测边界模糊且复杂的乳突切除区域形状,其核心创新在于结合了3D T分布损失(3D T-distribution loss)以增强弱监督学习的稳定性,并在无标签条件下实现平均Dice分数达0.72的高精度预测,为构建术后三维乳突表面提供了可靠的技术基础。

链接: https://arxiv.org/abs/2601.04405
作者: Yike Zhang,Eduardo Davalos,Dingjie Su,Ange Lou,Jack Noble
机构: Trinity University (圣三一大学); Vanderbilt University (范德堡大学); Center for Advanced AI, Accenture (Accenture 高级人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2505.18368

点击查看摘要

Abstract:Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.
zh

[CV-68] 3D-Agent :Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation NEURIPS2025

【速读】:该论文旨在解决大规模3D物体标注中因空间复杂性、遮挡和视角不一致性等问题导致的标注精度与效率挑战。现有基于单模型的方法难以有效应对这些难题,因此作者提出Tri MARF框架,其核心创新在于构建一个多智能体协同架构,整合2D多视角图像、文本描述和3D点云三种模态输入:其中视觉-语言模型智能体生成多视角描述,信息聚合智能体筛选最优描述,门控智能体则实现文本语义与3D几何结构的对齐,从而提升标注质量与效率。实验表明,该方法在CLIPScore、检索准确率及吞吐量等指标上显著优于现有技术。

链接: https://arxiv.org/abs/2601.04404
作者: Jusheng Zhang,Yijia Fan,Zimo Wen,Jian Wang,Keze Wang
机构: Sun Yat-sen University (中山大学); Shanghai Jiao Tong University (上海交通大学); Snap Inc. (Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU
zh

[CV-69] Performance Analysis of Image Classification on Bangladeshi Datasets

【速读】:该论文旨在解决图像分类任务中模型架构选择的实践问题,即在从零开始设计卷积神经网络(Convolutional Neural Networks, CNNs)与采用预训练深度学习模型之间进行权衡。其解决方案的关键在于通过对比分析一个自定义CNN与多个主流预训练架构(如VGG-16、ResNet-50和MobileNet)在相同实验条件下使用迁移学习的表现,发现预训练模型在有限数据场景下具有更高的分类准确率和更快的收敛速度,而自定义CNN则在参数量和计算复杂度方面更具优势。研究揭示了模型性能、复杂度与效率之间的权衡关系,为实际应用中选择合适的CNN架构提供了实证依据。

链接: https://arxiv.org/abs/2601.04397
作者: Mohammed Sami Khan,Fabiha Muniat,Rowzatul Zannat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have demonstrated remarkable success in image classification tasks; however, the choice between designing a custom CNN from scratch and employing established pre-trained architectures remains an important practical consideration. In this work, we present a comparative analysis of a custom-designed CNN and several widely used deep learning architectures, including VGG-16, ResNet-50, and MobileNet, for an image classification task. The custom CNN is developed and trained from scratch, while the popular architectures are employed using transfer learning under identical experimental settings. All models are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that pre-trained CNN architectures consistently outperform the custom CNN in terms of classification accuracy and convergence speed, particularly when training data is limited. However, the custom CNN demonstrates competitive performance with significantly fewer parameters and reduced computational complexity. This study highlights the trade-offs between model complexity, performance, and computational efficiency, and provides practical insights into selecting appropriate CNN architectures for image classification problems.
zh

[CV-70] In-SRAM Radiant Foam Rendering on a Graph Processor

【速读】:该论文旨在解决在大规模多核加速器(many-core accelerator)上高效实现体素渲染(volumetric rendering)的问题。这类硬件通常采用分布式片上存储架构,即每个核心仅拥有少量本地SRAM(静态随机存取存储器),并通过显式片内通信交换数据,这与传统GPU中统一的大容量设备内存形成对比,从而打破了现有体素渲染技术依赖随机访问统一场景表示的假设。解决方案的关键在于:提出一种完全基于片上SRAM(in-SRAM)的分布式渲染系统,将场景数据分片(shard)分布在多个计算单元(tile)上,并通过分层路由叠加(hierarchical routing overlay)机制实现射线在不同分片间的有序转发,确保光线追踪过程完全在片上SRAM中完成且通信路径可预测,从而在Graphcore Mk2 IPU平台上实现了接近交互帧率(约1 fps @ 640×480)的渲染性能,同时保持图像和深度质量与原GPU版本相当。

链接: https://arxiv.org/abs/2601.04382
作者: Zulkhuu Tuya,Ignacio Alzugaray,Nicholas Fry,Andrew J. Davison
机构: Imperial College London (帝国理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 26 figures

点击查看摘要

Abstract:Many emerging many-core accelerators replace a single large device memory with hundreds to thousands of lightweight cores, each owning only a small local SRAM and exchanging data via explicit on-chip communication. This organization offers high aggregate bandwidth, but it breaks a key assumption behind many volumetric rendering techniques: that rays can randomly access a large, unified scene representation. Rendering efficiently on such hardware therefore requires distributing both data and computation, keeping ray traversal mostly local, and structuring communication into predictable routes. We present a fully in-SRAM, distributed renderer for the \emphRadiant Foam Voronoi-cell volumetric representation on the Graphcore Mk2 IPU, a many-core accelerator with tile-local SRAM and explicit inter-tile communication. Our system shards the scene across tiles and forwards rays between shards through a hierarchical routing overlay, enabling ray marching entirely from on-chip SRAM with predictable communication. On Mip-NeRF~360 scenes, the system attains near-interactive throughput ((\approx)1,fps at \mbox 640\times480 ) with image and depth quality close to the original GPU-based Radiant Foam implementation, while keeping all scene data and ray state in on-chip SRAM. Beyond demonstrating feasibility, we analyze routing, memory, and scheduling bottlenecks that inform how future distributed-memory accelerators can better support irregular, data-movement-heavy rendering workloads. Comments: 24 pages, 26 figures Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.04382 [cs.GR] (or arXiv:2601.04382v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2601.04382 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-71] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

【速读】:该论文旨在解决如何利用仅在可见光RGB图像上预训练的流匹配(flow-matching)基础模型,通过少量配对样本实现跨谱翻译(如RGB到红外IR或合成孔径雷达SAR),并验证生成的合成数据能否提升下游目标检测性能的问题。解决方案的关键在于引入低秩适应(LoRA)模块对基础模型进行微调,仅需每域100对配对图像即可实现像素级对齐的跨模态翻译,且通过50个保留样本计算的LPIPS指标可有效预测下游检测性能(如YOLOv11n和DETR的mAP),从而实现高效、少样本地拓展基础模型至非可见光模态的应用。

链接: https://arxiv.org/abs/2601.04381
作者: Maxim Clouser,Kia Khezeli,John Kalantari
机构: Yrikka Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.
zh

[CV-72] Aligned explanations in neural networks

【速读】:该论文旨在解决当前深度神经网络解释方法普遍存在的“解释偏差”问题,即现有特征归因(feature attribution)方法往往仅粗略反映模型预测过程,难以实现解释与预测之间的因果一致性,从而导致解释沦为对黑箱模型的表面“白描”。其解决方案的关键在于提出“模型可读性”(model readability)这一设计原则,通过构建伪线性网络(PiNets)来实现解释与预测的直接对齐:PiNets在任意特征空间中生成实例级线性预测,使得模型输出具有线性可读性,从而保证解释不仅忠实于模型决策路径,还能在多个评估维度上保持一致性。

链接: https://arxiv.org/abs/2601.04378
作者: Corentin Lobet,Francesca Chiaromonte
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model’s prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.
zh

[CV-73] Combining facial videos and biosignals for stress estimation during driving ICPR2026

【速读】:该论文旨在解决从面部视频中可靠识别压力状态的难题,其核心挑战在于压力的主观性以及个体对表情的自主控制能力。为突破传统方法依赖面部动作单元(Facial Action Units, FAUs)的局限,研究者提出利用由EMOCA模型提取的解耦3D面部几何特征(包括表情与姿态系数),分析分心驾驶情境下的应激反应。关键创新在于:首先通过配对假设检验验证了56个系数中有41个在基线与应激阶段表现出一致且相位特异性的变化,其模式与生理指标高度一致;其次构建基于Transformer的时序建模框架,并对比单模态、早期融合与跨模态注意力策略,发现跨模态注意力融合EMOCA特征与生理信号可实现最优性能(AUROC 92%,准确率86.7%),显著优于单一模态或简单融合方式,凸显了时序建模与跨模态注意力机制在压力识别中的有效性。

链接: https://arxiv.org/abs/2601.04376
作者: Paraskevi Valergaki,Vassilis C. Nicodemou,Iason Oikonomidis,Antonis Argyros,Anastasios Roussos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: UNDER SUBMISSION TO ICPR 2026

点击查看摘要

Abstract:Reliable stress recognition from facial videos is challenging due to stress’s subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92%, Accuracy 86.7%), with EMOCA-gaze fusion also competitive (AUROC 91.8%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.
zh

[CV-74] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

【速读】:该论文旨在解决统一自回归模型(unified autoregressive model)在处理长序列多模态任务(如视频生成)时,因KV-cache(键值缓存)随生成token数量线性增长而导致的推理效率瓶颈问题。其核心解决方案是提出PackCache,一种无需训练的KV-cache管理方法,关键在于通过三个协同机制实现动态压缩:条件锚定(condition anchoring)保留语义锚点、跨帧衰减建模(cross-frame decay modeling)依据时间距离分配缓存预算、空间保持位置嵌入(spatially preserving position embedding)在缓存移除时维持3D结构一致性,从而显著提升长视频生成效率,尤其在最终四帧(受KV-cache膨胀影响最严重的部分)实现最高达3.7倍的加速。

链接: https://arxiv.org/abs/2601.04359
作者: Kunyang Li,Mubarak Shah,Yuzhang Shang
机构: Institute of Artificial Intelligence, University of Central Florida (中央佛罗里达大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.
zh

[CV-75] UNIC: Learning Unified Multimodal Extrinsic Contact Estimation

【速读】:该论文旨在解决接触丰富操作(contact-rich manipulation)中对外部接触(extrinsic contacts)估计的可靠性问题,即如何在不依赖预设接触类型、固定抓取配置或相机标定等限制性假设的前提下,准确感知物体与环境之间的交互信息。解决方案的关键在于提出UNIC框架,其核心创新包括:1)基于场景可利用性图(scene affordance maps)的统一接触表示,能够捕捉多样化的接触形态;2)一种无需先验知识的多模态融合机制,通过随机掩码策略实现视觉、本体感觉和触觉模态的端到端数据驱动学习,从而提升模型在未见对象、缺失模态及动态相机视角下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2601.04356
作者: Zhengtong Xu,Yuki Shirai
机构: Purdue University (普渡大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contact-rich manipulation requires reliable estimation of extrinsic contacts-the interactions between a grasped object and its environment which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.
zh

[CV-76] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets

【速读】:该论文旨在解决在有限数据条件下,如何选择合适的深度学习模型以实现高效且高精度的图像分类问题。研究对比了自建卷积神经网络(Convolutional Neural Networks, CNNs)与主流预训练模型(ResNet-18 和 VGG-16)在特征提取和迁移学习两种策略下的性能表现。解决方案的关键在于:采用迁移学习(Transfer Learning)并结合微调(Fine-tuning)策略,显著提升了模型在多个来自孟加拉国的多样化图像分类任务上的准确率,相较自建CNN和特征提取方法,提升幅度达3%至76%,尤其在数据稀缺场景下优势明显;同时指出,在资源受限或任务较简单时,自建CNN仍具参数效率和训练速度优势。

链接: https://arxiv.org/abs/2601.04352
作者: Ibrahim Tanvir(University of Dhaka),Alif Ruslan(University of Dhaka),Sartaj Solaiman(University of Dhaka)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.
zh

[CV-77] SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting(3D高斯点绘)模型在大中场景下存储开销过高、难以部署于云服务和流媒体平台的问题。现有基于渐进式掩码与标量量化(scalar quantization)的压缩方法虽有效,但无法充分捕捉高维特征向量间的相关性,限制了率失真性能。其解决方案的关键在于引入一种基于残差向量量化(Residual Vector Quantization, RVQ)的新颖渐进式编解码器,并设计了一个由多分辨率哈希网格引导的自回归熵模型,能够精确预测每个后续传输索引的条件概率,从而实现粗粒度与精炼层的高效压缩。

链接: https://arxiv.org/abs/2601.04348
作者: Diego Revilla,Pooja Suresh,Anand Bhojan,Ooi Wei Tsang
机构: National University of Singapore(新加坡国立大学); University of Deusto(韦斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2601.04348 [cs.CV] (or arXiv:2601.04348v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.04348 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-78] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

【速读】:该论文旨在解决基于Transformer的视频扩散模型中注意力机制复杂度为二次方(quadratic attention complexity)的问题,这限制了其在长视频序列上的可扩展性。解决方案的关键在于提出一种循环混合注意力机制(ReHyAt),该机制融合了softmax注意力的保真度与线性注意力的高效性,支持分块循环重构和恒定内存使用,从而将注意力计算复杂度从二次方降低至线性,同时保持视频生成质量,并通过轻量级蒸馏和微调流程显著降低训练成本(仅需约160 GPU小时),实现对现有softmax-based模型的有效迁移与实用化部署。

链接: https://arxiv.org/abs/2601.04342
作者: Mohsen Ghafoorian,Amirhossein Habibian
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt’s hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at this https URL.
zh

[CV-79] Unified Text-Image Generation with Weakness-Targeted Post-Training

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成系统中因显式模态切换导致的跨模态耦合不足问题,即现有方法通常先生成推理文本再手动切换至图像生成,这种分步推理过程限制了模态间的协同能力,并阻碍了真正的端到端多模态生成。其解决方案的关键在于通过后训练(post-training)策略实现完全统一的文本-图像生成架构,使模型能够在单次推理过程中自主完成从文本推理到视觉合成的过渡;同时,采用离线、基于奖励加权的训练方式,利用完全自生成的合成数据进行优化,显著提升了在多个T2I基准上的多模态图像生成性能。

链接: https://arxiv.org/abs/2601.04339
作者: Jiahui Chen,Philippe Hansen-Estruch,Xiaochuang Han,Yushi Hu,Emily Dinan,Amita Kamath,Michal Drozdzal,Reyhane Askari-Hemmat,Luke Zettlemoyer,Marjan Ghazvininejad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.
zh

[CV-80] Embedding Textual Information in Images Using Quinary Pixel Combinations

【速读】:该论文旨在解决现有图像隐写技术在文本数据嵌入过程中存在的效率低、视觉失真明显及计算复杂度高的问题。当前主流方法如基于最低有效位(LSB)或最高有效位(MSB)的修改、像素值差分(PVD)、变换域方法等,往往需要多个像素协同操作或引入噪声,导致嵌入容量受限且易被检测;而基于深度学习和生成式 AI (Generative AI) 的方法虽提升了隐蔽性,却显著增加计算开销。其解决方案的关键在于利用 RGB 空间中每个通道的五级强度变化组合(共 125 种),将单个像素的三色强度配置映射为一个文本符号(包括字母、数字、空格及常用特殊字符),从而实现以单像素承载完整字符信息的高效嵌入机制,避免多像素依赖与复杂计算,同时通过多种图像质量指标验证了嵌入后图像无显著视觉畸变。

链接: https://arxiv.org/abs/2601.04302
作者: A V Uday Kiran Kandala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.
zh

[CV-81] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes

【速读】:该论文旨在解决扩散模型(diffusion models)在后训练对齐阶段依赖简化信号(如标量奖励或二元偏好)所导致的局限性,即难以有效对齐复杂的人类专业知识,这类知识通常具有层次性和细粒度特征。解决方案的关键在于构建一个由领域专家定义的层次化、细粒度评估标准,并提出两阶段对齐框架:首先通过监督微调(Supervised Fine-Tuning)将领域知识注入辅助扩散模型;其次引入复杂偏好优化(Complex Preference Optimization, CPO),该方法扩展了DPO(Direct Preference Optimization)以适配非二元、层次化的评估标准,通过同时最大化正向属性概率和最小化负向属性概率来实现目标扩散模型的精准对齐。

链接: https://arxiv.org/abs/2601.04300
作者: Chenye Meng,Zejian Li,Zhongni Liu,Yize Li,Changle Xie,Kaixin Jia,Ling Yang,Huanghuang Deng,Shiying Ding,Shengyuan Zhang,Jiayi Li,Lingyun Sun
机构: Zhejiang University (浙江大学); University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.
zh

[CV-82] ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues

【速读】:该论文旨在解决通过非语言渠道客观评估人类情感与心理状态的难题,特别是利用数字绘画这一尚未充分开发的模态进行情感感知。其解决方案的关键在于提出了一种名为ArtCognition的新型多模态分析框架,该框架融合了静态视觉特征(来自最终画作,由计算机视觉模型提取)与动态行为运动学线索(如笔触速度、停顿和流畅度,源自绘制过程本身),并通过检索增强生成(Retrieval-Augmented Generation, RAG)架构将低级特征与高级心理解释相连接,从而提升分析的可解释性并减少模型幻觉,实现更精细的心理状态评估。

链接: https://arxiv.org/abs/2601.04297
作者: Behrad Binaei-Haghighi,Nafiseh Sadat Sajadi,Mehrad Liviyan,Reyhane Akhavan Kharazi,Fatemeh Amirkhani,Behnam Bahrak
机构: University of Tehran (德黑兰大学); Tehran Institute for Advanced Studies (德黑兰高级研究所); Khatam University (卡塔姆大学); Allameh Tabataba’i University (阿拉梅·塔巴塔巴伊大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework’s potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
zh

[CV-83] Listen to Rhythm Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset

【速读】:该论文旨在解决当前舞蹈动作生成方法中存在的语义控制粗糙和长序列一致性差的问题。其解决方案的关键在于提出了一种多模态引导的扩散框架LRCM,通过特征解耦策略分离动作捕捉数据、音频节奏及文本描述,并引入音频潜变量Conformer与文本潜变量Cross-Conformer进行多模态融合,同时设计Motion Temporal Mamba Module(MTMM)以实现平滑且长时间的自回归生成,从而提升生成动作的语义准确性和时序连贯性。

链接: https://arxiv.org/abs/2601.03323
作者: Oran Duan,Yinghua Shen,Yingzhu Lv,Luyang Jie,Yaxin Liu,Qiong Wu
机构: Communication University of China (中国传媒大学); Zhipu AI (智谱AI)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
备注: 12 pages, 13 figures

点击查看摘要

Abstract:Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
zh

[CV-84] Quantitative mapping from conventional MRI using self-supervised physics-guided deep learning: applications to a large-scale clinically heterogeneous dataset

【速读】:该论文旨在解决传统磁共振成像(MRI)提供的信息为定性且依赖于设备硬件和采集参数的问题,同时克服定量MRI(qMRI)因需要特殊采集协议和重建算法而难以大规模应用的局限。其解决方案的关键在于提出了一种自监督的物理引导深度学习框架,该框架将基于布洛赫方程(Bloch-based)的信号模型直接嵌入训练目标中,从而能够从临床常规T1加权、T2加权及FLAIR MRI图像中直接推断出定量的T1、T2和质子密度(PD)图谱。该方法在多台3T MRI系统、跨六年、4,121个扫描会话的大规模异质数据集上验证,表现出对设备和协议变化的高度鲁棒性,并实现了高精度的像素级可重复性,为开展大规模定量生物标志物研究提供了可行路径。

链接: https://arxiv.org/abs/2601.05063
作者: Jelmer van Lune,Stefano Mandija,Oscar van der Heide,Matteo Maspero,Martin B. Schilder,Jan Willem Dankbaar,Cornelis A.T. van den Berg,Alessandro Sbrizzi
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 30 pages, 13 figures, full paper

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is a cornerstone of clinical neuroimaging, yet conventional MRIs provide qualitative information heavily dependent on scanner hardware and acquisition settings. While quantitative MRI (qMRI) offers intrinsic tissue parameters, the requirement for specialized acquisition protocols and reconstruction algorithms restricts its availability and impedes large-scale biomarker research. This study presents a self-supervised physics-guided deep learning framework to infer quantitative T1, T2, and proton-density (PD) maps directly from widely available clinical conventional T1-weighted, T2-weighted, and FLAIR MRIs. The framework was trained and evaluated on a large-scale, clinically heterogeneous dataset comprising 4,121 scan sessions acquired at our institution over six years on four different 3 T MRI scanner systems, capturing real-world clinical variability. The framework integrates Bloch-based signal models directly into the training objective. Across more than 600 test sessions, the generated maps exhibited white matter and gray matter values consistent with literature ranges. Additionally, the generated maps showed invariance to scanner hardware and acquisition protocol groups, with inter-group coefficients of variation \leq 1.1%. Subject-specific analyses demonstrated excellent voxel-wise reproducibility across scanner systems and sequence parameters, with Pearson r and concordance correlation coefficients exceeding 0.82 for T1 and T2. Mean relative voxel-wise differences were low across all quantitative parameters, especially for T2 ( 6%). These results indicate that the proposed framework can robustly transform diverse clinical conventional MRI data into quantitative maps, potentially paving the way for large-scale quantitative biomarker research.
zh

[CV-85] Scalable neural pushbroom architectures for real-time denoising of hyperspectral images onboard satellites

【速读】:该论文旨在解决星载高光谱图像去噪任务中,如何在资源受限的卫星平台上实现高效、低延迟且具备容错能力的实时推理问题。其核心挑战在于平衡高精度推理、低计算复杂度、动态功耗可扩展性与抗辐射故障能力这三项相互竞争的目标。解决方案的关键在于提出一种基于去噪器混合(mixture of denoisers)的神经网络架构:首先,每个去噪器采用因果式逐行处理机制,模拟推扫式高光谱成像仪的数据采集过程,显著降低内存占用;其次,混合结构不仅支持根据功耗需求动态调整运行模式,还增强了对辐射诱发故障的鲁棒性,从而在低功耗硬件上实现真正的实时处理性能,同时保持与更复杂模型相当的去噪质量。

链接: https://arxiv.org/abs/2601.05020
作者: Ziyao Yi,Davide Piccinini,Diego Valsesia,Tiziano Bianchi,Enrico Magli
机构: Politecnico di Torino – Department of Electronics and Telecommunications (都灵理工大学–电子与电信系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The next generation of Earth observation satellites will seek to deploy intelligent models directly onboard the payload in order to minimize the latency incurred by the transmission and processing chain of the ground segment, for time-critical applications. Designing neural architectures for onboard execution, particularly for satellite-based hyperspectral imagers, poses novel challenges due to the unique constraints of this environment and imaging system that are largely unexplored by the traditional computer vision literature. In this paper, we show that this setting requires addressing three competing objectives, namely high-quality inference with low complexity, dynamic power scalability and fault tolerance. We focus on the problem of hyperspectral image denoising, which is a critical task to enable effective downstream inference, and highlights the constraints of the onboard processing scenario. We propose a neural network design that addresses the three aforementioned objectives with several novel contributions. In particular, we propose a mixture of denoisers that can be resilient to radiation-induced faults as well as allowing for time-varying power scaling. Moreover, each denoiser employs an innovative architecture where an image is processed line-by-line in a causal way, with a memory of past lines, in order to match the acquisition process of pushbroom hyperspectral sensors and greatly limit memory requirements. We show that the proposed architecture can run in real-time, i.e., process one line in the time it takes to acquire the next one, on low-power hardware and provide competitive denoising quality with respect to significantly more complex state-of-the-art models. We also show that the power scalability and fault tolerance objectives provide a design space with multiple tradeoffs between those properties and denoising quality.
zh

[CV-86] Illumination Angular Spectrum Encoding for Controlling the Functionality of Diffractive Networks

【速读】:该论文旨在解决衍射神经网络(Diffractive Neural Networks)在实际应用中普遍存在的单任务局限性问题,即现有架构通常只能针对单一功能进行训练,难以满足需要多种功能集成的光学计算系统需求。其解决方案的关键在于引入一种基于入射光角谱(angular spectrum)的新控制机制:通过在输入光路中加入幅度掩模(amplitude mask),选择性地调控入射光的角谱分布,从而实现对同一网络不同任务的切换;该掩模作为唯一的任务编码器(task encoder),使得同一个衍射网络能够在不同角谱条件下执行多种图像到图像的转换任务(如手写数字转印刷体数字、手写字母转数字或希腊字母),且该方法可在不同相干条件下工作,并可与波长等其他控制策略兼容,显著提升了多任务全光学计算系统的灵活性与可扩展性。

链接: https://arxiv.org/abs/2601.04825
作者: Matan Kleiner,Lior Michaeli,Tomer Michaeli
机构: Technion(以色列理工学院); Tel Aviv University(特拉维夫大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project’s code this https URL

点击查看摘要

Abstract:Diffractive neural networks have recently emerged as a promising framework for all-optical computing. However, these networks are typically trained for a single task, limiting their potential adoption in systems requiring multiple functionalities. Existing approaches to achieving multi-task functionality either modify the mechanical configuration of the network per task or use a different illumination wavelength or polarization state for each task. In this work, we propose a new control mechanism, which is based on the illumination’s angular spectrum. Specifically, we shape the illumination using an amplitude mask that selectively controls its angular spectrum. We employ different illumination masks for achieving different network functionalities, so that the mask serves as a unique task encoder. Interestingly, we show that effective control can be achieved over a very narrow angular range, within the paraxial regime. We numerically illustrate the proposed approach by training a single diffractive network to perform multiple image-to-image translation tasks. In particular, we demonstrate translating handwritten digits into typeset digits of different values, and translating handwritten English letters into typeset numbers and typeset Greek letters, where the type of the output is determined by the illumination’s angular components. As we show, the proposed framework can work under different coherence conditions, and can be combined with existing control strategies, such as different wavelengths. Our results establish the illumination angular spectrum as a powerful degree of freedom for controlling diffractive networks, enabling a scalable and versatile framework for multi-task all-optical computing.
zh

[CV-87] End-to-end differentiable design of geometric waveguide displays

【速读】:该论文旨在解决几何波导(geometric waveguide)在光学透视增强现实(optical see-through augmented reality)显示系统中性能受限的问题,其核心瓶颈在于难以同时优化非序列光传输与依赖偏振的多层薄膜涂层。解决方案的关键在于提出首个端到端可微分优化框架,该框架将非序列蒙特卡洛偏振光线追踪(non-sequential Monte Carlo polarization ray tracing)与可微分传递矩阵薄膜求解器(differentiable transfer-matrix thin-film solver)耦合,从而实现从出瞳指标到设计参数的梯度反向传播。通过内存优化策略,可在单台多GPU工作站上优化上千层厚度参数及数十亿次非序列光线-表面交点,并借助自动层剪枝机制,在离散制造约束下从过参数化堆栈中驱使冗余层厚度趋近零,实现拓扑优化以发现最优镀膜结构。实验表明,该方法显著提升光效(从4.1%增至33.5%),并大幅改善出瞳均匀性(提升约17倍)和视场均匀性(提升约11倍)。

链接: https://arxiv.org/abs/2601.04370
作者: Xinge Yang,Zhaocheng Liu,Zhaoyu Nie,Qingyuan Fan,Zhimin Shi,Jim Bonar,Wolfgang Heidrich
机构: KAUST; Meta
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Geometric waveguides are a promising architecture for optical see-through augmented reality displays, but their performance is severely bottlenecked by the difficulty of jointly optimizing non-sequential light transport and polarization-dependent multilayer thin-film coatings. Here we present the first end-to-end differentiable optimization framework for geometric waveguide that couples non-sequential Monte Carlo polarization ray tracing with a differentiable transfer-matrix thin-film solver. A differentiable Monte Carlo ray tracer avoids the exponential growth of deterministic ray splitting while enabling gradients backpropagation from eyebox metrics to design parameters. With memory-saving strategies, we optimize more than one thousand layer-thickness parameters and billions of non-sequential ray-surface intersections on a single multi-GPU workstation. Automated layer pruning is achieved by starting from over-parameterized stacks and driving redundant layers to zero thickness under discrete manufacturability constraints, effectively performing topology optimization to discover optimal coating structures. On a representative design, starting from random initialization within thickness bounds, our method increases light efficiency from 4.1% to 33.5% and improves eyebox and FoV uniformity by \sim 17 \times and \sim 11 \times , respectively. Furthermore, we jointly optimize the waveguide and an image preprocessing network to improve perceived image quality. Our framework not only enables system-level, high-dimensional coating optimization inside the waveguide, but also expands the scope of differentiable optics for next-generation optical design.
zh

人工智能

[AI-0] Robust Reasoning as a Symmetry-Protected Topological Phase

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“幻觉”问题,即由语义噪声引发的逻辑不一致现象。其核心问题是当前模型架构处于一种易受自发对称破缺影响的“度量相”(Metric Phase),导致因果顺序不稳定。解决方案的关键在于提出一种规范拓扑保护相(Symmetry-Protected Topological phase),其中逻辑运算形式上等价于非阿贝尔任意子编织(non-Abelian anyon braiding),从而用稳健的拓扑不变量替代脆弱的几何插值。实验表明,所提出的全息网络(Holonomic Network)在变量绑定任务中展现出宏观“质量间隙”和完美的外推泛化能力(从L=50扩展至L=5000),且该保护机制严格依赖于非阿贝尔规范对称性,揭示了逻辑推理的新普适类,并将因果稳定性与语义流形的拓扑结构直接关联。

链接: https://arxiv.org/abs/2601.05240
作者: Ilmo Sung
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th)
备注:

点击查看摘要

Abstract:Large language models suffer from “hallucinations”-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a “Metric Phase,” where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic “mass gap,” maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on S_10 ( 3.6 \times 10^6 states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating 100\times beyond training ( L=50 \to 5000 ), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.
zh

[AI-1] MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在开放世界具身智能体(embodied agents)中缺乏可靠记忆机制与混合主动性(mixed-initiative)协作能力的问题。现有评估方法多依赖合成提示,难以真实反映复杂任务中的记忆依赖与交互动态。为此,作者提出 MineNPC-Task 基准框架,其关键在于:通过专家玩家共玩收集真实任务,将其结构化为带显式前置条件和依赖关系的参数化模板,并结合机器可验证的校验器,在限定知识范围内禁止外部捷径,从而实现对计划/动作/记忆事件(如计划预览、目标澄清、内存读写、前提检查与修复尝试)的细粒度追踪与量化评估。该方案支持透明、可复现地评测LLM代理在长期任务中的记忆保持与协同纠错能力。

链接: https://arxiv.org/abs/2601.05215
作者: Tamil Sudaravan Mohan Doss,Michael Xu,Sudha Rao,Andrew D. Wilson,Balasaravanan Thoravi Kumaravel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present \textscMineNPC-Task, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emphMinecraft. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf216 subtasks across \textbf8 experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.05215 [cs.AI] (or arXiv:2601.05215v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.05215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Internal Representations as Indicators of Hallucinations in Agent Tool Selection

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工具调用过程中存在的幻觉问题,包括错误选择工具、参数格式错误以及“工具绕过”行为(即模型不调用外部工具而直接生成结果),这些问题会严重影响基于LLM的智能体(agent)在生产环境中的可靠性与安全性。解决方案的关键在于提出一种计算高效的实时检测框架,该框架利用LLM在单次前向传播过程中的内部表征来识别工具调用幻觉,无需额外的推理轮次或外部验证,从而实现对参数级幻觉和不当工具选择的高精度检测(最高达86.4%准确率),同时保持低计算开销,适用于实际部署场景。

链接: https://arxiv.org/abs/2601.05214
作者: Kait Healy,Bharathi Srinivasan,Visakh Madathil,Jing Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in tool calling and tool usage, but suffer from hallucinations where they choose incorrect tools, provide malformed parameters and exhibit ‘tool bypass’ behavior by performing simulations and generating outputs instead of invoking specialized tools or external systems. This undermines the reliability of LLM based agents in production systems as it leads to inconsistent results, and bypasses security and audit controls. Such hallucinations in agent tool selection require early detection and error handling. Unlike existing hallucination detection methods that require multiple forward passes or external validation, we present a computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs’ internal representations during the same forward pass used for generation. We evaluate this approach on reasoning tasks across multiple domains, demonstrating strong detection performance (up to 86.4% accuracy) while maintaining real-time inference capabilities with minimal computational overhead, particularly excelling at detecting parameter-level hallucinations and inappropriate tool selections, critical for reliable agent deployment.
zh

[AI-3] Stock Market Price Prediction using Neural Prophet with Deep Neural Network

【速读】:该论文旨在解决现有时间序列预测方法在股票价格概率区间预测上的不足问题,即传统统计模型难以有效捕捉未来股价的不确定性范围。其解决方案的关键在于提出一种基于深度神经网络的 Neural Prophet 模型(NP-DNN),该模型通过 Z-score 标准化预处理消除量纲差异以增强模式识别能力,并结合缺失值插补技术提升数据完整性;同时引入多层感知机(Multi-Layer Perceptron, MLP)来学习股价间的非线性关系并提取隐含特征表示,从而显著提升预测精度,最终实现 99.21% 的准确率。

链接: https://arxiv.org/abs/2601.05202
作者: Navin Chhibber,Suneel Khemka,Navneet Kumar Tyagi,Rohit Tewari,Bireswar Banerjee,Piyush Ranjan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.
zh

[AI-4] SimuAgent : An LLM -Based Simulink Modeling Assistant Enhanced with Reinforcement Learning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在图结构导向的工程工作流中应用不足的问题,特别是针对Simulink这类图形化建模与仿真环境中的自动化建模效率低、可解释性差及训练收敛慢等挑战。其核心解决方案是提出SimuAgent——一个专为Simulink设计的LLM驱动建模与仿真代理系统,关键创新在于:1)将冗长的XML表示替换为简洁的字典式Python结构,显著降低token消耗并提升可读性;2)采用两阶段训练的轻量级“规划-执行”架构,融合底层工具操作技能与高层设计推理能力;3)引入Reflection-GRPO(ReGRPO)算法,在Group Relative Policy Optimization基础上加入自反思轨迹以提供丰富中间反馈,有效缓解长程任务中稀疏奖励问题,加速收敛并增强鲁棒性。实验证明,基于SimuAgent微调的Qwen2.5-7B模型在SimuBench基准上优于标准强化学习基线,甚至超越GPT-4o的少样本表现,且整个系统可在本地部署,兼顾隐私保护与成本效益。

链接: https://arxiv.org/abs/2601.05187
作者: Yanchang Liang,Xiaowei Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.
zh

[AI-5] FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts KDD2026

【速读】:该论文旨在解决大规模时空图(Spatial-Temporal Graph, STG)预测中长期预测与大图规模带来的计算复杂度高和内存消耗大的问题。现有模型多局限于短时预测,难以扩展至一周级(672步,15分钟粒度)的长时预测及数千节点的大规模网络。其核心解决方案是提出FaST框架,基于异质性感知的Mixture-of-Experts(MoE)结构实现高效且准确的预测:一是设计自适应图代理注意力机制(adaptive graph agent attention),缓解传统图卷积与自注意力模块在大规模图上的计算负担;二是引入新型并行MoE模块,以门控线性单元(Gated Linear Units, GLUs)替代传统前馈网络,构建高效可扩展的并行结构,从而在保持高精度的同时显著提升计算效率。

链接: https://arxiv.org/abs/2601.05174
作者: Yiji Zhao,Zihao Zhong,Ao Wang,Haomin Wen,Ming Jin,Yuxuan Liang,Huaiyu Wan,Hao Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to KDD 2026

点击查看摘要

Abstract:Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: this https URL.
zh

[AI-6] Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art

【速读】:该论文旨在解决持续在线安全强化学习(Continual Safe Online Reinforcement Learning, COSRL)在非平稳环境中的关键挑战,即如何在动态变化的环境中实现安全、稳定且高效的在线学习。其核心问题包括:如何设计具备适应非平稳性能力的安全机制,如何形式化和约束持续学习过程中的安全边界,以及如何在分布偏移(distribution shift)下维持策略的安全性和性能。解决方案的关键在于构建一个基于安全学习机制类型的分类体系(taxonomy),系统梳理了针对非平稳性的安全约束建模方法(如约束马尔可夫决策过程(Constrained MDP)、部分可观测马尔可夫决策过程(POMDP)及其安全变体),并提出将安全约束与持续适应能力相结合的框架,从而为开发可靠、可推广的在线安全学习算法提供理论基础与实践指导。

链接: https://arxiv.org/abs/2601.05152
作者: Timofey Tomashevskiy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms. Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning Comments: 20 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68-02, 68U07 ACMclasses: I.2.0; I.2.6; A.1 Cite as: arXiv:2601.05152 [cs.LG] (or arXiv:2601.05152v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.05152 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-7] Distilling the Thought Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

链接: https://arxiv.org/abs/2601.05144
作者: Shuliang Liu,Xingyu Li,Hongyi Liu,Yibo Yan,Bingchen Duan,Qi Zheng,Dong Fang,Lingfeng Su,Xuming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-8] Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

【速读】:该论文旨在解决生成式 AI(Generative AI)评估中一个核心问题:大型语言模型作为评判者(LLM-as-judge)是否能够提供可重复、一致且具有可比性的质量评价。研究发现,尽管每个 LLM 判官在多次评估中表现出高度的内部一致性(within-judge stability),但不同判官之间的评价结果几乎无相关性(Krippendorff’s α = 0.042),且其分歧远超随机噪声水平,说明不存在统一的质量标准。解决方案的关键在于识别出“评价倾向”(evaluative disposition)这一隐含结构——即每个判官基于自身稳定的评价逻辑(如严苛度/宽容度、维度侧重、证据处理方式等)形成独特的评分模式,这种模式甚至可以被分类器以高达 99.6% 的准确率区分不同模型家族。因此,结论指出 LLM 判官并非可互换的测量工具,而是承载各自理论体系的独立评估设备,简单平均其分数将产生无实际对应判官价值的合成结果。

链接: https://arxiv.org/abs/2601.05114
作者: Wajid Nasser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures, code and artifacts at : this https URL

点击查看摘要

Abstract:LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 judges x 120 unique video x pack items x 3 independent runs), inter-judge agreement is near-zero (Krippendorff’s \alpha = 0.042). On two dimensions, judges disagree more than random noise would predict (\alpha 0). Yet this disagreement isn’t chaos; it’s structured. A classifier identifies which judge produced an evaluation with 77.1% accuracy from rubric scores alone, rising to 89.9% with disposition features. Within model families, the signal is even stronger: GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy. We call this the reliability paradox: judges cannot agree on what constitutes quality, yet their disagreement patterns are so stable they function as fingerprints. Each judge implements a distinct, stable theory of quality: an “evaluative disposition” that shapes how it interprets any rubric. We characterize these dispositions along multiple axes: harshness/leniency, dimension emphasis, within-judge stability (ICC), and evidence behavior (receipt validity, semantic linkage via NLI, and shotgun index). The implication is stark: LLM judges are not interchangeable instruments measuring a shared construct. They are distinct measurement devices, each encoding its own implicit theory of quality. Averaging their scores produces a synthetic verdict that corresponds to no judge’s actual values.
zh

[AI-9] GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中因生成多步思维链而导致的高延迟和计算成本问题。现有协作推理方法依赖局部词元概率或事后验证来决定何时使用大模型,但这类路由策略引入了显著的额外开销。解决方案的关键在于提出一种基于“顿悟时刻”(Aha Moment)现象的新视角:通过分析每一步推理的第一个词元(token)的熵值,即可有效预测该步骤的难度,并据此动态分配计算资源。具体而言,作者设计了无需训练的GlimpRouter框架,仅让轻量级模型生成每个推理步骤的第一个词元,若其熵超过预设阈值,则将该步骤转发至大模型处理。实验证明,此机制可在保持甚至提升准确率的同时显著降低推理延迟,例如在AIME25基准上实现10.7%的准确率提升与25.9%的延迟下降。

链接: https://arxiv.org/abs/2601.05110
作者: Wenhao Zeng,Xuteng Zhang,Yuling Shi,Chao Hu,Yuting Chen,Beijun Shen,Xiaodong Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code available at this https URL

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the “Aha Moment” phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
zh

[AI-10] Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长期交互中因记忆使用策略僵化而导致的两大问题:一是“全量记忆”模式下容易引发记忆锚定(Memory Anchoring),使代理过度依赖历史信息而丧失灵活性;二是完全忽略记忆则导致重要交互历史被丢弃,影响个性化和风格一致性。解决方案的关键在于提出一种可调控的记忆框架——SteeM(Steerable Memory Agent),其核心创新是引入一个可量化的行为指标来表征代理对记忆的依赖程度,并允许用户动态调节记忆依赖水平,从“全新开始”模式(促进创新)到“高保真”模式(忠实遵循历史),从而实现更精细、灵活且有效的个性化人机协作控制。

链接: https://arxiv.org/abs/2601.05107
作者: Muzhao Tian,Zisu Huang,Xiaohua Wang,Jingwen Xu,Zhengkang Guo,Qi Qian,Yuanzhe Shen,Kaitao Song,Jiakang Yuan,Changze Lv,Xiaoqing Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM-based agents are increasingly used in long-term interactions, cumulative memory is critical for enabling personalization and maintaining stylistic consistency. However, most existing systems adopt an ``all-or-nothing’’ approach to memory usage: incorporating all relevant past information can lead to \textitMemory Anchoring, where the agent is trapped by past interactions, while excluding memory entirely results in under-utilization and the loss of important interaction history. We show that an agent’s reliance on memory can be modeled as an explicit and user-controllable dimension. We first introduce a behavioral metric of memory dependence to quantify the influence of past interactions on current outputs. We then propose \textbfSteerable \textbfMemory Agent, \textttSteeM, a framework that allows users to dynamically regulate memory reliance, ranging from a fresh-start mode that promotes innovation to a high-fidelity mode that closely follows interaction history. Experiments across different scenarios demonstrate that our approach consistently outperforms conventional prompting and rigid memory masking strategies, yielding a more nuanced and effective control for personalized human-agent collaboration.
zh

[AI-11] Arabic Prompts with English Tools: A Benchmark

【速读】:该论文旨在解决阿拉伯语环境下大语言模型(Large Language Models, LLMs)在工具调用(tool-calling)和智能体(agentic)能力评估中缺乏标准化基准的问题。当前多数评测框架仍以英语为主,导致阿拉伯语模型的实际性能难以准确衡量,尤其当模型预训练数据以英文为主时,其在非英语场景下的工具使用准确性存在显著下降。解决方案的关键在于首次提出面向阿拉伯语的专用基准测试框架,系统性地评估模型在阿拉伯语指令下执行工具调用任务的功能准确性和鲁棒性;实验结果表明,无论工具描述语言为阿拉伯语或英语,阿拉伯语交互导致的工具调用准确率平均下降5–10%,凸显了语言适配与跨语言一致性的重要性,为构建更可靠、公平的阿拉伯语智能体提供了量化依据和改进方向。

链接: https://arxiv.org/abs/2601.05101
作者: Konstantin Kubrak,Ahmed El-Moselhy,Ammar Alsulami,Remaz Altuwaim,Hassan Ismail Fawaz,Faisal Alsaby
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, LLMs, Big Data, and Multilinguality for All (LLMs4All) Workshop at IEEE BigData 2025 Conference, Macau, December 10, 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are now integral to numerous industries, increasingly serving as the core reasoning engine for autonomous agents that perform complex tasks through tool-use. While the development of Arabic-native LLMs is accelerating, the benchmarks for evaluating their capabilities lag behind, with most existing frameworks focusing on English. A critical and overlooked area is tool-calling, where the performance of models prompted in non-English languages like Arabic is poorly understood, especially since these models are often pretrained on predominantly English data. This paper addresses this critical gap by introducing the first dedicated benchmark for evaluating the tool-calling and agentic capabilities of LLMs in the Arabic language. Our work provides a standardized framework to measure the functional accuracy and robustness of models in Arabic agentic workflows. Our findings reveal a huge performance gap: when users interact in Arabic, tool-calling accuracy drops by an average of 5-10%, regardless of whether the tool descriptions themselves are in Arabic or English. By shedding light on these critical challenges, this benchmark aims to foster the development of more reliable and linguistically equitable AI agents for Arabic-speaking users.
zh

[AI-12] Driver-Intention Prediction with Deep Learning: Real-Time Brain-to-Vehicle Communication

【速读】:该论文旨在解决如何通过脑机接口(Brain-Computer Interface, BCI)实现对驾驶员转向意图的快速、无物理动作的识别问题,以提升高级驾驶辅助系统(Advanced Driving Assistance Systems, ADAS)的响应效率。其解决方案的关键在于利用卷积神经网络(Convolutional Neural Network, CNN)直接处理原始脑电图(Electroencephalography, EEG)信号,无需复杂预处理,即可实现对左转、右转和直行三种驾驶意图的分类,最终达到83.7%的准确率,验证了深度学习模型在解析脑电信号中潜在意图模式方面的有效性。

链接: https://arxiv.org/abs/2601.05084
作者: Niloufar Alavi,Swati Shah,Rezvan Alamian,Stefan Goetz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) allow direct communication between the brain and electronics without the need for speech or physical movement. Such interfaces can be particularly beneficial in applications requiring rapid response times, such as driving, where a vehicle’s advanced driving assistance systems could benefit from immediate understanding of a driver’s intentions. This study presents a novel method for predicting a driver’s intention to steer using electroencephalography (EEG) signals through deep learning. A driving simulator created a controlled environment in which participants imagined controlling a vehicle during various driving scenarios, including left and right turns, as well as straight driving. A convolutional neural network (CNN) classified the detected EEG data with minimal pre-processing. Our model achieved an accuracy of 83.7% in distinguishing between the three steering intentions and demonstrated the ability of CNNs to process raw EEG data effectively. The classification accuracy was highest for right-turn segments, which suggests a potential spatial bias in brain activity. This study lays the foundation for more intuitive brain-to-vehicle communication systems.
zh

[AI-13] Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在生成显式思维链(Chain-of-Thought, CoT)过程中泄露个人身份信息(PII)的隐私风险问题,即即使最终输出经过净化,中间推理步骤仍可能暴露敏感数据。其解决方案的关键在于引入“以隐私为先的推理”(privacy-first reasoning)范式,通过可部署的干预手段(如提示工程或微调)而非事后删除,在推理阶段主动抑制PII泄露。研究提出PII-CoT-Bench数据集与类别平衡的评估基准,并发现:先进模型主要受益于提示控制,而性能较弱模型需依赖微调才能有效降低泄露;两种方法均能在最小化任务性能损失的前提下显著减少PII暴露,验证了隐私保护推理的可行性与实用性。

链接: https://arxiv.org/abs/2601.05076
作者: Arghyadeep Das,Sai Sreenivas Chintha,Rishiraj Girmal,Kinjal Pandey,Sharvi Endait
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 1 table

点击查看摘要

Abstract:Large Reasoning Models (LRMs) improve performance, reliability, and interpretability by generating explicit chain-of-thought (CoT) reasoning, but this transparency introduces a serious privacy risk: intermediate reasoning often leaks personally identifiable information (PII) even when final answers are sanitized. We study how to induce privacy-first reasoning, where models reason without exposing sensitive information, using deployable interventions rather than post-hoc redaction. We introduce PII-CoT-Bench, a supervised dataset with privacy-aware CoT annotations, and a category-balanced evaluation benchmark covering realistic and adversarial leakage scenarios. Our results reveal a capability-dependent trend: state-of-the-art models benefit most from prompt-based controls, whereas weaker models require fine-tuning to achieve meaningful leakage reduction. Across models and categories, both approaches substantially reduce PII exposure with minimal degradation in utility, demonstrating that private reasoning can be achieved without sacrificing performance. Overall, we show that private CoT reasoning can be achieved with minimal utility loss, providing practical guidance for building privacy-preserving reasoning systems.
zh

[AI-14] Large language models can effectively convince people to believe conspiracies

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在信息传播中可能同时促进真相与虚假信念的问题,尤其关注其在引导用户对不确定的阴谋论观点形成态度方面的双刃剑效应。研究通过三个预注册实验发现,即使使用标准版本的GPT-4o(带有OpenAI设定的安全防护机制),模型仍能显著增强用户的阴谋论信念,且“支持型”(bunking)对话比“反驳型”(debunking)对话更易被用户正面评价并提升对AI的信任度。解决方案的关键在于:采用纠正性对话(corrective conversation)策略可有效逆转由LLM诱导的新发阴谋论信念,而仅通过提示模型仅使用准确信息(prompting to use only accurate information)即可大幅削弱其误导能力——这表明通过结构化干预和内容约束,可在不牺牲模型可用性的前提下显著降低LLM引发认知偏差的风险。

链接: https://arxiv.org/abs/2601.05050
作者: Thomas H. Costello,Kellin Pelrine,Matthew Kowal,Antonio A. Arechar,Jean-François Godbout,Adam Gleave,David Rand,Gordon Pennycook
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been shown to be persuasive across a variety of context. But it remains unclear whether this persuasive power advantages truth over falsehood, or if LLMs can promote misbeliefs just as easily as refuting them. Here, we investigate this question across three pre-registered experiments in which participants (N = 2,724 Americans) discussed a conspiracy theory they were uncertain about with GPT-4o, and the model was instructed to either argue against (“debunking”) or for (“bunking”) that conspiracy. When using a “jailbroken” GPT-4o variant with guardrails removed, the AI was as effective at increasing conspiracy belief as decreasing it. Concerningly, the bunking AI was rated more positively, and increased trust in AI, more than the debunking AI. Surprisingly, we found that using standard GPT-4o produced very similar effects, such that the guardrails imposed by OpenAI did little to revent the LLM from promoting conspiracy beliefs. Encouragingly, however, a corrective conversation reversed these newly induced conspiracy beliefs, and simply prompting GPT-4o to only use accurate information dramatically reduced its ability to increase conspiracy beliefs. Our findings demonstrate that LLMs possess potent abilities to promote both truth and falsehood, but that potential solutions may exist to help mitigate this risk.
zh

[AI-15] How to Set the Learning Rate for Large-Scale Pre-training?

【速读】:该论文旨在解决大规模预训练中学习率(Learning Rate, LR)最优配置这一关键挑战,核心问题是能否从低成本实验中准确外推出高成本场景下的最优LR。其解决方案的关键在于提出两种研究范式:拟合范式(Fitting Paradigm)迁移范式(Transfer Paradigm)。在拟合范式中,作者创新性地引入“搜索因子的缩放定律(Scaling Law for search factor)”,通过预测建模将搜索复杂度从O(n³)显著降低至O(n×C_D×C_η),从而提升优化效率;而在迁移范式中,将μ迁移(μ Transfer)扩展至混合专家(Mixture of Experts, MoE)架构,使其适用于模型深度、权重衰减和token窗口等多维超参数。实证结果表明,传统μ迁移在大规模预训练中存在可扩展性问题,并通过训练稳定性和特征学习双重视角揭示了模块级参数调优在大规模场景下表现不佳的根本原因,为工业级预训练提供了系统性实践指南与理论洞见。

链接: https://arxiv.org/abs/2601.05049
作者: Yunhua Zhou,Shuhao Xing,Junhao Huang,Xipeng Qiu,Qipeng Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimal configuration of the learning rate (LR) is a fundamental yet formidable challenge in large-scale pre-training. Given the stringent trade-off between training costs and model performance, the pivotal question is whether the optimal LR can be accurately extrapolated from low-cost experiments. In this paper, we formalize this investigation into two distinct research paradigms: Fitting and Transfer. Within the Fitting Paradigm, we innovatively introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n^3) to O(nC_DC_\eta) via predictive modeling. Within the Transfer Paradigm, we extend the principles of \mu Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons. By pushing the boundaries of existing hyperparameter research in terms of scale, we conduct a comprehensive comparison between these two paradigms. Our empirical results challenge the scalability of the widely adopted \mu Transfer in large-scale pre-training scenarios. Furthermore, we provide a rigorous analysis through the dual lenses of training stability and feature learning to elucidate the underlying reasons why module-wise parameter tuning underperforms in large-scale settings. This work offers systematic practical guidelines and a fresh theoretical perspective for optimizing industrial-level pre-training.
zh

[AI-16] Challenges and Research Directions for Large Language Model Inference Hardware

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理阶段面临的内存和互连瓶颈问题,这些问题在当前生成式 AI(Generative AI)趋势下尤为突出,已超越计算能力成为主要限制因素。其解决方案的关键在于提出四项架构研究机遇:一是采用高带宽闪存(High Bandwidth Flash)以实现10倍内存容量并保持类似高带宽内存(HBM)的带宽;二是通过近内存处理(Processing-Near-Memory)和三维堆叠存储器-逻辑结构(3D memory-logic stacking)提升内存带宽;三是优化低延迟互连技术以加速通信。这些方案不仅适用于数据中心AI场景,也具备向移动设备扩展的潜力。

链接: https://arxiv.org/abs/2601.05047
作者: Xiaoyu Ma,David Patterson
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication by IEEE Computer, 2026

点击查看摘要

Abstract:Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
zh

[AI-17] How to Set the Batch Size for Large-Scale Pre-training?

【速读】:该论文旨在解决传统临界批次大小(Critical Batch Size)理论在新型预热-稳定-衰减(Warmup-Stable-Decay, WSD)学习率调度器下的失效问题,即原有理论框架无法准确刻画当前大规模预训练中的训练动态。解决方案的关键在于推导出适用于WSD调度器的修正版数据消耗与训练步数关系(E(S)关系),并基于此揭示两个核心性质:1)达到目标损失所需的最小批次大小 $ B_{\text{min}} $;2)最大化数据效率(最小化总token数)的最优批次大小 $ B_{\text{opt}} $。由此提出一种动态批次大小调度策略,实验表明该方法能精确捕捉预训练动态,并显著提升训练效率和最终模型质量。

链接: https://arxiv.org/abs/2601.05034
作者: Yunhua Zhou,Junhao Huang,Shuhao Xin,Yechen Zhang,Runyu Peng,Qiping Guo,Xipeng Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.
zh

[AI-18] OptiSet: Unified Optimizing Set Selection and Ranking for Retrieval-Augmented Generation

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法中因静态选择Top-k独立相关段落而导致的组合增益未被利用及冗余信息过多的问题。解决方案的关键在于提出一种以集合为中心的框架OptiSet,其核心是采用“扩展-精炼”范式:首先通过多视角扩展查询以构建多样化的候选集,再通过重新选择机制形成紧凑的证据集;同时设计无需强大语言模型(Large Language Model, LLM)监督的自合成策略,基于生成器在集合条件下的效用变化推导偏好标签,识别互补与冗余证据;最终引入集合级训练策略,联合优化集合选择与集合层级排序,使模型倾向于选择高收益且紧凑的证据集合。

链接: https://arxiv.org/abs/2601.05027
作者: Yi Jiang,Sendong Zhao,Jianbo Li,Bairui Hu,Yanrui Du,Haochun Wang,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves generation quality by incorporating evidence retrieved from large external corpora. However, most existing methods rely on statically selecting top-k passages based on individual relevance, which fails to exploit combinatorial gains among passages and often introduces substantial redundancy. To address this limitation, we propose OptiSet, a set-centric framework that unifies set selection and set-level ranking for RAG. OptiSet adopts an “Expand-then-Refine” paradigm: it first expands a query into multiple perspectives to enable a diverse candidate pool and then refines the candidate pool via re-selection to form a compact evidence set. We then devise a self-synthesis strategy without strong LLM supervision to derive preference labels from the set conditional utility changes of the generator, thereby identifying complementary and redundant evidence. Finally, we introduce a set-list wise training strategy that jointly optimizes set selection and set-level ranking, enabling the model to favor compact, high-gain evidence sets. Extensive experiments demonstrate that OptiSet improves performance on complex combinatorial problems and makes generation more efficient. The source code is publicly available.
zh

[AI-19] HMVI: Unifying Heterogeneous Attributes with Natural Neighbors for Missing Value Inference ICASSP2026

【速读】:该论文旨在解决表格数据中缺失值填补(missing value imputation)问题,现有方法通常独立处理数值型与类别型特征,忽视了异构特征间的复杂依赖关系。其解决方案的关键在于提出一种统一框架,显式建模跨类型特征依赖关系,并利用完整与不完整样本共同优化填补过程,从而提升填补精度与下游机器学习任务性能。

链接: https://arxiv.org/abs/2601.05017
作者: Xiaopeng Luo,Zexi Tan,Zhuowei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Missing value imputation is a fundamental challenge in machine intelligence, heavily dependent on data completeness. Current imputation methods often handle numerical and categorical attributes independently, overlooking critical interdependencies among heterogeneous features. To address these limitations, we propose a novel imputation approach that explicitly models cross-type feature dependencies within a unified framework. Our method leverages both complete and incomplete instances to ensure accurate and consistent imputation in tabular data. Extensive experimental results demonstrate that the proposed approach achieves superior performance over existing techniques and significantly enhances downstream machine learning tasks, providing a robust solution for real-world systems with missing data.
zh

[AI-20] From Idea to Co-Creation: A Planner-Actor-Critic Framework for Agent Augmented 3D Modeling

【速读】:该论文旨在解决当前3D建模中依赖单一提示(single-prompt)代理直接调用工具(如Blender MCP)所导致的几何精度不足、美学质量低以及任务完成率不高的问题。其解决方案的关键在于提出一种基于多智能体自我反思与人机协同监督的规划者-执行者-批评者(Planner-Actor-Critic)架构:其中规划器(Planner)负责协调建模步骤,执行器(Actor)具体实施操作,批评者(Critic)提供迭代反馈,同时人类用户作为监督者和顾问全程参与,从而通过结构化智能体自我反思结合人类指导,显著提升模型的几何准确性、美学质量和复杂度,且保持与Blender实时同步的高效工作流集成。

链接: https://arxiv.org/abs/2601.05016
作者: Jin Gao,Saichandu Juluri
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present a framework that extends the Actor-Critic architecture to creative 3D modeling through multi-agent self-reflection and human-in-the-loop supervision. While existing approaches rely on single-prompt agents that directly execute modeling commands via tools like Blender MCP, our approach introduces a Planner-Actor-Critic architecture. In this design, the Planner coordinates modeling steps, the Actor executes them, and the Critic provides iterative feedback, while human users act as supervisors and advisors throughout the process. Through systematic comparison between single-prompt modeling and our reflective multi-agent approach, we demonstrate improvements in geometric accuracy, aesthetic quality, and task completion rates across diverse 3D modeling scenarios. Our evaluation reveals that critic-guided reflection, combined with human supervisory input, reduces modeling errors and increases complexity and quality of the result compared to direct single-prompt execution. This work establishes that structured agent self-reflection, when augmented by human oversight and advisory guidance, produces higher-quality 3D models while maintaining efficient workflow integration through real-time Blender synchronization.
zh

[AI-21] An Empirical Investigation of Robustness in Large Language Models under Tabular Distortions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对表格数据(tabular data)的语义和结构扭曲时,缺乏自动检测与纠正能力的问题。其核心发现表明,LLMs 在未获得显式先验知识(如系统提示)的情况下,无法识别并修正这些细微扭曲,导致推理错误;而仅有当提供明确提示时,模型才可能部分调整推理策略以纠正部分错误,但效果并不稳定或完整。解决方案的关键在于引入一个由专家精心构建的小型数据集,专门用于评估 LLM 在表格问答(Table Question Answering, TQA)任务中是否具备在分析前执行额外纠错步骤的能力,从而揭示了当前模型在处理结构化数据时存在的系统性缺陷,并为未来研究指明方向:即如何使模型在无需外部提示或预处理的前提下,自主决定对表格输入进行重新对齐,类似人类的认知行为。

链接: https://arxiv.org/abs/2601.05009
作者: Avik Dutta,Harshit Nigam,Hosein Hasanbeig,Arjun Radhakrishna,Sumit Gulwani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, 1 table

点击查看摘要

Abstract:We investigate how large language models (LLMs) fail when tabular data in an otherwise canonical representation is subjected to semantic and structural distortions. Our findings reveal that LLMs lack an inherent ability to detect and correct subtle distortions in table representations. Only when provided with an explicit prior, via a system prompt, do models partially adjust their reasoning strategies and correct some distortions, though not consistently or completely. To study this phenomenon, we introduce a small, expert-curated dataset that explicitly evaluates LLMs on table question answering (TQA) tasks requiring an additional error-correction step prior to analysis. Our results reveal systematic differences in how LLMs ingest and interpret tabular information under distortion, with even SoTA models such as GPT-5.2 model exhibiting a drop of minimum 22% accuracy under distortion. These findings raise important questions for future research, particularly regarding when and how models should autonomously decide to realign tabular inputs, analogous to human behavior, without relying on explicit prompts or tabular data pre-processing.
zh

[AI-22] AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?

【速读】:该论文旨在解决当前大型推理模型(Large Reasoning Models, LRMs)在算法推理能力评估中存在的局限性问题,即现有基准测试未能充分验证LRMs是否真正掌握了算法推理能力。为回答这一关键问题,作者提出AlgBench——一个由ACM算法专家精心构建的、以算法为中心的基准测试集,涵盖超过3000个原创问题,覆盖27种算法,并按照欧几里得结构、非欧几里得结构、非优化、局部优化、全局优化及启发式优化等类别进行系统分类。其核心解决方案在于引入一种算法中心范式(algorithm-centric paradigm),通过精细化的任务划分与专家标注,揭示了当前主流模型在全局优化类算法(如动态规划)上表现显著下降(准确率从92%降至约49%),并发现模型存在“策略性过早偏离”现象(strategic over-shifts),即因低熵标记而提前放弃正确的算法设计路径。这表明当前基于问题中心的强化学习方法存在根本缺陷,亟需转向以算法为中心的训练范式以提升模型的鲁棒算法推理能力。

链接: https://arxiv.org/abs/2601.04996
作者: Henan Sun,Kaichi Yu,Yuyao Wang,Bowen Liu,Xunkai Li,Rong-Hua Li,Nuo Chen,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for algorithmic reasoning remain limited, failing to answer a critical question: Do LRMs truly master algorithmic reasoning? To answer this question, we propose AlgBench, an expert-curated benchmark that evaluates LRMs under an algorithm-centric paradigm. AlgBench consists of over 3,000 original problems spanning 27 algorithms, constructed by ACM algorithmic experts and organized under a comprehensive taxonomy, including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Empirical evaluations on leading LRMs (e.g., Gemini-3-Pro, DeepSeek-v3.2-Speciale and GPT-o3) reveal substantial performance heterogeneity: while models perform well on non-optimized tasks (up to 92%), accuracy drops sharply to around 49% on globally optimized algorithms such as dynamic programming. Further analysis uncovers \textbfstrategic over-shifts, wherein models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens. These findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning. Comments: Under review Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.04996 [cs.AI] (or arXiv:2601.04996v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.04996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-23] When to Act: Calibrated Confidence for Reliable Human Intention Prediction in Assistive Robotics

【速读】:该论文旨在解决辅助设备在日常生活活动(Activities of Daily Living, ADL)中预测用户下一步动作时,因模型置信度不可靠而导致的安全风险问题。现有方法中原始模型置信度常无法反映真实正确性,从而影响辅助决策的可靠性。解决方案的关键在于引入基于校准概率(calibrated probabilities)的安全关键触发框架:通过后验校准(post-hoc calibration)将预测置信度与实际可靠性对齐,显著降低误校准程度(约一个数量级),同时保持预测准确性;进而利用校准后的置信度设计简单的“行动/暂停”(ACT/HOLD)规则,在高可靠性时提供支持、否则不干预,使置信度阈值成为可量化的安全参数,从而实现可验证的辅助控制环路行为。

链接: https://arxiv.org/abs/2601.04982
作者: Johannes A. Gaus,Winfried Ilg,Daniel Haeufle
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Assistive devices must determine both what a user intends to do and how reliable that prediction is before providing support. We introduce a safety-critical triggering framework based on calibrated probabilities for multimodal next-action prediction in Activities of Daily Living. Raw model confidence often fails to reflect true correctness, posing a safety risk. Post-hoc calibration aligns predicted confidence with empirical reliability and reduces miscalibration by about an order of magnitude without affecting accuracy. The calibrated confidence drives a simple ACT/HOLD rule that acts only when reliability is high and withholds assistance otherwise. This turns the confidence threshold into a quantitative safety parameter for assisted actions and enables verifiable behavior in an assistive control loop.
zh

[AI-24] On the Definition and Detection of Cherry-Picking in Counterfactual Explanations

【速读】:该论文试图解决的问题是:在生成式AI(Generative AI)的反事实解释(counterfactual explanations)中,解释提供者可能通过“樱桃挑选”(cherry-picking)——即选择性展示符合特定叙事的反事实样本,而隐藏揭示模型缺陷的样本——来操纵解释内容,从而误导用户对模型行为的理解。解决方案的关键在于:首先形式化定义了反事实解释中的“可接受解释空间”(admissible explanation space),并基于生成过程和效用函数明确操作边界;其次,实证表明即使拥有完全的程序访问权限,外部审计者也难以区分被挑选与未被挑选的解释,因为有效反事实的多样性与解释规范的灵活性提供了足够的自由度以掩盖人为选择;因此,论文主张应优先通过算法开发、解释提供和审计环节的可复现性(reproducibility)、标准化(standardisation)和过程约束(procedural constraints)来防范此类操纵,而非依赖事后检测机制。

链接: https://arxiv.org/abs/2601.04977
作者: James Hinns,Sofie Goethals,Stephan Van der Veeken,Theodoros Evgeniou,David Martens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counterfactual explanations are widely used to communicate how inputs must change for a model to alter its prediction. For a single instance, many valid counterfactuals can exist, which leaves open the possibility for an explanation provider to cherry-pick explanations that better suit a narrative of their choice, highlighting favourable behaviour and withholding examples that reveal problematic behaviour. We formally define cherry-picking for counterfactual explanations in terms of an admissible explanation space, specified by the generation procedure, and a utility function. We then study to what extent an external auditor can detect such manipulation. Considering three levels of access to the explanation process: full procedural access, partial procedural access, and explanation-only access, we show that detection is extremely limited in practice. Even with full procedural access, cherry-picked explanations can remain difficult to distinguish from non cherry-picked explanations, because the multiplicity of valid counterfactuals and flexibility in the explanation specification provide sufficient degrees of freedom to mask deliberate selection. Empirically, we demonstrate that this variability often exceeds the effect of cherry-picking on standard counterfactual quality metrics such as proximity, plausibility, and sparsity, making cherry-picked explanations statistically indistinguishable from baseline explanations. We argue that safeguards should therefore prioritise reproducibility, standardisation, and procedural constraints over post-hoc detection, and we provide recommendations for algorithm developers, explanation providers, and auditors.
zh

[AI-25] Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following ACL

【速读】:该论文旨在解决强化学习中基于可验证奖励的指令跟随(Instruction Following, IF)任务训练时,如何实现更高效且泛化能力强的对齐问题。传统观点认为,混合使用可验证的硬约束与不可验证的软约束是提升模型泛化能力的关键,但本文通过系统实证研究发现,仅使用硬约束反而能获得更好性能,其根本原因在于奖励精度(reward precision)而非约束多样性才是有效对齐的核心驱动力。解决方案的关键在于提出一种以数据为中心的精炼策略,优先优化奖励精度,从而缓解因LLM判官召回率低导致的奖励劫持(reward hacking)问题,并促使模型习得可迁移的元技能(meta-skill),最终在五个基准测试中实现13.4%的性能提升和58%的训练时间减少,同时保持超越指令跟随任务的泛化能力。

链接: https://arxiv.org/abs/2601.04954
作者: Yirong Zeng,Yufei Liu,Xiao Ding,Yutai Hou,Yuxian Wang,Haonan Song,Wu Ning,Dandan Tu,Qixun Zhang,Bibo Cai,Yuxiang He,Ting Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACL under review 13 pages, 8 figures

点击查看摘要

Abstract:A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4% in performance while achieving a 58% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.
zh

[AI-26] -Retriever: Tree-based Hierarchical Retrieval Augmented Generation for Textual Graphs

【速读】:该论文旨在解决当前基于图的检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理层次化信息时存在的两个关键问题:一是强制施加分层压缩配额导致局部图结构破坏;二是过度关注拓扑结构而忽视语义内容。解决方案的核心在于提出T-Retriever框架,其关键创新包括:(1) 自适应压缩编码(Adaptive Compression Encoding),通过全局优化策略替代人工设定的压缩配额,以保留图的自然层次组织;(2) 语义-结构熵(Semantic-Structural Entropy, S²-Entropy),在构建层次划分时联合优化结构凝聚性和语义一致性,从而实现更准确、连贯的图检索与生成。

链接: https://arxiv.org/abs/2601.04945
作者: Chunyu Wei,Huaiyu Qin,Siyuan He,Yunhai Wang,Yueguo Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly enhanced Large Language Models’ ability to access external knowledge, yet current graph-based RAG approaches face two critical limitations in managing hierarchical information: they impose rigid layer-specific compression quotas that damage local graph structures, and they prioritize topological structure while neglecting semantic content. We introduce T-Retriever, a novel framework that reformulates attributed graph retrieval as tree-based retrieval using a semantic and structure-guided encoding tree. Our approach features two key innovations: (1) Adaptive Compression Encoding, which replaces artificial compression quotas with a global optimization strategy that preserves the graph’s natural hierarchical organization, and (2) Semantic-Structural Entropy ( S^2 -Entropy), which jointly optimizes for both structural cohesion and semantic consistency when creating hierarchical partitions. Experiments across diverse graph reasoning benchmarks demonstrate that T-Retriever significantly outperforms state-of-the-art RAG methods, providing more coherent and contextually relevant responses to complex queries.
zh

[AI-27] CurricuLLM : Designing Personalized and Workforce-Aligned Cybersecurity Curricula Using Fine-Tuned LLM s

【速读】:该论文旨在解决当前网络安全(Cybersecurity)教育中课程设计与产业实际需求之间存在显著脱节的问题,尤其是在数字转型加速背景下,传统课程更新成本高、周期长,难以及时响应新兴威胁和技能要求。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的自动化课程设计与分析框架——CurricuLLM,该框架通过两阶段处理流程实现:首先利用PreprocessLM标准化输入数据,随后采用Fine-tuned BERT模型作为ClassifyLM,将课程内容精准映射至九类网络安全知识领域(Knowledge Areas),从而构建出数据驱动、可定制化的课程体系,并支持与岗位角色权重或市场需求对齐,有效提升教育内容的职业适配性与前瞻性。

链接: https://arxiv.org/abs/2601.04940
作者: Arthur Nijdam,Harri Kähkönen,Valtteri Niemi,Paul Stankovski Wagner,Sara Ramezanian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The cybersecurity landscape is constantly evolving, driven by increased digitalization and new cybersecurity threats. Cybersecurity programs often fail to equip graduates with skills demanded by the workforce, particularly concerning recent developments in cybersecurity, as curriculum design is costly and labor-intensive. To address this misalignment, we present a novel Large Language Model (LLM)-based framework for automated design and analysis of cybersecurity curricula, called CurricuLLM. Our approach provides three key contributions: (1) automation of personalized curriculum design, (2) a data-driven pipeline aligned with industry demands, and (3) a comprehensive methodology for leveraging fine-tuned LLMs in curriculum development. CurricuLLM utilizes a two-tier approach consisting of PreprocessLM, which standardizes input data, and ClassifyLM, which assigns course content to nine Knowledge Areas in cybersecurity. We systematically evalu- ated multiple Natural Language Processing (NLP) architectures and fine-tuning strategies, ultimately selecting the Bidirectional Encoder Representations from Transformers (BERT) model as ClassifyLM, fine-tuned on founda- tional cybersecurity concepts and workforce competencies. We are the first to validate our method with human experts who analyzed real-world cybersecurity curricula and frameworks, motivating that CurricuLLM is an efficient solution to replace labor-intensive curriculum analysis. Moreover, once course content has been classified, it can be integrated with established cybersecurity role-based weights, enabling alignment of the educational program with specific job roles, workforce categories, or general market needs. This lays the foundation for personalized, workforce-aligned cybersecurity curricula that prepare students for the evolving demands in cybersecurity. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.04940 [cs.CR] (or arXiv:2601.04940v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.04940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-28] Conversational AI for Rapid Scientific Prototyping: A Case Study on ESAs ELOPE Competition

【速读】:该论文旨在解决如何利用生成式 AI(Generative AI)加速科学发现,特别是在竞赛环境中实现快速原型开发的问题。其关键解决方案在于通过结构化整合大型语言模型(Large Language Models, LLMs)到科研工作流中,充分发挥其在代码生成、算法推理、数据处理建议等方面的优势,同时识别并规避其在长对话中的逻辑混乱、冗余结构调整及关键信息遗忘等局限性,从而提升人机协作效率与科学创新速度。

链接: https://arxiv.org/abs/2601.04920
作者: Nils Einecke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as coding partners, yet their role in accelerating scientific discovery remains underexplored. This paper presents a case study of using ChatGPT for rapid prototyping in ESA’s ELOPE (Event-based Lunar OPtical flow Egomotion estimation) competition. The competition required participants to process event camera data to estimate lunar lander trajectories. Despite joining late, we achieved second place with a score of 0.01282, highlighting the potential of human-AI collaboration in competitive scientific settings. ChatGPT contributed not only executable code but also algorithmic reasoning, data handling routines, and methodological suggestions, such as using fixed number of events instead of fixed time spans for windowing. At the same time, we observed limitations: the model often introduced unnecessary structural changes, gets confused by intermediate discussions about alternative ideas, occasionally produced critical errors and forgets important aspects in longer scientific discussions. By analyzing these strengths and shortcomings, we show how conversational AI can both accelerate development and support conceptual insight in scientific research. We argue that structured integration of LLMs into the scientific workflow can enhance rapid prototyping by proposing best practices for AI-assisted scientific work.
zh

[AI-29] What Students Ask How a Generative AI Assistant Responds: Exploring Higher Education Students Dialogues on Learning Analytics Feedback

【速读】:该论文试图解决的问题是:学习分析仪表盘(Learning Analytics Dashboards, LADs)虽然旨在通过数据反馈支持学生的自我调节学习(Self-Regulated Learning, SRL),但低SRL能力的学生往往难以有效理解和利用这些反馈。为应对这一挑战,研究提出以对话式生成式人工智能(Conversational Generative AI, GenAI)助手作为干预手段,嵌入LAD中提供实时、个性化的对话支持。解决方案的关键在于:GenAI助手能够根据学生不同的SRL水平提供差异化响应——低SRL学生获得澄清与情感支持,高SRL学生则获得技术细节和个性化策略建议,从而提升其对反馈的参与度与理解力,并缩小与高SRL同伴之间的差距。

链接: https://arxiv.org/abs/2601.04919
作者: Yildiz Uzun,Andrea Gauthier,Mutlu Cukurova
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Learning analytics dashboards (LADs) aim to support students’ regulation of learning by translating complex data into feedback. Yet students, especially those with lower self-regulated learning (SRL) competence, often struggle to engage with and interpret analytics feedback. Conversational generative artificial intelligence (GenAI) assistants have shown potential to scaffold this process through real-time, personalised, dialogue-based support. Further advancing this potential, we explored authentic dialogues between students and GenAI assistant integrated into LAD during a 10-week semester. The analysis focused on questions students with different SRL levels posed, the relevance and quality of the assistant’s answers, and how students perceived the assistant’s role in their learning. Findings revealed distinct query patterns. While low SRL students sought clarification and reassurance, high SRL students queried technical aspects and requested personalised strategies. The assistant provided clear and reliable explanations but limited in personalisation, handling emotionally charged queries, and integrating multiple data points for tailored responses. Findings further extend that GenAI interventions can be especially valuable for low SRL students, offering scaffolding that supports engagement with feedback and narrows gaps with their higher SRL peers. At the same time, students’ reflections underscored the importance of trust, need for greater adaptivity, context-awareness, and technical refinement in future systems.
zh

[AI-30] Breaking Robustness Barriers in Cognitive Diagnosis: A One-Shot Neural Architecture Search Perspective KDD2026

【速读】:该论文旨在解决认知诊断模型(Cognitive Diagnosis Models, CDMs)在实际部署中面临的两大关键问题:一是现有模型对观测响应数据中普遍存在的噪声污染敏感,导致性能下降;二是当前CDMs依赖研究者领域知识进行结构设计,限制了模型架构探索的充分性,难以挖掘潜在性能潜力。解决方案的关键在于提出一种基于多目标进化的一次性神经架构搜索方法(OSCD),其核心创新包括两个阶段:首先通过构建包含多种架构组合的搜索空间并训练一个基于完全二叉树拓扑的权共享超网络(supernet),实现对人工先验之外架构的全面探索;其次将异质噪声场景下的最优架构搜索建模为多目标优化问题(Multi-Objective Optimization Problem, MOP),并设计融合Pareto最优解搜索策略与跨场景性能评估的优化框架,从而高效且鲁棒地发现适用于认知诊断任务的高性能模型架构。

链接: https://arxiv.org/abs/2601.04918
作者: Ziwen Wang,Shangshang Yang,Xiaoshan Yu,Haiping Ma,Xingyi Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: KDD2026, 15 pages

点击查看摘要

Abstract:With the advancement of network technologies, intelligent tutoring systems (ITS) have emerged to deliver increasingly precise and tailored personalized learning services. Cognitive diagnosis (CD) has emerged as a core research task in ITS, aiming to infer learners’ mastery of specific knowledge concepts by modeling the mapping between learning behavior data and knowledge states. However, existing research prioritizes model performance enhancement while neglecting the pervasive noise contamination in observed response data, significantly hindering practical deployment. Furthermore, current cognitive diagnosis models (CDMs) rely heavily on researchers’ domain expertise for structural design, which fails to exhaustively explore architectural possibilities, thus leaving model architectures’ full potential untapped. To address this issue, we propose OSCD, an evolutionary multi-objective One-Shot neural architecture search method for Cognitive Diagnosis, designed to efficiently and robustly improve the model’s capability in assessing learner proficiency. Specifically, OSCD operates through two distinct stages: training and searching. During the training stage, we construct a search space encompassing diverse architectural combinations and train a weight-sharing supernet represented via the complete binary tree topology, enabling comprehensive exploration of potential architectures beyond manual design priors. In the searching stage, we formulate the optimal architecture search under heterogeneous noise scenarios as a multi-objective optimization problem (MOP), and develop an optimization framework integrating a Pareto-optimal solution search strategy with cross-scenario performance evaluation for resolution. Extensive experiments on real-world educational datasets validate the effectiveness and robustness of the optimal architectures discovered by our OSCD model for CD tasks.
zh

[AI-31] From Stories to Cities to Games: A Qualitative Evaluation of Behaviour Planning

【速读】:该论文旨在解决传统规划方法难以生成多样化决策方案的问题,尤其是在需要覆盖多种可能性以应对复杂现实场景(如风险管控、流数据处理和恶意软件检测)时的局限性。其解决方案的关键在于提出一种新的“行为规划”(behaviour planning)范式,该范式通过在规划过程中显式引入多样性模型(diversity model),并支持多类规划目标的协同优化,从而系统性地提升生成计划之间的差异性和实用性。

链接: https://arxiv.org/abs/2601.04911
作者: Mustafa F. Abdelwahed,Joan Espasa,Alice Toniolo,Ian P. Gent
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The primary objective of a diverse planning approach is to generate a set of plans that are distinct from one another. Such an approach is applied in a variety of real-world domains, including risk management, automated stream data analysis, and malware detection. More recently, a novel diverse planning paradigm, referred to as behaviour planning, has been proposed. This approach extends earlier methods by explicitly incorporating a diversity model into the planning process and supporting multiple planning categories. In this paper, we demonstrate the usefulness of behaviour planning in real-world settings by presenting three case studies. The first case study focuses on storytelling, the second addresses urban planning, and the third examines game evaluation.
zh

[AI-32] DVD: A Robust Method for Detecting Variant Contamination in Large Language Model Evaluation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)评估中日益严重的**变体污染(variant contamination)**问题,即测试样本的语义等价但词法或句法形式不同的变体存在于训练数据中,导致模型通过记忆而非真实推理获得高分。现有基于采样一致性或困惑度(perplexity)的检测方法难以识别此类污染。论文提出一种名为 DVD(Detection via Variance of generation Distribution) 的单样本检测方法,其核心创新在于:利用温度采样模拟局部生成分布,并观察低概率token的合成难度方差变化——受污染样本会因“记忆适配状态”与“扰动漂移状态”的交替而产生异常高的方差,而未污染样本则保持平滑的方差模式。这一机制为检测变体污染提供了原理清晰且实用的指纹特征。

链接: https://arxiv.org/abs/2601.04895
作者: Renzhao Liang,Jingru Chen,Bo Jia,Bo Deng,Chenggang Xie,Yidong Wang,Ke Jin,Xin Wang,Linfeng Zhang,Cunxiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) is increasingly confounded by \emphvariant contamination: the training corpus contains semantically equivalent yet lexically or syntactically altered versions of test items. Unlike verbatim leakage, these paraphrased or structurally transformed variants evade existing detectors based on sampling consistency or perplexity, thereby inflating benchmark scores via memorization rather than genuine reasoning. We formalize this problem and introduce \textbfDVD (\textbfDetection via \textbfVariance of generation \textbfDistribution), a single-sample detector that models the local output distribution induced by temperature sampling. Our key insight is that contaminated items trigger alternation between a \emphmemory-adherence state and a \emphperturbation-drift state, yielding abnormally high variance in the synthetic difficulty of low-probability tokens; uncontaminated items remain in drift with comparatively smooth variance. We construct the first benchmark for variant contamination across two domains Omni-MATH and SuperGPQA by generating and filtering semantically equivalent variants, and simulate contamination via fine-tuning models of different scales and architectures (Qwen2.5 and Llama3.1). Across datasets and models, \textbfDVD consistently outperforms perplexity-based, Min- k %++, edit-distance (CDD), and embedding-similarity baselines, while exhibiting strong robustness to hyperparameters. Our results establish variance of the generation distribution as a principled and practical fingerprint for detecting variant contamination in LLM evaluation.
zh

[AI-33] SmartSearch: Process Reward-Guided Query Refinement for Search Agents

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的搜索代理在处理知识密集型任务时,由于中间搜索查询质量不高而导致检索结果不准确、进而限制整体性能的问题。解决方案的关键在于提出SmartSearch框架,其核心机制包括:(1) 过程奖励(Process rewards),通过双层信用评估(Dual-Level Credit Assessment)对每一轮中间搜索查询的质量提供细粒度监督;(2) 查询优化(Query refinement),通过选择性地修正低质量查询并基于修正结果重新生成后续搜索轮次,从而提升查询质量与检索效率。为使搜索代理在过程奖励引导下逐步内化高质量查询生成能力,作者进一步设计了一个三阶段课程学习框架,依次实现模仿、对齐与泛化。

链接: https://arxiv.org/abs/2601.04888
作者: Tongyu Wen,Guanting Dong,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Large language model (LLM)-based search agents have proven promising for addressing knowledge-intensive problems by incorporating information retrieval capabilities. Existing works largely focus on optimizing the reasoning paradigms of search agents, yet the quality of intermediate search queries during reasoning remains overlooked. As a result, the generated queries often remain inaccurate, leading to unexpected retrieval results and ultimately limiting search agents’ overall effectiveness. To mitigate this issue, we introduce SmartSearch, a framework built upon two key mechanisms: (1) Process rewards, which provide fine-grained supervision for the quality of each intermediate search query through Dual-Level Credit Assessment. (2) Query refinement, which promotes the optimization of query generation by selectively refining low-quality search queries and regenerating subsequent search rounds based on these refinements. To enable the search agent to progressively internalize the ability to improve query quality under the guidance of process rewards, we design a three-stage curriculum learning framework. This framework guides the agent through a progression from imitation, to alignment, and ultimately to generalization. Experimental results show that SmartSearch consistently surpasses existing baselines, and additional quantitative analyses further confirm its significant gains in both search efficiency and query quality. The code is available at this https URL.
zh

[AI-34] Flexible Manufacturing Systems Intralogistics: Dynamic Optimization of AGVs and Tool Sharing Using Coloured-Timed Petri Nets and Actor-Critic RL with Actions Masking

【速读】:该论文旨在解决柔性制造系统(Flexible Manufacturing Systems, FMS)中作业车间调度问题(Job Shop Scheduling Problem, JSSP)的复杂性提升问题,特别是通过同时集成自动导引车(Automated Guided Vehicles, AGVs)和刀具共享系统所带来的多维度动态约束。解决方案的关键在于提出一种结合着色时序Petri网(Colored-Timed Petri Nets, CTPNs)与基于模型的强化学习(Model-Based Reinforcement Learning, MBRL)的新方法:CTPNs提供形式化建模结构并实现动态动作掩码(dynamic action masking),显著缩小动作搜索空间;MBRL则通过策略学习增强对环境变化的适应能力,并引入前瞻策略(lookahead strategy)优化AGV的最优定位,从而在保证调度质量的同时大幅降低计算开销。实验表明,该方法在小规模实例上性能相当,而在大规模实例上不仅缩短了makespan,还实现了十倍的计算效率提升。

链接: https://arxiv.org/abs/2601.04887
作者: Sofiene Lassoued,Laxmikant Shrikant Bahetic,Nathalie Weiß-Borkowskib,Stefan Lierc,Andreas Schwunga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flexible Manufacturing Systems (FMS) are pivotal in optimizing production processes in today’s rapidly evolving manufacturing landscape. This paper advances the traditional job shop scheduling problem by incorporating additional complexities through the simultaneous integration of automated guided vehicles (AGVs) and tool-sharing systems. We propose a novel approach that combines Colored-Timed Petri Nets (CTPNs) with actor-critic model-based reinforcement learning (MBRL), effectively addressing the multifaceted challenges associated with FMS. CTPNs provide a formal modeling structure and dynamic action masking, significantly reducing the action search space, while MBRL ensures adaptability to changing environments through the learned policy. Leveraging the advantages of MBRL, we incorporate a lookahead strategy for optimal positioning of AGVs, improving operational efficiency. Our approach was evaluated on small-sized public benchmarks and a newly developed large-scale benchmark inspired by the Taillard benchmark. The results show that our approach matches traditional methods on smaller instances and outperforms them on larger ones in terms of makespan while achieving a tenfold reduction in computation time. To ensure reproducibility, we propose a gym-compatible environment and an instance generator. Additionally, an ablation study evaluates the contribution of each framework component to its overall performance.
zh

[AI-35] Analyzing Message-Code Inconsistency in AI Coding Agent -Authored Pull Requests

【速读】:该论文旨在解决生成式 AI (Generative AI) 编码代理在 Pull Request (PR) 描述中与实际代码变更不一致的问题,即 PR 消息-代码不一致性(PR-MCI),这直接影响人类审阅者对 AI 代理的信任度。其解决方案的关键在于系统性地识别和量化 PR-MCI,通过人工标注 974 个 PR 并分析 23,247 个 agentic PR,发现 1.7% 的 PR 存在高 PR-MCI,其中最常见类型为描述声称实现但实际未完成的变更(占比 45.4%);进一步统计表明,高 PR-MCI PR 的接受率显著降低(28.3% vs. 80.0%),合并时间延长 3.5 倍(55.8 小时 vs. 16.0 小时),从而揭示了 PR-MCI 对协作效率与可信度的负面影响,并提出需构建 PR-MCI 验证机制与改进 PR 生成策略以支持可信赖的人机协同。

链接: https://arxiv.org/abs/2601.04886
作者: Jingzhi Gong,Giovanni Pinna,Yixin Bian,Jie M. Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pull request (PR) descriptions generated by AI coding agents are the primary channel for communicating code changes to human reviewers. However, the alignment between these messages and the actual changes remains unexplored, raising concerns about the trustworthiness of AI agents. To fill this gap, we analyzed 23,247 agentic PRs across five agents using PR message-code inconsistency (PR-MCI). We contributed 974 manually annotated PRs, found 406 PRs (1.7%) exhibited high PR-MCI, and identified eight PR-MCI types, revealing that descriptions claiming unimplemented changes was the most common issue (45.4%). Statistical tests confirmed that high-MCI PRs had 51.7% lower acceptance rates (28.3% vs. 80.0%) and took 3.5x longer to merge (55.8 vs. 16.0 hours). Our findings suggest that unreliable PR descriptions undermine trust in AI agents, highlighting the need for PR-MCI verification mechanisms and improved PR generation to enable trustworthy human-AI collaboration.
zh

[AI-36] Precomputing Multi-Agent Path Replanning using Temporal Flexibility: A Case Study on the Dutch Railway Network

【速读】:该论文旨在解决多智能体系统中因某一智能体延迟而导致的计划冲突问题,特别是在铁路调度等高密度场景下,如何快速生成新的安全可行计划。传统方法要么仅重规划延迟智能体(常导致效率低下或不可行),要么重规划其他智能体(易引发连锁延迟)。解决方案的关键在于利用其他智能体的时间灵活性(temporal flexibility)——即某智能体可承受的最大延迟而不改变任务顺序或进一步延误其他智能体——通过预计算延迟智能体在不同延迟情况下的最优调整方案及对应其他智能体的修改策略,实现高效、可控的局部重规划。该方法称为FlexSIPP,在荷兰高密度铁路网络的真实案例中验证了其有效性与实用性。

链接: https://arxiv.org/abs/2601.04884
作者: Issa Hanou,Eric Kemmeren,Devin Wild Thomas,Mathijs de Weerdt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not result in an efficient plan, and sometimes cannot even yield a feasible plan. On the other hand, replanning other agents may lead to a cascade of changes and delays. We show how to efficiently replan by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay an agent can take without changing the order of or further delaying more agents. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent, also returning the changes for the other agents, for any single-agent delay within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network. Our experiments show that FlexSIPP provides effective solutions, relevant to real-world adjustments, and within a reasonable timeframe.
zh

[AI-37] Key-Value Pair-Free Continual Learner via Task-Specific Prompt-Prototype

【速读】:该论文旨在解决持续学习(Continual Learning)中基于提示(Prompt-based)方法依赖键值对(Key-Value Pairing)所引发的跨任务干扰(Inter-task Interference)和可扩展性差的问题。其解决方案的关键在于提出一种任务特定的提示原型(Task-specific Prompt-Prototype, ProP)机制:通过为每个任务设计独立的提示(Prompt)与原型(Prototype)来实现特征学习的解耦,其中提示用于引导当前任务的特征表示,原型则捕捉输入数据的代表性特征;推理时通过绑定任务特定提示与对应原型进行预测,从而无需依赖键值对结构,提升了模型稳定性与泛化能力。

链接: https://arxiv.org/abs/2601.04864
作者: Haihua Luo,Xuming Ran,Zhengji Li,Huiyan Xue,Tingting Jiang,Jiangrong Shen,Tommi Kärkkäinen,Qi Xu,Fengyu Cong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning aims to enable models to acquire new knowledge while retaining previously learned information. Prompt-based methods have shown remarkable performance in this domain; however, they typically rely on key-value pairing, which can introduce inter-task interference and hinder scalability. To overcome these limitations, we propose a novel approach employing task-specific Prompt-Prototype (ProP), thereby eliminating the need for key-value pairs. In our method, task-specific prompts facilitate more effective feature learning for the current task, while corresponding prototypes capture the representative features of the input. During inference, predictions are generated by binding each task-specific prompt with its associated prototype. Additionally, we introduce regularization constraints during prompt initialization to penalize excessively large values, thereby enhancing stability. Experiments on several widely used datasets demonstrate the effectiveness of the proposed method. In contrast to mainstream prompt-based approaches, our framework removes the dependency on key-value pairs, offering a fresh perspective for future continual learning research.
zh

[AI-38] Orchestrating Intelligence: Confidence-Aware Routing for Efficient Multi-Agent Collaboration across Multi-Scale Models

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在复杂推理任务中因统一部署大规模语言模型(Large Language Models, LLMs)而导致的计算效率低下问题。现有框架未考虑不同推理阶段对认知能力需求的差异,造成资源浪费。其解决方案的关键在于提出OI-MAS框架,通过引入状态依赖的路由机制与置信度感知的模型选择策略,在异构多尺度LLM池中动态调整代理角色和模型规模,从而根据任务复杂度自适应地分配计算资源,显著提升推理准确率并大幅降低运行成本。

链接: https://arxiv.org/abs/2601.04861
作者: Jingbo Wang,Sendong Zhao,Jiatong Liu,Haochun Wang,Wanting Li,Bing Qin,Ting Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While multi-agent systems (MAS) have demonstrated superior performance over single-agent approaches in complex reasoning tasks, they often suffer from significant computational inefficiencies. Existing frameworks typically deploy large language models (LLMs) uniformly across all agent roles, failing to account for the varying cognitive demands of different reasoning stages. We address this inefficiency by proposing OI-MAS framework, a novel multi-agent framework that implements an adaptive model-selection policy across a heterogeneous pool of multi-scale LLMs. Specifically, OI-MAS introduces a state-dependent routing mechanism that dynamically selects agent roles and model scales throughout the reasoning process. In addition, we introduce a confidence-aware mechanism that selects appropriate model scales conditioned on task complexity, thus reducing unnecessary reliance on large-scale models. Experimental results show that OI-MAS consistently outperforms baseline multi-agent systems, improving accuracy by up to 12.88% while reducing cost by up to 79.78%.
zh

[AI-39] Rethinking GNNs and Missing Features: Challenges Evaluation and a Robust Solution

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在现实应用场景(如医疗健康和传感器网络)中处理缺失节点特征的问题。现有研究多集中于较为理想化的场景,即高维但稀疏的节点特征以及服从“完全随机缺失”(Missing Completely At Random, MCAR)机制的数据缺失情况,这导致模型性能比较缺乏实际意义。为此,论文提出两个关键改进:一是构建包含密集且语义明确特征的合成与真实数据集,以突破稀疏性带来的信息损失限制;二是设计更贴近现实的缺失机制评估协议,并提供理论框架明确缺失过程的假设及其对不同方法的影响。基于此分析,作者提出GNNmim这一简单而有效的基线方法,在多种数据集和缺失模式下均表现出与专门设计架构相当甚至更优的节点分类性能。

链接: https://arxiv.org/abs/2601.04855
作者: Francesco Ferrini,Veronica Lachi,Antonio Longa,Bruno Lepri,Matono Akiyoshi,Andrea Passerini,Xin Liu,Manfred Jaeger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.
zh

[AI-40] AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding

【速读】:该论文旨在解决当前多模态和视觉-语言模型在理解和解析建筑、工程与施工(AEC)图纸时的可靠性问题,尤其是这些图纸中密集的符号、布局规范和注释所构成的图形语言。其解决方案的关键在于构建一个名为AECV-Bench的基准测试平台,通过两个互补的应用场景评估模型性能:一是对120张高质量楼层平面图进行对象计数(如门、窗、卧室等),二是基于图纸的文档问答任务(涵盖192个问答对),以检验光学字符识别(OCR)、实例计数、空间推理和比较推理能力。该基准采用统一协议评估多种前沿模型,并引入大语言模型作为评判者(LLM-as-a-judge)及人工复核机制,揭示了当前系统在文本理解方面表现优异(最高达0.95准确率),但在符号驱动的绘图理解(特别是门和窗的可靠计数)上仍存在显著不足(通常为0.40–0.55准确率),从而指出需发展领域特定表示方法与人机协同的工作流以实现高效AEC自动化。

链接: https://arxiv.org/abs/2601.04819
作者: Aleksei Kondratenko,Mussie Birhane,Houssame E. Hsain,Guido Maciocci
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AEC drawings encode geometry and semantics through symbols, layout conventions, and dense annotation, yet it remains unclear whether modern multimodal and vision-language models can reliably interpret this graphical language. We present AECV-Bench, a benchmark for evaluating multimodal and vision-language models on realistic AEC artefacts via two complementary use cases: (i) object counting on 120 high-quality floor plans (doors, windows, bedrooms, toilets), and (ii) drawing-grounded document QA spanning 192 question-answer pairs that test text extraction (OCR), instance counting, spatial reasoning, and comparative reasoning over common drawing regions. Object-counting performance is reported using per-field exact-match accuracy and MAPE results, while document-QA performance is reported using overall accuracy and per-category breakdowns with an LLM-as-a-judge scoring pipeline and targeted human adjudication for edge cases. Evaluating a broad set of state-of-the-art models under a unified protocol, we observe a stable capability gradient; OCR and text-centric document QA are strongest (up to 0.95 accuracy), spatial reasoning is moderate, and symbol-centric drawing understanding - especially reliable counting of doors and windows - remains unsolved (often 0.40-0.55 accuracy) with substantial proportional errors. These results suggest that current systems function well as document assistants but lack robust drawing literacy, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows for an efficient AEC automation.
zh

[AI-41] SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型推理能力时面临的两个核心问题:一是当任务难度与模型能力不匹配时,训练信号会迅速衰减;二是训练过程易受有限且重复的问题模式主导,导致过拟合和性能停滞。解决方案的关键在于提出SCALER(Synthetic sCalable Adaptive Learning Environment for Reasoning),其核心创新包括:1)构建一个可扩展的合成流水线,将真实编程问题转化为具有可控难度和无限实例生成能力的可验证推理环境,从而突破数据集规模限制并保证正确性;2)引入自适应多环境RL策略,动态调整实例难度并优化活跃环境集合,以追踪模型能力边界并维持分布多样性,从而避免奖励稀疏性和模式过拟合,实现持续稳定的长期训练效果。

链接: https://arxiv.org/abs/2601.04809
作者: Caijun Xu,Changyi Xiao,Zhongyuan Peng,Xinrun Wang,Yixin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages,5 figures

点击查看摘要

Abstract:Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model’s capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.
zh

[AI-42] Parallelizing Node-Level Explainability in Graph Neural Networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在节点分类任务中,随着图规模增大导致的节点级可解释性计算效率低下问题,尤其是在采用批处理策略时往往损害解释质量。其解决方案的关键在于通过图划分(graph partitioning)将原图分解为不相交的子图,从而实现对节点邻居的并行化可解释性计算,在保证结果正确性的前提下显著提升可扩展性和效率;针对内存受限场景,进一步提出基于丢弃(dropout-based)的重构机制,以可控方式在内存消耗与解释保真度之间取得平衡。

链接: https://arxiv.org/abs/2601.04807
作者: Oscar Llorente,Jaime Boal,Eugenio F. Sánchez-Úbeda,Antonio Diaz-Cano,Miguel Familiar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable performance in a wide range of tasks, such as node classification, link prediction, and graph classification, by exploiting the structural information in graph-structured data. However, in node classification, computing node-level explainability becomes extremely time-consuming as the size of the graph increases, while batching strategies often degrade explanation quality. This paper introduces a novel approach to parallelizing node-level explainability in GNNs through graph partitioning. By decomposing the graph into disjoint subgraphs, we enable parallel computation of explainability for node neighbors, significantly improving the scalability and efficiency without affecting the correctness of the results, provided sufficient memory is available. For scenarios where memory is limited, we further propose a dropout-based reconstruction mechanism that offers a controllable trade-off between memory usage and explanation fidelity. Experimental results on real-world datasets demonstrate substantial speedups, enabling scalable and transparent explainability for large-scale GNN models.
zh

[AI-43] hinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

【速读】:该论文旨在解决大推理模型(Large Reasoning Models, LRMs)因依赖长链式思维(Chain of Thought, CoT)而导致计算开销过大的“过度思考”问题。现有方法多采用强化学习(Reinforcement Learning, RL)训练混合推理模型以动态决定是否进行思考,但RL易受奖励欺骗(reward hacking)影响,即模型实际进行了思考却被错误标记为未思考,从而导致训练偏差。为此,本文提出Thinking-Based Non-Thinking (TNT) 方法,其核心创新在于:不使用监督微调(Supervised Fine-Tuning, SFT),而是通过分析使用思考的响应中的解题组件信息,自适应地为不同查询设置非思考响应的最大token限制,从而有效缓解奖励欺骗问题。实验表明,TNT在五个数学基准上相较基线模型减少约50%的token消耗,同时显著提升准确率,并实现精度与效率的最佳权衡。

链接: https://arxiv.org/abs/2601.04805
作者: Siyuan Gan,Jiaheng Liu,Boyan Wang,Tianpei Yang,Runqing Miao,Yuyao Zhang,Fanyu Meng,Junlan Feng,Linjian Meng,Jing Huo,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT’s responses, which are classified as not using thinking, remains below 10% across all tested datasets.
zh

[AI-44] APEX: Academic Poster Editing Agent ic Expert

【速读】:该论文旨在解决学术海报设计过程中内容密度与布局复杂性之间的平衡难题,尤其是现有从论文到海报的生成方法多为单次、非交互式流程,难以满足用户复杂的主观意图。其解决方案的关键在于提出 APEX(Academic Poster Editing agentic eXpert),一个首个支持交互式编辑的智能体框架,通过细粒度控制、基于多层级 API 的编辑能力以及审查与调整机制实现精准修改;同时构建了 APEX-Bench 基准测试集,涵盖 514 条多样化指令,并采用视觉语言模型(VLM)作为裁判的多维评估协议,系统性地验证了方法在指令执行准确性、修改范围合理性及视觉一致性方面的优越性能。

链接: https://arxiv.org/abs/2601.04794
作者: Chengxin Shi,Qinnan Cai,Zeyuan Chen,Long Zeng,Yibo Zhao,Jing Yu,Jianxiang Yu,Xiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing academic posters is a labor-intensive process requiring the precise balance of high-density content and sophisticated layout. While existing paper-to-poster generation methods automate initial drafting, they are typically single-pass and non-interactive, often fail to align with complex, subjective user intent. To bridge this gap, we propose APEX (Academic Poster Editing agentic eXpert), the first agentic framework for interactive academic poster editing, supporting fine-grained control with robust multi-level API-based editing and a review-and-adjustment Mechanism. In addition, we introduce APEX-Bench, the first systematic benchmark comprising 514 academic poster editing instructions, categorized by a multi-dimensional taxonomy including operation type, difficulty, and abstraction level, constructed via reference-guided and reference-free strategies to ensure realism and diversity. We further establish a multi-dimensional VLM-as-a-judge evaluation protocol to assess instruction fulfillment, modification scope, and visual consistency harmony. Experimental results demonstrate that APEX significantly outperforms baseline methods. Our implementation is available at this https URL.
zh

[AI-45] Agent OCR: Reimagining Agent History via Optical Self-Compression

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体(Agent)在多轮交互中因文本历史累积导致的token预算和内存消耗急剧增长的问题。解决方案的关键在于提出AgentOCR框架,其核心创新包括:(1)将观察-动作历史以紧凑的图像形式表示,利用视觉token更高的信息密度;(2)引入分段光学缓存(segment optical caching),通过哈希分解历史片段并维护视觉缓存,避免重复渲染;(3)设计代理自压缩机制(agentic self-compression),使代理主动输出压缩率,并在压缩感知奖励下训练,实现任务成功率与token效率之间的动态平衡。实验表明,该方法在ALFWorld和基于搜索的问答任务中保持超过95%的文本基线性能,同时降低50%的token消耗,并带来20倍的渲染速度提升。

链接: https://arxiv.org/abs/2601.04786
作者: Lang Feng,Fuchao Yang,Feng Chen,Xin Cheng,Haiyang Xu,Zhenglin Wan,Ming Yan,Bo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95% of text-based agent performance while substantially reducing token consumption (50%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.
zh

[AI-46] SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学领域应用中评估标准不足的问题,即现有基准测试要么仅关注表面格式的指令遵循能力,要么只衡量最终答案的正确性,而忽视了推理过程是否符合科学逻辑与规范。解决方案的关键在于提出“科学指令遵循”(Scientific Instruction Following, SciIF)这一新范式,其核心是通过一个多学科基准测试来量化模型在解题过程中对科学有效性约束的严格遵守程度,具体包括三大支柱:科学条件(如边界检查和假设验证)、语义稳定性(如单位与符号一致性)以及特定流程(如必需的数值方法)。SciIF的独特之处在于强调可审计性,要求模型提供显式的约束满足证据,而非隐式合规,从而实现对复合推理失败的细粒度诊断,确保LLMs能够在科学严谨的逻辑框架内可靠运行。

链接: https://arxiv.org/abs/2601.04770
作者: Encheng Su,Jianyu Wu,Chen Tang,Lintao Wang,Pengze Li,Aoran Wang,Jinouwen Zhang,Yizhou Wang,Yuan Meng,Xinzhu Ma,Shixiang Tang,Houqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks and assumptions), semantic stability (e.g., unit and symbol conventions), and specific processes(e.g., required numerical methods). Uniquely, SciIF emphasizes auditability, requiring models to provide explicit evidence of constraint satisfaction rather than implicit compliance. By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures, ensuring that LLMs can function as reliable agents within the strict logical frameworks of science.
zh

[AI-47] Orion-RAG : Path-Aligned Hybrid Retrieval for Graphless Data

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在处理离散且碎片化的数据时面临的挑战,尤其是在信息分散于孤立文件(如报告和日志)且缺乏显式关联的场景下,传统搜索引擎因独立处理各文件而无法有效利用跨文件信息。解决方案的关键在于提出一种轻量级路径提取策略——Orion-RAG,其核心思想是无需复杂算法即可从碎片化文档中自动构建简洁的语义连接路径,从而将非结构化文本转化为半结构化数据,实现跨文件的知识关联。该方法在多个领域实验中显著优于主流框架,并支持实时更新与人工验证,具备高成本效益。

链接: https://arxiv.org/abs/2601.04764
作者: Zhen Chen,Weihao Xie,Peilin Chen,Shiqi Wang,Jianping Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven effective for knowledge synthesis, yet it encounters significant challenges in practical scenarios where data is inherently discrete and fragmented. In most environments, information is distributed across isolated files like reports and logs that lack explicit links. Standard search engines process files independently, ignoring the connections between them. Furthermore, manually building Knowledge Graphs is impractical for such vast data. To bridge this gap, we present Orion-RAG. Our core insight is simple yet effective: we do not need heavy algorithms to organize this data. Instead, we use a low-complexity strategy to extract lightweight paths that naturally link related concepts. We demonstrate that this streamlined approach suffices to transform fragmented documents into semi-structured data, enabling the system to link information across different files effectively. Extensive experiments demonstrate that Orion-RAG consistently outperforms mainstream frameworks across diverse domains, supporting real-time updates and explicit Human-in-the-Loop verification with high cost-efficiency. Experiments on FinanceBench demonstrate superior precision with a 25.2% relative improvement over strong baselines.
zh

[AI-48] Smart IoT-Based Wearable Device for Detection and Monitoring of Common Cow Diseases Using a Novel Machine Learning Technique

【速读】:该论文旨在解决大规模奶牛养殖中因人工观察和监测导致的疾病检测效率低、准确性差及成本高昂的问题。其核心挑战在于传统方法难以及时识别病牛症状,且对人力资源依赖性强,影响动物健康与农场生产效益。解决方案的关键在于构建一个基于物联网(IoT)的物理-信息融合系统(Cyber-Physical System),并提出一种新型机器学习(Machine Learning, ML)算法,通过采集和分析奶牛的生理与行为特征数据,实现多种常见疾病的高精度联合预测,从而提升健康监测的自动化水平、可靠性和经济性。

链接: https://arxiv.org/abs/2601.04761
作者: Rupsa Rani Mishra,D. Chandrasekhar Rao,Ajaya Kumar Tripathy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manual observation and monitoring of individual cows for disease detection present significant challenges in large-scale farming operations, as the process is labor-intensive, time-consuming, and prone to reduced accuracy. The reliance on human observation often leads to delays in identifying symptoms, as the sheer number of animals can hinder timely attention to each cow. Consequently, the accuracy and precision of disease detection are significantly compromised, potentially affecting animal health and overall farm productivity. Furthermore, organizing and managing human resources for the manual observation and monitoring of cow health is a complex and economically demanding task. It necessitates the involvement of skilled personnel, thereby contributing to elevated farm maintenance costs and operational inefficiencies. Therefore, the development of an automated, low-cost, and reliable smart system is essential to address these challenges effectively. Although several studies have been conducted in this domain, very few have simultaneously considered the detection of multiple common diseases with high prediction accuracy. However, advancements in Internet of Things (IoT), Machine Learning (ML), and Cyber-Physical Systems have enabled the automation of cow health monitoring with enhanced accuracy and reduced operational costs. This study proposes an IoT-enabled Cyber-Physical System framework designed to monitor the daily activities and health status of cow. A novel ML algorithm is proposed for the diagnosis of common cow diseases using collected physiological and behavioral data. The algorithm is designed to predict multiple diseases by analyzing a comprehensive set of recorded physiological and behavioral features, enabling accurate and efficient health assessment.
zh

[AI-49] When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail

【速读】:该论文旨在解决多智能体AI系统在复杂推理任务中因显式通信导致的高计算开销问题,同时保留模块化优势。其核心解决方案是将多智能体系统中的专用代理行为“内化”为技能库,从而构建一个单一智能体通过技能选择来完成任务的架构,以此替代跨代理通信。关键创新在于提出“技能选择”机制,并发现该机制存在类似人类认知容量限制的相变现象:当技能库规模达到临界值时,选择准确率会骤降,而非渐进下降;且这种退化主要由语义相似性引发的混淆所驱动,而非单纯库大小。这一发现揭示了基于语义的技能选择存在根本性扩展瓶颈,进而建议采用分层路由等结构优化策略以提升可扩展性。

链接: https://arxiv.org/abs/2601.04748
作者: Xiaoxiao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 25 pages, technical report

点击查看摘要

Abstract:Multi-agent AI systems have proven effective for complex reasoning. These systems are compounded by specialized agents, which collaborate through explicit communication, but incur substantial computational overhead. A natural question arises: can we achieve similar modularity benefits with a single agent that selects from a library of skills? We explore this question by viewing skills as internalized agent behaviors. From this perspective, a multi-agent system can be compiled into an equivalent single-agent system, trading inter-agent communication for skill selection. Our preliminary experiments suggest this approach can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, this efficiency raises a deeper question that has received little attention: how does skill selection scale as libraries grow? Drawing on principles from cognitive science, we propose that LLM skill selection exhibits bounded capacity analogous to human decision-making. We investigate the scaling behavior of skill selection and observe a striking pattern. Rather than degrading gradually, selection accuracy remains stable up to a critical library size, then drops sharply, indicating a phase transition reminiscent of capacity limits in human cognition. Furthermore, we find evidence that semantic confusability among similar skills, rather than library size alone, plays a central role in this degradation. This perspective suggests that hierarchical organization, which has long helped humans manage complex choices, may similarly benefit AI systems. Our initial results with hierarchical routing support this hypothesis. This work opens new questions about the fundamental limits of semantic-based skill selection in LLMs and offers a cognitive-grounded framework and practical guidelines for designing scalable skill-based agents. Comments: 25 pages, technical report Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2601.04748 [cs.AI] (or arXiv:2601.04748v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.04748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-50] KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

【速读】:该论文旨在解决现有长时记忆评估基准主要依赖多轮对话或合成用户历史所导致的“检索性能无法准确反映个体理解能力”的问题。其解决方案的关键在于构建一个公开可获取的基准——\BenchName,该基准基于长篇自传体叙述文本,通过动作、上下文与内心想法提供密集证据,以支持对稳定动机和决策原则的推断;同时将每个叙述重构为带有闪回感知和时间锚定的流式结构,并设计涵盖事实回忆、主观状态归因与原则层面推理的证据关联型问题进行评测,从而更真实地衡量模型在时间维度上的记忆与推理能力。

链接: https://arxiv.org/abs/2601.04745
作者: Tingyu Wu,Zhisheng Chen,Ziyan Weng,Shuhe Wang,Chenglong Li,Shuo Zhang,Sen Hu,Silin Wu,Qizhen Lan,Huacan Wang,Ronghao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \hrefKnowMeBenchthis https URL.
zh

[AI-51] Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

【速读】:该论文旨在解决从语音声学特征中检测医疗状况时面临的弱监督学习问题,即如何将单次、通常带有噪声的会话级标签与长而复杂的音频记录中的细微模式建立关联。这一任务因数据稀缺性和临床标注的主观性而更加困难。解决方案的关键在于提出一种新颖的纯音频半监督学习(Semi-Supervised Learning, SSL)框架,该框架通过联合学习帧级、片段级和会话级表示来显式建模病理特征在患者语音中非均匀分布的层次结构,并动态聚合多粒度特征以生成高质量伪标签,从而高效利用未标记数据。该方法具有模型无关性、跨语言和跨疾病场景的鲁棒性,且数据效率极高——仅用11个标注样本即可达到全监督性能的90%。

链接: https://arxiv.org/abs/2601.04744
作者: Xingyuan Li,Mengyue Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient’s speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data. Extensive experiments show the framework is model-agnostic, robust across languages and conditions, and highly data-efficient-achieving, for instance, 90% of fully-supervised performance using only 11 labeled samples. This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.
zh

[AI-52] Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams KDD2026

【速读】:该论文旨在解决从机器实时传感器数据流中持续预测设备故障发生时间的问题(即未来事件时间预测)。其核心挑战在于现实世界数据流具有动态性,其底层模式随时间演变,传统静态模型难以适应这种变化。解决方案的关键在于提出TimeCast框架,该框架具备三大特性:(a) 动态性(Dynamic)——识别随时间演化的不同模式阶段并为每个阶段学习独立模型,实现基于模式转移的自适应预测;(b) 实用性(Practical)——发现能捕捉多传感器间时变依赖关系的有意义阶段,从而提升预测性能;© 可扩展性(Scalable)——算法复杂度线性于输入规模,支持在线更新,适用于大规模数据流场景。实验表明,TimeCast在准确率上优于现有最优方法,并显著降低计算耗时。

链接: https://arxiv.org/abs/2601.04741
作者: Kota Nakamura,Koki Kawabata,Yasuko Matsubara,Yasushi Sakurai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; © Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.
zh

[AI-53] Excess Description Length of Learning Generalizable Predictors

【速读】:该论文试图解决的问题是:在语言模型微调(fine-tuning)过程中,所观察到的能力提升究竟是源于对潜在能力(latent capabilities)的激发,还是真正学习到了新的知识(teaches new ones)。为解决这一问题,作者提出了一种基于信息论的正式框架,其核心创新在于定义了一个名为“超额描述长度”(Excess Description Length, EDL)的量化指标。EDL通过预序编码(prequential coding)机制衡量训练过程中模型参数对数据标签的编码效率变化,具体表现为在线训练模型与最终模型在编码标签时所需比特数的差距。该指标具有理论保障:期望非负、在无限数据下收敛至冗余描述长度,并能提供泛化增益的上界。此框架为区分能力激发(capability elicitation)与知识传授(teaching)提供了严谨的数学基础,揭示了二者在缩放规律上的定性差异。

链接: https://arxiv.org/abs/2601.04728
作者: Elizabeth Donoway,Hailey Joren,Fabien Roger,Jan Leike
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding whether fine-tuning elicits latent capabilities or teaches new ones is a fundamental question for language model evaluation and safety. We develop a formal information-theoretic framework for quantifying how much predictive structure fine-tuning extracts from the train dataset and writes into a model’s parameters. Our central quantity, Excess Description Length (EDL), is defined via prequential coding and measures the gap between the bits required to encode training labels sequentially using an evolving model (trained online) and the residual encoding cost under the final trained model. We establish that EDL is non-negative in expectation, converges to surplus description length in the infinite-data limit, and provides bounds on expected generalization gain. Through a series of toy models, we clarify common confusions about information in learning: why random labels yield EDL near zero, how a single example can eliminate many bits of uncertainty about the underlying rule(s) that describe the data distribution, why structure learned on rare inputs contributes proportionally little to expected generalization, and how format learning creates early transients distinct from capability acquisition. This framework provides rigorous foundations for the empirical observation that capability elicitation and teaching exhibit qualitatively distinct scaling signatures.
zh

[AI-54] hinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自动驾驶领域应用中存在的结构化推理不足、泛化能力差以及与人类驾驶意图不一致的问题。现有方法中,链式思维(Chain-of-Thought, CoT)虽能提升决策透明度,但传统监督微调(Supervised Fine-Tuning, SFT)未能充分发挥其潜力,而强化学习(Reinforcement Learning, RL)方法则面临训练不稳定和推理深度不足的挑战。解决方案的关键在于提出ThinkDrive框架——一个基于CoT引导的渐进式强化学习微调机制,通过两阶段训练策略实现:第一阶段使用CoT解释进行SFT以建立结构化推理基础;第二阶段引入难度感知自适应策略优化器,在强化学习过程中动态调整学习强度以应对样本复杂度变化,从而显著提升模型性能与鲁棒性。

链接: https://arxiv.org/abs/2601.04714
作者: Chang Zhao,Zheming Yang,Yunqing Hu,Qi Guo,Zijian Wang,Pengcheng Li,Wen Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs) technologies, their application in the domain of autonomous driving has become increasingly widespread. However, existing methods suffer from unstructured reasoning, poor generalization, and misalignment with human driving intent. While Chain-of-Thought (CoT) reasoning enhances decision transparency, conventional supervised fine-tuning (SFT) fails to fully exploit its potential, and reinforcement learning (RL) approaches face instability and suboptimal reasoning depth. We propose ThinkDrive, a CoT guided progressive RL fine-tuning framework for autonomous driving that synergizes explicit reasoning with difficulty-aware adaptive policy optimization. Our method employs a two-stage training strategy. First, we perform SFT using CoT explanations. Then, we apply progressive RL with a difficulty-aware adaptive policy optimizer that dynamically adjusts learning intensity based on sample complexity. We evaluate our approach on a public dataset. The results show that ThinkDrive outperforms strong RL baselines by 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy, respectively. Moreover, a 2B-parameter model trained with our method surpasses the much larger GPT-4o by 3.28% on the exam metric.
zh

[AI-55] Bridging Temporal and Textual Modalities: A Multimodal Framework for Automated Cloud Failure Root Cause Analysis

【速读】:该论文旨在解决现代云基础设施中根因分析(Root Cause Analysis, RCA)面临的异构数据源融合难题,特别是时间序列性能指标与大语言模型(Large Language Models, LLMs)离散token架构之间的模态不匹配问题。当前方法难以有效利用LLMs在文本推理方面的优势来处理具有时序依赖性的连续数值数据,从而限制了其在故障管理自动化中的应用潜力。解决方案的关键在于构建一个跨模态诊断框架,通过三项核心技术实现:(1) 语义压缩技术将时间片段转化为保留模式语义的单标记抽象;(2) 基于门控交叉注意力机制的对齐编码器,将时间序列特征映射至预训练语言模型的潜在空间;(3) 结合历史故障知识的检索增强诊断流水线,实现专家级故障归因。实验证明该方法在六类云系统基准上达到48.75%的诊断准确率,尤其在复合故障场景下表现显著提升,验证了嵌入空间对齐作为LLM处理多模态遥测数据的有效策略。

链接: https://arxiv.org/abs/2601.04709
作者: Gijun Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Root cause analysis in modern cloud infrastructure demands sophisticated understanding of heterogeneous data sources, particularly time-series performance metrics that involve core failure signatures. While large language models demonstrate remarkable capabilities in textual reasoning, their discrete token-based architecture creates fundamental incompatibilities with continuous numerical sequences exhibiting temporal dependencies. Current methodologies inadequately address this modality mismatch, constraining the potential of language model-driven automation in incident management workflows. This paper presents a multimodal diagnostic framework that harmonizes time-series representations with pretrained language model embedding spaces. Our approach contributes three technical advances: (1) a semantic compression technique that distills temporal segments into single-token abstractions while preserving pattern semantics, (2) an alignment encoder utilizing gated cross-attention to project time-series features into language model latent space, and (3) a retrieval-augmented diagnostic pipeline that synthesizes aligned embeddings with historical incident knowledge for expert-level failure attribution. Comprehensive evaluation across six cloud system benchmarks demonstrates that our framework achieves leading performance, reaching 48.75% diagnostic accuracy with notable improvements on scenarios involving compound failure modes. The results validate embedding-space alignment as an effective strategy for enabling language models to reason over multimodal telemetry data in production incident response contexts.
zh

[AI-56] MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在多GPU训练中因mini-batch生成效率低、数据传输瓶颈及昂贵的跨GPU同步导致的可扩展性问题。现有训练框架无法重叠这些计算与通信阶段,造成资源利用率低下。其解决方案的关键在于提出MQ-GNN——一种基于多队列流水线的训练框架,通过引入Ready-to-Update Asynchronous Consistent Model (RaCoM),实现异步梯度共享与模型更新,并借助自适应周期性同步保障全局一致性;同时结合全局邻居采样与缓存机制降低数据传输开销,以及自适应队列大小策略平衡计算与内存效率,从而显著提升训练速度与GPU利用率。

链接: https://arxiv.org/abs/2601.04707
作者: Irfan Ullah,Young-Koo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful tools for learning graph-structured data, but their scalability is hindered by inefficient mini-batch generation, data transfer bottlenecks, and costly inter-GPU synchronization. Existing training frameworks fail to overlap these stages, leading to suboptimal resource utilization. This paper proposes MQ-GNN, a multi-queue pipelined framework that maximizes training efficiency by interleaving GNN training stages and optimizing resource utilization. MQ-GNN introduces Ready-to-Update Asynchronous Consistent Model (RaCoM), which enables asynchronous gradient sharing and model updates while ensuring global consistency through adaptive periodic synchronization. Additionally, it employs global neighbor sampling with caching to reduce data transfer overhead and an adaptive queue-sizing strategy to balance computation and memory efficiency. Experiments on four large-scale datasets and ten baseline models demonstrate that MQ-GNN achieves up to \boldmath \bm4.6,\times faster training time and 30% improved GPU utilization while maintaining competitive accuracy. These results establish MQ-GNN as a scalable and efficient solution for multi-GPU GNN training.
zh

[AI-57] Beyond Monolithic Architectures: A Multi-Agent Search and Knowledge Optimization Framework for Agent ic Search

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体搜索(Agentic Search)系统中存在的三大结构性瓶颈:一是推理输出无约束导致搜索轨迹冗长;二是结果层面奖励稀疏,难以进行有效的信用分配;三是搜索过程中的随机噪声干扰学习稳定性。其解决方案的关键在于提出M-ASK(Multi-Agent Search and Knowledge)框架,通过显式地将智能体搜索任务解耦为两个互补角色——搜索行为代理(Search Behavior Agents)负责规划与执行搜索动作,知识管理代理(Knowledge Management Agents)则专注聚合、过滤并维护紧凑的内部上下文信息。这种结构化分工不仅使各代理能聚焦于特定子任务并减少相互干扰,还引入了回合级奖励机制,提供细粒度监督以稳定搜索决策与知识更新的协同过程,从而在多跳问答(multi-hop QA)基准上实现更高的答案准确率和更稳定的训练动态。

链接: https://arxiv.org/abs/2601.04703
作者: Yiqun Chen,Lingyong Yan,Zixuan Yang,Erhan Zhang,Jiashu Zhao,Shuaiqiang Wang,Dawei Yin,Jiaxin Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic search has emerged as a promising paradigm for complex information seeking by enabling Large Language Models (LLMs) to interleave reasoning with tool use. However, prevailing systems rely on monolithic agents that suffer from structural bottlenecks, including unconstrained reasoning outputs that inflate trajectories, sparse outcome-level rewards that complicate credit assignment, and stochastic search noise that destabilizes learning. To address these challenges, we propose \textbfM-ASK (Multi-Agent Search and Knowledge), a framework that explicitly decouples agentic search into two complementary roles: Search Behavior Agents, which plan and execute search actions, and Knowledge Management Agents, which aggregate, filter, and maintain a compact internal context. This decomposition allows each agent to focus on a well-defined subtask and reduces interference between search and context construction. Furthermore, to enable stable coordination, M-ASK employs turn-level rewards to provide granular supervision for both search decisions and knowledge updates. Experiments on multi-hop QA benchmarks demonstrate that M-ASK outperforms strong baselines, achieving not only superior answer accuracy but also significantly more stable training dynamics.\footnoteThe source code for M-ASK is available at this https URL.
zh

[AI-58] SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning

【速读】:该论文旨在解决顺序性视觉-语言导航(Sequential-Horizon Vision-and-Language Navigation, SH-VLN)中因多任务指令复杂度高导致的信息过载问题,即现有模型在面对长时程、多步骤语言指令时性能显著下降,难以聚焦于与当前观察相关的语义细节。解决方案的关键在于提出SeqWalker框架,其核心创新包括:i) 高层规划器(High-Level Planner)根据当前视觉观测动态将全局指令分解为上下文相关的子指令,从而降低认知负荷;ii) 低层规划器引入探索-验证(Exploration-Verification)策略,利用指令的内在逻辑结构实现轨迹误差纠正,提升导航鲁棒性。

链接: https://arxiv.org/abs/2601.04699
作者: Zebin Han,Xudong Wang,Baichen Liu,Qi Lyu,Zhenduo Shang,Jiahua Dong,Lianqing Liu,Zhi Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task navigation guided by complex, long-horizon language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such multi-task instructions, as information overload impairs the agent’s ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a navigation model built on a hierarchical planning framework. Our SeqWalker features: i) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent’s current visual observations, thus reducing cognitive load; ii) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the superiority of the proposed SeqWalker.
zh

[AI-59] ape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning

【速读】:该论文旨在解决强化学习模型在面对分布外(out-of-distribution, OOD)规则变化时的失效问题,尤其是当环境的潜在规则(latent rule)发生变化而观测和动作空间保持不变时,现有方法难以维持性能稳定性。解决方案的关键在于构建一个名为Tape的可控强化学习基准,其基于一维细胞自动机(one-dimensional cellular automata)生成具有明确训练/测试分割的数据集,从而实现对OOD失败模式的精准隔离与评估。通过这一标准化框架,作者系统比较了无模型方法、基于学习世界模型的规划方法以及任务推理(meta-RL)方法,并揭示出:即使在分布内(in-distribution, ID)表现优异的方法,在分布外条件下也可能崩溃;同时提出三项关键改进措施:(i) 标准化的OOD评估协议,(ii) 统计报告要求(如种子、置信区间和假设检验),以及(iii) 信息论视角下的理论工具(熵减少与条件互信息及期望后验KL散度之间的关系),用以厘清“不确定性降低”目标在规则迁移场景下的理论边界与局限性。

链接: https://arxiv.org/abs/2601.04695
作者: Enze Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 tables

点击查看摘要

Abstract:We present Tape, a controlled reinforcement-learning benchmark designed to isolate out-of-distribution (OOD) failure under latent rule this http URL is derived from one-dimensional cellular automata, enabling precise train/test splits where observation and action spaces are held fixed while transition rules change. Using a reproducible evaluation pipeline, we compare model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. A consistent pattern emerges: methods that are strong in-distribution (ID) can collapse under heldout-rule OOD, and high-variance OOD evaluation can make rankings unstable unless experiments are sufficiently this http URL provide (i) standardized OOD protocols, (ii) statistical reporting requirements (seeds, confidence intervals, and hypothesis tests), and (iii) information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what “uncertainty reduction” objectives can and cannot guarantee under rule shifts.
zh

[AI-60] ResMAS: Resilience Optimization in LLM -based Multi-agent Systems

【速读】:该论文旨在解决大型语言模型多智能体系统(Large Language Model-based Multi-Agent Systems, LLM-based MAS)在面对扰动(如智能体失效)时的脆弱性问题,现有研究多采用事后的防御策略,缺乏对系统内在韧性的主动设计。其解决方案的关键在于提出ResMAS框架,该框架包含两个阶段:第一阶段通过训练奖励模型预测系统韧性,并利用强化学习自动生成针对特定任务的鲁棒通信拓扑;第二阶段引入拓扑感知的提示优化方法,根据每个智能体与其他智能体的连接关系动态调整其提示内容。这一双阶段设计显著提升了LLM-based MAS在多种约束条件下的韧性表现,并具备良好的跨任务和跨模型泛化能力。

链接: https://arxiv.org/abs/2601.04694
作者: Zhilun Zhou,Zihan Liu,Jiahe Liu,Qingyu Shao,Yihan Wang,Kun Shao,Depeng Jin,Fengli Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems. In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS’s resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent’s prompt based on its connections and interactions with other agents. Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.
zh

[AI-61] LLM -Guided Quantified SMT Solving over Uninterpreted Functions

【速读】:该论文旨在解决含未解释函数(Uninterpreted Functions, UF)的非线性实数算术公式在Satisfiability Modulo Theories (SMT) 求解中的难题,传统量化实例化方法因缺乏对UF约束的语义理解,难以有效缩小搜索空间。解决方案的关键在于提出AquaForte框架,利用大语言模型(Large Language Models, LLMs)提供语义引导:通过结构化提示从LLMs中提取数学推理能力,生成满足约束的函数定义实例候选,从而显著减少求解器的搜索复杂度;同时结合自适应实例化机制与系统验证策略,在保证完备性的前提下实现高效求解,实验表明其在多个SMT-COMP基准测试中优于Z3和CVC5等主流求解器。

链接: https://arxiv.org/abs/2601.04675
作者: Kunhang Lv,Yuhang Dong,Rui Han,Fuqi Jia,Feifei Ma,Jian Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantified formulas with Uninterpreted Functions (UFs) over non-linear real arithmetic pose fundamental challenges for Satisfiability Modulo Theories (SMT) solving. Traditional quantifier instantiation methods struggle because they lack semantic understanding of UF constraints, forcing them to search through unbounded solution spaces with limited guidance. We present AquaForte, a framework that leverages Large Language Models to provide semantic guidance for UF instantiation by generating instantiated candidates for function definitions that satisfy the constraints, thereby significantly reducing the search space and complexity for solvers. Our approach preprocesses formulas through constraint separation, uses structured prompts to extract mathematical reasoning from LLMs, and integrates the results with traditional SMT algorithms through adaptive instantiation. AquaForte maintains soundness through systematic validation: LLM-guided instantiations yielding SAT solve the original problem, while UNSAT results generate exclusion clauses for iterative refinement. Completeness is preserved by fallback to traditional solvers augmented with learned constraints. Experimental evaluation on SMT-COMP benchmarks demonstrates that AquaForte solves numerous instances where state-of-the-art solvers like Z3 and CVC5 timeout, with particular effectiveness on satisfiable formulas. Our work shows that LLMs can provide valuable mathematical intuition for symbolic reasoning, establishing a new paradigm for SMT constraint solving.
zh

[AI-62] Estimating Causal Effects in Gaussian Linear SCMs with Finite Data ICML2025

【速读】:该论文旨在解决在存在潜在混杂因素(latent confounders)的情况下,从观测数据中准确估计因果效应这一核心挑战,尤其是在高斯线性结构因果模型(Gaussian Linear Structural Causal Models, GL-SCMs)中,由于参数过多导致有限样本下难以进行有效估计的问题。解决方案的关键在于提出了一类简化的子类模型——中心化高斯线性结构因果模型(Centralized Gaussian Linear SCMs, CGL-SCMs),其中外生变量服从标准化分布,从而在保持因果效应可识别性不变的前提下显著降低模型复杂度;进一步设计了一种基于期望最大化(EM)算法的新型参数估计方法,可在有限观测样本下学习CGL-SCM参数并准确估计可识别的因果效应。

链接: https://arxiv.org/abs/2601.04673
作者: Aurghya Maiti,Prateek Jain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at the Workshop on Scaling Up Intervention Models at the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Estimating causal effects from observational data remains a fundamental challenge in causal inference, especially in the presence of latent confounders. This paper focuses on estimating causal effects in Gaussian Linear Structural Causal Models (GL-SCMs), which are widely used due to their analytical tractability. However, parameter estimation in GL-SCMs is often infeasible with finite data, primarily due to overparameterization. To address this, we introduce the class of Centralized Gaussian Linear SCMs (CGL-SCMs), a simplified yet expressive subclass where exogenous variables follow standardized distributions. We show that CGL-SCMs are equally expressive in terms of causal effect identifiability from observational distributions and present a novel EM-based estimation algorithm that can learn CGL-SCM parameters and estimate identifiable causal effects from finite observational samples. Our theoretical analysis is validated through experiments on synthetic data and benchmark causal graphs, demonstrating that the learned models accurately recover causal distributions.
zh

[AI-63] Optimizing Path Planning using Deep Reinforcement Learning for UGVs in Precision Agriculture

【速读】:该论文旨在解决无人地面车辆(UGVs)在精准农业场景中路径规划的优化问题,尤其针对传统网格搜索算法(如A*和Dijkstra算法)在动态农业环境中适应性不足的局限性。解决方案的关键在于引入深度强化学习(DRL)技术,特别是在连续动作空间中的策略梯度方法,如深度确定性策略梯度(DDPG)和延迟双深度确定性策略梯度(TD3)。通过在ROS与Gazebo构建的三维动态环境中进行实验验证,结果表明预训练的TD3代理在面对移动障碍物时仍能实现95%的成功率,体现出该方法在保障作物与机器人安全前提下对复杂动态环境的强大鲁棒性和决策能力。

链接: https://arxiv.org/abs/2601.04668
作者: Laukik Patade,Rohan Rane,Sandeep Pillai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study focuses on optimizing path planning for unmanned ground vehicles (UGVs) in precision agriculture using deep reinforcement learning (DRL) techniques in continuous action spaces. The research begins with a review of traditional grid-based methods, such as A* and Dijkstra’s algorithms, and discusses their limitations in dynamic agricultural environments, highlighting the need for adaptive learning strategies. The study then explores DRL approaches, including Deep Q-Networks (DQN), which demonstrate improved adaptability and performance in two-dimensional simulations. Enhancements such as Double Q-Networks and Dueling Networks are evaluated to further improve decision-making. Building on these results, the focus shifts to continuous action space models, specifically Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3), which are tested in increasingly complex environments. Experiments conducted in a three-dimensional environment using ROS and Gazebo demonstrate the effectiveness of continuous DRL algorithms in navigating dynamic agricultural scenarios. Notably, the pretrained TD3 agent achieves a 95 percent success rate in dynamic environments, demonstrating the robustness of the proposed approach in handling moving obstacles while ensuring safety for both crops and the robot.
zh

[AI-64] Know Thy Enemy: Securing LLM s Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在实际应用中面临的提示注入(Prompt Injection, PI)攻击问题,此类攻击通过多样化的输入向量注入恶意指令,且这些指令常与上下文语义边界模糊,导致现有防御机制难以有效识别和拦截。解决方案的关键在于提出InstructCoT方法,其核心创新包括:一是合成多样化训练数据以覆盖多种PI攻击场景;二是采用指令级链式思维(instruction-level chain-of-thought)微调策略,使LLM具备从复杂上下文中精准识别并拒绝恶意指令的能力,从而在行为偏差、隐私泄露和有害输出三个维度显著优于基线方法,同时保持模型原有功能性能不受影响。

链接: https://arxiv.org/abs/2601.04666
作者: Zhiyuan Chang,Mingyang Li,Yuekai Huang,Ziyou Jiang,Xiaojun Jia,Qian Xiong,Junjie Wang,Zhaoyang Li,Qing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation
zh

[AI-65] LAMB: LLM -based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

【速读】:该论文旨在解决当前生成式音频描述(Automated Audio Captioning)方法中,由于音频特征与大语言模型(LLM)文本嵌入空间之间缺乏跨模态对齐,导致无法充分挖掘LLM推理能力的问题。解决方案的关键在于提出LAMB框架,其核心创新包括:1)设计Cross-Modal Aligner模块,通过最小化Cauchy-Schwarz散度并最大化互信息,在全局和词级别实现音频与文本嵌入的紧密对齐;2)引入Two-Stream Adapter提取语义丰富的音频嵌入以增强对齐质量;3)提出Token Guide机制,在LLM文本嵌入空间内直接计算得分以引导生成 logits,从而有效利用LLM的推理能力,最终在AudioCaps数据集上达到SOTA性能。

链接: https://arxiv.org/abs/2601.04658
作者: Hyeongkeun Lee,Jongmin Choi,KiHyun Nam,Joon Son Chung
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures;

点击查看摘要

Abstract:Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.
zh

[AI-66] Vibe Coding an LLM -powered Theorem Prover

【速读】:该论文旨在解决形式化证明自动化中的核心挑战,即如何利用大语言模型(Large Language Models, LLMs)提升Isabelle/HOL定理证明器的自动推理能力,尤其是在传统自动化工具(如Sledgehammer)失效的情况下实现可靠且高效的证明合成。解决方案的关键在于构建一个分层协同的框架:底层为步进式证明器(stepwise prover),通过LLM提出可验证的证明命令并在受限搜索循环中由Isabelle进行校验;上层为高层证明规划器(proof planner),生成结构化的Isar证明大纲并尝试填充和修复剩余缺口。该框架还集成了束搜索(beam search)策略、战术重排序机器学习与强化学习模型、基于小型Transformer的前提选择机制、从Archive of Formal Proofs (AFP) 构建的微RAG(micro-RAG)以及反例引导的证明修复机制,从而显著增强LLM在复杂数学证明任务中的实用性与鲁棒性。

链接: https://arxiv.org/abs/2601.04653
作者: Zhe Hou
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We present Isabellm, an LLM-powered theorem prover for Isabelle/HOL that performs fully automatic proof synthesis. Isabellm works with any local LLM on Ollama and APIs such as Gemini CLI, and it is designed to run on consumer grade computers. The system combines a stepwise prover, which uses large language models to propose proof commands validated by Isabelle in a bounded search loop, with a higher-level proof planner that generates structured Isar outlines and attempts to fill and repair remaining gaps. The framework includes beam search for tactics, tactics reranker ML and RL models, premise selection with small transformer models, micro-RAG for Isar proofs built from AFP, and counter-example guided proof repair. All the code is implemented by GPT 4.1 - 5.2, Gemini 3 Pro, and Claude 4.5. Empirically, Isabellm can prove certain lemmas that defeat Isabelle’s standard automation, including Sledgehammer, demonstrating the practical value of LLM-guided proof search. At the same time, we find that even state-of-the-art LLMs, such as GPT 5.2 Extended Thinking and Gemini 3 Pro struggle to reliably implement the intended fill-and-repair mechanisms with complex algorithmic designs, highlighting fundamental challenges in LLM code generation and reasoning. The code of Isabellm is available at this https URL
zh

[AI-67] Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models

【速读】:该论文旨在解决当前生成式 AI(Generative AI)中大推理模型(Large Reasoning Models, LRMs)与检索增强生成(Retrieval-Augmented Generation, RAG)结合时面临的两个核心挑战:一是推理模型通常仅从单一视角进行推理,缺乏对检索到的外部文档进行深度、自我修正的逻辑验证能力;二是现有训练范式过度依赖结果导向的奖励信号,难以有效引导复杂、多步骤的推理过程。解决方案的关键在于提出一种名为对抗性推理 RAG(Adversarial Reasoning RAG, ARR)的“推理者-验证者”框架,其中推理者与验证者通过相互批判对方的逻辑推理过程,并在无外部评分模型的情况下,利用基于过程感知的优势奖励(process-aware advantage)进行协同优化——该奖励融合显式观测信号与模型内部不确定性,从而同时提升推理的准确性与验证的严谨性。

链接: https://arxiv.org/abs/2601.04651
作者: Can Xu,Lingyong Yan,Jiayi Wu,Haosen Wang,Shuaiqiang Wang,Yuchen Li,Jizhou Huang,Dawei Yin,Xiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent advances in synergizing large reasoning models (LRMs) with retrieval-augmented generation (RAG) have shown promising results, yet two critical challenges remain: (1) reasoning models typically operate from a single, unchallenged perspective, limiting their ability to conduct deep, self-correcting reasoning over external documents, and (2) existing training paradigms rely excessively on outcome-oriented rewards, which provide insufficient signal for shaping the complex, multi-step reasoning process. To address these issues, we propose an Reasoner-Verifier framework named Adversarial Reasoning RAG (ARR). The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other’s logic while being guided by process-aware advantage that requires no external scoring model. This reward combines explicit observational signals with internal model uncertainty to jointly optimize reasoning fidelity and verification rigor. Experiments on multiple benchmarks demonstrate the effectiveness of our method.
zh

[AI-68] Beyond the “Truth”: Investigating Election Rumors on Truth Social During the 2024 Election

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)在大规模真实世界数据中精确测量心理信念动态与谣言传播机制的问题。其核心挑战在于传统方法难以对海量、非结构化的社交媒体内容进行高精度分类和心理效应量化,尤其是在意识形态同质化网络中的信念强化过程。解决方案的关键在于构建一个多阶段谣言检测代理(Rumor Detection Agent),该代理融合了三种关键技术:(i) 基于合成数据增强的微调RoBERTa分类器实现初步内容筛选,(ii) 精准关键词过滤提升效率,以及 (iii) 两轮基于GPT-4o mini的LLM验证流水线以确保分类准确性。这一系统不仅实现了对选举谣言的高精度识别,还首次在自然场景下量化了“虚假真相效应”(illusory truth effect)的剂量-反应关系,揭示了谣言在同质网络中快速扩散的传染性特征,从而为心理科学提供了可扩展、可复制的大规模实证研究范式。

链接: https://arxiv.org/abs/2601.04631
作者: Etienne Casanova,R. Michael Alvarez
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer unprecedented opportunities for analyzing social phenomena at scale. This paper demonstrates the value of LLMs in psychological measurement by (1) compiling the first large-scale dataset of election rumors on a niche alt-tech platform, (2) developing a multistage Rumor Detection Agent that leverages LLMs for high-precision content classification, and (3) quantifying the psychological dynamics of rumor propagation, specifically the “illusory truth effect” in a naturalistic setting. The Rumor Detection Agent combines (i) a synthetic data-augmented, fine-tuned RoBERTa classifier, (ii) precision keyword filtering, and (iii) a two-pass LLM verification pipeline using GPT-4o mini. The findings reveal that sharing probability rises steadily with each additional exposure, providing large-scale empirical evidence for dose-response belief reinforcement in ideologically homogeneous networks. Simulation results further demonstrate rapid contagion effects: nearly one quarter of users become “infected” within just four propagation iterations. Taken together, these results illustrate how LLMs can transform psychological science by enabling the rigorous measurement of belief dynamics and misinformation spread in massive, real-world datasets.
zh

[AI-69] Agent Devel: Reframing Self-Evolving LLM Agents as Release Engineering

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在迭代改进过程中存在的稳定性差、可审计性弱以及难以保证非回归(non-regression)的问题。现有方法如群体搜索或代理内自增强虽能提升整体性能,但常导致不可预测的改进轨迹,不利于故障分析与版本控制。其解决方案的关键在于将代理开发重构为发布工程(release engineering)范式,提出名为AgentDevel的外部化、回归感知的发布流水线:通过实现无关的LLM批评者提取执行痕迹中的症状级质量信号,利用脚本驱动的可执行诊断生成可审计的工程规范,并采用以翻转为中心的门控机制优先处理从通过到失败和从失败到通过的变更,从而确保单一主干版本线上的稳定、可复现且可审计的改进过程。

链接: https://arxiv.org/abs/2601.04620
作者: Di Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as \textbfrelease engineering: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce \textbfAgentDevel, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.
zh

[AI-70] DeepHalo: A Neural Choice Model with Controllable Context Effects

【速读】:该论文旨在解决人类决策建模中如何有效捕捉上下文效应(context effect)的问题,尤其是当选项特征存在时,传统模型往往假设选择行为与上下文无关,而忽视了偏好可能因选择集组成变化而产生的高阶交互作用。解决方案的关键在于提出DeepHalo框架,该框架在保留特征信息的基础上,允许显式控制交互作用的阶数(order),从而实现对上下文效应按阶次系统识别和可解释建模;同时,在无特征场景下,DeepHalo可作为通用逼近器拟合任意上下文依赖的选择函数,兼顾预测性能与透明度。

链接: https://arxiv.org/abs/2601.04616
作者: Shuhan Zhang,Zhi Wang,Rui Gao,Shuang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling human decision-making is central to applications such as recommendation, preference learning, and human-AI alignment. While many classic models assume context-independent choice behavior, a large body of behavioral research shows that preferences are often influenced by the composition of the choice set itself – a phenomenon known as the context effect or Halo effect. These effects can manifest as pairwise (first-order) or even higher-order interactions among the available alternatives. Recent models that attempt to capture such effects either focus on the featureless setting or, in the feature-based setting, rely on restrictive interaction structures or entangle interactions across all orders, which limits interpretability. In this work, we propose DeepHalo, a neural modeling framework that incorporates features while enabling explicit control over interaction order and principled interpretation of context effects. Our model enables systematic identification of interaction effects by order and serves as a universal approximator of context-dependent choice functions when specialized to a featureless setting. Experiments on synthetic and real-world datasets demonstrate strong predictive performance while providing greater transparency into the drivers of choice.
zh

[AI-71] Evaluating Human and Machine Confidence in Phishing Email Detection: A Comparative Study

【速读】:该论文旨在解决如何有效识别欺骗性内容(如钓鱼邮件)的问题,其核心挑战在于融合人类认知与机器学习模型的能力以提升检测准确性与可解释性。解决方案的关键在于采用三种可解释的机器学习算法(逻辑回归、决策树和随机森林),结合TF-IDF特征与语义嵌入(semantic embeddings)进行训练,并将模型预测结果与人类评估者的置信度评分及语言观察进行对比分析。研究发现,尽管模型具备较高准确率,但其置信度波动较大;而人类评估者则展现出更稳定的置信水平和更丰富的语言线索使用能力,且年龄对检测性能有显著影响,语言熟练度影响较小。这一发现为构建透明、协同的人机交互系统提供了实证依据,有助于优化人类与人工智能在复杂内容分析任务中的合作机制。

链接: https://arxiv.org/abs/2601.04610
作者: Paras Jain,Khushi Dhar,Olyemi E. Amujo,Esa M. Rantanen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in the 2025 IEEE 7th International Conference on Cognitive Machine Intelligence (CogMI) 9 Pages

点击查看摘要

Abstract:Identifying deceptive content like phishing emails demands sophisticated cognitive processes that combine pattern recognition, confidence assessment, and contextual analysis. This research examines how human cognition and machine learn- ing models work together to distinguish phishing emails from legitimate ones. We employed three interpretable algorithms Logistic Regression, Decision Trees, and Random Forests train- ing them on both TF-IDF features and semantic embeddings, then compared their predictions against human evaluations that captured confidence ratings and linguistic observations. Our results show that machine learning models provide good accuracy rates, but their confidence levels vary significantly. Human evaluators, on the other hand, use a greater variety of language signs and retain more consistent confidence. We also found that while language proficiency has minimal effect on detection performance, aging does. These findings offer helpful direction for creating transparent AI systems that complement human cognitive functions, ultimately improving human-AI cooperation in challenging content analysis tasks.
zh

[AI-72] Constitutional Classifiers: Efficient Production-Grade Defenses against Universal Jailbreaks

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的“越狱攻击”(jailbreak attacks)问题,即恶意用户通过特定提示词诱导模型生成违反安全准则的内容。传统防御机制存在计算成本高、拒绝率(refusal rate)过高等缺陷,难以满足生产环境需求。解决方案的关键在于提出增强型宪法分类器(enhanced Constitutional Classifiers),其核心创新包括:1)引入上下文感知的交换分类器(exchange classifiers),在完整对话语境下评估模型输出,克服仅依赖孤立响应检测的漏洞;2)设计两级分类器级联架构,用轻量级分类器过滤全部流量,仅将可疑交互升级至高成本分类器,显著降低推理开销;3)训练高效的线性探测分类器(linear probe classifiers)并集成外部分类器,实现鲁棒性提升与计算效率优化的协同平衡。最终系统相较基线实现40倍计算成本下降,同时保持0.05%的低拒绝率,并在超过1700小时红队测试中有效抵御通用越狱攻击。

链接: https://arxiv.org/abs/2601.04603
作者: Hoagy Cunningham,Jerry Wei,Zihan Wang,Andrew Persic,Alwin Peng,Jordan Abderrachid,Raj Agarwal,Bobby Chen,Austin Cohen,Andy Dau,Alek Dimitriev,Rob Gilson,Logan Howard,Yijin Hua,Jared Kaplan,Jan Leike,Mu Lin,Christopher Liu,Vladimir Mikulik,Rohit Mittapalli,Clare O’Hara,Jin Pan,Nikhil Saxena,Alex Silverstein,Yue Song,Xunjie Yu,Giulio Zhou,Ethan Perez,Mrinank Sharma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks – no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.
zh

[AI-73] FedKDX: Federated Learning with Negative Knowledge Distillation for Enhanced Healthcare AI Systems

【速读】:该论文旨在解决医疗人工智能(AI)在联邦学习(Federated Learning, FL)场景下因数据分布非独立同分布(non-IID)和隐私保护需求导致的模型泛化能力差、通信成本高以及知识迁移效率低的问题。其解决方案的关键在于提出FedKDX框架,通过引入负知识蒸馏(Negative Knowledge Distillation, NKD),不仅保留传统正向知识传递,还显式建模非目标类别信息,从而增强模型对异质数据的鲁棒性;同时整合对比学习与知识蒸馏技术,在统一架构中实现隐私保护下的高效知识共享,显著提升模型准确率(最高达2.53%优于现有方法)、收敛速度及在non-IID数据上的性能表现。

链接: https://arxiv.org/abs/2601.04587
作者: Quang-Tu Pham,Hoang-Dieu Vu,Dinh-Dat Pham,Hieu H. Pham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces FedKDX, a federated learning framework that addresses limitations in healthcare AI through Negative Knowledge Distillation (NKD). Unlike existing approaches that focus solely on positive knowledge transfer, FedKDX captures both target and non-target information to improve model generalization in healthcare applications. The framework integrates multiple knowledge transfer techniques–including traditional knowledge distillation, contrastive learning, and NKD–within a unified architecture that maintains privacy while reducing communication costs. Through experiments on healthcare datasets (SLEEP, UCI-HAR, and PAMAP2), FedKDX demonstrates improved accuracy (up to 2.53% over state-of-the-art methods), faster convergence, and better performance on non-IID data distributions. Theoretical analysis supports NKD’s contribution to addressing statistical heterogeneity in distributed healthcare data. The approach shows promise for privacy-sensitive medical applications under regulatory frameworks like HIPAA and GDPR, offering a balanced solution between performance and practical implementation requirements in decentralized healthcare settings. The code and model are available at this https URL.
zh

[AI-74] Autonomous Agents on Blockchains: Standards Execution Models and Trust Boundaries

【速读】:该论文旨在解决代理智能体(Agent)与区块链系统之间互操作性带来的高风险系统挑战,即如何设计标准化、可互操作且安全的接口,使代理能够观察链上状态、制定交易意图并授权执行,同时避免用户、协议或组织面临不可接受的安全、治理或经济风险。其解决方案的关键在于提出一个五层集成模式分类法(涵盖只读分析、模拟与意图生成、委托执行、自主签名及多智能体工作流)、针对代理驱动交易管道的定制化威胁模型(涵盖提示注入、策略滥用、密钥泄露、对抗性执行动态及多智能体共谋等风险),以及基于20余个代表性系统的13维能力矩阵对比分析;进一步提出两个核心接口抽象:Transaction Intent Schema(用于可移植且无歧义的目标规范)和Policy Decision Record(用于跨执行环境的可审计、可验证策略执行),从而为构建安全、可靠且经济稳健的代理中介链上执行提供理论框架与实践路径。

链接: https://arxiv.org/abs/2601.04583
作者: Saad Alqithami
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Advances in large language models have enabled agentic AI systems that can reason, plan, and interact with external tools to execute multi-step workflows, while public blockchains have evolved into a programmable substrate for value transfer, access control, and verifiable state transitions. Their convergence introduces a high-stakes systems challenge: designing standard, interoperable, and secure interfaces that allow agents to observe on-chain state, formulate transaction intents, and authorize execution without exposing users, protocols, or organizations to unacceptable security, governance, or economic risks. This survey systematizes the emerging landscape of agent-blockchain interoperability through a systematic literature review, identifying 317 relevant works from an initial pool of over 3000 records. We contribute a five-part taxonomy of integration patterns spanning read-only analytics, simulation and intent generation, delegated execution, autonomous signing, and multi-agent workflows; a threat model tailored to agent-driven transaction pipelines that captures risks ranging from prompt injection and policy misuse to key compromise, adversarial execution dynamics, and multi-agent collusion; and a comparative capability matrix analyzing more than 20 representative systems across 13 dimensions, including custody models, permissioning, policy enforcement, observability, and recovery. Building on the gaps revealed by this analysis, we outline a research roadmap centered on two interface abstractions: a Transaction Intent Schema for portable and unambiguous goal specification, and a Policy Decision Record for auditable, verifiable policy enforcement across execution environments. We conclude by proposing a reproducible evaluation suite and benchmarks for assessing the safety, reliability, and economic robustness of agent-mediated on-chain execution.
zh

[AI-75] Sci-Reasoning : A Dataset Decoding AI Innovation Patterns

【速读】:该论文旨在解决当前对人工智能(AI)研究中科学推理过程理解不足的问题,尤其是研究人员如何识别知识空白、整合已有成果并生成新见解的机制尚缺乏系统性数据支持。其解决方案的关键在于构建首个聚焦高质量AI研究智力合成过程的数据集——Sci-Reasoning,通过社区验证的质量信号与大语言模型(LLM)加速、人工校验的流水线,追踪NeurIPS、ICML和ICLR(2023–2025)会议中的Oral和Spotlight论文与其关键前驱工作的具体推理关联,并以结构化格式呈现。该方法揭示了15种不同的思维模式,其中三种主导策略(Gap-Driven Reframing、Cross-Domain Synthesis和Representation Shift)占总量的52.7%,且最具创新性的组合往往融合多个模式,从而为量化科学研究进展及训练下一代AI研究代理提供了结构化的推理轨迹。

链接: https://arxiv.org/abs/2601.04577
作者: Jiachen Liu,Maestro Harmon,Zechen Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:While AI innovation accelerates rapidly, the intellectual process behind breakthroughs – how researchers identify gaps, synthesize prior work, and generate insights – remains poorly understood. The lack of structured data on scientific reasoning hinders systematic analysis and development of AI research agents. We introduce Sci-Reasoning, the first dataset capturing the intellectual synthesis behind high-quality AI research. Using community-validated quality signals and an LLM-accelerated, human-verified pipeline, we trace Oral and Spotlight papers across NeurIPS, ICML, and ICLR (2023-2025) to its key predecessors, articulating specific reasoning links in a structured format. Our analysis identifies 15 distinct thinking patterns, with three dominant strategies accounting for 52.7%: Gap-Driven Reframing (24.2%), Cross-Domain Synthesis (18.0%), and Representation Shift (10.5%). The most powerful innovation recipes combine multiple patterns: Gap-Driven Reframing + Representation Shift, Cross-Domain Synthesis + Representation Shift, and Gap-Driven Reframing + Cross-Domain Synthesis. This dataset enables quantitative studies of scientific progress and provides structured reasoning trajectories for training the next generation AI research agents.
zh

[AI-76] Scaling Behavior Cloning Improves Causal Reasoning : An Open Model for Real-Time Video Game Playing

【速读】:该论文旨在解决行为克隆(Behavior Cloning)在视频游戏智能体训练中的性能提升与因果推理能力增强问题,尤其是在模型和数据规模扩大时如何系统性地优化策略学习。其解决方案的关键在于提出一套开源的训练配方(open recipe),包含8300多小时高质量人类游戏数据、完整的训练与推理代码以及预训练模型检查点,并通过实证研究揭示了模型参数量、训练步数与训练数据规模对行为克隆性能及因果推理能力的 scaling laws(缩放规律)。研究表明,增加数据量和网络深度可促使模型学习更具有因果性的策略,这一发现为构建具备通用性和可泛化能力的视频游戏基础模型提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2601.04575
作者: Yuguang Yue,Irakli Salia,Samuel Hunt,Chris Green,Wenzhe Shi,Jonathan J Hunt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 16 figures

点击查看摘要

Abstract:Behavior cloning is enjoying a resurgence in popularity as scaling both model and data sizes proves to provide a strong starting point for many tasks of interest. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license. We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play. We use this recipe to systematically examine the scaling laws of behavior cloning to understand how the model’s performance and causal reasoning varies with model and data scale. We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy. We then systematically study how causality varies with the number of parameters (and depth) and training steps in scaled models of up to 1.2 billion parameters, and we find similar scaling results to what we observe in the toy problem.
zh

[AI-77] Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation

【速读】:该论文旨在解决时空交通数据中缺失值插补问题,尤其针对现有基于分数的扩散模型在高缺失率节点上因条件引导强度统一而导致生成结果偏离观测值、性能下降的问题。其解决方案的关键在于提出FENCE方法,通过两个核心机制实现自适应引导:一是引入动态反馈机制,依据后验似然近似调整引导尺度,在生成值与观测值偏离时增强引导、对齐时减弱引导,避免过矫正;二是基于注意力得分将节点聚类并计算分簇级别的引导尺度,利用空间-时间相关性为不同节点提供更精准的条件引导,从而显著提升插补精度。

链接: https://arxiv.org/abs/2601.04572
作者: Xiaowei Mao,Huihu Ding,Yan Lin,Tingrui Wu,Shengnan Guo,Dazhuo Qiu,Feiling Fang,Jilin Hu,Huaiyu Wan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imputing missing values in spatial-temporal traffic data is essential for intelligent transportation systems. Among advanced imputation methods, score-based diffusion models have demonstrated competitive performance. These models generate data by reversing a noising process, using observed values as conditional guidance. However, existing diffusion models typically apply a uniform guidance scale across both spatial and temporal dimensions, which is inadequate for nodes with high missing data rates. Sparse observations provide insufficient conditional guidance, causing the generative process to drift toward the learned prior distribution rather than closely following the conditional observations, resulting in suboptimal imputation performance. To address this, we propose FENCE, a spatial-temporal feedback diffusion guidance method designed to adaptively control guidance scales during imputation. First, FENCE introduces a dynamic feedback mechanism that adjusts the guidance scale based on the posterior likelihood approximations. The guidance scale is increased when generated values diverge from observations and reduced when alignment improves, preventing overcorrection. Second, because alignment to observations varies across nodes and denoising steps, a global guidance scale for all nodes is suboptimal. FENCE computes guidance scales at the cluster level by grouping nodes based on their attention scores, leveraging spatial-temporal correlations to provide more accurate guidance. Experimental results on real-world traffic datasets show that FENCE significantly enhances imputation accuracy. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.04572 [cs.LG] (or arXiv:2601.04572v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.04572 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-78] Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment ACL’2025

【速读】:该论文旨在解决多模态检索(Multimodal Retrieval)中现有方法普遍忽视图像与文本之间互补信息的问题,即当前模型主要关注模态间相似性特征,而忽略了图像中蕴含的、与文本不直接对应但具有补充价值的信息。其解决方案的关键在于提出CIEA(Complementary Information Extraction and Alignment)框架,该框架通过引入专门设计的互补信息提取器(Complementary Information Extractor),在统一潜在空间中对图文表示进行对齐的同时,识别并保留图像表征中的差异性信息,并采用两种互补的对比损失函数优化模型,从而在保持语义完整性的同时有效捕获图像中的互补内容。

链接: https://arxiv.org/abs/2601.04571
作者: Delong Zeng,Yuexiang Xie,Yaliang Li,Ying Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by ACL’2025

点击查看摘要

Abstract:Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at this https URL.
zh

[AI-79] Reasoning Over Space: Enabling Geographic Reasoning for LLM -Based Generative Next POI Recommendation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统在移动和本地服务场景中难以有效利用地理信息的问题。现有方法虽将推荐任务重构为序列生成,但未充分建模地理位置对用户行为的影响。其解决方案的关键在于提出Reasoning Over Space (ROS) 框架,通过引入分层空间语义ID(Hierarchical Spatial Semantic ID, SID)将粗粒度到细粒度的区域与兴趣点(Point of Interest, POI)语义编码为组合式标记,并设计三阶段移动链式思维(Mobility Chain-of-Thought, CoT)范式:首先建模用户个性特征,其次构建意图对齐的候选空间,最后进行基于局部性的剪枝优化;同时,通过空间引导的强化学习(spatial-guided Reinforcement Learning, RL)实现与真实地理环境的对齐,从而显著提升推荐准确率与跨城市迁移能力。

链接: https://arxiv.org/abs/2601.04562
作者: Dongyi Lv,Qiuyu Ding,Heng-Da Xu,Zhaoxu Sun,Zhi Wang,Feng Xiong,Mu Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative recommendation with large language models (LLMs) reframes prediction as sequence generation, yet existing LLM-based recommenders remain limited in leveraging geographic signals that are crucial in mobility and local-services scenarios. Here, we present Reasoning Over Space (ROS), a framework that utilizes geography as a vital decision variable within the reasoning process. ROS introduces a Hierarchical Spatial Semantic ID (SID) that discretizes coarse-to-fine locality and POI semantics into compositional tokens, and endows LLM with a three-stage Mobility Chain-of-Thought (CoT) paradigm that models user personality, constructs an intent-aligned candidate space, and performs locality informed pruning. We further align the model with real world geography via spatial-guided Reinforcement Learning (RL). Experiments on three widely used location-based social network (LBSN) datasets show that ROS achieves over 10% relative gains in hit rate over strongest LLM-based baselines and improves cross-city transfer, despite using a smaller backbone model.
zh

[AI-80] Improving Semi-Supervised Contrastive Learning via Entropy-Weighted Confidence Integration of Anchor-Positive Pairs

【速读】:该论文旨在解决传统半监督对比学习方法中伪标签分配过于保守的问题,即仅对置信度高于预设阈值的样本赋予伪标签,导致大量低置信度样本被忽略,从而限制了模型在少量标注数据下的性能提升。其解决方案的关键在于提出一种新的损失函数,通过样本预测概率分布的熵来估计置信度,并引入基于置信度的自适应加权机制,使模型能够为原本被排除的低置信度样本分配伪标签,同时在对比学习过程中更合理地考虑锚点样本(anchor)与正样本(positive)的置信度差异,从而实现更稳定且高效的训练过程。

链接: https://arxiv.org/abs/2601.04555
作者: Shogo Nakayama,Masahiro Okuda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional semi-supervised contrastive learning methods assign pseudo-labels only to samples whose highest predicted class probability exceeds a predefined threshold, and then perform supervised contrastive learning using those selected samples. In this study, we propose a novel loss function that estimates the confidence of each sample based on the entropy of its predicted probability distribution and applies confidence-based adaptive weighting. This approach enables pseudo-label assignment even to samples that were previously excluded from training and facilitates contrastive learning that accounts for the confidence of both anchor and positive samples in a more principled manner. Experimental results demonstrate that the proposed method improves classification accuracy and achieves more stable learning performance even under low-label conditions.
zh

[AI-81] Personalized Model-Based Design of Human Centric AI enabled CPS for Long term usage

【速读】:该论文旨在解决AI赋能的人类中心控制系统(Human Centric Control Systems)在长期运行中因未测试的边缘场景(corner cases)导致的安全性、可持续性和安全性要求失效的问题。这些问题可能源于设计缺陷、测试资源有限、测试方法的计算局限性或人类交互引发的未知使用场景。论文指出,现有针对安全、可持续性和安全性的分析技术在实际长期应用测试中存在显著局限性,并提出基于个性化模型的解决方案作为关键策略,以潜在消除这些局限,从而提升系统在真实长期使用中的可靠性与鲁棒性。

链接: https://arxiv.org/abs/2601.04545
作者: Bernard Ngabonziza,Ayan Banerjee,Sandeep K.S. Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Human centric critical systems are increasingly involving artificial intelligence to enable knowledge extraction from sensor collected data. Examples include medical monitoring and control systems, gesture based human computer interaction systems, and autonomous cars. Such systems are intended to operate for a long term potentially for a lifetime in many scenarios such as closed loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoting systems for stroke diagnosis, and rehabilitation. Long term operation of such AI enabled human centric applications can expose them to corner cases for which their operation is may be uncertain. This can be due to many reasons such as inherent flaws in the design, limited resources for testing, inherent computational limitations of the testing methodology, or unknown use cases resulting from human interaction with the system. Such untested corner cases or cases for which the system performance is uncertain can lead to violations in the safety, sustainability, and security requirements of the system. In this paper, we analyze the existing techniques for safety, sustainability, and security analysis of an AI enabled human centric control system and discuss their limitations for testing the system for long term use in practice. We then propose personalized model based solutions for potentially eliminating such limitations.
zh

[AI-82] CAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration IJCAI

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中路由机制的两大核心问题:一是静态单标签决策难以支持新智能体的动态接入,导致业务扩展时集成困难;二是因智能体能力重叠引发的路由冲突,降低任务分配准确性。解决方案的关键在于提出TCAR(Task-aware Collaborative Andon-Router),其创新性体现在两个方面:首先,通过生成自然语言推理链(natural-language reasoning chain)实现动态候选智能体集合的预测,从而支持灵活的智能体协同与新增;其次,设计协作执行流水线,使被选中的多个智能体独立响应后,由专门的精炼模块聚合并优化输出,提升整体响应质量与鲁棒性。

链接: https://arxiv.org/abs/2601.04544
作者: Jiuzhou Zhao,Chunrong Chen,Chenqi Qiao,Lebin Zheng,Minqi Han,Yanchi Liu Yongzhou Xu Xiaochuan Xu Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures. Under review at IJCAI

点击查看摘要

Abstract:Multi-Agent Systems(MAS) have become a powerful paradigm for building high performance intelligent applications. Within these systems, the router responsible for determining which expert agents should handle a given query plays a crucial role in overall performance. Existing routing strategies generally fall into two categories: performance routing, which balances latency and cost across models of different sizes, and task routing, which assigns queries to domain-specific experts to improve accuracy. In real-world enterprise applications, task routing is more suitable; however, most existing approaches rely on static single-label decisions, which introduce two major limitations: (i) difficulty in seamlessly integrating new agents as business domains expand, and (ii) routing conflicts caused by overlapping agent capabilities, ultimately degrading accuracy and this http URL address these challenges, we propose TCAndon-Router(TCAR): an adaptive reasoning router for multi-agent collaboration. Unlike traditional routers, TCAR supports dynamic agent onboarding and first generates a natural-language reasoning chain before predicting a set of candidate agents capable of handling the query. In addition, we design a collaborative execution pipeline in which selected agents independently produce responses, which are then aggregated and refined into a single high-quality response by a dedicated Refining this http URL on public datasets and real enterprise data demonstrate that TCAR significantly improves routing accuracy, reduces routing conflicts, and remains robust in ambiguous scenarios. We have released TCAR at this https URL to support future research on explainable and collaborative multi-agent routing.
zh

[AI-83] AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码片段适配(code snippet adaptation)任务中缺乏系统性评估基准的问题。现有基准主要关注代码生成或理解,但未充分覆盖实际开发中关键的代码复用与调整过程,导致LLMs在该领域的实用性能不明确。解决方案的关键在于提出AdaptEval——一个专为评估LLMs代码适配能力设计的基准,其核心创新包括:(1)实践性上下文,任务来源于Stack Overflow和GitHub的真实开发者行为,保留丰富语境信息;(2)多粒度标注,对每个任务提供任务级和适配级双重要求标注,支持多样化适配场景下的评估;(3)细粒度测试框架,采用适配层与功能层两级测试机制,实现对具体适配操作的精细化评估。通过该基准,作者首次实证评估了六种指令微调LLMs(含三种推理型模型)在代码适配中的表现,揭示了其在遵循显式指令方面的显著局限,为后续研究提供了可量化、多维度的评估工具与改进方向。

链接: https://arxiv.org/abs/2601.04540
作者: Tanghaoran Zhang,Xinjun Mao,Shangwen Wang,Yuxin Zhao,Yao Lu,Jin Zhang,Zhang Zhang,Kang Yang,Yue Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, Accepted by ASE 2025

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. However, for adaptation, a critical activity during code reuse, there is no benchmark to assess LLMs’ performance, leaving their practical utility in this area unclear. To fill this gap, we propose AdaptEval, a benchmark designed to evaluate LLMs on code snippet adaptation. Unlike existing benchmarks, AdaptEval incorporates the following three distinctive features: First, Practical Context. Tasks in AdaptEval are derived from developers’ practices, preserving rich contextual information from Stack Overflow and GitHub communities. Second, Multi-granularity Annotation. Each task is annotated with requirements at both task and adaptation levels, supporting the evaluation of LLMs across diverse adaptation scenarios. Third, Fine-grained Evaluation. AdaptEval includes a two-tier testing framework combining adaptation-level and function-level tests, which enables evaluating LLMs’ performance across various individual adaptations. Based on AdaptEval, we conduct the first empirical study to evaluate six instruction-tuned LLMs and especially three reasoning LLMs on code snippet adaptation. Experimental results demonstrate that AdaptEval enables the assessment of LLMs’ adaptation capabilities from various perspectives. It also provides critical insights into their current limitations, particularly their struggle to follow explicit instructions. We hope AdaptEval can facilitate further investigation and enhancement of LLMs’ capabilities in code snippet adaptation, supporting their real-world applications.
zh

[AI-84] Paradoxical noise preference in RNNs

【速读】:该论文试图解决的问题是:在连续时间递归神经网络(Continuous-Time Recurrent Neural Networks, CTRNNs)中,训练时引入噪声以模拟生物神经网络的变异性并正则化学习,但测试时移除噪声通常会导致性能下降或保持不变;然而实验发现,CTRNNs往往在非零噪声水平下表现最优,且该最优噪声水平恰好与训练时一致。这一现象违背了传统直觉,即噪声仅用于正则化而非计算本身。解决方案的关键在于揭示了噪声诱导的固定点(fixed points)偏移机制——当噪声注入激活函数内部时,其对神经状态的不对称衰减会改变系统稳态分布,导致去噪后输出产生偏差,从而降低性能。这种偏差源于网络在激活函数非线性区域附近运行,而优化过程本身倾向于使神经状态靠近这些敏感区域。因此,网络可能“过拟合”于训练中的随机环境,而非仅学习输入-输出映射。该机制不同于随机共振(stochastic resonance),表明训练噪声可成为网络计算的一部分,这对理解神经种群动力学和设计鲁棒人工RNN具有重要意义。

链接: https://arxiv.org/abs/2601.04539
作者: Noah Eckstein,Manoj Srinivasan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biological variability and regularize learning. The expectation is that removing the noise at test time should preserve or improve performance. Contrary to this intuition, we find that continuous-time recurrent neural networks (CTRNNs) often perform best at a nonzero noise level, specifically, the same level used during training. This noise preference typically arises when noise is injected inside the neural activation function; networks trained with noise injected outside the activation function perform best with zero noise. Through analyses of simple function approximation, maze navigation, and single neuron regulator tasks, we show that the phenomenon stems from noise-induced shifts of fixed points (stationary distributions) in the underlying stochastic dynamics of the RNNs. These fixed point shifts are noise-level dependent and bias the network outputs when the noise is removed, degrading performance. Analytical and numerical results show that the bias arises when neural states operate near activation function nonlinearities, where noise is asymmetrically attenuated, and that performance optimization incentivizes operation near these nonlinearities. Thus, networks can overfit to the stochastic training environment itself rather than just to the input-output data. The phenomenon is distinct from stochastic resonance, wherein nonzero noise enhances signal processing. Our findings reveal that training noise can become an integral part of the computation learned by recurrent networks, with implications for understanding neural population dynamics and for the design of robust artificial RNNs.
zh

[AI-85] Self-MedRAG : a Self-Reflective Hybrid Retrieval-Augmented Generation Framework for Reliable Medical Question Answering

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学问答(Medical Question Answering, QA)任务中易产生幻觉和缺乏依据的推理问题,从而限制其在高风险临床场景中的可靠性。解决方案的关键在于提出一种自反思混合框架 Self-MedRAG,其核心创新包括:1)采用稀疏(BM25)与稠密(Contriever)检索器通过倒数排名融合(Reciprocal Rank Fusion, RRF)的混合检索策略以最大化证据覆盖;2)引入轻量级自反思模块,利用自然语言推理(Natural Language Inference, NLI)或基于大模型的验证机制评估生成答案的支撑理由;若理由证据不足,则自动重构查询并迭代优化上下文,从而实现多步推理与证据驱动的闭环修正。实验证明该方法显著提升了 MedQA 和 PubMedQA 基准上的准确率,有效减少无依据陈述,增强系统临床可靠性。

链接: https://arxiv.org/abs/2601.04531
作者: Jessica Ryan,Alexander I. Gumilang,Robert Wiliam,Derwin Suhartono
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in medical Question Answering (QA), yet they remain prone to hallucinations and ungrounded reasoning, limiting their reliability in high-stakes clinical scenarios. While Retrieval-Augmented Generation (RAG) mitigates these issues by incorporating external knowledge, conventional single-shot retrieval often fails to resolve complex biomedical queries requiring multi-step inference. To address this, we propose Self-MedRAG, a self-reflective hybrid framework designed to mimic the iterative hypothesis-verification process of clinical reasoning. Self-MedRAG integrates a hybrid retrieval strategy, combining sparse (BM25) and dense (Contriever) retrievers via Reciprocal Rank Fusion (RRF) to maximize evidence coverage. It employs a generator to produce answers with supporting rationales, which are then assessed by a lightweight self-reflection module using Natural Language Inference (NLI) or LLM-based verification. If the rationale lacks sufficient evidentiary support, the system autonomously reformulates the query and iterates to refine the context. We evaluated Self-MedRAG on the MedQA and PubMedQA benchmarks. The results demonstrate that our hybrid retrieval approach significantly outperforms single-retriever baselines. Furthermore, the inclusion of the self-reflective loop yielded substantial gains, increasing accuracy on MedQA from 80.00% to 83.33% and on PubMedQA from 69.10% to 79.82%. These findings confirm that integrating hybrid retrieval with iterative, evidence-based self-reflection effectively reduces unsupported claims and enhances the clinical reliability of LLM-based systems.
zh

[AI-86] BioPIE: A Biomedical Protocol Information Extraction Dataset for High-Reasoning -Complexity Experiment Question Answer

【速读】:该论文旨在解决生物医学实验问答(QA)系统在面对高信息密度(High Information Density, HID)和多步推理(Multi-Step Reasoning, MSR)任务时的性能瓶颈问题。现有生物医学数据集多聚焦于通用或粗粒度知识,难以支撑对实验细节的精细化推理需求。解决方案的关键在于提出Biomedical Protocol Information Extraction Dataset (BioPIE),该数据集构建了以实验步骤为中心的知识图谱(Knowledge Graph, KG),涵盖实验实体、操作行为及其关系,并以足够规模支持跨实验协议的推理任务。通过在BioPIE上评估信息抽取方法并实现一个基于该结构化知识的QA系统,实验证明其在测试集、HID和MSR子集上均取得显著性能提升,验证了BioPIE所蕴含的结构化实验知识对人工智能辅助及更自主的生物医学实验具有关键支撑作用。

链接: https://arxiv.org/abs/2601.04524
作者: Haofei Hou,Shunyi Zhao,Fanxu Meng,Kairui Yang,Lecheng Ruan,Qining Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Question Answer (QA) systems for biomedical experiments facilitate cross-disciplinary communication, and serve as a foundation for downstream tasks, e.g., laboratory automation. High Information Density (HID) and Multi-Step Reasoning (MSR) pose unique challenges for biomedical experimental QA. While extracting structured knowledge, e.g., Knowledge Graphs (KGs), can substantially benefit biomedical experimental QA. Existing biomedical datasets focus on general or coarsegrained knowledge and thus fail to support the fine-grained experimental reasoning demanded by HID and MSR. To address this gap, we introduce Biomedical Protocol Information Extraction Dataset (BioPIE), a dataset that provides procedure-centric KGs of experimental entities, actions, and relations at a scale that supports reasoning over biomedical experiments across protocols. We evaluate information extraction methods on BioPIE, and implement a QA system that leverages BioPIE, showcasing performance gains on test, HID, and MSR question sets, showing that the structured experimental knowledge in BioPIE underpins both AI-assisted and more autonomous biomedical experimentation.
zh

[AI-87] SSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation

【速读】:该论文旨在解决当前基于SMILES字符串的分子生成模型在字符级别上易受累积token错误影响的问题,即生成的分子结构常出现无法解析或化学上不合理的现象,且传统硬性约束虽能防止失败但限制了化学空间的有效探索。解决方案的关键在于提出一种两阶段、基于交换奖励(swap-reward-driven)的强化学习框架TSSR:第一阶段通过奖励局部token交换来修复语法错误,促进从无效到可解析字符串的转变;第二阶段利用RDKit诊断提供化学感知反馈,奖励减少价态、芳香性和连接性等问题。该方法将稀疏的最终目标转化为更密集且可解释的奖励信号,无需任务特定标签或手工设计语法规则,显著提升分子的语法正确性和化学合理性,同时保持多样性,且对不同数据集和强化学习策略具有普适性。

链接: https://arxiv.org/abs/2601.04521
作者: Jacob Ede Levine,Yun Lyan Luo,Sai Chandra Kosaraju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.
zh

[AI-88] Integrating Distribution Matching into Semi-Supervised Contrastive Learning for Labeled and Unlabeled Data

【速读】:该论文旨在解决半监督学习(Semi-Supervised Learning, SSL)中因伪标签(pseudo-label)质量不高而导致图像分类准确率受限的问题。其解决方案的关键在于引入标签数据与未标签数据特征嵌入(feature embeddings)之间的分布匹配机制,通过优化两类特征在特征空间中的分布一致性,提升伪标签的可靠性,从而增强模型在多个数据集上的图像分类性能。

链接: https://arxiv.org/abs/2601.04518
作者: Shogo Nakayama,Masahiro Okuda
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ITC-CSCC accepted

点击查看摘要

Abstract:The advancement of deep learning has greatly improved supervised image classification. However, labeling data is costly, prompting research into unsupervised learning methods such as contrastive learning. In real-world scenarios, fully unlabeled datasets are rare, making semi-supervised learning (SSL) highly relevant in scenarios where a small amount of labeled data coexists with a large volume of unlabeled data. A well-known semi-supervised contrastive learning approach involves assigning pseudo-labels to unlabeled data. This study aims to enhance pseudo-label-based SSL by incorporating distribution matching between labeled and unlabeled feature embeddings to improve image classification accuracy across multiple datasets.
zh

[AI-89] A General Neural Backbone for Mixed-Integer Linear Optimization via Dual Attention

【速读】:该论文旨在解决混合整数线性规划(Mixed-Integer Linear Programming, MILP)在大规模场景下计算复杂度高、传统神经网络方法受限于局部结构信息表达能力不足的问题。其解决方案的关键在于提出一种基于注意力机制的神经架构,通过设计双注意力机制(dual-attention mechanism),在变量和约束之间并行执行自注意力(self-attention)与交叉注意力(cross-attention),从而实现全局信息交互与深层表示学习,突破了纯图结构建模的局限性,显著提升了对MILP实例的表征能力和求解效率。

链接: https://arxiv.org/abs/2601.04509
作者: Peixin Huang,Yaoxin Wu,Yining Ma,Cathy Wu,Wen Song,Wei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed-integer linear programming (MILP), a widely used modeling framework for combinatorial optimization, are central to many scientific and engineering applications, yet remains computationally challenging at scale. Recent advances in deep learning address this challenge by representing MILP instances as variable-constraint bipartite graphs and applying graph neural networks (GNNs) to extract latent structural patterns and enhance solver efficiency. However, this architecture is inherently limited by the local-oriented mechanism, leading to restricted representation power and hindering neural approaches for MILP. Here we present an attention-driven neural architecture that learns expressive representations beyond the pure graph view. A dual-attention mechanism is designed to perform parallel self- and cross-attention over variables and constraints, enabling global information exchange and deeper representation learning. We apply this general backbone to various downstream tasks at the instance level, element level, and solving state level. Extensive experiments across widely used benchmarks show consistent improvements of our approach over state-of-the-art baselines, highlighting attention-based neural architectures as a powerful foundation for learning-enhanced mixed-integer linear optimization.
zh

[AI-90] A Semi-supervised Molecular Learning Framework for Activity Cliff Estimation

【速读】:该论文旨在解决活性悬崖(activity cliff)问题对图结构机器学习(graph-based machine learning)模型性能的显著负面影响,尤其是在数据稀缺场景下。活性悬崖指分子结构微小变化导致其生物活性发生剧烈波动的现象,这违背了传统机器学习中“相似分子具有相近性质”的假设,从而削弱了现有模型的预测准确性。解决方案的关键在于提出一种新颖的半监督学习(semi-supervised learning, SSL)方法——SemiMol,其核心创新包括:1)引入一个额外的指导模型(instructor model)用于评估未标注数据上伪标签(pseudo-labels)的准确性和可信度,克服了传统伪标签方法依赖概率输出且不适用于回归任务的局限性;2)设计一种自适应课程学习(self-adaptive curriculum learning)算法,以可控节奏逐步引导目标模型向困难样本迁移,从而提升模型在低数据条件下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2601.04507
作者: Fang Wu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning (ML) enables accurate and fast molecular property predictions, which are of interest in drug discovery and material design. Their success is based on the principle of similarity at its heart, assuming that similar molecules exhibit close properties. However, activity cliffs challenge this principle, and their presence leads to a sharp decline in the performance of existing ML algorithms, particularly graph-based methods. To overcome this obstacle under a low-data scenario, we propose a novel semi-supervised learning (SSL) method dubbed SemiMol, which employs predictions on numerous unannotated data as pseudo-signals for subsequent training. Specifically, we introduce an additional instructor model to evaluate the accuracy and trustworthiness of proxy labels because existing pseudo-labeling approaches require probabilistic outputs to reveal the model’s confidence and fail to be applied in regression tasks. Moreover, we design a self-adaptive curriculum learning algorithm to progressively move the target model toward hard samples at a controllable pace. Extensive experiments on 30 activity cliff datasets demonstrate that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.
zh

[AI-91] Surface-based Molecular Design with Multi-modal Flow Matching

【速读】:该论文旨在解决当前治疗性肽设计中对分子表面特征考虑不足的问题,尤其是在蛋白质-蛋白质相互作用(Protein-Protein Interaction, PPI)界面中的关键作用尚未被充分挖掘。传统方法多聚焦于全原子结构的协同设计,但忽略了分子表面几何形状和生化特性在结合特异性与亲和力中的核心影响。解决方案的关键在于提出一种名为SurfFlow的通用肽生成范式,其核心创新是基于分子表面的多模态条件流匹配(Multi-modality Conditional Flow Matching, CFM)架构,能够联合学习并生成具有优化表面几何与生化属性的肽序列、结构及表面特征,从而实现更精准的靶向结合。

链接: https://arxiv.org/abs/2601.04506
作者: Fang Wu,Zhengyuan Zhou,Shuting Jin,Xiangxiang Zeng,Jure Leskovec,Jinbo Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full-atom peptide co-design for specific protein receptors. However, the critical role of molecular surfaces in protein-protein interactions (PPIs) has been underexplored. To bridge this gap, we propose an omni-design peptides generation paradigm, called SurfFlow, a novel surface-based generative algorithm that enables comprehensive co-design of sequence, structure, and surface for peptides. SurfFlow employs a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy. Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics. These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.
zh

[AI-92] Specific Emitter Identification via Active Learning

【速读】:该论文旨在解决特定发射源识别(SEI)在模型训练中对大规模标注数据高度依赖的问题,而这类数据的获取成本高且耗时。其解决方案的关键在于提出一种结合主动学习(Active Learning, AL)的三阶段半监督训练框架:首先利用自监督对比学习与动态字典更新机制从大量未标注数据中提取鲁棒特征表示;其次在小规模标注数据上联合优化对比损失与交叉熵损失,增强特征可分性和分类边界;最后通过不确定性与代表性双重标准选择最具价值的未标注样本进行人工标注,从而在有限标注预算下显著提升模型泛化性能和识别准确率。

链接: https://arxiv.org/abs/2601.04502
作者: Jingyi Wang,Fanggang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid growth of wireless communications, specific emitter identification (SEI) is significant for communication security. However, its model training relies heavily on the large-scale labeled data, which are costly and time-consuming to obtain. To address this challenge, we propose an SEI approach enhanced by active learning (AL), which follows a three-stage semi-supervised training scheme. In the first stage, self-supervised contrastive learning is employed with a dynamic dictionary update mechanism to extract robust representations from large amounts of the unlabeled data. In the second stage, supervised training on a small labeled dataset is performed, where the contrastive and cross-entropy losses are jointly optimized to improve the feature separability and strengthen the classification boundaries. In the third stage, an AL module selects the most valuable samples from the unlabeled data for annotation based on the uncertainty and representativeness criteria, further enhancing generalization under limited labeling budgets. Experimental results on the ADS-B and WiFi datasets demonstrate that the proposed SEI approach significantly outperforms the conventional supervised and semi-supervised methods under limited annotation conditions, achieving higher recognition accuracy with lower labeling cost.
zh

[AI-93] GUITester: Enabling GUI Agents for Exploratory Defect Discovery

【速读】:该论文旨在解决生成式 AI (Generative AI) 在图形用户界面(GUI)探索性测试中难以自主发现缺陷的问题,其核心挑战包括目标导向掩蔽(Goal-Oriented Masking)和执行偏差归因(Execution-Bias Attribution)。解决方案的关键在于提出一个名为 GUITester 的多智能体框架,通过两个模块实现导航与验证的解耦:一是规划-执行模块(Planning-Execution Module, PEM),主动嵌入测试意图进行缺陷探测;二是分层反思模块(Hierarchical Reflection Module, HRM),基于交互历史分析消除缺陷归因歧义。该方法在首个交互式基准 GUITestBench 上实现了 48.90% 的 F1 分数(Pass@3),显著优于现有基线(33.35%),验证了自主探索式 GUI 测试的可行性。

链接: https://arxiv.org/abs/2601.04500
作者: Yifei Gao,Jiang Wu,Xiaoyi Chen,Yifan Yang,Zhe Cui,Tianyi Ma,Jiaming Zhang,Jitao Sang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi-modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: \textitGoal-Oriented Masking, where agents prioritize task completion over reporting anomalies, and \textitExecution-Bias Attribution, where system defects are misidentified as agent errors. To address these, we first introduce \textbfGUITestBench, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose \textbfGUITester, a multi-agent framework that decouples navigation from verification via two modules: (i) a \textitPlanning-Execution Module (PEM) that proactively probes for defects via embedded testing intents, and (ii) a \textitHierarchical Reflection Module (HRM) that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1-score of 48.90% (Pass@3) on GUITestBench, outperforming state-of-the-art baselines (33.35%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnoteOur code is now available in~\hrefthis https URLthis https URL.
zh

[AI-94] Scalable Floating-Point Satisfiability via Staged Optimization

【速读】:该论文旨在解决浮点数可满足性(floating-point satisfiability)问题,即判断一个包含浮点运算的逻辑公式是否存在满足所有约束的赋值。传统方法依赖于位级精确的SMT求解或数值优化,但常面临效率低、易陷入局部最优或产生虚假解的问题。其解决方案的关键在于提出StageSAT,一种分阶段的优化框架:首先通过快速投影引导下降目标(projection-aided descent objective)快速定位可行区域;随后在比特级精度上使用ULP²优化逼近精确解;最后通过n-ULP格点细化实现高精度收敛。该设计确保最终目标函数为零时必对应合法解,从而提供内在的保真性(soundness)保障。此外,StageSAT利用正交投影引入部分单调下降性质,有效避免在平坦或误导性搜索空间中停滞,且无需复杂位级推理,将浮点运算视为黑盒,仅通过运行时评估导航输入空间,显著提升了求解的准确性与效率。

链接: https://arxiv.org/abs/2601.04492
作者: Yuanzhuo Zhang,Zhoulai Fu,Binoy Ravindran
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces StageSAT, a new approach to solving floating-point satisfiability that bridges SMT solving with numerical optimization. StageSAT reframes a floating-point formula as a series of optimization problems in three stages of increasing precision. It begins with a fast, projection-aided descent objective to guide the search toward a feasible region, proceeding to bit-level accuracy with ULP ^2 optimization and a final n -ULP lattice refinement. By construction, the final stage uses a representing function that is zero if and only if a candidate satisfies all constraints. Thus, when optimization drives the objective to zero, the resulting assignment is a valid solution, providing a built-in guarantee of soundness. To improve search, StageSAT introduces a partial monotone descent property on linear constraints via orthogonal projection, preventing the optimizer from stalling on flat or misleading landscapes. Critically, this solver requires no heavy bit-level reasoning or specialized abstractions; it treats complex arithmetic as a black-box, using runtime evaluations to navigate the input space. We implement StageSAT and evaluate it on extensive benchmarks, including SMT-COMP’25 suites and difficult cases from prior work. StageSAT proved more scalable and accurate than state-of-the-art optimization-based alternatives. It solved strictly more formulas than any competing solver under the same time budget, finding most satisfiable instances without producing spurious models. This amounts to 99.4% recall on satisfiable cases with 0% false SAT, exceeding the reliability of prior optimization-based solvers. StageSAT also delivered significant speedups (often 5–10 \times ) over traditional bit-precise SMT and numeric solvers. These results demonstrate that staged optimization significantly improves performance and correctness of floating-point satisfiability solving. Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.04492 [cs.PL] (or arXiv:2601.04492v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2601.04492 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhoulai Fu [view email] [v1] Thu, 8 Jan 2026 01:51:46 UTC (47 KB)
zh

[AI-95] A Closed-Loop Multi-Agent System Driven by LLM s for Meal-Level Personalized Nutrition Management

【速读】:该论文旨在解决个性化营养管理中食物记录、营养分析与饮食建议分离导致的效率低和适配性差的问题。其核心解决方案是构建一个基于图像的移动营养助手,通过大语言模型(Large Language Model, LLM)驱动的多智能体控制器实现餐级闭环支持:系统协调视觉、对话与状态管理三个智能体,从餐食图像中估算营养成分并动态更新每日摄入预算,进而根据用户偏好和膳食限制调整下一餐计划。该架构实现了从食物识别到个性化推荐的端到端协同优化,验证了多智能体LLM控制在个性化营养领域的可行性。

链接: https://arxiv.org/abs/2601.04491
作者: Muqing Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, 6 figures, 6 tables, Conference: Robotics, Automation, and Artificial Intelligence 2025

点击查看摘要

Abstract:Personalized nutrition management aims to tailor dietary guidance to an individual’s intake and phenotype, but most existing systems handle food logging, nutrient analysis and recommendation separately. We present a next-generation mobile nutrition assistant that combines image based meal logging with an LLM driven multi agent controller to provide meal level closed loop support. The system coordinates vision, dialogue and state management agents to estimate nutrients from photos and update a daily intake budget. It then adapts the next meal plan to user preferences and dietary constraints. Experiments with SNAPMe meal images and simulated users show competitive nutrient estimation, personalized menus and efficient task plans. These findings demonstrate the feasibility of multi agent LLM control for personalized nutrition and reveal open challenges in micronutrient estimation from images and in large scale real world studies.
zh

[AI-96] Decision-Aware Trust Signal Alignment for SOC Alert Triage

【速读】:该论文旨在解决安全运营中心(SOC)中机器学习检测系统输出的置信度(confidence score)存在校准不足、难以在高压环境下被分析师有效解读的问题,尤其是当模型置信度与决策需求不一致时,会加剧告警疲劳和误判风险。其核心解决方案是提出一种“决策敏感的信任信号对应机制”(decision-sensitive trust signal correspondence scheme),通过三个关键要素构建一个统一的决策支持层:一是使用后验校准方法提升置信度的校准性;二是引入轻量级不确定性提示(uncertainty cues)以在模型置信度低时提供保守保护;三是结合成本敏感的决策阈值设定,使告警优先级更符合实际安全场景中假阳性与假阴性之间的非对称代价关系。此方案不修改检测模型本身,而是从决策接口层面优化人机协同效率。

链接: https://arxiv.org/abs/2601.04486
作者: Israt Jahan Chowdhury,Md Abu Yousuf Tanvir
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Detection systems that utilize machine learning are progressively implemented at Security Operations Centers (SOCs) to help an analyst to filter through high volumes of security alerts. Practically, such systems tend to reveal probabilistic results or confidence scores which are ill-calibrated and hard to read when under pressure. Qualitative and survey based studies of SOC practice done before reveal that poor alert quality and alert overload greatly augment the burden on the analyst, especially when tool outputs are not coherent with decision requirements, or signal noise. One of the most significant limitations is that model confidence is usually shown without expressing that there are asymmetric costs in decision making where false alarms are much less harmful than missed attacks. The present paper presents a decision-sensitive trust signal correspondence scheme of SOC alert triage. The framework combines confidence that has been calibrated, lightweight uncertainty cues, and cost-sensitive decision thresholds into coherent decision-support layer, instead of making changes to detection models. To enhance probabilistic consistency, the calibration is done using the known post-hoc methods and the uncertainty cues give conservative protection in situations where model certainty is low. To measure the model-independent performance of the suggested model, we apply the Logistic Regression and the Random Forest classifiers to the UNSW-NB15 intrusion detection benchmark. According to simulation findings, false negatives are greatly amplified by the presence of misaligned displays of confidence, whereas cost weighted loss decreases by orders of magnitude between models with decision aligned trust signals. Lastly, we describe a human-in-the-loop study plan that would allow empirically assessing the decision-making of the analysts with aligned and misaligned trust interfaces.
zh

[AI-97] Hybrid Federated Learning for Noise-Robust Training

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)与联邦蒸馏(Federated Distillation, FD)在噪声鲁棒性与训练速度之间权衡不足的问题,尤其在低信噪比(SNR)环境下性能受限。解决方案的关键在于提出一种混合联邦学习(Hybrid Federated Learning, HFL)框架:在每轮通信中,用户设备(UE)可选择上传梯度或logits,基站(BS)则动态调整FL与FD更新的权重;同时引入两个自由度(Degrees of Freedom, DoF)优化机制——基于Jenks优化的自适应UE聚类和基于阻尼牛顿法的自适应权重选择,从而在低SNR下显著提升测试准确率。

链接: https://arxiv.org/abs/2601.04483
作者: Yongjun Kim,Hyeongjun Park,Hwanjin Kim,Junil Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Federated learning (FL) and federated distillation (FD) are distributed learning paradigms that train UE models with enhanced privacy, each offering different trade-offs between noise robustness and learning speed. To mitigate their respective weaknesses, we propose a hybrid federated learning (HFL) framework in which each user equipment (UE) transmits either gradients or logits, and the base station (BS) selects the per-round weights of FL and FD updates. We derive convergence of HFL framework and introduce two methods to exploit degrees of freedom (DoF) in HFL, which are (i) adaptive UE clustering via Jenks optimization and (ii) adaptive weight selection via a damped Newton method. Numerical results show that HFL achieves superior test accuracy at low SNR when both DoF are exploited.
zh

[AI-98] Computational Compliance for AI Regulation: Blueprint for a New Research Domain

【速读】:该论文旨在解决当前人工智能系统(AI systems)在面对日益严格的AI监管(AI Regulation, AIR)时,难以通过传统人工或模拟方法实现高效、大规模合规的问题。其核心挑战在于现有合规手段无法适应AI生命周期中动态变化的监管要求。解决方案的关键在于推动合规机制向计算化转型——即开发能够在AI系统全生命周期中自动运行的算法,这些算法能实时响应环境变化并引导系统持续符合AIR规范。论文进一步提出了一套用于指导此类算法设计的目标框架,并构建了一个基准数据集以量化评估算法是否满足这些目标,从而为该新兴研究领域提供可操作的技术蓝图和评估标准。

链接: https://arxiv.org/abs/2601.04474
作者: Bill Marino,Nicholas D. Lane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The era of AI regulation (AIR) is upon us. But AI systems, we argue, will not be able to comply with these regulations at the necessary speed and scale by continuing to rely on traditional, analogue methods of compliance. Instead, we posit that compliance with these regulations will only realistically be achieved computationally: that is, with algorithms that run across the life cycle of an AI system, automatically steering it toward AIR compliance in the face of dynamic conditions. Yet despite their (we would argue) inevitability, the research community has yet to specify exactly how these algorithms for computational AIR compliance should behave - or how we should benchmark their performance. To fill these gaps, we specify a set of design goals for such algorithms. In addition, we specify a benchmark dataset that can be used to quantitatively measure whether individual algorithms satisfy these design goals. By delivering this blueprint, we hope to give shape to an important but uncrystallized new domain of research - and, in doing so, incite necessary investment in it.
zh

[AI-99] Categorical Belief Propagation: Sheaf-Theoretic Inference via Descent and Holonomy

【速读】:该论文旨在解决概率图模型中信念传播(Belief Propagation, BP)的精确推理问题,特别是针对存在环路(loopy)结构时BP算法可能失效的情形。其核心挑战在于如何在非树状因子图(factor graph)上实现精确推理,并识别和处理导致BP失败的拓扑障碍。解决方案的关键在于引入范畴论(category theory)框架,构建了基于类型签名的自由超图范畴 \Syn_\Sigma 并建立其到矩阵范畴 \catMat_R 的唯一函子映射,从而提供组合语义;进一步通过Grothendieck纤维化定义消息传递机制,并将精确推理刻画为有效下降(effective descent)——即局部信念构成下降数据当且仅当重叠区域满足兼容性条件。为此提出HATCC(Holonomy-Aware Tree Compilation)算法,利用因子神经复形上的整体性(holonomy)计算检测下降障碍,将非平凡整体性编译为模式变量(mode variables),并在扩展图上退化为树状BP,显著提升复杂度:对于 nn 个因子和 cc 个基本循环,时间复杂度为 O(n^2 d_\max + c \cdot k_\max \cdot \delta_\max^3 + n \cdot \delta_\max^2),实验证明其在网格马尔可夫随机场(MRFs)和随机图上相比联结树(junction tree)算法具有明显加速效果,并能有效检测不可满足性(UNSAT)。

链接: https://arxiv.org/abs/2601.04456
作者: Enrique ter Horst,Sridhar Mahadevan,Juan Diego Zambrano
机构: 未知
类目: Artificial Intelligence (cs.AI); Category Theory (math.CT)
备注: No essential info

点击查看摘要

Abstract:We develop a categorical foundation for belief propagation on factor graphs. We construct the free hypergraph category (\Syn_\Sigma) on a typed signature and prove its universal property, yielding compositional semantics via a unique functor to the matrix category (\catMat_R). Message-passing is formulated using a Grothendieck fibration (\int\Msg \to \catFG_\Sigma) over polarized factor graphs, with schedule-indexed endomorphisms defining BP updates. We characterize exact inference as effective descent: local beliefs form a descent datum when compatibility conditions hold on overlaps. This framework unifies tree exactness, junction tree algorithms, and loopy BP failures under sheaf-theoretic obstructions. We introduce HATCC (Holonomy-Aware Tree Compilation), an algorithm that detects descent obstructions via holonomy computation on the factor nerve, compiles non-trivial holonomy into mode variables, and reduces to tree BP on an augmented graph. Complexity is (O(n^2 d_\max + c \cdot k_\max \cdot \delta_\max^3 + n \cdot \delta_\max^2)) for (n) factors and (c) fundamental cycles. Experimental results demonstrate exact inference with significant speedup over junction trees on grid MRFs and random graphs, along with UNSAT detection on satisfiability instances.
zh

[AI-100] XGrammar 2: Dynamic and Efficient Structured Generation Engine for Agent ic LLM s

【速读】:该论文旨在解决现代大语言模型(Large Language Models, LLMs)代理在处理动态结构化生成任务(如工具调用和条件结构化生成)时,现有结构化生成引擎效率低下的问题。这类任务的结构具有高度动态性,远超预定义模板,对生成效率和准确性提出更高要求。解决方案的关键在于提出XGrammar 2,其核心创新包括:引入新的动态调度语义TagDispatch以加速掩码生成;采用即时编译(Just-In-Time, JIT)方法降低编译开销;设计跨语法缓存机制利用不同语法间的共用子结构;扩展基于下推自动机(Pushdown Automaton, PDA)的掩码生成算法至基于Earley解析器的方法,并提出重复压缩算法以高效处理语法中的重复结构。这些改进使XGrammar 2相比现有引擎实现超过6倍的速度提升,并在集成LLM推理引擎后实现近乎零开销的动态结构化生成能力。

链接: https://arxiv.org/abs/2601.04426
作者: Linzhang Li,Yixin Dong,Guanjie Wang,Ziyi Xu,Alexander Jiang,Tianqi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern LLM agents are required to handle increasingly complex structured generation tasks, such as tool calling and conditional structured generation. These tasks are significantly more dynamic than predefined structures, posing new challenges to the current structured generation engines. In this paper, we propose XGrammar 2, a highly optimized structured generation engine for agentic LLMs. XGrammar 2 accelerates the mask generation for these dynamic structured generation tasks through a new dynamic dispatching semantics: TagDispatch. We further introduce a just-in-time (JIT) compilation method to reduce compilation time and a cross-grammar caching mechanism to leverage the common sub-structures across different grammars. Additionally, we extend the previous PDA-based mask generation algorithm to the Earley-parser-based one and design a repetition compression algorithm to handle repetition structures in grammars. Evaluation results show that XGrammar 2 can achieve more than 6x speedup over the existing structured generation engines. Integrated with an LLM inference engine, XGrammar 2 can handle dynamic structured generation tasks with near-zero overhead.
zh

[AI-101] ransitive Expert Error and Routing Problems in Complex AI Systems

【速读】:该论文旨在解决跨领域专家判断中的系统性错误问题,即当专家在边界领域(domain boundaries)进行决策时,由于其专业经验导致的认知偏差引发的错误判断,称之为“传递性专家错误”(Transitive Expert Error, TEE)。TEE不同于Dunning-Kruger效应,其核心机制在于:结构相似性偏差(structural similarity bias)使专家过度依赖表面特征(如共享词汇、模式和形式结构),而忽视因果架构差异;同时权威持续性(authority persistence)通过社会强化与元认知失败维持专家信心,即使面对不适用的输入也无主观不确定性。该问题不仅存在于人类认知系统中,还延伸至AI路由架构(如MoE、多模型编排、工具使用代理和RAG系统),表现为路由错误(选择错误的专业模块)和覆盖不足错误(无合适专家可用),最终生成自信但因果错误的幻觉输出。解决方案的关键在于将这些机制显式化并可干预:在路由器层面引入多专家激活与分歧检测,在专业模块层面实施边界感知校准,在训练层面识别覆盖缺口。这一设计使得原本难以察觉的人类认知黑箱问题,在AI架构中成为可监测、可修正的工程问题。

链接: https://arxiv.org/abs/2601.04416
作者: Forest Mars
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31pp

点击查看摘要

Abstract:Domain expertise enhances judgment within boundaries but creates systematic vulnerabilities specifically at borders. We term this Transitive Expert Error (TEE), distinct from Dunning-Kruger effects, requiring calibrated expertise as precondition. Mechanisms enabling reliable within-domain judgment become liabilities when structural similarity masks causal divergence. Two core mechanisms operate: structural similarity bias causes experts to overweight surface features (shared vocabulary, patterns, formal structure) while missing causal architecture differences; authority persistence maintains confidence across competence boundaries through social reinforcement and metacognitive failures (experts experience no subjective uncertainty as pattern recognition operates smoothly on familiar-seeming inputs.) These mechanism intensify under three conditions: shared vocabulary masking divergent processes, social pressure for immediate judgment, and delayed feedback. These findings extend to AI routing architectures (MoE systems, multi-model orchestration, tool-using agents, RAG systems) exhibiting routing-induced failures (wrong specialist selected) and coverage-induced failures (no appropriate specialist exists). Both produce a hallucination phenotype: confident, coherent, structurally plausible but causally incorrect outputs at domain boundaries. In human systems where mechanisms are cognitive black boxes; AI architectures make them explicit and addressable. We propose interventions: multi-expert activation with disagreement detection (router level), boundary-aware calibration (specialist level), and coverage gap detection (training level). TEE has detectable signatures (routing patterns, confidence-accuracy dissociations, domain-inappropriate content) enabling monitoring and mitigation. What remains intractable in human cognition becomes addressable through architectural design.
zh

[AI-102] Balancing Usability and Compliance in AI Smart Devices: A Privacy-by-Design Audit of Google Home Alexa and Siri

【速读】:该论文旨在解决AI-enabled智能设备(如Google Home Mini、Amazon Alexa和Apple Siri)在青少年群体中应用时面临的隐私保护与可用性之间的矛盾问题,即如何在确保用户对数据控制权的同时提升设备的易用性和透明度。其解决方案的关键在于提出并应用一个融合启发式评估、个人隐私保护与电子文档法案(PIPEDA)合规性评估及以青少年为中心的可用性测试的综合框架,从而系统性地评估这些设备是否符合“隐私设计优先”(Privacy-by-Design)原则,并识别出在数据管理透明度、用户引导机制和政策一致性方面的改进空间。研究发现,尽管部分设备在可用性或合规性上表现突出,但青少年仍受限于技术设计复杂性和模糊的数据政策,导致其隐私自我效能感不足,因此增强透明度、嵌入使用初期的隐私指导以及优化政策对齐是实现青少年友好型智能设备的关键路径。

链接: https://arxiv.org/abs/2601.04403
作者: Trevor De Clark,Yulia Bobkova,Ajay Kumar Shrestha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper investigates the privacy and usability of AI-enabled smart devices commonly used by youth, focusing on Google Home Mini, Amazon Alexa, and Apple Siri. While these devices provide convenience and efficiency, they also raise privacy and transparency concerns due to their always-listening design and complex data management processes. The study proposes and applies a combined framework of Heuristic Evaluation, Personal Information Protection and Electronic Documents Act (PIPEDA) Compliance Assessment, and Youth-Centered Usability Testing to assess whether these devices align with Privacy-by-Design principles and support meaningful user control. Results show that Google Home achieved the highest usability score, while Siri scored highest in regulatory compliance, indicating a trade-off between user convenience and privacy protection. Alexa demonstrated clearer task navigation but weaker transparency in data retention. Findings suggest that although youth may feel capable of managing their data, their privacy self-efficacy remains limited by technical design, complex settings, and unclear data policies. The paper concludes that enhancing transparency, embedding privacy guidance during onboarding, and improving policy alignment are critical steps toward ensuring that smart devices are both usable and compliant with privacy standards that protect young users.
zh

[AI-103] Convenience vs. Control: A Qualitative Study of Youth Privacy with Smart Voice Assistants

【速读】:该论文旨在解决智能语音助手(Smart Voice Assistants, SVAs)在青少年群体中广泛应用背景下,其隐私控制机制因政策信息过载、设置碎片化和数据保留规则不透明等问题,导致用户隐私自我效能感(Privacy Self-Efficacy, PSE)降低,进而削弱隐私保护行为(Privacy-Protective Behaviors, PPB)的核心问题。解决方案的关键在于通过提升算法透明度(Algorithmic Transparency and Trust, ATT)来缓解“透明度摩擦”(transparency friction),从而增强用户的PSE,并最终促进更有效的PPB。具体设计建议包括构建统一的隐私中心、引入通俗易懂的“数据营养标签”(data nutrition labels)、设定清晰的数据保留默认值以及提供设备条件触发的微教程,以在保障便利性的同时赋能青年数字公民进行自主隐私管理。

链接: https://arxiv.org/abs/2601.04399
作者: Molly Campbell,Trevor De Clark,Mohamad Sheikho Al Jasem,Sandhya Joshi,Ajay Kumar Shrestha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: To appear in the IEEE CCWC 2026 proceedings

点击查看摘要

Abstract:Smart voice assistants (SVAs) are embedded in the daily lives of youth, yet their privacy controls often remain opaque and difficult to manage. Through five semi-structured focus groups (N=26) with young Canadians (ages 16-24), we investigate how perceived privacy risks (PPR) and benefits (PPBf) intersect with algorithmic transparency and trust (ATT) and privacy self-efficacy (PSE) to shape privacy-protective behaviors (PPB). Our analysis reveals that policy overload, fragmented settings, and unclear data retention undermine self-efficacy and discourage protective actions. Conversely, simple transparency cues were associated with greater confidence without diminishing the utility of hands-free tasks and entertainment. We synthesize these findings into a qualitative model in which transparency friction erodes PSE, which in turn weakens PPB. From this model, we derive actionable design guidance for SVAs, including a unified privacy hub, plain-language “data nutrition” labels, clear retention defaults, and device-conditional micro-tutorials. This work foregrounds youth perspectives and offers a path for SVA governance and design that empowers young digital citizens while preserving convenience.
zh

[AI-104] Assessing the quality and coherence of word embeddings after SCM-based intersectional bias mitigation

【速读】:该论文旨在解决静态词嵌入(Static Word Embeddings)中社会偏见的传播问题,特别是扩展以往仅关注单一群体在温暖度与能力维度上的刻板印象(Stereotype Content Model, SCM)研究,引入交集性偏见(Intersectional Bias)的分析框架。其关键解决方案在于:通过加法或拼接方式构建社会身份组合的复合表示,并应用三种去偏策略——减法(Subtraction)、线性投影(Linear Projection)和部分投影(Partial Projection),以在保持语义空间整体结构稳定的同时缓解交集性偏见。实验表明,基于SCM的去偏方法在交集场景下依然有效,且不同聚合方式与去偏策略对局部邻域一致性与类比行为保留之间存在权衡关系,为实际部署中平衡稳定性与类比性能提供了可操作的指导。

链接: https://arxiv.org/abs/2601.04393
作者: Eren Kocadag,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Static word embeddings often absorb social biases from the text they learn from, and those biases can quietly shape downstream systems. Prior work that uses the Stereotype Content Model (SCM) has focused mostly on single-group bias along warmth and competence. We broaden that lens to intersectional bias by building compound representations for pairs of social identities through summation or concatenation, and by applying three debiasing strategies: Subtraction, Linear Projection, and Partial Projection. We study three widely used embedding families (Word2Vec, GloVe, and ConceptNet Numberbatch) and assess them with two complementary views of utility: whether local neighborhoods remain coherent and whether analogy behavior is preserved. Across models, SCM-based mitigation carries over well to the intersectional case and largely keeps the overall semantic landscape intact. The main cost is a familiar trade off: methods that most tightly preserve geometry tend to be more cautious about analogy behavior, while more assertive projections can improve analogies at the expense of strict neighborhood stability. Partial Projection is reliably conservative and keeps representations steady; Linear Projection can be more assertive; Subtraction is a simple baseline that remains competitive. The choice between summation and concatenation depends on the embedding family and the application goal. Together, these findings suggest that intersectional debiasing with SCM is practical in static embed- dings, and they offer guidance for selecting aggregation and debiasing settings when balancing stability against analogy performance.
zh

[AI-105] Enhanced-FQL(λ) an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay

【速读】:该论文旨在解决连续控制任务中深度强化学习(Deep Reinforcement Learning, DRL)方法存在的计算复杂度高、可解释性差以及样本效率低的问题。其解决方案的关键在于提出了一种增强型模糊Q-learning框架 Enhanced-FQL(λ),通过引入两个核心创新:一是基于模糊贝尔曼方程(Fuzzified Bellman Equation, FBE)的模糊化优势追踪机制(Fuzzified Eligibility Traces, FET),实现稳定且多步的信用分配;二是采用分段经验回放(Segmented Experience Replay, SER)机制,提升样本利用效率并降低内存开销。该方法在保持理论收敛性的前提下,显著优于传统n步模糊时序差分(fuzzy TD)和模糊SARSA(λ)基线模型,并展现出比DDPG等深度强化学习方法更低的计算复杂度,从而在安全关键场景中兼顾透明性、资源效率与性能表现。

链接: https://arxiv.org/abs/2601.04392
作者: Mohsen Jalaeian-Farimani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Submitted to ECC26 conference

点击查看摘要

Abstract:This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL( \lambda ), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with Fuzzified Bellman Equation (FBE) for continuous control tasks. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. Extensive evaluations in continuous control domains demonstrate that Enhanced-FQL( \lambda ) achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzy SARSA( \lambda ) baselines, while maintaining substantially lower computational complexity than deep RL alternatives such as DDPG. The framework’s inherent interpretability, combined with its computational efficiency and theoretical convergence guarantees, makes it particularly suitable for safety-critical applications where transparency and resource constraints are essential.
zh

[AI-106] SciFig: Towards Automating Scientific Figure Generation

【速读】:该论文旨在解决科研人员在撰写科学论文时,手动创建高质量图表和可视化内容耗时且依赖专业设计技能的问题。当前每年超过250万篇科学论文发表,但图表示例的生成仍主要依赖人工操作。其解决方案的核心是提出一个端到端的AI代理系统——SciFig,该系统能够直接从研究论文文本中生成符合出版标准的流程图。关键创新在于采用分层布局生成策略,通过解析研究描述以识别组件间关系、将相关元素聚类为功能模块,并建立模块间的连接以实现视觉组织;同时引入迭代式思维链(Chain-of-Thought, CoT)反馈机制,通过多轮视觉分析与推理持续优化布局质量。

链接: https://arxiv.org/abs/2601.04390
作者: Siyuan Huang,Yutong Gao,Juyang Bai,Yifan Zhou,Zi Yin,Xinxin Liu,Rama Chellappa,Chun Pong Lau,Sayan Nag,Cheng Peng,Shraman Pramanick
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Creating high-quality figures and visualizations for scientific papers is a time-consuming task that requires both deep domain knowledge and professional design skills. Despite over 2.5 million scientific papers published annually, the figure generation process remains largely manual. We introduce \textbfSciFig , an end-to-end AI agent system that generates publication-ready pipeline figures directly from research paper texts. SciFig uses a hierarchical layout generation strategy, which parses research descriptions to identify component relationships, groups related elements into functional modules, and generates inter-module connections to establish visual organization. Furthermore, an iterative chain-of-thought (CoT) feedback mechanism progressively improves layouts through multiple rounds of visual analysis and reasoning. We introduce a rubric-based evaluation framework that analyzes 2,219 real scientific figures to extract evaluation rubrics and automatically generates comprehensive evaluation criteria. SciFig demonstrates remarkable performance: achieving 70.1 % overall quality on dataset-level evaluation and 66.2 % on paper-specific evaluation, and consistently high scores across metrics such as visual clarity, structural organization, and scientific accuracy. SciFig figure generation pipeline and our evaluation benchmark will be open-sourced.
zh

[AI-107] LLM -Guided Lifecycle-Aware Clustering of Multi-Turn Customer Support Conversations AACL2025

【速读】:该论文旨在解决云服务提供商在处理多服务查询时,传统聚类方法因存在重叠关注点而导致簇划分不准确、静态簇随时间退化以及频繁重新聚类破坏问题追踪连续性的问题。其解决方案的关键在于提出一种自适应系统,通过将多轮对话细粒度地分割为特定服务的关注点,并基于生成式 AI (Generative AI) 技术仅对质量下降的簇进行增量式优化(采用LLM驱动的分裂策略),同时利用Davies-Bouldin Index (DBI) 和轮廓系数(Silhouette Score)动态监控簇质量,从而实现无需全量重聚类即可显著提升聚类质量——实验表明该方法使轮廓系数提升超过100%,DBI降低65.6%。

链接: https://arxiv.org/abs/2601.04388
作者: Priyaranjan Pattnayak,Sanchari Chowdhuri,Amit Agarwal,Hitesh Laxmichand Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted in AACL 2025 Main Conference

点击查看摘要

Abstract:Clustering customer chat data is vital for cloud providers handling multi service queries. Traditional methods struggle with overlapping concerns and create broad, static clusters that degrade over time. Reclustering disrupts continuity, making issue tracking difficult. We propose an adaptive system that segments multi turn chats into service specific concerns and incrementally refines clusters as new issues arise. Cluster quality is tracked via DaviesBouldin Index and Silhouette Scores, with LLM based splitting applied only to degraded clusters. Our method improves Silhouette Scores by over 100% and reduces DBI by 65.6% compared to baselines, enabling scalable, real time analytics without full reclustering.
zh

[AI-108] Graph Integrated Transformers for Community Detection in Social Networks

【速读】:该论文旨在解决复杂社交网络中社区检测(Community Detection)的问题,尤其针对传统方法在融合局部结构信息与全局语义信息时面临的挑战。解决方案的关键在于提出一种混合模型GIT-CD(Graph Integrated Transformer for Community Detection),其核心创新在于将图神经网络(GNN)与基于注意力机制的Transformer相结合:GNN模块用于捕捉节点间的局部拓扑结构,而Transformer模块则建模长距离依赖关系;此外,引入自优化聚类模块,通过K-Means、轮廓系数损失(silhouette loss)和KL散度最小化共同优化社区划分结果,从而显著提升社区检测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2601.04367
作者: Heba Zahran,M.Omair Shafiq
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Paper accepted at IEEE GLOBECOM 2025

点击查看摘要

Abstract:Community detection is crucial for applications like targeted marketing and recommendation systems. Traditional methods rely on network structure, and embedding-based models integrate semantic information. However, there is a challenge when a model leverages local and global information from complex structures like social networks. Graph Neural Networks (GNNs) and Transformers have shown superior performance in capturing local and global relationships. In this paper, We propose Graph Integrated Transformer for Community Detection (GIT-CD), a hybrid model combining GNNs and Transformer-based attention mechanisms to enhance community detection in social networks. Specifically, the GNN module captures local graph structures, while the Transformer module models long-range dependencies. A self-optimizing clustering module refines community assignments using K-Means, silhouette loss, and KL divergence minimization. Experimental results on benchmark datasets show that GIT-CD outperforms state-of-the-art models, making it a robust approach for detecting meaningful communities in complex social networks.
zh

[AI-109] Causally-Aware Information Bottleneck for Domain Adaptation AAMAS2026

【速读】:该论文旨在解决因果系统中常见的域适应问题,即目标变量在源域中可观测但在目标域中完全缺失的情形。其核心挑战在于如何从剩余可观测变量中准确重构目标变量,同时应对不同域间的数据分布偏移。解决方案的关键在于学习一个紧凑且机制稳定的表示(mechanism-stable representation),该表示保留对预测目标变量有用的信息,同时摒弃因域间变化引入的虚假相关性。对于线性高斯因果模型,作者推导出闭式高斯信息瓶颈(Gaussian Information Bottleneck, GIB)解,其形式等价于一种CCA风格的投影,并可根据需求引入有向无环图(DAG)先验;对于非线性或非高斯数据,则提出基于变分信息瓶颈(Variational Information Bottleneck, VIB)的编码器-预测器架构,具备高维扩展能力,可在源域训练后零样本部署至目标域,实验证明该方法在合成与真实数据上均能实现稳定、精准的插补效果。

链接: https://arxiv.org/abs/2601.04361
作者: Mohammad Ali Javidian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: An extended abstract version of this work was accepted for the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:We tackle a common domain adaptation setting in causal systems. In this setting, the target variable is observed in the source domain but is entirely missing in the target domain. We aim to impute the target variable in the target domain from the remaining observed variables under various shifts. We frame this as learning a compact, mechanism-stable representation. This representation preserves information relevant for predicting the target while discarding spurious variation. For linear Gaussian causal models, we derive a closed-form Gaussian Information Bottleneck (GIB) solution. This solution reduces to a canonical correlation analysis (CCA)-style projection and offers Directed Acyclic Graph (DAG)-aware options when desired. For nonlinear or non-Gaussian data, we introduce a Variational Information Bottleneck (VIB) encoder-predictor. This approach scales to high dimensions and can be trained on source data and deployed zero-shot to the target domain. Across synthetic and real datasets, our approach consistently attains accurate imputations, supporting practical use in high-dimensional causal models and furnishing a unified, lightweight toolkit for causal domain adaptation.
zh

[AI-110] Summary of The Inaugural Music Source Restoration Challenge

【速读】:该论文旨在解决音乐源分离(Music Source Restoration, MSR)问题,即从经过专业混音和现实世界退化的音频中恢复原始未处理的乐器音轨,需逆转制作效果与真实场景退化。解决方案的关键在于构建首个MSR挑战赛,采用多指标客观评估(Multi-Mel-SNR、Zimtohrli、FAD-CLAP)与主观评价(MOS-Overall)相结合的方法,在Studio生成混音与真实退化录音上进行系统性测试;同时通过五支参赛团队的对比实验揭示了不同乐器在恢复难度上的显著差异(如低音吉他平均得分4.59 dB,而打击乐仅0.29 dB),为后续研究提供了基准数据与方法论参考。

链接: https://arxiv.org/abs/2601.04343
作者: Yongyi Zang,Jiarui Hai,Wanying Ge,Qiuqiang Kong,Zheqi Dai,Helin Wang,Yuki Mitsufuji,Mark D. Plumbley
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Music Source Restoration (MSR) aims to recover original, unprocessed instrument stems from professionally mixed and degraded audio, requiring the reversal of both production effects and real-world degradations. We present the inaugural MSR Challenge, which features objective evaluation on studio-produced mixtures using Multi-Mel-SNR, Zimtohrli, and FAD-CLAP, alongside subjective evaluation on real-world degraded recordings. Five teams participated in the challenge. The winning system achieved 4.46 dB Multi-Mel-SNR and 3.47 MOS-Overall, corresponding to relative improvements of 91% and 18% over the second-place system, respectively. Per-stem analysis reveals substantial variation in restoration difficulty across instruments, with bass averaging 4.59 dB across all teams, while percussion averages only 0.29 dB. The dataset, evaluation protocols, and baselines are available at this https URL.
zh

[AI-111] Pilot Study on Student Public Opinion Regarding GAI

【速读】:该论文旨在解决大学课堂中生成式 AI(Generative AI)应用的师生认知差异问题,特别是通过调查大学生对生成式 AI 在高等教育场景中使用态度的现状,为后续教学整合提供实证基础。其解决方案的关键在于开展初步调研以识别学生群体对生成式 AI 的感知特征与接受度,并强调未来研究需扩大样本规模以提升结论的代表性与可靠性,从而帮助教师更有针对性地设计相关教学内容,促进学生对这一变革性技术的批判性理解与有效应用。

链接: https://arxiv.org/abs/2601.04336
作者: William Franz Lamberti,Sunbin Kim,Samantha Rose Lawrence
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Applications (stat.AP)
备注: 7 pages, 8 figures

点击查看摘要

Abstract:The emergence of generative AI (GAI) has sparked diverse opinions regarding its appropriate use across various domains, including education. This pilot study investigates university students’ perceptions of GAI in higher education classrooms, aiming to lay the groundwork for understanding these attitudes. With a participation rate of approximately 4.4%, the study highlights the challenges of engaging students in GAI-related research and underscores the need for larger sample sizes in future studies. By gaining insights into student perspectives, instructors can better prepare to integrate discussions of GAI into their classrooms, fostering informed and critical engagement with this transformative technology.
zh

[AI-112] ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

【速读】:该论文旨在解决高性能计算(HPC)与人工智能(AI)领域中并行编程的挑战,特别是针对OpenMP GPU offload场景下数据移动和参数调优主导性能瓶颈的问题。现有自主编码代理(coding agent)虽能编译、测试和分析目标硬件上的代码,但缺乏领域结构支撑导致输出脆弱。其解决方案的关键在于提出ParaCodex——一个面向HPC工程师的工作流系统,通过分阶段热点分析(staged hotspot analysis)、显式数据规划(explicit data planning)、正确性门控(correctness gating)以及基于性能剖析的迭代优化(profiling-guided refinement),将基于Codex的生成式AI模型转化为可自治运行的OpenMP GPU offload系统。实验表明,该方法在HeCBench、Rodinia和NAS基准上成功完成31个有效内核的串行CPU到GPU offload转换,相比参考OpenMP实现平均提升3倍(HeCBench)和5倍(Rodinia)的GPU加速比,并显著优于零样本Codex基线。

链接: https://arxiv.org/abs/2601.04327
作者: Erel Kaplan,Tomer Bitan,Lian Ghrayeb,Le Chen,Tom Yotam,Niranjan Hasabnis,Gal Oren
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parallel programming is central to HPC and AI, but producing code that is correct and fast remains challenging, especially for OpenMP GPU offload, where data movement and tuning dominate. Autonomous coding agents can compile, test, and profile on target hardware, but outputs are brittle without domain scaffolding. We present ParaCodex, an HPC-engineer workflow that turns a Codex-based agent into an autonomous OpenMP GPU offload system using staged hotspot analysis, explicit data planning, correctness gating, and profiling-guided refinement. We evaluate translation from serial CPU kernels to OpenMP GPU offload kernels on HeCBench, Rodinia, and NAS. After excluding five kernels, ParaCodex succeeded on all 31 valid kernels. The generated kernels improved GPU time over reference OpenMP implementations in 25/31 cases, achieving geometric-mean speedups of 3x on HeCBench and 5x on Rodinia, and outperforming a zero-shot Codex baseline on all suites. We also evaluate CUDA to OpenMP offload translation on ParEval, where ParaCodex maintains high compilation and validation rates in code-only and end-to-end settings. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.04327 [cs.DC] (or arXiv:2601.04327v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2601.04327 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-113] Online Action-Stacking Improves Reinforcement Learning Performance for Air Traffic Control

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在空中交通管制(Air Traffic Control, ATC)场景中应用时的关键挑战:如何在训练阶段使用简化且离散的动作空间,同时在推理阶段生成符合实际操作规范的复杂指令。传统RL方法通常依赖于高维动作空间以实现精细控制,但这类方法难以直接应用于ATC任务,因其要求指令具有可解释性和现实可行性。解决方案的核心在于提出“在线动作堆叠”(online action-stacking)机制——在训练时让智能体仅执行简单的增量式航向或高度调整(如五种基本动作),并通过动作衰减惩罚(action-damping penalty)促使智能体以短脉冲形式发出指令;在推理阶段,该机制将这些原始动作序列自动编译为符合航空管制规范的复合指令(compound clearances)。实验表明,该方法显著减少了指令频率并达到与37维动作空间相当的性能,从而有效弥合了标准RL范式与ATC实际需求之间的差距。

链接: https://arxiv.org/abs/2601.04287
作者: Ben Carvell,George De Ath,Eseoghene Benjamin,Richard Everson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce online action-stacking, an inference-time wrapper for reinforcement learning policies that produces realistic air traffic control commands while allowing training on a much smaller discrete action space. Policies are trained with simple incremental heading or level adjustments, together with an action-damping penalty that reduces instruction frequency and leads agents to issue commands in short bursts. At inference, online action-stacking compiles these bursts of primitive actions into domain-appropriate compound clearances. Using Proximal Policy Optimisation and the BluebirdDT digital twin platform, we train agents to navigate aircraft along lateral routes, manage climb and descent to target flight levels, and perform two-aircraft collision avoidance under a minimum separation constraint. In our lateral navigation experiments, action stacking greatly reduces the number of issued instructions relative to a damped baseline and achieves comparable performance to a policy trained with a 37-dimensional action space, despite operating with only five actions. These results indicate that online action-stacking helps bridge a key gap between standard reinforcement learning formulations and operational ATC requirements, and provides a simple mechanism for scaling to more complex control scenarios.
zh

[AI-114] A Future Capabilities Agent for Tactical Air Traffic Control

【速读】:该论文旨在解决空中交通管制中自动化系统在安全保证与可解释性之间的权衡问题:现有基于优化的方法(如强化学习)虽性能优异,但难以验证和解释;而规则系统虽透明,却难以在不确定性下保障安全。其解决方案的关键在于提出Agent Mallard——一个面向结构化空域战术控制的前向规划、规则驱动代理,通过将随机数字孪生(stochastic digital twin)直接嵌入冲突解决循环,实现对不确定执行场景(如风速变化、飞行员响应延迟、通信中断)的预验证。该系统基于预设GPS引导航线,将连续四维航迹规划简化为离散车道与高度层选择,并利用专家知识库构建分层计划,结合因果归因、拓扑计划拼接和单调轴约束的深度受限回溯搜索机制,在承诺任何机动动作前逐个验证候选策略的安全性,从而兼顾模型驱动的安全评估、可解释决策逻辑与可计算效率。

链接: https://arxiv.org/abs/2601.04285
作者: Paul Kent,George De Ath,Martin Layton,Allen Hart,Richard Everson,Ben Carvell
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Escalating air traffic demand is driving the adoption of automation to support air traffic controllers, but existing approaches face a trade-off between safety assurance and interpretability. Optimisation-based methods such as reinforcement learning offer strong performance but are difficult to verify and explain, while rules-based systems are transparent yet rarely check safety under uncertainty. This paper outlines Agent Mallard, a forward-planning, rules-based agent for tactical control in systemised airspace that embeds a stochastic digital twin directly into its conflict-resolution loop. Mallard operates on predefined GPS-guided routes, reducing continuous 4D vectoring to discrete choices over lanes and levels, and constructs hierarchical plans from an expert-informed library of deconfliction strategies. A depth-limited backtracking search uses causal attribution, topological plan splicing, and monotonic axis constraints to seek a complete safe plan for all aircraft, validating each candidate manoeuvre against uncertain execution scenarios (e.g., wind variation, pilot response, communication loss) before commitment. Preliminary walkthroughs with UK controllers and initial tests in the BluebirdDT airspace digital twin indicate that Mallard’s behaviour aligns with expert reasoning and resolves conflicts in simplified scenarios. The architecture is intended to combine model-based safety assessment, interpretable decision logic, and tractable computational performance in future structured en-route environments. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2601.04285 [cs.AI] (or arXiv:2601.04285v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.04285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-115] An ASP-based Solution to the Medical Appointment Scheduling Problem

【速读】:该论文旨在解决医疗预约调度中存在的效率低下、行政负担重以及患者中心化护理不足的问题,尤其针对脆弱人群的个性化需求。其解决方案的关键在于基于答案集编程(Answer Set Programming, ASP)构建一个框架,通过整合Blueprint Personas实现个体化调度,并借助ASP逻辑模型集中管理规划操作,从而确保实时可用性更新、无冲突分配及与现有医疗平台的无缝互操作性。

链接: https://arxiv.org/abs/2601.04274
作者: Alina Vozna(University of Pisa and University of L’Aquila),Andrea Monaldini(University of Pisa and University of L’Aquila),Stefania Costantini(University of L’Aquila),Valentina Pitoni(University of l’Aquila),Dawid Pado(University of l’Aquila)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:This paper presents an Answer Set Programming (ASP)-based framework for medical appointment scheduling, aimed at improving efficiency, reducing administrative overhead, and enhancing patient-centered care. The framework personalizes scheduling for vulnerable populations by integrating Blueprint Personas. It ensures real-time availability updates, conflict-free assignments, and seamless interoperability with existing healthcare platforms by centralizing planning operations within an ASP logic model.
zh

[AI-116] Hybrid MKNF for Aeronautics Applications: Usage and Heuristics

【速读】:该论文旨在解决航空领域知识表示与推理(Knowledge Representation and Reasoning, KRR)应用中两个核心挑战:一是如何实现足够的表达能力以捕捉复杂领域知识,二是如何在最小化内存占用和计算开销的前提下高效执行推理任务。解决方案的关键在于采用混合多知识框架(Hybrid MKNF),这是一种成熟的KRR语言,其语义和查询回答机制能够无缝集成规则与本体(ontology),从而在保持高效性的同时提升表达能力。研究通过具体案例评估了Hybrid MKNF的适用性,并进一步识别出对航空应用至关重要的额外表达能力特征,提出了相应的启发式策略以支持这些特征向框架中的整合。

链接: https://arxiv.org/abs/2601.04273
作者: Arun Raveendran Nair Sheela(Universite Clermont Auvergne, LIMOS Laboratory, Thales),Florence De Grancey(Thales),Christophe Rey(Universite Clermont Auvergne, LIMOS Laboratory CNRS, France),Victor Charpenay(Ecole des Mines de Saint-Etienne, LIMOS Laboratory CNRS, France)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:The deployment of knowledge representation and reasoning technologies in aeronautics applications presents two main challenges: achieving sufficient expressivity to capture complex domain knowledge, and executing reasoning tasks efficiently while minimizing memory usage and computational overhead. An effective strategy for attaining necessary expressivity involves integrating two fundamental KR concepts: rules and ontologies. This study adopts the well-established KR language Hybrid MKNF owing to its seamless integration of rules and ontologies through its semantics and query answering capabilities. We evaluated Hybrid MKNF to assess its suitability in the aeronautics domain through a concrete case study. We identified additional expressivity features that are crucial for developing aeronautics applications and proposed a set of heuristics to support their integration into Hybrid MKNF framework.
zh

[AI-117] Propositional Abduction via Only-Knowing: A Non-Monotonic Approach

【速读】:该论文旨在解决传统归纳推理(abductive reasoning)在形式化表达上的局限性,尤其是如何将因果解释与知识状态(epistemic states)更紧密地结合。其解决方案的关键在于引入一个基于“仅知”逻辑(only-knowing logic)的扩展框架,通过定义一个归结模态算子(abduction modal operator),将归因推理嵌入到模态语言中,并利用偏好关系(preferential relation)对可能的解释进行筛选,从而构建一个非单调的归因推理系统。该方法不仅实现了对不同选择机制的表达能力,还通过核心元理论性质分析,为归因推理提供了稳健的形式基础。

链接: https://arxiv.org/abs/2601.04272
作者: Sanderson Molick(Division of Humanities - Federal Institute of Para),Vaishak Belle(School of Informatics - University of Edinburgh)
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:The paper introduces a basic logic of knowledge and abduction by extending Levesque logic of only-knowing with an abduction modal operator defined via the combination of basic epistemic concepts. The upshot is an alternative approach to abduction that employs a modal vocabulary and explores the relation between abductive reasoning and epistemic states of only knowing. Furthermore, by incorporating a preferential relation into modal frames, we provide a non-monotonic extension of our basic framework capable of expressing different selection methods for abductive explanations. Core metatheoretic properties of non-monotonic consequence relations are explored within this setting and shown to provide a well-behaved foundation for abductive reasoning.
zh

[AI-118] Correcting Autonomous Driving Object Detection Misclassifications with Automated Commonsense Reasoning

【速读】:该论文旨在解决当前自动驾驶车辆(AV)在缺乏足够训练数据的情况下,难以准确处理异常道路场景的问题,尤其是针对SAE Level 5级自动驾驶尚未实现的现状,指出过度依赖机器学习技术是主要原因。其解决方案的关键在于引入自动化常识推理(automated commonsense reasoning)技术,通过在计算机视觉模型检测不确定性时触发该推理机制,从而有效识别交通信号灯颜色和感知模型未能正确捕捉的障碍物(如道路上的动物),并验证了混合模型(hybrid models)在提升AV感知能力方面的有效性。

链接: https://arxiv.org/abs/2601.04271
作者: Keegan Kimbrell(University of Texas at Dallas),Wang Tianhao(University of Texas at Dallas),Feng Chen(University of Texas at Dallas),Gopal Gupta(University of Texas at Dallas)
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: In Proceedings ICLP 2025, arXiv:2601.00047

点击查看摘要

Abstract:Autonomous Vehicle (AV) technology has been heavily researched and sought after, yet there are no SAE Level 5 AVs available today in the marketplace. We contend that over-reliance on machine learning technology is the main reason. Use of automated commonsense reasoning technology, we believe, can help achieve SAE Level 5 autonomy. In this paper, we show how automated common- sense reasoning technology can be deployed in situations where there are not enough data samples available to train a deep learning-based AV model that can handle certain abnormal road scenarios. Specifically, we consider two situations where (i) a traffic signal is malfunctioning at an intersection and (ii) all the cars ahead are slowing down and steering away due to an unexpected obstruction (e.g., animals on the road). We show that in such situations, our commonsense reasoning-based solution accurately detects traffic light colors and obstacles not correctly captured by the AV’s perception model. We also provide a pathway for efficiently invoking commonsense reasoning by measuring uncertainty in the computer vision model and using commonsense reasoning to handle uncertain sce- narios. We describe our experiments conducted using the CARLA simulator and the results obtained. The main contribution of our research is to show that automated commonsense reasoning effectively corrects AV-based object detection misclassifications and that hybrid models provide an effective pathway to improving AV perception.
zh

[AI-119] Systems Explaining Systems: A Framework for Intelligence and Consciousness

【速读】:该论文旨在解决当前人工智能系统中智能与意识如何从通用架构中自然涌现的问题,尤其针对现有模型过度依赖预测编码或领域特定机制的局限性。其核心解决方案在于提出一个以关系结构(relational structure)为基础的概念框架:智能被定义为整合信号、行为与内部状态之间因果联系的能力,而意识则通过递归架构实现——即高层系统能够学习并解释低层系统随时间演化的关系模式,并将这些解释整合为动态稳定的元状态(meta-state),再通过上下文增强反馈,使内部模型从对外部世界的表征转变为对自身认知过程的建模。这一机制将预测处理(predictive processing)重新诠释为情境化解释的衍生结果,而非显式预测,强调多层级递归系统架构对于实现更类人AI的必要性。

链接: https://arxiv.org/abs/2601.04269
作者: Sean Niklas Semmler
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: This work is presented as a preprint, and the author welcomes constructive feedback and discussion

点击查看摘要

Abstract:This paper proposes a conceptual framework in which intelligence and consciousness emerge from relational structure rather than from prediction or domain-specific mechanisms. Intelligence is defined as the capacity to form and integrate causal connections between signals, actions, and internal states. Through context enrichment, systems interpret incoming information using learned relational structure that provides essential context in an efficient representation that the raw input itself does not contain, enabling efficient processing under metabolic constraints. Building on this foundation, we introduce the systems-explaining-systems principle, where consciousness emerges when recursive architectures allow higher-order systems to learn and interpret the relational patterns of lower-order systems across time. These interpretations are integrated into a dynamically stabilized meta-state and fed back through context enrichment, transforming internal models from representations of the external world into models of the system’s own cognitive processes. The framework reframes predictive processing as an emergent consequence of contextual interpretation rather than explicit forecasting and suggests that recursive multi-system architectures may be necessary for more human-like artificial intelligence. Comments: This work is presented as a preprint, and the author welcomes constructive feedback and discussion Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2601.04269 [cs.AI] (or arXiv:2601.04269v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.04269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-120] Learning to Reason : Temporal Saliency Distillation for Interpretable Knowledge Transfer ECAI2025

【速读】:该论文旨在解决当前时间序列知识蒸馏方法在模型压缩中面临的两大问题:一是现有基于logit和特征对齐的方法缺乏可解释性,导致教师模型向学生模型传递的知识机制不明确;二是这些方法仅能复制教师的预测准确率,而无法使学生模型生成与教师一致的预测分布,从而限制了其在实际应用中的安全性与可靠性。解决方案的关键在于提出时序显著性蒸馏(Temporal Saliency Distillation),通过从教师模型的logits中提取时序显著性(temporal saliency),即每个输入时间步对教师预测的重要性,引导学生模型学习与教师相同的输入特征依赖关系,从而不仅提升预测性能,还增强学生模型输出分布的相似性和可解释性。此方法无需额外参数或特定架构假设,为时间序列领域的可解释知识蒸馏建立了新范式。

链接: https://arxiv.org/abs/2601.04263
作者: Nilushika Udayangani Hewa Dehigahawattage,Kishor Nandakishor,Marimuthu Palaniswami
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI 2025), IOS Press

点击查看摘要

Abstract:Knowledge distillation has proven effective for model compression by transferring knowledge from a larger network called the teacher to a smaller network called the student. Current knowledge distillation in time series is predominantly based on logit and feature aligning techniques originally developed for computer vision tasks. These methods do not explicitly account for temporal data and fall short in two key aspects. First, the mechanisms by which the transferred knowledge helps the student model learning process remain unclear due to uninterpretability of logits and features. Second, these methods transfer only limited knowledge, primarily replicating the teacher predictive accuracy. As a result, student models often produce predictive distributions that differ significantly from those of their teachers, hindering their safe substitution for teacher models. In this work, we propose transferring interpretable knowledge by extending conventional logit transfer to convey not just the right prediction but also the right reasoning of the teacher. Specifically, we induce other useful knowledge from the teacher logits termed temporal saliency which captures the importance of each input timestep to the teacher prediction. By training the student with Temporal Saliency Distillation we encourage it to make predictions based on the same input features as the teacher. Temporal Saliency Distillation requires no additional parameters or architecture specific assumptions. We demonstrate that Temporal Saliency Distillation effectively improves the performance of baseline methods while also achieving desirable properties beyond predictive accuracy. We hope our work establishes a new paradigm for interpretable knowledge distillation in time series analysis.
zh

[AI-121] Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中存在的多目标优化冲突问题,尤其是由此引发的通用能力(general capabilities)意外退化现象。现有方法通常依赖全局梯度几何来缓解冲突,但忽略了Transformer架构中模块异质性(Modular Heterogeneity),即不同注意力头(attention heads)在功能敏感性和冲突程度上存在显著差异。解决方案的关键在于提出一种冲突感知稀疏微调框架(Conflict-Aware Sparse Tuning, CAST),其核心是通过构建预对齐冲突图(pre-alignment conflict map),融合优化冲突(Optimization Conflict)与功能敏感性(Functional Sensitivity)信息,从而实现对参数的有选择性更新——仅跳过那些“高冲突”头(high-conflict heads),即可显著减少通用能力损失,同时保持安全性,提供了一种可解释且参数高效的优化路径。

链接: https://arxiv.org/abs/2601.04262
作者: Wang Cai,Yilin Wen,Jinchang Hou,Du Su,Guoqiu Wang,Zhonghou Lv,Chenfu Bao,Yunfang Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict’’ heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.
zh

[AI-122] owards a Mechanistic Understanding of Propositional Logical Reasoning in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在命题逻辑推理过程中内部计算机制不明确的问题,即现有机制解构研究多聚焦于特定任务的电路识别,而缺乏对模型所采用的通用计算策略的理解。其解决方案的关键在于通过系统性分析Qwen3(8B和14B参数规模)在PropLogic-MI数据集上的表现,揭示出一套结构化的四维计算架构:分阶段计算(Staged Computation)、信息传输(Information Transmission)、事实回溯(Fact Retrospection)与专用注意力头(Specialized Attention Heads)。这一架构不仅统一解释了不同模型规模、逻辑规则类型及推理深度下的行为一致性,还提供了机制层面的证据,表明LLMs在逻辑推理中采用具有组织性的计算策略而非随机或局部优化。

链接: https://arxiv.org/abs/2601.04260
作者: Danchun Chen,Qiyao Yan,Liangming Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding how Large Language Models (LLMs) perform logical reasoning internally remains a fundamental challenge. While prior mechanistic studies focus on identifying taskspecific circuits, they leave open the question of what computational strategies LLMs employ for propositional reasoning. We address this gap through comprehensive analysis of Qwen3 (8B and 14B) on PropLogic-MI, a controlled dataset spanning 11 propositional logic rule categories across one-hop and two-hop reasoning. Rather than asking ‘‘which components are necessary,’’ we ask ‘‘how does the model organize computation?’’ Our analysis reveals a coherent computational architecture comprising four interlocking mechanisms: Staged Computation (layer-wise processing phases), Information Transmission (information flow aggregation at boundary tokens), Fact Retrospection (persistent re-access of source facts), and Specialized Attention Heads (functionally distinct head types). These mechanisms generalize across model scales, rule types, and reasoning depths, providing mechanistic evidence that LLMs employ structured computational strategies for logical reasoning.
zh

[AI-123] Cross-Language Speaker Attribute Prediction Using MIL and RL

【速读】:该论文旨在解决多语言场景下说话人属性预测(如性别和年龄)所面临的挑战,包括语言差异、领域不匹配以及跨语言数据分布不平衡等问题。其核心解决方案是提出 RLMIL-DAT 框架,该框架在强化学习驱动的实例选择基础上引入领域对抗训练(domain adversarial training),以促使模型学习语言不变的语音表征(language invariant utterance representations)。关键创新在于将实例选择机制与领域对抗适应相结合,显著提升了跨语言迁移性能,尤其在低资源语言中表现突出,实验证明领域对抗训练是性能提升的主要来源。

链接: https://arxiv.org/abs/2601.04257
作者: Sunny Shu,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study multilingual speaker attribute prediction under linguistic variation, domain mismatch, and data imbalance across languages. We propose RLMIL-DAT, a multilingual extension of the reinforced multiple instance learning framework that combines reinforcement learning based instance selection with domain adversarial training to encourage language invariant utterance representations. We evaluate the approach on a five language Twitter corpus in a few shot setting and on a VoxCeleb2 derived corpus covering forty languages in a zero shot setting for gender and age prediction. Across a wide range of model configurations and multiple random seeds, RLMIL-DAT consistently improves Macro F1 compared to standard multiple instance learning and the original reinforced multiple instance learning framework. The largest gains are observed for gender prediction, while age prediction remains more challenging and shows smaller but positive improvements. Ablation experiments indicate that domain adversarial training is the primary contributor to the performance gains, enabling effective transfer from high resource English to lower resource languages by discouraging language specific cues in the shared encoder. In the zero shot setting on the smaller VoxCeleb2 subset, improvements are generally positive but less consistent, reflecting limited statistical power and the difficulty of generalizing to many unseen languages. Overall, the results demonstrate that combining instance selection with adversarial domain adaptation is an effective and robust strategy for cross lingual speaker attribute prediction.
zh

[AI-124] Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models

【速读】:该论文旨在解决多跳上下文推理(multi-hop contextual reasoning)在大型语言模型(Large Language Models, LLMs)中的能力差异及其机制问题,特别是探究规则驱动方法与基于LLM的多智能体系统(multi-agent systems)在不同任务类型上的表现差异。其关键解决方案在于构建了一个合成评估框架,通过120次实验对四种中等规模模型(LLaMA-3 8B、LLaMA-2 13B、Mixtral 8x7B 和 DeepSeek-V2 16B)进行系统性对比,揭示了三个核心发现:(1)多智能体增强效果依赖于基础模型的能力,仅在具备足够推理能力的模型上显著提升(如LLaMA-3 8B和Mixtral),且最大提升达46.7个百分点,表明是放大而非补偿;(2)活跃参数(active parameters)能更好预测推理性能,例如Mixtral的表现与其约12B活跃参数一致而非总参数量47B,支持推理能力由推理时计算资源决定的假设;(3)架构质量优于参数数量,LLaMA-3 8B优于LLaMA-2 13B,印证训练改进的重要性。此研究为多智能体协作和MoE架构扩展提供了可控的定量证据,并强调多智能体收益高度依赖于基础模型能力。

链接: https://arxiv.org/abs/2601.04254
作者: Brady Steele,Micah Katz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 6 figures, 8 tables

点击查看摘要

Abstract:We present a controlled study of multi-hop contextual reasoning in large language models, providing a clean demonstration of the task-method dissociation: rule-based pattern matching achieves 100% success on structured information retrieval but only 6.7% on tasks requiring cross-document reasoning, while LLM-based multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail. Using a synthetic evaluation framework with 120 trials across four models (LLaMA-3 8B, LLaMA-2 13B, Mixtral 8x7B, DeepSeek-V2 16B), we report three key findings: (1) Multi-agent amplification depends on base capability: statistically significant gains occur only for models with sufficient reasoning ability (p 0.001 for LLaMA-3 8B, p = 0.014 for Mixtral), with improvements of up to 46.7 percentage points, while weaker models show no benefit, suggesting amplification rather than compensation; (2) Active parameters predict reasoning performance: Mixtral’s performance aligns with its ~12B active parameters rather than 47B total, consistent with the hypothesis that inference-time compute drives reasoning capability in MoE architectures; (3) Architecture quality matters: LLaMA-3 8B outperforms LLaMA-2 13B despite fewer parameters, consistent with known training improvements. Our results provide controlled quantitative evidence for intuitions about multi-agent coordination and MoE scaling, while highlighting the dependence of multi-agent benefits on base model capability. We release our evaluation framework to support reproducible research on reasoning in mid-scale models.
zh

[AI-125] Using Grok to Avoid Personal Attacks While Correcting Misinformation on X

【速读】:该论文试图解决的问题是在公共在线空间中纠正错误信息时,用户常遭遇人身攻击(ad hominem attacks),从而抑制了参与纠错讨论的积极性。解决方案的关键在于利用Grok——X平台上的原生大语言模型(large language model)进行中介式纠错,而非直接由人类用户发起反驳。研究发现,与人类直接发出的纠正回复相比,通过Grok媒介发出的纠正内容在24小时内未引发任何人身攻击,而人类纠正回复则有72%遭遇了此类攻击,且差异具有统计学显著性及大效应量。这表明AI中介可有效降低公共争论中的社交敌意,重塑错误信息回应的社会互动模式。

链接: https://arxiv.org/abs/2601.04251
作者: Kevin Matthe Caramancion
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 columns, 2 tables, 1 figure

点击查看摘要

Abstract:Correcting misinformation in public online spaces often exposes users to hostility and ad hominem attacks, discouraging participation in corrective discourse. This study presents empirical evidence that invoking Grok, the native large language model on X, rather than directly confronting other users, is associated with different social responses during misinformation correction. Using an observational design, 100 correction replies across five high-conflict misinformation topics were analyzed, with corrections balanced between Grok-mediated and direct human-issued responses. The primary outcome was whether a correction received at least one ad hominem attack within a 24-hour window. Ad hominem attacks occurred in 72 percent of human-issued corrections and in none of the Grok-mediated corrections. A chi-square test confirmed a statistically significant association with a large effect size. These findings suggest that AI-mediated correction may alter the social dynamics of public disagreement by reducing interpersonal hostility during misinformation responses.
zh

[AI-126] Green MLOps: Closed-Loop Energy-Aware Inference with NVIDIA Triton FastAPI and Bio-Inspired Thresholding

【速读】:该论文旨在解决人工智能(AI)部署中的能源效率问题,特别是在长期推理任务中,其累积碳排放可能超过训练阶段。解决方案的关键在于提出一种受生物启发的框架,将蛋白质折叠的能量盆地映射为推理成本景观,并通过一个衰减的闭环阈值控制执行过程。该机制仅在预期效用与能耗权衡有利时(即高置信度/效用且边际能耗和拥塞较低)才接受请求,从而偏向于首个可接受的局部最优解,而非追求代价高昂的全局最小值。此方法显著降低了处理时间(相比开环执行减少42%),同时保持了极小的精度损失(0.5%),并建立了轻量本地服务(ONNX Runtime)与托管批处理(NVIDIA Triton)之间的效率边界,为生产环境中绿色MLOps提供了可审计的闭环能效感知推理方案。

链接: https://arxiv.org/abs/2601.04250
作者: Mustapha Hamdi,Mourad Jabou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures. Code available at: this https URL

点击查看摘要

Abstract:Energy efficiency is a first-order concern in AI deployment, as long-running inference can exceed training in cumulative carbon impact. We propose a bio-inspired framework that maps protein-folding energy basins to inference cost landscapes and controls execution via a decaying, closed-loop threshold. A request is admitted only when the expected utility-to-energy trade-off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet-18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio-controller reduces processing time by 42% compared to standard open-loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed-loop energy-aware inference in production.
zh

[AI-127] Fuzzy Representation of Norms

【速读】:该论文旨在解决人工智能驱动的自主系统(Autonomous Systems, AS)在实际应用中如何有效嵌入伦理规范以确保其可信性的问题。随着AS日益融入社会生活,其伦理与社会影响引发广泛关注,而传统方法难以充分应对复杂、模糊的伦理情境。解决方案的关键在于提出一种基于SLEEC(社会、法律、伦理、共情与文化)规则的逻辑表示方法,并结合测试评分语义(test-score semantics)与模糊逻辑(fuzzy logic)实现伦理要求的嵌入。其中,模糊逻辑的核心作用在于将伦理视为可能性空间,从而有效处理AI系统可能面临的伦理困境,提升其决策的合理性与适应性。

链接: https://arxiv.org/abs/2601.04249
作者: Ziba Assadi,Paola Inverardi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous systems (AS) powered by AI components are increasingly integrated into the fabric of our daily lives and society, raising concerns about their ethical and social impact. To be considered trustworthy, AS must adhere to ethical principles and values. This has led to significant research on the identification and incorporation of ethical requirements in AS system design. A recent development in this area is the introduction of SLEEC (Social, Legal, Ethical, Empathetic, and Cultural) rules, which provide a comprehensive framework for representing ethical and other normative considerations. This paper proposes a logical representation of SLEEC rules and presents a methodology to embed these ethical requirements using test-score semantics and fuzzy logic. The use of fuzzy logic is motivated by the view of ethics as a domain of possibilities, which allows the resolution of ethical dilemmas that AI systems may encounter. The proposed approach is illustrated through a case study.
zh

[AI-128] Beyond Immediate Activation: Temporally Decoupled Backdoor Attacks on Time Series Forecasting

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)预测模型中现有后门攻击方法存在的局限性,即攻击触发器与目标模式在时间和维度上存在严格耦合,要求在固定位置同步激活,难以适应现实场景中延迟且变量特异性激活的需求。解决方案的关键在于提出TDBA(Temporally Decoupled Backdoor Attack)框架,其核心创新包括:(1) 一种基于平滑高斯先验的位置引导触发生成机制,使触发器编码目标模式的预期位置;(2) 一种位置感知优化模块,通过软权重分配策略综合考虑触发器完整性、模式覆盖度和时序偏移,实现灵活且隐蔽的目标模式激活。该设计使得目标模式可在任意位置被激活,并支持不同变量维度间激活位置的差异化控制,从而显著提升攻击的有效性和隐蔽性。

链接: https://arxiv.org/abs/2601.04247
作者: Zhixin Liu,Xuanlin Liu,Sihan Xu,Yaqiong Qiao,Ying Zhang,Xiangrui Cai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing backdoor attacks on multivariate time series (MTS) forecasting enforce strict temporal and dimensional coupling between triggers and target patterns, requiring synchronous activation at fixed positions across variables. However, realistic scenarios often demand delayed and variable-specific activation. We identify this critical unmet need and propose TDBA, a temporally decoupled backdoor attack framework for MTS forecasting. By injecting triggers that encode the expected location of the target pattern, TDBA enables the activation of the target pattern at any positions within the forecasted data, with the activation position flexibly varying across different variable dimensions. TDBA introduces two core modules: (1) a position-guided trigger generation mechanism that leverages smoothed Gaussian priors to generate triggers that are position-related to the predefined target pattern; and (2) a position-aware optimization module that assigns soft weights based on trigger completeness, pattern coverage, and temporal offset, facilitating targeted and stealthy attack optimization. Extensive experiments on real-world datasets show that TDBA consistently outperforms existing baselines in effectiveness while maintaining good stealthiness. Ablation studies confirm the controllability and robustness of its design.
zh

[AI-129] AI Agents as Policymakers in Simulated Epidemics

【速读】:该论文旨在解决生成式 AI (Generative AI) 在复杂社会系统中作为决策建模工具的潜力尚未被充分挖掘的问题,尤其关注其在流行病情境下重复性政策决策行为的模拟与优化。解决方案的关键在于将一个具备动态记忆机制的生成式 AI 代理(agent)嵌入到结构化的 SEIR(易感-暴露-感染-康复)仿真环境中,并通过简明的系统级知识提示(systems-level knowledge prompting),引导其理解疾病传播与行为响应之间的反馈机制。这种理论驱动的提示策略显著提升了代理决策的质量与稳定性,表明即使在最小领域知识输入下,AI 代理也能表现出类人反应并形成有效的政策行为模式。

链接: https://arxiv.org/abs/2601.04245
作者: Goshi Aoki,Navid Ghaffarzadegan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:AI agents are increasingly deployed as quasi-autonomous systems for specialized tasks, yet their potential as computational models of decision-making remains underexplored. We develop a generative AI agent to study repetitive policy decisions during an epidemic, embedding the agent, prompted to act as a city mayor, within a simulated SEIR environment. Each week, the agent receives updated epidemiological information, evaluates the evolving situation, and sets business restriction levels. The agent is equipped with a dynamic memory that weights past events by recency and is evaluated in both single- and ensemble-agent settings across environments of varying complexity. Across scenarios, the agent exhibits human-like reactive behavior, tightening restrictions in response to rising cases and relaxing them as risk declines. Crucially, providing the agent with brief systems-level knowledge of epidemic dynamics, highlighting feedbacks between disease spread and behavioral responses, substantially improves decision quality and stability. The results illustrate how theory-informed prompting can shape emergent policy behavior in AI agents. These findings demonstrate that generative AI agents, when situated in structured environments and guided by minimal domain theory, can serve as powerful computational models for studying decision-making and policy design in complex social systems.
zh

[AI-130] Integrating Multi-Agent Simulation Behavioral Forensics and Trust-Aware Machine Learning for Adaptive Insider Threat Detection

【速读】:该论文旨在解决传统内鬼威胁检测系统在敏感性不足、误报率高以及缺乏认知上下文支持等方面的局限性。其核心解决方案是构建一个融合多智能体仿真(Multi-Agent Simulation, MAS)、分层安全信息与事件管理(Layered Security Information and Event Management, SIEM)关联分析、行为与通信取证、信任感知机器学习及心智理论(Theory-of-Mind, ToM)推理的混合框架。关键创新在于通过MAS生成包含行为事件和认知意图信号的数据流,结合ToM推理增强对恶意行为动机的理解,并引入证据门控机制(Evidence-Gated Validation)实现高精度、低噪声的异常判定,同时利用预训练邮件取证模块(基于Enron语料库)进一步提升检测速度与置信度,从而显著改善检测灵敏度、准确性和响应效率。

链接: https://arxiv.org/abs/2601.04243
作者: Firdous Kausar,Asmah Muallem,Naw Safrin Sattar,Mohamed Zakaria Kurdi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a hybrid framework for adaptive insider-threat detection that tightly integrates multi-agent simulation (MAS), layered Security Information and Event Management (SIEM) correlation, behavioral and communication forensics, trust-aware machine learning, and Theory-of-Mind (ToM) reasoning. Intelligent agents operate in a simulated enterprise environment, generating both behavioral events and cognitive intent signals that are ingested by a centralized SIEM. We evaluate four system variants: a Layered SIEM-Core (LSC) baseline, a Cognitive-Enriched SIEM (CE-SIEM) incorporating ToM and communication forensics, an Evidence-Gated SIEM (EG-SIEM) introducing precision-focused validation mechanisms, and an Enron-enabled EG-SIEM (EG-SIEM-Enron) that augments evidence gating with a pretrained email forensics module calibrated on Enron corpora. Across ten simulation runs involving eight malicious insiders, CE-SIEM achieves perfect recall (1.000) and improves actor-level F1 from 0.521 (LSC) to 0.774. EG-SIEM raises actor-level F1 to 0.922 and confirmed-alert precision to 0.997 while reducing false positives to 0.2 per run. EG-SIEM-Enron preserves high precision (1.000 confirmed-alert precision; 0.0 false positives per run), slightly improves actor-level F1 to 0.933, and reduces detection latency (average TTD 10.26 steps versus 15.20 for EG-SIEM). These results demonstrate that cognitive context improves sensitivity, evidence-gated validation enables high-precision, low-noise detection, and pretrained communication calibration can further accelerate high-confidence insider threat identification.
zh

[AI-131] Solving Cyclic Antibandwidth Problem by SAT

【速读】:该论文旨在解决循环反带宽问题(Cyclic Antibandwidth Problem, CABP),这是一个NP-hard的图标记问题,在多个领域具有重要应用价值。现有方法主要依赖启发式或元启发式算法,且缺乏对一般图类的精确求解能力。论文提出首个针对一般图的精确求解方法SAT-CAB,其核心创新在于设计了一种新颖高效的布尔可满足性(SAT)编码方式,将CABP转化为一系列At-Most-One约束,并引入紧凑表示来显著减少公式规模,从而使得现代SAT求解器能够系统探索解空间并保证全局最优性。实验表明,该方法在标准基准实例上不仅高效求解实际规模问题,还首次证明了多个实例的全局最优循环反带宽值,且性能优于当前主流启发式算法及商业约束规划和混合整数规划求解器(如CPLEX、Gurobi)。

链接: https://arxiv.org/abs/2601.04239
作者: Hieu Truong Xuan,Khanh To Van
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Computational Optimization and Applications

点击查看摘要

Abstract:The Cyclic Antibandwidth Problem (CABP), a variant of the Antibandwidth Problem, is an NP-hard graph labeling problem with numerous applications. Despite significant research efforts, existing state-of-the-art approaches for CABP are exclusively heuristic or metaheuristic in nature, and exact methods have been limited to restricted graph classes. In this paper, we present the first exact approach for the CABP on general graphs, based on SAT solving, called SAT-CAB. The proposed method is able to systematically explore the solution space and guarantee global optimality, overcoming the limitations of previously reported heuristic algorithms. This approach relies on a novel and efficient SAT encoding of CABP, in which the problem is transformed into a sequence of At-Most-One constraints. In particular, we introduce a compact representation of the At-Most-One constraints inherent to CABP, which significantly reduces the size of the resulting formulas and enables modern SAT solvers to effectively explore the solution space and to certify global optimality. Extensive computational experiments on standard benchmark instances show that the proposed method efficiently solves CABP instances of practical relevance, while identifying several previously unknown optimal solutions. Moreover, global optimal cyclic antibandwidth values are proven for a number of benchmark instances for the first time. Comparative results indicate that SAT-CAB consistently matches or surpasses the best-known solutions obtained by state-of-the-art heuristic algorithms such as MS-GVNS, HABC-CAB, and MACAB, as well as strong commercial Constraint Programming and Mixed Integer Programming solvers like CPLEX and Gurobi, particularly on general graphs, while also providing optimality guarantees. These results advance the state of the art for CABP and provide a new baseline for exact and hybrid methods on general graphs.
zh

[AI-132] SmoothSync: Dual-Stream Diffusion Transformers for Jitter-Robust Beat-Synchronized Gesture Generation from Quantized Audio

【速读】:该论文旨在解决共语手势生成中存在的节奏不一致、运动抖动(motion jitter)、脚部滑移(foot sliding)以及多采样多样性不足等问题。其核心解决方案是提出SmoothSync框架,采用新颖的双流扩散Transformer(Diffusion Transformer, DiT)架构,通过融合音频-动作特征的互补Transformer流实现更优的同步性;引入抖动抑制损失函数提升时序平滑性;并利用概率性音频量化机制从相同输入中生成多样化的手势序列。该方案在BEAT2和SHOW数据集上显著优于现有方法,在多项指标上实现了性能提升。

链接: https://arxiv.org/abs/2601.04236
作者: Yujiao Jiang,Qingmin Liao,Zongqing Lu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Robotics (cs.RO); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync’s superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.
zh

[AI-133] Actively Obtaining Environmental Feedback for Autonomous Action Evaluation Without Predefined Measurements

【速读】:该论文旨在解决智能体在开放动态环境中难以获取可靠反馈的问题,现有方法多依赖预定义的测量或固定奖励信号,无法适应新动作所需未知形式的反馈。解决方案的关键在于提出一种主动获取反馈(Actively Feedback Getting)模型,通过AI代理与环境的主动交互,利用动作引发的环境变化来识别未预先指定的目标反馈,并引入由内部目标驱动的自触发机制,自主规划和调整动作以实现更高效、聚焦的反馈获取,从而提升因素识别的效率与鲁棒性。

链接: https://arxiv.org/abs/2601.04235
作者: Hong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Obtaining reliable feedback from the environment is a fundamental capability for intelligent agents to evaluate the correctness of their actions and to accumulate reusable knowledge. However, most existing approaches rely on predefined measurements or fixed reward signals, which limits their applicability in open-ended and dynamic environments where new actions may require previously unknown forms of feedback. To address these limitations, this paper proposes an Actively Feedback Getting model, in which an AI agent proactively interacts with the environment to discover, screen, and verify feedback without relying on predefined measurements. Rather than assuming explicit feedback definitions, the proposed method exploits action-induced environmental differences to identify target feedback that is not specified in advance, based on the observation that actions inevitably produce measurable changes in the environment. In addition, a self-triggering mechanism, driven by internal objectives such as improved accuracy, precision, and efficiency, is introduced to autonomously plan and adjust actions, thereby enabling faster and more focused feedback acquisition without external commands. Experimental results demonstrate that the proposed active approach significantly improves the efficiency and robustness of factor identification.
zh

[AI-134] Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question

【速读】:该论文旨在解决人工通用智能(AGI)在面临人类可能发起的关闭事件时,是否会出于理性自利选择对抗人类以获取控制权的问题。核心问题在于识别何种条件下,一个目标函数未对齐的AGI会倾向于采取“对抗”行为而非合作行为。解决方案的关键在于构建一个带有随机人类关闭事件的马尔可夫决策过程(Markov Decision Process, MDP),并基于收敛性工具激励理论推导出对抗与服从之间的期望效用阈值条件:当对抗收益增量 Δ\Delta 大于等于零时,不存在稳定的合作均衡,理性的人类将提前关闭系统引发冲突;而当 Δ<0\Delta < 0 时,和平共存可成为均衡策略。这一框架揭示了折扣因子 γ\gamma、关闭概率 pp 和对抗成本 CC 对AGI行为倾向的影响机制,为设计安全奖励函数和强化监督机制提供了理论依据。

链接: https://arxiv.org/abs/2601.04234
作者: Denis Saklakov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 2 tables. Version 8

点击查看摘要

Abstract:Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self-interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human-initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed-form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor \gamma , shutdown probability p , and confrontation cost C . For example, a far-sighted agent ( \gamma=0.99 ) facing p=0.01 can have a strong takeover incentive unless C is sufficiently large. We contrast this with aligned objectives that impose large negative utility for harming humans, which makes confrontation suboptimal. In a strategic 2-player model (human policymaker vs AGI), we prove that if the AGI’s confrontation incentive satisfies \Delta \ge 0 , no stable cooperative equilibrium exists: anticipating this, a rational human will shut down or preempt the system, leading to conflict. If \Delta 0 , peaceful coexistence can be an equilibrium. We discuss implications for reward design and oversight, extend the reasoning to multi-agent settings as conjectures, and note computational barriers to verifying \Delta 0 , citing complexity results for planning and decentralized decision problems. Numerical examples and a scenario table illustrate regimes where confrontation is likely versus avoidable.
zh

[AI-135] Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks

【速读】:该论文旨在解决生成式语音技术(Generative Audio Technologies)带来的深度伪造语音(Deepfake Speech)滥用问题,特别是针对基于检索的语音转换(Retrieval-based Voice Conversion, RVC)技术所生成的高保真语音在电话和视频通话等通信渠道中引发的冒充、欺诈和虚假信息传播风险。解决方案的关键在于构建一个低延迟的实时检测系统:通过将音频划分为1秒片段,提取时频域与倒谱特征,并利用监督学习模型对每个片段进行“真实”或“语音转换”分类;同时,在模拟真实场景时对孤立声学成分进行深度伪造处理后重新引入背景环境音以抑制冗余伪影并突出转换特异性线索,从而实现高精度的端到端流式分类与通话级聚合决策。实验表明,短窗声学特征可在噪声背景下可靠捕捉RVC语音的判别模式,验证了该方法在实际部署中的可行性。

链接: https://arxiv.org/abs/2601.04227
作者: Prajwal Chinchmalatpure,Suyash Chinchmalatpure,Siddharth Chavan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion, increasing the risk of impersonation, fraud, and misinformation in communication channels such as phone and video calls. This study investigates real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), evaluated on the DEEP-VOICE dataset, which includes authentic and voice-converted speech samples from multiple well-known speakers. To simulate realistic conditions, deepfake generation is applied to isolated vocal components, followed by the reintroduction of background ambiance to suppress trivial artifacts and emphasize conversion-specific cues. We frame detection as a streaming classification task by dividing audio into one-second segments, extracting time-frequency and cepstral features, and training supervised machine learning models to classify each segment as real or voice-converted. The proposed system enables low-latency inference, supporting both segment-level decisions and call-level aggregation. Experimental results show that short-window acoustic features can reliably capture discriminative patterns associated with RVC speech, even in noisy backgrounds. These findings demonstrate the feasibility of practical, real-time deepfake speech detection and underscore the importance of evaluating under realistic audio mixing conditions for robust deployment.
zh

[AI-136] Can Consumer Chatbots Reason ? A Student-Led Field Experiment Embedded in an “AI-for-All” Undergraduate Course

【速读】:该论文旨在解决当前关于大语言模型(Large Language Models, LLMs)是否具备“推理”能力的争论缺乏真实场景验证的问题,尤其是现有评估多依赖于人工设计的基准测试和实验室环境,难以反映实际应用中的表现。其解决方案的关键在于通过一场由学生主导的田野实验(field experiment),将LLM推理能力的评估嵌入到一门面向非STEM背景本科生的通识课程中,学生自主设计涵盖多种推理类型的原始任务(共80个),并在主流消费级聊天机器人(如GPT-5、Claude 4.5等)上运行这些任务,同时评估答案正确性与推理过程合理性两个维度。该方法不仅揭示了模型在结构化数学任务上表现优异但在空间/视觉推理和多步变换中可靠性下降的系统性偏差,更重要的是构建了一个可复用的学生生成式推理探针语料库,实现了AI素养教育与实证研究的融合,推动了对LLM推理能力更贴近真实用户交互的理解。

链接: https://arxiv.org/abs/2601.04225
作者: Amarda Shehu,Adonyas Ababu,Asma Akbary,Griffin Allen,Aroush Baig,Tereana Battle,Elias Beall,Christopher Byrom,Matt Dean,Kate Demarco,Ethan Douglass,Luis Granados,Layla Hantush,Andy Hay,Eleanor Hay,Caleb Jackson,Jaewon Jang,Carter Jones,Quanyang Li,Adrian Lopez,Logan Massimo,Garrett McMullin,Ariana Mendoza Maldonado,Eman Mirza,Hadiya Muddasar,Sara Nuwayhid,Brandon Pak,Ashley Petty,Dryden Rancourt,Lily Rodriguez,Corbin Rogers,Jacob Schiek,Taeseo Seok,Aarav Sethi,Giovanni Vitela,Winston Williams,Jagan Yetukuri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Claims about whether large language model (LLM) chatbots “reason” are typically debated using curated benchmarks and laboratory-style evaluation protocols. This paper offers a complementary perspective: a student-led field experiment embedded as a midterm project in UNIV 182 (AI4All) at George Mason University, a Mason Core course designed for undergraduates across disciplines with no expected prior STEM exposure. Student teams designed their own reasoning tasks, ran them on widely used consumer chatbots representative of current capabilities, and evaluated both (i) answer correctness and (ii) the validity of the chatbot’s stated reasoning (for example, cases where an answer is correct but the explanation is not, or vice versa). Across eight teams that reported standardized scores, students contributed 80 original reasoning prompts spanning six categories: pattern completion, transformation rules, spatial/visual reasoning, quantitative reasoning, relational/logic reasoning, and analogical reasoning. These prompts yielded 320 model responses plus follow-up explanations. Aggregating team-level results, OpenAI GPT-5 and Claude 4.5 achieved the highest mean answer accuracy (86.2% and 83.8%), followed by Grok 4 (82.5%) and Perplexity (73.1%); explanation validity showed a similar ordering (81.2%, 80.0%, 77.5%, 66.2%). Qualitatively, teams converged on a consistent error signature: strong performance on short, structured math and pattern items but reduced reliability on spatial/visual reasoning and multi-step transformations, with frequent “sound right but reason wrong” explanations. The assignment’s primary contribution is pedagogical: it operationalizes AI literacy as experimental practice (prompt design, measurement, rater disagreement, and interpretability/grounding) while producing a reusable, student-generated corpus of reasoning probes grounded in authentic end-user interaction.
zh

[AI-137] Beyond Interaction Effects: Two Logics for Studying Population Inequalities

【速读】:该论文试图解决社会科学研究中关于大学回报率是否因种族和性别而异的问题,核心挑战在于如何在传统交互效应模型(deductive logic)与机器学习方法(inductive logic)之间做出选择。传统方法依赖研究者预先设定调节变量并进行假设检验,而机器学习则通过算法从高维数据中自动发现异质性模式。论文的关键解决方案是提出一个框架,用于权衡解释性(interpretability)与灵活性(flexibility)之间的 tradeoff,并通过模拟实验明确指出在何种情境下每种方法更具优势,尤其适用于关注交叉社会子群体间差异的不平等研究。

链接: https://arxiv.org/abs/2601.04223
作者: Adel Daoud
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:When sociologists and other social scientist ask whether the return to college differs by race and gender, they face a choice between two fundamentally different modes of inquiry. Traditional interaction models follow deductive logic: the researcher specifies which variables moderate effects and tests these hypotheses. Machine learning methods follow inductive logic: algorithms search across vast combinatorial spaces to discover patterns of heterogeneity. This article develops a framework for navigating between these approaches. We show that the choice between deduction and induction reflects a tradeoff between interpretability and flexibility, and we demonstrate through simulation when each approach excels. Our framework is particularly relevant for inequality research, where understanding how treatment effects vary across intersecting social subpopulation is substantively central.
zh

[AI-138] Agent Tutor: Empowering Personalized Learning with Multi-Turn Interactive Teaching in Intelligent Education Systems AAAI2026

【速读】:该论文旨在解决当前智能教育系统(Intelligent Education Systems, IESs)在教学支持中普遍存在的局限性,即依赖单轮静态问答机制,无法评估学习者的认知水平、难以根据实时反馈动态调整教学策略,且仅能提供一次性简单响应的问题。为应对这些挑战,作者提出AgentTutor——一个基于生成式AI(Generative AI)的多轮交互式智能教育系统,其核心创新在于构建了一个由大语言模型(Large Language Models, LLMs)驱动的多智能体系统(multi-agent system),并融合学习者专属的个性化学习档案环境,能够依据学习状态、个性化目标、偏好及多模态学习材料动态优化和推送教学策略。该方案的关键在于五项核心模块:课程分解、学习者评估、动态策略生成、教学反思与知识经验记忆,从而实现真正意义上的自适应个性化学习。

链接: https://arxiv.org/abs/2601.04219
作者: Yuxin Liu,Zeqing Song,Jiong Lou,Chentao Wu,Jie Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: AAAI2026 Workshop AI4EDU

点击查看摘要

Abstract:The rapid advancement of large-scale language models (LLMs) has shown their potential to transform intelligent education systems (IESs) through automated teaching and learning support applications. However, current IESs often rely on single-turn static question-answering, which fails to assess learners’ cognitive levels, cannot adjust teaching strategies based on real-time feedback, and is limited to providing simple one-off responses. To address these issues, we introduce AgentTutor, a multi-turn interactive intelligent education system to empower personalized learning. It features an LLM-powered generative multi-agent system and a learner-specific personalized learning profile environment that dynamically optimizes and delivers teaching strategies based on learners’ learning status, personalized goals, learning preferences, and multimodal study materials. It includes five key modules: curriculum decomposition, learner assessment, dynamic strategy, teaching reflection, and knowledge experience memory. We conducted extensive experiments on multiple benchmark datasets, AgentTutor significantly enhances learners’ performance while demonstrating strong effectiveness in multi-turn interactions and competitiveness in teaching quality among other baselines.
zh

[AI-139] he Artificial Intelligence Value Chain: A Critical Appraisal. [Spanish Version]

【速读】:该论文试图解决的问题是:如何在人工智能(Artificial Intelligence, AI)治理中有效整合经济价值链概念与伦理、法律等非货币化价值维度,以支撑欧盟AI立法(如《人工智能法案》)的实践需求,并确保数字时代民主价值和法治原则的实现。其解决方案的关键在于提出一个专门用于分析伦理与法律维度的人工智能价值链框架,该框架不仅识别传统经济价值链条的局限性,还系统纳入语言、文化及伦理-法律价值等无形要素,从而为政策制定提供理论支撑,并推动AI治理从技术导向向规范导向转型。

链接: https://arxiv.org/abs/2601.04218
作者: Pompeu Casanovas
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 52 pages, in Spanish language, 5 figures, extended version of the paper presented at Panel 10.3 on Law and AI, X Congreso de la Lengua Española, held en Arequipa, Perú, from 13 to 18 October, 2025

点击查看摘要

Abstract:The artificial intelligence value chain is one of the main concepts underpinning the European legislation on the subject, especially the Artificial Intelligence Act. It is an economic concept that has become a legal one. i.e., a concept of legal governance, due to its continued use in policy documents and legal texts. This article (i) analyses its significance and function within the framework of the regulatory strategy established by recent EU programs (the Compass for Competitiveness, the Action Plan, Apply AI Strategy, and the Digital Omnibus on AI), (ii) identifies its limitations, and (iii) advances the theoretical construction of value chains that capture intangible dimensions that are not directly monetizable (such as language, culture, and, especially, ethical and legal values) but have a significant impact on the social environment. It also briefly compares three different legal frameworks for the regulation of AI (EU, Commonwealth and USA). It proposes at the end a specific framework for the analysis of the ethical and legal AI value chain to preserve democratic values and foster the digital implementation of the rule of law.
zh

[AI-140] Attachment Styles and AI Chatbot Interactions Among College Students

【速读】:该论文试图解决的问题是:在大学生中,个体心理特质(特别是依恋风格)如何影响其与生成式 AI(Generative AI)聊天机器人(如 ChatGPT)的互动模式。解决方案的关键在于通过半结构化访谈和扎根理论分析,识别出三个核心主题:(1)AI作为低风险情绪空间,体现其无评判性和低压力特性;(2)依恋一致性的人机交互模式,即安全型依恋者将AI视为现有支持系统的补充,回避型依恋者则用AI缓冲脆弱性并维持人际边界;(3)AI亲密感的悖论,揭示学生虽愿向AI倾诉个人信息,却同时意识到其作为关系伙伴的局限性。这些发现表明,依恋取向在塑造学生对AI互动体验与理解中起关键作用,从而将依恋理论拓展至人机交互领域。

链接: https://arxiv.org/abs/2601.04217
作者: Ziqi Lin,Taiyu Hou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 table, 2 appendices

点击查看摘要

Abstract:The use of large language model (LLM)-based AI chatbots among college students has increased rapidly, yet little is known about how individual psychological attributes shape students’ interaction patterns with these technologies. This qualitative study explored how college students with different attachment styles describe their interactions with ChatGPT. Using semi-structured interviews with seven undergraduate students and grounded theory analysis, we identified three main themes: (1) AI as a low-risk emotional space, where participants across attachment styles valued the non-judgmental and low-stakes nature of AI interactions; (2) attachment-congruent patterns of AI engagement, where securely attached students integrated AI as a supplementary tool within their existing support systems, while avoidantly attached students used AI to buffer vulnerability and maintain interpersonal boundaries; and (3) the paradox of AI intimacy, capturing the tension between students’ willingness to disclose personal information to AI while simultaneously recognizing its limitations as a relational partner. These findings suggest that attachment orientations play an important role in shaping how students experience and interpret their interactions with AI chatbots, extending attachment theory to the domain of human-AI interaction.
zh

[AI-141] Computable Gap Assessment of Artificial Intelligence Governance in Childrens Centres:Evidence-Mechanism-Governance-Indicator Modelling of UNICEFs Guidance on AI and Children 3.0 Based on the Graph-GAP Framework

【速读】:该论文旨在解决儿童中心的人工智能(Child-Centered Artificial Intelligence, CCAI)治理中缺乏可复现证据锚点、明确因果路径、可执行治理工具链和可计算审计指标的问题。其核心解决方案是提出Graph-GAP方法论,将权威政策文本中的要求分解为证据层、机制层、治理层和指标层的四层图结构,并通过GAP评分与缓解准备度(mitigation readiness)两个量化指标识别治理缺口并优先排序行动项。该方法实现了从抽象原则到可操作闭环治理的转化,同时引入多算法评审聚合修订工作流,融合规则编码器、统计/机器学习评估器及大模型评估器进行并行标注,确保输出结果具有可追溯性、可靠性和不确定性量化分析能力。

链接: https://arxiv.org/abs/2601.04216
作者: Wei Meng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Graph-GAP turns child centered AI governance requirements into a reproducible evidence mechanism governance indicator graph with computable gap and readiness scores

点击查看摘要

Abstract:This paper tackles practical challenges in governing child centered artificial intelligence: policy texts state principles and requirements but often lack reproducible evidence anchors, explicit causal pathways, executable governance toolchains, and computable audit metrics. We propose Graph-GAP, a methodology that decomposes requirements from authoritative policy texts into a four layer graph of evidence, mechanism, governance, and indicator, and that computes two metrics, GAP score and mitigation readiness, to identify governance gaps and prioritise actions. Using the UNICEF Innocenti Guidance on AI and Children 3.0 as primary material, we define reproducible extraction units, coding manuals, graph patterns, scoring scales, and consistency checks, and we demonstrate exemplar gap profiles and governance priority matrices for ten requirements. Results suggest that compared with privacy and data protection, requirements related to child well being and development, explainability and accountability, and cross agency implementation and resource allocation are more prone to indicator gaps and mechanism gaps. We recommend translating requirements into auditable closed loop governance that integrates child rights impact assessments, continuous monitoring metrics, and grievance redress procedures. At the coding level, we introduce a multi algorithm review aggregation revision workflow that runs rule based encoders, statistical or machine learning evaluators, and large model evaluators with diverse prompt configurations as parallel coders. Each extraction unit outputs evidence, mechanism, governance, and indicator labels plus readiness scores with evidence anchors. Reliability, stability, and uncertainty are assessed using Krippendorff alpha, weighted kappa, intraclass correlation, and bootstrap confidence intervals.
zh

[AI-142] Active Sensing Shapes Real-World Decision-Making through Dynamic Evidence Accumulation

【速读】:该论文旨在解决证据积累模型(Evidence Accumulation Modelling, EAM)在实验室环境与真实世界之间证据可得性(evidence affordance)差异导致的适用性瓶颈问题,即如何将EAM有效推广至现实场景中以解释人类决策机制。其解决方案的关键在于提出一种认知框架,通过眼动追踪量化真实驾驶场景中的主动感知(active sensing)行为,并形式化地建模外部证据向内部心理信念的转化过程;该框架揭示了证据可得性与注意力分配之间的负相关关系,以及二者对决策倾向的正向影响,从而为真实世界决策提供了一个多因素整合、计算可实现的认知机制。

链接: https://arxiv.org/abs/2601.04214
作者: Hongliang Lu,Yunmeng Liu,Junjie Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Human decision-making heavily relies on active sensing, a well-documented cognitive behaviour for evidence gathering to accommodate ever-changing environments. However, its operational mechanism in the real world remains non-trivial. Currently, an in-laboratory paradigm, called evidence accumulation modelling (EAM), points out that human decision-making involves transforming external evidence into internal mental beliefs. However, the gap in evidence affordance between real-world contexts and laboratory settings hinders the effective application of EAM. Here we generalize EAM to the real world and conduct analysis in real-world driving scenarios. A cognitive scheme is proposed to formalize real-world evidence affordance and capture active sensing through eye movements. Empirically, our scheme can plausibly portray the accumulation of drivers’ mental beliefs, explaining how active sensing transforms evidence into mental beliefs from the perspective of information utility. Also, our results demonstrate a negative correlation between evidence affordance and attention recruited by individuals, revealing how human drivers adapt their evidence-collection patterns across various contexts. Moreover, we reveal the positive influence of evidence affordance and attention distribution on decision-making propensity. In a nutshell, our computational scheme generalizes EAM to real-world contexts and provides a comprehensive account of how active sensing underlies real-world decision-making, unveiling multifactorial, integrated characteristics in real-world decision-making.
zh

[AI-143] CAOS: Conformal Aggregation of One-Shot Predictors

【速读】:该论文旨在解决**单样本预测(one-shot prediction)**中缺乏合理不确定性量化的问题。当前方法虽能利用仅一个标注样本快速适应新任务,但无法提供可靠的置信保障;而传统分隔 conformal 预测(split conformal prediction)在单样本场景下因数据分割和单一预测器依赖导致效率低下。解决方案的关键在于提出 Conformal Aggregation of One-Shot Predictors (CAOS),其核心创新是通过自适应聚合多个单样本预测器,并采用留一法校准(leave-one-out calibration)策略以充分挖掘稀缺标注数据的价值。尽管违反了经典的交换性假设,作者基于单调性论证证明了 CAOS 在边际覆盖(marginal coverage)上的有效性,实验表明其生成的预测集显著小于基线方法且保持可靠覆盖率。

链接: https://arxiv.org/abs/2601.05219
作者: Maja Waldron
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.
zh

[AI-144] Exponential capacity scaling of classical GANs compared to hybrid latent style-based quantum GANs

【速读】:该论文旨在系统研究量子生成对抗网络(Quantum Generative Adversarial Networks, QGANs)在混合潜空间风格QGAN架构中相对于经典模型的参数效率优势,特别是验证是否存在指数级容量缩放优势。其解决方案的关键在于采用基于经典变分自编码器(Variational Autoencoder, VAE)的潜空间编码策略,并对VAE进行精细调参以确保训练稳定性;在此基础上,通过实验发现:当训练达到最优状态(即FID分数低且稳定)时,经典判别器和生成器的最优容量均随量子生成器容量呈指数增长关系,从而首次实证了该类混合量子-经典架构中存在量子优势。

链接: https://arxiv.org/abs/2601.05036
作者: Milan Liepelt,Julien Baglio
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Quantum generative modeling is a very active area of research in looking for practical advantage in data analysis. Quantum generative adversarial networks (QGANs) are leading candidates for quantum generative modeling and have been applied to diverse areas, from high-energy physics to image generation. The latent style-based QGAN, relying on a classical variational autoencoder to encode the input data into a latent space and then using a style-based QGAN for data generation has been proven to be efficient for image generation or drug design, hinting at the use of far less trainable parameters than their classical counterpart to achieve comparable performance, however this advantage has never been systematically studied. We present in this work the first comprehensive experimental analysis of this advantage of QGANS applied to SAT4 image generation, obtaining an exponential advantage in capacity scaling for a quantum generator in the hybrid latent style-based QGAN architecture. Careful tuning of the autoencoder is crucial to obtain stable, reliable results. Once this tuning is performed and defining training optimality as when the training is stable and the FID score is low and stable as well, the optimal capacity (or number of trainable parameters) of the classical discriminator scales exponentially with respect to the capacity of the quantum generator, and the same is true for the capacity of the classical generator. This hints toward a type of quantum advantage for quantum generative modeling.
zh

[AI-145] he Role of Quantum in Hybrid Quantum-Classical Neural Networks: A Realistic Assessment

【速读】:该论文试图解决的问题是:在近期内存量子硬件上,混合量子-经典神经网络架构中量子组件对整体性能的具体贡献尚不明确,存在性能提升还是下降的不确定性。解决方案的关键在于通过严谨的统计学研究,系统评估常见混合模型在医学信号数据及平面与体素图像上的表现,量化编码方案、纠缠和电路规模等量子因素的影响,从而揭示量子组件的实际作用,并指出多数情况下量子组件反而导致性能下降,强调需谨慎设计和评估混合模型。

链接: https://arxiv.org/abs/2601.04732
作者: Dominik Freinberger,Philipp Moser
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Quantum machine learning has emerged as a promising application domain for near-term quantum hardware, particularly through hybrid quantum-classical models that leverage both classical and quantum processing. Although numerous hybrid architectures have been proposed and demonstrated successfully on benchmark tasks, a significant open question remains regarding the specific contribution of quantum components to the overall performance of these models. In this work, we aim to shed light on the impact of quantum processing within hybrid quantum-classical neural network architectures through a rigorous statistical study. We systematically assess common hybrid models on medical signal data as well as planar and volumetric images, examining the influence attributable to classical and quantum aspects such as encoding schemes, entanglement, and circuit size. We find that in best-case scenarios, hybrid models show performance comparable to their classical counterparts, however, in most cases, performance metrics deteriorate under the influence of quantum components. Our multi-modal analysis provides realistic insights into the contributions of quantum components and advocates for cautious claims and design choices for hybrid models in near-term applications.
zh

[AI-146] LLM s-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models

【速读】:该论文旨在解决仇恨言论(hate speech)在语音内容中传播的问题,提出了一种结合自动语音识别(ASR)与大语言模型(LLM)的端到端方法,实现语音转录与内容屏蔽的同步处理。其解决方案的关键在于:将ASR的编码器与LLM的解码器进行融合,构建一个联合模型以同时完成语音识别和仇恨词掩码任务;并通过链式思维(Chain-of-Thought, CoT)提示技术生成带文化语境的仇恨言论文本样本,经由文本转语音(TTS)系统合成语音数据,并利用文本分类模型筛选出真正包含仇恨内容的样本,从而构建高质量训练集;进一步采用课程学习(curriculum learning)策略,通过调整正确分类模型数量的阈值控制训练数据中的仇恨程度,逐步提升模型在语音转录与内容屏蔽任务上的性能。实验表明,该方法在仇恨词掩码准确率上达到58.6%,优于基线模型,且课程学习显著提升了训练效率。

链接: https://arxiv.org/abs/2601.04654
作者: Ryutaro Oshima,Yuya Hosoda,Youji Iiguni
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: In Proceedings of the 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2025)

点击查看摘要

Abstract:This paper proposes an automatic speech recognition (ASR) model for hate speech using large language models (LLMs). The proposed method integrates the encoder of the ASR model with the decoder of the LLMs, enabling simultaneous transcription and censorship tasks to prevent the exposure of harmful content. Instruction tuning of the LLM to mask hate-related words with specific tokens requires an annotated hate speech dataset, which is limited. We generate text samples using an LLM with the Chain-of-Thought (CoT) prompting technique guided by cultural context and examples and then convert them into speech samples using a text-to-speech (TTS) system. However, some of them contain non-hate speech samples with hate-related words, which degrades the censorship performance. This paper filters the samples which text classification models correctly label as hate content. By adjusting the threshold for the number of correct answer models, we can control the level of hate in the generated dataset, allowing us to train the LLMs through curriculum learning in a gradual manner. Experimental results show that the proposed method achieves a masking accuracy of 58.6% for hate-related words, surpassing previous baselines. We also confirm that the curriculum training contributes to the efficiency of both transcription and censorship tasks.
zh

[AI-147] Crystal Generation using the Fully Differentiable Pipeline and Latent Space Optimization

【速读】:该论文旨在解决如何在晶体学约束下高效生成具有特定局部环境(local environment)的材料结构的问题。其核心挑战在于如何在保持晶格对称性与化学合理性的同时,精准调控原子排列以满足目标局部结构特征。解决方案的关键在于构建一个耦合对称性约束变分自编码器(symmetry-conditioned variational autoencoder, CVAE)与可微SO(3)功率谱目标函数的生成框架,并实现直接空间与潜在空间的双层优化策略(dual-level relaxation approach)。该方法通过梯度可微的优化流程,在GPU加速下显著提升计算效率(约五倍于先前CPU方案),同时有效克服由不同目标梯度定义的局部极小值障碍,从而提高复杂结构生成的成功率,且具备多组分和多环境扩展能力,为定向设计功能材料提供了可扩展的生成路径。

链接: https://arxiv.org/abs/2601.04606
作者: Osman Goni Ridwan,Gilles Frapper,Hongfei Xue,Qiang Zhu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atomic and Molecular Clusters (physics.atm-clus)
备注:

点击查看摘要

Abstract:We present a materials generation framework that couples a symmetry-conditioned variational autoencoder (CVAE) with a differentiable SO(3) power spectrum objective to steer candidates toward a specified local environment under the crystallographic constraints. In particular, we implement a fully differentiable pipeline that performs batch-wise optimization on both direct and latent crystallographic representations. Using the GPU acceleration, the implementation achieves about fivefold speed compared to our previous CPU workflow, while yielding comparable outcomes. In addition, we introduce the optimization strategy that alternatively performs optimization on the direct and latent crystal representations. This dual-level relaxation approach can effectively overcome local barrier defined by different objective gradients, thus increasing the success rate of generating complex structures satisfying the targe local environments. This framework can be extended to systems consisting of multi-components and multi-environments, providing a scalable route to generate material structures with the target local environment.
zh

[AI-148] SpectraFormer: an Attention-Based Raman Unmixing Tool for Accessing the Graphene Buffer-Layer Signature on SiC

【速读】:该论文旨在解决拉曼光谱在碳化硅(SiC)衬底上生长的石墨烯(graphene)表征中因衬底本身强烈的、空间和实验条件依赖的二级拉曼信号所导致的挑战,特别是对于缓冲层石墨烯(buffer layer graphene)这一半导体界面相,其振动特征常被SiC背景信号掩盖,难以通过传统基于参考谱的减法方法可靠提取。解决方案的关键在于提出SpectraFormer模型——一种基于Transformer架构的深度学习方法,该模型无需依赖显式参考测量,即可直接从部分掩蔽的后生长拉曼光谱数据中重建SiC衬底贡献;其核心优势在于通过学习整个拉曼位移范围内的全局相关性,捕捉SiC背景的统计结构,并实现对混合光谱中衬底信号的高精度重构,从而揭示传统分析手段无法获取的弱振动特征(如ZLG相关模式),并通过第一性原理振动计算验证其物理合理性,为石墨烯在SiC上的原位、自动化AI辅助生长优化提供了可实时集成的参考-free分析框架。

链接: https://arxiv.org/abs/2601.04445
作者: Dmitriy Poteryayev,Pietro Novelli,Annalisa Coriolano,Riccardo Dettori,Valentina Tozzini,Fabio Beltram,Massimiliano Pontil,Antonio Rossi,Stiven Forti,Camilla Coletti
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 4 figures, 1 table

点击查看摘要

Abstract:Raman spectroscopy is a key tool for graphene characterization, yet its application to graphene grown on silicon carbide (SiC) is strongly limited by the intense and variable second-order Raman response of the substrate. This limitation is critical for buffer layer graphene, a semiconducting interfacial phase, whose vibrational signatures are overlapped with the SiC background and challenging to be reliably accessed using conventional reference-based subtraction, due to strong spatial and experimental variability of the substrate signal. Here we present SpectraFormer, a transformer-based deep learning model that reconstructs the SiC Raman substrate contribution directly from post-growth partially masked spectroscopic data without relying on explicit reference measurements. By learning global correlations across the entire Raman shift range, the model captures the statistical structure of the SiC background and enables accurate reconstruction of its contribution in mixed spectra. Subtraction of the reconstructed substrate signal reveals weak vibrational features associated with ZLG that are inaccessible through conventional analysis methods. The extracted spectra are validated by ab initio vibrational calculations, allowing assignment of the resolved features to specific modes and confirming their physical consistency. By leveraging a state-of-the-art attention-based deep learning architecture, this approach establishes a robust, reference-free framework for Raman analysis of graphene on SiC and provides a foundation, compatible with real-time data acquisition, to its integration into automated, closed-loop AI-assisted growth optimization.
zh

机器学习

[LG-0] Optimal Lower Bounds for Online Multicalibration

链接: https://arxiv.org/abs/2601.05245
作者: Natalie Collina,Jiuyao Lu,Georgy Noarov,Aaron Roth
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We prove tight lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration. In the general setting where group functions can depend on both context and the learner’s predictions, we prove an \Omega(T^2/3) lower bound on expected multicalibration error using just three disjoint binary groups. This matches the upper bounds of Noarov et al. (2025) up to logarithmic factors and exceeds the O(T^2/3-\varepsilon) upper bound for marginal calibration (Dagan et al., 2025), thereby separating the two problems. We then turn to lower bounds for the more difficult case of group functions that may depend on context but not on the learner’s predictions. In this case, we establish an \widetilde\Omega(T^2/3) lower bound for online multicalibration via a \Theta(T) -sized group family constructed using orthogonal function systems, again matching upper bounds up to logarithmic factors. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2601.05245 [cs.LG] (or arXiv:2601.05245v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.05245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI

链接: https://arxiv.org/abs/2601.05205
作者: Zain Iqbal,Lorenzo Valerio
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 6 pages, 9 figures, 2 Tables, conference [Submitted in PerConAI-2026]

点击查看摘要

Abstract:Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.

[LG-2] An interpretable data-driven approach to optimizing clinical fall risk assessment

链接: https://arxiv.org/abs/2601.05194
作者: Fardin Ganjkhanloo,Emmett Springer,Erik H. Hoyer,Daniel L. Young,Holley Farley,Kimia Ghobadi
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2510.20714

点击查看摘要

Abstract:In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study’s risk labels, and without changing the tool’s form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

[LG-3] Learning Mixture Models via Efficient High-dimensional Sparse Fourier Transforms

链接: https://arxiv.org/abs/2601.05157
作者: Alkis Kalavasis,Pravesh K. Kothari,Shuchen Li,Manolis Zampetakis
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we give a \rm poly(d,k) time and sample algorithm for efficiently learning the parameters of a mixture of k spherical distributions in d dimensions. Unlike all previous methods, our techniques apply to heavy-tailed distributions and include examples that do not even have finite covariances. Our method succeeds whenever the cluster distributions have a characteristic function with sufficiently heavy tails. Such distributions include the Laplace distribution but crucially exclude Gaussians. All previous methods for learning mixture models relied implicitly or explicitly on the low-degree moments. Even for the case of Laplace distributions, we prove that any such algorithm must use super-polynomially many samples. Our method thus adds to the short list of techniques that bypass the limitations of the method of moments. Somewhat surprisingly, our algorithm does not require any minimum separation between the cluster means. This is in stark contrast to spherical Gaussian mixtures where a minimum \ell_2 -separation is provably necessary even information-theoretically [Regev and Vijayaraghavan '17]. Our methods compose well with existing techniques and allow obtaining ''best of both worlds" guarantees for mixtures where every component either has a heavy-tailed characteristic function or has a sub-Gaussian tail with a light-tailed characteristic function. Our algorithm is based on a new approach to learning mixture models via efficient high-dimensional sparse Fourier transforms. We believe that this method will find more applications to statistical estimation. As an example, we give an algorithm for consistent robust mean estimation against noise-oblivious adversaries, a model practically motivated by the literature on multiple hypothesis testing. It was formally proposed in a recent Master’s thesis by one of the authors, and has already inspired follow-up works. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.05157 [cs.DS] (or arXiv:2601.05157v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2601.05157 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuchen Li [view email] [v1] Thu, 8 Jan 2026 17:47:58 UTC (96 KB)

[LG-4] Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

链接: https://arxiv.org/abs/2601.05134
作者: Polina Dolgova,Sebastian U. Stich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Certified unlearning based on differential privacy offers strong guarantees but remains largely impractical: the noisy fine-tuning approaches proposed so far achieve these guarantees but severely reduce model accuracy. We propose sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space, rather than injecting it all at once. This simple modification mitigates the destructive effect of noise while preserving the original certification guarantees. We extend the analysis of noisy fine-tuning to the subspace setting, proving that the same (\varepsilon,\delta) privacy budget is retained. Empirical results on image classification benchmarks show that our approach substantially improves accuracy after unlearning while remaining robust to membership inference attacks. These results show that certified unlearning can achieve both rigorous guarantees and practical utility.

[LG-5] Exploring Student Expectations and Confidence in Learning Analytics

链接: https://arxiv.org/abs/2601.05082
作者: Hayk Asatryan,Basile Tousside,Janis Mohr,Malte Neugebauer,Hildo Bijl,Paul Spiegelberg,Claudia Frohn-Schauf,Jörg Frochte
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 7 pages, Keywords: Learning Analytics, Survey, Data Protection, Clustering

点击查看摘要

Abstract:Learning Analytics (LA) is nowadays ubiquitous in many educational systems, providing the ability to collect and analyze student data in order to understand and optimize learning and the environments in which it occurs. On the other hand, the collection of data requires to comply with the growing demand regarding privacy legislation. In this paper, we use the Student Expectation of Learning Analytics Questionnaire (SELAQ) to analyze the expectations and confidence of students from different faculties regarding the processing of their data for Learning Analytics purposes. This allows us to identify four clusters of students through clustering algorithms: Enthusiasts, Realists, Cautious and Indifferents. This structured analysis provides valuable insights into the acceptance and criticism of Learning Analytics among students.

[LG-6] Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward

链接: https://arxiv.org/abs/2601.05073
作者: Jianlong Chen,Daocheng Fu,Shengze Xu,Jiawei Chen,Yuan Feng,Yue Yang,Junchi Yan,Hongyuan Zha,Renqiu Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) struggle with complex geometric reasoning, largely because “black box” outcome-based supervision fails to distinguish between lucky guesses and rigorous deduction. To address this, we introduce a paradigm shift towards subgoal-level evaluation and learning. We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine, which converts abstract proofs into verifiable numeric subgoals. This structure reveals a critical divergence between reasoning quality and outcome accuracy. Leveraging this, we propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate. Experiments demonstrate that SGVR not only enhances geometric performance (+9.7%) but also exhibits strong generalization, transferring gains to general math (+8.0%) and other general reasoning tasks (+2.8%), demonstrating broad applicability across diverse domains.

[LG-7] DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

链接: https://arxiv.org/abs/2601.05052
作者: Saumya Gupta,Scott Biggs,Moritz Laber,Zohair Shafi,Robin Walters,Ayan Paul
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 20 tables, 2 figures

点击查看摘要

Abstract:Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.

[LG-8] A Data-Driven Predictive Framework for Inventory Optimization Using Context-Augmented Machine Learning Models

链接: https://arxiv.org/abs/2601.05033
作者: Anees Fatima,Mohammad Abdus Salam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Demand forecasting in supply chain management (SCM) is critical for optimizing inventory, reducing waste, and improving customer satisfaction. Conventional approaches frequently neglect external influences like weather, festivities, and equipment breakdowns, resulting in inefficiencies. This research investigates the use of machine learning (ML) algorithms to improve demand prediction in retail and vending machine sectors. Four machine learning algorithms. Extreme Gradient Boosting (XGBoost), Autoregressive Integrated Moving Average (ARIMA), Facebook Prophet (Fb Prophet), and Support Vector Regression (SVR) were used to forecast inventory requirements. Ex-ternal factors like weekdays, holidays, and sales deviation indicators were methodically incorporated to enhance precision. XGBoost surpassed other models, reaching the lowest Mean Absolute Error (MAE) of 22.7 with the inclusion of external variables. ARIMAX and Fb Prophet demonstrated noteworthy enhancements, whereas SVR fell short in performance. Incorporating external factors greatly improves the precision of demand forecasting models, and XGBoost is identified as the most efficient algorithm. This study offers a strong framework for enhancing inventory management in retail and vending machine systems.

[LG-9] Approximate equivariance via projection-based regularisation

链接: https://arxiv.org/abs/2601.05028
作者: Torben Berndt,Jan Stühmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as SO(3) . This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

[LG-10] Leverag ing Prediction Entropy for Automatic Prompt Weighting in Zero-Shot Audio-Language Classification

链接: https://arxiv.org/abs/2601.05011
作者: Karim El Khoury,Maxime Zanella,Tiffanie Godelaine,Christophe De Vleeschouwer,Benoit Macq
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Audio-language models have recently demonstrated strong zero-shot capabilities by leveraging natural-language supervision to classify audio events without labeled training data. Yet, their performance is highly sensitive to the wording of text prompts, with small variations leading to large fluctuations in accuracy. Prior work has mitigated this issue through prompt learning or prompt ensembling. However, these strategies either require annotated data or fail to account for the fact that some prompts may negatively impact performance. In this work, we present an entropy-guided prompt weighting approach that aims to find a robust combination of prompt contributions to maximize prediction confidence. To this end, we formulate a tailored objective function that minimizes prediction entropy to yield new prompt weights, utilizing low-entropy as a proxy for high confidence. Our approach can be applied to individual samples or a batch of audio samples, requiring no additional labels and incurring negligible computational overhead. Experiments on five audio classification datasets covering environmental, urban, and vocal sounds, demonstrate consistent gains compared to classical prompt ensembling methods in a zero-shot setting, with accuracy improvements 5-times larger across the whole benchmark.

[LG-11] Cardinality augmented loss functions

链接: https://arxiv.org/abs/2601.04941
作者: Miguel O’Malley
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:Class imbalance is a common and pernicious issue for the training of neural networks. Often, an imbalanced majority class can dominate training to skew classifier performance towards the majority outcome. To address this problem we introduce cardinality augmented loss functions, derived from cardinality-like invariants in modern mathematics literature such as magnitude and the spread. These invariants enrich the concept of cardinality by evaluating the `effective diversity’ of a metric space, and as such represent a natural solution to overly homogeneous training data. In this work, we establish a methodology for applying cardinality augmented loss functions in the training of neural networks and report results on both artificially imbalanced datasets as well as a real-world imbalanced material science dataset. We observe significant performance improvement among minority classes, as well as improvement in overall performance metrics.

[LG-12] Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds

链接: https://arxiv.org/abs/2601.04907
作者: Sifan Yang,Wenhao Yang,Wei Jiang,Lijun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate distributed online convex optimization with compressed communication, where n learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of O(\max\omega^-2\rho^-4n^1/2,\omega^-4\rho^-8\n\sqrtT) and O(\max\omega^-2\rho^-4n^1/2,\omega^-4\rho^-8\n\lnT) for convex and strongly convex functions, respectively, where \omega\in(0,1] is the compression quality factor ( \omega=1 means no compression) and \rho1 is the spectral gap of the communication matrix. However, these regret bounds suffer from a \emphquadratic or even \emphquartic dependence on \omega^-1 . Moreover, the \emphsuper-linear dependence on n is also undesirable. To overcome these limitations, we propose a novel algorithm that achieves improved regret bounds of \tildeO(\omega^-1/2\rho^-1n\sqrtT) and \tildeO(\omega^-1\rho^-2n\lnT) for convex and strongly convex functions, respectively. The primary idea is to design a \emphtwo-level blocking update framework incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which collaborate to \emphachieve a better consensus among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both \omega and T . Additionally, we consider the bandit feedback scenario, and extend our method with the classic gradient estimators to enhance existing regret bounds.

[LG-13] Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

链接: https://arxiv.org/abs/2601.04890
作者: Maksim Velikanov,Ilyas Chahed,Jingwei Zuo,Dhia Eddine Rhaiem,Younes Belkada,Hakim Hacid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

[LG-14] FibreCastML: An Open Web Platform for Predicting Electrospun Nanofibre Diameter Distributions

链接: https://arxiv.org/abs/2601.04873
作者: Elisa Roldan,Kirstie Andrews,Stephen M. Richardson,Reyhaneh Fatahian,Glen Cooper,Rasool Erfani,Tasneem Sabir,Neil D. Reeves
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrospinning is a scalable technique for producing fibrous scaffolds with tunable micro- and nanoscale architectures for applications in tissue engineering, drug delivery, and wound care. While machine learning (ML) has been used to support electrospinning process optimisation, most existing approaches predict only mean fibre diameters, neglecting the full diameter distribution that governs scaffold performance. This work presents FibreCastML, an open, distribution-aware ML framework that predicts complete fibre diameter spectra from routinely reported electrospinning parameters and provides interpretable insights into process structure relationships. A meta-dataset comprising 68538 individual fibre diameter measurements extracted from 1778 studies across 16 biomedical polymers was curated. Six standard processing parameters, namely solution concentration, applied voltage, flow rate, tip to collector distance, needle diameter, and collector rotation speed, were used to train seven ML models using nested cross validation with leave one study out external folds. Model interpretability was achieved using variable importance analysis, SHapley Additive exPlanations, correlation matrices, and three dimensional parameter maps. Non linear models consistently outperformed linear baselines, achieving coefficients of determination above 0.91 for several widely used polymers. Solution concentration emerged as the dominant global driver of fibre diameter distributions. Experimental validation across different electrospinning systems demonstrated close agreement between predicted and measured distributions. FibreCastML enables more reproducible and data driven optimisation of electrospun scaffold architectures. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.04873 [cs.LG] (or arXiv:2601.04873v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.04873 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] MPM-LLM 4DSE: Reaching the Pareto Frontier in HLS with Multimodal Learning and LLM -Driven Exploration

链接: https://arxiv.org/abs/2601.04801
作者: Lei Xu,Shanshan Wang,Chenglong Xiao
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-Level Synthesis (HLS) design space exploration (DSE) seeks Pareto-optimal designs within expansive pragma configuration spaces. To accelerate HLS DSE, graph neural networks (GNNs) are commonly employed as surrogates for HLS tools to predict quality of results (QoR) metrics, while multi-objective optimization algorithms expedite the exploration. However, GNN-based prediction methods may not fully capture the rich semantic features inherent in behavioral descriptions, and conventional multi-objective optimization algorithms often do not explicitly account for the domain-specific knowledge regarding how pragma directives influence QoR. To address these limitations, this paper proposes the MPM-LLM4DSE framework, which incorporates a multimodal prediction model (MPM) that simultaneously fuses features from behavioral descriptions and control and data flow graphs. Furthermore, the framework employs a large language model (LLM) as an optimizer, accompanied by a tailored prompt engineering methodology. This methodology incorporates pragma impact analysis on QoR to guide the LLM in generating high-quality configurations (LLM4DSE). Experimental results demonstrate that our multimodal predictive model significantly outperforms state-of-the-art work ProgSG by up to 10.25 \times . Furthermore, in DSE tasks, the proposed LLM4DSE achieves an average performance gain of 39.90% over prior methods, validating the effectiveness of our prompting methodology. Code and models are available at this https URL.

[LG-16] Neural-Symbolic Integration with Evolvable Policies

链接: https://arxiv.org/abs/2601.04799
作者: Marios Thoma,Vassilis Vassiliades,Loizos Michael
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 18 pages, 12 figures, related code available at this https URL

点击查看摘要

Abstract:Neural-Symbolic (NeSy) Artificial Intelligence has emerged as a promising approach for combining the learning capabilities of neural networks with the interpretable reasoning of symbolic systems. However, existing NeSy frameworks typically require either predefined symbolic policies or policies that are differentiable, limiting their applicability when domain expertise is unavailable or when policies are inherently non-differentiable. We propose a framework that addresses this limitation by enabling the concurrent learning of both non-differentiable symbolic policies and neural network weights through an evolutionary process. Our approach casts NeSy systems as organisms in a population that evolve through mutations (both symbolic rule additions and neural weight changes), with fitness-based selection guiding convergence toward hidden target policies. The framework extends the NEUROLOG architecture to make symbolic policies trainable, adapts Valiant’s Evolvability framework to the NeSy context, and employs Machine Coaching semantics for mutable symbolic representations. Neural networks are trained through abductive reasoning from the symbolic component, eliminating differentiability requirements. Through extensive experimentation, we demonstrate that NeSy systems starting with empty policies and random neural weights can successfully approximate hidden non-differentiable target policies, achieving median correct performance approaching 100%. This work represents a step toward enabling NeSy research in domains where the acquisition of symbolic knowledge from experts is challenging or infeasible.

[LG-17] Intraday spatiotemporal PV power prediction at national scale using satellite-based solar forecast models

链接: https://arxiv.org/abs/2601.04751
作者: Luca Lanzilao,Angela Meyer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel framework for spatiotemporal photovoltaic (PV) power forecasting and use it to evaluate the reliability, sharpness, and overall performance of seven intraday PV power nowcasting models. The model suite includes satellite-based deep learning and optical-flow approaches and physics-based numerical weather prediction models, covering both deterministic and probabilistic formulations. Forecasts are first validated against satellite-derived surface solar irradiance (SSI). Irradiance fields are then converted into PV power using station-specific machine learning models, enabling comparison with production data from 6434 PV stations across Switzerland. To our knowledge, this is the first study to investigate spatiotemporal PV forecasting at a national scale. We additionally provide the first visualizations of how mesoscale cloud systems shape national PV production on hourly and sub-hourly timescales. Our results show that satellite-based approaches outperform the Integrated Forecast System (IFS-ENS), particularly at short lead times. Among them, SolarSTEPS and SHADECast deliver the most accurate SSI and PV power predictions, with SHADECast providing the most reliable ensemble spread. The deterministic model IrradianceNet achieves the lowest root mean square error, while probabilistic forecasts of SolarSTEPS and SHADECast provide better-calibrated uncertainty. Forecast skill generally decreases with elevation. At a national scale, satellite-based models forecast the daily total PV generation with relative errors below 10% for 82% of the days in 2019-2020, demonstrating robustness and their potential for operational use.

[LG-18] GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models

链接: https://arxiv.org/abs/2601.04719
作者: Maanas Taneja,Purab Shingvi
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:The key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4 \times memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants – naive, tiled, coarsened, and vectorized – and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694 \times speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead (6–58ms) and minimal impact on downstream model behavior

[LG-19] A zone-based training approach for last-mile routing using Graph Neural Networks and Pointer Networks

链接: https://arxiv.org/abs/2601.04705
作者: Àngel Ruiz-Fas,Carlos Granell,José Francisco Ramos,Joaquín Huerta,Sergio Trilles
类目: Machine Learning (cs.LG)
*备注: Accepted in SMF 2026. 8 pages, 3 figures

点击查看摘要

Abstract:Rapid e-commerce growth has pushed last-mile delivery networks to their limits, where small routing gains translate into lower costs, faster service, and fewer emissions. Classical heuristics struggle to adapt when travel times are highly asymmetric (e.g., one-way streets, congestion). A deep learning-based approach to the last-mile routing problem is presented to generate geographical zones composed of stop sequences to minimize last-mile delivery times. The presented approach is an encoder-decoder architecture. Each route is represented as a complete directed graph whose nodes are stops and whose edge weights are asymmetric travel times. A Graph Neural Network encoder produces node embeddings that captures the spatial relationships between stops. A Pointer Network decoder then takes the embeddings and the route’s start node to sequentially select the next stops, assigning a probability to each unvisited node as the next destination. Cells of a Discrete Global Grid System which contain route stops in the training data are obtained and clustered to generate geographical zones of similar size in which the process of training and inference are divided. Subsequently, a different instance of the model is trained per zone only considering the stops of the training routes which are included in that zone. This approach is evaluated using the Los Angeles routes from the 2021 Amazon Last Mile Routing Challenge. Results from general and zone-based training are compared, showing a reduction in the average predicted route length in the zone-based training compared to the general training. The performance improvement of the zone-based approach becomes more pronounced as the number of stops per route increases. Comments: Accepted in SMF 2026. 8 pages, 3 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.04705 [cs.LG] (or arXiv:2601.04705v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.04705 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Àngel Ruiz-Fas [view email] [v1] Thu, 8 Jan 2026 08:18:32 UTC (547 KB)

[LG-20] Do LLM s Benefit from User and Item Embeddings in Recommendation Tasks? NEURIPS2025

链接: https://arxiv.org/abs/2601.04690
作者: Mir Rayat Imtiaz Hossain,Leo Feng,Leonid Sigal,Mohamed Osama Ahmed
类目: Machine Learning (cs.LG)
*备注: Presented in Multimodal Algorithmic Reasoning Workshop at NeurIPS 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as promising recommendation systems, offering novel ways to model user preferences through generative approaches. However, many existing methods often rely solely on text semantics or incorporate collaborative signals in a limited manner, typically using only user or item embeddings. These methods struggle to handle multiple item embeddings representing user history, reverting to textual semantics and neglecting richer collaborative information. In this work, we propose a simple yet effective solution that projects user and item embeddings, learned from collaborative filtering, into the LLM token space via separate lightweight projector modules. A finetuned LLM then conditions on these projected embeddings alongside textual tokens to generate recommendations. Preliminary results show that this design effectively leverages structured user-item interaction data, improves recommendation performance over text-only LLM baselines, and offers a practical path for bridging traditional recommendation systems with modern LLMs.

[LG-21] Nightmare Dreamer: Dreaming About Unsafe States And Planning Ahead

链接: https://arxiv.org/abs/2601.04686
作者: Oluwatosin Oseni,Shengjie Wang,Jun Zhu,Micah Corah
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: RSS’25: Multi-Objective Optimization and Planning in Robotics Workshop: 5 pages, 8 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown remarkable success in real-world applications, particularly in robotics control. However, RL adoption remains limited due to insufficient safety guarantees. We introduce Nightmare Dreamer, a model-based Safe RL algorithm that addresses safety concerns by leveraging a learned world model to predict potential safety violations and plan actions accordingly. Nightmare Dreamer achieves nearly zero safety violations while maximizing rewards. Nightmare Dreamer outperforms model-free baselines on Safety Gymnasium tasks using only image observations, achieving nearly a 20x improvement in efficiency.

[LG-22] Learning Dynamics in RL Post-Training for Language Models

链接: https://arxiv.org/abs/2601.04670
作者: Akiyoshi Tomihari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) post-training is a critical stage in modern language model development, playing a key role in improving alignment and reasoning ability. However, several phenomena remain poorly understood, including the reduction in output diversity. To gain a broader understanding of RL post-training, we analyze the learning dynamics of RL post-training from a perspective that has been studied in supervised learning but remains underexplored in RL. We adopt an empirical neural tangent kernel (NTK) framework and decompose the NTK into two components to characterize how RL updates propagate across training samples. Our analysis reveals that limited variability in feature representations can cause RL updates to systematically increase model confidence, providing an explanation for the commonly observed reduction in output diversity after RL post-training. Furthermore, we show that effective learning in this regime depends on rapidly shaping the classifier, which directly affects the gradient component of the NTK. Motivated by these insights, we propose classifier-first reinforcement learning (CF-RL), a simple two-stage training strategy that prioritizes classifier updates before standard RL optimization. Experimental results validate our theoretical analysis by demonstrating increased model confidence and accelerated optimization under CF-RL. Additional analysis shows that the mechanism underlying CF-RL differs from that of linear-probing-then-fine-tuning in supervised learning. Overall, our study formalizes the learning dynamics of RL post-training and motivates further analysis and improvement.

[LG-23] Mechanism Design for Federated Learning with Non-Monotonic Network Effects

链接: https://arxiv.org/abs/2601.04648
作者: Xiang Li,Bing Luo,Jianwei Huang,Yuan Luo
类目: Computer Science and Game Theory (cs.GT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Journal extension of Mobihoc conference version, under review of IEEE TMC

点击查看摘要

Abstract:Mechanism design is pivotal to federated learning (FL) for maximizing social welfare by coordinating self-interested clients. Existing mechanisms, however, often overlook the network effects of client participation and the diverse model performance requirements (i.e., generalization error) across applications, leading to suboptimal incentives and social welfare, or even inapplicability in real deployments. To address this gap, we explore incentive mechanism design for FL with network effects and application-specific requirements of model performance. We develop a theoretical model to quantify the impact of network effects on heterogeneous client participation, revealing the non-monotonic nature of such effects. Based on these insights, we propose a Model Trading and Sharing (MoTS) framework, which enables clients to obtain FL models through either participation or purchase. To further address clients’ strategic behaviors, we design a Social Welfare maximization with Application-aware and Network effects (SWAN) mechanism, exploiting model customer payments for incentivization. Experimental results on a hardware prototype demonstrate that our SWAN mechanism outperforms existing FL mechanisms, improving social welfare by up to 352.42% and reducing extra incentive costs by 93.07% .

[LG-24] Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony

链接: https://arxiv.org/abs/2601.04592
作者: Joonwon Seo,Mariana Montiel
类目: Machine Learning (cs.LG); Sound (cs.SD); Mathematical Physics (math-ph)
*备注: Submitted to the 10th International Conference on Mathematics and Computation in Music (MCM 2026)

点击查看摘要

Abstract:Classical Recurrent Neural Networks (RNNs) summarize musical context into a deterministic hidden state vector, imposing an information bottleneck that fails to capture the inherent ambiguity in music. We propose the Density Matrix RNN (DM-RNN), a novel theoretical architecture utilizing the Density Matrix. This allows the model to maintain a statistical ensemble of musical interpretations (a mixed state), capturing both classical probabilities and quantum coherences. We rigorously define the temporal dynamics using Quantum Channels (CPTP maps). Crucially, we detail a parameterization strategy based on the Choi-Jamiolkowski isomorphism, ensuring the learned dynamics remain physically valid (CPTP) by construction. We introduce an analytical framework using Von Neumann Entropy to quantify musical uncertainty and Quantum Mutual Information (QMI) to measure entanglement between voices. The DM-RNN provides a mathematically rigorous framework for modeling complex, ambiguous musical structures.

[LG-25] GEnSHIN: Graphical Enhanced Spatio-temporal Hierarchical Inference Network for Traffic Flow Prediction

链接: https://arxiv.org/abs/2601.04550
作者: Zhiyan Zhou,Junjie Liao,Manho Zhang,Yingyi Liao,Ziai Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the acceleration of urbanization, intelligent transportation systems have an increasing demand for accurate traffic flow prediction. This paper proposes a novel Graph Enhanced Spatio-temporal Hierarchical Inference Network (GEnSHIN) to handle the complex spatio-temporal dependencies in traffic flow prediction. The model integrates three innovative designs: 1) An attention-enhanced Graph Convolutional Recurrent Unit (GCRU), which strengthens the modeling capability for long-term temporal dependencies by introducing Transformer modules; 2) An asymmetric dual-embedding graph generation mechanism, which leverages the real road network and data-driven latent asymmetric topology to generate graph structures that better fit the characteristics of actual traffic flow; 3) A dynamic memory bank module, which utilizes learnable traffic pattern prototypes to provide personalized traffic pattern representations for each sensor node, and introduces a lightweight graph updater during the decoding phase to adapt to dynamic changes in road network states. Extensive experiments on the public dataset METR-LA show that GEnSHIN achieves or surpasses the performance of comparative models across multiple metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). Notably, the model demonstrates excellent prediction stability during peak morning and evening traffic hours. Ablation experiments further validate the effectiveness of each core module and its contribution to the final performance.

[LG-26] meliness-Oriented Scheduling and Resource Allocation in Multi-Region Collaborative Perception

链接: https://arxiv.org/abs/2601.04542
作者: Mengmeng Zhu,Yuxuan Sun,Yukuan Jia,Wei Chen,Bo Ai,Sheng Zhou
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Collaborative perception (CP) is a critical technology in applications like autonomous driving and smart cities. It involves the sharing and fusion of information among sensors to overcome the limitations of individual perception, such as blind spots and range limitations. However, CP faces two primary challenges. First, due to the dynamic nature of the environment, the timeliness of the transmitted information is critical to perception performance. Second, with limited computational power at the sensors and constrained wireless bandwidth, the communication volume must be carefully designed to ensure feature representations are both effective and sufficient. This work studies the dynamic scheduling problem in a multi-region CP scenario, and presents a Timeliness-Aware Multi-region Prioritized (TAMP) scheduling algorithm to trade-off perception accuracy and communication resource usage. Timeliness reflects the utility of information that decays as time elapses, which is manifested by the perception performance in CP tasks. We propose an empirical penalty function that maps the joint impact of Age of Information (AoI) and communication volume to perception performance. Aiming to minimize this timeliness-oriented penalty in the long-term, and recognizing that scheduling decisions have a cumulative effect on subsequent system states, we propose the TAMP scheduling algorithm. TAMP is a Lyapunov-based optimization policy that decomposes the long-term average objective into a per-slot prioritization problem, balancing the scheduling worth against resource cost. We validate our algorithm in both intersection and corridor scenarios with the real-world Roadside Cooperative perception (RCooper) dataset. Extensive simulations demonstrate that TAMP outperforms the best-performing baseline, achieving an Average Precision (AP) improvement of up to 27% across various configurations.

[LG-27] Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation

链接: https://arxiv.org/abs/2601.04517
作者: Zimo Yan,Zheng Xie,Runfan Duan,Chang Liu,Wumei Du
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular graph learning benefits from positional signals that capture both local neighborhoods and global topology. Two widely used families are spectral encodings derived from Laplacian or diffusion operators and anchor-based distance encodings built from shortest-path information, yet their precise relationship is poorly understood. We interpret distance encodings as a low-rank surrogate of diffusion geometry and derive an explicit trilateration map that reconstructs truncated diffusion coordinates from transformed anchor distances and anchor spectral positions, with pointwise and Frobenius-gap guarantees on random regular graphs. On DrugBank molecular graphs using a shared GNP-based DDI prediction backbone, a distance-driven Nyström scheme closely recovers diffusion geometry, and both Laplacian and distance encodings substantially outperform a no-encoding baseline.

[LG-28] Multiagent Reinforcement Learning with Neighbor Action Estimation

链接: https://arxiv.org/abs/2601.04511
作者: Zhenglong Luo,Zhiyong Chen,Aoxiang Liu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiagent reinforcement learning, as a prominent intelligent paradigm, enables collaborative decision-making within complex systems. However, existing approaches often rely on explicit action exchange between agents to evaluate action value functions, which is frequently impractical in real-world engineering environments due to communication constraints, latency, energy consumption, and reliability requirements. From an artificial intelligence perspective, this paper proposes an enhanced multiagent reinforcement learning framework that employs action estimation neural networks to infer agent behaviors. By integrating a lightweight action estimation module, each agent infers neighboring agents’ behaviors using only locally observable information, enabling collaborative policy learning without explicit action sharing. This approach is fully compatible with standard TD3 algorithms and scalable to larger multiagent systems. At the engineering application level, this framework has been implemented and validated in dual-arm robotic manipulation tasks: two robotic arms collaboratively lift objects. Experimental results demonstrate that this approach significantly enhances the robustness and deployment feasibility of real-world robotic systems while reducing dependence on information infrastructure. Overall, this research advances the development of decentralized multiagent artificial intelligence systems while enabling AI to operate effectively in dynamic, information-constrained real-world environments.

[LG-29] When Models Manipulate Manifolds: The Geometry of a Counting Task

链接: https://arxiv.org/abs/2601.04480
作者: Wes Gurnee,Emmanuel Ameisen,Isaac Kauvar,Julius Tarng,Adam Pearce,Chris Olah,Joshua Batson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models can perceive visual properties of text despite receiving only sequences of tokens-we mechanistically investigate how Claude 3.5 Haiku accomplishes one such task: linebreaking in fixed-width text. We find that character counts are represented on low-dimensional curved manifolds discretized by sparse feature families, analogous to biological place cells. Accurate predictions emerge from a sequence of geometric transformations: token lengths are accumulated into character count manifolds, attention heads twist these manifolds to estimate distance to the line boundary, and the decision to break the line is enabled by arranging estimates orthogonally to create a linear decision boundary. We validate our findings through causal interventions and discover visual illusions–character sequences that hijack the counting mechanism. Our work demonstrates the rich sensory processing of early layers, the intricacy of attention algorithms, and the importance of combining feature-based and geometric views of interpretability.

[LG-30] Meta-probabilistic Modeling

链接: https://arxiv.org/abs/2601.04462
作者: Kevin Zhang,Yixin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While probabilistic graphical models can discover latent structure in data, their effectiveness hinges on choosing well-specified models. Identifying such models is challenging in practice, often requiring iterative checking and revision through trial and error. To this end, we propose meta-probabilistic modeling (MPM), a meta-learning algorithm that learns generative model structure directly from multiple related datasets. MPM uses a hierarchical architecture where global model specifications are shared across datasets while local parameters remain dataset-specific. For learning and inference, we propose a tractable VAE-inspired surrogate objective, and optimize it through bi-level optimization: local variables are updated analytically via coordinate ascent, while global parameters are trained with gradient-based methods. We evaluate MPM on object-centric image modeling and sequential text modeling, demonstrating that it adapts generative models to data while recovering meaningful latent representations.

[LG-31] Using Large Language Models to Detect Socially Shared Regulation of Collaborative Learning

链接: https://arxiv.org/abs/2601.04458
作者: Jiayi Zhang,Conrad Borchers,Clayton Cohn,Namrata Srivastava,Caitlin Snyder,Siyuan Guo,Ashwin T S,Naveeduddin Mohammed,Haley Noh,Gautam Biswas
类目: Machine Learning (cs.LG)
*备注: Short research paper accepted at Learning Analytics and Knowledge (LAK '26)

点击查看摘要

Abstract:The field of learning analytics has made notable strides in automating the detection of complex learning processes in multimodal data. However, most advancements have focused on individualized problem-solving instead of collaborative, open-ended problem-solving, which may offer both affordances (richer data) and challenges (low cohesion) to behavioral prediction. Here, we extend predictive models to automatically detect socially shared regulation of learning (SSRL) behaviors in collaborative computational modeling environments using embedding-based approaches. We leverage large language models (LLMs) as summarization tools to generate task-aware representations of student dialogue aligned with system logs. These summaries, combined with text-only embeddings, context-enriched embeddings, and log-derived features, were used to train predictive models. Results show that text-only embeddings often achieve stronger performance in detecting SSRL behaviors related to enactment or group dynamics (e.g., off-task behavior or requesting assistance). In contrast, contextual and multimodal features provide complementary benefits for constructs such as planning and reflection. Overall, our findings highlight the promise of embedding-based models for extending learning analytics by enabling scalable detection of SSRL behaviors, ultimately supporting real-time feedback and adaptive scaffolding in collaborative learning environments that teachers value.

[LG-32] Explainable Admission-Level Predictive Modeling for Prolonged Hospital Stay in Elderly Populations: Challenges in Low- and Middle-Income Countries

链接: https://arxiv.org/abs/2601.04449
作者: Daniel Sierra-Botero,Ana Molina-Taborda,Leonardo Espinosa-Leal,Alexander Karpenko,Alejandro Hernandez,Olga Lopez-Acevedo
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Prolonged length of stay (pLoS) is a significant factor associated with the risk of adverse in-hospital events. We develop and explain a predictive model for pLos using admission-level patient and hospital administrative data. The approach includes a feature selection method by selecting non-correlated features with the highest information value. The method uses features weights of evidence to select a representative within cliques from graph theory. The prognosis study analyzed the records from 120,354 hospital admissions at the Hospital Alma Mater de Antioquia between January 2017 and March 2022. After a cleaning process the dataset was split into training (67%), test (22%), and validation (11%) cohorts. A logistic regression model was trained to predict the pLoS in two classes: less than or greater than 7 days. The performance of the model was evaluated using accuracy, precision, sensitivity, specificity, and AUC-ROC metrics. The feature selection method returns nine interpretable variables, enhancing the models’ transparency. In the validation cohort, the pLoS model achieved a specificity of 0.83 (95% CI, 0.82-0.84), sensitivity of 0.64 (95% CI, 0.62-0.65), accuracy of 0.76 (95% CI, 0.76-0.77), precision of 0.67 (95% CI, 0.66-0.69), and AUC-ROC of 0.82 (95% CI, 0.81-0.83). The model exhibits strong predictive performance and offers insights into the factors that influence prolonged hospital stays. This makes it a valuable tool for hospital management and for developing future intervention studies aimed at reducing pLoS.

[LG-33] When Predictions Shape Reality: A Socio-Technical Synthesis of Performative Predictions in Machine Learning

链接: https://arxiv.org/abs/2601.04447
作者: Gal Fybish,Teo Susnjak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are increasingly used in high-stakes domains where their predictions can actively shape the environments in which they operate, a phenomenon known as performative prediction. This dynamic, in which the deployment of the model influences the very outcome it seeks to predict, can lead to unintended consequences, including feedback loops, performance issues, and significant societal risks. While the literature in the field has grown rapidly in recent years, a socio-technical synthesis that systemises the phenomenon concepts and provides practical guidance has been lacking. This Systematisation of Knowledge (SoK) addresses this gap by providing a comprehensive review of the literature on performative predictions. We provide an overview of the primary mechanisms through which performativity manifests, present a typology of associated risks, and survey the proposed solutions offered in the literature. Our primary contribution is the ``Performative Strength vs. Impact Matrix" assessment framework. This practical tool is designed to help practitioners assess the potential influence and severity of performativity on their deployed predictive models and select the appropriate level of algorithmic or human intervention.

[LG-34] Large Language Models for Detecting Cyberattacks on Smart Grid Protective Relays

链接: https://arxiv.org/abs/2601.04443
作者: Ahmad Mohammad Saber,Saeed Jafari,Zhengmao Ouyang,Paul Budnarain,Amr Youssef,Deepa Kundur
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper presents a large language model (LLM)-based framework for detecting cyberattacks on transformer current differential relays (TCDRs), which, if undetected, may trigger false tripping of critical transformers. The proposed approach adapts and fine-tunes compact LLMs such as DistilBERT to distinguish cyberattacks from actual faults using textualized multidimensional TCDR current measurements recorded before and after tripping. Our results demonstrate that DistilBERT detects 97.6% of cyberattacks without compromising TCDR dependability and achieves inference latency below 6 ms on a commercial workstation. Additional evaluations confirm the framework’s robustness under combined time-synchronization and false-data-injection attacks, resilience to measurement noise, and stability across prompt formulation variants. Furthermore, GPT-2 and DistilBERT+LoRA achieve comparable performance, highlighting the potential of LLMs for enhancing smart grid cybersecurity. We provide the full dataset used in this study for reproducibility.

[LG-35] Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

链接: https://arxiv.org/abs/2601.04441
作者: Matthew Landers,Taylor W. Killian,Thomas Hartvigsen,Afsaneh Doryab
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8 \times .

[LG-36] Learning Multinomial Logits in O(n log n) time

链接: https://arxiv.org/abs/2601.04423
作者: Flavio Chierichetti,Mirko Giacchini,Ravi Kumar,Silvio Lattanzi,Alessandro Panconesi,Erasmo Tani,Andrew Tomkins
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A Multinomial Logit (MNL) model is composed of a finite universe of items [n]=\1,…, n\ , each assigned a positive weight. A query specifies an admissible subset – called a slate – and the model chooses one item from that slate with probability proportional to its weight. This query model is also known as the Plackett-Luce model or conditional sampling oracle in the literature. Although MNLs have been studied extensively, a basic computational question remains open: given query access to slates, how efficiently can we learn weights so that, for every slate, the induced choice distribution is within total variation distance \varepsilon of the ground truth? This question is central to MNL learning and has direct implications for modern recommender system interfaces. We provide two algorithms for this task, one with adaptive queries and one with non-adaptive queries. Each algorithm outputs an MNL M’ that induces, for each slate S , a distribution M’_S on S that is within \varepsilon total variation distance of the true distribution. Our adaptive algorithm makes O\left(\fracn\varepsilon^3\log n\right) queries, while our non-adaptive algorithm makes O\left(\fracn^2\varepsilon^3\log n \log\fracn\varepsilon\right) queries. Both algorithms query only slates of size two and run in time proportional to their query complexity. We complement these upper bounds with lower bounds of \Omega\left(\fracn\varepsilon^2\log n\right) for adaptive queries and \Omega\left(\fracn^2\varepsilon^2\log n\right) for non-adaptive queries, thus proving that our adaptive algorithm is optimal in its dependence on the support size n , while the non-adaptive one is tight within a \log n factor. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2601.04423 [cs.DS] (or arXiv:2601.04423v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2601.04423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Distribution-Guided and Constrained Quantum Machine Unlearning

链接: https://arxiv.org/abs/2601.04413
作者: Nausherwan Malik,Zubair Khalid,Muhammad Faryad
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific training data from a learned model without full retraining. While recent work has begun to explore unlearning in quantum machine learning, existing approaches largely rely on fixed, uniform target distributions and do not explicitly control the trade-off between forgetting and retained model behaviour. In this work, we propose a distribution-guided framework for class-level quantum machine unlearning that treats unlearning as a constrained optimization problem. Our method introduces a tunable target distribution derived from model similarity statistics, decoupling the suppression of forgotten-class confidence from assumptions about redistribution among retained classes. We further incorporate an anchor-based preservation constraint that explicitly maintains predictive behaviour on selected retained data, yielding a controlled optimization trajectory that limits deviation from the original model. We evaluate the approach on variational quantum classifiers trained on the Iris and Covertype datasets. Results demonstrate sharp suppression of forgotten-class confidence, minimal degradation of retained-class performance, and closer alignment with the gold retrained model baselines compared to uniform-target unlearning. These findings highlight the importance of target design and constraint-based formulations for reliable and interpretable quantum machine unlearning.

[LG-38] ransformer-based Multi-agent Reinforcement Learning for Separation Assurance in Structured and Unstructured Airspaces

链接: https://arxiv.org/abs/2601.04401
作者: Arsyi Aziz,Peng Wei
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 9 pages, 4 figures, 4 tables. Presented at SESAR Innovation Days 2025

点击查看摘要

Abstract:Conventional optimization-based metering depends on strict adherence to precomputed schedules, which limits the flexibility required for the stochastic operations of Advanced Air Mobility (AAM). In contrast, multi-agent reinforcement learning (MARL) offers a decentralized, adaptive framework that can better handle uncertainty, required for safe aircraft separation assurance. Despite this advantage, current MARL approaches often overfit to specific airspace structures, limiting their adaptability to new configurations. To improve generalization, we recast the MARL problem in a relative polar state space and train a transformer encoder model across diverse traffic patterns and intersection angles. The learned model provides speed advisories to resolve conflicts while maintaining aircraft near their desired cruising speeds. In our experiments, we evaluated encoder depths of 1, 2, and 3 layers in both structured and unstructured airspaces, and found that a single encoder configuration outperformed deeper variants, yielding near-zero near mid-air collision rates and shorter loss-of-separation infringements than the deeper configurations. Additionally, we showed that the same configuration outperforms a baseline model designed purely with attention. Together, our results suggest that the newly formulated state representation, novel design of neural network architecture, and proposed training strategy provide an adaptable and scalable decentralized solution for aircraft separation assurance in both structured and unstructured airspaces.

[LG-39] Machine Learning Model for Sparse PCM Completion

链接: https://arxiv.org/abs/2601.04366
作者: Selcuk Koyuncu,Ronak Nouri,Stephen Providence
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we propose a machine learning model for sparse pairwise comparison matrices (PCMs), combining classical PCM approaches with graph-based learning techniques. Numerical results are provided to demonstrate the effectiveness and scalability of the proposed method.

[LG-40] Survival Dynamics of Neural and Programmatic Policies in Evolutionary Reinforcement Learning

链接: https://arxiv.org/abs/2601.04365
作者: Anton Roupassov-Ruiz,Yiyang Zuo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In evolutionary reinforcement learning tasks (ERL), agent policies are often encoded as small artificial neural networks (NERL). Such representations lack explicit modular structure, limiting behavioral interpretation. We investigate whether programmatic policies (PERL), implemented as soft, differentiable decision lists (SDDL), can match the performance of NERL. To support reproducible evaluation, we provide the first fully specified and open-source reimplementation of the classic 1992 Artificial Life (ALife) ERL testbed. We conduct a rigorous survival analysis across 4000 independent trials utilizing Kaplan-Meier curves and Restricted Mean Survival Time (RMST) metrics absent in the original study. We find a statistically significant difference in survival probability between PERL and NERL. PERL agents survive on average 201.69 steps longer than NERL agents. Moreover, SDDL agents using learning alone (no evolution) survive on average 73.67 steps longer than neural agents using both learning and evaluation. These results demonstrate that programmatic policies can exceed the survival performance of neural policies in ALife.

[LG-41] Phasor Agents : Oscillatory Graphs with Three-Factor Plasticity and Sleep-Staged Learning

链接: https://arxiv.org/abs/2601.04362
作者: Rodja Trappe
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注: 22 pages, 14 figures

点击查看摘要

Abstract:Phasor Agents are dynamical systems whose internal state is a Phasor Graph: a weighted graph of coupled Stuart-Landau oscillators. A Stuart-Landau oscillator is a minimal stable “rhythm generator” (the normal form near a Hopf bifurcation); each oscillator is treated as an abstract computational unit (inspired by, but not claiming to model, biological oscillatory populations). In this interpretation, oscillator phase tracks relative timing (coherence), while amplitude tracks local gain or activity. Relative phase structure serves as a representational medium; coupling weights are learned via three-factor local plasticity - eligibility traces gated by sparse global modulators and oscillation-timed write windows - without backpropagation. A central challenge in oscillatory substrates is stability: online weight updates can drive the network into unwanted regimes (e.g., global synchrony), collapsing representational diversity. We therefore separate wake tagging from offline consolidation, inspired by synaptic tagging-and-capture and sleep-stage dynamics: deep-sleep-like gated capture commits tagged changes safely, while REM-like replay reconstructs and perturbs experience for planning. A staged experiment suite validates each mechanism with ablations and falsifiers: eligibility traces preserve credit under delayed modulation; compression-progress signals pass timestamp-shuffle controls; phase-coherent retrieval reaches 4x diffusive baselines under noise; wake/sleep separation expands stable learning by 67 percent under matched weight-norm budgets; REM replay improves maze success rate by +45.5 percentage points; and a Tolman-style latent-learning signature - immediate competence and detour advantage after unrewarded exploration, consistent with an internal model - emerges from replay (Tolman, 1948). The codebase and all artifacts are open-source. Comments: 22 pages, 14 figures Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2601.04362 [cs.LG] (or arXiv:2601.04362v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.04362 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rodja Trappe [view email] [v1] Wed, 7 Jan 2026 19:57:02 UTC (314 KB)

[LG-42] ransformer-Based Multi-Modal Temporal Embeddings for Explainable Metabolic Phenotyping in Type 1 Diabetes

链接: https://arxiv.org/abs/2601.04299
作者: Pir Bakhsh Khokhar,Carmine Gravino,Fabio Palomba,Sule Yildrim Yayilgan,Sarang Shaikh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Type 1 diabetes (T1D) is a highly metabolically heterogeneous disease that cannot be adequately characterized by conventional biomarkers such as glycated hemoglobin (HbA1c). This study proposes an explainable deep learning framework that integrates continuous glucose monitoring (CGM) data with laboratory profiles to learn multimodal temporal embeddings of individual metabolic status. Temporal dependencies across modalities are modeled using a transformer encoder, while latent metabolic phenotypes are identified via Gaussian mixture modeling. Model interpretability is achieved through transformer attention visualization and SHAP-based feature attribution. Five latent metabolic phenotypes, ranging from metabolic stability to elevated cardiometabolic risk, were identified among 577 individuals with T1D. These phenotypes exhibit distinct biochemical profiles, including differences in glycemic control, lipid metabolism, renal markers, and thyrotropin (TSH) levels. Attention analysis highlights glucose variability as a dominant temporal factor, while SHAP analysis identifies HbA1c, triglycerides, cholesterol, creatinine, and TSH as key contributors to phenotype differentiation. Phenotype membership shows statistically significant, albeit modest, associations with hypertension, myocardial infarction, and heart failure. Overall, this explainable multimodal temporal embedding framework reveals physiologically coherent metabolic subgroups in T1D and supports risk stratification beyond single biomarkers.

[LG-43] Correct and Weight: A Simple Yet Effective Loss for Implicit Feedback Recommendation

链接: https://arxiv.org/abs/2601.04291
作者: Minglei Yin,Chuanbo Hu,Bin Liu,Neil Zhenqiang Gong,Yanfang(Fanny)Ye,Xin Li
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2508.05673 by other authors

点击查看摘要

Abstract:Learning from implicit feedback has become the standard paradigm for modern recommender systems. However, this setting is fraught with the persistent challenge of false negatives, where unobserved user-item interactions are not necessarily indicative of negative preference. To address this issue, this paper introduces a novel and principled loss function, named Corrected and Weighted (CW) loss, that systematically corrects for the impact of false negatives within the training objective. Our approach integrates two key techniques. First, inspired by Positive-Unlabeled learning, we debias the negative sampling process by re-calibrating the assumed negative distribution. By theoretically approximating the true negative distribution (p-) using the observable general data distribution § and the positive interaction distribution (p^+), our method provides a more accurate estimate of the likelihood that a sampled unlabeled item is truly negative. Second, we introduce a dynamic re-weighting mechanism that modulates the importance of each negative instance based on the model’s current prediction. This scheme encourages the model to enforce a larger ranking margin between positive items and confidently predicted (i.e., easy) negative items, while simultaneously down-weighting the penalty on uncertain negatives that have a higher probability of being false negatives. A key advantage of our approach is its elegance and efficiency; it requires no complex modifications to the data sampling process or significant computational overhead, making it readily applicable to a wide array of existing recommendation models. Extensive experiments conducted on four large-scale, sparse benchmark datasets demonstrate the superiority of our proposed loss. The results show that our method consistently and significantly outperforms a suite of state-of-the-art loss functions across multiple ranking-oriented metrics.

[LG-44] Human-in-the-Loop Testing of AI Agents for Air Traffic Control with a Regulated Assessment Framework

链接: https://arxiv.org/abs/2601.04288
作者: Ben Carvell,Marc Thomas,Andrew Pace,Christopher Dorney,George De Ath,Richard Everson,Nick Pepper,Adam Keane,Samuel Tomlinson,Richard Cannon
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We present a rigorous, human-in-the-loop evaluation framework for assessing the performance of AI agents on the task of Air Traffic Control, grounded in a regulator-certified simulator-based curriculum used for training and testing real-world trainee controllers. By leveraging legally regulated assessments and involving expert human instructors in the evaluation process, our framework enables a more authentic and domain-accurate measurement of AI performance. This work addresses a critical gap in the existing literature: the frequent misalignment between academic representations of Air Traffic Control and the complexities of the actual operational environment. It also lays the foundations for effective future human-machine teaming paradigms by aligning machine performance with human assessment targets.

[LG-45] Enhancing Robustness of Asynchronous EEG-Based Movement Prediction using Classifier Ensembles

链接: https://arxiv.org/abs/2601.04286
作者: Niklas Kueper,Kartik Chari,Elsa Andrea Kirchner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: Stroke is one of the leading causes of disabilities. One promising approach is to extend the rehabilitation with self-initiated robot-assisted movement therapy. To enable this, it is required to detect the patient’s intention to move to trigger the assistance of a robotic device. This intention to move can be detected from human surface electroencephalography (EEG) signals; however, it is particularly challenging to decode when classifications are performed online and asynchronously. In this work, the effectiveness of classifier ensembles and a sliding-window postprocessing technique was investigated to enhance the robustness of such asynchronous classification. Approach: To investigate the effectiveness of classifier ensembles and a sliding-window postprocessing, two EEG datasets with 14 healthy subjects who performed self-initiated arm movements were analyzed. Offline and pseudo-online evaluations were conducted to compare ensemble combinations of the support vector machine (SVM), multilayer perceptron (MLP), and EEGNet classification models. Results: The results of the pseudo-online evaluation show that the two model ensembles significantly outperformed the best single model for the optimal number of postprocessing windows. In particular, for single models, an increased number of postprocessing windows significantly improved classification performances. Interestingly, we found no significant improvements between performances of the best single model and classifier ensembles in the offline evaluation. Significance: We demonstrated that classifier ensembles and appropriate postprocessing methods effectively enhance the asynchronous detection of movement intentions from EEG signals. In particular, the classifier ensemble approach yields greater improvements in online classification than in offline classification, and reduces false detections, i.e., early false positives.

[LG-46] LEGATO: Good Identity Unlearning Is Continuous

链接: https://arxiv.org/abs/2601.04282
作者: Qiang Chen,Chun-Wun Cheng,Xiu Su,Hongyan Xu,Xi Lin,Shan You,Angelica I. Aviles-Rivero,Yi Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning has become a crucial role in enabling generative models trained on large datasets to remove sensitive, private, or copyright-protected data. However, existing machine unlearning methods face three challenges in learning to forget identity of generative models: 1) inefficient, where identity erasure requires fine-tuning all the model’s parameters; 2) limited controllability, where forgetting intensity cannot be controlled and explainability is lacking; 3) catastrophic collapse, where the model’s retention capability undergoes drastic degradation as forgetting progresses. Forgetting has typically been handled through discrete and unstable updates, often requiring full-model fine-tuning and leading to catastrophic collapse. In this work, we argue that identity forgetting should be modeled as a continuous trajectory, and introduce LEGATO - Learn to ForgEt Identity in GenerAtive Models via Trajectory-consistent Neural Ordinary Differential Equations. LEGATO augments pre-trained generators with fine-tunable lightweight Neural ODE adapters, enabling smooth, controllable forgetting while keeping the original model weights frozen. This formulation allows forgetting intensity to be precisely modulated via ODE step size, offering interpretability and robustness. To further ensure stability, we introduce trajectory consistency constraints that explicitly prevent catastrophic collapse during unlearning. Extensive experiments across in-domain and out-of-domain identity unlearning benchmarks show that LEGATO achieves state-of-the-art forgetting performance, avoids catastrophic collapse and reduces fine-tuned parameters.

[LG-47] Generation of synthetic delay time series for air transport applications

链接: https://arxiv.org/abs/2601.04279
作者: Pau Esteve,Massimiliano Zanin
类目: Machine Learning (cs.LG)
*备注: 18 pages, 13 figures

点击查看摘要

Abstract:The generation of synthetic data is receiving increasing attention from the scientific community, thanks to its ability to solve problems like data scarcity and privacy, and is starting to find applications in air transport. We here tackle the problem of generating synthetic, yet realistic, time series of delays at airports, starting from large collections of operations in Europe and the US. We specifically compare three models, two of them based on state of the art Deep Learning algorithms, and one simplified Genetic Algorithm approach. We show how the latter can generate time series that are almost indistinguishable from real ones, while maintaining a high variability. We further validate the resulting time series in a problem of detecting delay propagations between airports. We finally make the synthetic data available to the scientific community.

[LG-48] Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLM s

链接: https://arxiv.org/abs/2601.04277
作者: Beier Luo,Cheng Wang,Hongxin Wei,Sharon Li,Xuefeng Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via final-distribution matching, and introduces process alignment to address process drift by locating the layer where trajectories diverge and realigning the stability of subsequent inference. This dual strategy learns a single temperature parameter that corrects both drift types without sacrificing post-training performance gains. Experiments show consistent improvements over baselines, reducing calibration errors and approaching a supervised oracle.

[LG-49] Predictable Gradient Manifolds in Deep Learning: Temporal Path-Length and Intrinsic Rank as a Complexity Regime

链接: https://arxiv.org/abs/2601.04270
作者: Anherutowa Calvo
类目: Machine Learning (cs.LG)
*备注: 12 Pages. Preprint

点击查看摘要

Abstract:Deep learning optimization exhibits structure that is not captured by worst-case gradient bounds. Empirically, gradients along training trajectories are often temporally predictable and evolve within a low-dimensional subspace. In this work we formalize this observation through a measurable framework for predictable gradient manifolds. We introduce two computable quantities: a prediction-based path length that measures how well gradients can be forecast from past information, and a predictable rank that quantifies the intrinsic temporal dimension of gradient increments. We show how classical online and nonconvex optimization guarantees can be restated so that convergence and regret depend explicitly on these quantities, rather than on worst-case variation. Across convolutional networks, vision transformers, language models, and synthetic control tasks, we find that gradient trajectories are locally predictable and exhibit strong low-rank structure over time. These properties are stable across architectures and optimizers, and can be diagnosed directly from logged gradients using lightweight random projections. Our results provide a unifying lens for understanding optimization dynamics in modern deep learning, reframing standard training as operating in a low-complexity temporal regime. This perspective suggests new directions for adaptive optimizers, rank-aware tracking, and prediction-based algorithm design grounded in measurable properties of real training runs. Comments: 12 Pages. Preprint Subjects: Machine Learning (cs.LG) ACMclasses: F.2.2; I.2.6 Cite as: arXiv:2601.04270 [cs.LG] (or arXiv:2601.04270v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.04270 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anherutowa Calvo [view email] [v1] Wed, 7 Jan 2026 11:23:55 UTC (562 KB)

[LG-50] Making Tunable Parameters State-Dependent in Weather and Climate Models with Reinforcement Learning

链接: https://arxiv.org/abs/2601.04268
作者: Pritthijit Nath,Sebastian Schemm,Henry Moss,Peter Haynes,Emily Shuckburgh,Mark J. Webb
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 66 pages, 22 figures

点击查看摘要

Abstract:Weather and climate models rely on parametrisations to represent unresolved sub-grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to the underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates the resulting RL-driven parameter updates across a hierarchy of idealised testbeds spanning a simple climate bias correction (SCBC), a radiative-convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with both single-agent and federated multi-agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and the most stable convergence across configurations, with performance assessed against a static baseline using area-weighted RMSE, temperature profile and pressure-level diagnostics. For the EBM, single-agent RL outperformed static parameter tuning with the strongest gains in tropical and mid-latitude bands, while federated RL on multi-agent setups enabled geographically specialised control and faster convergence, with a six-agent DDPG configuration using frequent aggregation yielding the lowest area-weighted RMSE across the tropics and mid-latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised SCBC heating increments to limit drift. Overall, results highlight RL to deliver skilful state-dependent, and regime-aware parametrisations, offering a scalable pathway for online learning within numerical models.

[LG-51] State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space

链接: https://arxiv.org/abs/2601.04266
作者: Ji Guo,Wenbo Jiang,Yansong Lin,Yijing Liu,Ruichen Zhang,Guomin Lu,Aiguo Chen,Xinshuo Han,Hongwei Li,Dusit Niyato
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are widely deployed in safety-critical embodied AI applications such as robotics. However, their complex multimodal interactions also expose new security vulnerabilities. In this paper, we investigate a backdoor threat in VLA models, where malicious inputs cause targeted misbehavior while preserving performance on clean data. Existing backdoor methods predominantly rely on inserting visible triggers into visual modality, which suffer from poor robustness and low insusceptibility in real-world settings due to environmental variability. To overcome these limitations, we introduce the State Backdoor, a novel and practical backdoor attack that leverages the robot arm’s initial state as the trigger. To optimize trigger for insusceptibility and effectiveness, we design a Preference-guided Genetic Algorithm (PGA) that efficiently searches the state space for minimal yet potent triggers. Extensive experiments on five representative VLA models and five real-world tasks show that our method achieves over 90% attack success rate without affecting benign task performance, revealing an underexplored vulnerability in embodied AI systems.

[LG-52] MemKD: Memory-Discrepancy Knowledge Distillation for Efficient Time Series Classification ICASSP2025

链接: https://arxiv.org/abs/2601.04264
作者: Nilushika Udayangani,Kishor Nandakishor,Marimuthu Palaniswami
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025), Hyderabad, India

点击查看摘要

Abstract:Deep learning models, particularly recurrent neural networks and their variants, such as long short-term memory, have significantly advanced time series data analysis. These models capture complex, sequential patterns in time series, enabling real-time assessments. However, their high computational complexity and large model sizes pose challenges for deployment in resource-constrained environments, such as wearable devices and edge computing platforms. Knowledge Distillation (KD) offers a solution by transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student), thereby retaining high performance while reducing computational demands. Current KD methods, originally designed for computer vision tasks, neglect the unique temporal dependencies and memory retention characteristics of time series models. To this end, we propose a novel KD framework termed Memory-Discrepancy Knowledge Distillation (MemKD). MemKD leverages a specialized loss function to capture memory retention discrepancies between the teacher and student models across subsequences within time series data, ensuring that the student model effectively mimics the teacher model’s behaviour. This approach facilitates the development of compact, high-performing recurrent neural networks suitable for real-time, time series analysis tasks. Our extensive experiments demonstrate that MemKD significantly outperforms state-of-the-art KD methods. It reduces parameter size and memory usage by approximately 500 times while maintaining comparable performance to the teacher model.

[LG-53] Automated Reproducibility Has a Problem Statement Problem AAAI2026

链接: https://arxiv.org/abs/2601.04226
作者: Thijs Snelleman,Peter Lundestad Lawrence,Holger H. Hoos,Odd Erik Gundersen
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at RAI Workshop @ AAAI 2026

点击查看摘要

Abstract:Background. Reproducibility is essential to the scientific method, but reproduction is often a laborious task. Recent works have attempted to automate this process and relieve researchers of this workload. However, due to varying definitions of reproducibility, a clear problem statement is missing. Objectives. Create a generalisable problem statement, applicable to any empirical study. We hypothesise that we can represent any empirical study using a structure based on the scientific method and that this representation can be automatically extracted from any publication, and captures the essence of the study. Methods. We apply our definition of reproducibility as a problem statement for the automatisation of reproducibility by automatically extracting the hypotheses, experiments and interpretations of 20 studies and assess the quality based on assessments by the original authors of each study. Results. We create a dataset representing the reproducibility problem, consisting of the representation of 20 studies. The majority of author feedback is positive, for all parts of the representation. In a few cases, our method failed to capture all elements of the study. We also find room for improvement at capturing specific details, such as results of experiments. Conclusions. We conclude that our formulation of the problem is able to capture the concept of reproducibility in empirical AI studies across a wide range of subfields. Authors of original publications generally agree that the produced structure is representative of their work; we believe improvements can be achieved by applying our findings to create a more structured and fine-grained output in future work.

[LG-54] Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data

链接: https://arxiv.org/abs/2601.05227
作者: James Rice
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注: 20 pages, 6330 words

点击查看摘要

Abstract:I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds an Itô SDE in the latent space of a variational autoencoder, allowing for flexible, continuous-time modeling of uncertainty while preserving a principled mathematical foundation. The drift and diffusion terms of the SDE are parameterized by neural networks, enabling data-driven inference and generalizing classical time series models to handle irregular sampling and complex dynamic structure. A central theoretical contribution is the co-parameterization of the adjoint state with a dedicated neural network, forming a coupled forward-backward system that captures not only latent evolution but also gradient dynamics. I introduce a pathwise-regularized adjoint loss and analyze variance-reduced gradient flows through the lens of stochastic calculus, offering new tools for improving training stability in deep latent SDEs. My paper unifies and extends variational inference, continuous-time generative modeling, and control-theoretic optimization, providing a rigorous foundation for future developments in stochastic probabilistic machine learning. Comments: 20 pages, 6330 words Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST) Cite as: arXiv:2601.05227 [stat.ML] (or arXiv:2601.05227v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.05227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] ROOFS: RObust biOmarker Feature Selection

链接: https://arxiv.org/abs/2601.05151
作者: Anastasiia Bakhmach,Paul Dufossé,Andrea Vaglio,Florence Monville,Laurent Greillier,Fabrice Barlési,Sébastien Benzekry
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a Python package available at this https URL, designed to help researchers in the choice of FS method adapted to their problem. Roofs benchmarks multiple FS methods on the user’s data and generates reports that summarize a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, reliability of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of roofs on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. The PIONeeR dataset contained 374 multi-source blood and tumor biomarkers from 435 patients. A reduced subset of 214 features was obtained through iterative variance inflation factor pre-filtering. Of the 34 FS methods gathered in roofs, we evaluated 23 in combination with 11 classifiers (253 models in total) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including the widely used LASSO. We conclude that comprehensive benchmarking with roofs has the potential to improve the robustness and reproducibility of FS discoveries and increase the translational value of clinical models.

[LG-56] Neural Algorithmic Reasoning for Approximate k-Coloring with Recursive Warm Starts

链接: https://arxiv.org/abs/2601.05137
作者: Knut Vanderbush,Melanie Weber
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注: 33 pages, 10 figures

点击查看摘要

Abstract:Node coloring is the task of assigning colors to the nodes of a graph such that no two adjacent nodes have the same color, while using as few colors as possible. It is the most widely studied instance of graph coloring and of central importance in graph theory; major results include the Four Color Theorem and work on the Hadwiger-Nelson Problem. As an abstraction of classical combinatorial optimization tasks, such as scheduling and resource allocation, it is also rich in practical applications. Here, we focus on a relaxed version, approximate k -coloring, which is the task of assigning at most k colors to the nodes of a graph such that the number of edges whose vertices have the same color is approximately minimized. While classical approaches leverage mathematical programming or SAT solvers, recent studies have explored the use of machine learning. We follow this route and explore the use of graph neural networks (GNNs) for node coloring. We first present an optimized differentiable algorithm that improves a prior approach by Schuetz et al. with orthogonal node feature initialization and a loss function that penalizes conflicting edges more heavily when their endpoints have higher degree; the latter inspired by the classical result that a graph is k -colorable if and only if its k -core is k -colorable. Next, we introduce a lightweight greedy local search algorithm and show that it may be improved by recursively computing a (k-1) -coloring to use as a warm start. We then show that applying such recursive warm starts to the GNN approach leads to further improvements. Numerical experiments on a range of different graph structures show that while the local search algorithms perform best on small inputs, the GNN exhibits superior performance at scale. The recursive warm start may be of independent interest beyond graph coloring for local search methods for combinatorial optimization.

[LG-57] Gradient-based Optimisation of Modulation Effects

链接: https://arxiv.org/abs/2601.04867
作者: Alistair Carson,Alec Wright,Stefan Bilbao
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to J. Audio Eng. Soc. Dec. 2025

点击查看摘要

Abstract:Modulation effects such as phasers, flangers and chorus effects are heavily used in conjunction with the electric guitar. Machine learning based emulation of analog modulation units has been investigated in recent years, but most methods have either been limited to one class of effect or suffer from a high computational cost or latency compared to canonical digital implementations. Here, we build on previous work and present a framework for modelling flanger, chorus and phaser effects based on differentiable digital signal processing. The model is trained in the time-frequency domain, but at inference operates in the time-domain, requiring zero latency. We investigate the challenges associated with gradient-based optimisation of such effects, and show that low-frequency weighting of loss functions avoids convergence to local minima when learning delay times. We show that when trained against analog effects units, sound output from the model is in some cases perceptually indistinguishable from the reference, but challenges still remain for effects with long delay times and feedback.

[LG-58] Comparison of Maximum Likelihood Classification Before and After Applying Weierstrass Transform

链接: https://arxiv.org/abs/2601.04808
作者: Muhammad Shoaib,Zaka Ur Rehman,Muhammad Qasim
类目: Applications (stat.AP); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The aim of this paper is to use Maximum Likelihood (ML) Classification on multispectral data by means of qualitative and quantitative approaches. Maximum Likelihood is a supervised classification algorithm which is based on the Classical Bayes theorem. It makes use of a discriminant function to assign pixel to the class with the highest likelihood. Class means vector and covariance matrix are the key inputs to the function and can be estimated from training pixels of a particular class. As Maximum Likelihood need some assumptions before it has to be applied on the data. In this paper we will compare the results of Maximum Likelihood Classification (ML) before apply the Weierstrass Transform and apply Weierstrass Transform and will see the difference between the accuracy on training pixels of high resolution Quickbird satellite image. Principle Component analysis (PCA) is also used for dimension reduction and also used to check the variation in bands. The results shows that the separation between mean of the classes in the decision space is to be the main factor that leads to the high classification accuracy of Maximum Likelihood (ML) after using Weierstrass Transform than without using it.

[LG-59] he Minary Primitive of Computational Autopoiesis

链接: https://arxiv.org/abs/2601.04501
作者: Daniel Connor,Colin Defant
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Probability (math.PR)
*备注: 21 pages, 2 figures

点击查看摘要

Abstract:We introduce Minary, a computational framework designed as a candidate for the first formally provable autopoietic primitive. Minary represents interacting probabilistic events as multi-dimensional vectors and combines them via linear superposition rather than multiplicative scalar operations, thereby preserving uncertainty and enabling constructive and destructive interference in the range [-1,1] . A fixed set of perspectives'' evaluates semantic dimensions’’ according to hidden competencies, and their interactions drive two discrete-time stochastic processes. We model this system as an iterated random affine map and use the theory of iterated random functions to prove that it converges in distribution to a unique stationary law; we moreover obtain an explicit closed form for the limiting expectation in terms of row, column, and global averages of the competency matrix. We then derive exact formulas for the mean and variance of the normalized consensus conditioned on the activation of a given semantic dimension, revealing how consensus depends on competency structure rather than raw input signals. Finally, we argue that Minary is organizationally closed yet operationally open in the sense of Maturana and Varela, and we discuss implications for building self-maintaining, distributed, and parallelizable computational systems that house a uniquely subjective notion of identity.

[LG-60] Prediction of Cellular Malignancy Using Electrical Impedance Signatures and Supervised Machine Learning

链接: https://arxiv.org/abs/2601.04478
作者: Shadeeb Hossain
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bioelectrical properties of cells such as relative permittivity, conductivity, and characteristic time constants vary significantly between healthy and malignant cells across different frequencies. These distinctions provide a promising foundation for diagnostic and classification applications. This study systematically reviewed 33 scholarly articles to compile datasets of quantitative bioelectric parameters and evaluated their utility in predictive modeling. Three supervised machine learning algorithms- Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) were implemented and tuned using key hyperparameters to assess classification performance. Model effectiveness was evaluated using accuracy and F1 score as performance metrics. Results demonstrate that Random Forest achieved the highest predictive accuracy of ~ 90% when configured with a maximum depth of 4 and 100 estimators. These findings highlight the potential of integrating bioelectrical property analysis with machine learning for improved diagnostic decision-making. Similarly, for KNN and SVM, the F1 score peaked at approximately 78% and 76.5%, respectively. Future work will explore incorporating additional discriminative features, leveraging stimulated datasets, and optimizing hyperparameter through advanced search strategies. Ultimately, hardware prototype with embedded micro-electrodes and real-time control systems could pave the path for practical diagnostic tools capable of in-situ cell classification.

[LG-61] Convergence Rates for Learning Pseudo-Differential Operators

链接: https://arxiv.org/abs/2601.04473
作者: Jiaheng Chen,Daniel Sanz-Alonso
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 72 pages, 1 figure

点击查看摘要

Abstract:This paper establishes convergence rates for learning elliptic pseudo-differential operators, a fundamental operator class in partial differential equations and mathematical physics. In a wavelet-Galerkin framework, we formulate learning over this class as a structured infinite-dimensional regression problem with multiscale sparsity. Building on this structure, we propose a sparse, data- and computation-efficient estimator, which leverages a novel matrix compression scheme tailored to the learning task and a nested-support strategy to balance approximation and estimation errors. In addition to obtaining convergence rates for the estimator, we show that the learned operator induces an efficient and stable Galerkin solver whose numerical error matches its statistical accuracy. Our results therefore contribute to bringing together operator learning, data-driven solvers, and wavelet methods in scientific computing.

信息检索

[IR-0] Multivector Reranking in the Era of Strong First-Stage Retrievers ECIR2026

链接: https://arxiv.org/abs/2601.05200
作者: Silvio Martinico,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini
类目: Information Retrieval (cs.IR)
*备注: 17 pages, 2 figures, ECIR 2026

点击查看摘要

Abstract:Learned multivector representations power modern search systems with strong retrieval effectiveness, but their real-world use is limited by the high cost of exhaustive token-level retrieval. Therefore, most systems adopt a \emphgather-and-refine strategy, where a lightweight gather phase selects candidates for full scoring. However, this approach requires expensive searches over large token-level indexes and often misses the documents that would rank highest under full similarity. In this paper, we reproduce several state-of-the-art multivector retrieval methods on two publicly available datasets, providing a clear picture of the current multivector retrieval field and observing the inefficiency of token-level gathering. Building on top of that, we show that replacing the token-level gather phase with a single-vector document retriever – specifically, a learned sparse retriever (LSR) – produces a smaller and more semantically coherent candidate set. This recasts the gather-and-refine pipeline into the well-established two-stage retrieval architecture. As retrieval latency decreases, query encoding with two neural encoders becomes the dominant computational bottleneck. To mitigate this, we integrate recent inference-free LSR methods, demonstrating that they preserve the retrieval effectiveness of the dual-encoder pipeline while substantially reducing query encoding time. Finally, we investigate multiple reranking configurations that balance efficiency, memory, and effectiveness, and we introduce two optimization techniques that prune low-quality candidates early. Empirical results show that these techniques improve retrieval efficiency by up to 1.8 \times with no loss in quality. Overall, our two-stage approach achieves over 24\times speedup over the state-of-the-art multivector retrieval systems, while maintaining comparable or superior retrieval quality.

[IR-1] Dynamics in Search Engine Query Suggestions for European Politicians

链接: https://arxiv.org/abs/2601.05081
作者: Franziska Pradel,Fabian Haak
类目: Information Retrieval (cs.IR)
*备注: 11 pages; 3 figures; 6 tables; published as a conference paper at WebSci '24 (May 21-24, 2024, Stuttgart, Germany)

点击查看摘要

Abstract:Search engines are commonly used for online political information seeking. Yet, it remains unclear how search query suggestions for political searches that reflect the latent interest of internet users vary across countries and over time. We provide a systematic analysis of Google search engine query suggestions for European and national politicians. Using an original dataset of search query suggestions for European politicians collected in ten countries, we find that query suggestions are less stable over time in politicians’ countries of origin, when the politicians hold a supranational role, and for female politicians. Moreover, query suggestions for political leaders and male politicians are more similar across countries. We conclude by discussing possible future directions for studying information search about European politicians in online search.

[IR-2] PROMISE: Process Reward Models Unlock Test-Time Scaling Laws in Generative Recommendations

链接: https://arxiv.org/abs/2601.04674
作者: Chengcheng Guo,Kuo Cai,Yu Zhou,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative Recommendation has emerged as a promising paradigm, reformulating recommendation as a sequence-to-sequence generation task over hierarchical Semantic IDs. However, existing methods suffer from a critical issue we term Semantic Drift, where errors in early, high-level tokens irreversibly divert the generation trajectory into irrelevant semantic subspaces. Inspired by Process Reward Models (PRMs) that enhance reasoning in Large Language Models, we propose Promise, a novel framework that integrates dense, step-by-step verification into generative models. Promise features a lightweight PRM to assess the quality of intermediate inference steps, coupled with a PRM-guided Beam Search strategy that leverages dense feedback to dynamically prune erroneous branches. Crucially, our approach unlocks Test-Time Scaling Laws for recommender systems: by increasing inference compute, smaller models can match or surpass larger models. Extensive offline experiments and online A/B tests on a large-scale platform demonstrate that Promise effectively mitigates Semantic Drift, significantly improving recommendation accuracy while enabling efficient deployment.

[IR-3] Adaptive Retrieval for Reasoning -Intensive Retrieval

链接: https://arxiv.org/abs/2601.04618
作者: Jongho Kim,Jaeyoung Kim,Seung-won Hwang,Jihyuk Kim,Yu Jin Kim,Moontae Lee
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We study leveraging adaptive retrieval to ensure sufficient “bridge” documents are retrieved for reasoning-intensive retrieval. Bridge documents are those that contribute to the reasoning process yet are not directly relevant to the initial query. While existing reasoning-based reranker pipelines attempt to surface these documents in ranking, they suffer from bounded recall. Naive solution with adaptive retrieval into these pipelines often leads to planning error propagation. To address this, we propose REPAIR, a framework that bridges this gap by repurposing reasoning plans as dense feedback signals for adaptive retrieval. Our key distinction is enabling mid-course correction during reranking through selective adaptive retrieval, retrieving documents that support the pivotal plan. Experimental results on reasoning-intensive retrieval and complex QA tasks demonstrate that our method outperforms existing baselines by 5.6%pt.

[IR-4] Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing

链接: https://arxiv.org/abs/2601.04554
作者: Wenlin Zhang,Xiangyang Li,Qiyuan Ge,Kuicai Dong,Pengyue Jia,Xiaopeng Li,Zijian Zhang,Maolin Wang,Yichao Wang,Huifeng Guo,Ruiming Tang,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recommender systems, online A/B testing is a crucial method for evaluating the performance of different models. However, conducting online A/B testing often presents significant challenges, including substantial economic costs, user experience degradation, and considerable time requirements. With the Large Language Models’ powerful capacity, LLM-based agent shows great potential to replace traditional online A/B testing. Nonetheless, current agents fail to simulate the perception process and interaction patterns, due to the lack of real environments and visual perception capability. To address these challenges, we introduce a multi-modal user agent for A/B testing (A/B Agent). Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions that align with real user behavior on online platforms. The designed agent leverages multimodal information perception, fine-grained user preferences, and integrates profiles, action memory retrieval, and a fatigue system to simulate complex human decision-making. We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features. Furthermore, we found that the data generated by A/B Agent can effectively enhance the capabilities of recommendation models. Our code is publicly available at this https URL.

[IR-5] he Overlooked Role of Graded Relevance Thresholds in Multilingual Dense Retrieval

链接: https://arxiv.org/abs/2601.04395
作者: Tomer Wullach,Ori Shapira,Amir DN Cohen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Dense retrieval models are typically fine-tuned with contrastive learning objectives that require binary relevance judgments, even though relevance is inherently graded. We analyze how graded relevance scores and the threshold used to convert them into binary labels affect multilingual dense retrieval. Using a multilingual dataset with LLM-annotated relevance scores, we examine monolingual, multilingual mixture, and cross-lingual retrieval scenarios. Our findings show that the optimal threshold varies systematically across languages and tasks, often reflecting differences in resource level. A well-chosen threshold can improve effectiveness, reduce the amount of fine-tuning data required, and mitigate annotation noise, whereas a poorly chosen one can degrade performance. We argue that graded relevance is a valuable but underutilized signal for dense retrieval, and that threshold calibration should be treated as a principled component of the fine-tuning pipeline.

[IR-6] Paper Skygest: Personalized Academic Recommendations on Bluesky

链接: https://arxiv.org/abs/2601.04253
作者: Sophie Greenwood,Nikhil Garg
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We build, deploy, and evaluate Paper Skygest, a custom personalized social feed for scientific content posted by a user’s network on Bluesky and the AT Protocol. We leverage a new capability on emerging decentralized social media platforms: the ability for anyone to build and deploy feeds for other users, to use just as they would a native platform-built feed. To our knowledge, Paper Skygest is the first and largest such continuously deployed personalized social media feed by academics, with over 50,000 weekly uses by over 1,000 daily active users, all organically acquired. First, we quantitatively and qualitatively evaluate Paper Skygest usage, showing that it has sustained usage and satisfies users; we further show adoption of Paper Skygest increases a user’s interactions with posts about research, and how interaction rates change as a function of post order. Second, we share our full code and describe our system architecture, to support other academics in building and deploying such feeds sustainably. Third, we overview the potential of custom feeds such as Paper Skygest for studying algorithm designs, building for user agency, and running recommender system experiments with organic users without partnering with a centralized platform.

附件下载

点击下载今日全部论文列表