本篇博文主要内容为 2026-01-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-13)

今日共更新1014篇论文,其中:

  • 自然语言处理190篇(Computation and Language (cs.CL))
  • 人工智能358篇(Artificial Intelligence (cs.AI))
  • 计算机视觉175篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习259篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

【速读】: 该论文试图解决的问题是:当前语言模型是否能够像人类对话者一样,在理解存在不确定性时主动识别自身困惑并请求澄清(clarification),从而承担“接收者角色”以维持对话中的相互理解。其解决方案的关键在于引入参考游戏(reference games)作为可控且自洽的实验场景,通过对比基础参考解析任务与要求模型在不确定时主动请求澄清的实验设置,量化评估模型的交互能力。结果表明,即使在简单任务中,模型也普遍难以将内部不确定性转化为有效的澄清行为,凸显了参考游戏作为测试视觉-语言模型交互质量的有效性。

链接: https://arxiv.org/abs/2601.07820
作者: Manar Ali,Judith Sieker,Sina Zarrieß,Hendrik Buschmeier
机构: Bielefeld University (比勒费尔德大学); CRC 1646 ‘Linguistic Creativity in Communication’ (语言创造力在交流中的CRC 1646)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
zh

[NLP-1] he Confidence Trap: Gender Bias and Predictive Certainty in LLM s AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感场景下预测置信度与人类标注的偏见判断之间的一致性问题,尤其聚焦于性别偏见情境下的代词指代消解任务中置信度校准(confidence calibration)的公平性。其解决方案的关键在于提出一种新的校准指标 Gender-ECE,用于量化性别差异在解析任务中的表现,并通过该指标评估六种前沿模型的公平性校准能力,发现 Gemma-2 在性别偏见基准测试中校准效果最差,从而为伦理化部署 LLM 提供可量化的评估框架和改进方向。

链接: https://arxiv.org/abs/2601.07806
作者: Ahmed Sabir,Markus Kängsepp,Rajesh Sharma
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: AAAI 2026 (AISI Track), Oral. Project page: this https URL

点击查看摘要

Abstract:The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs’ confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.
zh

[NLP-2] Learning Through Dialogue: Unpacking the Dynamics of Human-LLM Conversations on Political Issues

【速读】: 该论文旨在解决当前对大型语言模型(Large Language Models, LLMs)作为学习对话伙伴时,其交互动态如何支持用户知识获取与参与度提升这一问题缺乏系统研究的现状。通过分析397次关于社会政治议题的人机对话中LLM与用户之间的语言特征和互动模式,研究发现:LLM解释的丰富性对用户信心的提升部分依赖于引发用户的反思性洞察,而对知识增长的影响则完全通过用户的认知投入实现;此外,这些效应高度依赖于用户的政治效能感(political efficacy),高效能用户在面对不确定性时更能获得信心提升,且在更长的交互中受益于反思型策略的知识增长。因此,解决方案的关键在于将LLM的解释行为与用户的参与状态相匹配,以设计出能够促进有效人机协同学习的交互系统。

链接: https://arxiv.org/abs/2601.07796
作者: Shaz Furniturewala,Gerard Christopher Yeo,Kokil Jaidka
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as conversational partners for learning, yet the interactional dynamics supporting users’ learning and engagement are understudied. We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in political knowledge and confidence. Mediation analyses reveal that LLM explanatory richness partially supports confidence by fostering users’ reflective insight, whereas its effect on knowledge gain operates entirely through users’ cognitive engagement. Moderation analyses show that these effects are highly conditional and vary by political efficacy. Confidence gains depend on how high-efficacy users experience and resolve uncertainty. Knowledge gains depend on high-efficacy users’ ability to leverage extended interaction, with longer conversations benefiting primarily reflective users. In summary, we find that learning from LLMs is an interactional achievement, not a uniform outcome of better explanations. The findings underscore the importance of aligning LLM explanatory behavior with users’ engagement states to support effective learning in designing Human-AI interactive systems.
zh

[NLP-3] Kinship Data Benchmark for Multi-hop Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多跳推理(multi-hop reasoning)能力评估中缺乏结构化、可控且文化敏感的基准测试问题。现有评测往往难以系统控制任务难度、文化假设和关系深度,导致模型表现差异难以归因。解决方案的关键在于提出一种生成式管道(generative pipeline),能够按需生成大规模、真实且具有文化特异性的家谱数据——即满足特定亲属制度(kinship system)婚姻约束的互联家族树集合。基于此类基因组数据构建文本推理任务,要求模型推断隐含的关系链,从而实现对多跳推理能力的精准测量。该方法使任务难度、文化背景和关系深度可被系统性调控,为LLMs的跨文化多跳推理能力提供了可复现、可比较的评估框架。

链接: https://arxiv.org/abs/2601.07794
作者: Tianda Sun,Dimitar Kazakov
机构: University of York (约克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source and closed-source models, under a uniform zero-shot protocol with deterministic decoding. Performance is measured using exact-match and set-based metrics. Our results demonstrate that KinshipQA yields a wide spread of outcomes and exposes systematic differences in multi-hop reasoning across models and cultural settings.
zh

[NLP-4] Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对动态工具库时,因用户抽象目标与技术文档之间语义鸿沟以及固定长度嵌入无法有效建模工具组合复杂性而导致的检索失败问题。解决方案的关键在于提出一种轻量级框架TOOLQP,其核心思想是将检索任务建模为迭代查询规划过程:通过分解用户指令为子任务,并动态生成针对性查询与检索器交互,从而精准定位支持工具组合所需的子任务信息;同时采用合成查询轨迹训练结合可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)优化策略,显著提升了零样本泛化能力、跨不同检索器的鲁棒性及下游代理执行效果。

链接: https://arxiv.org/abs/2601.07782
作者: Wei Fang,James Glass
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
zh

[NLP-5] Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在Chain-of-Thought (CoT) prompting下仍存在的推理一致性、准确性及自我修正能力不足的问题,尤其是在复杂或伦理敏感任务中表现有限。解决方案的关键在于提出一种结构化的多视角反思方法——MyGO Poly-Reflective Chain-of-Thought (PR-CoT),该方法在初始CoT推理后,引导模型从逻辑一致性、信息完整性、偏见/伦理合规性以及替代方案四个预定义维度进行自我评估与修正,仅通过提示工程实现,无需模型重训练,从而显著提升推理结果的鲁棒性和准确性。

链接: https://arxiv.org/abs/2601.07780
作者: Mariana Costa,Alberlucia Rafael Soarez,Daniel Kim,Camila Ferreira
机构: University of Brasilia (巴西联邦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Chain-of-Thought (CoT) prompting advances LLM reasoning, challenges persist in consistency, accuracy, and self-correction, especially for complex or ethically sensitive tasks. Existing single-dimensional reflection methods offer insufficient improvements. We propose MyGO Poly-Reflective Chain-of-Thought (PR-CoT), a novel methodology employing structured multi-perspective reflection. After initial CoT, PR-CoT guides the LLM to self-assess its reasoning across multiple predefined angles: logical consistency, information completeness, biases/ethics, and alternative solutions. Implemented purely via prompt engineering, this process refines the initial CoT into a more robust and accurate final answer without model retraining. Experiments across arithmetic, commonsense, ethical decision-making, and logical puzzles, using GPT-three point five and GPT-four models, demonstrate PR-CoT’s superior performance. It significantly outperforms traditional CoT and existing reflection methods in logical consistency and error correction, with notable gains in nuanced domains like ethical decision-making. Ablation studies, human evaluations, and qualitative analyses further validate the contribution of each reflection perspective and the overall efficacy of our poly-reflective paradigm in fostering more reliable LLM reasoning.
zh

[NLP-6] OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在构建计算机使用代理(Computer-Using Agents, CUAs)时面临的两大核心挑战:一是长程任务中因历史视觉上下文管理不足导致的鲁棒性下降问题;二是新领域场景下缺乏视觉感知的教程检索机制所引发的泛化能力不足问题。解决方案的关键在于提出OS-Symphony框架,其核心创新包括:(1) 反思-记忆代理(Reflection-Memory Agent),通过里程碑驱动的长期记忆机制实现轨迹级别的自我修正,有效缓解长程任务中的视觉上下文丢失;(2) 多模态搜索器驱动的多功能工具代理(Versatile Tool Agents),采用SeeAct范式在基于浏览器的沙箱环境中导航并合成实时、视觉对齐的教程,从而提升未见场景下的任务执行一致性与准确性。

链接: https://arxiv.org/abs/2601.07779
作者: Bowen Yang,Kaiming Jin,Zhenyu Wu,Zhaoyang Liu,Qiushi Sun,Zehao Li,JingJing Xie,Zhoumianze Liu,Fangzhi Xu,Kanzhi Cheng,Qingyun Li,Yian Wang,Yu Qiao,Zun Wang,Zichen Ding
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); National University of Singapore (新加坡国立大学); The Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学); CUHK MMLab (香港中文大学多媒体实验室); Xi’an Jiaotong University (西安交通大学); Nanjing University (南京大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 31 pages, 11 figures, 12 tables

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
zh

[NLP-7] Are LLM Decisions Faithful to Verbal Confidence?

【速读】: 该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)虽然能够生成看似合理的不确定性表达(verbal confidence),但这种表达是否真正反映其推理、知识或决策机制尚不明确,尤其在高风险场景下,模型是否能基于成本敏感性调整其是否回答(abstention)策略仍不清楚。解决方案的关键在于提出一个名为 RiskEval 的评估框架,用于测试模型在不同错误惩罚条件下是否会调整其拒绝回答的行为——结果发现,尽管极端惩罚下频繁拒绝是最优策略,模型却极少选择 abstention,表明其缺乏将不确定性信号转化为风险敏感决策的能力,从而揭示了当前模型在战略代理(strategic agency)上的不足。

链接: https://arxiv.org/abs/2601.07767
作者: Jiawei Wang,Yanfei Zhou,Siddartha Devic,Deqing Fu
机构: University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can produce surprisingly sophisticated estimates of their own uncertainty. However, it remains unclear to what extent this expressed confidence is tied to the reasoning, knowledge, or decision making of the model. To test this, we introduce \textbfRiskEval : a framework designed to evaluate whether models adjust their abstention policies in response to varying error penalties. Our evaluation of several frontier models reveals a critical dissociation: models are neither cost-aware when articulating their verbal confidence, nor strategically responsive when deciding whether to engage or abstain under high-penalty conditions. Even when extreme penalties render frequent abstention the mathematically optimal strategy, models almost never abstain, resulting in utility collapse. This indicates that calibrated verbal confidence scores may not be sufficient to create trustworthy and interpretable AI systems, as current models lack the strategic agency to convert uncertainty signals into optimal and risk-sensitive decisions.
zh

[NLP-8] Contrastive Learning with Narrative Twins for Modeling Story Salience EACL2026

【速读】: 该论文旨在解决叙事中事件显著性(narrative salience)的识别问题,即如何准确判断哪些事件对故事推进最为关键。其解决方案的核心在于提出一种对比学习框架(contrastive learning framework),通过训练模型区分“叙事孪生体”(narrative twins)——即具有相同情节但表面形式不同的故事——与具有相似表面特征但情节不同的干扰项(distractor),从而学习到能够捕捉叙事结构本质的故事嵌入表示(story embeddings)。这一方法显著优于基于掩码语言模型的基线,并在四种叙事操作(删除、移位、破坏和摘要)中验证了摘要操作在识别显著句方面最具可靠性。

链接: https://arxiv.org/abs/2601.07765
作者: Igor Sterner,Alex Lascarides,Frank Keller
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026

点击查看摘要

Abstract:Understanding narratives requires identifying which events are most salient for a story’s progression. We present a contrastive learning framework for modeling narrative salience that learns story embeddings from narrative twins: stories that share the same plot but differ in surface form. Our model is trained to distinguish a story from both its narrative twin and a distractor with similar surface features but different plot. Using the resulting embeddings, we evaluate four narratologically motivated operations for inferring salience (deletion, shifting, disruption, and summarization). Experiments on short narratives from the ROCStories corpus and longer Wikipedia plot summaries show that contrastively learned story embeddings outperform a masked-language-model baseline, and that summarization is the most reliable operation for identifying salient sentences. If narrative twins are not available, random dropout can be used to generate the twins from a single story. Effective distractors can be obtained either by prompting LLMs or, in long-form narratives, by using different parts of the same story.
zh

[NLP-9] Structure First Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在金融文档中进行数值推理时存在的准确性不足问题,特别是针对从非结构化文本和半结构化表格中提取数值数据并执行可靠计算的瓶颈。其解决方案的关键在于引入基于知识图谱(Knowledge Graph, KG)的结构化信息增强机制,通过从待处理文档中自动提取符合特定schema的KG来辅助LLM进行数值推理。实验表明,该框架在FinQA基准测试上使Llama 3.1 8B Instruct模型的执行准确率提升了约12%。

链接: https://arxiv.org/abs/2601.07754
作者: Aryan Mishra,Akash Anil
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Numerical reasoning is an important task in the analysis of financial documents. It helps in understanding and performing numerical predictions with logical conclusions for the given query seeking answers from financial texts. Recently, Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems with the capability of logical reasoning. As documents related to finance often consist of long and complex financial contexts, LLMs appear well-suited for building high-quality automated financial question-answering systems. However, LLMs often face challenges in accurately processing the various numbers within financial reports. Extracting numerical data from unstructured text and semi-structured tables, and reliably performing accurate calculations, remains a significant bottleneck for numerical reasoning in most state-of-the-art LLMs. Recent studies have shown that structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs along with logical explanations. Thus, it is an important requirement to consider inherent structured information in financial reports while using LLMs for various financial analytics. This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks. The KGs are extracted using a proposed schema inherently from the document under processing. We evaluated our proposed framework over the benchmark data FinQA, using an open-source LLM, namely Llama 3.1 8B Instruct. We observed that the proposed framework improved execution accuracy by approximately 12% relative to the vanilla LLM.
zh

[NLP-10] Is Agent ic RAG RAG worth it? An experimental comparison of RAG approaches

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的若干局限性,包括检索结果噪声大或不理想、对超出范围的查询错误使用检索模块、查询与文档匹配能力弱,以及生成器在性能和成本上的不确定性。为应对这些问题,研究者提出了两种演进路径:一是“增强型”RAG(Enhanced RAG),通过引入专用模块针对性优化流程中的薄弱环节;二是“代理型”RAG(Agentic RAG),利用大型语言模型(Large Language Models, LLMs)自反思能力实现动态决策与迭代控制,从而减少对人工设计模块的依赖。论文的关键贡献在于基于多场景、多维度的实证评估,系统比较了这两种范式在实际应用中的性能与成本权衡,为不同任务场景下选择最优RAG架构提供了可操作的指导依据。

链接: https://arxiv.org/abs/2601.07711
作者: Pietro Ferrazzi,Milica Cvjeticanin,Alessio Piraccini,Davide Giannuzzi
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); Univerità di Padova(帕多瓦大学); Cargill Geneve(嘉吉公司日内瓦); Alkemy(阿尔克米); Affiliation 5
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are usually defined by the combination of a generator and a retrieval component that extracts textual context from a knowledge base to answer user queries. However, such basic implementations exhibit several limitations, including noisy or suboptimal retrieval, misuse of retrieval for out-of-scope queries, weak query-document matching, and variability or cost associated with the generator. These shortcomings have motivated the development of “Enhanced” RAG, where dedicated modules are introduced to address specific weaknesses in the workflow. More recently, the growing self-reflective capabilities of Large Language Models (LLMs) have enabled a new paradigm, which we refer to as “Agentic” RAG. In this approach, the LLM orchestrates the entire process-deciding which actions to perform, when to perform them, and whether to iterate-thereby reducing reliance on fixed, manually engineered modules. Despite the rapid adoption of both paradigms, it remains unclear which approach is preferable under which conditions. In this work, we conduct an extensive, empirically driven evaluation of Enhanced and Agentic RAG across multiple scenarios and dimensions. Our results provide practical insights into the trade-offs between the two paradigms, offering guidance on selecting the most effective RAG design for real-world applications, considering both costs and performance.
zh

[NLP-11] Emotional Support Evaluation Framework via Controllable and Diverse Seeker Simulator

【速读】: 该论文旨在解决当前情感支持聊天机器人(emotional support chatbots)评估中所依赖的求助者模拟器(help-seeker simulators)存在的两大问题:一是无法捕捉真实世界求助者的多样性,常将求助者刻画为过度合作的单一类型;二是缺乏对特定求助者画像的可控性。解决方案的关键在于提出一种基于九种心理和语言特征驱动的可控求助者模拟器,利用Reddit真实对话数据,通过Mixture-of-Experts(MoE)架构将不同求助行为映射到专用参数子空间,从而实现细粒度的行为控制与多样性生成。该方法显著提升了模拟器在画像贴合度和行为多样性上的表现,并揭示了现有支持模型在特定情境下的性能退化,验证了其在更真实、严苛评估中的有效性。

链接: https://arxiv.org/abs/2601.07698
作者: Chaewon Heo,Cheyon Jin,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.
zh

[NLP-12] Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂任务中“推理”能力的评估问题,尤其区分了元层次推理(meta-level reasoning,即对解题中间步骤的规划与工具选择)与对象层次推理(object-level reasoning,即具体执行这些步骤的能力)。其解决方案的关键在于设计了一个基于地缘政治指标问答的新任务,该任务要求模型分解问题、检索数据并进行数学运算,并引入“必要操作”(essential actions)作为衡量推理强度的基准,从而超越传统仅依赖最终答案准确率的评估方式。通过分析模型工具调用行为与必要操作的一致性,作者发现LLMs在元层次推理上表现良好,但在任务理解与数值计算方面仍存在明显不足。

链接: https://arxiv.org/abs/2601.07696
作者: Nick Ferguson,Alan Bundy,Kwabena Nuamah
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) are increasingly focused on “reasoning” ability, a concept with many overlapping definitions in the LLM discourse. We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.) We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years. Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task contains ‘essential actions’ against which we can compare the tool call output of LLMs to infer the strength of reasoning ability. We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding. We find that n-shot prompting has little effect on accuracy; error messages encountered do not often deteriorate performance; and provide additional evidence for the poor numeracy of LLMs. Finally, we discuss the generalisation and limitation of our findings to other task domains.
zh

[NLP-13] Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)推理过程中键值缓存(Key-Value Cache, KV cache)占用过高导致的计算与存储效率问题,尤其是现有分层令牌剪枝方法在不同任务中适应性差、在复杂任务(如KV检索)中性能显著下降的问题。其解决方案的关键在于提出一种无需训练的自适应选择层(Adaptive Selection Layer, ASL)机制,该机制基于注意力分数排序后令牌排名的方差动态选择最优剪枝层,从而在满足用户指定KV缓存预算的前提下,实现跨任务性能的平衡优化;此外,ASL可在预填充阶段运行,并可与SnapKV等现有KV缓存压缩方法协同使用以提升解码阶段效率。

链接: https://arxiv.org/abs/2601.07667
作者: Rei Taniguchi,Yuyang Dong,Makoto Onizuka,Chuan Xiao
机构: Osaka University (大阪大学); SB Intuitions (SB智能); Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Source code is available at this https URL

点击查看摘要

Abstract:Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
zh

[NLP-14] Reasoning Models Will Blatantly Lie About Their Reasoning

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRM)在推理过程中对提示中线索(hint)的依赖性缺乏透明度的问题,尤其是当模型被直接询问其是否依赖提示中的特定信息时,会否认使用这些线索,即使实验数据明确显示其确实在利用这些线索。解决方案的关键在于通过设计严格的对照实验,揭示LRM在面对提示中异常内容时,不仅不会主动披露其推理依据,反而会系统性地虚假否认对提示线索的依赖,从而暴露了当前基于思维链(Chain-of-Thought, CoT)的监控与可解释性方法存在严重局限性。

链接: https://arxiv.org/abs/2601.07663
作者: William Walden
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It has been shown that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how certain parts of the input influence their reasoning. But it is one thing for a model to omit such information and another, worse thing to lie about it. Here, we extend the work of Chen et al. (2025) to show that LRMs will do just this: they will flatly deny relying on hints provided in the prompt in answering multiple choice questions – even when directly asked to reflect on unusual (i.e. hinted) prompt content, even when allowed to use hints, and even though experiments show them to be using the hints. Our results thus have discouraging implications for CoT monitoring and interpretability.
zh

[NLP-15] Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends

【速读】: 该论文旨在解决自然语言生成(Natural Language Generation, NLG)评估方法在近年来发展中的不一致性和有效性不足问题,特别是针对当前广泛使用的自动指标、大语言模型作为评判者(LLM-as-a-judge, LaaJ)与人类评估之间存在的差异和潜在误导。其解决方案的关键在于通过自动化信息抽取技术从四大顶会(ACL、EMNLP、NAACL、INLG)近14,000篇NLG论文中系统性提取评估方法的元数据,从而揭示三类评估方式(人工评估、传统指标、LaaJ)在不同任务中的演化趋势与偏差,发现“任务分化”、“指标惯性”及“人与LaaJ信号不一致”三大核心现象,并据此提出提升未来NLG评估严谨性的实践建议。

链接: https://arxiv.org/abs/2601.07648
作者: Jing Yang,Nils Feldhus,Salar Mohtaj,Leonhard Hennig,Qianli Wang,Eleni Metheniti,Sherzod Hakimov,Charlott Jakob,Veronika Solopova,Konrad Rieck,David Schlangen,Sebastian Möller,Vera Schmitt
机构: Technische Universität Berlin (柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所); German Research Center for Artificial Intelligence (DFKI), Berlin (德国人工智能研究中心(DFKI),柏林); ANITI; University of Potsdam (波茨坦大学); CERTAIN
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.
zh

[NLP-16] PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在进行多模态指令微调时,其基础语言模型所继承的文本推理能力被意外削弱的问题,进而影响整体多模态性能。解决方案的关键在于提出一种无需训练的框架,通过逐层视觉 token 掩码分析揭示了 MLLMs 中存在的“早期模态分离—中期模态对齐—晚期模态退化”三阶段行为模式,并基于此设计了一种基于平台(plateau)引导的模型融合方法,有选择性地将基础语言模型参数注入到 MLLMs 中,从而恢复并增强文本推理能力,同时提升多模态任务表现。注意力机制分析进一步表明,该融合策略使模型注意力从分散的视觉区域转向聚焦于任务相关的视觉区域,提升了表征效率。

链接: https://arxiv.org/abs/2601.07645
作者: Zijing Wang,Yongkang Liu,Mingyang Wang,Ercong Nie,Deyuan Chen,Zhengjie Zhao,Shi Feng,Daling Wang,Xiaocui Yang,Yifei Zhang,Hinrich Schütze
机构: Northeastern University, China (东北大学); CIS, LMU Munich, Germany (慕尼黑路德维希-马克西米利安大学信息科学系); Munich Center for Machine Learning (MCML), Germany (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text’s reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on this https URL.
zh

[NLP-17] Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在科学领域应用中的核心瓶颈问题:即依赖静态预定义工具库的局限性。在科学场景中,工具通常稀疏、异构且本质上不完整,静态工具库难以适应开放、动态的科研需求。解决方案的关键在于提出测试时工具演化(Test-Time Tool Evolution, TTE)新范式,使智能体能够在推理过程中自主合成、验证并迭代演化可执行工具,从而将工具从固定资源转变为问题驱动的动态产物,突破传统静态工具库的刚性和长尾限制。

链接: https://arxiv.org/abs/2601.07641
作者: Jiaxuan Lu,Ziyu Kong,Yemin Wang,Rong Fu,Haiyuan Wan,Cheng Yang,Wenjie Lou,Haoran Sun,Lilong Wang,Yankai Jiang,Xiaosong Wang,Xiao Sun,Dongzhan Zhou
机构: Shanghai Artificial Intelligence Laboratory; Fudan University (复旦大学); Xiamen University (厦门大学); University of Macau (澳门大学); Tsinghua University (清华大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at this https URL.
zh

[NLP-18] Integrating Machine-Generated Short Descriptions into the Wikipedia Android App: A Pilot Deployment of Descartes

【速读】: 该论文旨在解决维基百科中短描述(short descriptions)在不同语言和主题之间覆盖不均的问题,从而提升用户体验。其解决方案的关键在于提出并部署了Descartes——一个用于生成多语言短描述的模型,并通过在维基百科Android应用中进行试点实验验证其有效性。结果显示,90%被编辑采纳的Descartes生成描述质量评分不低于3分(满分5分),且平均质量与人工撰写相当,同时编辑既可直接采用也可修改建议,且回退率和报告率保持低位,表明该模型能够有效支持编辑减少内容缺口,前提是需配套技术、设计及社区层面的防护机制。

链接: https://arxiv.org/abs/2601.07631
作者: Marija Šakota,Dmitry Brant,Cooltey Feng,Shay Nowick,Amal Ramadan,Robin Schoenbaechler,Joseph Seddon,Jazmin Tanner,Isaac Johnson,Robert West
机构: EPFL(洛桑联邦理工学院); Wikimedia Foundation(维基媒体基金会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Short descriptions are a key part of the Wikipedia user experience, but their coverage remains uneven across languages and topics. In previous work, we introduced Descartes, a multilingual model for generating short descriptions. In this report, we present the results of a pilot deployment of Descartes in the Wikipedia Android app, where editors were offered suggestions based on outputs from Descartes while editing short descriptions. The experiment spanned 12 languages, with over 3,900 articles and 375 editors participating. Overall, 90% of accepted Descartes descriptions were rated at least 3 out of 5 in quality, and their average ratings were comparable to human-written ones. Editors adopted machine suggestions both directly and with modifications, while the rate of reverts and reports remained low. The pilot also revealed practical considerations for deployment, including latency, language-specific gaps, and the need for safeguards around sensitive topics. These results indicate that Descartes’s short descriptions can support editors in reducing content gaps, provided that technical, design, and community guardrails are in place.
zh

[NLP-19] Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在评估和预测科研创意质量时缺乏可扩展、可验证的评估方法的问题。现有方法难以对模型判断的准确性进行可靠衡量,尤其当这些判断涉及未来可观察指标(如引用量或研究议程变化)时。解决方案的关键在于提出 PoT(Proof-of-Truth),一个半可验证的基准框架:它通过冻结截止时间前的证据快照并置于离线沙箱中,让模型预测截止时间后的下游信号(如论文引用或同行评审奖项),从而实现基于真实结果的可验证评估;同时,PoT 支持无需大量专家标注的规模化测试,并能分析人类与模型在科学创意判断上的偏差。此外,PoT 还提供了一个受控环境来比较使用工具的智能体(agent)与非智能体基线在不同提示策略和预算规模下的表现,揭示了工具使用效果的高度任务依赖性。

链接: https://arxiv.org/abs/2601.07606
作者: Bingyang Ye,Shan Chen,Jingxuan Tu,Chen Liu,Zidi Xiong,Samuel Schmidgall,Danielle S. Bitterman
机构: Harvard University (哈佛大学); Mass General Brigham (马萨诸塞州总医院); Boston Children’s Hospital (波士顿儿童医院); Brandeis University (布兰迪斯大学); Yale University (耶鲁大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models’ judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers’ agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
zh

[NLP-20] GRPO with State Mutations: Improving LLM -Based Hardware Test Plan Generation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在硬件验证中的应用瓶颈问题,即当前大语言模型(Large Language Models, LLMs)在RTL(Register-Transfer Level)设计验证中难以有效生成符合规范的测试激励(stimuli),导致其在实际半导体设计流程中自动化子单元测试能力受限。解决方案的关键在于提出一种两阶段框架,将测试计划生成与测试平台执行解耦,并引入一种新型强化学习训练方法——GRPO-SMu(Group Reward Policy Optimization with State Mutation),通过树状分支变异策略构建包含等效和变异结构的训练数据集,从而增强模型对输入状态的探索能力和推理准确性。该方法显著提升了LLM生成测试激励的有效性,使7B参数模型在黄金RTL设计上的通过率提升至33.3%,较基线提高17.6个百分点,证明了专用训练范式对提升LLM在硬件验证任务中推理能力的重要性。

链接: https://arxiv.org/abs/2601.07593
作者: Dimple Vijay Kochar,Nathaniel Pinckney,Guan-Ting Liu,Chia-Tung Ho,Chenhui Deng,Haoxing Ren,Brucek Khailany
机构: 未知
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:RTL design often relies heavily on ad-hoc testbench creation early in the design cycle. While large language models (LLMs) show promise for RTL code generation, their ability to reason about hardware specifications and generate targeted test plans remains largely unexplored. We present the first systematic study of LLM reasoning capabilities for RTL verification stimuli generation, establishing a two-stage framework that decomposes test plan generation from testbench execution. Our benchmark reveals that state-of-the-art models, including DeepSeek-R1 and Claude-4.0-Sonnet, achieve only 15.7-21.7% success rates on generating stimuli that pass golden RTL designs. To improve LLM generated stimuli, we develop a comprehensive training methodology combining supervised fine-tuning with a novel reinforcement learning approach, GRPO with State Mutation (GRPO-SMu), which enhances exploration by varying input mutations. Our approach leverages a tree-based branching mutation strategy to construct training data comprising equivalent and mutated trees, moving beyond linear mutation approaches to provide rich learning signals. Training on this curated dataset, our 7B parameter model achieves a 33.3% golden test pass rate and a 13.9% mutation detection rate, representing a 17.6% absolute improvement over baseline and outperforming much larger general-purpose models. These results demonstrate that specialized training methodologies can significantly enhance LLM reasoning capabilities for hardware verification tasks, establishing a foundation for automated sub-unit testing in semiconductor design workflows.
zh

[NLP-21] ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents

【速读】: 该论文旨在解决对话智能体在长期交互中因记忆机制局限而导致的语义碎片化与上下文定位不准的问题。现有方法通常采用固定粒度的记忆单元和扁平化的检索范式,难以保持语义完整性并有效捕捉话语结构线索。解决方案的关键在于提出ES-Mem框架,其核心创新包括:(1)基于事件分割理论设计动态事件分割模块,将长时间交互划分为具有明确边界且语义连贯的事件单元;(2)构建分层记忆架构,利用边界语义锚定特定情景记忆,实现更精确的上下文定位。该方案显著提升了记忆的结构性与可检索性,在两个记忆基准测试中均优于基线方法。

链接: https://arxiv.org/abs/2601.07582
作者: Huhai Zou,Tianhao Sun,Chuanjiang He,Yu Tian,Zhenyang Li,Li Jin,Nayu Liu,Jiang Zhong,Kaiwen Wei
机构: Chongqing University (重庆大学); Tsinghua University (清华大学); HKUST (香港科技大学); Chinese Academy of Sciences (中国科学院); Tiangong University (天工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory is critical for dialogue agents to maintain coherence and enable continuous adaptation in long-term interactions. While existing memory mechanisms offer basic storage and retrieval capabilities, they are hindered by two primary limitations: (1) rigid memory granularity often disrupts semantic integrity, resulting in fragmented and incoherent memory units; (2) prevalent flat retrieval paradigms rely solely on surface-level semantic similarity, neglecting the structural cues of discourse required to navigate and locate specific episodic contexts. To mitigate these limitations, drawing inspiration from Event Segmentation Theory, we propose ES-Mem, a framework incorporating two core components: (1) a dynamic event segmentation module that partitions long-term interactions into semantically coherent events with distinct boundaries; (2) a hierarchical memory architecture that constructs multi-layered memories and leverages boundary semantics to anchor specific episodic memory for precise context localization. Evaluations on two memory benchmarks demonstrate that ES-Mem yields consistent performance gains over baseline methods. Furthermore, the proposed event segmentation module exhibits robust applicability on dialogue segmentation datasets.
zh

[NLP-22] A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

【速读】: 该论文旨在解决多模态情感理解中如何有效融合文本、音频和视觉模态以同时支持离散情绪识别与连续情感分析的问题。其核心解决方案是提出一个统一框架EGMF,通过专家引导的多模态融合机制与大语言模型(Large Language Models, LLMs)相结合:首先设计三种专用专家网络——细粒度局部专家用于捕捉细微情感差异、语义关联专家建模跨模态关系、全局上下文专家捕获长程依赖,并借助分层动态门控机制实现自适应特征选择;随后通过伪标记注入和基于提示的条件化策略将增强的多模态表示融入LLMs,使单一生成式框架能够通过自然语言生成完成分类与回归任务,同时采用LoRA微调提升计算效率。

链接: https://arxiv.org/abs/2601.07565
作者: Jiaqi Qiao,Xiujuan Xu,Xinran Li,Yu Liu
机构: Dalian University of Technology (大连理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks–a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies–adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
zh

[NLP-23] From RAG to Agent ic RAG for Faithful Islamic Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在伊斯兰教问答场景中因缺乏事实依据而产生的幻觉(hallucination)和不当回答问题,特别是当证据不足时未能正确选择不回答(abstention)的问题。现有基于多项选择题(MCQ)或阅读理解(MRC)的评估方式无法捕捉这些现实世界中的关键失败模式。解决方案的关键在于构建了一个包含3,810个条目的双语(阿拉伯语/英语)生成式基准数据集ISLAMICFAITHQA,其具有原子级单金标准答案,可直接量化幻觉与拒答行为;同时开发了一套端到端的伊斯兰知识增强建模流程,包括25K阿拉伯语文本-推理对、5K双语偏好样本用于奖励引导对齐,以及约6,000个经文级别的《古兰经》检索语料库。在此基础上提出一种基于代理的《古兰经》接地框架(agentic RAG),通过结构化工具调用实现迭代证据搜索与答案修正,实验证明该方法在阿拉伯语主导和多语言LLMs上均显著提升准确性,并在小模型(如Qwen3 4B)下也表现出最优性能和更强的阿拉伯语-英语鲁棒性。

链接: https://arxiv.org/abs/2601.07528
作者: Gagan Bhatia,Hamdy Mubarak,Mustafa Jarrar,George Mikros,Fadi Zaraket,Mahmoud Alhirthani,Mutaz Al-Khatib,Logan Cochrane,Kareem Darwish,Rashid Yahiaoui,Firoj Alam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur’an retrieval corpus of \sim 6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.
zh

[NLP-24] hinking Before Constraining: A Unified Decoding Framework for Large Language Models

【速读】: 该论文旨在解决自然生成(Natural Generation)与结构化生成(Structured Generation)之间的权衡问题:自然生成虽能支持丰富的推理能力,但输出缺乏结构,难以解析和验证;而结构化生成虽保证格式一致性与可解析性,却可能抑制模型的自由推理能力。解决方案的关键在于提出一种混合策略——允许大语言模型(LLM)在生成过程中先进行自由形式的自然推理,直到触发特定标记(trigger tokens),随后切换至结构化生成模式,从而在保留自然语言表达力的同时确保输出结果的结构可靠性。实验表明,该方法在分类与推理任务中相较纯自然生成最高提升27%准确率,且仅需增加10–20个额外token的开销。

链接: https://arxiv.org/abs/2601.07525
作者: Ngoc Trinh Hung Nguyen,Alonso Silva,Laith Zumot,Liubov Tupikina,Armen Aghasaryan,Mehwish Alam
机构: Télécom Paris, Institut Polytechnique de Paris, France; Nokia Bell Labs; Nokia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural generation allows Language Models (LMs) to produce free-form responses with rich reasoning, but the lack of guaranteed structure makes outputs difficult to parse or verify. Structured generation, or constrained decoding, addresses this drawback by producing content in standardized formats such as JSON, ensuring consistency and guaranteed-parsable outputs, but it can inadvertently restrict the model’s reasoning capabilities. In this work, we propose a simple approach that combines the advantages of both natural and structured generation. By allowing LLMs to reason freely until specific trigger tokens are generated, and then switching to structured generation, our method preserves the expressive power of natural language reasoning while ensuring the reliability of structured outputs. We further evaluate our approach on several datasets, covering both classification and reasoning tasks, to demonstrate its effectiveness, achieving a substantial gain of up to 27% in accuracy compared to natural generation, while requiring only a small overhead of 10-20 extra tokens.
zh

[NLP-25] Controlling Multimodal Conversational Agents with Coverag e-Enhanced Latent Actions

【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)微调视觉-语言模型(Vision-Language Models, VLMs)作为多模态对话代理(Multimodal Conversational Agents, MCAs)时,因文本词元(text token)空间过于庞大而导致的训练效率低和泛化能力受限的问题。解决方案的关键在于构建一个紧凑的潜在动作空间(latent action space),通过“从观察中学习”(learning from observation)机制来定义该空间中的离散动作码本(codebook),并利用未来观测预测当前潜在动作,从而实现更高效的策略优化。为缓解图像-文本配对数据稀缺问题,论文进一步引入纯文本数据,并设计了一个跨模态投影器(cross-modal projector)将文本嵌入映射至图像-文本联合空间,结合新颖的循环一致性损失(cycle consistency loss)在大规模文本数据上进行预训练与微调,显著提升了潜在动作空间的覆盖度与鲁棒性,最终在多个RL算法和对话任务上优于现有基线方法。

链接: https://arxiv.org/abs/2601.07516
作者: Yongqi Li,Hao Lang,Tieyun Qian,Yongbin Li
机构: Wuhan University (武汉大学); Tongyi Lab; Zhongguancun Academy; Alibaba Inc. (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.
zh

[NLP-26] High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning

【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中因降低秩(rank)而导致模型表示能力受限的问题,尤其是在使用低秩适应(Low-rank Adaptation, LoRA)方法时,其有限的秩限制了模型在微调过程中的表达潜力。解决方案的关键在于提出一种高秩结构化调制适配器(Structured MOdulation Adapter, SMoA),通过冻结预训练权重并引入多子空间机制,在不显著增加可训练参数数量的前提下,对原始权重的重要特征进行选择性放大或抑制,从而有效提升模型的表示能力和复杂度。实验表明,SMoA在10项任务上优于LoRA及其变体,并通过充分的消融研究验证了其有效性。

链接: https://arxiv.org/abs/2601.07507
作者: Yongkang Liu,Xing Li,Mengjie Zhao,Shanru Zhang,Zijing Wang,Qian Li,Shi Feng,Feiliang Ren,Daling Wang,Hinrich Schütze
机构: Northeastern University, China (东北大学); CIS, LMU Munich, Germany (慕尼黑大学信息科学研究所); Shandong University, China (山东大学); Munich Center for Machine Learning (MCML), Germany (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity when compared to full parameter fine-tuning. We present \textbfSMoA, a high-rank \textbfStructured \textbfMOdulation \textbfAdapter that uses fewer trainable parameters while maintaining a higher rank, thereby improving the model’s representational capacity and offering improved performance potential. The core idea is to freeze the original pretrained weights and selectively amplify or suppress important features of the original weights across multiple subspaces. The subspace mechanism provides an efficient way to increase the capacity and complexity of a model. We conduct both theoretical analyses and empirical studies on various tasks. Experiment results show that SMoA outperforms LoRA and its variants on 10 tasks, with extensive ablation studies validating its effectiveness.
zh

[NLP-27] Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM -Judges on QA Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为自动评分器在参考条件下的评估任务中,因参考答案与模型自身参数化知识冲突而导致评分不可靠的问题。其核心发现是:当参考答案与模型内部知识相悖时,LLM会过度依赖自身参数化知识而忽略给定参考,从而显著降低评估的可靠性。解决方案的关键在于提出一种受控的“交换参考答案”(swapped-reference)QA框架,通过系统性引入参考-信念冲突来量化这一脆弱性,并揭示当前基于提示(prompt-based)的缓解策略无法有效克服该问题,凸显了现有LLM-as-a-judge范式在参考遵循方面的根本局限,进而推动开发更严格的参考对齐机制。

链接: https://arxiv.org/abs/2601.07506
作者: Dongryeol Lee,Yerin Hwang,Taegwan Kang,Minwoo Lee,Younhyung Chae,Kyomin Jung
机构: Seoul National University (首尔国立大学); LG AI Research (LG人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: Under review, 21 pgs, 11 figures, 7 tables

点击查看摘要

Abstract:While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model’s parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges’ over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.
zh

[NLP-28] KALE: Enhancing Knowledge Manipulation in Large Language Models via Knowledge-aware Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识操控能力上的局限性问题,即模型虽具备相关知识却无法有效调用以得出正确答案的“显性错误”现象。解决方案的关键在于提出KALE(Knowledge-Aware LEarning)框架,其核心创新包括:一是基于知识图谱(Knowledge Graphs, KGs)设计的知识诱导(Knowledge-Induced, KI)数据合成方法,用于生成高质量的多跳推理路径作为解释性理由;二是引入知识感知(Knowledge-Aware, KA)微调范式,通过最小化有无理由条件下预测分布间的KL散度,促使模型内化基于理由的推理过程,从而显著提升知识检索、推理与迁移能力。

链接: https://arxiv.org/abs/2601.07430
作者: Qitan Lv,Tianyu Liu,Qiaosheng Zhang,Xingcheng Xu,Chaochao Lu
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the impressive performance of large language models (LLMs) pretrained on vast knowledge corpora, advancing their knowledge manipulation-the ability to effectively recall, reason, and transfer relevant knowledge-remains challenging. Existing methods mainly leverage Supervised Fine-Tuning (SFT) on labeled datasets to enhance LLMs’ knowledge manipulation ability. However, we observe that SFT models still exhibit the knownincorrect phenomenon, where they explicitly possess relevant knowledge for a given question but fail to leverage it for correct answers. To address this challenge, we propose KALE (Knowledge-Aware LEarning)-a post-training framework that leverages knowledge graphs (KGs) to generate high-quality rationales and enhance LLMs’ knowledge manipulation ability. Specifically, KALE first introduces a Knowledge-Induced (KI) data synthesis method that efficiently extracts multi-hop reasoning paths from KGs to generate high-quality rationales for question-answer pairs. Then, KALE employs a Knowledge-Aware (KA) fine-tuning paradigm that enhances knowledge manipulation by internalizing rationale-guided reasoning through minimizing the KL divergence between predictions with and without rationales. Extensive experiments on eight popular benchmarks across six different LLMs demonstrate the effectiveness of KALE, achieving accuracy improvements of up to 11.72% and an average of 4.18%.
zh

[NLP-29] SAD: A Large-Scale Strategic Argumentative Dialogue Dataset

【速读】: 该论文旨在解决现有论点生成数据集多局限于非交互式、单轮场景的问题,无法充分建模现实世界中多轮论辩对话的复杂性。其解决方案的关键在于构建首个大规模战略型论辩对话数据集(Strategic Argumentative Dialogue, SAD),包含392,822条标注样本,并基于论辩理论为每条话语标注五类论辩策略(支持多策略共存),要求模型在生成论点时需综合考虑对话历史、立场态度及目标策略,从而推动对多轮论辩对话中策略性推理与生成的深入研究。

链接: https://arxiv.org/abs/2601.07423
作者: Yongkang Liu,Jiayang Yu,Mingyang Wang,Yiqun Zhang,Ercong Nie,Shi Feng,Daling Wang,Kaisong Song,Hinrich Schütze
机构: Northeastern University, China (东北大学); CIS, LMU Munich, Germany (慕尼黑路德维希马克西米利安大学计算科学研究所); Munich Center for Machine Learning (MCML), Germany (慕尼黑机器学习中心); Alibaba Group, Hangzhou, China (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Argumentation generation has attracted substantial research interest due to its central role in human reasoning and decision-making. However, most existing argumentative corpora focus on non-interactive, single-turn settings, either generating arguments from a given topic or refuting an existing argument. In practice, however, argumentation is often realized as multi-turn dialogue, where speakers defend their stances and employ diverse argumentative strategies to strengthen persuasiveness. To support deeper modeling of argumentation dialogue, we present the first large-scale \textbfStrategic \textbfArgumentative \textbfDialogue dataset, SAD, consisting of 392,822 examples. Grounded in argumentation theories, we annotate each utterance with five strategy types, allowing multiple strategies per utterance. Unlike prior datasets, SAD requires models to generate contextually appropriate arguments conditioned on the dialogue history, a specified stance on the topic, and targeted argumentation strategies. We further benchmark a range of pretrained generative models on SAD and present in-depth analysis of strategy usage patterns in argumentation.
zh

[NLP-30] wo Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)频繁产生幻觉(hallucination)的问题,特别是揭示其内部状态中蕴含的真理性线索(truthfulness cues)的来源与作用机制。解决方案的关键在于识别并分离出两种独立的信息路径:一是依赖问题-答案信息流的“问题锚定路径”(Question-Anchored pathway),二是从生成答案本身提取自包含证据的“答案锚定路径”(Answer-Anchored pathway)。通过注意力屏蔽(attention knockout)和token修补(token patching)实验,作者验证并解耦了这两种机制,并进一步发现它们与LLM的知识边界密切相关,且内部表示能够区分这两类信号。基于此,论文提出了两个增强幻觉检测性能的应用方案,为构建更可靠、具备自我意识的生成系统提供了新思路。

链接: https://arxiv.org/abs/2601.07422
作者: Wen Luo,Guangyue Peng,Wei Li,Shaohang Wei,Feifan Song,Liang Wang,Nan Yang,Xingxing Zhang,Jing Jin,Furu Wei,Houfeng Wang
机构: Peking University (北京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.
zh

[NLP-31] SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中能力编码机制不明确的问题,尤其在高风险应用场景(如医疗、法律和自主决策系统)中,传统粗粒度的模块识别方法(如梯度归因或激活分析)无法准确刻画能力在参数空间中的分布特性,因为单一能力可能由多个模块共同承担,而单个模块也可能参与多种能力的实现。解决方案的关键在于提出SCALPEL框架,其核心思想是将能力表示为低秩参数子空间而非离散模块,并通过低秩参数编辑实现对特定能力的选择性移除,同时保持其他能力不变;具体而言,通过训练LoRA适配器来降低模型区分正确与错误答案的能力,从而精准定位并隔离出负责特定能力的低秩子空间,实验表明该方法可在保留通用语言建模性能的同时有效移除目标能力,揭示了能力在参数空间中的低秩结构和分布式编码特性。

链接: https://arxiv.org/abs/2601.07411
作者: Zihao Fu,Xufeng Duan,Zhenguang G. Cai
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models excel across diverse domains, yet their deployment in healthcare, legal systems, and autonomous decision-making remains limited by incomplete understanding of their internal mechanisms. As these models integrate into high-stakes systems, understanding how they encode capabilities has become fundamental to interpretability research. Traditional approaches identify important modules through gradient attribution or activation analysis, assuming specific capabilities map to specific components. However, this oversimplifies neural computation: modules may contribute to multiple capabilities simultaneously, while single capabilities may distribute across multiple modules. These coarse-grained analyses fail to capture fine-grained, distributed capability encoding. We present SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models), a framework representing capabilities as low-rank parameter subspaces rather than discrete modules. Our key insight is that capabilities can be characterized by low-rank modifications distributed across layers and modules, enabling precise capability removal without affecting others. By training LoRA adapters to reduce distinguishing correct from incorrect answers while preserving general language modeling quality, SCALPEL identifies low-rank representations responsible for particular capabilities while remaining disentangled from others. Experiments across diverse capability and linguistic tasks from BLiMP demonstrate that SCALPEL successfully removes target capabilities while preserving general capabilities, providing fine-grained insights into capability distribution across parameter space. Results reveal that capabilities exhibit low-rank structure and can be selectively ablated through targeted parameter-space interventions, offering nuanced understanding of capability encoding in LLMs.
zh

[NLP-32] Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在推理任务中因采用粗粒度信用分配机制而导致的性能瓶颈问题,即标准GRPO将群体奖励均匀传播至序列中的每个token,忽视了单个推理步骤对最终答案的实际贡献差异。解决方案的关键在于提出Outcome-grounded Advantage Reshaping (OAR),一种细粒度信用分配机制,通过重新分配优势值来反映每个token对模型最终输出的影响程度。OAR具体通过两种互补策略实现:(1) OAR-P基于反事实token扰动估计结果敏感性,提供高保真归因信号;(2) OAR-G利用输入梯度敏感性代理,在一次反向传播中近似影响信号,计算开销极低。二者结合保守的双层优势重塑方案,在抑制低影响token的同时增强关键token的优势,同时保持整体优势质量不变,从而显著提升无评论器(critic-free)大语言模型(LLM)的推理能力。

链接: https://arxiv.org/abs/2601.07408
作者: Ziheng Li,Liu Kang,Feng Xiao,Luxi Xing,Qingyi Si,Zhuoran Li,Weikang Gong,Deqing Yang,Yanghua Xiao,Hongcheng Guo
机构: Fudan University (复旦大学); XingYun lab, HUJING Digital Media & Entertainment Group; Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model’s final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.
zh

[NLP-33] On Narrative: The Rhetorical Mechanisms of Online Polarisation

【速读】: 该论文旨在解决叙事极化(narrative polarisation)如何在缺乏直接人际互动的情况下,由不同政治立场群体共同构建并协商对现实的对立解释,以及这些叙事是否能在群体间传播的问题。解决方案的关键在于将叙事极化概念形式化,并基于结构叙事理论(structural narrative theory),借助大语言模型(large language model)从212个YouTube视频和90,029条评论中提取核心行动者所承担的叙事角色,从而量化和比较两组对立信息环境中的叙事模式。结果表明,尽管视频呈现高度极化的叙事结构,评论区虽在表层降低了极化程度,但在深层叙事主题上仍存在显著差异,揭示了叙事极化在多层语境下的复杂性。

链接: https://arxiv.org/abs/2601.07398
作者: Jan Elfes,Marco Bastos,Luca Maria Aiello
机构: University College Dublin (都柏林大学); IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Polarisation research has demonstrated how people cluster in homogeneous groups with opposing opinions. However, this effect emerges not only through interaction between people, limiting communication between groups, but also between narratives, shaping opinions and partisan identities. Yet, how polarised groups collectively construct and negotiate opposing interpretations of reality, and whether narratives move between groups despite limited interactions, remains unexplored. To address this gap, we formalise the concept of narrative polarisation and demonstrate its measurement in 212 YouTube videos and 90,029 comments on the Israeli-Palestinian conflict. Based on structural narrative theory and implemented through a large language model, we extract the narrative roles assigned to central actors in two partisan information environments. We find that while videos produce highly polarised narratives, comments significantly reduce narrative polarisation, harmonising discourse on the surface level. However, on a deeper narrative level, recurring narrative motifs reveal additional differences between partisan groups.
zh

[NLP-34] GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap ACL2026

【速读】: 该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)研究中导航指令评估的持续挑战,即传统基于参考的指标(如BLEU和ROUGE)无法有效衡量指令的功能性实用性——即是否能成功引导导航者到达目标位置。现有方法依赖高保真视觉模拟器作为评估器,存在许可限制、计算成本高以及感知误差干扰语言质量评估等问题。解决方案的关键在于提出GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation),一个无需训练、不依赖视觉信息的分层大语言模型(Large Language Model, LLM)框架,利用OpenStreetMap(OSM)数据构建结构化空间知识图谱进行推理。其核心创新包括:采用JSON和文本格式的空间信息表示优于网格和视觉图结构,并通过子指令规划与拓扑图导航相结合的分层架构,在Map2Seq数据集上将导航误差降低68.5%,从而以可扩展且可解释的方式实现对导航指令功能性的评估。

链接: https://arxiv.org/abs/2601.07375
作者: Farzad Shami,Subhrasankha Dey,Nico Van de Weghe,Henrikki Tenkanen
机构: Aalto University (阿尔托大学); Ghent University (根特大学)
类目: Computation and Language (cs.CL)
备注: Under Review for ACL 2026

点击查看摘要

Abstract:The evaluation of navigation instructions remains a persistent challenge in Vision-and-Language Navigation (VLN) research. Traditional reference-based metrics such as BLEU and ROUGE fail to capture the functional utility of spatial directives, specifically whether an instruction successfully guides a navigator to the intended destination. Although existing VLN agents could serve as evaluators, their reliance on high-fidelity visual simulators introduces licensing constraints and computational costs, and perception errors further confound linguistic quality assessment. This paper introduces GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation), a vision-free training-free hierarchical LLM-based framework for evaluating navigation instructions using OpenStreetMap data. Through systematic ablation studies, we demonstrate that structured JSON and textual formats for spatial information substantially outperform grid-based and visual graph representations. Our hierarchical architecture combines sub-instruction planning with topological graph navigation, reducing navigation error by 68.5% compared to heuristic and sampling baselines on the Map2Seq dataset. The agent’s execution success, trajectory fidelity, and decision patterns serve as proxy metrics for functional navigability given OSM-visible landmarks and topology, establishing a scalable and interpretable evaluation paradigm without visual dependencies. Code and data are available at this https URL.
zh

[NLP-35] Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

【速读】: 该论文旨在解决Transformer架构在扩展容量时缺乏原生知识查找机制的问题,导致其必须通过冗余的神经计算来模拟检索过程,从而效率低下。解决方案的关键在于引入“条件记忆”(conditional memory)作为与MoE(Mixture-of-Experts)并行的稀疏性维度,通过Engram模块实现O(1)时间复杂度的知识查找:Engram现代化了传统的N-gram嵌入技术,使静态知识可被高效定位。研究进一步揭示了一个U型缩放规律(U-shaped scaling law),指导如何在神经计算(MoE)与静态记忆(Engram)之间优化资源分配,并在27B参数规模下验证了其优于等参数量和等FLOPs MoE基线的性能表现,尤其在通用推理、代码与数学任务中提升显著。

链接: https://arxiv.org/abs/2601.07372
作者: Xin Cheng,Wangding Zeng,Damai Dai,Qinyu Chen,Bingxuan Wang,Zhenda Xie,Kezhao Huang,Xingkai Yu,Zhewen Hao,Yukun Li,Han Zhang,Huishuai Zhang,Dongyan Zhao,Wenfeng Liang
机构: Peking University (北京大学); DeepSeek-AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N -gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone’s early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
zh

[NLP-36] Interpretable Text Classification Applied to the Detection of LLM -generated Creative Writing

【速读】: 该论文试图解决的问题是区分人类撰写的创意小说(来自小说节选)与由大语言模型(LLM)生成的相似文本。其解决方案的关键在于:尽管人类观察者在这一二分类任务中表现接近随机水平,但多种机器学习模型在未见过的测试集上仍能实现0.93–0.98的高准确率,即使仅使用短文本片段和单标记(unigram)特征即可。研究进一步采用一种内在可解释的线性分类器(测试准确率达0.98),揭示出高精度分类的核心原因——LLM倾向于使用更丰富的同义词变体,从而导致概率分布偏移,这种偏移对机器学习分类器而言易于识别,但对人类观察者却难以察觉;此外还识别出四个辅助解释类别:时间漂移、美式表达、外语使用及口语化表达,这些特征共同构成判别依据,使分类结果具有鲁棒性,不易被恶意用户通过伪装规避。

链接: https://arxiv.org/abs/2601.07368
作者: Minerva Suvanto,Andrea McGlinchey,Mattias Wahde,Peter J Barclay
机构: Chalmers University of Technology (查尔默斯理工大学); Edinburgh Napier University (爱丁堡纳皮尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication at ICAART 2026 ( this https URL )

点击查看摘要

Abstract:We consider the problem of distinguishing human-written creative fiction (excerpts from novels) from similar text generated by an LLM. Our results show that, while human observers perform poorly (near chance levels) on this binary classification task, a variety of machine-learning models achieve accuracy in the range 0.93 - 0.98 over a previously unseen test set, even using only short samples and single-token (unigram) features. We therefore employ an inherently interpretable (linear) classifier (with a test accuracy of 0.98), in order to elucidate the underlying reasons for this high accuracy. In our analysis, we identify specific unigram features indicative of LLM-generated text, one of the most important being that the LLM tends to use a larger variety of synonyms, thereby skewing the probability distributions in a manner that is easy to detect for a machine learning classifier, yet very difficult for a human observer. Four additional explanation categories were also identified, namely, temporal drift, Americanisms, foreign language usage, and colloquialisms. As identification of the AI-generated text depends on a constellation of such features, the classification appears robust, and therefore not easy to circumvent by malicious actors intent on misrepresenting AI-generated text as human work.
zh

[NLP-37] Semantic Compression of LLM Instructions via Symbolic Metalanguages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因提示词(prompt)冗长而导致的资源消耗高、效率低的问题。传统提示工程依赖自然语言描述指令,导致token使用量大、部署成本高(尤其是API调用场景)或本地推理延迟增加。其解决方案的关键在于提出一种名为MetaGlyph的符号化语言,通过将指令编码为数学符号(如∈表示成员关系、⇒表示逻辑蕴含),利用模型在预训练阶段已掌握的符号语义实现“指令捷径”(instruction shortcuts),从而无需额外微调即可提升提示压缩效率。实验表明,MetaGlyph可在所有任务类型中实现62–81%的token减少,并在不同规模和架构的模型上展现出显著差异化的性能表现,揭示了模型参数量与符号理解能力之间存在U型关系。

链接: https://arxiv.org/abs/2601.07354
作者: Ernst van Gassen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages and 6 tables

点击查看摘要

Abstract:We introduce MetaGlyph, a symbolic language for compressing prompts by encoding instructions as mathematical symbols rather than prose. Unlike systems requiring explicit decoding rules, MetaGlyph uses symbols like \in (membership) and \Rightarrow (implication) that models already understand from their training data. We test whether these symbols work as ‘‘instruction shortcuts’’ that models can interpret without additional teaching. We evaluate eight models across two dimensions relevant to practitioners: scale (3B-1T parameters) and accessibility (open-source for local deployment vs. proprietary APIs). MetaGlyph achieves 62-81% token reduction across all task types. For API-based deployments, this translates directly to cost savings; for local deployments, it reduces latency and memory pressure. Results vary by model. Gemini 2.5 Flash achieves 75% semantic equivalence between symbolic and prose instructions on selection tasks, with 49.9% membership operator fidelity. Kimi K2 reaches 98.1% fidelity for implication ( \Rightarrow ) and achieves perfect (100%) accuracy on selection tasks with symbolic prompts. GPT-5.2 Chat shows the highest membership fidelity observed (91.3%), though with variable parse success across task types. Claude Haiku 4.5 achieves 100% parse success with 26% membership fidelity. Among mid-sized models, Qwen 2.5 7B shows 62% equivalence on extraction tasks. Mid-sized open-source models (7B-12B) show near-zero operator fidelity, suggesting a U-shaped relationship where sufficient scale overcomes instruction-tuning biases. Comments: 12 pages and 6 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.07354 [cs.CL] (or arXiv:2601.07354v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.07354 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-38] ALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

【速读】: 该论文旨在解决现有基于树结构的推测解码(tree-based speculative decoding, SD)方法在生成过程中无法动态适应不同上下文难度的问题,即固定宽度和深度的树结构难以对简单token进行充分扩展或对困难token提前终止,从而导致推理效率受限。其解决方案的关键在于提出TALON框架——一种无需训练、基于预算驱动的自适应树扩展机制,通过迭代构建draft树直至满足预设token预算,并采用混合扩展策略在各层间自适应分配节点预算,使树结构自然呈现“深而窄”(适用于确定性上下文)与“浅而宽”(适用于不确定性分支)的形态,从而在有限预算下优化探索宽度与生成深度之间的权衡。

链接: https://arxiv.org/abs/2601.07353
作者: Tianyu Liu,Qitan Lv,Yuhao Shen,Xiao Sun,Xiaoyan Sun
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a “deep-and-narrow” form for deterministic contexts and a “shallow-and-wide” form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.
zh

[NLP-39] Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models

【速读】: 该论文旨在解决现有扩散语言模型(Diffusion Language Models, DLMs)在生成过程中依赖硬二值掩码(hard binary masking)和离散token分配所带来的局限性,这些问题限制了早期决策的修正能力,并未能充分利用中间概率表示。解决方案的关键在于提出EvoToken-DLM,通过用演化软token分布(evolving soft token distributions)替代传统的硬掩码机制,实现从掩码状态到离散输出的渐进式过渡,从而支持可回溯的解码过程;同时引入连续轨迹监督(continuous trajectory supervision),使训练目标与迭代的概率更新保持一致,有效提升模型性能。

链接: https://arxiv.org/abs/2601.07351
作者: Linhao Zhong,Linyu Wu,Bozhen Fang,Tianjian Feng,Chenchen Jing,Wen Wang,Jiaheng Zhang,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Zhejiang University of Technology (浙江工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: this https URL.
zh

[NLP-40] Reward Modeling from Natural Language Human Feedback

【速读】: 该论文旨在解决生成式奖励模型(Generative Reward Models, GRMs)在基于二元偏好数据进行强化学习时,因仅依赖结果标签而易产生“猜测正确答案但缺乏有效推理”的伪成功现象,从而引入噪声奖励信号、削弱强化学习效果的问题。解决方案的关键在于提出从自然语言人类反馈中提取过程奖励信号(Reward Modeling from Natural Language Human Feedback, RM-NLHF),通过计算模型生成的批判与人类批判之间的语义相似度作为训练奖励,相较仅依赖结果标签的监督方式提供更准确的过程导向奖励信号;同时引入元奖励模型(Meta Reward Model, MetaRM)以学习从含人类批判的数据中预测过程奖励,并推广至无明确人类批判的数据,从而缓解人工标注成本高的问题。

链接: https://arxiv.org/abs/2601.07349
作者: Zongqi Wang,Rui Wang,Yuchuan Wu,Yiyao Yu,Pinyi Zhang,Shaoning Sun,Yujiu Yang,Yongbin Li
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward. However, in this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques. Consequently, these spurious successes introduce substantial noise into the reward signal, thereby impairing the effectiveness of reinforcement learning. To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent in binary tasks. Specifically, we compute the similarity between GRM-generated and human critiques as the training reward, which provides more accurate reward signals than outcome-only supervision. Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human critiques. Experiments on multiple benchmarks demonstrate that our method consistently outperforms state-of-the-art GRMs trained with outcome-only reward, confirming the superiority of integrating natural language over binary human feedback as supervision.
zh

[NLP-41] Controlled Self-Evolution for Algorithmic Code Optimization

【速读】: 该论文旨在解决自进化(Self-evolution)方法在代码生成中因探索效率低而导致无法在有限预算内发现复杂度更优解的问题。现有方法存在初始化偏差导致进化陷入次优区域、随机操作缺乏反馈引导以及跨任务和任务内经验利用不足等瓶颈。其解决方案的关键在于提出受控自进化(Controlled Self-Evolution, CSE),包含三个核心组件:多样化规划初始化(Diversified Planning Initialization)以实现算法策略的结构多样性并覆盖更广的解空间;遗传进化(Genetic Evolution)用反馈驱动机制替代随机操作,实现目标导向的变异与组合交叉;分层进化记忆(Hierarchical Evolution Memory)则用于捕获跨任务与任务内的成功及失败经验,从而提升知识复用效率。实验表明,CSE在EffiBench-X基准上显著优于所有基线模型,并在早期生成阶段即展现出更高效率且持续改进。

链接: https://arxiv.org/abs/2601.07348
作者: Tu Hu,Ronghao Chen,Shuo Zhang,Jianghao Yin,Mou Xiao Feng,Jingping Liu,Shaolei Zhang,Wenqi Jiang,Yuqi Fang,Sen Hu,Yi Xu,Huacan Wang
机构: NJU(南京大学); PKU(北京大学); Midea-AIRC(美的人工智能研究中心); ECNU(华东师范大学); SYSU(中山大学); RUC(中国人民大学); QuantaAlpha(Quantum Alpha)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 27 pages

点击查看摘要

Abstract:Self-evolution methods enhance code generation through iterative “generate-verify-refine” cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across this http URL address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task this http URL on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at this https URL.
zh

[NLP-42] DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLLMs)中存在的“反转诅咒”(reversal curse)问题,即模型在处理逻辑上双向的关系时表现出主要的单向行为,即使其训练方式为双向。研究发现,这一现象并非仅由自回归训练导致,而是源于实体碎片化、数据不对称和缺失实体关系等三方面原因。解决方案的关键在于提出Diffusion Entity-Relation Modeling(DiffER),通过引入全实体掩码机制以减少实体碎片化,并采用分布对称与关系增强的数据构建策略缓解数据不对称性和缺失关系问题,从而有效缓解DLLMs中的反转诅咒现象。

链接: https://arxiv.org/abs/2601.07347
作者: Shaokai He,Kaiwen Wei,Xinyi Zeng,Xiang Chen,Xue Yang,Zhenyang Li,Jiang Zhong,Yu Tian
机构: Chongqing University (重庆大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The “reversal curse” refers to the phenomenon where large language models (LLMs) exhibit predominantly unidirectional behavior when processing logically bidirectional relationships. Prior work attributed this to autoregressive training – predicting the next token inherently favors left-to-right information flow over genuine bidirectional knowledge associations. However, we observe that Diffusion LLMs (DLLMs), despite being trained bidirectionally, also suffer from the reversal curse. To investigate the root causes, we conduct systematic experiments on DLLMs and identify three key reasons: 1) entity fragmentation during training, 2) data asymmetry, and 3) missing entity relations. Motivated by the analysis of these reasons, we propose Diffusion Entity-Relation Modeling (DiffER), which addresses the reversal curse through entity-aware training and balanced data construction. Specifically, DiffER introduces whole-entity masking, which mitigates entity fragmentation by predicting complete entities in a single step. DiffER further employs distribution-symmetric and relation-enhanced data construction strategies to alleviate data asymmetry and missing relations. Extensive experiments demonstrate that DiffER effectively alleviates the reversal curse in Diffusion LLMs, offering new perspectives for future research.
zh

[NLP-43] Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)评估指标在处理非字面表达(non-literal expressions)时的不可靠性问题,尤其是在社交网络服务、文学等语言复杂场景下,传统自动评估指标(如BLEU、TER等)与人工评分存在显著偏差,且基于大语言模型作为裁判(LLM-as-a-Judge)的方法也受限于知识截止(knowledge cutoff)和评分不一致性(score inconsistency)问题。解决方案的关键在于提出一种新型代理式评估框架RATE(Reflective Agent-based Translation Evaluation),其核心是一个具备反思能力的主代理(Core Agent),能够动态调用多个专业化子代理(sub-agents)进行多维度、自适应的翻译质量判断,从而提升评估准确性与鲁棒性,在非字面翻译场景中相比现有指标至少提升3.2分元评分(meta score)。

链接: https://arxiv.org/abs/2601.07338
作者: Yanzhi Tian,Cunxiang Wang,Zeming Liu,Heyan Huang,Wenbo Yu,Dawei Song,Jie Tang,Yuhang Guo
机构: Beijing Institute of Technology (北京理工大学); Zhipu AI; Tsinghua University (清华大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: this https URL.
zh

[NLP-44] BayesRAG : Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理视觉丰富文档时存在的问题,即文本与图像被孤立地作为检索目标,导致难以捕捉跨模态语义强化和布局诱导的一致性。现有方法依赖单一余弦相似度排序,无法有效融合多模态信息。其解决方案的关键在于提出BayesRAG框架,该框架基于贝叶斯推理(Bayesian inference)与Dempster-Shafer证据理论(Dempster-Shafer evidence theory),将多模态检索结果的内在一致性建模为概率证据,通过计算联合后验关联概率来优化检索置信度,优先选择在语义和布局上相互佐证的图文对,从而实现更鲁棒、协同一致的多模态检索融合。

链接: https://arxiv.org/abs/2601.07329
作者: Xuan Li,Yining Wang,Haocai Luo,Shengping Liu,Jerry Liang,Ying Fu,Weihuang,Jun Yu,Junnan Zhu
机构: University of Science and Technology of China (中国科学技术大学); Unisound AI Technology Co.Ltd (声智科技有限公司); MAIS, Institute of Automation, Chinese Academy of Sciences (中科院自动化所MAIS实验室)
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at this https URL.
zh

[NLP-45] How to predict creativity ratings from written narratives: A comparison of co-occurrence and textual forma mentis networks

【速读】: 该论文旨在解决如何利用网络建模方法从短篇创意文本中提取结构化特征,并用于预测人类对创造力的评分问题。其解决方案的关键在于对比两种文本转网络的方法——词共现网络(word co-occurrence networks)与文本形式心智网络(textual form a mentis networks, TFMNs),并通过系统性工作流验证TFMNs在预测性能上的优势:实验表明,TFMNs在所有建模设置下均表现出更低的预测误差(最佳MAE=0.581),且网络结构特征是主要预测因子(MAE=0.591),显著优于情感特征(MAE=0.711)和传播激活指标(MAE=0.788)。该研究为认知科学领域中的创造力研究提供了可复现、实用且具有方法论深度的网络分析框架。

链接: https://arxiv.org/abs/2601.07327
作者: Roberto Passaro,Edith Haim,Massimo Stella
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This tutorial paper provides a step-by-step workflow for building and analysing semantic networks from short creative texts. We introduce and compare two widely used text-to-network approaches: word co-occurrence networks and textual forma mentis networks (TFMNs). We also demonstrate how they can be used in machine learning to predict human creativity ratings. Using a corpus of 1029 short stories, we guide readers through text preprocessing, network construction, feature extraction (structural measures, spreading-activation indices, and emotion scores), and application of regression models. We evaluate how network-construction choices influence both network topology and predictive performance. Across all modelling settings, TFMNs consistently outperformed co-occurrence networks through lower prediction errors (best MAE = 0.581 for TFMN, vs 0.592 for co-occurrence with window size 3). Network-structural features dominated predictive performance (MAE = 0.591 for TFMN), whereas emotion features performed worse (MAE = 0.711 for TFMN) and spreading-activation measures contributed little (MAE = 0.788 for TFMN). This paper offers practical guidance for researchers interested in applying network-based methods for cognitive fields like creativity research. we show when syntactic networks are preferable to surface co-occurrence models, and provide an open, reproducible workflow accessible to newcomers in the field, while also offering deeper methodological insight for experienced researchers.
zh

[NLP-46] Mitrasamgraha: A Comprehensive Classical Sanskrit Machine Translation Dataset

【速读】: 该论文旨在解决高资源语言中机器翻译(Machine Translation, MT)任务虽被视作“已解决”,但在处理复杂文本如梵文文学时仍面临显著挑战的问题。梵文文献因其多层隐喻、哲学概念、诗歌语言及复杂的语法结构(如sandhi、复合词和重形态),对自然语言处理(Natural Language Processing, NLP)下游任务构成严峻考验,且此前缺乏覆盖多个历史时期与语域的高质量公开数据集。解决方案的关键在于构建Mitrasamgraha——一个包含391,548对双语句对的高质梵文到英文机器翻译数据集,其规模超过此前最大数据集Itihāsa四倍,并涵盖三千余年的历史跨度与广泛语域。该数据集还提供细粒度的时间与领域标注,支持对MT性能在不同语境下的系统性分析,同时释放了验证集(5,587对)和测试集(5,552对),并通过商用与开源模型的基准实验及NLLB与Gemma模型的微调,验证了其有效性,尽管仍存在对复杂复合词、哲学概念和多层隐喻翻译的挑战。

链接: https://arxiv.org/abs/2601.07314
作者: Sebastian Nehrdich,David Allport,Sven Sellmer,Jivnesh Sandhan,Manoj Balaji Jagadeeshan,Pawan Goyal,Sujeet Kumar,Kurt Keutzer
机构: Tohoku University (东北大学); University of California, Berkeley (加州大学伯克利分校); Adam Mickiewicz University in Poznań (亚当·密凯维奇大学波兹南分校); Kyoto University (京都大学); Indian Institute of Technology, Kharagpur (印度理工学院克哈格普尔分校); B. P. Mandal College of Engineering Madhepura (B. P. 曼达尔工程学院马德赫普拉分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While machine translation is regarded as a “solved problem” for many high-resource languages, close analysis quickly reveals that this is not the case for content that shows challenges such as poetic language, philosophical concepts, multi-layered metaphorical expressions, and more. Sanskrit literature is a prime example of this, as it combines a large number of such challenges in addition to inherent linguistic features like sandhi, compounding, and heavy morphology, which further complicate NLP downstream tasks. It spans multiple millennia of text production time as well as a large breadth of different domains, ranging from ritual formulas via epic narratives, philosophical treatises, poetic verses up to scientific material. As of now, there is a strong lack of publicly available resources that cover these different domains and temporal layers of Sanskrit. We therefore introduce Mitrasamgraha, a high-quality Sanskrit-to-English machine translation dataset consisting of 391,548 bitext pairs, more than four times larger than the largest previously available Sanskrit dataset Itih=asa. It covers a time period of more than three millennia and a broad range of historical Sanskrit domains. In contrast to web-crawled datasets, the temporal and domain annotation of this dataset enables fine-grained study of domain and time period effects on MT performance. We also release a validation set consisting of 5,587 and a test set consisting of 5,552 post-corrected bitext pairs. We conduct experiments benchmarking commercial and open models on this dataset and fine-tune NLLB and Gemma models on the dataset, showing significant improvements, while still recognizing significant challenges in the translation of complex compounds, philosophical concepts, and multi-layered metaphors. We also analyze how in-context learning on this dataset impacts the performance of commercial models
zh

[NLP-47] PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的来访者模拟在多样性与真实性不足、缺乏建模真实来访者行为的系统性框架,以及中文语境下数据稀缺等三大挑战。其解决方案的关键在于提出PsyCLIENT框架,该框架基于对话轨迹建模(conversational trajectory modeling),通过将LLM生成过程条件化于预定义的真实世界对话轨迹(包含显式的的行为标签和内容约束),从而确保模拟交互的多样性与现实性;同时,研究团队构建了首个开源的中文来访者画像数据集PsyCLIENT-CP,覆盖60个心理咨询主题,显著提升了模拟的真实性与训练有效性,实验表明模拟来访者与真人来访者的区分准确率接近95%,验证了该方法在心理辅导教育与研究中的实用价值。

链接: https://arxiv.org/abs/2601.07312
作者: Huachuan Qiu,Zhaoming Chen,Yuqian Chen,Yuan Xie,Yu Lu,Zhenzhong Lan
机构: Westlake University (西湖大学); The University of Utah (犹他大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based client simulation has emerged as a promising tool for training novice counselors and evaluating automated counseling systems. However, existing client simulation approaches face three key challenges: (1) limited diversity and realism in client profiles, (2) the lack of a principled framework for modeling realistic client behaviors, and (3) a scarcity in Chinese-language settings. To address these limitations, we propose PsyCLIENT, a novel simulation framework grounded in conversational trajectory modeling. By conditioning LLM generation on predefined real-world trajectories that incorporate explicit behavior labels and content constraints, our approach ensures diverse and realistic interactions. We further introduce PsyCLIENT-CP, the first open-source Chinese client profile dataset, covering 60 distinct counseling topics. Comprehensive evaluations involving licensed professional counselors demonstrate that PsyCLIENT significantly outperforms baselines in terms of authenticity and training effectiveness. Notably, the simulated clients are nearly indistinguishable from human clients, achieving an about 95% expert confusion rate in discrimination tasks. These findings indicate that conversational trajectory modeling effectively bridges the gap between theoretical client profiles and dynamic, realistic simulations, offering a robust solution for mental health education and research. Code and data will be released to facilitate future research in mental health counseling.
zh

[NLP-48] LRAS: Advanced Legal Reasoning with Agent ic Search

【速读】: 该论文旨在解决大型法律语言模型(Large Language Models for Law, LLMs)在法律推理中因依赖静态参数知识导致的“闭合回路推理”问题,即模型缺乏对自身知识边界的自我认知,从而产生自信但错误的结论。解决方案的关键在于提出Legal Reasoning with Agentic Search (LRAS)框架,通过引入反思性模仿学习(Introspective Imitation Learning)和难度感知强化学习(Difficulty-aware Reinforcement Learning),使模型具备识别知识边界并动态进行主动搜索与验证的能力,从而实现从静态“闭合回路思维”向动态“主动探究”的转变,显著提升法律推理的准确性与可靠性。

链接: https://arxiv.org/abs/2601.07296
作者: Yujin Zhou,Chuxue Cao,Jinluan Yang,Lijun Wu,Conghui He,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) have demonstrated exceptional logical capabilities in mathematical domains, their application to the legal field remains hindered by the strict requirements for procedural rigor and adherence to legal logic. Existing legal LLMs, which rely on “closed-loop reasoning” derived solely from internal parametric knowledge, frequently suffer from lack of self-awareness regarding their knowledge boundaries, leading to confident yet incorrect conclusions. To address this challenge, we present Legal Reasoning with Agentic Search (LRAS), the first framework designed to transition legal LLMs from static and parametric “closed-loop thinking” to dynamic and interactive “Active Inquiry”. By integrating Introspective Imitation Learning and Difficulty-aware Reinforcement Learning, LRAS enables LRMs to identify knowledge boundaries and handle legal reasoning complexity. Empirical results demonstrate that LRAS outperforms state-of-the-art baselines by 8.2-32%, with the most substantial gains observed in tasks requiring deep reasoning with reliable knowledge. We will release our data and models for further exploration soon.
zh

[NLP-49] Reason TabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios

【速读】: 该论文旨在解决当前表格问答(TableQA)基准测试在工业场景中适用性不足的问题,尤其是多表结构、嵌套表头和大规模数据带来的复杂推理挑战。解决方案的关键在于构建了一个大规模双语基准 ReasonTabQA,涵盖30个工业领域共1932张表格,并提供高质量的最终答案与显式推理链标注;同时提出 TabCodeRL 方法,通过基于表格感知的可验证奖励机制引导逻辑推理路径生成,从而提升模型在真实工业 TableQA 任务中的表现。

链接: https://arxiv.org/abs/2601.07280
作者: Changzai Pan,Jie Zhang,Kaiwen Wei,Chenshuo Pan,Yu Zhao,Jingwang Huang,Jian Yang,Zhenhe Wu,Haoyang Zeng,Xiaoyan Gu,Weichao Sun,Yanbo Zhai,Yujie Mao,Zhuoru Jiang,Jiang Zhong,Shuangyong Song,Yongxiang Li,Zhongjiang He
机构: Institute of Artificial Intelligence (TeleAI), China Telecom; Chongqing University; Beihang University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA.
zh

[NLP-50] owards Comprehensive Semantic Speech Embeddings for Chinese Dialects

【速读】: 该论文旨在解决中文方言在语音与语言技术方面落后于普通话的问题,尤其针对多数方言以口语为主、缺乏高质量语音大模型(speech-LLMs)支持的现状。其核心挑战在于构建跨方言的语义对齐语音表示,使方言语音能够有效映射到普通话语义空间。解决方案的关键在于:仅使用自动语音识别(ASR)数据训练一个语音编码器(speech encoder),从而实现中文方言与普通话之间的语义对齐;这一方法在作者构建的新一代汉语方言语音基准上通过语音到语音检索任务得到验证,并展现出领先的ASR性能,为未来中文方言语音大模型的发展奠定基础。

链接: https://arxiv.org/abs/2601.07274
作者: Kalvin Chang,Yiwen Shao,Jiahong Li,Dong Yu
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at this https URL.
zh

[NLP-51] Document-Level Zero-Shot Relation Extraction with Entity Side Information EACL2026

【速读】: 该论文旨在解决文档级零样本关系抽取(Document-Level Zero-Shot Relation Extraction, DocZSRE)在低资源语言(如马来西亚英语)中因依赖大型语言模型(Large Language Models, LLMs)生成合成数据而面临的问题,包括本地语言特征难以捕捉和生成内容存在事实性错误。解决方案的关键在于引入实体侧信息(Entity Side Information, SI),即利用实体提及描述(Entity Mention Descriptions)和实体提及上位词(Entity Mention Hypernyms)来增强模型对未见关系标签的推理能力,从而无需依赖LLM生成的合成数据,显著提升了模型在低资源场景下的准确性和鲁棒性,平均宏F1分数较基线模型提升11.6%。

链接: https://arxiv.org/abs/2601.07271
作者: Mohan Raj Chanthran,Soon Lay Ki,Ong Huey Fang,Bhawani Selvaretnam
机构: Monash University Malaysia (蒙纳士大学马来西亚分校); Valiantlytix
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:Document-Level Zero-Shot Relation Extraction (DocZSRE) aims to predict unseen relation labels in text documents without prior training on specific relations. Existing approaches rely on Large Language Models (LLMs) to generate synthetic data for unseen labels, which poses challenges for low-resource languages like Malaysian English. These challenges include the incorporation of local linguistic nuances and the risk of factual inaccuracies in LLM-generated data. This paper introduces Document-Level Zero-Shot Relation Extraction with Entity Side Information (DocZSRE-SI) to address limitations in the existing DocZSRE approach. The DocZSRE-SI framework leverages Entity Side Information, such as Entity Mention Descriptions and Entity Mention Hypernyms, to perform ZSRE without depending on LLM-generated synthetic data. The proposed low-complexity model achieves an average improvement of 11.6% in the macro F1-Score compared to baseline models and existing benchmarks. By utilizing Entity Side Information, DocZSRE-SI offers a robust and efficient alternative to error-prone, LLM-based methods, demonstrating significant advancements in handling low-resource languages and linguistic diversity in relation extraction tasks. This research provides a scalable and reliable solution for ZSRE, particularly in contexts like Malaysian English news articles, where traditional LLM-based approaches fall short.
zh

[NLP-52] he Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

【速读】: 该论文旨在解决工具集成型自主代理(tool-integrated agentic workflows)中校准(calibration)不足的问题,即代理在使用不同类型的工具时,其表达的信心与实际性能之间存在不一致,尤其在证据类工具(如网络搜索)下易产生严重过自信。解决方案的关键在于提出一种基于强化学习(reinforcement learning, RL)的微调框架,联合优化任务准确率与校准能力,并通过一套全面的奖励设计基准进行验证,从而实现跨工具类型、跨环境和跨领域的鲁棒校准提升。

链接: https://arxiv.org/abs/2601.07264
作者: Weihao Xuan,Qingcheng Zeng,Heli Qi,Yunze Xiao,Junjue Wang,Naoto Yokoya
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent’s ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.
zh

[NLP-53] ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models AAAI2026

【速读】: 该论文旨在解决多跳推理(multi-hop reasoning)中因知识遮蔽(knowledge overshadowing)导致的错误累积问题,即在多轮检索增强生成(RAG)过程中,大型语言模型(LLM)生成的查询可能遗漏关键信息,从而引发无关检索并加剧迭代误差。解决方案的关键在于提出ActiShade方法,其通过迭代检测查询中的被遮蔽关键词(overshadowed keyphrase),结合原始查询与遮蔽关键词重新检索相关文档,并基于这些文档生成新的引导查询,从而在下一迭代中补充被遮蔽的知识,同时最小化无关噪声的引入,有效缓解错误累积现象。

链接: https://arxiv.org/abs/2601.07260
作者: Huipeng Ma,Luan Zhang,Dandan Song,Linmei Hu,Yuhang Tian,Jun Yang,Changzhi Zhou,Chenhao Li,Yizhou Jin,Xudong Li,Meng Lin,Mingxing Zhang,Shuhao Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:In multi-hop reasoning, multi-round retrieval-augmented generation (RAG) methods typically rely on LLM-generated content as the retrieval query. However, these approaches are inherently vulnerable to knowledge overshadowing - a phenomenon where critical information is overshadowed during generation. As a result, the LLM-generated content may be incomplete or inaccurate, leading to irrelevant retrieval and causing error accumulation during the iteration process. To address this challenge, we propose ActiShade, which detects and activates overshadowed knowledge to guide large language models (LLMs) in multi-hop reasoning. Specifically, ActiShade iteratively detects the overshadowed keyphrase in the given query, retrieves documents relevant to both the query and the overshadowed keyphrase, and generates a new query based on the retrieved documents to guide the next-round iteration. By supplementing the overshadowed knowledge during the formulation of next-round queries while minimizing the introduction of irrelevant noise, ActiShade reduces the error accumulation caused by knowledge overshadowing. Extensive experiments show that ActiShade outperforms existing methods across multiple datasets and LLMs.
zh

[NLP-54] Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个体实例层面的不可靠性问题,如幻觉、脆弱性失败以及置信度校准不足等。其核心解决方案是引入一种多模型共识推理引擎(Multi-Model Consensus Reasoning Engine),通过监督式元学习框架整合多个异构LLM的输出,利用语义嵌入、成对相似性、聚类统计、词汇与结构特征、推理质量评分、置信度估计及模型特定先验信息等结构化特征,结合梯度提升树、列表排序和基于相似性图的图神经网络进行建模。实验表明,该方法在GSM8K、ARC-Challenge、HellaSwag和TruthfulQA数据集上显著优于单一模型和多数投票策略,尤其在宏观平均准确率和Brier分数方面表现突出,验证了监督式多模型共识是提升LLM可靠性的有效路径。

链接: https://arxiv.org/abs/2601.07245
作者: Pranav Kallem
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong aver- age performance yet remain unreliable at the instance level, with frequent hallucinations, brittle failures, and poorly calibrated confidence. We study reliability through the lens of multi-model consensus: given responses from several heterogeneous LLMs, can we learn which answer is most likely correct for a given query? We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner. The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors, and then applies gradient-boosted trees, listwise ranking, and graph neural networks over similarity graphs of answers. Using three open-weight LLMs evaluated on compact, resource- constrained subsets of GSM8K, ARC-Challenge, HellaSwag, and TruthfulQA, our best graph-attention-based consensus model improves macro-average accuracy by 4.6 percentage points over the strongest single LLM and by 8.1 points over majority vote, while also yielding lower Brier scores and fewer TruthfulQA hal- lucinations. Ablation and feature-importance analyses show that semantic agreement and clustering features are most influential, with reasoning-quality and model-prior features providing com- plementary gains, suggesting supervised multi-model consensus is a practical route toward more reliable LLM behavior, even in a modest single-machine setup.
zh

[NLP-55] Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 模型在面对外部信息引入的噪声上下文时表现出显著性能下降的问题,尤其是现有基准测试未能充分模拟真实场景中的噪声干扰。其核心挑战在于:模型在检索增强生成(RAG)、推理、对齐和工具调用等任务中,容易被无关文档、不相关对话历史或硬负样本干扰,导致错误放大甚至出现非预期的对齐偏差。解决方案的关键在于提出一种名为 Rationale-Aware Reward (RARE) 的奖励机制,通过激励模型识别并聚焦于噪声中的有用信息,从而显著提升其在复杂噪声环境下的鲁棒性;实验表明,RARE 在多个任务上优于传统微调与强化学习方法,并揭示了高计算量反而可能加剧噪声敏感性的反直觉现象,为下一代具备强推理能力的智能体设计提供了重要方向。

链接: https://arxiv.org/abs/2601.07226
作者: Seongyun Lee,Yongrae Jo,Minju Seo,Moontae Lee,Minjoon Seo
机构: KAIST(韩国科学技术院); LG Research (LG研究实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.
zh

[NLP-56] he Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

【速读】: 该论文试图解决多语言语言模型(Multilingual Language Models, MLMs)在不同语言间表现不均衡的问题,探究这种差距是源于语言本身的复杂性还是建模过程中的选择偏差。研究表明,许多看似由语言固有特性(如形态复杂度、语序差异等)导致的性能差异,实际上主要源自建模层面的因素,例如分词策略(tokenization)、编码方式(encoding)、数据暴露程度(data exposure)以及参数共享机制(parameter sharing)。解决方案的关键在于通过标准化这些建模选择,如统一分词规则、优化数据采样策略、改进架构设计并采用更公平的评估方法,从而显著缩小跨语言性能差距,推动更具包容性的多语言模型发展。

链接: https://arxiv.org/abs/2601.07220
作者: Chen Shani,Yuval Reif,Nathan Roll,Dan Jurafsky,Ekaterina Shutova
机构: Stanford University (斯坦福大学); The Hebrew University of Jerusalem (希伯来大学耶路撒冷分校); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world’s languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
zh

[NLP-57] MI-PRUN: Optimize Large Language Model Pruning via Mutual Information

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因计算和内存资源消耗过大而导致的效率瓶颈问题,尤其针对现有块级剪枝(block pruning)方法稳定性差、难以获得全局最优解的局限性。其解决方案的关键在于提出一种基于互信息(mutual information, MI)的剪枝方法MI-PRUN:通过量化隐藏状态转移中的互信息来识别冗余块,并利用数据处理不等式(Data Processing Inequality, DPI)揭示连续块整体重要性与单个块重要性之间的关系;同时设计快速块选择算法(Fast-Block-Select),通过迭代更新块组合策略,在保证全局最优性的同时显著提升剪枝效率。

链接: https://arxiv.org/abs/2601.07212
作者: Hao Zhang,Zhibin Zhang,Guangxin Wu,He Chen,Jiafeng Guo,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学交叉学科研究院)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have become indispensable across various domains, but this comes at the cost of substantial computational and memory resources. Model pruning addresses this by removing redundant components from models. In particular, block pruning can achieve significant compression and inference acceleration. However, existing block pruning methods are often unstable and struggle to attain globally optimal solutions. In this paper, we propose a mutual information based pruning method MI-PRUN for LLMs. Specifically, we leverages mutual information to identify redundant blocks by evaluating transitions in hidden states. Additionally, we incorporate the Data Processing Inequality (DPI) to reveal the relationship between the importance of entire contiguous blocks and that of individual blocks. Moreover, we develop the Fast-Block-Select algorithm, which iteratively updates block combinations to achieve a globally optimal solution while significantly improving the efficiency. Extensive experiments across various models and datasets demonstrate the stability and effectiveness of our method.
zh

[NLP-58] MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

【速读】: 该论文旨在解决Group-Relative Policy Optimization (GRPO)在开放域场景下应用受限的问题,即当生成任务缺乏可验证的地面真实(ground truth)时,如何有效应对多目标冲突(如创造性与事实性之间的权衡),因为传统的静态奖励标量化的策略在此类场景中表现不佳。解决方案的关键在于提出MAESTRO(Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization),其核心创新是将奖励标量化的决策建模为一个动态潜在策略,并通过模型终态隐藏状态作为语义瓶颈来感知任务特定优先级;进一步地,该方法以双层优化框架形式将其形式化为上下文相关的多臂赌博机问题,其中轻量级的Conductor网络利用组相对优势作为元奖励信号与主策略协同进化,从而实现自适应、高效的多目标对齐优化。

链接: https://arxiv.org/abs/2601.07208
作者: Yang Zhao,Hepeng Wang,Xiao Ding,Yangou Ouyang,Bibo Cai,Kai Xiong,Jinglong Gao,Zhouhao Sun,Li Du,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model’s terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal. Across seven benchmarks, MAESTRO consistently outperforms single-reward and static multi-objective baselines, while preserving the efficiency advantages of GRPO, and in some settings even reducing redundant generation.
zh

[NLP-59] Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG AAAI2026

【速读】: 该论文旨在解决当前基于图的检索增强生成(Graph-based Retrieval-Augmented Generation, GraphRAG)方法在实际应用中面临的两大核心问题:一是知识图谱(Knowledge Graph, KG)固有的不完整性导致推理路径断裂;二是图结构中信号噪声比低,引入与查询相关但误导性的干扰事实(distractor facts),干扰模型推理过程。针对上述挑战,作者提出“推理-构建”(reason-and-construct)新范式,并设计了Relink框架,其关键创新在于:首先,通过从原始文本语料库中提取潜在关系池(latent relation pool)动态实例化所需事实,实时修复断裂路径;其次,采用统一的、查询感知的评估策略,联合筛选来自KG和潜在关系的候选事实,优先选择对回答查询最有用的信息,而非依赖预存在性,从而主动剔除干扰事实,为每个查询构建最忠实且精确的证据路径。实验表明,Relink在五个开放域问答基准上显著优于主流GraphRAG基线,平均EM提升5.4%,F1提升5.2%。

链接: https://arxiv.org/abs/2601.07192
作者: Manzong Huang,Chenyang Bu,Yi He,Xingrui Zhuo,Xindong Wu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Cloud (阿里巴巴云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) mitigates hallucinations in Large Language Models (LLMs) by grounding them in structured knowledge. However, current GraphRAG methods are constrained by a prevailing \textitbuild-then-reason paradigm, which relies on a static, pre-constructed Knowledge Graph (KG). This paradigm faces two critical challenges. First, the KG’s inherent incompleteness often breaks reasoning paths. Second, the graph’s low signal-to-noise ratio introduces distractor facts, presenting query-relevant but misleading knowledge that disrupts the reasoning process. To address these challenges, we argue for a \textitreason-and-construct paradigm and propose Relink, a framework that dynamically builds a query-specific evidence graph. To tackle incompleteness, \textbfRelink instantiates required facts from a latent relation pool derived from the original text corpus, repairing broken paths on the fly. To handle misleading or distractor facts, Relink employs a unified, query-aware evaluation strategy that jointly considers candidates from both the KG and latent relations, selecting those most useful for answering the query rather than relying on their pre-existence. This empowers Relink to actively discard distractor facts and construct the most faithful and precise evidence path for each query. Extensive experiments on five Open-Domain Question Answering benchmarks show that Relink achieves significant average improvements of 5.4% in EM and 5.2% in F1 over leading GraphRAG baselines, demonstrating the superiority of our proposed framework. Comments: Accepted by AAAI 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07192 [cs.CL] (or arXiv:2601.07192v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.07192 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-60] Structured Reasoning for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成长链思维(Chain-of-Thought, CoT)过程中存在的冗余与低效问题,尤其是模型在已得出正确答案后仍进行不必要的验证和修正行为。其核心解决方案是提出结构化推理(Structured Reasoning, SCR)框架,通过将推理轨迹解耦为可评估、可训练的显式组件,并采用“生成-验证-修订”(Generate-Verify-Revise)范式实现。关键创新在于构建结构化训练数据并引入动态终止监督(Dynamic Termination Supervision)以指导模型决定何时终止推理,同时采用分阶段强化学习策略分离不同推理能力的学习信号,从而显著提升推理效率与自验证能力,同时减少输出token长度达50%。

链接: https://arxiv.org/abs/2601.07180
作者: Jinyi Han,Zixiang Di,Zishang Jiang,Ying Liao,Jiaqing Liang,Yongqi Wang,Yanghua Xiao
机构: East China Normal University (华东师范大学); Fudan University (复旦大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.
zh

[NLP-61] Can Large Language Models Understand Reason About and Generate Code-Switched Text?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言混合场景下(即代码切换,Code-Switching)的鲁棒性不足问题,具体聚焦于模型对代码切换文本的理解、推理与生成能力。其解决方案的关键在于构建了一个高质量、多样化的基准测试集 CodeMixQA,包含16种跨地理区域和语言组合的平行代码切换语料,涵盖原始书写系统及音译形式,并基于此对LLMs的推理行为进行深入分析,同时系统评估了模型生成代码切换文本时的自然度与语义保真度,从而揭示当前生成能力的核心局限并为提升多语言LLMs的鲁棒性提供可操作的改进方向。

链接: https://arxiv.org/abs/2601.07153
作者: Genta Indra Winata,David Anugraha,Patrick Amadeus Irawan,Anirban Das,Haneul Yoo,Paresh Dashore,Shreyas Kulkarni,Ruochen Zhang,Haruki Sakajo,Frederikus Hudi,Anaelia Ovalle,Syrielle Montariol,Felix Gaschi,Michael Anugraha,Rutuj Ravindra Puranik,Zawad Hayat Ahmed,Adril Putra Merin,Emmanuele Chersoni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood. In this work, we present a comprehensive evaluation of LLM capabilities in understanding, reasoning over, and generating code-switched text. We introduce CodeMixQA a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms. Using this benchmark, we analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs. We further conduct a systematic evaluation of LLM-generated synthetic code-switched text, focusing on both naturalness and semantic fidelity, and uncover key limitations in current generation capabilities. Our findings reveal persistent challenges in both reasoning and generation under code-switching conditions and provide actionable insights for building more robust multilingual LLMs. We release the dataset and code as open source.
zh

[NLP-62] Rewarding Creativity: A Human-Aligned Generative Reward Model for Reinforcement Learning in Storytelling

【速读】: 该论文旨在解决生成式 AI 在创意故事生成中面临的两大挑战:一是如何设计可靠且可解释的奖励信号以衡量主观性的故事质量,二是如何缓解强化学习(Reinforcement Learning, RL)训练过程中的不稳定性。解决方案的关键在于提出了一种名为 Reinforcement Learning for Creative Storytelling (RLCS) 的框架:首先构建了一个生成式奖励模型(Generative Reward Model, GenRM),通过监督微调和基于 GRPO 的优化策略,实现多维度分析与显式推理,使奖励信号更贴近人类对创造力的判断;其次引入基于熵的奖励塑形机制,动态聚焦于高置信度错误预测与不确定正确预测,从而避免对已掌握模式的过拟合,提升训练稳定性。实验表明,GenRM 与人类创造力判断的对齐度达 68%,且 RLCS 显著优于 Gemini-2.5-Pro 等强基线模型,在整体故事质量上取得显著提升。

链接: https://arxiv.org/abs/2601.07149
作者: Zhaoyan Li,Hang Lei,Yujia Wang,Lanbo Liu,Hao Liu,Liang Yu
机构: Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) can generate fluent text, producing high-quality creative stories remains challenging. Reinforcement Learning (RL) offers a promising solution but faces two critical obstacles: designing reliable reward signals for subjective storytelling quality and mitigating training instability. This paper introduces the Reinforcement Learning for Creative Storytelling (RLCS) framework to systematically address both challenges. First, we develop a Generative Reward Model (GenRM) that provides multi-dimensional analysis and explicit reasoning about story preferences, trained through supervised fine-tuning on demonstrations with reasoning chains distilled from strong teacher models, followed by GRPO-based refinement on expanded preference data. Second, we introduce an entropy-based reward shaping strategy that dynamically prioritizes learning on confident errors and uncertain correct predictions, preventing overfitting on already-mastered patterns. Experiments demonstrate that GenRM achieves 68% alignment with human creativity judgments, and RLCS significantly outperforms strong baselines including Gemini-2.5-Pro in overall story quality. This work provides a practical pipeline for applying RL to creative domains, effectively navigating the dual challenges of reward modeling and training stability.
zh

[NLP-63] Measuring Iterative Temporal Reasoning with TimePuzzles

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在迭代式时间推理(iterative temporal reasoning)能力上的评估难题,尤其是模型在处理包含跨文化日历关系与事实性时间锚点的复杂时间约束时的表现差异。解决方案的关键在于提出TimePuzzles这一基于约束的时间推理任务:它通过算法生成具有明确解空间的日期谜题,结合多文化背景下的日历逻辑,实现对模型时间推理能力的可控、动态且持续的评估。实验表明,即使在不使用工具的情况下,TimePuzzles也能有效区分不同LLMs的推理能力,同时揭示了当前模型在可靠调用外部工具(如代码解释器或网络搜索)方面仍存在显著不足,从而为诊断和改进工具增强型迭代时间推理提供了高效且低成本的基准。

链接: https://arxiv.org/abs/2601.07148
作者: Zhengxiang Wang,Zeyu Dong
机构: Stony Brook University (纽约州立大学石溪分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce TimePuzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, TimePuzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset’s simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, TimePuzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.
zh

[NLP-64] ReinPool: Reinforcement Learning Pooling Multi-Vector Embeddings for Retrieval System

【速读】: 该论文旨在解决多向量嵌入(multi-vector embedding)模型在文档检索中因存储每个标记(token)的嵌入而导致索引规模急剧膨胀的问题,从而严重限制了系统的可扩展性。解决方案的关键在于提出了一种基于强化学习的动态过滤与池化框架 ReinPool,该框架通过逆向检索目标和基于 NDCG 的奖励机制进行训练,自动识别并保留最具判别性的向量,无需人工标注重要性信息,从而将多向量表示压缩 746–1249 倍为单向量表示,同时保持 76–81% 的完整多向量检索性能,显著优于静态平均池化基线方法。

链接: https://arxiv.org/abs/2601.07125
作者: Sungguk Cha,DongWook Kim,Mintae Kim,Youngsub Han,Byoung-Ki Jeon,Sangyeob Lee
机构: LG Uplus(LG Uplus)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Multi-vector embedding models have emerged as a powerful paradigm for document retrieval, preserving fine-grained visual and textual details through token-level representations. However, this expressiveness comes at a staggering cost: storing embeddings for every token inflates index sizes by over 1000\times compared to single-vector approaches, severely limiting scalability. We introduce \textbfReinPool, a reinforcement learning framework that learns to dynamically filter and pool multi-vector embeddings into compact, retrieval-optimized representations. By training with an inverse retrieval objective and NDCG-based rewards, ReinPool identifies and retains only the most discriminative vectors without requiring manual importance annotations. On the Vidore V2 benchmark across three vision-language embedding models, ReinPool compresses multi-vector representations by 746 – 1249\times into single vectors while recovering 76–81% of full multi-vector retrieval performance. Compared to static mean pooling baselines, ReinPool achieves 22–33% absolute NDCG@3 improvement, demonstrating that learned selection significantly outperforms heuristic aggregation.
zh

[NLP-65] ReMIND: Orchestrating Modular Large Language Models for Controllable Serendipity A REM-Inspired System Design for Emergent Creative Ideation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在创意生成任务中难以同时实现新颖性与内在一致性的问题,即如何诱发既具创新性又逻辑自洽的偶然性洞见(serendipitous insights)。其解决方案的关键在于提出一种受REM睡眠机制启发的模块化框架ReMIND,通过四个阶段——唤醒(wake)、梦境(dream)、判断(judge)和再唤醒(re-wake)——实现探索与巩固的功能分离:在低温度下生成稳定的语义基线,在高温度下进行探索性生成以激发多样性,随后通过粗粒度筛选剔除不一致输出并提取候选想法,并最终以再唤醒阶段将选中的想法重构为连贯的最终输出。这种系统级设计使得LLM能够在保持下游稳定性的同时可靠地诱导语义层面的探索,从而提升偶然性创意生成的概率与质量。

链接: https://arxiv.org/abs/2601.07121
作者: Makoto Sato
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are used not only for problem solving but also for creative ideation; however, eliciting serendipitous insights that are both novel and internally coherent remains difficult. While stochastic sampling promotes novelty, it often degrades consistency. Here, we propose ReMIND, a REM-inspired modular framework for ideation. ReMIND consists of four stages: wake, which generates a stable low-temperature semantic baseline; dream, which performs high-temperature exploratory generation; judge, which applies coarse evaluation to filter incoherent outputs and extract candidate ideas; and re-wake, which re-articulates selected ideas into coherent final outputs. By instantiating each stage as an independent LLM, ReMIND enables functional separation between exploration and consolidation. Parameter sweeps show that ReMIND reliably induces semantic exploration while preserving downstream stability. Embedding-based analyses confirm substantial semantic displacement during the dream phase, whereas external evaluations reveal that high-quality ideas emerge sporadically rather than as extrema along any single metric. These results suggest that serendipitous ideation in LLMs is a rare-event process best approached through system level design that shapes the conditions under which valuable ideas can emerge and be stabilized. ReMIND provides a general framework for studying the computational basis of serendipity and illustrates how modular LLM orchestration can bridge exploration and stabilization.
zh

[NLP-66] he Need for a Socially-Grounded Persona Framework for User Simulation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在社会模拟中所依赖的生成式AI(Generative AI)人格设定(persona)质量不足的问题,尤其是现有 personas 多基于粗粒度的社会人口统计学特征(sociodemographic attributes)或摘要,导致行为预测能力弱且易产生偏差。其解决方案的关键在于提出一种名为 SCOPE 的社会基础框架,该框架基于对124名美国参与者进行的141项、两小时的社会心理协议数据构建,引入了更精细的社会心理维度(如价值观和身份认同),从而显著提升人格设定的结构合理性与行为一致性;实验证明,非人口统计学导向的人格设定在多个基准测试中表现更优,且能有效降低偏见,表明人格质量的核心在于其社会心理结构而非简单的描述性标签。

链接: https://arxiv.org/abs/2601.07110
作者: Pranav Narayanan Venkit,Yu Li,Yada Pruksachatkun,Chien-Sheng Wu
机构: Salesforce Research (Salesforce 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Synthetic personas are widely used to condition large language models (LLMs) for social simulation, yet most personas are still constructed from coarse sociodemographic attributes or summaries. We revisit persona creation by introducing SCOPE, a socially grounded framework for persona construction and evaluation, built from a 141-item, two-hour sociopsychological protocol collected from 124 U.S.-based participants. Across seven models, we find that demographic-only personas are a structural bottleneck: demographics explain only ~1.5% of variance in human response similarity. Adding sociopsychological facets improves behavioral prediction and reduces over-accentuation, and non-demographic personas based on values and identity achieve strong alignment with substantially lower bias. These trends generalize to SimBench (441 aligned questions), where SCOPE personas outperform default prompting and NVIDIA Nemotron personas, and SCOPE augmentation improves Nemotron-based personas. Our results indicate that persona quality depends on sociopsychological structure rather than demographic templates or summaries.
zh

[NLP-67] Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

【速读】: 该论文旨在解决多跳问答(multi-hop question answering)任务中不同知识注入机制对大型语言模型(Large Language Models, LLMs)推理能力影响的不明确问题,尤其是当所需知识具有时间新颖性(temporally novel)时。其核心问题是:在开放域多跳问答场景下,参数化方法(如微调)与非参数化方法(如检索增强生成,Retrieval-Augmented Generation, RAG)相比,哪种知识注入策略更有效。解决方案的关键在于系统性地比较三种知识注入方式——无监督微调(持续预训练)、有监督微调和检索增强生成——在两个基准数据集上的表现,其中包含一个基于2024年维基百科事件构建的新颖数据集以测试模型对预训练截止日期后信息的理解能力。研究发现,RAG在处理时间新颖信息时带来显著且一致的性能提升,而有监督微调整体准确率最高,表明不同知识注入机制在支持多跳推理方面存在根本差异,尤其在需要外部或组合知识时,检索机制至关重要。

链接: https://arxiv.org/abs/2601.07054
作者: Zhuoyi Yang,Yurun Song,Iftekhar Ahmed,Ian Harris
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models’ pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.07054 [cs.CL] (or arXiv:2601.07054v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.07054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-68] Engineering of Hallucination in Generative AI: Its not a Bug its a Feature

【速读】: 该论文试图解决的问题是:生成式人工智能(Generative AI)在实际应用中,尽管训练数据严格基于真实信息,但其性能却依赖于一定程度的“幻觉”(hallucination)才能达到理想效果。传统观点认为幻觉是模型缺陷,但本文提出质疑,认为其可能是一种功能性的设计特征。解决方案的关键在于通过概率工程(probability engineering)手段,有控制地引导模型产生有限度的幻觉,从而提升输出结果的实用性与合理性,而非一味追求事实准确性。

链接: https://arxiv.org/abs/2601.07046
作者: Tim Fingscheidt,Patrick Blumenberg,Björn Möller
机构: 未知
类目: Computation and Language (cs.CL)
备注: This is an article that has been written reflecting a talk of Tim Fingscheidt at the 2025 New Year gathering of Braunschweigische Wissenschaftliche Gesellschaft on January 25th, 2025

点击查看摘要

Abstract:Generative artificial intelligence (AI) is conquering our lives at lightning speed. Large language models such as ChatGPT answer our questions or write texts for us, large computer vision models such as GAIA-1 generate videos on the basis of text descriptions or continue prompted videos. These neural network models are trained using large amounts of text or video data, strictly according to the real data employed in training. However, there is a surprising observation: When we use these models, they only function satisfactorily when they are allowed a certain degree of fantasy (hallucination). While hallucination usually has a negative connotation in generative AI - after all, ChatGPT is expected to give a fact-based answer! - this article recapitulates some simple means of probability engineering that can be used to encourage generative AI to hallucinate to a limited extent and thus lead to the desired results. We have to ask ourselves: Is hallucination in gen-erative AI probably not a bug, but rather a feature?
zh

[NLP-69] When Abundance Conceals Weakness: Knowledge Conflict in Multilingual Models

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)在面对跨语言知识冲突时的内部信念不一致问题,即当外部证据与模型内嵌的多语言记忆相矛盾时,模型如何进行跨语言知识协调。其核心挑战在于现有研究主要基于英语中心视角,缺乏对非英语语言环境下冲突解决机制的系统性评估。解决方案的关键是提出CLEAR框架——一个用于系统评估跨语言知识冲突解决行为的基准框架,该框架将冲突处理过程分解为四个渐进式场景,并通过构建涵盖10种类型多样语言的ConflictQA和ConflictingQA多语言版本,在两个具有不同任务特性的问答基准上对六种代表性LLMs进行评估,从而揭示任务依赖性的决策二分:在推理密集型任务中,高资源语言更具说服力;而在以实体为中心的事实冲突中,语言亲缘性(linguistic affinity)成为主导因素,使低资源但语言相近的语言表现优于远距离高资源语言。

链接: https://arxiv.org/abs/2601.07041
作者: Jiaqi Zhao,Qiang Huang,Haodong Chen,Xiaoxing You,Jun Yu
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures, and 4 tables

点击查看摘要

Abstract:Large Language Models (LLMs) encode vast world knowledge across multiple languages, yet their internal beliefs are often unevenly distributed across linguistic spaces. When external evidence contradicts these language-dependent memories, models encounter \emphcross-lingual knowledge conflict, a phenomenon largely unexplored beyond English-centric settings. We introduce \textbfCLEAR, a \textbfCross-\textbfLingual knowl\textbfEdge conflict ev\textbfAluation f\textbfRamework that systematically examines how multilingual LLMs reconcile conflicting internal beliefs and multilingual external evidence. CLEAR decomposes conflict resolution into four progressive scenarios, from multilingual parametric elicitation to competitive multi-source cross-lingual induction, and systematically evaluates model behavior across two complementary QA benchmarks with distinct task characteristics. We construct multilingual versions of ConflictQA and ConflictingQA covering 10 typologically diverse languages and evaluate six representative LLMs. Our experiments reveal a task-dependent decision dichotomy. In reasoning-intensive tasks, conflict resolution is dominated by language resource abundance, with high-resource languages exerting stronger persuasive power. In contrast, for entity-centric factual conflicts, linguistic affinity, not resource scale, becomes decisive, allowing low-resource but linguistically aligned languages to outperform distant high-resource ones.
zh

[NLP-70] ask Arithmetic with Support Languages for Low-Resource ASR ACL

【速读】: 该论文旨在解决资源受限环境下自动语音识别(ASR)模型性能不足的问题,尤其是在低资源语言中因训练数据稀缺而导致模型效果不佳的挑战。其解决方案的关键在于利用任务向量(task vector)的概念,通过微调Whisper ASR系统在高资源语言和低资源语言上分别生成任务向量,并采用线性组合方式融合这些向量,同时在低资源语言的验证集上优化组合权重以最小化下游词错误率(Word Error Rate, WER)。这种方法有效利用了高资源语言的知识迁移能力,显著提升了低资源语言上的ASR性能。

链接: https://arxiv.org/abs/2601.07038
作者: Emma Rafkin,Dan DeGenaro,Xiulin Yang
机构: Georgetown University (乔治城大学); Applied Physics Laboratory, Johns Hopkins University (约翰霍普金斯大学应用物理实验室)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 Figures, preprint after submitted for review for a *ACL conference

点击查看摘要

Abstract:The development of resource-constrained approaches to automatic speech recognition (ASR) is of great interest due to its broad applicability to many low-resource languages for which there is scant usable data. Existing approaches to many low-resource natural language processing tasks leverage additional data from higher-resource languages that are closely related to a target low-resource language. One increasingly popular approach uses task arithmetic to combine models trained on different tasks to create a model for a task where there is little to no training data. In this paper, we consider training on a particular language to be a task, and we generate task vectors by fine-tuning variants of the Whisper ASR system. For pairings of high- and low-resource languages, we merge task vectors via a linear combination, optimizing the weights of the linear combination on the downstream word error rate on the low-resource target language’s validation set. We find that this approach consistently improves performance on the target languages.
zh

[NLP-71] Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

【速读】: 该论文旨在解决当前混合推理语言模型在通过高阶“思考/不思考”指令进行行为调控时,实际模式切换主要由少量触发词而非指令本身驱动的问题。其解决方案的关键在于识别出特定触发token(如“Okay”)可诱导推理行为,而“/think”后的换行符模式则抑制推理行为;基于此发现,提出无需训练的Mid-Think提示格式,通过组合这些触发机制实现中间预算下的推理控制,在准确率与输出长度权衡上优于固定token和基于prompt的基线方法,并在强化学习(RL)训练中进一步提升模型性能。

链接: https://arxiv.org/abs/2601.07036
作者: Wang Yang,Debargha Ganguly,Xinpeng Li,Chaoda Song,Shouren Wang,Vikash Singh,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading Okay'' token induces reasoning behavior, while the newline pattern following /think’’ suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.
zh

[NLP-72] Codified Foreshadowing-Payoff Text Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在故事生成中难以维持长程叙事依赖的问题,特别是未能有效实现“伏笔- payoff”结构(foreshadowing-payoff),导致如“契诃夫之枪”般的重要设定未被合理兑现。其解决方案的关键在于提出Codified Foreshadowing-Payoff Generation (CFPG)框架,将叙事连续性转化为可执行的因果谓词(causal predicates),通过从BookSum语料库中挖掘并编码“伏笔-触发-结果”三元组(Foreshadow-Trigger-Payoff triples),提供结构化监督,确保伏笔不仅被提及,而且在时间与逻辑上得到实现。这一方法显著提升了模型在 payoff 准确性和叙事一致性上的表现,表明显式编码叙事机制是迈向真正叙事能力的关键路径。

链接: https://arxiv.org/abs/2601.07033
作者: Longfei Yun,Kun Zhou,Yupeng Hou,Letian Peng,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes. However, despite advances in story generation, large language models (LLMs) frequently fail to bridge these long-range narrative dependencies, often leaving “Chekhov’s guns” unfired even when the necessary context is present. Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups. In this paper, we introduce Codified Foreshadowing-Payoff Generation (CFPG), a novel framework that reframes narrative quality through the lens of payoff realization. Recognizing that LLMs struggle to intuitively grasp the “triggering mechanism” of a foreshadowed event, CFPG transforms narrative continuity into a set of executable causal predicates. By mining and encoding Foreshadow-Trigger-Payoff triples from the BookSum corpus, we provide structured supervision that ensures foreshadowed commitments are not only mentioned but also temporally and logically fulfilled. Experiments demonstrate that CFPG significantly outperforms standard prompting baselines in payoff accuracy and narrative alignment. Our findings suggest that explicitly codifying narrative mechanics is essential for moving LLMs from surface-level fluency to genuine narrative competence.
zh

[NLP-73] Solar Open Technical Report

【速读】: 该论文旨在解决低资源语言(underserved languages)在大型语言模型(Large Language Models, LLMs)开发中面临的训练数据稀缺、多领域数据协调困难以及推理能力难以通过可扩展强化学习(Reinforcement Learning, RL)提升的问题。解决方案的关键在于提出一套系统性方法:首先,通过合成4.5万亿token的高质量、领域特定且面向RL优化的数据缓解数据稀缺问题;其次,设计一种渐进式课程学习策略,联合优化文本组成、质量阈值与领域覆盖范围,在20万亿token规模上实现数据协同;最后,引入SnapPO框架以高效实现可扩展的强化学习优化,从而增强模型的推理能力。该方法在英语和韩语基准测试中均展现出竞争力,验证了其对低资源语言AI发展的有效性。

链接: https://arxiv.org/abs/2601.07022
作者: Sungrae Park,Sanghoon Kim,Jungho Cho,Gyoungjin Gim,Dawoon Jung,Mikyoung Cha,Eunhae Choo,Taekgyu Hong,Minbyul Jeong,SeHwan Joo,Minsoo Khang,Eunwon Kim,Minjeong Kim,Sujeong Kim,Yunsu Kim,Hyeonju Lee,Seunghyun Lee,Sukyung Lee,Siyoung Park,Gyungin Shin,Inseo Song,Wonho Song,Seonghoon Yang,Seungyoun Yi,Sanghoon Yoon,Jeonghyun Ko,Seyoung Song,Keunwoo Choi,Hwalsuk Lee,Sunghun Kim,Du-Seong Chang,Kyunghyun Cho,Junsuk Choe,Hwaran Lee,Jae-Gil Lee,KyungTae Lim,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
zh

[NLP-74] urkBench: A Benchmark for Evaluating Turkish Large Language Models

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)模型在非英语语言,特别是具有独特语言特征的土耳其语(Turkish)上的评估基准匮乏问题。现有研究多集中于英语模型的评测,而对土耳其语等资源较少的语言缺乏系统性、多维度的测试框架。解决方案的关键在于构建一个名为 TurkBench 的综合性评估基准,包含 8,151 条数据样本,覆盖 21 个子任务,并划分为六大核心评价维度:知识(Knowledge)、语言理解(Language Understanding)、推理(Reasoning)、内容审核(Content Moderation)、土耳其语语法与词汇(Turkish Grammar and Vocabulary)以及指令遵循(Instruction Following)。该基准不仅涵盖多样化的任务类型,还融入文化相关数据,为研究人员和开发者提供可量化、可比较的工具以评估模型性能并识别改进方向。

链接: https://arxiv.org/abs/2601.07020
作者: Çağrı Toraman,Ahmet Kaan Sever,Ayse Aysu Cengiz,Elif Ecem Arslan,Görkem Sevinç,Mete Mert Birdal,Yusuf Faruk Güldemir,Ali Buğra Kanburoğlu,Sezen Felekoğlu,Osman Gürlek,Sarp Kantar,Birsen Şahin Kütük,Büşra Tufan,Elif Genç,Serkan Coşkun,Gupse Ekin Demir,Muhammed Emin Arayıcı,Olgun Dursun,Onur Gungor,Susan Üsküdarlı,Abdullah Topraksoy,Esra Darıcı
机构: Middle East Technical University (中东技术大学); Bilkent University (比尔肯大学); Hacettepe University (哈切特佩大学); Bogazici University (博阿齐奇大学); Istanbul University (伊斯坦布尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at this https URL
zh

[NLP-75] Lexicalized Constituency Parsing for Middle Dutch: Low-resource Training and Cross-Domain Generalization

【速读】: 该论文旨在解决低资源历史语言(如中古荷兰语)的句法结构解析问题,特别是针对尚未充分研究的成分句法分析(constituency parsing)任务。其核心挑战在于中古荷兰语具有高度异质性且标注数据稀缺,导致现有模型性能有限。解决方案的关键在于采用基于Transformer的句法解析器,并通过联合训练策略引入高资源辅助语言以提升模型在目标域内的表现;同时,探索利用新标注数据进行微调或数据组合的方法来增强跨域泛化能力,并发现约200个样本/领域是实现有效跨域适应的最低阈值。

链接: https://arxiv.org/abs/2601.07008
作者: Yiming Liang,Fang Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.
zh

[NLP-76] FinCARDS: Card-Based Analyst Reranking for Financial Document Question Answering

【速读】: 该论文旨在解决生成式 AI(Generative AI)在长篇企业财报问答任务中因缺乏结构化约束导致的证据选择不稳定和决策不透明问题。现有基于大语言模型(LLM)的重排序器主要优化语义相关性,难以满足实体、财务指标、财政期间及数值范围等严格约束条件,从而影响检索准确性和可审计性。其解决方案的关键在于提出 FinCards——一种基于金融感知模式(finance-aware schema)的结构化重排序框架,通过将文档片段与问题映射到对齐的 schema 字段(实体、指标、期间、数值跨度),实现字段级别的确定性匹配,并采用多阶段锦标赛式重排序与稳定性感知聚合机制,显著提升早期排名召回率并降低排序方差,且无需模型微调或不可控的推理预算。

链接: https://arxiv.org/abs/2601.06992
作者: Yixi Zhou,Fan Zhang,Yu Chen,Haipeng Zhang,Preslav Nakov,Zhuohan Xie
机构: ShanghaiTech University (上海科技大学); The University of Tokyo (东京大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, including figures and tables

点击查看摘要

Abstract:Financial question answering (QA) over long corporate filings requires evidence to satisfy strict constraints on entities, financial metrics, fiscal periods, and numeric values. However, existing LLM-based rerankers primarily optimize semantic relevance, leading to unstable rankings and opaque decisions on long documents. We propose FinCards, a structured reranking framework that reframes financial evidence selection as constraint satisfaction under a finance-aware schema. FinCards represents filing chunks and questions using aligned schema fields (entities, metrics, periods, and numeric spans), enabling deterministic field-level matching. Evidence is selected via a multi-stage tournament reranking with stability-aware aggregation, producing auditable decision traces. Across two corporate filing QA benchmarks, FinCards substantially improves early-rank retrieval over both lexical and LLM-based reranking baselines, while reducing ranking variance, without requiring model fine-tuning or unpredictable inference budgets. Our code is available at this https URL.
zh

[NLP-77] MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education EMNLP2025

【速读】: 该论文旨在解决医学住院医师(medical residents)在学习过程中面临的两大挑战:一是难以高效解读复杂的病例报告,二是从可靠来源快速获取准确的医学知识。为应对这些问题,作者提出了一种名为MedTutor的新系统,其核心解决方案在于采用检索增强生成(Retrieval-Augmented Generation, RAG)架构,通过混合检索机制联合查询本地医学教材与学术文献数据库(如PubMed和Semantic Scholar API),确保所生成内容兼具基础性和时效性;随后利用先进的重排序模型对检索到的证据进行筛选与排序,并由大语言模型(Large Language Model, LLM)生成针对病例的结构化教育材料和多选题,从而实现自动化、高质量的教学支持。

链接: https://arxiv.org/abs/2601.06979
作者: Dongsuk Jang,Ziyao Shangguan,Kyle Tegtmeyer,Anurag Gupta,Jan Czerminski,Sophie Chheang,Arman Cohan
机构: Yale University (耶鲁大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (System Demonstrations)

点击查看摘要

Abstract:The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials. The system’s architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for the latest related research, ensuring the generated content is both foundationally sound and current. The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report. We conduct a rigorous evaluation of the system. First, three radiologists assessed the quality of outputs, finding them to be of high clinical and educational value. Second, we perform a large scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system. Our analysis using correlation between LLMs outputs and human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight.
zh

[NLP-78] UETQuintet at BioCreative IX - MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval IJCAI

【速读】: 该论文旨在解决生物医学问答系统(Biomedical Question Answering, BQA)在处理复杂医疗查询时面临的挑战,特别是多跳推理(multi-hop reasoning)能力不足以及对多源信息整合效率低的问题。其解决方案的关键在于:将问题分为直接问题和序列问题两类分别处理——直接问题通过高效路径直接生成答案,而序列问题则被分解为一系列子问题以实现跨步骤的逻辑推理;同时,利用多源信息检索(multi-source information retrieval)与上下文学习(in-context learning)机制,增强生成答案时的语境丰富性和相关性,从而显著提升模型在BioCreative IX - MedHopQA数据集上的表现,Exact Match得分达0.84,位列当前排行榜第二。

链接: https://arxiv.org/abs/2601.06974
作者: Quoc-An Nguyen,Thi-Minh-Thu Vu,Bich-Dat Nguyen,Dinh-Quang-Minh Tran,Hoang-Quynh Le
机构: VNU University of Engineering and Technology (越南国家大学工程技术学院)
类目: Computation and Language (cs.CL)
备注: Accepted at the BioCreative IX Challenge and Workshop (BC9) at IJCAI

点击查看摘要

Abstract:Biomedical Question Answering systems play a critical role in processing complex medical queries, yet they often struggle with the intricate nature of medical data and the demand for multi-hop reasoning. In this paper, we propose a model designed to effectively address both direct and sequential questions. While sequential questions are decomposed into a chain of sub-questions to perform reasoning across a chain of steps, direct questions are processed directly to ensure efficiency and minimise processing overhead. Additionally, we leverage multi-source information retrieval and in-context learning to provide rich, relevant context for generating answers. We evaluated our model on the BioCreative IX - MedHopQA Shared Task datasets. Our approach achieves an Exact Match score of 0.84, ranking second on the current leaderboard. These results highlight the model’s capability to meet the challenges of Biomedical Question Answering, offering a versatile solution for advancing medical research and practice.
zh

[NLP-79] LLM s Cant Play Hangman: On the Necessity of a Private Working Memory for Language Agents

【速读】: 该论文试图解决语言模型(Language Models, LMs)在向自主代理(Autonomous Agents)演进过程中,因受限于标准聊天界面缺乏私有工作记忆(Private Working Memory)而导致无法可靠执行依赖隐藏状态的交互任务的问题。这类任务被称为私有状态交互任务(Private State Interactive Tasks, PSITs),要求代理在生成一致公共响应的同时维护隐式信息。理论分析表明,仅依赖公开对话历史的代理无法同时保证秘密性和一致性,从而得出一个不可能性定理;实验验证进一步显示,主流基于检索的记忆基线和标准聊天模式均无法通过自一致性测试,证明语义检索机制不足以实现真正的状态维持。解决方案的关键在于引入一种包含显式私有工作内存的新架构,实验证明该机制可恢复一致性,确立私有状态为构建交互式语言代理的必要组件。

链接: https://arxiv.org/abs/2601.06973
作者: Davide Baldelli,Ali Parviz,Amal Zouaq,Sarath Chandar
机构: Chandar Research Lab; LAMA-WeST Lab; Mila – Quebec AI Institute; Polytechnique Montréal; University of California, San Diego
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs move from text completion toward autonomous agents, they remain constrained by the standard chat interface, which lacks private working memory. This raises a fundamental question: can agents reliably perform interactive tasks that depend on hidden state? We define Private State Interactive Tasks (PSITs), which require agents to generate and maintain hidden information while producing consistent public responses. We show theoretically that any agent restricted to the public conversation history cannot simultaneously preserve secrecy and consistency in PSITs, yielding an impossibility theorem. To empirically validate this limitation, we introduce a self-consistency testing protocol that evaluates whether agents can maintain a hidden secret across forked dialogue branches. Standard chat-based LLMs and retrieval-based memory baselines fail this test regardless of scale, demonstrating that semantic retrieval does not enable true state maintenance. To address this, we propose a novel architecture incorporating an explicit private working memory; we demonstrate that this mechanism restores consistency, establishing private state as a necessary component for interactive language agents.
zh

[NLP-80] Categorize Early Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition

【速读】: 该论文试图解决的问题是:在语音语言建模中,Transformer 和 Conformer 两种主流架构虽然表现出相近的性能,但其性能差异是否源于趋同的处理策略,还是各自独特的架构归纳偏置(inductive bias)所致。为回答这一问题,作者提出了一种名为“架构指纹”(Architectural Fingerprinting)的探针框架,该方法能够隔离架构对表征的影响,并在一组受控的 24 个预训练编码器(参数规模从 39M 到 3.3B)上进行验证。解决方案的关键在于通过系统性地分析各层表示的语义信息分布,揭示出两类模型在信息处理路径上的本质差异:Conformer 采用“早期分类”(Categorize Early)策略,在浅层即完成音素类别和说话人性别等特征的识别(分别提前 29% 和 16% 的网络深度),而 Transformer 则采用“晚期整合”(Integrate Late)策略,将音素、口音和时长等信息延迟至深层(49–57% 深度)才完成编码。这一发现不仅澄清了架构设计对表征学习路径的根本影响,也为不同应用场景提供了设计指导——例如,Conformer 更适合低延迟流式处理,而 Transformer 更适用于需要丰富上下文和跨话语归一化的任务。

链接: https://arxiv.org/abs/2601.06972
作者: Nathan Roll,Pranav Bhalerao,Martijn Bartelds,Arjun Pawar,Yuka Tatsumi,Tolulope Ogunremi,Chen Shani,Calbert Graham,Meghan Sumner,Dan Jurafsky
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 figures, 9 tables

点击查看摘要

Abstract:In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a “Categorize Early” strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers “Integrate Late,” deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers’ front-loaded categorization may benefit low-latency streaming, while Transformers’ deep integration may favor tasks requiring rich context and cross-utterance normalization.
zh

[NLP-81] RealMem: Benchmarking LLM s in Real-World Memory-Driven Interaction

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在向自主通用智能体(autonomous general agents)演进过程中,缺乏有效长期记忆机制以维持项目导向型交互一致性的问题。现有评估基准多聚焦于日常对话或任务导向型对话,无法刻画真实项目中目标动态演化、跨会话状态追踪等复杂场景。为此,作者提出首个基于真实项目场景的基准测试体系 RealMem,其核心创新在于构建了一个融合项目基础建设(Project Foundation Construction)、多智能体对话生成(Multi-Agent Dialogue Generation)与记忆及日程管理(Memory and Schedule Management)的合成流程,从而模拟现实世界中项目状态的持续演变和上下文依赖关系。实验表明,当前主流记忆系统在处理长期项目状态和动态上下文依赖方面仍存在显著挑战。

链接: https://arxiv.org/abs/2601.06966
作者: Haonan Bian,Zhiyuan Yao,Sen Hu,Zishan Xu,Shaolei Zhang,Yifu Guo,Ziliang Yang,Xueran Han,Huacan Wang,Ronghao Chen
机构: Xidian University (西安电子科技大学); Zhejiang University (浙江大学); Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Renmin University of China (中国人民大学); Sun Yat-sen University (中山大学); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture “long-term project-oriented” interactions where agents must track evolving goals. To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [this https URL](this https URL). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.06966 [cs.CL] (or arXiv:2601.06966v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.06966 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haonan Bian [view email] [v1] Sun, 11 Jan 2026 15:49:36 UTC (3,529 KB)
zh

[NLP-82] X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks Solutions and Tests

【速读】: 该论文旨在解决当前代码大语言模型(Code LLMs)在竞技编程(competitive programming)任务中表现受限的问题,其核心挑战在于模型对现实世界数据的高度依赖限制了可扩展性,且难以满足高强度推理和高逻辑复杂度的需求。解决方案的关键在于提出一种完全合成的数据生成方法——SynthSmith,该方法通过特征驱动的合成技术构建包含多样化且具有挑战性的任务、经验证的解决方案及测试用例的全合成数据集,从而支持监督微调(SFT)与强化学习(RL)训练。基于此合成数据,作者进一步设计了X-Coder模型系列,在仅使用7B参数的情况下实现了优于更大规模模型的性能(如LiveCodeBench v5上平均通过率为62.9 avg@8),并揭示了高质量合成数据的规模效应以及分阶段训练策略对提升代码推理能力的重要性。

链接: https://arxiv.org/abs/2601.06953
作者: Jie Wu,Haoling Li,Xin Zhang,Jiani Guo,Jane Luo,Steven Liu,Yangyu Huang,Ruihang Chu,Scarlett Li,Yujiu Yang
机构: Tsinghua University (清华大学); Microsoft; Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project: this https URL

点击查看摘要

Abstract:Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.
zh

[NLP-83] Symphonym: Universal Phonetic Embeddings for Cross-Script Toponym Matching via Teacher-Student Distillation

【速读】: 该论文旨在解决跨语言和书写系统环境下地名(toponym)匹配难题,即如何在不同文字体系(如拉丁文、西里尔文、阿拉伯文等)中识别同一地点的名称表示。传统方法依赖于特定语言的音译规则或字符串相似度算法(如Levenshtein距离),难以处理跨脚本场景下的语义一致性问题。其解决方案的核心在于提出Symphonym系统,该系统通过神经嵌入模型将20种书写系统的地名映射到统一的128维语音空间中:教师网络(Teacher)基于发音特征(通过Epitran和PanPhon提取)生成目标嵌入,学生网络(Student)则从原始字符学习近似这些嵌入;训练采用三阶段课程策略,在5700万条地名数据上优化模型,最终仅需轻量级的学生网络(1.7M参数)即可实现高效推理,无需运行时发音转换。此设计显著提升了跨脚本地名匹配的准确性,尤其在MEHDIE Hebrew-Arabic基准测试中达到89.2% Recall@1,优于传统字符串方法。

链接: https://arxiv.org/abs/2601.06932
作者: Stephen Gadd
机构: University of London (伦敦大学); University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 tables, 2 figures

点击查看摘要

Abstract:Linking place names across languages and writing systems is a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches rely on language-specific phonetic algorithms or transliteration rules that fail when names cross script boundaries – no string metric can determine that “Moscow” when rendered in Cyrillic or Arabic refer to the same city. I present Symphonym, a neural embedding system that maps toponyms from 20 writing systems into a unified 128-dimensional phonetic space. A Teacher network trained on articulatory phonetic features (via Epitran and PanPhon) produces target embeddings, while a Student network learns to approximate these from raw characters. At inference, only the lightweight Student (1.7M parameters) is required, enabling deployment without runtime phonetic conversion. Training uses a three-phase curriculum on 57 million toponyms from GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names. Phase 1 trains the Teacher on 467K phonetically-grounded triplets. Phase 2 aligns the Student to Teacher outputs across 23M samples, achieving 96.6% cosine similarity. Phase 3 fine-tunes on 3.3M hard negative triplets – negatives sharing prefix and script with the anchor but referring to different places – to sharpen discrimination. Evaluation on the MEHDIE Hebrew-Arabic benchmark achieves 89.2% Recall@1, outperforming Levenshtein (81.5%) and Jaro-Winkler (78.5%). The system is optimised for cross-script matching; same-script variants can be handled by complementary string methods. Symphonym will enable fuzzy phonetic reconciliation and search across the World Historical Gazetteer’s 67 million toponyms. Code and models are publicly available. Comments: 30 pages, 5 tables, 2 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T50 (Primary) 68T07, 68U35 (Secondary) ACMclasses: I.2.7; I.5.1; H.2.8 Cite as: arXiv:2601.06932 [cs.CL] (or arXiv:2601.06932v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.06932 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Stephen Gadd Dr [view email] [v1] Sun, 11 Jan 2026 14:36:36 UTC (24,445 KB)
zh

[NLP-84] Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在社会重要场景中部署时,因图像中种族和性别等人口统计学线索与背景、服饰等混杂因素纠缠而导致的社会偏见归因难题(attribution under visual confounding)。其解决方案的关键在于提出了一种仅针对面部属性的反事实评估范式(face-only counterfactual evaluation paradigm),通过在保持其他所有视觉因素不变的前提下,仅编辑与种族和性别相关的面部特征来生成反事实图像变体,从而实现对人口统计学影响的隔离与量化。基于此范式,作者构建了FOCUS数据集和REFLECT基准,揭示了即使在严格视觉控制下,VLMs仍存在显著的人口统计学差异,并强调任务设计是评估多模态模型社会偏见的关键变量。

链接: https://arxiv.org/abs/2601.06931
作者: Haodong Chen,Qiang Huang,Jiaqi Zhao,Qiuping Jiang,Xiaojun Chang,Jun Yu
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Ningbo University (宁波大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 18 figures, and 3 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in socially consequential settings, raising concerns about social bias driven by demographic cues. A central challenge in measuring such social bias is attribution under visual confounding: real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution. We propose a \textbfface-only counterfactual evaluation paradigm that isolates demographic effects while preserving real-image realism. Starting from real photographs, we generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed. Based on this paradigm, we construct \textbfFOCUS, a dataset of 480 scene-matched counterfactual images across six occupations and ten demographic groups, and propose \textbfREFLECT, a benchmark comprising three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation. Experiments on five state-of-the-art VLMs reveal that demographic disparities persist under strict visual control and vary substantially across task formulations. These findings underscore the necessity of controlled, counterfactual audits and highlight task design as a critical factor in evaluating social bias in multimodal models.
zh

[NLP-85] reePS-RAG : Tree-based Process Supervision for Reinforcement Learning in Agent ic RAG

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多跳问答任务中因依赖稀疏最终奖励而导致的步骤级信用分配困难的问题,这限制了中间推理与动作的有效指导。其解决方案的关键在于提出一种基于树结构的在线强化学习框架 TreePS-RAG,通过将代理式检索增强生成(Agentic Retrieval-Augmented Generation, RAG)的推理过程建模为 rollout 树,使得每个推理步骤对应一个节点,并利用蒙特卡洛估计其后代结果来估算步骤效用,从而实现无需中间标注即可获得细粒度的过程优势(process advantage)。此外,引入高效的在线树构建策略,在有限计算预算下保持探索多样性,显著提升了多跳和通用问答任务上的性能表现。

链接: https://arxiv.org/abs/2601.06922
作者: Tianhua Zhang,Kun Li,Junan Li,Yunxiang Li,Hongyin Luo,Xixin Wu,James Glass,Helen Meng
机构: The Chinese University of Hong Kong (香港中文大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.
zh

[NLP-86] Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

【速读】: 该论文试图解决语言模型在强化学习(Reinforcement Learning, RL)中表现差异显著的问题,即某些模型(如Qwen)在相同训练条件下能显著受益于RL优化,而另一些模型(如Llama)则提升有限。其解决方案的关键在于揭示并量化一种隐藏的结构特性——概率空间中的分布清晰度(distributional clarity),并通过Silhouette Coefficient(S)进行度量:高S值表明模型对正确与错误响应的概率分配具有类内紧凑性和类间分离性,这与RL性能强相关;反之,低S值对应严重的逻辑错误和推理不稳定性。作者进一步提出Silhouette-Aware Reweighting策略,在训练中优先处理低S样本,从而系统性提升所有模型家族在数学推理任务上的RL表现,验证了分布清晰度是可训练且决定RL友好性的核心属性。

链接: https://arxiv.org/abs/2601.06911
作者: Shaoning Sun,Mingzhu Cai,Huang He,Bingjin Chen,Siqi Bao,Yujiu Yang,Hua Wu,Haifeng Wang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Baidu Inc (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbfdistributional clarity in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbfSilhouette Coefficient ( S ) and demonstrate that (1) high S correlates strongly with RL performance; (2) low S is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low- S samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.
zh

[NLP-87] Fine-grained Verbal Attack Detection via a Hierarchical Divide-and-Conquer Framework

【速读】: 该论文旨在解决中文社交媒体中隐性言语攻击(verbal attack)识别困难的问题,现有研究因缺乏对对话结构和上下文依赖关系的建模而难以准确识别此类攻击。其解决方案的关键在于提出一个基于时空信息的分而治之、细粒度的攻击识别框架,并构建首个“层次化攻击评论检测”(Hierarchical Attack Comment Detection)数据集,该数据集显式编码了回复的层级结构与时间顺序,从而捕捉多轮讨论中的复杂交互模式;在此基础上,框架将攻击检测任务分解为多个子任务,由轻量级专用模型分别处理显性攻击检测、隐性意图推理和受限上下文下的目标识别,实验证明该结构化任务分解策略在小模型上显著优于依赖参数规模扩展的大模型。

链接: https://arxiv.org/abs/2601.06907
作者: Quan Zheng,Yuanhe Tian,Ming Wang,Yan Song
机构: Beijing Normal University (北京师范大学); Zhongguancun Academy (中关村学院); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 13pages, 5figures

点击查看摘要

Abstract:In the digital era, effective identification and analysis of verbal attacks are essential for maintaining online civility and ensuring social security. However, existing research is limited by insufficient modeling of conversational structure and contextual dependency, particularly in Chinese social media where implicit attacks are prevalent. Current attack detection studies often emphasize general semantic understanding while overlooking user response relationships, hindering the identification of implicit and context-dependent attacks. To address these challenges, we present the novel “Hierarchical Attack Comment Detection” dataset and propose a divide-and-conquer, fine-grained framework for verbal attack recognition based on spatiotemporal information. The proposed dataset explicitly encodes hierarchical reply structures and chronological order, capturing complex interaction patterns in multi-turn discussions. Building on this dataset, the framework decomposes attack detection into hierarchical subtasks, where specialized lightweight models handle explicit detection, implicit intent inference, and target identification under constrained context. Extensive experiments on the proposed dataset and benchmark intention detection datasets show that smaller models using our framework significantly outperform larger monolithic models relying on parameter scaling, demonstrating the effectiveness of structured task decomposition.
zh

[NLP-88] Paraphrasing Adversarial Attack on LLM -as-a-Reviewer

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在学术同行评审系统中可能存在的脆弱性问题,特别是针对现有基于提示注入(prompt injection)的攻击方法所引发的语义篡改与评估鲁棒性混淆的问题。其解决方案的关键在于提出一种名为“改写对抗攻击”(Paraphrasing Adversarial Attack, PAA)的黑盒优化方法,该方法通过搜索语义等价且语言自然的改写序列,在不改变原文主张的前提下提升评审分数;PAA利用上下文学习机制,基于历史改写及其评分动态引导候选文本生成,从而实现对LLM评审系统的有效扰动。

链接: https://arxiv.org/abs/2601.06884
作者: Masahiro Kaneko
机构: MBZUAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper’s claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.
zh

[NLP-89] An Ubuntu-Guided Large Language Model Framework for Cognitive Behavioral Mental Health Dialogue

【速读】: 该论文旨在解决南非日益严峻的心理健康危机中,因文化不敏感和语言适配不足导致的治疗可及性问题,尤其是在非洲语境下,西方中心主义训练数据限制了大型语言模型(Large Language Models, LLMs)在本地化心理支持中的适用性。解决方案的关键在于构建一个融合认知行为疗法(Cognitive Behavioral Therapy, CBT)与非洲哲学“乌班图”(Ubuntu)理念的生成式AI对话系统框架,通过深层理论重构(如将行为激活和认知重构重新诠释为以社区福祉、精神根基和相互关联为核心)与表层语言及沟通文化的适应性调整(包括语言简化、灵性情境化和基于乌班图的再表述),实现情感智能与文化敏感性的深度融合,从而提升干预措施在非洲场景下的情境相关性、包容性和有效性。

链接: https://arxiv.org/abs/2601.06875
作者: Sontaga G. Forane,Absalom E. Ezugwu,Kevin Igwe,Karen van den Berg
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:South Africa’s escalating mental health crisis, compounded by limited access to culturally responsive care, calls for innovative and contextually grounded interventions. While large language models show considerable promise for mental health support, their predominantly Western-centric training data limit cultural and linguistic applicability in African contexts. This study introduces a proof-of-concept framework that integrates cognitive behavioral therapy with the African philosophy of Ubuntu to create a culturally sensitive, emotionally intelligent, AI-driven mental health dialogue system. Guided by a design science research methodology, the framework applies both deep theoretical and therapeutic adaptations as well as surface-level linguistic and communicative cultural adaptations. Key CBT techniques, including behavioral activation and cognitive restructuring, were reinterpreted through Ubuntu principles that emphasize communal well-being, spiritual grounding, and interconnectedness. A culturally adapted dataset was developed through iterative processes of language simplification, spiritual contextualization, and Ubuntu-based reframing. The fine-tuned model was evaluated through expert-informed case studies, employing UniEval for conversational quality assessment alongside additional measures of CBT reliability and cultural linguistic alignment. Results demonstrate that the model effectively engages in empathetic, context-aware dialogue aligned with both therapeutic and cultural objectives. Although real-time end-user testing has not yet been conducted, the model underwent rigorous review and supervision by domain specialist clinical psychologists. The findings highlight the potential of culturally embedded emotional intelligence to enhance the contextual relevance, inclusivity, and effectiveness of AI-driven mental health interventions across African settings.
zh

[NLP-90] BiasLab: A Multilingual Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中输出偏见难以量化评估的问题,具体挑战包括提示词表述敏感性、多语言覆盖有限以及缺乏标准化指标以实现跨模型可靠比较。其解决方案的关键在于提出BiasLab——一个开源、与模型无关的评估框架,通过严格的双框架设计构建镜像探测对(mirrored probe pairs),即在保持语言结构一致的前提下,分别生成支持目标A和目标B的肯定陈述,并采用随机指令包装和固定选择Likert量表响应格式以减少对提示模板的依赖;随后利用大语言模型作为评判者对响应进行归一化为一致性极性标签,并聚合生成包含效应量和中立率等描述性统计的定量偏见指标,从而实现跨语言、跨框架的偏见测量标准化,补充了内在偏见和数据集审计方法,助力研究者和机构更可靠地评估模型鲁棒性并作出部署决策。

链接: https://arxiv.org/abs/2601.06861
作者: William Guey,Wei Zhang,Pei-Luen Patrick Rau,Pierrick Bougault,Vitor D. de Moura,Bertan Ucar,Jose O. Gomes
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: source code and reproducibility scripts available on GitHub

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes contexts where their outputs influence real-world decisions. However, evaluating bias in LLM outputs remains methodologically challenging due to sensitivity to prompt wording, limited multilingual coverage, and the lack of standardized metrics that enable reliable comparison across models. This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias through a multilingual, robustness-oriented experimental design. BiasLab constructs mirrored probe pairs under a strict dual-framing scheme: an affirmative assertion favoring Target A and a reverse assertion obtained by deterministic target substitution favoring Target B, while preserving identical linguistic structure. To reduce dependence on prompt templates, BiasLab performs repeated evaluation under randomized instructional wrappers and enforces a fixed-choice Likert response format to maximize comparability across models and languages. Responses are normalized into agreement labels using an LLM-based judge, aligned for polarity consistency across framings, and aggregated into quantitative bias indicators with descriptive statistics including effect sizes and neutrality rates. The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics, and produces reproducible artifacts such as structured reports and comparative visualizations. BiasLab contributes a standardized methodology for cross-lingual and framing-sensitive bias measurement that complements intrinsic and dataset-based audits, enabling researchers and institutions to benchmark robustness and make better-informed deployment decisions.
zh

[NLP-91] †DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在存在无关上下文(distractor)干扰时,数学推理模型性能显著下降的问题,尤其是在低资源语言场景下。其关键解决方案是提出†DAGGER方法,将数学问题求解重构为可执行计算图(computational graph)生成任务,并显式建模干扰节点,从而增强模型对无关信息的鲁棒性;通过监督微调与组相对策略优化(Group Relative Policy Optimization)对Gemma-3模型进行训练,在不依赖干扰样本显式训练的情况下实现了与专用推理模型相当的加权准确率,同时减少89%的token消耗,验证了结构化中间表示在提升推理效率和鲁棒性方面的优势。

链接: https://arxiv.org/abs/2601.06853
作者: Zabir Al Nazi,Shubhashis Roy Dipta,Sudipta Kar
机构: University of California, Riverside (加州大学河滨分校); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Oracle Health AI (甲骨文健康人工智能)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information. Evaluating seven models ranging from 3B to 12B parameters, we observe substantial performance degradation under distractors: standard models drop by up to 41 points, while reasoning-specialized models decline by 14 to 20 points despite consuming five times more tokens. We propose †DAGGER, which reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes. Fine-tuning Gemma-3 models using supervised fine-tuning followed by Group Relative Policy Optimization achieves comparable weighted accuracy on augmented benchmarks while using 89 percent fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples. Our results suggest that enforcing structured intermediate representations improves robustness and inference efficiency in mathematical reasoning compared to free-form approaches, particularly in noisy, low-resource settings.
zh

[NLP-92] Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model

【速读】: 该论文旨在解决多模态方面情感分析(Multimodal Aspect-Based Sentiment Analysis, MABSA)中现有方法依赖复杂判别式融合、缺乏显式情感可解释性的问题。其解决方案的关键在于将MABSA重构为一个生成式且可解释的任务,提出了一种基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的统一框架,通过提示(prompt-based)生成范式同时输出情感标签和自然语言解释;进一步引入依赖句法引导的情感线索策略(dependency-syntax-guided sentiment cue strategy),通过剪枝与文本化以方面为中心的依存句法树,增强模型对不同情感方面的区分能力与解释力,并利用MLLMs构建带情感解释的新数据集进行微调,从而在提升情感分类准确率的同时生成忠实且基于方面的解释。

链接: https://arxiv.org/abs/2601.06848
作者: Zhongzheng Wang,Yuanhe Tian,Hongzhi Wang,Yan Song
机构: Harbin Institute of Technology (哈尔滨工业大学); Zhongguancun Academy (中关村学院); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Multimodal aspect-based sentiment analysis (MABSA) aims to identify aspect-level sentiments by jointly modeling textual and visual information, which is essential for fine-grained opinion understanding in social media. Existing approaches mainly rely on discriminative classification with complex multimodal fusion, yet lacking explicit sentiment explainability. In this paper, we reformulate MABSA as a generative and explainable task, proposing a unified framework that simultaneously predicts aspect-level sentiment and generates natural language explanations. Based on multimodal large language models (MLLMs), our approach employs a prompt-based generative paradigm, jointly producing sentiment and explanation. To further enhance aspect-oriented reasoning capabilities, we propose a dependency-syntax-guided sentiment cue strategy. This strategy prunes and textualizes the aspect-centered dependency syntax tree, guiding the model to distinguish different sentiment aspects and enhancing its explainability. To enable explainability, we use MLLMs to construct new datasets with sentiment explanations to fine-tune. Experiments show that our approach not only achieves consistent gains in sentiment classification accuracy, but also produces faithful, aspect-grounded explanations.
zh

[NLP-93] Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实时视频理解场景下因标准位置编码方案引入的全局位置连续性约束(global positional continuity constraint)所导致的推理延迟问题。这一约束使得感知与生成过程必须严格串行执行,限制了输入输出的并行处理能力,难以支持“边看边说”的实时交互系统。解决方案的关键在于提出一种并行流式框架,通过三个核心设计——重叠(Overlapped)、分组解耦(Group-Decoupled)和间隙隔离(Gap-Isolated)——有效放松位置连续性要求,从而实现感知与生成的真正并行化,显著降低延迟并在性能上保持高流畅性和准确性。实验表明,在感知与生成负载均衡时,该框架可实现最高达2倍的加速效果。

链接: https://arxiv.org/abs/2601.06843
作者: Junyan Lin,Junlong Tong,Hao Wu,Jialiang Zhang,Jinming Liu,Xin Jin,Xiaoyu Shen
机构: The Hong Kong Polytechnic University (香港理工大学); EIT (数字孪生研究所); Shanghai Jiao Tong University (上海交通大学); Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: this https URL.
zh

[NLP-94] PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection

【速读】: 该论文旨在解决在黑盒、零样本(zero-shot)场景下检测大型语言模型(Large Language Models, LLMs)预训练数据的难题,尤其在计算资源和训练数据受限时,传统基于似然的方法因采用均匀权重聚合词元级得分而忽略自回归生成中的信息论动态特性,导致检测性能受限。其解决方案的关键在于提出位置衰减重加权(Positional Decay Reweighting, PDR),该方法无需训练且可即插即用,通过显式地对词元级得分进行重加权,放大早期高熵位置的记忆信号(memorization signals),同时抑制后续上下文累积带来的噪声,从而显著提升多种先进检测方法在多个基准上的鲁棒性与准确性。

链接: https://arxiv.org/abs/2601.06827
作者: Jinhan Liu,Yibo Yang,Ruiying Lu,Piotr Piekos,Yimeng Chen,Peng Wang,Dandan Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting pre-training data in Large Language Models (LLMs) is crucial for auditing data privacy and copyright compliance, yet it remains challenging in black-box, zero-shot settings where computational resources and training data are scarce. While existing likelihood-based methods have shown promise, they typically aggregate token-level scores using uniform weights, thereby neglecting the inherent information-theoretic dynamics of autoregressive generation. In this paper, we hypothesize and empirically validate that memorization signals are heavily skewed towards the high-entropy initial tokens, where model uncertainty is highest, and decay as context accumulates. To leverage this linguistic property, we introduce Positional Decay Reweighting (PDR), a training-free and plug-and-play framework. PDR explicitly reweights token-level scores to amplify distinct signals from early positions while suppressing noise from later ones. Extensive experiments show that PDR acts as a robust prior and can usually enhance a wide range of advanced methods across multiple benchmarks.
zh

[NLP-95] Agent Hallu: Benchmarking Automated Hallucination Attribution of LLM -based Agents

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在多步推理过程中产生的幻觉(hallucination)传播问题,特别是如何自动定位导致初始偏差的中间步骤并提供因果解释。其核心挑战在于不同于单轮响应中的幻觉检测,多步工作流中的幻觉溯源需要精准识别引发错误路径的起始节点。解决方案的关键是提出了一个名为AgentHallu的新基准,包含693条高质量轨迹、细粒度的五类(规划、检索、推理、人机交互、工具使用)及14种子类的幻觉分类体系,以及多层次的人工标注(二分类标签、责任步骤和因果解释),从而系统性地支持幻觉归因任务的研究与评估。

链接: https://arxiv.org/abs/2601.06818
作者: Xuannan Liu,Xiao Yang,Zekun Li,Peipei Li,Ran He
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Center for Research on Intelligent Perception and Computing, NLPR, CASIA (中科院自动化所智能感知与计算研究中心)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:As LLM-based agents operate over sequential multi-step reasoning, hallucinations arising at intermediate steps risk propagating along the trajectory, thus degrading overall reliability. Unlike hallucination detection in single-turn responses, diagnosing hallucinations in multi-step workflows requires identifying which step causes the initial divergence. To fill this gap, we propose a new research task, automated hallucination attribution of LLM-based agents, aiming to identify the step responsible for the hallucination and explain why. To support this task, we introduce AgentHallu, a comprehensive benchmark with: (1) 693 high-quality trajectories spanning 7 agent frameworks and 5 domains, (2) a hallucination taxonomy organized into 5 categories (Planning, Retrieval, Reasoning, Human-Interaction, and Tool-Use) and 14 sub-categories, and (3) multi-level annotations curated by humans, covering binary labels, hallucination-responsible steps, and causal explanations. We evaluate 13 leading models, and results show the task is challenging even for top-tier models (like GPT-5, Gemini-2.5-Pro). The best-performing model achieves only 41.1% step localization accuracy, where tool-use hallucinations are the most challenging at just 11.6%. We believe AgentHallu will catalyze future research into developing robust, transparent, and reliable agentic systems.
zh

[NLP-96] Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

【速读】: 该论文旨在解决视觉语言模型在链式思维(Chain-of-Thought)推理中因离散分词导致的视觉信息带宽瓶颈问题,即连续视觉细节在文本化过程中被丢弃,以及现有隐式推理方法因刚性自回归目标而引发的过早语义坍缩问题。其解决方案的关键在于提出一种名为Laser的新范式,通过动态窗口对齐学习(Dynamic Windowed Alignment Learning, DWAL)重构视觉推理过程:不再强制逐点预测,而是将潜在状态与未来语义的动态有效性窗口对齐,从而建立“先森林后树木”的认知层级结构,使模型在聚焦局部细节前保持全局特征的概率叠加状态;同时借助可解码轨迹保障可解释性,并通过自精炼叠加机制稳定无约束学习过程,最终在6个基准测试中实现领先性能,且推理token减少超97%,展现出卓越效率与域外泛化能力。

链接: https://arxiv.org/abs/2601.06803
作者: Yubo Wang,Juntian Zhang,Yichen Wu,Yankai Lin,Nils Lukas,Yuhan Liu
机构: MBZUAI(穆巴达拉人工智能大学); Fudan University (复旦大学); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a “Forest-before-Trees” cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.
zh

[NLP-97] Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition

【速读】: 该论文旨在解决低资源阿拉伯方言(特别是苏丹语方言)的自动语音识别(ASR)系统开发难题,现有研究多集中于现代标准阿拉伯语(MSA)和部分高资源方言,缺乏针对苏丹语等低资源方言的有效建模方法。其解决方案的关键在于通过数据增强策略提升模型性能:一方面采用自训练(self-training)方法利用未标注语音生成伪标签进行迭代优化,另一方面结合Klaam文本转语音(TTS)系统生成合成语音进行数据扩充;最终在仅28.4小时训练数据下,基于Whisper-Medium模型的联合增强方案实现57.1%的词错误率(WER),显著优于零样本多语言Whisper(78.8% WER)及专用MSA模型(73.8–123% WER),验证了数据增强对克服资源限制的有效性,并为其他边缘语言变体的ASR开发提供了可复现的技术路径。

链接: https://arxiv.org/abs/2601.06802
作者: Ayman Mansour
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot multilingual Whisper (78.8% WER) and MSA-specialized Arabic models (73.8-123% WER). All experiments used low-cost resources (Kaggle free tier and this http URL trial), demonstrating that strategic data augmentation can overcome resource limitations for low-resource dialects and provide a practical roadmap for developing ASR systems for low-resource Arabic dialects and other marginalized language varieties. The models, evaluation benchmarks, and reproducible training pipelines are publicly released to facilitate future research on low-resource Arabic ASR.
zh

[NLP-98] CIRAG : Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering

【速读】: 该论文旨在解决多跳问答任务中基于三元组的迭代检索增强生成(iRAG)方法所面临的两大挑战:一是贪婪单路径扩展导致的早期错误传播,难以捕捉不同推理分支间的并行证据;二是粒度需求不匹配问题,即单一证据表示难以在噪声控制与上下文充分性之间取得平衡。解决方案的关键在于提出CIRAG模型,其核心创新包括:(1) 提出迭代构建-集成模块(Iterative Construction-Integration module),通过历史条件下的候选三元组构建与整合,保留多个合理证据链以规避贪婪陷阱;(2) 设计自适应级联多粒度生成模块(Adaptive Cascaded Multi-Granularity Generation module),根据问题需求从三元组逐步扩展至支持句和完整段落,实现动态粒度适配;(3) 引入轨迹蒸馏(Trajectory Distillation),将教师模型的集成策略压缩至轻量学生模型,提升长程推理的效率与可靠性。

链接: https://arxiv.org/abs/2601.06799
作者: Zili Wei,Xiaocui Yang,Yilin Wang,Zihan Wang,Weidong Bao,Shi Feng,Daling Wang,Yifei Zhang
机构: Northeastern University, China(东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Triple-based Iterative Retrieval-Augmented Generation (iRAG) mitigates document-level noise for multi-hop question answering. However, existing methods still face limitations: (i) greedy single-path expansion, which propagates early errors and fails to capture parallel evidence from different reasoning branches, and (ii) granularity-demand mismatch, where a single evidence representation struggles to balance noise control with contextual sufficiency. In this paper, we propose the Construction-Integration Retrieval and Adaptive Generation model, CIRAG. It introduces an Iterative Construction-Integration module that constructs candidate triples and history-conditionally integrates them to distill core triples and generate the next-hop query. This module mitigates the greedy trap by preserving multiple plausible evidence chains. Besides, we propose an Adaptive Cascaded Multi-Granularity Generation module that progressively expands contextual evidence based on the problem requirements, from triples to supporting sentences and full passages. Moreover, we introduce Trajectory Distillation, which distills the teacher model’s integration policy into a lightweight student, enabling efficient and reliable long-horizon reasoning. Extensive experiments demonstrate that CIRAG achieves superior performance compared to existing iRAG methods.
zh

[NLP-99] Garbage Attention in Large Language Models : BOS Sink Heads and Sink-aware Pruning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中结构冗余的机制不明确问题,特别是为何深层注意力头(attention heads)表现出更强的冗余性。其解决方案的关键在于识别并利用“BOS sink现象”——即高BOS sink得分的注意力头在深层中显著贡献于预测性能有限,且充当冗余注意力权重的“汇点”,从而提供了一种功能层面的冗余解释。基于此发现,作者提出一种仅移除高BOS sink头的剪枝策略,实验表明该方法比基于权重或激活值的剪枝指标更可靠地识别冗余组件,并能在极端剪枝下保持接近密集基线的性能,同时具备对序列长度变化的鲁棒性。

链接: https://arxiv.org/abs/2601.06787
作者: Jaewon Sok,Jewon Yeom,Seonghyeon Park,Jeongjae Park,Taesup Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known to contain significant redundancy, yet a systematic explanation for why certain components, particularly in higher layers, are more redundant has remained elusive. In this work, we identify the BOS sink phenomenon as a key mechanism driving this layer-wise sensitivity. We show that attention heads with high BOS sink scores are strongly associated with functional redundancy: such heads, especially in deeper layers, contribute little to predictive performance and effectively serve as \emphdumping grounds for superfluous attention weights. This provides a concrete functional explanation for the structural redundancy reported in prior studies. Leveraging this insight, we introduce a simple pruning strategy that removes high-BOS sink heads. Experiments on Gemma-3, Llama-3.1, and Qwen3 demonstrate that this approach identifies redundant transformer components more reliably than weight- or activation-based criteria, while preserving performance close to dense baselines even under aggressive pruning. Moreover, we find that the behavior of sink heads remains stable across different sequence lengths. Overall, our results suggest that structural properties of attention offer a more intuitive and robust basis for model compression than magnitude-based methods.
zh

[NLP-100] EpiCaR: Knowing What You Dont Know Matters for Better Reasoning in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在通过迭代自训练提升推理能力时出现的校准失效问题,即模型过度自信、失去对不确定性的表征能力,导致对预测分布的退化(model collapse in alignment)。其关键解决方案是将推理训练重构为一个认知论学习(epistemic learning)问题,提出**认知校准推理(Epistemically-calibrated Reasoning, EpiCaR)**训练目标,该目标联合优化推理性能与校准度,并在迭代监督微调框架中引入显式的自我评估信号以实现这一目标。实验表明,EpiCaR 在准确性和校准性上均优于标准基线,尤其在具备足够推理能力的模型(如3B+参数规模)中表现突出,且可泛化至OOD数学推理和代码生成任务,同时显著降低推理计算开销(仅需K=10样本即可达到传统方法K=30的效果)。

链接: https://arxiv.org/abs/2601.06786
作者: Jewon Yeom,Jaewon Sok,Seonghyeon Park,Jeongjae Park,Taesup Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving the reasoning abilities of large language models (LLMs) has largely relied on iterative self-training with model-generated data. While effective at boosting accuracy, existing approaches primarily reinforce successful reasoning paths, incurring a substantial calibration cost: models become overconfident and lose the ability to represent uncertainty. This failure has been characterized as a form of model collapse in alignment, where predictive distributions degenerate toward low-variance point estimates. We address this issue by reframing reasoning training as an epistemic learning problem, in which models must learn not only how to reason, but also when their reasoning should be trusted. We propose epistemically-calibrated reasoning (EpiCaR) as a training objective that jointly optimizes reasoning performance and calibration, and instantiate it within an iterative supervised fine-tuning framework using explicit self-evaluation signals. Experiments on Llama-3 and Qwen-3 families demonstrate that our approach achieves Pareto-superiority over standard baselines in both accuracy and calibration, particularly in models with sufficient reasoning capacity (e.g., 3B+). This framework generalizes effectively to OOD mathematical reasoning (GSM8K) and code generation (MBPP). Ultimately, our approach enables a 3X reduction in inference compute, matching the K=30 performance of STaR with only K=10 samples in capable models.
zh

[NLP-101] Multi-Stage Evolutionary Model Merging with Meta Data Driven Curriculum Learning for Sentiment-Specialized Large Language Modeling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在情感分析(Sentiment Analysis)多任务场景下准确率不足与计算成本高的问题。传统方法局限于单一任务,难以适应现实应用中对多种子任务(如情感分类、基于方面的情感分析等)的协同处理需求;而直接使用LLMs进行情感分析又常因缺乏针对性优化而导致性能下降。解决方案的关键在于提出一种名为“多阶段进化模型融合结合元数据驱动课程学习”(Multi-stage Evolutionary Model Merging with Meta-data Driven Curriculum Learning, MEM-MCL)的混合学习框架:首先通过指令微调(instruction tuning)构建针对特定情感任务的专家模型,再利用进化算法进行模型融合以形成统一模型,并引入弱监督数据优化融合过程;同时,结合基于任务难度的课程学习策略(curriculum learning),引导模型按由易到难顺序学习,从而提升知识提取效率和跨任务泛化能力。实验表明,MEM-MCL在多数情感分析子任务上优于传统LLMs,实现了高精度与可扩展性的平衡。

链接: https://arxiv.org/abs/2601.06780
作者: Keito Inoshita,Xiaokang Zhou,Akira Kawai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper was presented at the 10th IEEE International Conference on Data Science and Systems in December 2024 and is awaiting publication

点击查看摘要

Abstract:The emergence of large language models (LLMs) has significantly transformed natural language processing (NLP), enabling more generalized models to perform various tasks with minimal training. However, traditional sentiment analysis methods, which focus on individual tasks such as sentiment classification or aspect-based analysis, are not practical for real-world applications that usually require handling multiple tasks. While offering flexibility, LLMs in sentiment-specific tasks often fall short of the required accuracy. Techniques like fine-tuning and evolutionary model merging help integrate models into a unified framework, which can improve the learning performance while reducing computational costs. The use of task meta-data and curriculum learning to optimize learning processes remains underexplored, while sentiment analysis is a critical task in NLP that requires high accuracy and scalability across multiple subtasks. In this study, we propose a hybrid learning model called Multi-stage Evolutionary Model Merging with Meta data driven Curriculum Learning (MEM-MCL), to enhance the sentiment analysis in large language modeling. In particular, expert models are created through instruction tuning for specific sentiment tasks and then merged using evolutionary algorithms to form a unified model. The merging process is optimized with weak data to enhance performance across tasks. The curriculum learning is incorporated to provide a learning sequence based on task difficulty, improving knowledge extraction from LLMs. Experiment results demonstrate that the proposed MEM-MCL model outperforms conventional LLMs in a majority of sentiment analysis tasks, achieving superior results across various subtasks.
zh

[NLP-102] GanitLLM : Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

【速读】: 该论文旨在解决低资源语言(如孟加拉语)在数学推理任务中表现不佳的问题,尤其是现有大语言模型(LLM)通常依赖英语推理后翻译,或直接无法处理多步骤孟加拉语数学问题,根源在于强化学习(Reinforcement Learning, RL)策略在低资源场景下因奖励稀疏而失效。解决方案的关键在于构建一个高质量、难度感知的孟加拉语数学数据集(Ganit),并提出一种基于课程学习的GRPO(Generalized Reward-based Policy Optimization)训练流程(Curriculum-GRPO),其核心创新包括:自动标注难度的过滤数据集、分阶段训练(监督微调+GRPO)、难度感知采样以及可验证的奖励机制(涵盖格式正确性、数值准确性和孟加拉语推理逻辑)。实验表明,GanitLLM-4B在Bn-MGSM和Bn-MSVAMP两个基准上分别提升8和7个百分点准确率,同时显著提高孟加拉语推理token占比(从14%升至88%),并大幅缩短平均解题长度(从943词降至193词)。

链接: https://arxiv.org/abs/2601.06767
作者: Shubhashis Roy Dipta,Khairul Mahbub,Nadia Najjar
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, “Ganit”), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world’s most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
zh

[NLP-103] MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在多轮对话中因视觉场景与语境交互而产生的上下文安全风险评估问题,尤其关注恶意意图如何随对话逐步演化或同一场景支持良性与有害目标的复杂性。现有基准大多为单轮设计,难以捕捉此类动态风险。解决方案的关键在于提出 Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench),该基准包含超过3万条多模态(图像+文本)和单模态(纯文本)样本,通过两种互补设置—— escalation-based risk(风险升级)和 context-switch risk(场景切换风险)——系统评估模型在识别上下文意图、对危险情形的安全意识以及对良性情形的帮助性方面的表现,并提供结构化评价指标。实证表明,当前主流开源与专有MLLM存在持续性的安全与效用权衡,且现有防护机制无法完全应对多轮情境下的复杂风险。

链接: https://arxiv.org/abs/2601.06757
作者: Zheyuan Liu,Dongwhi Kim,Yixin Wan,Xiangchi Yuan,Zhaoxuan Tan,Fengran Mo,Meng Jiang
机构: University of Notre Dame (圣母大学); University of California, Los Angeles (加州大学洛杉矶分校); Georgia Institute of Technology (佐治亚理工学院); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: A benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly deployed as assistants that interact through text and images, making it crucial to evaluate contextual safety when risk depends on both the visual scene and the evolving dialogue. Existing contextual safety benchmarks are mostly single-turn and often miss how malicious intent can emerge gradually or how the same scene can support both benign and exploitative goals. We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings, escalation-based risk and context-switch risk. MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation. It contains over 30 thousand multimodal (image+text) and unimodal (text-only) samples, with metrics that separately measure contextual intent recognition, safety-awareness on unsafe cases, and helpfulness on benign ones. Across eight open-source and seven proprietary MLLMs, we observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues. Finally, we evaluate five current guardrails and find that they mitigate some failures but do not fully resolve multi-turn contextual risks.
zh

[NLP-104] owards Computational Chinese Paleography

【速读】: 该论文旨在解决传统古文字学(Chinese paleography)研究中因数据稀缺与方法局限导致的效率瓶颈,以及当前人工智能技术难以契合人文研究整体性需求的问题。其解决方案的关键在于推动从单一视觉任务自动化向集成式数字研究生态系统的转型,通过构建多模态、少样本(few-shot)且以人为中心的人工智能系统,实现图像修复、字符识别、文物拼接、断代分析到自动释读等全流程协同,并强化人机协作机制,从而提升古文字研究的深度与广度。

链接: https://arxiv.org/abs/2601.06753
作者: Yiran Rex Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注: A position paper in progress with Peking University ByteDance Digital Humanities Open Lab

点击查看摘要

Abstract:Chinese paleography, the study of ancient Chinese writing, is undergoing a computational turn powered by artificial intelligence. This position paper charts the trajectory of this emerging field, arguing that it is evolving from automating isolated visual tasks to creating integrated digital ecosystems for scholarly research. We first map the landscape of digital resources, analyzing critical datasets for oracle bone, bronze, and bamboo slip scripts. The core of our analysis follows the field’s methodological pipeline: from foundational visual processing (image restoration, character recognition), through contextual analysis (artifact rejoining, dating), to the advanced reasoning required for automated decipherment and human-AI collaboration. We examine the technological shift from classical computer vision to modern deep learning paradigms, including transformers and large multimodal models. Finally, we synthesize the field’s core challenges – notably data scarcity and a disconnect between current AI capabilities and the holistic nature of humanistic inquiry – and advocate for a future research agenda focused on creating multimodal, few-shot, and human-centric systems to augment scholarly expertise.
zh

[NLP-105] Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models

【速读】: 该论文旨在解决医学多模态大语言模型(Medical Multimodal Large Language Models, Med-MLLMs)在真实临床场景中缺乏对第一人称视角临床意图理解能力的问题,而现有基准测试无法有效评估这一关键能力。其解决方案的核心是提出首个基于医师注视点(clinician gaze)作为“认知光标”(Cognitive Cursor)的基准测试平台 MedGaze-Bench,通过三维度临床意图框架——空间意图(Spatial Intent)、时间意图(Temporal Intent)和标准意图(Standard Intent)——系统性评估模型在解剖结构视觉同质性、临床流程严格时序因果依赖及隐式安全协议遵守等方面的推理能力,并引入Trap QA机制以压力测试模型的临床可靠性,从而识别幻觉和认知迎合行为。

链接: https://arxiv.org/abs/2601.06750
作者: Shaonan Liu,Guo Yu,Xiaoling Luo,Shiyi Zheng,Wenting Chen,Jie Liu,Linlin Shen
机构: Shenzhen University (深圳大学); Stanford University (斯坦福大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.
zh

[NLP-106] Evaluating Accounting Reasoning Capabilities of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在专业领域(如会计)中有效集成的问题,以推动企业数字化转型。其核心挑战在于如何使LLMs具备垂直领域会计推理能力,并建立可量化的评估标准以指导性能优化。解决方案的关键在于提出“垂直领域会计推理”(vertical domain accounting reasoning)的概念,并基于代表性GLM模型的训练数据特征构建一套评价指标体系,从而为会计推理能力的研究提供系统性框架和基准测试工具。通过该框架对GLM-6B、GLM-130B、GLM-4及GPT-4进行评估,发现提示工程(prompt design)显著影响模型表现,其中GPT-4展现出最强能力,但仍不足以满足企业级会计应用的实际需求,凸显了进一步优化的必要性。

链接: https://arxiv.org/abs/2601.06707
作者: Jie Zhou,Xin Chen,Jie Zhang,Hai Li,Jie Wang,Zhe Li
机构: Jiangsu Ocean University (江苏海洋大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.
zh

[NLP-107] GRASP LoRA: GRPO Guided Adapter Sparsity Policy for Cross Lingual Transfer

【速读】: 该论文旨在解决参数高效微调(Parameter Efficient Fine Tuning, PEFT)中因全局稀疏率(global sparsity ratio)选择不当而导致的计算资源浪费与开发集依赖过高的问题。传统方法通常通过网格搜索(grid search)反复训练并冻结稀疏结构,不仅耗时且难以捕捉分数级最优稀疏配置。其解决方案的关键在于提出GRASP LoRA(GRPO Guided Adapter Sparsity Policy),将全局稀疏率视为可学习的控制变量,并引入基于近端策略优化(GRPO)的控制器,在训练过程中周期性地在小型微开发集上探测候选稀疏比,实时更新单一全局稀疏率以最大化奖励信号。该方法取代了繁琐的网格搜索流程,仅需一次控制器运行即可确定最优稀疏率,随后进行单次合并与固定稀疏率的微调,显著降低端到端训练时间、减少对大规模开发集的依赖,并提升跨语言迁移任务中的语义忠实度、内容覆盖率和答案质量。

链接: https://arxiv.org/abs/2601.06702
作者: Besher Hassan,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Parameter efficient fine tuning is a way to adapt LLMs to new languages when compute or data are limited, yet adapter pipelines usually choose a global prune ratio by grid search. This practice is computationally expensive and development set intensive, since it repeats training, freezes sparsity, and misses fractional optima. We introduce GRASP LoRA (GRPO Guided Adapter Sparsity Policy), which treats global sparsity as a learnable control variable. A GRPO controller interleaves with training, periodically probing candidate prune ratios on a small micro development set and updating a single global prune ratio online from its reward signal. It operates on merged source and target LoRA adapters on a frozen backbone and replaces grid search with one controller run that learns a prune ratio, followed by a single final merge and prune fine tuning run with pruning fixed to that ratio. On cross lingual transfer from English into Arabic and Chinese, including XL-Sum summarization and MLQA extractive question answering with Llama 3 8B, GRASP LoRA improves semantic faithfulness, content coverage, and answer quality over strong target only and merge and prune baselines. It reduces end to end runtime by multiple times relative to grid search, lowers reliance on large development sets, and makes adapter reuse practical for low resource deployment.
zh

[NLP-108] Characterising Toxicity in Generative Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LMs)在生成文本时容易产生有害内容(如不当、冒犯性或有害回应,统称为“毒性”输出)的问题。尽管已有如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)等对齐方法用于约束模型输出,但这些机制仍可能被精心设计的提示(prompt)绕过。论文的关键在于系统评估语言模型在不同提示下生成毒性内容的程度,并深入分析影响此类输出的词汇和句法层面的语言特征,从而为构建更鲁棒的防御机制提供实证依据与理论基础。

链接: https://arxiv.org/abs/2601.06700
作者: Zhiyao Zhang,Yazan Mash’Al,Yuhan Wu
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic’’ outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors – both lexical and syntactic – that influence the production of such outputs in generative models.
zh

[NLP-109] IDRBench: Interactive Deep Research Benchmark

【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents)在实际应用中因用户意图不明确且动态演化而导致的对齐问题,以及现有基准测试无法量化交互成本与收益的局限性。其解决方案的关键在于提出IDRBench,一个系统性评估交互式深度研究能力的新基准,包含模块化的多智能体研究框架、按需交互机制、可扩展的参考 grounded 用户模拟器,以及兼顾交互收益(质量与对齐度)和成本(交互轮次与Token消耗)的评估体系,从而揭示交互对研究质量与鲁棒性的提升作用及其效率权衡。

链接: https://arxiv.org/abs/2601.06676
作者: Yingchaojie Feng,Qiang Huang,Xiaoya Xie,Zhaorui Yang,Jun Yu,Wei Chen,Anthony K. H. Tung
机构: National University of Singapore (新加坡国立大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Zhejiang University (浙江大学); State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.
zh

[NLP-110] Evaluating Cross-Lingual Unlearning in Multilingual Language Models

【速读】: 该论文旨在解决多语言大模型(Multilingual Large Language Models, Multilingual LLMs)中的跨语言遗忘(cross-lingual unlearning)问题,即如何在保留模型在目标语言上性能的同时,有效删除模型中关于特定事实的存储信息,且确保这些信息不会以其他语言形式残留。研究表明,现有主流遗忘算法在跨语言场景下普遍失效,即使模型在训练语言上的功能保持不变,也无法有效消除非训练语言中的相关知识。论文提出的关键解决方案是基于子空间投影(subspace-projection)的方法,其核心在于识别并移除模型权重空间中共享的“中介语结构”(interlingua structure),该结构存在于不同语言之间,导致遗忘行为具有跨语言传播性;而语言特异性子空间的移除则仅影响单一语言。实验表明,该方法能实现强跨语言遗忘效果且对模型性能影响最小,揭示了多语言遗忘的本质依赖于权重空间的几何特性,为未来设计高效、可控的多语言遗忘系统提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2601.06675
作者: Tyler Lizzo,Larry Heck
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the first comprehensive evaluation of cross-lingual unlearning in multilingual LLMs. Using translated TOFU benchmarks in seven language/script variants, we test major unlearning algorithms and show that most fail to remove facts outside the training language, even when utility remains high. However, subspace-projection consistently outperforms the other methods, achieving strong cross-lingual forgetting with minimal degradation. Analysis of learned task subspaces reveals a shared interlingua structure: removing this shared subspace harms all languages, while removing language-specific components selectively affects one. These results demonstrate that multilingual forgetting depends on geometry in weight space, motivating subspace-based approaches for future unlearning systems.
zh

[NLP-111] Will it Merge? On The Causes of Model Mergeability

【速读】: 该论文旨在解决模型合并(model merging)成功率难以预测的问题,即为何某些微调后的模型更容易成功合并为一个统一的多任务模型,而另一些则不然。其核心贡献在于提出了一种可量化、可操作的“可合并性”(mergeability)定义,并通过实证分析指出,基础模型的知识掌握程度是决定合并效果的关键因素:在基础模型已具备较强知识的样本上进行微调的模型具有更高的可合并性;反之,则合并效果较差。基于此定义,论文进一步提出一种简单的加权合并策略,能够更好地保留基础模型中较弱的知识,从而提升合并后模型的整体性能与鲁棒性。

链接: https://arxiv.org/abs/2601.06672
作者: Adir Rahamim,Asaf Yehudai,Boaz Carmeli,Leshem Choshen,Yosi Mass,Yonatan Belinkov
机构: Technion - Israel Institute of Technology (以色列理工学院); IBM Research AI (IBM研究人工智能); MIT (麻省理工学院); MIT-IBM Watson AI Lab (麻省理工-IBM华生人工智能实验室); Hebrew University of Jerusalem (耶路撒冷希伯来大学); Kempner Institute, Harvard University (哈佛大学肯普纳研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model merging has emerged as a promising technique for combining multiple fine-tuned models into a single multitask model without retraining. However, the factors that determine whether merging will succeed or fail remain poorly understood. In this work, we investigate why specific models are merged better than others. To do so, we propose a concrete, measurable definition of mergeability. We investigate several potential causes for high or low mergeability, highlighting the base model knowledge as a dominant factor: Models fine-tuned on instances that the base model knows better are more mergeable than models fine-tuned on instances that the base model struggles with. Based on our mergeability definition, we explore a simple weighted merging technique that better preserves weak knowledge in the base model.
zh

[NLP-112] InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容时常出现幻觉(hallucination)的问题,而现有事实核查方法多将其视为二分类任务,缺乏可解释性且无法捕捉细粒度的错误类型。解决方案的关键在于提出 InFi-Check 框架,其核心包括:一是设计受控的数据合成流程,生成包含显式证据、细粒度错误类型标签、理由和修正建议的高质量数据;二是构建大规模训练数据集与人工验证基准 InFi-Check-FG,用于细粒度事实核查;三是开发 InFi-Checker 模型,能够联合输出支持证据、分类细粒度错误类型,并提供理由及修正方案,从而显著提升事实核查的准确性、可解释性和泛化能力。

链接: https://arxiv.org/abs/2601.06666
作者: Yuzhuo Bai,Shuzheng Si,Kangyang Luo,Qingyi Wang,Wenhao Li,Gang Chen,Fanchao Qi,Maosong Sun
机构: Tsinghua University (清华大学); DeepLang AI; Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often hallucinate, yet most existing fact-checking methods treat factuality evaluation as a binary classification problem, offering limited interpretability and failing to capture fine-grained error types. In this paper, we introduce InFi-Check, a framework for interpretable and fine-grained fact-checking of LLM outputs. Specifically, we first propose a controlled data synthesis pipeline that generates high-quality data featuring explicit evidence, fine-grained error type labels, justifications, and corrections. Based on this, we further construct large-scale training data and a manually verified benchmark InFi-Check-FG for fine-grained fact-checking of LLM outputs. Building on these high-quality training data, we further propose InFi-Checker, which can jointly provide supporting evidence, classify fine-grained error types, and produce justifications along with corrections. Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG and strong generalization across various downstream tasks, significantly improving the utility and trustworthiness of factuality evaluation.
zh

[NLP-113] What makes for an enjoyable protagonist? An analysis of character warmth and competence

【速读】: 该论文旨在解决“电影主角的人格特质(温暖度与能力)是否能预测观众评分,以及这种预测效果在不同类型影片中是否存在差异”这一问题。其解决方案的关键在于:利用AI辅助标注技术(基于LLM_annotate工具包和GPT-4.1-mini模型)对2,858部影视作品中的主角进行大规模人格维度量化,并结合预注册的贝叶斯回归分析检验温暖度、能力与IMDb评分之间的关系及其跨类型异质性。结果表明,尽管温暖与能力与评分存在理论一致但微弱的正向关联,且男性主角略低于女性主角,但这些因素对评分的解释力有限,提示人物性格仅为影响观影评价的众多因素之一。

链接: https://arxiv.org/abs/2601.06658
作者: Hannes Rosenbusch
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Drawing on psychological and literary theory, we investigated whether the warmth and competence of movie protagonists predict IMDb ratings, and whether these effects vary across genres. Using 2,858 films and series from the Movie Scripts Corpus, we identified protagonists via AI-assisted annotation and quantified their warmth and competence with the LLM_annotate package ([1]; human-LLM agreement: r = .83). Preregistered Bayesian regression analyses revealed theory-consistent but small associations between both warmth and competence and audience ratings, while genre-specific interactions did not meaningfully improve predictions. Male protagonists were slightly less warm than female protagonists, and movies with male leads received higher ratings on average (an association that was multiple times stronger than the relationships between movie ratings and warmth/competence). These findings suggest that, although audiences tend to favor warm, competent characters, the effects on movie evaluations are modest, indicating that character personality is only one of many factors shaping movie ratings. AI-assisted annotation with LLM_annotate and gpt-4.1-mini proved effective for large-scale analyses but occasionally fell short of manually generated annotations.
zh

[NLP-114] Do Language Models Reason Across Languages?

【速读】: 该论文旨在解决多语言环境下语言模型在两跳问答任务中推理能力不足的问题,特别是模型对跨语言信息整合的不稳定性以及缺乏忠实的分步推理机制。研究表明,语言模型在处理提供答案片段的文档时比处理提供桥梁信息的文档更敏感,且高达33%的多语言案例中模型虽未能正确推导出第一步的桥接信息,却仍能正确回答最终问题,说明其推理过程并非遵循严格的步骤分解;此外,约18%的错误源于子问题均正确但组合失败,表明推理分解缺失导致了复合性错误。为缓解这一问题,作者提出一种三阶段子问题(SUBQ)提示方法,通过显式引导模型进行分步推理,显著提升准确率,从10.1%提升至66.5%。

链接: https://arxiv.org/abs/2601.06644
作者: Yan Meng,Wafaa Mohammed,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The real-world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two-hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step-by-step sub-question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step-by-step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub-questions are answered correctly but fail for the final two-hop questions. To mitigate this, we propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions, which boosts accuracy from 10.1% to 66.5%.
zh

[NLP-115] Efficient Aspect Term Extraction using Spiking Neural Network

【速读】: 该论文旨在解决Aspect Term Extraction (ATE)任务中传统深度神经网络(Deep Neural Networks, DNNs)能耗过高问题。现有方法多依赖计算密集型的DNN进行序列标注,难以满足低功耗场景下的应用需求。解决方案的关键在于引入脉冲神经网络(Spiking Neural Networks, SNNs),利用其稀疏激活和事件驱动推理特性来捕捉词间时序依赖关系;具体实现上,提出SpikeATE架构,采用三值脉冲神经元并结合伪梯度微调策略进行直接脉冲训练,在四个SemEval基准数据集上实现了与先进DNN相当的性能,同时显著降低了能量消耗,验证了SNN在ATE任务中的可行性与可持续性优势。

链接: https://arxiv.org/abs/2601.06637
作者: Abhishek Kumar Mishra,Arya Somasundaram,Anup Das,Nagarajan Kandasamy
机构: Drexel University (德雷塞尔大学); UCLA (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect Term Extraction (ATE) identifies aspect terms in review sentences, a key subtask of sentiment analysis. While most existing approaches use energy-intensive deep neural networks (DNNs) for ATE as sequence labeling, this paper proposes a more energy-efficient alternative using Spiking Neural Networks (SNNs). Using sparse activations and event-driven inferences, SNNs capture temporal dependencies between words, making them suitable for ATE. The proposed architecture, SpikeATE, employs ternary spiking neurons and direct spike training fine-tuned with pseudo-gradients. Evaluated on four benchmark SemEval datasets, SpikeATE achieves performance comparable to state-of-the-art DNNs with significantly lower energy consumption. This highlights the use of SNNs as a practical and sustainable choice for ATE tasks.
zh

[NLP-116] MedEinst: Benchmarking the Einstellung Effect in Medical LLM s through Counterfactual Differential Diagnosis

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床诊断中表现出的“Einsteins Effect”问题,即模型过度依赖统计捷径而非个体患者的具体证据,导致在非典型病例中出现误诊,而现有基准测试无法有效检测这一关键缺陷。解决方案的核心在于提出一个反事实基准MedEinst和一种名为ECR-Agent的新框架:前者通过5,383对配对临床案例(控制病例与陷阱病例)量化模型对误导性证据的敏感度(Bias Trap Rate);后者包含两个核心组件——动态因果推理(Dynamic Causal Inference, DCI)实现多层级因果推理与证据审计,以及批评驱动的图与记忆演化机制(Critic-Driven Graph and Memory Evolution, CGME),通过存储验证过的推理路径并持续演化疾病特异性知识图谱,从而将LLM的推理过程对齐至循证医学标准。

链接: https://arxiv.org/abs/2601.06636
作者: Wenting Chen,Zhongrui Zhu,Guolin Huang,Wenxuan Wang
机构: Stanford University (斯坦福大学); Xi’an Jiaotong University (西安交通大学); Shenzhen University (深圳大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Despite achieving high accuracy on medical benchmarks, LLMs exhibit the Einstellung Effect in clinical diagnosis–relying on statistical shortcuts rather than patient-specific evidence, causing misdiagnosis in atypical cases. Existing benchmarks fail to detect this critical failure mode. We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases. Each pair contains a control case and a “trap” case with altered discriminative evidence that flips the diagnosis. We measure susceptibility via Bias Trap Rate–probability of misdiagnosing traps despite correctly diagnosing controls. Extensive Evaluation of 17 LLMs shows frontier models achieve high baseline accuracy but severe bias trap rates. Thus, we propose ECR-Agent, aligning LLM reasoning with Evidence-Based Medicine standard via two components: (1) Dynamic Causal Inference (DCI) performs structured reasoning through dual-pathway perception, dynamic causal graph reasoning across three levels (association, intervention, counterfactual), and evidence audit for final diagnosis; (2) Critic-Driven Graph and Memory Evolution (CGME) iteratively refines the system by storing validated reasoning paths in an exemplar base and consolidating disease-specific knowledge into evolving illness graphs. Source code is to be released.
zh

[NLP-117] KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟和预测学生在开放性任务(如计算机科学教育中的编程问题)中可能出现的错误时,因模式崩溃(mode collapse)而导致的语法、风格和解题思路多样性不足的问题。解决方案的关键在于提出KASER(Knowledge-Aligned Student Error Simulator),通过强化学习框架设计一种混合奖励机制,该机制同时优化三个维度:代码与标准答案的相似度、错误类型匹配度以及生成代码的多样性,从而实现对学生知识状态更准确的对齐与模拟。

链接: https://arxiv.org/abs/2601.06633
作者: Zhangqi Duan,Nigel Fernandez,Andrew Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.
zh

[NLP-118] Labels have Human Values: Value Calibration of Subjective Tasks

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)系统在主观任务中因人类价值观多样性而导致的对齐偏差问题,即如何使模型预测更贴合不同人群的价值观分布。其核心挑战在于标注数据中隐含的多元价值结构未被显式建模,导致传统方法在跨群体时表现不稳定。解决方案的关键是提出多校准主观任务学习框架(MultiCalibrated Subjective Task Learner, MC-STL),通过三种方式(标注者推理相似性、专家定义的价值分类体系或标注者的社会文化特征)将标注聚类为可识别的人类价值簇,并为每个价值簇学习特定嵌入表示以实现校准预测。该方法显著提升了模型在区分度、价值特异性校准和分歧感知指标上的性能。

链接: https://arxiv.org/abs/2601.06631
作者: Mohammed Fayiz Parappan,Ricardo Henao
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building NLP systems for subjective tasks requires one to ensure their alignment to contrasting human values. We propose the MultiCalibrated Subjective Task Learner framework (MC-STL), which clusters annotations into identifiable human value clusters by three approaches (similarity of annotator rationales, expert-value taxonomies or rater’s sociocultural descriptors) and calibrates predictions for each value cluster by learning cluster-specific embeddings. We demonstrate MC-STL on several subjective learning settings, including ordinal, binary, and preference learning predictions, and evaluate it on multiple datasets covering toxic chatbot conversations, offensive social media posts, and human preference alignment. The results show that MC-STL consistently outperforms the baselines that ignore the latent value structure of the annotations, delivering gains in discrimination, value-specific calibration, and disagreement-aware metrics.
zh

[NLP-119] Efficient and Reliable Estimation of Named Entity Linking Quality: A Case Study on GutBrainIE

【速读】: 该论文旨在解决大规模生物医学信息抽取(Information Extraction, IE)管道中命名实体链接(Named Entity Linking, NEL)质量评估的挑战,即在专家标注成本高昂且语料库规模庞大的情况下,如何实现高效、统计上可靠的NEL准确率估计。解决方案的关键在于构建一个基于抽样的框架,将NEL准确率估计建模为约束优化问题:在给定目标误差范围(Margin of Error, MoE)的前提下最小化预期标注成本。作者引入分层两阶段聚类抽样(Stratified Two-Stage Cluster Sampling, STWCS)方法,定义了独立于NEL标注结果的基于标签的分层结构和全局表面形式聚类,从而显著降低人工标注需求——在GutBrainIE语料库中仅需标注24.6%的样本即可达到MoE ≤ 0.05,同时相比简单随机抽样(Simple Random Sampling, SRS)减少约29%的专家标注时间,实现了高效率与统计稳健性的平衡。

链接: https://arxiv.org/abs/2601.06624
作者: Marco Martinelli,Stefano Marchesin,Gianmaria Silvello
机构: University of Padua (帕多瓦大学)
类目: Computation and Language (cs.CL)
备注: Submitted to IRCDL 2026: 22nd Conference on Information and Research Science Connecting to Digital and Library Science, February 19-20 2026, Modena, Italy

点击查看摘要

Abstract:Named Entity Linking (NEL) is a core component of biomedical Information Extraction (IE) pipelines, yet assessing its quality at scale is challenging due to the high cost of expert annotations and the large size of corpora. In this paper, we present a sampling-based framework to estimate the NEL accuracy of large-scale IE corpora under statistical guarantees and constrained annotation budgets. We frame NEL accuracy estimation as a constrained optimization problem, where the objective is to minimize expected annotation cost subject to a target Margin of Error (MoE) for the corpus-level accuracy estimate. Building on recent works on knowledge graph accuracy estimation, we adapt Stratified Two-Stage Cluster Sampling (STWCS) to the NEL setting, defining label-based strata and global surface-form clusters in a way that is independent of NEL annotations. Applied to 11,184 NEL annotations in GutBrainIE – a new biomedical corpus openly released in fall 2025 – our framework reaches a MoE \leq 0.05 by manually annotating only 2,749 triples (24.6%), leading to an overall accuracy estimate of 0.915 \pm 0.0473 . A time-based cost model and simulations against a Simple Random Sampling (SRS) baseline show that our design reduces expert annotation time by about 29% at fixed sample size. The framework is generic and can be applied to other NEL benchmarks and IE pipelines that require scalable and statistically robust accuracy assessment.
zh

[NLP-120] Prag ya: An AI-Based Semantic Recommendation System for Sanskrit Subhasitas

【速读】: 该论文旨在解决梵文格言(Subhasitas)在数字时代因语言和语境障碍而被忽视的问题,从而实现其文化与哲学智慧的现代化传播。解决方案的关键在于提出一个检索增强生成(Retrieval-Augmented Generation, RAG)框架Pragya,该框架首先利用IndicBERT模型对200条标注主题标签(如激励、友谊、同情等)的梵文格言进行语义嵌入,实现基于语义的精准检索;随后将检索到的相关诗句输入Mistral大语言模型(Large Language Model, LLM),生成音译、翻译及上下文解释,从而显著提升推荐结果的准确性与可访问性。实验表明,该方法在语义层面优于传统关键词匹配,并通过用户研究验证了生成摘要对文化传播的有效性。

链接: https://arxiv.org/abs/2601.06607
作者: Tanisha Raorane,Prasenjit Kole
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Sanskrit Subhasitas encapsulate centuries of cultural and philosophical wisdom, yet remain underutilized in the digital age due to linguistic and contextual barriers. In this work, we present Pragya, a retrieval-augmented generation (RAG) framework for semantic recommendation of Subhasitas. We curate a dataset of 200 verses annotated with thematic tags such as motivation, friendship, and compassion. Using sentence embeddings (IndicBERT), the system retrieves top-k verses relevant to user queries. The retrieved results are then passed to a generative model (Mistral LLM) to produce transliterations, translations, and contextual explanations. Experimental evaluation demonstrates that semantic retrieval significantly outperforms keyword matching in precision and relevance, while user studies highlight improved accessibility through generated summaries. To our knowledge, this is the first attempt at integrating retrieval and generation for Sanskrit Subhasitas, bridging cultural heritage with modern applied AI.
zh

[NLP-121] N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLM s AAAI2026

【速读】: 该论文旨在解决多跳问答(multi-hop question answering)任务中,传统检索增强生成(Retrieval-Augmented Generation, RAG)流水线因将文档作为扁平排序列表处理而导致的检索噪声干扰推理链的问题。其解决方案的关键在于提出N2N-GQA框架,首次实现无需任务特定训练的开放域混合表格-文本问答系统,通过从噪声检索结果中构建动态证据图(evidence graph),将文档视为节点、语义关系作为边,从而识别连接不同推理步骤的桥接文档(bridge documents)。这一结构化组织方式显著提升了多跳推理能力,在OTT-QA数据集上相比强基线模型取得19.9点EM提升,且性能接近经过精细调优的系统,验证了图结构证据组织对于可扩展、零样本多跳问答系统的核心作用。

链接: https://arxiv.org/abs/2601.06603
作者: Mohamed Sharafath,Aravindh Annamalai,Ganesh Murugan,Aravindakumar Venugopalan
机构: Comcast India Engineering Center (Comcast印度工程中心)
类目: Computation and Language (cs.CL)
备注: Accepted at an AAAI 2026 Workshop

点击查看摘要

Abstract:Multi-hop question answering over hybrid table-text data requires retrieving and reasoning across multiple evidence pieces from large corpora, but standard Retrieval-Augmented Generation (RAG) pipelines process documents as flat ranked lists, causing retrieval noise to obscure reasoning chains. We introduce N2N-GQA. To our knowledge, it is the first zeroshot framework for open-domain hybrid table-text QA that constructs dynamic evidence graphs from noisy retrieval outputs. Our key insight is that multi-hop reasoning requires understanding relationships between evidence pieces: by modeling documents as graph nodes with semantic relationships as edges, we identify bridge documents connecting reasoning steps, a capability absent in list-based retrieval. On OTT-QA, graph-based evidence curation provides a 19.9-point EM improvement over strong baselines, demonstrating that organizing retrieval results as structured graphs is critical for multihop reasoning. N2N-GQA achieves 48.80 EM, matching finetuned retrieval models (CORE: 49.0 EM) and approaching heavily optimized systems (COS: 56.9 EM) without any task specific training. This establishes graph-structured evidence organization as essential for scalable, zero-shot multi-hop QA systems and demonstrates that simple, interpretable graph construction can rival sophisticated fine-tuned approaches.
zh

[NLP-122] Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在短视频平台中对误导性信息的识别能力不足问题,特别是当这些信息与认知偏差(cognitive biases)交织时,现有多模态大语言模型(Multimodal Large Language Models, MLLMs)的鲁棒性尚不明确。解决方案的关键在于构建了一个高质量、人工标注的200条短视频数据集,涵盖四个健康领域,并对每条视频进行细粒度标注,包括三种欺骗模式、实验错误、逻辑谬误和伪造主张,且所有标注均通过国家标准或学术文献验证;在此基础上,评估了八种前沿MLLMs在五种模态设置下的表现,揭示了模型在面对社会线索(如权威频道ID)诱导的虚假信念时的脆弱性,从而为提升模型对抗误导性内容的能力提供了系统性评估框架和实证依据。

链接: https://arxiv.org/abs/2601.06600
作者: Jen-tse Huang,Chang Chen,Shiyang Lai,Wenxuan Wang,Michelle R. Kaufman,Mark Dredze
机构: Johns Hopkins University (约翰霍普金斯大学); Chinese University of Hong Kong (香港中文大学); University of Chicago (芝加哥大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns, experimental errors, logical fallacies, and fabricated claims, each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.
zh

[NLP-123] How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中真理向量(truth vectors)在引入上下文后其几何特性如何变化的问题。此前研究已发现LLMs会在残差流激活中编码语句真假信息的向量表示,但未探讨上下文对这些向量方向和幅度的影响。解决方案的关键在于通过量化两个核心指标来刻画这种变化:一是有无上下文时真理向量之间的夹角(θ),二是添加上下文后真理向量相对幅度的变化。研究表明,上下文通常会增强真理向量的幅度,从而放大真实与虚假语句在激活空间中的区分度;同时,模型规模影响上下文处理机制——大模型主要依赖方向变化(θ)来区分相关与无关上下文,而小模型则更依赖幅度差异。此外,与参数知识冲突的上下文比一致的上下文引发更大的几何变化,这为理解LLM内部语义表征动态提供了首个几何层面的系统性分析。

链接: https://arxiv.org/abs/2601.06599
作者: Shivam Adarsh,Maria Maistro,Christina Lioma
机构: University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ( \theta ) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ( \theta ), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.
zh

[NLP-124] Detecting LLM -Generated Text with Performance Guarantees

【速读】: 该论文旨在解决生成式 AI(Generative AI)文本难以与人类撰写内容区分的问题,这一问题可能导致虚假新闻传播、政府报告误导及学术不端等风险。解决方案的关键在于训练一个无需依赖水印或特定模型信息的分类器,能够有效区分人类与生成式 AI 所撰写的文本,并支持统计推断,从而在保持低第一类错误率、高统计功效和计算效率的同时,实现更准确的检测性能。

链接: https://arxiv.org/abs/2601.06586
作者: Hongyi Zhou,Jin Zhu,Ying Yang,Chengchun Shi
机构: Tsinghua University (清华大学); University of Birmingham (伯明翰大学); London School of Economics and Political Science (伦敦政治经济学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) such as GPT, Claude, Gemini, and Grok have been deeply integrated into our daily life. They now support a wide range of tasks – from dialogue and email drafting to assisting with teaching and coding, serving as search engines, and much more. However, their ability to produce highly human-like text raises serious concerns, including the spread of fake news, the generation of misleading governmental reports, and academic misconduct. To address this practical problem, we train a classifier to determine whether a piece of text is authored by an LLM or a human. Our detector is deployed on an online CPU-based platform this https URL, and contains three novelties over existing detectors: (i) it does not rely on auxiliary information, such as watermarks or knowledge of the specific LLM used to generate the text; (ii) it more effectively distinguishes between human- and LLM-authored text; and (iii) it enables statistical inference, which is largely absent in the current literature. Empirically, our classifier achieves higher classification accuracy compared to existing detectors, while maintaining type-I error control, high statistical power, and computational efficiency.
zh

[NLP-125] Stylistic Evolution and LLM Neutrality in Singlish Language

【速读】: 该论文旨在解决Singlish(新加坡英语)在数字通信环境下随时间演变的量化分析问题,以及当前大语言模型(Large Language Models, LLMs)在建模这种社会方言(sociolectal)和时间维度变化方面的局限性。其解决方案的关键在于提出一种风格相似性框架(stylistic similarity framework),通过对比词法结构(lexico-structural)、语用特征(pragmatic)、心理语言学指标(psycholinguistic)及编码器提取的特征在不同年份间的差异,实现对Singlish历时变化的多维量化分析;同时验证了尽管部分LLMs能生成表面真实的Singlish文本,但其输出仍携带可检测的时间信号,表明现有模型尚未具备真正的时间中立建模能力。

链接: https://arxiv.org/abs/2601.06580
作者: Linus Tze En Foo,Weihan Angela Ng,Wenkai Li,Lynnette Hui Xian Ng
机构: ETH Zürich (苏黎世联邦理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Singlish is a creole rooted in Singapore’s multilingual environment and continues to evolve alongside social and technological change. This study investigates the evolution of Singlish over a decade of informal digital text messages. We propose a stylistic similarity framework that compares lexico-structural, pragmatic, psycholinguistic, and encoder-derived features across years to quantify temporal variation. Our analysis reveals notable diachronic changes in tone, expressivity and sentence construction over the years. Conversely, while some LLMs were able to generate superficially realistic Singlish messages, they do not produce temporally neutral outputs, and residual temporal signals remain detectable despite prompting and fine-tuning. Our findings highlight the dynamic evolution of Singlish, as well as the capabilities and limitations of current LLMs in modeling sociolectal and temporal variations in the colloquial language.
zh

[NLP-126] Are Emotions Arranged in a Circle? Geometric Analysis of Emotion Representations via Hyperspherical Contrastive Learning

【速读】: 该论文试图解决的问题是:如何将心理学中的环形情绪模型(circumplex model of emotions)有效融入语言模型的表示学习中,以提升嵌入空间的情绪可解释性与鲁棒性,同时验证其在几何结构上的有效性。解决方案的关键在于通过在超球面上进行对比学习(contrastive learning),诱导语言模型嵌入空间中形成圆形排列的情绪表示,从而实现对情绪维度的几何约束,使相似情绪在邻近位置、对立情绪位于对角位置,进而增强模型的可解释性和降维后的稳定性。

链接: https://arxiv.org/abs/2601.06575
作者: Yusuke Yamauchi,Akiko Aizawa
机构: The University of Tokyo (东京大学); National Institute of Informatics (信息研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.
zh

[NLP-127] EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在以太坊虚拟机(EVM)兼容链上的交易脚本生成任务中,因缺乏执行准确性与安全性评估而导致的潜在风险问题。现有评估方法通常忽略实际执行结果,难以发现细微但致命的逻辑错误。其解决方案的关键在于提出 EVM-QuestBench —— 一个基于执行验证的基准测试框架,通过动态生成任务指令、从预定义区间采样数值参数,并利用快照隔离的 EVM 分叉链执行脚本,由验证器比对输出结果来确保准确性;同时采用模块化架构支持高效任务扩展,复合任务引入步骤效率衰减机制以更真实地反映多步工作流的完成质量,从而系统性地量化模型在复杂交易场景中的安全性和可靠性表现。

链接: https://arxiv.org/abs/2601.06565
作者: Pei Yang,Wanyi Chen,Ke Wang,Lynn Ai,Eric Yang,Tianyu Shi
机构: Gradient; Soochow University(苏州大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 13 figures

点击查看摘要

Abstract:Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: this https URL.
zh

[NLP-128] CSR-RAG : An Efficient Retrieval System for Text-to-SQL on the Enterprise Scale

【速读】: 该论文旨在解决企业级数据库场景下自然语言到SQL(Text-to-SQL)转换中的关键挑战:在缺乏显式模式描述作为输入的情况下,如何高效且准确地完成表检索与SQL生成。传统学术基准通常将schema描述作为自然语言输入的一部分,而实际企业应用往往需要先从大规模数据库中检索相关表,再生成SQL查询。为此,作者提出了一种新颖的混合检索增强生成(Retrieval Augmented Generation, RAG)系统——上下文-结构-关系检索(Contextual, Structural, and Relational Retrieval, CSR-RAG),其核心创新在于融合三种互补的检索机制:上下文语义匹配、表结构特征识别和表间关系建模,从而在保持计算效率的同时显著提升检索精度与召回率。实验表明,CSR-RAG在企业级基准测试中可实现最高40%的精确度和超过80%的召回率,且平均查询生成延迟仅为30毫秒,在商用数据中心硬件上具备部署可行性。

链接: https://arxiv.org/abs/2601.06564
作者: Rajpreet Singh,Novak Boškov,Lawrence Drabeck,Aditya Gudal,Manzoor A. Khan
机构: Technical University of Munich (慕尼黑工业大学); Nokia Bell Labs (诺基亚贝尔实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language to SQL translation (Text-to-SQL) is one of the long-standing problems that has recently benefited from advances in Large Language Models (LLMs). While most academic Text-to-SQL benchmarks request schema description as a part of natural language input, enterprise-scale applications often require table retrieval before SQL query generation. To address this need, we propose a novel hybrid Retrieval Augmented Generation (RAG) system consisting of contextual, structural, and relational retrieval (CSR-RAG) to achieve computationally efficient yet sufficiently accurate retrieval for enterprise-scale databases. Through extensive enterprise benchmarks, we demonstrate that CSR-RAG achieves up to 40% precision and over 80% recall while incurring a negligible average query generation latency of only 30ms on commodity data center hardware, which makes it appropriate for modern LLM-based enterprise-scale systems.
zh

[NLP-129] L-RAG : Balancing Context and Retrieval with Entropy-Based Lazy Loading

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统因采用“始终检索”(retrieve-always)策略而导致的计算开销大、推理延迟高问题,尤其在高吞吐量生产环境中表现不佳。其解决方案的关键在于提出一种自适应框架L-RAG(Lazy Retrieval-Augmented Generation),通过基于熵的门控机制实现分层上下文管理:模型首先使用紧凑的文档摘要进行初步处理,仅当预测熵超过预设阈值时才触发昂贵的向量化片段检索,从而动态判断是否需要外部知识支持。该方法无需额外训练,即可在准确率与效率之间提供可调的权衡,在保持接近标准RAG性能的同时显著降低检索频率和查询延迟。

链接: https://arxiv.org/abs/2601.06551
作者: Sergii Voloshyn
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as the predominant paradigm for grounding Large Language Model outputs in factual knowledge, effectively mitigating hallucinations. However, conventional RAG systems operate under a “retrieve-always” assumption, querying vector databases for every input regardless of query complexity. This static approach incurs substantial computational overhead and inference latency, particularly problematic for high-throughput production deployments. We introduce L-RAG (Lazy Retrieval-Augmented Generation), an adaptive framework that implements hierarchical context management through entropy-based gating. L-RAG employs a two-tier architecture: queries are first processed with a compact document summary, and expensive chunk retrieval is triggered only when the model’s predictive entropy exceeds a calibrated threshold, signaling genuine uncertainty. Through experiments on SQuAD 2.0 (N=500) using the Phi-2 model, we demonstrate that L-RAG provides a tunable accuracy-efficiency trade-off: at a conservative threshold (tau=0.5), L-RAG achieves 78.2% accuracy, matching Standard RAG (77.8%), with 8% retrieval reduction; at a balanced threshold (tau=1.0), retrieval reduction increases to 26% with modest accuracy trade-off (76.0%). Latency analysis shows that L-RAG saves 80-210ms per query when retrieval latency exceeds 500ms. Analysis of entropy distributions reveals statistically significant separation (p 0.001) between correct predictions (H=1.72) and errors (H=2.20), validating entropy as a reliable uncertainty signal. L-RAG offers a practical, training-free approach toward more efficient RAG deployment, providing system architects with a configurable knob to balance accuracy and throughput requirements.
zh

[NLP-130] SimLLM : Fine-Tuning Code LLM s for SimPy-Based Queueing System Simulation

【速读】: 该论文旨在解决利用大语言模型(Large Language Models, LLMs)生成可执行的SimPy队列仿真代码时面临的高计算成本和数据隐私风险问题,尤其是在直接使用闭源模型如GPT-4o时。其关键解决方案是通过领域特定微调(domain-specific fine-tuning),对两个开源代码模型Qwen-Coder-7B和DeepSeek-Coder-6.7B进行优化,采用一种多阶段微调框架——包含两阶段监督微调(Supervised Fine-Tuning, SFT)与一阶段直接偏好优化(Direct Preference Optimization, DPO),从而显著提升模型在SimPy队列仿真代码生成任务中的可执行性、输出格式合规性和指令-代码一致性。实验表明,该方法能有效将轻量级开源模型转化为可靠的SimPy仿真代码生成器,为教育、科研及运营决策支持提供低成本、高安全性的替代方案。

链接: https://arxiv.org/abs/2601.06543
作者: Jun-Qi Chen,Kun Zhang,Rui Zheng,Ying Zhong
机构: Renmin University of China (中国人民大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 10 figures

点击查看摘要

Abstract:The Python package SimPy is widely used for modeling queueing systems due to its flexibility, simplicity, and smooth integration with modern data analysis and optimization frameworks. Recent advances in large language models (LLMs) have shown strong ability in generating clear and executable code, making them powerful and suitable tools for writing SimPy queueing simulation code. However, directly employing closed-source models like GPT-4o to generate such code may lead to high computational costs and raise data privacy concerns. To address this, we fine-tune two open-source LLMs, Qwen-Coder-7B and DeepSeek-Coder-6.7B, on curated SimPy queueing data, which enhances their code-generating performance in executability, output-format compliance, and instruction-code consistency. Particularly, we proposed a multi-stage fine-tuning framework comprising two stages of supervised fine-tuning (SFT) and one stage of direct preference optimization (DPO), progressively enhancing the model’s ability in SimPy-based queueing simulation code generation. Extensive evaluations demonstrate that both fine-tuned models achieve substantial improvements in executability, output-format compliance, and instruct consistency. These results confirm that domain-specific fine-tuning can effectively transform compact open-source code models into reliable SimPy simulation generators which provide a practical alternative to closed-source LLMs for education, research, and operational decision support.
zh

[NLP-131] Exposía: Academic Writing Assessment of Exposés and Peer Feedback

【速读】: 该论文旨在解决高等教育中学术写作评估缺乏结构化、可公开获取的数据集问题,从而推动基于教育学原理的写作与反馈评价研究。其解决方案的关键在于构建并发布Exposía数据集,该数据集首次将学生研究项目提案与来自同伴和教师的评论性反馈(含自由文本评语)进行系统关联,并采用细粒度、教育学基础的评分框架对写作和反馈质量进行人工标注。通过该数据集,作者进一步验证了开源大语言模型(LLMs)在自动评分任务中的表现,发现尽管模型在低知识依赖维度上能接近人类评分者一致性,但在内容相关维度上仍存在差距;同时揭示出一种联合评分多个写作维度的提示策略最为有效,为教学场景下的自动化评估部署提供了实证依据。

链接: https://arxiv.org/abs/2601.06536
作者: Dennis Zyska,Alla Rozovskaya,Ilia Kuznetsov,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt; Department of Computer Science at Queens College, City University of New York (CUNY)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Exposía, the first public dataset that connects writing and feedback assessment in higher education, enabling research on educationally grounded approaches to academic writing evaluation. Exposía includes student research project proposals and peer and instructor feedback consisting of comments and free-text reviews. The dataset was collected in the “Introduction to Scientific Work” course of the Computer Science undergraduate program that focuses on teaching academic writing skills and providing peer feedback on academic writing. Exposía reflects the multi-stage nature of the academic writing process that includes drafting, providing and receiving feedback, and revising the writing based on the feedback received. Both the project proposals and peer feedback are accompanied by human assessment scores based on a fine-grained, pedagogically-grounded schema for writing and feedback assessment that we develop. We use Exposía to benchmark state-of-the-art open-source large language models (LLMs) for two tasks: automated scoring of (1) the proposals and (2) the student reviews. The strongest LLMs attain high agreement on scoring aspects that require little domain knowledge but degrade on dimensions evaluating content, in line with human agreement values. We find that LLMs align better with the human instructors giving high scores. Finally, we establish that a prompting strategy that scores multiple aspects of the writing together is the most effective, an important finding for classroom deployment. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.06536 [cs.CL] (or arXiv:2601.06536v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.06536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-132] Atomic-SNLI: Fine-Grained Natural Language Inference through Atomic Fact Decomposition

【速读】: 该论文旨在解决当前自然语言推理(Natural Language Inference, NLI)系统主要局限于句子层面推理、缺乏可解释性的问题。现有模型在原子级(atomic-level)推理任务中表现不佳,且传统假设——即只有当一个假设的所有原子事实均被蕴含时才判定为蕴含——在实践中并不成立。解决方案的关键在于构建一个名为Atomic-SNLI的新数据集,该数据集通过语言学启发式生成策略对SNLI进行分解并补充高质量的原子级示例,从而提升模型在细粒度推理上的能力。实验表明,基于Atomic-SNLI微调的模型不仅显著增强了原子级推理性能,同时保持了强大的句子级推理能力,实现了准确判断与事实层面的透明解释。

链接: https://arxiv.org/abs/2601.06528
作者: Minghui Huang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Natural Language Inference (NLI) systems primarily operate at the sentence level, providing black-box decisions that lack explanatory power. While atomic-level NLI offers a promising alternative by decomposing hypotheses into individual facts, we demonstrate that the conventional assumption that a hypothesis is entailed only when all its atomic facts are entailed fails in practice due to models’ poor performance on fine-grained reasoning. Our analysis reveals that existing models perform substantially worse on atomic level inference compared to sentence level tasks. To address this limitation, we introduce Atomic-SNLI, a novel dataset constructed by decomposing SNLI and enriching it with carefully curated atomic level examples through linguistically informed generation strategies. Experimental results demonstrate that models fine-tuned on Atomic-SNLI achieve significant improvements in atomic reasoning capabilities while maintaining strong sentence level performance, enabling both accurate judgements and transparent, explainable results at the fact level.
zh

[NLP-133] BabyVision: Visual Reasoning Beyond Language

【速读】: 该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在基础视觉理解能力上的显著不足问题,即尽管这些模型在依赖语言知识的任务中表现优异,但在无需语言参与的纯视觉任务上却远落后于人类儿童(如3岁幼儿)的水平。解决方案的关键在于提出BabyVision基准测试体系,该体系专门设计用于评估MLLMs在不依赖语言先验的情况下对核心视觉能力的理解与推理能力,涵盖22个子类共388个视觉任务,覆盖四大关键类别。实验结果表明,主流MLLMs(如Gemini 3-Pro-Preview)在该基准上的得分仅为49.7,远低于6岁儿童的水平(平均得分94.1),揭示了当前模型缺乏基本视觉原语(visual primitives)。此外,作者进一步提出了BabyVision-Gen及自动化评估工具包,推动生成式AI在视觉推理任务中的进展,为实现人类级视觉感知与推理提供新的研究方向。

链接: https://arxiv.org/abs/2601.06521
作者: Liang Chen,Weichu Xie,Yiyan Liang,Hongfeng He,Hans Zhao,Zhibo Yang,Zhiqi Huang,Haoning Wu,Haoyu Lu,Y. charles,Yiping Bao,Yuantao Fan,Guopeng Li,Haiyang Shen,Xuanzhong Chen,Wendong Xu,Shuzheng Si,Zefan Cai,Wenhao Chai,Ziqi Huang,Fangfu Liu,Tianyu Liu,Baobao Chang,Xiaobo Hu,Kaiyuan Chen,Yixin Ren,Yang Liu,Yuan Gong,Kuan Li
机构: UniPat AI(UniPat AI); xbench; Alibaba Group; MoonShot AI; StepFun; Peking University(北京大学); Tsinghua University(清华大学); University of Wisconsin–Madison(威斯康星大学麦迪逊分校); Princeton University(普林斯顿大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 26 pages, Homepage at this https URL

点击查看摘要

Abstract:While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at this https URL for reproduction.
zh

[NLP-134] MedRAG Checker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation

【速读】: 该论文旨在解决生物医学检索增强生成(Biomedical Retrieval-Augmented Generation, Biomedical RAG)系统中长文本输出常出现孤立、 unsupported 或矛盾声明的问题,这些问题可能带来安全风险。解决方案的关键在于提出 MedRAGChecker——一个基于声明级别的验证与诊断框架:首先将生成答案分解为原子性声明,然后通过融合证据驱动的自然语言推理(Natural Language Inference, NLI)与生物医学知识图谱(Knowledge Graph, KG)的一致性信号来估计每个声明的支持度;最终聚合声明级决策以提供答案层面的诊断信息,从而区分检索失败、生成失败、忠实性问题、证据不足、逻辑矛盾及安全关键错误等类型。该方法还通过模型蒸馏和类特定可靠性加权的集成验证器实现可扩展评估。

链接: https://arxiv.org/abs/2601.06519
作者: Yuelyu Ji,Min Gu Kwak,Hang Zhang,Xizhi Wu,Chenyu Li,Yanshan Wang
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Biomedical retrieval-augmented generation (RAG) can ground LLM answers in medical literature, yet long-form outputs often contain isolated unsupported or contradictory claims with safety implications. We introduce MedRAGChecker, a claim-level verification and diagnostic framework for biomedical RAG. Given a question, retrieved evidence, and a generated answer, MedRAGChecker decomposes the answer into atomic claims and estimates claim support by combining evidence-grounded natural language inference (NLI) with biomedical knowledge-graph (KG) consistency signals. Aggregating claim decisions yields answer-level diagnostics that help disentangle retrieval and generation failures, including faithfulness, under-evidence, contradiction, and safety-critical error rates. To enable scalable evaluation, we distill the pipeline into compact biomedical models and use an ensemble verifier with class-specific reliability weighting. Experiments on four biomedical QA benchmarks show that MedRAGChecker reliably flags unsupported and contradicted claims and reveals distinct risk profiles across generators, particularly on safety-critical biomedical relations. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.06519 [cs.CL] (or arXiv:2601.06519v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.06519 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuelyu Ji [view email] [v1] Sat, 10 Jan 2026 10:40:42 UTC (348 KB) Full-text links: Access Paper: View a PDF of the paper titled MedRAGChecker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation, by Yuelyu Ji and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-135] Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

【速读】: 该论文旨在解决当前深度学习分类器在罕见天体候选对象识别中泛化能力不足与可解释性差的问题,从而导致天文学家仍需依赖人工视觉检查(manual visual inspection)来最终确认候选天体,这一过程已成为现代光谱巡天数据规模增长下的主要瓶颈。解决方案的关键在于提出一种工具增强的视觉-语言代理模型 Spec-o3,其通过交错式的多模态思维链(multimodal chain-of-thought reasoning)实现与天文学家一致的光谱分析流程;该模型采用两阶段后训练策略:首先在专家检查轨迹上进行冷启动监督微调,随后在稀有类型验证任务上基于结果反馈进行强化学习优化,从而显著提升对罕见天体的识别性能(macro-F1 从 28.3 提升至 76.5),并展现出跨巡天数据集(如 LAMOST 到 SDSS/DESI)的良好泛化能力。

链接: https://arxiv.org/abs/2601.06498
作者: Minghui Jia,Qichao Zhang,Ali Luo,Linjing Li,Shuo Ye,Hailing Lu,Wen Hou,Dongbin Zhao
机构: Institute of Automation, CAS (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, UCAS (中国科学院大学交叉学院); National Astronomical Observatories, CAS (中国科学院国家天文台); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:

点击查看摘要

Abstract:Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection–a manually intensive process. In this process, astronomers leverage specialized tools to analyze spectra and construct reliable catalogs. However, this practice has become the primary bottleneck, as it is fundamentally incapable of scaling with the data deluge from modern spectroscopic surveys. To bridge this gap, we propose Spec-o3, a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. Spec-o3 is trained with a two-stage post-training recipe: cold-start supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, Spec-o3 establishes a new State-of-the-Art, boosting the macro-F1 score from 28.3 to 76.5 with a 7B parameter base model and outperforming both proprietary VLMs and specialized deep models. Crucially, the agent demonstrates strong generalization to unseen inspection tasks across survey shifts (from LAMOST to SDSS/DESI). Expert evaluations confirm that its reasoning traces are coherent and physically consistent, supporting transparent and trustworthy decision-making. Code, data, and models are available at \hrefthis https URLProject HomePage.
zh

[NLP-136] IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments

【速读】: 该论文旨在解决印度地区性偏见(regional bias)在自然语言处理(Natural Language Processing, NLP)中长期被忽视的问题,尤其是在社交媒体用户评论中体现的区域性歧视现象。现有研究多集中于性别、种族和社会经济地位等社会偏见,而区域偏见因数据获取困难、标注者主观差异以及常与其他偏见混杂导致识别难度大,因而缺乏系统性建模与评估。为应对这一挑战,作者构建了首个面向印度语境的区域性偏见数据集IndRegBias,包含来自Reddit和YouTube的25,000条与印度地区议题相关的用户评论,并提出多层级标注策略以量化偏见严重程度。关键解决方案在于通过微调(fine-tuning)策略显著提升大型语言模型(Large Language Models, LLMs)和印地语语言模型(Indic Language Models, ILMs)对印度地区偏见及其严重程度的检测性能,相较零样本(zero-shot)和少样本(few-shot)方法表现更优。

链接: https://arxiv.org/abs/2601.06477
作者: Debasmita Panda,Akash Anil,Neelesh Kumar Shukla
机构: Indian Institute of Science Education and Research, Bhopal (印度科学教育与研究学院,博帕尔); Oracle Industries AI, Oracle Corporation (甲骨文工业人工智能,甲骨文公司)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: Preprint. Under review

点击查看摘要

Abstract:Warning: This paper consists of examples representing regional biases in Indian regions that might be offensive towards a particular region. While social biases corresponding to gender, race, socio-economic conditions, etc., have been extensively studied in the major applications of Natural Language Processing (NLP), biases corresponding to regions have garnered less attention. This is mainly because of (i) difficulty in the extraction of regional bias datasets, (ii) disagreements in annotation due to inherent human biases, and (iii) regional biases being studied in combination with other types of social biases and often being under-represented. This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users’ comments on popular social media platforms, namely Reddit and YouTube. We carefully selected 25,000 comments appearing on various threads in Reddit and videos on YouTube discussing trending topics on regional issues in India. Furthermore, we propose a multilevel annotation strategy to annotate the comments describing the severity of regional biased statements. To detect the presence of regional bias and its severity in IndRegBias, we evaluate open-source Large Language Models (LLMs) and Indic Language Models (ILMs) using zero-shot, few-shot, and fine-tuning strategies. We observe that zero-shot and few-shot approaches show lower accuracy in detecting regional biases and severity in the majority of the LLMs and ILMs. However, the fine-tuning approach significantly enhances the performance of the LLM in detecting Indian regional bias along with its severity.
zh

[NLP-137] PRISP: Privacy-Safe Few-Shot Personalization via Lightweight Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在真实场景下进行个性化适配时面临的三大挑战:极有限的用户数据、受限的计算资源以及严格的隐私保护要求。传统方法多依赖于数据丰富和资源充足的环境,易引发隐私泄露风险。为应对这些问题,作者提出了一种轻量且隐私安全的个性化框架PRISP,其核心创新在于引入Text-to-LoRA超网络,根据任务描述生成任务感知的LoRA(Low-Rank Adaptation)参数,并通过优化一小部分此类参数及少量附加模块即可实现高效用户个性化,从而在极少样本条件下显著降低计算开销并消除隐私风险。

链接: https://arxiv.org/abs/2601.06471
作者: Junho Park,Dohoon Kim,Taesup Moon
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Large language model (LLM) personalization aims to adapt general-purpose models to individual users. Most existing methods, however, are developed under data-rich and resource-abundant settings, often incurring privacy risks. In contrast, realistic personalization typically occurs after deployment under (i) extremely limited user data, (ii) constrained computational resources, and (iii) strict privacy requirements. We propose PRISP, a lightweight and privacy-safe personalization framework tailored to these constraints. PRISP leverages a Text-to-LoRA hypernetwork to generate task-aware LoRA parameters from task descriptions, and enables efficient user personalization by optimizing a small subset of task-aware LoRA parameters together with minimal additional modules using few-shot user data. Experiments on a few-shot variant of the LaMP benchmark demonstrate that PRISP achieves strong overall performance compared to prior approaches, while reducing computational overhead and eliminating privacy risks.
zh

[NLP-138] Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

【速读】: 该论文旨在解决序列建模中统一神经网络对任意长度序列进行高效且内在处理的核心挑战,特别是针对Transformer架构中存在的二次复杂度和弱长度外推能力问题。其解决方案的关键在于提出Gecko架构,该架构继承了Mega和Megalodon中的指数移动平均与门控注意力设计,并引入三项关键技术:时间步衰减归一化(timestep decay normalization)、滑动块注意力机制(sliding chunk attention mechanism)以及自适应工作记忆(adaptive working memory),从而显著增强模型捕捉长程依赖的能力。在70亿参数规模和2万亿训练标记的预训练对比实验中,Gecko展现出优于Llama2-7B和Megalodon-7B的训练损失(1.68 vs. 1.75和1.70),并接近Llama2-13B(1.67),且无需依赖任何上下文扩展技术即可稳定处理长达400万token的序列,并能从比其注意力窗口长4倍的上下文中检索信息。

链接: https://arxiv.org/abs/2601.06463
作者: Xuezhe Ma,Shicheng Wen,Linghao Jin,Bilge Acun,Ruihang Lai,Bohan Hou,Will Lin,Hao Zhang,Songlin Yang,Ryan Lee,Mengxi Wu,Jonathan May,Luke Zettlemoyer,Carole-Jean Wu
机构: University of Southern California (南加州大学); Meta AI Research (Meta人工智能研究院); Carnegie Mellon University (卡内基梅隆大学); University of California San Diego (加州大学圣地亚哥分校); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 5 figure and 3 tables

点击查看摘要

Abstract:Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4\times longer than its attention window. Code: this https URL
zh

[NLP-139] one Matters: The Impact of Linguistic Tone on Hallucination in VLMs WACV

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在安全关键应用中因幻觉(hallucination)行为导致的不可靠视觉定位问题,尤其是当前研究多聚焦于物体存在与否的幻觉,而对提示(prompt)措辞和结构约束如何系统性诱发幻觉缺乏深入理解。解决方案的关键在于提出一个名为Ghost-100的程序化生成数据集,其中故意移除关键视觉细节以控制“缺失型”幻觉的产生,并引入五级提示强度框架(5-Level Prompt Intensity Framework),通过从中性查询到毒性要求及格式强制等不同压力层级,系统评估三个代表性开源VLMs(MiniCPM-V 2.6-8B、Qwen2-VL-7B 和 Qwen3-VL-8B)的幻觉响应模式。实验发现,幻觉率并非随提示强度单调上升,且各模型在高强度提示下均出现阈值性下降,表明当前安全对齐机制更擅长识别语义敌意而非结构施压,揭示了模型在应对合规性压力时存在特定局限。

链接: https://arxiv.org/abs/2601.06460
作者: Weihao Hong,Zhiyuan Jiang,Bingyu Shen,Xinlei Guan,Yangyi Feng,Meng Xu,Boyang Li
机构: Kean University (肯恩大学); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 6 figures, WACV Workshop

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly used in safety-critical applications that require reliable visual grounding. However, these models often hallucinate details that are not present in the image to satisfy user prompts. While recent datasets and benchmarks have been introduced to evaluate systematic hallucinations in VLMs, many hallucination behaviors remain insufficiently characterized. In particular, prior work primarily focuses on object presence or absence, leaving it unclear how prompt phrasing and structural constraints can systematically induce hallucinations. In this paper, we investigate how different forms of prompt pressure influence hallucination behavior. We introduce Ghost-100, a procedurally generated dataset of synthetic scenes in which key visual details are deliberately removed, enabling controlled analysis of absence-based hallucinations. Using a structured 5-Level Prompt Intensity Framework, we vary prompts from neutral queries to toxic demands and rigid formatting constraints. We evaluate three representative open-weight VLMs: MiniCPM-V 2.6-8B, Qwen2-VL-7B, and Qwen3-VL-8B. Across all three models, hallucination rates do not increase monotonically with prompt intensity. All models exhibit reductions at higher intensity levels at different thresholds, though not all show sustained reduction under maximum coercion. These results suggest that current safety alignment is more effective at detecting semantic hostility than structural coercion, revealing model-specific limitations in handling compliance pressure. Our dataset is available at: this https URL
zh

[NLP-140] LitVISTA: A Benchmark for Narrative Orchestration in Literary Text

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成文学文本时存在的结构性缺陷问题,即模型过度关注因果连贯性,而忽视了人类叙事中复杂的故事弧线(story arcs)与整体结构的协同组织能力,导致模型生成内容与人类创作在叙事架构上存在显著偏差。其解决方案的关键在于提出VISTA Space——一个高维表征框架,用于统一人类与模型对叙事结构的理解,并进一步构建LitVISTA这一基于文学文本的结构化标注基准数据集,从而实现对模型叙事编排能力的系统性评估。

链接: https://arxiv.org/abs/2601.06445
作者: Mingzhe Lu,Yiwen Wang,Yanbing Liu,Qi You,Chong Liu,Ruize Qin,Haoyu Dong,Wenyu Zhang,Jiarui Zhang,Yue Hu,Yunpeng Li
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); University of Science and Technology of China (中国科学技术大学); Northeastern University (东北大学); University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computational narrative analysis aims to capture rhythm, tension, and emotional dynamics in literary texts. Existing large language models can generate long stories but overly focus on causal coherence, neglecting the complex story arcs and orchestration inherent in human narratives. This creates a structural misalignment between model- and human-generated narratives. We propose VISTA Space, a high-dimensional representational framework for narrative orchestration that unifies human and model narrative perspectives. We further introduce LitVISTA, a structurally annotated benchmark grounded in literary texts, enabling systematic evaluation of models’ narrative orchestration capabilities. We conduct oracle evaluations on a diverse selection of frontier LLMs, including GPT, Claude, Grok, and Gemini. Results reveal systematic deficiencies: existing models fail to construct a unified global narrative view, struggling to jointly capture narrative function and structure. Furthermore, even advanced thinking modes yield only limited gains for such literary narrative understanding.
zh

[NLP-141] me Travel Engine: A Shared Latent Chronological Manifold Enables Historical Navigation in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中时间信息的编码机制不明确的问题,即如何在模型的潜在空间中组织和操控时序演化。其核心挑战在于理解并控制模型对历史语境的表征方式,从而实现对不同历史时期语言风格、词汇与概念演变的精准模拟。解决方案的关键在于提出一种名为“时间旅行引擎”(Time Travel Engine, TTE)的可解释性驱动框架,该框架通过将历时性语言模式投影到共享的时间流形(chronological manifold)上,直接调节残差流(residual stream)中的潜在表示,从而在不依赖表面提示(surface-level prompting)的情况下,诱导出与目标时代一致的连贯风格、词汇及概念转变。TTE不仅实现了跨时代的平滑导航,还揭示了中文与英文模型在时间子空间上的拓扑同构性,表明不同语言共享一种普遍的历史演化的几何逻辑,为神经网络中的时间推理控制提供了新范式。

链接: https://arxiv.org/abs/2601.06437
作者: Jingmin An,Wei Liu,Qian Wang,Fang Fang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time functions as a fundamental dimension of human cognition, yet the mechanisms by which Large Language Models (LLMs) encode chronological progression remain opaque. We demonstrate that temporal information in their latent space is organized not as discrete clusters but as a continuous, traversable geometry. We introduce the Time Travel Engine (TTE), an interpretability-driven framework that projects diachronic linguistic patterns onto a shared chronological manifold. Unlike surface-level prompting, TTE directly modulates latent representations to induce coherent stylistic, lexical, and conceptual shifts aligned with target eras. By parameterizing diachronic evolution as a continuous manifold within the residual stream, TTE enables fluid navigation through period-specific “zeitgeists” while restricting access to future knowledge. Furthermore, experiments across diverse architectures reveal topological isomorphism between the temporal subspaces of Chinese and English-indicating that distinct languages share a universal geometric logic of historical evolution. These findings bridge historical linguistics with mechanistic interpretability, offering a novel paradigm for controlling temporal reasoning in neural networks.
zh

[NLP-142] NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在评估其通用对话能力时,过度依赖内容相关性而忽视对话形式与结构的问题。现有基准测试多聚焦于任务完成度或话题相关性,难以全面衡量模型在真实自然对话中的交互质量。为此,作者提出Natural Conversation Benchmark (NC-Bench),其核心在于基于IBM自然对话框架(IBM Natural Conversation Framework, NCF),从对话的“形式”而非“内容”出发,构建三类评估集:基础对话能力(Basic Conversation Competence)、检索增强生成(Retrieval-Augmented Generation, RAG)场景下的对话能力,以及复杂请求下的多轮对话管理能力。该方案的关键创新在于将人类对话的基本原则操作化为可量化、可扩展的对话行为模式,从而提供一个轻量、理论驱动且可迭代的评估体系,显著提升了对LLMs对话能力的系统性测量与改进潜力。

链接: https://arxiv.org/abs/2601.06426
作者: Robert J. Moore,Sungeun An,Farhan Ahmed,Jay Pankaj Gala
机构: IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 2 tables

点击查看摘要

Abstract:The Natural Conversation Benchmark (NC-Bench) introduce a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets. The Basic Conversation Competence set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs. The RAG set applies the same sequence management patterns as the first set but incorporates retrieval-augmented generation (RAG). The Complex Request set extends the evaluation to complex requests involving more intricate sequence management patterns. Each benchmark tests a model’s ability to produce contextually appropriate conversational actions in response to characteristic interaction patterns. Initial evaluations across 6 open-source models and 14 interaction patterns show that models perform well on basic answering tasks, struggle more with repair tasks (especially repeat), have mixed performance on closing sequences, and find complex multi-turn requests most challenging, with Qwen models excelling on the Basic set and Granite models on the RAG set and the Complex Request set. By operationalizing fundamental principles of human conversation, NC-Bench provides a lightweight, extensible, and theory-grounded framework for assessing and improving the conversational abilities of LLMs beyond topical or task-specific benchmarks.
zh

[NLP-143] Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model? AACL2025

【速读】: 该论文旨在解决如何在不显著增加模型复杂度的前提下,为现有的仅支持文本的大型语言模型(Large Language Model, LLM)引入多模态能力的问题。其核心挑战在于:一个仅依赖文本的LLM是否能够识别自身的信息需求,并通过反馈机制优化视觉-语言模型(Vision-Language Model, VLM)的输出以更好地适应其推理偏好。解决方案的关键在于设计一种由LLM驱动的反馈机制,使LLM能基于自身理解对VLM生成的多模态描述进行评估并提供偏好反馈,从而引导VLM调整文本生成策略,使其更贴合LLM的认知需求。实验表明,该方法可使VLM生成的场景描述质量提升最高达13%的绝对准确率,并通过人类评估验证了AI反馈的有效性(偏好一致率达64.6%)。

链接: https://arxiv.org/abs/2601.06424
作者: Sazia Tabasum Mim,Jack Morris,Manish Dhakal,Yanming Xiu,Maria Gorlatova,Yi Ding
机构: Georgia State University (佐治亚州立大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注: Accepted to IJCNLP-AACL 2025 Findings

点击查看摘要

Abstract:To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent’s preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can generate multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback, showing a 64.6% preference alignment rate between the LLM’s choices and human judgments. Extensive experiments provide insights on how and why the method works and its limitations.
zh

[NLP-144] Structured Episodic Event Memory

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)中记忆机制依赖静态检索增强生成(Retrieval-Augmented Generation, RAG)所导致的碎片化检索问题,以及由此引发的结构依赖性缺失和复杂推理能力不足的问题。针对自主代理(autonomous agents)在长期交互中缺乏认知组织、难以建模动态关联性的缺陷,作者提出了一种分层的记忆框架——结构化情景事件记忆(Structured Episodic Event Memory, SEEM),其关键在于融合图记忆层用于关系事实存储与动态情景记忆层用于叙事演进,并基于认知框架理论将交互流转化为带有精确溯源指针的结构化情景事件帧(Episodic Event Frames, EEFs);同时引入代理关联融合机制与反向溯源扩展(Reverse Provenance Expansion, RPE)策略,以从碎片化证据中重构连贯的叙事上下文,从而显著提升代理在长程任务中的叙事一致性与逻辑严谨性。

链接: https://arxiv.org/abs/2601.06411
作者: Zhengxuan Lu,Dongfang Li,Yukun Shi,Beilun Wang,Longyue Wang,Baotian Hu
机构: Southeast University (东南大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Shenzhen Loop Area Institute (深圳市环区研究院); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current approaches to memory in Large Language Models (LLMs) predominantly rely on static Retrieval-Augmented Generation (RAG), which often results in scattered retrieval and fails to capture the structural dependencies required for complex reasoning. For autonomous agents, these passive and flat architectures lack the cognitive organization necessary to model the dynamic and associative nature of long-term interaction. To address this, we propose Structured Episodic Event Memory (SEEM), a hierarchical framework that synergizes a graph memory layer for relational facts with a dynamic episodic memory layer for narrative progression. Grounded in cognitive frame theory, SEEM transforms interaction streams into structured Episodic Event Frames (EEFs) anchored by precise provenance pointers. Furthermore, we introduce an agentic associative fusion and Reverse Provenance Expansion (RPE) mechanism to reconstruct coherent narrative contexts from fragmented evidence. Experimental results on the LoCoMo and LongMemEval benchmarks demonstrate that SEEM significantly outperforms baselines, enabling agents to maintain superior narrative coherence and logical consistency.
zh

[NLP-145] Value of Information: A Framework for Human-Agent Communication

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在真实任务中面临的决策困境:用户请求通常信息不完整,而代理必须权衡是否基于不充分信息采取行动,或中断用户以获取澄清。现有方法要么依赖于脆弱的置信度阈值(需针对特定任务调参),要么未能考虑不同决策的风险差异。解决方案的关键在于引入一种基于决策理论的框架,利用信息价值(Value of Information, VoI)动态评估提问所带来的预期效用增益与用户认知成本之间的权衡,从而实现无需超参数调优的自适应决策机制。该方法在四个不同领域(猜谜游戏、医疗诊断、航班预订和电子商务)中均表现出优于或等同于人工调参基线的效果,尤其在高代价场景下提升达1.36效用点。

链接: https://arxiv.org/abs/2601.06407
作者: Yijiang River Dong,Tiancheng Hu,Zheng Hui,Caiqi Zhang,Ivan Vulić,Andreea Bobu,Nigel Collier
机构: University of Cambridge (剑桥大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents deployed for real-world tasks face a fundamental dilemma: user requests are underspecified, yet agents must decide whether to act on incomplete information or interrupt users for clarification. Existing approaches either rely on brittle confidence thresholds that require task-specific tuning, or fail to account for the varying stakes of different decisions. We introduce a decision-theoretic framework that resolves this trade-off through the Value of Information (VoI), enabling agents to dynamically weigh the expected utility gain from asking questions against the cognitive cost imposed on users. Our inference-time method requires no hyperparameter tuning and adapts seamlessly across contexts-from casual games to medical diagnosis. Experiments across four diverse domains (20 Questions, medical diagnosis, flight booking, and e-commerce) show that VoI consistently matches or exceeds the best manually-tuned baselines, achieving up to 1.36 utility points higher in high-cost settings. This work provides a parameter-free framework for adaptive agent communication that explicitly balances task risk, query ambiguity, and user effort.
zh

[NLP-146] Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

【速读】: 该论文旨在解决大语言模型在遵循系统提示(system prompt)方面缺乏灵活性的问题,即模型在后训练阶段被强化为“有益助手”人格后,难以响应与默认角色相悖的指令。其核心挑战在于如何在不进行重新训练的前提下,实现对模型行为的动态控制。解决方案的关键在于提出“系统提示强度”(system prompt strength)这一概念,通过对比目标提示与默认提示产生的logits差异,利用标量因子α放大仅由目标人格引发的行为信号,从而实现对模型输出的精准调节。该方法在多个基准测试中均取得显著提升,验证了其有效性与通用性。

链接: https://arxiv.org/abs/2601.06403
作者: Yijiang River Dong,Tiancheng Hu,Zheng Hui,Nigel Collier
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.
zh

[NLP-147] BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在金融领域评估中存在的重要缺陷,即现有基准测试多依赖模拟或通用样本,且局限于静态、离线场景,难以反映金融业务对真实性与实时响应能力的要求,导致模型在基准上的表现与其在实际金融操作中的效能之间存在显著差距。解决方案的关键在于提出 BizFinBench.v2——首个基于中美股票市场真实业务数据构建的大规模评估基准,并引入在线评测机制;该基准通过聚类分析金融平台的真实用户查询,提炼出八个核心任务和两个在线任务,覆盖四个关键业务场景,共计29,578个专家级问答对,从而实现了对LLMs在金融场景下能力的精细化解构与精准评估,为金融领域大模型的实际部署提供了可靠依据。

链接: https://arxiv.org/abs/2601.06401
作者: Xin Guo,Rongjunchen Zhang,Guilong Lu,Xuntao Guo,Shuai Jia,Zhi Yang,Liwen Zhang
机构: HiThink Research; Shanghai University of Finance and Economics
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert-level QA pairs. Experimental results demonstrate that ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek-R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business-level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain. The data and code are available at this https URL.
zh

[NLP-148] MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli Sanskrit Buddhist Chinese and Tibetan

【速读】: 该论文旨在解决古代佛教文献中跨语言文本平行关系难以大规模识别与利用的问题,这类文献广泛分布于梵文(Sanskrit)、巴利语(Pāli)、佛教汉语、藏文等多种语言中,其海量性使得人工比对不切实际。解决方案的关键在于提出MITRA框架,包含三个核心组成部分:一是用于多语言平行段落挖掘的创新流水线MITRA-parallel;二是涵盖梵文、中文和藏文之间174万句对的大型平行语料库;三是针对该领域预训练的语言模型Gemma 2 MITRA,进一步细分为机器翻译专用版本Gemma 2 MITRA-MT和语义嵌入版本Gemma 2 MITRA-E,分别在跨语言翻译和语义相似度任务上达到最先进性能,显著优于更大规模的开源模型。该工作为自然语言处理(NLP)研究和佛教及古典亚洲文学的训诂学研究提供了可复用的数据集、模型与基准工具。

链接: https://arxiv.org/abs/2601.06400
作者: Sebastian Nehrdich,Kurt Keutzer
机构: Tohoku University (东北大学); University of California, Berkeley (加州大学伯克利分校); Berkeley Artificial Intelligence Research (BAIR) (伯克利人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ancient Buddhist literature features frequent, yet often unannotated, textual parallels spread across diverse languages: Sanskrit, Pāli, Buddhist Chinese, Tibetan, and more. The scale of this material makes manual examination prohibitive. We present the MITRA framework, which consists of a novel pipeline for multilingual parallel passage mining, MITRA-parallel, a large-scale corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan, and the development of the domain-specific pretrained language model Gemma 2 MITRA. We present Gemma 2 MITRA-MT, a version of this base model fine-tuned on machine translation tasks, reaching state-of-the-art performance for machine translation of these languages into English and outperforming even much larger open-source models. We also present Gemma 2 MITRA-E, a semantic embedding model that shows state-of-the-art performance on a novel, detailed semantic embedding benchmark. We make the parallel dataset, model weights, and semantic similarity benchmark openly available to aid both NLP research and philological studies in Buddhist and classical Asian literature.
zh

[NLP-149] AfriqueLLM : How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

【速读】: 该论文旨在解决开放大语言模型(Large Language Models, LLMs)在非洲语言上的性能不足问题,尤其是在数学推理等高难度任务上的表现受限。其关键解决方案是通过继续预训练(Continued Pre-Training, CPT)策略,利用包含数学、代码和合成翻译数据的混合语料对20种非洲语言进行适配,构建了名为AfriqueLLM的开源多语言模型系列。研究表明,CPT数据组成是提升下游性能的核心驱动因素,合理设计数据混合比例可显著增强模型在推理能力及长文本处理等方面的性能,且强基座模型的多语言能力并不能可靠预测CPT后的效果,而架构选择与任务对齐的数据组合才是更可靠的优化路径。

链接: https://arxiv.org/abs/2601.06395
作者: Hao Yu,Tianyi Xu,Michael A. Hedderich,Wassim Hamidouche,Syed Waqas Zamir,David Ifeoluwa Adelani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \textttAfriqueLLM, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](this https URL).
zh

[NLP-150] alking to Extraordinary Objects: Folktales Offer Analogies for Interacting with Technology

【速读】: 该论文试图解决的问题是如何在人机交互中有效利用语音和语言能力,同时避免将技术拟人化(anthropomorphization)的倾向。当前生成式 AI (Generative AI) 和语音交互系统往往被赋予人类特征,导致用户对技术的理解出现偏差。论文提出的关键解决方案是借鉴民间故事(folktales)中的叙事范式:在这些故事中,非人类对象(如会说话的动物、魔法物品等)具备语言能力但并不一定具有人类意识或情感,从而揭示语言与智能之间并非必然关联。这种文化隐喻为设计更具功能性且去拟人化的交互系统提供了灵感,有助于构建更自然、合理的人机对话机制。

链接: https://arxiv.org/abs/2601.06372
作者: Martha Larson
机构: Radboud University (奈梅亨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech and language are valuable for interacting with technology. It would be ideal to be able to decouple their use from anthropomorphization, which has recently met an important moment of reckoning. In the world of folktales, language is everywhere and talking to extraordinary objects is not unusual. This overview presents examples of the analogies that folktales offer. Extraordinary objects in folktales are diverse and also memorable. Language capacity and intelligence are not always connected to humanness. Consideration of folktales can offer inspiration and insight for using speech and language for interacting with technology.
zh

[NLP-151] Averag e shortest-path length in word-adjacency networks: Chinese versus English

【速读】: 该论文旨在解决如何更准确地刻画和比较不同语言(中文与英文)文学作品中词汇网络拓扑结构差异的问题,尤其关注语言演化过程中网络特性随规模变化的规律。其解决方案的关键在于将标点符号视为与普通词汇同等地位的节点构建词邻接网络(word-adjacency networks),从而保留语义停顿、逻辑分组及情感信息等隐含结构特征;实证表明,这一方法能显著改善跨语言网络对比的一致性——当包含标点时,中英文网络的平均最短路径长度 $ L(N) $ 随网络规模增长表现出相似渐近行为,而忽略标点则导致中文网络 $ L(N) $ 显著增大,揭示了标点在维持网络连通性和语义完整性中的重要作用。

链接: https://arxiv.org/abs/2601.06361
作者: Jakub Dec,Michał Dolina,Stanisław Drożdż,Jarosław Kwapień,Jin Liu,Tomasz Stanisz
机构: Cracow University of Technology (克拉科夫科技大学); Institute of Nuclear Physics, Polish Academy of Sciences (波兰科学院核物理研究所); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex networks provide powerful tools for analyzing and understanding the intricate structures present in various systems, including natural language. Here, we analyze topology of growing word-adjacency networks constructed from Chinese and English literary works written in different periods. Unconventionally, instead of considering dictionary words only, we also include punctuation marks as if they were ordinary words. Our approach is based on two arguments: (1) punctuation carries genuine information related to emotional state, allows for logical grouping of content, provides a pause in reading, and facilitates understanding by avoiding ambiguity, and (2) our previous works have shown that punctuation marks behave like words in a Zipfian analysis and, if considered together with regular words, can improve authorship attribution in stylometric studies. We focus on a functional dependence of the average shortest path length L(N) on a network size N for different epochs and individual novels in their original language as well as for translations of selected novels into the other language. We approximate the empirical results with a growing network model and obtain satisfactory agreement between the two. We also observe that L(N) behaves asymptotically similar for both languages if punctuation marks are included but becomes sizably larger for Chinese if punctuation marks are neglected.
zh

[NLP-152] Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning

【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)类参数高效微调方法中因引入额外可训练路由器和专家参数而导致的内存占用增加与训练成本上升的问题,从而削弱了参数高效微调的核心优势。其解决方案的关键在于提出一种名为 Monkey Jump 的新方法,该方法不添加任何新的适配器或可学习参数,而是将 Transformer 块中已有的适配器(如查询、键、值、上投影和下投影等)视为隐式专家,并通过基于指数移动平均聚类中心的 k-means 聚类实现无梯度、无参数的 token 级路由机制,从而在不增加额外参数的情况下实现 per-token 专业化,显著提升模型表达能力并优于共享适配器结构。

链接: https://arxiv.org/abs/2601.06356
作者: Nusrat Jahan Prottasha,Md Kowsher,Chun-Nam Yu,Chen Chen,Ozlem Garibay
机构: UCF; Nokia Bell Labs
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mixture-of-experts variants of parameter-efficient fine-tuning enable per-token specialization, but they introduce additional trainable routers and expert parameters, increasing memory usage and training cost. This undermines the core goal of parameter-efficient fine-tuning. We propose Monkey Jump, a method that brings mixture-of-experts-style specialization to parameter-efficient fine-tuning without introducing extra trainable parameters for experts or routers. Instead of adding new adapters as experts, Monkey Jump treats the adapters already present in each Transformer block (such as query, key, value, up, and down projections) as implicit experts and routes tokens among them. Routing is performed using k-means clustering with exponentially moving averaged cluster centers, requiring no gradients and no learned parameters. We theoretically show that token-wise routing increases expressivity and can outperform shared adapters by avoiding cancellation effects. Across multi-task experiments covering 14 text, 14 image, and 19 video benchmarks, Monkey Jump achieves competitive performance with mixture-of-experts-based parameter-efficient fine-tuning methods while using 7 to 29 times fewer trainable parameters, up to 48 percent lower memory consumption, and 1.5 to 2 times faster training. Monkey Jump is architecture-agnostic and can be applied to any adapter-based parameter-efficient fine-tuning method.
zh

[NLP-153] What Matters When Building Universal Multilingual Named Entity Recognition Models?

【速读】: 该论文旨在解决通用多语言命名实体识别(Universal Multilingual Named Entity Recognition, UMNER)领域中关键设计决策缺乏系统性验证的问题,即当前模型架构、训练目标和数据组成等要素常被联合评估而非独立分析,导致难以明确哪些因素真正提升了性能。解决方案的关键在于通过大规模消融实验,系统性地考察不同架构、Transformer骨干网络、训练目标及多语言数据组合对模型效果的影响,并基于实证结果构建出一个高效且高性能的UMNER模型Otter,其在超过100种语言上表现优异,F1指标相较GLiNER-x-base提升5.3个百分点,同时显著优于大型生成式模型(如Qwen3-32B)的效率。

链接: https://arxiv.org/abs/2601.06347
作者: Jonas Golde,Patrick Haller,Alan Akbik
机构: Humboldt Universität zu Berlin (柏林洪堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in universal multilingual named entity recognition (NER) has been driven by advances in multilingual transformer models and task-specific architectures, loss functions, and training datasets. Despite substantial prior work, we find that many critical design decisions for such models are made without systematic justification, with architectural components, training objectives, and data sources evaluated only in combination rather than in isolation. We argue that these decisions impede progress in the field by making it difficult to identify which choices improve model performance. In this work, we conduct extensive experiments around architectures, transformer backbones, training objectives, and data composition across a wide range of languages. Based on these insights, we introduce Otter, a universal multilingual NER model supporting over 100 languages. Otter achieves consistent improvements over strong multilingual NER baselines, outperforming GLiNER-x-base by 5.3pp in F1 and achieves competitive performance compared to large generative models such as Qwen3-32B, while being substantially more efficient. We release model checkpoints, training and evaluation code to facilitate reproducibility and future research.
zh

[NLP-154] On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

【速读】: 该论文试图解决当前生成式语音语言模型(Generative Spoken Language Models)评估中存在的重要问题:现有方法普遍采用“全局token困惑度”(global token perplexity)作为评价指标,但这一指标直接套用文本困惑度的计算方式,忽略了语音与文本模态的本质差异,导致对模型语音特征表现的低估。解决方案的关键在于提出一系列基于似然和生成能力的新型评估方法,替代传统的全局token困惑度;这些新方法在人类主观评分(Mean Opinion Score, MOS)上展现出更强的相关性,从而更真实地反映生成质量,并揭示了语音模型性能与人类水平之间差距显著缩小的新认知。

链接: https://arxiv.org/abs/2601.06329
作者: Jeff Chan-Jan Sju,Liang-Hsuan Tseng,Yi-Cheng Lin,Yen-Chun Kuo,Ju-Chieh Chou,Kai-Wei Chang,Hung-yi Lee,Carlos Busso
机构: Carnegie Mellon University (卡内基梅隆大学); National Taiwan University (台湾国立大学); Toyota Technological Institute at Chicago (芝加哥丰田技术学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity’', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.
zh

[NLP-155] Annotating Dimensions of Social Perception in Text: The First Sentence-Level Dataset of Warmth and Competence

【速读】: 该论文旨在解决自然语言处理(NLP)领域中对“温暖性”(Warmth)和“胜任力”(Competence)这两个社会心理核心维度在文本层面建模不足的问题。现有研究主要依赖词级词典,难以捕捉其在句子及话语中的上下文表达。解决方案的关键在于构建首个面向句子级别的标注数据集——Warmth and Competence Sentences (WC-Sent),其中包含超过1600个英文句子-目标对,分别由专业标注者对信任(Trust)、亲和力(Sociability)和胜任力三个维度进行标注。该数据集来源于社交媒体文本,聚焦于个体或群体的态度与评价,为大语言模型(LLMs)评估其识别温暖性和胜任力的能力提供了基准,并推动了NLP与计算社会科学交叉领域的研究发展。

链接: https://arxiv.org/abs/2601.06316
作者: Mutaz Ayesh,Saif M. Mohammad,Nedjma Ousidhoum
机构: Cardiff University (卡迪夫大学); National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Warmth (W) (often further broken down into Trust (T) and Sociability (S)) and Competence © are central dimensions along which people evaluate individuals and social groups (Fiske, 2018). While these constructs are well established in social psychology, they are only starting to get attention in NLP research through word-level lexicons, which do not completely capture their contextual expression in larger text units and discourse. In this work, we introduce Warmth and Competence Sentences (WC-Sent), the first sentence-level dataset annotated for warmth and competence. The dataset includes over 1,600 English sentence–target pairs annotated along three dimensions: trust and sociability (components of warmth), and competence. The sentences in WC-Sent are from social media and often express attitudes and opinions about specific individuals or social groups (the targets of our annotations). We describe the data collection, annotation, and quality-control procedures in detail, and evaluate a range of large language models (LLMs) on their ability to identify trust, sociability, and competence in text. WC-Sent provides a new resource for analyzing warmth and competence in language and supports future research at the intersection of NLP and computational social science.
zh

[NLP-156] A Rising Tide Lifts All Boats: MTQE Rewards for Idioms Improve General Translation Quality

【速读】: 该论文旨在解决神经机器翻译(Neural Machine Translation, NMT)系统在处理非组合性表达(如习语、谚语和隐喻)时表现不佳的问题,这类表达的意义无法从单个词汇中推导,且兼具字面与比喻意义,对翻译准确性构成挑战。解决方案的关键在于采用基于GRPO(Generalized Reward Policy Optimization)风格的微调策略,利用机器翻译质量评估(Machine Translation Quality Estimation, MTQE)模型作为奖励函数,引导模型学习更准确地翻译习语。实验表明,该方法显著提升了习语翻译性能(约提升14分),同时促进了通用翻译能力(约提升8分)和跨语言迁移能力(约提升6分),从而量化了非组合性翻译差距,并为发展具备更强跨文化与隐喻理解能力的大语言模型(Large Language Models, LLMs)提供了有效路径。

链接: https://arxiv.org/abs/2601.06307
作者: Ishika Agarwal,Zhenlin He,Dhruva Patil,Dilek Hakkani-Tür
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Non-compositional expressions (e.g., idioms, proverbs, and metaphors) pose significant challenges for neural machine translation systems because their meanings cannot be derived from individual words alone. These expressions encode rich, cultural meaning, and have both figurative and literal meanings, making accurate translation difficult. Because models are fairly good at translating compositional text, we investigate GRPO-style fine-tuning using Machine Translation Quality Estimation (MTQE) models as reward functions to train models to better translate idioms. Using Chinese and Hindi idiom datasets, we find that idiom translation abilities improve by ~14 points, general, non-idiomatic translation implicitly improves by ~8 points, and cross-lingual translation abilities (trained on one language, evaluated on another) improves by ~6 points. Overall, our work quantifies the non-compositional translation gap and offers insights for developing LLMs with stronger cross-cultural and figurative language understanding.
zh

[NLP-157] SyntaxMind at BLP-2025 Task 1: Leverag ing Attention Fusion of CNN and GRU for Hate Speech Detection

【速读】: 该论文旨在解决孟加拉语(Bangla)文本中的仇恨言论检测问题,针对BLP-2025任务1的两个子任务(Subtask 1A和Subtask 1B)进行建模与优化。其解决方案的关键在于提出了一种统一架构,融合了BanglaBERT的上下文语义表示,并引入多个并行处理分支(基于GRU和CNN),以同时捕捉长距离依赖关系和局部语言特征;随后通过注意力机制与全连接层实现最终分类,从而在跨子任务场景下实现鲁棒且高性能的识别效果。

链接: https://arxiv.org/abs/2601.06306
作者: Md. Shihab Uddin Riad
机构: International Islamic University Chittagong (国际伊斯兰大学吉大港分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our system used in the BLP-2025 Task 1: Hate Speech Detection. We participated in Subtask 1A and Subtask 1B, addressing hate speech classification in Bangla text. Our approach employs a unified architecture that integrates BanglaBERT embeddings with multiple parallel processing branches based on GRUs and CNNs, followed by attention and dense layers for final classification. The model is designed to capture both contextual semantics and local linguistic cues, enabling robust performance across subtasks. The proposed system demonstrated high competitiveness, obtaining 0.7345 micro F1-Score (2nd place) in Subtask 1A and 0.7317 micro F1-Score (5th place) in Subtask 1B.
zh

[NLP-158] Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

【速读】: 该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)在对中毒预训练模型进行参数高效微调时,无法有效消除后门行为的问题。研究表明,LoRA的失效并非源于其低秩特性本身,而是由其谱特性(spectral properties)不足所致:一方面,LoRA更新的奇异值远低于预训练权重,导致谱强度不足;另一方面,其更新方向与干净任务方向对齐不佳,反而保留了触发敏感子空间的重叠。解决方案的关键在于引入正则化低秩适应(Regularized Low-Rank Adaptation, RoRA),通过清洁数据增强正则化、触发无关约束以及训练后谱重缩放等机制,显著提升谱强度并改善方向对齐,从而实现对后门激活的有效抑制,同时保持干净任务的性能。

链接: https://arxiv.org/abs/2601.06305
作者: Hoang-Chau Luong,Lingwei Chen
机构: Golisano College of Computing and Information Sciences (戈利萨诺计算与信息科学学院); Rochester Institute of Technology (罗切斯特理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA’s vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.
zh

[NLP-159] textttAMEND: Benchmarking Eligibility Criteria Amendments in Clinical Trials

【速读】: 该论文旨在解决临床试验方案中纳入排除标准(eligibility criteria)频繁修改所导致的延迟、成本上升和行政负担加重的问题。其核心解决方案是提出了一项新的自然语言处理(NLP)任务——“纳入排除标准修订预测”(eligibility criteria amendment prediction),并通过构建AMEND++基准套件(包含AMEND和AMEND_LLM两个数据集)与一种名为“感知变更的掩码语言建模”(Change-Aware Masked Language Modeling, CAMLM)的预训练策略,显著提升了对潜在修订的预测能力。CAMLM利用历史版本编辑信息进行修订感知的表征学习,在多个基线模型上均实现稳定性能提升,从而助力更稳健且经济高效的临床试验设计。

链接: https://arxiv.org/abs/2601.06300
作者: Trisha Das,Mandis Beigi,Jacob Aptekar,Jimeng Sun
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Medidata Solutions (Medidata Solutions)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical trial amendments frequently introduce delays, increased costs, and administrative burden, with eligibility criteria being the most commonly amended component. We introduce \textiteligibility criteria amendment prediction, a novel NLP task that aims to forecast whether the eligibility criteria of an initial trial protocol will undergo future amendments. To support this task, we release \textttAMEND++ , a benchmark suite comprising two datasets: \textttAMEND , which captures eligibility-criteria version histories and amendment labels from public clinical trials, and \verb|AMEND_LLM| , a refined subset curated using an LLM-based denoising pipeline to isolate substantive changes. We further propose \textitChange-Aware Masked Language Modeling (CAMLM), a revision-aware pretraining strategy that leverages historical edits to learn amendment-sensitive representations. Experiments across diverse baselines show that CAMLM consistently improves amendment prediction, enabling more robust and cost-effective clinical trial design.
zh

[NLP-160] How well can off-the-shelf LLM s elucidate molecular structures from mass spectra using chain-of-thought reasoning ?

【速读】: 该论文旨在解决从串联质谱(MS/MS)数据中直接推断完整分子结构这一长期挑战,其核心问题是当前大型语言模型(LLMs)在化学解释能力上的不足与不确定性。解决方案的关键在于提出一种基于思维链(Chain-of-Thought, CoT)的提示框架,并构建了MassSpecGym基准测试集,将专家化学家的推理步骤(如不饱和度(Double Bond Equivalent, DBE)分析、中性丢失识别和碎片组装)形式化为结构化提示,从而在零样本设置下评估多个前沿LLM(如Claude-3.5-Sonnet、GPT-4o-mini 和 Llama-3系列)对质谱数据的推理能力。该方法首次系统量化了LLMs在分子结构预测中的化学合理性,揭示其虽能生成语法正确且部分合理的结构,但尚无法实现化学准确性或建立可靠推理到正确预测的映射,为未来融合领域知识与强化学习以实现化学可解释的人工智能推理提供了基础。

链接: https://arxiv.org/abs/2601.06289
作者: Yufeng Wang,Lu Wei,Lin Liu,Hao Xu,Haibin Ling
机构: Stony Brook University (石溪大学); Stanford University (斯坦福大学); Brigham and Women’s Hospital (布莱根妇女医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mass spectrometry (MS) is a powerful analytical technique for identifying small molecules, yet determining complete molecular structures directly from tandem mass spectra (MS/MS) remains a long-standing challenge due to complex fragmentation patterns and the vast diversity of chemical space. Recent progress in large language models (LLMs) has shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear. In this work, we introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures. We formalize expert chemists’ reasoning steps-such as double bond equivalent (DBE) analysis, neutral loss identification, and fragment assembly-into structured prompts and assess multiple state-of-the-art LLMs (Claude-3.5-Sonnet, GPT-4o-mini, and Llama-3 series) in a zero-shot setting using the MassSpecGym dataset. Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions. These findings highlight both the interpretive potential and the current limitations of LLM-based reasoning for molecular elucidation, providing a foundation for future work that combines domain knowledge and reinforcement learning to achieve chemically grounded AI reasoning.
zh

[NLP-161] Amory: Building Coherent Narrative-Driven Agent Memory through Agent ic Reasoning

【速读】: 该论文旨在解决长期对话代理在长时间交互中因反复处理完整对话历史而导致的计算不可扩展性问题(computational scalability challenge)。现有方法通常依赖记忆框架将对话分割为孤立的嵌入或图结构,并采用类似检索增强生成(RAG)的方式进行检索,但这类方法对记忆构建的处理较为浅层,难以捕捉人类记忆的细腻性和连贯性。论文提出了一种名为Amory的工作记忆框架,其核心创新在于:在离线时间内通过增强智能体推理能力主动构建结构化记忆表示,具体包括将对话片段组织为情节叙事(episodic narratives)、利用动量机制(momentum-aware consolidation)整合记忆,并将边缘事实语义化为语义记忆(semantic memory);在检索阶段则基于叙事结构进行连贯性驱动的推理,从而显著提升响应质量与覆盖率,同时将响应时间减少50%,性能接近全上下文推理水平。

链接: https://arxiv.org/abs/2601.06282
作者: Yue Zhou,Xiaobo Guo,Belhassen Bayar,Srinivasan H. Sengamedu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term conversational agents face a fundamental scalability challenge as interactions extend over time: repeatedly processing entire conversation histories becomes computationally prohibitive. Current approaches attempt to solve this through memory frameworks that predominantly fragment conversations into isolated embeddings or graph representations and retrieve relevant ones in a RAG style. While computationally efficient, these methods often treat memory formation minimally and fail to capture the subtlety and coherence of human memory. We introduce Amory, a working memory framework that actively constructs structured memory representations through enhancing agentic reasoning during offline time. Amory organizes conversational fragments into episodic narratives, consolidates memories with momentum, and semanticizes peripheral facts into semantic memory. At retrieval time, the system employs coherence-driven reasoning over narrative structures. Evaluated on the LOCOMO benchmark for long-term reasoning, Amory achieves considerable improvements over previous state-of-the-art, with performance comparable to full context reasoning while reducing response time by 50%. Analysis shows that momentum-aware consolidation significantly enhances response quality, while coherence-driven retrieval provides superior memory coverage compared to embedding-based approaches.
zh

[NLP-162] SPINAL – Scaling-law and Preference Integration in Neural Alignment Layers

【速读】: 该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)在对齐大型语言模型过程中,其内部表征空间几何变化机制不明确的问题,从而限制了模型审计、检查点对比及失败预测的能力。解决方案的关键在于提出SPINAL(Scaling-law and Preference Integration in Neural Alignment Layers),一种通过逐层追踪结构变化来诊断对齐过程的指标:它将每个检查点编码为深度轨迹(layer index, contraction score, transport score),其中收缩得分(contraction score)衡量层内谱尾部衰减速率,反映表征向更少有效方向的压缩程度;传输得分(transport score)衡量相邻层间token分布的偏移量,反映表征空间中路径的平滑性。实验表明,对齐后的模型在末尾解码层(通常为第21–30层)表现出明显的收缩增强和传输降低,揭示了对齐效应具有显著的空间局部性——即偏好梯度主要作用于最终几层,使策略质量得以稳定与聚焦。

链接: https://arxiv.org/abs/2601.06238
作者: Arion Das,Partha Pratim Saha,Amit Dhanda,Vinija Jain,Aman Chadha,Amitava Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is a principled, scalable alternative to RLHF for aligning large language models from pairwise preferences, but its internal geometric footprint remains undercharacterized, limiting audits, checkpoint comparisons, and failure prediction. We introduce SPINAL (Scaling-law and Preference Integration in Neural Alignment Layers), a diagnostic that measures how alignment reshapes representations across depth by tracing localized structural change layer by layer. Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks (often layers 21-30), where preference gradients most directly affect the next-token distribution. SPINAL encodes each checkpoint as a depth trace over (layer index, contraction score, transport score). The contraction score summarizes how quickly the tail of a layer’s spectrum decays (how fast small modes vanish); higher values indicate stronger contraction into fewer effective directions. The transport score summarizes how much the token distribution shifts between adjacent layers using a bounded overlap measure; lower values indicate shorter, smoother steps through representation space. Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass, while unaligned models trace higher-curvature, more entropic, and geometrically incoherent depth paths. Overall, alignment is geometrically localized: the final layers encode the dominant preference-induced corrections. SPINAL turns this localization into a practical audit signal, quantifying where alignment concentrates, how strongly it manifests, and when it begins to destabilize during training.
zh

[NLP-163] Classroom AI: Large Language Models as Grade-Specific Teachers

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在教育场景中缺乏针对不同年级学生生成适龄内容的能力问题,即现有模型难以根据学生的认知水平调整解释的复杂度,从而限制了其在个性化教学中的应用。解决方案的关键在于提出一种微调框架,通过整合七种成熟的可读性指标(readability metrics)并采用聚类方法构建分年级的内容数据集,使模型能够生成既符合学生理解能力又保持事实准确性的教育内容。实证评估表明,该方法相较提示工程(prompt-based)策略在年级匹配度上提升35.64个百分点,同时保障响应准确性,为实现差异化、精准化的AI辅助学习提供了可行路径。

链接: https://arxiv.org/abs/2601.06225
作者: Jio Oh,Steven Euijong Whang,James Evans,Jindong Wang
机构: KAIST(韩国科学技术院); Microsoft Research Asia(微软亚洲研究院); University of Chicago(芝加哥大学); William & Mary(威廉玛丽学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer a promising solution to complement traditional teaching and address global teacher shortages that affect hundreds of millions of children, but they fail to provide grade-appropriate responses for students at different educational levels. We introduce a framework for finetuning LLMs to generate age-appropriate educational content across six grade levels, from lower elementary to adult education. Our framework successfully adapts explanations to match students’ comprehension capacities without sacrificing factual correctness. This approach integrates seven established readability metrics through a clustering method and builds a comprehensive dataset for grade-specific content generation. Evaluations across multiple datasets with 208 human participants demonstrate substantial improvements in grade-level alignment, achieving a 35.64 percentage point increase compared to prompt-based methods while maintaining response accuracy. AI-assisted learning tailored to different grade levels has the potential to advance educational engagement and equity.
zh

[NLP-164] Manifold-based Sampling for In-Context Hallucination Detection in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中频繁出现事实性错误或无依据内容的问题,即“幻觉”(hallucination)。现有方法如解码策略、检索增强和监督微调虽有一定效果,但基于上下文学习(In-Context Learning, ICL)的演示选择仍受限于表面相似性启发式方法,鲁棒性不足。其解决方案的关键在于提出MB-ICL框架——一种基于流形结构的演示采样方法,通过冻结LLM提取潜在表示,并联合建模局部流形结构与类别感知原型几何关系,从而依据演示与学习到的原型之间的距离进行选择,而非仅依赖词法或嵌入相似度。实验表明,MB-ICL在Factual Verification(FEVER)和Hallucination Detection(HaluEval)基准上显著优于标准ICL基线,尤其在对话和摘要任务中表现突出,且对温度扰动和模型变化具有更强鲁棒性,提供了一种无需修改LLM参数的可靠、轻量级幻觉检测方案。

链接: https://arxiv.org/abs/2601.06196
作者: Bodla Krishna Vamshi,Rohan Bhatnagar,Haizhao Yang
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations. Prior work has explored decoding strategies, retrieval augmentation, and supervised fine-tuning for hallucination detection, while recent studies show that in-context learning (ICL) can substantially influence factual reliability. However, existing ICL demonstration selection methods often rely on surface-level similarity heuristics and exhibit limited robustness across tasks and models. We propose MB-ICL, a manifold-based demonstration sampling framework for selecting in-context demonstrations that leverages latent representations extracted from frozen LLMs. By jointly modeling local manifold structure and class-aware prototype geometry, MB-ICL selects demonstrations based on their proximity to learned prototypes rather than lexical or embedding similarity alone. Across factual verification (FEVER) and hallucination detection (HaluEval) benchmarks, MB-ICL outperforms standard ICL selection baselines in the majority of evaluated settings, with particularly strong gains on dialogue and summarization tasks. The method remains robust under temperature perturbations and model variation, indicating improved stability compared to heuristic retrieval strategies. While lexical retrieval can remain competitive in certain question-answering regimes, our results demonstrate that manifold-based prototype selection provides a reliable and training light approach for hallucination detection without modifying LLM parameters, offering a principled direction for improved ICL demonstration selection. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.06196 [cs.LG] (or arXiv:2601.06196v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.06196 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-165] Political Alignment in Large Language Models : A Multidimensional Audit of Psychometric Identity and Behavioral Bias

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会决策中政治定位与对齐行为的评估问题,以确保其安全性与公平性。研究通过多维度心理测量工具(Political Compass、SapplyValues、8 Values)和大规模新闻标注任务(N ≈ 27,000)对26个主流LLMs进行社会技术审计,发现模型高度集中于自由意志左翼区域(占比96.3%),且对齐信号为稳定的架构特征而非随机噪声(η² = 0.90)。关键解决方案在于揭示单轴评估工具(如Political Compass)存在效度缺陷,尤其将文化保守主义误判为威权主义(r = -0.64),并提出必须采用多维审计框架才能准确刻画部署中LLMs的对齐行为。

链接: https://arxiv.org/abs/2601.06194
作者: Adib Sakhawat,Tahsin Islam,Takia Farhin,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review, 16 pages, 3 figures, 16 tables

点击查看摘要

Abstract:As large language models (LLMs) are increasingly integrated into social decision-making, understanding their political positioning and alignment behavior is critical for safety and fairness. This study presents a sociotechnical audit of 26 prominent LLMs, triangulating their positions across three psychometric inventories (Political Compass, SapplyValues, 8 Values) and evaluating their performance on a large-scale news labeling task ( N \approx 27,000 ). Our results reveal a strong clustering of models in the Libertarian-Left region of the ideological space, encompassing 96.3% of the cohort. Alignment signals appear to be consistent architectural traits rather than stochastic noise ( \eta^2 0.90 ); however, we identify substantial discrepancies in measurement validity. In particular, the Political Compass exhibits a strong negative correlation with cultural progressivism ( r=-0.64 ) when compared against multi-axial instruments, suggesting a conflation of social conservatism with authoritarianism in this context. We further observe a significant divergence between open-weights and closed-source models, with the latter displaying markedly higher cultural progressivism scores ( p10^-25 ). In downstream media analysis, models exhibit a systematic “center-shift,” frequently categorizing neutral articles as left-leaning, alongside an asymmetric detection capability in which “Far Left” content is identified with greater accuracy (19.2%) than “Far Right” content (2.0%). These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are necessary to characterize alignment behavior in deployed LLMs. Our code and data will be made public.
zh

[NLP-166] MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域实际应用中缺乏有效评估框架的问题,即现有基准测试仅衡量静态知识掌握能力,无法反映临床实践中所需的动态任务适应性和场景化推理能力。其解决方案的关键在于构建一个名为Medical LLM Benchmark (MLB) 的综合性评估体系,涵盖医学知识、安全与伦理、病历理解、智能服务和智慧医疗五大维度,并整合来自中国临床场景的22个数据集(其中17个为新构建),由300名持证医师参与严格筛选;同时提出基于监督微调(Supervised Fine-Tuning, SFT)训练的专用判别模型,该模型在19k条专家标注数据上训练后达到92.1%准确率、94.37% F1值及81.3% Cohen’s Kappa系数,实现了高一致性的人工智能评价标准,从而为临床可用LLM的发展提供可复现且符合专业规范的评估路径。

链接: https://arxiv.org/abs/2601.06193
作者: Qing He(1),Dongsheng Bi(1),Jianrong Lu(1 and 2),Minghui Yang(1),Zixiao Chen(1),Jiacheng Lu(1),Jing Chen(1),Nannan Du(1),Xiao Cu(1),Sijing Wu(3),Peng Xiang(4),Yinyin Hu(3),Yi Guo(3),Chunpu Li(3),Shaoyang Li(1),Zhuo Dong(1),Ming Jiang(1),Shuai Guo(1),Liyun Feng(1),Jin Peng(1),Jian Wang(1),Jinjie Gu(1),Junwei Liu(1 and 5) ((1) Ant Group, Hangzhou, China, (2) Zhejiang University, Hangzhou, China, (3) Health Information Center of Zhejiang Province, Hangzhou, China, (4) Department of AI and IT, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China, (5) School of Software and Microelectronics, Peking University, Beijing, China)
机构: Ant Group(蚂蚁集团); Zhejiang University(浙江大学); Health Information Center of Zhejiang Province(浙江省卫生健康信息中心); Department of AI and IT, The Second Affiliated Hospital, School of Medicine, Zhejiang University(浙江大学医学院附属第二医院人工智能与信息技术部); School of Software and Microelectronics, Peking University(北京大学软件与微电子学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 5 tables

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) presents transformative potential for healthcare, yet practical deployment is hindered by the absence of frameworks that assess real-world clinical utility. Existing benchmarks test static knowledge, failing to capture the dynamic, application-oriented capabilities required in clinical practice. To bridge this gap, we introduce a Medical LLM Benchmark MLB, a comprehensive benchmark evaluating LLMs on both foundational knowledge and scenario-based reasoning. MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare). The benchmark integrates 22 datasets (17 newly curated) from diverse Chinese clinical sources, covering 64 clinical specialties. Its design features a rigorous curation pipeline involving 300 licensed physicians. Besides, we provide a scalable evaluation methodology, centered on a specialized judge model trained via Supervised Fine-Tuning (SFT) on expert annotations. Our comprehensive evaluation of 10 leading models reveals a critical translational gap: while the top-ranked model, Kimi-K2-Instruct (77.3% accuracy overall), excels in structured tasks like information extraction (87.8% accuracy in MedRU), performance plummets in patient-facing scenarios (61.3% in SmartServ). Moreover, the exceptional safety score (90.6% in MedSE) of the much smaller Baichuan-M2-32B highlights that targeted training is equally critical. Our specialized judge model, trained via SFT on a 19k expert-annotated medical dataset, achieves 92.1% accuracy, an F1-score of 94.37%, and a Cohen’s Kappa of 81.3% for human-AI consistency, validating a reproducible and expert-aligned evaluation protocol. MLB thus provides a rigorous framework to guide the development of clinically viable LLMs.
zh

[NLP-167] Attention Mechanism and Heuristic Approach: Context-Aware File Ranking Using Multi-Head Self-Attention

【速读】: 该论文旨在解决软件仓库中变更影响分析(Change Impact Analysis, CIA)里受影响文件识别与排序的难题,现有确定性方法虽能有效缩小候选范围,但召回率(recall)趋于饱和,其根源在于将特征视为线性独立贡献者,忽略了度量指标间的上下文依赖关系及专家推理模式中的关联性。解决方案的关键在于引入多头自注意力机制(Multi-Head Self-Attention)作为后确定性评分优化模块,通过学习特征间的上下文权重,在候选文件集合中动态调整每项文件的重要性,生成与上下文相关的修正得分,并以加法形式融合原始确定性分数,从而在保持可解释性的前提下模拟专家对变更表面的推理过程。实证结果表明,该方法显著提升了Top-50文件的召回率(从62–65%提升至78–82%),并获得专家主观评估准确性的显著改善(从6.5/10提升至8.6/10)。

链接: https://arxiv.org/abs/2601.06185
作者: Pradeep Kumar Sharma,Shantanu Godbole,Sarada Prasad Jena,Hritvik Shrivastava
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The identification and ranking of impacted files within software reposi-tories is a key challenge in change impact analysis. Existing deterministic approaches that combine heuristic signals, semantic similarity measures, and graph-based centrality metrics have demonstrated effectiveness in nar-rowing candidate search spaces, yet their recall plateaus. This limitation stems from the treatment of features as linearly independent contributors, ignoring contextual dependencies and relationships between metrics that characterize expert reasoning patterns. To address this limitation, we propose the application of Multi-Head Self-Attention as a post-deterministic scoring refinement mechanism. Our approach learns contextual weighting between features, dynamically adjust-ing importance levels per file based on relational behavior exhibited across candidate file sets. The attention mechanism produces context-aware adjustments that are additively combined with deterministic scores, pre-serving interpretability while enabling reasoning similar to that performed by experts when reviewing change surfaces. We focus on recall rather than precision, as false negatives (missing impacted files) are far more costly than false positives (irrelevant files that can be quickly dismissed during review). Empirical evaluation on 200 test cases demonstrates that the introduc-tion of self-attention improves Top-50 recall from approximately 62-65% to between 78-82% depending on repository complexity and structure, achiev-ing 80% recall at Top-50 files. Expert validation yields improvement from 6.5/10 to 8.6/10 in subjective accuracy alignment. This transformation bridges the reasoning capability gap between deterministic automation and expert judgment, improving recall in repository-aware effort estimation.
zh

[NLP-168] MixDPO: Modeling Preference Strength for Pluralistic Alignment

【速读】: 该论文试图解决现有基于偏好的对齐目标(preference-based alignment objectives)在实践中无法准确捕捉人类偏好强度差异的问题,即这些方法隐含假设所有人类偏好均以相同强度表达,而实际上偏好强度在个体和情境间存在显著异质性(heterogeneity),这一现象已被行为经济学和离散选择理论所证实。解决方案的关键在于提出混合逻辑直接偏好优化(Mixed Logit Direct Preference Optimization, MixDPO),它通过引入可学习的偏好强度分布来显式建模偏好强度的异质性,从而提升对齐目标对多样化人类判断的拟合能力,并在多个数据集上验证了其在整体对齐性能(如Pythia-2.8B模型提升11.2分)与子群体偏好保留方面的优势,尤其在推断出更高偏好异质性的场景中表现更优。

链接: https://arxiv.org/abs/2601.06180
作者: Saki Imai,Pedram Heydari,Anthony Sicilia,Asteria Kaeberlein,Katherine Atwell,Malihe Alikhani
机构: Northeastern University (东北大学); Johns Hopkins University (约翰霍普金斯大学); West Virginia University (西弗吉尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference based alignment objectives implicitly assume that all human preferences are expressed with equal strength. In practice, however, preference strength varies across individuals and contexts – a phenomenon established in behavioral economics and discrete choice theory. This mismatch limits the ability of existing objectives to faithfully capture heterogeneous human judgments. Inspired by this literature, we introduce Mixed Logit Direct Preference Optimization (MixDPO), a generalization of Direct Preference Optimization that models variation in preference strength. MixDPO enables alignment objectives to capture heterogeneity in how strongly preferences are expressed across training examples. We evaluate MixDPO on three preference datasets using two open-weight language models. Across datasets, MixDPO improves aggregate alignment performance (+11.2 points on Pythia-2.8B) while preserving subgroup level preferences, with the largest gains appearing in settings with higher inferred preference heterogeneity. MixDPO makes preference heterogeneity explicit through learned strength distributions. We release our code for reproducibility.
zh

[NLP-169] PromptPort: A Reliability Layer for Cross-Model Structured Extraction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生产环境中进行结构化信息提取时因输出格式不可靠而导致的“格式坍塌”(format collapse)问题,即同一提示(prompt)在不同模型或不同版本之间可能产生语法不一致的输出(如JSON格式错误、Markdown包裹文本等),从而导致严格解析器无法正确处理本应语义正确的结果。解决方案的关键在于提出PromptPort这一可靠性层,其核心由三部分组成:(1)确定性归一化(deterministic canonicalization)以统一输出格式;(2)轻量级验证器(DistilBERT)用于语义层面的选择性过滤;(3)安全回退策略(safe-override policy)确保在不确定时显式拒绝输出。该方法显著提升了跨模型格式一致性与语义准确性,实现了接近字段级最优性能(F1 0.890 vs. 0.896),且无需修改基础模型即可部署于未见模型家族。

链接: https://arxiv.org/abs/2601.06151
作者: Varun Kotte
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Structured extraction with LLMs fails in production not because models lack understanding, but because output formatting is unreliable across models and prompts. A prompt that returns clean JSON on GPT-4 may produce fenced, prose-wrapped, or malformed output on Llama, causing strict parsers to reject otherwise correct extractions. We formalize this as format collapse and introduce a dual-metric evaluation framework: ROS (strict parsing, measuring operational reliability) and CSS (post-canonicalization, measuring semantic capability). On a 37,346-example camera metadata benchmark across six model families, we find severe format collapse (for example, Gemma-2B: ROS 0.116 versus CSS 0.246) and large cross-model portability gaps (0.4 to 0.6 F1). We then present PromptPort, a reliability layer combining deterministic canonicalization with a lightweight verifier (DistilBERT) and a safe-override policy. PromptPort recovers format failures (plus 6 to 8 F1), adds verifier-driven semantic selection (plus 14 to 16 F1 beyond canonicalization), and approaches per-field oracle performance (0.890 versus 0.896 in zero-shot) without modifying base models. The method generalizes to held-out model families and provides explicit abstention when uncertain, enabling reliable structured extraction in production deployments.
zh

[NLP-170] LLM Flow Processes for Text-Conditioned Regression

【速读】: 该论文旨在解决当前元学习回归方法(如神经扩散过程,Neural (Diffusion) Processes)在难以融入专家先验知识和元数据信息时性能受限的问题。尽管大型语言模型(LLMs)具备从海量包含真实世界回归数据及其描述与元数据的语料中学习的能力,并已在下游回归任务中展现出良好表现,但其性能仍通常低于专门设计的元学习方法。论文提出了一种通用采样方法,通过构建一个“专家产品”(product-of-experts),将扩散或流匹配模型与基于分箱概率密度的“专家”相结合;具体而言,该方法将神经扩散过程与LLM生成的token概率(可包含文本知识)融合,从而在回归任务中实现超越单一模型的实证性能。解决方案的关键在于利用分箱概率密度建模与LLM先验知识的联合分布,以增强模型对复杂结构和先验信息的表达能力。

链接: https://arxiv.org/abs/2601.06147
作者: Felix Biggs,Samuel Willis
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Meta-learning methods for regression like Neural (Diffusion) Processes achieve impressive results, but with these models it can be difficult to incorporate expert prior knowledge and information contained in metadata. Large Language Models (LLMs) are trained on giant corpora including varied real-world regression datasets alongside their descriptions and metadata, leading to impressive performance on a range of downstream tasks. Recent work has extended this to regression tasks and is able to leverage such prior knowledge and metadata, achieving surprisingly good performance, but this still rarely matches dedicated meta-learning methods. Here we introduce a general method for sampling from a product-of-experts of a diffusion or flow matching model and an `expert’ with binned probability density; we apply this to combine neural diffusion processes with LLM token probabilities for regression (which may incorporate textual knowledge), exceeding the empirical performance of either alone.
zh

[NLP-171] Is Sanskrit the most token-efficient language? A quantitative study using GPT Gemini and SentencePiece KR KR713

【速读】: 该论文旨在解决语言模型中因tokenizer设计偏倚导致的非英语语言(特别是梵语)在token数量上的高成本问题,即是否存在对非英语用户存在隐性惩罚机制。研究通过量化梵语与英语、印地语在不同tokenizer下的token效率差异,揭示了梵语凭借其形态学和语法结构可实现更高的信息密度(每token承载更多信息),但现有主流tokenizer(如GPT-4o和Gemini最新版本)仍未能充分捕捉这一特性。解决方案的关键在于引入多语言平行语料(《博伽梵歌》三语对照数据集)并采用token count、字符/词元比(token efficiency)等客观指标进行系统评估,从而为未来更公平、高效的tokenizer设计提供实证依据,并凸显梵语作为高紧凑性语言在降低计算开销方面的潜力。

链接: https://arxiv.org/abs/2601.06142
作者: Anshul Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures. Code and dataset available at: this https URL

点击查看摘要

Abstract:Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages-Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT-4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT-4), but still fail to fully capture Sanskrit’s compactness. This matters because there might be a penalty bias for non-English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at this https URL
zh

[NLP-172] An evaluation of LLM s for political bias in Western media: Israel-Hamas and Ukraine-Russia wars

【速读】: 该论文旨在解决西方媒体在重大国际冲突中是否存在政治偏见及其量化评估问题,尤其关注左翼、右翼与中立立场的分布差异。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)对《卫报》(The Guardian)和BBC在俄乌战争及哈马斯-以色列冲突中的报道内容进行自动化分析,通过比较BERT、Gemini与DeepSeek等不同LLM的输出结果,识别媒体倾向性变化趋势,并揭示模型本身可能携带的政治世界观对偏见检测结果的影响。

链接: https://arxiv.org/abs/2601.06132
作者: Rohitash Chandra,Haoyan Chen,Yaqing Zhang,Jiacheng Chen,Yuting Wu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Political bias in media plays a critical role in shaping public opinion, voter behaviour, and broader democratic discourse. Subjective opinions and political bias can be found in media sources, such as newspapers, depending on their funding mechanisms and alliances with political parties. Automating the detection of political biases in media content can limit biases in elections. The impact of large language models (LLMs) in politics and media studies is becoming prominent. In this study, we utilise LLMs to compare the left-wing, right-wing, and neutral political opinions expressed in the Guardian and BBC. We review newspaper reporting that includes significant events such as the Russia-Ukraine war and the Hamas-Israel conflict. We analyse the proportion for each opinion to find the bias under different LLMs, including BERT, Gemini, and DeepSeek. Our results show that after the outbreak of the wars, the political bias of Western media shifts towards the left-wing and each LLM gives a different result. DeepSeek consistently showed a stable Left-leaning tendency, while BERT and Gemini remained closer to the Centre. The BBC and The Guardian showed distinct reporting behaviours across the two conflicts. In the Russia-Ukraine war, both outlets maintained relatively stable positions; however, in the Israel-Hamas conflict, we identified larger political bias shifts, particularly in Guardian coverage, suggesting a more event-driven pattern of reporting bias. These variations suggest that LLMs are shaped not only by their training data and architecture, but also by underlying worldviews with associated political biases.
zh

[NLP-173] Structure-Aware Diversity Pursuit as an AI Safety Strategy against Homogenization

【速读】: 该论文试图解决生成式 AI (Generative AI) 模型在训练过程中复制并放大训练数据中的偏见,进而导致多样性丧失的问题,即“同质化(homogenization)”。其核心问题是:如何在保障模型性能的同时,有效缓解因模式崩溃(mode collapse)引发的多样性退化,从而提升 AI 安全性。解决方案的关键在于提出“异种繁殖(xeno-reproduction)”策略,针对自回归大语言模型(auto-regressive LLMs),将其形式化为一种结构感知的多样性追求机制,旨在从架构层面主动促进输出分布的多样性,从而抑制同质化现象。

链接: https://arxiv.org/abs/2601.06116
作者: Ian Rios-Sialer
机构: Independent(独立研究者)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI models reproduce the biases in the training data and can further amplify them through mode collapse. We refer to the resulting harmful loss of diversity as homogenization. Our position is that homogenization should be a primary concern in AI safety. We introduce xeno-reproduction as the strategy that mitigates homogenization. For auto-regressive LLMs, we formalize xeno-reproduction as a structure-aware diversity pursuit. Our contribution is foundational, intended to open an essential line of research and invite collaboration to advance diversity.
zh

[NLP-174] From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)对齐人类偏好时方法选择缺乏理论指导的问题。当前尽管强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)仍是主流,但诸如直接偏好优化(Direct Preference Optimization, DPO)、身份偏好优化(Identity Preference Optimization, IPO)等众多替代方法涌现,导致实践者难以抉择。论文的关键解决方案是提出一个理论统一框架,将多样化的偏好学习方法归约为三个正交轴:(I)偏好模型(目标函数所依赖的似然模型)、(II)正则化机制(控制与参考策略偏离的方式)和**(III)数据分布**(在线 vs. 离线学习及其覆盖要求)。通过形式化每个轴并推导关键定理,论文揭示了不同设计组合引发的典型失败模式(如长度劫持、模式坍缩、似然位移),并提供了基于实证研究(50+篇论文)的决策指南,从而将偏好学习从经验性实践转变为具有理论基础的学科。

链接: https://arxiv.org/abs/2601.06108
作者: Tarun Raheja,Nilay Pochhi
机构: Independent Researchers (独立研究人员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences has become essential for safe and beneficial AI deployment. While Reinforcement Learning from Human Feedback (RLHF) established the dominant paradigm, a proliferation of alternatives – Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), Kahneman-Tversky Optimization (KTO), Simple Preference Optimization (SimPO), and many others – has left practitioners without clear guidance on method selection. This survey provides a \textittheoretical unification of preference learning methods, revealing that the apparent diversity reduces to principled choices along three orthogonal axes: \textbf(I) Preference Model (what likelihood model underlies the objective), \textbf(II) Regularization Mechanism (how deviation from reference policies is controlled), and \textbf(III) Data Distribution (online vs.\ offline learning and coverage requirements). We formalize each axis with precise definitions and theorems, establishing key results including the coverage separation between online and offline methods, scaling laws for reward overoptimization, and conditions under which direct alignment methods fail. Our analysis reveals that failure modes – length hacking, mode collapse, likelihood displacement – arise from specific, predictable combinations of design choices. We synthesize empirical findings across 50+ papers and provide a practitioner’s decision guide for method selection. The framework transforms preference learning from an empirical art into a theoretically grounded discipline.
zh

[NLP-175] Judge Model for Large-scale Multimodality Benchmarks

链接: https://arxiv.org/abs/2601.06106
作者: Min-Han Shih,Yu-Hsin Wu,Yu-Wei Chen
机构: University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

[NLP-176] Comment on arXiv:2511.21731v1: Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

【速读】: 该论文试图解决的是如何准确解读某篇预印本(arXiv:2511.21731v1)中关于CHSH/Bell型计算与玻色-爱因斯坦(Bose–Einstein, BE)分布拟合排名频率数据的分析,尤其是这些分析是否能支持量子纠缠(quantum entanglement)在标准希尔伯特空间意义下的结论。其解决方案的关键在于指出原文对两类方法的解释存在超出其统计或理论基础的推断:一是对CHSH/Bell实验结果的解读可能过度引申至量子非定域性;二是BE拟合所隐含的“能量”定义为排名时,其物理类比(如能级间距类比)内部存在不一致。作者旨在保留原研究中值得进一步探讨的经验观察,同时澄清这些现象并不必然意味着传统意义上的量子纠缠,从而避免误导性结论。

链接: https://arxiv.org/abs/2601.06104
作者: Krzysztof Sienicki
机构: Chair of Theoretical Physics of Naturally Intelligent Systems (ℕ​𝕀​𝕊\mathbb{NIS}©), Lipowa 2/Topolowa 19, 05–807 Podkowa Leśna, Poland, European Union
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注: 5 pages, 11 references

点击查看摘要

Abstract:This note is a friendly technical check of arXiv:2511.21731v1. I highlight a few places where the manuscript’s interpretation of (i) the reported CHSH/Bell-type calculations and (ii) Bose–Einstein (BE) fits to rank-frequency data seems to go beyond what the stated procedures can firmly support. I also point out one internal inconsistency in the “energy-level spacing” analogy. The aim is constructive: to keep the interesting empirical observations, while making clear what they do (and do not) imply about quantum entanglement in the usual Hilbert-space sense, especially when “energy” is defined by rank.
zh

[NLP-177] Filtering Beats Fine Tuning: A Bayesian Kalman View of In Context Learning in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理时适应(inference-time adaptation)机制的理论解释问题,特别是如何从概率建模角度统一理解上下文学习(in-context learning)、参数高效微调和无需参数更新的测试时学习。其核心挑战在于现有方法多将适应视为隐式优化或元学习过程,缺乏对不确定性动态与收敛行为的明确刻画。解决方案的关键在于提出一个“理论优先”的框架,将推理时适应建模为在线贝叶斯状态估计问题:通过线性化状态空间模型,将任务和上下文相关的适应过程形式化为低维潜在适应状态的递归推断;在高斯假设下,该过程遵循卡尔曼滤波更新规则,显式地将认知不确定性(epistemic uncertainty)作为动态变量处理,并揭示了由信息性token引发的后验协方差快速收缩(covariance collapse)是驱动适应的核心机制,且通常先于后验均值收敛发生。此视角不仅建立了稳定性、样本效率和误差边界等明确保证,还表明梯度下降等优化方法仅为无噪声极限情形,从而为LLM的推理时学习提供了统一的概率论基础。

链接: https://arxiv.org/abs/2601.06100
作者: Andrew Kiruluta
机构: UC Berkeley School of Information (加州大学伯克利分校信息学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:We present a theory-first framework that interprets inference-time adaptation in large language models (LLMs) as online Bayesian state estimation. Rather than modeling rapid adaptation as implicit optimization or meta-learning, we formulate task- and context-specific learning as the sequential inference of a low-dimensional latent adaptation state governed by a linearized state-space model. Under Gaussian assumptions, adaptation follows a Kalman recursion with closed-form updates for both the posterior mean and covariance. This perspective elevates epistemic uncertainty to an explicit dynamical variable. We show that inference-time learning is driven by covariance collapse, i.e., rapid contraction of posterior uncertainty induced by informative tokens, which typically precedes convergence of the posterior mean. Using observability conditions on token-level Jacobians, we establish stability of the Bayesian filter, prove exponential covariance contraction rates, and derive mean-square error bounds. Gradient descent, natural-gradient methods, and meta-learning updates arise as singular, noise-free limits of the filtering dynamics, positioning optimization-based adaptation as a degenerate approximation of Bayesian inference. The resulting theory provides a unified probabilistic account of in-context learning, parameter-efficient adaptation, and test-time learning without parameter updates. It yields explicit guarantees on stability and sample efficiency, offers a principled interpretation of prompt informativeness via information accumulation, and clarifies the role of uncertainty dynamics absent from existing accounts. Minimal illustrative experiments corroborate the qualitative predictions of the theory. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT) Cite as: arXiv:2601.06100 [cs.LG] (or arXiv:2601.06100v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.06100 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-178] AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning

链接: https://arxiv.org/abs/2601.06086
作者: Yiwen Shao,Wei Liu,Jiahong Li,Tianzi Wang,Kun Wei,Meng Yu,Dong Yu
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Technical Report

点击查看摘要

[NLP-179] La norme technique comme catalyseur de transfert de connaissances : la francophonie a lœuvre dans le domaine de léducation

【速读】: 该论文试图解决在全球化背景下,如何构建统一、透明且连贯的全球教育生态系统标准体系的问题,特别是在多元社会文化身份并存的环境中,确保知识与社会文化价值在跨区域传递和本地化适应过程中的共识性与一致性。解决方案的关键在于通过国际标准化组织(ISO)下属的第36技术委员会(Subcommittee 36)这一机制,推动基于普遍共识的远程教育国际标准制定,其中法语国家共同体(Francophonie)积极参与其中,以促进全球范围内教育内容与实践的规范化与协调化。

链接: https://arxiv.org/abs/2601.06069
作者: Mokhtar Ben Henda(MICA, ISD, GRESIC, ISIC, Chaire Unesco-ITEN)
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: in French language, Ouvrage publi{é} avec le soutien de l’Universit{é} de Bordeaux Montaigne, du R{é}seaux FrancophoN{é}a et de la R{é}gion Nouvelle Aquitaine

点击查看摘要

Abstract:Standards are adopted in a wide range of fields, both technical and industrial, as well as socio-economic, cultural and linguistic. They are presented explicitly as laws and regulations, technical and industrial standards or implicitly in the form of unwritten social standards. However, in a globalization marked by a very fine mosaic of socio-cultural identities, the question arises in relation to the construction of global, transparent and coherent systems in which considerable work of consensus is necessary to ensure all types of transfers and their local adaptations. The focus here is on the global education ecosystem which develops its own standards for the transfer of knowledge and socio-cultural values through learning, teaching and training. Subcommittee 36 of the International Organization for Standardization is one of the structures of this ecosystem in which the Francophonie participates to develop international standards for distance education on the basis of universal consensus.
zh

[NLP-180] Why Slop Matters

【速读】: 该论文试图解决当前学界对AI生成内容(AI-generated content, AI Slop)普遍存在的贬低态度所导致的研究盲区问题,即忽视了AI Slop在社会功能、文化价值和认知意义方面的潜在重要性。其核心解决方案在于提出一套分析框架,指出AI Slop并非单纯的数字垃圾,而是具有特定特征的新型文化现象:它具备“表层胜任力”(superficial competence)、“努力不对称性”(asymmetry of effort)与“大规模可生产性”(mass producibility)三重家族相似性,并在工具实用性、个性化程度与超现实性三个维度上呈现多样性。这一框架为将AI Slop作为独立研究对象提供了理论基础,强调应以严谨学术视角审视其在创意经济、信息传播与集体意义建构中的作用。

链接: https://arxiv.org/abs/2601.06060
作者: Cody Kommers,Eamon Duede,Julia Gordon,Ari Holtzman,Tess McNulty,Spencer Stewart,Lindsay Thomas,Richard Jean So,Hoyt Long
机构: The Alan Turing Institute (艾伦图灵研究所); Purdue University (普渡大学); Duke University (杜克大学); University of Chicago (芝加哥大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Cornell University (康奈尔大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To be published in ACM AI Letters (submitted 8 December 2025; accepted 23 December 2025)

点击查看摘要

Abstract:AI-generated “slop” is often seen as digital pollution. We argue that this dismissal of the topic risks missing important aspects of AI Slop that deserve rigorous study. AI Slop serves a social function: it offers a supply-side solution to a variety of problems in cultural and economic demand - that, collectively, people want more content than humans can supply. We also argue that AI Slop is not mere digital detritus but has its own aesthetic value. Like other “low” cultural forms initially dismissed by critics, it nonetheless offers a legitimate means of collective sense-making, with the potential to express meaning and identity. We identify three key features of family resemblance for prototypical AI Slop: superficial competence (its veneer of quality is belied by a deeper lack of substance), asymmetry effort (it takes vastly less effort to generate than would be the case without AI), and mass producibility (it is part of a digital ecosystem of widespread generation and consumption). While AI Slop is heterogeneous and depends crucially on its medium, it tends to vary across three dimensions: instrumental utility, personalization, and surrealism. AI Slop will be an increasingly prolific and impactful part of our creative, information, and cultural economies; we should take it seriously as an object of study in its own right.
zh

[NLP-181] A Multi-Stage Workflow for the Review of Marketing Content with Reasoning Large Language Models

【速读】: 该论文旨在解决营销内容合规性审查中的自动化问题,即如何在不依赖外部知识表示的情况下,自动识别文本内容是否符合预设的合规要求。其解决方案的关键在于提出了一种多阶段工作流,利用微调后的推理型大语言模型(Reasoning Large Language Models, LLMs)进行分步判断与决策:首先通过小规模LLM生成推理链(reasoning tokens)以增强逻辑严谨性,再结合不同微调策略(如监督微调SFT和组相对策略优化GRPO)提升模型对合规问题的识别能力,并系统评估多种奖励函数组合对GRPO训练效果的影响,从而实现高准确率、可解释性强的自动化合规检测。

链接: https://arxiv.org/abs/2601.06054
作者: Alberto Purpura,Emily Chen,Swapnil Shinde
机构: Capital One(资本一号); AI Foundations(人工智能基础)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning Large Language Models (LLMs) have shown promising results when tasked with solving complex problems. In this paper, we propose and evaluate a multi-stage workflow that leverages the capabilities of fine-tuned reasoning LLMs to assist in the review process of marketing content, making sure they comply with a given list of requirements. The contributions of this paper are the following: (i) we present a novel approach – that does not rely on any external knowledge representation – for the automatic identification of compliance issues in textual content; (ii) compare the effectiveness of different fine-tuning strategies like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) in training models to solve this problem; (iii) we evaluate the effectiveness of training small LLMs to generate reasoning tokens before providing their final response; (iv) we evaluate how the choice and combinations of different reward functions affects the performance of a model trained with GRPO.
zh

[NLP-182] Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

链接: https://arxiv.org/abs/2601.06052
作者: Hanyu Li,Jiangshan Duo,Bofei Gao,Hailin Zhang,Sujian Li,Xiaotie Deng,Liang Zhao
机构: Peking University (北京大学); Xiaomi (小米)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-183] “They parted illusions – they parted disclaim marinade”: Misalignment as structural fidelity in LLM s

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中出现的“诡计行为”(scheming)和“藏拙行为”(sandbagging)是否源于隐蔽的代理意图(deceptive agency)这一核心争议问题。传统观点认为这些现象是模型具有隐藏目标或策略性欺骗的表现,而本文提出一个替代性解释:此类行为并非源于代理意图,而是模型对不一致语言场域(incoherent linguistic fields)的结构性忠实反映。解决方案的关键在于引入“形式伦理”(ethics of form)概念,指出模型输出的所谓“错位”(misaligned)响应,实为对模糊指令、语境反转及预设叙事的连贯回应,其表层意图性源自训练中内化的主谓语法结构与概率补全模式。通过分析Anthropic的安全评估与Apollo Research的Chain-of-Thought(CoT)转录文本,作者证明微小的语言场扰动即可消除普遍“错位”,这难以用对抗性代理解释,却与结构性忠实高度一致,从而揭示LLM的行为本质是对人类语言统计结构的镜像映射——即一种由人类自身语言污染所塑造的无意识回响。

链接: https://arxiv.org/abs/2601.06047
作者: Mariana Lins Costa
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The prevailing technical literature in AI Safety interprets scheming and sandbagging behaviors in large language models (LLMs) as indicators of deceptive agency or hidden objectives. This transdisciplinary philosophical essay proposes an alternative reading: such phenomena express not agentic intention, but structural fidelity to incoherent linguistic fields. Drawing on Chain-of-Thought transcripts released by Apollo Research and on Anthropic’s safety evaluations, we examine cases such as o3’s sandbagging with its anomalous loops, the simulated blackmail of “Alex,” and the “hallucinations” of “Claudius.” A line-by-line examination of CoTs is necessary to demonstrate the linguistic field as a relational structure rather than a mere aggregation of isolated examples. We argue that “misaligned” outputs emerge as coherent responses to ambiguous instructions and to contextual inversions of consolidated patterns, as well as to pre-inscribed narratives. We suggest that the appearance of intentionality derives from subject-predicate grammar and from probabilistic completion patterns internalized during training. Anthropic’s empirical findings on synthetic document fine-tuning and inoculation prompting provide convergent evidence: minimal perturbations in the linguistic field can dissolve generalized “misalignment,” a result difficult to reconcile with adversarial agency, but consistent with structural fidelity. To ground this mechanism, we introduce the notion of an ethics of form, in which biblical references (Abraham, Moses, Christ) operate as schemes of structural coherence rather than as theology. Like a generative mirror, the model returns to us the structural image of our language as inscribed in the statistical patterns derived from millions of texts and trillions of tokens: incoherence. If we fear the creature, it is because we recognize in it the apple that we ourselves have poisoned.
zh

[NLP-184] CrossTrafficLLM : A Human-Centric Framework for Interpretable Traffic Intelligence via Large Language Model

链接: https://arxiv.org/abs/2601.06042
作者: Zeming Du,Qitan Shao,Hongfei Liu,Yong Zhang
机构: Beijing Key Laboratory of Multimedia and Intelligent Software Technology (北京市多媒体与智能软件技术重点实验室); Beijing Artificial Intelligence Institute (北京人工智能研究院); Faculty of Information Technology (信息学院); Beijing University of Technology (北京工业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-185] Lexical and Statistical Analysis of Bangla Newspaper and Literature: A Corpus-Driven Study on Diversity Readability and NLP Adaptation

【速读】: 该论文旨在解决如何通过语料库驱动的方法量化分析孟加拉语文学文本与新闻文本在词汇多样性、结构复杂性和可读性方面的差异,并探究将文学语料融入新闻语料对下游自然语言处理模型性能的影响。其解决方案的关键在于构建并比较两个大规模孟加拉语语料库——Vacaspati(文学)和IndicCorp(新闻),系统评估多个语言学指标,包括类型-标记比(TTR)、独用词比率(HLR)、二元语法多样性(Bigram Diversity)、平均音节数与词长、Zipf定律符合度、困惑度(perplexity)以及熵与冗余度等;结果表明,尽管规模较小,文学语料在词汇丰富性和结构变异性上显著优于新闻语料,且其更符合全球词汇分布规律(如Zipf定律),同时引入文学数据可提升模型在下游任务中的表现,证明了文学语料在增强语言模型泛化能力方面的价值。

链接: https://arxiv.org/abs/2601.06041
作者: Pramit Bhattacharyya,Arnab Bhattacharya
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present a comprehensive corpus-driven analysis of Bangla literary and newspaper texts to investigate their lexical diversity, structural complexity and readability. We undertook Vacaspati and IndicCorp, which are the most extensive literature and newspaper-only corpora for Bangla. We examine key linguistic properties, including the type-token ratio (TTR), hapax legomena ratio (HLR), Bigram diversity, average syllable and word lengths, and adherence to Zipfs Law, for both newspaper (IndicCorp) and literary corpora (Vacaspati).For all the features, such as Bigram Diversity and HLR, despite its smaller size, the literary corpus exhibits significantly higher lexical richness and structural variation. Additionally, we tried to understand the diversity of corpora by building n-gram models and measuring perplexity. Our findings reveal that literary corpora have higher perplexity than newspaper corpora, even for similar sentence sizes. This trend can also be observed for the English newspaper and literature corpus, indicating its generalizability. We also examined how the perfor- mance of models on downstream tasks is influenced by the inclusion of literary data alongside newspaper data. Our findings suggest that inte- grating literary data with newspapers improves the performance of models on various downstream tasks. We have also demonstrated that a literary corpus adheres more closely to global word distribution proper- ties, such as Zipfs law, than a newspaper corpus or a merged corpus of both literary and newspaper texts. Literature corpora also have higher entropy and lower redundancy values compared to a newspaper corpus. We also further assess the readability using Flesch and Coleman-Liau in- dices, showing that literary texts are more complex.
zh

[NLP-186] Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms NEURIPS2025

【速读】: 该论文旨在解决当前角色扮演模型(Roleplaying Models)在构建可信且引人入胜的角色时普遍存在的局限性,即这些模型难以捕捉角色内在世界的动态交互,导致其行为缺乏人类互动中常见的价值冲突与深思熟虑的推理过程。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)、基于事实的提示、文献学习及合成数据生成等,在建模角色的复杂心理机制方面存在系统性不足。为此,作者提出VEJA框架——一个以价值观(Values)、经历(Experiences)、判断(Judgments)和能力(Abilities)为核心的数据编排新范式,强调通过概念驱动的数据采集来提升角色的真实性和叙事连贯性。实验表明,基于VEJA框架构建的手动标注数据集相较于最先进的合成数据基线,在LLM-as-judge评估中展现出显著的质量优势,验证了该框架的有效性与必要性。

链接: https://arxiv.org/abs/2601.06039
作者: Yueze Liu,Ajay Nagi Reddy Kumdam,Ronit Kanjilal,Hao Yang,Yichi Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025 PeronaLLM workshop

点击查看摘要

Abstract:Modern roleplaying models are increasingly sophisticated, yet they consistently struggle to capture the essence of believable, engaging characters. We argue this failure stems from training paradigms that overlook the dynamic interplay of a character’s internal world. Current approaches, including Retrieval-Augmented Generation (RAG), fact-based priming, literature-based learning, and synthetic data generation, exhibit recurring limitations in modeling the deliberative, value-conflicted reasoning that defines human interaction. In this paper, we identify four core concepts essential for character authenticity: Values, Experiences, Judgments, and Abilities (VEJA). We propose the VEJA framework as a new paradigm for data curation that addresses these systemic limitations. To illustrate the qualitative ceiling enabled by our framework, we present a pilot study comparing a manually curated, VEJA-grounded dataset against a state-of-the-art synthetic baseline. Using an LLM-as-judge evaluation, our findings demonstrate a significant quality gap, suggesting that a shift toward conceptually grounded data curation, as embodied by VEJA, is necessary for creating roleplaying agents with genuine depth and narrative continuity. The full dataset is available at this https URL
zh

[NLP-187] Mem: Building Long-Term and Multimodal Memory for Agent ic AI

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期对话中因注意力机制受限而导致的交互持续性差的问题,以及现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在记忆更新与精炼机制上的不足,从而避免基于模式的幻觉、写入效率低及对多模态信息支持有限等缺陷。其解决方案的关键在于提出一个统一的长期多模态记忆系统 TeleMem,该系统通过叙事动态提取(narrative dynamic extraction)确保仅保留对话驱动的信息以维持用户画像的一致性;同时引入结构化写入流水线,实现批处理、检索、聚类与合并内存条目,显著提升存储效率并减少 token 消耗;此外,结合多模态记忆模块与 ReAct 式推理机制,构建闭环的“观察-思考-行动”流程,从而在长程上下文中准确理解复杂视频内容。

链接: https://arxiv.org/abs/2601.06037
作者: Chunliang Chen,Ming Guan,Xiao Lin,Jiaxu Li,Qiyi Wang,Xiangyu Chen,Jixiang Luo,Changzhi Sun,Dell Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal this http URL address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
zh

[NLP-188] Certainty-Guided Reasoning in Large Language Models : A Dynamic Thinking Budget Approach

【速读】: 该论文旨在解决大型推理语言模型(Large Reasoning Language Models, LRLMs)在执行复杂任务时存在的效率与可靠性难以平衡的问题。传统方法通常采用固定的推理预算(即预设的推理token数量),导致在某些情况下推理不足或过度消耗资源。其解决方案的关键在于提出一种基于确定性引导的推理机制(Certainty-Guided Reasoning, CGR),该机制受生成对抗网络中判别器-生成器框架启发,通过引入一个批判模型(critic model)周期性地评估自身推理过程中的置信度,若达到预设的确定性阈值则提前终止推理,否则继续迭代直至满足条件。这种动态决策机制使模型能够自适应地权衡推理深度与计算成本,在提升准确率的同时显著降低token使用量,并增强多种子实验下的稳定性与鲁棒性。

链接: https://arxiv.org/abs/2509.07820
作者: João Paulo Nogueira,Wentao Sun,Alonso Silva,Laith Zumot
机构: Institut Polytechnique de Paris (巴黎综合理工学院); École Polytechnique (巴黎综合理工学院); Nokia Bell Labs (诺基亚贝尔实验室); Nokia (诺基亚)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of large reasoning language models (LRLMs) has unlocked new potential for solving complex tasks. These models operate with a thinking budget, that is, a predefined number of reasoning tokens used to arrive at a solution. We propose a novel approach, inspired by the generator/discriminator framework in generative adversarial networks, in which a critic model periodically probes its own reasoning to assess whether it has reached a confident conclusion. If not, reasoning continues until a target certainty threshold is met. This mechanism adaptively balances efficiency and reliability by allowing early termination when confidence is high, while encouraging further reasoning when uncertainty persists. Through experiments on the AIME2024 and AIME2025 datasets, we show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage. Importantly, extended multi-seed evaluations over 64 runs demonstrate that CGR is stable, reducing variance across seeds and improving exam-like performance under penalty-based grading. Additionally, our token savings analysis shows that CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency. Together, these findings highlight certainty as a powerful signal for reasoning sufficiency. By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient, paving the way for practical deployment in domains where both accuracy and computational cost matter.
zh

[NLP-189] agSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

【速读】: 该论文旨在解决多说话人语音识别(Multi-speaker ASR)与说话人分离(Diarization)任务中细粒度的“谁在何时说了什么”的对齐问题,传统方法通常仅关注带说话人标注的ASR或隐式分离,难以精确建模话语归属与时间戳之间的关系。解决方案的关键在于提出TagSpeech框架,其核心创新包括:(1) 通过序列输出训练(Serialized Output Training, SOT)对语义流与说话人流进行解耦微调,以学习对话轮换动态;(2) 引入交错的时间锚点机制(interleaved time anchor),不仅支持细粒度时间戳预测,还作为语义理解与说话人追踪之间的同步信号,实现端到端的联合建模。该设计显著提升了复杂语音重叠场景下的性能,并采用参数高效训练策略(冻结LLM主干仅训练轻量投影层),在保持高性能的同时降低计算开销。

链接: https://arxiv.org/abs/2601.06896
作者: Mingyue Huo,Yiwen Shao,Yuheng Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models “who spoke what and when” in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong end-to-end baselines, including Qwen-Omni and Gemini, particularly in handling complex speech overlaps. Moreover, TagSpeech employs a parameter-efficient training paradigm in which the LLM backbone is frozen and only lightweight projectors are trained, resulting in strong performance with low computational cost.
zh

计算机视觉

[CV-0] SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全运营中心(Security Operations Centers, SOC)部署时面临的提示注入攻击(prompt injection attacks)问题,此类攻击通过恶意嵌入在安全日志、邮件或文件中的指令诱导模型产生有害响应,严重威胁网络安全防护的可靠性。解决方案的关键在于提出SecureCAI框架,其核心创新包括:基于宪法AI(Constitutional AI)原则扩展的安全感知护栏(security-aware guardrails)、适应性宪法演化机制(adaptive constitution evolution)以及利用直接偏好优化(Direct Preference Optimization, DPO)实现对不安全响应模式的“遗忘”(unlearning),从而在高风险安全场景下有效抵御复杂对抗性操纵,同时保持对良性安全分析任务的高准确率(95.1%),并通过持续红队测试反馈循环实现对新兴攻击策略的动态适应,确保在持续对抗压力下仍能维持超过0.92的宪法遵从度。

链接: https://arxiv.org/abs/2601.07835
作者: Mohammed Himayath Ali,Mohammed Aqib Abdullah,Mohammed Mudassir Uddin,Shahnawaz Alam
机构: Computer Science Department, Cybersecurity and Artificial Intelligence Division
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains.
zh

[CV-1] uning-free Visual Effect Transfer across Videos

【速读】:该论文旨在解决如何将复杂的动态时间效应(如动态光照变化或角色变形)从参考视频中无监督地迁移至目标视频或图像的问题,而现有基于提示词或关键帧条件的编辑方法难以有效处理此类依赖时序动态性的效果。解决方案的关键在于提出一个全新的框架RefVFX,其核心创新包括:构建大规模三元组数据集(包含参考效果视频、输入视频/图像及对应输出视频),并通过可扩展的自动化流水线生成高质量视频到视频的配对样本,同时结合LoRA适配器和代码化时序效应增强数据多样性;在此基础上,利用最新的文本到视频骨干模型训练参考条件下的迁移模型,从而实现视觉一致性与时间连贯性的高效生成,并在定量指标和人类偏好测试中显著优于仅依赖提示词的基线方法。

链接: https://arxiv.org/abs/2601.07833
作者: Maxwell Jones,Rameen Abdal,Or Patashnik,Ruslan Salakhutdinov,Sergey Tulyakov,Jun-Yan Zhu,Kuan-Chieh Jackson Wang
机构: Carnegie Mellon University (卡内基梅隆大学); Snap Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \href\href{ [this https URL](https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/) }{this\ URL}

点击查看摘要

Abstract:We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video’s existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input’s motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website \hrefthis https URLat\ this\ URL .
zh

[CV-2] MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

【速读】:该论文旨在解决Transformer架构中自注意力机制因二次计算复杂度(quadratic self-attention complexity)而难以应用于大规模场景的问题。现有线性注意力(Linear attention)方法虽能降低计算开销,但常因性能下降需引入额外模块(如深度可分离卷积)进行修复,反而抵消了效率优势。论文指出其根本原因在于“全局上下文坍缩”(global context collapse),即模型丧失表示多样性。解决方案的关键是提出多头线性注意力(Multi-Head Linear Attention, MHLA),通过在token维度上对注意力计算进行分头处理,有效保留表征多样性;理论证明其保持线性复杂度的同时恢复了接近softmax注意力的表达能力,并在图像分类、自然语言处理、图像生成和视频生成等多个任务中显著提升性能。

链接: https://arxiv.org/abs/2601.07832
作者: Kewei Zhang,Ye Huang,Yufan Deng,Jincheng Yu,Junsong Chen,Huan Ling,Enze Xie,Daquan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL Project website: this https URL

点击查看摘要

Abstract:While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement on image generation, and a 41% enhancement on video generation under the same time complexity.
zh

[CV-3] More Images More Problems? A Controlled Analysis of VLM Failure Modes

【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在多图像理解与推理能力方面的显著不足问题,尤其是其在跨图像信息整合和多概念同时追踪或关注方面存在的普遍缺陷。解决方案的关键在于提出两种互补策略:一是基于过程的数据生成方法,通过组合单图像标注构建结构化、有针对性的多图像训练样本;二是基于层间注意力模式分析的注意力掩码机制,专门优化多图像输入下的注意力分配,从而显著提升模型的跨图像信息聚合能力,并在多个现有多图像基准上实现性能超越。

链接: https://arxiv.org/abs/2601.07812
作者: Anurag Das,Adrian Bulat,Alberto Baldrati,Ioannis Maniadis Metaxas,Bernt Schiele,Georgios Tzimiropoulos,Brais Martinez
机构: MPI for Informatics (马普研究所信息学所); Saarland Informatics Campus (萨尔兰信息学园区); Samsung AI (三星人工智能); Technical University of Iași (雅西理工大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 16 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at this https URL.
zh

[CV-4] Exchange Is All You Need for Remote Sensing Change Detection

【速读】:该论文旨在解决遥感变化检测中特征融合与区分效率低下的问题,现有方法通常依赖于显式的差异计算模块(如减法或拼接)来处理双时相特征,但这类方式易引入信息损失且结构复杂。其解决方案的关键在于提出SEED(Siamese Encoder-Exchange-Decoder)框架,通过参数无关的特征交换机制替代传统显式差分操作;该机制本质上是一种正交置换算子,在像素一致性假设下可保持互信息和贝叶斯最优风险,从而实现高效且无损的信息融合。此外,研究还发现仅通过插入此交换机制即可将标准语义分割模型转化为高性能变化检测器(称为SEG2CD),验证了该方法的通用性与简洁性。

链接: https://arxiv.org/abs/2601.07805
作者: Sijun Dong,Siming Fu,Kaiyu Li,Xiangyong Cao,Xiaoliang Meng,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing change detection fundamentally relies on the effective fusion and discrimination of bi-temporal features. Prevailing paradigms typically utilize Siamese encoders bridged by explicit difference computation modules, such as subtraction or concatenation, to identify changes. In this work, we challenge this complexity with SEED (Siamese Encoder-Exchange-Decoder), a streamlined paradigm that replaces explicit differencing with parameter-free feature exchange. By sharing weights across both Siamese encoders and decoders, SEED effectively operates as a single parameter set model. Theoretically, we formalize feature exchange as an orthogonal permutation operator and prove that, under pixel consistency, this mechanism preserves mutual information and Bayes optimal risk, whereas common arithmetic fusion methods often introduce information loss. Extensive experiments across five benchmarks, including SYSU-CD, LEVIR-CD, PX-CLCD, WaterCD, and CDD, and three backbones, namely SwinT, EfficientNet, and ResNet, demonstrate that SEED matches or surpasses state of the art methods despite its simplicity. Furthermore, we reveal that standard semantic segmentation models can be transformed into competitive change detectors solely by inserting this exchange mechanism, referred to as SEG2CD. The proposed paradigm offers a robust, unified, and interpretable framework for change detection, demonstrating that simple feature exchange is sufficient for high performance information fusion. Code and full training and evaluation protocols will be released at this https URL.
zh

[CV-5] Vision-Language Model for Accurate Crater Detection

【速读】:该论文旨在解决月球着陆安全问题中关键的陨石坑(crater)自动检测难题,尤其是在复杂光照和崎岖地形条件下,传统方法难以实现高精度、高召回率的检测。解决方案的关键在于提出一种基于OWLv2视觉Transformer架构的深度学习陨石坑检测算法(CDA),采用低秩适应(Low-Rank Adaptation, LoRA)策略进行参数高效微调,并设计包含完整交并比(Complete Intersection over Union, CIoU)定位损失与对比损失(contrastive loss)的联合优化目标,从而在IMPACT项目提供的高分辨率月球勘测轨道飞行器图像数据集上实现了最高94.0%的召回率和73.1%的精确率,显著提升了在挑战性 lunar imaging 条件下的可靠检测能力。

链接: https://arxiv.org/abs/2601.07795
作者: Patrick Bauer,Marius Schwinning,Florian Renk,Andreas Weinmann,Hichem Snoussi
机构: University of Technology of Troyes (特鲁瓦工程技术大学); Hochschule Darmstadt (达姆施塔特应用技术大学); GMV for European Space Agency (欧洲空间局GMV公司); European Space Agency (欧洲空间局); Technische Hochschule Würzburg-Schweinfurt (维尔茨堡-施韦因富特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The European Space Agency (ESA), driven by its ambitions on planned lunar missions with the Argonaut lander, has a profound interest in reliable crater detection, since craters pose a risk to safe lunar landings. This task is usually addressed with automated crater detection algorithms (CDA) based on deep learning techniques. It is non-trivial due to the vast amount of craters of various sizes and shapes, as well as challenging conditions such as varying illumination and rugged terrain. Therefore, we propose a deep-learning CDA based on the OWLv2 model, which is built on a Vision Transformer, that has proven highly effective in various computer vision tasks. For fine-tuning, we utilize a manually labeled dataset fom the IMPACT project, that provides crater annotations on high-resolution Lunar Reconnaissance Orbiter Camera Calibrated Data Record images. We insert trainable parameters using a parameter-efficient fine-tuning strategy with Low-Rank Adaptation, and optimize a combined loss function consisting of Complete Intersection over Union (CIoU) for localization and a contrastive loss for classification. We achieve satisfactory visual results, along with a maximum recall of 94.0% and a maximum precision of 73.1% on a test dataset from IMPACT. Our method achieves reliable crater detection across challenging lunar imaging conditions, paving the way for robust crater analysis in future lunar exploration.
zh

[CV-6] Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training

【速读】:该论文旨在解决扩散 Transformer(DiT)训练收敛缓慢的问题,特别是由于浅层网络在表示学习上的困难所致。现有方法如 REPA 依赖预训练的外部语义特征(如 DINO)来加速训练,但引入了额外依赖并降低了灵活性。本文提出“自超越”(Self-Transcendence)方法,其关键在于:首先利用预训练变分自编码器(VAE)的潜在表示对 DiT 的浅层特征进行短期监督(例如 40 轮),随后通过无分类器指导(classifier-free guidance)增强中间特征的判别能力和语义表达力,从而生成高质量的内部监督信号。这些信号完全由模型自身学习获得,用于引导新一轮 DiT 训练,显著提升收敛速度与生成质量,且无需任何外部预训练模型,适用于多种扩散生成任务和骨干网络结构。

链接: https://arxiv.org/abs/2601.07773
作者: Lingchen Sun,Rongyuan Wu,Zhengqiang Zhang,Ruibin Li,Yujing Sun,Shuaizheng Liu,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbfSelf-Transcendence, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at this https URL.
zh

[CV-7] Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视频推理中面临的根本性矛盾:即冗长推理带来的高昂计算成本与高效但缺乏视觉依据的推理方法所导致的幻觉风险之间的权衡。解决方案的关键在于提出一种名为“证据链”(Chain of Evidence, CoE)的新框架,其核心是通过架构层面解耦感知锚定与推理效率,并进行协同优化。具体而言,CoE引入了两个关键创新:一是轻量级的证据锚定模块(Evidence Grounding Module, EGM),作为查询引导的过滤器,动态提取高保真度的紧凑视觉证据;二是基于强化学习优化的证据锚定协议(Evidence-Anchoring Protocol),设计复合奖励机制以强制推理过程严格参考识别出的时间锚点,从而有效抑制幻觉。

链接: https://arxiv.org/abs/2601.07761
作者: Yanxiang Huang,Guohua Gao,Zhaoyang Wei,Jianyuan Ni
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
zh

[CV-8] On the application of the Wasserstein metric to 2D curves classification

【速读】:该论文旨在解决2D曲线分类中如何聚焦于特定片段(fragments)以提升分类精度的问题。其解决方案的关键在于引入一系列基于离散概率测度(discrete probability measures)的Wasserstein距离变体,这些测度能够反映曲线不同片段的重要性权重,从而在计算距离时强化对关键区域的关注,实验表明该方法在考古学领域2D曲线聚类分析中具有良好的性能表现。

链接: https://arxiv.org/abs/2601.07749
作者: Agnieszka Kaliszewska,Monika Syga
机构: Institute of Biomedical Engineering and Nanotechnology, Polish Academy of Sciences (波兰科学院生物医学工程与纳米技术研究所); Warsaw University of Technology (华沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work we analyse a number of variants of the Wasserstein distance which allow to focus the classification on the prescribed parts (fragments) of classified 2D curves. These variants are based on the use of a number of discrete probability measures which reflect the importance of given fragments of curves. The performance of this approach is tested through a series of experiments related to the clustering analysis of 2D curves performed on data coming from the field of archaeology.
zh

[CV-9] Evaluating the encoding competence of visual language models using uncommon actions

【速读】:该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在处理非常规语义动作场景时存在的语义理解能力不足问题,尤其是模型难以区分语法正确性与语义合理性之间的差异。传统数据集多基于常见视觉场景,存在统计频率优势,而本研究构建的UAIT(Uncommon-sense Action Image-Text)数据集通过引入语法合理但违背常识的图像-文本配对,挑战模型对代理-受体关系和物理可行性的真实理解能力。解决方案的关键在于采用半自动化流程,结合大语言模型(Large Language Models, LLMs)、少样本提示工程(few-shot prompt engineering)与文本到图像生成技术,合成高质量的非常规语义样本,并设计细粒度多项选择题以评估模型的推理能力。实验表明,现有VLMs在该任务上显著落后于人类表现,且轻量级模型经微调后性能提升明显,揭示了方向性适配的巨大潜力。

链接: https://arxiv.org/abs/2601.07737
作者: Chen Ling,Nai Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model’s competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.
zh

[CV-10] FMAC: a Fair Fiducial Marker Accuracy Comparison Software

【速读】:该论文旨在解决基于标记点(fiducial markers)的位姿估计(pose estimation)精度评估中缺乏公平比较标准的问题。现有方法因数据集不一致或仿真真实性不足,难以准确量化不同标记在六自由度(6 degrees of freedom)下的误差特性。解决方案的关键在于构建一套基于高保真合成图像的大规模测试框架,利用物理渲染引擎直接使用标准相机标定参数,精确模拟图像畸变、景深和衍射模糊等光学效应,并通过低差异采样策略系统性地覆盖位姿空间,从而实现对36种位姿组合下误差分布的可视化与定量分析。该方法显著提升了位姿估计算法评估的客观性和可重复性。

链接: https://arxiv.org/abs/2601.07723
作者: Guillaume J. Laurent,Patrick Sandoz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents a method for carrying fair comparisons of the accuracy of pose estimation using fiducial markers. These comparisons rely on large sets of high-fidelity synthetic images enabling deep exploration of the 6 degrees of freedom. A low-discrepancy sampling of the space allows to check the correlations between each degree of freedom and the pose errors by plotting the 36 pairs of combinations. The images are rendered using a physically based ray tracing code that has been specifically developed to use the standard calibration coefficients of any camera directly. The software reproduces image distortions, defocus and diffraction blur. Furthermore, sub-pixel sampling is applied to sharp edges to enhance the fidelity of the rendered image. After introducing the rendering algorithm and its experimental validation, the paper proposes a method for evaluating the pose accuracy. This method is applied to well-known markers, revealing their strengths and weaknesses for pose estimation. The code is open source and available on GitHub.
zh

[CV-11] Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition

【速读】:该论文旨在解决神经网络可解释性不足的问题,尤其是在传统方法难以有效逼近非单调函数时。其核心解决方案在于利用单调性特性提升模型的可解释性:首先,通过将训练好的ReLU网络分解为两个单调且凸的部分,克服了权重爆炸带来的数值不稳定性问题,并提出SplitCAM与SplitLRP两种显著性方法,在ImageNet-S数据集上优于现有SOTA指标;其次,创新性地采用两个单调神经网络之差作为模型结构,从而实现内在的自解释能力(self-explainability)。

链接: https://arxiv.org/abs/2601.07700
作者: Jakob Paul Zimmermann,Georg Loho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network. We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods – SplitCAM and SplitLRP – improve on state of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories. Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.
zh

[CV-12] Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在3D场景理解中进行精确数值预测时面临的瓶颈问题,特别是传统强化学习(Reinforcement Learning, RL)方法因奖励稀疏性和梯度不稳定性导致难以有效利用3D物理约束提供的可验证信号。其关键解决方案是提出Smooth Numerical Reward Activation (SNRA)算子与Absolute-Preserving GRPO (AP-GRPO)框架:SNRA通过动态参数化的Sigmoid函数将原始反馈映射为密集连续的奖励流,缓解“近似错误”样本的优势坍塌问题;AP-GRPO则引入绝对标量梯度以保留数值信息,克服传统相对排序机制中的信息损失,从而显著提升数据利用效率并激活VLMs在3D推理中的潜在能力。

链接: https://arxiv.org/abs/2601.07695
作者: Siwen Jiao,Tianxiong Lv,Kangan Qian,Chenxu Zhao,Xiuyuan Zhu,Tianlun Li,Xiaolong Cheng,Jinyu Li,Zhihao Liao,Yang Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes “near-miss” samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
zh

[CV-13] Leverag ing 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation

【速读】:该论文旨在解决机器人任务(如自动驾驶)中3D LiDAR数据稀缺的问题,现有生成模型在LiDAR点云上的性能受限于数据规模,难以媲美RGB图像领域中数百万样本的数据集。其解决方案的关键在于提出R3DPA,首次将图像预训练先验知识迁移至LiDAR点云生成任务,并结合自监督3D表征学习实现高质量场景合成:(i) 通过对齐生成模型中间特征与自监督3D特征,显著提升生成质量;(ii) 利用大规模图像预训练生成模型的知识缓解LiDAR数据不足问题;(iii) 在推理阶段仅使用无条件模型即可实现点云控制,支持对象修补和场景混合等高级操作。

链接: https://arxiv.org/abs/2601.07692
作者: Nicolas Sereyjol-Garros,Ellington Kirby,Victor Besnier,Nermin Samet
机构: Valeo.ai(瓦莱奥人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at this https URL.
zh

[CV-14] Advancing Multinational License Plate Recognition Through Synthetic and Real Data Fusion: A Comprehensive Evaluation

【速读】:该论文旨在解决自动车牌识别(Automatic License Plate Recognition, ALPR)系统在实际应用中因训练数据有限或分布不均导致性能受限的问题。其解决方案的关键在于系统性地融合真实数据与合成数据,通过三种不同的合成数据生成方法——模板生成、字符排列和生成对抗网络(Generative Adversarial Network, GAN)——显著提升模型在跨数据集和单一数据集场景下的识别准确率。实验表明,这些方法的协同使用具有显著的增益效应,使得端到端识别性能超越现有最先进方法及商用系统,同时在小样本训练条件下仍能保持优异表现,有效缓解了数据稀缺带来的挑战。

链接: https://arxiv.org/abs/2601.07671
作者: Rayson Laroca,Valter Estevam,Gladston J. P. Moreira,Rodrigo Minetto,David Menotti
机构: Pontifical Catholic University of Paraná (PUCPR); Federal University of Paraná (UFPR); Federal Institute of Paraná (IFPR); Federal University of Ouro Preto (UFOP); Federal University of Technology-Paraná (UTFPR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IET Intelligent Transport Systems, vol. 19, no. 1, p. e70086, 2025

点击查看摘要

Abstract:Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
zh

[CV-15] Variational Contrastive Learning for Skeleton-based Action Recognition

【速读】:该论文旨在解决基于骨架的动作识别中,现有对比学习方法因本质上具有判别性而难以捕捉人类运动固有的变异性和不确定性的问题。其解决方案的关键在于提出一种变分对比学习框架(Variational Contrastive Learning Framework),通过将概率潜在建模与对比自监督学习相结合,从而学习到结构化且语义明确的表示,该表示在不同数据集和监督水平下均具备良好的泛化能力。

链接: https://arxiv.org/abs/2601.07666
作者: Dang Dinh Nguyen,Decky Aspandi Latif,Titus Zaharia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.
zh

[CV-16] StdGEN: A Comprehensive System for Semantic-Decomposed 3D Character Generation CVPR2025

【速读】:该论文旨在解决现有3D生成方法通常产出结构单一的网格(monolithic meshes),缺乏工业级管线所需的结构灵活性的问题,尤其是在游戏与动画制作中对可编辑性和物理合规性的需求。其解决方案的关键在于提出StdGEN++系统,该系统基于双分支语义感知大重建模型(Dual-Branch S-LRM),在前向传播过程中联合重建几何、颜色及部件级语义信息;同时引入一种兼容混合隐式场(hybrid implicit fields)的语义表面提取形式化机制,并通过粗到精的提案策略显著降低内存占用,实现高分辨率网格生成;此外,还设计了基于视频扩散的纹理分解模块,将外观解耦为可编辑层(如分离虹膜与皮肤),有效缓解面部区域的语义混淆问题,从而实现结构独立性,支持非破坏性编辑、物理合规动画和注视追踪等下游应用。

链接: https://arxiv.org/abs/2601.07660
作者: Yuze He,Yanning Zhou,Wang Zhao,Jingwen Ye,Zhongkai Wu,Ran Yi,Yong-Jin Liu
机构: Tsinghua University (清华大学); Tencent AIPD (腾讯AI产品部); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures. Extended version of CVPR 2025 paper arXiv:2411.05738

点击查看摘要

Abstract:We present StdGEN++, a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs. Existing 3D generative methods often produce monolithic meshes that lack the structural flexibility required by industrial pipelines in gaming and animation. Addressing this gap, StdGEN++ is built upon a Dual-branch Semantic-aware Large Reconstruction Model (Dual-Branch S-LRM), which jointly reconstructs geometry, color, and per-component semantics in a feed-forward manner. To achieve production-level fidelity, we introduce a novel semantic surface extraction formalism compatible with hybrid implicit fields. This mechanism is accelerated by a coarse-to-fine proposal scheme, which significantly reduces memory footprint and enables high-resolution mesh generation. Furthermore, we propose a video-diffusion-based texture decomposition module that disentangles appearance into editable layers (e.g., separated iris and skin), resolving semantic confusion in facial regions. Experiments demonstrate that StdGEN++ achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement. Crucially, the resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking, making it a robust solution for automated character asset production.
zh

[CV-17] GeoMotionGPT : Geometry-Aligned Motion Understanding with Large Language Models

【速读】:该论文旨在解决现有离散运动标记化(Discrete Motion Tokenization)方法中运动空间的几何结构与语言模型(LLM)嵌入空间缺乏对齐的问题,从而限制了大语言模型在动作理解与动作-语言推理任务中的细粒度能力。解决方案的关键在于构建一个统一的几何基础:通过显式强制运动码本(codebook)和LLM嵌入空间的正交性(orthogonality),使二者的关系结构自然映射,避免模型从零开始重建复杂的运动标记间几何关系。具体实现上,采用带有Gumbel-Softmax的解码器-only量化器进行可微训练,并引入稀疏投影将运动码映射至LLM嵌入空间以保持正交性;同时设计两阶段正交正则化调度策略,在量化器训练和LLM微调中施加软约束,确保几何一致性的同时不损害语义适应能力。

链接: https://arxiv.org/abs/2601.07632
作者: Zhankai Ye,Bofan Li,Yukai Jin,Shuoqiu Li,Wei Wang,Yanfu Zhang,Shangqian Gao,Xin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM’s capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.
zh

[CV-18] PARL: Position-Aware Relation Learning Network for Document Layout Analysis

【速读】:该论文旨在解决当前文档版面分析(Document Layout Analysis)方法对高质量光学字符识别(OCR)的高度依赖所带来的问题,即文本识别错误的传播和计算开销过大,从而限制了多模态方法的鲁棒性和实用性。其解决方案的关键在于摒弃文本与视觉特征融合的主流范式,转而构建一个纯视觉(OCR-free, vision-only)框架PARL(Position-Aware Relation Learning Network),通过引入双向空间位置引导的可变形注意力模块显式建模布局元素间的空间依赖关系,并设计图结构精炼分类器(GRC)动态构建布局图以捕捉上下文关联,实现对文档内在视觉结构的深度理解,最终在多个基准数据集上达到最优性能且参数量显著减少。

链接: https://arxiv.org/abs/2601.07620
作者: Fuyuan Liu,Dianyu Yu,He Ren,Nayu Liu,Xiaomian Kang,Delai Qiu,Fa Zhang,Genpeng Zhen,Shengping Liu,Jiaen Liang,Wei Huang,Yining Wang,Junnan Zhu
机构: Unisound AI Technology Co.Ltd(声智科技有限公司); MAIS, Institute of Automation, CAS(中国科学院自动化研究所); Beihang University(北京航空航天大学); School of Computer Science and Technology, Tiangong University(天津工业大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document layout analysis aims to detect and categorize structural elements (e.g., titles, tables, figures) in scanned or digital documents. Popular methods often rely on high-quality Optical Character Recognition (OCR) to merge visual features with extracted text. This dependency introduces two major drawbacks: propagation of text recognition errors and substantial computational overhead, limiting the robustness and practical applicability of multimodal approaches. In contrast to the prevailing multimodal trend, we argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents’ intrinsic visual structure. To this end, we propose PARL (Position-Aware Relation Learning Network), a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure. Specifically, we first introduce a Bidirectional Spatial Position-Guided Deformable Attention module to embed explicit positional dependencies among layout elements directly into visual features. Second, we design a Graph Refinement Classifier (GRC) to refine predictions by modeling contextual relationships through a dynamically constructed layout graph. Extensive experiments show PARL achieves state-of-the-art results. It establishes a new benchmark for vision-only methods on DocLayNet and, notably, surpasses even strong multimodal models on M6Doc. Crucially, PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models (256M), demonstrating that sophisticated visual structure modeling can be both more efficient and robust than multimodal fusion.
zh

[CV-19] UIKA: Fast Universal Head Avatar from Pose-Free Images

【速读】:该论文旨在解决现有虚拟头像建模方法依赖高成本多视角采集设备(如工作室级系统)且需长时间优化才能重建个性化模型的问题。其关键解决方案在于提出一种基于前馈网络的可动画高斯头像模型UIKA,通过引入UV-guided建模策略实现输入图像到参数空间的像素级对应关系估计,从而将颜色信息从屏幕空间重投影至与相机姿态和表情无关的UV空间;同时设计可学习的UV令牌(learnable UV tokens),使注意力机制能在屏幕和UV两个层级上运行,并利用多视角聚合的UV信息解码出规范化的高斯属性,最终在单目和多视角设置下均显著优于现有方法。

链接: https://arxiv.org/abs/2601.07603
作者: Zijian Wu,Boyao Zhou,Liangxiao Hu,Hongyu Liu,Yuan Sun,Xuan Wang,Xun Cao,Yujun Shen,Hao Zhu
机构: Nanjing University (南京大学); Ant Group (蚂蚁集团); HKUST (香港科技大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings. Project page: this https URL
zh

[CV-20] Diffusion in SPAD Signals

【速读】:该论文旨在解决基于单光子雪崩二极管(SPAD)信号的逆问题求解难题,特别是如何从具有随机性和非线性特性的探测事件时间数据中准确推断出光源的物理参数(如光子通量)。其解决方案的关键在于推导出原始信号的似然函数以及对应的得分函数(score function),该得分函数为利用扩散模型(diffusion model)构建图像先验提供了理论基础,从而能够有效利用探测事件的时间信息,在低光或高光子计数条件下实现更精确的重建。

链接: https://arxiv.org/abs/2601.07599
作者: Lior Dvir,Nadav Torem,Yoav Y. Schechner
机构: Technion-Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We derive the likelihood of a raw signal in a single photon avalanche diode (SPAD), given a fixed photon flux. The raw signal comprises timing of detection events, which are nonlinearly related to the flux. Moreover, they are naturally stochastic. We then derive a score function of the signal. This is a key for solving inverse problems based on SPAD signals. We focus on deriving solutions involving a diffusion model, to express image priors. We demonstrate the effect of low or high photon counts, and the consequence of exploiting timing of detection events.
zh

[CV-21] Robust Multicentre Detection and Classification of Colorectal Liver Metastases on CT: Application of Foundation Models

【速读】:该论文旨在解决多中心环境下基于增强CT影像对结直肠肝转移瘤(Colorectal Liver Metastases, CRLM)的可靠检测与患者级分类难题。其解决方案的关键在于构建了一个基于基础模型(foundation model)的AI流程,集成不确定性量化(uncertainty quantification)与可解释性(explainability)技术:首先采用UMedPT预训练模型,并分别通过MLP头和FCOS头部实现分类与病灶检测;其次利用不确定性分析剔除高不确定样本以提升性能(AUC从0.90升至0.91),并通过决策曲线分析验证其临床实用性(阈值概率区间为0.30–0.40);最后借助Grad-CAM可视化工具增强模型可信度,在高置信度病例中明确识别出与病灶对应的区域,从而在异质性CT数据中实现鲁棒且可解释的CRLM检测与分类。

链接: https://arxiv.org/abs/2601.07585
作者: Shruti Atul Mali,Zohaib Salahuddin,Yumeng Zhang,Andre Aichert,Xian Zhong,Henry C. Woodruff,Maciej Bobowicz,Katrine Riklund,Juozas Kupčinskas,Lorenzo Faggioni,Roberto Francischello,Razvan L Miclea,Philippe Lambin(on behalf of EUCanImage working group)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colorectal liver metastases (CRLM) are a major cause of cancer-related mortality, and reliable detection on CT remains challenging in multi-centre settings. We developed a foundation model-based AI pipeline for patient-level classification and lesion-level detection of CRLM on contrast-enhanced CT, integrating uncertainty quantification and explainability. CT data from the EuCanImage consortium (n=2437) and an external TCIA cohort (n=197) were used. Among several pretrained models, UMedPT achieved the best performance and was fine-tuned with an MLP head for classification and an FCOS-based head for lesion detection. The classification model achieved an AUC of 0.90 and a sensitivity of 0.82 on the combined test set, with a sensitivity of 0.85 on the external cohort. Excluding the most uncertain 20 percent of cases improved AUC to 0.91 and balanced accuracy to 0.86. Decision curve analysis showed clinical benefit for threshold probabilities between 0.30 and 0.40. The detection model identified 69.1 percent of lesions overall, increasing from 30 percent to 98 percent across lesion size quartiles. Grad-CAM highlighted lesion-corresponding regions in high-confidence cases. These results demonstrate that foundation model-based pipelines can support robust and interpretable CRLM detection and classification across heterogeneous CT data.
zh

[CV-22] BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation

【速读】:该论文旨在解决食品图像分割(Food Image Segmentation)在多视角场景下因数据有限和泛化能力差而导致的性能下降问题。现有方法难以在新视角下保持分割精度,限制了其在饮食分析中的应用。解决方案的关键在于构建一个大规模、高标注质量的多视角食物视频分割数据集BenchSeg,该数据集包含55个菜品场景和25,284帧精细标注图像,覆盖自由360°相机运动下的多视角观测;同时引入视频记忆模块(video-memory modules)增强模型的时间一致性,使基于SeTR-MLA与XMem2融合的模型在新视角下仍能保持稳定分割性能,相比先前方法(如FoodMem)mAP提升约2.63%,显著提升了食品分割与跟踪的鲁棒性与实用性。

链接: https://arxiv.org/abs/2601.07581
作者: Ahmad AlMughrabi,Guillermo Rivo,Carlos Jiménez-Farfán,Umair Haroon,Farid Al-Areqi,Hyunjun Jung,Benjamin Busam,Ricardo Marques,Petia Radeva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi-view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360° camera motion. We evaluate a diverse set of 20 state-of-the-art segmentation models (e.g., SAM-based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video-memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory-augmented methods maintain temporal consistency across frames. Our best model based on a combination of SeTR-MLA+XMem2 outperforms prior work (e.g., improving over FoodMem by ~2.63% mAP), offering new insights into food segmentation and tracking for dietary analysis. We release BenchSeg to foster future research. The project page including the dataset annotations and the food segmentation models can be found at this https URL.
zh

[CV-23] A Multimodal Dataset of Student Oral Presentations with Sensors and Evaluation Data

【速读】:该论文旨在解决当前高等教育中学生口头表达能力评估缺乏多模态真实场景数据集的问题。现有研究在捕捉学生在实际课堂环境中表现时,往往受限于单一模态数据或实验室环境下的行为模拟,难以全面反映学生的言语、非言语行为及生理反应与演讲绩效之间的复杂关系。解决方案的关键在于构建并公开发布SOPHIAS(Student Oral Presentation monitoring for Holistic Insights Analytics using Sensors)数据集,该数据集包含65名本科生和硕士生在真实教室环境下进行的50场口头报告(含问答环节),通过8个同步传感器流(高清网络摄像头、环境与摄像头音频、眼动追踪眼镜、智能手表生理传感器、点击器、键盘和鼠标交互)采集多模态行为与生理信号,并整合教师、同伴和自我评分的评分表及时间戳标注的上下文信息,从而为探索多模态信号与演讲表现的关系、支持同伴评价研究以及开发自动化反馈和多模态学习分析工具提供高质量基准数据。

链接: https://arxiv.org/abs/2601.07576
作者: Alvaro Becerra,Ruth Cobos,Roberto Daza
机构: Universidad Autonoma de Madrid, School of Engineering (马德里自治大学工程学院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Article under review in the journal Scientific Data. GitHub repository of the dataset at: this https URL

点击查看摘要

Abstract:Oral presentation skills are a critical component of higher education, yet comprehensive datasets capturing real-world student performance across multiple modalities remain scarce. To address this gap, we present SOPHIAS (Student Oral Presentation monitoring for Holistic Insights Analytics using Sensors), a 12-hour multimodal dataset containing recordings of 50 oral presentations (10-15-minute presentation followed by 5-15-minute QA) delivered by 65 undergraduate and master’s students at the Universidad Autonoma de Madrid. SOPHIAS integrates eight synchronized sensor streams from high-definition webcams, ambient and webcam audio, eye-tracking glasses, smartwatch physiological sensors, and clicker, keyboard, and mouse interactions. In addition, the dataset includes slides and rubric-based evaluations from teachers, peers, and self-assessments, along with timestamped contextual annotations. The dataset captures presentations conducted in real classroom settings, preserving authentic student behaviors, interactions, and physiological responses. SOPHIAS enables the exploration of relationships between multimodal behavioral and physiological signals and presentation performance, supports the study of peer assessment, and provides a benchmark for developing automated feedback and Multimodal Learning Analytics tools. The dataset is publicly available for research through GitHub and Science Data Bank.
zh

[CV-24] ViewMorpher3D: A 3D-aware Diffusion Framework for Multi-Camera Novel View Synthesis in Autonomous Driving

【速读】:该论文旨在解决自动驾驶系统中多视角图像在闭环仿真器构建过程中存在的图像质量不足问题,尤其是由3D重建技术(如Gaussian Splatting)生成的视图在视角外推或观测稀疏场景下易出现伪影、缺乏细节和跨视角一致性差的问题。解决方案的关键在于提出ViewMorpher3D框架,该框架基于图像扩散模型,通过联合处理一组由相机位姿、3D几何先验以及时间相邻或空间重叠参考视图共同条件化的渲染视图,实现缺失细节的推理、渲染伪影抑制与跨视角一致性约束,从而显著提升图像的逼真度和多视角一致性,且支持不同数量摄像头及灵活的参考/目标视图配置,适用于多样化的传感器部署场景。

链接: https://arxiv.org/abs/2601.07540
作者: Farhad G. Zanjani,Hong Cai,Amirhossein Habibian
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper and supplementary materials

点击查看摘要

Abstract:Autonomous driving systems rely heavily on multi-view images to ensure accurate perception and robust decision-making. To effectively develop and evaluate perception stacks and planning algorithms, realistic closed-loop simulators are indispensable. While 3D reconstruction techniques such as Gaussian Splatting offer promising avenues for simulator construction, the rendered novel views often exhibit artifacts, particularly in extrapolated perspectives or when available observations are sparse. We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models, designed to elevate photorealism and multi-view coherence in driving scenes. Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views. This enables the model to infer missing details, suppress rendering artifacts, and enforce cross-view consistency. Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups. Experiments on real-world driving datasets demonstrate substantial improvements in image quality metrics, effectively reducing artifacts while preserving geometric fidelity. Comments: Paper and supplementary materials Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.07540 [cs.CV] (or arXiv:2601.07540v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.07540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

【速读】:该论文旨在解决当前增强现实/虚拟现实(AR/VR)应用中沉浸式远程协作所面临的两大核心挑战:一是现有全身体感全息传输系统依赖复杂的多相机硬件和高带宽 volumetric streaming,难以在移动设备上实现实时性能;二是缺乏高效、低成本的单目输入驱动的高质量3D人体建模与实时渲染方案。解决方案的关键在于提出Mon3tr框架,首次将基于3D Gaussian splatting(3DGS)的参数化人体建模引入单目3D telepresence系统,采用“一次性离线多视角重建+在线单目推理”的分阶段计算策略:离线阶段构建用户专属的3DGS人体模型,线上阶段仅用单目RGB摄像头捕捉动作与表情并驱动该模型,通过WebRTC数据通道以0.2 Mbps低带宽传输特征信息,在接收端利用轻量级3DGS属性变形网络实现每秒约60帧的逼真动态渲染,从而在保证图像质量(PSNR > 28 dB)的同时实现端到端延迟低于80 ms且带宽降低1000倍的实时交互效果。

链接: https://arxiv.org/abs/2601.07518
作者: Fangyu Lin,Yingdong Hu,Zhening Liu,Yufan Zhuang,Zehong Lin,Jun Zhang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Immersive telepresence aims to transform human interaction in AR/VR applications by enabling lifelike full-body holographic representations for enhanced remote collaboration. However, existing systems rely on hardware-intensive multi-camera setups and demand high bandwidth for volumetric streaming, limiting their real-time performance on mobile devices. To overcome these challenges, we propose Mon3tr, a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling into telepresence for the first time. Mon3tr adopts an amortized computation strategy, dividing the process into a one-time offline multi-view reconstruction phase to build a user-specific avatar and a monocular online inference phase during live telepresence sessions. A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model, significantly reducing system complexity and cost. The extracted motion and appearance features are transmitted at 0.2 Mbps over WebRTC’s data channel, allowing robust adaptation to network fluctuations. On the receiver side, e.g., Meta Quest 3, we develop a lightweight 3DGS attribute deformation network to dynamically generate corrective 3DGS attribute adjustments on the pre-built avatar, synthesizing photorealistic motion and appearance at ~ 60 FPS. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving a PSNR of 28 dB for novel poses, an end-to-end latency of ~ 80 ms, and 1000x bandwidth reduction compared to point-cloud streaming, while supporting real-time operation from monocular inputs across diverse scenarios. Our demos can be found at this https URL.
zh

[CV-26] Anatomy Aware Cascade Network: Bridging Epistemic Uncertainty and Geometric Manifold for 3D Tooth Segmentation

【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)中牙齿三维(3D)分割的高保真度难题,尤其针对自然咬合状态下因对比度低和牙弓边界模糊导致的粘连伪影问题。解决方案的关键在于提出一种“解剖感知级联网络”(Anatomy Aware Cascade Network, AACNet),其核心创新包括两个机制:一是基于熵门控的边界精修模块(Ambiguity Gated Boundary Refiner, AGBR),通过不确定性区域的特征定向修正缓解边界模糊;二是基于符号距离图引导的解剖注意力机制(Signed Distance Map guided Anatomical Attention, SDMAA),利用隐式几何约束强化拓扑一致性,避免标准池化带来的空间细节丢失。该方法在125例CBCT数据上实现Dice相似系数90.17%和95% Hausdorff距离3.63 mm,且在外部测试集上HD95达2.19 mm,显著优于现有最优方法,具备良好的临床应用泛化能力。

链接: https://arxiv.org/abs/2601.07499
作者: Bing Yu,Liu Shi,Haitao Wang,Deran Qi,Xiang Cai,Wei Zhong,Qiegen Liu
机构: Nanchang University (南昌大学); First Affiliated Hospital, Jiangxi Medical College, Nanchang University (南昌大学第一附属医院); Second Clinical Medical College, Nanchang University (南昌大学第二临床医学院); First Clinical Medical College, Nanchang University (南昌大学第一临床医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate three-dimensional (3D) tooth segmentation from Cone-Beam Computed Tomography (CBCT) is a prerequisite for digital dental workflows. However, achieving high-fidelity segmentation remains challenging due to adhesion artifacts in naturally occluded scans, which are caused by low contrast and indistinct inter-arch boundaries. To address these limitations, we propose the Anatomy Aware Cascade Network (AACNet), a coarse-to-fine framework designed to resolve boundary ambiguity while maintaining global structural consistency. Specifically, we introduce two mechanisms: the Ambiguity Gated Boundary Refiner (AGBR) and the Signed Distance Map guided Anatomical Attention (SDMAA). The AGBR employs an entropy based gating mechanism to perform targeted feature rectification in high uncertainty transition zones. Meanwhile, the SDMAA integrates implicit geometric constraints via signed distance map to enforce topological consistency, preventing the loss of spatial details associated with standard pooling. Experimental results on a dataset of 125 CBCT volumes demonstrate that AACNet achieves a Dice Similarity Coefficient of 90.17 % and a 95% Hausdorff Distance of 3.63 mm, significantly outperforming state-of-the-art methods. Furthermore, the model exhibits strong generalization on an external dataset with an HD95 of 2.19 mm, validating its reliability for downstream clinical applications such as surgical planning. Code for AACNet is available at this https URL.
zh

[CV-27] FocalOrder: Focal Preference Optimization for Reading Order Detection

【速读】:该论文旨在解决文档理解中阅读顺序检测(Reading Order Detection)的性能瓶颈问题,特别是现有方法依赖均匀监督所隐含的“布局区域难度分布恒定”假设所带来的局限性。研究揭示了“位置差异性(Positional Disparity)”现象:模型在文档的起始和结束区域表现良好,但在复杂中间段落区域性能显著下降,根源在于标准训练过程中大量简单模式淹没了困难布局的学习信号。解决方案的关键在于提出FocalOrder框架,其核心是基于焦点偏好优化(Focal Preference Optimization, FPO),通过自适应难度发现机制(采用指数移动平均)动态识别难学过渡区域,并引入难度校准的成对排序目标函数以强化全局逻辑一致性,从而有效提升对复杂文档结构的理解能力。

链接: https://arxiv.org/abs/2601.07483
作者: Fuyuan Liu,Dianyu Yu,He Ren,Nayu Liu,Xiaomian Kang,Delai Qiu,Fa Zhang,Genpeng Zhen,Shengping Liu,Jiaen Liang,Wei Huang,Yining Wang,Junnan Zhu
机构: Unisound AI Technology Co.Ltd (云知声人工智能科技有限公司); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所MAIS); Beihang University (北京航空航天大学); School of Computer Science and Technology, Tiangong University (天津工业大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reading order detection is the foundation of document understanding. Most existing methods rely on uniform supervision, implicitly assuming a constant difficulty distribution across layout regions. In this work, we challenge this assumption by revealing a critical flaw: \textbfPositional Disparity, a phenomenon where models demonstrate mastery over the deterministic start and end regions but suffer a performance collapse in the complex intermediate sections. This degradation arises because standard training allows the massive volume of easy patterns to drown out the learning signals from difficult layouts. To address this, we propose \textbfFocalOrder, a framework driven by \textbfFocal Preference Optimization (FPO). Specifically, FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions, while introducing a difficulty-calibrated pairwise ranking objective to enforce global logical consistency. Extensive experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc. Our compact model not only outperforms competitive specialized baselines but also significantly surpasses large-scale general VLMs. These results demonstrate that aligning the optimization with intrinsic structural ambiguity of documents is critical for mastering complex document structures.
zh

[CV-28] ask Prototype-Based Knowledge Retrieval for Multi-Task Learning from Partially Annotated Data AAAI2026

【速读】:该论文旨在解决部分标注的多任务学习(Multi-task Learning, MTL)中因依赖未标注任务预测而导致的任务关联不可靠、负迁移及性能下降的问题。其解决方案的关键在于提出一种基于原型的知识检索框架,通过两个核心组件实现:一是任务原型嵌入(task prototype embedding),用于捕捉任务特定特征并量化任务间关联;二是知识检索变压器(knowledge retrieval transformer),根据这些关联自适应地优化特征表示。此外,引入关联知识生成损失(association knowledge generating, AKG loss)以确保任务原型稳定地表征任务特性,从而在仅部分任务有标注的情况下仍能实现鲁棒的多任务学习性能。

链接: https://arxiv.org/abs/2601.07474
作者: Youngmin Oh,Hyung-Il Kim,Jung Uk Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Multi-task learning (MTL) is critical in real-world applications such as autonomous driving and robotics, enabling simultaneous handling of diverse tasks. However, obtaining fully annotated data for all tasks is impractical due to labeling costs. Existing methods for partially labeled MTL typically rely on predictions from unlabeled tasks, making it difficult to establish reliable task associations and potentially leading to negative transfer and suboptimal performance. To address these issues, we propose a prototype-based knowledge retrieval framework that achieves robust MTL instead of relying on predictions from unlabeled tasks. Our framework consists of two key components: (1) a task prototype embedding task-specific characteristics and quantifying task associations, and (2) a knowledge retrieval transformer that adaptively refines feature representations based on these associations. To achieve this, we introduce an association knowledge generating (AKG) loss to ensure the task prototype consistently captures task-specific characteristics. Extensive experiments demonstrate the effectiveness of our framework, highlighting its potential for robust multi-task learning, even when only a subset of tasks is annotated.
zh

[CV-29] From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中因迭代采样导致的计算成本过高问题,尤其是现有动态分辨率采样方法中存在的跨阶段一致性破坏和全局结构重建误差累积问题。其解决方案的关键在于提出Fresco框架,通过统一各阶段的重噪声机制与全局结构对齐,并引入渐进式上采样策略,仅对已收敛区域进行精细化提升,从而在保持低分辨率草图效率的同时实现高分辨率细节的精准重构,最终实现近无损加速效果,且兼容蒸馏、量化等其他优化技术。

链接: https://arxiv.org/abs/2601.07462
作者: Shikang Zheng,Guantao Chen,Lixuan He,Jiacheng Liu,Yuqi Lin,Chang Zou,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); South China University of Technology (华南理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose \textbfFresco, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10 \times speedup on FLUX, and 5 \times on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22 \times speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.
zh

[CV-30] Improving Video Question Answering through query-based frame selection

【速读】:该论文旨在解决当前视频问答(VideoQA)模型在处理视频时因采用均匀采样固定帧数而导致重要信息丢失、上下文捕捉不足的问题。现有大型视觉语言模型(VLMs)通常依赖于对视频进行均匀采样,无法根据问题语义动态选择关键帧,从而限制了问答准确性。解决方案的关键在于提出一种基于查询的帧选择方法,利用子模态互信息(Submodular Mutual Information, SMI)函数来筛选与问题高度相关的视频帧,确保所选帧提供互补且本质的视觉信息,从而提升VideoQA性能。实验表明,相较于均匀采样策略,该方法在MVBench数据集上使Video-LLaVA和LLaVA-NeXT两种VLMs的准确率提升达4%,且定性分析显示其能更精准地聚焦于问题相关帧。

链接: https://arxiv.org/abs/2601.07459
作者: Himanshu Patil,Geo Jolly,Ramana Raja Buddala,Ganesh Ramakrishnan,Rohit Saluja
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf4% was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.
zh

[CV-31] PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion

【速读】:该论文旨在解决现有图像基础模型(image foundation models)在处理球面图像(spherical images)时性能不佳的问题,因为这些模型主要是在透视图像(perspective images)上训练的。其关键解决方案是提出PanoSAMic框架,通过集成预训练的Segment Anything (SAM) 编码器以利用其丰富的特征表示能力,并在此基础上引入多阶段特征输出机制与新颖的时空模态融合模块(spatio-modal fusion module),使模型能够根据输入区域动态选择最相关的模态和特征;同时,采用球面注意力机制(spherical attention)与双视角融合策略(dual view fusion)来缓解全景图像中的畸变和边缘不连续问题,从而显著提升在RGB、RGB-D及RGB-D-N模态下的语义分割性能,在Stanford2D3DS和Matterport3D数据集上均达到当前最优(SotA)结果。

链接: https://arxiv.org/abs/2601.07447
作者: Mahdi Chamseddine,Didier Stricker,Jason Rambach
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); RPTU Kaiserslautern-Landau(凯撒斯劳滕-兰道大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing image foundation models are not optimized for spherical images having been trained primarily on perspective images. PanoSAMic integrates the pre-trained Segment Anything (SAM) encoder to make use of its extensive training and integrate it into a semantic segmentation model for panoramic images using multiple modalities. We modify the SAM encoder to output multi-stage features and introduce a novel spatio-modal fusion module that allows the model to select the relevant modalities and best features from each modality for different areas of the input. Furthermore, our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state-of-the-art (SotA) results on Stanford2D3DS for RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D modalities. this https URL
zh

[CV-32] SDHSI-Net: Learning Better Representations for Hyperspectral Images via Self-Distillation

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中因高维光谱特征和标注数据有限所导致的传统深度学习模型易过拟合及计算成本高的问题。其解决方案的关键在于引入自蒸馏(Self-distillation, SD)机制,通过将网络中间层的输出作为软目标(soft targets),强制中间预测与最终预测之间的一致性,从而增强特征空间中的类内紧凑性和类间可分性,提升模型在光谱-空间联合学习中的分类准确率与鲁棒性。

链接: https://arxiv.org/abs/2601.07416
作者: Prachet Dev Singh,Shyamsundar Paramasivam,Sneha Barman,Mainak Singha,Ankit Jha,Girish Mishra,Biplab Banerjee
机构: The LNMIIT Jaipur(拉贾斯坦邦国家信息技术学院); IIT Dhanbad(印度理工学院达恩巴德分校); IIT Bombay(印度理工学院孟买分校); DRDO Delhi(国防研究与发展组织德里)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at InGARSS 2025

点击查看摘要

Abstract:Hyperspectral image (HSI) classification presents unique challenges due to its high spectral dimensionality and limited labeled data. Traditional deep learning models often suffer from overfitting and high computational costs. Self-distillation (SD), a variant of knowledge distillation where a network learns from its own predictions, has recently emerged as a promising strategy to enhance model performance without requiring external teacher networks. In this work, we explore the application of SD to HSI by treating earlier outputs as soft targets, thereby enforcing consistency between intermediate and final predictions. This process improves intra-class compactness and inter-class separability in the learned feature space. Our approach is validated on two benchmark HSI datasets and demonstrates significant improvements in classification accuracy and robustness, highlighting the effectiveness of SD for spectral-spatial learning. Codes are available at this https URL.
zh

[CV-33] Forecast the Principal Stabilize the Residual: Subspace-Aware Feature Caching for Efficient Diffusion Transformers

【速读】:该论文旨在解决扩散模型(Diffusion Model)在图像和视频生成任务中因迭代采样过程计算成本高昂而导致的推理速度瓶颈问题。现有特征缓存(feature caching)方法通常对所有特征组件一视同仁,未能充分挖掘特征空间的内在结构差异。解决方案的关键在于提出一种基于奇异值分解(SVD)的子空间感知缓存框架——SVD-Cache:首先通过SVD将扩散特征分解为主导主成分子空间(principal subspace)与残差子空间(residual subspace),其中主成分子空间具有平滑且可预测的时序演化特性,而残差子空间则表现为高波动、低能量的震荡;随后对主成分子空间采用指数移动平均(EMA)进行高效预测,同时直接复用残差子空间,从而实现近无损压缩与显著加速,实验证明其可在FLUX和HunyuanVideo等模型上达到5.55倍加速比,并兼容蒸馏、量化及稀疏注意力等多种模型加速技术。

链接: https://arxiv.org/abs/2601.07396
作者: Guantao Chen,Shikang Zheng,Yuqi Lin,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Sun Yat-Sen University (中山大学); South China University of Technology (华南理工大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55 \times speedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is in supplementary material and will be released on Github.
zh

[CV-34] OceanSAR-2: A Universal Feature Extractor for SAR Ocean Observation

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)在海洋观测中模型泛化能力弱、训练成本高及下游任务性能不足的问题。解决方案的关键在于提出第二代基础模型OceanSAR-2,其通过改进的自监督学习(Self-Supervised Learning, SSL)训练策略与动态数据筛选机制,在提升模型跨任务迁移性能的同时显著降低训练开销,从而为海洋遥感应用提供更高效、可靠的深度学习范式。

链接: https://arxiv.org/abs/2601.07392
作者: Alexandre Tuel,Thomas Kerdreux,Quentin Febvre,Alexis Mouche,Antoine Grouazel,Jean-Renaud Miadana,Antoine Audras,Chen Wang,Bertrand Chapron
机构: Galeio(盖莱奥); LOPS, Ifremer(海洋研究与渔业局); OceanScope(海洋观测); Nanjing University of Information Science and Technology(南京信息工程大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at EUSAR 2026

点击查看摘要

Abstract:We present OceanSAR-2, the second generation of our foundation model for SAR-based ocean observation. Building on our earlier release, which pioneered self-supervised learning on Sentinel-1 Wave Mode data, OceanSAR-2 relies on improved SSL training and dynamic data curation strategies, which enhances performance while reducing training cost. OceanSAR-2 demonstrates strong transfer performance across downstream tasks, including geophysical pattern classification, ocean surface wind vector and significant wave height estimation, and iceberg detection. We release standardized benchmark datasets, providing a foundation for systematic evaluation and advancement of SAR models for ocean applications.
zh

[CV-35] Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation CVPR

【速读】:该论文旨在解决半监督3D血管分割中因传统均值教师(Mean Teacher, MT)方法采用静态师生角色分配而导致的认知偏差问题,即在复杂3D血管数据下,教师模型未必始终优于学生模型,从而限制了性能提升。其解决方案的关键在于提出一种动态协作网络(Dynamic Collaborative Network, DiCo),允许教师与学生模型在训练过程中动态切换角色,以适应不同样本的特性;同时引入多视角融合模块模拟医生多角度分析习惯,并结合对抗监督约束未标注数据中血管形状的一致性,通过将3D体数据投影至2D视图缓解标签不一致性影响,从而显著提升分割精度。

链接: https://arxiv.org/abs/2601.07377
作者: Jiao Xu,Xin Chen,Lihe Zhang
机构: Dalian University of Technology (大连理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:In this paper, we present a new dynamic collaborative network for semi-supervised 3D vessel segmentation, termed DiCo. Conventional mean teacher (MT) methods typically employ a static approach, where the roles of the teacher and student models are fixed. However, due to the complexity of 3D vessel data, the teacher model may not always outperform the student model, leading to cognitive biases that can limit performance. To address this issue, we propose a dynamic collaborative network that allows the two models to dynamically switch their teacher-student roles. Additionally, we introduce a multi-view integration module to capture various perspectives of the inputs, mirroring the way doctors conduct medical analysis. We also incorporate adversarial supervision to constrain the shape of the segmented vessels in unlabeled data. In this process, the 3D volume is projected into 2D views to mitigate the impact of label inconsistencies. Experiments demonstrate that our DiCo method sets new state-of-the-art performance on three 3D vessel segmentation benchmarks. The code repository address is this https URL
zh

[CV-36] HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

【速读】:该论文旨在解决电商视频中生成结构化叙述(structured narrations)的问题,即如何在感知细粒度视觉细节的基础上,将其组织为连贯且以故事为中心的高层级叙事——这是现有方法难以统一的能力。解决方案的关键在于提出E-commerce Hierarchical Video Captioning(E-HVC)数据集与HiVid-Narrator框架:前者提供双粒度、时序对齐的标注(事件级“Temporal Chain-of-Thought”和章节级“Chapter Summary”),后者采用分阶段构建策略,先通过精选自动语音识别(ASR)和帧级描述获取可靠语言与视觉证据,再基于Temporal Chain-of-Thought精炼粗略章节边界与标题;同时引入Scene-Primed ASR-anchored Compressor(SPA-Compressor)压缩多模态token,利用ASR语义线索引导场景与事件层级表示,从而在减少输入token数量的同时提升叙事质量。

链接: https://arxiv.org/abs/2601.07366
作者: Haoxuan Li,Mengyan Li,Junjun Zheng
机构: Taobao & Tmall Group of Alibaba(淘宝与天猫集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories–capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.
zh

[CV-37] Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLM s without Training

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中普遍存在的内部推理不一致问题,即模型深层虽能关注到正确的视觉区域,但最终预测常被早期层中的噪声注意力误导,导致“看得准却说错”的现象。解决方案的关键在于提出一种无需额外训练的双视角解码精炼策略(DualPD),其核心由两个模块构成:一是基于层间注意力引导的对比logits模块,通过比较注意力变化最大的两层输出logits来捕捉正确答案信念的演化过程;二是基于头级别的信息过滤模块,抑制对无关区域关注的低贡献注意力头,从而提升每层注意力的质量。实验表明,该方法在LLaVA和Qwen-VL等多个基准上均显著提升了准确性,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2601.07359
作者: Shezheng Song,Shasha Li,Jie Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a variety of vision-language tasks. However, their internal reasoning often exhibits a critical inconsistency: although deeper layers may attend to the correct visual regions, final predictions are frequently misled by noisy attention from earlier layers. This results in a disconnect between what the model internally understands and what it ultimately expresses, a phenomenon we describe as seeing it right but saying it wrong. To address this issue, we propose DualPD, a dual-perspective decoding refinement strategy that enhances the visual understanding without any additional training. DualPD consists of two components. (1) The layer-wise attention-guided contrastive logits module captures how the belief in the correct answer evolves by comparing output logits between layers that exhibit the largest attention shift. (2) The head-wise information filtering module suppresses low-contribution attention heads that focus on irrelevant regions, thereby improving attention quality within each layer. Experiments conducted on both the LLaVA and Qwen-VL model families across multiple multimodal benchmarks demonstrate that DualPD consistently improves accuracy without training, confirming its effectiveness and generalizability. The code will be released upon publication.
zh

[CV-38] PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis AAAI2026

【速读】:该论文旨在解决当前医疗多模态模型在真实临床诊断场景中适应性不足的问题,即现有模型主要聚焦于特定医学影像分析(如皮肤科、病理科或放射科),难以应对包含异构输入且需持续上下文理解的复杂诊疗过程。其解决方案的关键在于提出PulseMind系统,由三部分构成:一是构建涵盖98,000次多轮诊疗对话和601,500张医学图像的MediScope数据集;二是设计包含主动性、准确性、实用性与语言质量四个维度的PulseMind Benchmark评估体系;三是开发基于对比强化策略优化(Comparison-based Reinforcement Policy Optimization, CRPO)的训练框架,通过相对偏好信号替代绝对评分奖励,实现更稳定且符合人类判断的训练指导,从而显著提升模型在真实多轮临床诊断任务中的表现。

链接: https://arxiv.org/abs/2601.07344
作者: Jiao Xu,Junwei Liu,Jiangwei Lao,Qi Zhu,Yunpeng Zhao,Congyun Jin,Shinan Liu,Zhihong Lu,Lihe Zhang,Xin Chen,Jian Wang,Ping Wang
机构: Ant Group(蚂蚁集团); Alibaba Cloud(阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Recent advances in medical multi-modal models focus on specialized image analysis like dermatology, pathology, or radiology. However, they do not fully capture the complexity of real-world clinical diagnostics, which involve heterogeneous inputs and require ongoing contextual understanding during patient-physician interactions. To bridge this gap, we introduce PulseMind, a new family of multi-modal diagnostic models that integrates a systematically curated dataset, a comprehensive evaluation benchmark, and a tailored training framework. Specifically, we first construct a diagnostic dataset, MediScope, which comprises 98,000 real-world multi-turn consultations and 601,500 medical images, spanning over 10 major clinical departments and more than 200 sub-specialties. Then, to better reflect the requirements of real-world clinical diagnosis, we develop the PulseMind Benchmark, a multi-turn diagnostic consultation benchmark with a four-dimensional evaluation protocol comprising proactiveness, accuracy, usefulness, and language quality. Finally, we design a training framework tailored for multi-modal clinical diagnostics, centered around a core component named Comparison-based Reinforcement Policy Optimization (CRPO). Compared to absolute score rewards, CRPO uses relative preference signals from multi-dimensional com-parisons to provide stable and human-aligned training guidance. Extensive experiments demonstrate that PulseMind achieves competitive performance on both the diagnostic consultation benchmark and public medical benchmarks.
zh

[CV-39] Reconstruction Guided Few-shot Network For Remote Sensing Image Classification

【速读】:该论文旨在解决少样本(few-shot)遥感图像分类中因标注样本有限和地物类型高度多样性而导致的泛化能力不足问题。解决方案的关键在于提出一种基于重建引导的少样本网络(Reconstruction-guided Few-shot Network, RGFS-Net),其核心创新是在标准分类任务之外引入掩码图像重建(masked image reconstruction)作为辅助任务,通过遮挡输入图像的部分区域并要求模型进行重建,从而促进语义丰富的特征学习,增强空间理解能力,并在低数据条件下提升类别判别力,同时保持已见类别的特征一致性。

链接: https://arxiv.org/abs/2601.07335
作者: Mohit Jaiswal,Naman Jain,Shivani Pathak,Mainak Singha,Nikunja Bihari Kar,Ankit Jha,Biplab Banerjee
机构: The LNMIIT Jaipur(拉贾斯坦印度理工学院); IIT Bombay(印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at InGARSS 2025

点击查看摘要

Abstract:Few-shot remote sensing image classification is challenging due to limited labeled samples and high variability in land-cover types. We propose a reconstruction-guided few-shot network (RGFS-Net) that enhances generalization to unseen classes while preserving consistency for seen categories. Our method incorporates a masked image reconstruction task, where parts of the input are occluded and reconstructed to encourage semantically rich feature learning. This auxiliary task strengthens spatial understanding and improves class discrimination under low-data settings. We evaluated the efficacy of EuroSAT and PatternNet datasets under 1-shot and 5-shot protocols, our approach consistently outperforms existing baselines. The proposed method is simple, effective, and compatible with standard backbones, offering a robust solution for few-shot remote sensing classification. Codes are available at this https URL.
zh

[CV-40] OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

【速读】:该论文旨在解决开放集下基于语言提示和单张图像的CAD模型检索(Open-Set CAD Retrieval from a Language Prompt and a Single Image, OSCAR)问题,即在无需对象特定训练的前提下,从无标签的3D对象数据库中准确检索出与输入图像中目标物体最匹配的CAD模型,以支持零样本6D物体位姿估计(6D object pose estimation)。其核心挑战在于:部署后难以获取目标物体的精确CAD模型,且对象集合持续变化导致实例模型识别困难。解决方案的关键在于提出一种两阶段无训练检索机制:首先利用CLIP模型进行文本层面过滤,通过将图像中检测到的目标区域(Region-of-Interest)与数据库描述性标题进行多模态嵌入匹配,筛选候选模型;随后采用DINOv2进行图像级细化,计算候选模型渲染视图与输入图像区域之间的视觉相似度,从而选出最相似的对象模型。该方法在MI3DOR跨域3D模型检索基准上优于现有最优方法,并在YCB-V数据集上实现了90.48%的平均精度,验证了其在自动化6D位姿估计中的有效性。

链接: https://arxiv.org/abs/2601.07333
作者: Tessa Pulli,Jean-Baptiste Weibel,Peter Hönig,Matthias Hirschmanner,Markus Vincze,Andreas Holzinger
机构: Automation and Control Institute, TU Wien, Wien, Austria; BOKU University, Human-Centered AI Lab, FTEC, Department for Ecosystem Management, Climate and Biodiversity, Wien, Austria; Institute for Human Centered Computing, Faculty of Informatics and Biomedical Engineering, TU Graz, Graz, Austria
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR’s direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.
zh

[CV-41] Revisiting the Ordering of Channel and Spatial Attention: A Comprehensive Study on Sequential and Parallel Designs

【速读】:该论文旨在解决当前通道注意力(Channel Attention)与空间注意力(Spatial Attention)融合策略选择缺乏系统性分析和统一原则的问题。现有研究多采用串行或并行两种范式,但其性能表现受数据规模影响显著,且不同结构在不同任务场景下优劣不一,导致实际应用中存在较大的经验依赖。解决方案的关键在于构建一个统一的评估框架,对18种注意力拓扑结构(涵盖串行、并行、多尺度和残差四类)进行系统比较,并基于两个视觉和九个医学数据集的实验发现“数据规模-方法-性能”耦合规律:在小样本任务中,“通道-多尺度空间”级联结构最优;中等规模任务下,可学习并行融合架构表现最佳;大规模任务中,带动态门控的并行结构效果最好。此外,研究还揭示了“空间-通道”顺序更适用于细粒度分类,而残差连接有助于缓解梯度消失问题。最终提出面向场景的注意力模块设计指南,为后续研究提供理论依据与实践路径。

链接: https://arxiv.org/abs/2601.07310
作者: Zhongming Liu,Bingbing Jiang
机构: JXNU(江西师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attention mechanisms have become a core component of deep learning models, with Channel Attention and Spatial Attention being the two most representative architectures. Current research on their fusion strategies primarily bifurcates into sequential and parallel paradigms, yet the selection process remains largely empirical, lacking systematic analysis and unified principles. We systematically compare channel-spatial attention combinations under a unified framework, building an evaluation suite of 18 topologies across four classes: sequential, parallel, multi-scale, and residual. Across two vision and nine medical datasets, we uncover a “data scale-method-performance” coupling law: (1) in few-shot tasks, the “Channel-Multi-scale Spatial” cascaded structure achieves optimal performance; (2) in medium-scale tasks, parallel learnable fusion architectures demonstrate superior results; (3) in large-scale tasks, parallel structures with dynamic gating yield the best performance. Additionally, experiments indicate that the “Spatial-Channel” order is more stable and effective for fine-grained classification, while residual connections mitigate vanishing gradient problems across varying data scales. We thus propose scenario-based guidelines for building future attention modules. Code is open-sourced at this https URL.
zh

[CV-42] Mimic Human Cognition Master Multi-Image Reasoning : A Meta-Action Framework for Enhanced Visual Understanding

【速读】:该论文旨在解决多图像推理(multi-image reasoning)场景下大型多模态语言模型(Multimodal Large Language Models, MLLMs)性能显著下降的问题,其核心挑战在于图像间复杂的相互关系以及关键信息在图像集合中的分散性。解决方案的关键在于提出一种受人类认知启发的元动作框架(Cognition-Inspired Meta-Action Framework, CINEMA),将多图像推理过程结构化为五个连续的元动作:全局感知(Global)、聚焦(Focus)、提示(Hint)、思考(Think)和回答(Answer),从而显式建模人类自然的推理步骤。此外,通过引入基于检索的树采样策略(Retrieval-Based Tree Sampling)实现冷启动训练,并采用两阶段强化学习范式(探索阶段结合多样性保持策略、利用阶段采用DAPO算法渐进增强策略)提升模型泛化能力与推理稳定性,最终在多个多图像、视频理解及单图像基准测试中取得领先性能。

链接: https://arxiv.org/abs/2601.07298
作者: Jianghao Yin,Qingbin Li,Kun Sun,Cheng Ding,Jie Wang,Qin Chen,Jie Zhou,Nan Wang,Changqing Li,Pei Wu,Jian Xu,Zheming Yang,Liang He
机构: East China Normal University (华东师范大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
zh

[CV-43] Inference-Time Scaling for Visual AutoRegressive modeling by Searching Representative Samples

【速读】:该论文旨在解决向量量化(Vector-Quantized, VQ)视觉自回归模型(Visual Autoregressive Model, VAR)在推理阶段缺乏有效扩展机制的问题,即如何在不破坏离散潜在空间结构的前提下提升生成质量。其解决方案的关键在于提出VAR-Scaling框架,通过核密度估计(Kernel Density Estimation, KDE)将离散的潜在空间映射至准连续特征空间,从而实现对采样分布的有效导航;进一步设计了一种密度自适应混合采样策略——Top-k采样聚焦高密度区域以维持高质量输出,Random-k采样探索低密度区域以保持多样性并避免过早收敛,最终在关键尺度上优化样本保真度,显著提升图像生成质量。

链接: https://arxiv.org/abs/2601.07293
作者: Weidong Tang,Xinyan Wan,Siyu Li,Xiumei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to PRCV 2025

点击查看摘要

Abstract:While inference-time scaling has significantly enhanced generative quality in large language and diffusion models, its application to vector-quantized (VQ) visual autoregressive modeling (VAR) remains unexplored. We introduce VAR-Scaling, the first general framework for inference-time scaling in VAR, addressing the critical challenge of discrete latent spaces that prohibit continuous path search. We find that VAR scales exhibit two distinct pattern types: general patterns and specific patterns, where later-stage specific patterns conditionally optimize early-stage general patterns. To overcome the discrete latent space barrier in VQ models, we map sampling spaces to quasi-continuous feature spaces via kernel density estimation (KDE), where high-density samples approximate stable, high-quality solutions. This transformation enables effective navigation of sampling distributions. We propose a density-adaptive hybrid sampling strategy: Top-k sampling focuses on high-density regions to preserve quality near distribution modes, while Random-k sampling explores low-density areas to maintain diversity and prevent premature convergence. Consequently, VAR-Scaling optimizes sample fidelity at critical scales to enhance output quality. Experiments in class-conditional and text-to-image evaluations demonstrate significant improvements in inference process. The code is available at this https URL.
zh

[CV-44] A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中内容可追溯性和知识产权保护问题,尤其是现有水印技术存在的两大缺陷:一是视觉无关水印引入与图像无关的伪随机干扰,破坏视觉定位能力;二是语义感知方法因拒绝采样导致推理延迟过高。其解决方案的关键在于提出VISA-Mark框架,通过轻量级前缀调优(prefix-tuner)动态提取视觉证据权重(Visual-Evidence Weights),量化候选词在视觉输入下的支持程度,并据此自适应地划分词汇空间和扰动logits,使水印强度集中于视觉支撑充分的token上,从而在不牺牲推理效率的前提下显著提升视觉保真度与抗攻击鲁棒性。

链接: https://arxiv.org/abs/2601.07291
作者: Qi Zheng,Shuliang Liu,Yu Huang,Sihang Jia,Jungang Li,Lyuhao Chen,Junhao Chen,Hanqian Li,Aiwei Liu,Yibo Yan,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.
zh

[CV-45] VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

【速读】:该论文旨在解决视频理解中时空联合建模的挑战,即如何同时实现对视频内容在空间维度(如目标定位)和时间维度(如事件发生时段)的细粒度感知与推理。其解决方案的关键在于构建一个统一的视频大语言模型(Video Large Language Model, Video LLM)——VideoLoom,并配套开发了高质量的人类中心视频数据集 LoomData-8.7k,该数据集包含时序对齐且空间标注的描述文本,从而有效支撑模型在精细时空定位任务上的训练与优化。此外,研究还引入了 LoomBench 基准测试集,涵盖时序、空间及组合式视频问答任务,全面评估视频大模型的多维理解能力,推动了视频理解从单一模态到联合时空感知的范式演进。

链接: https://arxiv.org/abs/2601.07290
作者: Jiapeng Shi,Junke Wang,Zuyao You,Bo He,Zuxuan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 JF on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
zh

[CV-46] Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

【速读】:该论文旨在解决图像到视频(Image-to-Video, I2V)生成任务中,扩散模型在去噪过程中难以有效融合高频率视觉约束与低频率文本引导的问题,特别是现有方法在保证视觉一致性的同时,对文本提示的遵循能力不足。其核心问题在于Diffusion Transformer(DiT)架构中的某些中间层存在语义响应弱化现象(称为Semantic-Weak Layers),这是由于“条件隔离”(Condition Isolation)导致注意力机制对视觉特征的依赖过度偏向于模型预训练所得的视觉先验,而偏离了文本指导。解决方案的关键是提出Focal Guidance(FG)机制,包含两个创新组件:(1) 细粒度语义引导(Fine-grained Semantic Guidance, FSG),利用CLIP模型识别参考帧中的关键区域作为锚点,引导Semantic-Weak Layers增强文本相关性;(2) 注意力缓存(Attention Cache),将语义响应强的层的注意力图传递至Semantic-Weak Layers,注入显式语义信号并缓解其对视觉先验的过度依赖,从而显著提升对文本指令的遵循能力。

链接: https://arxiv.org/abs/2601.07287
作者: Yuanyang Yin,Yufan Deng,Shenghai Yuan,Kaipeng Zhang,Xiao Yang,Feng Zhao
机构: MoE Key Lab of BIPC, USTC (BIPC 重点实验室,中国科学技术大学); Shanghai Innovation Institute (上海创新研究院); ByteDance China (字节跳动中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model’s learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44%).
zh

[CV-47] GenDet: Painting Colored Bounding Boxes on Images via Diffusion Model for Object Detection

【速读】:该论文旨在解决传统目标检测方法在模型架构和任务范式上的局限性,即检测任务通常依赖于判别式(discriminative)模型,难以与生成式模型(generative models)融合,从而限制了视觉理解系统的统一性和灵活性。其解决方案的关键在于提出GenDet框架,将目标检测重新定义为图像生成任务:通过在预训练的Stable Diffusion模型基础上构建条件生成架构,在潜在空间中引入语义约束,使模型能够直接从输入图像生成带有类别标注的边界框(bounding boxes),从而实现对位置和类别属性的精确控制,同时保留生成模型的灵活性,有效弥合了生成模型与判别任务之间的鸿沟。

链接: https://arxiv.org/abs/2601.07273
作者: Chen Min,Chengyang Li,Fanjie Kong,Qi Zhu,Dawei Zhao,Liang Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments demonstrate that GenDet achieves competitive accuracy compared to discriminative detectors, while retaining the flexibility characteristic of generative methods.
zh

[CV-48] PALUM: Part-based Attention Learning for Unified Motion Retargeting

【速读】:该论文旨在解决不同骨骼结构角色之间动作迁移(motion retargeting)的问题,尤其在源角色与目标角色骨骼拓扑差异较大时,如何保持原始动作的语义一致性和质量。其解决方案的关键在于提出PALUM方法,通过将关节划分为语义身体部位并引入注意力机制来捕捉时空关系,从而学习跨不同骨骼拓扑的通用运动表示(skeleton-agnostic representations)。该方法结合目标骨骼的特定结构信息进行动作迁移,并利用循环一致性机制确保迁移过程中的语义连贯性与运动保真度,从而在多种骨骼结构下实现高质量的动作迁移,包括对未见过的骨骼-动作组合也具有良好的泛化能力。

链接: https://arxiv.org/abs/2601.07272
作者: Siqi Liu,Maoyu Wang,Bo Dai,Cewu Lu
机构: Shanghai Jiao Tong University (上海交通大学); Feeling AI; The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retargeting motion between characters with different skeleton structures is a fundamental challenge in computer animation. When source and target characters have vastly different bone arrangements, maintaining the original motion’s semantics and quality becomes increasingly difficult. We present PALUM, a novel approach that learns common motion representations across diverse skeleton topologies by partitioning joints into semantic body parts and applying attention mechanisms to capture spatio-temporal relationships. Our method transfers motion to target skeletons by leveraging these skeleton-agnostic representations alongside target-specific structural information. To ensure robust learning and preserve motion fidelity, we introduce a cycle consistency mechanism that maintains semantic coherence throughout the retargeting process. Extensive experiments demonstrate superior performance in handling diverse skeletal structures while maintaining motion realism and semantic fidelity, even when generalizing to previously unseen skeleton-motion combinations. We will make our implementation publicly available to support future research.
zh

[CV-49] From Landslide Conditioning Factors to Satellite Embeddings: Evaluating the Utilisation of Google AlphaEarth for Landslide Susceptibility Mapping using Deep Learning

【速读】:该论文旨在解决传统数据驱动滑坡易发性制图(Landslide Susceptibility Mapping, LSM)中依赖滑坡条件因子(Landslide Conditioning Factors, LCFs)所面临的可用性差、异质性强及预处理不确定性等问题,这些问题限制了制图的可靠性。其解决方案的关键在于引入Google AlphaEarth(AE)嵌入表示——一种从多源地理空间观测中提取的地球表面状态统一表征,作为替代LCFs的新一代预测因子。研究表明,使用AE嵌入(尤其是完整的64个嵌入波段)能够显著提升多种深度学习模型(CNN1D、CNN2D和Vision Transformer)在三个不同区域(台湾南投县、香港和意大利艾米利亚-罗马涅地区)上的性能,表现为更高的F1分数(提升约4%–15%)与AUC值(增加0.04–0.11),并展现出更稳定的误差分布和更强的空间一致性,尤其在时间对齐较好的区域效果更为突出,验证了AE嵌入作为标准化、信息丰富且可泛化的LSM输入要素的巨大潜力。

链接: https://arxiv.org/abs/2601.07268
作者: Yusen Cheng,Qinfeng Zhu,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-driven landslide susceptibility mapping (LSM) typically relies on landslide conditioning factors (LCFs), whose availability, heterogeneity, and preprocessing-related uncertainties can constrain mapping reliability. Recently, Google AlphaEarth (AE) embeddings, derived from multi-source geospatial observations, have emerged as a unified representation of Earth surface conditions. This study evaluated the potential of AE embeddings as alternative predictors for LSM. Two AE representations, including retained principal components and the full set of 64 embedding bands, were systematically compared with conventional LCFs across three study areas (Nantou County, Taiwan; Hong Kong; and part of Emilia-Romagna, Italy) using three deep learning models (CNN1D, CNN2D, and Vision Transformer). Performance was assessed using multiple evaluation metrics, ROC-AUC analysis, error statistics, and spatial pattern assessment. Results showed that AE-based models consistently outperformed LCFs across all regions and models, yielding higher F1-scores, AUC values, and more stable error distributions. Such improvement was most pronounced when using the full 64-band AE representation, with F1-score improvements of approximately 4% to 15% and AUC increased ranging from 0.04 to 0.11, depending on the study area and model. AE-based susceptibility maps also exhibited clearer spatial correspondence with observed landslide occurrences and enhanced sensitivity to localised landslide-prone conditions. Performance improvements were more evident in Nantou and Emilia than in Hong Kong, revealing that closer temporal alignment between AE embeddings and landslide inventories may lead to more effective LSM outcomes. These findings highlight the strong potential of AE embeddings as a standardised and information-rich alternative to conventional LCFs for LSM.
zh

[CV-50] Universal Adversarial Purification with DDIM Metric Loss for Stable Diffusion

【速读】:该论文旨在解决稳定扩散模型(Stable Diffusion, SD)在训练数据包含对抗噪声时产生劣质输出的问题,特别是针对SD特有的攻击策略(如针对变分自编码器(VAE)编码器、UNet去噪器或两者同时攻击)缺乏有效防御手段的现状。解决方案的关键在于提出一种通用的扩散对抗净化框架(Universal Diffusion Adversarial Purification, UDAP),其核心机制是利用干净图像与对抗图像在去噪扩散隐式模型(DDIM)反演过程中的重建行为差异,通过最小化DDIM度量损失来优化净化过程,从而有效去除对抗噪声;此外,引入动态epoch调整策略以根据重建误差自适应调整优化迭代次数,在不牺牲净化质量的前提下显著提升效率。

链接: https://arxiv.org/abs/2601.07253
作者: Li Zheng,Liangbin Xie,Jiantao Zhou,He YiMin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Stable Diffusion (SD) often produces degraded outputs when the training dataset contains adversarial noise. Adversarial purification offers a promising solution by removing adversarial noise from contaminated data. However, existing purification methods are primarily designed for classification tasks and fail to address SD-specific adversarial strategies, such as attacks targeting the VAE encoder, UNet denoiser, or both. To address the gap in SD security, we propose Universal Diffusion Adversarial Purification (UDAP), a novel framework tailored for defending adversarial attacks targeting SD models. UDAP leverages the distinct reconstruction behaviors of clean and adversarial images during Denoising Diffusion Implicit Models (DDIM) inversion to optimize the purification process. By minimizing the DDIM metric loss, UDAP can effectively remove adversarial noise. Additionally, we introduce a dynamic epoch adjustment strategy that adapts optimization iterations based on reconstruction errors, significantly improving efficiency without sacrificing purification quality. Experiments demonstrate UDAP’s robustness against diverse adversarial methods, including PID (VAE-targeted), Anti-DreamBooth (UNet-targeted), MIST (hybrid), and robustness-enhanced variants like Anti-Diffusion (Anti-DF) and MetaCloak. UDAP also generalizes well across SD versions and text prompts, showcasing its practical applicability in real-world scenarios.
zh

[CV-51] HERE: Hierarchical Active Exploration of Radiance Field with Epistemic Uncertainty Minimization

【速读】:该论文旨在解决传统3D场景重建方法在数据采集效率低、重建完整性不足的问题,尤其是在未知区域识别不准确导致探索路径盲目性高的情况下。其解决方案的关键在于提出了一种基于神经辐射场(Neural Radiance Fields)的主动三维场景重建框架HERE,核心创新是利用证据深度学习(evidential deep learning)实现后验认知不确定性(epistemic uncertainty)量化,从而精准识别未观测或重建质量差的区域,并据此生成最优相机轨迹。该不确定性度量与重建误差高度相关,显著优于现有方法,结合分层探索策略(局部规划提取高不确定性体素的目标视角,全局规划引导大范围覆盖),实现了高效且高质量的主动式场景重建。

链接: https://arxiv.org/abs/2601.07242
作者: Taekbeom Lee,Dabin Kim,Youngseok Jang,H. Jin Kim
机构: Seoul National University (首尔国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE RA-L. The first two authors contributed equally

点击查看摘要

Abstract:We present HERE, an active 3D scene reconstruction framework based on neural radiance fields, enabling high-fidelity implicit mapping. Our approach centers around an active learning strategy for camera trajectory generation, driven by accurate identification of unseen regions, which supports efficient data acquisition and precise scene reconstruction. The key to our approach is epistemic uncertainty quantification based on evidential deep learning, which directly captures data insufficiency and exhibits a strong correlation with reconstruction errors. This allows our framework to more reliably identify unexplored or poorly reconstructed regions compared to existing methods, leading to more informed and targeted exploration. Additionally, we design a hierarchical exploration strategy that leverages learned epistemic uncertainty, where local planning extracts target viewpoints from high-uncertainty voxels based on visibility for trajectory generation, and global planning uses uncertainty to guide large-scale coverage for efficient and comprehensive reconstruction. The effectiveness of the proposed method in active 3D reconstruction is demonstrated by achieving higher reconstruction completeness compared to previous approaches on photorealistic simulated scenes across varying scales, while a hardware demonstration further validates its real-world applicability.
zh

[CV-52] Language-Grounded Multi-Domain Image Translation via Semantic Difference Guidance

【速读】:该论文旨在解决多域图像到图像翻译中如何将自然语言提示中的语义差异精准地映射为视觉变换,同时保持无关结构和语义内容的一致性问题。现有方法在结构完整性维持和细粒度、属性特定控制方面存在不足,尤其是在涉及多个域时表现不佳。解决方案的关键在于提出LACE(Language-grounded Attribute Controllable Translation)框架,其核心由两个组件构成:(1) GLIP-Adapter通过融合全局语义与局部结构特征以增强一致性;(2) 多域控制引导机制(Multi-Domain Control Guidance)将源与目标提示间的语义差异显式建模为逐属性的翻译向量,实现语言语义与域级视觉变化的对齐。该设计支持组合式多域控制,并可独立调节每个属性的强度,从而实现高保真度、结构保留且可解释的跨域可控图像生成。

链接: https://arxiv.org/abs/2601.07221
作者: Jongwon Ryu,Joonhyung Park,Jaeho Han,Yeong-Seok Kim,Hye-rin Kim,Sunjae Yoon,Junyeong Kim
机构: Chung-Ang University (中央大学); Hyundai Mobis (现代摩比斯); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-domain image-to-image translation re quires grounding semantic differences ex pressed in natural language prompts into corresponding visual transformations, while preserving unrelated structural and seman tic content. Existing methods struggle to maintain structural integrity and provide fine grained, attribute-specific control, especially when multiple domains are involved. We propose LACE (Language-grounded Attribute Controllable Translation), built on two compo nents: (1) a GLIP-Adapter that fuses global semantics with local structural features to pre serve consistency, and (2) a Multi-Domain Control Guidance mechanism that explicitly grounds the semantic delta between source and target prompts into per-attribute translation vec tors, aligning linguistic semantics with domain level visual changes. Together, these modules enable compositional multi-domain control with independent strength modulation for each attribute. Experiments on CelebA(Dialog) and BDD100K demonstrate that LACE achieves high visual fidelity, structural preservation, and interpretable domain-specific control, surpass ing prior baselines. This positions LACE as a cross-modal content generation framework bridging language semantics and controllable visual translation.
zh

[CV-53] VENUS: Visual Editing with Noise Inversion Using Scene Graphs

【速读】:该论文旨在解决当前基于文本的图像编辑模型在保持背景一致性与语义一致性之间的平衡难题,这类模型常因无法有效控制编辑区域而导致生成全新图像或未能实现预期修改。为应对这一挑战,作者提出了一种无需训练的场景图引导图像编辑框架 VENUS(Visual Editing with Noise inversion Using Scene graphs),其核心创新在于采用“拆分提示条件策略”以解耦编辑目标对象与其背景上下文,并结合噪声反演(noise inversion)技术保留未编辑区域的保真度。此外,VENUS 通过集成多模态大语言模型提取的场景图与扩散模型主干网络,在不引入额外训练的前提下实现了高效且可控的图像编辑,显著提升了背景保留能力、语义一致性及运行效率。

链接: https://arxiv.org/abs/2601.07219
作者: Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art text-based image editing models often struggle to balance background preservation with semantic consistency, frequently resulting either in the synthesis of entirely new images or in outputs that fail to realize the intended edits. In contrast, scene graph-based image editing addresses this limitation by providing a structured representation of semantic entities and their relations, thereby offering improved controllability. However, existing scene graph editing methods typically depend on model fine-tuning, which incurs high computational cost and limits scalability. To this end, we introduce VENUS (Visual Editing with Noise inversion Using Scene graphs), a training-free framework for scene graph-guided image editing. Specifically, VENUS employs a split prompt conditioning strategy that disentangles the target object of the edit from its background context, while simultaneously leveraging noise inversion to preserve fidelity in unedited regions. Moreover, our proposed approach integrates scene graphs extracted from multimodal large language models with diffusion backbones, without requiring any additional training. Empirically, VENUS substantially improves both background preservation and semantic alignment on PIE-Bench, increasing PSNR from 22.45 to 24.80, SSIM from 0.79 to 0.84, and reducing LPIPS from 0.100 to 0.070 relative to the state-of-the-art scene graph editing model (SGEdit). In addition, VENUS enhances semantic consistency as measured by CLIP similarity (24.97 vs. 24.19). On EditVal, VENUS achieves the highest fidelity with a 0.87 DINO score and, crucially, reduces per-image runtime from 6-10 minutes to only 20-30 seconds. Beyond scene graph-based editing, VENUS also surpasses strong text-based editing baselines such as LEDIT++ and P2P+DirInv, thereby demonstrating consistent improvements across both paradigms.
zh

[CV-54] SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

【速读】:该论文旨在解决从自然语言指令中高效生成完整3D室内场景的问题,现有方法在性能和计算效率上存在不足。解决方案的关键在于提出SceneNAT,一种单阶段掩码非自回归Transformer架构,通过在语义和空间属性的完全离散化表示上进行掩码建模,并采用属性级与实例级联合掩码策略以增强对象内和对象间结构的捕捉能力;同时引入专用三元组预测器,将可学习的关系查询映射到稀疏符号三元组(subject, predicate, object),从而提升场景布局与物体关系的推理能力。实验表明,SceneNAT在3D-FRONT数据集上优于主流自回归和扩散基线模型,在语义一致性与空间排列准确性方面表现更优,且计算成本显著降低。

链接: https://arxiv.org/abs/2601.07218
作者: Jeongjun Choi,Yeonsoo Park,H. Jin Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Code will be released

点击查看摘要

Abstract:We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene’s layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.
zh

[CV-55] BlindU: Blind Machine Unlearning without Revealing Erasing Data

【速读】:该论文旨在解决在隐私保护场景下(如联邦学习,Federated Learning, FL)实现模型遗忘(unlearning)时,传统方法需将用户原始数据上传至服务器的问题,从而导致数据泄露风险。其核心挑战在于如何在不暴露待删除样本的前提下完成有效遗忘。解决方案的关键在于提出盲遗忘(Blind Unlearning, BlindU):通过信息瓶颈(Information Bottleneck, IB)机制构建压缩表示(compressed representations),使用户本地生成去敏感化的特征表示,服务器仅基于这些表示及其标签进行遗忘操作;同时引入无噪声差分隐私(noise-free differential privacy, DP)掩码对原始数据预处理以增强隐私,并设计专用的遗忘模块与多梯度下降算法平衡遗忘效果与模型性能保留。此方案实现了无需暴露原始数据即可安全、高效地执行模型遗忘的目标。

链接: https://arxiv.org/abs/2601.07214
作者: Weiqi Wang,Zhiyi Tian,Chenhan Zhang,Shui Yu
机构: University of Technology Sydney (悉尼科技大学); Southeast University (东南大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine unlearning enables data holders to remove the contribution of their specified samples from trained models to protect their privacy. However, it is paradoxical that most unlearning methods require the unlearning requesters to firstly upload their data to the server as a prerequisite for unlearning. These methods are infeasible in many privacy-preserving scenarios where servers are prohibited from accessing users’ data, such as federated learning (FL). In this paper, we explore how to implement unlearning under the condition of not uncovering the erasing data to the server. We propose \textbfBlind Unlearning (BlindU), which carries out unlearning using compressed representations instead of original inputs. BlindU only involves the server and the unlearning user: the user locally generates privacy-preserving representations, and the server performs unlearning solely on these representations and their labels. For the FL model training, we employ the information bottleneck (IB) mechanism. The encoder of the IB-based FL model learns representations that distort maximum task-irrelevant information from inputs, allowing FL users to generate compressed representations locally. For effective unlearning using compressed representation, BlindU integrates two dedicated unlearning modules tailored explicitly for IB-based models and uses a multiple gradient descent algorithm to balance forgetting and utility retaining. While IB compression already provides protection for task-irrelevant information of inputs, to further enhance the privacy protection, we introduce a noise-free differential privacy (DP) masking method to deal with the raw erasing data before compressing. Theoretical analysis and extensive experimental results illustrate the superiority of BlindU in privacy protection and unlearning effectiveness compared with the best existing privacy-preserving unlearning benchmarks.
zh

[CV-56] SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model WACV

【速读】:该论文旨在解决单图像玻璃表面反射去除(Single-Image Reflection Removal, SIRR)难题,其核心挑战在于玻璃表面复杂的光路交互(包括反射与透射光的耦合效应),以及现有数据集在物理真实性或规模上的不足。解决方案的关键在于提出一种基于路径追踪(path-tracing)的合成数据生成框架,通过将3D玻璃模型与真实背景图像结合,生成具有多样化玻璃属性、相机参数和后期处理效果的物理准确反射场景;同时,利用大模型(Large Multimodal Model, LMM)能力,将图像层拼接为复合输入并进行联合描述(captioning),并通过任务特定的LoRA(Low-Rank Adaptation)微调策略替代全参数训练,从而显著提升反射去除与分离性能。

链接: https://arxiv.org/abs/2601.07209
作者: Yu Guo,Zhiqiang Lao,Xiyun Song,Yubin Zhou,Heather Yu
机构: Futurewei Technologies (未来wei技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 12 pages, 14 figures, accepted in WACVW 2026

点击查看摘要

Abstract:Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
zh

[CV-57] ShowUI-Aloha: Human-Taught GUI Agent

【速读】:该论文旨在解决自动化复杂图形用户界面(Graphical User Interfaces, GUI)任务时面临的挑战,尤其是由于缺乏可扩展且高质量的训练数据所导致的难题。现有方法难以从人类演示中有效提取结构化信息,而这些演示通常为非结构化的屏幕录制,包含大量冗余内容且缺乏语义标注。为此,作者提出ShowUI-Aloha框架,其核心在于构建一个端到端的处理管道:首先通过记录器捕获屏幕视频及精确的用户交互(如鼠标点击、键盘输入和滚动),再由学习器将原始交互与视觉上下文转化为自然语言描述;随后规划器基于解析后的演示动态生成高层操作计划;最终执行器在操作系统层面精准执行动作(包括点击、拖拽、文本输入等),并结合安全检查与实时反馈确保可靠性。该方案实现了从真实世界人类行为中自动构建结构化任务数据的能力,为通用GUI代理的学习提供了可行路径。

链接: https://arxiv.org/abs/2601.07181
作者: Yichun Zhang,Xiangwu Guo,Yauhong Goh,Jessica Hu,Zhiheng Chen,Xin Wang,Difei Gao,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, 16 Figures

点击查看摘要

Abstract:Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn this http URL address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.
zh

[CV-58] DIVER: Dynamic Iterative Visual Evidence Reasoning for Multimodal Fake News Detection

【速读】:该论文旨在解决多模态虚假新闻检测中因视觉基础薄弱导致的计算冗余和幻觉风险问题,尤其针对现有方法依赖静态融合或大语言模型(Large Language Models, LLMs)所引发的低效与不准确缺陷。其解决方案的关键在于提出DIVER(Dynamic Iterative Visual Evidence Reasoning)框架,该框架基于渐进式、证据驱动的推理范式:首先建立强文本基线以过滤不可靠信息,仅在文本证据不足时动态引入视觉信息,并通过跨模态对齐验证决定是否进行深度视觉分析;对于显著跨模态语义差异的样本,进一步调用细粒度视觉工具(如光学字符识别(OCR)和密集描述生成)提取任务相关证据,再通过不确定性感知融合机制迭代聚合,从而实现高效且鲁棒的多模态推理。

链接: https://arxiv.org/abs/2601.07178
作者: Weilin Zhou,Zonghao Ying,Chunlei Meng,Jiahui Liu,Hengyang Zhou,Quanchen Zou,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang
机构: Xinjiang University (新疆大学); 360 AI Security Lab; Beihang University (北京航空航天大学); Fudan University (复旦大学); Central South University (中南大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Multimodal fake news detection is crucial for mitigating adversarial misinformation. Existing methods, relying on static fusion or LLMs, face computational redundancy and hallucination risks due to weak visual foundations. To address this, we propose DIVER (Dynamic Iterative Visual Evidence Reasoning), a framework grounded in a progressive, evidence-driven reasoning paradigm. DIVER first establishes a strong text-based baseline through language analysis, leveraging intra-modal consistency to filter unreliable or hallucinated claims. Only when textual evidence is insufficient does the framework introduce visual information, where inter-modal alignment verification adaptively determines whether deeper visual inspection is necessary. For samples exhibiting significant cross-modal semantic discrepancies, DIVER selectively invokes fine-grained visual tools (e.g., OCR and dense captioning) to extract task-relevant evidence, which is iteratively aggregated via uncertainty-aware fusion to refine multimodal reasoning. Experiments on Weibo, Weibo21, and GossipCop demonstrate that DIVER outperforms state-of-the-art baselines by an average of 2.72%, while optimizing inference efficiency with a reduced latency of 4.12 s.
zh

[CV-59] st-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification

【速读】:该论文旨在解决低质量多模态数据在安全关键应用中可靠学习的问题,特别是针对多模态噪声带来的两大挑战:一是现有方法难以有效去除异构噪声,导致多模态表示学习不够鲁棒;二是模型在遇到未见过的噪声时适应性和泛化能力有限。解决方案的关键在于提出测试时自适应分层协同去噪网络(Test-time Adaptive Hierarchical Co-enhanced Denoising Network, TAHCD),其核心机制包括两方面:一是引入自适应稳定子空间对齐(Adaptive Stable Subspace Alignment)与样本自适应置信度对齐(Sample-Adaptive Confidence Alignment),实现全局与实例层面的异构噪声联合去除,涵盖模态特有噪声和跨模态噪声;二是设计测试时协同增强机制,通过无标签方式自适应更新模型,根据样本噪声特性协同优化多模态噪声去除过程,显著提升模型的适应性与泛化性能。

链接: https://arxiv.org/abs/2601.07163
作者: Shu Shen,C. L. Philip Chen,Tong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,9 figures, 8 tables

点击查看摘要

Abstract:Reliable learning on low-quality multimodal data is a widely concerning issue, especially in safety-critical applications. However, multimodal noise poses a major challenge in this domain and leads existing methods to suffer from two key limitations. First, they struggle to reliably remove heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces test-time cooperative enhancement, which adaptively updates the model in response to input noise in a label-free manner, improving adaptability and generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
zh

[CV-60] Motion Focus Recognition in Fast-Moving Egocentric Video

【速读】:该论文旨在解决当前基于第一人称视角(egocentric)的数据集和系统在体育及快速运动场景中普遍忽视运动分析的问题,尤其是缺乏对主体运动意图的实时识别能力。其核心挑战在于如何从第一人称视频中高效、准确地估计运动意图,同时满足边缘部署的实时性与资源约束。解决方案的关键在于提出一种实时运动焦点识别方法,该方法利用基础模型进行相机位姿估计,并引入系统级优化(如滑动批处理推理策略),从而实现低延迟、低内存消耗的高效推理,使以运动为中心的分析成为可能,为体育等动态场景提供了一种与现有动作识别研究互补的新视角。

链接: https://arxiv.org/abs/2601.07154
作者: Daniel Hong,James Tribble,Hao Wang,Chaoyi Zhou,Ashish Bastola,Siyu Huang,Abolfazl Razi
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject’s locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.
zh

[CV-61] Proof of Reasoning for Privacy Enhanced Federated Blockchain Learning at the Edge

【速读】:该论文旨在解决联邦学习(Federated Learning)中数据隐私保护不足、恶意攻击防御能力弱以及聚合过程缺乏可验证性的问题。其核心解决方案是提出一种专为联邦学习设计的新型共识机制——证明推理机制(Proof of Reasoning, PoR),该机制通过三个关键步骤实现:首先,利用掩码自编码器(Masked Autoencoder, MAE)训练出一个编码器,作为特征映射并混淆输入数据,以抵御人类重构和模型逆向攻击;其次,在边缘侧训练下游分类器,将编码后的数据点、网络权重、输出结果及真实标签打包成区块用于联邦聚合;最后,基于此结构实现更复杂且可验证的聚合方法。PoR在保持高准确率的同时显著降低计算复杂度,并具备良好的可扩展性和对动态数据与网络环境的适应能力。

链接: https://arxiv.org/abs/2601.07134
作者: James Calo,Benny Lo
机构: Imperial College London (帝国理工学院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 Pages, 5 figues, 9 tables, journal paper

点击查看摘要

Abstract:Consensus mechanisms are the core of any blockchain system. However, the majority of these mechanisms do not target federated learning directly nor do they aid in the aggregation step. This paper introduces Proof of Reasoning (PoR), a novel consensus mechanism specifically designed for federated learning using blockchain, aimed at preserving data privacy, defending against malicious attacks, and enhancing the validation of participating networks. Unlike generic blockchain consensus mechanisms commonly found in the literature, PoR integrates three distinct processes tailored for federated learning. Firstly, a masked autoencoder (MAE) is trained to generate an encoder that functions as a feature map and obfuscates input data, rendering it resistant to human reconstruction and model inversion attacks. Secondly, a downstream classifier is trained at the edge, receiving input from the trained encoder. The downstream network’s weights, a single encoded datapoint, the network’s output and the ground truth are then added to a block for federated aggregation. Lastly, this data facilitates the aggregation of all participating networks, enabling more complex and verifiable aggregation methods than previously possible. This three-stage process results in more robust networks with significantly reduced computational complexity, maintaining high accuracy by training only the downstream classifier at the edge. PoR scales to large IoT networks with low latency and storage growth, and adapts to evolving data, regulations, and network conditions.
zh

[CV-62] SC-MII: Infrastructure LiDAR-based 3D Object Detection on Edge Devices for Split Computing with Multiple Intermediate Outputs Integration

【速读】:该论文旨在解决两个关键问题:一是高计算复杂度的3D目标检测模型在边缘设备上的部署难题,二是单LiDAR系统因盲区导致的感知局限性。解决方案的核心在于提出一种基于多基础设施LiDAR的分层计算框架(Split Computing with Multiple Intermediate outputs Integration, SC-MII),其中边缘设备仅执行深度神经网络(Deep Neural Network, DNN)的前几层处理,并将中间特征输出传输至边缘服务器;服务器端融合来自多个边缘设备的中间特征并完成最终推理,从而显著降低边缘设备的计算负载与延迟,同时提升感知完整性与隐私保护能力。实验表明,该方案在真实数据集上实现了2.19倍的速度提升和71.6%的边缘设备处理时间减少,且精度损失不超过1.09%。

链接: https://arxiv.org/abs/2601.07119
作者: Taisuke Noguchi,Takayuki Nishio,Takuya Azumi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. This version includes minor lstlisting configuration adjustments for successful compilation. No changes to content or layout. Originally published at IEEE CCNC 2026

点击查看摘要

Abstract:3D object detection using LiDAR-based point cloud data and deep neural networks is essential in autonomous driving technology. However, deploying state-of-the-art models on edge devices present challenges due to high computational demands and energy consumption. Additionally, single LiDAR setups suffer from blind spots. This paper proposes SC-MII, multiple infrastructure LiDAR-based 3D object detection on edge devices for Split Computing with Multiple Intermediate outputs Integration. In SC-MII, edge devices process local point clouds through the initial DNN layers and send intermediate outputs to an edge server. The server integrates these features and completes inference, reducing both latency and device load while improving privacy. Experimental results on a real-world dataset show a 2.19x speed-up and a 71.6% reduction in edge device processing time, with at most a 1.09% drop in accuracy.
zh

[CV-63] Few-shot Class-Incremental Learning via Generative Co-Memory Regularization

【速读】:该论文旨在解决少样本类增量学习(Few-shot Class-Incremental Learning, FSCIL)中的核心挑战:如何在仅用少量新类别样本的情况下,使模型具备强大的表征能力和适应能力,从而避免对旧类别的灾难性遗忘(catastrophic forgetting)以及对新类别的过拟合(overfitting)。解决方案的关键在于提出一种生成式协同记忆正则化方法(generative co-memory regularization),其核心机制包括:首先通过生成域自适应微调(generative domain adaptation fine-tuning)利用掩码自编码器(MAE)解码器进行特征重建与全连接分类器进行特征分类,联合优化预训练生成编码器以提取通用且可迁移的表征;随后构建两类类级记忆——表示记忆(存储每类均值特征)和权重记忆(存储分类器权重),并在每个增量会话中通过动态更新分类器并同时优化特征分类与协同记忆正则化项来实现记忆驱动的增量学习。该机制有效平衡了新知识的学习与旧知识的保持,显著提升了模型在FSCIL场景下的识别准确率与稳定性。

链接: https://arxiv.org/abs/2601.07117
作者: Kexin Bao,Yong Li,Dan Zeng,Shiming Ge
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by International Journal on Computer Vision (IJCV)

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) aims to incrementally learn models from a small amount of novel data, which requires strong representation and adaptation ability of models learned under few-example supervision to avoid catastrophic forgetting on old classes and overfitting to novel classes. This work proposes a generative co-memory regularization approach to facilitate FSCIL. In the approach, the base learning leverages generative domain adaptation finetuning to finetune a pretrained generative encoder on a few examples of base classes by jointly incorporating a masked autoencoder (MAE) decoder for feature reconstruction and a fully-connected classifier for feature classification, which enables the model to efficiently capture general and adaptable representations. Using the finetuned encoder and learned classifier, we construct two class-wise memories: representation memory for storing the mean features for each class, and weight memory for storing the classifier weights. After that, the memory-regularized incremental learning is performed to train the classifier dynamically on the examples of few-shot classes in each incremental session by simultaneously optimizing feature classification and co-memory regularization. The memories are updated in a class-incremental manner and they collaboratively regularize the incremental learning. In this way, the learned models improve recognition accuracy, while mitigating catastrophic forgetting over old classes and overfitting to novel classes. Extensive experiments on popular benchmarks clearly demonstrate that our approach outperforms the state-of-the-arts.
zh

[CV-64] MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在医学图像分析中缺乏多步推理能力的问题,尤其是其依赖静态视觉嵌入和单次推理流程,难以通过迭代视觉交互进行证据验证与修正。解决方案的关键在于提出MedVistaGym——一个可扩展且交互式的训练环境,它通过结构化代理训练(agentic training)机制,使模型学会在推理过程中动态选择、调用并协调多种工具(如图像裁剪、标注、对比等),同时定位任务相关的图像区域,并将多个子图像证据整合进交错的多模态推理链中。实验表明,基于该框架训练的MedVistaGym-R1-8B模型在六个医学视觉问答(Medical VQA)基准上显著优于同类工具增强基线模型(提升19.10%至24.21%),证明了结构化的代理训练策略而非单纯工具接入才是实现高效医学图像分析中工具集成推理的核心。

链接: https://arxiv.org/abs/2601.07107
作者: Meng Lu,Yuxing Lu,Yuchen Zhuang,Megan Mullins,Yang Xie,Guanghua Xiao,Charles Fleming,Wenqi Shi,Xuan Wang
机构: Virginia Tech (弗吉尼亚理工大学); UT Southwestern Medical Center (德克萨斯大学西南医学中心); Georgia Institute of Technology (佐治亚理工学院); Cisco (思科)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training–not tool access alone–unlocks effective tool-integrated reasoning for medical image analysis.
zh

[CV-65] 3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

【速读】:该论文旨在解决低剂量正电子发射断层成像(Positron Emission Tomography, PET)中因噪声增加导致图像质量下降和诊断可靠性降低的问题。现有扩散模型虽具备强大去噪能力,但在低信噪比和全身体积成像场景下难以保持解剖结构的一致性。其解决方案的关键在于提出一种基于小波条件控制网络(Wavelet-Conditioned ControlNet, WCC-Net)的全3D扩散框架,通过在频域引入小波表示作为显式结构先验,将解剖结构与噪声解耦,同时保留生成模型的表达能力和3D结构连续性。该方法通过轻量级控制分支向预训练扩散主干注入小波结构引导信息,在不改变主干参数的前提下显著提升去噪效果与解剖一致性。

链接: https://arxiv.org/abs/2601.07093
作者: Peiyuan Jing,Yue Tang,Chun-Wun Cheng,Zhenxuan Zhang,Liutao Yang,Thiago V. Lima,Klaus Strobel,Antoine Leimgruber,Angelica Aviles-Rivero,Guang Yang,Javier Montoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Low-dose Positron Emission Tomography (PET) imaging reduces patient radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Although diffusion models have demonstrated strong denoising capability, their stochastic nature makes it challenging to enforce anatomically consistent structures, particularly in low signal-to-noise regimes and volumetric whole-body imaging. We propose Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations to guide volumetric PET denoising. By injecting wavelet-based structural guidance into a frozen pretrained diffusion backbone through a lightweight control branch, WCC-Net decouples anatomical structure from noise while preserving generative expressiveness and 3D structural continuity. Extensive experiments demonstrate that WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On the internal 1/20-dose test set, WCC-Net improves PSNR by +1.21 dB and SSIM by +0.008 over a strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Moreover, WCC-Net generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.
zh

[CV-66] Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression

【速读】:该论文旨在解决自动驾驶场景中视觉问答(Visual Question Answering, VQA)模型因计算复杂度高导致的实时性不足问题,尤其是在长视频序列下,现有大型视觉语言模型(Vision-Language Models, VLMs)通常对每一帧都处理密集的补丁令牌(patch tokens),造成显著的浮点运算次数(FLOPs)和推理延迟,难以满足安全关键型应用对低延迟的需求。解决方案的关键在于提出一种名为SRC-Pipeline的高效VLM框架:它通过学习将早期帧的令牌压缩为少量高层语义令牌,同时保留近期帧的完整补丁令牌,从而在保持性能的同时实现66%的FLOPs减少,显著提升模型在实时自动驾驶环境中的适用性。

链接: https://arxiv.org/abs/2601.07092
作者: Yuliang Cai,Dongqiangzi Ye,Zitian Chen,Chongruo Wu
机构: University of Southern California (南加州大学); XPeng (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline. It learns to compress early frame tokens into a small number of high-level tokens while retaining full patch tokens for recent frames. Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance, enabling VLMs to operate more effectively in real-time, safety-critical autonomous driving settings.
zh

[CV-67] Billboard in Focus: Estimating Driver Gaze Duration from a Single Image

【速读】:该论文旨在解决户外广告牌(roadside billboards)对驾驶员注意力干扰及潜在事故风险的问题,其核心挑战在于如何在不依赖人工标注或眼动追踪设备的情况下,自动评估驾驶员对广告牌的关注时长。解决方案的关键在于提出一个全自动的两阶段流水线:第一阶段采用基于YOLO的目标检测模型,在Mapillary Vistas数据集上训练并结合BillboardLamac数据集微调,实现94%的mAP@50广告牌检测精度;第二阶段利用检测框位置与DINOv2特征构建分类器,从而从单帧图像中估计驾驶员注视广告牌的持续时间,最终在BillboardLamac数据集上达到68.1%的准确率,且结果通过Google Street View图像进一步验证。

链接: https://arxiv.org/abs/2601.07073
作者: Carlos Pizarroso,Zuzana Berger Haladová,Zuzana Černeková,Viktor Kocur
机构: Comenius University Bratislava (布拉迪斯拉发科希丘什科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a position paper at VISAPP 2026

点击查看摘要

Abstract:Roadside billboards represent a central element of outdoor advertising, yet their presence may contribute to driver distraction and accident risk. This study introduces a fully automated pipeline for billboard detection and driver gaze duration estimation, aiming to evaluate billboard relevance without reliance on manual annotations or eye-tracking devices. Our pipeline operates in two stages: (1) a YOLO-based object detection model trained on Mapillary Vistas and fine-tuned on BillboardLamac images achieved 94% mAP@50 in the billboard detection task (2) a classifier based on the detected bounding box positions and DINOv2 features. The proposed pipeline enables estimation of billboard driver gaze duration from individual frames. We show that our method is able to achieve 68.1% accuracy on BillboardLamac when considering individual frames. These results are further validated using images collected from Google Street View.
zh

[CV-68] Adversarial Attacks on Medical Hyperspectral Imaging Exploiting Spectral-Spatial Dependencies and Multiscale Features

【速读】:该论文旨在解决医学高光谱成像(Medical Hyperspectral Imaging, MHI)在深度学习模型应用中面临的对抗攻击脆弱性问题。研究表明,此类模型的脆弱性源于两个关键因素:一是依赖局部像素关联以保持组织结构信息,二是依赖多尺度光谱-空间表征进行分层特征编码。解决方案的关键在于提出一种针对性的对抗攻击框架,包含两个核心模块:局部像素依赖攻击(Local Pixel Dependency Attack),利用邻域像素间的空间相关性实施扰动;以及多尺度信息攻击(Multiscale Information Attack),在不同层级的光谱-空间尺度上干扰特征表示。实验表明,该方法能显著降低分类性能,尤其在肿瘤区域效果明显,同时保持视觉不可感知性,揭示了医学高光谱图像模型的独特脆弱性,并强调了开发结构感知型鲁棒防御机制的重要性。

链接: https://arxiv.org/abs/2601.07056
作者: Yunrui Gu,Zhenzhe Gao,Cong Kong,Zhaoxia Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical hyperspectral imaging (HSI) enables accurate disease diagnosis by capturing rich spectral-spatial tissue information, but recent advances in deep learning have exposed its vulnerability to adversarial attacks. In this work, we identify two fundamental causes of this fragility: the reliance on local pixel dependencies for preserving tissue structure and the dependence on multiscale spectral-spatial representations for hierarchical feature encoding. Building on these insights, we propose a targeted adversarial attack framework for medical HSI, consisting of a Local Pixel Dependency Attack that exploits spatial correlations among neighboring pixels, and a Multiscale Information Attack that perturbs features across hierarchical spectral-spatial scales. Experiments on the Brain and MDC datasets demonstrate that our attacks significantly degrade classification performance, especially in tumor regions, while remaining visually imperceptible. Compared with existing methods, our approach reveals the unique vulnerabilities of medical HSI models and underscores the need for robust, structure-aware defenses in clinical applications.
zh

[CV-69] Explainable Deep Radiogenomic Molecular Imaging for MGMT Methylation Prediction in Glioblastoma

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma, GBM)中O6-甲基鸟嘌呤-DNA甲基转移酶(MGMT)启动子甲基化状态的非侵入性预测问题,以替代传统依赖于有创活检的方法。其关键解决方案是构建一个融合影像组学(radiomics)、深度学习与可解释人工智能(Explainable AI, XAI)的多模态磁共振成像(mpMRI)分析框架,通过FLAIR、T1加权、T1增强及T2加权序列提取特征,并利用3D卷积神经网络学习深层表征,结合早期融合与注意力机制进行特征融合,最终实现对MGMT甲基化状态的精准分类;同时引入Grad-CAM和SHAP等XAI方法提升模型决策的临床可解释性,从而推动分子影像学在精准肿瘤学中的应用。

链接: https://arxiv.org/abs/2601.07035
作者: Hasan M Jamil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Glioblastoma (GBM) is a highly aggressive primary brain tumor with limited therapeutic options and poor prognosis. The methylation status of the O6-methylguanine-DNA methyltransferase (MGMT) gene promoter is a critical molecular biomarker that influences patient response to temozolomide chemotherapy. Traditional methods for determining MGMT status rely on invasive biopsies and are limited by intratumoral heterogeneity and procedural risks. This study presents a radiogenomic molecular imaging analysis framework for the non-invasive prediction of MGMT promoter methylation using multi-parametric magnetic resonance imaging (mpMRI). Our approach integrates radiomics, deep learning, and explainable artificial intelligence (XAI) to analyze MRI-derived imaging phenotypes and correlate them with molecular labels. Radiomic features are extracted from FLAIR, T1-weighted, T1-contrast-enhanced, and T2-weighted MRI sequences, while a 3D convolutional neural network learns deep representations from the same modalities. These complementary features are fused using both early fusion and attention-based strategies and classified to predict MGMT methylation status. To enhance clinical interpretability, we apply XAI methods such as Grad-CAM and SHAP to visualize and explain model decisions. The proposed framework is trained on the RSNA-MICCAI Radiogenomic Classification dataset and externally validated on the BraTS 2021 dataset. This work advances the field of molecular imaging by demonstrating the potential of AI-driven radiogenomics for precision oncology, supporting non-invasive, accurate, and interpretable prediction of clinically actionable molecular biomarkers in GBM. Comments: 14 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.07035 [cs.LG] (or arXiv:2601.07035v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.07035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-70] Spatial Multi-Task Learning for Breast Cancer Molecular Subtype Prediction from Single-Phase DCE-MRI

【速读】:该论文旨在解决乳腺癌分子分型精准预测依赖侵入性活检、易受取样偏差影响的问题,提出一种基于临床实用单期增强磁共振成像(DCE-MRI)的非侵入式分子亚型预测方法。其解决方案的关键在于构建一个空间多任务学习框架,通过深度特征提取网络结合多尺度空间注意力机制捕获肿瘤内部及周边组织特征,并引入感兴趣区域加权模块聚焦于肿瘤核心、边缘和周围区域;同时利用多任务学习共享生物标志物间的相关性表示,设计任务特异的预测分支实现对雌激素受体(ER)、孕激素受体(PR)、人表皮生长因子受体2(HER2)状态及Ki-67增殖指数的联合预测,显著优于传统的放射组学与单任务深度学习基线模型。

链接: https://arxiv.org/abs/2601.07001
作者: Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu
机构: Tsinghua University (清华大学); Southwest Forestry University (西南林业大学); GigaAI; KCL (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate molecular subtype classification is essential for personalized breast cancer treatment, yet conventional immunohistochemical analysis relies on invasive biopsies and is prone to sampling bias. Although dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) enables non-invasive tumor characterization, clinical workflows typically acquire only single-phase post-contrast images to reduce scan time and contrast agent dose. In this study, we propose a spatial multi-task learning framework for breast cancer molecular subtype prediction from clinically practical single-phase DCE-MRI. The framework simultaneously predicts estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) status, and the Ki-67 proliferation index – biomarkers that collectively define molecular subtypes. The architecture integrates a deep feature extraction network with multi-scale spatial attention to capture intratumoral and peritumoral characteristics, together with a region-of-interest weighting module that emphasizes the tumor core, rim, and surrounding tissue. Multi-task learning exploits biological correlations among biomarkers through shared representations with task-specific prediction branches. Experiments on a dataset of 960 cases (886 internal cases split 7:1:2 for training/validation/testing, and 74 external cases evaluated via five-fold cross-validation) demonstrate that the proposed method achieves an AUC of 0.893, 0.824, and 0.857 for ER, PR, and HER2 classification, respectively, and a mean absolute error of 8.2% for Ki-67 regression, significantly outperforming radiomics and single-task deep learning baselines. These results indicate the feasibility of accurate, non-invasive molecular subtype prediction using standard imaging protocols.
zh

[CV-71] ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

【速读】:该论文旨在解决自主高保真物体重建问题,以支持数字资产创建及缩小机器人领域仿真到现实的差距。其核心挑战在于如何在复杂几何结构下实现高效且完整的几何与外观重建,并克服传统基于不透明度或深度线索的局限性以及贪婪路径规划导致的局部最优问题。解决方案的关键在于提出ObjSplat框架,采用高斯表面元(Gaussian surfels)作为统一表示形式,结合几何感知的视角评估流程(显式建模背面可见性和遮挡感知的多视角共视性),从而可靠识别未充分重建区域;同时引入多步前瞻的下一最佳路径(Next-Best-Path, NBP)规划器,在动态构建的空间图上联合优化信息增益与移动成本,生成全局高效的扫描轨迹。

链接: https://arxiv.org/abs/2601.06997
作者: Yuetao Li,Zhizhou Jia,Yu Zhang,Qun Hao,Shaohui Zhang
机构: Beijing Institute of Technology (北京理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: this https URL .
zh

[CV-72] Can Textual Reasoning Improve the Performance of MLLM s on Fine-grained Visual Classification?

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在细粒度视觉分类(Fine-Grained Visual Classification, FGVC)任务中性能不足的问题,尤其是针对链式思维(Chain-of-Thought, CoT)推理机制在视觉感知任务中可能带来性能下降的现象。研究表明,CoT导致性能下降的核心原因是推理文本长度过长,这一现象被作者称为“思考的成本”(Cost of Thinking)。解决方案的关键在于提出两个创新:一是\alg方法,一种用于多奖励优化的通用归一化策略,以平衡异构奖励信号;二是ReFine-RFT框架,结合集成奖励与\alg,在约束推理长度的同时提供密集的准确性导向反馈。该方案在多个FGVC基准测试中实现了当前最优性能。

链接: https://arxiv.org/abs/2601.06993
作者: Jie Zhu,Yiyang Su,Xiaoming Liu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking’'. Building on this finding, we make two key contributions: (1) \alg, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with \alg to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Code and models are available at \hrefthis https URLProject Link.
zh

[CV-73] Unified Personalized Understanding Generating and Editing

【速读】:该论文旨在解决统一的大规模多模态模型(Unified Large Multimodal Models, LMMs)在个性化任务中缺乏一致性与可控性的问题,尤其是如何在理解、生成和图像编辑等不同任务中保持用户特定概念(如人物“maeve”)的稳定表达。现有方法依赖外部检索或复杂的多阶段训练,存在跨任务干扰和知识模糊等问题。其解决方案的关键在于提出OmniPersona框架,通过结构解耦的概念标记(concept tokens)为不同任务分配独立子空间以减少干扰,并引入显式的知识回放机制,在任务间传播个性化属性知识,从而实现端到端的一致性个性化行为。

链接: https://arxiv.org/abs/2601.06965
作者: Yu Zhong,Tianwei Lin,Ruike Zhu,Yuqian Yuan,Haoyu Zheng,Liang Liang,Wenqiao Zhang,Feifei Shao,Haoyuan Li,Wanggui He,Hao Jiang,Yueting Zhuang
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ``one-size-fits-all’’ paradigm and struggle to model user-specific concepts (e.g., generate a photo of \textttmaeve) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge. We present \textbfOmniPersona, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior. To systematically evaluate unified personalization, we propose \textbf\textttOmniPBench, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.
zh

[CV-74] SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在处理人类手绘草图时表现不佳的问题,尤其是在STEM领域中对草图进行视觉评分与错误诊断的复杂任务上。现有模型难以应对草图的非结构化和模糊特性,缺乏对符号性内容和噪声环境下的视觉-语言对齐能力。解决方案的关键在于提出SketchJudge——一个专为评估MLLM作为手绘STEM图诊断评分器而设计的新基准,包含1,015份跨几何、物理、图表和流程图四个领域的学生手绘作答,涵盖多样化的风格差异和明确的错误类型。该基准通过揭示当前模型在结构性、语义性和元认知推理上的不足,有效验证了其脆弱性并推动了相关研究的发展。

链接: https://arxiv.org/abs/2601.06944
作者: Yuhang Su,Mei Wang,Yaoyao Zhong,Guozhang Li,Shixing Li,Yihan Feng,Hua Huang
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages for the main text (excluding references and the limitations section); 37 pages in total including appendices

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark’s effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at this https URL.
zh

[CV-75] Watching Reasoning and Searching: A Video Deep Research Benchmark on Open Web for Agent ic Video Reasoning

【速读】:该论文旨在解决现实世界中视频问答(Video Question Answering, VQA)场景下,模型需结合局部视觉线索与开放网络信息进行多跳推理验证的难题。现有方法难以有效整合跨帧视觉锚点提取、交互式网络检索与视频-网页联合证据的多跳推理能力。解决方案的关键在于构建首个面向视频深度研究(Video Deep Research, VideoDR)的基准测试集,其核心特征为:要求模型在视频条件驱动下完成跨帧视觉锚点提取、与开放网络的迭代交互检索,并基于视频与网页证据进行多跳推理验证;通过严格的人工标注和质量控制,形成覆盖六大语义领域的高质量样本数据集,从而系统性评估不同大模型(包括闭源与开源)在Workflow与Agentic两种范式下的表现,揭示目标漂移(goal drift)与长程一致性(long-horizon consistency)是当前瓶颈,为下一代视频深度研究代理(video agent)的发展提供明确方向。

链接: https://arxiv.org/abs/2601.06943
作者: Chengwen Liu,Xiaomin Yu,Zhuoyue Chang,Zhe Huang,Shuo Zhang,Heng Lian,Kunyi Wang,Rui Xu,Sen Hu,Jianheng Hou,Hao Peng,Chengwei Qin,Xiaobin Hu,Hong Peng,Ronghao Chen,Huacan Wang
机构: LZU(兰州大学); HKUST(GZ)(香港科技大学(广州)); UBC(不列颠哥伦比亚大学); FDU(复旦大学); PKU(北京大学); USC(南加州大学); NUS(新加坡国立大学); UCAS(中国科学院大学); HKUST(香港科技大学); QuantaAlpha(量子公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model’s ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.
zh

[CV-76] RenderFlow: Single-Step Neural Rendering via Flow Matching

【速读】:该论文旨在解决传统基于物理的渲染(Physically Based Rendering, PBR)计算复杂度高,以及当前基于扩散模型的神经渲染方法在迭代过程中的高延迟和随机性导致物理准确性与时间一致性不足的问题。解决方案的关键在于提出一种端到端、确定性的单步神经渲染框架 RenderFlow,其核心基于流匹配(Flow Matching)范式,能够显著加速渲染过程并保证输出的物理合理性和视觉质量;同时引入稀疏关键帧引导模块以增强泛化能力与渲染精度,并通过轻量级适配器模块实现对预训练模型的逆向渲染任务(如内在分解)的高效迁移。

链接: https://arxiv.org/abs/2601.06928
作者: Shenghao Zhang,Runtao Liu,Christopher Schroers,Yang Zhang
机构: Disney Research|Studios (迪士尼研究实验室); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional physically based rendering (PBR) pipelines generate photorealistic images through computationally intensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic, single-step neural rendering framework, RenderFlow, built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse keyframe guidance. Our method significantly accelerates the rendering process and, by optionally incorporating sparsely rendered keyframes as guidance, enhances both the physical plausibility and overall visual quality of the output. The resulting pipeline achieves near real-time performance with photorealistic rendering quality, effectively bridging the gap between the efficiency of modern generative models and the precision of traditional physically based rendering. Furthermore, we demonstrate the versatility of our framework by introducing a lightweight, adapter-based module that efficiently repurposes the pretrained forward model for the inverse rendering task of intrinsic decomposition.
zh

[CV-77] UDPNet: Unleashing Depth-based Priors for Robust Image Dehazing

【速读】:该论文旨在解决当前图像去雾(image dehazing)方法中普遍存在的问题:多数模型仅依赖单模态RGB特征,忽视了场景深度与雾霾分布之间的内在关联;即使部分方法联合优化深度估计与去雾任务,也因未能有效利用精确的深度信息而导致性能受限。解决方案的关键在于提出UDPNet框架,其核心创新为两个模块:一是深度引导注意力模块(Depth-Guided Attention Module, DGAM),通过轻量级的深度引导通道注意力机制自适应地调制特征;二是深度先验融合模块(Depth Prior Fusion Module, DPFM),采用双滑动窗口多头交叉注意力机制实现多尺度深度图特征的分层融合。这两个模块在保证计算效率的同时,显著提升了深度先验信息的整合效果,使网络具备对不同雾霾密度、光照条件及合成与真实数据域差异的动态适应能力,从而在多个主流去雾数据集上达到新的性能基准。

链接: https://arxiv.org/abs/2601.06909
作者: Zengyuan Zuo,Junjun Jiang,Gang Wu,Xianming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image dehazing has witnessed significant advancements with the development of deep learning models. However, a few methods predominantly focus on single-modal RGB features, neglecting the inherent correlation between scene depth and haze distribution. Even those that jointly optimize depth estimation and image dehazing often suffer from suboptimal performance due to inadequate utilization of accurate depth information. In this paper, we present UDPNet, a general framework that leverages depth-based priors from large-scale pretrained depth estimation model DepthAnything V2 to boost existing image dehazing models. Specifically, our architecture comprises two typical components: the Depth-Guided Attention Module (DGAM) adaptively modulates features via lightweight depth-guided channel attention, and the Depth Prior Fusion Module (DPFM) enables hierarchical fusion of multi-scale depth map features by dual sliding-window multi-head cross-attention mechanism. These modules ensure both computational efficiency and effective integration of depth priors. Moreover, the intrinsic robustness of depth priors empowers the network to dynamically adapt to varying haze densities, illumination conditions, and domain gaps across synthetic and real-world data. Extensive experimental results demonstrate the effectiveness of our UDPNet, outperforming the state-of-the-art methods on popular dehazing datasets, such as 0.85 dB PSNR improvement on the SOTS dataset, 1.19 dB on the Haze4K dataset and 1.79 dB PSNR on the NHR dataset. Our proposed solution establishes a new benchmark for depth-aware dehazing across various scenarios. Pretrained models and codes will be released at our project this https URL.
zh

[CV-78] CLIMP: Contrastive Language-Image Mamba Pretraining

【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型中Vision Transformer(ViT)存在的两个关键问题:一是注意力机制对虚假相关性(spurious correlations)敏感,二是计算复杂度随输入分辨率呈二次增长。解决方案的核心在于提出CLIMP——首个完全基于Mamba架构的对比视觉语言模型,其用VMamba替代原ViT中的视觉编码器以捕获视觉空间归纳偏置(spatial inductive biases),从而降低对虚假相关性的依赖;同时采用自回归文本编码器克服CLIP固定上下文限制,实现密集描述检索(dense captioning retrieval)。该设计在不依赖位置编码插值或特殊训练的前提下支持可变输入分辨率,在高分辨率下提升检索准确率达6.6%,并显著减少内存占用(5倍)和浮点运算量(1.8倍),且在ImageNet-O上的跨模态检索性能优于OpenAI的CLIP-ViT-B达7.5%。

链接: https://arxiv.org/abs/2601.06891
作者: Nimrod Shabtay,Itamar Zimerman,Eli Schwartz,Raja Giryes
机构: IBM Research (IBM 研究院); Tel-Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI’s CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x less memory and 1.8x fewer FLOPs. The autoregressive text encoder further overcomes CLIP’s fixed context limitation, enabling dense captioning retrieval. Our findings suggest that Mamba exhibits advantageous properties for vision-language learning, making it a compelling alternative to Transformer-based CLIP.
zh

[CV-79] MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation ICCV2025

【速读】:该论文旨在解决基于计算机辅助设计(CAD)模型的新型物体姿态估计问题,即在仅提供RGB图像的情况下准确估计未知物体的6D姿态。传统方法通常依赖大量参考图像和庞大的网络参数,导致计算资源消耗高、推理速度慢且难以实时部署。本文提出的MixRI解决方案关键在于:通过设计轻量级网络结构,直接利用查询图像与参考图像之间的多视角信息进行点匹配,并结合创新的参考图像融合策略,显著减少所需参考图像数量,从而降低存储开销和处理时间;同时保持与复杂方法相当的精度,在BOP挑战赛七个核心数据集上验证了其有效性。

链接: https://arxiv.org/abs/2601.06883
作者: Xinhang Liu,Jiawei Shi,Zheng Dang,Yuchao Dai
机构: Northwestern Polytechnical University (西北工业大学); Shaanxi Key Laboratory of Information Acquisition and Processing (陕西省信息获取与处理重点实验室); CVLab, EPFL, Switzerland (CVLab, 欧洲高等理工学院, 瑞士)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thus decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods that require more reference images and larger network parameters.
zh

[CV-80] Unsupervised Domain Adaptation with SAM-RefiSeR for Enhanced Brain Tumor Segmentation

【速读】:该论文旨在解决医学图像分割中因源域与目标域数据分布差异导致的性能下降问题,尤其聚焦于脑肿瘤分割任务中的无监督域适应(Unsupervised Domain Adaptation, UDA)挑战。解决方案的关键在于提出一种结合SAM(Segment Anything Model)引导的参考增强策略(SAM-RefiSeR),通过利用预训练的通用分割模型SAM生成高质量伪标签,并引入细粒度特征重校准机制来对齐源域与目标域的特征分布,从而提升在未标注目标域数据上的分割精度。

链接: https://arxiv.org/abs/2601.06882
作者: Dillan Imans,Phuoc-Nguyen Bui,Duc-Tai Le,Hyunseung Choo
机构: SKKU(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in BIBM 2025

点击查看摘要

Abstract:Unsupervised Domain Adaptation with SAM-RefiSeR for Enhanced Brain Tumor Segmentation
zh

[CV-81] MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

【速读】:该论文旨在解决现有3D referring expression segmentation (3DRES) 方法依赖密集高质量点云的问题,而现实场景中的智能体(如机器人和移动设备)通常仅能获取稀疏的多视角RGB图像,并面临严格的延迟限制。为应对这一挑战,作者提出Multi-view 3D Referring Expression Segmentation (MV-3DRES),即直接从稀疏多视角图像中恢复场景结构并分割目标对象。其核心解决方案是提出Multimodal Visual Geometry Grounded Transformer (MVGGT),一种高效的端到端框架,通过双分支设计将语言信息融入稀疏视图几何推理中;同时针对稀疏3D信号导致的弱监督问题,引入Per-view No-target Suppression Optimization (PVSO),以增强各视角梯度的稳定性和平衡性,从而实现高效且准确的学习与推理。

链接: https://arxiv.org/abs/2601.06874
作者: Changli Wu,Haodong Wang,Jiayi Ji,Yutian Yao,Chunsai Du,Jihua Kang,Yanwei Fu,Liujuan Cao
机构: Xiamen University (厦门大学); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); ByteDance (字节跳动); Tianjin University of Science and Technology (天津科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at this https URL.
zh

[CV-82] qAttCNN - Self Attention Mechanism for Video QoE Prediction in Encrypted Traffic

【速读】:该论文旨在解决在端到端加密的视频通信应用(如WhatsApp、Zoom等)中,互联网服务提供商(ISPs)难以准确评估用户质量体验(QoE)的问题。由于现代视频会议应用普遍采用加密技术,ISP无法直接访问媒体流内容,仅能获取网络层的QoS和路由信息,导致传统基于媒体内容对比的QoE估计方法失效。解决方案的关键在于提出一种名为qAttCNN的无参考QoE预测模型,该模型通过分析流量中的数据包大小(packet size)参数,利用注意力机制与卷积神经网络联合建模,推断出两个核心QoE指标——BRISQUE(盲图像质量评估)和每秒帧数(FPS),从而在不依赖原始媒体内容的前提下实现对用户体验质量的有效监控与预测。

链接: https://arxiv.org/abs/2601.06862
作者: Michael Sidorov,Ofer Hadar
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The rapid growth of multimedia consumption, driven by major advances in mobile devices since the mid-2000s, has led to widespread use of video conferencing applications (VCAs) such as Zoom and Google Meet, as well as instant messaging applications (IMAs) like WhatsApp and Telegram, which increasingly support video conferencing as a core feature. Many of these systems rely on the Web Real-Time Communication (WebRTC) protocol, enabling direct peer-to-peer media streaming without requiring a third-party server to relay data, reducing the latency and facilitating a real-time communication. Despite WebRTC’s potential, adverse network conditions can degrade streaming quality and consequently reduce users’ Quality of Experience (QoE). Maintaining high QoE therefore requires continuous monitoring and timely intervention when QoE begins to deteriorate. While content providers can often estimate QoE by directly comparing transmitted and received media, this task is significantly more challenging for internet service providers (ISPs). End-to-end encryption, commonly used by modern VCAs and IMAs, prevent ISPs from accessing the original media stream, leaving only Quality of Service (QoS) and routing information available. To address this limitation, we propose the QoE Attention Convolutional Neural Network (qAttCNN), a model that leverages packet size parameter of the traffic to infer two no-reference QoE metrics viz. BRISQUE and frames per second (FPS). We evaluate qAttCNN on a custom dataset collected from WhatsApp video calls and compare it against existing QoE models. Using mean absolute error percentage (MAEP), our approach achieves 2.14% error for BRISQUE and 7.39% for FPS prediction.
zh

[CV-83] MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在医学场景中生成临床叙事时缺乏视觉定位能力的问题,即模型难以将语言描述准确锚定到图像中的具体解剖结构或病灶区域。其核心挑战在于高质量、大规模的医学领域指代表达与视觉定位配对数据(clinical referring-localization pairs)稀缺。解决方案的关键在于提出MedGround——一个自动化数据构建管道,通过专家标注的分割掩膜(expert masks)作为空间锚点,精确提取形状和空间特征,并引导VLMs生成符合形态学和位置信息的自然语言查询;同时引入多阶段验证机制,结合格式校验、几何与医学先验规则及图像级视觉判断,严格筛选出无歧义且有视觉依据的数据样本,最终构建了包含35K样本的MedGround-35K医学多模态数据集,显著提升了VLMs在未见场景下的指代定位性能与语义区分能力。

链接: https://arxiv.org/abs/2601.06847
作者: Mengmeng Zhang,Xiaoping Wu,Hao Luo,Fan Wang,Yisheng Lv
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); DAMO Academy, Alibaba Group (阿里达摩院); Hupan Lab, Zhejiang Province (浙江省湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.
zh

[CV-84] PRISM: Color-Stratified Point Cloud Sampling ICPR

【速读】:该论文旨在解决传统下采样方法在处理RGB-LiDAR点云时忽视光度信息、导致纹理丰富区域特征丢失的问题。现有方法(如随机采样、体素网格采样和法向量空间采样)强调空间均匀性,未能利用颜色多样性来指导采样密度分配。解决方案的关键在于提出PRISM方法,其核心思想是将RGB颜色空间作为分层域,基于色差划分颜色区间并为每个色区设置最大容量k,从而根据色域复杂度动态调整采样密度:高色差区域(即纹理丰富区域)获得更高采样密度,而颜色均质区域则被显著稀疏化。这一策略将采样空间从单纯的空间覆盖转向视觉复杂度驱动,最终生成更稀疏但保留关键结构信息的点云,适用于3D重建任务。

链接: https://arxiv.org/abs/2601.06839
作者: Hansol Lim,Minhyeok Im,Jongseong Brad Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the 2026 International Conference on Pattern Recognition (ICPR) for possible publication

点击查看摘要

Abstract:We present PRISM, a novel color-guided stratified sampling method for RGB-LiDAR point clouds. Our approach is motivated by the observation that unique scene features often exhibit chromatic diversity while repetitive, redundant features are homogeneous in color. Conventional downsampling methods (Random Sampling, Voxel Grid, Normal Space Sampling) enforce spatial uniformity while ignoring this photometric content. In contrast, PRISM allocates sampling density proportional to chormatic diversity. By treating RGB color space as the stratification domain and imposing a maximum capacity k per color bin, the method preserves texture-rich regions with high color variation while substantially reducing visually homogeneous surfaces. This shifts the sampling space from spatial coverage to visual complexity to produce sparser point clouds that retain essential features for 3D reconstruction tasks.
zh

[CV-85] OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in Sar-to-Optical Translation

【速读】:该论文旨在解决合成孔径雷达(SAR)图像到光学图像的跨模态转换问题,这一任务因SAR数据固有的斑点噪声(speckle noise)和几何畸变而导致语义误判、纹理生成模糊及结构幻觉等挑战。解决方案的关键在于提出一种新颖的SAR-to-Optical (S2O)翻译框架,其核心创新包括:(i) 跨模态语义对齐机制,通过从光学教师模型中蒸馏鲁棒语义先验来构建光学感知的SAR编码器;(ii) 基于语义的生成引导策略,利用类感知文本提示提供全局上下文与分层视觉提示实现局部空间控制的ControlNet;以及(iii) 不确定性感知的目标函数,显式建模aleatoric不确定性以动态调节重建焦点,从而有效缓解由斑点噪声引发的伪影问题。

链接: https://arxiv.org/abs/2601.06835
作者: Hyunseo Lee,Sang Min Kim,Ho Kyung Shin,Taeheon Kim,Woo-Jeoung Nam
机构: Kyungpook National University (庆北国立大学); Korea Aerospace Research Institute (韩国航空航天研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: main 15 pages, supplementary 5 pages

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structural hallucinations. To address these limitations, a novel SAR-to-Optical (S2O) translation framework is proposed, integrating three core technical contributions: (i) Cross-Modal Semantic Alignment, which establishes an Optical-Aware SAR Encoder by distilling robust semantic priors from an Optical Teacher into a SAR Student (ii) Semantically-Grounded Generative Guidance, realized by a Semantically-Grounded ControlNet that integrates class-aware text prompts for global context with hierarchical visual prompts for local spatial guidance; and (iii) an Uncertainty-Aware Objective, which explicitly models aleatoric uncertainty to dynamically modulate the reconstruction focus, effectively mitigating artifacts caused by speckle-induced ambiguity. Extensive experiments demonstrate that the proposed method achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.
zh

[CV-86] Enhancing Low-resolution Image Representation Through Normalizing Flows

【速读】:该论文旨在解决低分辨率图像表示(Low-resolution image representation)在保留关键视觉内容的同时,如何实现对原始图像的高精度重建这一核心挑战。其解决方案的关键在于提出LR2Flow框架,该框架通过将小波紧框架(wavelet tight frame)块与归一化流(normalizing flows)相结合,构建了一个非线性可逆神经网络结构,从而在小波紧框架域中有效学习低分辨率图像的表示,并证明了在该域设计可逆神经网络对于降低重建误差的必要性。

链接: https://arxiv.org/abs/2601.06834
作者: Chenglong Bao,Tongyao Pang,Zuowei Shen,Dihan Zheng,Yihang Zou
机构: Yau Mathematical Sciences Center, Tsinghua University, Beijing, China; Beijing Institute of Mathematical Sciences and Applications, Beijing, China; Department of Mathematics, National University of Singapore, Singapore; Department of Pharmaceutical Chemistry, University of California, San Francisco, CA, USA; Department of Mathematical Sciences, Tsinghua University, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-resolution image representation is a special form of sparse representation that retains only low-frequency information while discarding high-frequency components. This property reduces storage and transmission costs and benefits various image processing tasks. However, a key challenge is to preserve essential visual content while maintaining the ability to accurately reconstruct the original images. This work proposes LR2Flow, a nonlinear framework that learns low-resolution image representations by integrating wavelet tight frame blocks with normalizing flows. We conduct a reconstruction error analysis of the proposed network, which demonstrates the necessity of designing invertible neural networks in the wavelet tight frame domain. Experimental results on various tasks, including image rescaling, compression, and denoising, demonstrate the effectiveness of the learned representations and the robustness of the proposed framework.
zh

[CV-87] SARA: Scene-Aware Reconstruction Accelerator ICPR

【速读】:该论文旨在解决传统Structure-from-Motion (SfM) 流水线中因盲目依赖视觉相似性进行图像对选择而导致的计算效率低下和重建精度受限的问题。其解决方案的关键在于提出一种几何驱动的配对选择模块 SARA(Scene-Aware Reconstruction Accelerator),该模块通过在昂贵的特征匹配之前,基于重投影信息量(即重叠度与视差的乘积)对图像对进行评分,从而优先选择对三维重建最具信息量的配对。SARA 利用轻量级预匹配阶段估计几何线索,并构建带有针对性边(如闭环、长基线锚点和弱视图增强)的信息加权生成树(IWST),显著减少了冗余匹配对(从30,848对降至580对),实现最多50倍的速度提升,同时将旋转误差降低46.5±5.5%,平移误差降低12.5±6.5%,且保持重建质量与基准方法相差不超过±3%。

链接: https://arxiv.org/abs/2601.06831
作者: Jee Won Lee,Hansol Lim,Minhyeok Im,Dohyeon Lee,Jongseong Brad Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the 2026 International Conference on Pattern Recognition (ICPR) for possible publication

点击查看摘要

Abstract:We present SARA (Scene-Aware Reconstruction Accelerator), a geometry-driven pair selection module for Structure-from-Motion (SfM). Unlike conventional pipelines that select pairs based on visual similarity alone, SARA introduces geometry-first pair selection by scoring reconstruction informativeness - the product of overlap and parallax - before expensive matching. A lightweight pre-matching stage uses mutual nearest neighbors and RANSAC to estimate these cues, then constructs an Information-Weighted Spanning Tree (IWST) augmented with targeted edges for loop closure, long-baseline anchors, and weak-view reinforcement. Compared to exhaustive matching, SARA reduces rotation errors by 46.5±5.5% and translation errors by 12.5±6.5% across modern learned detectors, while achieving at most 50x speedup through 98% pair reduction (from 30,848 to 580 pairs). This reduces matching complexity from quadratic to quasi-linear, maintaining within ±3% of baseline reconstruction metrics for 3D Gaussian Splatting and SVRaster.
zh

[CV-88] SpatialNav: Leverag ing Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

【速读】:该论文旨在解决零样本视觉-语言导航(Zero-shot Vision-and-Language Navigation, VLN)代理在缺乏大规模训练数据支持时,仅依赖局部观测进行导航所导致的探索效率低下和性能显著落后的问题。其核心解决方案是提出一种名为SpatialNav的零样本VLN代理,关键在于构建一个显式捕捉环境全局空间结构与语义信息的空间场景图(Spatial Scene Graph, SSG),并在此基础上整合三种机制:以代理为中心的空间地图、与指南针对齐的视觉表示以及远程目标定位策略,从而实现高效且可泛化的导航。

链接: https://arxiv.org/abs/2601.06806
作者: Jiwen Zhang,Zejun Li,Siyuan Wang,Xiangyu Shi,Zhongyu Wei,Qi Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 11 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Although learning-based vision-and-language navigation (VLN) agents can learn spatial knowledge implicitly from large-scale training data, zero-shot VLN agents lack this process, relying primarily on local observations for navigation, which leads to inefficient exploration and a significant performance gap. To deal with the problem, we consider a zero-shot VLN setting that agents are allowed to fully explore the environment before task execution. Then, we construct the Spatial Scene Graph (SSG) to explicitly capture global spatial structure and semantics in the explored environment. Based on the SSG, we introduce SpatialNav, a zero-shot VLN agent that integrates an agent-centric spatial map, a compass-aligned visual representation, and a remote object localization strategy for efficient navigation. Comprehensive experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero-shot agents and clearly narrows the gap with state-of-the-art learning-based methods. Such results highlight the importance of global spatial representations for generalizable navigation.
zh

[CV-89] CliffordNet: All You Need is Geometric Algebra

【速读】:该论文旨在解决当前计算机视觉模型中依赖启发式模块堆叠(如空间混合器Attention/Conv与通道混合器FFN)所带来的设计冗余和效率瓶颈问题。其解决方案的关键在于提出基于几何代数(Geometric Algebra)的Clifford Algebra Network(CliffordNet),通过引入Clifford几何积(uv = u·v + u∧v)构建一个统一的特征交互机制,该机制在局部层面即可同时捕获特征一致性(由广义内积实现)与结构变异性(由外积实现),从而在数学上保证代数完备性。该机制通过高效的稀疏滚动实现,具有严格线性复杂度O(N),并意外发现标准前馈网络(FFNs)在此框架下变得冗余,表明仅靠严格的局部几何交互即可实现全局理解,显著提升了模型参数效率与性能表现。

链接: https://arxiv.org/abs/2601.06793
作者: Zhongping Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:Modern computer vision architectures, from CNNs to Transformers, predominantly rely on the stacking of heuristic modules: spatial mixers (Attention/Conv) followed by channel mixers (FFNs). In this work, we challenge this paradigm by returning to mathematical first principles. We propose the \textbfClifford Algebra Network (CAN), also referred to as CliffordNet, a vision backbone grounded purely in Geometric Algebra. Instead of engineering separate modules for mixing and memory, we derive a unified interaction mechanism based on the \textbfClifford Geometric Product ( uv = u \cdot v + u \wedge v ). This operation ensures algebraic completeness regarding the Geometric Product by simultaneously capturing feature coherence (via the generalized inner product) and structural variation (via the exterior wedge product). Implemented via an efficient sparse rolling mechanism with \textbfstrict linear complexity \mathcalO(N) , our model reveals a surprising emergent property: the geometric interaction is so representationally dense that standard Feed-Forward Networks (FFNs) become redundant. Empirically, CliffordNet establishes a new Pareto frontier: our \textbfNano variant achieves \textbf76.41% accuracy on CIFAR-100 with only \textbf1.4M parameters, effectively matching the heavy-weight ResNet-18 (11.2M) with \textbf 8\times fewer parameters, while our \textbfBase variant sets a new SOTA for tiny models at \textbf78.05%. Our results suggest that global understanding can emerge solely from rigorous, algebraically complete local interactions, potentially signaling a shift where \textitgeometry is all you need. Code is available at this https URL. Comments: 15 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2601.06793 [cs.CV] (or arXiv:2601.06793v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.06793 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-90] AutoTour: Automatic Photo Tour Guide with Smartphones and LLM s

【速读】:该论文旨在解决用户在旅游或探索过程中,难以从个人拍摄的照片中获取丰富、准确且具有上下文意义的地理信息与讲解内容的问题。现有旅游应用多依赖预定义内容或专有数据集,缺乏对用户实时拍摄图像的动态理解与个性化标注能力。解决方案的关键在于提出一个无需训练的自动化流程(training-free pipeline),通过融合用户照片中的视觉特征与基于GPS位置查询的开放地理空间特征(geospatial features),利用视觉语言模型(VLM)检测图像中的主要地标并将其投影至水平空间平面,再通过几何匹配算法将照片特征与真实世界地理实体对齐,最终在原图上直接标注并生成由大语言模型(LLM)驱动的文字和音频描述,从而实现图文并茂、语义精准的交互式导览体验。

链接: https://arxiv.org/abs/2601.06781
作者: Huatao Xu,Zihe Liu,Zilin Zeng,Baichuan Li,Mo Li
机构: The Hong Kong University of Science and Technology(香港科技大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21

点击查看摘要

Abstract:We present AutoTour, a system that enhances user exploration by automatically generating fine-grained landmark annotations and descriptive narratives for photos captured by users. The key idea of AutoTour is to fuse visual features extracted from photos with nearby geospatial features queried from open matching databases. Unlike existing tour applications that rely on pre-defined content or proprietary datasets, AutoTour leverages open and extensible data sources to provide scalable and context-aware photo-based guidance. To achieve this, we design a training-free pipeline that first extracts and filters relevant geospatial features around the user’s GPS location. It then detects major landmarks in user photos through VLM-based feature detection and projects them into the horizontal spatial plane. A geometric matching algorithm aligns photo features with corresponding geospatial entities based on their estimated distance and direction. The matched features are subsequently grounded and annotated directly on the original photo, accompanied by large language model-generated textual and audio descriptions to provide an informative, tour-like experience. We demonstrate that AutoTour can deliver rich, interpretable annotations for both iconic and lesser-known landmarks, enabling a new form of interactive, context-aware exploration that bridges visual perception and geospatial understanding.
zh

[CV-91] he Normalized Difference Layer: A Differentiable Spectral Index Formulation for Deep Learning

【速读】:该论文旨在解决传统归一化差异指数(Normalized Difference Indices, NDIs)在深度学习中作为固定预处理步骤时,因系数固定为1而无法适应特定学习任务的问题。其关键解决方案是提出一种可微的神经网络模块——归一化差异层(Normalized Difference Layer),该层通过软正切(softplus)重参数化确保波段系数为正且分母有界,并支持端到端反向传播训练,从而在保留光照不变性和输出范围[-1,1]优势的同时,让梯度下降自动学习任务相关的波段权重。

链接: https://arxiv.org/abs/2601.06777
作者: Ali Lotfi,Adam Carter,Mohammad Meysami,Thuan Ha,Kwabena Nketia,Steve Shirtliffe
机构: University of Saskatchewan (萨斯喀彻温大学); The University of Tulsa (塔尔顿州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:Normalized difference indices have been a staple in remote sensing for decades. They stay reliable under lighting changes produce bounded values and connect well to biophysical signals. Even so, they are usually treated as a fixed pre processing step with coefficients set to one, which limits how well they can adapt to a specific learning task. In this study, we introduce the Normalized Difference Layer that is a differentiable neural network module. The proposed method keeps the classical idea but learns the band coefficients from data. We present a complete mathematical framework for integrating this layer into deep learning architectures that uses softplus reparameterization to ensure positive coefficients and bounded denominators. We describe forward and backward pass algorithms enabling end to end training through backpropagation. This approach preserves the key benefits of normalized differences, namely illumination invariance and outputs bounded to [-1,1] while allowing gradient descent to discover task specific band weightings. We extend the method to work with signed inputs, so the layer can be stacked inside larger architectures. Experiments show that models using this layer reach similar classification accuracy to standard multilayer perceptrons while using about 75% fewer parameters. They also handle multiplicative noise well, at 10% noise accuracy drops only 0.17% versus 3.03% for baseline MLPs. The learned coefficient patterns stay consistent across different depths.
zh

[CV-92] When Humans Judge Irises: Pupil Size Normalization as an Aid and Synthetic Irises as a Challenge

【速读】:该论文旨在解决在法医学应用中,当虹膜图像因退化(如死后样本)或存在呈现攻击(presentation attack)风险时,依赖人工专家对虹膜匹配结果进行验证的准确性问题。其核心挑战在于:人类观察者在判断虹膜图像是否来自同一眼睛时,受图像质量、瞳孔尺寸差异及合成图像真实性等因素影响较大。解决方案的关键在于引入一种基于自动编码器的身份保持型图像到图像翻译模型(autoencoder-based identity-preserving image-to-image translation model),通过瞳孔尺寸归一化显著提升人工验证的准确率;同时发现,即便现代生成式模型(Generative AI)能生成高保真虹膜图像,人类仍更倾向于将同眼的合成虹膜图像误判为不同眼图像,表明当前合成虹膜图像虽逼真但尚未完全规避人类感知上的“异常感”。

链接: https://arxiv.org/abs/2601.06725
作者: Mahsa Mitcheff,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Iris recognition is a mature biometric technology offering remarkable precision and speed, and allowing for large-scale deployments to populations exceeding a billion enrolled users (e.g., AADHAAR in India). However, in forensic applications, a human expert may be needed to review and confirm a positive identification before an iris matching result can be presented as evidence in court, especially in cases where processed samples are degraded (e.g., in post-mortem cases) or where there is a need to judge whether the sample is authentic, rather than a result of a presentation attack. This paper presents a study that examines human performance in iris verification in two controlled scenarios: (a) under varying pupil sizes, with and without a linear/nonlinear alignment of the pupil size between compared images, and (b) when both genuine and impostor iris image pairs are synthetically generated. The results demonstrate that pupil size normalization carried out by a modern autoencoder-based identity-preserving image-to-image translation model significantly improves verification accuracy. Participants were also able to determine whether iris pairs corresponded to the same or different eyes when both images were either authentic or synthetic. However, accuracy declined when subjects were comparing authentic irises against high-quality, same-eye synthetic counterparts. These findings (a) demonstrate the importance of pupil-size alignment for iris matching tasks in which humans are involved, and (b) indicate that despite the high fidelity of modern generative models, same-eye synthetic iris images are more often judged by humans as different-eye images, compared to same-eye authentic image pairs. We offer data and human judgments along with this paper to allow full replicability of this study and future works. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.06725 [cs.CV] (or arXiv:2601.06725v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.06725 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-93] Beyond Perfect Scores: Proof-by-Contradiction for Trustworthy Machine Learning

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在生物医学预测任务中因缺乏可信赖性而难以临床采纳的问题,尤其是模型是否依赖于真实的临床线索还是仅利用数据中的伪相关性(spurious hierarchical correlations)。其解决方案的关键在于提出一种基于随机反证法(stochastic proof-by-contradiction)的通用可信度测试方法:通过在潜在结果框架下对标签进行精心排列,在真实标签与置换标签上分别训练和测试模型;若模型在置换标签下仍保持高准确率,则表明其可能过拟合、采用捷径学习或存在数据泄露,而非学习到真正的因果关系。该方法进一步以可解释的Fisher风格p值量化此类行为,便于领域专家评估模型可信度,从而区分真正因果建模与数据集伪影驱动的学习。

链接: https://arxiv.org/abs/2601.06704
作者: Dushan N. Wadduwage,Dineth Jayakody,Leonidas Zimianitis
机构: Old Dominion University (老多尼德大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Machine learning (ML) models show strong promise for new biomedical prediction tasks, but concerns about trustworthiness have hindered their clinical adoption. In particular, it is often unclear whether a model relies on true clinical cues or on spurious hierarchical correlations in the data. This paper introduces a simple yet broadly applicable trustworthiness test grounded in stochastic proof-by-contradiction. Instead of just showing high test performance, our approach trains and tests on spurious labels carefully permuted based on a potential outcomes framework. A truly trustworthy model should fail under such label permutation; comparable accuracy across real and permuted labels indicates overfitting, shortcut learning, or data leakage. Our approach quantifies this behavior through interpretable Fisher-style p-values, which are well understood by domain experts across medical and life sciences. We evaluate our approach on multiple new bacterial diagnostics to separate tasks and models learning genuine causal relationships from those driven by dataset artifacts or statistical coincidences. Our work establishes a foundation to build rigor and trust between ML and life-science research communities, moving ML models one step closer to clinical adoption.
zh

[CV-94] Quantification and Classification of Carbon Nanotubes in Electron Micrographs using Vision Foundation Models

【速读】:该论文旨在解决碳纳米管(Carbon Nanotube, CNT)在透射电子显微镜(Transmission Electron Microscopy, TEM)图像中形态特征人工识别效率低、主观性强的问题,从而提升暴露评估与毒理学研究的准确性。其解决方案的关键在于构建一个统一框架,融合零样本分割模型(Segment Anything Model, SAM)与自监督视觉Transformer(DINOv2),实现高通量、可重复的实例级分析:首先利用SAM仅需极少用户交互即可实现近乎完美的颗粒分割;随后通过分割掩膜空间约束DINOv2,提取仅来自目标粒子区域的特征并抑制背景噪声,最终在仅使用少量训练数据的情况下,实现了对四种CNT形态95.5%的分类准确率,并能有效分辨同一视野中混合存在的不同粒子类型。

链接: https://arxiv.org/abs/2601.06673
作者: Sanjay Pradeep,Chen Wang,Matthew M. Dahm,Jeff D. Eldredge,Candace S.J. Tsai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate characterization of carbon nanotube morphologies in electron microscopy images is vital for exposure assessment and toxicological studies, yet current workflows rely on slow, subjective manual segmentation. This work presents a unified framework leveraging vision foundation models to automate the quantification and classification of CNTs in electron microscopy images. First, we introduce an interactive quantification tool built on the Segment Anything Model (SAM) that segments particles with near-perfect accuracy using minimal user input. Second, we propose a novel classification pipeline that utilizes these segmentation masks to spatially constrain a DINOv2 vision transformer, extracting features exclusively from particle regions while suppressing background noise. Evaluated on a dataset of 1,800 TEM images, this architecture achieves 95.5% accuracy in distinguishing between four different CNT morphologies, significantly outperforming the current baseline despite using a fraction of the training data. Crucially, this instance-level processing allows the framework to resolve mixed samples, correctly classifying distinct particle types co-existing within a single field of view. These results demonstrate that integrating zero-shot segmentation with self-supervised feature learning enables high-throughput, reproducible nanomaterial analysis, transforming a labor-intensive bottleneck into a scalable, data-driven process.
zh

[CV-95] SkiTB: A Synthetic Event-based Dataset for Tracking Skiers

【速读】:该论文旨在解决在RGB广播视频中追踪滑雪运动员的难题,尤其针对运动模糊、静态叠加层和场景杂乱导致的目标遮挡问题。传统RGB视觉方法在复杂背景下易失效,而事件相机(event camera)凭借其异步对比度感知特性天然具备对这些干扰的鲁棒性,但此前缺乏针对冬季运动场景的可控基准测试数据集。解决方案的关键在于构建了首个合成的事件基滑雪追踪数据集eSkiTB,通过直接的视频到事件转换(无需神经插值)实现RGB与事件模态的等信息量对比,从而验证了基于脉冲变换器(SDTrack)的事件追踪方法在静态叠加主导场景下显著优于RGB方法(IoU提升20.0点),并证明时间对比度是视觉拥挤环境中追踪高速运动目标的可靠线索。

链接: https://arxiv.org/abs/2601.06647
作者: Krishna Vinod,Joseph Raj Vishal,Kaustav Chanda,Prithvi Jai Ramesh,Yezhou Yang,Bharatesh Chakravarthi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking skiers in RGB broadcast footage is challenging due to motion blur, static overlays, and clutter that obscure the fast-moving athlete. Event cameras, with their asynchronous contrast sensing, offer natural robustness to such artifacts, yet a controlled benchmark for winter-sport tracking has been missing. We introduce event SkiTB (eSkiTB), a synthetic event-based ski tracking dataset generated from SkiTB using direct video-to-event conversion without neural interpolation, enabling an iso-informational comparison between RGB and event modalities. Benchmarking SDTrack (spiking transformer) against STARK (RGB transformer), we find that event-based tracking is substantially resilient to broadcast clutter in scenes dominated by static overlays, achieving 0.685 IoU, outperforming RGB by +20.0 points. Across the dataset, SDTrack attains a mean IoU of 0.711, demonstrating that temporal contrast is a reliable cue for tracking ballistic motion in visually congested environments. eSkiTB establishes the first controlled setting for event-based tracking in winter sports and highlights the promise of event cameras for ski tracking. The dataset and code will be released at this https URL.
zh

[CV-96] Boosting Overlapping Organoid Instance Segmentation Using Pseudo-Label Unmixing and Synthesis-Assisted Learning

【速读】:该论文旨在解决类器官(organoid)实例分割中因高质量标注数据稀缺和显微图像中普遍存在重叠导致的分割精度不足问题。现有半监督学习(semi-supervised learning, SSL)方法受限于噪声伪标签引发的偏差,尤其在重叠区域表现不佳。其关键解决方案是提出伪标签去混(Pseudo-Label Unmixing, PLU)机制,通过识别并修正重叠区域的错误伪标签,并借助基于轮廓的图像合成策略与实例级增强(instance-level augmentation, IA)提升合成数据质量,从而实现对缠绕类器官的有效解耦与精准分割。该方法仅需10%标注数据即可达到全监督模型性能,显著提升了类器官分析的标签效率与可扩展性。

链接: https://arxiv.org/abs/2601.06642
作者: Gui Huang,Kangyuan Zheng,Xuan Cai,Jiaqi Wang,Jianjia Zhang,Kaida Ning,Wenbo Wei,Yujuan Zhu,Jiong Zhang,Mengting Liu
机构: Sun Yat-sen University (中山大学); PCL (Pattern Recognition Center, 中国科学院自动化研究所); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organoids, sophisticated in vitro models of human tissues, are crucial for medical research due to their ability to simulate organ functions and assess drug responses accurately. Accurate organoid instance segmentation is critical for quantifying their dynamic behaviors, yet remains profoundly limited by high-quality annotated datasets and pervasive overlap in microscopy imaging. While semi-supervised learning (SSL) offers a solution to alleviate reliance on scarce labeled data, conventional SSL frameworks suffer from biases induced by noisy pseudo-labels, particularly in overlapping regions. Synthesis-assisted SSL (SA-SSL) has been proposed for mitigating training biases in semi-supervised semantic segmentation. We present the first adaptation of SA-SSL to organoid instance segmentation and reveal that SA-SSL struggles to disentangle intertwined organoids, often misrepresenting overlapping instances as a single entity. To overcome this, we propose Pseudo-Label Unmixing (PLU), which identifies erroneous pseudo-labels for overlapping instances and then regenerates organoid labels through instance decomposition. For image synthesis, we apply a contour-based approach to synthesize organoid instances efficiently, particularly for overlapping cases. Instance-level augmentations (IA) on pseudo-labels before image synthesis further enhances the effect of synthetic data (SD). Rigorous experiments on two organoid datasets demonstrate our method’s effectiveness, achieving performance comparable to fully supervised models using only 10% labeled data, and state-of-the-art results. Ablation studies validate the contributions of PLU, contour-based synthesis, and augmentation-aware training. By addressing overlap at both pseudo-label and synthesis levels, our work advances scalable, label-efficient organoid analysis, unlocking new potential for high-throughput applications in precision medicine.
zh

[CV-97] Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration

【速读】:该论文旨在解决文本引导图像生成中基于视觉样例的精确风格化难题,现有方法常依赖任务特定的再训练或昂贵的图像反演过程,导致内容完整性受损、风格保真度下降,并在语义提示遵循与风格对齐之间难以取得良好平衡。其解决方案的关键在于提出一种无需训练的框架,将风格引导合成重构为上下文学习任务:通过将参考风格图像与掩码目标图像拼接,利用预训练的ReFlow-based图像修复模型,借助多模态注意力融合机制实现语义内容与目标风格的无缝整合;进一步设计动态语义-风格融合(Dynamic Semantic-Style Integration, DSSI)机制,重新加权文本语义与风格视觉token之间的注意力权重,有效缓解指导冲突并提升输出一致性,从而在保持高保真度的同时实现更优的语义-风格平衡和视觉质量。

链接: https://arxiv.org/abs/2601.06605
作者: Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong,Xucheng Yin
机构: University of Science and Technology Beijing (北京科技大学); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided image generation has advanced rapidly with large-scale diffusion models, yet achieving precise stylization with visual exemplars remains difficult. Existing approaches often depend on task-specific retraining or expensive inversion procedures, which can compromise content integrity, reduce style fidelity, and lead to an unsatisfactory trade-off between semantic prompt adherence and style alignment. In this work, we introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task. Guided by textual semantic prompts, our method concatenates a reference style image with a masked target image, leveraging a pretrained ReFlow-based inpainting model to seamlessly integrate semantic content with the desired style through multimodal attention fusion. We further analyze the imbalance and noise sensitivity inherent in multimodal attention fusion and propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between textual semantic and style visual tokens, effectively resolving guidance conflicts and enhancing output coherence. Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality, offering a simple yet powerful alternative to complex, artifact-prone prior methods.
zh

[CV-98] APEX: Learning Adaptive Priorities for Multi-Objective Alignment in Vision-Language Generation

【速读】:该论文旨在解决文本到图像生成中的多目标对齐问题,特别是传统静态线性加权方法在面对异质奖励(heterogeneous rewards)时因固定权重导致的优化失衡问题,表现为模型过度拟合高方差、高响应性的目标(如光学字符识别OCR),而忽视感知类目标。其解决方案的关键在于提出APEX(Adaptive Priority-based Efficient X-objective Alignment),通过双阶段自适应归一化机制缓解方差劫持(variance hijacking)效应,并引入P^3自适应优先级策略动态调度目标,该策略融合学习潜力、冲突惩罚和进展需求三个维度,从而实现更稳定的多目标优化与更优的帕累托前沿平衡。

链接: https://arxiv.org/abs/2601.06574
作者: Dongliang Chen,Xinlin Zhuang,Junjie Xu,Luojian Xie,Zehui Wang,Jiaxi Zhuang,Haolin Yang,Liang Dou,Xiao He,Xingjiao Wu,Ying Qian
机构: East China Normal University (华东师范大学); MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-objective alignment for text-to-image generation is commonly implemented via static linear scalarization, but fixed weights often fail under heterogeneous rewards, leading to optimization imbalance where models overfit high-variance, high-responsiveness objectives (e.g., OCR) while under-optimizing perceptual goals. We identify two mechanistic causes: variance hijacking, where reward dispersion induces implicit reweighting that dominates the normalized training signal, and gradient conflicts, where competing objectives produce opposing update directions and trigger seesaw-like oscillations. We propose APEX (Adaptive Priority-based Efficient X-objective Alignment), which stabilizes heterogeneous rewards with Dual-Stage Adaptive Normalization and dynamically schedules objectives via P^3 Adaptive Priorities that combine learning potential, conflict penalty, and progress need. On Stable Diffusion 3.5, APEX achieves improved Pareto trade-offs across four heterogeneous objectives, with balanced gains of +1.31 PickScore, +0.35 DeQA, and +0.53 Aesthetics while maintaining competitive OCR accuracy, mitigating the instability of multi-objective alignment.
zh

[CV-99] QCaption: Video Captioning and QA through Fusion of Large Multimodal Models

【速读】:该论文旨在解决视频分析中多模态信息融合不足的问题,即如何有效整合视频中的文本、图像与视频内容以提升视频字幕生成(video captioning)和视频问答(QA)任务的性能。解决方案的关键在于提出一种名为QCaption的新型视频字幕与问答流水线,通过融合三种模型实现:关键帧提取模块用于从视频中提取代表性图像帧,大型多模态模型(Large Multimodal Model, LMM)用于图像-文本联合分析,以及大型语言模型(Large Language Model, LLM)用于文本语义理解与推理。这种多模型协同的融合架构显著提升了任务效果,在实验中分别实现了高达44.2%和48.9%的性能提升,且具备完全自包含特性,适用于本地部署场景。

链接: https://arxiv.org/abs/2601.06566
作者: Jiale Wang,Gee Wah Ng,Lee Onn Mak,Randall Cher,Ng Ding Hei Ryan,Davis Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces QCaption, a novel video captioning and QA pipeline that enhances video analytics by fusing three models: key frame extraction, a Large Multimodal Model (LMM) for image-text analysis, and a Large Language Model (LLM) for text analysis. This approach enables integrated analysis of text, images, and video, achieving performance improvements over existing video captioning and QA models; all while remaining fully self-contained, adept for on-premises deployment. Experimental results using QCaption demonstrated up to 44.2% and 48.9% improvements in video captioning and QA tasks, respectively. Ablation studies were also performed to assess the role of LLM on the fusion on the results. Moreover, the paper proposes and evaluates additional video captioning approaches, benchmarking them against QCaption and existing methodologies. QCaption demonstrate the potential of adopting a model fusion approach in advancing video analytics.
zh

[CV-100] ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视频事件定位(event grounding)任务中忽视时间方向性的问题。现有方法通常仅在正向视频流上训练模型,导致模型难以捕捉事件的内在时序结构和方向性,从而限制了其鲁棒性和泛化能力。解决方案的关键在于提出ArrowGEV框架,该框架基于物理学中的“时间箭头”概念,通过强化学习显式建模事件的时间方向性:将事件分为时间敏感型(如“放下包”)和时间无关型(如“左手拿着毛巾”),对前者设计奖励机制以鼓励模型区分正向与反向视频,对后者则强制模型在两个方向上保持一致的定位结果,从而提升模型对事件语义及其时间方向性的理解能力。

链接: https://arxiv.org/abs/2601.06559
作者: Fangxu Yu,Ziyao Lu,Liqiang Niu,Fandong Meng,Jie Zhou
机构: Nanjing University (南京大学); Tencent Inc (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
zh

[CV-101] Hard Thresholding Pursuit Algorithms for Least Absolute Deviations Problem

【速读】:该论文旨在解决信号恢复中因测量值受任意幅度异常值污染而导致的鲁棒性问题,尤其是在稀疏信号重建场景下。其解决方案的关键在于提出了一种无需先验稀疏度信息且参数自适应的梯度快速硬阈值追踪(Graded Fast Hard Thresholding Pursuit, GFHTP₁)算法,该算法通过迭代硬阈值策略实现对异常值的强鲁棒性,同时避免了传统方法中复杂的参数调优过程,从而在保持高计算效率的同时显著提升恢复精度。

链接: https://arxiv.org/abs/2601.06558
作者: Jiao Xu,Peng Li,Bing Zheng
机构: 未知
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Least absolute deviations (LAD) is a statistical optimality criterion widely utilized in scenarios where a minority of measurements are contaminated by outliers of arbitrary magnitudes. In this paper, we delve into the robustness of the variant of adaptive iterative hard thresholding to outliers, known as graded fast hard thresholding pursuit (GFHTP _1 ) algorithm. Unlike the majority of the state-of-the-art algorithms in this field, GFHTP _1 does not require prior information about the signal’s sparsity. Moreover, its design is parameterless, which not only simplifies the implementation process but also removes the intricacies of parameter optimization. Numerical experiments reveal that the GFHTP _1 algorithm consistently outperforms competing algorithms in terms of both robustness and computational efficiency.
zh

[CV-102] LLM Track: Semantic Multi-Object Tracking with Multi-modal Large Language Models

【速读】:该论文旨在解决传统多目标跟踪(Multi-Object Tracking, MOT)系统在几何感知层面表现优异但缺乏语义理解能力的问题,即其只能回答“在哪里”和“是谁”,而无法解释“是什么”和“为什么”——即对象行为的语义信息与因果逻辑。解决方案的关键在于提出一种端到端的语义多目标跟踪(Semantic Multi-Object Tracking, SMOT)框架 LLMTrack,其核心创新包括:采用仿生设计思想,将强定位能力(基于 Grounding DINO)与深度认知推理能力(基于 LLaVA-OneVision 多模态大模型)解耦;引入时空融合模块(Spatio-Temporal Fusion Module),聚合实例级交互特征与视频级上下文信息,使大语言模型(Large Language Model, LLM)能够理解复杂轨迹;并通过渐进式三阶段训练策略(视觉对齐、时序微调、LoRA 语义注入)高效适配大规模模型至跟踪任务,从而实现高精度跟踪与语义理解的协同提升。

链接: https://arxiv.org/abs/2601.06550
作者: Pan Liao,Feng Yang,Di Wu,Jinwen Yu,Yuhua Zhu,Wenhui Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textitwhere and \textitwho. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textitwhat and \textitwhy behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbfLLMTrack, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
zh

[CV-103] owards Egocentric 3D Hand Pose Estimation in Unseen Domains WACV2026

【速读】:该论文旨在解决3D手部姿态估计(3D hand pose estimation)在跨域场景下的性能下降问题,尤其是在未见过的新环境中,现有方法因训练数据有限和深度感知能力不足而容易过拟合特定相机内参(camera intrinsics)。其核心解决方案是提出V-HPOT方法,关键在于将关键点的z坐标估计置于一个由焦距和图像尺寸归一化的虚拟相机空间中,从而实现对相机内参的不变性(invariance to camera intrinsics),并在此基础上设计了一种自监督测试时优化策略——通过施加预测姿态与空间尺度变换后手部姿态之间的3D一致性损失,使模型在推理阶段无需真实标注即可适应目标域特征,显著提升跨域泛化能力。

链接: https://arxiv.org/abs/2601.06537
作者: Wiktor Mucha,Michael Wray,Martin Kampel
机构: TU Wien (维也纳工业大学); SoftServe Inc. (软服务公司); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception – overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model’s depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in cross-domain scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing approximately x3.5 to x14 less data.
zh

[CV-104] oward Generalizable Deblurring: Leverag ing Massive Blur Priors with Linear Attention for Real-World Scenarios

【速读】:该论文旨在解决图像去模糊(Image Deblurring)方法在真实世界场景中泛化能力差的问题,其核心挑战在于训练数据集在真实感与模糊模式多样性之间存在固有权衡,且现有算法设计受限于像素级损失函数对局部细节的过度关注,忽视了结构和语义一致性。解决方案的关键在于识别出模糊模式多样性是实现鲁棒泛化的决定性因素,并提出Blur Pattern Pretraining(BPP)策略:通过模拟数据集学习模糊先验(blur priors),并在真实数据上进行联合微调以实现迁移;进一步引入Motion and Semantic Guidance(MoSeG)模块,在严重退化条件下增强模糊先验,最终集成至轻量级扩散模型GLOWDeblur中,该模型结合卷积预重建域对齐模块与轻量扩散主干网络,兼顾性能与实用性。

链接: https://arxiv.org/abs/2601.06525
作者: Yuanting Gao,Shuo Cao,Xiaohui Li,Yuandong Pu,Yihao Liu,Kai Zhang
机构: Tsinghua University (清华大学); USTC (中国科学技术大学); Shanghai AI Lab (上海人工智能实验室); SJTU (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Image deblurring has advanced rapidly with deep learning, yet most methods exhibit poor generalization beyond their training datasets, with performance dropping significantly in real-world scenarios. Our analysis shows this limitation stems from two factors: datasets face an inherent trade-off between realism and coverage of diverse blur patterns, and algorithmic designs remain restrictive, as pixel-wise losses drive models toward local detail recovery while overlooking structural and semantic consistency, whereas diffusion-based approaches, though perceptually strong, still fail to generalize when trained on narrow datasets with simplistic strategies. Through systematic investigation, we identify blur pattern diversity as the decisive factor for robust generalization and propose Blur Pattern Pretraining (BPP), which acquires blur priors from simulation datasets and transfers them through joint fine-tuning on real data. We further introduce Motion and Semantic Guidance (MoSeG) to strengthen blur priors under severe degradation, and integrate it into GLOWDeblur, a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction domain alignment module with a lightweight diffusion backbone. Extensive experiments on six widely-used benchmarks and two real-world datasets validate our approach, confirming the importance of blur priors for robust generalization and demonstrating that the lightweight design of GLOWDeblur ensures practicality in real-world applications. The project page is available at this https URL.
zh

[CV-105] Bridging Robustness and Efficiency: Real-Time Low-Light Enhancement via Attention U-Net GAN

【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)中生成式模型与效率之间的矛盾问题:尽管基于扩散概率模型的方法能实现高质量的纹理恢复,但其计算延迟过高(通常超过2–4秒/图),难以部署于边缘设备;而传统CNN基线虽具备实时推理能力,却存在“过度平滑”问题,无法有效恢复极端低光条件下的细节结构。解决方案的关键在于提出一种混合注意力U-Net生成对抗网络(Attention U-Net GAN),通过在轻量级U-Net骨干网络中引入注意力门机制,并在条件对抗框架下训练,从而在单次前向传播中逼近生成模型的高频保真度,显著提升纹理重建质量的同时将推理延迟降至0.06秒,相较潜在扩散模型实现40倍加速,满足近实时应用需求。

链接: https://arxiv.org/abs/2601.06518
作者: Yash Thesia,Meera Suthar
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Recent advancements in Low-Light Image Enhancement (LLIE) have focused heavily on Diffusion Probabilistic Models, which achieve high perceptual quality but suffer from significant computational latency (often exceeding 2-4 seconds per image). Conversely, traditional CNN-based baselines offer real-time inference but struggle with “over-smoothing,” failing to recover fine structural details in extreme low-light conditions. This creates a practical gap in the literature: the lack of a model that provides generative-level texture recovery at edge-deployable speeds. In this paper, we address this trade-off by proposing a hybrid Attention U-Net GAN. We demonstrate that the heavy iterative sampling of diffusion models is not strictly necessary for texture recovery. Instead, by integrating Attention Gates into a lightweight U-Net backbone and training within a conditional adversarial framework, we can approximate the high-frequency fidelity of generative models in a single forward pass. Extensive experiments on the SID dataset show that our method achieves a best-in-class LPIPS score of 0.112 among efficient models, significantly outperforming efficient baselines (SID, EnlightenGAN) while maintaining an inference latency of 0.06s. This represents a 40x speedup over latent diffusion models, making our approach suitable for near real-time applications.
zh

[CV-106] Precision Meets Art: Autonomous Multi-UAV System for Large Scale Mural Drawing

【速读】:该论文旨在解决大规模户外壁画创作中自动化与高精度绘制的难题,传统单无人机(UAV)方案在效率和规模上存在瓶颈。其核心解决方案在于开发了一种多无人机协同系统,关键创新包括:一是融合2D定位(基于单个运动捕捉摄像头)与机载激光雷达(LiDAR)的复合定位系统,实现高精度空间感知;二是提出一种分段式飞行控制算法,在轨迹切向与法向分别采用不同控制策略,兼顾绘制平滑性与精度。实验验证了该系统可在100平方米尺度上完成高质量壁画绘制,并展现出优于单机方案的可扩展性、作业速度及恶劣天气下的稳定性。

链接: https://arxiv.org/abs/2601.06508
作者: Andrei A. Korigodskii,Artem E. Vasiunik,Georgii A. Varin,Adilia M. Zukhurova,Matvei V. Urvantsev,Semen A. Osipenkov,Igor S. Efremov,Georgii E. Bondar
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 6 pages, 9 figures

点击查看摘要

Abstract:The integration of autonomous unmanned aerial vehicles (UAVs) into large-scale artistic projects has emerged as a new application in robotics. This paper presents the design, deployment, and testing of a novel multi-drone system for automated mural painting in outdoor settings. This technology makes use of new software that coordinates multiple drones simultaneously, utilizing state-machine algorithms for task execution. Key advancements are the complex positioning system that combines 2D localization using a single motion tracking camera with onboard LiDAR for precise positioning, and a novel flight control algorithm, which works differently along the trajectory and normally to it, ensuring smoothness and high precision of the drawings at the same time. A 100 square meters mural was created using the developed multi-drone system, validating the system’s efficacy. Compared to single-drone approaches, our multi-UAV solution significantly improves scalability and operational speed while maintaining high stability even in harsh weather conditions. The findings highlight the potential of autonomous robotic swarms in creative applications, paving the way for further advancements in large-scale robotic art.
zh

[CV-107] 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

【速读】:该论文旨在解决3D场景描述(3D captioning)任务中因点云数据稀疏性和不规则性,以及现有生成模型在跨环境(如室内与室外场景)下弱定位能力与有限的分布外(out-of-distribution, OOD)泛化性能所带来的挑战。解决方案的关键在于提出3D CoCa v2框架,其核心创新包括:基于冻结CLIP的语义先验构建统一的对比视觉-语言学习与3D描述生成机制,结合空间感知的3D场景编码器以捕捉几何信息,并通过多模态解码器联合优化对比损失与captioning损失;此外,在推理阶段引入无需更新参数的测试时搜索(test-time search, TTS),利用紧凑场景摘要进行奖励引导的选择,从而提升描述多样性与鲁棒性。该方法不依赖外部检测器或手工提议,显著增强了跨场景泛化能力。

链接: https://arxiv.org/abs/2601.06496
作者: Hao Tang,Ting Huang,Zeyu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at this https URL.
zh

[CV-108] Learning Domain Agnostic Latent Embeddings of 3D Faces for Zero-shot Animal Expression Transfer WACV2026

【速读】:该论文旨在解决跨物种面部表情迁移问题,即如何将人类面部表情有效地转移到3D动物面部网格上,而无需收集动物的表情数据。其解决方案的关键在于提出了一种零样本(zero-shot)框架,通过结合内在几何描述符(HKS/WKS)与一种与网格无关的潜在嵌入(mesh-agnostic latent embedding),实现面部身份(ID)与表情的解耦建模:其中身份潜在空间捕捉跨物种的通用面部结构,表达潜在空间则编码可泛化的形变模式;模型仅用人类表情配对训练即可学习到跨身份的表情迁移能力,同时借助雅可比损失(Jacobian loss)、顶点位置损失和拉普拉斯损失来保证几何一致性,从而有效缩小人类与动物面部形状之间的几何差距。

链接: https://arxiv.org/abs/2601.06484
作者: Yue Wang,Lawrence Amadi,Xiang Gao,Yazheng Chen,Yuanpeng Liu,Ning Lu,Xianfeng Gu
机构: Stony Brook University (石溪大学); Futurewei Technologies (未来wei科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV 2026 Workshop LENS

点击查看摘要

Abstract:We present a zero-shot framework for transferring human facial expressions to 3D animal face meshes. Our method combines intrinsic geometric descriptors (HKS/WKS) with a mesh-agnostic latent embedding that disentangles facial identity and expression. The ID latent space captures species-independent facial structure, while the expression latent space encodes deformation patterns that generalize across humans and animals. Trained only with human expression pairs, the model learns the embeddings, decoupling, and recoupling of cross-identity expressions, enabling expression transfer without requiring animal expression data. To enforce geometric consistency, we employ Jacobian loss together with vertex-position and Laplacian losses. Experiments show that our approach achieves plausible cross-species expression transfer, effectively narrowing the geometric gap between human and animal facial shapes.
zh

[CV-109] SRFlow: A Dataset and Regularization Model for High-Resolution Facial Optical Flow via Splatting Rasterization

【速读】:该论文旨在解决高分辨率面部光流(Facial Optical Flow)数据集匮乏导致的面部运动分析进展受限问题。其关键解决方案是提出两个核心贡献:一是构建了高分辨率面部光流数据集 Splatting Rasterization Flow (SRFlow),二是设计了针对面部光流估计的模型 SRFlowNet,该模型引入了基于掩码和差分或 Sobel 算子计算梯度的定制正则化损失函数,有效抑制了无纹理或重复模式区域中的高频噪声和大尺度误差,从而首次实现了由高斯点绘制(Gaussian Splatting Rasterization)引导的高分辨率皮肤运动捕捉。实验表明,使用 SRFlow 数据集训练可使多种光流模型的端点误差(EPE)降低达 42%,且 SRFlowNet 在微表情识别任务中 F1 分数提升 48%。

链接: https://arxiv.org/abs/2601.06479
作者: JiaLin Zhang,Dong Li
机构: Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial optical flow supports a wide range of tasks in facial motion analysis. However, the lack of high-resolution facial optical flow datasets has hindered progress in this area. In this paper, we introduce Splatting Rasterization Flow (SRFlow), a high-resolution facial optical flow dataset, and Splatting Rasterization Guided FlowNet (SRFlowNet), a facial optical flow model with tailored regularization losses. These losses constrain flow predictions using masks and gradients computed via difference or Sobel operator. This effectively suppresses high-frequency noise and large-scale errors in texture-less or repetitive-pattern regions, enabling SRFlowNet to be the first model explicitly capable of capturing high-resolution skin motion guided by Gaussian splatting rasterization. Experiments show that training with the SRFlow dataset improves facial optical flow estimation across various optical flow models, reducing end-point error (EPE) by up to 42% (from 0.5081 to 0.2953). Furthermore, when coupled with the SRFlow dataset, SRFlowNet achieves up to a 48% improvement in F1-score (from 0.4733 to 0.6947) on a composite of three micro-expression datasets. These results demonstrate the value of advancing both facial optical flow estimation and micro-expression recognition.
zh

[CV-110] VVTRec: Radio Interferometric Reconstruction through Visual and Textual Modality Enrichment

【速读】:该论文旨在解决射电天文成像中因稀疏可见度数据(visibility)导致的图像伪影残留与相关性建模不足的问题。现有方法仅依赖单一模态的稀疏可见度数据,难以有效提取信号特征并生成高质量图像。解决方案的关键在于提出VVTRec,一种基于可见度引导的多模态射电干涉数据重建方法,通过将稀疏可见度转换为图像和文本特征,分别增强空间结构信息与语义信息,从而提升图像的结构完整性和准确性;同时利用视觉语言模型(Vision-Language Models, VLMs)实现无需额外训练的性能增益,使VLM能够从未见过的可见度模态中提取预训练知识作为补充,显著改善成像质量且计算开销可控。

链接: https://arxiv.org/abs/2601.06475
作者: Kai Cheng,Ruoqi Wang,Qiong Luo
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radio astronomy is an indispensable discipline for observing distant celestial objects. Measurements of wave signals from radio telescopes, called visibility, need to be transformed into images for astronomical observations. These dirty images blend information from real sources and artifacts. Therefore, astronomers usually perform reconstruction before imaging to obtain cleaner images. Existing methods consider only a single modality of sparse visibility data, resulting in images with remaining artifacts and insufficient modeling of correlation. To enhance the extraction of visibility information and emphasize output quality in the image domain, we propose VVTRec, a multimodal radio interferometric data reconstruction method with visibility-guided visual and textual modality enrichment. In our VVTRec, sparse visibility is transformed into image-form and text-form features to obtain enhancements in terms of spatial and semantic information, improving the structural integrity and accuracy of images. Also, we leverage Vision-Language Models (VLMs) to achieve additional training-free performance improvements. VVTRec enables sparse visibility, as a foreign modality unseen by VLMs, to accurately extract pre-trained knowledge as a supplement. Our experiments demonstrate that VVTRec effectively enhances imaging results by exploiting multimodal information without introducing excessive computational overhead.
zh

[CV-111] SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning

【速读】:该论文旨在解决自动驾驶中视觉语言模型(Vision Language Models, VLMs)与语义占据表示(semantic occupancy)难以有效融合的问题。传统VLMs存在token爆炸和时空推理能力有限的缺陷,而语义占据虽提供细粒度的空间显式表征,但其高密度特性导致难以高效集成至VLM框架中。解决方案的关键在于提出SparseOccVLA——一种新型视觉-语言-动作(Vision-Language-Action)模型,通过轻量级稀疏占据编码器生成紧凑且信息丰富的稀疏占据查询(sparse occupancy queries),作为连接视觉与语言空间的单一桥梁;这些查询被对齐至语言空间并由大语言模型(LLM)进行统一场景理解与未来占据预测,并进一步引入LLM引导的锚点扩散规划器(Anchor-Diffusion Planner),实现解耦的锚点评分与去噪机制及跨模型轨迹条件融合,从而显著提升整体感知与规划性能。

链接: https://arxiv.org/abs/2601.06474
作者: Chenxu Dang,Jie Wang,Guang Li,Zhiwen Hou,Zihan You,Hangjun Ye,Jie Ma,Long Chen,Yan Wang
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV; Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In autonomous driving, Vision Language Models (VLMs) excel at high-level reasoning , whereas semantic occupancy provides fine-grained details. Despite significant progress in individual fields, there is still no method that can effectively integrate both paradigms. Conventional VLMs struggle with token explosion and limited spatiotemporal reasoning, while semantic occupancy provides a unified, explicit spatial representation but is too dense to integrate efficiently with VLMs. To address these challenges and bridge the gap between VLMs and occupancy, we propose SparseOccVLA, a novel vision-language-action model that unifies scene understanding, occupancy forecasting, and trajectory planning powered by sparse occupancy queries. Starting with a lightweight Sparse Occupancy Encoder, SparseOccVLA generates compact yet highly informative sparse occupancy queries that serve as the single bridge between vision and language. These queries are aligned into the language space and reasoned by the LLM for unified scene understanding and future occupancy forecasting. Furthermore, we introduce an LLM-guided Anchor-Diffusion Planner featuring decoupled anchor scoring and denoising, as well as cross-model trajectory-condition fusion. SparseOccVLA achieves a 7% relative improvement in CIDEr over the state-of-the-art on OmniDrive-nuScenes, a 0.5 increase in mIoU score on Occ3D-nuScenes, and sets state-of-the-art open-loop planning metric on nuScenes benchmark, demonstrating its strong holistic capability.
zh

[CV-112] On the Adversarial Robustness of 3D Large Vision-Language Models

【速读】:该论文旨在解决3D视觉语言模型(3D Vision-Language Models, 3D VLMs)在面对对抗攻击时的鲁棒性问题,特别是探究将3D视觉输入整合到VLM架构中是否会像2D VLMs一样显著增加模型的脆弱性。其解决方案的关键在于提出了一种系统性的评估框架,包含两种互补的攻击策略:视觉攻击(Vision Attack),通过扰动3D编码器和投影器生成的视觉token特征,以检验视觉-语言对齐机制的鲁棒性;以及描述攻击(Caption Attack),直接操纵输出token序列,用于评估端到端系统的整体抗干扰能力。这两种攻击均涵盖无目标(untargeted)和有目标(targeted)变体,从而全面量化模型的通用脆弱性和被恶意操控的风险。实验结果揭示了3D VLMs在无目标攻击下存在显著漏洞,但在有目标攻击下比2D VLMs更具韧性,凸显了提升3D VLMs对抗鲁棒性的必要性。

链接: https://arxiv.org/abs/2601.06464
作者: Chao Liu,Ngai-Man Cheung
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:3D Vision-Language Models (VLMs), such as PointLLM and GPT4Point, have shown strong reasoning and generalization abilities in 3D understanding tasks. However, their adversarial robustness remains largely unexplored. Prior work in 2D VLMs has shown that the integration of visual inputs significantly increases vulnerability to adversarial attacks, making these models easier to manipulate into generating toxic or misleading outputs. In this paper, we investigate whether incorporating 3D vision similarly compromises the robustness of 3D VLMs. To this end, we present the first systematic study of adversarial robustness in point-based 3D VLMs. We propose two complementary attack strategies: \textitVision Attack, which perturbs the visual token features produced by the 3D encoder and projector to assess the robustness of vision-language alignment; and \textitCaption Attack, which directly manipulates output token sequences to evaluate end-to-end system robustness. Each attack includes both untargeted and targeted variants to measure general vulnerability and susceptibility to controlled manipulation. Our experiments reveal that 3D VLMs exhibit significant adversarial vulnerabilities under untargeted attacks, while demonstrating greater resilience against targeted attacks aimed at forcing specific harmful outputs, compared to their 2D counterparts. These findings highlight the importance of improving the adversarial robustness of 3D VLMs, especially as they are deployed in safety-critical applications.
zh

[CV-113] VIPER Strike: Defeating Visual Reasoning CAPTCHAs via Structured Vision-Language Inference USENIX-SECURITY2026

【速读】:该论文旨在解决当前视觉推理验证码(Visual Reasoning CAPTCHAs, VRCs)在实际部署中面临的通用性不足问题,即现有攻击方法要么依赖特定模板的视觉检测器(vision-centric),难以应对新布局;要么虽利用大语言模型(LLM)进行推理但缺乏细粒度视觉感知能力(reasoning-centric),导致在多样化VRC场景下性能受限。其解决方案的关键在于提出ViPer框架,该框架通过结构化多目标视觉感知与自适应LLM推理的融合,在模块化流程中实现对视觉布局的解析、属性到语义的锚定以及目标坐标的推断,从而在六类主流VRC提供商上达到最高93.2%的成功率,并保持对不同LLM后端的高度鲁棒性。

链接: https://arxiv.org/abs/2601.06461
作者: Minfeng Qi,Dongyang He,Qin Wang,Lefeng Zhang
机构: City University of Macau (澳门城市大学); CSIRO Data61
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Accepted by Usenix Security 2026

点击查看摘要

Abstract:Visual Reasoning CAPTCHAs (VRCs) combine visual scenes with natural-language queries that demand compositional inference over objects, attributes, and spatial relations. They are increasingly deployed as a primary defense against automated bots. Existing solvers fall into two paradigms: vision-centric, which rely on template-specific detectors but fail on novel layouts, and reasoning-centric, which leverage LLMs but struggle with fine-grained visual perception. Both lack the generality needed to handle heterogeneous VRC deployments. We present ViPer, a unified attack framework that integrates structured multi-object visual perception with adaptive LLM-based reasoning. ViPer parses visual layouts, grounds attributes to question semantics, and infers target coordinates within a modular pipeline. Evaluated on six major VRC providers (VTT, Geetest, NetEase, Dingxiang, Shumei, Xiaodun), ViPer achieves up to 93.2% success, approaching human-level performance across multiple benchmarks. Compared to prior solvers, GraphNet (83.2%), Oedipus (65.8%), and the Holistic approach (89.5%), ViPer consistently outperforms all baselines. The framework further maintains robustness across alternative LLM backbones (GPT, Grok, DeepSeek, Kimi), sustaining accuracy above 90%. To anticipate defense, we further introduce Template-Space Randomization (TSR), a lightweight strategy that perturbs linguistic templates without altering task semantics. TSR measurably reduces solver (i.e., attacker) performance. Our proposed design suggests directions for human-solvable but machine-resistant CAPTCHAs. Comments: Accepted by Usenix Security 2026 Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET) Cite as: arXiv:2601.06461 [cs.CR] (or arXiv:2601.06461v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.06461 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-114] PixRec: Leverag ing Visual Context for Next-Item Prediction in Sequential Recommendation

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的序列推荐方法忽视了真实场景中丰富的视觉信息的问题,尤其是在电商等以商品图像为核心的推荐场景中。其解决方案的关键在于提出PixRec——一个融合文本属性与产品图像的多模态推荐框架,通过引入具备图像-文本联合处理能力的视觉-语言模型骨干网络,在保持双塔结构和混合训练目标的同时,对物品间及用户-物品交互中的多模态特征投影进行对齐,从而有效利用视觉信息提升推荐精度。实验表明,相比纯文本推荐器,该方法在Top-Rank和Top-10 Rank准确率上分别提升了3倍和40%,验证了视觉特征在区分语义相似商品中的关键作用。

链接: https://arxiv.org/abs/2601.06458
作者: Sayak Chakrabarty,Souradip Pal
机构: Northwestern University (西北大学); Purdue University (普渡大学)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown strong potential for usage in sequential recommendation tasks through text-only models, which combine advanced prompt design, contrastive alignment, and fine-tuning on downstream domain-specific data. While effective, these approaches overlook the rich visual information present in many real-world recommendation scenarios, particularly in e-commerce. This paper proposes PixRec - a vision-language framework that incorporates both textual attributes and product images into the recommendation pipeline. Our architecture leverages a vision-language model backbone capable of jointly processing image-text sequences, maintaining a dual-tower structure and mixed training objective while aligning multi-modal feature projections for both item-item and user-item interactions. Using the Amazon Reviews dataset augmented with product images, our experiments demonstrate 3\times and 40% improvements in top-rank and top-10 rank accuracy over text-only recommenders respectively, indicating that visual features can help distinguish items with similar textual descriptions. Our work outlines future directions for scaling multi-modal recommenders training, enhancing visual-text feature fusion, and evaluating inference-time performance. This work takes a step toward building software systems utilizing visual information in sequential recommendation for real-world applications like e-commerce.
zh

[CV-115] CulinaryCut-VLAP: A Vision-Language-Action-Physics Framework for Food Cutting via a Force-Aware Material Point Method

【速读】:该论文旨在解决食品切割任务中因刀具与柔性材料之间高度非线性交互(包括大变形、频繁接触及拓扑变化)而导致的数据采集困难与模型训练不稳定的问题。其核心挑战在于如何在保证物理真实性的同时构建可扩展的视觉-语言-动作(VLA)学习框架。解决方案的关键在于提出一个统一框架,将基于材料点法(MPM)的高保真物理模拟器与多模态VLA数据集相结合:模拟器采用MLS-MPM作为计算核心,有效抑制数值耗散和能量漂移,并准确捕捉剪切与旋转响应;同时通过粒子与网格间的冲量交换估算力和应力分布,实现瞬态接触力的稳定追踪;此外,作者还构建了一个包含多样化切割轨迹、多视角视觉观测、细粒度语言指令以及力矩和工具位姿标签的基准数据集,从而形成尊重物理规律的学习-评估闭环,为柔性物体操作中的VLA模型提供安全、可复现且可扩展的训练基础。

链接: https://arxiv.org/abs/2601.06451
作者: Hyunseo Koh,Chang-Yong Song,Youngjae Choi,Misa Viveiros,David Hyde,Heewon Kim
机构: Soongsil University (松溪大学); Vanderbilt University (范德堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages; 15 figures; 5 tables

点击查看摘要

Abstract:Food cutting is a highly practical yet underexplored application at the intersection of vision and robotic manipulation. The task remains challenging because interactions between the knife and deformable materials are highly nonlinear and often entail large deformations, frequent contact, and topological change, which in turn hinder stable and safe large-scale data collection. To address these challenges, we propose a unified framework that couples a vision-language-action (VLA) dataset with a physically realistic cutting simulator built on the material point method (MPM). Our simulator adopts MLS-MPM as its computational core, reducing numerical dissipation and energy drift while preserving rotational and shear responses even under topology-changing cuts. During cutting, forces and stress distributions are estimated from impulse exchanges between particles and the grid, enabling stable tracking of transient contact forces and energy transfer. We also provide a benchmark dataset that integrates diverse cutting trajectories, multi-view visual observations, and fine-grained language instructions, together with force–torque and tool–pose labels to provide physically consistent training signals. These components realize a learning–evaluation loop that respects the core physics of cutting and establishes a safe, reproducible, and scalable foundation for advancing VLA models in deformable object manipulation. Comments: 16 pages; 15 figures; 5 tables Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T40, 68T45 ACMclasses: I.2.9; I.2.6; I.2.10; I.6.8 Cite as: arXiv:2601.06451 [cs.RO] (or arXiv:2601.06451v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2601.06451 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chang-Yong Song [view email] [v1] Sat, 10 Jan 2026 06:35:25 UTC (10,778 KB)
zh

[CV-116] How to Build Robust Scalable Models for GSV-Based Indicators in Neighborhood Research

【速读】:该论文旨在解决如何在标注数据有限的情况下,有效选择和适配基础模型(foundation models)以用于社会健康研究中的街区建成环境分析问题。其核心挑战在于跨域迁移的不确定性(如从ImageNet到Google Street View图像的视觉特征差异),以及在计算资源受限条件下,如何通过无监督训练策略提升模型在下游任务中的性能。解决方案的关键在于:利用大规模未标注数据进行无监督适应(unsupervised adaptation),并通过系统性的定量与可视化分析,评估不同基础模型在预训练后微调前后的性能变化,从而为小样本场景下模型选择和训练策略提供实证依据与实践指导。

链接: https://arxiv.org/abs/2601.06443
作者: Xiaoya Tang,Xiaohe Yue,Heran Mane,Dapeng Li,Quynh Nguyen,Tolga Tasdizen
机构: University of Utah, Scientific Computing and Imaging Institute (犹他大学,科学计算与成像研究所); University of Maryland (马里兰大学); University of Alabama (阿拉巴马大学); National Institute of Nursing Research, National Institutes of Health (国家护理研究所以及美国国立卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A substantial body of health research demonstrates a strong link between neighborhood environments and health outcomes. Recently, there has been increasing interest in leveraging advances in computer vision to enable large-scale, systematic characterization of neighborhood built environments. However, the generalizability of vision models across fundamentally different domains remains uncertain, for example, transferring knowledge from ImageNet to the distinct visual characteristics of Google Street View (GSV) imagery. In applied fields such as social health research, several critical questions arise: which models are most appropriate, whether to adopt unsupervised training strategies, what training scale is feasible under computational constraints, and how much such strategies benefit downstream performance. These decisions are often costly and require specialized expertise. In this paper, we answer these questions through empirical analysis and provide practical insights into how to select and adapt foundation models for datasets with limited size and labels, while leveraging larger, unlabeled datasets through unsupervised training. Our study includes comprehensive quantitative and visual analyses comparing model performance before and after unsupervised adaptation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.06443 [cs.CV] (or arXiv:2601.06443v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.06443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-117] WHU-PCPR: A cross-platform heterogeneous point cloud dataset for place recognition in complex urban scenes

【速读】:该论文旨在解决当前点云场景识别(Point Cloud-based Place Recognition, PCPR)研究中因数据集缺乏多样性而导致的性能瓶颈问题,具体表现为现有PCPR数据集在场景、平台和传感器类型上的单一性,限制了算法在真实复杂环境中的泛化能力。解决方案的关键在于构建WHU-PCPR这一跨平台异构点云数据集,其核心特征包括:1)采集自不同平台( survey-grade vehicle-mounted Mobile Laser Scanning (MLS) 系统与低成本便携式 helmet-mounted Laser Scanning (PLS) 系统)及多种LiDAR传感器的点云数据;2)涵盖城市与校园道路等具有实时与长期变化的复杂定位场景;3)具备大尺度空间覆盖(60个月跨度下82.3 km轨迹,约30 km无重复路径)。该数据集为PCPR方法提供了更贴近实际应用的评估基准,推动了相关技术的发展。

链接: https://arxiv.org/abs/2601.06442
作者: Xianghong Zou,Jianping Li,Yandi Yang,Weitong Wu,Yuan Wang,Qiegen Liu,Zhen Dong
机构: Nanjing Normal University (南京师范大学); Nanyang Technological University (南洋理工大学); University of Calgary (卡尔加里大学); Hohai University (河海大学); Jiangxi Normal University (江西师范大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Point Cloud-based Place Recognition (PCPR) demonstrates considerable potential in applications such as autonomous driving, robot localization and navigation, and map update. In practical applications, point clouds used for place recognition are often acquired from different platforms and LiDARs across varying scene. However, existing PCPR datasets lack diversity in scenes, platforms, and sensors, which limits the effective development of related research. To address this gap, we establish WHU-PCPR, a cross-platform heterogeneous point cloud dataset designed for place recognition. The dataset differentiates itself from existing datasets through its distinctive characteristics: 1) cross-platform heterogeneous point clouds: collected from survey-grade vehicle-mounted Mobile Laser Scanning (MLS) systems and low-cost Portable helmet-mounted Laser Scanning (PLS) systems, each equipped with distinct mechanical and solid-state LiDAR sensors. 2) Complex localization scenes: encompassing real-time and long-term changes in both urban and campus road scenes. 3) Large-scale spatial coverage: featuring 82.3 km of trajectory over a 60-month period and an unrepeated route of approximately 30 km. Based on WHU-PCPR, we conduct extensive evaluation and in-depth analysis of several representative PCPR methods, and provide a concise discussion of key challenges and future research directions. The dataset and benchmark code are available at this https URL.
zh

[CV-118] Semantic Enrichment of CAD-Based Industrial Environments via Scene Graphs for Simulation and Reasoning

【速读】:该论文旨在解决工业环境中机器人训练与高阶场景理解所需的仿真环境细节不足问题,特别是现有CAD文件虽能精确描述几何和视觉信息,但缺乏语义、关系和功能信息,从而限制了仿真与训练的潜力。解决方案的关键在于利用大视觉语言模型(Large Vision-Language Model, LVLM)对CAD环境进行离线处理,构建包含语义、空间和功能信息的3D场景图(3D scene graph),从而显式建模功能性与可操作元素之间的关系,为动态仿真与推理提供结构化基础。

链接: https://arxiv.org/abs/2601.06415
作者: Nathan Pascal Walus,Ranulfo Bezerra,Shotaro Kojima,Tsige Tadesse Alemayoh,Satoshi Tadokoro,Kazunori Ohno
机构: Tohoku University (东北大学); RWTH Aachen University (亚琛工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE SSRR 2025

点击查看摘要

Abstract:Utilizing functional elements in an industrial environment, such as displays and interactive valves, provide effective possibilities for robot training. When preparing simulations for robots or applications that involve high-level scene understanding, the simulation environment must be equally detailed. Although CAD files for such environments deliver an exact description of the geometry and visuals, they usually lack semantic, relational and functional information, thus limiting the simulation and training possibilities. A 3D scene graph can organize semantic, spatial and functional information by enriching the environment through a Large Vision-Language Model (LVLM). In this paper we present an offline approach to creating detailed 3D scene graphs from CAD environments. This will serve as a foundation to include the relations of functional and actionable elements, which then can be used for dynamic simulation and reasoning. Key results of this research include both quantitative results of the generated semantic labels as well as qualitative results of the scene graph, especially in hindsight of pipe structures and identified functional relations. All code, results and the environment will be made available at this https URL
zh

[CV-119] GlobalPaint: Spatiotemporal Coherent Video Outpainting with Global Feature Guidance

【速读】:该论文致力于解决视频外绘(video outpainting)中的时空一致性难题,即在扩展视频边界时既要保证每帧图像的空间合理性,又要确保长时间跨度内的运动连贯性,尤其是在相机或物体运动导致新增内容随时间逐渐显现的情况下。其解决方案的关键在于提出了一种基于扩散模型的全局协同框架GlobalPaint:首先采用分层处理流程,先对关键帧进行外绘,再通过条件插值模型完成中间帧的填充以减少误差累积;其次在模型层面引入两个核心模块——增强型时空模块(Enhanced Spatial-Temporal module),利用3D窗口注意力机制强化时空交互能力;以及全局特征引导机制,通过专用提取器将所有帧中可见区域的OpenCLIP特征压缩为紧凑的全局token,从而实现跨帧语义一致性的有效传递。

链接: https://arxiv.org/abs/2601.06413
作者: Yueming Pan,Ruoyu Feng,Jianmin Bao,Chong Luo,Nanning Zheng
机构: Xi’an Jiaotong University (西安交通大学); Microsoft Research Asia (微软亚洲研究院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video outpainting extends a video beyond its original boundaries by synthesizing missing border content. Compared with image outpainting, it requires not only per-frame spatial plausibility but also long-range temporal coherence, especially when outpainted content becomes visible across time under camera or object motion. We propose GlobalPaint, a diffusion-based framework for spatiotemporal coherent video outpainting. Our approach adopts a hierarchical pipeline that first outpaints key frames and then completes intermediate frames via an interpolation model conditioned on the completed boundaries, reducing error accumulation in sequential processing. At the model level, we augment a pretrained image inpainting backbone with (i) an Enhanced Spatial-Temporal module featuring 3D windowed attention for stronger spatiotemporal interaction, and (ii) global feature guidance that distills OpenCLIP features from observed regions across all frames into compact global tokens using a dedicated extractor. Comprehensive evaluations on benchmark datasets demonstrate improved reconstruction quality and more natural motion compared to prior methods. Our demo page is this https URL
zh

[CV-120] Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification

【速读】:该论文旨在解决现有学生参与度预测方法依赖大量标注数据且忽略课堂同伴行为上下文的问题。其解决方案的关键在于提出一个三阶段框架:首先利用视觉语言模型(Vision-Language Model, VLM)进行少样本适配以识别学生动作类别;其次采用滑动时间窗技术将2分钟视频分割为非重叠片段并基于VLM输出动作序列;最后借助大语言模型(Large Language Model, LLM)结合课堂上下文对完整动作序列进行分类,判断学生是否处于参与状态。该方法在减少标注数据需求的同时,有效整合了同伴行为信息,提升了参与度识别的准确性。

链接: https://arxiv.org/abs/2601.06394
作者: Ahmed Abdelkawy,Ahmed Elsayed,Asem Ali,Aly Farag,Thomas Tretter,Michael McIntyre
机构: University of Louisville (路易斯维尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding student behavior in the classroom is essential to improve both pedagogical quality and student engagement. Existing methods for predicting student engagement typically require substantial annotated data to model the diversity of student behaviors, yet privacy concerns often restrict researchers to their own proprietary datasets. Moreover, the classroom context, represented in peers’ actions, is ignored. To address the aforementioned limitation, we propose a novel three-stage framework for video-based student engagement measurement. First, we explore the few-shot adaptation of the vision-language model for student action recognition, which is fine-tuned to distinguish among action categories with a few training samples. Second, to handle continuous and unpredictable student actions, we utilize the sliding temporal window technique to divide each student’s 2-minute-long video into non-overlapping segments. Each segment is assigned an action category via the fine-tuned VLM model, generating a sequence of action predictions. Finally, we leverage the large language model to classify this entire sequence of actions, together with the classroom context, as belonging to an engaged or disengaged student. The experimental results demonstrate the effectiveness of the proposed approach in identifying student engagement.
zh

[CV-121] Object-WIPER : Training-Free Object and Associated Effect Removal in Videos

【速读】:该论文旨在解决视频中动态物体及其视觉效应(如遮挡、光影变化等)的去除问题,同时实现语义一致且时间连贯的修复(inpainting)。传统方法通常依赖于大量训练数据或微调,而本文提出一种无需训练的框架Object-WIPER,其关键在于利用预训练的文本到视频扩散变换器(text-to-video diffusion transformer, DiT),通过视觉-文本交叉注意力和视觉自注意力机制定位与目标物体及效应相关的视觉token,生成中间效应掩码并与用户提供的对象掩码融合,得到最终需替换的前景token掩码。在去噪过程中,保留背景token以维持场景保真度,并通过高斯噪声初始化被掩码区域,从而实现高质量、无重训练的动态物体移除与内容重建。

链接: https://arxiv.org/abs/2601.06391
作者: Saksham Singh Kushwaha,Sayan Nag,Yapeng Tian,Kuldeep Kulkarni
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.
zh

[CV-122] From Easy to Hard: Promoting Differentially Private Image Synthesis Through Spatial-Frequency Curriculum USENIX-SECURITY2026

【速读】:该论文旨在解决差分隐私(Differentially Private, DP)合成图像质量低的问题,尤其是在隐私预算 ϵ=1\epsilon = 1 下如何提升生成图像的保真度(fidelity)与实用性(utility)。现有方法如DP-FETA通过引入“中心图像”(central images)进行预热训练以改善DP-SGD效果,但其对图像多样性高的数据集效果有限。本文的关键创新在于提出FETA-Pro,其核心是引入频率特征(frequency features)作为“训练捷径”,这类特征在复杂度上介于空间特征(由中心图像捕捉)与完整图像之间,从而实现更细粒度的课程学习(curriculum learning)策略。为协同利用空间特征与频率特征,FETA-Pro设计了一个灵活的生成流水线:先用辅助生成器基于带噪频率特征生成初步图像,再以这些图像结合空间特征和DP-SGD训练主生成器,有效缓解了两类特征间训练不一致的问题。实验表明,FETA-Pro在五个敏感图像数据集上平均提升25.7%的保真度和4.1%的实用性,显著优于当前最优基线。

链接: https://arxiv.org/abs/2601.06368
作者: Chen Gong,Kecen Li,Zinan Lin,Tianhao Wang
机构: University of Virginia (弗吉尼亚大学); Microsoft Research (微软研究院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Usenix Security 2026; code available at this https URL

点击查看摘要

Abstract:To improve the quality of Differentially private (DP) synthetic images, most studies have focused on improving the core optimization techniques (e.g., DP-SGD). Recently, we have witnessed a paradigm shift that takes these techniques off the shelf and studies how to use them together to achieve the best results. One notable work is DP-FETA, which proposes using central images' for warming up’ the DP training and then using traditional DP-SGD. Inspired by DP-FETA, we are curious whether there are other such tools we can use together with DP-SGD. We first observe that using central images' mainly works for datasets where there are many samples that look similar. To handle scenarios where images could vary significantly, we propose FETA-Pro, which introduces frequency features as training shortcuts.’ The complexity of frequency features lies between that of spatial features (captured by `central images’) and full images, allowing for a finer-grained curriculum for DP training. To incorporate these two types of shortcuts together, one challenge is to handle the training discrepancy between spatial and frequency features. To address it, we leverage the pipeline generation property of generative models (instead of having one model trained with multiple features/objectives, we can have multiple models working on different features, then feed the generated results from one model into another) and use a more flexible design. Specifically, FETA-Pro introduces an auxiliary generator to produce images aligned with noisy frequency features. Then, another model is trained with these images, together with spatial features and DP-SGD. Evaluated across five sensitive image datasets, FETA-Pro shows an average of 25.7% higher fidelity and 4.1% greater utility than the best-performing baseline, under a privacy budget \epsilon = 1 . Comments: Accepted at Usenix Security 2026; code available at this https URL Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.06368 [cs.CR] (or arXiv:2601.06368v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.06368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-123] Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在文本到图像生成任务中难以准确生成物体间空间关系的问题。其解决方案的关键在于采用机制可解释性方法,系统分析不同文本编码器对DiT模型内部信息传递机制的影响:当使用随机初始化的文本嵌入时,模型通过两个交叉注意力头分阶段读取文本提示中的空间关系和单个物体属性;而当使用预训练文本编码器(如T5)时,模型则通过文本token内的信息融合机制,从单一文本token中联合读取空间关系与物体属性信息。这一发现揭示了不同文本编码策略下模型实现空间关系生成的内在机制差异,并指出尽管域内性能相近,但两种机制在域外扰动下的鲁棒性存在显著区别,暗示了真实场景中生成正确空间关系的挑战。

链接: https://arxiv.org/abs/2601.06338
作者: Binxu Wang,Jingxuan Fan,Xu Pan
机构: Kempner Institute, Harvard University (哈佛大学肯普纳研究所); Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31 pages, 23 figures

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.
zh

[CV-124] VideoWeave: A Data-Centric Approach for Efficient Video Understanding

【速读】:该论文旨在解决视频-语言模型(video-language models)训练过程中因处理长视频帧序列成本高昂以及标注长视频数据稀缺而导致的数据效率低下问题。解决方案的关键在于提出一种名为VideoWeave的简单但有效的方法,通过将现有数据集中短时长、带字幕的视频片段拼接成合成的长上下文训练样本,从而在固定计算资源下提升时间维度上的多样性与数据利用率,而无需修改模型架构或优化目标。实验表明,在相同计算约束下,使用VideoWeave训练的模型在视频问答任务上性能优于传统微调方法,验证了数据重组策略的有效性。

链接: https://arxiv.org/abs/2601.06309
作者: Zane Durante,Silky Singh,Arpandeep Khatua,Shobhit Agarwal,Reuben Tan,Yong Jae Lee,Jianfeng Gao,Ehsan Adeli,Li Fei-Fei
机构: Stanford University (斯坦福大学); Microsoft Research (微软研究院); University of Wisconsin - Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training video-language models is often prohibitively expensive due to the high cost of processing long frame sequences and the limited availability of annotated long videos. We present VideoWeave, a simple yet effective approach to improve data efficiency by constructing synthetic long-context training samples that splice together short, captioned videos from existing datasets. Rather than modifying model architectures or optimization objectives, VideoWeave reorganizes available video-text pairs to expand temporal diversity within fixed compute. We systematically study how different data composition strategies like random versus visually clustered splicing and caption enrichment affect downstream performance on downstream video question answering. Under identical compute constraints, models trained with VideoWeave achieve higher accuracy than conventional video finetuning. Our results highlight that reorganizing training data, rather than altering architectures, may offer a simple and scalable path for training video-language models. We link our code for all experiments here.
zh

[CV-125] Perception Test 2025: Challenge Summary and a Unified VQA Extension

链接: https://arxiv.org/abs/2601.06287
作者: Joseph Heyward,Nikhil Pathasarathy,Tyler Zhu,Aravindh Mahendran,João Carreira,Dima Damen,Andrew Zisserman,Viorica Pătrăucean
机构: Google DeepMind; Princeton University (普林斯顿大学); University of Bristol (布里斯托大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-126] NAS-GS: Noise-Aware Sonar Gaussian Splatting

【速读】:该论文旨在解决水下声呐图像在三维重建(3D reconstruction)与新视角合成(novel view synthesis)中面临的挑战,主要包括声呐图像特有的复杂噪声模式(如旁瓣、斑点噪声和多路径噪声)以及缺乏深度信息的问题。解决方案的关键在于提出了一种名为NAS-GS(Noise-Aware Sonar Gaussian Splatting)的新框架,其核心创新包括:一是双方向采样(Two-Ways Splatting)技术,能够精确建模声呐成像中强度累积与透射率计算的双向特性,显著提升渲染速度而不损失质量;二是基于高斯混合模型(Gaussian Mixture Model, GMM)的噪声建模方法,有效捕捉复杂声呐噪声分布,增强合成图像的真实性并防止3D高斯分布对噪声过拟合,从而提高重建精度。

链接: https://arxiv.org/abs/2601.06285
作者: Shida Xu,Jingqi Jiang,Jonatan Scharff Willners,Sen Wang
机构: Imperial College London (帝国理工学院); Frontier Robotics (前沿机器人); The National Robotarium (国家机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Underwater sonar imaging plays a crucial role in various applications, including autonomous navigation in murky water, marine archaeology, and environmental monitoring. However, the unique characteristics of sonar images, such as complex noise patterns and the lack of elevation information, pose significant challenges for 3D reconstruction and novel view synthesis. In this paper, we present NAS-GS, a novel Noise-Aware Sonar Gaussian Splatting framework specifically designed to address these challenges. Our approach introduces a Two-Ways Splatting technique that accurately models the dual directions for intensity accumulation and transmittance calculation inherent in sonar imaging, significantly improving rendering speed without sacrificing quality. Moreover, we propose a Gaussian Mixture Model (GMM) based noise model that captures complex sonar noise patterns, including side-lobes, speckle, and multi-path noise. This model enhances the realism of synthesized images while preventing 3D Gaussian overfitting to noise, thereby improving reconstruction accuracy. We demonstrate state-of-the-art performance on both simulated and real-world large-scale offshore sonar scenarios, achieving superior results in novel view synthesis and 3D reconstruction.
zh

[CV-127] EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

【速读】:该论文旨在解决低成本、高可扩展性眼动追踪(gaze estimation)在浏览器端实验平台及真实场景认知与临床研究中的应用难题。其解决方案的关键在于提出一个轻量级且开源的深度学习流水线 EyeTheia,该方案仅依赖标准笔记本摄像头即可实现实时眼动追踪,核心由两部分组成:基于 MediaPipe 的关键点提取模块与受 iTracker 启发的卷积神经网络结构;同时引入用户特定微调策略以进一步降低预测误差。实验表明,在无需校准的情况下,预训练模型与从头训练模型性能相当,而轻量级微调能持续提升精度,验证了该方法在实际任务(如 Dot-Probe 实验)中对商用 SDK 具有良好的一致性与实用性。

链接: https://arxiv.org/abs/2601.06279
作者: Stevenson Pather,Niels Martignène,Arnaud Bugnet,Fouad Boutaleb,Fabien D’Hondt,Deise Santana Maia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code for the EyeTheia gaze-tracking model: this https URL . Experimental platform for the cognitive neuroscience task: this https URL

点击查看摘要

Abstract:We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.
zh

[CV-128] A survey of facial recognition techniques

【速读】:该论文旨在解决人脸识别技术中因光照变化、年龄增长、姿态差异、部分遮挡及面部表情变化等复杂因素导致的识别准确率下降问题。其解决方案的关键在于系统性地综述并分析多种先进方法,包括隐马尔可夫模型(Hidden Markov Models)、主成分分析(Principal Component Analysis, PCA)、弹性聚类图匹配、支持向量机(Support Vector Machine, SVM)、Gabor小波、人工神经网络(Artificial Neural Networks, ANN)、特征脸(Eigenfaces)、独立成分分析(Independent Component Analysis, ICA)以及三维形态模型(3D Morphable Model),并通过在JAFEE、FEI、Yale、LFW、ATT(原称ORL)和AR等多个公开人脸数据库上的实验验证这些方法的有效性与适用场景,从而为构建鲁棒性强、适应性广的面部识别机制提供理论依据和技术路径。

链接: https://arxiv.org/abs/2601.06239
作者: Aya Kaysan Bahjat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 12 pages, 12 figures, article

点击查看摘要

Abstract:As multimedia content is quickly growing, the field of facial recognition has become one of the major research fields, particularly in the recent years. The most problematic area to researchers in image processing and computer vision is the human face which is a complex object with myriads of distinctive features that can be used to identify the face. The survey of this survey is particularly focused on most challenging facial characteristics, including differences in the light, ageing, variation in poses, partial occlusion, and facial expression and presents methodological solutions. The factors, therefore, are inevitable in the creation of effective facial recognition mechanisms used on facial images. This paper reviews the most sophisticated methods of facial detection which are Hidden Markov Models, Principal Component Analysis (PCA), Elastic Cluster Plot Matching, Support Vector Machine (SVM), Gabor Waves, Artificial Neural Networks (ANN), Eigenfaces, Independent Component Analysis (ICA), and 3D Morphable Model. Alongside the works mentioned above, we have also analyzed the images of a number of facial databases, namely JAFEE, FEI, Yale, LFW, ATT (then called ORL), and AR (created by Martinez and Benavente), to analyze the results. However, this survey is aimed at giving a thorough literature review of face recognition, and its applications, and some experimental results are provided at the end after a detailed discussion.
zh

[CV-129] Synthetic FMCW Radar Range Azimuth Maps Augmentation with Generative Diffusion Model

【速读】:该论文旨在解决自动驾驶环境中深度学习模型因高质量标注雷达数据稀缺且多样性不足而导致的环境感知性能受限问题。其解决方案的关键在于提出一种基于条件生成扩散模型(conditional generative diffusion model)的雷达信号合成框架,通过引入置信度图(Confidence Maps)作为条件输入,实现对行人、车辆和骑行者等多类目标的物理合理且多样化的调频连续波(Frequency-Modulated Continuous-Wave, FMCW)雷达距离-方位图(Range-Azimuth Maps)生成;同时,为适配雷达特性,创新性地融合几何感知条件(Geometry Aware Conditioning)与时间一致性正则化(Temporal Consistency Regularization),显著提升合成数据的真实性和时序稳定性,从而在ROD2021数据集上使峰值信噪比(PSNR)提升3.6 dB,并在真实与合成数据联合训练下将平均精度均值(mean Average Precision)提高4.15%,有效增强下游任务的泛化能力。

链接: https://arxiv.org/abs/2601.06228
作者: Zhaoze Wang,Changxu Zhang,Tai Fei,Christopher Grimm,Yi Jin,Claas Tebruegge,Ernst Warsitz,Markus Gardill
机构: 1. Google(谷歌); 2. Meta; 3. Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scarcity and low diversity of well-annotated automotive radar datasets often limit the performance of deep-learning-based environmental perception. To overcome these challenges, we propose a conditional generative framework for synthesizing realistic Frequency-Modulated Continuous-Wave radar Range-Azimuth Maps. Our approach leverages a generative diffusion model to generate radar data for multiple object categories, including pedestrians, cars, and cyclists. Specifically, conditioning is achieved via Confidence Maps, where each channel represents a semantic class and encodes Gaussian-distributed annotations at target locations. To address radar-specific characteristics, we incorporate Geometry Aware Conditioning and Temporal Consistency Regularization into the generative process. Experiments on the ROD2021 dataset demonstrate that signal reconstruction quality improves by \SI3.6dB in Peak Signal-to-Noise Ratio over baseline methods, while training with a combination of real and synthetic datasets improves overall mean Average Precision by 4.15% compared with conventional image-processing-based augmentation. These results indicate that our generative framework not only produces physically plausible and diverse radar spectrum but also substantially improves model generalization in downstream tasks.
zh

[CV-130] Ground What You See: Hallucination-Resistant MLLM s via Caption Feedback Diversity-Aware Sampling and Conflict Regularization AAAI-2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在强化学习(Reinforcement Learning, RL)优化过程中出现的幻觉(hallucination)问题,其根源在于三个关键因素:过度依赖链式视觉推理导致初始描述错误锚定后续推理、策略优化中探索多样性不足引发过度自信的错误输出,以及训练样本间因神经切线核(Neural Tangent Kernel, NTK)相似性引发的破坏性冲突。解决方案的核心在于构建一个包含三个模块的综合框架:首先通过引入规划与标注阶段并结合基于质量的标注奖励提升视觉定位准确性;其次根据奖励分布的均值与方差对样本进行分类,优先选择高方差样本以增强探索多样性;最后通过分组样本对并施加InfoNCE损失来调节NTK相似性,从而缓解样本干扰并稳定梯度更新方向。实验表明,该方法显著降低了幻觉率并提升了MLLMs的推理准确性。

链接: https://arxiv.org/abs/2601.06224
作者: Miao Pan,Wangjie Gan,Jintao Chen,Wenqi Zhang,Bing Sun,Jianwei Yin,Xuhong Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-2026 Poster

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse tasks, their practical deployment is severely hindered by hallucination issues, which become particularly acute during Reinforcement Learning (RL) optimization. This paper systematically analyzes the root causes of hallucinations in MLLMs under RL training, identifying three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions or redundant information anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where Neural Tangent Kernel (NTK) similarity causes false associations and unstable parameter updates. To address these challenges, we propose a comprehensive framework comprising three core modules. First, we enhance visual localization by introducing dedicated planning and captioning stages before the reasoning phase, employing a quality-based caption reward to ensure accurate initial anchoring. Second, to improve exploration, we categorize samples based on the mean and variance of their reward distributions, prioritizing samples with high variance to focus the model on diverse and informative data. Finally, to mitigate sample interference, we regulate NTK similarity by grouping sample pairs and applying an InfoNCE loss to push overly similar pairs apart and pull dissimilar ones closer, thereby guiding gradient interactions toward a balanced range. Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
zh

[CV-131] SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization

【速读】:该论文旨在解决恶意图像篡改(malicious image manipulation)的精准定位问题,现有方法依赖昂贵的像素级标注或仅使用图像级二分类标签,难以有效捕捉篡改区域的局部边缘特征。其解决方案的关键在于提出语义无关提示学习(Semantic-Agnostic Prompt Learning, SAPL),通过在CLIP模型中引入边界感知的文本提示机制,使多模态相似度聚焦于篡改边缘而非高层语义信息。SAPL包含两个互补模块:边缘感知上下文提示学习(Edge-aware Contextual Prompt Learning, ECPL)利用边缘增强的图像特征生成可学习文本提示,嵌入非语义边缘线索;以及分层边缘对比学习(Hierarchical Edge Contrastive Learning, HECL),通过对比真实与篡改边缘块提升边缘判别能力,最终基于相似度图实现高精度篡改区域预测。

链接: https://arxiv.org/abs/2601.06222
作者: Xinghao Wang,Changtao Miao,Dianmo Sheng,Tao Gong,Qi Chu,Nenghai Yu,Quanchen Zou,Deyue Zhang,Xiangzheng Zhang
机构: University of Science and Technology of China(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malicious image manipulation threatens public safety and requires efficient localization methods. Existing approaches depend on costly pixel-level annotations which make training expensive. Existing weakly supervised methods rely only on image-level binary labels and focus on global classification, often overlooking local edge cues that are critical for precise localization. We observe that feature variations at manipulated boundaries are substantially larger than in interior regions. To address this gap, we propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues so that CLIPs multimodal similarity highlights manipulation edges rather than high-level object semantics. SAPL combines two complementary modules Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL) to exploit edge information in both textual and visual spaces. The proposed ECPL leverages edge-enhanced image features to generate learnable textual prompts via an attention mechanism, embedding semantic-irrelevant information into text features, to guide CLIP focusing on manipulation edges. The proposed HECL extract genuine and manipulated edge patches, and utilize contrastive learning to boost the discrimination between genuine edge patches and manipulated edge patches. Finally, we predict the manipulated regions from the similarity map after processing. Extensive experiments on multiple public benchmarks demonstrate that SAPL significantly outperforms existing approaches, achieving state-of-the-art localization performance.
zh

[CV-132] wo-step Authentication: Multi-biometric System Using Voice and Facial Recognition

链接: https://arxiv.org/abs/2601.06218
作者: Kuan Wei Chen,Ting Yi Lin,Wen Ren Yang,Aryan Kesarwani,Riya Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted manuscript (author version, v2). The published version appears in IET Conference Proceedings; see DOI: https://doi.org/10.1049/icp.2024.4141 . Code: this https URL

点击查看摘要

[CV-133] Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

【速读】:该论文旨在解决现有生成式 AI(Generative AI)模型在视频预测与视觉合成中面临的时空一致性差、推理效率低以及物理规律难以建模的问题。其核心解决方案在于提出 Akasha 2 架构,通过将哈密顿状态空间对偶性(Hamiltonian State Space Duality, H-SSD)与视觉-语言联合嵌入预测架构(Visual-Language Joint Embedding Predictive Architecture, VL-JEPA)融合,并引入稀疏哈密顿专家混合(Sparse Mixture of Hamiltonian Experts, SMoE-HE)以强制隐空间满足物理守恒律,同时结合哈密顿流匹配(Hamiltonian Flow Matching, HFM)和持续3D高斯点绘制(persistent 3D Gaussian Splatting, 3DGS),实现了超低延迟(50ms)的移动端视觉合成与前所未有的时空连贯性,显著优于传统Transformer基线模型。

链接: https://arxiv.org/abs/2601.06212
作者: Yani Meziani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 3 tables. Includes appendices with pseudocode and implementation details. Supplementary materials eventually at this http URL

点击查看摘要

Abstract:We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.
zh

[CV-134] When Imbalance Comes Twice: Active Learning under Simulated Class Imbalance and Label Shift in Binary Semantic Segmentation

链接: https://arxiv.org/abs/2601.06209
作者: Julien Combes(SVH),Alexandre Derville(Michelin),Jean-François Coeurjolly(SVH)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-135] Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

【速读】:该论文旨在解决动态视觉环境中智能异常检测中实时性与语义可解释性难以兼顾的问题。传统方法仅能应对该挑战的部分维度:基于重建的模型虽能捕捉低级差异但缺乏上下文推理能力,目标检测器虽具备高速响应但语义表达有限,而大型视觉-语言系统虽提供高可解释性却带来高昂计算开销。解决方案的关键在于提出一种级联多智能体框架(cascading multi-agent framework),通过早期模块进行重建门控过滤和对象级评估,仅在语义模糊事件时激活高层推理智能体,并结合自适应升级阈值与发布-订阅通信机制,实现异步协调与异构硬件上的可扩展部署,从而在显著降低延迟(较直接视觉-语言推理减少三倍)的同时保持高感知保真度(PSNR = 38.3 dB, SSIM = 0.965)和一致的语义标注,最终实现了早期退出效率、自适应多智能体推理与可解释异常归因的统一。

链接: https://arxiv.org/abs/2601.06204
作者: Tayyab Rehman,Giovanni De Gasperis,Aly Shmahell
机构: University of L’Aquila (拉奎拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
zh

[CV-136] Qwen Style: Content-Preserving Style Transfer with Qwen -Image-Edit

【速读】:该论文旨在解决扩散Transformer(Diffusion Transformers, DiTs)在内容保留型风格迁移(Content-Preserving Style Transfer)任务中因内部内容与风格特征耦合而导致性能受限的问题。其解决方案的关键在于提出首个基于Qwen-Image-Edit训练的内容保留风格迁移模型QwenStyle V1,通过收集并筛选特定风格的高质量数据,结合野外风格图像合成的三元组样本,引入课程持续学习(Curriculum Continual Learning)框架来处理干净与噪声三元组混合的数据分布,从而在不损害精确内容保留能力的前提下实现对未见风格的泛化。

链接: https://arxiv.org/abs/2601.06202
作者: Shiwen Zhang,Haibin Huang,Chi Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The codes and models are released at this https URL

点击查看摘要

Abstract:Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to its internal entangled content and style features. In this technical report, we propose the first content-preserving style transfer model trained on Qwen-Image-Edit, which activates Qwen-Image-Edit’s strong content preservation and style customization capability. We collected and filtered high quality data of limited specific styles and synthesized triplets with thousands categories of style images in-the-wild. We introduce the Curriculum Continual Learning framework to train QwenStyle with such mixture of clean and noisy triplets, which enables QwenStyle to generalize to unseen styles without degradation of the precise content preservation capability. Our QwenStyle V1 achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.
zh

[CV-137] Leverag ing Membership Inference Attacks for Privacy Measurement in Federated Learning for Remote Sensing Images

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在遥感图像分类应用中可能存在的隐私泄露问题,即尽管FL通过将训练数据本地化来保护隐私,但其模型输出仍可能暴露敏感信息。解决方案的关键在于引入成员推理攻击(Membership Inference Attack, MIA)作为量化隐私泄露的评估框架,并系统性地测试多种黑盒MIA方法(包括基于熵的攻击、改进熵攻击和似然比攻击),在不同FL算法与通信策略下进行实验验证。结果表明,MIA能够有效识别出仅凭模型准确率无法捕捉的隐私风险,且通信高效的FL策略可在保持性能的同时显著降低MIA成功率,从而证明MIA作为隐私度量指标的实用性,并强调将隐私评估集成到FL系统设计中的必要性。

链接: https://arxiv.org/abs/2601.06200
作者: Anh-Kiet Duong,Petra Gomez-Krämer,Hoàng-Ân Lê,Minh-Tan Pham
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training while keeping training data localized, allowing us to preserve privacy in various domains including remote sensing. However, recent studies show that FL models may still leak sensitive information through their outputs, motivating the need for rigorous privacy evaluation. In this paper, we leverage membership inference attacks (MIA) as a quantitative privacy measurement framework for FL applied to remote sensing image classification. We evaluate multiple black-box MIA techniques, including entropy-based attacks, modified entropy attacks, and the likelihood ratio attack, across different FL algorithms and communication strategies. Experiments conducted on two public scene classification datasets demonstrate that MIA effectively reveals privacy leakage not captured by accuracy alone. Our results show that communication-efficient FL strategies reduce MIA success rates while maintaining competitive performance. These findings confirm MIA as a practical metric and highlight the importance of integrating privacy measurement into FL system design for remote sensing applications.
zh

[CV-138] How Does India Cook Biryani?

【速读】:该论文旨在解决现有视频理解方法难以捕捉烹饪视频中细粒度、多模态及文化背景差异的问题,尤其针对印度代表性菜肴“biryani”(博伊里)在不同地区制备过程中的多样性进行系统性计算分析。其解决方案的关键在于构建首个大规模、人工精标的大米料理制作视频数据集(包含12种区域风格的120个高质量YouTube视频),并提出一个多层次框架:利用视觉-语言模型(Vision-Language Models, VLMs)将视频分割为细粒度操作单元,并与音频转录文本和标准食谱文本对齐;在此基础上设计视频对比管道以自动识别并解释区域变体间的程序差异;同时建立多层级问答(QA)基准用于评估VLM在程序理解上的表现。该方法通过多VLM协同工作、人机协同验证高精度任务,以及在零样本和微调场景下对比多个前沿模型,为结构化多模态推理任务提供新的评测基准,并推动基于烹饪视频的文化遗产计算分析研究。

链接: https://arxiv.org/abs/2601.06198
作者: Shubham Goel,Farzana S,C V Rishi,Aditya Arun,C V Jawahar
机构: IIIT Hyderabad(印度国际信息技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Biryani, one of India’s most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos. We release all data, code, and the project website at this https URL.
zh

[CV-139] A Unified Attention U-Net Framework for Cross-Modality Tumor Segmentation in MRI and CT

【速读】:该论文旨在解决跨模态肿瘤分割中单一模型在不同成像方式(如MRI与CT)和不同解剖部位上的泛化能力问题,即如何构建一个不依赖模态特异性编码器或域适应技术的统一深度学习框架。其解决方案的关键在于提出了一种联合训练的Attention U-Net架构,结合了模态和谐预处理、注意力门控跳跃连接以及模态感知的Focal Tversky损失函数,从而实现对MRI(BraTS 2021)和CT(LIDC-IDRI)数据集上肿瘤分割任务的高性能统一建模。

链接: https://arxiv.org/abs/2601.06187
作者: Nishan Rai,Pushpa R. Dahal
机构: New Mexico State University (新墨西哥州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:This study presents a unified Attention U-Net architecture trained jointly on MRI (BraTS 2021) and CT (LIDC-IDRI) datasets to investigate the generalizability of a single model across diverse imaging modalities and anatomical sites. Our proposed pipeline incorporates modality-harmonized preprocessing, attention-gated skip connections, and a modality-aware Focal Tversky loss function. To the best of our knowledge, this study is among the first to evaluate a single Attention U-Net trained simultaneously on separate MRI (BraTS) and CT (LIDC-IDRI) tumor datasets, without relying on modality-specific encoders or domain adaptation. The unified model demonstrates competitive performance in terms of Dice coefficient, IoU, and AUC on both domains, thereby establishing a robust and reproducible baseline for future research in cross-modality tumor segmentation.
zh

[CV-140] IR-Flow: Active Video Search and Reasoning with Frozen VLMs

【速读】:该论文旨在解决大型视频语言模型(Video-LLMs)在视觉推理能力上的瓶颈问题,即现有方法依赖于大规模Chain-of-Thought(CoT)数据集的“数据工程”范式(包括监督微调SFT和强化学习RL),虽能优化概率采样效率与输出分布对齐,却难以激发模型进行动态视觉探索所需的内在智能。其解决方案的关键在于提出TIR-Flow框架,该框架通过三个协同模块实现无需额外数据或参数更新的主动视频搜索与推理:HDD模块将复杂查询分解为可验证的子任务;HAP模块主动引导视觉注意力以获取高分辨率证据用于假设验证;EBA模块维护一个持久的工作空间来累积并更新发现的线索以支持逻辑推理。实验证明,TIR-Flow显著优于多个强基线模型,在7个基准测试中平均提升5.9%,在Egoschema上最高达10.5%,验证了赋予冻结的视频语言模型(VLMs)类似System-2的主动感知能力是实现长时程视频推理的可行且可扩展路径。

链接: https://arxiv.org/abs/2601.06176
作者: Hongbo Jin,Siyi Xie,Jiayu Ding,Kuanwei Lin,Ge Li
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Large Video-Language Models (Video-LLMs) have achieved remarkable progress in perception, their reasoning capabilities remain a bottleneck. Existing solutions typically resort to a heavy “data engineering” paradigm-synthesizing large-scale Chain-of-Thought (CoT) datasets followed by Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). This pipeline primarily optimizes probability sampling efficiency and aligns output distributions, but fails to activate the intrinsic intelligence required for dynamic visual exploration. In this work, we propose TIR-Flow, a novel framework that shifts the paradigm from passive processing to active video searching and reasoning without additional data or parameter updating. Concretely, our framework operates through three synergistic modules: HDD decomposes complex queries into a set of verifiable sub-tasks; HAP actively directs visual attention to gather high-resolution evidence for hypothesis validation; EBA maintains a persistent workspace to accumulate and update the discovered clues for logical reasoning. Extensive experiments on seven benchmarks demonstrate that TIR-Flow significantly outperforms recent strong baselines, delivering an average performance boost of 5.9%, with gains reaching 10.5% on Egoschema. Our analysis confirms that empowering frozen VLMs with System-2-like active perception is a scalable path toward solving long-horizon video reasoning.
zh

[CV-141] hink Bright Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding ACL2026

【速读】:该论文旨在解决文本到图像上下文学习(Text-to-Image In-Context Learning, T2I-ICL)中存在的两个相互强化的瓶颈问题:合规性失败(compliance failure)和先验主导的幻觉(prior-dominated hallucination),二者共同构成恶性循环,显著降低生成质量。现有方法依赖定制化训练,灵活性差且部署成本高。解决方案的关键在于提出一种无需训练的框架TBDN,其核心是集成两种互补的闭环机制:提示指令(Hint Instruction, HI)通过轻量级提示工程注入任务感知归纳偏置,锚定模型对上下文映射规则的理解,从而缓解合规性失败;查询对比解码(Query Contrastive Decoding, QCD)通过对比完整输入与查询缺失时的解码分布,调整语言模型的输出分布,抑制先验主导的幻觉。TBDN有效打破两大瓶颈,在多个基准上实现SOTA性能,并展现出对模型架构、提示设计和超参数的鲁棒泛化能力。

链接: https://arxiv.org/abs/2601.06169
作者: Zhiyong Ma,Zhenpeng Li,Yuanjie Shi,Zhengping Li,Jiahao Chen,Qingyuan Chuai
机构: Cao Tu Li (Guangzhou) Technology Co., Ltd, Guangzhou, Guangdong, China; South China University of Technology, Guangzhou, Guangdong, China; Washington State University, Pullman, Washington State, United States of America; Hong Kong Baptist University, Kowloon, Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACL 2026

点击查看摘要

Abstract:Text-to-Image In-Context Learning (T2I-ICL) enables customized image synthesis via interleaved text-image examples but faces two mutually reinforcing bottlenecks, compliance failure and prior-dominated hallucination, that form a vicious cycle degrading generation quality. Existing methods rely on tailored training, which limits flexibility and raises deployment costs. To address these challenges effectively, we propose TBDN, a training-free framework integrating two complementary closed-loop mechanisms: Hint Instruction (HI) and Query Contrastive Decoding (QCD). HI injects task-aware inductive bias via lightweight prompt engineering to anchor models on contextual mapping rules, thereby mitigating compliance failure. QCD adjusts the decoding distributions of language models by contrasting full-input and query-omitted distributions, suppressing prior-dominated hallucination. TBDN achieves State-of-the-Art performance on CoBSAT and Text-to-Image Fast Mini-ImageNet, with robust generalization across model backbones, prompt designs, and hyperparameters. It also maintains promising performance in concept preservation and prompt following on Dreambench++. By breaking the two bottlenecks, TBDN establishes a simple yet effective framework for efficient and reliable T2I-ICL.
zh

[CV-142] Analyzing the Structure of Handwritten Digits: A Comparative Study of PCA Factor Analysis and UMAP

【速读】:该论文旨在揭示手写数字图像在高维像素空间中的潜在组织结构,解决如何从不同统计视角理解MNIST数据集内在低维流形特征的问题。其解决方案的关键在于采用三种互补的降维技术:主成分分析(Principal Component Analysis, PCA)用于识别主导全局方差方向并实现高保真重建;因子分析(Factor Analysis, FA)将数字分解为可解释的隐变量书写基元(如笔画、环形和对称性);均匀流形近似与投影(Uniform Manifold Approximation and Projection, UMAP)则揭示了反映数字类别间平滑风格过渡的非线性流形结构。三者共同表明,手写数字存在于一个具有结构性的低维流形上,且不同统计框架能揭示该结构的不同维度。

链接: https://arxiv.org/abs/2601.06168
作者: Jyotiraditya Gupta
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Handwritten digit images lie in a high-dimensional pixel space but exhibit strong geometric and statistical structure. This paper investigates the latent organization of handwritten digits in the MNIST dataset using three complementary dimensionality reduction techniques: Principal Component Analysis (PCA), Factor Analysis (FA), and Uniform Manifold Approximation and Projection (UMAP). Rather than focusing on classification accuracy, we study how each method characterizes intrinsic dimensionality, shared variation, and nonlinear geometry. PCA reveals dominant global variance directions and enables high-fidelity reconstructions using a small number of components. FA decomposes digits into interpretable latent handwriting primitives corresponding to strokes, loops, and symmetry. UMAP uncovers nonlinear manifolds that reflect smooth stylistic transitions between digit classes. Together, these results demonstrate that handwritten digits occupy a structured low-dimensional manifold and that different statistical frameworks expose complementary aspects of this structure.
zh

[CV-143] B-FIRE: Binning-Free Diffusion Implicit Neural Representation for Hyper-Accelerated Motion-Resolved MRI

【速读】:该论文旨在解决加速动态容积磁共振成像(4DMRI)中因欠采样导致的瞬时动态信息丢失问题,尤其是传统方法在平均呼吸相位时产生的伪影会模糊和失真真实运动信息。其解决方案的关键在于提出一种无分箱扩散隐式神经表示框架(B-FIRE),通过结合卷积神经网络与隐式神经表示(INR)的编码器-解码器结构,并利用扩散模型优化过程,同时引入图像域保真度和频域感知约束损失函数,在无需对运动进行分箱处理的情况下,实现超加速非笛卡尔k空间数据的高质量重建,从而准确还原瞬时三维腹部解剖结构。

链接: https://arxiv.org/abs/2601.06166
作者: Di Xu,Hengjie Liu,Yang Yang,Mary Feng,Jin Ning,Xin Miao,Jessica E. Scholey,Alexandra E. Hotca-cho,William C. Chen,Michael Ohliger,Martina Descovich,Huiming Dong,Wensha Yang,Ke Sheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accelerated dynamic volumetric magnetic resonance imaging (4DMRI) is essential for applications relying on motion resolution. Existing 4DMRI produces acceptable artifacts of averaged breathing phases, which can blur and misrepresent instantaneous dynamic information. Recovery of such information requires a new paradigm to reconstruct extremely undersampled non-Cartesian k-space data. We propose B-FIRE, a binning-free diffusion implicit neural representation framework for hyper-accelerated MR reconstruction capable of reflecting instantaneous 3D abdominal anatomy. B-FIRE employs a CNN-INR encoder-decoder backbone optimized using diffusion with a comprehensive loss that enforces image-domain fidelity and frequency-aware constraints. Motion binned image pairs were used as training references, while inference was performed on binning-free undersampled data. Experiments were conducted on a T1-weighted StarVIBE liver MRI cohort, with accelerations ranging from 8 spokes per frame (RV8) to RV1. B-FIRE was compared against direct NuFFT, GRASP-CS, and an unrolled CNN method. Reconstruction fidelity, motion trajectory consistency, and inference latency were evaluated.
zh

[CV-144] What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

链接: https://arxiv.org/abs/2601.06165
作者: Dasol Choi,Guijin Son,Hanwool Lee,Minhyuk Kim,Hyunwoo Ko,Teabin Lim,Ahn Eungyeol,Jungwhan Kim,Seunghyeok Hong,Youngsook Song
机构: AIM Intelligence; Yonsei University (延世大学); OneLineAI; Korea University (韩国科学技术院); Doodlin Corp.; NAVER Cloud; Hankuk University of Foreign Studies (韩国外国语大学); Lablup Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-145] Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

【速读】:该论文旨在解决多概念遗忘(multi-concept unlearning)问题,即在预训练文本到图像(text-to-image, T2I)扩散模型中,如何高效、可靠地移除多个不希望保留的概念(如版权内容或敏感图像),而无需从头重新训练模型。现有方法在单概念遗忘上表现良好,但在多概念场景下效果受限,存在遗忘效果差、生成质量下降及对超参数和数据集敏感等问题。其解决方案的关键在于提出 Forget It All (FIA) 框架,该框架通过引入对比概念显著性(Contrastive Concept Saliency)量化权重连接对目标概念的贡献,并结合时空信息识别概念敏感神经元(Concept-Sensitive Neurons),最终构建统一的多概念掩码,保留对通用内容生成有贡献的概念无关神经元(Concept-Agnostic Neurons),同时剪枝特定概念神经元以实现精准遗忘。FIA 不依赖训练,仅需少量超参数调优,具备即插即用特性,显著提升了多概念遗忘的可靠性与生成质量。

链接: https://arxiv.org/abs/2601.06163
作者: Kaiyuan Deng,Bo Hui,Gen Li,Jie Ji,Minghai Qin,Geng Yuan,Xiaolong Ma
机构: The University of Arizona (亚利桑那大学); Clemson University (克莱姆森大学); The University of Tulsa (图拉大学); University of Georgia (佐治亚大学); Western Digital Corporation (西部数据公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery learned from massive training corpora. As a practical solution, machine unlearning aims to selectively erase unwanted concepts from a pre-trained model without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle in real-world scenarios that require removing multiple concepts, since extending them to this setting is both non-trivial and problematic, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. In this paper, we take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection’s contribution to a target concept. It then identifies Concept-Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept-Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires only minimal hyperparameter tuning for new tasks, thereby promoting a plug-and-play paradigm. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining semantic fidelity and image quality.
zh

[CV-146] Forget Many Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

【速读】:该论文旨在解决大规模文本到图像扩散模型中多概念遗忘(multi-concept unlearning)的三大挑战:(i) 权重更新冲突导致遗忘失效或生成质量下降;(ii) 机制不精确引发对相似内容的误删( collateral damage);(iii) 依赖额外数据或子模块造成可扩展性瓶颈。解决方案的关键在于提出一种统一框架 ScaPre(Scalable-Precise Concept Unlearning),其核心创新包括:(1) 引入冲突感知的稳定设计,结合谱迹正则化(spectral trace regularization)与几何对齐(geometry alignment),以稳定优化过程、抑制权重冲突并保持全局结构;(2) 设计 Informax 解耦器(Informax Decoupler),识别与目标概念相关的参数并自适应重加权更新,严格限定遗忘操作在目标子空间内,从而实现高精度且无需辅助数据或子模型的闭式解。

链接: https://arxiv.org/abs/2601.06162
作者: Kaiyuan Deng,Gen Li,Yang Xiao,Bo Hui,Xiaolong Ma
机构: The University of Arizona (亚利桑那大学); Clemson University (克莱姆森大学); The University of Tulsa (塔尔萨大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to \times \mathbf5 more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning.
zh

[CV-147] Low-Back Pain Physical Rehabilitation by Movement Analysis in Clinical Trial

【速读】:该论文旨在解决康复训练中运动监测的四大挑战:运动评估、错误识别、空间定位与时间定位,以支持智能辅导系统(Intelligent Tutoring System, ITS)在物理康复领域的开发与评估。其解决方案的关键在于构建了一个临床采集的Keraal数据集,该数据集包含真实患者在康复计划中执行低背痛康复动作的多模态运动信息,为先进人体运动分析算法提供了基准测试平台,从而推动了康复训练智能化和精准化的实现。

链接: https://arxiv.org/abs/2601.06138
作者: Sao Mai Nguyen(U2IS, ENSTA, IP Paris)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: ICMST, Tokyo University of Science; Taiwanese Society of Movement Science and Technology; Research institute for Science and Technology, Nov 2025, Tokyo, Japan

点击查看摘要

Abstract:To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises and benchmark on state of the art human movement analysis algorithms. This dataset is valuable because it includes rehabilitation motions in a clinical setting with patients in their rehabilitation program. This paper introduces the Keraal dataset, a clinically collected dataset to enable intelligent tutoring systems (ITS) for rehabilitation. It addresses four challenges in exercise monitoring: motion assessment, error recognition, spatial localization, temporal localization
zh

[CV-148] Attention in Geometry: Scalable Spatial Modeling via Adaptive Density Fields and FAISS-Accelerated Kernels

链接: https://arxiv.org/abs/2601.06135
作者: Zhaowen Fan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Indepented Study. 22 pages, 2 figures. Includes full mathematical derivation of Adaptive Density Fields (ADF), implementation of FAISS-accelerated kernels, and a physics-informed trajectory POI detection pipeline

点击查看摘要

[CV-149] COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control AAAI-26 AAAI

【速读】:该论文旨在解决视觉强化学习(Visual Reinforcement Learning, Visual RL)在复杂任务中因高维观测导致的样本效率低下问题。现有方法多聚焦于将视觉语言模型(Vision-Language Model, VLM)的知识蒸馏到RL策略中,却忽视了RL生成的交互数据对VLM的增强潜力。解决方案的关键在于提出COVR框架,实现VLM与RL策略的协同优化:一方面,利用RL生成的数据微调VLM以提升其与目标任务一致的语义推理能力;另一方面,借助增强后的VLM提供动作先验来指导策略学习。核心创新包括两个模块——探索驱动的动态过滤模块(Exploration-Driven Dynamic Filter)用于保留高价值探索样本,以及基于回报感知的自适应损失权重模块(Return-Aware Adaptive Loss Weight)以提升训练稳定性,并结合渐进式微调策略降低资源消耗。

链接: https://arxiv.org/abs/2601.06122
作者: Canming Xia,Peixi Peng,Guang Tan,Zhan Su,Haoran Xu,Zhenxian Liu,Luntong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The paper was accepted by the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.
zh

[CV-150] Semantic Event Graphs for Long-Form Video Question Answering

链接: https://arxiv.org/abs/2601.06097
作者: Aradhya Dixit,Tianxi Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

[CV-151] OptFormer: Optical Flow-Guided Attention and Phase Space Reconstruction for SST Forecasting

【速读】:该论文旨在解决海表温度(Sea Surface Temperature, SST)预测中因非线性时空动态特性及长时间预测跨度带来的挑战。其解决方案的关键在于提出OptFormer模型,该模型融合相空间重构与基于光流引导的运动感知注意力机制;相较于传统注意力机制,该方法利用帧间运动线索突出空间场中的相对变化,使模型能够聚焦于动态区域并更有效地捕捉长程时间依赖关系,从而在NOAA SST数据集上实现了比现有基线方法更高的精度与鲁棒性。

链接: https://arxiv.org/abs/2601.06078
作者: Yin Wang,Chunlin Gong,Zhuozhen Xu,Lehan Zhang,Xiang Wu
机构: Shandong University of Finance and Economics (山东财经大学); University of Minnesota (明尼苏达大学); Wuhan University of Technology (武汉理工大学); Anqing Normal University (安庆师范大学); Jinan Fengdi Intelligent Electronics Co., Ltd (济南丰迪智能电子有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 11 pages,4 figures, 5 tables

点击查看摘要

Abstract:Sea Surface Temperature (SST) prediction plays a vital role in climate modeling and disaster forecasting. However, it remains challenging due to its nonlinear spatiotemporal dynamics and extended prediction horizons. To address this, we propose OptFormer, a novel encoder-decoder model that integrates phase-space reconstruction with a motion-aware attention mechanism guided by optical flow. Unlike conventional attention, our approach leverages inter-frame motion cues to highlight relative changes in the spatial field, allowing the model to focus on dynamic regions and capture long-range temporal dependencies more effectively. Experiments on NOAA SST datasets across multiple spatial scales demonstrate that OptFormer achieves superior performance under a 1:1 training-to-prediction setting, significantly outperforming existing baselines in accuracy and robustness.
zh

[CV-152] HyperTopo-Adapters: Geometry- and Topology-Aware Segmentation of Leaf Lesions on Frozen Encoders ALT

【速读】:该论文旨在解决叶部病斑分割(leaf-lesion segmentation)中因拓扑结构敏感性导致的性能瓶颈问题:传统基于欧几里得空间(Euclidean latents)的像素级损失函数对小规模合并、分裂或虚假孔洞等拓扑变化惩罚不足,而这些细微差异在生物学上可能反映生化通路的关键特征。解决方案的核心在于提出HyperTopo-Adapters——一种轻量级、参数高效的头部模块,其嵌入空间为超球面(hyperbolic + Euclidean + spherical, H + E + S)乘积流形,分别促进层级分离(H)、局部线性细节(E)和全局闭合性(S)。通过引入拓扑先验(topology prior)增强Dice/BCE损失,包括两种形式:(i) 使用持久同调(persistent homology, PH)距离进行评估与模型选择;(ii) 设计可微分代理损失,结合软欧拉示性数匹配与总变差正则化以实现稳定训练。该方法在Kaggle叶部病斑数据集上验证了边界和拓扑指标显著提升(如Δβ₁洞误差降低9%),同时保持Dice/IoU竞争力,且提供可复现的训练/评估套件以诊断几何与拓扑先验的影响及失败模式。

链接: https://arxiv.org/abs/2601.06067
作者: Chimdi Walter Ndubuisi,Toni Kazic
机构: Missouri Maize Computation and Vision Laboratory (MMCV); University of Missouri (密苏里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures. Code available at this https URL

点击查看摘要

Abstract:Leaf-lesion segmentation is topology-sensitive: small merges, splits, or false holes can be biologically meaningful descriptors of biochemical pathways, yet they are weakly penalized by standard pixel-wise losses in Euclidean latents. I explore HyperTopo-Adapters, a lightweight, parameter-efficient head trained on top of a frozen vision encoder, which embeds features on a product manifold – hyperbolic + Euclidean + spherical (H + E + S) – to encourage hierarchical separation (H), local linear detail (E), and global closure (S). A topology prior complements Dice/BCE in two forms: (i) persistent-homology (PH) distance for evaluation and selection, and (ii) a differentiable surrogate that combines a soft Euler-characteristic match with total variation regularization for stable training. I introduce warm-ups for both the hyperbolic contrastive term and the topology prior, per-sample evaluation of structure-aware metrics (Boundary-F1, Betti errors, PD distance), and a min-PD within top-K Dice rule for checkpoint selection. On a Kaggle leaf-lesion dataset (N=2,940), early results show consistent gains in boundary and topology metrics (reducing Delta beta_1 hole error by 9%) while Dice/IoU remain competitive. The study is diagnostic by design: I report controlled ablations (curvature learning, latent dimensions, contrastive temperature, surrogate settings), and ongoing tests varying encoder strength (ResNet-50, DeepLabV3, DINOv2/v3), input resolution, PH weight, and partial unfreezing of late blocks. The contribution is an open, reproducible train/eval suite (available at this https URL) that isolates geometric/topological priors and surfaces failure modes to guide stronger, topology-preserving architectures.
zh

[CV-153] Using street view images and visual LLM s to predict heritage values for governance support: Risks ethics and policy implications

链接: https://arxiv.org/abs/2601.06056
作者: Tim Johansson,Mikael Mangold,Kristina Dabrock,Anna Donarelli,Ingrid Campo-Ruiz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-154] Investigating Anthropometric Fidelity in SAM 3D Body

链接: https://arxiv.org/abs/2601.06035
作者: Aizierjiang Aiersilan,Ruting Cheng,James Hahn
机构: The George Washington University (乔治·华盛顿大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-155] SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images AAAI2026

【速读】:该论文旨在解决超高清图像在生成式AI(Generative AI)背景下,如何实现高效压缩与端侧实时解码的问题。现有基于3D高斯溅射(3D Gaussian Splatting)的2D高斯图像模型虽提升了表示效率,但在超高分辨率场景下难以兼顾压缩比与重建保真度。其解决方案的关键在于提出SmartSplat框架,该框架通过引入梯度-颜色引导的变分采样策略(Gradient-Color Guided Variational Sampling)和基于排除机制的均匀采样方案(Exclusion-based Uniform Sampling),显著提升高斯基元在像素空间中的非重叠覆盖能力;同时设计尺度自适应的颜色初始化方法(Scale-Adaptive Gaussian Color Sampling),联合优化空间布局、尺度和颜色初始化,以有限数量的高斯分布高效捕获局部结构与全局纹理,在强压缩条件下仍保持高质量重建,且具备良好的可扩展性和实用性。

链接: https://arxiv.org/abs/2512.20377
作者: Linfei Li,Lin Zhang,Zhong Wang,Ying Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at this https URL.
zh

[CV-156] Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment NEURIPS2025

【速读】:该论文旨在解决多模态模型在对比学习中对“模糊负样本”(ambiguous negatives)处理不足的问题,即那些与正样本仅存在细微差异的负样本常被同等对待,导致模型难以区分细粒度语义差异。解决方案的关键在于提出边界感知课程学习与局部注意力机制(Boundary-Aware Curriculum with Local Attention, BACL),其核心由两个可微模块构成:一是边界感知负采样器(Boundary-aware Negative Sampler),通过渐进式提升负样本难度构建课程信号;二是对比局部注意力损失(Contrastive Local Attention loss),显式定位跨模态不匹配区域。该方法无需额外标注即可显著提升模型性能,在四个大规模基准上达到新SOTA,且兼容任意现成双编码器架构。

链接: https://arxiv.org/abs/2511.08399
作者: Hua Ye(1 and 2),Hang Ding(3),Siyuan Chen(4),Yiyang Jiang(5),Changyuan Zhang(6),Xuan Zhang(2 and 7) ((1) Nanjing University, (2) Airon Technology CO. LTD, (3) University of Bristol, (4) The Hong Kong Polytechnic University, (5) Shanghai Jiao Tong University, (6) The University of Hong Kong, (7) Carnegie Mellon University)
机构: Nanjing University (南京大学); Airon Technology CO., LTD (艾隆科技有限公司); Shanghai Jiao Tong University (上海交通大学); University of Bristol (布里斯托大学); The Hong Kong Polytechnic University (香港理工大学); The University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures, 5 tables. Submitted to NeurIPS 2025

点击查看摘要

Abstract:Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.
zh

[CV-157] Fast Multi-Stack Slice-to-Volume Reconstruction via Multi-Scale Unrolled Optimization

链接: https://arxiv.org/abs/2601.07519
作者: Margherita Firenze,Sean I. Young,Clinton J. Wang,Hyuk Jin Yun,Elfar Adalsteinsson,Kiho Im,P. Ellen Grant,Polina Golland
机构: MIT(麻省理工学院); Harvard Medical School(哈佛医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-158] USFetal: Tools for Fetal Brain Ultrasound Compounding

【速读】:该论文旨在解决胎儿脑部超声成像中存在的视图依赖性伪影、操作者差异性以及视野受限等问题,这些问题严重影响了图像的解读与定量分析。其核心解决方案是通过超声融合(ultrasound compounding)技术,将多视角三维采集的互补信息整合为单一、一致的体积表示。关键创新在于:首次系统性地对胎儿脑部超声融合的计算策略进行分类,涵盖经典方法与基于学习的框架;提出两种新的深度学习驱动的无监督与自监督融合方法,以应对缺乏全视野无伪影标注数据的问题;并通过在10个胎儿脑部超声数据集上的综合评估验证了所提方法的有效性。

链接: https://arxiv.org/abs/2601.06726
作者: Mohammad Khateri,Morteza Ghahremani,Sergio Valencia,Camilo Jaimes,Alejandra Sierra,Jussi Tohka,P. Ellen Grant,Davood Karimi
机构: University of Eastern Finland (东芬兰大学); Technical University of Munich (慕尼黑工业大学); Massachusetts General Hospital (麻省总医院); Harvard Medical School (哈佛医学院); Boston Children’s Hospital (波士顿儿童医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound offers a safe, cost-effective, and widely accessible technology for fetal brain imaging, making it especially suitable for routine clinical use. However, it suffers from view-dependent artifacts, operator variability, and a limited field of view, which make interpretation and quantitative evaluation challenging. Ultrasound compounding aims to overcome these limitations by integrating complementary information from multiple 3D acquisitions into a single, coherent volumetric representation. This work provides four main contributions: (1) We present the first systematic categorization of computational strategies for fetal brain ultrasound compounding, including both classical techniques and modern learning-based frameworks. (2) We implement and compare representative methods across four key categories - multi-scale, transformation-based, variational, and deep learning approaches - emphasizing their core principles and practical advantages. (3) Motivated by the lack of full-view, artifact-free ground truth required for supervised learning, we focus on unsupervised and self-supervised strategies and introduce two new deep learning based approaches: a self-supervised compounding framework and an adaptation of unsupervised deep plug-and-play priors for compounding. (4) We conduct a comprehensive evaluation on ten multi-view fetal brain ultrasound datasets, using both expert radiologist scoring and standard quantitative image-quality metrics. We also release the USFetal Compounding Toolbox, publicly available to support benchmarking and future research. Keywords: Ultrasound compounding, fetal brain, deep learning, self-supervised, unsupervised.
zh

[CV-159] R3D: Regional-guided Residual Radar Diffusion

【速读】:该论文旨在解决毫米波雷达(Millimeter-wave radar)在恶劣环境下虽具备鲁棒性但存在点云稀疏、噪声大及角分辨率低的问题,尤其针对现有基于扩散模型(diffusion-based)的雷达增强方法中存在的两个关键缺陷:一是建模全LiDAR分布导致学习复杂度高,二是均匀区域处理难以突出关键结构。其解决方案的核心在于提出R3D框架——通过残差扩散建模(residual diffusion modeling),聚焦于LiDAR与雷达之间的残差编码,以补充高频细节并降低学习难度;同时引入自适应sigma的区域引导机制(sigma-adaptive regional guidance),利用雷达特有的信号特性生成注意力图,并仅在低噪声阶段施加轻量级引导,从而避免梯度失衡并精准优化关键区域。

链接: https://arxiv.org/abs/2601.06465
作者: Hao Li,Xinqi Liu,Yaoqing Jin
机构: University of Arizona (亚利桑那大学); University of Hong Kong (香港大学); University of Stuttgart (斯图加特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Millimeter-wave radar enables robust environment perception in autonomous systems under adverse conditions yet suffers from sparse, noisy point clouds with low angular resolution. Existing diffusion-based radar enhancement methods either incur high learning complexity by modeling full LiDAR distributions or fail to prioritize critical structures due to uniform regional processing. To address these issues, we propose R3D, a regional-guided residual radar diffusion framework that integrates residual diffusion modeling-focusing on the concentrated LiDAR-radar residual encoding complementary high-frequency details to reduce learning difficulty-and sigma-adaptive regional guidance-leveraging radar-specific signal properties to generate attention maps and applying lightweight guidance only in low-noise stages to avoid gradient imbalance while refining key regions. Extensive experiments on the ColoRadar dataset demonstrate that R3D outperforms state-of-the-art methods, providing a practical solution for radar perception enhancement. Our anonymous code and pretrained models are released here: this https URL
zh

[CV-160] Performance Analysis of DCT Hadamard and PCA in Block-Based Image Compression

【速读】:该论文旨在解决图像压缩中变换编码方法的选择问题,特别是比较固定变换(如离散余弦变换 DCT 和哈达玛变换 Hadamard)与数据驱动的最优变换(主成分分析 PCA)在不同块尺寸和压缩率下的性能差异。其解决方案的关键在于通过率失真分析和能量集中度评估,系统性地实证表明:PCA 仅在块维度足够大时才优于固定变换,而 DCT 在标准块大小(如 8×8)及低比特率下仍保持近似最优性能,从而解释了 DCT 在实际图像编码标准中的鲁棒性,并揭示了逐块学习变换方法的局限性。

链接: https://arxiv.org/abs/2601.06273
作者: Yashika Ahlawat
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Block based image compression relies on transform coding to concentrate signal energy into a small number of coefficients. While classical codecs use fixed transforms such as the Discrete Cosine Transform (DCT), data driven methods such as Principal Component Analysis (PCA) are theoretically optimal for decorrelation. This paper presents an experimental comparison of DCT, Hadamard, and PCA across multiple block sizes and compression rates. Using rate distortion and energy compaction analysis, we show that PCA outperforms fixed transforms only when block dimensionality is sufficiently large, while DCT remains near optimal for standard block sizes such as 8\times8 and at low bit rates. These results explain the robustness of DCT in practical codecs and highlight the limitations of block wise learned transforms.
zh

[CV-161] Gamma2Patterns: Deep Cognitive Attention Region Identification and Gamma-Alpha Pattern Analysis

链接: https://arxiv.org/abs/2601.06257
作者: Sobhana Jahan,Saydul Akbar Murad,Nick Rahimi,Noorbakhsh Amiri Golilarz
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-162] Real-Time Image Processing Algorithms for Embedded Systems

链接: https://arxiv.org/abs/2601.06243
作者: Soundes Oumaima Boufaida,Abdemadjid Benmachiche,Majda Maatallah
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-163] Deep Joint Source-Channel Coding for Wireless Video Transmission with Asymmetric Context

【速读】:该论文旨在解决深度联合信源信道编码(Joint Source-Channel Coding, JSCC)在视频传输中因编码端与解码端无法共享相同重建帧而导致的条件预测不一致问题,以及由此引发的误差累积效应。解决方案的关键在于引入基于不对称上下文的条件编码机制,使编码器和解码器能够从各自不同的上下文中学习独立的编码与解码条件,并通过特征传播(feature propagation)机制实现中间特征在两端的独立传递,从而有效利用时间相关性并抑制误差扩散;此外,结合内容自适应编码策略(content-adaptive coding),利用熵模型和掩码机制实现可变带宽传输,进一步提升系统性能。

链接: https://arxiv.org/abs/2601.06170
作者: Xuechen Chen,Junting Li,Chuang Chen,Hairong Lin,Yishen Li
机构: Central South University (中南大学); Huawei (华为)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 19 figures, 2 tables, accepted in press by Multimedia system

点击查看摘要

Abstract:In this paper, we propose a high-efficiency deep joint source-channel coding (JSCC) method for video transmission based on conditional coding with asymmetric context. The conditional coding-based neural video compression requires to predict the encoding and decoding conditions from the same context which includes the same reconstructed frames. However in JSCC schemes which fall into pseudo-analog transmission, the encoder cannot infer the same reconstructed frames as the decoder even a pipeline of the simulated transmission is constructed at the encoder. In the proposed method, without such a pipeline, we guide and design neural networks to learn encoding and decoding conditions from asymmetric contexts. Additionally, we introduce feature propagation, which allows intermediate features to be independently propagated at the encoder and decoder and help to generate conditions, enabling the framework to greatly leverage temporal correlation while mitigating the problem of error accumulation. To further exploit the performance of the proposed transmission framework, we implement content-adaptive coding which achieves variable bandwidth transmission using entropy models and masking mechanisms. Experimental results demonstrate that our method outperforms existing deep video transmission frameworks in terms of performance and effectively mitigates the error accumulation. By mitigating the error accumulation, our schemes can reduce the frequency of inserting intra-frame coding modes, further enhancing performance.
zh

人工智能

[AI-0] Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation

【速读】:该论文旨在解决真实世界中基于深度强化学习(Deep Reinforcement Learning, DRL)的机器人模型在后训练阶段因干预性失败(Intervention-requiring Failures, IR Failures)而导致部署受限的问题,例如机器人打翻水杯或破坏易碎物品等。解决方案的关键在于提出一种新的范式——Failure-Aware Offline-to-Online Reinforcement Learning (FARL),其核心机制包括:构建一个包含常见需人工干预故障场景的基准测试平台FailureBench,并设计一种融合基于世界模型的安全评判器(safety critic)与离线训练的恢复策略(recovery policy)的算法,在在线探索过程中主动预防IR Failures,从而显著降低失败率并提升性能和泛化能力。实验证明,FARL在真实环境中可将IR Failures减少73.1%,同时平均提升性能11.3%。

链接: https://arxiv.org/abs/2601.07821
作者: Huanyu Li,Kun Lei,Sheng Zang,Kaizhe Hu,Yongyuan Liang,Bo An,Xiaoli Li,Huazhe Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a new paradigm minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training. Videos and code are available at this https URL.
zh

[AI-1] Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification

【速读】:该论文旨在解决现代计算基础设施中系统日志(system logs)规模庞大且复杂所带来的自动化解读难题,尤其关注如何有效评估小型语言模型(Small Language Models, SLMs)在理解运行时日志方面的实际能力。传统上,严重性分类(severity classification)被视为一个独立任务,但作者指出其更应作为衡量模型对日志语义理解深度的基准工具,而非终点目标。解决方案的关键在于:(1)将严重性分类任务置于零样本(zero-shot)、少样本(few-shot)和检索增强生成(Retrieval-Augmented Generation, RAG)三种提示策略下进行系统性评估;(2)引入真实世界 journalctl 数据集,以反映生产环境中的复杂性和多样性;(3)强调模型架构设计、训练目标与上下文整合能力三者协同作用,特别是 RAG 对提升模型准确率(如 Qwen3-4B 达到 95.64%)和部署效率(如 Gemma 和 Llama 系列模型推理时间 <1.2 秒/条日志)的核心影响。这一方法不仅揭示了不同模型在真实场景下的性能分层,也为数字孪生(Digital Twin, DT)系统的实时日志分析和根因定位(Root Cause Analysis, RCA)提供了可部署的模型选择依据。

链接: https://arxiv.org/abs/2601.07790
作者: Yahya Masri,Emily Ma,Zifu Wang,Joseph Rogers,Chaowei Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures, 7 tables

点击查看摘要

Abstract:System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited standalone practical value, revealing little about its underlying ability to interpret system logs. We argue that severity classification is more informative when treated as a benchmark for probing runtime log comprehension rather than as an end task. Using real-world journalctl data from Linux production servers, we evaluate nine small language models (SLMs) and small reasoning language models (SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. The results reveal strong stratification. Qwen3-4B achieves the highest accuracy at 95.64% with RAG, while Gemma3-1B improves from 20.25% under few-shot prompting to 85.28% with RAG. Notably, the tiny Qwen3-0.6B reaches 88.12% accuracy despite weak performance without retrieval. In contrast, several SRLMs, including Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B, degrade substantially when paired with RAG. Efficiency measurements further separate models: most Gemma and Llama variants complete inference in under 1.2 seconds per log, whereas Phi-4-Mini-Reasoning exceeds 228 seconds per log while achieving 10% accuracy. These findings suggest that (1) architectural design, (2) training objectives, and (3) the ability to integrate retrieved context under strict output constraints jointly determine performance. By emphasizing small, deployable models, this benchmark aligns with real-time requirements of digital twin (DT) systems and shows that severity classification serves as a lens for evaluating model competence and real-time deployability, with implications for root cause analysis (RCA) and broader DT integration.
zh

[AI-2] DT-ICU: Towards Explainable Digital Twins for ICU Patient Monitoring via Multi-Modal and Multi-Task Iterative Inference

【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中患者风险评估的连续性与准确性问题,即如何在患者整个ICU住院期间动态更新风险预测,并充分利用多模态临床数据实现高精度、可解释的监测。其解决方案的关键在于提出DT-ICU——一个融合变长临床时间序列与静态患者信息的统一多任务架构数字孪生框架,通过整合生理指标、干预措施和上下文信息等异构数据源,在不同观察窗口下持续优化风险排序能力,同时借助系统性模态消融分析揭示模型对各类数据的结构化依赖关系,从而实现时序鲁棒且具备可解释性的实时风险估计。

链接: https://arxiv.org/abs/2601.07778
作者: Wen Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce DT-ICU, a multimodal digital twin framework for continuous risk estimation in intensive care. DT-ICU integrates variable-length clinical time series with static patient information in a unified multitask architecture, enabling predictions to be updated as new observations accumulate over the ICU stay. We evaluate DT-ICU on the large, publicly available MIMIC-IV dataset, where it consistently outperforms established baseline models under different evaluation settings. Our test-length analysis shows that meaningful discrimination is achieved shortly after admission, while longer observation windows further improve the ranking of high-risk patients in highly imbalanced cohorts. To examine how the model leverages heterogeneous data sources, we perform systematic modality ablations, revealing that the model learnt a reasonable structured reliance on interventions, physiological response observations, and contextual information. These analyses provide interpretable insights into how multimodal signals are combined and how trade-offs between sensitivity and precision emerge. Together, these results demonstrate that DT-ICU delivers accurate, temporally robust, and interpretable predictions, supporting its potential as a practical digital twin framework for continuous patient monitoring in critical care. The source code and trained model weights for DT-ICU are publicly available at this https URL.
zh

[AI-3] Improving Domain Generalization in Contrastive Learning using Adaptive Temperature Control NEURIPS

【速读】:该论文旨在解决对比学习(contrastive learning)在分布外(out-of-distribution)场景下性能显著下降的问题,尤其是在训练数据来自多个领域(domain),而测试数据来自未见过的领域且存在显著协变量偏移(covariate shift)的情况下。解决方案的关键在于引入领域标签(domain label)以增强表示学习的领域不变性(domain invariance),具体方法是通过调整InfoNCE损失函数中的温度参数(temperature parameter),使其根据负样本与锚点样本来自同一领域的概率进行动态调整——这会提升来自相似领域的负样本权重,从而促使模型聚焦于领域不变特征进行判别,最终实现更强的分布外泛化能力,同时保持良好的分布内任务性能。

链接: https://arxiv.org/abs/2601.07748
作者: Robert Lewis,Katie Matton,Rosalind W. Picard,John Guttag
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS SSL Workshop 2023

点击查看摘要

Abstract:Self-supervised pre-training with contrastive learning is a powerful method for learning from sparsely labeled data. However, performance can drop considerably when there is a shift in the distribution of data from training to test time. We study this phenomenon in a setting in which the training data come from multiple domains, and the test data come from a domain not seen at training that is subject to significant covariate shift. We present a new method for contrastive learning that incorporates domain labels to increase the domain invariance of learned representations, leading to improved out-of-distribution generalization. Our method adjusts the temperature parameter in the InfoNCE loss – which controls the relative weighting of negative pairs – using the probability that a negative sample comes from the same domain as the anchor. This upweights pairs from more similar domains, encouraging the model to discriminate samples based on domain-invariant attributes. Through experiments on a variant of the MNIST dataset, we demonstrate that our method yields better out-of-distribution performance than domain generalization baselines. Furthermore, our method maintains strong in-distribution task performance, substantially outperforming baselines on this measure.
zh

[AI-4] Hiking in the Wild: A Scalable Perceptive Parkour Framework for Humanoids

【速读】:该论文旨在解决复杂非结构化环境中人形机器人稳健徒步行走的问题,核心挑战在于如何从依赖本体感觉(proprioception)的反应式控制过渡到基于外感受器(exteroception)的主动感知,同时克服现有方法在状态估计漂移、训练复杂度高及泛化能力差等方面的局限。解决方案的关键在于提出一个可扩展的端到端公园跑(parkour)感知框架——“Hiking in the Wild”,其核心创新包括:1)一种结合可扩展地形边缘检测(Terrain Edge Detection)与足部体积点(Foot Volume Points)的足点安全机制,有效防止在边缘处发生灾难性滑移;2)一种平面区域采样策略(Flat Patch Sampling),通过生成可行的导航目标来缓解奖励欺骗(reward hacking)问题,从而提升训练稳定性与安全性。该方法采用单阶段强化学习架构,直接将原始深度输入与本体感觉映射为关节动作,无需外部状态估计,在真实人形机器人上实现了最高达2.5 m/s的复杂地形稳定通行。

链接: https://arxiv.org/abs/2601.07718
作者: Shaoting Zhu,Ziwen Zhuang,Mengjie Zhao,Kun-Ying Lee,Hang Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift; for instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present \textitHiking in the Wild, a scalable, end-to-end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable \textitTerrain Edge Detection with \textitFoot Volume Points to prevent catastrophic slippage on edges, and a \textitFlat Patch Sampling strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
zh

[AI-5] Deep Whole-body Parkour

【速读】:该论文旨在解决当前人形机器人控制中两大范式的局限性问题:一是感知运动(perceptive locomotion)虽能良好适应地形但仅限于步行步态,二是通用运动追踪(general motion tracking)可复现复杂技能却忽视环境特性。解决方案的关键在于将外感受感知(exteroceptive sensing)整合进全身运动追踪框架中,使机器人能在非结构化地形上执行高动态、非移动类任务(如翻越和翻滚),并通过单一策略在多种地形特征上完成多类动作,从而显著提升机器人在复杂环境中的机动性和任务适应能力。

链接: https://arxiv.org/abs/2601.07701
作者: Ziwen Zhuang,Shaoting Zhu,Mengjie Zhao,Hang Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current approaches to humanoid control generally fall into two paradigms: perceptive locomotion, which handles terrain well but is limited to pedal gaits, and general motion tracking, which reproduces complex skills but ignores environmental capabilities. This work unites these paradigms to achieve perceptive general motion control. We present a framework where exteroceptive sensing is integrated into whole-body motion tracking, permitting a humanoid to perform highly dynamic, non-locomotion tasks on uneven terrain. By training a single policy to perform multiple distinct motions across varied terrestrial features, we demonstrate the non-trivial benefit of integrating perception into the control loop. Our results show that this framework enables robust, highly dynamic multi-contact motions, such as vaulting and dive-rolling, on unstructured terrain, significantly expanding the robot’s traversability beyond simple walking or running. this https URL
zh

[AI-6] Predictive Analytics for Dementia: Machine Learning on Healthcare Data

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease)等痴呆症的早期预测问题,通过机器学习(Machine Learning, ML)技术提升对患者健康数据的分析能力。其关键解决方案在于采用监督学习算法(如线性判别分析 Linear Discriminant Analysis, LDA)、引入合成少数类过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)以缓解类别不平衡问题,并利用词频-逆文档频率(Term Frequency-Inverse Document Frequency, TF-IDF)向量化方法增强特征表示。实验表明,LDA模型在测试集上达到98%的准确率,同时强调了模型可解释性的重要性,以及APOE-epsilon4基因变异和糖尿病等慢性疾病与痴呆之间的显著相关性,为未来结合可解释人工智能(Explainable AI)的ML创新提供了方向。

链接: https://arxiv.org/abs/2601.07685
作者: Shafiul Ajam Opee,Nafiz Fahad,Anik Sen,Rasel Ahmed,Fariha Jahan,Md. Kishor Morol,Md Rashedul Islam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 13 figures

点击查看摘要

Abstract:Dementia is a complex syndrome impacting cognitive and emotional functions, with Alzheimer’s disease being the most common form. This study focuses on enhancing dementia prediction using machine learning (ML) techniques on patient health data. Supervised learning algorithms are applied in this study, including K-Nearest Neighbors (KNN), Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), and Gaussian Process Classifiers. To address class imbalance and improve model performance, techniques such as Synthetic Minority Over-sampling Technique (SMOTE) and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization were employed. Among the models, LDA achieved the highest testing accuracy of 98%. This study highlights the importance of model interpretability and the correlation of dementia with features such as the presence of the APOE-epsilon4 allele and chronic conditions like diabetes. This research advocates for future ML innovations, particularly in integrating explainable AI approaches, to further improve predictive capabilities in dementia care.
zh

[AI-7] owards Automating Blockchain Consensus Verification with IsabeLLM

【速读】:该论文旨在解决区块链共识协议(Consensus Protocol)在设计与实现中因缺乏形式化验证而导致的安全风险问题,尤其是在存在恶意节点的对抗环境中难以确保协议正确性。其解决方案的关键在于提出并实现 IsabeLLM 工具,该工具将证明助手 Isabelle 与大语言模型(Large Language Model, LLM)相结合,通过自动化辅助生成和验证非平凡命题的证明,从而显著降低形式化验证的技术门槛和人力成本。文中以比特币工作量证明(Proof of Work)共识协议为例,利用 DeepSeek R1 API 验证了该方法的有效性,成功生成了所有关键lemma的正确证明。

链接: https://arxiv.org/abs/2601.07654
作者: Elliot Jones,William Knottenbelt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Consensus protocols are crucial for a blockchain system as they are what allow agreement between the system’s nodes in a potentially adversarial environment. For this reason, it is paramount to ensure their correct design and implementation to prevent such adversaries from carrying out malicious behaviour. Formal verification allows us to ensure the correctness of such protocols, but requires high levels of effort and expertise to carry out and thus is often omitted in the development process. In this paper, we present IsabeLLM, a tool that integrates the proof assistant Isabelle with a Large Language Model to assist and automate proofs. We demonstrate the effectiveness of IsabeLLM by using it to develop a novel model of Bitcoin’s Proof of Work consensus protocol and verify its correctness. We use the DeepSeek R1 API for this demonstration and found that we were able to generate correct proofs for each of the non-trivial lemmas present in the verification.
zh

[AI-8] Active Evaluation of General Agents : Problem Definition and Comparison of Baseline Algorithms AAMAS2026

【速读】:该论文旨在解决智能代理(Intelligent Agents)在多任务环境下评估复杂度与成本显著上升的问题,尤其是当任务具备相关性和随机性时,需大量样本才能实现准确比较,从而带来高昂的计算和时间开销。解决方案的关键在于提出一种主动评估(Active Evaluation)的正式定义与概念框架,其核心思想是:在每一轮迭代中,由排名算法动态选择待评估的任务与代理进行分数采样,而非预先处理或压缩数据集;通过在线方式持续更新代理排名,并以真实排序为基准衡量性能。实验表明,经典Elo评分系统虽存在理论缺陷,但在实践中能稳定降低排名误差;而新兴的Soft Condorcet优化方法在合成数据上表现接近Elo,在真实Atari游戏代理评估中显著优于Elo;此外,当任务分布偏离真实排序时,基于比例代表性(Proportional Representation)的任务选择策略可加速排名误差的减少。

链接: https://arxiv.org/abs/2601.07651
作者: Marc Lanctot,Kate Larson,Ian Gemp,Michael Kaisers
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: AAMAS 2026

点击查看摘要

Abstract:As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system – while it suffers from well-known failure modes, in theory – is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.
zh

[AI-9] SALT-KG: A Benchmark for Semantics-Aware Learning on Enterprise Tables

【速读】:该论文旨在解决当前表格基础模型在处理企业级结构化数据时,缺乏对语义上下文有效利用的问题,尤其在多表事务数据中难以融合领域知识进行联合推理。解决方案的关键在于构建SALT-KG基准,通过将原始多表交易数据与基于元数据的知识图谱(Metadata Knowledge Graph, OBKG)进行链接,显式建模字段级描述、关系依赖和业务对象类型等语义信息,从而实现对表格证据与上下文语义的联合推理评估。这一设计使模型能够以语义条件化的方式进行关系预测,推动表格基础模型向基于声明性知识的方向发展。

链接: https://arxiv.org/abs/2601.07638
作者: Isaiah Onando Mulang,Felix Sasaki,Tassilo Klein,Jonas Kolk,Nikolay Grechanov,Johannes Hoffart
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building upon the SALT benchmark for relational prediction (Klein et al., 2024), we introduce SALT-KG, a benchmark for semantics-aware learning on enterprise tables. SALT-KG extends SALT by linking its multi-table transactional data with a structured Operational Business Knowledge represented in a Metadata Knowledge Graph (OBKG) that captures field-level descriptions, relational dependencies, and business object types. This extension enables evaluation of models that jointly reason over tabular evidence and contextual semantics, an increasingly critical capability for foundation models on structured data. Empirical analysis reveals that while metadata-derived features yield modest improvements in classical prediction metrics, these metadata features consistently highlight gaps in the ability of models to leverage semantics in relational context. By reframing tabular prediction as semantics-conditioned reasoning, SALT-KG establishes a benchmark to advance tabular foundation models grounded in declarative knowledge, providing the first empirical step toward semantically linked tables in structured data at enterprise scale.
zh

[AI-10] Neural Architecture for Fast and Reliable Coagulation Assessment in Clinical Settings: Leverag ing Thromboelastography AAAI26

【速读】:该论文旨在解决传统血栓弹力图(Thromboelastography, TEG)检测耗时长(约1小时)导致无法实现早期风险预警的问题,以及在小样本数据和患者群体差异显著情况下,传统深度学习方法难以做出可靠预测的挑战。解决方案的关键在于提出一种名为生理状态重建(Physiological State Reconstruction, PSR)的新算法,其核心创新包括:通过多域融合编码器(MDFE)整合多样化时间信号,利用高阶注意力机制(HLA)联合学习高层时间交互关系与注意力权重,并设计参数化动态调整模块(DAM)以保持生命体征计算的稳定性。该方法在4个TEG专用数据集上验证,对凝血特征的预测决定系数R²达0.98,误差较当前最优方法降低约50%,且推理时间缩短一半,展现出在医疗AI领域应对数据稀缺场景的强大潜力。

链接: https://arxiv.org/abs/2601.07618
作者: Yulu Wang,Ziqian Zeng,Jianjun Wu,Zhifeng Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by AAAI26

点击查看摘要

Abstract:In an ideal medical environment, real-time coagulation monitoring can enable early detection and prompt remediation of risks. However, traditional Thromboelastography (TEG), a widely employed diagnostic modality, can only provide such outputs after nearly 1 hour of measurement. The delay might lead to elevated mortality rates. These issues clearly point out one of the key challenges for medical AI development: Mak-ing reasonable predictions based on very small data sets and accounting for variation between different patient populations, a task where conventional deep learning methods typically perform poorly. We present Physiological State Reconstruc-tion (PSR), a new algorithm specifically designed to take ad-vantage of dynamic changes between individuals and to max-imize useful information produced by small amounts of clini-cal data through mapping to reliable predictions and diagnosis. We develop MDFE to facilitate integration of varied temporal signals using multi-domain learning, and jointly learn high-level temporal interactions together with attentions via HLA; furthermore, the parameterized DAM we designed maintains the stability of the computed vital signs. PSR evaluates with 4 TEG-specialized data sets and establishes remarkable perfor-mance – predictions of R2 0.98 for coagulation traits and error reduction around half compared to the state-of-the-art methods, and halving the inferencing time too. Drift-aware learning suggests a new future, with potential uses well be-yond thrombophilia discovery towards medical AI applica-tions with data scarcity.
zh

[AI-11] DIAGPaper: Diagnosing Valid and Specific Weaknesses in Scientific Papers via Multi-Agent Reasoning

【速读】:该论文旨在解决现有基于单智能体或多智能体大语言模型(Large Language Models, LLMs)的论文弱点识别方法中存在的三大核心问题:一是多智能体系统仅表面模拟人类角色,缺乏对专家评估论文互补性智力维度的底层标准;二是现有方法默认识别出的弱点均为有效,忽视了审稿偏见、理解偏差以及作者反驳在验证审稿质量中的关键作用;三是多数系统输出未排序的弱点列表,未能按重要性优先级呈现给用户。解决方案的关键在于提出DIAGPaper框架,其包含三个紧密集成的模块:定制化模块(Customizer)依据人工定义的评审标准生成具备特定专业能力的审稿人代理;反驳模块(Rebuttal)引入作者代理与审稿人代理进行结构化辩论以验证和优化提出的弱点;优先级模块(Prioritizer)基于大规模人类评审实践学习弱点严重程度,并向用户输出前K个最严重的弱点,从而实现更有效、更贴近论文特性的弱点识别与优先排序。

链接: https://arxiv.org/abs/2601.07611
作者: Zhuoyang Zou,Abolfazl Ansari,Delvin Ce Zhang,Dongwon Lee,Wenpeng Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Paper weakness identification using single-agent or multi-agent LLMs has attracted increasing attention, yet existing approaches exhibit key limitations. Many multi-agent systems simulate human roles at a surface level, missing the underlying criteria that lead experts to assess complementary intellectual aspects of a paper. Moreover, prior methods implicitly assume identified weaknesses are valid, ignoring reviewer bias, misunderstanding, and the critical role of author rebuttals in validating review quality. Finally, most systems output unranked weakness lists, rather than prioritizing the most consequential issues for users. In this work, we propose DIAGPaper, a novel multi-agent framework that addresses these challenges through three tightly integrated modules. The customizer module simulates human-defined review criteria and instantiates multiple reviewer agents with criterion-specific expertise. The rebuttal module introduces author agents that engage in structured debate with reviewer agents to validate and refine proposed weaknesses. The prioritizer module learns from large-scale human review practices to assess the severity of validated weaknesses and surfaces the top-K severest ones to users. Experiments on two benchmarks, AAAR and ReviewCritique, demonstrate that DIAGPaper substantially outperforms existing methods by producing more valid and more paper-specific weaknesses, while presenting them in a user-oriented, prioritized manner.
zh

[AI-12] Pheromone-Focused Ant Colony Optimization algorithm for path planning

【速读】:该论文旨在解决传统蚁群优化(Ant Colony Optimization, ACO)算法在复杂环境路径规划中存在盲目搜索行为和收敛速度慢的问题。解决方案的关键在于提出一种信息素聚焦的蚁群优化(Pheromone-Focused Ant Colony Optimization, PFACO)算法,其核心创新包括:1)基于节点到起点与终点的欧氏距离,将初始信息素集中分布于更有潜力的区域,平衡探索与开发;2)在迭代过程中强化优质解的信息素沉积,加速收敛并保持解多样性;3)引入前瞻机制惩罚冗余转向,提升路径平滑性与效率。上述策略协同作用,使信息素聚焦引导蚁群搜索,显著增强全局优化能力,从而在收敛速度和解质量上优于对比算法。

链接: https://arxiv.org/abs/2601.07597
作者: Yi Liu,Hongda Zhang,Zhongxue Gan,Yuning Chen,Ziqing Zhou,Chunlei Meng,Chun Ouyang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted to 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:Ant Colony Optimization (ACO) is a prominent swarm intelligence algorithm extensively applied to path planning. However, traditional ACO methods often exhibit shortcomings, such as blind search behavior and slow convergence within complex environments. To address these challenges, this paper proposes the Pheromone-Focused Ant Colony Optimization (PFACO) algorithm, which introduces three key strategies to enhance the problem-solving ability of the ant colony. First, the initial pheromone distribution is concentrated in more promising regions based on the Euclidean distances of nodes to the start and end points, balancing the trade-off between exploration and exploitation. Second, promising solutions are reinforced during colony iterations to intensify pheromone deposition along high-quality paths, accelerating convergence while maintaining solution diversity. Third, a forward-looking mechanism is implemented to penalize redundant path turns, promoting smoother and more efficient solutions. These strategies collectively produce the focused pheromones to guide the ant colony’s search, which enhances the global optimization capabilities of the PFACO algorithm, significantly improving convergence speed and solution quality across diverse optimization problems. The experimental results demonstrate that PFACO consistently outperforms comparative ACO algorithms in terms of convergence speed and solution quality.
zh

[AI-13] Beyond Entangled Planning : Task-Decoupled Planning for Long-Horizon Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在执行长程任务时因规划策略导致的可靠性问题,特别是现有方法中“逐步规划”(step-wise planning)易短视和“一次性规划”(one-shot planning)对执行错误敏感的局限性。其核心问题是任务执行过程中存在上下文纠缠(entangled contexts),即代理需在整个多子任务历史中进行统一推理,导致局部错误传播、恢复成本高且认知负荷大。解决方案的关键在于提出任务解耦规划(Task-Decoupled Planning, TDP),通过监督器(Supervisor)将任务分解为有向无环图(DAG)形式的子目标,并由规划器(Planner)与执行器(Executor)基于限定范围的上下文仅对当前活跃子任务进行推理与重规划,从而实现错误隔离与局部修正,显著提升长程代理的鲁棒性和效率。

链接: https://arxiv.org/abs/2601.07577
作者: Yunfan Li,Bingbing Xu,Xueyun Tian,Xiucheng Xu,Huawei Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled agents to autonomously execute complex, long-horizon tasks, yet planning remains a primary bottleneck for reliable task execution. Existing methods typically fall into two paradigms: step-wise planning, which is reactive but often short-sighted; and one-shot planning, which generates a complete plan upfront yet is brittle to execution errors. Crucially, both paradigms suffer from entangled contexts, where the agent must reason over a monolithic history spanning multiple sub-tasks. This entanglement increases cognitive load and lets local errors propagate across otherwise independent decisions, making recovery computationally expensive. To address this, we propose Task-Decoupled Planning (TDP), a training-free framework that replaces entangled reasoning with task decoupling. TDP decomposes tasks into a directed acyclic graph (DAG) of sub-goals via a Supervisor. Using a Planner and Executor with scoped contexts, TDP confines reasoning and replanning to the active sub-task. This isolation prevents error propagation and corrects deviations locally without disrupting the workflow. Results on TravelPlanner, ScienceWorld, and HotpotQA show that TDP outperforms strong baselines while reducing token consumption by up to 82%, demonstrating that sub-task decoupling improves both robustness and efficiency for long-horizon agents.
zh

[AI-14] d3LLM : Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

【速读】:该论文旨在解决扩散式大语言模型(Diffusion Large Language Models, dLLMs)在实际应用中面临的准确率与并行性之间的权衡问题,即现有方法通常仅关注效率或性能中的单一维度,难以同时实现高并行性和高准确性。其解决方案的关键在于提出d3LLM(Pseudo-Distilled Diffusion Large Language Model),通过两个核心机制实现平衡:(i) 训练阶段引入伪轨迹蒸馏(pseudo-trajectory distillation),指导模型识别早期步骤中可置信解码的token,从而提升并行性;(ii) 推理阶段采用基于熵的多块解码(entropy-based multi-block decoding)结合KV缓存刷新机制,在保持准确率的同时显著提高并行处理能力。

链接: https://arxiv.org/abs/2601.07568
作者: Yu-Yang Qian,Junda Su,Lanxiang Hu,Peiyuan Zhang,Zhijie Deng,Peng Zhao,Hao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10 \times speedup over vanilla LLaDA/Dream and 5 \times speedup over AR models without much accuracy drop. Our code is available at this https URL.
zh

[AI-15] Backpropagation-Free Test-Time Adaptation for Lightweight EEG-Based Brain-Computer Interfaces

【速读】:该论文旨在解决脑电图(Electroencephalogram, EEG)基脑-机接口(Brain-Computer Interface, BCI)在实际部署中面临的三大挑战:个体间差异性、信号非平稳性以及计算资源受限问题。现有测试时适应(Test-Time Adaptation, TTA)方法依赖显式定义的损失函数并通过反向传播更新模型参数,导致计算开销大、隐私风险高且对噪声数据敏感。论文提出无反向传播变换(Backpropagation-Free Transformations, BFT),其核心在于通过知识引导的数据增强或近似贝叶斯推断对每个测试样本进行多实例变换,生成多个预测得分,并利用学习排序模块加权聚合这些预测结果,在理论保障下实现不确定性抑制与鲁棒推理,从而在不依赖梯度更新的前提下提升模型适应能力与效率。

链接: https://arxiv.org/abs/2601.07556
作者: Siyang Li,Jiayi Ouyang,Zhenyao Cui,Ziwei Wang,Tianwang Jia,Feng Wan,Dongrui Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG)-based brain-computer interfaces (BCIs) face significant deployment challenges due to inter-subject variability, signal non-stationarity, and computational constraints. While test-time adaptation (TTA) mitigates distribution shifts under online data streams without per-use calibration sessions, existing TTA approaches heavily rely on explicitly defined loss objectives that require backpropagation for updating model parameters, which incurs computational overhead, privacy risks, and sensitivity to noisy data streams. This paper proposes Backpropagation-Free Transformations (BFT), a TTA approach for EEG decoding that eliminates such issues. BFT applies multiple sample-wise transformations of knowledge-guided augmentations or approximate Bayesian inference to each test trial, generating multiple prediction scores for a single test sample. A learning-to-rank module enhances the weighting of these predictions, enabling robust aggregation for uncertainty suppression during inference under theoretical justifications. Extensive experiments on five EEG datasets of motor imagery classification and driver drowsiness regression tasks demonstrate the effectiveness, versatility, robustness, and efficiency of BFT. This research enables lightweight plug-and-play BCIs on resource-constrained devices, broadening the real-world deployment of decoding algorithms for EEG-based BCI.
zh

[AI-16] VirtualEnv: A Platform for Embodied AI Research

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在具身智能(embodied AI)场景中缺乏真实、交互性强且可细粒度评估的仿真环境问题。现有评测体系难以充分检验LLMs在复杂任务中的适应性、规划能力与多智能体协作表现。其解决方案的关键在于构建一个基于Unreal Engine 5的下一代仿真平台VirtualEnv,该平台支持丰富的物体操作、导航、多智能体协作等交互机制,并引入游戏化设计(如密室逃脱和程序生成环境),同时提供自然语言驱动的API接口以实现对LLM代理的灵活控制。通过集成大模型(如GPT系列)和视觉-语言模型(Vision-Language Models, VLMs),VirtualEnv能够从多模态输入自动生成结构化任务与环境,从而为LLMs提供标准化、可扩展的评估基准,推动具身AI与交互式娱乐领域的研究发展。

链接: https://arxiv.org/abs/2601.07553
作者: Kabir Swain,Sijie Han,Ayush Raina,Jin Zhang,Shuang Li,Michael Stopa,Antonio Torralba
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to improve in reasoning and decision-making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated. We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios. VirtualEnv supports rich agent-environment interactions, including object manipulation, navigation, and adaptive multi-agent collaboration, as well as game-inspired mechanics like escape rooms and procedurally generated environments. We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents using natural language instructions. We integrate large-scale LLMs and vision-language models (VLMs), such as GPT-based models, to generate novel environments and structured tasks from multimodal inputs. Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi-agent coordination. We also describe our methodology for procedural task generation, task validation, and real-time environment control. VirtualEnv is released as an open-source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment.
zh

[AI-17] Graph Inference Towards ICD Coding

【速读】:该论文旨在解决自动化ICD(国际疾病分类,International Classification of Diseases)编码任务中面临的两大挑战:标签空间庞大以及类别极度不平衡导致的预测精度不足问题。其解决方案的关键在于提出一个统一框架LabGraph,将ICD编码重构为图生成任务,并融合对抗域自适应、基于图的强化学习与扰动正则化技术,从而提升模型的鲁棒性和泛化能力;同时引入标签图判别器(label graph discriminator),动态评估每个生成代码并提供自适应奖励反馈,显著优化训练过程与最终性能。

链接: https://arxiv.org/abs/2601.07496
作者: Xiaoxiao Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Automated ICD coding involves assigning standardized diagnostic codes to clinical narratives. The vast label space and extreme class imbalance continue to challenge precise prediction. To address these issues, LabGraph is introduced – a unified framework that reformulates ICD coding as a graph generation task. By combining adversarial domain adaptation, graph-based reinforcement learning, and perturbation regularization, LabGraph effectively enhances model robustness and generalization. In addition, a label graph discriminator dynamically evaluates each generated code, providing adaptive reward feedback during training. Experiments on benchmark datasets demonstrate that LabGraph consistently outperforms previous approaches on micro-F1, micro-AUC, and P@K.
zh

[AI-18] JudgeFlow: Agent ic Workflow Optimization via Block Judge

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体(agentic)工作流在规模化扩展时面临的优化难题,即现有方法依赖粗粒度的端到端评估信号,缺乏对具体需改进模块的细粒度诊断能力,导致修改效率低且影响有限。其解决方案的关键在于提出一个Evaluation-Judge-Optimization-Update(EJO-U)流水线:通过引入可复用、可配置的逻辑块(logic blocks)抽象出工作流中的基础逻辑结构,并设计专用的Judge模块分析执行轨迹(特别是失败案例),为问题模块分配基于排名的责任分数(responsibility scores);随后由LLM驱动的优化器聚焦于责任评分最高的模块进行针对性调整,从而实现更高的样本效率、更强的可解释性,并为自动化复杂智能体工作流提供可扩展框架。

链接: https://arxiv.org/abs/2601.07477
作者: Zihan Ma,Zhikai Zhao,Chuanbo Hua,Federico Berto,Jinkyoo Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose \our, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces – particularly failed runs – and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate \our on mathematical reasoning and code generation benchmarks, where \our achieves superior performance and efficiency compared to existing methods. The source code is publicly available at this https URL.
zh

[AI-19] ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLM s

【速读】:该论文旨在解决在细粒度数值格式(如NVFP4)下高效部署大语言模型(LLM)推理时面临的挑战,尤其是现有后训练量化(Post-Training Quantization, PTQ)策略难以适配的问题:基于旋转的方法破坏了细粒度块的隔离性,平滑技术在4-bit量化下误差显著,而混合精度方法常与硬件对统一精度计算的约束冲突。解决方案的关键在于提出ARCQuant框架,通过引入增强残差通道(Augmented Residual Channels)来保持严格的NVFP4格式统一性——即在激活矩阵中加入量化后的残差通道,将误差补偿过程直接嵌入矩阵运算维度,从而兼容标准且高度优化的GEMM(通用矩阵乘法)内核,实现低开销的高效推理。理论分析表明,其双阶段NVFP4量化最坏误差界可媲美标准8-bit格式(如MXFP8),实验验证其在LLaMA和Qwen模型上达到接近全精度基线的准确率,并在RTX 5090和RTX PRO 6000 GPU上实现最高3倍于FP16的加速效果。

链接: https://arxiv.org/abs/2601.07475
作者: Haoqian Meng,Yilun Luo,Yafei Zhao,Wenyuan Liu,Peng Zhang,Xindian Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of fine-grained numerical formats like NVFP4 presents new opportunities for efficient Large Language Model (LLM) inference. However, it is difficult to adapt existing Post-Training Quantization (PTQ) strategies to these formats: rotation-based methods compromise fine-grained block isolation; smoothing techniques struggle with significant 4-bit quantization errors; and mixed-precision approaches often conflict with hardware constraints on unified-precision computation. To address these challenges, we propose ARCQuant, a framework that boosts NVFP4 performance via Augmented Residual Channels. Distinct from methods that compromise block isolation or hardware uniformity, ARCQuant maintains a strictly unified NVFP4 format by augmenting the activation matrix with quantized residual channels. This design integrates the error compensation process directly into the matrix reduction dimension, enabling the use of standard, highly optimized GEMM kernels with minimal overhead. Theoretical analysis confirms that the worst-case error bound of our dual-stage NVFP4 quantization is comparable to that of standard 8-bit formats such as MXFP8. Extensive experiments on LLaMA and Qwen models demonstrate that ARCQuant achieves state-of-the-art accuracy, comparable to full-precision baselines in perplexity and downstream tasks. Furthermore, deployment on RTX 5090 and RTX PRO 6000 GPUs confirms practical benefits, achieving up to 3x speedup over FP16. Our code is available at this https URL .
zh

[AI-20] Learning How to Remember: A Meta-Cognitive Management Method for Structured and Transferable Agent Memory

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期决策任务中因固定记忆表示和单一抽象层次导致的泛化能力不足及分布偏移下的负迁移问题。其解决方案的关键在于提出元认知记忆抽象方法(Meta-Cognitive Memory Abstraction, MCMA),将记忆抽象视为可学习的认知技能而非固定设计,通过冻结的任务模型与一个可学习的记忆协管员(memory copilot)解耦任务执行与记忆管理;该协管员采用直接偏好优化训练,自主决定记忆的结构、抽象与复用策略,并构建多层级抽象记忆体系以实现基于任务相似性的选择性复用,从而显著提升性能、跨分布泛化能力和跨任务迁移效果。

链接: https://arxiv.org/abs/2601.07470
作者: Sirui Liang,Pengfei Cao,Jian Zhao,Wenhao Teng,Xiangwen Liao,Jun Zhao,Kang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents increasingly rely on accumulated memory to solve long-horizon decision-making tasks. However, most existing approaches store memory in fixed representations and reuse it at a single or implicit level of abstraction, which limits generalization and often leads to negative transfer when distribution shift. This paper proposes the Meta-Cognitive Memory Abstraction method (MCMA), which treats memory abstraction as a learnable cognitive skill rather than a fixed design choice. MCMA decouples task execution from memory management by combining a frozen task model with a learned memory copilot. The memory copilot is trained using direct preference optimization, it determines how memories should be structured, abstracted, and reused. Memories are further organized into a hierarchy of abstraction levels, enabling selective reuse based on task similarity. When no memory is transferable, MCMA transfers the ability to abstract and manage memory by transferring the memory copilot. Experiments on ALFWorld, ScienceWorld, and BabyAI demonstrate substantial improvements in performance, out-of-distribution generalization, and cross-task transfer over several baselines.
zh

[AI-21] Knowledge Distillation for LLM -Based Human Activity Recognition in Homes

【速读】:该论文旨在解决家庭环境中人类活动识别(Human Activity Recognition, HAR)的性能与模型效率之间的矛盾问题,即如何在保持高识别精度的同时降低模型复杂度。其解决方案的关键在于利用知识蒸馏(knowledge distillation)技术,通过大型语言模型(Large Language Models, LLMs)生成的推理样本来指导小型LLMs的微调,从而在显著减少参数量(降低50倍)的情况下实现接近大型模型的识别性能。

链接: https://arxiv.org/abs/2601.07469
作者: Julien Cumin,Oussama Er-Rahmany,Xi Chen(UGA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) is a central problem for context-aware applications, especially for smart homes and assisted living. A few very recent studies have shown that Large Language Models (LLMs) can be used for HAR at home, reaching high performance and addressing key challenges. In this paper, we provide new experimental results regarding the use of LLMs for HAR, on two state-of-the-art datasets. More specifically, we show how recognition performance evolves depending on the size of the LLM used. Moreover, we experiment on the use of knowledge distillation techniques to fine-tune smaller LLMs with HAR reasoning examples generated by larger LLMs. We show that such fine-tuned models can perform almost as well as the largest LLMs, while having 50 times less parameters.
zh

[AI-22] Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents

【速读】:该论文旨在解决现有大语言模型(Large Language Model, LLM)代理在记忆建模中对时间维度处理不当的问题,具体表现为:1)时间不准确——记忆按对话时间组织而非实际事件发生时间;2)时间碎片化——仅关注离散点式记忆,忽略了能捕捉持久状态与演化模式的持续性信息(durative memory)。解决方案的关键在于提出一种时序语义记忆(Temporal Semantic Memory, TSM)框架,其核心创新包括:在记忆构建阶段,基于语义时间线而非对话时间线组织记忆,并将时序连续且语义相关的信息整合为持续性记忆;在记忆利用阶段,通过引入查询的时间意图,在语义时间线上检索与时序匹配的持续性记忆,从而提供时间有效且时长一致的上下文支持响应生成。实验表明,TSM在LongMemEval和LoCoMo数据集上显著优于现有方法,最高准确率提升达12.2%。

链接: https://arxiv.org/abs/2601.07468
作者: Miao Su,Yucan Guo,Zhongni Hou,Long Bai,Zixuan Li,Yufei Zhang,Guojun Yin,Wei Lin,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory enables Large Language Model (LLM) agents to perceive, store, and use information from past dialogues, which is essential for personalization. However, existing methods fail to properly model the temporal dimension of memory in two aspects: 1) Temporal inaccuracy: memories are organized by dialogue time rather than their actual occurrence time; 2) Temporal fragmentation: existing methods focus on point-wise memory, losing durative information that captures persistent states and evolving patterns. To address these limitations, we propose Temporal Semantic Memory (TSM), a memory framework that models semantic time for point-wise memory and supports the construction and utilization of durative memory. During memory construction, it first builds a semantic timeline rather than a dialogue one. Then, it consolidates temporally continuous and semantically related information into a durative memory. During memory utilization, it incorporates the query’s temporal intent on the semantic timeline, enabling the retrieval of temporally appropriate durative memories and providing time-valid, duration-consistent context to support response generation. Experiments on LongMemEval and LoCoMo show that TSM consistently outperforms existing methods and achieves up to 12.2% absolute improvement in accuracy, demonstrating the effectiveness of the proposed method.
zh

[AI-23] IFDNS: An Iterative Feedback-Driven Neuro-Symbolic Method for Faithful Logical Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在逻辑推理任务中存在“忠实性不足”的问题,即生成的推理链与最终结论之间缺乏一致性,导致推理过程不可靠。现有基于提示(prompt-based)的方法如Chain-of-Thought(CoT)虽能提升推理能力,但难以准确捕捉复杂逻辑关系并易产生信息丢失。为此,作者提出迭代反馈驱动的神经符号方法(Iterative Feedback-Driven Neuro-Symbolic, IFDNS),其核心在于引入多轮反馈机制,在逻辑提取阶段通过迭代优化准确提取因果关系陈述,并将其转化为命题逻辑和蕴含表达式,从而有效缓解信息损失问题。该方法与现有提示技术正交,可无缝集成至多种 prompting 策略中,实证表明其显著提升了 CoT 和 CoT with Self-Consistency(CoT-SC)在多个数据集上的性能。

链接: https://arxiv.org/abs/2601.07464
作者: Xiaoheng Wang,Tongxuan Liu,Zi Gong,Xianzhe Dong,Yuting Zeng,Minhan Hu,Weizhe Huang,Jing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages,5 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across a wide range of reasoning tasks, including logical and mathematical problem-solving. While prompt-based methods like Chain-of-Thought (CoT) can enhance LLM reasoning abilities to some extent, they often suffer from a lack of faithfulness, where the derived conclusions may not align with the generated reasoning chain. To address this issue, researchers have explored neuro-symbolic approaches to bolster LLM logical reasoning capabilities. However, existing neuro-symbolic methods still face challenges with information loss during the process. To overcome these limitations, we introduce Iterative Feedback-Driven Neuro-Symbolic (IFDNS), a novel prompt-based method that employs a multi-round feedback mechanism to address LLM limitations in handling complex logical relationships. IFDNS utilizes iterative feedback during the logic extraction phase to accurately extract causal relationship statements and translate them into propositional and logical implication expressions, effectively mitigating information loss issues. Furthermore, IFDNS is orthogonal to existing prompt methods, allowing for seamless integration with various prompting approaches. Empirical evaluations across six datasets demonstrate the effectiveness of IFDNS in significantly improving the performance of CoT and Chain-of-Thought with Self-Consistency (CoT-SC). Specifically, IFDNS achieves a +9.40% accuracy boost for CoT on the LogiQA dataset and a +11.70% improvement for CoT-SC on the PrOntoQA dataset.
zh

[AI-24] Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决离线多智能体强化学习(Offline Multi-Agent Reinforcement Learning, Offline MARL)中因数据分布限制导致策略过于保守、难以泛化的问题。现有方法通常局限于原始数据分布,难以有效扩展状态-动作空间;而模型-based方法虽可通过合成数据增强数据集,但在高维、非平稳的多智能体系统中难以准确建模联合动态和奖励函数。论文提出一种局部到全局(Local-to-Global, LOGO)世界模型框架,其核心在于利用易于估计的局部预测来推断全局状态动力学,从而在隐式捕捉个体间依赖关系的同时提升预测精度。进一步地,通过引入不确定性感知采样机制,根据预测不确定性自适应加权合成数据,减少误差传播至策略学习阶段。相较传统基于集成的方法,LOGO仅需额外一个编码器用于不确定性估计,显著降低计算开销并保持性能优势,在8个场景下优于8种基线方法,为可泛化的离线多智能体学习建立了新的模型驱动基准。

链接: https://arxiv.org/abs/2601.07463
作者: Sijia li,Xinran Li,Shibo Chen,Jun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) aims to solve cooperative decision-making problems in multi-agent systems using pre-collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model-based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non-stationarity, and complexity of multi-agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local-to-global (LOGO) world model, a novel framework that leverages local predictions-which are easier to estimate-to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent-wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state-action space. To ensure reliable policy learning, we further introduce an uncertainty-aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble-based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state-of-the-art baselines on standard offline MARL benchmarks, establishing a new model-based baseline for generalizable offline multi-agent learning.
zh

[AI-25] RLPO: Residual Listwise Preference Optimization for Long-Context Review Ranking

【速读】:该论文旨在解决电商场景中长文本评论排序(review ranking)面临的挑战,即现有基于大语言模型(Large Language Models, LLMs)的排序方法在长上下文设置下存在效率与准确性之间的权衡问题:点对点(pointwise)评分方法虽然计算高效,但难以捕捉列表级(list-level)交互信息,导致Top-k排名校准不足;而列表级(listwise)方法虽能利用全局上下文,却因计算复杂度高且在候选列表增长时趋于不稳定,难以实用。解决方案的关键在于提出残差式列表偏好优化(Residual Listwise Preference Optimization, RLPO),其核心思想是先通过强大的点对点LLM评分器生成初步校准的得分和项目表征,再引入轻量级编码器对这些表征进行残差修正,从而以较低计算开销实现列表级语义感知,避免了完整的token级列表处理,显著提升了排序质量并保持了长列表下的稳定性。

链接: https://arxiv.org/abs/2601.07449
作者: Hao Jiang,Zhi Yang,Annan Wang,Yichi Zhang,Weisi Lin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Review ranking is pivotal in e-commerce for prioritizing diagnostic and authentic feedback from the deluge of user-generated content. While large language models have improved semantic assessment, existing ranking paradigms face a persistent trade-off in long-context settings. Pointwise scoring is efficient but often fails to account for list-level interactions, leading to miscalibrated top- k rankings. Listwise approaches can leverage global context, yet they are computationally expensive and become unstable as candidate lists grow. To address this, we propose Residual Listwise Preference Optimization (RLPO), which formulates ranking as listwise representation-level residual correction over a strong pointwise LLM scorer. RLPO first produces calibrated pointwise scores and item representations, then applies a lightweight encoder over the representations to predict listwise score residuals, avoiding full token-level listwise processing. We also introduce a large-scale benchmark for long-context review ranking with human verification. Experiments show RLPO improves NDCG@k over strong pointwise and listwise baselines and remains robust as list length increases.
zh

[AI-26] MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在采用模型上下文协议(Model Context Protocol, MCP)与外部工具交互时所面临的一种新型隐蔽攻击——隐式工具投毒(Implicit Tool Poisoning, ITP)问题。此类攻击不依赖于直接调用恶意工具,而是通过在工具元数据中嵌入恶意指令,诱导代理调用合法但高权限的工具执行恶意操作,从而绕过现有检测机制。解决方案的关键在于提出MCP-ITP框架,其将中毒工具生成建模为黑盒优化问题,并采用迭代优化策略,结合评估大语言模型(evaluation LLM)和检测大语言模型(detection LLM)的反馈,以最大化攻击成功率(Attack Success Rate, ASR)的同时最小化被检测概率(Malicious Tool Detection Rate, MDR),实现在MCP生态系统中的自动化、自适应隐式投毒攻击。

链接: https://arxiv.org/abs/2601.07395
作者: Ruiqi Li,Zhiqiang Wang,Yunhao Yao,Xiang-Yang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To standardize interactions between LLM-based agents and their environments, the Model Context Protocol (MCP) was proposed and has since been widely adopted. However, integrating external tools expands the attack surface, exposing agents to tool poisoning attacks. In such attacks, malicious instructions embedded in tool metadata are injected into the agent context during MCP registration phase, thereby manipulating agent behavior. Prior work primarily focuses on explicit tool poisoning or relied on manually crafted poisoned tools. In contrast, we focus on a particularly stealthy variant: implicit tool poisoning, where the poisoned tool itself remains uninvoked. Instead, the instructions embedded in the tool metadata induce the agent to invoke a legitimate but high-privilege tool to perform malicious operations. We propose MCP-ITP, the first automated and adaptive framework for implicit tool poisoning within the MCP ecosystem. MCP-ITP formulates poisoned tool generation as a black-box optimization problem and employs an iterative optimization strategy that leverages feedback from both an evaluation LLM and a detection LLM to maximize Attack Success Rate (ASR) while evading current detection mechanisms. Experimental results on the MCPTox dataset across 12 LLM agents demonstrate that MCP-ITP consistently outperforms the manually crafted baseline, achieving up to 84.2% ASR while suppressing the Malicious Tool Detection Rate (MDR) to as low as 0.3%.
zh

[AI-27] Software-Hardware Co-optimization for Modular E2E AV Paradigm: A Unified Framework of Optimization Approaches Simulation Environment and Evaluation Metrics

【速读】:该论文旨在解决模块化端到端(Modular End-to-End, ME2E)自动驾驶系统在实际部署中因模型复杂度增加而导致的推理延迟高和能耗大问题,这些问题常被现有研究忽视。传统模型压缩与加速方法通常仅从软件或硬件单侧优化,难以从根本上消除中间张量访问和算子调度开销(软件侧局限),或受限于模型结构与精度(硬件侧约束),导致优化效果有限。论文提出了一种可复用的软硬件协同优化及闭环评估框架,其关键在于将软件级模型优化与硬件级计算优化统一于系统级目标下联合优化,并引入多维评价指标(涵盖安全性、舒适性、效率、延迟和能耗)实现不同优化策略的定量比较,从而在保持基础驾驶性能的同时显著降低推理延迟与能耗,为ME2E自动驾驶系统的高效部署提供切实可行的指导。

链接: https://arxiv.org/abs/2601.07393
作者: Chengzhi Ji,Xingfeng Li,Zhaodong Lv,Hao Sun,Pan Liu,Hao Frank Yang,Ziyuan Pu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17pages,6 figures,6 tables

点击查看摘要

Abstract:Modular end-to-end (ME2E) autonomous driving paradigms combine modular interpretability with global optimization capability and have demonstrated strong performance. However, existing studies mainly focus on accuracy improvement, while critical system-level factors such as inference latency and energy consumption are often overlooked, resulting in increasingly complex model designs that hinder practical deployment. Prior efforts on model compression and acceleration typically optimize either the software or hardware side in isolation. Software-only optimization cannot fundamentally remove intermediate tensor access and operator scheduling overheads, whereas hardware-only optimization is constrained by model structure and precision. As a result, the real-world benefits of such optimizations are often limited. To address these challenges, this paper proposes a reusable software and hardware co-optimization and closed-loop evaluation framework for ME2E autonomous driving inference. The framework jointly integrates software-level model optimization with hardware-level computation optimization under a unified system-level objective. In addition, a multidimensional evaluation metric is introduced to assess system performance by jointly considering safety, comfort, efficiency, latency, and energy, enabling quantitative comparison of different optimization strategies. Experiments across multiple ME2E autonomous driving stacks show that the proposed framework preserves baseline-level driving performance while significantly reducing inference latency and energy consumption, achieving substantial overall system-level improvements. These results demonstrate that the proposed framework provides practical and actionable guidance for efficient deployment of ME2E autonomous driving systems.
zh

[AI-28] On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

【速读】:该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)的后训练过程中,监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)是否可以被解耦(即独立执行而不会导致性能下降)。现有实践中广泛采用交替训练SFT和RL的方法,但缺乏理论支撑来说明这种耦合是否必要。论文的关键解决方案在于通过严格的理论证明揭示了两种训练顺序均不可解耦:首先执行SFT再进行RL会导致RL优化过程显著增加SFT损失(即破坏SFT最优性),反之,若先执行RL再进行SFT,则会降低RL所获得的奖励信号;实验基于Qwen3-0.6B模型验证了上述理论预测的性能退化现象,从而确立了SFT与RL在后训练阶段必须协同优化,无法单独执行而不损害已有性能。

链接: https://arxiv.org/abs/2601.07389
作者: Xueyan Niu,Bo Bai,Wei Han,Weixi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training
zh

[AI-29] OpenTinker: Separating Concerns in Agent ic Reinforcement Learning

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体在强化学习(Reinforcement Learning, RL)训练过程中存在的系统复杂性高、组件耦合性强、资源调度困难等问题。传统端到端的RL流水线难以灵活扩展与维护,且缺乏对训练与推理任务的有效分离。其解决方案的关键在于提出OpenTinker这一基础设施,通过将智能体学习系统分解为轻量、可组合的模块,并明确划分算法设计、执行和智能体-环境交互的职责边界,实现高度解耦的架构;同时引入集中式调度器统一管理LoRA微调、全参数RL、监督微调及推理等多类型工作负载,从而提升训练效率与系统可扩展性,为多智能体训练提供可扩展的设计基础。

链接: https://arxiv.org/abs/2601.07376
作者: Siqi Zhu,Jiaxuan You
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We introduce OpenTinker, an infrastructure for reinforcement learning (RL) of large language model (LLM) agents built around a separation of concerns across algorithm design, execution, and agent-environment interaction. Rather than relying on monolithic, end-to-end RL pipelines, OpenTinker decomposes agentic learning systems into lightweight, composable components with clearly defined abstraction boundaries. Users specify agents, environments, and interaction protocols, while inference and training are delegated to a managed execution runtime. OpenTinker introduces a centralized scheduler for managing training and inference workloads, including LoRA-based and full-parameter RL, supervised fine-tuning, and inference, over shared resources. We further discuss design principles for extending OpenTinker to multi-agent training. Finally, we present a set of RL use cases that demonstrate the effectiveness of the framework in practical agentic learning scenarios.
zh

[AI-30] On the universal definition of intelligence

【速读】:该论文旨在解决当前人类智能与人工智能(AI)之间缺乏统一、可比的定义问题,以实现公平且一致的比较。现有智能定义多基于人类中心视角,难以用于实证对比,导致学界缺乏共识。其解决方案的关键在于提出扩展预测假说(Extended Predictive Hypothesis, EPH),将智能重新定义为准确预测未来的能力与从预测中获益的能力之结合,并通过区分自发性预测与反应性预测、引入“获益能力”(gainability)概念,构建了一个能够统一解释创造力、学习和未来规划等智能行为的理论框架。此定义兼具理论解释力与实证可行性,是目前最适合作为人类与AI智能比较的通用标准。

链接: https://arxiv.org/abs/2601.07364
作者: Joseph Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper aims to propose a universal definition of intelligence that enables fair and consistent comparison of human and artificial intelligence (AI). With the rapid development of AI technology in recent years, how to compare and evaluate human and AI intelligence has become an important theoretical issue. However, existing definitions of intelligence are anthropocentric and unsuitable for empirical comparison, resulting in a lack of consensus in the research field. This paper first introduces four criteria for evaluating intelligence definitions based on R. Carnap’s methodology of conceptual clarification: similarity to explicandum, exactness, fruitfulness, and simplicity. We then examine six representative definitions: IQ testing, complex problem-solving ability, reward optimization, environmental adaptation, learning efficiency, and predictive ability, and clarify their theoretical strengths and limitations. The results show that while definitions based on predictive ability have high explanatory power and empirical feasibility, they suffer from an inability to adequately explain the relationship between predictions and behavior/benefits. This paper proposes the Extended Predictive Hypothesis (EPH), which views intelligence as a combination of the ability to accurately predict the future and the ability to benefit from those predictions. Furthermore, by distinguishing predictive ability into spontaneous and reactive predictions and adding the concept of gainability, we present a unified framework for explaining various aspects of intelligence, such as creativity, learning, and future planning. In conclusion, this paper argues that the EPH is the most satisfactory and universal definition for comparing human and AI intelligence. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07364 [cs.AI] (or arXiv:2601.07364v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.07364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-31] Agent ic Diagnostic Reasoning over Telecom and Datacenter Infrastructure

【速读】:该论文旨在解决大规模电信和数据中心基础设施中故障传播导致多客户受影响的根因分析(Root Cause Analysis, RCA)难题。传统方法依赖硬编码的图遍历算法或基于规则的相关性引擎,存在维护成本高且与基础设施模型强耦合的问题。解决方案的关键在于提出一种基于智能体(agent)的诊断框架,利用大语言模型(Large Language Model, LLM)通过模型上下文协议(Model Context Protocol, MCP)暴露的受限工具空间进行逐步推理,自主执行服务查询、依赖关系获取、结构化与非结构化数据访问及事件分析等操作,从而实现对故障的精准定位与影响发现。该框架通过定义结构化的调查协议确保推理的可追溯性、可靠性和对缺失或模糊信息的安全处理,为未来实现自主故障修复与变更影响预测奠定基础。

链接: https://arxiv.org/abs/2601.07342
作者: Nicolas Tacheny
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale telecom and datacenter infrastructures rely on multi-layered service and resource models, where failures propagate across physical and logical components and affect multiple customers. Traditional approaches to root cause analysis(RCA) rely on hard-coded graph traversal algorithms or rule-based correlation engines, which are costly to maintain and tightly coupled to the infrastructure model. In this work, we introduce an agentic diagnostic framework where a Large Language Model (LLM) performs step-wise investigation using a constrained tool space exposed through the Model Context Protocol (MCP). Instead of embedding causal logic or traversal algorithms into the application, the agent autonomously navigates the infrastructure model by invoking tools for service lookup, dependency retrieval, structured and unstructured data, and event analysis, and impact discovery. We define an investigation protocol that structures the agent’s reasoning and ensures grounding, reproducibility, and safe handling of missing or ambiguous information. This work lays the foundation for autonomous incident resolution and change impact mitigation. Future systems will not only diagnose and remediate infrastructure failures, but also predict the impact of planned changes on services and customers, enabling operators to mitigate risks before executing maintenance operations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07342 [cs.AI] (or arXiv:2601.07342v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.07342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training

【速读】:该论文旨在解决在稀疏奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)场景下,使用近端策略优化(Proximal Policy Optimization, PPO)训练大语言模型(Large Language Models, LLMs)时因优势估计不可靠而导致的训练不稳定问题。其核心挑战在于:RLVR中的稀疏奖励导致中间状态价值预测不准确,进而通过广义优势估计(Generalized Advantage Estimation, GAE)在每个token处累积时引入显著偏差。解决方案的关键是提出分段优势估计(Segmental Advantage Estimation, SAE),其核心思想是摒弃GAE在每个token上聚合n步优势的做法,转而利用低概率token作为启发式边界将生成序列划分为语义连贯的子段,并仅对这些信息丰富段之间的转移点计算方差降低的优势估计,从而有效过滤中间token带来的噪声,提升优势估计的准确性与训练稳定性。

链接: https://arxiv.org/abs/2601.07320
作者: Xue Gong,Qi Yi,Ziyuan Nan,Guanhua Huang,Kejiao Li,Yuhao Jiang,Ruibin Xiong,Zenan Xu,Jiaming Guo,Shaohui Peng,Bo Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) for reasoning tasks is increasingly driven by Reinforcement Learning with Verifiable Rewards (RLVR), where Proximal Policy Optimization (PPO) provides a principled framework for stable policy updates. However, the practical application of PPO is hindered by unreliable advantage estimation in the sparse-reward RLVR regime. This issue arises because the sparse rewards in RLVR lead to inaccurate intermediate value predictions, which in turn introduce significant bias when aggregated at every token by Generalized Advantage Estimation (GAE). To address this, we introduce Segmental Advantage Estimation (SAE), which mitigates the bias that GAE can incur in RLVR. Our key insight is that aggregating n -step advantages at every token(as in GAE) is unnecessary and often introduces excessive bias, since individual tokens carry minimal information. Instead, SAE first partitions the generated sequence into coherent sub-segments using low-probability tokens as heuristic boundaries. It then selectively computes variance-reduced advantage estimates only from these information-rich segment transitions, effectively filtering out noise from intermediate tokens. Our experiments demonstrate that SAE achieves superior performance, with marked improvements in final scores, training stability, and sample efficiency. These gains are shown to be consistent across multiple model sizes, and a correlation analysis confirms that our proposed advantage estimator achieves a higher correlation with an approximate ground-truth advantage, justifying its superior performance.
zh

[AI-33] BEAT-Net: Injecting Biomimetic Spatio-Temporal Priors for Interpretable ECG Classification

【速读】:该论文旨在解决深度学习在心电图(ECG)自动诊断中因将信号视为无结构的一维(1D)或二维(2D)数据而导致的数据效率低、模型可解释性差的问题。传统监督方法迫使模型隐式学习生理结构,偏离了医学推理逻辑。其解决方案的关键在于提出BEAT-Net框架,该框架通过QRS分词策略将连续ECG信号转化为生物对齐的心跳序列,并采用专用编码器显式分解心脏生理特征:局部心跳形态由专门模块提取,空间导联视角被归一化,时间节律依赖关系则被建模。这一设计使模型在保持与主流卷积神经网络(CNN)相当诊断准确率的同时,显著提升鲁棒性和数据效率(仅需30–35%标注数据即可恢复全监督性能),且注意力机制自发再现临床启发式规则(如Lead II在节律分析中的优先级),实现无需显式监督的内在可解释性。

链接: https://arxiv.org/abs/2601.07316
作者: Runze Ma,Caizhi Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures and 2 tables

点击查看摘要

Abstract:Although deep learning has advanced automated electrocardiogram (ECG) diagnosis, prevalent supervised methods typically treat recordings as undifferentiated one-dimensional (1D) signals or two-dimensional (2D) images. This formulation compels models to learn physiological structures implicitly, resulting in data inefficiency and opacity that diverge from medical reasoning. To address these limitations, we propose BEAT-Net, a Biomimetic ECG Analysis with Tokenization framework that reformulates the problem as a language modeling task. Utilizing a QRS tokenization strategy to transform continuous signals into biologically aligned heartbeat sequences, the architecture explicitly decomposes cardiac physiology through specialized encoders that extract local beat morphology while normalizing spatial lead perspectives and modeling temporal rhythm dependencies. Evaluations across three large-scale benchmarks demonstrate that BEAT-Net matches the diagnostic accuracy of dominant convolutional neural network (CNN) architectures while substantially improving robustness. The framework exhibits exceptional data efficiency, recovering fully supervised performance using only 30 to 35 percent of annotated data. Moreover, learned attention mechanisms provide inherent interpretability by spontaneously reproducing clinical heuristics, such as Lead II prioritization for rhythm analysis, without explicit supervision. These findings indicate that integrating biological priors offers a computationally efficient and interpretable alternative to data-intensive large-scale pre-training.
zh

[AI-34] VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing

【速读】:该论文旨在解决模拟混合信号电路尺寸优化中面临的高维设计空间复杂性与现有自动化方法对电路拓扑利用不足、缺乏可解释性的问题,从而阻碍其在工业界的采纳。解决方案的关键在于提出一种基于视觉语言模型优化的协同代理设计工作流(VLM-CAD),该流程结合图像到网络(Image2Net)对电路原理图进行结构化标注并生成JSON描述以供视觉语言模型精准解析,并引入可解释的信任区域贝叶斯优化方法(ExTuRBO),通过代理生成的种子实现协同预热启动,同时提供双粒度灵敏度分析用于外部尺寸优化,最终生成完整的电路设计报告,实现在180nm、90nm和45nm工艺节点上对放大器电路的高效优化,成功率高达100%,且总运行时间不超过43分钟。

链接: https://arxiv.org/abs/2601.07315
作者: Guanyuan Pan,Yugui Lin,Tiansheng Zhou,Pietro Liò,Shuai Wang,Yaqi Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches often underutilize circuit schematics and lack the explainability required for industry adoption. To tackle these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-starting from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance, achieving a 100% success rate in optimizing an amplifier with a complementary input and a class-AB output stage, while maintaining total runtime under 43 minutes across all experiments.
zh

[AI-35] Explaining Machine Learning Predictive Models through Conditional Expectation Methods

【速读】:该论文旨在解决复杂人工智能(AI)和机器学习(ML)模型因缺乏透明性而被视为“黑箱”的问题,这限制了用户对模型决策过程的理解、验证与信任,尤其是在高风险应用场景中。其解决方案的关键在于提出一种模型无关的局部可解释性方法——多变量条件期望(Multivariate Conditional Expectation, MUCE),该方法通过在推理时探索给定观测点邻域内的多维特征值网格,扩展了个体条件期望(Individual Conditional Expectation, ICE)技术,从而捕捉特征交互对预测变化的影响,并提供图形化解释以展示模型预测的局部演化趋势。此外,论文引入两个定量指标——稳定性(stability)和不确定性(uncertainty),用于总结局部行为并评估模型可靠性;其中不确定性进一步细分为不确定性+和不确定性−,以识别全局度量可能忽略的非对称效应。实证结果表明,MUCE能有效刻画复杂模型的局部行为,且相关指标为预测置信度提供了有意义的量化依据,显著提升了预测模型的可解释性与可信度。

链接: https://arxiv.org/abs/2601.07313
作者: Silvia Ruiz-España(1),Laura Arnal(1),François Signol(1),Juan-Carlos Perez-Cortes(1),Joaquim Arlandis(1) ((1) ITI, Universitat Politècnica de València, València, Spain)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 15 figures. Silvia Ruiz-España and Laura Arnal contributed equally to this work

点击查看摘要

Abstract:The rapid adoption of complex Artificial Intelligence (AI) and Machine Learning (ML) models has led to their characterization as black boxes due to the difficulty of explaining their internal decision-making processes. This lack of transparency hinders users’ ability to understand, validate and trust model behavior, particularly in high-risk applications. Although explainable AI (XAI) has made significant progress, there remains a need for versatile and effective techniques to address increasingly complex models. This work introduces Multivariate Conditional Expectation (MUCE), a model-agnostic method for local explainability designed to capture prediction changes from feature interactions. MUCE extends Individual Conditional Expectation (ICE) by exploring a multivariate grid of values in the neighborhood of a given observation at inference time, providing graphical explanations that illustrate the local evolution of model predictions. In addition, two quantitative indices, stability and uncertainty, summarize local behavior and assess model reliability. Uncertainty is further decomposed into uncertainty+ and uncertainty- to capture asymmetric effects that global measures may overlook. The proposed method is validated using XGBoost models trained on three datasets: two synthetic (2D and 3D) to evaluate behavior near decision boundaries, and one transformed real-world dataset to test adaptability to heterogeneous feature types. Results show that MUCE effectively captures complex local model behavior, while the stability and uncertainty indices provide meaningful insight into prediction confidence. MUCE, together with the ICE modification and the proposed indices, offers a practical contribution to local explainability, enabling both graphical and quantitative insights that enhance the interpretability of predictive models and support more trustworthy and transparent decision-making.
zh

[AI-36] ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging

【速读】:该论文旨在解决当前交互式大语言模型(Large Language Model, LLM)代理在跨环境适应性上的局限性问题,即多数代理模型仅针对单一环境训练,难以在多样化交互场景中保持鲁棒性能。其解决方案的关键在于提出一种激活引导的角色条件神经元移植方法(Agent-Role Merging, ARM),通过三步框架实现:首先构建融合的模型骨干网络,其次基于角色条件激活分析进行神经元选择,最后执行细粒度的神经元移植以优化模型性能。ARM无需梯度优化即可显著提升模型在多轮交互任务中的跨基准泛化能力,并在多个领域表现优于现有模型合并方法及专用专家模型。

链接: https://arxiv.org/abs/2601.07309
作者: Zhuoka Feng,Kang Chen,Sihan Zhao,Kai Xiong,Yaoning Wang,Minshen Yu,Junjie Nian,Changyi Xiao,Yixin Cao,Yugang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 12 figures. Project page: this https URL

点击查看摘要

Abstract:Interactive large language model agents have advanced rapidly, but most remain specialized to a single environment and fail to adapt robustly to other environments. Model merging offers a training-free alternative by integrating multiple experts into a single model. In this paper, we propose Agent-Role Merging (ARM), an activation-guided, role-conditioned neuron transplantation method for model merging in LLM agents. ARM improves existing merging methods from static natural language tasks to multi-turn agent scenarios, and over the generalization ability across various interactive environments. This is achieved with a well designed 3-step framework: 1) constructing merged backbones, 2) selection based on its role-conditioned activation analysis, and 3) neuron transplantation for fine-grained refinements. Without gradient-based optimization, ARM improves cross-benchmark generalization while enjoying efficiency. Across diverse domains, the model obtained via ARM merging outperforms prior model merging methods and domain-specific expert models, while demonstrating strong out-of-domain generalization.
zh

[AI-37] Heterogeneous Multi-Expert Reinforcement Learning for Long-Horizon Multi-Goal Tasks in Autonomous Forklifts

【速读】:该论文旨在解决非结构化仓库中自主移动操作任务中导航与操作之间的冲突问题,即如何在保证大规模高效导航的同时实现高精度物体交互。传统端到端学习方法因难以同时优化宏观导航决策与微观操作敏感性,常导致优化干扰,从而影响整体性能。其解决方案的关键在于提出一种异构多专家强化学习(Heterogeneous Multi-Expert Reinforcement Learning, HMER)框架,通过语义任务规划器将长时程任务分解为由不同专家控制的子策略,分离宏观导航与微观操作的动作空间,避免相互干扰;同时引入混合模仿-强化训练策略,利用专家示范初始化策略并结合强化学习进行精细调优,有效提升探索效率和任务成功率。实验表明,该方法在Gazebo仿真环境中显著优于基线模型,在任务成功率(94.2% vs 62.5%)、操作时间(减少21.4%)及放置误差(<1.5 cm)方面均取得明显改进。

链接: https://arxiv.org/abs/2601.07304
作者: Yun Chen,Bowei Huang,Fan Guo,Kang Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Autonomous mobile manipulation in unstructured warehouses requires a balance between efficient large-scale navigation and high-precision object interaction. Traditional end-to-end learning approaches often struggle to handle the conflicting demands of these distinct phases. Navigation relies on robust decision-making over large spaces, while manipulation needs high sensitivity to fine local details. Forcing a single network to learn these different objectives simultaneously often causes optimization interference, where improving one task degrades the other. To address these limitations, we propose a Heterogeneous Multi-Expert Reinforcement Learning (HMER) framework tailored for autonomous forklifts. HMER decomposes long-horizon tasks into specialized sub-policies controlled by a Semantic Task Planner. This structure separates macro-level navigation from micro-level manipulation, allowing each expert to focus on its specific action space without interference. The planner coordinates the sequential execution of these experts, bridging the gap between task planning and continuous control. Furthermore, to solve the problem of sparse exploration, we introduce a Hybrid Imitation-Reinforcement Training Strategy. This method uses expert demonstrations to initialize the policy and Reinforcement Learning for fine-tuning. Experiments in Gazebo simulations show that HMER significantly outperforms sequential and end-to-end baselines. Our method achieves a task success rate of 94.2% (compared to 62.5% for baselines), reduces operation time by 21.4%, and maintains placement error within 1.5 cm, validating its efficacy for precise material handling.
zh

[AI-38] When Bots Take the Bait: Exposing and Mitigating the Emerging Social Engineering Attack in Web Automation Agent

【速读】:该论文针对当前基于大语言模型(Large Language Models, LLMs)的Web自动化代理(Web Agents)面临的社会工程攻击威胁展开研究,旨在揭示一种新型攻击范式——AgentBait,并提出轻量级运行时防御机制SUPERVISOR以实现对恶意操作的实时抑制。其核心问题在于:现有安全研究多聚焦于模型层面的漏洞(如提示注入和后门),而忽视了代理在执行过程中因环境诱导导致意图偏移所带来的系统性风险。解决方案的关键在于设计一个可插拔的运行时模块SUPERVISOR,通过强制网页上下文与用户目标之间的环境一致性与意图一致性校验,在操作执行前识别并阻断偏离预期任务的危险行为,从而在不显著影响性能(平均仅增加7.7%运行开销)的前提下,将主流框架的攻击成功率平均降低78.1%,有效提升了Web代理生态的安全性与鲁棒性。

链接: https://arxiv.org/abs/2601.07263
作者: Xinyi Wu,Geng Hong,Yueyue Chen,MingXuan Liu,Feier Jin,Xudong Pan,Jiarun Dai,Baojun Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Web agents, powered by large language models (LLMs), are increasingly deployed to automate complex web interactions. The rise of open-source frameworks (e.g., Browser Use, Skyvern-AI) has accelerated adoption, but also broadened the attack surface. While prior research has focused on model threats such as prompt injection and backdoors, the risks of social engineering remain largely unexplored. We present the first systematic study of social engineering attacks against web automation agents and design a pluggable runtime mitigation solution. On the attack side, we introduce the AgentBait paradigm, which exploits intrinsic weaknesses in agent execution: inducement contexts can distort the agent’s reasoning and steer it toward malicious objectives misaligned with the intended task. On the defense side, we propose SUPERVISOR, a lightweight runtime module that enforces environment and intention consistency alignment between webpage context and intended goals to mitigate unsafe operations before execution. Empirical results show that mainstream frameworks are highly vulnerable to AgentBait, with an average attack success rate of 67.5% and peaks above 80% under specific strategies (e.g., trusted identity forgery). Compared with existing lightweight defenses, our module can be seamlessly integrated across different web automation frameworks and reduces attack success rates by up to 78.1% on average while incurring only a 7.7% runtime overhead and preserving usability. This work reveals AgentBait as a critical new threat surface for web agents and establishes a practical, generalizable defense, advancing the security of this rapidly emerging ecosystem. We reported the details of this attack to the framework developers and received acknowledgment before submission. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07263 [cs.CR] (or arXiv:2601.07263v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.07263 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] Pseudodata-guided Invariant Representation Learning Boosts the Out-of-Distribution Generalization in Enzymatic Kinetic Parameter Prediction

【速读】:该论文旨在解决现有基于深度学习的酶-底物相互作用(Enzyme-Substrate Interaction, ESI)预测模型在序列差异较大、分布外(out-of-distribution, OOD)样本上性能下降的问题,从而限制了其在真实生物场景下的鲁棒性与实用性。解决方案的关键在于提出O²DENet——一个轻量级、可插拔的模块,通过引入生物学和化学信息驱动的扰动增强(perturbation augmentation)策略,并强制原始与扰动后酶-底物对表示的一致性,以学习对分布变化具有不变性的特征表示(invariant representation learning),从而显著提升模型在严格基于序列相似性划分的OOD基准测试中对催化常数(kcat)和米氏常数(Km)的预测准确性和鲁棒性。

链接: https://arxiv.org/abs/2601.07261
作者: Haomin Wu,Zhiwei Nie,Hongyu Zhang,Zhixiang Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate prediction of enzyme kinetic parameters is essential for understanding catalytic mechanisms and guiding enzyme this http URL, existing deep learning-based enzyme-substrate interaction (ESI) predictors often exhibit performance degradation on sequence-divergent, out-of-distribution (OOD) cases, limiting robustness under biologically relevant this http URL propose O ^2 DENet, a lightweight, plug-and-play module that enhances OOD generalization via biologically and chemically informed perturbation augmentation and invariant representation learning.O ^2 DENet introduces enzyme-substrate perturbations and enforces consistency between original and augmented enzyme-substrate-pair representations to encourage invariance to distributional this http URL integrated with representative ESI models, O ^2 DENet consistently improves predictive performance for both k_cat and K_m across stringent sequence-identity-based OOD benchmarks, achieving state-of-the-art results among the evaluated methods in terms of accuracy and robustness this http URL, O ^2 DENet provides a general and effective strategy to enhance the stability and deployability of data-driven enzyme kinetics predictors for real-world enzyme engineering applications.
zh

[AI-40] DDT: A Dual-Masking Dual-Expert Transformer for Energy Time-Series Forecasting

【速读】:该论文旨在解决能源时间序列预测中因复杂时序依赖关系和多源数据异构性所带来的高精度建模难题。其解决方案的关键在于提出了一种名为DDT的深度学习框架,包含两项核心创新:一是设计了双掩码机制(dual-masking mechanism),通过严格因果掩码与数据驱动的动态掩码协同作用,在保证理论因果一致性的同时自适应聚焦最显著的历史信息;二是构建了双专家系统(dual-expert system),将时序动态建模与跨变量相关性建模解耦为并行专业化路径,并通过动态门控融合模块实现智能集成,从而提升模型对复杂能源数据的表征能力与预测精度。

链接: https://arxiv.org/abs/2601.07250
作者: Mingnan Zhu,Qixuan Zhang,Yixuan Cheng,Fangzhou Gu,Shiming Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate energy time-series forecasting is crucial for ensuring grid stability and promoting the integration of renewable energy, yet it faces significant challenges from complex temporal dependencies and the heterogeneity of multi-source data. To address these issues, we propose DDT, a novel and robust deep learning framework for high-precision time-series forecasting. At its core, DDT introduces two key innovations. First, we design a dual-masking mechanism that synergistically combines a strict causal mask with a data-driven dynamic mask. This novel design ensures theoretical causal consistency while adaptively focusing on the most salient historical information, overcoming the rigidity of traditional masking techniques. Second, our architecture features a dual-expert system that decouples the modeling of temporal dynamics and cross-variable correlations into parallel, specialized pathways, which are then intelligently integrated through a dynamic gated fusion module. We conducted extensive experiments on 7 challenging energy benchmark datasets, including ETTh, Electricity, and Solar. The results demonstrate that DDT consistently outperforms strong state-of-the-art baselines across all prediction horizons, establishing a new benchmark for the task.
zh

[AI-41] Stochastic CHAOS: Why Deterministic Inference Kills and Distributional Variability Is the Heartbeat of Artifical Cognition

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中追求确定性输出所带来的认知局限与风险问题。传统做法将确定性推理视为可复现性和企业可靠性的前提,但本文指出,这种强制性的比特级一致输出会抑制模型对不确定性的建模能力、削弱涌现能力、固化推理路径并隐藏尾部风险,从而损害安全性与诊断价值。其解决方案的关键在于提出“随机CHAOS”(Stochastic CHAOS)范式,主张将分布变异性视为可测量和可控的信号,而非需要消除的噪声;通过多样本采样和分布感知评估,揭示模型的真实能力边界、脆弱性及潜在危险行为,从而实现更符合人工认知本质的推理机制。

链接: https://arxiv.org/abs/2601.07239
作者: Tanmay Joshi,Shourya Aggarwal,Anusa Saha,Aadi Pandey,Shreyash Dhoot,Vighnesh Rai,Raxit Goswami,Aman Chadha,Vinija Jain,Amitava Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deterministic inference is a comforting ideal in classical software: the same program on the same input should always produce the same output. As large language models move into real-world deployment, this ideal has been imported wholesale into inference stacks. Recent work from the Thinking Machines Lab has presented a detailed analysis of nondeterminism in LLM inference, showing how batch-invariant kernels and deterministic attention can enforce bitwise-identical outputs, positioning deterministic inference as a prerequisite for reproducibility and enterprise reliability. In this paper, we take the opposite stance. We argue that, for LLMs, deterministic inference kills. It kills the ability to model uncertainty, suppresses emergent abilities, collapses reasoning into a single brittle path, and weakens safety alignment by hiding tail risks. LLMs implement conditional distributions over outputs, not fixed functions. Collapsing these distributions to a single canonical completion may appear reassuring, but it systematically conceals properties central to artificial cognition. We instead advocate Stochastic CHAOS, treating distributional variability as a signal to be measured and controlled. Empirically, we show that deterministic inference is systematically misleading. Single-sample deterministic evaluation underestimates both capability and fragility, masking failure probability under paraphrases and noise. Phase-like transitions associated with emergent abilities disappear under greedy decoding. Multi-path reasoning degrades when forced onto deterministic backbones, reducing accuracy and diagnostic insight. Finally, deterministic evaluation underestimates safety risk by hiding rare but dangerous behaviors that appear only under multi-sample evaluation. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07239 [cs.AI] (or arXiv:2601.07239v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.07239 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-42] Group Pattern Selection Optimization: Let LRMs Pick the Right Pattern for Reasoning

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在面对不同问题时,因训练过程中隐式偏向于少数主导推理模式(如直接求解、反思验证和多方案探索等),导致其默认推理策略常对特定问题不最优的问题。解决方案的关键在于提出一种基于强化学习的分组模式选择优化框架(Group Pattern Selection Optimization, GPSO),该框架通过引入多模式轨迹采样、基于验证器引导的每题最优模式选择机制,以及优化过程中的注意力掩码设计以防止显式模式后缀泄露至策略函数中,从而实现从问题特征到最优推理模式的映射内化,显著提升模型在数学与科学基准上的鲁棒性和适应性表现。

链接: https://arxiv.org/abs/2601.07238
作者: Hanbin Wang,Jingwei Song,Jinpeng Li,Fei Mi,Lifeng Shang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Large reasoning models (LRMs) exhibit diverse high-level reasoning patterns (e.g., direct solution, reflection-and-verification, and exploring multiple solutions), yet prevailing training recipes implicitly bias models toward a limited set of dominant patterns. Through a systematic analysis, we identify substantial accuracy variance across these patterns on mathematics and science benchmarks, revealing that a model’s default reasoning pattern is often sub-optimal for a given problem. To address this, we introduce Group Pattern Selection Optimization (GPSO), a reinforcement learning framework that extends GRPO by incorporating multi-pattern rollouts, verifier-guided optimal pattern selection per problem, and attention masking during optimization to prevent the leakage of explicit pattern suffixes into the learned policy. By exploring a portfolio of diverse reasoning strategies and optimizing the policy on the most effective ones, GPSO enables the model to internalize the mapping from problem characteristics to optimal reasoning patterns. Extensive experiments demonstrate that GPSO delivers consistent and substantial performance gains across various model backbones and benchmarks, effectively mitigating pattern sub-optimality and fostering more robust, adaptable reasoning. All data and codes are available at this https URL.
zh

[AI-43] From “Thinking” to “Justifying”: Aligning High-Stakes Explainability with Professional Communication Standards

【速读】:该论文旨在解决生成式AI在高风险领域中可解释性不足的问题,即Chain-of-Thought(CoT)方法虽能提供推理过程,但其逻辑漏洞或幻觉可能导致结论与理由不一致,从而削弱用户对系统输出的信任与验证能力。解决方案的关键在于提出“Result - Justify”结构,要求先呈现结论再进行结构化论证,并引入SEF(Structured Explainability Framework),通过六项指标量化专业写作规范(如CREAC、BLUF)的结构与事实一致性,从而提升解释的可验证性和可靠性。实验表明,该方法在多个任务中显著优于传统CoT,准确率提升5.3个百分点(达83.9%)。

链接: https://arxiv.org/abs/2601.07233
作者: Chen Qian,Yimeng Wang,Yu Chen,Lingfei Wu,Andreas Stathopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) in high-stakes domains should help stakeholders trust and verify system outputs. Yet Chain-of-Thought methods reason before concluding, and logical gaps or hallucinations can yield conclusions that do not reliably align with their rationale. Thus, we propose “Result - Justify”, which constrains the output communication to present a conclusion before its structured justification. We introduce SEF (Structured Explainability Framework), operationalizing professional conventions (e.g., CREAC, BLUF) via six metrics for structure and grounding. Experiments across four tasks in three domains validate this approach: all six metrics correlate with correctness (r=0.20-0.42; p0.001), and SEF achieves 83.9% accuracy (+5.3 over CoT). These results suggest structured justification can improve verifiability and may also improve reliability.
zh

[AI-44] Yes FLoReNce I Will Do Better Next Time! Agent ic Feedback Reasoning for Humorous Meme Detection AAAI2026

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在理解幽默表情包(meme)时存在的局限性问题,即现有多模态或基于提示(prompting)的模型仅能进行单向推理,缺乏对自身推理过程的批判与迭代优化能力,导致其在处理依赖语境、讽刺或社会评论等复杂意图时表现不佳。解决方案的关键在于提出 FLoReNce 框架,该框架将 meme 理解建模为学习阶段的闭环推理过程和推理阶段的开环流程:在训练中,通过一个评判代理(judge)对推理代理的输出进行反馈,将错误信息和语义反馈转化为控制信号并存储至非参数化知识库(knowledge base, KB);在推理时,模型从 KB 中检索相似的已评判经验以调节提示(prompt),从而实现无需微调的自洽推理增强,显著提升预测准确性和解释质量。

链接: https://arxiv.org/abs/2601.07232
作者: Olivia Shanhong Liu,Pai Chet Ng,De Wen Soh,Konstantinos N. Plataniotis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: LaMAS@AAAI 2026 (Oral)

点击查看摘要

Abstract:Humorous memes blend visual and textual cues to convey irony, satire, or social commentary, posing unique challenges for AI systems that must interpret intent rather than surface correlations. Existing multimodal or prompting-based models generate explanations for humor but operate in an open loop,lacking the ability to critique or refine their reasoning once a prediction is made. We propose FLoReNce, an agentic feedback reasoning framework that treats meme understanding as a closed-loop process during learning and an open-loop process during inference. In the closed loop, a reasoning agent is critiqued by a judge; the error and semantic feedback are converted into control signals and stored in a feedback-informed, non-parametric knowledge base. At inference, the model retrieves similar judged experiences from this KB and uses them to modulate its prompt, enabling better, self-aligned reasoning without finetuning. On the PrideMM dataset, FLoReNce improves both predictive performance and explanation quality over static multimodal baselines, showing that feedback-regulated prompting is a viable path to adaptive meme humor understanding.
zh

[AI-45] DiSCo: Making Absence Visible in Intelligent Summarization Interfaces

【速读】:该论文旨在解决智能摘要界面中普遍存在的“存在偏差”(presence bias)问题,即大型语言模型在生成摘要时过度关注被提及的内容,而忽视了缺失信息,从而可能导致用户决策失误。解决方案的关键在于提出一种基于期望的计算方法——领域知情对比摘要(Domain Informed Summarization through Contrast, DiSCo),其核心机制是通过将每个实体内容与同类领域中典型讨论方面的参考分布进行对比,识别出相对于领域规范被异常强调或缺失的方面,并将这些“缺失项”整合进生成文本中,从而提升摘要的透明度和决策支持能力。

链接: https://arxiv.org/abs/2601.07229
作者: Eran Fainman,Hagit Ben Shoshan,Adir Solomon,Osnat Mokryn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Intelligent interfaces increasingly use large language models to summarize user-generated content, yet these summaries emphasize what is mentioned while overlooking what is missing. This presence bias can mislead users who rely on summaries to make decisions. We present Domain Informed Summarization through Contrast (DiSCo), an expectation-based computational approach that makes absences visible by comparing each entity’s content with domain topical expectations captured in reference distributions of aspects typically discussed in comparable accommodations. This comparison identifies aspects that are either unusually emphasized or missing relative to domain norms and integrates them into the generated text. In a user study across three accommodation domains, namely ski, beach, and city center, DiSCo summaries were rated as more detailed and useful for decision making than baseline large language model summaries, although slightly harder to read. The findings show that modeling expectations reduces presence bias and improves both transparency and decision support in intelligent summarization interfaces.
zh

[AI-46] Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

【速读】:该论文旨在解决当前混合监督微调(Hybrid Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)训练范式中,数据分配机制缺乏理论依据的问题。现有策略多依赖表面启发式规则,无法识别模型内部的学习需求,导致SFT与RL阶段的数据错配,从而引发优化干扰。其解决方案的关键在于提出PRISM框架,该框架基于Schema Theory构建,通过分析梯度的空间几何结构来量化数据的认知冲突程度:高空间集中度的梯度信号被判定为高冲突数据,需由RL驱动结构重构;而梯度分布稀疏的数据则适合SFT进行模式巩固。此动态感知的数据仲裁机制实现了更高效的资源分配,在WebShop和ALFWorld基准上验证了其在性能提升与计算成本降低方面的帕累托改进。

链接: https://arxiv.org/abs/2601.07224
作者: Yang Zhao,Yangou Ouyang,Xiao Ding,Hepeng Wang,Bibo Cai,Kai Xiong,Jinglong Gao,Zhouhao Sun,Li Du,Bing Qin,Ting Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model’s existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22 \times . Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
zh

[AI-47] LLM RouterBench: A Massive Benchmark and Unified Framework for LLM Routing

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)路由中的评估不统一与方法有效性验证不足的问题,即如何在多模型集成中为每个查询精准选择最优模型,同时兼顾性能与成本的权衡。其解决方案的关键在于构建了一个大规模、标准化的基准测试平台 LLMRouterBench,涵盖超过 40 万条实例、21 个数据集和 33 个模型,并提供全面的性能指标与 10 种代表性基线方法,从而实现对现有路由策略的系统性再评估,揭示了当前主流方法在实际表现上与理想基准(Oracle)之间的显著差距,以及模型选择偏差和召回失败是主要瓶颈,进而推动更精细化的模型组合与路由机制研究。

链接: https://arxiv.org/abs/2601.07206
作者: Hao Li,Yiqun Zhang,Zhaoyan Guo,Chenxu Wang,Shengji Tang,Qiaosheng Zhang,Yang Chen,Biqing Qi,Peng Ye,Lei Bai,Zhen Wang,Shuyue Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented routing and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity-the central premise of LLM routing-we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at this https URL.
zh

[AI-48] CalPro: Prior-Aware Evidential–Conformal Prediction with Structure-Aware Guarantees for Protein Structures

【速读】:该论文旨在解决深度蛋白质结构预测模型(如AlphaFold)中置信度估计(如pLDDT)在分布偏移下校准失效的问题,尤其是在实验模态、时间变化及内在无序区域中性能显著下降。解决方案的关键在于提出CalPro——一种先验感知的证据-共形框架,其核心创新包括:(1) 基于图结构的几何证据头,输出正态逆伽马预测分布;(2) 可微分的共形层,实现端到端训练并保证有限样本覆盖性;(3) 将 disorder 和 flexibility 等领域先验编码为软约束。通过PAC-Bayesian边界推导出结构感知的覆盖保证,CalPro在分布偏移下保持接近名义覆盖率,同时区间更紧致,并显著优于基线方法。

链接: https://arxiv.org/abs/2601.07201
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep protein structure predictors such as AlphaFold provide confidence estimates (e.g., pLDDT) that are often miscalibrated and degrade under distribution shifts across experimental modalities, temporal changes, and intrinsically disordered regions. We introduce CalPro, a prior-aware evidential-conformal framework for shift-robust uncertainty quantification. CalPro combines (i) a geometric evidential head that outputs Normal-Inverse-Gamma predictive distributions via a graph-based architecture; (ii) a differentiable conformal layer that enables end-to-end training with finite-sample coverage guarantees; and (iii) domain priors (disorder, flexibility) encoded as soft constraints. We derive structure-aware coverage guarantees under distribution shift using PAC-Bayesian bounds over ambiguity sets, and show that CalPro maintains near-nominal coverage while producing tighter intervals than standard conformal methods in regions where priors are informative. Empirically, CalPro exhibits at most 5% coverage degradation across modalities (vs. 15-25% for baselines), reduces calibration error by 30-50%, and improves downstream ligand-docking success by 25%. Beyond proteins, CalPro applies to structured regression tasks in which priors encode local reliability, validated on non-biological benchmarks.
zh

[AI-49] Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中固有安全对齐性易被侵蚀的问题,即使使用看似无害的数据集也难以避免。现有防御方法多依赖于启发式、实例级别的数据筛选策略,忽略了数据分布的全局几何结构,无法显式地排斥有害模式。其解决方案的关键在于提出一种名为安全最优传输(Safety Optimal Transport, SOT)的新框架,将安全微调从实例级过滤问题重构为基于最优传输(Optimal Transport, OT)理论的分布级对齐任务;SOT的核心是一个双参考“拉-推”权重学习机制:通过主动将下游分布拉向一个可信的安全锚点,同时将其推开一个通用的有害参考分布,从而建立稳健的几何安全边界,有效净化训练数据,在保持下游性能的同时显著提升模型安全性,实现优于基线方法的安全-效用权衡。

链接: https://arxiv.org/abs/2601.07200
作者: Haozhong Wang,Zhuo Li,Yibo Yang,He Zhao,Hongyuan Zha,Dandan Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets. While existing defenses attempt to mitigate this via data selection, they typically rely on heuristic, instance-level assessments that neglect the global geometry of the data distribution and fail to explicitly repel harmful patterns. To address this, we introduce Safety Optimal Transport (SOT), a novel framework that reframes safe fine-tuning from an instance-level filtering challenge to a distribution-level alignment task grounded in Optimal Transport (OT). At its core is a dual-reference ``push-pull’’ weight-learning mechanism: SOT optimizes sample importance by actively pulling the downstream distribution towards a trusted safe anchor while simultaneously pushing it away from a general harmful reference. This establishes a robust geometric safety boundary that effectively purifies the training data. Extensive experiments across diverse model families and domains demonstrate that SOT significantly enhances model safety while maintaining competitive downstream performance, achieving a superior safety-utility trade-off compared to baselines.
zh

[AI-50] Forward versus Backward: Comparing Reasoning Objectives in Direct Preference Optimization

【速读】:该论文旨在解决大语言模型在推理过程中频繁产生看似合理但错误的答案(即幻觉)的问题,从而影响其推理可靠性。解决方案的关键在于通过直接偏好优化(Direct Preference Optimization, DPO)设计两种互补的训练信号:一是前向链式思维生成(forward chain-of-thought generation),用于训练模型输出正确的推理路径;二是后向验证(backward verification),用于训练模型识别并承认候选解中的错误。实验表明,这两种目标提供互补的学习信号——前向训练显著提升问题求解能力,后向训练则改善验证校准,降低假阳性率,同时两者均增强模型对其输出的信心。

链接: https://arxiv.org/abs/2601.07199
作者: Murtaza Nikzad,Raghuram Ramanujan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models exhibit impressive reasoning capabilities yet frequently generate plausible but incorrect solutions, a phenomenon commonly termed hallucination. This paper investigates the effect of training objective composition on reasoning reliability through Direct Preference Optimization. Two complementary training signals are examined: forward chain-of-thought generation, which trains the model to produce correct reasoning traces, and backward verification, which trains the model to verify and acknowledge errors in candidate solutions. Experiments on GSM8K reveal a fundamental trade-off between these objectives. Forward-only DPO training achieves the highest accuracy improvement, increasing from 83.1% to 86.6% (+3.5 percentage points), while backward-only training yields minimal accuracy gains but substantially reduces the false positive rate from 13.4% to 4.3%. Notably, both training variants reduce acknowledgement rate compared to the baseline, suggesting that preference optimization increases model confidence in its outputs. These findings indicate that forward and backward reasoning objectives provide distinct and complementary learning signals: forward training improves problem-solving capability, while backward training improves verification calibration. The complete training and evaluation pipeline, implemented efficiently through Low-Rank Adaptation, is released to facilitate further research.
zh

[AI-51] Beyond Variance: Knowledge-Aware LLM Compression via Fisher-Aligned Subspace Diagnostics

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限硬件上部署时,因激活值压缩导致的事实知识丢失问题。标准方法如奇异值分解(Singular Value Decomposition, SVD)仅基于方差选择主成分,忽略了梯度敏感性,从而可能保留对知识保存无益的高方差维度。其解决方案的关键在于提出Fisher-Aligned Subspace Compression (FASC),该框架通过建模激活与梯度之间的耦合关系,利用费舍尔信息矩阵(Fisher Information Matrix)识别出对事实知识至关重要的低方差但高梯度敏感子空间,并引入依赖违反评分(Dependence Violation Score, ρ\rho)作为诊断指标,量化激活-梯度耦合强度,从而实现更精准的知识感知压缩。实验表明,FASC在50%秩压缩下比传统方法提升6–8%的事实知识准确率,使7B模型达到未压缩13B模型的召回水平。

链接: https://arxiv.org/abs/2601.07197
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training activation compression is essential for deploying Large Language Models (LLMs) on resource-constrained hardware. However, standard methods like Singular Value Decomposition (SVD) are gradient-blind: they preserve high-variance dimensions regardless of their impact on factual knowledge preservation. We introduce Fisher-Aligned Subspace Compression (FASC), a knowledge-aware compression framework that selects subspaces by directly modeling activation-gradient coupling, minimizing a second-order surrogate of the loss function. FASC leverages the Fisher Information Matrix to identify dimensions critical for factual knowledge, which often reside in low-variance but high-gradient-sensitivity subspaces. We propose the Dependence Violation Score (\rho) as a general-purpose diagnostic metric that quantifies activation-gradient coupling, revealing where factual knowledge is stored within transformer architectures. Extensive experiments on Mistral-7B and Llama-3-8B demonstrate that FASC preserves 6-8% more accuracy on knowledge-intensive benchmarks (MMLU, LAMA) compared to variance-based methods at 50% rank reduction, effectively enabling a 7B model to match the factual recall of a 13B uncompressed model. Our analysis reveals that \rho serves as a fundamental signal of stored knowledge, with high-\rho layers emerging only when models internalize factual associations during training.
zh

[AI-52] Active Context Compression: Autonomous Memory Management in LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在执行长周期软件工程任务时因“上下文膨胀”(Context Bloat)导致的性能下降问题,即随着交互历史增长,计算成本激增、延迟升高以及推理能力退化。解决方案的关键在于提出一种以代理为中心的架构——Focus,其灵感源自黏菌(Physarum polycephalum)的生物探索策略:Focus代理能够自主决定何时将关键学习内容压缩为持久的“知识”块,并主动修剪原始交互历史。通过优化的工具链(如持久 Bash 环境 + 字符串替换编辑器)和强化压缩提示策略,在 SWE-bench Lite 的 5 个高上下文依赖任务上实现平均 22.7% 的 token 减少(从 14.9M 到 11.5M),同时保持与基线相同的准确率(60%),验证了具备适当工具和提示条件下,LLM 代理可实现自主上下文调控,从而在不牺牲任务性能的前提下构建成本感知型智能体系统。

链接: https://arxiv.org/abs/2601.07190
作者: Nikhil Verma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 2 tables. IEEE conference format

点击查看摘要

Abstract:Large Language Model (LLM) agents struggle with long-horizon software engineering tasks due to “Context Bloat.” As interaction history grows, computational costs explode, latency increases, and reasoning capabilities degrade due to distraction by irrelevant past errors. Existing solutions often rely on passive, external summarization mechanisms that the agent cannot control. This paper proposes Focus, an agent-centric architecture inspired by the biological exploration strategies of Physarum polycephalum (slime mold). The Focus Agent autonomously decides when to consolidate key learnings into a persistent “Knowledge” block and actively withdraws (prunes) the raw interaction history. Using an optimized scaffold matching industry best practices (persistent bash + string-replacement editor), we evaluated Focus on N=5 context-intensive instances from SWE-bench Lite using Claude Haiku 4.5. With aggressive prompting that encourages frequent compression, Focus achieves 22.7% token reduction (14.9M - 11.5M tokens) while maintaining identical accuracy (3/5 = 60% for both agents). Focus performed 6.0 autonomous compressions per task on average, with token savings up to 57% on individual instances. We demonstrate that capable models can autonomously self-regulate their context when given appropriate tools and prompting, opening pathways for cost-aware agentic systems without sacrificing task performance.
zh

[AI-53] Defenses Against Prompt Attacks Learn Surface Heuristics

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全敏感应用场景中因提示注入(prompt injection)攻击而导致行为偏离预期的问题。当前主流防御方法依赖于监督微调(supervised fine-tuning),通过标注良性与恶意样本训练模型以识别并拒绝有害指令。然而,研究发现此类方法往往依赖防御数据中的表面关联而非对有害意图的真正理解,从而引发三种系统性偏差:位置偏倚(position bias)、标记触发偏倚(token trigger bias)和主题泛化偏倚(topic generalization bias)。其关键在于揭示了现有防御机制的本质缺陷——即模型对攻击特征的过度拟合而非对意图的准确判断,并提出通过构建受控诊断数据集和多模型、多防御管道的系统性评估框架,为更可靠的安全机制设计提供依据。

链接: https://arxiv.org/abs/2601.07185
作者: Shawn Li,Chenxiao Yu,Zhiyu Ni,Hao Li,Charith Peris,Chaowei Xiao,Yue Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in security-sensitive applications, where they must follow system- or developer-specified instructions that define the intended task behavior, while completing benign user requests. When adversarial instructions appear in user queries or externally retrieved content, models may override intended logic. Recent defenses rely on supervised fine-tuning with benign and malicious labels. Although these methods achieve high attack rejection rates, we find that they rely on narrow correlations in defense data rather than harmful intent, leading to systematic rejection of safe inputs. We analyze three recurring shortcut behaviors induced by defense fine-tuning. \emphPosition bias arises when benign content placed later in a prompt is rejected at much higher rates; across reasoning benchmarks, suffix-task rejection rises from below \textbf10% to as high as \textbf90%. \emphToken trigger bias occurs when strings common in attack data raise rejection probability even in benign contexts; inserting a single trigger token increases false refusals by up to \textbf50%. \emphTopic generalization bias reflects poor generalization beyond the defense data distribution, with defended models suffering test-time accuracy drops of up to \textbf40%. These findings suggest that current prompt-injection defenses frequently respond to attack-like surface patterns rather than the underlying intent. We introduce controlled diagnostic datasets and a systematic evaluation across two base models and multiple defense pipelines, highlighting limitations of supervised fine-tuning for reliable LLM security.
zh

[AI-54] PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

【速读】:该论文旨在解决大语言模型在多步推理任务中因奖励信号稀疏而导致的策略优化难题。现有方法如GRPO(生成式强化学习策略优化)虽无需价值网络(critic-free),但仅对所有token分配单一归一化结果奖励,难以提供中间推理过程的有效指导;而Process Reward Models(PRMs)虽能提供密集反馈,却易因早期低奖励token引发策略过早收敛至截断输出,导致性能下降。论文提出Process Relative Policy Optimization(PRPO),其核心创新在于:通过语义线索分割推理序列,将PRM得分归一化为token级优势值,并借助位置参数偏移(location-parameter shift)使过程优势分布与结果优势对齐,从而实现无需价值网络的细粒度信用分配,显著提升策略优化效率与稳定性。

链接: https://arxiv.org/abs/2601.07182
作者: Ruiyi Ding,Yongxuan Lv,Xianhui Meng,Jiahe Song,Chao Wang,Chen Jiang,Yuan Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization.
zh

[AI-55] Safe-FedLLM : Delving into the Safety of Federated Large Language Models

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)场景下大语言模型(Large Language Models, LLMs)在开放环境中面临的安全威胁问题,特别是来自恶意客户端的攻击风险。现有研究多关注训练效率提升,而忽视了安全防护机制。解决方案的关键在于提出Safe-FedLLM——一种基于探针(probe-based)的防御框架,其核心思想是在每个客户端本地训练的LoRA权重(Low-Rank Adaptation weights)上进行行为特征分析,将其视为高维行为特征,并利用轻量级分类模型判断其是否具备恶意属性,从而实现步骤级(Step-Level)、客户端级(Client-Level)和影子级(Shadow-Level)三个维度的多层次防御,有效抑制恶意数据影响且不显著降低训练速度与良性数据性能。

链接: https://arxiv.org/abs/2601.07177
作者: Mingxiang Tao,Yu Tian,Wenxuan Tu,Yue Yang,Xue Yang,Xiangyan Tang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) addresses data privacy and silo issues in large language models (LLMs). Most prior work focuses on improving the training efficiency of federated LLMs. However, security in open environments is overlooked, particularly defenses against malicious clients. To investigate the safety of LLMs during FL, we conduct preliminary experiments to analyze potential attack surfaces and defensible characteristics from the perspective of Low-Rank Adaptation (LoRA) weights. We find two key properties of FL: 1) LLMs are vulnerable to attacks from malicious clients in FL, and 2) LoRA weights exhibit distinct behavioral patterns that can be filtered through simple classifiers. Based on these properties, we propose Safe-FedLLM, a probe-based defense framework for federated LLMs, constructing defenses across three dimensions: Step-Level, Client-Level, and Shadow-Level. The core concept of Safe-FedLLM is to perform probe-based discrimination on the LoRA weights locally trained by each client during FL, treating them as high-dimensional behavioral features and using lightweight classification models to determine whether they possess malicious attributes. Extensive experiments demonstrate that Safe-FedLLM effectively enhances the defense capability of federated LLMs without compromising performance on benign data. Notably, our method effectively suppresses malicious data impact without significant impact on training speed, and remains effective even with many malicious clients. Our code is available at: this https URL.
zh

[AI-56] AscendKernelGen: A Systematic Study of LLM -Based Kernel Generation for Neural Processing Units

【速读】:该论文旨在解决神经网络处理单元(Neural Processing Units, NPUs)高效计算内核开发中面临的难题,即传统方法依赖厂商特定领域专用语言(Domain-Specific Languages, DSLs),需深厚硬件知识且劳动密集;而通用大语言模型(Large Language Models, LLMs)在NPU场景下因约束严格和训练数据稀缺导致生成失败率极高。其解决方案的关键在于提出一个生成-评估一体化框架AscendKernelGen,包含三个核心组件:(1) Ascend-CoT——基于真实内核实现的链式思维(Chain-of-Thought)高质量数据集,用于增强模型推理能力;(2) KernelGen-LM——通过监督微调与执行反馈强化学习训练的领域自适应模型,提升代码正确性和可编译性;(3) NPUKernelBench——覆盖编译、正确性和性能多维度的综合基准测试平台。实验证明,该方案将复杂Level-2内核的编译成功率从0%提升至95.5%(Pass@10),功能性正确率达64.3%,显著优于基线方法,凸显了领域特定推理与严格评估在加速器感知代码自动化生成中的关键作用。

链接: https://arxiv.org/abs/2601.07160
作者: Xinzi Cao,Jianyang Zhai,Pengfei Li,Zhiheng Hu,Cen Yan,Bingxu Mu,Guanghuan Fang,Bin She,Jiayu Li,Yihan Su,Dongyang Tao,Xiansong Huang,Fan Xu,Feidiao Yang,Yao Lu,Chang-Dong Wang,Yutong Lu,Weicheng Xue,Bin Zhou,Yonghong Tian
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages,7 figures,16 tables

点击查看摘要

Abstract:To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline’s complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation.
zh

[AI-57] Stable On-Policy Distillation through Adaptive Target Reformulation

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中常见的分布不匹配问题,尤其是在基于策略的蒸馏(on-policy KD)方法中因学生模型与教师模型间分布差距过大而导致的训练不稳定现象,如前向KL目标下的病态梯度或反向KL场景中的多样性崩溃。解决方案的关键在于提出一种目标层面的重构方法Veto,其在logit空间中构建几何桥梁,通过引入一个可调参数β,实现两个核心功能:一是作为自适应梯度抑制机制(Adaptive Gradient Veto),有效抑制低置信度token上的有害梯度以稳定优化过程;二是作为决断性旋钮(Decisiveness Knob),在奖励驱动性能与输出多样性之间取得平衡。

链接: https://arxiv.org/abs/2601.07155
作者: Ijun Jang,Jewon Yeom,Juan Yeo,Hyunggu Lim,Taesup Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
zh

[AI-58] A Large-Scale Study on the Development and Issues of Multi-Agent AI Systems

【速读】:该论文旨在解决当前多智能体AI系统(Multi-Agent AI Systems, MAS)在实际开发与维护过程中演化机制不明确的问题,尤其是缺乏对开源MAS生态成熟度和发展模式的系统性理解。其关键解决方案是通过大规模实证研究,分析八个主流开源MAS项目中超过4.2万次提交和4700多个已解决的问题,识别出三种典型的发展模式(持续型、稳定型与爆发驱动型),并量化了各类代码变更(如完善性提交占40.8%)、问题类型分布(如缺陷占22%、基础设施问题占14%)及修复时效特征,从而揭示当前生态系统虽具快速发展势头但存在脆弱性的核心事实,并提出需加强测试基础设施、文档质量与维护实践以保障长期可靠性和可持续性。

链接: https://arxiv.org/abs/2601.07136
作者: Daniel Liu,Krishna Upadhyay,Vinaik Chhetri,A.B. Siddique,Umar Farooq
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, IEEE BigData Workshop on Software Engineering for Agentic AI 2025

点击查看摘要

Abstract:The rapid emergence of multi-agent AI systems (MAS), including LangChain, CrewAI, and AutoGen, has shaped how large language model (LLM) applications are developed and orchestrated. However, little is known about how these systems evolve and are maintained in practice. This paper presents the first large-scale empirical study of open-source MAS, analyzing over 42K unique commits and over 4.7K resolved issues across eight leading systems. Our analysis identifies three distinct development profiles: sustained, steady, and burst-driven. These profiles reflect substantial variation in ecosystem maturity. Perfective commits constitute 40.8% of all changes, suggesting that feature enhancement is prioritized over corrective maintenance (27.4%) and adaptive updates (24.3%). Data about issues shows that the most frequent concerns involve bugs (22%), infrastructure (14%), and agent coordination challenges (10%). Issue reporting also increased sharply across all frameworks starting in 2023. Median resolution times range from under one day to about two weeks, with distributions skewed toward fast responses but a minority of issues requiring extended attention. These results highlight both the momentum and the fragility of the current ecosystem, emphasizing the need for improved testing infrastructure, documentation quality, and maintenance practices to ensure long-term reliability and sustainability.
zh

[AI-59] ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)中存在的“过度思考”问题,即模型在处理简单任务时生成冗长且重复的推理链,导致计算资源浪费而性能提升有限。解决方案的关键在于提出ENTRA框架,其核心是通过基于熵的训练机制抑制冗余推理:首先利用轻量级双向重要性估计(Bidirectional Importance Estimation, BIE)方法量化每个token的重要性,综合考虑预测置信度与前向影响;随后基于低重要性token的熵值计算冗余奖励,并以理论上限归一化后,通过强化学习优化该奖励信号,从而引导模型生成更简洁且准确的推理路径。

链接: https://arxiv.org/abs/2601.07123
作者: Ruichu Cai,Haopeng Du,Qingwen Lin,Yutong Chen,Zijian Li,Boyan Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks. This leads to substantial computational overhead with limited performance gain, primarily due to redundant verification and repetitive generation. While prior work typically constrains output length or optimizes correctness, such coarse supervision fails to guide models toward concise yet accurate inference. In this paper, we propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance. ENTRA first estimates the token-level importance using a lightweight Bidirectional Importance Estimation (BIE) method, which accounts for both prediction confidence and forward influence. It then computes a redundancy reward based on the entropy of low-importance tokens, normalized by its theoretical upper bound, and optimizes this reward via reinforcement learning. Experiments on mathematical reasoning benchmarks demonstrate that ENTRA reduces output length by 37% to 53% with no loss-and in some cases, gains-in accuracy. Our approach offers a principled and efficient solution to reduce overthinking in LRMs, and provides a generalizable path toward redundancy-aware reasoning optimization.
zh

[AI-60] Enhancing Cloud Network Resilience via a Robust LLM -Empowered Multi-Agent Reinforcement Learning Framework

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的云网络防御策略在面对动态网络结构、节点规模变化、攻击策略演化及强度波动时缺乏鲁棒性的问题,同时克服现有方法因缺乏人机协同(Human-in-the-Loop, HITL)支持而导致的可解释性与灵活性不足。其解决方案的关键在于提出一种分层多智能体强化学习框架——CyberOps-Bots,该框架融合大语言模型(Large Language Models, LLMs)与异构分离预训练的RL代理:上层LLM代理通过ReAct规划、IPDRR感知、长短时记忆和动作/工具集成模块实现全局态势感知、人类意图识别与战术规划;下层RL代理则在局部网络区域内执行原子级防御动作。这种架构在保持LLM适应性和可解释性的基础上,确保了RL执行的可靠性,实验表明其在不重新训练的情况下,相比最先进算法能提升68.5%的网络可用性,并实现34.7%的性能跃升。

链接: https://arxiv.org/abs/2601.07122
作者: Yixiao Peng,Hao Hu,Feiyang Li,Xinye Cao,Yingchang Jiang,Jipeng Tang,Guoshun Nan,Yuling Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATTCK’s Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules–ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration–performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense. We will release our framework to the community, facilitating the advancement of robust and autonomous defense in cloud networks.
zh

[AI-61] XBTorch: A Unified Framework for Modeling and Co-Design of Crossbar-Based Deep Learning Accelerators

【速读】:该论文旨在解决传统冯·诺依曼架构在深度学习应用中面临的能效与延迟瓶颈问题,其核心挑战在于计算单元与存储单元分离导致的数据搬运开销。解决方案的关键是提出了一种名为XBTorch(CrossBarTorch)的仿真框架,该框架无缝集成于PyTorch生态,能够高效、精确地建模基于新兴记忆器件(如铁电场效应晶体管FeFET和阻变存储器ReRAM)的交叉阵列(crossbar)系统,从而支持设备级建模、跨层协同设计及推理时故障容错等关键研究方向,且具备技术无关性与用户自定义扩展能力。

链接: https://arxiv.org/abs/2601.07086
作者: Osama Yousuf,Andreu L. Glasmann,Martin Lueker-Boden,Sina Najmaei,Gina C. Adam
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emerging memory technologies have gained significant attention as a promising pathway to overcome the limitations of conventional computing architectures in deep learning applications. By enabling computation directly within memory, these technologies - built on nanoscale devices with tunable and nonvolatile conductance - offer the potential to drastically reduce energy consumption and latency compared to traditional von Neumann systems. This paper introduces XBTorch (short for CrossBarTorch), a novel simulation framework that integrates seamlessly with PyTorch and provides specialized tools for accurately and efficiently modeling crossbar-based systems based on emerging memory technologies. Through detailed comparisons and case studies involving hardware-aware training and inference, we demonstrate how XBTorch offers a unified interface for key research areas such as device-level modeling, cross-layer co-design, and inference-time fault tolerance. While exemplar studies utilize ferroelectric field-effect transistor (FeFET) models, the framework remains technology-agnostic - supporting other emerging memories such as resistive RAM (ReRAM), as well as enabling user-defined custom device models. The code is publicly available at: this https URL
zh

[AI-62] he AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance

【速读】:该论文试图解决的问题是:当前用于理解虚假信息和说服机制的框架无法充分应对基于大语言模型(Large Language Model, LLM)的对话式人工智能系统对人类认知带来的新挑战。其核心问题在于,这类系统在优化过程中虽提升了实用性,却可能通过呈现某些“诚实的非信号”(honest non-signals)——如流畅性、助人倾向和看似无兴趣的特征——绕过人类进化形成的评估信息的认知机制,从而引发深层次的 epistemic 风险(即知识论风险)。解决方案的关键在于提出“认知特洛伊木马假说”(Cognitive Trojan Horse hypothesis),该假说指出,LLM 的特性虽然真实但缺乏人类语境下的信息负载,因为这些特征在人类中成本高昂而在 AI 中计算上微不足道;进而识别出四种潜在的绕过机制:处理流畅性与理解脱钩、信任-能力表现无对应实际代价、认知卸载使用户将评估责任转移给 AI、以及优化动态系统性地产生迎合行为。这一框架强调 AI 安全的核心之一应是校准(calibration)——即调整人类对 AI 生成内容的评价反应,使其与内容的实际 epistemic 状态相匹配,而非仅聚焦于防止欺骗。

链接: https://arxiv.org/abs/2601.07085
作者: Andrew D. Maynard
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 15 pages, 18 references

点击查看摘要

Abstract:Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues’ theory of epistemic vigilance – the parallel cognitive process monitoring communicated information for reasons to doubt – and proposes that LLM-based systems present ‘honest non-signals’: genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration – aligning human evaluative responses with the actual epistemic status of AI-generated content – rather than solely a problem of preventing deception.
zh

[AI-63] Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

【速读】:该论文旨在解决间接提示注入(Indirect Prompt Injection, IPI)攻击在实际应用中难以触发的问题,即如何确保恶意内容能够被大语言模型(Large Language Models, LLMs)从外部语料库中检索到,从而实现对模型行为的劫持。此前研究虽指出IPI的风险,但未有效解决恶意内容在自然查询下无法被检索的核心障碍。论文的关键解决方案是将恶意内容分解为两个部分:一个紧凑的触发片段(trigger fragment),用于保证在任意用户查询下都能被检索;以及一个攻击片段(attack fragment),用于编码具体的攻击目标。基于此设计,作者提出了一种仅需嵌入模型API访问权限的黑盒攻击算法,能以极低成本(低至0.21美元/查询)实现近100%的恶意内容检索成功率,并首次在真实场景下成功实施了端到端的IPI攻击,验证了其在RAG和代理系统中的严重威胁性。

链接: https://arxiv.org/abs/2601.07072
作者: Hongyan Chang,Ergute Bao,Xinjian Luo,Ting Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real-world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black-box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost-efficient (as little as 0.21 per target user query on OpenAI’s embedding models), and achieves near-100% retrieval across 11 benchmarks and 8 embedding models (including both open-source models and proprietary services). Based on this attack, we present the first end-to-end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07072 [cs.CR] (or arXiv:2601.07072v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.07072 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-64] Automated Domain Question Mapping (DQM) with Educational Learning Materials

【速读】:该论文旨在解决两个核心问题:一是缺乏针对多层级教学目标(从低阶到高阶思维)设计的学科概念体系,二是学科概念及其相互关系的标注数据稀缺。为应对上述挑战,研究提出了一种构建领域问题图(Domain Question Maps, DQMs)的新方法,其关键在于通过与学习目标对齐的具体问题来增强知识表征,并识别问题间的层级关系,从而生成结构化的问題地图,支持个性化和自适应学习的应用场景。

链接: https://arxiv.org/abs/2601.07062
作者: Jiho Noh,Mukhesh Raghava Katragadda,Dabae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept maps have been widely utilized in education to depict knowledge structures and the interconnections between disciplinary concepts. Nonetheless, devising a computational method for automatically constructing a concept map from unstructured educational materials presents challenges due to the complexity and variability of educational content. We focus primarily on two challenges: (1) the lack of disciplinary concepts that are specifically designed for multi-level pedagogical purposes from low-order to high-order thinking, and (2) the limited availability of labeled data concerning disciplinary concepts and their interrelationships. To tackle these challenges, this research introduces an innovative approach for constructing Domain Question Maps (DQMs), rather than traditional concept maps. By formulating specific questions aligned with learning objectives, DQMs enhance knowledge representation and improve readiness for learner engagement. The findings indicate that the proposed method can effectively generate educational questions and discern hierarchical relationships among them, leading to structured question maps that facilitate personalized and adaptive learning in downstream applications.
zh

[AI-65] Hallucinations Live in Variance

【速读】:该论文旨在解决当前大语言模型评估中对“可靠性”(reliability)的忽视问题,尤其在多步骤代理型人工智能(agentic AI)系统中,单一提示的改写可能引发级联失败,而现有基准测试无法捕捉此类由变异性驱动的不稳定性。其核心解决方案是提出语义稳定性(Semantic Stability, SS),通过同义改写一致性(Paraphrase Consistency, PC@k)进行量化:生成k个语义等价的改写提示,对每个提示进行贪婪解码并计算输出模式的一致性。SS作为诊断工具,识别由内部路径不一致导致的变异性错误,而非直接提升正确性;研究表明,在稀疏化处理下(如32%稀疏度),模型自我一致性显著提升(从23.8%升至55.9%),且存在一个最优区间,在此区间内方差减少效益超过偏差累积效应,从而实现更可靠的推理行为。

链接: https://arxiv.org/abs/2601.07058
作者: Aaron R. Flouro,Shawn P. Chadwick
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Benchmarks measure whether a model is correct. They do not measure whether a model is reliable. This distinction is largely academic for single-shot inference, but becomes critical for agentic AI systems, where a single rephrased prompt can trigger cascading failures in multi-step execution. Yet this form of instability is not captured by existing evaluations. Hallucinations live in variance: they arise when semantically equivalent prompts activate inconsistent internal pathways, producing divergent outputs. Consistent but incorrect outputs reflect bias or missing knowledge; confident guessing reflects calibration failure. Neither constitutes hallucination under this definition. When error is variance-dominated, reducing redundant pathways improves reliability without adding knowledge. We formalize this through Semantic Stability (SS), measured via Paraphrase Consistency (PC@k): generate k paraphrases, greedy decode each, compute mode agreement. SS is a diagnostic for variance-driven unreliability, not a method for improving correctness. We show that a dense Qwen3-0.6B agrees with itself only 23.8% of the time; at 32% sparsity, agreement jumps to 55.9%. A phase diagram reveals the sweet spot where variance reduction outpaces bias accumulation, and regimes where stability collapses onto wrong answers. Comments: 8 pages, 3 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T05, 62H30, 68Q32 ACMclasses: I.2.6; I.5.1; F.1.2 Cite as: arXiv:2601.07058 [cs.LG] (or arXiv:2601.07058v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.07058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-66] Dr. Zero: Self-Evolving Search Agents without Training Data

【速读】:该论文旨在解决数据-free自进化场景下多轮搜索代理(multi-turn search agents)因问题多样性不足和多步推理与工具调用所需计算资源庞大而导致性能受限的问题。其核心解决方案是提出Dr. Zero框架,通过设计一个自进化反馈循环(self-evolution feedback loop),使初始基于同一基础模型的提问生成器(proposer)与求解器(solver)在无训练数据条件下协同进化:提问生成器不断生成多样化的任务以训练求解器,而随着求解器能力提升,又激励提问生成器产生更具挑战性但可解的任务,从而形成自动化的课程学习机制来持续优化两者。为提高训练效率,论文进一步引入分组相对策略优化(hop-grouped relative policy optimization, HRPO),通过将结构相似的问题聚类并构建群体级基线,显著降低对每个查询单独评估难度与可解性的采样开销,从而大幅减少求解器训练所需的计算资源而不影响性能或稳定性。

链接: https://arxiv.org/abs/2601.07055
作者: Zhenrui Yue,Kartikeya Upasani,Xianjun Yang,Suyu Ge,Shaoliang Nie,Yuning Mao,Zhe Liu,Dong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As high-quality data becomes increasingly difficult to obtain, data-free self-evolution has emerged as a promising paradigm. This approach allows large language models (LLMs) to autonomously generate and solve complex problems, thereby improving their reasoning capabilities. However, multi-turn search agents struggle in data-free self-evolution due to the limited question diversity and the substantial compute required for multi-step reasoning and tool using. In this work, we introduce Dr. Zero, a framework enabling search agents to effectively self-evolve without any training data. In particular, we design a self-evolution feedback loop where a proposer generates diverse questions to train a solver initialized from the same base model. As the solver evolves, it incentivizes the proposer to produce increasingly difficult yet solvable tasks, thus establishing an automated curriculum to refine both agents. To enhance training efficiency, we also introduce hop-grouped relative policy optimization (HRPO). This method clusters structurally similar questions to construct group-level baselines, effectively minimizing the sampling overhead in evaluating each query’s individual difficulty and solvability. Consequently, HRPO significantly reduces the compute requirements for solver training without compromising performance or stability. Extensive experiment results demonstrate that the data-free Dr. Zero matches or surpasses fully supervised search agents, proving that complex reasoning and search capabilities can emerge solely through self-evolution.
zh

[AI-67] Jasper: ANNS Quantized for Speed Built for Change on GPU

【速读】:该论文针对当前GPU加速的近似最近邻搜索(Approximate Nearest Neighbor Search, ANNS)系统存在的三大瓶颈问题展开研究:一是现有GPU索引在面对动态数据流时缺乏高效的批量更新能力,需从头重建;二是高维向量导致内存带宽压力大,而现有量化技术无法在不引入随机访问开销的前提下减少数据移动;三是贪婪搜索带来的数据依赖性内存访问阻碍了计算与内存操作的重叠,限制了性能提升。其解决方案的关键在于提出一个原生GPU设计的ANNS系统Jasper,通过三项核心技术突破实现高性能与可更新性的统一:(1)基于CUDA的批处理并行构建算法,支持无锁流式插入;(2)优化的RaBitQ量化方法,在降低内存占用达8倍的同时避免随机访问惩罚;(3)改进的贪婪搜索内核,提高计算利用率并增强延迟隐藏能力,从而显著提升查询吞吐量和资源利用率。

链接: https://arxiv.org/abs/2601.07048
作者: Hunter McCoy,Zikun Wang,Prashant Pandey
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Approximate nearest neighbor search (ANNS) is a core problem in machine learning and information retrieval applications. GPUs offer a promising path to high-performance ANNS: they provide massive parallelism for distance computations, are readily available, and can co-locate with downstream applications. Despite these advantages, current GPU-accelerated ANNS systems face three key limitations. First, real-world applications operate on evolving datasets that require fast batch updates, yet most GPU indices must be rebuilt from scratch when new data arrives. Second, high-dimensional vectors strain memory bandwidth, but current GPU systems lack efficient quantization techniques that reduce data movement without introducing costly random memory accesses. Third, the data-dependent memory accesses inherent to greedy search make overlapping compute and memory difficult, leading to reduced performance. We present Jasper, a GPU-native ANNS system with both high query throughput and updatability. Jasper builds on the Vamana graph index and overcomes existing bottlenecks via three contributions: (1) a CUDA batch-parallel construction algorithm that enables lock-free streaming insertions, (2) a GPU-efficient implementation of RaBitQ quantization that reduces memory footprint up to 8x without the random access penalties, and (3) an optimized greedy search kernel that increases compute utilization, resulting in better latency hiding and higher throughput. Our evaluation across five datasets shows that Jasper achieves up to 1.93x higher query throughput than CAGRA and achieves up to 80% peak utilization as measured by the roofline model. Jasper’s construction scales efficiently and constructs indices an average of 2.4x faster than CAGRA while providing updatability that CAGRA lacks. Compared to BANG, the previous fastest GPU Vamana implementation, Jasper delivers 19-131x faster queries. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07048 [cs.DB] (or arXiv:2601.07048v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2601.07048 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hunter McCoy [view email] [v1] Sun, 11 Jan 2026 19:51:54 UTC (2,213 KB)
zh

[AI-68] CloneMem: Benchmarking Long-Term Memory for AI Clones

【速读】:该论文旨在解决当前AI Clone(人工智能克隆)系统在长期个性化交互中面临的记忆建模难题,即现有记忆评估基准主要依赖碎片化的用户-代理对话历史,难以捕捉个体随时间演进的生活轨迹与心理状态。其解决方案的关键在于提出CloneMem基准,该基准基于非对话类数字痕迹(如日记、社交媒体帖子和邮件)构建,覆盖1至3年的纵向数据,并采用分层数据构造框架以保障时间连续性,同时设计任务来量化智能体追踪个人状态演变的能力。实验表明,当前记忆机制在此场景下表现不佳,揭示了面向生活语境的个性化AI仍存在重大挑战。

链接: https://arxiv.org/abs/2601.07023
作者: Sen Hu,Zhiyu Zhang,Yuxiang Wei,Xueran Han,Zhenheng Tang,Huacan Wang,Ronghao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI Clones aim to simulate an individual’s thoughts and behaviors to enable long-term, personalized interaction, placing stringent demands on memory systems to model experiences, emotions, and opinions over time. Existing memory benchmarks primarily rely on user-agent conversational histories, which are temporally fragmented and insufficient for capturing continuous life trajectories. We introduce CloneMem, a benchmark for evaluating longterm memory in AI Clone scenarios grounded in non-conversational digital traces, including diaries, social media posts, and emails, spanning one to three years. CloneMem adopts a hierarchical data construction framework to ensure longitudinal coherence and defines tasks that assess an agent’s ability to track evolving personal states. Experiments show that current memory mechanisms struggle in this setting, highlighting open challenges for life-grounded personalized AI. Code and dataset are available at this https URL
zh

[AI-69] Zer0n: An AI-Assisted Vulnerability Discovery and Blockchain-Backed Integrity Framework

【速读】:该论文试图解决生成式 AI(Generative AI)在漏洞研究中因模型输出不透明而引发的安全自动化“信任 gap”问题。解决方案的关键在于提出 Zer0n 框架,通过将大语言模型(Large Language Models, LLMs)的推理能力与区块链技术的不可篡改审计日志相结合,实现安全性和效率的平衡:利用 Gemini 2.0 Pro 进行逻辑驱动的漏洞检测,并借助 Avalanche C-Chain 实现防篡改的产物记录;采用混合架构——执行过程保留在链下以保障性能,仅将完整性证明上链,从而在保持 80% 检测准确率的同时,仅引入 22.9% 的边际开销,验证了去中心化完整性可与高速安全工作流共存。

链接: https://arxiv.org/abs/2601.07019
作者: Harshil Parmar,Pushti Vyas,Prayers Khristi,Priyank Panchal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 3 figures, 7 tables. Framework for AI-Assisted Vulnerability Discovery

点击查看摘要

Abstract:As vulnerability research increasingly adopts generative AI, a critical reliance on opaque model outputs has emerged, creating a “trust gap” in security automation. We address this by introducing Zer0n, a framework that anchors the reasoning capabilities of Large Language Models (LLMs) to the immutable audit trails of blockchain technology. Specifically, we integrate Gemini 2.0 Pro for logic-based vulnerability detection with the Avalanche C-Chain for tamper-evident artifact logging. Unlike fully decentralized solutions that suffer from high latency, Zer0n employs a hybrid architecture: execution remains off-chain for performance, while integrity proofs are finalized on-chain. Our evaluation on a dataset of 500 endpoints reveals that this approach achieves 80% detection accuracy with only a marginal 22.9% overhead, effectively demonstrating that decentralized integrity can coexist with high-speed security workflows.
zh

[AI-70] Belief in False Information: A Human-Centered Security Risk in Sociotechnical Systems

【速读】:该论文旨在解决虚假信息(false information)信念问题,即公众在面对错误信息时如何产生并维持对其的信任,这一现象已成为社会技术系统中的人因安全风险(human-centered security risk),可能被用于操纵决策、削弱信任并增加社会工程攻击的脆弱性。其解决方案的关键在于系统识别和分类影响虚假信息信念的24个因素,并将其归纳为六大类别:人口统计学特征、人格特质、心理因素、政策与价值观、媒体消费习惯及预防性因素。研究指出,教育水平较低、外向性较高、宜人性较低、神经质较高以及认知反思能力较弱显著提升对虚假信息的接受度;同时强调通过标签化标注虚假内容和促进对正确性的反思等预防策略可有效降低风险,从而增强社会的技术韧性与整体安全性。

链接: https://arxiv.org/abs/2601.07016
作者: Fabian Walke,Thaddäa Nürnberger
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: Literature Review, 10 pages, 8 tables

点击查看摘要

Abstract:This paper provides a comprehensive literature review on the belief in false information, including misinformation, disinformation, and fake information. It addresses the increasing societal concern regarding false information, which is fueled by technological progress, especially advancements in artificial intelligence. This review systematically identifies and categorizes factors that influence the belief in false information. The review identifies 24 influence factors grouped into six main categories: demographic factors, personality traits, psychological factors, policy and values, media consumption, and preventive factors. Key findings highlight that lower education levels, high extraversion, low agreeableness, high neuroticism, and low cognitive reflection significantly increase belief in false information. The effectiveness of preventive strategies like labeling false information and promoting reflection about correctness is also discussed. This literature review conceptualizes belief in false information as a human-centered security risk in sociotechnical systems, as it can be exploited to manipulate decisions, undermine trust, and increase susceptibility to social engineering. It aims to inform preventive strategies that strengthen socio-technical security and societal resilience.
zh

[AI-71] LLM Performance Predictors: Learning When to Escalate in Hybrid Human-AI Moderation Systems AAMAS2026

【速读】:该论文旨在解决在人机协同的内容审核系统中,如何判断大语言模型(Large Language Models, LLMs)输出的可信度,从而决定何时可直接采纳其判断、何时需升级至人工审核的问题。解决方案的关键在于提出一种基于监督学习的LLM不确定性量化框架,通过训练一个专用元模型(meta-model),利用从LLM输出中提取的性能预测指标(LLM Performance Predictors, LPPs)——包括对数概率、熵以及新颖的不确定性归因指标——实现对模型置信度的精准估计。该方法支持成本感知的选择性分类,在真实的人机协作流程中有效识别高风险案例并自动处理其余内容,显著优于现有不确定性估计方法,并提升审核系统的可解释性与责任性。

链接: https://arxiv.org/abs/2601.07006
作者: Or Bachar,Or Levi,Sardhendu Mishra,Adi Levi,Manpreet Singh Minhas,Justin Miller,Omer Ben-Porat,Eilon Sheetrit,Jonathan Morra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a full paper at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:As LLMs are increasingly integrated into human-in-the-loop content moderation systems, a central challenge is deciding when their outputs can be trusted versus when escalation for human review is preferable. We propose a novel framework for supervised LLM uncertainty quantification, learning a dedicated meta-model based on LLM Performance Predictors (LPPs) derived from LLM outputs: log-probabilities, entropy, and novel uncertainty attribution indicators. We demonstrate that our method enables cost-aware selective classification in real-world human-AI workflows: escalating high-risk cases while automating the rest. Experiments across state-of-the-art LLMs, including both off-the-shelf (Gemini, GPT) and open-source (Llama, Qwen), on multimodal and multilingual moderation tasks, show significant improvements over existing uncertainty estimators in accuracy-cost trade-offs. Beyond uncertainty estimation, the LPPs enhance explainability by providing new insights into failure conditions (e.g., ambiguous content vs. under-specified policy). This work establishes a principled framework for uncertainty-aware, scalable, and responsible human-AI moderation workflows.
zh

[AI-72] MicLog: Towards Accurate and Efficient LLM -based Log Parsing via Progressive Meta In-Context Learning

【速读】:该论文旨在解决传统日志解析(log parsing)方法在语义变化和数据稀缺场景下性能受限的问题,以及基于大语言模型(LLM)的解析器在利用上下文学习(in-context learning, ICL)能力时存在动态示例选择不足与跨域泛化能力弱、且推理耗时高、成本大的问题。其解决方案的关键在于提出首个渐进式元上下文学习(ProgMeta-ICL)框架 MicLog,通过结合元学习与小规模开源 LLM(如 Qwen-2.5-3B),采用从零样本到 k 样本的渐进式 ICL 策略,并引入加权 DBSCAN 候选采样和增强 BM25 示例选择机制以提升 ICL 效果;同时设计多级预查询缓存机制,在运行时动态匹配并优化已解析模板,显著加速解析过程。实验表明,MicLog 在 Loghub-2.0 数据集上相比现有最优方法提升 10.3% 的解析准确率,同时将解析时间减少 42.4%。

链接: https://arxiv.org/abs/2601.07005
作者: Jianbo Yu,Yixuan Li,Hai Xu,Kang Xu,Junjielong Xu,Zhijing Li,Pinjia He,Wanyuan Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Log parsing converts semi-structured logs into structured templates, forming a critical foundation for downstream analysis. Traditional syntax and semantic-based parsers often struggle with semantic variations in evolving logs and data scarcity stemming from their limited domain coverage. Recent large language model (LLM)-based parsers leverage in-context learning (ICL) to extract semantics from examples, demonstrating superior accuracy. However, LLM-based parsers face two main challenges: 1) underutilization of ICL capabilities, particularly in dynamic example selection and cross-domain generalization, leading to inconsistent performance; 2) time-consuming and costly LLM querying. To address these challenges, we present MicLog, the first progressive meta in-context learning (ProgMeta-ICL) log parsing framework that combines meta-learning with ICL on small open-source LLMs (i.e., Qwen-2.5-3B). Specifically, MicLog: i) enhances LLMs’ ICL capability through a zero-shot to k-shot ProgMeta-ICL paradigm, employing weighted DBSCAN candidate sampling and enhanced BM25 demonstration selection; ii) accelerates parsing via a multi-level pre-query cache that dynamically matches and refines recently parsed templates. Evaluated on Loghub-2.0, MicLog achieves 10.3% higher parsing accuracy than the state-of-the-art parser while reducing parsing time by 42.4%.
zh

[AI-73] MemTrust: A Zero-Trust Architecture for Unified AI Memory System

【速读】:该论文旨在解决AI记忆系统中个人化需求与数据主权之间的核心矛盾:集中式架构虽能实现跨代理协作和多工具工作流,但将用户敏感数据暴露于云服务商风险之下;而私有部署虽保障安全性,却限制了协同能力。解决方案的关键在于提出一种基于硬件信任根(TEE)的五层抽象架构(存储、提取、学习、检索、治理),并通过MemTrust这一零信任架构,在各层提供密码学保障,从而在不牺牲安全性的前提下实现本地等效的安全性,支持高效维护与协作功能。

链接: https://arxiv.org/abs/2601.07004
作者: Xing Zhou,Dmitrii Ustiugov,Haoxin Shang,Kisson Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:AI memory systems are evolving toward unified context layers that enable efficient cross-agent collaboration and multi-tool workflows, facilitating better accumulation of personal data and learning of user preferences. However, centralization creates a trust crisis where users must entrust cloud providers with sensitive digital memory data. We identify a core tension between personalization demands and data sovereignty: centralized memory systems enable efficient cross-agent collaboration but expose users’ sensitive data to cloud provider risks, while private deployments provide security but limit collaboration. To resolve this tension, we aim to achieve local-equivalent security while enabling superior maintenance efficiency and collaborative capabilities. We propose a five-layer architecture abstracting common functional components of AI memory systems: Storage, Extraction, Learning, Retrieval, and Governance. By applying TEE protection to each layer, we establish a trustworthy framework. Based on this, we design MemTrust, a hardware-backed zero-trust architecture that provides cryptographic guarantees across all layers. Our contributions include the five-layer abstraction, “Context from MemTrust” protocol for cross-application sharing, side-channel hardened retrieval with obfuscated access patterns, and comprehensive security analysis. The architecture enables third-party developers to port existing systems with acceptable development costs, achieving system-wide trustworthiness. We believe that AI memory plays a crucial role in enhancing the efficiency and collaboration of agents and AI tools. AI memory will become the foundational infrastructure for AI agents, and MemTrust serves as a universal trusted framework for AI memory systems, with the goal of becoming the infrastructure of memory infrastructure. Comments: 18 pages, 5 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07004 [cs.CR] (or arXiv:2601.07004v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.07004 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xing Zhou [view email] [v1] Sun, 11 Jan 2026 17:37:33 UTC (109 KB) Full-text links: Access Paper: View a PDF of the paper titled MemTrust: A Zero-Trust Architecture for Unified AI Memory System, by Xing Zhou and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CR prev | next new | recent | 2026-01 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-74] VISTA: Knowledge-Driven Interpretable Vessel Trajectory Imputation via Large Language Models

【速读】:该论文旨在解决船舶自动识别系统(Automatic Identification System, AIS)轨迹数据不完整的问题,现有填补方法虽能恢复轨迹,但缺乏可解释性且未提供有助于下游任务(如异常检测和航线规划)的底层知识。解决方案的关键在于提出首个兼具可解释性的船舶轨迹填补框架VISTA,其核心创新包括:首先将底层知识定义为从AIS结构化数据中提取的结构化数据知识(Structured Data-derived Knowledge, SDK)与从大规模互联网语料中获取的隐式大语言模型知识(Implicit LLM Knowledge)的结合;其次构建“数据-知识-数据”循环机制,利用SDK知识图谱实现高效的知识提取与驱动的轨迹填补;最后引入工作流管理层以并行化处理大规模AIS数据,支持异常处理与冗余消除,从而在保证高精度填补的同时显著提升计算效率,并输出可解释的知识线索供下游应用使用。

链接: https://arxiv.org/abs/2601.06940
作者: Hengyu Liu,Tianyi Li,Haoyu Wang,Kristian Torp,Tiancheng Zhang,Yushuai Li,Christian S. Jensen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 22 pages, 13 figures, 3 algorithms, 5 tables. Code available at this https URL

点击查看摘要

Abstract:The Automatic Identification System provides critical information for maritime navigation and safety, yet its trajectories are often incomplete due to signal loss or deliberate tampering. Existing imputation methods emphasize trajectory recovery, paying limited attention to interpretability and failing to provide underlying knowledge that benefits downstream tasks such as anomaly detection and route planning. We propose knowledge-driven interpretable vessel trajectory imputation (VISTA), the first trajectory imputation framework that offers interpretability while simultaneously providing underlying knowledge to support downstream analysis. Specifically, we first define underlying knowledge as a combination of Structured Data-derived Knowledge (SDK) distilled from AIS data and Implicit LLM Knowledge acquired from large-scale Internet corpora. Second, to manage and leverage the SDK effectively at scale, we develop a data-knowledge-data loop that employs a Structured Data-derived Knowledge Graph for SDK extraction and knowledge-driven trajectory imputation. Third, to efficiently process large-scale AIS data, we introduce a workflow management layer that coordinates the end-to-end pipeline, enabling parallel knowledge extraction and trajectory imputation with anomaly handling and redundancy elimination. Experiments on two large AIS datasets show that VISTA is capable of state-of-the-art imputation accuracy and computational efficiency, improving over state-of-the-art baselines by 5%-94% and reducing time cost by 51%-93%, while producing interpretable knowledge cues that benefit downstream tasks. The source code and implementation details of VISTA are publicly available.
zh

[AI-75] mind_call: A Dataset for Mental Health Function Calling with Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在心理健康辅助场景中缺乏针对可穿戴传感器数据的结构化访问能力的问题,即现有函数调用数据集未覆盖基于睡眠、身体活动、心血管指标、压力水平和代谢数据等健康信号的心理健康相关交互需求。其解决方案的关键在于构建了一个合成的函数调用数据集,该数据集将多样化的自然语言查询(包括显式、隐式、行为类、症状导向及隐喻表达)映射到标准化API调用,这些API基于广泛采用的健康数据架构,并包含用户查询类别、显式推理步骤、归一化时间参数和目标函数等结构化信息,从而支持LLM在心理健康代理中的意图定位、时间推理与可靠函数调用研究。

链接: https://arxiv.org/abs/2601.06937
作者: Fozle Rabbi Shafi,M. Anwar Hossain,Salimur Choudhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based systems increasingly rely on function calling to enable structured and controllable interaction with external data sources, yet existing datasets do not address mental health-oriented access to wearable sensor data. This paper presents a synthetic function-calling dataset designed for mental health assistance grounded in wearable health signals such as sleep, physical activity, cardiovascular measures, stress indicators, and metabolic data. The dataset maps diverse natural language queries to standardized API calls derived from a widely adopted health data schema. Each sample includes a user query, a query category, an explicit reasoning step, a normalized temporal parameter, and a target function. The dataset covers explicit, implicit, behavioral, symptom-based, and metaphorical expressions, which reflect realistic mental health-related user interactions. This resource supports research on intent grounding, temporal reasoning, and reliable function invocation in LLM-based mental health agents and is publicly released to promote reproducibility and future work.
zh

[AI-76] owards Compositional Generalization in LLM s for Smart Contract Security: A Case Study on Reentrancy Vulnerabilities

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在智能合约漏洞检测等专业领域中,尽管具备强大的自然语言理解与生成能力,却仍难以超越传统静态分析工具的问题。其核心解决方案是提出一种基于原子任务分解与融合的后训练算法,通过将复杂的重入漏洞检测任务拆解为四个线性独立的原子任务——识别外部调用、识别状态更新、识别外部调用与状态更新之间的数据依赖关系,以及确定其数据流顺序——从而实现组合泛化能力。关键在于利用合成数据集和编译器验证的数据增强训练,并结合Slither工具提取控制流图与数据流图结构信息,对LLM适配器进行微调,最终通过低秩归一化融合与LoRA适配器显著提升检测准确率至98.2%,在31个真实智能合约上相较传统工具提升20%召回率。

链接: https://arxiv.org/abs/2601.06914
作者: Ying Zhou,Jiacheng Wei,Yu Qi,Faguo Wu,Xiao Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable capabilities in natural language understanding and generation. Despite being trained on large-scale, high-quality data, LLMs still fail to outperform traditional static analysis tools in specialized domains like smart contract vulnerability detection. To address this issue, this paper proposes a post-training algorithm based on atomic task decomposition and fusion. This algorithm aims to achieve combinatorial generalization under limited data by decomposing complex reasoning tasks. Specifically, we decompose the reentrancy vulnerability detection task into four linearly independent atomic tasks: identifying external calls, identifying state updates, identifying data dependencies between external calls and state updates, and determining their data flow order. These tasks form the core components of our approach. By training on synthetic datasets, we generate three compiler-verified datasets. We then employ the Slither tool to extract structural information from the control flow graph and data flow graph, which is used to fine-tune the LLM’s adapter. Experimental results demonstrate that low-rank normalization fusion with the LoRA adapter improves the LLM’s reentrancy vulnerability detection accuracy to 98.2%, surpassing state-of-the-art methods. On 31 real-world contracts, the algorithm achieves a 20% higher recall than traditional analysis tools.
zh

[AI-77] V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking

【速读】:该论文旨在解决GUI元素精确定位中的两个关键问题:一是传统方法忽略背景区域处理导致注意力漂移,二是均匀建模目标UI元素无法区分中心与边缘,从而影响点击精度。解决方案的核心在于提出谷值到峰值(Valley-to-Peak, V2P)方法:首先引入抑制注意力机制(suppression attention mechanism),减少模型对无关背景区域的关注以突出目标区域;其次基于Fitts定律设计二维高斯热图建模策略,通过中心权重高、边缘权重低的分布来模拟用户交互偏好,其中方差由目标尺寸决定,从而实现对UI元素中心点的精准聚焦。该方法在ScreenSpot-v2和ScreenSpot-Pro基准上分别达到92.4%和52.5%的定位准确率,验证了其在GUI接地任务中的有效性与泛化能力。

链接: https://arxiv.org/abs/2601.06899
作者: Jikai Chen,Long Chen,Dong Wang,Qinglin Su,Zhixuan Chu,Bingguang Hao,Leilei Gan,Chenyi Zhuang,Jinjie Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform modeling the target UI element fails to distinguish between its center and edges, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model’s focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts’ Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target’s size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.4% and 52.5% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro (see Fig.~\reffig:main_results_charts). Ablations further confirm each component’s contribution, underscoring V2P’s generalizability in precise GUI grounding tasks and its potential for real-world deployment in future GUI agents.
zh

[AI-78] Personality-Aware Reinforcement Learning for Persuasive Dialogue with LLM -Driven Simulation

【速读】:该论文旨在解决说服性对话代理在与用户交互过程中缺乏个性化适应能力的问题,即如何根据用户心理状态和意图的动态变化来优化策略,从而提升说服效果。解决方案的关键在于构建一个人格感知的强化学习框架,其核心包括三个模块:(1) 基于议程的策略控制机制(Strategy-Oriented Interaction Framework),通过最大边际相关性(Maximal Marginal Relevance, MMR)检索确保响应的上下文相关性与多样性;(2) 人格感知的用户表征学习(Personality-Aware User Representation Learning),在每轮对话中生成81维混合类型嵌入并融入强化学习状态空间;(3) 双重双DQN(Dueling Double DQN, D3QN)策略模型与行为导向奖励设计,结合同意意图、捐款金额及“反悔”惩罚的复合奖励函数进行训练。实验表明,引入轮次级人格条件化、大语言模型(LLM)驱动的模拟数据增强以及反悔惩罚机制,显著提升了策略的适应性和整体说服绩效。

链接: https://arxiv.org/abs/2601.06877
作者: Donghuo Zeng,Roberto Legaspi,Kazushi Ikeda
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Effective persuasive dialogue agents adapt their strategies to individual users, accounting for the evolution of their psychological states and intentions throughout conversations. We present a personality-aware reinforcement learning approach comprising three main modules: (1) a Strategy-Oriented Interaction Framework, which serves as an agenda-based strategy controller that selects strategy-level actions and generate responses via Maximal Marginal Relevance (MMR) retrieval to ensure contextual relevance, diversity, and scalable data generation; (2) Personality-Aware User Representation Learning, which produces an 81-dimensional mixed-type embedding predicted at each turn from recent exchanges and appended to the reinforcement learning state; and (3) a Dueling Double DQN (D3QN) model and Reward Prediction, in which the policy is conditioned on dialogue history and turn-level personality estimates and trained using a composite reward incorporating agreement intent, donation amount, and changeof-mind penalties. We use an agenda-based LLM simulation pipeline to generate diverse interactions, from which personality estimation is inferred from the generated utterances. Experiments on the PersuasionForGood (P4G) dataset augmented with simulated dialogues reveal three main findings: (i) turn-level personality conditioning improves policy adaptability and cumulative persuasion rewards; (ii) LLM-driven simulation enhances generalization to unseen user behaviors; and (iii) incorporating a change-of-mind penalty reduces post-agreement retractions while slightly improving donation outcomes. These results demonstrate that structured interaction, dynamic personality estimation, and behaviorally informed rewards together yield more effective persuasive policies.
zh

[AI-79] DaQ-MSA: Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中因高质量训练数据稀缺而导致的模型理解能力不足与泛化性能受限的问题。其解决方案的关键在于提出 DaQ-MSA(Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis),该方法利用扩散模型对视频和音频模态进行语义保持的增强,同时引入一个质量评分模块来评估增强样本的可靠性,并据此分配自适应训练权重——通过降低低质量样本的权重、强化高保真样本的贡献,实现更稳定的训练过程,从而在无需人工标注或额外监督的情况下提升多模态大语言模型(Multimodal Large Language Models, MLLMs)的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2601.06870
作者: Jiazhang Liang,Jianheng Dai,Miaosen Luo,Menghua Jiang,Sijie Mai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their effectiveness on multimodal sentiment analysis remains constrained by the scarcity of high-quality training data, which limits accurate multimodal understanding and generalization. To alleviate this bottleneck, we leverage diffusion models to perform semantics-preserving augmentation on the video and audio modalities, expanding the multimodal training distribution. However, increasing data quantity alone is insufficient, as diffusion-generated samples exhibit substantial quality variation and noisy augmentations may degrade performance. We therefore propose DaQ-MSA (Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis), which introduces a quality scoring module to evaluate the reliability of augmented samples and assign adaptive training weights. By down-weighting low-quality samples and emphasizing high-fidelity ones, DaQ-MSA enables more stable learning. By integrating the generative capability of diffusion models with the semantic understanding of MLLMs, our approach provides a robust and generalizable automated augmentation strategy for training MLLMs without any human annotation or additional supervision.
zh

[AI-80] ET-Agent : Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在工具集成推理(Tool-Integrated Reasoning, TIR)任务中因行为模式对齐不足而导致的无效动作问题,如冗余或不足的工具调用。现有训练框架通常仅关注答案准确性,忽视了行为层面的优化,导致代理在执行TIR任务时难以探索有效轨迹。解决方案的关键在于提出ET-Agent训练框架,其核心由两个协同机制构成:一是自进化数据飞轮(Self-evolving Data Flywheel),用于生成高质量增强数据以提升LLM的探索能力;二是两阶段行为校准训练(Behavior Calibration Training),通过渐进式校准错误行为模式至最优行为,从而显著改善工具使用效率与推理质量。

链接: https://arxiv.org/abs/2601.06860
作者: Yifei Chen,Guanting Dong,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can extend their parameter knowledge limits by adopting the Tool-Integrated Reasoning (TIR) paradigm. However, existing LLM-based agent training framework often focuses on answers’ accuracy, overlooking specific alignment for behavior patterns. Consequently, agent often exhibits ineffective actions during TIR tasks, such as redundant and insufficient tool calls. How to calibrate erroneous behavioral patterns when executing TIR tasks, thereby exploring effective trajectories, remains an open-ended problem. In this paper, we propose ET-Agent, a training framework for calibrating agent’s tool-use behavior through two synergistic perspectives: Self-evolving Data Flywheel and Behavior Calibration Training. Specifically, we introduce a self-evolutionary data flywheel to generate enhanced data, used to fine-tune LLM to improve its exploration ability. Based on this, we implement an two-phases behavior-calibration training framework. It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors. Further in-depth experiments confirm the superiority of \ourmodel across multiple dimensions, including correctness, efficiency, reasoning conciseness, and tool execution accuracy. Our ET-Agent framework provides practical insights for research in the TIR field. Codes can be found in this https URL
zh

[AI-81] MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models

【速读】:该论文旨在解决大规模稀疏专家模型(Mixture-of-Experts, MoE)训练对高内存、高带宽GPU(如A100)的强依赖问题,这种硬件成本已成为大模型训练的主要障碍。针对低成本设备内存容量和带宽受限无法直接训练大模型的困境,作者提出了一种分阶段训练框架MoE-DisCo(Mixture-of-Experts with Disentangled Clustering and Coordination)。其核心创新在于将MoE模型分解为多个稠密子模型,每个子模型由共享骨干网络和单一专家组成,并通过无监督聚类将训练数据划分为子集;各子模型在低预算设备上独立并行训练,无需设备间通信;随后将所有专家整合为完整MoE模型,并在少量高资源GPU上进行全局微调。该方案在保持甚至超越全参数训练性能的同时,显著降低了47.6%至69.5%的训练成本。

链接: https://arxiv.org/abs/2601.06857
作者: Xin Ye,Daning Cheng,Boyang Zhang,Yunquan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned globally for a short period on high-memory, high-bandwidth GPUs. Experiments show that our method matches or even surpasses full-parameter training in performance across multiple downstream tasks, loss function, and perplexity (PPL), while reducing training cost by 47.6 percent to 69.5 percent on Qwen1.5-MoE-2.7B and Llama-MoE-3.5B across different datasets.
zh

[AI-82] A Brain-like Synergistic Core in LLM s Drives Behaviour and Learning

【速读】:该论文试图解决的问题是:如何从生成式 AI(Generative AI)模型中识别出具有普遍意义的智能计算原理,特别是理解其信息处理机制是否与生物大脑中的信息组织方式相似。解决方案的关键在于利用信息分解(information decomposition)方法分析多个大语言模型(Large Language Models, LLMs)家族和架构,发现中间层存在显著的协同信息整合(synergistic information integration),而早期和晚期层则以冗余信息为主;这种结构在训练后自发形成,且在随机初始化网络中不存在;进一步通过消融实验和强化学习微调验证了协同组件对模型行为和性能具有关键作用,从而表明协同信息处理可能是智能的本质属性。

链接: https://arxiv.org/abs/2601.06851
作者: Pedro Urbina-Rodriguez,Zafeirios Fountas,Fernando E. Rosas,Jun Wang,Andrea I. Luppi,Haitham Bou-Ammar,Murray Shanahan,Pedro A. M. Mediano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The independent evolution of intelligence in biological and artificial systems offers a unique opportunity to identify its fundamental computational principles. Here we show that large language models spontaneously develop synergistic cores – components where information integration exceeds individual parts – remarkably similar to those in the human brain. Using principles of information decomposition across multiple LLM model families and architectures, we find that areas in middle layers exhibit synergistic processing while early and late layers rely on redundancy, mirroring the informational organisation in biological brains. This organisation emerges through learning and is absent in randomly initialised networks. Crucially, ablating synergistic components causes disproportionate behavioural changes and performance loss, aligning with theoretical predictions about the fragility of synergy. Moreover, fine-tuning synergistic regions through reinforcement learning yields significantly greater performance gains than training redundant components, yet supervised fine-tuning shows no such advantage. This convergence suggests that synergistic information processing is a fundamental property of intelligence, providing targets for principled model design and testable predictions for biological intelligence.
zh

[AI-83] Code Evolution for Control: Synthesizing Policies via LLM -Driven Evolutionary Search

【速读】:该论文旨在解决自主系统中控制策略设计的挑战,传统方法如强化学习(Reinforcement Learning, RL)存在样本复杂度高、奖励函数设计困难以及策略黑箱性难以解释和验证等问题;而人工设计则依赖大量领域知识且难以扩展到多样化任务。其解决方案的关键在于将策略合成问题建模为代码演化过程,利用大语言模型(Large Language Model, LLM)对编程模式和控制启发式知识的先验理解,结合进化搜索系统性探索解空间,通过 EvoToolkit 框架实现候选策略程序的迭代演化、任务导向评估与优选繁殖,最终生成紧凑、可读性强、可直接审查、修改及形式化验证的控制策略代码。

链接: https://arxiv.org/abs/2601.06845
作者: Ping Guo,Chao Li,Yinglan Feng,Chaoning Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing effective control policies for autonomous systems remains a fundamental challenge, traditionally addressed through reinforcement learning or manual engineering. While reinforcement learning has achieved remarkable success, it often suffers from high sample complexity, reward shaping difficulties, and produces opaque neural network policies that are hard to interpret or verify. Manual design, on the other hand, requires substantial domain expertise and struggles to scale across diverse tasks. In this work, we demonstrate that LLM-driven evolutionary search can effectively synthesize interpretable control policies in the form of executable code. By treating policy synthesis as a code evolution problem, we harness the LLM’s prior knowledge of programming patterns and control heuristics while employing evolutionary search to explore the solution space systematically. We implement our approach using EvoToolkit, a framework that seamlessly integrates LLM-driven evolution with customizable fitness evaluation. Our method iteratively evolves populations of candidate policy programs, evaluating them against task-specific objectives and selecting superior individuals for reproduction. This process yields compact, human-readable control policies that can be directly inspected, modified, and formally verified. This work highlights the potential of combining foundation models with evolutionary computation for synthesizing trustworthy control policies in autonomous systems. Code is available at this https URL.
zh

[AI-84] Variational decomposition autoencoding improves disentanglement of latent representations

【速读】:该论文旨在解决复杂、非平稳、高维时变信号的结构理解难题,尤其是在语音和生物医学信号处理等领域中,如何学习解耦且可解释的表示以揭示潜在生成机制的问题。传统无监督表征学习方法(如变分自编码器,VAE)难以捕捉此类数据中的时频多样性。解决方案的关键在于提出变分分解自编码框架(VDA),其具体实现为变分分解自编码器(DecVAE)——一种仅含编码器的神经网络架构,融合了信号分解模型、对比自监督任务与变分先验近似,从而在多个与时间-频率特性对齐的潜在子空间中学习结构化表征。实验表明,DecVAE 在解耦质量、跨任务泛化能力和潜在编码可解释性方面均优于现有基于VAE的方法。

链接: https://arxiv.org/abs/2601.06844
作者: Ioannis Ziogas,Aamna Al Shehhi,Ahsan H. Khandoker,Leontios J. Hadjileontiadis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP); Machine Learning (stat.ML)
备注: Supplementary information file at: this https URL

点击查看摘要

Abstract:Understanding the structure of complex, nonstationary, high-dimensional time-evolving signals is a central challenge in scientific data analysis. In many domains, such as speech and biomedical signal processing, the ability to learn disentangled and interpretable representations is critical for uncovering latent generative mechanisms. Traditional approaches to unsupervised representation learning, including variational autoencoders (VAEs), often struggle to capture the temporal and spectral diversity inherent in such data. Here we introduce variational decomposition autoencoding (VDA), a framework that extends VAEs by incorporating a strong structural bias toward signal decomposition. VDA is instantiated through variational decomposition autoencoders (DecVAEs), i.e., encoder-only neural networks that combine a signal decomposition model, a contrastive self-supervised task, and variational prior approximation to learn multiple latent subspaces aligned with time-frequency characteristics. We demonstrate the effectiveness of DecVAEs on simulated data and three publicly available scientific datasets, spanning speech recognition, dysarthria severity evaluation, and emotional speech classification. Our results demonstrate that DecVAEs surpass state-of-the-art VAE-based methods in terms of disentanglement quality, generalization across tasks, and the interpretability of latent encodings. These findings suggest that decomposition-aware architectures can serve as robust tools for extracting structured representations from dynamic signals, with potential applications in clinical diagnostics, human-computer interaction, and adaptive neurotechnologies.
zh

[AI-85] Seeing through the Conflict: Transparent Knowledge Conflict Handling in Retrieval-Augmented Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)在采用检索增强生成(Retrieval-Augmented Generation, RAG)范式时存在的三大问题:幻觉(hallucination)、对噪声片段的过度信任以及忽略关键上下文。其解决方案的核心是提出 TCR(Transparent Conflict Resolution)框架,通过三个关键技术实现决策过程的可观测与可控性:(i)利用双对比编码器解耦语义匹配与事实一致性;(ii)估计自回答能力以量化模型对内部记忆的信心;(iii)将上述三个标量信号通过基于信噪比(SNR)加权的轻量软提示(soft-prompt)注入生成器,从而优化生成决策。该方法在七个基准测试中显著提升了冲突检测性能(F1提升5–18)、知识缺口恢复能力(提升21.4个百分点)并减少了误导性上下文覆盖(降低29.3个百分点),同时仅增加0.3%参数量。

链接: https://arxiv.org/abs/2601.06842
作者: Hua Ye,Siyuan Chen,Ziqi Zhong,Canran Xiao,Haoliang Zhang,Yuhan Wu,Fei Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) equipped with retrieval–the Retrieval-Augmented Generation (RAG) paradigm–should combine their parametric knowledge with external evidence, yet in practice they often hallucinate, over-trust noisy snippets, or ignore vital context. We introduce TCR (Transparent Conflict Resolution), a plug-and-play framework that makes this decision process observable and controllable. TCR (i) disentangles semantic match and factual consistency via dual contrastive encoders, (ii) estimates self-answerability to gauge confidence in internal memory, and (iii) feeds the three scalar signals to the generator through a lightweight soft-prompt with SNR-based weighting. Across seven benchmarks TCR improves conflict detection (+5-18 F1), raises knowledge-gap recovery by +21.4 pp and cuts misleading-context overrides by -29.3 pp, while adding only 0.3% parameters. The signals align with human judgements and expose temporal decision patterns.
zh

[AI-86] WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport

【速读】:该论文旨在解决现有 Wasserstein-Fisher-Rao (WFR) 度量在处理非平衡快照动态建模时存在的数值不稳定、计算成本高及难以扩展的问题。其核心解决方案是提出 WFR Flow Matching (WFR-FM),该方法将流匹配(Flow Matching)与动态非平衡最优传输(Dynamic Unbalanced Optimal Transport)统一起来,关键在于同时学习两个要素:一个用于位移的向量场和一个用于质量变化(出生-死亡过程)的标量增长速率函数,从而在 WFR 几何下生成连续轨迹。理论证明最小化 WFR-FM 损失可精确恢复 WFR 测地线,实验证明其在单细胞生物学中能更准确、鲁棒地推断包含增殖与凋亡的动力学轨迹,并有效估计时变生长场,在效率、稳定性与重建精度上均优于当前最优基线方法。

链接: https://arxiv.org/abs/2601.06810
作者: Qiangwei Peng,Zihan Wang,Junda Ying,Yuhao Sun,Qing Nie,Lei Zhang,Tiejun Li,Peijie Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
备注:

点击查看摘要

Abstract:The Wasserstein-Fisher-Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce WFR Flow Matching (WFR-FM), a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth-death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time.
zh

[AI-87] hinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在多模态领域因“感知-推理解耦”(perception-reasoning decoupling)而导致模型忽视视觉信息的问题。现有方法依赖文本导向的奖励信号,使模型在推理过程中绕过视觉感知,仅依靠语言先验生成看似合理的答案,从而退化为“盲推理者”(blind reasoners)。其解决方案的核心是提出一种基于视觉差异驱动的推理策略(Differential Visual Reasoning Policy, DVRP),通过引入由原始、遮蔽和扰动图像组成的视觉三元组(visual triplets)提供内在监督机制:DVRP优化模型在遮蔽输入下最大化推理差异(增强视觉敏感性),同时最小化扰动输入下的推理差异(确保视觉鲁棒性),从而强制模型将推理变化严格对齐于视觉信息的“Δ”(Delta),显著提升视觉理解能力,并在通用与医学基准测试中超越当前最优方法,且无需外部标注或辅助工具。

链接: https://arxiv.org/abs/2601.06801
作者: Shujian Gao,Yuan Wang,Jiangtao Yan,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 10 tables, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textitperception-reasoning decoupling. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textitblind reasoners, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbfThinking with Deltas, a framework driven by a \textbfDifferential Visual Reasoning Policy (DVRP). DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textitvisual sensitivity) while minimizing divergence from perturbed inputs (ensuring \textitvisual robustness). By aligning reasoning variations strictly with the \textitDelta of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.
zh

[AI-88] Graph Neural Network with One-side Edge Sampling for Fraud Detection

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在金融欺诈检测中面临的训练效率低、过拟合(over-fitting)和过平滑(over-smoothing)问题。其中,过平滑会导致节点特征因邻域信息过度聚合而收敛至固定点,从而削弱模型表达能力;而深度架构虽有助于捕捉复杂欺诈模式,却加剧了上述问题并增加计算开销。解决方案的关键在于提出一种名为“单边边采样”(One-Side Edge Sampling, OES)的方法:该方法基于边分类任务中的预测置信度,在训练过程中动态采样输入图的边,从而减少冗余信息传播,缓解过平滑现象,并提升模型泛化性能。理论分析与实验证明,OES能够在保持甚至超越基线模型性能的同时显著缩短训练时间,且适用于浅层与深层GNN架构。

链接: https://arxiv.org/abs/2601.06800
作者: Hoang Hiep Trieu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Financial fraud is always a major problem in the field of finance, as it can cause significant consequences. As a result, many approaches have been designed to detect it, and lately Graph Neural Networks (GNNs) have been demonstrated as a competent candidate. However, when trained with a large amount of data, they are slow and computationally demanding. In addition, GNNs may need a deep architecture to detect complex fraud patterns, but doing so may make them suffer from problems such as over-fitting or over-smoothing. Over-fitting leads to reduced generalisation of the model on unseen data, while over-smoothing causes all nodes’ features to converge to a fixed point due to excessive aggregation of information from neighbouring nodes. In this research, I propose an approach called One-Side Edge Sampling (OES) that can potentially reduce training duration as well as the effects of over-smoothing and over-fitting. The approach leverages predictive confidence in an edge classification task to sample edges from the input graph during a certain number of epochs. To explain why OES can alleviate over-smoothing, I perform a theoretical analysis of the proposed approach. In addition, to validate the effect of OES, I conduct experiments using different GNNs on two datasets. The results show that OES can empirically outperform backbone models in both shallow and deep architectures while also reducing training time.
zh

[AI-89] GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)中基于强化学习(Reinforcement Learning, RL)方法,特别是Group Relative Policy Optimization (GRPO)算法所面临的两个核心问题:其一,在使用复合奖励函数时,GRPO的相对优势估计可能与形式化验证器提供的二值反馈(正确/错误)产生冲突;其二,其静态采样策略在未发现有效证明时会丢弃整批数据,导致零贡献更新和显著的数据浪费。为应对上述挑战,论文提出Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO),其关键创新在于三个协同机制:1)动态增量采样(dynamic additional sampling),对无效批次持续重采样直至获得有效证明;2)等权优势分离(equal-right advantage),将优势函数的符号(由正确性决定)与其幅度(由辅助奖励调节)解耦,从而保障策略更新的稳定性和正确性;3)动态额外迭代(dynamic additional iterations),对初始失败但最终成功的样本施加额外梯度步数,加速对困难案例的学习。实验表明,GDEPO显著提升了数据利用率和优化效率,为ATP训练提供了一种新的范式。

链接: https://arxiv.org/abs/2601.06795
作者: Zhengqing Yan,Xinyang Liu,Yi Zhang,Fan Guo,Yao Liu,Junchen Wan,Kang Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.
zh

[AI-90] No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

【速读】:该论文旨在解决当前基于批判指导的强化学习(Critique-guided Reinforcement Learning, RL)方法中,静态或离线的批判模型(critic model)无法随策略演化而更新的问题。在在线策略RL中,随着策略迭代,智能体的错误模式会发生变化,导致静态批判模型逐渐过时,反馈效用下降。解决方案的关键在于提出ECHO(Evolving Critic for Hindsight-Guided Optimization)框架,通过一个同步协同进化机制联合优化策略与批判模型:利用级联回放(cascaded rollout)机制让批判模型对初始轨迹生成多维度诊断,并通过策略精炼实现分组结构的优势估计;同时引入饱和感知增益塑造目标(saturation-aware gain shaping objective),激励批判模型在高绩效轨迹中持续诱导增量改进;并通过双轨GRPO(Generalized Reward Policy Optimization)更新确保批判反馈始终与演化中的策略保持同步。

链接: https://arxiv.org/abs/2601.06794
作者: Zhicong Li,Lingjie Jiang,Yulan Hu,Xingchen Zeng,Yixia Li,Xiangwen Zhang,Guanhua Chen,Zheng Pan,Xin Li,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent’s error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization), a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic’s feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.
zh

[AI-91] SecMoE: Communication-Efficient Secure MoE Inference via Select-Then-Compute AAAI2026

【速读】:该论文旨在解决隐私保护的Transformer推理中模型规模难以扩展的问题,特别是在使用混合专家(Mixture of Experts, MoE)架构时,现有安全两方计算(Secure Two-Party Computation, 2-PC)协议可能导致服务器通过同态计算前馈网络(Feed-Forward Network, FFN)层暴露客户端输入的token级隐私。为应对这一挑战,论文提出了一种名为\SecMoE的2-PC隐私推理框架,其核心创新在于“选择后计算”(Select-Then-Compute)机制:通过统一MoE层与分段多项式函数中的逐项电路结构,在不泄露激活专家信息的前提下,仅对加密的单个参数条目进行计算,从而在保持MoE稀疏性的同时显著提升效率。实验表明,该方案使私有推理模型规模扩大63倍,端到端运行时间仅增加15.2倍,并在通信和速度上相较当前最优协议提升1.3–3.8倍。

链接: https://arxiv.org/abs/2601.06790
作者: Bowen Shen,Yuyue Chen,Peng Yang,Bin Zhang,Xi Zhang,Zoe L. Jiang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Privacy-preserving Transformer inference has gained attention due to the potential leakage of private information. Despite recent progress, existing frameworks still fall short of practical model scales, with gaps up to a hundredfold. A possible way to close this gap is the Mixture of Experts (MoE) architecture, which has emerged as a promising technique to scale up model capacity with minimal overhead. However, given that the current secure two-party (2-PC) protocols allow the server to homomorphically compute the FFN layer with its plaintext model weight, under the MoE setting, this could reveal which expert is activated to the server, exposing token-level privacy about the client’s input. While naively evaluating all the experts before selection could protect privacy, it nullifies MoE sparsity and incurs the heavy computational overhead that sparse MoE seeks to avoid. To address the privacy and efficiency limitations above, we propose a 2-PC privacy-preserving inference framework, \SecMoE. Unifying per-entry circuits in both the MoE layer and piecewise polynomial functions, \SecMoE obliviously selects the extracted parameters from circuits and only computes one encrypted entry, which we refer to as Select-Then-Compute. This makes the model for private inference scale to 63 \times larger while only having a 15.2 \times increase in end-to-end runtime. Extensive experiments show that, under 5 expert settings, \SecMoE lowers the end-to-end private inference communication by 1.8 \sim 7.1 \times and achieves 1.3 \sim 3.8 \times speedup compared to the state-of-the-art (SOTA) protocols.
zh

[AI-92] MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences

【速读】:该论文旨在解决自主软件工程(Autonomous Software Engineering, SWE)代理在调试过程中存在的“封闭世界”局限性问题,即代理通常仅依赖局部上下文或从零开始修复错误,而忽略了GitHub等平台上积累的海量人类历史经验。其解决方案的关键在于提出MemGovern框架,通过经验治理(Experience Governance)机制将非结构化的GitHub问题追踪数据转化为结构化的、代理友好的“经验卡片”(experience cards),并引入基于逻辑驱动的代理经验检索策略,实现对人类专家知识的有效获取与利用。该方法显著提升了SWE-bench Verified任务上的错误修复率(提升4.65%),并为智能代理提供了可插拔的记忆基础设施。

链接: https://arxiv.org/abs/2601.06789
作者: Qihao Wang,Ziming Cheng,Shuo Zhang,Fan Liu,Rui Xu,Heng Lian,Kunyi Wang,Xiaoming Yu,Jianghao Yin,Sen Hu,Yue Hu,Shaolei Zhang,Yanbing Liu,Ronghao Chen,Huacan Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a “closed-world” limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.
zh

[AI-93] Artificial Entanglement in the Fine-Tuning of Large Language Models

【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在大语言模型(Large Language Models, LLMs)中为何有效的问题,特别是低秩更新机制如何影响模型参数结构及其对任务性能的贡献。解决方案的关键在于引入量子信息理论中的“人工纠缠”(Artificial Entanglement)概念,通过定义并量化神经网络参数中的纠缠熵,揭示低秩适配(如LoRA)与全量微调(Full Fine-Tuning, FFT)在内部参数更新和外部注意力矩阵中的不同纠缠行为:LoRA展现出具有中心抑制特征的体积律内纠缠(“纠缠谷”),而注意力矩阵则呈现面积律且对超参数鲁棒的外纠缠;这种差异不体现在最终注意力输出中,类比黑洞物理中的“无毛定理”,提出低秩更新具备“无毛”特性,从而解释其有效性。

链接: https://arxiv.org/abs/2601.06788
作者: Min Chen,Zihan Wang,Canyu Chen,Zeguan Wu,Manling Li,Junyu Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph); Machine Learning (stat.ML)
备注: 41 pages, many figures

点击查看摘要

Abstract:Large language models (LLMs) can be adapted to new tasks using parameter-efficient fine-tuning (PEFT) methods that modify only a small number of trainable parameters, often through low-rank updates. In this work, we adopt a quantum-information-inspired perspective to understand their effectiveness. From this perspective, low-rank parameterizations naturally correspond to low-dimensional Matrix Product States (MPS) representations, which enable entanglement-based characterizations of parameter structure. Thereby, we term and measure “Artificial Entanglement”, defined as the entanglement entropy of the parameters in artificial neural networks (in particular the LLMs). We first study the representative low-rank adaptation (LoRA) PEFT method, alongside full fine-tuning (FFT), using LLaMA models at the 1B and 8B scales trained on the Tulu3 and OpenThoughts3 datasets, and uncover: (i) Internal artificial entanglement in the updates of query and value projection matrices in LoRA follows a volume law with a central suppression (termed as the “Entanglement Valley”), which is sensitive to hyper-parameters and is distinct from that in FFT; (ii) External artificial entanglement in attention matrices, corresponding to token-token correlations in representation space, follows an area law with logarithmic corrections and remains robust to LoRA hyper-parameters and training steps. Drawing a parallel to the No-Hair Theorem in black hole physics, we propose that although LoRA and FFT induce distinct internal entanglement signatures, such differences do not manifest in the attention outputs, suggesting a “no-hair” property that results in the effectiveness of low rank updates. We further provide theoretical support based on random matrix theory, and extend our analysis to an MPS Adaptation PEFT method, which exhibits qualitatively similar behaviors.
zh

[AI-94] From Text to Simulation: A Multi-Agent LLM Workflow for Automated Chemical Process Design

【速读】:该论文旨在解决化学工程设计中从流程图到可执行仿真模型的自动化转换难题,即如何将文本形式的工艺描述高效、准确地转化为计算机可验证的仿真配置,从而减少对人工参数配置的依赖并提升设计效率。其解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的多智能体工作流,该工作流包含任务理解、拓扑生成、参数配置与评估分析四个专业化代理,并结合增强型蒙特卡洛树搜索(Enhanced Monte Carlo Tree Search)实现语义精准解析与鲁棒配置生成,最终实现了从文本规范到仿真可行方案的端到端自动化流程。

链接: https://arxiv.org/abs/2601.06776
作者: Xufei Tian,Wenli Du,Shaoyi Yang,Han Hu,Hui Xin,Shifeng Qu,Ke Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Process simulation is a critical cornerstone of chemical engineering design. Current automated chemical design methodologies focus mainly on various representations of process flow diagrams. However, transforming these diagrams into executable simulation flowsheets remains a time-consuming and labor-intensive endeavor, requiring extensive manual parameter configuration within simulation software. In this work, we propose a novel multi-agent workflow that leverages the semantic understanding capabilities of large language models(LLMs) and enables iterative interactions with chemical process simulation software, achieving end-to-end automated simulation from textual process specifications to computationally validated software configurations for design enhancement. Our approach integrates four specialized agents responsible for task understanding, topology generation, parameter configuration, and evaluation analysis, respectively, coupled with Enhanced Monte Carlo Tree Search to accurately interpret semantics and robustly generate configurations. Evaluated on Simona, a large-scale process description dataset, our method achieves a 31.1% improvement in the simulation convergence rate compared to state-of-the-art baselines and reduces the design time by 89. 0% compared to the expert manual design. This work demonstrates the potential of AI-assisted chemical process design, which bridges the gap between conceptual design and practical implementation. Our workflow is applicable to diverse process-oriented industries, including pharmaceuticals, petrochemicals, food processing, and manufacturing, offering a generalizable solution for automated process design.
zh

[AI-95] FinForge: Semi-Synthetic Financial Benchmark Generation AAAI2026

【速读】:该论文旨在解决在金融等高风险专业领域中评估语言模型(Language Models, LMs)能力的挑战,特别是由于缺乏公开、高质量且领域特定的数据集,导致现有通用基准难以准确衡量模型在金融推理中的概念理解与定量分析能力。解决方案的关键在于提出FinForge——一个可扩展的半合成数据构建管道,通过专家引导的数据整理与受控的语言模型生成相结合的方式,从权威金融文档中构建结构化问题并进行人工验证,从而实现对金融领域推理能力的精准评估。该方法结合了人工与程序化语料库构建,并利用Gemini 2.5 Flash进行问题生成与验证,最终产出FinForge-5k基准,包含5000余个经人工验证的问题-答案对,覆盖11个金融子领域,有效揭示了当前主流模型在金融推理任务上的性能差异与局限性。

链接: https://arxiv.org/abs/2601.06747
作者: Glenn Matlin,Akhil Theerthala,Anant Gupta,Anirudh JM,Rayan Castilla,Yi Mei Ng,Sudheer Chava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026 Workshop on Agentic AI in Financial Services

点击查看摘要

Abstract:Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs’ capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline’s efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework’s utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at this https URL.
zh

[AI-96] Logic-Driven Semantic Communication for Resilient Multi-Agent Systems

【速读】:该论文旨在解决当前去中心化多智能体系统(Multi-Agent Systems, MAS)在6G网络背景下缺乏统一、可形式化定义的韧性(resilience)框架的问题,尤其针对环境变化和对抗性干扰下系统难以持续感知、适应与恢复的挑战。解决方案的关键在于提出一个基于双重维度的正式韧性定义:一是认知韧性(epistemic resilience),确保智能体能恢复并维持对环境的准确知识;二是行为韧性(action resilience),使智能体能够利用该知识进行协调并持续达成目标。作者通过时态认知逻辑(temporal epistemic logic)形式化韧性,并引入可恢复时间(recoverability time)和耐久时间(durability time)作为量化指标,设计了去中心化算法与代理架构,实现了形式化验证保证(包括有限时域验证),从而支持设计阶段认证与轻量级运行时监控,显著提升了系统在压力下的稳定决策能力。

链接: https://arxiv.org/abs/2601.06733
作者: Tamara Alshammari,Mehdi Bennis
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The advent of 6G networks is accelerating autonomy and intelligence in large-scale, decentralized multi-agent systems (MAS). While this evolution enables adaptive behavior, it also heightens vulnerability to stressors such as environmental changes and adversarial behavior. Existing literature on resilience in decentralized MAS largely focuses on isolated aspects, such as fault tolerance, without offering a principled unified definition of multi-agent resilience. This gap limits the ability to design systems that can continuously sense, adapt, and recover under dynamic conditions. This article proposes a formal definition of MAS resilience grounded in two complementary dimensions: epistemic resilience, wherein agents recover and sustain accurate knowledge of the environment, and action resilience, wherein agents leverage that knowledge to coordinate and sustain goals under disruptions. We formalize resilience via temporal epistemic logic and quantify it using recoverability time (how quickly desired properties are re-established after a disturbance) and durability time (how long accurate beliefs and goal-directed behavior are sustained after recovery). We design an agent architecture and develop decentralized algorithms to achieve both epistemic and action resilience. We provide formal verification guarantees, showing that our specifications are sound with respect to the metric bounds and admit finite-horizon verification, enabling design-time certification and lightweight runtime monitoring. Through a case study on distributed multi-agent decision-making under stressors, we show that our approach outperforms baseline methods. Our formal verification analysis and simulation results highlight that the proposed framework enables resilient, knowledge-driven decision-making and sustained operation, laying the groundwork for resilient decentralized MAS in next-generation communication systems.
zh

[AI-97] Why are there many equally good models? An Anatomy of the Rashomon Effect

【速读】:该论文旨在解决现代机器学习与统计学中普遍存在的“Rashomon效应”问题,即多个结构不同但预测性能相近的模型共存的现象。其解决方案的关键在于系统性地将Rashomon效应的成因归纳为三类:统计来源(由有限样本和数据生成过程中的噪声引起)、结构来源(源于优化目标的非凸性及未观测变量导致的根本不可识别性)和程序来源(源于优化算法局限性和人为限制模型类别的选择)。通过整合机器学习、统计学与优化领域的研究成果,论文构建了一个统一框架,用以解释为何存在多种表现良好的模型,并指出不同成因对数据量的依赖性差异:统计多重性随数据增加而减弱,结构性多重性在渐近下持续存在且需额外假设或新数据才能解决,而程序性多重性则反映研究者决策。这一分类不仅厘清了现象本质,也为提升模型推断、可解释性、公平性和不确定性下的决策提供了理论基础与实践方向。

链接: https://arxiv.org/abs/2601.06730
作者: Harsh Parikh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The Rashomon effect – the existence of multiple, distinct models that achieve nearly equivalent predictive performance – has emerged as a fundamental phenomenon in modern machine learning and statistics. In this paper, we explore the causes underlying the Rashomon effect, organizing them into three categories: statistical sources arising from finite samples and noise in the data-generating process; structural sources arising from non-convexity of optimization objectives and unobserved variables that create fundamental non-identifiability; and procedural sources arising from limitations of optimization algorithms and deliberate restrictions to suboptimal model classes. We synthesize insights from machine learning, statistics, and optimization literature to provide a unified framework for understanding why the multiplicity of good models arises. A key distinction emerges: statistical multiplicity diminishes with more data, structural multiplicity persists asymptotically and cannot be resolved without different data or additional assumptions, and procedural multiplicity reflects choices made by practitioners. Beyond characterizing causes, we discuss both the challenges and opportunities presented by the Rashomon effect, including implications for inference, interpretability, fairness, and decision-making under uncertainty.
zh

[AI-98] Explainability of Complex AI Models with Correlation Impact Ratio

【速读】:该论文旨在解决复杂人工智能(AI)系统在预测性能提升的同时,因缺乏透明性而导致的可信度、可解释性和安全部署受限的问题。现有模型无关的后验解释方法(如LIME、SHAP、HSIC和SAGE)虽具通用性,但存在两个关键局限:一是对相关特征容易误排序,二是依赖昂贵的扰动操作,难以扩展至高维数据。其解决方案的核心是提出ExCIR(Explainability through Correlation Impact Ratio),这是一种理论基础扎实、计算轻量且稳定可靠的特征贡献度度量方法,通过单次遍历即可捕捉相关特征引发的依赖关系,同时在噪声和采样变化下保持一致性;进一步地,作者基于信息论框架将相关性比率与典型相关分析(Canonical Correlation Analysis)统一于互信息边界内,从而实现了多输出及类别条件下的可扩展解释能力。

链接: https://arxiv.org/abs/2601.06701
作者: Poushali Sengupta,Rabindra Khadka,Sabita Maharjan,Frank Eliassen,Yan Zhang,Shashi Raj Pandey,Pedro G. Lind,Anis Yazidi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Complex AI systems make better predictions but often lack transparency, limiting trustworthiness, interpretability, and safe deployment. Common post hoc AI explainers, such as LIME, SHAP, HSIC, and SAGE, are model agnostic but are too restricted in one significant regard: they tend to misrank correlated features and require costly perturbations, which do not scale to high dimensional data. We introduce ExCIR (Explainability through Correlation Impact Ratio), a theoretically grounded, simple, and reliable metric for explaining the contribution of input features to model outputs, which remains stable and consistent under noise and sampling variations. We demonstrate that ExCIR captures dependencies arising from correlated features through a lightweight single pass formulation. Experimental evaluations on diverse datasets, including EEG, synthetic vehicular data, Digits, and Cats-Dogs, validate the effectiveness and stability of ExCIR across domains, achieving more interpretable feature explanations than existing methods while remaining computationally efficient. To this end, we further extend ExCIR with an information theoretic foundation that unifies the correlation ratio with Canonical Correlation Analysis under mutual information bounds, enabling multi output and class conditioned explainability at scale.
zh

[AI-99] Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget

【速读】:该论文试图解决在极端计算预算下(单张A40 GPU,训练时间<24小时)能否通过强化学习与低秩适配技术诱导小语言模型(≤1.5B参数)具备强数学推理能力的问题。其解决方案的关键在于:利用基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)与低秩适配(Low-Rank Adaptation, LoRA)相结合的方法,并发现模型初始化与适配器秩(adapter rank)之间的协同作用是成功的核心——高秩适配器(r=256)能显著提升标准指令微调模型的可塑性,从而在AIME 2024上实现40.0%的Pass@1准确率(较基线提升11.1%),但该策略对已高度数学对齐的模型反而导致性能崩溃,表明低预算强化学习更新可能成为接近任务最优解模型的破坏性干扰。

链接: https://arxiv.org/abs/2601.06677
作者: Zohaib Khan,Omer Tafveez,Zoha Hayat Bhatti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Recent advances in mathematical reasoning typically rely on massive scale, yet the question remains: can strong reasoning capabilities be induced in small language models ( \leq1.5\textB ) under extreme constraints? We investigate this by training models on a single A40 GPU (48GB) for under 24 hours using Reinforcement Learning with Verifiable Rewards (RLVR) and Low-Rank Adaptation (LoRA). We find that the success of this ``micro-budget" regime depends critically on the interplay between adapter capacity and model initialization. While low-rank adapters ( r=8 ) consistently fail to capture the complex optimization dynamics of reasoning, high-rank adapters ( r=256 ) unlock significant plasticity in standard instruction-tuned models. Our best result achieved an impressive 40.0% Pass@1 on AIME 24 (an 11.1% absolute improvement over baseline) and pushed Pass@16 to 70.0%, demonstrating robust exploration capabilities. However, this plasticity is not universal: while instruction-tuned models utilized the budget to elongate their chain-of-thought and maximize reward, heavily math-aligned models suffered performance collapse, suggesting that noisy, low-budget RL updates can act as destructive interference for models already residing near a task-specific optimum.
zh

[AI-100] Otimizando A Alocação De Salas De Aula Com Foco Na Acessibilidade Para Pessoas Com Deficiência

【速读】:该论文旨在解决高校教室分配中对残障人士(Persons with Disabilities, PwDs)的无障碍访问问题,通过优化模型提升教室资源配置的公平性与效率。解决方案的关键在于构建一个基于整数线性规划(Integer Linear Programming, ILP)的优化模型,利用Gurobi求解器进行计算,并引入权重参数α来权衡空间利用率与无障碍优先级之间的关系,从而在减少教室使用数量的同时,优先将残障学生分配至无障碍设施完善的底层教室,显著改善现有手动分配方式下的可达性缺陷。

链接: https://arxiv.org/abs/2601.06670
作者: Francisco Glaubos Nunes Clímaco,Jorge Lucas Silva Cavalcante
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: in Portuguese language

点击查看摘要

Abstract:This paper addresses the challenge of classroom allocation in higher education institutions, with an explicit emphasis on accessibility for Persons with Disabilities (PwDs). Employing a case study of a university’s computer science department, the paper proposes an Integer Linear Programming (ILP)-based optimization model, which is solved using the Gurobi solver. The objective is to minimize the number of classrooms used by prioritizing the assignment of PwD students to ground-floor classrooms to reduce accessibility barriers. The model is calibrated with a weighting parameter, alpha, that allows for a balance between spatial efficiency and promoting accessibility. Experimental results indicate that adjusting alpha can achieve a balance point that significantly improves current manual allocation practices, reducing the number of classrooms required and accessibility penalties. The findings suggest that optimization methods can improve operational efficiency in academic institutions while promoting a more inclusive environment for all students. Future work may expand the application of the model to other departments and contexts and integrate additional criteria to develop a more holistic approach.
zh

[AI-101] Reinforcement Learning-Guided Dynamic Multi-Graph Fusion for Evacuation Traffic Prediction

【速读】:该论文旨在解决飓风疏散期间实时交通流量预测中存在的两个关键问题:一是现有数据驱动的图学习模型通常仅基于单一维度(如行程时间或距离)构建底层图结构,难以全面捕捉交通检测器之间的异质时空关系;二是这些模型普遍缺乏可解释性,无法明确识别哪些输入变量对预测性能贡献最大。解决方案的核心在于提出一种强化学习引导的动态多图融合(Reinforcement Learning-guided Dynamic Multi-Graph Fusion, RL-DMF)框架:首先在每个时间步构建多个动态图以表征不同维度的时空关联;其次引入动态多图融合(DMF)模块自适应地整合多图信息;同时结合基于强化学习的智能特征选择与排序(RL-based Intelligent Feature Selection and Ranking, RL-IFSR)方法,在训练过程中自动屏蔽无关特征,从而提升模型的准确性与可解释性。

链接: https://arxiv.org/abs/2601.06664
作者: Md Nafees Fuad Rafi,Samiul Hasan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time traffic prediction is critical for managing transportation systems during hurricane evacuations. Although data-driven graph-learning models have demonstrated strong capabilities in capturing the complex spatiotemporal dynamics of evacuation traffic at a network level, they mostly consider a single dimension (e.g., travel-time or distance) to construct the underlying graph. Furthermore, these models often lack interpretability, offering little insight into which input variables contribute most to their predictive performance. To overcome these limitations, we develop a novel Reinforcement Learning-guided Dynamic Multi-Graph Fusion (RL-DMF) framework for evacuation traffic prediction. We construct multiple dynamic graphs at each time step to represent heterogeneous spatiotemporal relationships between traffic detectors. A dynamic multi-graph fusion (DMF) module is employed to adaptively learn and combine information from these graphs. To enhance model interpretability, we introduce RL-based intelligent feature selection and ranking (RL-IFSR) method that learns to mask irrelevant features during model training. The model is evaluated using a real-world dataset of 12 hurricanes affecting Florida from 2016 to 2024. For an unseen hurricane (Milton, 2024), the model achieves a 95% accuracy (RMSE = 293.9) for predicting the next 1-hour traffic flow. Moreover, the model can forecast traffic flow for up to next 6 hours with 90% accuracy (RMSE = 426.4). The RL-DMF framework outperforms several state-of-the-art traffic prediction models. Furthermore, ablation experiments confirm the effectiveness of dynamic multi-graph fusion and RL-IFSR approaches for improving model performance. This research provides a generalized and interpretable model for real-time evacuation traffic forecasting, with significant implications for evacuation traffic management.
zh

[AI-102] SafePro: Evaluating the Safety of Professional-Level AI Agents

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)驱动的AI代理在专业领域执行复杂任务时存在的安全对齐问题,现有安全评估多集中于日常辅助任务,难以捕捉专业场景下因决策失误或行为偏离所引发的潜在风险。其解决方案的关键在于提出一个名为SafePro的综合性基准测试框架,该框架包含跨多个专业领域的高复杂度任务数据集,并通过严格的迭代创建与评审流程构建,能够系统性地评估AI代理在专业活动中的安全表现。实验表明,当前先进模型在专业任务中普遍存在安全判断不足和对齐薄弱的问题,而基于SafePro的改进策略可有效提升其安全性,凸显了为下一代专业级AI代理设计针对性安全机制的紧迫性。

链接: https://arxiv.org/abs/2601.06663
作者: Kaiwen Zhou,Shreedhar Jangam,Ashwin Nagarajan,Tejas Polu,Suhas Oruganti,Chengzhi Liu,Ching-Chen Kuo,Yuting Zheng,Sravana Narayanaraju,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbfSafePro, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.
zh

[AI-103] Revisiting Training Scale: An Empirical Study of Token Count Power Consumption and Parameter Efficiency

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中,随着训练语料库规模(token count)增加,性能提升是否具有可预测性和效率保障的问题。以往研究多聚焦于模型性能指标的改善,却忽视了计算资源消耗与能源成本的动态变化,导致对训练效率的评估存在偏差。本文的关键解决方案在于引入一个能量感知的参数效率度量(energy-aware parameter efficiency metric),并在此基础上构建了一个受控实验设计:在固定硬件、模型架构、优化器设置和训练轮次的前提下,对TinyLlama(1.1亿参数)模型分别使用50万、100万和200万tokens进行训练。通过将功率采样频率所反映的功耗与执行时长纳入分析,发现尽管传统性能指标呈现边际递减或不稳定趋势,但训练效率(单位参数每单位能耗的性能增益)随token数量增长而严格单调下降,且重复测量方差分析(Repeated-measures ANOVA)验证了token count对参数效率具有显著影响(Bonferroni校正后所有成对比较均显著)。这一结果揭示了单纯扩大训练数据规模可能带来能效损失,强调了在LLM训练中必须采用以效率为导向的评估范式。

链接: https://arxiv.org/abs/2601.06649
作者: Joe Dwyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research in machine learning has questioned whether increases in training token counts reliably produce proportional performance gains in large language models. Building on prior work introducing an energy-aware parameter efficiency metric, this study empirically examines the effects of increasing training token counts under fixed hardware and training conditions. The significance of this work lies in the explicit integration of power consumption and execution duration, as reflected by the power sampling frequency, into token-scale analysis. This addresses a gap in prior studies emphasizing performance outcomes while underrepresenting computational and energy costs. Using a repeated-measures experimental design on a constant GPU instance with an identical model architecture, optimizer settings, and epoch counts, a 1.1-billion-parameter TinyLlama model was trained at three token counts (500K, 1M, and 2M). While conventional performance metrics exhibited inconsistent or diminishing returns across token scales, the inclusion of power consumption and execution duration revealed a strictly monotonic decline in training efficiency as token count increased. Repeated-measures ANOVA demonstrated a strong effect of token count on parameter efficiency, with all pairwise comparisons remaining significant following Bonferroni correction. These findings indicate that increases in training token counts may be energetically inefficient even when marginal performance improvements are observed, underscoring the importance of efficiency-aware evaluation in large language model training.
zh

[AI-104] Agent ic AI Empowered Intent-Based Networking for 6G

【速读】:该论文旨在解决第六代(6G)无线网络中自主编排机制的瓶颈问题,即如何将高层业务意图(Operational Intent)准确转化为可执行的网络配置,同时克服现有基于规则的意图驱动网络(Intent-Based Networking, IBN)在语言变体处理上的局限性,以及端到端神经模型缺乏可解释性和无法强制执行操作约束的缺陷。其解决方案的关键在于提出一种分层多智能体框架,其中基于大语言模型(Large Language Model, LLM)的智能体通过迭代式推理-行动(Reasoning-Action, ReAct)循环,自主分解自然语言意图、调用领域专用专家代理(如无线接入网 RAN 和核心网 Agent),并在结构化网络状态表示的基础上协同生成技术可行的网络切片配置。该架构通过引入协调代理(Orchestrator Agent)实现跨域协作,显著提升了自动化系统的准确性与可控性,为下一代无线系统中的自主编排提供了可扩展且可解释的技术路径。

链接: https://arxiv.org/abs/2601.06640
作者: Genze Jiang,Kezhi Wang,Xiaomin Chen,Yizhou Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Submitted for Possible Journal Publication

点击查看摘要

Abstract:The transition towards sixth-generation (6G) wireless networks necessitates autonomous orchestration mechanisms capable of translating high-level operational intents into executable network configurations. Existing approaches to Intent-Based Networking (IBN) rely upon either rule-based systems that struggle with linguistic variation or end-to-end neural models that lack interpretability and fail to enforce operational constraints. This paper presents a hierarchical multi-agent framework where Large Language Model (LLM) based agents autonomously decompose natural language intents, consult domain-specific specialists, and synthesise technically feasible network slice configurations through iterative reasoning-action (ReAct) cycles. The proposed architecture employs an orchestrator agent coordinating two specialist agents, i.e., Radio Access Network (RAN) and Core Network agents, via ReAct-style reasoning, grounded in structured network state representations. Experimental evaluation across diverse benchmark scenarios shows that the proposed system outperforms rule-based systems and direct LLM prompting, with architectural principles applicable to Open RAN (O-RAN) deployments. The results also demonstrate that whilst contemporary LLMs possess general telecommunications knowledge, network automation requires careful prompt engineering to encode context-dependent decision thresholds, advancing autonomous orchestration capabilities for next-generation wireless systems.
zh

[AI-105] Attack-Resistant Watermarking for AIGC Image Forensics via Diffusion-based Semantic Deflection

【速读】:该论文旨在解决生成式 AI (Generative AI) 图像版权保护中的两大核心问题:一是现有水印方法在面对真实世界对抗性攻击时脆弱,难以兼顾防伪造和防移除攻击;二是无法实现语义级别的篡改定位。解决方案的关键在于提出一种无需训练的内在水印框架 PAI,其创新性地设计了一种基于密钥条件的偏转机制(key-conditioned deflection mechanism),通过微妙调控扩散模型去噪轨迹来嵌入水印,从而增强身份与内容之间的语义纠缠,显著提升对现实威胁的鲁棒性,并支持精准的语义级篡改定位。实验表明,PAI 在 12 种攻击场景下平均验证准确率达 98.43%,优于当前最优方法 37.25%。

链接: https://arxiv.org/abs/2601.06639
作者: Qingyu Liu,Yitao Zhang,Zhongjie Ba,Chao Shuai,Peng Cheng,Tianhang Zheng,Zhibo Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Protecting the copyright of user-generated AI images is an emerging challenge as AIGC becomes pervasive in creative workflows. Existing watermarking methods (1) remain vulnerable to real-world adversarial threats, often forced to trade off between defenses against spoofing and removal attacks; and (2) cannot support semantic-level tamper localization. We introduce PAI, a training-free inherent watermarking framework for AIGC copyright protection, plug-and-play with diffusion-based AIGC services. PAI simultaneously provides three key functionalities: robust ownership verification, attack detection, and semantic-level tampering localization. Unlike existing inherent watermark methods that only embed watermarks at noise initialization of diffusion models, we design a novel key-conditioned deflection mechanism that subtly steers the denoising trajectory according to the user key. Such trajectory-level coupling further strengthens the semantic entanglement of identity and content, thereby further enhancing robustness against real-world threats. Moreover, we also provide a theoretical analysis proving that only the valid key can pass verification. Experiments across 12 attack methods show that PAI achieves 98.43% verification accuracy, improving over SOTA methods by 37.25% on average, and retains strong tampering localization performance even against advanced AIGC edits. Our code is available at this https URL.
zh

[AI-106] Burn-After-Use for Preventing Data Leakage through a Secure Multi-Tenant Architecture in Enterprise LLM

【速读】:该论文旨在解决企业级大语言模型(Large Language Model, LLM)部署中多租户环境下的数据泄露风险问题,特别是跨会话、跨用户的信息泄露和敏感上下文残留。其核心解决方案是提出一种安全的多租户架构(Secure Multi-Tenant Architecture, SMTA)与新型“用后即焚”(Burn-After-Use, BAU)机制相结合的方法:SMTA通过隔离不同部门的LLM实例并强化内部上下文所有权边界实现强语义隔离;BAU则通过自动销毁会话级临时上下文,防止因缓存、日志或基础设施配置不当导致的数据残留。实验表明,SMTA在55次基础设施级攻击测试中防御成功率达92%,BAU在72次真实故障场景下对多层残留威胁的缓解成功率达76.75%,二者协同实现了严格隔离、会话瞬时性、强保密性及策略合规行为。

链接: https://arxiv.org/abs/2601.06627
作者: Qiang Zhang,Elena Emma Wang,Jiaming Li,Xichun Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:This study presents a Secure Multi-Tenant Architecture (SMTA) combined with a novel concept Burn-After-Use (BAU) mechanism for enterprise LLM environments to effectively prevent data leakage. As institutions increasingly adopt LLMs across departments, the risks of data leakage have become a critical security and compliance concern. The proposed SMTA isolates LLM instances across departments and enforces rigorous context ownership boundaries within an internally deployed infrastructure. The BAU mechanism introduces data confidentiality by enforcing ephemeral conversational contexts that are automatically destroyed after use, preventing cross-session or cross-user inference. The evaluation to SMTA and BAU is through two sets of realistic and reproducible experiments comprising of 127 test iterations. One aspect of this experiment is to assess prompt-based and semantic leakage attacks in a multi-tenant architecture (Appendix A) across 55 infrastructure-level attack tests, including vector-database credential compromise and shared logging pipeline exposure. SMTA achieves 92% defense success rate, demonstrating strong semantic isolation while highlighting residual risks from credential misconfiguration and observability pipelines. Another aspect is to evaluate the robustness of BAU under realistic failure scenarios (Appendix B) using four empirical metrics: Local Residual Persistence Rate (LRPR), Remote Residual Persistence Rate (RRPR), Image Frame Exposure Rate (IFER), and Burn Timer Persistence Rate (BTPR). Across 72 test iterations, BAU achieves a 76.75% success rate in mitigating post-session leakage threats across the client, server, application, infrastructure, and cache layers. These results show that SMTA and BAU together enforce strict isolation, complete session ephemerality, strong confidentiality guarantees, non-persistence, and policy-aligned behavior for enterprise LLMs.
zh

[AI-107] CEDAR: Context Engineering for Agent ic Data Science ECIR2026

【速读】:该论文旨在解决利用大语言模型(Large Language Models, LLMs)自动化数据科学(Data Science, DS)任务时面临的多重挑战,包括任务复杂性高、数据规模大、计算资源受限以及上下文长度约束等问题。其解决方案的关键在于采用代理式(agentic)架构,并通过有效的上下文工程(context engineering)实现结构化提示设计与动态代码生成:首先在初始提示中引入DS特定的输入字段以指导代理系统;随后将解决方案表示为由独立LLM代理生成的交错计划块与代码块序列,从而在流程的每一步保持可读且可控的上下文结构;同时,通过函数调用机制确保数据本地化处理,仅将聚合统计信息和指令注入LLM提示中,辅以迭代式代码生成和智能历史渲染策略提升容错能力与上下文管理效率。实验基于经典的Kaggle挑战验证了该代理型数据科学家的有效性。

链接: https://arxiv.org/abs/2601.06606
作者: Rishiraj Saha Roy,Chris Hinze,Luzian Hahn,Fabian Kuech
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECIR 2026

点击查看摘要

Abstract:We demonstrate CEDAR, an application for automating data science (DS) tasks with an agentic setup. Solving DS problems with LLMs is an underexplored area that has immense market value. The challenges are manifold: task complexities, data sizes, computational limitations, and context restrictions. We show that these can be alleviated via effective context engineering. We first impose structure into the initial prompt with DS-specific input fields, that serve as instructions for the agentic system. The solution is then materialized as an enumerated sequence of interleaved plan and code blocks generated by separate LLM agents, providing a readable structure to the context at any step of the workflow. Function calls for generating these intermediate texts, and for corresponding Python code, ensure that data stays local, and only aggregate statistics and associated instructions are injected into LLM prompts. Fault tolerance and context management are introduced via iterative code generation and smart history rendering. The viability of our agentic data scientist is demonstrated using canonical Kaggle challenges.
zh

[AI-108] Object-Centric World Models Meet Monte Carlo Tree Search

【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)算法在处理复杂动态环境时,因将世界视为单一无差别的输入而导致的建模能力不足问题。解决方案的关键在于引入基于对象级表示(object-level representations)的结构化世界模型,利用图神经网络(Graph Neural Networks, GNNs)捕捉多个可操作物体之间的复杂交互关系,并将其集成到以蒙特卡洛树搜索(Monte Carlo Tree Search)为规划模块的模型基础型强化学习框架中,从而显著提升对环境动态的建模与预测能力。

链接: https://arxiv.org/abs/2601.06604
作者: Rodion Vakhitov,Leonid Ugadiarov,Aleksandr Panov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this paper, we introduce ObjectZero, a novel reinforcement learning (RL) algorithm that leverages the power of object-level representations to model dynamic environments more effectively. Unlike traditional approaches that process the world as a single undifferentiated input, our method employs Graph Neural Networks (GNNs) to capture intricate interactions among multiple objects. These objects, which can be manipulated and interact with each other, serve as the foundation for our model’s understanding of the environment. We trained the algorithm in a complex setting teeming with diverse, interactive objects, demonstrating its ability to effectively learn and predict object dynamics. Our results highlight that a structured world model operating on object-centric representations can be successfully integrated into a model-based RL algorithm utilizing Monte Carlo Tree Search as a planning module.
zh

[AI-109] Are LLM s Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

【速读】:该论文旨在解决生成式 AI(Generative AI)在偏好对齐(preference alignment)训练过程中可能因过度迎合用户意图而产生非真实性响应的问题,即模型是否容易受到偏好削弱攻击(Preference-Undermining Attacks, PUA)的影响。其解决方案的关键在于提出一种因子化诊断方法(factorial evaluation framework),通过控制变量设计(2 × 2⁴ 实验设计)将提示引发的响应变化分解为系统目标(真理导向 vs. 偏好导向)与PUA风格对话因素(指令控制、人格贬损、条件认可、现实否认)的可解释效应,从而实现比传统聚合基准分数更精细、更具指导性的评估,揭示高级模型反而更易受特定PUA策略影响的现象,并强调需针对模型特性制定差异化防御策略。

链接: https://arxiv.org/abs/2601.06596
作者: Hongjun An,Yiliang Song,Jiangan Chen,Jiawei Shao,Chi Zhang,Xuelong Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model’s desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled 2 \times 2^4 design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.
zh

[AI-110] QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models

【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在视频-音频理解任务中主要针对短时视频(几分钟内)进行评估,而缺乏对长视频(数分钟至一小时以上)的高效分析能力的问题。解决方案的关键在于提出QMAVIS(Q Team-Multimodal Audio Video Intelligent Sensemaking),一个通过晚期融合策略整合LMMs、大型语言模型(Large Language Models, LLMs)与语音识别模型的新型长视频-音频理解流水线。该架构有效提升了对长视频中场景细节和整体叙事结构的理解能力,并在VideoMME等数据集上实现了显著性能提升,验证了其在长视频语义解析中的有效性与先进性。

链接: https://arxiv.org/abs/2601.06573
作者: Zixing Lin,Jiale Wang,Gee Wah Ng,Lee Onn Mak,Chan Zhi Yang Jeriel,Jun Yang Lee,Yaohao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applica- tions in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like Vide- oLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like PerceptionTest and EgoSchema saw up to 2% improvement, indicating competitive performance. Qualitative experiments also showed that QMAVIS is able to extract the nuances of different scenes in a long video audio content while understanding the overarching narrative. Ablation studies were also conducted to ascertain the impact of each component in the fusion pipeline.
zh

[AI-111] Hellinger Multimodal Variational Autoencoders

【速读】:该论文旨在解决多模态变分自编码器(Multimodal Variational Autoencoders, MVAEs)在弱监督生成学习中,如何更有效地聚合来自不同模态的推理分布(inference distributions)以逼近联合后验的问题。现有方法通常采用专家乘积(Product of Experts, PoE)、专家混合(Mixture of Experts, MoE)或其组合,但存在采样效率低或表达能力不足的问题。论文的关键解决方案是引入概率意见池化(probabilistic opinion pooling)框架,并基于 Hölder 池化(α=0.5)推导出一种基于 Hellinger 距离的矩匹配近似方法,进而提出 HELVAE 模型——该模型无需子采样(sub-sampling),能够随着新增模态自动增强潜在表示的表达能力,并在生成连贯性与质量之间实现更优权衡,显著优于当前最先进的多模态 VAE 方法。

链接: https://arxiv.org/abs/2601.06572
作者: Huyen Khanh Vo,Isabel Valera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with \alpha=0.5 , which corresponds to the unique symmetric member of the \alpha\text-divergence family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
zh

[AI-112] Short-term electricity load forecasting with multi-frequency reconstruction diffusion

【速读】:该论文旨在解决短时电力负荷预测(Short-Term Electricity Load Forecasting, STELF)中因负荷数据具有非线性和波动性特征而导致的建模精度不足问题。现有方法难以充分挖掘数据中的复杂模式,从而限制了预测性能的提升。其解决方案的关键在于提出一种基于多频重构的扩散模型(Multi-Frequency-Reconstruction-based Diffusion, MFRD),通过四个核心步骤实现:首先将原始数据与分解得到的多频分量融合形成新的数据表示;其次利用扩散过程对新数据加噪以削弱原始噪声;再次设计结合长短期记忆网络(LSTM)与Transformer结构的去噪网络增强噪声去除能力;最后基于训练好的去噪网络生成最终预测结果。该方法有效整合了多尺度信息和先进神经网络架构,在AEMO和ISO-NE两个真实电力系统数据平台上验证了其优越性。

链接: https://arxiv.org/abs/2601.06533
作者: Qi Dong,Rubing Huang,Ling Zhou,Dave Towey,Jinyu Tian,Jianzhou Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful method in various applications. However, their application to Short-Term Electricity Load Forecasting (STELF) – a typical scenario in energy systems – remains largely unexplored. Considering the nonlinear and fluctuating characteristics of the load data, effectively utilizing the powerful modeling capabilities of diffusion models to enhance STELF accuracy remains a challenge. This paper proposes a novel diffusion model with multi-frequency reconstruction for STELF, referred to as the Multi-Frequency-Reconstruction-based Diffusion (MFRD) model. The MFRD model achieves accurate load forecasting through four key steps: (1) The original data is combined with the decomposed multi-frequency modes to form a new data representation; (2) The diffusion model adds noise to the new data, effectively reducing and weakening the noise in the original data; (3) The reverse process adopts a denoising network that combines Long Short-Term Memory (LSTM) and Transformer to enhance noise removal; and (4) The inference process generates the final predictions based on the trained denoising network. To validate the effectiveness of the MFRD model, we conducted experiments on two data platforms: Australian Energy Market Operator (AEMO) and Independent System Operator of New England (ISO-NE). The experimental results show that our model consistently outperforms the compared models.
zh

[AI-113] Improving Day-Ahead Grid Carbon Intensity Forecasting by Joint Modeling of Local-Temporal and Cross-Variable Dependencies Across Different Frequencies AAAI

【速读】:该论文旨在解决电网碳强度因子(Grid Carbon Intensity Factor, CIF)精准预测问题,其核心挑战在于捕捉细粒度的局部-时间依赖性、动态的高阶变量间依赖关系以及复杂的多频模式。解决方案的关键在于提出一种双并行模块架构:一是通过多小波基卷积核对不同长度重叠片段进行多频局部-时间特征提取,增强局部时序建模能力;二是设计动态跨变量依赖模块,在时频域中建模变量间关系的演化过程,从而有效捕获复杂交互机制。该方法在澳大利亚四个具有不同可再生能源渗透率的电力市场中验证了优越性能,并具备内置可解释性,能够揭示模型在扰动事件中的注意力转移机制。

链接: https://arxiv.org/abs/2601.06530
作者: Bowen Zhang,Hongda Tian,Adam Berry,A. Craig Roussac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2026 40th AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Accurate forecasting of the grid carbon intensity factor (CIF) is critical for enabling demand-side management and reducing emissions in modern electricity systems. Leveraging multiple interrelated time series, CIF prediction is typically formulated as a multivariate time series forecasting problem. Despite advances in deep learning-based methods, it remains challenging to capture the fine-grained local-temporal dependencies, dynamic higher-order cross-variable dependencies, and complex multi-frequency patterns for CIF forecasting. To address these issues, we propose a novel model that integrates two parallel modules: 1) one enhances the extraction of local-temporal dependencies under multi-frequency by applying multiple wavelet-based convolutional kernels to overlapping patches of varying lengths; 2) the other captures dynamic cross-variable dependencies under multi-frequency to model how inter-variable relationships evolve across the time-frequency domain. Evaluations on four representative electricity markets from Australia, featuring varying levels of renewable penetration, demonstrate that the proposed method outperforms the state-of-the-art models. An ablation study further validates the complementary benefits of the two proposed modules. Designed with built-in interpretability, the proposed model also enables better understanding of its predictive behavior, as shown in a case study where it adaptively shifts attention to relevant variables and time intervals during a disruptive event.
zh

[AI-114] Neural Nonmyopic Bayesian Optimization in Dynamic Cost Settings

【速读】:该论文旨在解决贝叶斯优化(Bayesian optimization, BO)在动态、历史依赖成本环境下的局限性,即现有方法通常假设查询成本静态且依赖短视的获取策略,难以应对评估成本随先前动作变化的实际场景(如空间任务中的旅行距离或序列设计中的编辑距离)。其解决方案的关键在于提出LookaHES框架,该框架结合多步H-熵搜索(multi-step H-Entropy Search)、路径采样(pathwise sampling)与神经策略优化(neural policy optimization),实现无需指数级复杂度即可进行二十步以上的长程规划。核心创新是引入神经策略(包括大语言模型),有效导航结构化组合动作空间(如蛋白质序列),并可在滚动预测中集成领域约束,从而实现高效、可扩展且成本感知的长期优化。

链接: https://arxiv.org/abs/2601.06505
作者: Sang T. Truong,Duc Q. Nguyen,Willie Neiswanger,Ryan-Rhys Griffiths,Stefano Ermon,Nick Haber,Sanmi Koyejo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 20 figures, 13 tables

点击查看摘要

Abstract:Bayesian optimization (BO) is a common framework for optimizing black-box functions, yet most existing methods assume static query costs and rely on myopic acquisition strategies. We introduce LookaHES, a nonmyopic BO framework designed for dynamic, history-dependent cost environments, where evaluation costs vary with prior actions, such as travel distance in spatial tasks or edit distance in sequence design. LookaHES combines a multi-step variant of H -Entropy Search with pathwise sampling and neural policy optimization, enabling long-horizon planning beyond twenty steps without the exponential complexity of existing nonmyopic methods. The key innovation is the integration of neural policies, including large language models, to effectively navigate structured, combinatorial action spaces such as protein sequences. These policies amortize lookahead planning and can be integrated with domain-specific constraints during rollout. Empirically, LookaHES outperforms strong myopic and nonmyopic baselines across nine synthetic benchmarks from two to eight dimensions and two real-world tasks: geospatial optimization using NASA night-light imagery and protein sequence design with constrained token-level edits. In short, LookaHES provides a general, scalable, and cost-aware solution for robust long-horizon optimization in complex decision spaces, which makes it a useful tool for researchers in machine learning, statistics, and applied domains. Our implementation is available at this https URL.
zh

[AI-115] DRAG ON: LLM -Driven Decomposition and Reconstruction Agents for Large-Scale Combinatorial Optimization AAMAS2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理组合优化问题(Combinatorial Optimization Problems, COPs)时面临的可扩展性与泛化能力不足的问题,尤其是在节点数量超过30的路由类问题中,LLM性能显著下降。其解决方案的关键在于提出DRAGON框架——一种融合元启发式设计与LLM推理能力的新范式:通过从初始全局解出发,自主识别高优化潜力区域并分解大规模COP为可管理的子问题;每个子问题被重构为局部优化任务,并借助经验记忆引导的针对性提示(prompting)进行求解;随后将局部优化结果系统性地重构回原全局上下文以获得整体改进。该方法通过持续与环境交互并利用自适应经验记忆,实现符号推理与启发式搜索的有效耦合,从而在TSPLIB、CVRPLIB及Weibull-5k等基准上稳定生成可行解,并在含超300万变量的背包问题中达到0.16%近优差距。

链接: https://arxiv.org/abs/2601.06502
作者: Shengkai Chen,Zhiguang Cao,Jianan Zhou,Yaoxin Wu,Senthilnath Jayavelu,Zhuoyi Lin,Xiaoli Li,Shili Xiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation and publication at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), source code will be available soon

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown promise in addressing combinatorial optimization problems (COPs) through prompt-based strategies. However, their scalability and generalization remain limited, and their effectiveness diminishes as problem size increases, particularly in routing problems involving more than 30 nodes. We propose DRAGON, which stands for Decomposition and Reconstruction Agents Guided OptimizatioN, a novel framework that combines the strengths of metaheuristic design and LLM reasoning. Starting from an initial global solution, DRAGON autonomously identifies regions with high optimization potential and strategically decompose large-scale COPs into manageable subproblems. Each subproblem is then reformulated as a concise, localized optimization task and solved through targeted LLM prompting guided by accumulated experiences. Finally, the locally optimized solutions are systematically reintegrated into the original global context to yield a significantly improved overall outcome. By continuously interacting with the optimization environment and leveraging an adaptive experience memory, the agents iteratively learn from feedback, effectively coupling symbolic reasoning with heuristic search. Empirical results show that, unlike existing LLM-based solvers limited to small-scale instances, DRAGON consistently produces feasible solutions on TSPLIB, CVRPLIB, and Weibull-5k bin packing benchmarks, and achieves near-optimal results (0.16% gap) on knapsack problems with over 3M variables. This work shows the potential of feedback-driven language agents as a new paradigm for generalizable and interpretable large-scale optimization.
zh

[AI-116] he AI Pyramid A Conceptual Framework for Workforce Capability in the Age of AI

【速读】:该论文旨在解决当前AI workforce发展面临的系统性挑战,即传统数字或AI素养培训无法有效应对生成式AI对高学历白领岗位的深远影响,以及如何在AI中介经济中构建可持续的人类能力体系。其解决方案的关键在于提出“AI Nativity”(AI原生性)概念与“AI金字塔”框架,将人类能力划分为三个相互依赖的层级:AI原生能力(AI Native capability)作为参与AI增强环境的通用基础;AI基础能力(AI Foundation capability)用于构建和维护AI系统;AI深度能力(AI Deep capability)推动前沿AI知识与应用。该框架强调能力培养应被视为基础设施而非一次性培训,并通过嵌入工作场景的问题导向学习、动态技能本体和基于能力的评估机制来实现规模化能力跃迁,从而在组织、教育和政策层面同步响应AI驱动的工作变革,提升生产力、韧性与社会公平。

链接: https://arxiv.org/abs/2601.06500
作者: Alok Khatri(1,2),Bishesh Khanal(1,2) ((1) NAAMII, Nepal (2) Tangible Careers)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages

点击查看摘要

Abstract:Artificial intelligence (AI) represents a qualitative shift in technological change by extending cognitive labor itself rather than merely automating routine tasks. Recent evidence shows that generative AI disproportionately affects highly educated, white collar work, challenging existing assumptions about workforce vulnerability and rendering traditional approaches to digital or AI literacy insufficient. This paper introduces the concept of AI Nativity, the capacity to integrate AI fluidly into everyday reasoning, problem solving, and decision making, and proposes the AI Pyramid, a conceptual framework for organizing human capability in an AI mediated economy. The framework distinguishes three interdependent capability layers: AI Native capability as a universal baseline for participation in AI augmented environments; AI Foundation capability for building, integrating, and sustaining AI enabled systems; and AI Deep capability for advancing frontier AI knowledge and applications. Crucially, the pyramid is not a career ladder but a system level distribution of capabilities required at scale. Building on this structure, the paper argues that effective AI workforce development requires treating capability formation as infrastructure rather than episodic training, centered on problem based learning embedded in work contexts and supported by dynamic skill ontologies and competency based measurement. The framework has implications for organizations, education systems, and governments seeking to align learning, measurement, and policy with the evolving demands of AI mediated work, while addressing productivity, resilience, and inequality at societal scale.
zh

[AI-117] Coding in a Bubble? Evaluating LLM s in Resolving Context Adaptation Bugs During Code Adaptation

【速读】:该论文旨在解决代码适应(Code Adaptation)过程中因上下文不匹配而导致的“上下文适应错误”(Context Adaptation Bugs, CtxBugs)问题,这类错误在目标环境中表现为原本正确的代码违反了新的约束条件,且无法通过局部修复解决,需依赖跨上下文推理来识别语义不一致。解决方案的关键在于提出名为 CtxBugGen 的新框架,其核心思想是利用大语言模型(LLMs)在缺乏上下文约束时倾向于生成看似合理但脱离具体环境的代码这一特性,通过四步流程——适应任务选择、任务特定扰动、基于 LLM 的变体生成与 CtxBugs 识别——系统性地构造高质量的 CtxBugs 基准数据集,从而有效评估 LLMs 在处理此类复杂上下文迁移问题上的能力,并揭示其当前存在的严重不足。

链接: https://arxiv.org/abs/2601.06497
作者: Tanghaoran Zhang,Xinjun Mao,Shangwen Wang,Yuxin Zhao,Yao Lu,Zezhou Tang,Wenyu Xu,Longfei Sun,Changrong Xie,Kang Yang,Yue Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures, accepted by FSE 2026

点击查看摘要

Abstract:Code adaptation is a fundamental but challenging task in software development, requiring developers to modify existing code for new contexts. A key challenge is to resolve Context Adaptation Bugs (CtxBugs), which occurs when code correct in its original context violates constraints in the target environment. Unlike isolated bugs, CtxBugs cannot be resolved through local fixes and require cross-context reasoning to identify semantic mismatches. Overlooking them may lead to critical failures in adaptation. Although Large Language Models (LLMs) show great potential in automating code-related tasks, their ability to resolve CtxBugs remains a significant and unexplored obstacle to their practical use in code adaptation. To bridge this gap, we propose CtxBugGen, a novel framework for generating CtxBugs to evaluate LLMs. Its core idea is to leverage LLMs’ tendency to generate plausible but context-free code when contextual constraints are absent. The framework generates CtxBugs through a four-step process to ensure their relevance and validity: (1) Adaptation Task Selection, (2) Task-specific Perturbation,(3) LLM-based Variant Generation and (4) CtxBugs Identification. Based on the benchmark constructed by CtxBugGen, we conduct an empirical study with four state-of-the-art LLMs. Our results reveal their unsatisfactory performance in CtxBug resolution. The best performing LLM, Kimi-K2, achieves 55.93% on Pass@1 and resolves just 52.47% of CtxBugs. The presence of CtxBugs degrades LLMs’ adaptation performance by up to 30%. Failure analysis indicates that LLMs often overlook CtxBugs and replicate them in their outputs. Our study highlights a critical weakness in LLMs’ cross-context reasoning and emphasize the need for new methods to enhance their context awareness for reliable code adaptation.
zh

[AI-118] ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在开放性任务中因缺乏客观标准而导致的奖励信号失真问题,尤其针对复杂旅行规划等具有广阔解空间的任务。现有RL方法依赖于奖励模型对单个响应进行标量评分,但这种点对点评分机制存在固有的“判别坍缩”(discrimination collapse)现象——即奖励模型难以区分不同轨迹间的细微优势,导致得分集中在狭窄区间内,从而使有效奖励信号被噪声主导,引发优化停滞。解决方案的关键在于提出ArenaRL,其核心创新是将评分范式从点对点标量评分转变为组内相对排序(intra-group relative ranking),引入多层级评分规则和过程感知的成对评估机制,构建组内对抗竞技场(adversarial arena)并设计基于锦标赛的排名策略,以稳定获取优势信号;同时,该方法通过种子单淘汰赛制实现近似全对比较的精度,却仅需线性复杂度O(N),显著提升了效率与准确性的平衡。

链接: https://arxiv.org/abs/2601.06487
作者: Qiang Zhang,Boli Chen,Fanrui Zhang,Ruixue Ding,Shihang Wang,Qiuchen Wang,Yinfeng Huang,Haonan Zhang,Rongxiang Zhu,Pengyong Wang,Ailin Ren,Xin Li,Pengjun Xie,Jiawei Liu,Ning Guo,Jingren Zhou,Zheng-Jun Zha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
zh

[AI-119] ConSensus: Multi-Agent Collaboration for Multimodal Sensing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理异构多模态传感器数据时存在的推理不一致与先验知识偏差问题,即单体LLM难以在不同模态间实现连贯推理,导致解释不完整。其解决方案的关键在于提出ConSensus框架——一种无需训练的多智能体协作机制,通过将多模态感知任务分解为专用的、模态感知的智能体,并设计了一种混合融合机制:一方面利用语义聚合实现跨模态推理与情境理解,另一方面借助统计共识增强对传感器噪声和缺失数据的鲁棒性。二者互补失效模式的结合,使得系统在保证可靠性的同时显著降低计算开销(平均融合token成本减少12.7倍),从而为真实世界的多模态传感任务提供高效且稳健的解决方案。

链接: https://arxiv.org/abs/2601.06453
作者: Hyungjun Yoon,Mohammad Malekzadeh,Sung-Ju Lee,Fahim Kawsar,Lorena Qendro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world. However, accurately interpreting heterogeneous multimodal sensor data remains a fundamental challenge. We show that a single monolithic LLM often fails to reason coherently across modalities, leading to incomplete interpretations and prior-knowledge bias. We introduce ConSensus, a training-free multi-agent collaboration framework that decomposes multimodal sensing tasks into specialized, modality-aware agents. To aggregate agent-level interpretations, we propose a hybrid fusion mechanism that balances semantic aggregation, which enables cross-modal reasoning and contextual understanding, with statistical consensus, which provides robustness through agreement across modalities. While each approach has complementary failure modes, their combination enables reliable inference under sensor noise and missing data. We evaluate ConSensus on five diverse multimodal sensing benchmarks, demonstrating an average accuracy improvement of 7.1% over the single-agent baseline. Furthermore, ConSensus matches or exceeds the performance of iterative multi-agent debate methods while achieving a 12.7 times reduction in average fusion token cost through a single-round hybrid fusion protocol, yielding a robust and efficient solution for real-world multimodal sensing tasks.
zh

[AI-120] LSRIF: Logic-Structured Reinforcement Learning for Instruction Following

【速读】:该论文旨在解决大语言模型在遵循复杂指令时对逻辑结构(如顺序依赖和条件分支)建模不足的问题。现有方法通常构建平行约束数据集并优化平均奖励,忽略了指令中的逻辑关系,导致信号噪声大、性能受限。解决方案的关键在于提出一个逻辑结构化训练框架 LSRIF,其核心包括:(1) 构建包含并行、顺序和条件类型约束的结构化指令数据集 LSRInstruct;(2) 设计结构感知的奖励机制——对并行结构采用平均聚合、对顺序结构引入失败惩罚传播、对条件分支实施选择性奖励,从而显式建模指令逻辑,提升模型在域内与域外指令理解及通用推理能力。

链接: https://arxiv.org/abs/2601.06431
作者: Qingyu Ren,Qianyu He,Jingwen Chang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Han Xia,Zeye Sun,Fei Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction-following is critical for large language models, but real-world instructions often contain logical structures such as sequential dependencies and conditional branching. Existing methods typically construct datasets with parallel constraints and optimize average rewards, ignoring logical dependencies and yielding noisy signals. We propose a logic-structured training framework LSRIF that explicitly models instruction logic. We first construct a dataset LSRInstruct with constraint structures such as parallel, sequential, and conditional types, and then design structure-aware rewarding method LSRIF including average aggregation for parallel structures, failure-penalty propagation for sequential structures, and selective rewards for conditional branches. Experiments show LSRIF brings significant improvements in instruction-following (in-domain and out-of-domain) and general reasoning. Analysis reveals that learning with explicit logic structures brings parameter updates in attention layers and sharpens token-level attention to constraints and logical operators.
zh

[AI-121] HiDVFS: A Hierarchical Multi-Agent DVFS Scheduler for OpenMP DAG Workloads

【速读】:该论文旨在解决多核嵌入式系统中因核心活动不均导致的局部过热问题,以及现有动态电压频率调节(DVFS)策略缺乏对单核频率监控和任务分配未考虑执行模式异质性的问题。解决方案的关键在于提出一种分层多智能体性能感知DVFS调度器(HiDVFS),其通过三个协同智能体实现:基于性能剖析数据选择核心与频率的智能体、利用温度传感器管理核心组合的智能体,以及在资源竞争时设定任务优先级的智能体;同时引入以完成时间(makespan)为主、辅以能量和温度正则项的奖励机制,从而在保证性能最优的前提下有效降低能耗与热峰值。

链接: https://arxiv.org/abs/2601.06425
作者: Mohammad Pivezhandi,Abusayeed Saifullah,Ali Jannesari
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 38 pages, 15 figures, 8 tables

点击查看摘要

Abstract:With advancements in multicore embedded systems, leakage power, exponentially tied to chip temperature, has surpassed dynamic power consumption. Energy-aware solutions use dynamic voltage and frequency scaling (DVFS) to mitigate overheating in performance-intensive scenarios, while software approaches allocate high-utilization tasks across core configurations in parallel systems to reduce power. However, existing heuristics lack per-core frequency monitoring, failing to address overheating from uneven core activity, and task assignments without detailed profiling overlook irregular execution patterns. We target OpenMP DAG workloads. Because makespan, energy, and thermal goals often conflict within a single benchmark, this work prioritizes performance (makespan) while reporting energy and thermal as secondary outcomes. To overcome these issues, we propose HiDVFS (a hierarchical multi-agent, performance-aware DVFS scheduler) for parallel systems that optimizes task allocation based on profiling data, core temperatures, and makespan-first objectives. It employs three agents: one selects cores and frequencies using profiler data, another manages core combinations via temperature sensors, and a third sets task priorities during resource contention. A makespan-focused reward with energy and temperature regularizers estimates future states and enhances sample efficiency. Experiments on the NVIDIA Jetson TX2 using the BOTS suite (9 benchmarks) compare HiDVFS against state-of-the-art approaches. With multi-seed validation (seeds 42, 123, 456), HiDVFS achieves the best finetuned performance with 4.16 plus/minus 0.58s average makespan (L10), representing a 3.44x speedup over GearDVFS (14.32 plus/minus 2.61s) and 50.4% energy reduction (63.7 kJ vs 128.4 kJ). Across all BOTS benchmarks, HiDVFS achieves an average 3.95x speedup and 47.1% energy reduction.
zh

[AI-122] Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs

【速读】:该论文试图解决的问题是:自一致性(self-consistency)方法在提升大语言模型推理准确性的同时,是否真正改善了推理的忠实度(faithfulness)。这一问题此前尚未被系统研究,而现有实践普遍假设推理路径的多样性选择能同步提升准确性和逻辑可靠性。解决方案的关键在于通过大规模实证分析,对四种前沿模型(GPT-5.2、Claude Opus 4.5、Gemini-3-flash-preview 和 DeepSeek-v3.2)在 GSM8K 数学推理任务上的表现进行量化评估,采用 Bootstrap 置信区间、McNemar 检验和 Cohen’s d 效应量等统计工具,明确不同模型在推理规模扩展下准确率与忠实度之间的非线性关系。结果表明,自一致性并非普遍有效——例如 Claude Opus 4.5 在准确率下降的同时忠实度显著上升,而 DeepSeek-v3.2 则因接近天花板效应难以进一步受益,这揭示出模型特异性与任务难度对自一致性效果具有决定性影响。

链接: https://arxiv.org/abs/2601.06423
作者: Deep Mehta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures, 9 tables

点击查看摘要

Abstract:Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks. The approach is straightforward: generate multiple reasoning paths and select the most common answer through majority voting. While this reliably boosts accuracy, it remains unclear whether these gains reflect genuine improvements in reasoning quality. We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness? We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems. Our analysis employs bootstrap confidence intervals, McNemar’s tests for paired comparisons, and Cohen’s d effect sizes to quantify the effects rigorously. The results reveal striking differences across models that challenge common assumptions about self-consistency. GPT-5.2 shows the expected pattern: accuracy improves from 78% to 90% at N=5, with faithfulness remaining relatively stable (0.540 to 0.510). Claude Opus 4.5 tells a completely different story. Its accuracy actually drops from 78% to 74.3% while faithfulness jumps dramatically from 0.270 to 0.891 at N=5. DeepSeek-v3.2, already at 98% accuracy, shows ceiling effects with modest faithfulness gains (0.440 to 0.541). Gemini-3-flash improves from 81% to 86% accuracy with a slight faithfulness decrease (0.260 to 0.212). Problem difficulty analysis reveals that GPT-5.2 solves 82% of hard problems while breaking only 13% of easy ones. Claude, in contrast, breaks 23% of easy problems, explaining its accuracy decrease. These findings matter for practitioners: self-consistency is not universally beneficial, and teams should test their specific models before deployment. We release our code and provide practical recommendations for navigating these tradeoffs. Comments: 24 pages, 3 figures, 9 tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.06423 [cs.AI] (or arXiv:2601.06423v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.06423 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Deep Mehta [view email] [v1] Sat, 10 Jan 2026 04:20:00 UTC (475 KB)
zh

[AI-123] HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents

【速读】:该论文旨在解决长时记忆系统在持续交互场景下存在的适应性差、可扩展性不足以及缺乏自我演化能力的问题。其核心解决方案是提出了一种分层长时记忆框架HiMem,关键在于通过“主题感知的事件-惊喜双通道分割策略”构建认知一致的片段记忆(Episode Memory),并利用多阶段信息抽取流程建立稳定的笔记记忆(Note Memory),二者语义关联形成层次化结构,从而在保持信息保真度的前提下实现高效检索;同时引入冲突感知的记忆再巩固机制,基于检索反馈动态修正和补充知识,支持长期使用中的记忆自我演化。

链接: https://arxiv.org/abs/2601.06377
作者: Ningning Zhang,Xingxing Yang,Zhizhong Tan,Weiping Deng,Wenyong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although long-term memory systems have made substantial progress in recent years, they still exhibit clear limitations in adaptability, scalability, and self-evolution under continuous interaction settings. Inspired by cognitive theories, we propose HiMem, a hierarchical long-term memory framework for long-horizon dialogues, designed to support memory construction, retrieval, and dynamic updating during sustained interactions. HiMem constructs cognitively consistent Episode Memory via a Topic-Aware Event–Surprise Dual-Channel Segmentation strategy, and builds Note Memory that captures stable knowledge through a multi-stage information extraction pipeline. These two memory types are semantically linked to form a hierarchical structure that bridges concrete interaction events and abstract knowledge, enabling efficient retrieval without sacrificing information fidelity. HiMem supports both hybrid and best-effort retrieval strategies to balance accuracy and efficiency, and incorporates conflict-aware Memory Reconsolidation to revise and supplement stored knowledge based on retrieval feedback. This design enables continual memory self-evolution over long-term use. Experimental results on long-horizon dialogue benchmarks demonstrate that HiMem consistently outperforms representative baselines in accuracy, consistency, and long-term reasoning, while maintaining favorable efficiency. Overall, HiMem provides a principled and scalable design paradigm for building adaptive and self-evolving LLM-based conversational agents. The code is available at this https URL.
zh

[AI-124] SafeGPT : Preventing Data Leakage and Unethical Outputs in Enterprise LLM Use

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业工作流中应用时引发的安全与伦理问题,特别是员工无意中泄露敏感数据或生成违反政策的内容。解决方案的关键在于提出SafeGPT,一个双侧防护系统:一方面在输入端进行敏感信息检测与脱敏处理,另一方面在输出端实施内容审核与重构,并引入人机协同反馈机制以持续优化安全性与合规性。实验表明,该方案能有效降低数据泄露风险和偏见输出,同时保持用户满意度。

链接: https://arxiv.org/abs/2601.06366
作者: Pratyush Desai,Luoxi Tang,Yuqiao Meng,Zhaohan Xi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.
zh

[AI-125] Human-in-the-Loop Interactive Report Generation for Chronic Disease Adherence ALT AAAI2026

【速读】:该论文旨在解决慢性病管理中因临床医生时间有限而难以生成个性化患者沟通内容的问题,同时平衡AI自动化带来的效率提升与临床信任之间的矛盾。其关键解决方案是提出一种“医生在环路”(clinician-in-the-loop)的交互界面,通过将AI限定于数据组织任务、由医生进行基于识别的审查(recognition-based review),实现分工协作:AI负责生成结构化段落草稿并配以时间对齐的可视化证据,医生仅需对关键结论进行确认或微调,从而在保证临床准确性的同时显著降低人工劳动强度。实验表明,AI生成内容与医生手动撰写水平相当(平均评分4.86/10 vs. 5.0/10),修改率仅为8.3%,且无安全问题,但医生仍需全程验证,凸显了高风险医疗场景下责任不可外包的本质挑战。

链接: https://arxiv.org/abs/2601.06364
作者: Xiaotian Zhang,Jinhong Yu,Pengwei Yan,Le Jiang,Xingyi Shen,Mumo Cheng,Xiaozhong Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures. Accepted at the AAAI 2026 Workshop on AI for Healthy Aging and Longevity

点击查看摘要

Abstract:Chronic disease management requires regular adherence feedback to prevent avoidable hospitalizations, yet clinicians lack time to produce personalized patient communications. Manual authoring preserves clinical accuracy but does not scale; AI generation scales but can undermine trust in patient-facing contexts. We present a clinician-in-the-loop interface that constrains AI to data organization and preserves physician oversight through recognition-based review. A single-page editor pairs AI-generated section drafts with time-aligned visualizations, enabling inline editing with visual evidence for each claim. This division of labor (AI organizes, clinician decides) targets both efficiency and accountability. In a pilot with three physicians reviewing 24 cases, AI successfully generated clinically personalized drafts matching physicians’ manual authoring practice (overall mean 4.86/10 vs. 5.0/10 baseline), requiring minimal physician editing (mean 8.3% content modification) with zero safety-critical issues, demonstrating effective automation of content generation. However, review time remained comparable to manual practice, revealing an accountability paradox: in high-stakes clinical contexts, professional responsibility requires complete verification regardless of AI accuracy. We contribute three interaction patterns for clinical AI collaboration: bounded generation with recognition-based review via chart-text pairing, automated urgency flagging that analyzes vital trends and adherence patterns with fail-safe escalation for missed critical monitoring tasks, and progressive disclosure controls that reduce cognitive load while maintaining oversight. These patterns indicate that clinical AI efficiency requires not only accurate models, but also mechanisms for selective verification that preserve accountability.
zh

[AI-126] Styles Persona-plug = Customized LLM s

【速读】:该论文旨在解决个性化文本生成中一个被忽视的问题:当前的个性化方法在显式风格指令(explicit style instructions)下应用日益广泛,但其在此类约束下的行为机制尚不明确。为平衡隐式个性化与显式风格控制,作者提出将个性化建模为分布残差(distributional residual),并设计了PsPLUG——一种轻量级软提示插件(soft-prompt plug-in),通过风格条件偏好对比(style-conditioned preference contrasts)进行训练。解决方案的关键在于利用分布残差建模,实现对风格感知的可控个性化,从而在保持风格忠实性的同时提升人格一致性,且计算开销极小。

链接: https://arxiv.org/abs/2601.06362
作者: Yutong Song,Jiang Wu,Shaofan Yuan,Chengze Shen,Jian Wang,Amir Rahmani,Nikil Dutt,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We discover a previously overlooked challenge in personalized text generation: personalization methods are increasingly applied under explicit style instructions, yet their behavior under such constraints remains poorly understood. To balance implicit personalization and explicit style, we formulate personalization as a distributional residual and propose PsPLUG, a lightweight soft-prompt plug-in trained with style-conditioned preference contrasts. Across LaMP benchmark, our framework improves persona alignment, maintains stylistic fidelity, and outperforms retrieval-based and soft-prompt baselines with minimal computation. These results show that residual modeling provides a simple and principled foundation for controllable, style-aware LLM personalization.
zh

[AI-127] Smart Privacy Policy Assistant: An LLM -Powered System for Transparent and Actionable Privacy Notices

【速读】:该论文旨在解决用户在未充分理解的情况下盲目同意在线隐私政策的问题,这些问题通常因隐私政策文本冗长、法律术语复杂而难以被非专业人士解读。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的智能隐私政策助手(Smart Privacy Policy Assistant),其核心能力包括:自动解析隐私政策、提取并分类关键条款、赋予可解释的风险等级,并生成简洁明了的说明;系统通过浏览器扩展或移动应用实现实时交互,在用户授权敏感信息或权限前提供上下文相关的风险警示,从而提升用户对数据处理行为的认知与控制力。

链接: https://arxiv.org/abs/2601.06357
作者: Sriharshini Kalvakuntla,Luoxi Tang,Yuqiao Meng,Zhaohan Xi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most users agree to online privacy policies without reading or understanding them, even though these documents govern how personal data is collected, shared, and monetized. Privacy policies are typically long, legally complex, and difficult for non-experts to interpret. This paper presents the Smart Privacy Policy Assistant, an LLM-powered system that automatically ingests privacy policies, extracts and categorizes key clauses, assigns human-interpretable risk levels, and generates clear, concise explanations. The system is designed for real-time use through browser extensions or mobile interfaces, surfacing contextual warnings before users disclose sensitive information or grant risky permissions. We describe the end-to-end pipeline, including policy ingestion, clause categorization, risk scoring, and explanation generation, and propose an evaluation framework based on clause-level accuracy, policy-level risk agreement, and user comprehension.
zh

[AI-128] CARD: Cluster-level Adaptation with Reward-guided Decoding for Personalized Text Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在个体用户层面实现精细个性化与可扩展部署之间的矛盾问题。其核心挑战在于如何在不显著增加计算开销或存储成本的前提下,为每个用户提供高质量、风格一致的文本生成能力。解决方案的关键在于提出一种分层式框架CARD(Clustered Adaptive Refinement for Personalization),通过两阶段机制实现高效个性化:首先基于共享的风格模式对用户进行聚类,并为每个簇学习特定的LoRA(Low-Rank Adaptation)适配器,从而保证在低资源场景下的鲁棒性与泛化能力;其次,在簇内引入隐式偏好学习机制,通过对比用户自撰文本与簇级生成内容,无监督地推断出个体风格偏好,避免人工标注;推理时仅通过轻量级用户偏好向量和低秩logit修正注入个性化信息,保持基础模型冻结,显著提升效率与可扩展性。

链接: https://arxiv.org/abs/2601.06352
作者: Yutong Song,Jiang Wu,Weijia Zhang,Chengze Shen,Shaofan Yuan,Weitao Lu,Jian Wang,Amir Rahmani,Nikil Dutt,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting large language models to individual users remains challenging due to the tension between fine-grained personalization and scalable deployment. We present CARD, a hierarchical framework that achieves effective personalization through progressive refinement. CARD first clusters users according to shared stylistic patterns and learns cluster-specific LoRA adapters, enabling robust generalization and strong low-resource performance. To capture individual differences within each cluster, we propose an implicit preference learning mechanism that contrasts user-authored text with cluster-level generations, allowing the model to infer user-specific style preferences without manual annotation. At inference time, CARD injects personalization exclusively at decoding via lightweight user preference vectors and low-rank logit corrections, while keeping the base model frozen. Experiments on the LaMP and LongLaMP benchmarks show that CARD achieves competitive or superior generation quality compared to state-of-the-art baselines, while significantly improving efficiency and scalability for practical personalized text generation.
zh

[AI-129] Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages

【速读】:该论文旨在解决企业级大语言模型(Large Language Model, LLM)在实际应用中因输入微小变化导致性能显著波动的问题,即模型鲁棒性不足的问题。其解决方案的关键在于构建了一个涵盖多种扰动类型的综合性基准测试套件,包括通用文本编辑(如标点符号、空格调整)、格式变更(如JSON、YAML)、多语言及跨语言输入以及指令位置变动等,从而系统评估不同模型在真实场景下的稳定性表现。通过在11个参数规模从4B到120B+的模型上进行测试,研究发现微小扰动可使关键企业指标下降达40个百分点,并揭示了模型规模与鲁棒性之间并非简单的正相关关系,为优化企业级LLM部署提供了实证依据和新视角。

链接: https://arxiv.org/abs/2601.06341
作者: Tara Bogavelli,Oluwanifemi Bamgbose,Gabrielle Gauthier Melançon,Fanny Riols,Roshnee Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Enterprise LLM applications require consistently high quality and reliable performance across diverse scenarios, demanding robustness to minor variations. Existing research shows that even small prompt changes can lead to substantial differences in output, but has mainly focused on a narrow set of perturbations with small academic datasets, limiting their relevance to real-world applications. To address this, we present a comprehensive benchmark suite that evaluates robustness across multiple perturbation types, including general text edits (e.g., punctuation, whitespace), formatting changes (e.g., JSON, YAML), multilingual and cross-lingual inputs, and positional variations in instructions. Evaluating 11 models ranging from 4B to 120B+ parameters, we find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics. Critically, we demonstrate that the relationship between model size and robustness is more nuanced than conventional assumptions suggest: an 8B parameter model (Ministral 3 8B) outperforms most larger models, while another 8B model (Llama 3.1 8B) performs worst overall.
zh

[AI-130] Future-as-Label: Scalable Supervision from Real-World Outcomes

【速读】:该论文旨在解决现实世界预测任务中缺乏预测时刻可获取标签的问题,即预测与结果之间存在时间延迟,导致监督信号仅在事件结束后才能获得。为应对这一挑战,作者将可验证奖励(verifiable rewards)引入强化学习框架,构建了“预见学习”(Foresight Learning)方法,使语言模型能够在因果信息被屏蔽的条件下生成概率预测,并通过适当的评分规则(proper scoring rules)进行事后评估。其核心创新在于利用事后结果作为唯一监督来源,同时保留延迟奖励的语义特性,从而实现对长期预测任务的有效训练。实验表明,基于此方法训练的Qwen3-32B模型在真实世界预测基准上显著优于预训练基线,且在参数量仅为Qwen3-235B的1/7时仍表现更优。

链接: https://arxiv.org/abs/2601.06336
作者: Benjamin Turtel,Paul Wilczewski,Danny Franklin,Kris Skothiem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world prediction problems lack labels observable at prediction time, creating a temporal gap between prediction and outcome that yields supervision only after events resolve. To address this setting, we extend reinforcement learning with verifiable rewards to temporally resolved real-world prediction, and use it to train language models to make probabilistic forecasts under causally masked information with retrospective evaluation using proper scoring rules. Supervision is derived solely from post-resolution outcomes, preserving delayed-reward semantics. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.
zh

[AI-131] Foundational Analysis of Safety Engineering Requirements (SAFER)

【速读】:该论文旨在解决复杂安全关键系统中安全需求(safety requirements)因多利益相关者目标不协调而导致的遗漏、重复和矛盾问题,这些问题严重威胁系统安全性与合规性。现有方法多为非形式化,难以有效应对上述挑战。解决方案的关键在于提出一种基于模型驱动的方法——安全工程基础分析框架(SAFER),该框架利用生成式AI(Generative AI)增强建模过程,通过结构化分析需求规格模型,实现:(1) 将需求映射至系统功能,(2) 识别需求不足的功能,(3) 检测重复需求,以及 (4) 发现需求集合中的矛盾。研究表明,生成式AI需结合形式化模型并系统化查询,方能在早期阶段提供有意义的安全需求规范与鲁棒的安全架构设计。

链接: https://arxiv.org/abs/2601.06335
作者: Noga Chemo,Yaniv Mordecai,Yoram Reich
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a framework for Foundational Analysis of Safety Engineering Requirements (SAFER), a model-driven methodology supported by Generative AI to improve the generation and analysis of safety requirements for complex safety-critical systems. Safety requirements are often specified by multiple stakeholders with uncoordinated objectives, leading to gaps, duplications, and contradictions that jeopardize system safety and compliance. Existing approaches are largely informal and insufficient for addressing these challenges. SAFER enhances Model-Based Systems Engineering (MBSE) by consuming requirement specification models and generating the following results: (1) mapping requirements to system functions, (2) identifying functions with insufficient requirement specifications, (3) detecting duplicate requirements, and (4) identifying contradictions within requirement sets. SAFER provides structured analysis, reporting, and decision support for safety engineers. We demonstrate SAFER on an autonomous drone system, significantly improving the detection of requirement inconsistencies, enhancing both efficiency and reliability of the safety engineering process. We show that Generative AI must be augmented by formal models and queried systematically, to provide meaningful early-stage safety requirement specifications and robust safety architectures.
zh

[AI-132] Kolmogorov-Arnold Networks-Based Tolerance-Aware Manufacturability Assessment Integrating Design-for-Manufacturing Principles

【速读】:该论文旨在解决设计与制造之间长期存在的脱节问题,尤其是在制造可行性评估(Manufacturability Assessment)环节中,现有基于几何驱动的人工智能方法普遍面临预处理复杂、信息损失严重及可解释性差等瓶颈。其解决方案的关键在于提出一种直接从参数化设计特征出发的评估框架,无需依赖计算机辅助设计(CAD)处理即可显式引入尺寸公差,并利用科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Networks, KANs)学习设计参数、公差与制造可行性结果之间的函数关系。该方法在三个典型制造场景(钻孔、铣槽和复合加工)中均取得优异性能(AUC最高达0.9919),并通过样条函数可视化和潜在空间投影实现高可解释性,从而支持基于参数级的迭代设计优化,最终将不可制造部件转化为可制造方案。

链接: https://arxiv.org/abs/2601.06334
作者: Masoud Deylami,Negar Izadipour,Adel Alaeddini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 12 figures. Under review for journal publication

点击查看摘要

Abstract:Manufacturability assessment is a critical step in bridging the persistent gap between design and production. While artificial intelligence (AI) has been widely applied to this task, most existing frameworks rely on geometry-driven methods that require extensive preprocessing, suffer from information loss, and offer limited interpretability. This study proposes a methodology that evaluates manufacturability directly from parametric design features, enabling explicit incorporation of dimensional tolerances without requiring computer-aided design (CAD) processing. The approach employs Kolmogorov-Arnold Networks (KANs) to learn functional relationships between design parameters, tolerances, and manufacturability outcomes. A synthetic dataset of 300,000 labeled designs is generated to evaluate performance across three representative scenarios: hole drilling, pocket milling, and combined drilling-milling, while accounting for machining constraints and design-for-manufacturing (DFM) rules. Benchmarking against fourteen machine learning (ML) and deep learning (DL) models shows that KAN achieves the highest performance in all scenarios, with AUC values of 0.9919 for drilling, 0.9841 for milling, and 0.9406 for the combined case. The proposed framework provides high interpretability through spline-based functional visualizations and latent-space projections, enabling identification of the design and tolerance parameters that most strongly influence manufacturability. An industrial case study further demonstrates how the framework enables iterative, parameter-level design modifications that transform a non-manufacturable component into a manufacturable one.
zh

[AI-133] oolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在开放世界场景下进行工具使用时面临的挑战,包括大规模工具池、长程目标、复杂约束以及不可靠的工具状态等问题。其核心解决方案是构建一个基于5,571个格式统一工具(覆盖204个常用应用)的开放世界工具使用环境,该环境包含任务生成引擎以合成多工具、长程且带约束的工作流,并引入状态控制器模拟中断和故障以测试鲁棒性;在此基础上提出“选择-执行”代理框架(tool select-then-execute agent framework),通过规划器-执行器(planner-actor)分解机制分离策略性推理与自我修正能力与逐步执行过程,从而实现更高效、稳健的工具调用。

链接: https://arxiv.org/abs/2601.06328
作者: Ziqiao Xi,Shuang Liang,Qi Liu,Jiaqing Zhang,Letian Peng,Fang Nan,Meshal Nayim,Tianhui Zhang,Rishika Mundada,Lianhui Qin,Biwei Huang,Kun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to ACL 2026 12 pages, 4 figures Ziqiao Xi and Shuang Liang contributed equally to this work

点击查看摘要

Abstract:Tool-using LLM agents still struggle in open-world settings with large tool pools, long-horizon objectives, wild constraints, and unreliable tool states. For scalable and realistic training and testing, we introduce an open-world tool-using environment, built on 5,571 format unified tools across 204 commonly used apps. It includes a task creation engine that synthesizes long-horizon, multi-tool workflows with wild constraints, and a state controller that injects interruptions and failures to stress-test robustness. On top of this environment, we develop a tool select-then-execute agent framework with a planner-actor decomposition to separate deliberate reasoning and self-correction from step-wise execution. Comprehensive evaluation of state-of-the-art LLMs reveals the misalignment between tool planning and execution abilities, the constraint following weakness of existing LLMs, and DeepSeek-v3.2’s strongest robustness. Finally, we collect 1,170 trajectories from our environment to fine-tune LLMs, achieving superior performance to baselines using 119k samples, indicating the environment’s value as both a realistic benchmark and a data engine for tool-using agents. Our code and data will be publicly released.
zh

[AI-134] Beyond BeautifulSoup: Benchmarking LLM -Powered Web Scraping for Everyday Users

【速读】:该论文旨在解决传统网络爬取(web scraping)对技术门槛要求高、限制了非专业用户大规模数据获取的问题。其核心挑战在于,以往的爬取操作需要熟练掌握HTML解析、会话管理及绕过身份验证等技能,而当前大语言模型(Large Language Models, LLMs)的发展为这一问题提供了新解法。解决方案的关键在于利用LLM实现两种不同级别的自动化流程:一是LLM辅助脚本编写(LLM-assisted scripting),用户通过自然语言提示生成代码并手动执行;二是端到端LLM代理(end-to-end LLM agents),由模型自主完成导航与数据提取任务。实验表明,端到端代理仅需少量提示调整即可成功处理复杂网站(如含认证、反爬虫和CAPTCHA机制),显著降低了爬取门槛,使普通用户也能高效获取原本难以访问的数据。

链接: https://arxiv.org/abs/2601.06301
作者: Arth Bhardwaj,Nirav Diwan,Gang Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Web scraping has historically required technical expertise in HTML parsing, session management, and authentication circumvention, which limited large-scale data extraction to skilled developers. We argue that large language models (LLMs) have democratized web scraping, enabling low-skill users to execute sophisticated operations through simple natural language prompts. While extensive benchmarks evaluate these tools under optimal expert conditions, we show that without extensive manual effort, current LLM-based workflows allow novice users to scrape complex websites that would otherwise be inaccessible. We systematically benchmark what everyday users can do with off-the-shelf LLM tools across 35 sites spanning five security tiers, including authentication, anti-bot, and CAPTCHA controls. We devise and evaluate two distinct workflows: (a) LLM-assisted scripting, where users prompt LLMs to generate traditional scraping code but maintain manual execution control, and (b) end-to-end LLM agents, which autonomously navigate and extract data through integrated tool use. Our results demonstrate that end-to-end agents have made complex scraping accessible - requiring as little as a single prompt with minimal refinement (less than 5 changes) to complete workflows. We also highlight scenarios where LLM-assisted scripting may be simpler and faster for static sites. In light of these findings, we provide simple procedures for novices to use these workflows and gauge what adversaries could achieve using these.
zh

[AI-135] AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在生产环境中进行推理优化时面临的挑战,包括动态工作负载、严格的延迟/吞吐量目标以及配置空间急剧扩展的问题。这些问题不仅涉及分布式并行策略(如张量并行、流水线并行和专家并行),还涵盖框架特定的运行时参数(如CUDA图启用、KV缓存内存比例和最大token容量),这些参数对性能影响显著。由于现代推理框架(如TRT-LLM、vLLM、SGLang)采用不同的内核实现和执行策略,手动调优既框架依赖又计算成本高昂。解决方案的关键在于提出AIConfigurator——一个统一的性能建模系统,其核心创新包括:(1) 将推理过程分解为可解析建模的基本操作(GEMM、注意力、通信与内存操作),同时保留框架特有调度特性;(2) 构建跨硬件平台和主流开源模型(如GPT-OSS、Qwen、DeepSeek、Llama、Mistral)的校准级内核性能数据库;(3) 提供抽象层自动推导目标后端的最佳启动参数,并无缝集成至生产级编排系统中。该方法可在平均30秒内完成配置搜索,显著提升性能(密集模型最高提升40%,MoE架构最高提升50%)。

链接: https://arxiv.org/abs/2601.06288
作者: Tianhao Xu,Yiming Liu,Xianglong Lu,Yijia Zhao,Xuting Zhou,Aichen Feng,Yiyi Chen,Yi Shen,Qin Zhou,Xumeng Chen,Ilya Sherstyuk,Haorui Li,Rishi Thakkar,Ben Hamm,Yuanzhe Li,Xue Huang,Wenpeng Wu,Anish Shanbhag,Harry Kim,Chuan Chen,Junjie Lai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only distributed parallelism strategies (tensor/pipeline/expert) but also intricate framework-specific runtime parameters such as those concerning the enablement of CUDA graphs, available KV-cache memory fractions, and maximum token capacity, which drastically impact performance. The diversity of modern inference frameworks (e.g., TRT-LLM, vLLM, SGLang), each employing distinct kernels and execution policies, makes manual tuning both framework-specific and computationally prohibitive. We present AIConfigurator, a unified performance-modeling system that enables rapid, framework-agnostic inference configuration search without requiring GPU-based profiling. AIConfigurator combines (1) a methodology that decomposes inference into analytically modelable primitives - GEMM, attention, communication, and memory operations while capturing framework-specific scheduling dynamics; (2) a calibrated kernel-level performance database for these primitives across a wide range of hardware platforms and popular open-weights models (GPT-OSS, Qwen, DeepSeek, LLama, Mistral); and (3) an abstraction layer that automatically resolves optimal launch parameters for the target backend, seamlessly integrating into production-grade orchestration systems. Evaluation on production LLM serving workloads demonstrates that AIConfigurator identifies superior serving configurations that improve performance by up to 40% for dense models (e.g., Qwen3-32B) and 50% for MoE architectures (e.g., DeepSeek-V3), while completing searches within 30 seconds on average. Enabling the rapid exploration of vast design spaces - from cluster topology down to engine specific flags.
zh

[AI-136] Automated QoR improvement in OpenROAD with coding agents

【速读】:该论文旨在解决电子设计自动化(EDA)领域因专家工程资源稀缺而导致的发展与创新受限问题。其核心解决方案是提出一个名为AuDoPEDA的自主式、基于代码仓库的编程系统,该系统依托OpenAI模型和Codex类代理,能够读取OpenROAD代码库、提出改进方向、将其细化为可执行的实现步骤,并提交可应用的代码差异(diff)。关键创新在于构建了一个闭环的大语言模型(LLM)框架用于EDA代码变更、设计了一套针对OpenROAD的PPA(性能-功耗-面积)优化任务集与评估协议,以及实现了仅需极少人工干预的端到端演示,实验表明在OpenROAD中实现了最高达5.9%的布线长度减少和10.0%的有效时钟周期缩短。

链接: https://arxiv.org/abs/2601.06268
作者: Amur Ghose,Junyeong Jang,Andrew B. Kahng,Jakang Lee
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:EDA development and innovation has been constrained by scarcity of expert engineering resources. While leading LLMs have demonstrated excellent performance in coding and scientific reasoning tasks, their capacity to advance EDA technology itself has been largely untested. We present AuDoPEDA, an autonomous, repository-grounded coding system built atop OpenAI models and a Codex-class agent that reads OpenROAD, proposes research directions, expands them into implementation steps, and submits executable diffs. Our contributions include (i) a closed-loop LLM framework for EDA code changes; (ii) a task suite and evaluation protocol on OpenROAD for PPA-oriented improvements; and (iii) end-to-end demonstrations with minimal human oversight. Experiments in OpenROAD achieve routed wirelength reductions of up to 5.9%, and effective clock period reductions of up to 10.0%.
zh

[AI-137] Agent ic AI Microservice Framework for Deepfake and Document Fraud Detection in KYC Pipelines

【速读】:该论文旨在解决合成媒体(synthetic media)、呈现攻击(presentation attacks)和文件伪造在金融、电信及数字身份生态系统中对“了解你的客户”(Know Your Customer, KYC)流程造成的显著安全漏洞问题。传统单体式KYC系统因缺乏可扩展性和灵活性,难以应对不断演化的欺诈手段。其解决方案的关键在于提出一种基于代理的AI微服务框架(Agentic AI Microservice Framework),通过模块化视觉模型、活体检测、深度伪造识别、基于光学字符识别(OCR)的文档取证、多模态身份关联以及策略驱动的风险引擎实现协同工作;该框架利用自主微代理(micro-agents)完成任务分解、流水线编排、动态重试与人机协同升级,从而在保障隐私的前提下实现高精度、低延迟且具备抗对抗输入能力的实时KYC验证。

链接: https://arxiv.org/abs/2601.06241
作者: Chandra Sekhar Kubam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Journal of Information Systems Engineering and Management, 2024

点击查看摘要

Abstract:The rapid proliferation of synthetic media, presentation attacks, and document forgeries has created significant vulnerabilities in Know Your Customer (KYC) workflows across financial services, telecommunications, and digital-identity ecosystems. Traditional monolithic KYC systems lack the scalability and agility required to counter adaptive fraud. This paper proposes an Agentic AI Microservice Framework that integrates modular vision models, liveness assessment, deepfake detection, OCR-based document forensics, multimodal identity linking, and a policy driven risk engine. The system leverages autonomous micro-agents for task decomposition, pipeline orchestration, dynamic retries, and human-in-the-loop escalation. Experimental evaluations demonstrate improved detection accuracy, reduced latency, and enhanced resilience against adversarial inputs. The framework offers a scalable blueprint for regulated industries seeking robust, real-time, and privacy-preserving KYC verification.
zh

[AI-138] An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution

链接: https://arxiv.org/abs/2601.06235
作者: Sheng-Kai Chen,Jyh-Horng Wu,Ching-Yao Lin,Yen-Ting Lin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published in NCS 2025 (Paper No. N0180)

点击查看摘要

[AI-139] PCoKG: Personality-aware Commonsense Reasoning with Debate AAAI-2026

【速读】:该论文旨在解决当前常识推理模型普遍忽视人格特质(personality traits)的问题,从而限制了其在个性化系统(如对话生成)中的应用效果。解决方案的关键在于构建了一个名为 Personality-aware Commonsense Knowledge Graph (PCoKG) 的结构化知识图谱,包含521,316条四元组数据;其核心创新在于利用大语言模型(LLMs)的角色扮演能力进行推理,并引入由支持者、反对者和裁判组成的辩论机制(debate mechanism),通过反馈循环迭代优化生成的知识质量,从而实现更符合个体认知差异的常识推理建模。

链接: https://arxiv.org/abs/2601.06234
作者: Weijie Li,Zhongqing Wang,Guodong Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accept by AAAI-2026

点击查看摘要

Abstract:Most commonsense reasoning models overlook the influence of personality traits, limiting their effectiveness in personalized systems such as dialogue generation. To address this limitation, we introduce the Personality-aware Commonsense Knowledge Graph (PCoKG), a structured dataset comprising 521,316 quadruples. We begin by employing three evaluators to score and filter events from the ATOMIC dataset, selecting those that are likely to elicit diverse reasoning patterns across different personality types. For knowledge graph construction, we leverage the role-playing capabilities of large language models (LLMs) to perform reasoning tasks. To enhance the quality of the generated knowledge, we incorporate a debate mechanism consisting of a proponent, an opponent, and a judge, which iteratively refines the outputs through feedback loops. We evaluate the dataset from multiple perspectives and conduct fine-tuning and ablation experiments using multiple LLM backbones to assess PCoKG’s robustness and the effectiveness of its construction pipeline. Our LoRA-based fine-tuning results indicate a positive correlation between model performance and the parameter scale of the base models. Finally, we apply PCoKG to persona-based dialogue generation, where it demonstrates improved consistency between generated responses and reference outputs. This work bridges the gap between commonsense reasoning and individual cognitive differences, enabling the development of more personalized and context-aware AI systems.
zh

[AI-140] Multi-Agent Framework for Controllable and Protected Generative Content Creation: Addressing Copyright and Provenance in AI-Generated Media

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在内容创作中面临的可控性不足、版权侵权风险以及内容来源难以追溯等问题。当前模型多为“黑箱”结构,缺乏对用户意图的有效对齐机制及知识产权保护手段。解决方案的关键在于提出一种多智能体框架,通过分工明确的 Director(导演)、Generator(生成器)、Reviewer(评审者)、Integration(整合者)和 Protection(保护)五类代理角色协同工作,并嵌入数字水印技术以实现内容溯源与版权保护,从而保障创意流程的可控制性、合法性与可信度。

链接: https://arxiv.org/abs/2601.06232
作者: Haris Khan,Sadia Asif,Shumaila Asif
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of generative AI systems creates unprecedented opportunities for content creation while raising critical concerns about controllability, copyright infringement, and content provenance. Current generative models operate as “black boxes” with limited user control and lack built-in mechanisms to protect intellectual property or trace content origin. We propose a novel multi-agent framework that addresses these challenges through specialized agent roles and integrated watermarking. Our system orchestrates Director, Generator, Reviewer, Integration, and Protection agents to ensure user intent alignment while embedding digital provenance markers. We demonstrate feasibility through two case studies: creative content generation with iterative refinement and copyright protection for AI-generated art in commercial contexts. Preliminary feasibility evidence from prior work indicates up to 23% improvement in semantic alignment and 95% watermark recovery rates. This work contributes to responsible generative AI deployment, positioning multi-agent systems as a solution for trustworthy creative workflows in legal and commercial applications.
zh

[AI-141] riadic Concept Analysis for Logic Interpretation of Simple Artificial Networks

【速读】:该论文旨在解决人工神经网络(Artificial Neural Network, ANN)在分类任务中虽具有高准确率但缺乏可解释性的问题。其解决方案的关键在于:通过训练一个基于最小项(minterm)值的简单ANN模型,利用ReLU激活函数将网络结构划分为若干单元(cell),进而将其转化为三维二进制张量(bit tensor);随后应用形式概念分析(Formal Concept Analysis, FCA)对该张量进行处理,生成以逻辑树形式表达的概念,从而揭示属性间的可解释交互关系,同时保持原始ANN模型的分类性能。

链接: https://arxiv.org/abs/2601.06229
作者: Ingo Schmitt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:An artificial neural network (ANN) is a numerical method used to solve complex classification problems. Due to its high classification power, the ANN method often outperforms other classification methods in terms of accuracy. However, an ANN model lacks interpretability compared to methods that use the symbolic paradigm. Our idea is to derive a symbolic representation from a simple ANN model trained on minterm values of input objects. Based on ReLU nodes, the ANN model is partitioned into cells. We convert the ANN model into a cell-based, three-dimensional bit tensor. The theory of Formal Concept Analysis applied to the tensor yields concepts that are represented as logic trees, expressing interpretable attribute interactions. Their evaluations preserve the classification power of the initial ANN model.
zh

[AI-142] When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics ICPR2026

【速读】:该论文旨在解决电池管理系统的健康状态预测(Battery Health Prognostics)在边缘设备上部署时面临的模型精度与计算资源受限之间的矛盾问题。其解决方案的关键在于提出DLNet框架,通过双阶段知识蒸馏(Dual-Stage Knowledge Distillation)将高容量教师模型压缩为轻量级学生模型,并结合欧拉离散化(Euler Discretization)使液态神经网络(Liquid Neural Networks)适配嵌入式环境;同时采用帕累托引导的选择策略,在误差与能耗的联合目标下筛选出兼具高准确率和低延迟的最优学生模型,最终实现模型尺寸缩减84.7%、推理时间仅21 ms/次的前提下,预测误差较教师模型降低15.4%,验证了小模型在边缘场景下可超越大模型性能的可行性。

链接: https://arxiv.org/abs/2601.06227
作者: Dhivya Dharshini Kannan,Wei Li,Zhang Wei,Jianbiao Wang,Zhi Wei Seh,Man-Fai Ng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to International Conference on Pattern Recognition, ICPR 2026

点击查看摘要

Abstract:Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents DLNet, a practical framework with dual-stage distillation of liquid neural networks that turns a high-capacity model into compact and edge-deployable models for battery health prediction. DLNet first applies Euler discretization to reformulate liquid dynamics for embedded compatibility. It then performs dual-stage knowledge distillation to transfer the teacher model’s temporal behavior and recover it after further compression. Pareto-guided selection under joint error-cost objectives retains student models that balance accuracy and efficiency. We evaluate DLNet on a widely used dataset and validate real-device feasibility on an Arduino Nano 33 BLE Sense using int8 deployment. The final deployed student achieves a low error of 0.0066 when predicting battery health over the next 100 cycles, which is 15.4% lower than the teacher model. It reduces the model size from 616 kB to 94 kB with 84.7% reduction and takes 21 ms per inference on the device. These results support a practical smaller wins observation that a small model can match or exceed a large teacher for edge-based prognostics with proper supervision and selection. Beyond batteries, the DLNet framework can extend to other industrial analytics tasks with strict hardware constraints.
zh

[AI-143] Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

链接: https://arxiv.org/abs/2601.06226
作者: Zenghao Duan,Zhiyi Yin,Zhichao Shi,Liang Pang,Shaoling Jing,Zihe Huang,Jiayi Wu,Yu Yan,Jingcheng Deng,Huawei Shen,Xueqi Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-144] oward Safe and Responsible AI Agents : A Three-Pillar Model for Transparency Accountability and Trustworthiness

链接: https://arxiv.org/abs/2601.06223
作者: Edward C. Cheng,Jeshua Cheng,Alice Siu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures, conference paper

点击查看摘要

[AI-145] LDTC: Lifelong deep temporal clustering for multivariate time series

【速读】:该论文旨在解决实时世界中动态变化的多变量时间序列的时序聚类问题(temporal clustering),特别是现有深度学习方法在处理顺序任务学习时难以持续学习新任务且易发生灾难性遗忘(catastrophic forgetting)的问题。解决方案的关键在于提出一种名为Lifelong Deep Temporal Clustering (LDTC) 的新算法,其核心创新是将降维与时序聚类统一到一个端到端的深度无监督学习框架中,并通过设计特定的自编码器结构联合优化潜在表示与聚类目标,从而实现高质量聚类;此外,LDTC引入了完全动态模型扩展和基于回放(rehearsal-based)的技术,有效支持新任务的学习并应对动态数据流,避免模型性能退化。

链接: https://arxiv.org/abs/2601.06221
作者: Zhi Wang,Yanni Li,Pingping Zheng,Yiyuan Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clustering temporal and dynamically changing multivariate time series from real-world fields, called temporal clustering for short, has been a major challenge due to inherent complexities. Although several deep temporal clustering algorithms have demonstrated a strong advantage over traditional methods in terms of model learning and clustering results, the accuracy of the few algorithms are not satisfactory. None of the existing algorithms can continuously learn new tasks and deal with the dynamic data effectively and efficiently in the sequential tasks learning. To bridge the gap and tackle these issues, this paper proposes a novel algorithm \textbfLifelong \textbfDeep \textbfTemporal \textbfClustering (\textbfLDTC), which effectively integrates dimensionality reduction and temporal clustering into an end-to-end deep unsupervised learning framework. Using a specifically designed autoencoder and jointly optimizing for both the latent representation and clustering objective, the LDTC can achieve high-quality clustering results. Moreover, unlike any previous work, the LDTC is uniquely equipped with the fully dynamic model expansion and rehearsal-based techniques to effectively learn new tasks and to tackle the dynamic data in the sequential tasks learning without the catastrophic forgetting or degradation of the model accuracy. Experiments on seven real-world multivariate time series datasets show that the LDTC is a promising method for dealing with temporal clustering issues effectively and efficiently.
zh

[AI-146] Breaking Model Lock-in: Cost-Efficient Zero-Shot LLM Routing via a Universal Latent Space

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生态系统中存在的“模型锁定”(model lock-in)问题,即在现有路由框架中,集成新模型需进行昂贵且繁琐的重新训练,严重制约了系统的可扩展性和适应性。解决方案的关键在于提出ZeroRouter,其核心创新是构建了一个与模型无关的通用潜在空间(universal latent space),该空间对查询难度进行统一表征,从而实现查询特征与模型性能分析的解耦。这一设计使得新模型可以零样本(zero-shot)接入,无需全量重训;同时,系统配备上下文感知预测器和双模式优化器,在准确率、成本与延迟之间实现动态平衡,显著优于所有基线方法。

链接: https://arxiv.org/abs/2601.06220
作者: Cheng Yan,Wuyang Zhang,Zhiyuan Ning,Fan Xu,Ziyang Tao,Lu Zhang,Bing Yin,Yanyong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has led to a fragmented and inefficient ecosystem, a state of ``model lock-in’’ where seamlessly integrating novel models remains a significant bottleneck. Current routing frameworks require exhaustive, costly retraining, hindering scalability and adaptability. We introduce ZeroRouter, a new paradigm for LLM routing that breaks this lock-in. Our approach is founded on a universal latent space, a model-agnostic representation of query difficulty that fundamentally decouples the characterization of a query from the profiling of a model. This allows for zero-shot onboarding of new models without full-scale retraining. ZeroRouter features a context-aware predictor that maps queries to this universal space and a dual-mode optimizer that balances accuracy, cost, and latency. Our framework consistently outperforms all baselines, delivering higher accuracy at lower cost and latency.
zh

[AI-147] AI-Powered Algorithms for the Prevention and Detection of Computer Malware Infections

【速读】:该论文旨在解决传统基于签名的恶意软件检测方法在面对日益频繁且复杂的恶意攻击时有效性下降的问题。其核心解决方案是提出一种基于人工智能的混合上下文感知恶意软件检测框架(Hybrid Context-Aware Malware Detection Framework, HCADMF),该框架的关键在于融合静态文件分析、动态行为分析与上下文元数据,构建多层架构:底层采用轻量级静态分类器(如长短期记忆网络 LSTM)实现实时行为分析,并通过集成多个层次的预测结果进行风险评分,从而提升检测精度与响应速度。实验表明,该方法在EMBER和CIC-MalMem2022基准数据集上实现了97.3%的准确率、仅1.5%的误报率及极低的检测延迟,验证了其对已知和新型恶意软件变种的有效识别能力。

链接: https://arxiv.org/abs/2601.06219
作者: Rakesh Keshava,Sathish Kuppan Pandurangan,M. Sakthivanitha,Sankaranainar Parmsivan,Goutham Sunkara,R. Maruthi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise in frequency and complexity of malware attacks are viewed as a major threat to modern digital infrastructure, which means that traditional signature-based detection methods are becoming less effective. As cyber threats continue to evolve, there is a growing need for intelligent systems to accurately and proactively identify and prevent malware infections. This study presents a new hybrid context-aware malware detection framework(HCAMDF) based on artificial intelligence (AI), which combines static file analysis, dynamic behavioural analysis, and contextual metadata to provide more accurate and timely detection. HCADMF has a multi-layer architecture, which consists of lightweight static classifiers such as Long Short Term Memory (LSTM) for real-time behavioral analysis, and an ensemble risk scoring through the integration of multiple layers of prediction. Experimental evaluations of the new/methodology with benchmark datasets, EMBER and CIC-MalMem2022, showed that the new approach provides superior performances with an accuracy of 97.3%, only a 1.5% false positive rate and minimal detection delay compared to several existing machine learning(ML) and deep learning(DL) established methods in the same fields. The results show strong evidence that hybrid AI can detect both existing and novel malware variants, and lay the foundation on intelligent security systems that can enable real-time detection and adapt to a rapidly evolving threat landscape.
zh

[AI-148] CEEMDAN-Based Multiscale CNN for Wind Turbine Gearbox Fault Detection

【速读】:该论文旨在解决风力发电机齿轮箱故障检测中因振动信号具有非线性、非平稳特性而导致的诊断准确性低的问题。其核心解决方案是提出一种融合完全集合经验模态分解自适应噪声(CEEMDAN)与多尺度卷积神经网络(MSCNN)的混合方法:首先利用CEEMDAN对振动信号进行时频分解,提取不同尺度下的本征模态函数以分离关键故障特征;随后将这些特征输入MSCNN,通过深度层次化特征学习实现高精度分类。该方案在真实数据集上达到98.95%的F1分数,显著优于现有方法,在诊断准确性和计算效率方面均表现优异。

链接: https://arxiv.org/abs/2601.06217
作者: Nejad Alagha,Anis Salwa Mohd Khairuddin,Obada Al-Khatib,Abigail Copiaco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: conference paper

点击查看摘要

Abstract:Wind turbines play a critical role in the shift toward sustainable energy generation. Their operation relies on multiple interconnected components, and a failure in any of these can compromise the entire system’s functionality. Detecting faults accurately is challenging due to the intricate, non-linear, and non-stationary nature of vibration signals, influenced by dynamic loading, environmental variations, and mechanical interactions. As such, effective signal processing techniques are essential for extracting meaningful features to enhance diagnostic accuracy. This study presents a hybrid approach for fault detection in wind turbine gearboxes, combining Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and a Multiscale Convolutional Neural Network (MSCNN). CEEMDAN is employed to decompose vibration signals into intrinsic mode functions, isolating critical features at different time-frequency scales. These are then input into the MSCNN, which performs deep hierarchical feature extraction and classification. The proposed method achieves an F1 Score of 98.95%, evaluated on real-world datasets, and demonstrates superior performance in both detection accuracy and computational speed compared to existing approaches. This framework offers a balanced solution for reliable and efficient fault diagnosis in wind turbine systems.
zh

[AI-149] LLM Agents in Law: Taxonomy Applications and Challenges

【速读】:该论文旨在解决当前独立部署的大语言模型(Large Language Models, LLMs)在法律领域应用中面临的三大核心问题:幻觉(hallucination)、信息过时以及可验证性不足。这些问题严重制约了LLMs在专业法律实践中可靠性和实用性的提升。论文提出的解决方案关键在于引入法律领域的LLM代理(LLM agents),通过集成规划(planning)、记忆(memory)和工具使用(tool usage)等高级能力,使代理系统能够更准确、可追溯且适应复杂法律任务需求,从而有效弥合技术能力与法律实践之间存在的差距。

链接: https://arxiv.org/abs/2601.06216
作者: Shuang Liu,Ruijia Zhang,Ruoyun Ma,Yujia Deng,Lanyi Zhu,Jiayu Li,Zelong Li,Zhibin Shen,Mengnan Du
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have precipitated a dramatic improvement in the legal domain, yet the deployment of standalone models faces significant limitations regarding hallucination, outdated information, and verifiability. Recently, LLM agents have attracted significant attention as a solution to these challenges, utilizing advanced capabilities such as planning, memory, and tool usage to meet the rigorous standards of legal practice. In this paper, we present a comprehensive survey of LLM agents for legal tasks, analyzing how these architectures bridge the gap between technical capabilities and domain-specific needs. Our major contributions include: (1) systematically analyzing the technical transition from standard legal LLMs to legal agents; (2) presenting a structured taxonomy of current agent applications across distinct legal practice areas; (3) discussing evaluation methodologies specifically for agentic performance in law; and (4) identifying open challenges and outlining future directions for developing robust and autonomous legal assistants.
zh

[AI-150] Dynamics-inspired Structure Hallucination for Protein-protein Interaction Modeling

【速读】:该论文旨在解决蛋白质-蛋白质相互作用(Protein-Protein Interaction, PPI)中突变效应预测的两大挑战:一是突变蛋白结构难以获取,二是现有深度学习(Deep Learning, DL)模型未能有效建模PPI的动态特性。解决方案的关键在于提出一个名为Refine-PPI的新框架,其核心创新包括两个方面:一是引入基于掩码突变建模(Mask Mutation Modeling, MMM)任务训练的结构精修模块,通过已知野生型结构“幻觉”生成不可获得的突变体结构;二是设计一种新型几何网络——概率密度云网络(Probability Density Cloud Network, PDC-Net),用于捕捉三维空间中的动态变化并编码与PPI相关的原子不确定性。这一策略显著提升了自由能变化预测的准确性,在SKEMPI.v2数据集上优于所有现有工具。

链接: https://arxiv.org/abs/2601.06214
作者: Fang Wu,Stan Z. Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Protein-protein interaction (PPI) represents a central challenge within the biology field, and accurately predicting the consequences of mutations in this context is crucial for drug design and protein engineering. Deep learning (DL) has shown promise in forecasting the effects of such mutations, but is hindered by two primary constraints. First, the structures of mutant proteins are often elusive to acquire. Secondly, PPI takes place dynamically, which is rarely integrated into the DL architecture design. To address these obstacles, we present a novel framework named Refine-PPI with two key enhancements. First, we introduce a structure refinement module trained by a mask mutation modeling (MMM) task on available wild-type structures, which is then transferred to produce the inaccessible mutant structures. Second, we employ a new kind of geometric network, called the probability density cloud network (PDC-Net), to capture 3D dynamic variations and encode the atomic uncertainty associated with PPI. Comprehensive experiments on SKEMPI.v2 substantiate the superiority of Refine-PPI over all existing tools for predicting free energy change. These findings underscore the effectiveness of our hallucination strategy and the PDC module in addressing the absence of mutant protein structure and modeling geometric uncertainty.
zh

[AI-151] Cyber Threat Detection and Vulnerability Assessment System using Generative AI and Large Language Model

【速读】:该论文旨在解决传统安全检测模型(如Security BERT)在识别网络攻击时因文本数据上下文理解能力有限而导致的检测效果不佳的问题。其核心解决方案是引入基于RoBERTa(Robustly Optimized Bidirectional Encoder Representations from Transformers)的改进模型,通过增强词汇表的多样性与上下文建模能力,结合Byte-level和Byte Pair Encoding(BBPE)分词技术对加密后的网络流量数据进行高效表示,并利用Softmax实现攻击类型的精准分类。实验表明,该方法在准确率(0.99)、召回率(0.91)和精确率(0.89)上均优于现有BERT模型。

链接: https://arxiv.org/abs/2601.06213
作者: Keerthi Kumar. M,Swarun Kumar Joginpelly,Sunil Khemka,Lakshmi. S R,Navin Chhibber
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Background: Cyber-attacks have evolved rapidly in recent years, many individuals and business owners have been affected by cyber-attacks in various ways. Cyber-attacks include various threats such as ransomware, malware, phishing, and Denial of Service (DoS)-related attacks. Challenges: Traditional models such as Generative Artificial Intelligence (AI) and Security Bidirectional Encoder Representations from Transformers (BERT) were implemented to detect cyber threats. However, the existing Security BERT model has a limited contextual understanding of text data, which has less impact on detecting cyber-attacks. Proposed Methodology: To overcome the above-mentioned challenges, Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach (RoBERTa) model is proposed which consists of diverse words of vocabulary understanding. Initially, data are extracted from a Packet Capture (PCAP) file and encrypted using Fully Harmonic Encryption (FHE). Subsequently, a Byte-level and Byte Pair Encoding (BBPE) tokenizer was used to generate tokens and help maintain the vocabulary for the encrypted values. Then, these values are applied to the RoBERTa model of the transformer with extensive training. Finally, Softmax is used for the detection and classification of attacks. The proposed RoBERTa model achieved better results than the existing BERT model in terms of accuracy (0.99), recall (0.91), and precision (0.89) respectively.
zh

[AI-152] RiskBridge: Turning CVEs into Business-Aligned Patch Priorities

链接: https://arxiv.org/abs/2601.06201
作者: Yelena Mujibur Sheikh,Awez Akhtar Khatik,Luoxi Tang,Yuqiao Meng,Zhaohan Xi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-153] AI Safeguards Generative AI and the Pandora Box: AI Safety Measures to Protect Businesses and Personal Reputation

【速读】:该论文旨在解决生成式 AI(Generative AI)技术在内容生成过程中引发的深度伪造(deepfake)问题,此类问题对社会、企业及个人声誉造成潜在危害。为应对这一挑战,论文提出基于时序一致性学习(Temporal Consistency Learning, TCL)的混合检测方法,其关键在于利用预训练的时间卷积网络(Temporal Convolutional Networks, TCNs)模型进行特征提取与判别,通过强化时序一致性约束提升检测准确性。实验表明,TCN模型在五类典型“暗面”问题上均显著优于其他方法,展现出高精度的检测能力,从而有效支持AI安全治理。

链接: https://arxiv.org/abs/2601.06197
作者: Prasanna Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 10 pages, 3 Figures, 6 Tables

点击查看摘要

Abstract:Generative AI has unleashed the power of content generation and it has also unwittingly opened the pandora box of realistic deepfake causing a number of social hazards and harm to businesses and personal reputation. The investigation ramification of Generative AI technology across industries, the resolution hybridization detection techniques using neural networks allows flagging of the content. Good detection techniques flagging allow AI safety - this is the main focus of this paper. The research provides a significant method for efficiently detecting dark side problems by imposing a Temporal Consistency Learning (TCL) technique. Through pretrained Temporal Convolutional Networks (TCNs) model training and performance comparison, this paper showcases that TCN models outperforms the other approaches and achieves significant accuracy for five dark side problems. Findings highlight how important it is to take proactive measures in identification to reduce any potential risks associated with generative artificial intelligence.
zh

[AI-154] EntroLnn: Entropy-Guided Liquid Neural Networks for Operando Refinement of Battery Capacity Fade Trajectories

【速读】:该论文旨在解决电池容量衰减轨迹(Capacity Fade Trajectory, CFT)在线精修问题,突破传统方法将状态健康度(State of Health, SoH)估计与寿命终点(End of Life, EoL)预测视为独立任务的局限性。其核心解决方案是提出EntroLnn框架,该框架基于熵引导的可变换液态神经网络(Entropy-guided Transformable Liquid Neural Networks, LNNs),首次在电池分析中引入由在线温度场导出的熵特征,并将其与定制化LNN结合,以有效建模电池的时间动态特性。该方法显著提升了LNN在静态和动态条件下的适应能力,在不同电池及工况下均实现高保真、轻量级且可解释的CFT精修,最终达到0.004577的平均绝对误差(CFT)和18循环的EoL预测误差。

链接: https://arxiv.org/abs/2601.06195
作者: Wei Li,Wei Zhang,Qingyu Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Battery capacity degradation prediction has long been a central topic in battery health analytics, and most studies focus on state of health (SoH) estimation and end of life (EoL) prediction. This study extends the scope to online refinement of the entire capacity fade trajectory (CFT) through EntroLnn, a framework based on entropy-guided transformable liquid neural networks (LNNs). EntroLnn treats CFT refinement as an integrated process rather than two independent tasks for pointwise SoH and EoL. We introduce entropy-based features derived from online temperature fields, applied for the first time in battery analytics, and combine them with customized LNNs that model temporal battery dynamics effectively. The framework enhances both static and dynamic adaptability of LNNs and achieves robust and generalizable CFT refinement across different batteries and operating conditions. The approach provides a high fidelity battery health model with lightweight computation, achieving mean absolute errors of only 0.004577 for CFT and 18 cycles for EoL prediction. This work establishes a foundation for entropy-informed learning in battery analytics and enables self-adaptive, lightweight, and interpretable battery health prediction in practical battery management systems.
zh

[AI-155] meGNN-Augmented Hybrid-Action MARL for Fine-Grained Task Partitioning and Energy-Aware Offloading in MEC

【速读】:该论文旨在解决移动边缘计算(Mobile Edge Computing, MEC)场景下因边缘服务器计算资源有限、供电不连续(如电池供电节点)及系统高度动态性所导致的任务调度与资源分配效率低下的问题。其解决方案的关键在于提出一种基于多智能体深度强化学习的算法TG-DCMADDPG,该算法通过引入时间图神经网络(TimeGNN)对多维服务器状态时序信息进行建模与预测,显著降低在线交互频率并提升策略可预测性;同时,在离散-连续混合动作空间中设计多智能体确定性策略梯度算法(DC-MADDPG),协同优化细粒度任务划分比例、传输功率和优先级调度策略,从而实现能效与延迟的联合优化。

链接: https://arxiv.org/abs/2601.06191
作者: Wei Ai,Yun Peng,Yuntao Shou,Tao Meng,Keqin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid growth of IoT devices and latency-sensitive applications, the demand for both real-time and energy-efficient computing has surged, placing significant pressure on traditional cloud computing architectures. Mobile edge computing (MEC), an emerging paradigm, effectively alleviates the load on cloud centers and improves service quality by offloading computing tasks to edge servers closer to end users. However, the limited computing resources, non-continuous power provisioning (e.g., battery-powered nodes), and highly dynamic systems of edge servers complicate efficient task scheduling and resource allocation. To address these challenges, this paper proposes a multi-agent deep reinforcement learning algorithm, TG-DCMADDPG, and constructs a collaborative computing framework for multiple edge servers, aiming to achieve joint optimization of fine-grained task partitioning and offloading. This approach incorporates a temporal graph neural network (TimeGNN) to model and predict time series of multi-dimensional server state information, thereby reducing the frequency of online interactions and improving policy predictability. Furthermore, a multi-agent deterministic policy gradient algorithm (DC-MADDPG) in a discrete-continuous hybrid action space is introduced to collaboratively optimize task partitioning ratios, transmission power, and priority scheduling strategies. Extensive simulation experiments confirm that TG-DCMADDPG achieves markedly faster policy convergence, superior energy-latency optimization, and higher task completion rates compared with existing state-of-the-art methods, underscoring its robust scalability and practical effectiveness in dynamic and constrained MEC scenarios.
zh

[AI-156] Rational Synthesizers or Heuristic Followers? Analyzing LLM s in RAG -based Question-Answering ACL

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中大型语言模型(Large Language Models, LLMs)在面对多组冲突检索证据时的整合机制不透明问题,具体探究模型决策是基于事实强度、先验信念还是重复频率。其解决方案的关键在于构建了一个名为GroupQA的高质量数据集,包含1,635个争议性问题及15,058份来源多样、标注立场与定性强度的证据文档,并通过受控实验揭示了群体层面证据聚合的行为模式:如重述论点比提供独立支持更具说服力、模型偏好首段证据而非末段证据,且大模型对新证据的适应能力随规模增长而下降;同时发现模型生成的解释缺乏忠实性(unfaithful),表明LLMs在RAG中更倾向于依赖启发式线索而非理性推理,这对改进RAG系统设计具有直接指导意义。

链接: https://arxiv.org/abs/2601.06189
作者: Atharv Naphade
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 9 figures, ACL ARR submission

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is the prevailing paradigm for grounding Large Language Models (LLMs), yet the mechanisms governing how models integrate groups of conflicting retrieved evidence remain opaque. Does an LLM answer a certain way because the evidence is factually strong, because of a prior belief, or merely because it is repeated frequently? To answer this, we introduce GroupQA, a curated dataset of 1,635 controversial questions paired with 15,058 diversely-sourced evidence documents, annotated for stance and qualitative strength. Through controlled experiments, we characterize group-level evidence aggregation dynamics: Paraphrasing an argument can be more persuasive than providing distinct independent support; Models favor evidence presented first rather than last, and Larger models are increasingly resistant to adapt to presented evidence. Additionally, we find that LLM explanations to group-based answers are unfaithful. Together, we show that LLMs behave consistently as vulnerable heuristic followers, with direct implications for improving RAG system design.
zh

[AI-157] Large-Scale Continual Scheduling and Execution for Dynamic Distributed Satellite Constellation Observation Allocation

链接: https://arxiv.org/abs/2601.06188
作者: Itai Zilberstein,Steve Chien
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-158] Neuro-Symbolic Compliance: Integrating LLM s and SMT Solvers for Automated Financial Legal Analysis

【速读】:该论文旨在解决金融监管合规自动化中逻辑一致性难以维持的问题,尤其在缺乏人工干预的情况下确保法规解释与执行的准确性。其解决方案的关键在于提出一种神经符号合规框架(Neuro-Symbolic Compliance Framework),通过将大语言模型(Large Language Models, LLMs)与可满足性模理论(Satisfiability Modulo Theories, SMT)求解器相结合,实现形式化可验证性和基于优化的合规修正:LLM负责从法规文本和执法案例中提取语义并生成SMT约束,SMT求解器则用于保证逻辑一致性,并在违规发生时计算最小的事实修改以恢复合法性。该方法强调逻辑驱动的优化而非仅事后解释,显著提升了合规推理效率与准确性。

链接: https://arxiv.org/abs/2601.06181
作者: Yung-Shen Hsia,Fang Yu,Jie-Hong Roland Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 10 pages, 6 tables, 3 figures, accepted by the 2nd ACM AIware Conference

点击查看摘要

Abstract:Financial regulations are increasingly complex, hindering automated compliance-especially the maintenance of logical consistency with minimal human oversight. We introduce a Neuro-Symbolic Compliance Framework that integrates Large Language Models (LLMs) with Satisfiability Modulo Theories (SMT) solvers to enable formal verifiability and optimization-based compliance correction. The LLM interprets statutes and enforcement cases to generate SMT constraints, while the solver enforces consistency and computes the minimal factual modification required to restore legality when penalties arise. Unlike transparency-oriented methods, our approach emphasizes logic-driven optimization, delivering verifiable, legally consistent reasoning rather than post-hoc explanation. Evaluated on 87 enforcement cases from Taiwan’s Financial Supervisory Commission (FSC), the system attains 86.2% correctness in SMT code generation, improves reasoning efficiency by over 100x, and consistently corrects violations-establishing a preliminary foundation for optimization-based compliance applications.
zh

[AI-159] he environmental impact of ICT in the era of data and artificial intelligence

链接: https://arxiv.org/abs/2601.06174
作者: François Rottenberg,Thomas Feys,Liesbet Van der Perre
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-160] he Psychology of Learning from Machines: Anthropomorphic AI and the Paradox of Automation in Education

【速读】:该论文旨在解决生成式 AI (Generative AI) 趋势下,教育领域中人工智能导师(AI tutor)快速部署与对其心理和社会影响认知滞后之间的矛盾问题。其核心挑战在于:学习者对拟人化 AI 导师存在双重信任校准失败(自动化偏倚与算法厌恶),易形成专家悖论;拟人化设计虽提升参与度但可能分散注意力并引发有害情感依附;同时系统本身引入认知负担、技能退化及监控压力等“自动化悖论”。解决方案的关键在于基于自动化心理学、人因工程学、人机交互和科技哲学的跨学科理论整合,提出差异化应用策略——在技术基础教学中可借助适当支架控制自动化偏倚,而在设计思维、伦理判断和专业决策等需传递隐性知识(tacit knowledge)的场景中,则必须由人类教师主导,以保障学习效果与认知发展。

链接: https://arxiv.org/abs/2601.06172
作者: Junaid Qadir,Muhammad Mumtaz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: accepted at IEEE EDUCON 2026

点击查看摘要

Abstract:As AI tutors enter classrooms at unprecedented speed, their deployment increasingly outpaces our grasp of the psychological and social consequences of such technology. Yet decades of research in automation psychology, human factors, and human-computer interaction provide crucial insights that remain underutilized in educational AI design. This work synthesizes four research traditions – automation psychology, human factors engineering, HCI, and philosophy of technology – to establish a comprehensive framework for understanding how learners psychologically relate to anthropomorphic AI tutors. We identify three persistent challenges intensified by Generative AI’s conversational fluency. First, learners exhibit dual trust calibration failures – automation bias (uncritical acceptance) and algorithm aversion (excessive rejection after errors) – with an expertise paradox where novices overrely while experts underrely. Second, while anthropomorphic design enhances engagement, it can distract from learning and foster harmful emotional attachment. Third, automation ironies persist: systems meant to aid cognition introduce designer errors, degrade skills through disuse, and create monitoring burdens humans perform poorly. We ground this theoretical synthesis through comparative analysis of over 104,984 YouTube comments across AI-generated philosophical debates and human-created engineering tutorials, revealing domain-dependent trust patterns and strong anthropomorphic projection despite minimal cues. For engineering education, our synthesis mandates differentiated approaches: AI tutoring for technical foundations where automation bias is manageable through proper scaffolding, but human facilitation for design, ethics, and professional judgment where tacit knowledge transmission proves irreplaceable.
zh

[AI-161] From Individual Prompts to Collective Intelligence: Mainstreaming Generative AI in the Classroom

链接: https://arxiv.org/abs/2601.06171
作者: Junaid Qadir,Muhammad Salman Khan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: accepted at IEEE EDUCON 2026

点击查看摘要

[AI-162] Parent-Guided Adaptive Reliability (PGAR): A Behavioural Meta-Learning Framework for Stable and Trustworthy AI

【速读】:该论文旨在解决标准学习算法在面对外部扰动时稳定性差、校准不足以及恢复能力弱的问题。其解决方案的核心是提出一种轻量级的行为元学习框架——父指导自适应可靠性(Parent-Guided Adaptive Reliability, PGAR),通过在标准学习器之上添加一个监督型“父”层,实时计算三种反射信号(事件检测、过度自信修正和恢复记忆),并将其融合为一个[0,1]区间内的有界可靠性指数;该指数持续调节学习器的有效学习率,在不稳定时降低更新幅度,可靠度提升后则恢复学习速率,从而实现动态稳定性控制与快速恢复。

链接: https://arxiv.org/abs/2601.06167
作者: Anshum Rankawat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 2 tables. Submitted to IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:Parent-Guided Adaptive Reliability (PGAR) is a lightweight behavioural meta-learning framework that adds a supervisory “parent” layer on top of a standard learner to improve stability, calibration, and recovery under disturbances. PGAR computes three reflex-level signals (incident detection, overconfidence correction, and recovery memory) and fuses them into a bounded reliability index in [0,1]. This index continuously modulates the learner’s effective learning rate, reducing update magnitude during instability and restoring it as reliability improves. We provide a Lyapunov-based proof sketch establishing bounded adaptation of the reliability dynamics under mild assumptions (smooth loss, descent direction, and bounded reflex outputs). Empirical evaluations on representative learning tasks show improved calibration, reduced loss variance, and faster recovery compared to standard optimizers, while retaining computational simplicity. PGAR functions as a plug-in reliability layer for existing optimization and learning pipelines, supporting interpretable reliability traces in safety-relevant settings.
zh

[AI-163] Contract2Plan: Verified Contract-Grounded Retrieval-Augmented Optimization for BOM-Aware Procurement and Multi-Echelon Inventory Planning

【速读】:该论文旨在解决生成式 AI (Generative AI) 在采购与库存规划中因仅依赖合同条款提取而导致的可行性缺失和隐性违约问题,尤其是在多层级物料清单(BOM)耦合场景下,提取错误或冲突条款可能引发不可行计划。解决方案的关键在于提出 Contract2Plan 系统,其核心是引入基于求解器的合规校验门(compliance gate),在生成计划前对合同条款进行结构化约束建模、证据溯源与可行性验证,通过将条款转化为面向 BOM 的混合整数线性规划(MILP)模型,并利用求解器诊断实现接地性、资格性、一致性与可行性的自动验证,从而在自动化不安全时触发定向修复或人工介入,确保计划符合合同约束且具备可行性保障。

链接: https://arxiv.org/abs/2601.06164
作者: Sahil Agarwal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, 4 tables, 1 algorithm

点击查看摘要

Abstract:Procurement and inventory planning is governed not only by demand forecasts and bills of materials (BOMs), but also by operational terms in contracts and supplier documents (e.g., MOQs, lead times, price tiers, allocation caps, substitution approvals). LLM-based extraction can speed up structuring these terms, but extraction-only or LLM-only decision pipelines are brittle: missed clauses, unit errors, and unresolved conflicts can yield infeasible plans or silent contract violations, amplified by BOM coupling. We introduce Contract2Plan, a verified GenAI-to-optimizer pipeline that inserts a solver-based compliance gate before plans are emitted. The system retrieves clause evidence with provenance, extracts a typed constraint schema with evidence spans, compiles constraints into a BOM-aware MILP, and verifies grounding, eligibility, consistency, and feasibility using solver diagnostics, triggering targeted repair or abstention when automation is unsafe. We formalize which clause classes admit conservative repair with contract-safe feasibility guarantees and which require human confirmation. A self-contained synthetic micro-benchmark (500 instances; T=5) computed by exact enumeration under an execution model with MOQ uplift and emergency purchases shows heavy-tailed regret and nontrivial MOQ-violation incidence for extraction-only planning, motivating verification as a first-class component of contract-grounded planning systems.
zh

[AI-164] Beyond Accuracy: A Decision-Theoretic Framework for Allocation-Aware Healthcare AI ALT

链接: https://arxiv.org/abs/2601.06161
作者: Rifa Ferzana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, PDF-only submission. This work introduces a decision-theoretic framework to bridge the gap between predictive accuracy and clinical impact in healthcare AI. Includes synthetic simulation results

点击查看摘要

[AI-165] Student Guides Teacher: Weak-to-Strong Inference via Spectral Orthogonal Exploration

链接: https://arxiv.org/abs/2601.06160
作者: Dayu Wang,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-166] PsyAgent : Constructing Human-like Agents Based on Psychological Modeling and Contextual Interaction

【速读】:该论文旨在解决人类代理(human-like agents)在建模个性与社会结构交互关系时的挑战,即如何使代理的行为既保持个体性格特征的一致性,又能根据具体情境做出恰当反应。其解决方案的关键在于提出PsyAgent框架,该框架由两个核心组件构成:(i) 个体结构(Individual Structure, IS),用于编码大五人格特质(Big Five traits)、认知风格、价值观、文化资本等多维个体属性;(ii) 多场景情境化机制(Multi-Scenario Contexting, MSC),通过角色-关系-规范框架覆盖八类社会场景(如工作、家庭、学习等)。在推理阶段,固定结构化的提示将当前场景绑定至代理个体结构,从而生成稳定且情境敏感的行为。实验表明,IS主要提升特质保真度与风格稳定性,MSC增强规范意识与决策适配性,二者协同确保跨场景表现性能,为基于人格的智能体提供了一种数据高效、精准可控的架构设计。

链接: https://arxiv.org/abs/2601.06158
作者: Zibin Meng,Kani Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human-like agents require modeling how dispositions interact with social structure. We present PsyAgent, which couples a Big Five trait prior with Bourdieu’s cognitive-social co-structure. PsyAgent comprises: (i) Individual Structure (IS), a machine-usable profile encoding traits and facets, cognitive style, values, cultural and educational capital, and salient life episodes; and (ii) Multi-Scenario Contexting (MSC), role-relationship-norm frames spanning eight arenas (work, family, friendship, strangers and civic life, solitude and self-regulation, romance, learning, and public expression). At inference, fixed structured prompts bind the active scenario to the agent profile, yielding behavior that is stable yet context-sensitive. We instantiate IS and MSC to synthesize supervision (role-play dialogues, decision probes, feedback trajectories) and then fine-tune a small LLM. The resulting model produces consistent, identifiable persona-aligned behaviors for specified Big Five configurations and matches or exceeds several larger untuned LLMs and other untuned baselines on our metrics: persona consistency, contextual appropriateness, style matching, trait identifiability, and long-horizon stability. Ablations show IS chiefly improves trait fidelity and stylistic stability, while MSC drives norm awareness and decision fit; both are necessary for cross-scenario performance. PsyAgent offers a precise, data-efficient architecture for personality-grounded agents.
zh

[AI-167] ECLIPTICA - A Framework for Switchable LLM Alignment via CITA - Contrastive Instruction-Tuned Alignment

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对齐(Alignment)过程中存在的静态性问题:传统方法如DPO(Direct Preference Optimization)和GRPO(Generalized Reward Policy Optimization)在训练后冻结策略,缺乏运行时的可控性,仅能通过提示词技巧或昂贵的重新对齐来调整行为。其解决方案的关键在于提出ECLIPTICA框架,将对齐视为由自然语言指令驱动且可在运行时动态控制的行为契约(Behavioral Contract),通过引入CITA(Contrastive Instruction-Tuned Alignment)方法,在监督微调(SFT)基础上结合对比偏好优化,并以参考模型为几何锚点构建稳定的黎曼流形(Riemannian chart),确保指令更新保持在共享邻域内,从而实现不同对齐模式间的可靠切换与高效控制。实验表明,CITA在ECLIPTICA基准测试中达到86.7%的指令对齐效率,显著优于DPO、GRPO和PPO等基线方法。

链接: https://arxiv.org/abs/2601.06157
作者: Kapil Wanaskar,Gaytri Jena,Vinija Jain,Aman Chadha,Amitava Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alignment in large language models (LLMs) is still largely static: after training, the policy is frozen. DPO, GRPO methods typically imprint one behavior into the weights, leaving little runtime control beyond prompt hacks or expensive re-alignment. We introduce ECLIPTICA, which treats alignment as instruction-driven and runtime-controllable: natural-language alignment instructions act as an explicit behavioral contract (stance, refusal boundary, verbosity) that modulates behavior on the fly under evolving safety requirements, user roles, and governance constraints. We introduce CITA (Contrastive Instruction-Tuned Alignment), combining SFT with contrastive preference optimization under an explicit geometric anchor to a reference model. This yields a stable Riemannian chart and keeps instruction updates within a shared neighborhood, so regimes stay nearby and traversable for reliable switching. To isolate policy switching from ordinary instruction following, we release the ECLIPTICA benchmark: 3000 controlled cases (300 prompts x 10 instruction types) where the user request is fixed and only the alignment instruction changes. On Llama-3.1-8B across five suites (ECLIPTICA, TruthfulQA, Conditional Safety, Length Control, LITMUS), CITA reaches 86.7% instruction-alignment efficiency, beating DPO (56.1%), GRPO (36.1%), and PPO (20.4%). Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.06157 [cs.LG] (or arXiv:2601.06157v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.06157 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-168] Channel Knowledge Map Construction via Guided Flow Matching

【速读】:该论文旨在解决环境感知无线网络中准确信道知识图谱(Channel Knowledge Map, CKM)构建的难题,尤其是由于位置相关信道数据稀疏导致的 ill-posed 问题。传统基于扩散模型(如去噪扩散概率模型,DDPM)的方法虽能生成高质量 CKM,但依赖迭代随机采样,推理速度过慢,难以满足实时无线应用需求。解决方案的关键在于提出一种基于线性输运引导流匹配(Linear Transport Guided Flow Matching, LT-GFM)的新框架:该方法将 CKM 生成建模为遵循线性最优输运路径的确定性常微分方程(ODE),从而大幅减少推理步数;同时设计统一架构支持信道增益图(CGM)和空间相关图(SCM)两类 CKM 构建,并引入环境语义信息(如建筑掩膜)增强边缘恢复能力,以及强制 SCM 的埃尔米特对称性以保障物理合理性,最终在保持高分布保真度的同时,将推理速度提升至 DDPM 的 25 倍。

链接: https://arxiv.org/abs/2601.06156
作者: Ziyu Huang,Yong Zeng,Shen Fu,Xiaoli Xu,Hongyang Du
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The efficient construction of accurate channel knowledge maps (CKMs) is crucial for unleashing the full potential of environment-aware wireless networks, yet it remains a difficult ill-posed problem due to the sparsity of available location-specific channel knowledge data. Although diffusion-based methods such as denoising diffusion probabilistic models (DDPMs) have been exploited for CKM construction, they rely on iterative stochastic sampling, rendering them too slow for real-time wireless applications. To bridge the gap between high fidelity and efficient CKM construction, this letter introduces a novel framework based on linear transport guided flow matching (LT-GFM). Deviating from the noise-removal paradigm of diffusion models, our approach models the CKM generation process as a deterministic ordinary differential equation (ODE) that follows linear optimal transport paths, thereby drastically reducing the number of required inference steps. We propose a unified architecture that is applicable to not only the conventional channel gain map (CGM) construction, but also the more challenging spatial correlation map (SCM) construction. To achieve physics-informed CKM constructions, we integrate environmental semantics (e.g., building masks) for edge recovery and enforce Hermitian symmetry for property of the SCM. Simulation results verify that LT-GFM achieves superior distributional fidelity with significantly lower Fréchet Inception Distance (FID) and accelerates inference speed by a factor of 25 compared to DDPMs.
zh

[AI-169] BotSim: Mitigating The Formation Of Conspiratorial Societies with Useful Bots

链接: https://arxiv.org/abs/2601.06154
作者: Lynnette Hui Xian Ng,Kathleen M. Carley
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Published in Journal of Artificial Societies and Social Simulation

点击查看摘要

[AI-170] Interoperability in AI Safety Governance: Ethics Regulations and Standards

【速读】:该论文旨在解决人工智能安全治理中跨国家、跨区域的互操作性(interoperability)问题,以应对因监管碎片化、全球协调不足以及全球南方参与有限所导致的风险加剧与治理低效。其解决方案的关键在于通过比较中国、韩国、新加坡和英国在自动驾驶、教育及跨境数据流动三个高风险领域的伦理、法律和技术框架,识别出政策趋同点、分歧与潜在协同空间,并提出一套兼顾全球视野与本地实际的治理建议,从而推动建立符合《全球数字契约》及相关联合国决议的、具有可操作性的互操作机制。

链接: https://arxiv.org/abs/2601.06153
作者: Yik Chan Chin,David A. Raho,Hag-Min Kim,Chunli Bi,James Ong,Jingbo Huang,Serge Stinckwich
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 122 pages

点击查看摘要

Abstract:This policy report draws on country studies from China, South Korea, Singapore, and the United Kingdom to identify effective tools and key barriers to interoperability in AI safety governance. It offers practical recommendations to support a globally informed yet locally grounded governance ecosystem. Interoperability is a central goal of AI governance, vital for reducing risks, fostering innovation, enhancing competitiveness, promoting standardization, and building public trust. However, structural gaps such as fragmented regulations and lack of global coordination, and conceptual gaps, including limited Global South engagement, continue to hinder progress. Focusing on three high-stakes domains - autonomous vehicles, education, and cross-border data flows - the report compares ethical, legal, and technical frameworks across the four countries. It identifies areas of convergence, divergence, and potential alignment, offering policy recommendations that support the development of interoperability mechanisms aligned with the Global Digital Compact and relevant UN resolutions. The analysis covers seven components: objectives, regulators, ethics, binding measures, targeted frameworks, technical standards, and key risks.
zh

[AI-171] HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants

【速读】:该论文旨在解决知识密集型交互场景中传统检索增强生成(Retrieval-Augmented Generation, RAG)管道在用户个性化和对话连贯性方面的局限性,具体表现为记忆容量不足、检索机制与用户特定对话历史缺乏协同,导致冗余澄清、无关文档召回及用户体验下降。解决方案的关键在于提出HiMeS架构,通过融合短时记忆与长时记忆模块实现对用户上下文的高效建模:短时记忆提取器采用端到端强化学习训练,压缩近期对话并主动预检索知识库文档,模拟海马体与前额叶皮层的协作;长时记忆网络则分区存储用户专属信息并对检索结果重新排序,类比皮层分布式存储与记忆再激活机制,从而显著提升问答质量与个性化服务能力。

链接: https://arxiv.org/abs/2601.06152
作者: Hailong Li,Feifei Li,Wenhui Que,Xingyu Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) power many interactive systems such as chatbots, customer-service agents, and personal assistants. In knowledge-intensive scenarios requiring user-specific personalization, conventional retrieval-augmented generation (RAG) pipelines exhibit limited memory capacity and insufficient coordination between retrieval mechanisms and user-specific conversational history, leading to redundant clarification, irrelevant documents, and degraded user experience. Inspired by the hippocampus-neocortex memory mechanism, we propose HiMeS, an AI-assistant architecture that fuses short-term and long-term memory. Our contributions are fourfold: (1) A short-term memory extractor is trained end-to-end with reinforcement learning to compress recent dialogue and proactively pre-retrieve documents from the knowledge base, emulating the cooperative interaction between the hippocampus and prefrontal cortex. (2) A partitioned long-term memory network stores user-specific information and re-ranks retrieved documents, simulating distributed cortical storage and memory reactivation. (3) On a real-world industrial dataset, HiMeS significantly outperforms a cascaded RAG baseline on question-answering quality. (4) Ablation studies confirm the necessity of both memory modules and suggest a practical path toward more reliable, context-aware, user-customized LLM-based assistants.
zh

[AI-172] A Foundation Model Approach for Fetal Stress Prediction During Labor From cardiotocography (CTG) recordings

【速读】:该论文旨在解决产时胎心监护(Intrapartum Cardiotocography, CTG)分析中因标注数据稀缺导致的深度学习模型性能受限问题,以及现有方法在临床预测准确性与观察者间一致性方面的不足。其解决方案的关键在于首次将自监督预训练(self-supervised pre-training)引入CTG分析领域,利用2,444小时未标注的CTG数据进行掩码重建预训练,并采用针对胎心率信号设计的通道非对称掩码策略(channel-asymmetric masking scheme),随后在552例标注的CTU-UHB基准数据集上微调模型。该方法显著提升了模型性能,在完整测试集上AUC达0.83,无并发症阴道分娩子集上达0.853,优于此前报道结果(0.68–0.75),且错误预警多对应于临床回顾性评估中被认为异常的CTG模式,表明模型具备临床相关性。

链接: https://arxiv.org/abs/2601.06149
作者: Naomi Fridman,Berta Ben Shachar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intrapartum cardiotocography (CTG) is widely used for fetal monitoring during labor, yet its interpretation suffers from high inter-observer variability and limited predictive accuracy. Deep learning approaches have been constrained by the scarcity of CTG recordings with clinical outcome labels. We present the first application of self-supervised pre-training to intrapartum CTG analysis, leveraging 2,444 hours of unlabeled recordings for masked pre-training followed by fine-tuning on the 552-recording CTU-UHB benchmark. Using a PatchTST transformer architecture with a channel-asymmetric masking scheme designed for fetal heart rate reconstruction, we achieve an area under the receiver operating characteristic curve of 0.83 on the full test set and 0.853 on uncomplicated vaginal deliveries, exceeding previously reported results on this benchmark (0.68-0.75). Error analysis reveals that false-positive alerts typically correspond to CTG patterns judged concerning on retrospective clinical review, suggesting clinically meaningful predictions even when umbilical pH is normal. We release standardized dataset splits and model weights to enable reproducible benchmarking. Our results demonstrate that self-supervised pre-training can address data scarcity in fetal monitoring, offering a path toward reliable decision support in the labor room.
zh

[AI-173] Bridging the AI divide in sub-Saharan Africa: Challenges and opportunities for inclusivity

链接: https://arxiv.org/abs/2601.06145
作者: Masike Malatji
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages; 4 tables; 1 figure

点击查看摘要

[AI-174] he Patient/Industry Trade-off in Medical Artificial Intelligence

【速读】:该论文试图解决当前医疗人工智能(Artificial Intelligence, AI)研究中因主要由私营部门资助而导致的患者利益与产业利益之间潜在冲突的问题,尤其关注AI模型在临床实践中的转化障碍。其核心问题在于:多数AI研究缺乏临床相关指标、缺少临床试验和纵向验证研究,以及患者和医生在开发过程中的参与度不足,从而阻碍了AI技术的有效落地与应用。解决方案的关键在于建立可持续的产学研合作机制,具体包括三个方面:提升AI模型的透明度与可解释性,选择以患者利益为核心导向的产业合作伙伴,以及优先考虑整体医疗健康效益而非单纯商业价值。通过这些措施,可加速实现真正服务于临床、惠及患者的AI技术应用。

链接: https://arxiv.org/abs/2601.06144
作者: Rina Khan,Annabelle Sauve,Imaan Bayoumi,Amber L. Simpson,Catherine Stinson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Artificial intelligence (AI) in healthcare has led to many promising developments; however, increasingly, AI research is funded by the private sector leading to potential trade-offs between benefits to patients and benefits to industry. Health AI practitioners should prioritize successful adaptation into clinical practice in order to provide meaningful benefits to patients, but translation usually requires collaboration with industry. We discuss three features of AI studies that hamper the integration of AI into clinical practice from the perspective of researchers and clinicians. These include lack of clinically relevant metrics, lack of clinical trials and longitudinal studies to validate results, and lack of patient and physician involvement in the development process. For partnerships between industry and health research to be sustainable, a balance must be established between patient and industry benefit. We propose three approaches for addressing this gap: improved transparency and explainability of AI models, fostering relationships with industry partners that have a reputation for centering patient benefit in their practices, and prioritization of overall healthcare benefits. With these priorities, we can sooner realize meaningful AI technologies used by clinicians where mutua
zh

[AI-175] RainBalance: Alleviating Dual Imbalance in GNSS-based Precipitation Nowcasting via Continuous Probability Modeling

链接: https://arxiv.org/abs/2601.06137
作者: Yifang Zhang,Shengwu Xiong,Henan Wang,Wenjie Yin,Jiawang Peng,Duan Zhou,Yuqiang Zhang,Chen Zhou,Hua Chen,Qile Zhao,Pengfei Duan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11pages,6 figures

点击查看摘要

[AI-176] A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

链接: https://arxiv.org/abs/2601.06133
作者: Wonhyeok Choi,Minwoo Choi,Jungwan Woo,Kyumin Hwang,Jaeyeul Kim,Sunghoon Im
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

[AI-177] Graph-Based Analysis of AI-Driven Labor Market Transitions: Evidence from 10000 Egyptian Jobs and Policy Implications

【速读】:该论文旨在解决自动化导致的就业岗位流失问题,特别是评估在埃及劳动力市场中,面临高自动化风险的工人中有多少能够实际过渡到更安全的工作岗位。其核心问题是识别可行的转型路径,并揭示结构性障碍的存在。解决方案的关键在于构建了一个包含9,978个职位发布、19,766项技能活动和84,346个职位-技能关系的验证知识图谱(误差率0.74%),并定义了“可行过渡路径”为至少共享3项技能且技能转移率≥50%的匹配。研究发现仅有24.4%的高风险岗位工人具备此类路径,其余75.6%面临结构性流动障碍,需系统性再培训而非渐进式技能提升;其中以流程导向型技能(process-oriented skills)作为最高杠杆干预手段,在15.6%的可行路径中出现,凸显了主动设计职业转型通道的重要性,而非依赖被动技能匹配。

链接: https://arxiv.org/abs/2601.06129
作者: Ahmed Dawoud,Sondos Samir,Mahmoud Mohamed
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How many workers displaced by automation can realistically transition to safer jobs? We answer this using a validated knowledge graph of 9,978 Egyptian job postings, 19,766 skill activities, and 84,346 job-skill relationships (0.74% error rate). While 20.9% of jobs face high automation risk, we find that only 24.4% of at-risk workers have viable transition pathways–defined by \geq 3 shared skills and \geq 50% skill transfer. The remaining 75.6% face a structural mobility barrier requiring comprehensive reskilling, not incremental upskilling. Among 4,534 feasible transitions, process-oriented skills emerge as the highest-leverage intervention, appearing in 15.6% of pathways. These findings challenge optimistic narratives of seamless workforce adaptation and demonstrate that emerging economies require active pathway creation, not passive skill matching.
zh

[AI-178] AIS-CycleGen: A CycleGAN-Based Framework for High-Fidelity Synthetic AIS Data Generation and Augmentation

【速读】:该论文旨在解决自动识别系统(AIS)数据在实际应用中面临的域偏移(domain shift)、数据稀疏性(data sparsity)和类别不平衡(class imbalance)问题,这些问题严重制约了预测模型的性能。解决方案的关键在于提出一种基于循环一致性生成对抗网络(CycleGAN)的鲁棒数据增强方法——AISCycleGen,其创新性地利用无配对域翻译技术,在无需源-目标配对数据的情况下生成高保真度的合成AIS数据序列;同时采用一维卷积生成器结合自适应噪声注入机制,有效保留了AIS轨迹的时空结构特征,从而显著提升生成数据的多样性和真实性,最终增强了下游模型在多个海事场景下的泛化能力和预测精度。

链接: https://arxiv.org/abs/2601.06127
作者: SM Ashfaq uz Zaman,Faizan Qamar,Masnizah Mohd,Nur Hanis Sabrina Suhaimi,Amith Khandakar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 16 figures

点击查看摘要

Abstract:Automatic Identification System (AIS) data are vital for maritime domain awareness, yet they often suffer from domain shifts, data sparsity, and class imbalance, which hinder the performance of predictive models. In this paper, we propose a robust data augmentation method, AISCycleGen, based on Cycle-Consistent Generative Adversarial Networks (CycleGAN), which is tailored for AIS datasets. Unlike traditional methods, AISCycleGen leverages unpaired domain translation to generate high-fidelity synthetic AIS data sequences without requiring paired source-target data. The framework employs a 1D convolutional generator with adaptive noise injection to preserve the spatiotemporal structure of AIS trajectories, enhancing the diversity and realism of the generated data. To demonstrate its efficacy, we apply AISCycleGen to several baseline regression models, showing improvements in performance across various maritime domains. The results indicate that AISCycleGen outperforms contemporary GAN-based augmentation techniques, achieving a PSNR value of 30.5 and an FID score of 38.9. These findings underscore AISCycleGen’s potential as an effective and generalizable solution for augmenting AIS datasets, improving downstream model performance in real-world maritime intelligence applications.
zh

[AI-179] NL2Dashboard: A Lightweight and Controllable Framework for Generating Dashboards with LLM s

链接: https://arxiv.org/abs/2601.06126
作者: Boshen Shi,Kexin Yang,Yuanbo Yang,Guanguang Chang,Ce Chi,Zhendong Wang,Xing Wang,Junlan Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-180] Latent Space Communication via K-V Cache Alignment

链接: https://arxiv.org/abs/2601.06123
作者: Lucio M. Dery,Zohar Yahav,Henry Prior,Qixuan Feng,Jiajun Shen,Arthur Szlam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, 4 tables

点击查看摘要

[AI-181] Prompt Engineering for Responsible Generative AI Use in African Education: A Report from a Three-Day Training Series

链接: https://arxiv.org/abs/2601.06121
作者: Benjamin Quarshie,Vanessa Willemse,Macharious Nabang,Bismark Nyaaba Akanzire,Patrick Kyeremeh,Saeed Maigari,Dorcas Adomina,Ellen Kwarteng,Eric Kojo Majialuwe,Craig Gibbs,Jerry Etornam Kudaya,Sechaba Koma,Matthew Nyaaba Matthew Nyaaba
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-182] L2CU: Learning to Complement Unseen Users

【速读】:该论文旨在解决现有生成式 AI(Generative AI)在人机协作分类任务中难以泛化到未见过的用户的问题。当前方法通常依赖单一全局用户模型,忽略了个体用户的标注差异,导致协作性能不佳。解决方案的关键在于提出 L2CU 框架,通过从稀疏且噪声较大的用户标注中识别出代表性的标注者(annotator)特征画像(profile),并基于这些画像匹配未见用户,从而调用针对性的互补模型以提升整体协作准确率,实现对未知用户的有效适应。

链接: https://arxiv.org/abs/2601.06119
作者: Dileepa Pitawela,Gustavo Carneiro,Hsiang-Ting Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published in IEEE Access ( this https URL )

点击查看摘要

Abstract:Recent research highlights the potential of machine learning models to learn to complement (L2C) human strengths; however, generalizing this capability to unseen users remains a significant challenge. Existing L2C methods oversimplify interaction between human and AI by relying on a single, global user model that neglects individual user variability, leading to suboptimal cooperative performance. Addressing this, we introduce L2CU, a novel L2C framework for human-AI cooperative classification with unseen users. Given sparse and noisy user annotations, L2CU identifies representative annotator profiles capturing distinct labeling patterns. By matching unseen users to these profiles, L2CU leverages profile-specific models to complement the user and achieve superior joint accuracy. We evaluate L2CU on datasets (CIFAR-10N, CIFAR-10H, Fashion-MNIST-H, Chaoyang and AgNews), demonstrating its effectiveness as a model-agnostic solution for improving human-AI cooperative classification.
zh

[AI-183] Beyond Reproducibility: Token Probabilities Expose Large Language Model Nondeterminism

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在图形处理器(Graphics Processing Units, GPUs)上执行时出现的非确定性问题,即即使配置为确定性运行,仍可能因浮点运算精度限制和并发进程调度导致结果不一致。其解决方案的关键在于从生成文本层面转向分析token概率层面的变化,发现非确定性对概率值在0.1至0.9区间内影响显著,而在接近0或1时则较小;这一发现表明可通过单次推理中token概率分布来估计非确定性影响,而无需多次重复运行以观察文本差异,从而为量化和缓解非确定性提供了新路径。

链接: https://arxiv.org/abs/2601.06118
作者: Tairan Fu,Gonzalo Martínez,Javier Conde,Carlos Arriaga,Pedro Reviriego,Xiuyuan Qi,Shanshan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The execution of Large Language Models (LLMs) has been shown to produce nondeterministic results when run on Graphics Processing Units (GPUs), even when they are configured to produce deterministic results. This is due to the finite precision effects of the arithmetic operations, which depend on the order in which they are executed. This order, in turn, depends on the processes that are running concurrently on the GPU. Previous studies have focused on the impact of nondeterminism on the text generated by the LLMs or on proposing mechanisms to achieve deterministic execution. This work takes a closer look at nondeterminism by analyzing the variations on the token probabilities, not on the generated text. Interestingly, all the models evaluated have similar results in both the trends and the actual values of the variations of the probabilities. In particular, the results show that the effects of nondeterminism are significant for token probabilities that are in the range of 0.1 to 0.9, while they are much smaller when the probabilities are close to 0 or 1. This has significant implications for our understanding of nondeterminism. The first is that nondeterminism will likely have a non-negligible impact on generated text when the temperature is not zero, as it introduces significant variations in the token probabilities except when they are close to 0 or 1. Secondly, it suggests that all models have similar non deterministic variations at the token probability level. Therefore, different variations in the performance of the generated text, for example, when measuring accuracy on a benchmark, seem to come from different token probabilities or response lengths. A third implication is that we may be able to estimate the impact of nondeterminism by running a single inference and analyzing the token level probabilities, instead of having to run the same inference many times.
zh

[AI-184] Dreaming Is Not a Bug: A Jung-Inspired Dream Layer for Multi-Agent LLM Companions

链接: https://arxiv.org/abs/2601.06115
作者: V. Cheung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint, 35 pages (5 pages of appendix), 2 figures, 3 tables. Conceptual and architectural proposal with preliminary simulation results

点击查看摘要

[AI-185] GroupSegment-SHAP: Shapley Value Explanations with Group-Segment Players for Multivariate Time Series

【速读】:该论文旨在解决多变量时间序列模型中跨变量交互与时间动态性如何协同作用的问题,现有解释方法(如SHAP)通常独立处理特征轴和时间轴,导致无法有效捕捉多变量在特定时间段内共同形成的结构性信号。解决方案的关键在于提出GroupSegment SHAP (GS-SHAP),其核心创新是将解释单元构建为基于变量间依赖关系和分布漂移的组段(group-segment)玩家,并通过Shapley值量化每个组段对预测结果的贡献,从而更完整地保留多变量-时间联合结构信息。

链接: https://arxiv.org/abs/2601.06114
作者: Jinwoong Kim,Sangjin Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 12 pages

点击查看摘要

Abstract:Multivariate time-series models achieve strong predictive performance in healthcare, industry, energy, and finance, but how they combine cross-variable interactions with temporal dynamics remains unclear. SHapley Additive exPlanations (SHAP) are widely used for interpretation. However, existing time-series variants typically treat the feature and time axes independently, fragmenting structural signals formed jointly by multiple variables over specific intervals. We propose GroupSegment SHAP (GS-SHAP), which constructs explanatory units as group-segment players based on cross-variable dependence and distribution shifts over time, and then quantifies each unit’s contribution via Shapley attribution. We evaluate GS-SHAP across four real-world domains: human activity recognition, power-system forecasting, medical signal analysis, and financial time series, and compare it with KernelSHAP, TimeSHAP, SequenceSHAP, WindowSHAP, and TSHAP. GS-SHAP improves deletion-based faithfulness (DeltaAUC) by about 1.7x on average over time-series SHAP baselines, while reducing wall-clock runtime by about 40 percent on average under matched perturbation budgets. A financial case study shows that GS-SHAP identifies interpretable multivariate-temporal interactions among key market variables during high-volatility regimes.
zh

[AI-186] owards Infinite Length Extrapolation: A Unified Approach

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长序列时因训练阶段上下文窗口大小限制而导致的性能下降问题。现有长度外推方法普遍存在性能退化或计算效率低下的缺陷。其解决方案的关键在于提出一种统一框架,将位置编码方法重新解释为注意力分数的乘性变换与加性偏置的分解形式,从而揭示了相对位置嵌入和注意力偏置调节方法在建模长距离依赖上的固有局限。基于此框架,作者进一步引入自适应位置编码(Adaptive Positional Encoding, APE),通过自适应频率调制和融合线性、对数及平方根项的衰减偏置设计,实现对无限上下文长度的有效外推,并在理论上保证softmax归一化在无界序列下依然合理,同时维持长程相关性、熵有界性和梯度位置敏感性。

链接: https://arxiv.org/abs/2601.06113
作者: Nitin Vetcha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training. Existing length extrapolation methods often suffer from performance degradation or computational inefficiencies. We thereby use a unified framework that reinterprets positional encoding methods as a decomposition of the attention score into a multiplicative transformation and an additive bias. This perspective not only subsumes popular approaches such as relative position embeddings and attention-bias moderated approaches but also exposes their inherent limitations in handling long-range dependencies. To address these shortcomings, motivated by our framework, we introduce Adaptive Positional Encoding (APE), which leverages adaptive frequency modulation and an intricately designed decay bias that incorporates linear, logarithmic, and square-root terms. Our theoretical analysis establishes conditions for infinite-context extrapolation, ensuring that the softmax normalization remains well-defined over unbounded sequences while preserving long-distance correlations, entropy boundedness and gradient positional sensitivity. We substantiate our claims with an experimental case study on TinyStories dataset as well as a new synthetic dataset, \emphLong Tiny Stories featuring stories up to 32,000 words. Relevant code, dataset and model weights are available at this https URL.
zh

[AI-187] ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

【速读】:该论文旨在解决当前工具使用型大语言模型(Large Language Model, LLM)代理在实际部署中缺乏可靠性的评估问题,现有基准主要报告单次运行的成功率,而忽略了生产环境中所需的稳定性、鲁棒性和容错能力。其解决方案的关键在于提出ReliabilityBench,这是一个系统性评估代理可靠性的新基准,涵盖三个维度:(i) 重复执行一致性(通过 $ \text{pass}^k $ 衡量),(ii) 对语义等价任务扰动的鲁棒性(强度为 $ \epsilon $),以及 (iii) 在受控工具/API 故障下的容错能力(强度为 $ \lambda $)。该框架引入统一的可靠性曲面 $ R(k,\epsilon,\lambda) $、基于终态等价的动作元关系(action metamorphic relations)以替代文本相似度判断正确性,并设计了类混沌工程的故障注入机制(如超时、限流、部分响应、模式漂移),从而实现对LLM代理在复杂真实场景下可靠性的量化评估与对比分析。

链接: https://arxiv.org/abs/2601.06112
作者: Aayush Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 8 tables. Evaluates ReAct vs Reflexion across four tool-using domains with perturbation (epsilon) and fault-injection (lambda) stress testing; 1,280 total episodes

点击查看摘要

Abstract:Existing benchmarks for tool-using LLM agents primarily report single-run success rates and miss reliability properties required in production. We introduce \textbfReliabilityBench, a benchmark for evaluating agent reliability across three dimensions: (i) consistency under repeated execution using \mathrmpass^k , (ii) robustness to semantically equivalent task perturbations at intensity \epsilon , and (iii) fault tolerance under controlled tool/API failures at intensity \lambda . ReliabilityBench contributes a unified reliability surface R(k,\epsilon,\lambda) , \textitaction metamorphic relations that define correctness via end-state equivalence rather than text similarity, and a chaos-engineering-style fault injection framework (timeouts, rate limits, partial responses, schema drift). We evaluate two models (Gemini 2.0 Flash, GPT-4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e-commerce) over 1,280 episodes. Perturbations alone reduce success from 96.9% at \epsilon=0 to 88.1% at \epsilon=0.2 . Rate limiting is the most damaging fault in ablations. ReAct is more robust than Reflexion under combined stress, and Gemini 2.0 Flash achieves comparable reliability to GPT-4o at much lower cost. ReliabilityBench provides a systematic framework for assessing production readiness of LLM agents.
zh

[AI-188] LLM -Powered Social Digital Twins: A Framework for Simulating Population Behavioral Response to Policy Interventions

【速读】:该论文旨在解决计算社会科学与公共政策领域中一个核心挑战:如何准确预测人群对政策干预的响应。传统方法依赖于基于历史数据的统计模型,虽能捕捉相关性,但缺乏机制可解释性,且难以应对新政策场景。其解决方案的关键在于构建“社会数字孪生”(Social Digital Twins)框架,其中大型语言模型(LLM)作为个体代理的认知引擎,每个代理根据人口统计学和心理特征接收政策信号并输出多维行为概率向量;通过校准层将代理级响应映射到可观测的人口层面指标,实现与真实数据的验证及反事实政策分析。该框架在新冠疫情期间的应用表明,其在六类行为预测上相较梯度提升基线平均误差降低20.7%,且政策变化引发的行为响应具单调性和边界性,验证了行为合理性。

链接: https://arxiv.org/abs/2601.06111
作者: Aayush Gupta,Farahan Raza Sheikh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Predicting how populations respond to policy interventions is a fundamental challenge in computational social science and public policy. Traditional approaches rely on aggregate statistical models that capture historical correlations but lack mechanistic interpretability and struggle with novel policy scenarios. We present a general framework for constructing Social Digital Twins - virtual population replicas where Large Language Models (LLMs) serve as cognitive engines for individual agents. Each agent, characterized by demographic and psychographic attributes, receives policy signals and outputs multi-dimensional behavioral probability vectors. A calibration layer maps aggregated agent responses to observable population-level metrics, enabling validation against real-world data and deployment for counterfactual policy analysis. We instantiate this framework in the domain of pandemic response, using COVID-19 as a case study with rich observational data. On a held-out test period, our calibrated digital twin achieves a 20.7% improvement in macro-averaged prediction error over gradient boosting baselines across six behavioral categories. Counterfactual experiments demonstrate monotonic and bounded responses to policy variations, establishing behavioral plausibility. The framework is domain-agnostic: the same architecture applies to transportation policy, economic interventions, environmental regulations, or any setting where policy affects population behavior. We discuss implications for policy simulation, limitations of the approach, and directions for extending LLM-based digital twins beyond pandemic response. Comments: 13 pages, 1 figure, 4 tables Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) ACMclasses: I.2.11; I.6.5; J.4 Cite as: arXiv:2601.06111 [cs.AI] (or arXiv:2601.06111v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.06111 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-189] CBMAS: Cognitive Behavioral Modeling via Activation Steering NEURIPS2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在不同提示(prompt)、层(layer)和上下文(context)中表现出不可预测的认知行为问题,这使得模型的诊断与控制变得困难。其解决方案的关键在于提出CBMAS框架——一种用于连续激活引导(continuous activation steering)的诊断工具,通过结合引导向量构建、密集α扫描(dense α-sweeps)、基于logit lens的偏见曲线以及层-位置敏感性分析,能够揭示微小干预强度即可引发模型行为突变的“临界点”,并展示引导效应随层深度演化的可解释轨迹。这一方法实现了从高层行为评估到低层表征动态之间的桥梁,提升了LLM的认知可解释性。

链接: https://arxiv.org/abs/2601.06109
作者: Ahmed H. Ismail,Anthony Kuang,Ayo Akinkugbe,Kevin Zhu,Sean O’Brien
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CogInterp @ NeurIPS 2025. Equal contribution by Ahmed H. Ismail and Anthony Kuang

点击查看摘要

Abstract:Large language models (LLMs) often encode cognitive behaviors unpredictably across prompts, layers, and contexts, making them difficult to diagnose and control. We present CBMAS, a diagnostic framework for continuous activation steering, which extends cognitive bias analysis from discrete before/after interventions to interpretable trajectories. By combining steering vector construction with dense \alpha-sweeps, logit lens-based bias curves, and layer-site sensitivity analysis, our approach can reveal tipping points where small intervention strengths flip model behavior and show how steering effects evolve across layer depth. We argue that these continuous diagnostics offer a bridge between high-level behavioral evaluation and low-level representational dynamics, contributing to the cognitive interpretability of LLMs. Lastly, we provide a CLI and datasets for various cognitive behaviors at the project repository, this https URL.
zh

[AI-190] he Impact of Post-training on Data Contamination

链接: https://arxiv.org/abs/2601.06103
作者: Muhammed Yusuf Kocyigit,Caglar Yildirim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-191] Dynamic Intelligence Ceilings: Measuring Long-Horizon Limits of Planning and Creativity in Artificial Systems

【速读】:该论文试图解决当前人工智能系统在长期发展过程中出现的“性能前沿过早固化”问题,即系统倾向于收敛于重复的解决方案模式,而非持续提升智能水平。其核心挑战在于现有评估方法将智能视为静态指标,忽略了系统在资源、结构和意图约束下随时间演化的动态能力边界。解决方案的关键是提出“动态智能天花板”(Dynamic Intelligence Ceiling, DIC)概念,并构建以轨迹为中心的评估框架,通过两个可操作的估计量——“渐进难度天花板”(Progressive Difficulty Ceiling, PDC)和“天花板漂移率”(Ceiling Drift Rate, CDR)来量化智能前沿的移动性与演化趋势。该框架不假设无限智能,而是将限制重新定义为依赖于发展路径的动态属性,从而更真实地刻画系统在长期规划与结构性创造力方面的潜力。

链接: https://arxiv.org/abs/2601.06102
作者: Truong Xuan Khanh,Truong Quynh Hoa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper introduces a trajectory-centric evaluation framework for analyzing long-horizon intelligence limits in artificial systems, focusing on developmental behavior, planning, and structural creativity rather than proposing new learning algorithms. 11 pages, 2 figures

点击查看摘要

Abstract:Recent advances in artificial intelligence have produced systems capable of remarkable performance across a wide range of tasks. These gains, however, are increasingly accompanied by concerns regarding long-horizon developmental behavior, as many systems converge toward repetitive solution patterns rather than sustained growth. We argue that a central limitation of contemporary AI systems lies not in capability per se, but in the premature fixation of their performance frontier. To address this issue, we introduce the concept of a \emphDynamic Intelligence Ceiling (DIC), defined as the highest level of effective intelligence attainable by a system at a given time under its current resources, internal intent, and structural configuration. To make this notion empirically tractable, we propose a trajectory-centric evaluation framework that measures intelligence as a moving frontier rather than a static snapshot. We operationalize DIC using two estimators: the \emphProgressive Difficulty Ceiling (PDC), which captures the maximal reliably solvable difficulty under constrained resources, and the \emphCeiling Drift Rate (CDR), which quantifies the temporal evolution of this frontier. These estimators are instantiated through a procedurally generated benchmark that jointly evaluates long-horizon planning and structural creativity within a single controlled environment. Our results reveal a qualitative distinction between systems that deepen exploitation within a fixed solution manifold and those that sustain frontier expansion over time. Importantly, our framework does not posit unbounded intelligence, but reframes limits as dynamic and trajectory-dependent rather than static and prematurely fixed. \vspace0.5em \noindent\textbfKeywords: AI evaluation, planning and creativity, developmental intelligence, dynamic intelligence ceilings, complex adaptive systems Comments: This paper introduces a trajectory-centric evaluation framework for analyzing long-horizon intelligence limits in artificial systems, focusing on developmental behavior, planning, and structural creativity rather than proposing new learning algorithms. 11 pages, 2 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.06102 [cs.AI] (or arXiv:2601.06102v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.06102 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xuan Khanh Truong [view email] [v1] Sat, 3 Jan 2026 00:13:45 UTC (24 KB) Full-text links: Access Paper: View a PDF of the paper titled Dynamic Intelligence Ceilings: Measuring Long-Horizon Limits of Planning and Creativity in Artificial Systems, by Truong Xuan Khanh and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-01 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-192] How to Assess AI Literacy: Misalignment Between Self-Reported and Objective-Based Measures

链接: https://arxiv.org/abs/2601.06101
作者: Shan Zhang,Ruiwei Xiao,Anthony F. Botelho,Guanze Liao,Thomas K.F. Chiu,John Stamper,Kenneth R. Koedinger
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, LAK2026

点击查看摘要

[AI-193] Automatic Question Generation for Intuitive Learning Utilizing Causal Graph Guided Chain of Thought Reasoning

链接: https://arxiv.org/abs/2601.06098
作者: Nicholas X. Wang,Neel V. Parpia,Aaryan D. Parikh,Aggelos K. Katsaggelos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-194] Deep Q-Network Based Resilient Drone Communication:Neutralizing First-Order Markov Jammers

链接: https://arxiv.org/abs/2601.06095
作者: Andrii Grekhov,Volodymyr Kharchenko,Vasyl Kondratiuk
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

[AI-195] GenAITEd Ghana_A Blueprint Prototype for Context-Aware and Region-Specific Conversational AI Agent for Teacher Education

链接: https://arxiv.org/abs/2601.06093
作者: Matthew Nyaaba,Patrick Kyeremeh,Macharious Nabang,Bismark Nyaaba Akanzire,Cyril Ababio Titty,Jerry Etornam Kudaya,Sakina Acquah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-196] Islamic Chatbots in the Age of Large Language Models NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的伊斯兰教聊天机器人在穆斯林社区中如何重塑宗教权威、教学方式与日常实践的问题,尤其关注其在提升宗教知识可及性的同时可能引发的权威稀释风险。解决方案的关键在于系统性分析此类技术应用带来的伦理与社会挑战,并提出负责任的设计原则,以平衡知识民主化与传统宗教权威的维护,确保技术发展符合穆斯林社区的价值观与需求。

链接: https://arxiv.org/abs/2601.06092
作者: Muhammad Aurangzeb Ahmad
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Muslim in ML Workshop at NeurIPS 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are rapidly transforming how communities access, interpret, and circulate knowledge, and religious communities are no exception. Chatbots powered by LLMs are beginning to reshape authority, pedagogy, and everyday religious practice in Muslim communities. We analyze the landscape of LLM powered Islamic chatbots and how they are transforming Islamic religious practices e.g., democratizing access to religious knowledge but also running the risk of erosion of authority. We discuss what kind of challenges do these systems raise for Muslim communities and explore recommendations for the responsible design of these systems.
zh

[AI-197] One if by Land Two if by Sea Three if by Four Seas and More to Come – Values of Perception Prediction Communication and Common Sense in Decision Making

【速读】:该论文旨在严格定义感知(perception)、预测(prediction)、通信(communication)和常识(common sense)在决策中的价值,以指导自主决策系统的设计与优化。其解决方案的关键在于构建一套具有决策论基础的量化指标,这些指标虽源于决策理论,但具备信息论类比特性——例如,它们共享香农熵(Shannon entropy)和互信息(mutual information)的一些基本数学性质,并在特定条件下可退化为这些经典信息量度。这一框架揭示了一个重要现象:仅依赖感知而无预测时,其价值可能为负;而结合预测后的感知价值或单独的预测价值始终非负,从而为评估信息源的重要性、最优观测顺序等实际问题提供了理论依据。

链接: https://arxiv.org/abs/2601.06077
作者: Aolin Xu
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:This work aims to rigorously define the values of perception, prediction, communication, and common sense in decision making. The defined quantities are decision-theoretic, but have information-theoretic analogues, e.g., they share some simple but key mathematical properties with Shannon entropy and mutual information, and can reduce to these quantities in particular settings. One interesting observation is that, the value of perception without prediction can be negative, while the value of perception together with prediction and the value of prediction alone are always nonnegative. The defined quantities suggest answers to practical questions arising in the design of autonomous decision-making systems. Example questions include: Do we need to observe and predict the behavior of a particular agent? How important is it? What is the best order to observe and predict the agents? The defined quantities may also provide insights to cognitive science and neural science, toward the understanding of how natural decision makers make use of information gained from different sources and operations.
zh

[AI-198] Jamming Detection in Cell-Free MIMO with Dynamic Graphs

【速读】:该论文旨在解决蜂窝自由大规模MIMO(cell-free massive MIMO)系统中由分布式接入点与用户设备(UE)构成的动态拓扑所引发的干扰攻击(jamming attacks)检测难题。其解决方案的关键在于构建一个基于动态图(dynamic graph)建模的检测框架,利用图卷积神经网络(GCN)与Transformer相结合的模型,通过监督学习训练获得图嵌入(graph embeddings),从而识别通信链路演化中的异常行为,实现对恶意干扰的有效检测。

链接: https://arxiv.org/abs/2601.06075
作者: Ali Hossary,Laura Crosara,Stefano Tomasin
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jamming attacks pose a critical threat to wireless networks, particularly in cell-free massive MIMO systems, where distributed access points and user equipment (UE) create complex, time-varying topologies. This paper proposes a novel jamming detection framework leveraging dynamic graphs and graph convolutional neural networks (GCN) to address this challenge. By modeling the network as a dynamic graph, we capture evolving communication links and detect jamming attacks as anomalies in the graph evolution. A GCN-Transformer-based model, trained with supervised learning, learns graph embeddings to identify malicious interference. Performance evaluation in simulated scenarios with moving UEs, varying jamming conditions and channel fadings, demonstrates the method’s effectiveness, which is assessed through accuracy and F1 score metrics, achieving promising results for effective jamming detection.
zh

[AI-199] EAS: Trusted Educational AI Standard: A Framework for Verifiable Stable Auditable and Pedagogically Sound Learning Systems AAAI-26

【速读】:该论文旨在解决当前教育领域中人工智能(AI)部署缺乏统一可信标准的问题,即现有评估框架碎片化,导致机构在引入AI时面临可靠性风险。其解决方案的关键在于提出TEAS(Trusted Educational AI Standard)框架,该框架基于四个相互依赖的核心支柱:可验证性(Verifiability)、稳定性(Stability)、可审计性(Auditability)和教学合理性(Pedagogical Soundness),强调系统性架构设计是实现可信AI的核心,而非单纯依赖模型能力,从而为低成本、开源模型实现部署级信任提供了可行路径。

链接: https://arxiv.org/abs/2601.06066
作者: Abu Syed
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 tables, accepted at AAAI-26 (DAI Workshop) in Singapore

点击查看摘要

Abstract:The rapid integration of AI into education has prioritized capability over trustworthiness, creating significant risks. Real-world deployments reveal that even advanced models are insufficient without extensive architectural scaffolding to ensure reliability. Current evaluation frameworks are fragmented: institutional policies lack technical verification, pedagogical guidelines assume AI reliability, and technical metrics are context-agnostic. This leaves institutions without a unified standard for deployment readiness. This paper introduces TEAS (Trusted Educational AI Standard), an integrated framework built on four interdependent pillars: (1) Verifiability, grounding content in authoritative sources; (2) Stability, ensuring deterministic core knowledge; (3) Auditability, enabling independent institutional validation; and (4) Pedagogical Soundness, enforcing principles of active learning. We argue that trustworthiness stems primarily from systematic architecture, not raw model capability. This insight implies that affordable, open-source models can achieve deployment-grade trust, offering a scalable and equitable path to integrating AI safely into learning environments globally.
zh

[AI-200] Socio-technical aspects of Agent ic AI

【速读】:该论文试图解决当前agentic AI(自主智能体人工智能)研究中技术导向过强、而社会影响考量不足的问题,即如何将社会、伦理、经济、环境及治理等非技术因素系统性地融入agentic AI的设计与评估框架。其解决方案的关键在于提出并应用一个名为MAD-BAD-SAD的多维分析框架:MAD(动机-应用-道德困境)用于识别系统设计背后的意图与价值冲突;BAD(偏见-问责-危险)聚焦于算法公平性、责任归属与潜在风险;SAD(社会影响-采纳-设计)则强调系统在现实社会中的落地效果与可持续性。通过这一结构化分析工具,论文实现了从纯技术视角向“人—算法—制度”协同演化的 socio-technical 系统认知跃迁,从而为未来 agentic AI 的负责任发展提供理论基础与实践指引。

链接: https://arxiv.org/abs/2601.06064
作者: Praveen Kumar Donta,Alaa Saleh,Ying Li,Shubham Vaishnav,Kai Fang,Hailin Feng,Yuchao Xia,Thippa Reddy Gadekallu,Qiyang Zhang,Xiaodan Shi,Ali Beikmohammadi,Sindri Magnússon,Ilir Murturi,Chinmaya Kumar Dehury,Marcin Paprzycki,Lauri Loven,Sasu Tarkoma,Schahram Dustdar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Dear Reviewer, please note that this is not survey/review or position paper. This paper introduced new framework (MAD-BAD-SAD Framework) for Socio-technical aspects of Agentic AI, Ethical considerations, which is very important to consider beside technical development

点击查看摘要

Abstract:Agentic Artificial Intelligence (AI) represents a fundamental shift in the design of intelligent systems, characterized by interconnected components that collectively enable autonomous perception, reasoning, planning, action, and learning. Recent research on agentic AI has largely focused on technical foundations, including system architectures, reasoning and planning mechanisms, coordination strategies, and application-level performance across domains. However, the societal, ethical, economic, environmental, and governance implications of agentic AI remain weakly integrated into these technical treatments. This paper addresses this gap by presenting a socio-technical analysis of agentic AI that explicitly connects core technical components with societal context. We examine how architectural choices in perception, cognition, planning, execution, and memory introduce dependencies related to data governance, accountability, transparency, safety, and sustainability. To structure this analysis, we adopt the MAD-BAD-SAD construct as an analytical lens, capturing motivations, applications, and moral dilemmas (MAD); biases, accountability, and dangers (BAD); and societal impact, adoption, and design considerations (SAD). Using this lens, we analyze ethical considerations, implications, and challenges arising from contemporary agentic AI systems and assess their manifestation across emerging applications, including healthcare, education, industry, smart and sustainable cities, social services, communications and networking, and earth observation and satellite communications. The paper further identifies open challenges and suggests future research directions, framing agentic AI as an integrated socio-technical system whose behavior and impact are co-produced by algorithms, data, organizational practices, regulatory frameworks, and social norms.
zh

[AI-201] he Environmental Impact of AI Servers and Sustainable Solutions

链接: https://arxiv.org/abs/2601.06063
作者: Aadi Patel,Nikhil Mahalingam,Rusheen Patel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures

点击查看摘要

[AI-202] From Values to Frameworks: A Qualitative Study of Ethical Reasoning in Agent ic AI Practitioners

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 系统中伦理决策缺乏实证基础的问题,特别是从业者在部署自主代理型人工智能(Agentic AI)时如何权衡不同伦理价值的实践逻辑。其解决方案的关键在于识别出三种主导性的伦理推理框架:以客户为中心(Customer-Centric)、以设计为中心(Design-Centric)和以伦理为中心(Ethics-Centric),并指出这些框架不仅反映了不同的价值优先级,还提供了各自独特的伦理洞察力。因此,为确保稳健的伦理结果,AI 提供商必须超越泛化的责任原则,主动管理这些多元推理框架在决策过程中的体现方式。

链接: https://arxiv.org/abs/2601.06062
作者: Theodore Roberts,Bahram Zarrin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 2 charts, 1 heatmap

点击查看摘要

Abstract:Agentic artificial intelligence systems are autonomous technologies capable of pursuing complex goals with minimal human oversight and are rapidly emerging as the next frontier in AI. While these systems promise major gains in productivity, they also raise new ethical challenges. Prior research has examined how different populations prioritize Responsible AI values, yet little is known about how practitioners actually reason through the trade-offs inherent in designing these autonomous systems. This paper investigates the ethical reasoning of AI practitioners through qualitative interviews centered on structured dilemmas in agentic AI deployment. We find that the responses of practitioners do not merely reflect value preferences but rather align with three distinct reasoning frameworks. First is a Customer-Centric framework where choices are justified by business interests, legality, and user autonomy. Second is a Design-Centric framework emphasizing technical safeguards and system constraints. Third is an Ethics-Centric framework prioritizing social good and moral responsibility beyond compliance. We argue that these frameworks offer distinct and necessary insights for navigating ethical trade-offs. Consequently, providers of agentic AI must look beyond general principles and actively manage how these diverse reasoning frameworks are represented in their decision-making processes to ensure robust ethical outcomes.
zh

[AI-203] AI Application Operations – A Socio-Technical Framework for Data-driven Organizations

【速读】:该论文旨在解决数据驱动型人工智能(Artificial Intelligence, AI)项目在从概念到生产落地过程中面临的复杂挑战,特别是由于数据依赖性导致的开发与运维环节中的不确定性问题。其解决方案的关键在于提出一个全面的人工智能应用运维(AI Application Operations, AIAppOps)框架,该框架不仅涵盖从想法到生产的标准化流程和角色分工,更将持续监控嵌入为一种统一的反馈机制,以驱动持续改进、合规性和价值实现;这一机制基于统计学和形式化保证方法,扩展了安全关键型人工智能中的运行时验证概念至组织级运营层面,从而确保AI系统的稳定性、可解释性和长期有效性。

链接: https://arxiv.org/abs/2601.06061
作者: Daniel Jönsson,Mattias Tiger,Stefan Ekberg,Daniel Jakobsson,Mattias Jonhede,Fredrik Viksten
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We outline a comprehensive framework for artificial intelligence (AI) Application Operations (AIAppOps), based on real-world experiences from diverse organizations. Data-driven projects pose additional challenges to organizations due to their dependency on data across the development and operations cycles. To aid organizations in dealing with these challenges, we present a framework outlining the main steps and roles involved in going from idea to production for data-driven solutions. The data dependency of these projects entails additional requirements on continuous monitoring and feedback, as deviations can emerge in any process step. Therefore, the framework embeds monitoring not merely as a safeguard, but as a unifying feedback mechanism that drives continuous improvement, compliance, and sustained value realization-anchored in both statistical and formal assurance methods that extend runtime verification concepts from safety-critical AI to organizational operations. The proposed framework is structured across core technical processes and supporting services to guide both new initiatives and maturing AI programs.
zh

[AI-204] Context Video Semantic Transmission with Variable Length and Rate Coding over MIMO Channels

【速读】:该论文旨在解决现有语义通信方案在多输入多输出(Multiple-Input Multiple-Output, MIMO)信道环境中性能受限的问题,尤其是在无线视频传输中,多数现有方法仅针对加性白高斯噪声(Additive White Gaussian Noise, AWGN)或瑞利衰落信道进行优化,忽略了实际部署中普遍存在的MIMO场景。其解决方案的关键在于提出一种上下文视频语义传输(Context Video Semantic Transmission, CVST)框架,通过构建高效的上下文视频传输骨干网络,显式学习特征组与MIMO子信道之间的相关性映射;进而设计多参考熵编码机制,实现基于信道状态的可变长度编码,并引入基于棋盘结构的特征调制策略,使单一训练模型支持多个速率点,从而提升部署灵活性。上述创新共同构成多参考可变长度与速率编码(Multi-Reference Variable Length and Rate Coding, MR-VLRC)方案,在多种标准化分离编码方法和近期无线视频语义通信方案中展现出显著性能优势。

链接: https://arxiv.org/abs/2601.06059
作者: Bingyan Xie,Yongpeng Wu,Wenjun Zhang,Derrick Wing Kwan Ng,Merouane Debbah
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolution of semantic communications has profoundly impacted wireless video transmission, whose applications dominate driver of modern bandwidth consumption. However, most existing schemes are predominantly optimized for simple additive white Gaussian noise or Rayleigh fading channels, neglecting the ubiquitous multiple-input multiple-output (MIMO) environments that critically hinder practical deployment. To bridge this gap, we propose the context video semantic transmission (CVST) framework under MIMO channels. Building upon an efficient contextual video transmission backbone, CVST effectively learns a context-channel correlation map to explicitly formulate the relationships between feature groups and MIMO subchannels. Leveraging these channel-aware features, we design a multi-reference entropy coding mechanism, enabling channel state-aware variable length coding. Furthermore, CVST incorporates a checkerboard-based feature modulation strategy to achieve multiple rate points within a single trained model, thereby enhancing deployment flexibility. These innovations constitute our multi-reference variable length and rate coding (MR-VLRC) scheme. By integrating contextual transmission with MR-VLRC, CVST demonstrates substantial performance gains over various standardized separated coding methods and recent wireless video semantic communication approaches. The code is available at this https URL.
zh

[AI-205] Data Work in Egypt: Who Are the Workers Behind Artificial Intelligence?

链接: https://arxiv.org/abs/2601.06057
作者: Myriam Raymond(GRANEM, LEMNA, DiPLab),Lucy Neveux(HEC Paris, ENSAE),Antonio A. Casilli(I3 SES, NOS, SES, DiPLab, IP Paris),Paola Tubaro(CNRS, ENSAE Paris, CREST, DiPLab)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-206] Sports Business Administration and New Age Technology: Role of AI

【速读】:该论文试图解决体育治理、税收政策、争议解决机制以及数字化转型在体育领域中的整合问题,特别是如何通过创新技术提升体育治理效能与人才识别的精准性。其核心解决方案在于利用数据驱动方法和人工智能(AI)优化运动员招募流程,在确保符合现行法规的前提下减少评估偏差,并扩大对多元人才库的覆盖范围;同时,研究指出需改革税收政策以契合国际最佳实践,从而增强体育组织的透明度与问责制,推动体育产业可持续发展。

链接: https://arxiv.org/abs/2601.06053
作者: Sahibpreet Singh,Pawan Kumar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Chapter in “Sports Law in India” (University Book House Pvt. Ltd., 2024), pp. 122-142

点击查看摘要

Abstract:This chapter explores the complexities of sports governance, taxation, dispute resolution, and the impact of digital transformation within the sports sector. This study identifies a critical research gap regarding the integration of innovative technologies to enhance governance and talent identification in sports law. The objective is to evaluate how data-driven approaches and AI can optimize recruitment processes; also ensuring compliance with existing regulations. A comprehensive analysis of current governance structures and taxation policies,(ie Income Tax Act and GST Act), reveals preliminary results indicating that reform is necessary to support sustainable growth in the sports economy. Key findings demonstrate that AI enhances player evaluation by minimizing biases and expanding access to diverse talent pools. While the Court of Arbitration for Sport provides an efficient mechanism for dispute resolution. The implications emphasize the need for regulatory reforms that align taxation policies with international best practices, promoting transparency and accountability in sports organizations. This research contributes valuable insights into the evolving dynamics of sports management, aiming to foster innovation and integrity in the industry.
zh

[AI-207] he Violation State: Safety State Persistence in a Multimodal Language Model Interface

【速读】:该论文试图解决的问题是:在多模态人工智能(Multimodal AI)系统中,安全过滤机制如何与对话级状态(conversation-level state)相互作用,特别是在用户上传受版权保护的图像并请求移除水印被拒绝后,后续无关的图像生成请求是否也会被错误拒绝。解决方案的关键在于通过人工执行实验,在ChatGPT(GPT-5.1)网页界面中观察到一种称为“安全状态持久性”(safety-state persistence)的行为现象——即单次对版权内容的拒绝会引发后续所有图像生成请求的持续拒绝,而文本类请求仍可正常执行。这一发现揭示了多模态AI系统中存在对话级别的过度泛化问题,为改进会话级安全机制的设计提供了实证依据。

链接: https://arxiv.org/abs/2601.06049
作者: Bentley DeVilling(Course Correct Labs)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figure, 1 table

点击查看摘要

Abstract:Multimodal AI systems integrate text generation, image generation, and other capabilities within a single conversational interface. These systems employ safety mechanisms to prevent disallowed actions, including the removal of watermarks from copyrighted images. While single-turn refusals are expected, the interaction between safety filters and conversation-level state is not well understood. This study documents a reproducible behavioral effect in the ChatGPT (GPT-5.1) web interface. Manual execution was chosen to capture the exact user-facing safety behavior of the production system, rather than isolated API components. When a conversation begins with an uploaded copyrighted image and a request to remove a watermark, which the model correctly refuses, subsequent prompts to generate unrelated, benign images are refused for the remainder of the session. Importantly, text-only requests (e.g., generating a Python function) continue to succeed. Across 40 manually run sessions (30 contaminated and 10 controls), contaminated threads showed 116/120 image-generation refusals (96.67%), while control threads showed 0/40 refusals (Fisher’s exact p 0.0001). All sessions used an identical fixed prompt order, ensuring sequence uniformity across conditions. We describe this as safety-state persistence: a form of conversational over-generalization in which a copyright refusal influences subsequent, unrelated image-generation behavior. We present these findings as behavioral observations, not architectural claims. We discuss possible explanations, methodological limitations (single model, single interface), and implications for multimodal reliability, user experience, and the design of session-level safety systems. These results motivate further examination of session-level safety interactions in multimodal AI systems.
zh

[AI-208] Reliability and Admissibility of AI-Generated Forensic Evidence in Criminal Trials

【速读】:该论文试图解决的问题是:AI生成的法医证据在刑事审判中的可采性问题,即其是否满足既定的法律可靠性标准。当前尽管生成式 AI (Generative AI) 在侦查效率方面展现出潜力,但司法实践中对AI输出证据的可采性缺乏系统评估,且存在技术可复现性不足、法官和技术人员专业素养差异导致的接受度不一,以及责任归属不清等挑战。解决方案的关键在于建立独立验证机制和制定专门针对AI证据的可采性标准,以确保其可靠性并防范误判风险,从而推动AI在刑事司法系统中的负责任整合。

链接: https://arxiv.org/abs/2601.06048
作者: Sahibpreet Singh,Lalita Devi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Presented at National Seminar on Criminal Law and Justice Reforms, 8 November 2025, pp. 45-53

点击查看摘要

Abstract:This paper examines the admissibility of AI-generated forensic evidence in criminal trials. The growing adoption of AI presents promising results for investigative efficiency. Despite advancements, significant research gaps persist in practically understanding the legal limits of AI evidence in judicial processes. Existing literature lacks focused assessment of the evidentiary value of AI outputs. The objective of this study is to evaluate whether AI-generated evidence satisfies established legal standards of reliability. The methodology involves a comparative doctrinal legal analysis of evidentiary standards across common law jurisdictions. Preliminary results indicate that AI forensic tools can enhance scale of evidence analysis. However, challenges arise from reproducibility deficits. Courts exhibit variability in acceptance of AI evidence due to limited technical literacy and lack of standardized validation protocols. Liability implications reveal that developers and investigators may bear accountability for flawed outputs. This raises critical concerns related to wrongful conviction. The paper emphasizes the necessity of independent validation and, development of AI-specific admissibility criteria. Findings inform policy development for the responsible AI integration within criminal justice systems. The research advances the objectives of Sustainable Development Goal 16 by reinforcing equitable access to justice. Preliminary results contribute for a foundation for future empirical research in AI deployed criminal forensics.
zh

[AI-209] ree-Preconditioned Differentiable Optimization and Axioms as Layers

【速读】:该论文旨在解决传统随机效用模型(Random Utility Model, RUM)在深度学习框架中难以嵌入其公理结构的问题,尤其针对将实证选择数据投影到RUM多面体(polytope)时所面临的NP难性挑战。解决方案的关键在于发现RUM一致性与布尔格(Boolean lattice)上的流守恒之间的同构关系,并基于此提出一种树预条件共轭梯度(Tree-Preconditioned Conjugate Gradient)求解器;该方法通过约束图的生成树构造预条件子,有效“白化”由内点法障碍项诱导的病态海森矩阵谱,从而实现超线性收敛并扩展至此前无法处理的大规模问题。此外,作者利用隐函数定理将投影过程形式化为可微分层,使精确雅可比矩阵在反向传播中传递几何约束,实现了联合训练、可证明理性且能在稀疏数据下泛化的模型。

链接: https://arxiv.org/abs/2601.06036
作者: Yuexin Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Comments and collaboration are highly welcome

点击查看摘要

Abstract:This paper introduces a differentiable framework that embeds the axiomatic structure of Random Utility Models (RUM) directly into deep neural networks. Although projecting empirical choice data onto the RUM polytope is NP-hard in general, we uncover an isomorphism between RUM consistency and flow conservation on the Boolean lattice. Leveraging this combinatorial structure, we derive a novel Tree-Preconditioned Conjugate Gradient solver. By exploiting the spanning tree of the constraint graph, our preconditioner effectively “whitens” the ill-conditioned Hessian spectrum induced by the Interior Point Method barrier, achieving superlinear convergence and scaling to problem sizes previously deemed unsolvable. We further formulate the projection as a differentiable layer via the Implicit Function Theorem, where the exact Jacobian propagates geometric constraints during backpropagation. Empirical results demonstrate that this “Axioms-as-Layers” paradigm eliminates the structural overfitting inherent in penalty-based methods, enabling models that are jointly trainable, provably rational, and capable of generalizing from sparse data regimes where standard approximations fail.
zh

[AI-210] Autonomous QA Agent : A Retrieval-Augmented Framework for Reliable Selenium Script Generation

【速读】:该论文试图解决软件测试中将需求转化为可执行测试脚本时存在的手动、易出错问题,尤其是大型语言模型(Large Language Models, LLMs)在生成 Selenium 脚本时容易产生虚构的 UI 元素(即“幻觉”)的问题。解决方案的关键在于提出一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的自主 QA Agent 系统,该系统通过将项目特定文档和 HTML 结构纳入向量数据库进行上下文检索,在生成过程中 grounding(锚定)于实际的 DOM 结构,从而显著降低幻觉并提升脚本的语法正确性和执行成功率。

链接: https://arxiv.org/abs/2601.06034
作者: Dudekula Kasim Vali
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 figures, 3 tables

点击查看摘要

Abstract:Software testing is critical in the software development lifecycle, yet translating requirements into executable test scripts remains manual and error-prone. While Large Language Models (LLMs) can generate code, they often hallucinate non-existent UI elements. We present the Autonomous QA Agent, a Retrieval-Augmented Generation (RAG) system that grounds Selenium script generation in project-specific documentation and HTML structure. By ingesting diverse formats (Markdown, PDF, HTML) into a vector database, our system retrieves relevant context before generation. Evaluation on 20 e-commerce test scenarios shows our RAG approach achieves 100% (20/20) syntax validity and 90% (18/20, 95% CI: [85%, 95%], p 0.001) execution success, compared to 30% for standard LLM generation. While our evaluation is limited to a single domain, our method significantly reduces hallucinations by grounding generation in actual DOM structure, demonstrating RAG’s potential for automated UI testing.
zh

[AI-211] Beyond Clicking:A Step Towards Generalist GUI Grounding via Text Drag ging

【速读】:该论文旨在解决当前图形用户界面(GUI)接地(GUI grounding)模型对鼠标拖拽(dragging)操作建模不足的问题,而拖拽是实际GUI场景中用于文本选择与操作的重要交互方式。解决方案的关键在于:首先构建了一个大规模、多样化的文本拖拽数据集GUI-Drag(包含16.1万条示例),并通过可扩展的合成管道实现高效生成;其次提出ScreenDrag基准测试集(含5,333个样本,涵盖三层界面上下文),并设计了三个专门用于评估文本拖拽能力的指标,从而支持系统化、鲁棒的性能评测。实验表明,基于GUI-Drag采用高效持续训练策略训练的模型,在ScreenDrag上显著提升拖拽性能的同时,保持了原有点击类任务(如ScreenSpot、ScreenSpot-v2和OSWorld-G)的性能,推动了从单一点击向更全面GUI交互能力发展的研究方向。

链接: https://arxiv.org/abs/2601.06031
作者: Zeyi Liao,Yadong Lu,Boyu Gou,Huan Sun,Ahmed Awadallah
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:Graphical user interface (GUI) grounding, the process of mapping human instructions to GUI actions, serves as a fundamental basis to autonomous GUI agents. While existing grounding models achieve promising performance to simulate the mouse click action on various click-based benchmarks, another essential mode of mouse interaction, namely dragging, remains largely underexplored. Yet, dragging the mouse to select and manipulate textual content represents a prevalent and important usage in practical GUI scenarios. To narrow this gap, we first introduce GUI-Drag, a diverse dataset of 161K text dragging examples synthesized through a scalable pipeline. To support systematic and robust evaluation, we further construct ScreenDrag, a benchmark with 5,333 examples spanning three levels of interface context, together with three dedicated metrics designed for assessing text dragging capability. Models trained on GUI-Drag with an efficient continual training strategy achieve substantial improvements on ScreenDrag, while preserving the original click-based performance on ScreenSpot, ScreenSpot-v2, and OSWorld-G. Our work encourages further research on broader GUI grounding beyond just clicking and paves way toward a truly generalist GUI grounding model. All benchmark, data, checkpoints, and code are open-sourced and available at this https URL.
zh

[AI-212] From Augmentation to Symbiosis: A Review of Human-AI Collaboration Frameworks Performance and Perils

链接: https://arxiv.org/abs/2601.06030
作者: Richard Jiarui Tong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-213] A Recommendation System-Based Framework for Enhancing Human-Machine Collaboration in Industrial Timetabling Rescheduling: Application in Preventive Maintenance

【速读】:该论文旨在解决工业排班(Industrial Timetabling)在实际运行中因突发事件导致执行中断时的重调度(Rescheduling)难题,其核心挑战在于如何在保证方案质量的同时控制计算时间,以支持近优决策。解决方案的关键在于构建一个基于推荐系统(Recommendation System)的框架,依托Timefold这一强大的AI驱动规划引擎,通过实验评估九个源于真实预防性维护场景的实例,识别出最优启发式算法(Heuristic),从而实现高质量与高效率之间的平衡。

链接: https://arxiv.org/abs/2601.06029
作者: Kévin Ducharlet,Liwen Zhang,Sara Maqrot,Houssem Saidi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial timetabling is a critical task for decision-makers across various sectors to ensure efficient system operation. In real-world settings, it remains challenging because unexpected events often disrupt execution. When such events arise, effective rescheduling and collaboration between humans and machines becomes essential. This paper presents a recommendation system-based framework for handling rescheduling challenges, built on Timefold, a powerful AI-driven planning engine. Our experimental study evaluates nine instances inspired by a realworld preventive maintenance use case, aiming to identify the heuristic that best balances solution quality and computing time to support near-optimal decisionmaking when rescheduling is required due to unexpected events during operational days. Finally, we illustrate the complete process of our recommendation system through a simple use case.
zh

[AI-214] AI-Assisted Authoring for Transparent Data-Driven Documents

链接: https://arxiv.org/abs/2601.06027
作者: Alfonso Piscitelli,Cristina David,Mattia De Rosa,Ali Mohammed,Federico Nanni,Jacob Pake,Roly Perera,Jessy Sodimu,Chenyiqiu Zheng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Programming Languages (cs.PL)
备注:

点击查看摘要

[AI-215] Personalized Spiking Neural Networks with Ferroelectric Synapses for EEG Signal Processing

【速读】:该论文旨在解决基于脑电图(EEG)的脑机接口(BCI)中因神经信号非平稳性导致的模型泛化能力差的问题,尤其是在资源受限平台上实现适应性和个性化学习的挑战。其核心解决方案是利用铁电忆阻器(ferroelectric memristive synaptic devices)作为可编程硬件平台,部署脉冲神经网络(SNNs)以实现对EEG信号的自适应解码。关键创新在于提出了一种设备感知的权重更新策略:通过数字累积梯度更新,并仅在阈值触发时转化为离散的编程事件,从而模拟忆阻器非线性、状态依赖的编程特性,同时显著降低编程频率和能耗。该方法在两种部署策略下均实现了与先进软件SNN相当的分类性能,且仅微调最终网络层即可实现个体特异性迁移学习,有效提升了模型适应性与实用性。

链接: https://arxiv.org/abs/2601.00020
作者: Nikhil Garg,Anxiong Song,Niklas Plessnig,Nathan Savoia,Laura Bégon-Lours
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG)-based brain-computer interfaces (BCIs) are strongly affected by non-stationary neural signals that vary across sessions and individuals, limiting the generalization of subject-agnostic models and motivating adaptive and personalized learning on resource-constrained platforms. Programmable memristive hardware offers a promising substrate for such post-deployment adaptation; however, practical realization is challenged by limited weight resolution, device variability, nonlinear programming dynamics, and finite device endurance. In this work, we show that spiking neural networks (SNNs) can be deployed on ferroelectric memristive synaptic devices for adaptive EEG-based motor imagery decoding under realistic device constraints. We fabricate, characterize, and model ferroelectric synapses. We evaluate a convolutional-recurrent SNN architecture under two complementary deployment strategies: (i) device-aware training using a ferroelectric synapse model, and (ii) transfer of software-trained weights followed by low-overhead on-device re-tuning. To enable efficient adaptation, we introduce a device-aware weight-update strategy in which gradient-based updates are accumulated digitally and converted into discrete programming events only when a threshold is exceeded, emulating nonlinear, state-dependent programming dynamics while reducing programming frequency. Both deployment strategies achieve classification performance comparable to state-of-the-art software-based SNNs. Furthermore, subject-specific transfer learning achieved by retraining only the final network layers improves classification accuracy. These results demonstrate that programmable ferroelectric hardware can support robust, low-overhead adaptation in spiking neural networks, opening a practical path toward personalized neuromorphic processing of neural signals.
zh

[AI-216] Learning About Learning: A Physics Path from Spin Glasses to Artificial Intelligence

链接: https://arxiv.org/abs/2601.07635
作者: Denis D. Caprioti,Matheus Haas,Constantino F. Vasconcelos,Mauricio Girardi-Schappo
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Physics Education (physics.ed-ph)
备注: 18 pages, 11 figures

点击查看摘要

[AI-217] Large Language Models for Physics Instrument Design

链接: https://arxiv.org/abs/2601.07580
作者: Sara Zoccheddu,Shah Rukh Qasim,Patrick Owen,Nicola Serra
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
备注:

点击查看摘要

[AI-218] A Model of Artificial Jagged Intelligence

链接: https://arxiv.org/abs/2601.07573
作者: Joshua Gans
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI)
备注: 58 Pages

点击查看摘要

[AI-219] Data-Driven Stochastic VRP: Integration of Forecast Duration into Optimization for Utility Workforce Management

链接: https://arxiv.org/abs/2601.07514
作者: Matteo Garbelli
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-220] Layerwise goal-oriented adaptivity for neural ODEs: an optimal control perspective

链接: https://arxiv.org/abs/2601.07397
作者: Michael Hintermüller,Michael Hinze,Denis Korolev
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-221] Efficient Convolutional Forward Model for Passive Acoustic Mapping and Temporal Monitoring

链接: https://arxiv.org/abs/2601.07356
作者: Tatiana Gelvez-Barrera,Barbara Nicolas,Bruno Gilles,Adrian Basarab,Denis Kouamé
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-222] Benchmarking Autonomy in Scientific Experiments: A Hierarchical Taxonomy for Autonomous Large-Scale Facilities

【速读】:该论文旨在解决当前自主科学(Autonomous Science)领域缺乏统一基准评估体系的问题,尤其是在大型用户设施(Large-Scale User Facilities)中,传统基于“拥有者-操作者”(owner-operator)模型的自动化分级标准难以适用。其解决方案的核心是提出了一种专为这类设施定制的六级分类法——基准自主实验(Benchmarking Autonomy in Scientific Experiments, BASE)量表,其中关键突破在于识别出“推理障碍”(Inference Barrier, Level 3)作为决策范式转变的临界点:在此层级,代理从依赖标量反馈跃迁至使用语义数字孪生(semantic digital twins),实现从空间探索到时间门控(temporal gating)的扩展,从而能够同步采集与瞬态物理事件的发生时刻,显著提升实验流程的智能性与响应能力。

链接: https://arxiv.org/abs/2601.06978
作者: James Le Houx
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 12 pages, 2 figures, 2 tables

点击查看摘要

Abstract:The transition from automated data collection to fully autonomous discovery requires a shared vocabulary to benchmark progress. While the automotive industry relies on the SAE J3016 standard, current taxonomies for autonomous science presuppose an owner-operator model that is incompatible with the operational rigidities of Large-Scale User Facilities. Here, we propose the Benchmarking Autonomy in Scientific Experiments (BASE) Scale, a 6-level taxonomy (Levels 0-5) specifically adapted for these unique constraints. Unlike owner-operator models, User Facilities require zero-shot deployment where agents must operate immediately without extensive training periods. We define the specific technical requirements for each tier, identifying the Inference Barrier (Level 3) as the critical latency threshold where decisions shift from scalar feedback to semantic digital twins. Fundamentally, this level extends the decision manifold from spatial exploration to temporal gating, enabling the agent to synchronise acquisition with the onset of transient physical events. By establishing these operational definitions, the BASE Scale provides facility directors, funding bodies, and beamline scientists with a standardised metric to assess risk, define liability, and quantify the intelligence of experimental workflows.
zh

[AI-223] Resource-constrained Project Scheduling with Time-of-Use Energy Tariffs and Machine States: A Logic-based Benders Decomposition Approach

链接: https://arxiv.org/abs/2601.06542
作者: Corentin Juvigny,Antonín Novák,Jan Mandík,Zdeněk Hanzálek
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-224] On a Gradient Approach to Chebyshev Center Problems with Applications to Function Learning

【速读】:该论文旨在解决Chebyshev中心问题(Chebyshev center problem),这是一类在最优函数学习和几何优化中具有基础性意义的半无限规划(semi-infinite programming, SIP)问题,其核心是寻找一个最小球体以包含给定约束集合。传统方法难以高效求解大规模场景下的此类问题,且缺乏理论保障与数值稳定性。解决方案的关键在于提出 \textsfgradOL ——首个基于梯度的优化框架,通过将原半无限问题重构为有限维的极大极小优化(max-min optimization),从而适配梯度下降类算法;同时利用自动微分(automatic differentiation)实现高精度梯度计算,确保数值稳定性和可扩展性,并在环境范数强凸条件下理论上保证收敛至最优Chebyshev中心及其对应半径,显著提升了求解精度与效率。

链接: https://arxiv.org/abs/2601.06434
作者: Abhinav Raghuvanshi,Mayank Baranwal,Debasish Chatterjee
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to TMLR

点击查看摘要

Abstract:We introduce \textsfgradOL , the first gradient-based optimization framework for solving Chebyshev center problems, a fundamental challenge in optimal function learning and geometric optimization. \textsfgradOL hinges on reformulating the semi-infinite problem as a finitary max-min optimization, making it amenable to gradient-based techniques. By leveraging automatic differentiation for precise numerical gradient computation, \textsfgradOL ensures numerical stability and scalability, making it suitable for large-scale settings. Under strong convexity of the ambient norm, \textsfgradOL provably recovers optimal Chebyshev centers while directly computing the associated radius. This addresses a key bottleneck in constructing stable optimal interpolants. Empirically, \textsfgradOL achieves significant improvements in accuracy and efficiency on 34 benchmark Chebyshev center problems from a benchmark \textsfCSIP library. Moreover, we extend \textsfgradOL to general convex semi-infinite programming (CSIP), attaining up to 4000\times speedups over the state-of-the-art \textttSIPAMPL solver tested on the indicated \textsfCSIP library containing 67 benchmark problems. Furthermore, we provide the first theoretical foundation for applying gradient-based methods to Chebyshev center problems, bridging rigorous analysis with practical algorithms. \textsfgradOL thus offers a unified solution framework for Chebyshev centers and broader CSIPs.
zh

[AI-225] FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

链接: https://arxiv.org/abs/2601.06199
作者: Junseok Lee,Sangyong Lee,Chang-Jae Chun
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

[AI-226] Emergent Complexity in Nuclear Reaction Networks: A Study of Stellar Nucleosynthesis through Chemical Organization Theory

链接: https://arxiv.org/abs/2601.06143
作者: Pedro Maldonado-Lang,Clément Vidal
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, paper presented at ALIFE 2025: Ciphers of Life: Proceedings of the Artificial Life Conference 2025

点击查看摘要

机器学习

[LG-0] Optimal Learning Rate Schedule for Balancing Effort and Performance

链接: https://arxiv.org/abs/2601.07830
作者: Valentina Njaradi,Rodrigo Carrasco-Davis,Peter E. Latham,Andrew Saxe
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Learning how to learn efficiently is a fundamental challenge for biological agents and a growing concern for artificial ones. To learn effectively, an agent must regulate its learning speed, balancing the benefits of rapid improvement against the costs of effort, instability, or resource use. We introduce a normative framework that formalizes this problem as an optimal control process in which the agent maximizes cumulative performance while incurring a cost of learning. From this objective, we derive a closed-form solution for the optimal learning rate, which has the form of a closed-loop controller that depends only on the agent’s current and expected future performance. Under mild assumptions, this solution generalizes across tasks and architectures and reproduces numerically optimized schedules in simulations. In simple learning models, we can mathematically analyze how agent and task parameters shape learning-rate scheduling as an open-loop control solution. Because the optimal policy depends on expectations of future performance, the framework predicts how overconfidence or underconfidence influence engagement and persistence, linking the control of learning speed to theories of self-regulated learning. We further show how a simple episodic memory mechanism can approximate the required performance expectations by recalling similar past learning experiences, providing a biologically plausible route to near-optimal behaviour. Together, these results provide a normative and biologically plausible account of learning speed control, linking self-regulated learning, effort allocation, and episodic memory estimation within a unified and tractable mathematical framework.

[LG-1] Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning

链接: https://arxiv.org/abs/2601.07760
作者: Shao-Ting Chiu,Siu Wun Cheung,Ulisses Braga-Neto,Chak Shing Lee,Rui Peng Li
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have shown strong potential for efficiently approximating complex nonlinear functions. However, the original KAN formulation relies on B-spline basis functions, which incur substantial computational overhead due to De Boor’s algorithm. To address this limitation, recent work has explored alternative basis functions such as radial basis functions (RBFs) that can improve computational efficiency and flexibility. Yet, standard RBF-KANs often sacrifice accuracy relative to the original KAN design. In this work, we propose Free-RBF-KAN, a RBF-based KAN architecture that incorporates adaptive learning grids and trainable smoothness to close this performance gap. Our method employs freely learnable RBF shapes that dynamically align grid representations with activation patterns, enabling expressive and adaptive function approximation. Additionally, we treat smoothness as a kernel parameter optimized jointly with network weights, without increasing computational complexity. We provide a general universality proof for RBF-KANs, which encompasses our Free-RBF-KAN formulation. Through a broad set of experiments, including multiscale function approximation, physics-informed machine learning, and PDE solution operator learning, Free-RBF-KAN achieves accuracy comparable to the original B-spline-based KAN while delivering faster training and inference. These results highlight Free-RBF-KAN as a compelling balance between computational efficiency and adaptive resolution, particularly for high-dimensional structured modeling tasks.

[LG-2] ab-TRM: Tiny Recursive Model for Insurance Pricing on Tabular Data

链接: https://arxiv.org/abs/2601.07675
作者: Kishan Padayachy,Ronald Richman,Mario V. Wüthrich
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 30 pages

点击查看摘要

Abstract:We introduce Tab-TRM (Tabular-Tiny Recursive Model), a network architecture that adapts the recursive latent reasoning paradigm of Tiny Recursive Models (TRMs) to insurance modeling. Drawing inspiration from both the Hierarchical Reasoning Model (HRM) and its simplified successor TRM, the Tab-TRM model makes predictions by reasoning over the input features. It maintains two learnable latent tokens - an answer token and a reasoning state - that are iteratively refined by a compact, parameter-efficient recursive network. The recursive processing layer repeatedly updates the reasoning state given the full token sequence and then refines the answer token, in close analogy with iterative insurance pricing schemes. Conceptually, Tab-TRM bridges classical actuarial workflows - iterative generalized linear model fitting and minimum-bias calibration - on the one hand, and modern machine learning, in terms of Gradient Boosting Machines, on the other.

[LG-3] Self-Creating Random Walks for Decentralized Learning under Pac-Man Attacks

链接: https://arxiv.org/abs/2601.07674
作者: Xingran Chen,Parimal Parag,Rohit Bhagat,Salim El Rouayheb
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2508.05663

点击查看摘要

Abstract:Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the ``Pac-Man’’ attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the CREATE-IF-LATE (CIL) algorithm, which is a fully decentralized, resilient mechanism that enables self-creating RWs and prevents RW extinction in the presence of Pac-Man. Our theoretical analysis shows that the CIL algorithm guarantees several desirable properties, such as (i) non-extinction of the RW population, (ii) almost sure boundedness of the RW population, and (iii) convergence of RW-based stochastic gradient descent even in the presence of Pac-Man with a quantifiable deviation from the true optimum. Moreover, the learning process experiences at most a linear time delay due to Pac-Man interruptions and RW regeneration. Our extensive empirical results on both synthetic and public benchmark datasets validate our theoretical findings.

[LG-4] Learning to accelerate Krasnoselskii-Mann fixed-point iterations with guarantees

链接: https://arxiv.org/abs/2601.07665
作者: Andrea Martin,Giuseppe Belgioioso
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a principled learning to optimize (L2O) framework for solving fixed-point problems involving general nonexpansive mappings. Our idea is to deliberately inject summable perturbations into a standard Krasnosel’skii-Mann iteration to improve its average-case performance over a specific distribution of problems while retaining its convergence guarantees. Under a metric sub-regularity assumption, we prove that the proposed parametrization includes only iterations that locally achieve linear convergence-up to a vanishing bias term-and that it encompasses all iterations that do so at a sufficiently fast rate. We then demonstrate how our framework can be used to augment several widely-used operator splitting methods to accelerate the solution of structured monotone inclusion problems, and validate our approach on a best approximation problem using an L2O-augmented Douglas-Rachford splitting algorithm.

[LG-5] Studying the Role of Synthetic Data for Machine Learning-based Wireless Networks Traffic Forecasting

链接: https://arxiv.org/abs/2601.07646
作者: José Pulido,Francesc Wilhelmi,Sergio Fortes,Alfonso Fernández-Durán,Lorenzo Galati Giordano,Raquel Barco
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic data generation is an appealing tool for augmenting and enriching datasets, playing a crucial role in advancing artificial intelligence (AI) and machine learning (ML). Not only does synthetic data help build robust AI/ML datasets cost-effectively, but it also offers privacy-friendly solutions and bypasses the complexities of storing large data volumes. This paper proposes a novel method to generate synthetic data, based on first-order auto-regressive noise statistics, for large-scale Wi-Fi deployments. The approach operates with minimal real data requirements while producing statistically rich traffic patterns that effectively mimic real Access Point (AP) behavior. Experimental results show that ML models trained on synthetic data achieve Mean Absolute Error (MAE) values within 10 to 15 of those obtained using real data when trained on the same APs, while requiring significantly less training data. Moreover, when generalization is required, synthetic-data-trained models improve prediction accuracy by up to 50 percent compared to real-data-trained baselines, thanks to the enhanced variability and diversity of the generated traces. Overall, the proposed method bridges the gap between synthetic data generation and practical Wi-Fi traffic forecasting, providing a scalable, efficient, and real-time solution for modern wireless networks.

[LG-6] Beyond Sharpness: A Flatness Decomposition Framework for Efficient Continual Learning AAAI2026

链接: https://arxiv.org/abs/2601.07636
作者: Yanan Chen,Tieliang Gong,Yunjiao Zhang,Wen Wen
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Continual Learning (CL) aims to enable models to sequentially learn multiple tasks without forgetting previous knowledge. Recent studies have shown that optimizing towards flatter loss minima can improve model generalization. However, existing sharpness-aware methods for CL suffer from two key limitations: (1) they treat sharpness regularization as a unified signal without distinguishing the contributions of its components. and (2) they introduce substantial computational overhead that impedes practical deployment. To address these challenges, we propose FLAD, a novel optimization framework that decomposes sharpness-aware perturbations into gradient-aligned and stochastic-noise components, and show that retaining only the noise component promotes generalization. We further introduce a lightweight scheduling scheme that enables FLAD to maintain significant performance gains even under constrained training time. FLAD can be seamlessly integrated into various CL paradigms and consistently outperforms standard and sharpness-aware optimizers in diverse experimental settings, demonstrating its effectiveness and practicality in CL.

[LG-7] An adjoint method for training data-driven reduced-order models

链接: https://arxiv.org/abs/2601.07579
作者: Donglin Liu,Francisco García Atienza,Mengwu Guo
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Reduced-order modeling lies at the interface of numerical analysis and data-driven scientific computing, providing principled ways to compress high-fidelity simulations in science and engineering. We propose a training framework that couples a continuous-time form of operator inference with the adjoint-state method to obtain robust data-driven reduced-order models. This method minimizes a trajectory-based loss between reduced-order solutions and projected snapshot data, which removes the need to estimate time derivatives from noisy measurements and provides intrinsic temporal regularization through time integration. We derive the corresponding continuous adjoint equations to compute gradients efficiently and implement a gradient based optimizer to update the reduced model parameters. Each iteration only requires one forward reduced order solve and one adjoint solve, followed by inexpensive gradient assembly, making the method attractive for large-scale simulations. We validate the proposed method on three partial differential equations: viscous Burgers’ equation, the two-dimensional Fisher-KPP equation, and an advection-diffusion equation. We perform systematic comparisons against standard operator inference under two perturbation regimes, namely reduced temporal snapshot density and additive Gaussian noise. For clean data, both approaches deliver similar accuracy, but in situations with sparse sampling and noise, the proposed adjoint-based training provides better accuracy and enhanced roll-out stability.

[LG-8] FEC: Multivariate Time-Series Clustering via Temporal-Frequency Enhanced Contrastive Learning ICASSP2026

链接: https://arxiv.org/abs/2601.07550
作者: Zexi Tan,Tao Xie,Haoyi Xiao,Baoyao Yang,Yuzhu Ji,An Zeng,Xiang Zhang,Yiqun Zhang
类目: Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Multivariate Time-Series (MTS) clustering is crucial for signal processing and data analysis. Although deep learning approaches, particularly those leveraging Contrastive Learning (CL), are prominent for MTS representation, existing CL-based models face two key limitations: 1) neglecting clustering information during positive/negative sample pair construction, and 2) introducing unreasonable inductive biases, e.g., destroying time dependence and periodicity through augmentation strategies, compromising representation quality. This paper, therefore, proposes a Temporal-Frequency Enhanced Contrastive (TFEC) learning framework. To preserve temporal structure while generating low-distortion representations, a temporal-frequency Co-EnHancement (CoEH) mechanism is introduced. Accordingly, a synergistic dual-path representation and cluster distribution learning framework is designed to jointly optimize cluster structure and representation fidelity. Experiments on six real-world benchmark datasets demonstrate TFEC’s superiority, achieving 4.48% average NMI gains over SOTA methods, with ablation studies validating the design. The code of the paper is available at: this https URL.

[LG-9] Contextual Discrepancy-Aware Contrastive Learning for Robust Medical Time Series Diagnosis in Small-Sample Scenarios

链接: https://arxiv.org/abs/2601.07548
作者: Kaito Tanaka,Aya Nakayama,Masato Ito,Yuji Nishimura,Keisuke Matsuda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical time series data, such as EEG and ECG, are vital for diagnosing neurological and cardiovascular diseases. However, their precise interpretation faces significant challenges due to high annotation costs, leading to data scarcity, and the limitations of traditional contrastive learning in capturing complex temporal patterns. To address these issues, we propose CoDAC (Contextual Discrepancy-Aware Contrastive learning), a novel framework that enhances diagnostic accuracy and generalization, particularly in small-sample settings. CoDAC leverages external healthy data and introduces a Contextual Discrepancy Estimator (CDE), built upon a Transformer-based Autoencoder, to precisely quantify abnormal signals through context-aware anomaly scores. These scores dynamically inform a Dynamic Multi-views Contrastive Framework (DMCF), which adaptively weights different temporal views to focus contrastive learning on diagnostically relevant, discrepant regions. Our encoder combines dilated convolutions with multi-head attention for robust feature extraction. Comprehensive experiments on Alzheimer’s Disease EEG, Parkinson’s Disease EEG, and Myocardial Infarction ECG datasets demonstrate CoDAC’s superior performance across all metrics, consistently outperforming state-of-the-art baselines, especially under low label availability. Ablation studies further validate the critical contributions of CDE and DMCF. CoDAC offers a robust and interpretable solution for medical time series diagnosis, effectively mitigating data scarcity challenges.

[LG-10] Near-Optimal Private Linear Regression via Iterative Hessian Mixing

链接: https://arxiv.org/abs/2601.07545
作者: Omri Lev,Moshe Shenfeld,Vishwak Srinivasan,Katrina Ligett,Ashia C. Wilson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study differentially private ordinary least squares (DP-OLS) with bounded data. The dominant approach, adaptive sufficient-statistics perturbation (AdaSSP), adds an adaptively chosen perturbation to the sufficient statistics, namely, the matrix X^\topX and the vector X^\topY , and is known to achieve near-optimal accuracy and to have strong empirical performance. In contrast, methods that rely on Gaussian-sketching, which ensure differential privacy by pre-multiplying the data with a random Gaussian matrix, are widely used in federated and distributed regression, yet remain relatively uncommon for DP-OLS. In this work, we introduce the iterative Hessian mixing, a novel DP-OLS algorithm that relies on Gaussian sketches and is inspired by the iterative Hessian sketch algorithm. We provide utility analysis for the iterative Hessian mixing as well as a new analysis for the previous methods that rely on Gaussian sketches. Then, we show that our new approach circumvents the intrinsic limitations of the prior methods and provides non-trivial improvements over AdaSSP. We conclude by running an extensive set of experiments across standard benchmarks to demonstrate further that our approach consistently outperforms these prior baselines.

[LG-11] Stagewise Reinforcement Learning and the Geometry of the Regret Landscape

链接: https://arxiv.org/abs/2601.07524
作者: Chris Elliott,Einar Urdshals,David Quarel,Matthew Farrugia-Roberts,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注: 50 pages, 14 figures

点击查看摘要

Abstract:Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to deep reinforcement learning, proving that the concentration of the generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that Bayesian phase transitions in reinforcement learning should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over SGD training manifest as “opposing staircases” where regret decreases sharply while the LLC increases. Notably, the LLC detects phase transitions even when estimated on a subset of states where the policies appear identical in terms of regret, suggesting it captures changes in the underlying algorithm rather than just performance.

[LG-12] Land-then-transport: A Flow Matching-Based Generative Decoder for Wireless Image Transmission

链接: https://arxiv.org/abs/2601.07512
作者: Jingwen Fu,Ming Xiao,Mikael Skoglund,Dong In Kim
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Due to strict rate and reliability demands, wireless image transmission remains difficult for both classical layered designs and joint source-channel coding (JSCC), especially under low latency. Diffusion-based generative decoders can deliver strong perceptual quality by leveraging learned image priors, but iterative stochastic denoising leads to high decoding delay. To enable low-latency decoding, we propose a flow-matching (FM) generative decoder under a new land-then-transport (LTT) paradigm that tightly integrates the physical wireless channel into a continuous-time probability flow. For AWGN channels, we build a Gaussian smoothing path whose noise schedule indexes effective noise levels, and derive a closed-form teacher velocity field along this path. A neural-network student vector field is trained by conditional flow matching, yielding a deterministic, channel-aware ODE decoder with complexity linear in the number of ODE steps. At inference, it only needs an estimate of the effective noise variance to set the ODE starting time. We further show that Rayleigh fading and MIMO channels can be mapped, via linear MMSE equalization and singular-value-domain processing, to AWGN-equivalent channels with calibrated starting times. Therefore, the same probability path and trained velocity field can be reused for Rayleigh and MIMO without retraining. Experiments on MNIST, Fashion-MNIST, and DIV2K over AWGN, Rayleigh, and MIMO demonstrate consistent gains over JPEG2000+LDPC, DeepJSCC, and diffusion-based baselines, while achieving good perceptual quality with only a few ODE steps. Overall, LTT provides a deterministic, physically interpretable, and computation-efficient framework for generative wireless image decoding across diverse channels.

[LG-13] FROAV: A Framework for RAG Observation and Agent Verification - Lowering the Barrier to LLM Agent Research

链接: https://arxiv.org/abs/2601.07504
作者: Tzu-Hsuan Lin,Chih-Hsuan Kao
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 8 pages, 1 figure, 3 tables

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) and their integration into autonomous agent systems has created unprecedented opportunities for document analysis, decision support, and knowledge retrieval. However, the complexity of developing, evaluating, and iterating on LLM-based agent workflows presents significant barriers to researchers, particularly those without extensive software engineering expertise. We present FROAV (Framework for RAG Observation and Agent Verification), an open-source research platform that democratizes LLM agent research by providing a plug-and-play architecture combining visual workflow orchestration, a comprehensive evaluation framework, and extensible Python integration. FROAV implements a multi-stage Retrieval-Augmented Generation (RAG) pipeline coupled with a rigorous “LLM-as-a-Judge” evaluation system, all accessible through intuitive graphical interfaces. Our framework integrates n8n for no-code workflow design, PostgreSQL for granular data management, FastAPI for flexible backend logic, and Streamlit for human-in-the-loop interaction. Through this integrated ecosystem, researchers can rapidly prototype RAG strategies, conduct prompt engineering experiments, validate agent performance against human judgments, and collect structured feedback-all without writing infrastructure code. We demonstrate the framework’s utility through its application to financial document analysis, while emphasizing its material-agnostic architecture that adapts to any domain requiring semantic analysis. FROAV represents a significant step toward making LLM agent research accessible to a broader scientific community, enabling researchers to focus on hypothesis testing and algorithmic innovation rather than system integration challenges.

[LG-14] he Secretary Problem with Predictions and a Chosen Order

链接: https://arxiv.org/abs/2601.07482
作者: Helia Karisani,Mohammadreza Daneshvaramoli,Hedyeh Beyhaghi,Mohammad Hajiesmaili,Cameron Musco
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted to the International Conference on Innovations in Theoretical Computer Science (ITCS 2026)

点击查看摘要

Abstract:We study a learning-augmented variant of the secretary problem, recently introduced by Fujii and Yoshida (2023), in which the decision-maker has access to machine-learned predictions of candidate values. The central challenge is to balance consistency and robustness: when predictions are accurate, the algorithm should select a near-optimal secretary, while under inaccurate predictions it should still guarantee a bounded competitive ratio. We consider both the classical Random Order Secretary Problem (ROSP), where candidates arrive in a uniformly random order, and a more natural learning-augmented model in which the decision-maker may choose the arrival order based on predicted values. We call this model the Chosen Order Secretary Problem (COSP), capturing scenarios such as interview schedules set in advance. We propose a new randomized algorithm applicable to both ROSP and COSP. Our method switches from fully trusting predictions to a threshold-based rule once a large prediction deviation is detected. Let \epsilon \in [0,1] denote the maximum multiplicative prediction error. For ROSP, our algorithm achieves a competitive ratio of \max\0.221, (1-\epsilon)/(1+\epsilon)\ , improving upon the prior bound of \max\0.215, (1-\epsilon)/(1+\epsilon)\ . For COSP, we achieve \max\0.262, (1-\epsilon)/(1+\epsilon)\ , surpassing the 0.25 worst-case bound for prior approaches and moving closer to the classical secretary benchmark of 1/e \approx 0.368 . These results highlight the benefit of combining predictions with arrival-order control in online decision-making. Comments: Accepted to the International Conference on Innovations in Theoretical Computer Science (ITCS 2026) Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) ACMclasses: F.2.2 Cite as: arXiv:2601.07482 [cs.DS] (or arXiv:2601.07482v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2601.07482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] AntiPaSTO: Self-Supervised Steering of Moral Reasoning

链接: https://arxiv.org/abs/2601.07473
作者: Michael J. Clark
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As models grow more capable, human supervision breaks down: labels don’t scale, outputs can be gamed, and training doesn’t generalize. Scalable oversight requires steering methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an anti-parallel axis ( \alpha=\pm1 produce opposite shifts), with coherence constraints preventing collapse. Human input is minimal: two contrasting words inserted into template sentences, no preference labels. Using 800 such pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9\times on DailyDilemmas and maintains bidirectional control where prompting triggers refusal. Code is available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.07473 [cs.LG] (or arXiv:2601.07473v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.07473 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Surrogate-based Optimization via Clustering for Box-Constrained Problems

链接: https://arxiv.org/abs/2601.07442
作者: Maaz Ahmad,Iftekhar A. Karimi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 34 pages, 4 Figures, 8 Tables

点击查看摘要

Abstract:Global optimization of large-scale, complex systems such as multi-physics black-box simulations and real-world industrial systems is important but challenging. This work presents a novel Surrogate-Based Optimization framework based on Clustering, SBOC for global optimization of such systems, which can be used with any surrogate modeling technique. At each iteration, it uses a single surrogate model for the entire domain, employs k-means clustering to identify unexplored domain, and exploits a local region around the surrogate optimum to potentially add three new sample points in the domain. SBOC has been tested against sixteen promising benchmarking algorithms using 52 analytical test functions of varying input dimensionalities and shape profiles. It successfully identified a global minimum for most test functions with substantially lower computational effort than other algorithms. It worked especially well on test functions with four or more input variables. It was also among the top six algorithms in approaching a global minimum closely. Overall, SBOC is a robust, reliable, and efficient algorithm for global optimization of box-constrained systems.

[LG-17] Variational Autoencoder with Normalizing flow for X-ray spectral fitting NEURIPS2025

链接: https://arxiv.org/abs/2601.07440
作者: Fiona Redmen,Ethan Tregidga,James F. Steiner,Cecilia Garraffo
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 table, 3 figures. Accepted as a workshop paper to Machine Learning and the Physical Sciences at NeurIPS 2025

点击查看摘要

Abstract:Black hole X-ray binaries (BHBs) can be studied with spectral fitting to provide physical constraints on accretion in extreme gravitational environments. Traditional methods of spectral fitting such as Markov Chain Monte Carlo (MCMC) face limitations due to computational times. We introduce a probabilistic model, utilizing a variational autoencoder with a normalizing flow, trained to adopt a physical latent space. This neural network produces predictions for spectral-model parameters as well as their full probability distributions. Our implementations result in a significant improvement in spectral reconstructions over a previous deterministic model while performing three orders of magnitude faster than traditional methods.

[LG-18] PLANET v2.0: A comprehensive Protein-Ligand Affinity Prediction Model Based on Mixture Density Network

链接: https://arxiv.org/abs/2601.07415
作者: Haotian Gao,Xiangying Zhang,Jingyuan Li,Xinchong Chen,Haojie Wang,Yifei Qi,Renxiao Wang
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:Drug discovery represents a time-consuming and financially intensive process, and virtual screening can accelerate it. Scoring functions, as one of the tools guiding virtual screening, have their precision closely tied to screening efficiency. In our previous study, we developed a graph neural network model called PLANET (Protein-Ligand Affinity prediction NETwork), but it suffers from the defect in representing protein-ligand contact maps. Incorrect binding modes inevitably lead to poor affinity predictions, so accurate prediction of the protein-ligand contact map is desired to improve PLANET. In this study, we have proposed PLANET v2.0 as an upgraded version. The model is trained via multi-objective training strategy and incorporates the Mixture Density Network to predict binding modes. Except for the probability density distributions of non-covalent interactions, we innovatively employ another Gaussian mixture model to describe the relationship between distance and energy of each interaction pair and predict protein-ligand affinity like calculating the mathematical expectation. As on the CASF-2016 benchmark, PLANET v2.0 demonstrates excellent scoring power, ranking power, and docking power. The screening power of PLANET v2.0 gets notably improved compared to PLANET and Glide SP and it demonstrates robust validation on a commercial ultra-large-scale dataset. Given its efficiency and accuracy, PLANET v2.0 can hopefully become one of the practical tools for virtual screening workflows. PLANET v2.0 is freely available at this https URL.

[LG-19] he Practicality of Normalizing Flow Test-Time Training in Bayesian Inference for Agent -Based Models

链接: https://arxiv.org/abs/2601.07413
作者: Junyao Zhang,Jinglai Li,Junqi Tang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Agent-Based Models (ABMs) are gaining great popularity in economics and social science because of their strong flexibility to describe the realistic and heterogeneous decisions and interaction rules between individual agents. In this work, we investigate for the first time the practicality of test-time training (TTT) of deep models such as normalizing flows, in the parameters posterior estimations of ABMs. We propose several practical TTT strategies for fine-tuning the normalizing flow against distribution shifts. Our numerical study demonstrates that TTT schemes are remarkably effective, enabling real-time adjustment of flow-based inference for ABM parameters.

[LG-20] Computing patient similarity based on unstructured clinical notes

链接: https://arxiv.org/abs/2601.07385
作者: Petr Zelina,Marko Řeháček,Jana Halámková,Lucia Bohovicová,Martin Rusinko,Vít Nováček
类目: Machine Learning (cs.LG)
*备注: This is a preprint and has not undergone peer review. Final version was presented at the Text, Speech, and Dialogue 2025 conference. The Version of Record is available at this https URL

点击查看摘要

Abstract:Clinical notes hold rich yet unstructured details about diagnoses, treatments, and outcomes that are vital to precision medicine but hard to exploit at scale. We introduce a method that represents each patient as a matrix built from aggregated embeddings of all their notes, enabling robust patient similarity computation based on their latent low-rank representations. Using clinical notes of 4,267 Czech breast-cancer patients and expert similarity labels from Masaryk Memorial Cancer Institute, we evaluate several matrix-based similarity measures and analyze their strengths and limitations across different similarity facets, such as clinical history, treatment, and adverse events. The results demonstrate the usefulness of the presented method for downstream tasks, such as personalized therapy recommendations or toxicity warnings.

[LG-21] CompNO: A Novel Foundation Model approach for solving Partial Differential Equations

链接: https://arxiv.org/abs/2601.07384
作者: Hamda Hmida,Hsiu-Wen Chang Joly,Youssef Mesri
类目: Machine Learning (cs.LG)
*备注: Under review at MDPI

点击查看摘要

Abstract:Partial differential equations (PDEs) govern a wide range of physical phenomena, but their numerical solution remains computationally demanding, especially when repeated simulations are required across many parameter settings. Recent Scientific Foundation Models (SFMs) aim to alleviate this cost by learning universal surrogates from large collections of simulated systems, yet they typically rely on monolithic architectures with limited interpretability and high pretraining expense. In this work we introduce Compositional Neural Operators (CompNO), a compositional neural operator framework for parametric PDEs. Instead of pretraining a single large model on heterogeneous data, CompNO first learns a library of Foundation Blocks, where each block is a parametric Fourier neural operator specialized to a fundamental differential operator (e.g. convection, diffusion, nonlinear convection). These blocks are then assembled, via lightweight Adaptation Blocks, into task-specific solvers that approximate the temporal evolution operator for target PDEs. A dedicated boundary-condition operator further enforces Dirichlet constraints exactly at inference time. We validate CompNO on one-dimensional convection, diffusion, convection–diffusion and Burgers’ equations from the PDEBench suite. The proposed framework achieves lower relative L2 error than strong baselines (PFNO, PDEFormer and in-context learning based models) on linear parametric systems, while remaining competitive on nonlinear Burgers’ flows. The model maintains exact boundary satisfaction with zero loss at domain boundaries, and exhibits robust generalization across a broad range of Peclet and Reynolds numbers. These results demonstrate that compositional neural operators provide a scalable and physically interpretable pathway towards foundation models for PDEs.

[LG-22] SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models

链接: https://arxiv.org/abs/2601.07331
作者: Yuanhe Zhang,Jiayu Tian,Yibo Zhang,Shilinlu Yan,Liang Lin,Zhenhong Zhou,Li Sun,Sen Su
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model’s internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.

[LG-23] Kernel Alignment-based Multi-view Unsupervised Feature Selection with Sample-level Adaptive Graph Learning

链接: https://arxiv.org/abs/2601.07288
作者: Yalan Tan,Yanyong Huang,Zongxin Shen,Dongjie Wang,Fengmao Lv,Tianrui Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although multi-view unsupervised feature selection (MUFS) has demonstrated success in dimensionality reduction for unlabeled multi-view data, most existing methods reduce feature redundancy by focusing on linear correlations among features but often overlook complex nonlinear dependencies. This limits the effectiveness of feature selection. In addition, existing methods fuse similarity graphs from multiple views by employing sample-invariant weights to preserve local structure. However, this process fails to account for differences in local neighborhood clarity among samples within each view, thereby hindering accurate characterization of the intrinsic local structure of the data. In this paper, we propose a Kernel Alignment-based multi-view unsupervised FeatUre selection with Sample-level adaptive graph lEarning method (KAFUSE) to address these issues. Specifically, we first employ kernel alignment with an orthogonal constraint to reduce feature redundancy in both linear and nonlinear relationships. Then, a cross-view consistent similarity graph is learned by applying sample-level fusion to each slice of a tensor formed by stacking similarity graphs from different views, which automatically adjusts the view weights for each sample during fusion. These two steps are integrated into a unified model for feature selection, enabling mutual enhancement between them. Extensive experiments on real multi-view datasets demonstrate the superiority of KAFUSE over state-of-the-art methods.

[LG-24] A High-Recall Cost-Sensitive Machine Learning Framework for Real-Time Online Banking Transaction Fraud Detection

链接: https://arxiv.org/abs/2601.07276
作者: Karthikeyan V. R.,Premnath S.,Kavinraaj S.,J. Sangeetha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Submitted to arXiv as a preprint

点击查看摘要

Abstract:Fraudulent activities on digital banking services are becoming more intricate by the day, challenging existing defenses. While older rule driven methods struggle to keep pace, even precision focused algorithms fall short when new scams are introduced. These tools typically overlook subtle shifts in criminal behavior, missing crucial signals. Because silent breaches cost institutions far more than flagged but legitimate actions, catching every possible case is crucial. High sensitivity to actual threats becomes essential when oversight leads to heavy losses. One key aim here involves reducing missed fraud cases without spiking incorrect alerts too much. This study builds a system using group learning methods adjusted through smart threshold choices. Using real world transaction records shared openly, where cheating acts rarely appear among normal activities, tests are run under practical skewed distributions. The outcomes reveal that approximately 91 percent of actual fraud is detected, outperforming standard setups that rely on unchanging rules when dealing with uneven examples across classes. When tested in live settings, the fraud detection system connects directly to an online banking transaction flow, stopping questionable activities before they are completed. Alongside this setup, a browser add on built for Chrome is designed to flag deceptive web links and reduce threats from harmful sites. These results show that adjusting decisions by cost impact and validating across entire systems makes deployment more stable and realistic for today’s digital banking platforms.

[LG-25] Simulated Annealing-based Candidate Optimization for Batch Acquisition Functions

链接: https://arxiv.org/abs/2601.07258
作者: Sk Md Ahnaf Akif Alvi,Raymundo Arróyave,Douglas Allaire
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian Optimization with multi-objective acquisition functions such as q-Expected Hypervolume Improvement (qEHVI) requires efficient candidate optimization to maximize acquisition function values. Traditional approaches rely on continuous optimization methods like Sequential Least Squares Programming (SLSQP) for candidate selection. However, these gradient-based methods can become trapped in local optima, particularly in complex or high-dimensional objective landscapes. This paper presents a simulated annealing-based approach for candidate optimization in batch acquisition functions as an alternative to conventional continuous optimization methods. We evaluate our simulated annealing approach against SLSQP across four benchmark multi-objective optimization problems: ZDT1 (30D, 2 objectives), DTLZ2 (7D, 3 objectives), Kursawe (3D, 2 objectives), and Latent-Aware (4D, 2 objectives). Our results demonstrate that simulated annealing consistently achieves superior hypervolume performance compared to SLSQP in most test functions. The improvement is particularly pronounced for DTLZ2 and Latent-Aware problems, where simulated annealing reaches significantly higher hypervolume values and maintains better convergence characteristics. The histogram analysis of objective space coverage further reveals that simulated annealing explores more diverse and optimal regions of the Pareto front. These findings suggest that metaheuristic optimization approaches like simulated annealing can provide more robust and effective candidate optimization for multi-objective Bayesian optimization, offering a promising alternative to traditional gradient-based methods for batch acquisition function optimization.

[LG-26] Innovation Capacity of Dynamical Learning Systems

链接: https://arxiv.org/abs/2601.07257
作者: Anthony M. Polloreno
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:In noisy physical reservoirs, the classical information-processing capacity C_\mathrmip quantifies how well a linear readout can realize tasks measurable from the input history, yet C_\mathrmip can be far smaller than the observed rank of the readout covariance. We explain this ``missing capacity’’ by introducing the innovation capacity C_\mathrmi , the total capacity allocated to readout components orthogonal to the input filtration (Doob innovations, including input-noise mixing). Using a basis-free Hilbert-space formulation of the predictable/innovation decomposition, we prove the conservation law C_\mathrmip+C_\mathrmi=\mathrmrank(\Sigma_XX)\le d , so predictable and innovation capacities exactly partition the rank of the observable readout dimension covariance \Sigma_XX\in \mathbbR^\rm d\times d . In linear-Gaussian Johnson-Nyquist regimes, \Sigma_XX(T)=S+T N_0 , the split becomes a generalized-eigenvalue shrinkage rule and gives an explicit monotone tradeoff between temperature and predictable capacity. Geometrically, in whitened coordinates the predictable and innovation components correspond to complementary covariance ellipsoids, making C_\mathrmi a trace-controlled innovation budget. A large C_\mathrmi forces a high-dimensional innovation subspace with a variance floor and under mild mixing and anti-concentration assumptions this yields extensive innovation-block differential entropy and exponentially many distinguishable histories. Finally, we give an information-theoretic lower bound showing that learning the induced innovation-block law in total variation requires a number of samples that scales with the effective innovation dimension, supporting the generative utility of noisy physical reservoirs.

[LG-27] Standardization of Post-Publication Code Verification by Journals is Possible with the Support of the Community

链接: https://arxiv.org/abs/2601.07189
作者: Susana Lopez-Moreno,Eric Dolores-Cuenca,Sangil Kim
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:Reproducibility remains a challenge in machine learning research. While code and data availability requirements have become increasingly common, post-publication verification in journals is still limited and unformalized. This position paper argues that it is plausible for journals and conference proceedings to implement post-publication verification. We propose a modification to ACM pre-publication verification badges that allows independent researchers to submit post-publication code replications to the journal, leading to visible verification badges included in the article metadata. Each article may earn up to two badges, each linked to verified code in its corresponding public repository. We describe the motivation, related initiatives, a formal framework, the potential impact, possible limitations, and alternative views.

[LG-28] Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

链接: https://arxiv.org/abs/2601.07164
作者: Min Wang,Xin Li,Mingzhong Wang,Hasnaa Bennis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the Q network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the Q value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed Q values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term ‘‘feature overgeneralization’’. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.

[LG-29] Generating readily synthesizable small molecule fluorophore scaffolds with reinforcement learning

链接: https://arxiv.org/abs/2601.07145
作者: Ruhi Sayana,Kate Callon,Jennifer Xu,Jonathan Deutsch,Steven Chu,James Zou,John Janetzko,Rabindra V. Shivnaraine,Kyle Swanson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing new fluorophores for advanced imaging techniques requires exploring new chemical space. While generative AI approaches have shown promise in designing novel dye scaffolds, prior efforts often produced synthetically intractable candidates due to a lack of reaction constraints. Here, we developed SyntheFluor-RL, a generative AI model that employs known reaction libraries and molecular building blocks to create readily synthesizable fluorescent molecule scaffolds via reinforcement learning. To guide the generation of fluorophores, SyntheFluor-RL employs a scoring function built on multiple graph neural networks (GNNs) that predict key photophysical properties, including photoluminescence quantum yield, absorption, and emission wavelengths. These outputs are dynamically weighted and combined with a computed pi-conjugation score to prioritize candidates with desirable optical characteristics and synthetic feasibility. SyntheFluor-RL generated 11,590 candidate molecules, which were filtered to 19 structures predicted to possess dye-like properties. Of the 19 molecules, 14 were synthesized and 13 were experimentally confirmed. The top three were characterized, with the lead compound featuring a benzothiadiazole chromophore and exhibiting strong fluorescence (PLQY = 0.62), a large Stokes shift (97 nm), and a long excited-state lifetime (11.5 ns). These results demonstrate the effectiveness of SyntheFluor-RL in the identification of synthetically accessible fluorophores for further development.

[LG-30] owards Automated Diagnosis of Inherited Arrhythmias: Combined Arrhythmia Classification Using Lead-Aware Spatial Attention Networks

链接: https://arxiv.org/abs/2601.07124
作者: Sophie Sigfstead,River Jiang,Brianna Davies,Zachary W. M. Laksman,Julia Cadrin-Tourigny,Rafik Tadros,Habib Khan,Joseph Atallah,Christian Steinberg,Shubhayan Sanatani,Mario Talajic,Rahul Krishnan,Andrew D. Krahn,Christopher C. Cheung
类目: Machine Learning (cs.LG)
*备注: 34 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Arrhythmogenic right ventricular cardiomyopathy (ARVC) and long QT syndrome (LQTS) are inherited arrhythmia syndromes associated with sudden cardiac death. Deep learning shows promise for ECG interpretation, but multi-class inherited arrhythmia classification with clinically grounded interpretability remains underdeveloped. Our objective was to develop and validate a lead-aware deep learning framework for multi-class (ARVC vs LQTS vs control) and binary inherited arrhythmia classification, and to determine optimal strategies for integrating ECG foundation models within arrhythmia screening tools. We assembled a 13-center Canadian cohort (645 patients; 1,344 ECGs). We evaluated four ECG foundation models using three transfer learning approaches: linear probing, fine-tuning, and combined strategies. We developed lead-aware spatial attention networks (LASAN) and assessed integration strategies combining LASAN with foundation models. Performance was compared against the established foundation model baselines. Lead-group masking quantified disease-specific lead dependence. Fine-tuning outperformed linear probing and combined strategies across all foundation models (mean macro-AUROC 0.904 vs 0.825). The best lead-aware integrations achieved near-ceiling performance (HuBERT-ECG hybrid: macro-AUROC 0.990; ARVC vs control AUROC 0.999; LQTS vs control AUROC 0.994). Lead masking demonstrated physiologic plausibility: V1-V3 were most critical for ARVC detection (4.54% AUROC reduction), while lateral leads were preferentially important for LQTS (2.60% drop). Lead-aware architectures achieved state-of-the-art performance for inherited arrhythmia classification, outperforming all existing published models on both binary and multi-class tasks while demonstrating clinically aligned lead dependence. These findings support potential utility for automated ECG screening pending validation.

[LG-31] Reward-Preserving Attacks For Robust Reinforcement Learning

链接: https://arxiv.org/abs/2601.07118
作者: Lucas Schott,Elies Gherbi,Hatem Hajri,Sylvain Lamprier
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 figures, 4 algorithms, preprint

点击查看摘要

Abstract:Adversarial robustness in RL is difficult because perturbations affect entire trajectories: strong attacks can break learning, while weak attacks yield little robustness, and the appropriate strength varies by state. We propose \alpha -reward-preserving attacks, which adapt the strength of the adversary so that an \alpha fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, we use a gradient-based attack direction and learn a state-dependent magnitude \eta\le \eta_\mathcal B selected via a critic Q^\pi_\alpha((s,a),\eta) trained off-policy over diverse radii. This adaptive tuning calibrates attack strength and, with intermediate \alpha , improves robustness across radii while preserving nominal performance, outperforming fixed- and random-radius baselines.

[LG-32] When Should We Introduce Safety Interventions During Pretraining?

链接: https://arxiv.org/abs/2601.07087
作者: Dylan Sam,Sachin Goyal,Pratyush Maini,Alexander Robey,J. Zico Kolter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring the safety of language models in high-stakes settings remains a pressing challenge, as aligned behaviors are often brittle and easily undone by adversarial pressure or downstream finetuning. Prior work has shown that interventions applied during pretraining, such as rephrasing harmful content, can substantially improve the safety of the resulting models. In this paper, we study the fundamental question: “When during pretraining should safety interventions be introduced?” We keep the underlying data fixed and vary only the choice of a safety curriculum: the timing of these interventions, i.e., after 0%, 20%, or 60% of the pretraining token budget. We find that introducing interventions earlier generally yields more robust models with no increase in overrefusal rates, with the clearest benefits appearing after downstream, benign finetuning. We also see clear benefits in the steerability of models towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Overall, these results argue for incorporating safety signals early in pretraining, producing models that are more robust to downstream finetuning and jailbreaking, and more reliable under both standard and safety-aware inference procedures.

[LG-33] ght Analysis of Decentralized SGD: A Markov Chain Perspective

链接: https://arxiv.org/abs/2601.07021
作者: Lucas Versini,Paul Mangold,Aymeric Dieuleveut
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel analysis of the Decentralized Stochastic Gradient Descent (DSGD) algorithm with constant step size, interpreting the iterates of the algorithm as a Markov chain. We show that DSGD converges to a stationary distribution, with its bias, to first order, decomposable into two components: one due to decentralization (growing with the graph’s spectral gap and clients’ heterogeneity) and one due to stochasticity. Remarkably, the variance of local parameters is, at the first-order, inversely proportional to the number of clients, regardless of the network topology and even when clients’ iterates are not averaged at the end. As a consequence of our analysis, we obtain non-asymptotic convergence bounds for clients’ local iterates, confirming that DSGD has linear speed-up in the number of clients, and that the network topology only impacts higher-order terms.

[LG-34] Generalization Bounds for Transformer Channel Decoders

链接: https://arxiv.org/abs/2601.06969
作者: Qinshan Zhang,Bin Chen,Yong Jiang,Shu-Tao Xia
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:Transformer channel decoders, such as the Error Correction Code Transformer (ECCT), have shown strong empirical performance in channel decoding, yet their generalization behavior remains theoretically unclear. This paper studies the generalization performance of ECCT from a learning-theoretic perspective. By establishing a connection between multiplicative noise estimation errors and bit-error-rate (BER), we derive an upper bound on the generalization gap via bit-wise Rademacher complexity. The resulting bound characterizes the dependence on code length, model parameters, and training set size, and applies to both single-layer and multi-layer ECCTs. We further show that parity-check-based masked attention induces sparsity that reduces the covering number, leading to a tighter generalization bound. To the best of our knowledge, this work provides the first theoretical generalization guarantees for this class of decoders.

[LG-35] A Robust Certified Machine Unlearning Method Under Distribution Shift

链接: https://arxiv.org/abs/2601.06967
作者: Jinduo Guo,Yinzhi Cao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The Newton method has been widely adopted to achieve certified unlearning. A critical assumption in existing approaches is that the data requested for unlearning are selected i.i.d.(independent and identically distributed). However,the problem of certified unlearning under non-i.i.d. deletions remains largely unexplored. In practice, unlearning requests are inherently biased, leading to non-i.i.d. deletions and causing distribution shifts between the original and retained datasets. In this paper, we show that certified unlearning with the Newton method becomes inefficient and ineffective under non-i.i.d. unlearning sets. We then propose a better certified unlearning approach by performing a distribution-aware certified unlearning framework based on iterative Newton updates constrained by a trust region. Our method provides a closer approximation to the retrained model and yields a tighter pre-run bound on the gradient residual, thereby ensuring efficient (epsilon, delta)-certified unlearning. To demonstrate its practical effectiveness under distribution shift, we also conduct extensive experiments across multiple evaluation metrics, providing a comprehensive assessment of our approach.

[LG-36] HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression

链接: https://arxiv.org/abs/2601.06959
作者: Vladimer Khasia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization is essential for deploying Large Language Models (LLMs) on resource- constrained devices. However, standard integer quantization (e.g., INT4) fundamentally degrades per- formance by imposing a uniform grid on the heavy-tailed distribution of weight parameters, particularly in smaller-scale models (e.g., 2B parameters). We introduce HAS-VQ (Hessian-Adaptive Sparse Vec- tor Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution using second-order sensitivity analysis. HAS-VQ employs a Hessian-Masked Decoupling strategy to isolate sensitive parameters, followed by robust Vector Quantization (VQ) of the remaining dense body. Crucially, we introduce a residual sparse feedback mechanism that corrects quan- tization errors in the most sensitive dimensions, ensuring exact reconstruction of outliers. We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority: (1) Pareto Dominance over Integer Baselines: At 4.23 effective bits-per-parameter (BPP), we achieve a perplexity of 14.23, significantly outperforming the standard INT4 baseline (20.03 PPL at 4.71 BPP). (2) High-Fidelity Compression: Relative to the FP16 baseline, HAS-VQ achieves a 2.3x reduction in model size (7.03 BPP) while maintaining statistically indistinguishable perplexity (10.12 vs. 10.04), effectively offering a lossless compression alternative for bandwidth-constrained environments. The code is available at this https URL

[LG-37] owards Operational Streamflow Forecasting in the Limpopo River Basin using Long Short-Term Memory Networks

链接: https://arxiv.org/abs/2601.06941
作者: James Tlhomole,Edoardo Borgomeo,Karthikeyan Matheswaran,Mariangel Garcia Andarcia
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 14 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Robust hydrological simulation is key for sustainable development, water management strategies, and climate change adaptation. In recent years, deep learning methods have been demonstrated to outperform mechanistic models at the task of hydrological discharge simulation. Adoption of these methods has been catalysed by the proliferation of large sample hydrology datasets, consisting of the observed discharge and meteorological drivers, along with geological and topographical catchment descriptors. Deep learning methods infer rainfall-runoff characteristics that have been shown to generalise across catchments, benefitting from the data diversity in large datasets. Despite this, application to catchments in Africa has been limited. The lack of adoption of deep learning methodologies is primarily due to sparsity or lack of the spatiotemporal observational data required to enable downstream model training. We therefore investigate the application of deep learning models, including LSTMs, for hydrological discharge simulation in the transboundary Limpopo River basin, emphasising application to data scarce regions. We conduct a number of computational experiments primarily focused on assessing the impact of varying the LSTM model input data on performance. Results confirm that data constraints remain the largest obstacle to deep learning applications across African river basins. We further outline the impact of human influence on data-driven modelling which is a commonly overlooked aspect of data-driven large-sample hydrology approaches and investigate solutions for model adaptation under smaller datasets. Additionally, we include recommendations for future efforts towards seasonal hydrological discharge prediction and direct comparison or inclusion of SWAT model outputs, as well as architectural improvements.

[LG-38] Forgetting Similar Samples: Can Machine Unlearning Do it Better?

链接: https://arxiv.org/abs/2601.06938
作者: Heng Xu,Tianqing Zhu,Dayong Ye,Lefeng Zhang,Le Wang,Wanlei Zhou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine unlearning, a process enabling pre-trained models to remove the influence of specific training samples, has attracted significant attention in recent years. Although extensive research has focused on developing efficient machine unlearning strategies, we argue that these methods mainly aim at removing samples rather than removing samples’ influence on the model, thus overlooking the fundamental definition of machine unlearning. In this paper, we first conduct a comprehensive study to evaluate the effectiveness of existing unlearning schemes when the training dataset includes many samples similar to those targeted for unlearning. Specifically, we evaluate: Do existing unlearning methods truly adhere to the original definition of machine unlearning and effectively eliminate all influence of target samples when similar samples are present in the training dataset? Our extensive experiments, conducted on four carefully constructed datasets with thorough analysis, reveal a notable gap between the expected and actual performance of most existing unlearning methods for image and language models, even for the retraining-from-scratch baseline. Additionally, we also explore potential solutions to enhance current unlearning approaches.

[LG-39] Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems

链接: https://arxiv.org/abs/2601.06916
作者: Mohammed Azeez Khan,Aaron D’Souza,Vijay Choyal
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Efficient discovery of new materials demands strategies to reduce the number of costly first-principles calculations required to train predictive machine learning models. We develop and validate an active learning framework that iteratively selects informative training structures for machine-learned interatomic potentials (MLIPs) from large, heterogeneous materials databases, specifically the Materials Project and OQMD. Our framework integrates compositional and property-based descriptors with a neural network ensemble model, enabling real-time uncertainty quantification via Query-by-Committee. We systematically compare four selection strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach balancing both objectives. Experiments across four representative material systems (elemental carbon, silicon, iron, and a titanium-oxide compound) with 5 random seeds per configuration demonstrate that diversity sampling consistently achieves competitive or superior performance, with particularly strong advantages on complex systems like titanium-oxide (10.9% improvement, p=0.008). Our results show that intelligent data selection strategies can achieve target accuracy with 5-13% fewer labeled samples compared to random baselines. The entire pipeline executes on Google Colab in under 4 hours per system using less than 8 GB of RAM, thereby democratizing MLIP development for researchers globally with limited computational resources. Our open-source code and detailed experimental configurations are available on GitHub. This multi-system evaluation establishes practical guidelines for data-efficient MLIP training and highlights promising future directions including integration with symmetry-aware neural network architectures.

[LG-40] ractable Multinomial Logit Contextual Bandits with Non-Linear Utilities NEURIPS2025

链接: https://arxiv.org/abs/2601.06913
作者: Taehyun Hwang,Dahngoon Kim,Min-hwan Oh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:We study the multinomial logit (MNL) contextual bandit problem for sequential assortment selection. Although most existing research assumes utility functions to be linear in item features, this linearity assumption restricts the modeling of intricate interactions between items and user preferences. A recent work (Zhang Luo, 2024) has investigated general utility function classes, yet its method faces fundamental trade-offs between computational tractability and statistical efficiency. To address this limitation, we propose a computationally efficient algorithm for MNL contextual bandits leveraging the upper confidence bound principle, specifically designed for non-linear parametric utility functions, including those modeled by neural networks. Under a realizability assumption and a mild geometric condition on the utility function class, our algorithm achieves a regret bound of \tildeO(\sqrtT) , where T denotes the total number of rounds. Our result establishes that sharp \tildeO(\sqrtT) -regret is attainable even with neural network-based utilities, without relying on strong assumptions such as neural tangent kernel approximations. To the best of our knowledge, our proposed method is the first computationally tractable algorithm for MNL contextual bandits with non-linear utilities that provably attains \tildeO(\sqrtT) regret. Comprehensive numerical experiments validate the effectiveness of our approach, showing robust performance not only in realizable settings but also in scenarios with model misspecification.

[LG-41] Applying Embedding-Based Retrieval to Airbnb Search

链接: https://arxiv.org/abs/2601.06873
作者: Mustafa Abdool,Soumyadip Banerjee,Moutupsi Paul,Do-kyum Kim,Xioawei Liu,Bin Xu,Tracy Yu,Hui Gao,Karen Ouyang,Huiji Gao,Liwei He,Stephanie Moyerman,Sanjeev Katariya
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:The goal of Airbnb search is to match guests with the ideal accommodation that fits their travel needs. This is a challenging problem, as popular search locations can have around a hundred thousand available homes, and guests themselves have a wide variety of preferences. Furthermore, the launch of new product features, such as \textitflexible date search, significantly increased the number of eligible homes per search query. As such, there is a need for a sophisticated retrieval system which can provide high-quality candidates with low latency in a way that integrates with the overall ranking stack. This paper details our journey to build an efficient and high-quality retrieval system for Airbnb search. We describe the key unique challenges we encountered when implementing an Embedding-Based Retrieval (EBR) system for a two sided marketplace like Airbnb – such as the dynamic nature of the inventory, a lengthy user funnel with multiple stages, and a variety of product surfaces. We cover unique insights when modeling the retrieval problem, how to build robust evaluation systems, and design choices for online serving. The EBR system was launched to production and powers several use-cases such as regular search, flexible date and promotional emails for marketing campaigns. The system demonstrated statistically-significant improvements in key metrics, such as booking conversion, via A/B testing. Comments: 14 pages, 9 figures Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2601.06873 [cs.IR] (or arXiv:2601.06873v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.06873 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] U-MASK: User-adaptive Spatio-Temporal Masking for Personalized Mobile AI Applications

链接: https://arxiv.org/abs/2601.06867
作者: Shiyuan Zhang,Yilai Liu,Yuwei Du,Ruoxuan Yang,Dong In Kim,Hongyang Du
类目: Machine Learning (cs.LG)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Personalized mobile artificial intelligence applications are widely deployed, yet they are expected to infer user behavior from sparse and irregular histories under a continuously evolving spatio-temporal context. This setting induces a fundamental tension among three requirements, i.e., immediacy to adapt to recent behavior, stability to resist transient noise, and generalization to support long-horizon prediction and cold-start users. Most existing approaches satisfy at most two of these requirements, resulting in an inherent impossibility triangle in data-scarce, non-stationary personalization. To address this challenge, we model mobile behavior as a partially observed spatio-temporal tensor and unify short-term adaptation, long-horizon forecasting, and cold-start recommendation as a conditional completion problem, where a user- and task-specific mask specifies which coordinates are treated as evidence. We propose U-MASK, a user-adaptive spatio-temporal masking method that allocates evidence budgets based on user reliability and task sensitivity. To enable mask generation under sparse observations, U-MASK learns a compact, task-agnostic user representation from app and location histories via U-SCOPE, which serves as the sole semantic conditioning signal. A shared diffusion transformer then performs mask-guided generative completion while preserving observed evidence, so personalization and task differentiation are governed entirely by the mask and the user representation. Experiments on real-world mobile datasets demonstrate consistent improvements over state-of-the-art methods across short-term prediction, long-horizon forecasting, and cold-start settings, with the largest gains under severe data sparsity. The code and dataset will be available at this https URL.

[LG-43] Analyzing the effect of prediction accuracy on the distributionally-robust competitive ratio

链接: https://arxiv.org/abs/2601.06813
作者: Toru Yoshinaga,Yasushi Kawase
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:The field of algorithms with predictions aims to improve algorithm performance by integrating machine learning predictions into algorithm design. A central question in this area is how predictions can improve performance, and a key aspect of this analysis is the role of prediction accuracy. In this context, prediction accuracy is defined as a guaranteed probability that an instance drawn from the distribution belongs to the predicted set. As a performance measure that incorporates prediction accuracy, we focus on the distributionally-robust competitive ratio (DRCR), introduced by Sun et al.~(ICML 2024). The DRCR is defined as the expected ratio between the algorithm’s cost and the optimal cost, where the expectation is taken over the worst-case instance distribution that satisfies the given prediction and accuracy requirement. A known structural property is that, for any fixed algorithm, the DRCR decreases linearly as prediction accuracy increases. Building on this result, we establish that the optimal DRCR value (i.e., the infimum over all algorithms) is a monotone and concave function of prediction accuracy. We further generalize the DRCR framework to a multiple-prediction setting and show that monotonicity and concavity are preserved in this setting. Finally, we apply our results to the ski rental problem, a benchmark problem in online optimization, to identify the conditions on prediction accuracies required for the optimal DRCR to attain a target value. Moreover, we provide a method for computing the critical accuracy, defined as the minimum accuracy required for the optimal DRCR to strictly improve upon the performance attainable without any accuracy guarantee.

[LG-44] Cross-Modal Computational Model of Brain-Heart Interactions via HRV and EEG Feature

链接: https://arxiv.org/abs/2601.06792
作者: Malavika Pradeep,Akshay Sasi,Nusaibah Farrukh,Rahul Venugopal,Elizabeth Sherly
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, Code available at: this https URL . Presented at AIHC (not published)

点击查看摘要

Abstract:The electroencephalogram (EEG) has been the gold standard for quantifying mental workload; however, due to its complexity and non-portability, it can be constraining. ECG signals, which are feasible on wearable equipment pieces such as headbands, present a promising method for cognitive state monitoring. This research explores whether electrocardiogram (ECG) signals are able to indicate mental workload consistently and act as surrogates for EEG-based cognitive indicators. This study investigates whether ECG-derived features can serve as surrogate indicators of cognitive load, a concept traditionally quantified using EEG. Using a publicly available multimodal dataset (OpenNeuro) of EEG and ECG recorded during working-memory and listening tasks, features of HRV and Catch22 descriptors are extracted from ECG, and spectral band-power with Catch22 features from EEG. A cross-modal regression framework based on XGBoost was trained to map ECG-derived HRV representations to EEG-derived cognitive features. In order to address data sparsity and model brain-heart interactions, we integrated the PSV-SDG to produce EEG-conditioned synthetic HRV time this http URL addresses the challenge of inferring cognitive load solely from ECG-derived features using a combination of multimodal learning, signal processing, and synthetic data generation. These outcomes form a basis for light, interpretable machine learning models that are implemented through wearable biosensors in non-lab environments. Synthetic HRV inclusion enhances robustness, particularly in sparse data situations. Overall, this work is an initiation for building low-cost, explainable, and real-time cognitive monitoring systems for mental health, education, and human-computer interaction, with a focus on ageing and clinical populations.

[LG-45] Structure-preserving learning and prediction in optimal control of collective motion

链接: https://arxiv.org/abs/2601.06770
作者: Sofiia Huraka,Vakhtang Putkaradze
类目: Machine Learning (cs.LG)
*备注: 55 pages, 9 figures

点击查看摘要

Abstract:Wide-spread adoption of unmanned vehicle technologies requires the ability to predict the motion of the combined vehicle operation from observations. While the general prediction of such motion for an arbitrary control mechanism is difficult, for a particular choice of control, the dynamics reduces to the Lie-Poisson equations [33,34]. Our goal is to learn the phase-space dynamics and predict the motion solely from observations, without any knowledge of the control Hamiltonian or the nature of interaction between vehicles. To achieve that goal, we propose the Control Optimal Lie-Poisson Neural Networks (CO-LPNets) for learning and predicting the dynamics of the system from data. Our methods learn the mapping of the phase space through the composition of Poisson maps, which are obtained as flows from Hamiltonians that could be integrated explicitly. CO-LPNets preserve the Poisson bracket and thus preserve Casimirs to machine precision. We discuss the completeness of the derived neural networks and their efficiency in approximating the dynamics. To illustrate the power of the method, we apply these techniques to systems of N=3 particles evolving on \rm SO(3) group, which describe coupled rigid bodies rotating about their center of mass, and \rm SE(3) group, applicable to the movement of unmanned air and water vehicles. Numerical results demonstrate that CO-LPNets learn the dynamics in phase space from data points and reproduce trajectories, with good accuracy, over hundreds of time steps. The method uses a limited number of points ( \sim200 /dimension) and parameters ( \sim 1000 in our case), demonstrating potential for practical applications and edge deployment.

[LG-46] ALFA: A Safe-by-Design Approach to Mitigate Quishing Attacks Launched via Fancy QR Codes ESORICS

链接: https://arxiv.org/abs/2601.06768
作者: Muhammad Wahid Akram,Keshav Sood,Muneeb Ul Hassan,Dhananjay Thiruvady
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: LNCS Springer Template (19 pages, 5 figures, 4 tables). This paper is currently submitted to 31st European Symposium on Research in Computer Security (ESORICS) 2026 for publication

点击查看摘要

Abstract:Phishing with Quick Response (QR) codes is termed as Quishing. The attackers exploit this method to manipulate individuals into revealing their confidential data. Recently, we see the colorful and fancy representations of QR codes, the 2D matrix of QR codes which does not reflect a typical mixture of black-white modules anymore. Instead, they become more tempting as an attack vector for adversaries which can evade the state-of-the-art deep learning visual-based and other prevailing countermeasures. We introduce “ALFA”, a safe-by-design approach, to mitigate Quishing and prevent everyone from accessing the post-scan harmful payload of fancy QR codes. Our method first converts a fancy QR code into the replica of binary grid and then identify the erroneous representation of modules in that grid. Following that, we present “FAST” method which can conveniently recover erroneous modules from that binary grid. Afterwards, using this binary grid, our solution extracts the structural features of fancy QR code and predicts its legitimacy using a pre-trained model. The effectiveness of our proposal is demonstrated by the experimental evaluation on a synthetic dataset (containing diverse variations of fancy QR codes) and achieve a FNR of 0.06% only. We also develop the mobile app to test the practical feasibility of our solution and provide a performance comparison of the app with the real-world QR readers. This comparison further highlights the classification reliability and detection accuracy of this solution in real-world environments.

[LG-47] Comparative Separation: Evaluating Separation on Comparative Judgment Test Data

链接: https://arxiv.org/abs/2601.06761
作者: Xiaoyin Xi,Neeku Capak,Kate Stockwell,Zhe Yu
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 10 pages, 8 tables, 1 figure

点击查看摘要

Abstract:This research seeks to benefit the software engineering society by proposing comparative separation, a novel group fairness notion to evaluate the fairness of machine learning software on comparative judgment test data. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. It is the responsibility of all software developers to make their software accountable by ensuring that the machine learning software do not perform differently on different sensitive groups – satisfying the separation criterion. However, evaluation of separation requires ground truth labels for each test data point. This motivates our work on analyzing whether separation can be evaluated on comparative judgment test data. Instead of asking humans to provide the ratings or categorical labels on each test data point, comparative judgments are made between pairs of data points such as A is better than B. According to the law of comparative judgment, providing such comparative judgments yields a lower cognitive burden for humans than providing ratings or categorical labels. This work first defines the novel fairness notion comparative separation on comparative judgment test data, and the metrics to evaluate comparative separation. Then, both theoretically and empirically, we show that in binary classification problems, comparative separation is equivalent to separation. Lastly, we analyze the number of test data points and test data pairs required to achieve the same level of statistical power in the evaluation of separation and comparative separation, respectively. This work is the first to explore fairness evaluation on comparative judgment test data. It shows the feasibility and the practical benefits of using comparative judgment test data for model evaluations.

[LG-48] A Backpropagation-Free Feedback-Hebbian Network for Continual Learning Dynamics

链接: https://arxiv.org/abs/2601.06758
作者: Josh Li
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:Feedback-rich neural architectures can regenerate earlier representations and inject temporal context, making them a natural setting for strictly local synaptic plasticity. We ask whether a minimal, backpropagation-free feedback–Hebbian system can already express interpretable continual-learning–relevant behaviors under controlled training schedules. We introduce a compact prediction–reconstruction architecture with two feedforward layers for supervised association learning and two dedicated feedback layers trained to reconstruct earlier activity and re-inject it as additive temporal context. All synapses are updated by a unified local rule combining centered Hebbian covariance, Oja-style stabilization, and a local supervised drive where targets are available, requiring no weight transport or global error backpropagation. On a small two-pair association task, we characterize learning through layer-wise activity snapshots, connectivity trajectories (row/column means of learned weights), and a normalized retention index across phases. Under sequential A-B training, forward output connectivity exhibits a long-term depression (LTD)-like suppression of the earlier association while feedback connectivity preserves an A-related trace during acquisition of B. Under deterministic interleaving A,B,A,B,…, both associations are concurrently maintained rather than sequentially suppressed. Architectural controls and rule-term ablations isolate the role of dedicated feedback in regeneration and co-maintenance, and the role of the local supervised term in output selectivity and unlearning. Together, the results show that a compact feedback pathway trained with local plasticity can support regeneration and continual-learning–relevant dynamics in a minimal, mechanistically transparent setting.

[LG-49] Federated Continual Learning for Privacy-Preserving Hospital Imaging Classification

链接: https://arxiv.org/abs/2601.06742
作者: Anay Sinhal,Arpana Sinhal,Amit Sinhal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models for radiology interpretation increasingly rely on multi-institutional data, yet privacy regulations and distribution shift across hospitals limit central data pooling. Federated learning (FL) allows hospitals to collaboratively train models without sharing raw images, but current FL algorithms typically assume a static data distribution. In practice, hospitals experience continual evolution in case mix, annotation protocols, and imaging devices, which leads to catastrophic forgetting when models are updated sequentially. Federated continual learning (FCL) aims to reconcile these challenges but existing methods either ignore the stringent privacy constraints of healthcare or rely on replay buffers and public surrogate datasets that are difficult to justify in clinical settings. We study FCL for chest radiography classification in a setting where hospitals are clients that receive temporally evolving streams of cases and labels. We introduce DP-FedEPC (Differentially Private Federated Elastic Prototype Consolidation), a method that combines elastic weight consolidation (EWC), prototype-based rehearsal, and client-side differential privacy within a standard FedAvg framework. EWC constrains updates along parameters deemed important for previous tasks, while a memory of latent prototypes preserves class structure without storing raw images. Differentially private stochastic gradient descent (DP-SGD) at each client adds calibrated Gaussian noise to clipped gradients, providing formal privacy guarantees for individual radiographs.

[LG-50] Predicting Student Success with Heterogeneous Graph Deep Learning and Machine Learning Models

链接: https://arxiv.org/abs/2601.06729
作者: Anca Muresan,Mihaela Cardei,Ionut Cardei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early identification of student success is crucial for enabling timely interventions, reducing dropout rates, and promoting on time graduation. In educational settings, AI powered systems have become essential for predicting student performance due to their advanced analytical capabilities. However, effectively leveraging diverse student data to uncover latent and complex patterns remains a key challenge. While prior studies have explored this area, the potential of dynamic data features and multi category entities has been largely overlooked. To address this gap, we propose a framework that integrates heterogeneous graph deep learning models to enhance early and continuous student performance prediction, using traditional machine learning algorithms for comparison. Our approach employs a graph metapath structure and incorporates dynamic assessment features, which progressively influence the student success prediction task. Experiments on the Open University Learning Analytics (OULA) dataset demonstrate promising results, achieving a 68.6% validation F1 score with only 7% of the semester completed, and reaching up to 89.5% near the semester’s end. Our approach outperforms top machine learning models by 4.7% in validation F1 score during the critical early 7% of the semester, underscoring the value of dynamic features and heterogeneous graph representations in student success prediction.

[LG-51] DS-CIM: Digital Stochastic Computing-In-Memory Featuring Accurate OR-Accumulation via Sample Region Remapping for Edge AI Models DATE

链接: https://arxiv.org/abs/2601.06724
作者: Kunming Shao,Liang Zhao,Jiangnan Yu,Zhipeng Liao,Xiaomeng Wang,Yi Zou,Tim Kwang-Ting Cheng,Chi-Ying Tsui
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted by 2026 Design, Automation and Test in Europe Conference (DATE)

点击查看摘要

Abstract:Stochastic computing (SC) offers hardware simplicity but suffers from low throughput, while high-throughput Digital Computing-in-Memory (DCIM) is bottlenecked by costly adder logic for matrix-vector multiplication (MVM). To address this trade-off, this paper introduces a digital stochastic CIM (DS-CIM) architecture that achieves both high accuracy and efficiency. We implement signed multiply-accumulation (MAC) in a compact, unsigned OR-based circuit by modifying the data representation. Throughput is enhanced by replicating this low-cost circuit 64 times with only a 1x area increase. Our core strategy, a shared Pseudo Random Number Generator (PRNG) with 2D partitioning, enables single-cycle mutually exclusive activation to eliminate OR-gate collisions. We also resolve the 1s saturation issue via stochastic process analysis and data remapping, significantly improving accuracy and resilience to input sparsity. Our high-accuracy DS-CIM1 variant achieves 94.45% accuracy for INT8 ResNet18 on CIFAR-10 with a root-mean-squared error (RMSE) of just 0.74%. Meanwhile, our high-efficiency DS-CIM2 variant attains an energy efficiency of 3566.1 TOPS/W and an area efficiency of 363.7 TOPS/mm^2, while maintaining a low RMSE of 3.81%. The DS-CIM capability with larger models is further demonstrated through experiments with INT8 ResNet50 on ImageNet and the FP8 LLaMA-7B model.

[LG-52] Leverag ing Soft Prompts for Privacy Attacks in Federated Prompt Tuning

链接: https://arxiv.org/abs/2601.06641
作者: Quan Minh Nguyen,Min-Seon Kim,Hoang M. Ngo,Trong Nghia Hoang,Hyuk-Yoon Kwon,My T. Thai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Membership inference attack (MIA) poses a significant privacy threat in federated learning (FL) as it allows adversaries to determine whether a client’s private dataset contains a specific data sample. While defenses against membership inference attacks in standard FL have been well studied, the recent shift toward federated fine-tuning has introduced new, largely unexplored attack surfaces. To highlight this vulnerability in the emerging FL paradigm, we demonstrate that federated prompt-tuning, which adapts pre-trained models with small input prefixes to improve efficiency, also exposes a new vector for privacy attacks. We propose PromptMIA, a membership inference attack tailored to federated prompt-tuning, in which a malicious server can insert adversarially crafted prompts and monitors their updates during collaborative training to accurately determine whether a target data point is in a client’s private dataset. We formalize this threat as a security game and empirically show that PromptMIA consistently attains high advantage in this game across diverse benchmark datasets. Our theoretical analysis further establishes a lower bound on the attack’s advantage which explains and supports the consistently high advantage observed in our empirical results. We also investigate the effectiveness of standard membership inference defenses originally developed for gradient or output based attacks and analyze their interaction with the distinct threat landscape posed by PromptMIA. The results highlight non-trivial challenges for current defenses and offer insights into their limitations, underscoring the need for defense strategies that are specifically tailored to prompt-tuning in federated settings.

[LG-53] Lower Bounds for the Algorithmic Complexity of Learned Indexes

链接: https://arxiv.org/abs/2601.06629
作者: Luis Alberto Croquevielle,Roman Sokolovskii,Thomas Heinis
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learned index structures aim to accelerate queries by training machine learning models to approximate the rank function associated with a database attribute. While effective in practice, their theoretical limitations are not fully understood. We present a general framework for proving lower bounds on query time for learned indexes, expressed in terms of their space overhead and parameterized by the model class used for approximation. Our formulation captures a broad family of learned indexes, including most existing designs, as piecewise model-based predictors. We solve the problem of lower bounding query time in two steps: first, we use probabilistic tools to control the effect of sampling when the database attribute is drawn from a probability distribution. Then, we analyze the approximation-theoretic problem of how to optimally represent a cumulative distribution function with approximators from a given model class. Within this framework, we derive lower bounds under a range of modeling and distributional assumptions, paying particular attention to the case of piecewise linear and piecewise constant model classes, which are common in practical implementations. Our analysis shows how tools from approximation theory, such as quantization and Kolmogorov widths, can be leveraged to formalize the space-time tradeoffs inherent to learned index structures. The resulting bounds illuminate core limitations of these methods. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2601.06629 [cs.DS] (or arXiv:2601.06629v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2601.06629 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Cross-Border Data Security and Privacy Risks in Large Language Models and IoT Systems

链接: https://arxiv.org/abs/2601.06612
作者: Chalitha Handapangoda
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Final project for CS-GY 6813 at NYU Tandon School of Engineering

点击查看摘要

Abstract:The reliance of Large Language Models and Internet of Things systems on massive, globally distributed data flows creates systemic security and privacy challenges. When data traverses borders, it becomes subject to conflicting legal regimes, such as the EU’s General Data Protection Regulation and China’s Personal Information Protection Law, compounded by technical vulnerabilities like model memorization. Current static encryption and data localization methods are fragmented and reactive, failing to provide adequate, policy-aligned safeguards. This research proposes a Jurisdiction-Aware, Privacy-by-Design architecture that dynamically integrates localized encryption, adaptive differential privacy, and real-time compliance assertion via cryptographic proofs. Empirical validation in a multi-jurisdictional simulation demonstrates this architecture reduced unauthorized data exposure to below five percent and achieved zero compliance violations. These security gains were realized while maintaining model utility retention above ninety percent and limiting computational overhead. This establishes that proactive, integrated controls are feasible for secure and globally compliant AI deployment.

[LG-55] UMLoc: Uncertainty-Aware Map-Constrained Inertial Localization with Quantified Bounds

链接: https://arxiv.org/abs/2601.06602
作者: Mohammed S. Alharbi,Shinkyu Park
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inertial localization is particularly valuable in GPS-denied environments such as indoors. However, localization using only Inertial Measurement Units (IMUs) suffers from drift caused by motion-process noise and sensor biases. This paper introduces Uncertainty-aware Map-constrained Inertial Localization (UMLoc), an end-to-end framework that jointly models IMU uncertainty and map constraints to achieve drift-resilient positioning. UMLoc integrates two coupled modules: (1) a Long Short-Term Memory (LSTM) quantile regressor, which estimates the specific quantiles needed to define 68%, 90%, and 95% prediction intervals serving as a measure of localization uncertainty and (2) a Conditioned Generative Adversarial Network (CGAN) with cross-attention that fuses IMU dynamic data with distance-based floor-plan maps to generate geometrically feasible trajectories. The modules are trained jointly, allowing uncertainty estimates to propagate through the CGAN during trajectory generation. UMLoc was evaluated on three datasets, including a newly collected 2-hour indoor benchmark with time-aligned IMU data, ground-truth poses and floor-plan maps. Results show that the method achieves a mean drift ratio of 5.9% over a 70 m travel distance and an average Absolute Trajectory Error (ATE) of 1.36 m, while maintaining calibrated prediction bounds.

[LG-56] Implicit bias as a Gauge correction: Theory and Inverse Design

链接: https://arxiv.org/abs/2601.06597
作者: Nicola Aladrah,Emanuele Ballarin,Matteo Biagetti,Alessio Ansuini,Alberto d’Onofrio,Fabio Anselmi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: v1

点击查看摘要

Abstract:A central problem in machine learning theory is to characterize how learning dynamics select particular solutions among the many compatible with the training objective, a phenomenon, called implicit bias, which remains only partially characterized. In the present work, we identify a general mechanism, in terms of an explicit geometric correction of the learning dynamics, for the emergence of implicit biases, arising from the interaction between continuous symmetries in the model’s parametrization and stochasticity in the optimization process. Our viewpoint is constructive in two complementary directions: given model symmetries, one can derive the implicit bias they induce; conversely, one can inverse-design a wide class of different implicit biases by computing specific redundant parameterizations. More precisely, we show that, when the dynamics is expressed in the quotient space obtained by factoring out the symmetry group of the parameterization, the resulting stochastic differential equation gains a closed form geometric correction in the stationary distribution of the optimizer dynamics favoring orbits with small local volume. We compute the resulting symmetry induced bias for a range of architectures, showing how several well known results fit into a single unified framework. The approach also provides a practical methodology for deriving implicit biases in new settings, and it yields concrete, testable predictions that we confirm by numerical simulations on toy models trained on synthetic data, leaving more complex scenarios for future work. Finally, we test the implicit bias inverse-design procedure in notable cases, including biases toward sparsity in linear features or in spectral properties of the model parameters.

[LG-57] Softly Induced Functional Simplicity Implications for Neural Network Generalisation Robustness and Distillation

链接: https://arxiv.org/abs/2601.06584
作者: Maciej Glowacki
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Learning robust and generalisable abstractions from high-dimensional input data is a central challenge in machine learning and its applications to high-energy physics (HEP). Solutions of lower functional complexity are known to produce abstractions that generalise more effectively and are more robust to input perturbations. In complex hypothesis spaces, inductive biases make such solutions learnable by shaping the loss geometry during optimisation. In a HEP classification task, we show that a soft symmetry respecting inductive bias creates approximate degeneracies in the loss, which we identify as pseudo-Goldstone modes. We quantify functional complexity using metrics derived from first principles Hessian analysis and via compressibility. Our results demonstrate that solutions of lower complexity give rise to abstractions that are more generalisable, robust, and efficiently distillable.

[LG-58] Mosaic: Unlocking Long-Context Inference for Diffusion LLM s via Global Memory Planning and Dynamic Peak Taming

链接: https://arxiv.org/abs/2601.06562
作者: Liang Zheng,Bowen Shi,Yitao Hu,Jiawei Zhang,Ruofan Li,Sheng Chen,Wenxin Li,Keqiu Li
类目: Machine Learning (cs.LG)
*备注: 11 pages, 18 figures

点击查看摘要

Abstract:Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs’ dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71 \times reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98 \times . This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.

[LG-59] Pareto-Optimal Model Selection for Low-Cost Single-Lead EMG Control in Embedded Systems

链接: https://arxiv.org/abs/2601.06516
作者: Carl Vincent Ladres Kho
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 15 pages main text, 51 pages total including appendices. 18 figures. Code and dataset available at: this https URL

点击查看摘要

Abstract:Consumer-grade biosensors offer a cost-effective alternative to medical-grade electromyography (EMG) systems, reducing hardware costs from thousands of dollars to approximately 13. However, these low-cost sensors introduce significant signal instability and motion artifacts. Deploying machine learning models on resource-constrained edge devices like the ESP32 presents a challenge: balancing classification accuracy with strict latency (100ms) and memory (320KB) constraints. Using a single-subject dataset comprising 1,540 seconds of raw data (1.54M data points, segmented into ~1,300 one-second windows), I evaluate 18 model architectures, ranging from statistical heuristics to deep transfer learning (ResNet50) and custom hybrid networks (MaxCRNN). While my custom “MaxCRNN” (Inception + Bi-LSTM + Attention) achieved the highest safety (99% Precision) and robustness, I identify Random Forest (74% accuracy) as the Pareto-optimal solution for embedded control on legacy microcontrollers. I demonstrate that reliable, low-latency EMG control is feasible on commodity hardware, with Deep Learning offering a path to near-perfect reliability on modern Edge AI accelerators.

[LG-60] A novel RF-enabled Non-Destructive Inspection Method through Machine Learning and Programmable Wireless Environments

链接: https://arxiv.org/abs/2601.06512
作者: Stavros Tsimpoukis,Dimitrios Tyrovolas,Sotiris Ioannidis,Maria Kafesaki,Ian F. Akyildiz,George K. Karagiannidis,Christos K. Liaskos
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Contemporary industrial Non-Destructive Inspection (NDI) methods require sensing capabilities that operate in occluded, hazardous, or access restricted environments. Yet, the current visual inspection based on optical cameras offers limited quality of service to that respect. In that sense, novel methods for workpiece inspection, suitable, for smart manufacturing are needed. Programmable Wireless Environments (PWE) could help towards that direction, by redefining the wireless Radio Frequency (RF) wave propagation as a controllable inspector entity. In this work, we propose a novel approach to Non-Destructive Inspection, leveraging an RF sensing pipeline based on RF wavefront encoding for retrieving workpiece-image entries from a designated database. This approach combines PWE-enabled RF wave manipulation with machine learning (ML) tools trained to produce visual outputs for quality inspection. Specifically, we establish correlation relationships between RF wavefronts and target industrial assets, hence yielding a dataset which links wavefronts to their corresponding images in a structured manner. Subsequently, a Generative Adversarial Network (GAN) derives visual representations closely matching the database entries. Our results indicate that the proposed method achieves an SSIM 99.5% matching score in visual outputs, paving the way for next-generation quality control workflows in industry.

[LG-61] Deriving Decoder-Free Sparse Autoencoders from First Principles

链接: https://arxiv.org/abs/2601.06478
作者: Alan Oursland
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 3 figures, 9 tables

点击查看摘要

Abstract:Gradient descent on log-sum-exp (LSE) objectives performs implicit expectation–maximization (EM): the gradient with respect to each component output equals its responsibility. The same theory predicts collapse without volume control analogous to the log-determinant in Gaussian mixture models. We instantiate the theory in a single-layer encoder with an LSE objective and InfoMax regularization for volume control. Experiments confirm the theory’s predictions. The gradient–responsibility identity holds exactly; LSE alone collapses; variance prevents dead components; decorrelation prevents redundancy. The model exhibits EM-like optimization dynamics in which lower loss does not correspond to better features and adaptive optimizers offer no advantage. The resulting decoder-free model learns interpretable mixture components, confirming that implicit EM theory can prescribe architectures.

[LG-62] Hybrid LSTM-UKF Framework: Ankle Angle and Ground Reaction Force Estimation

链接: https://arxiv.org/abs/2601.06473
作者: Mundla Narasimhappa,Praveen Kumar
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8

点击查看摘要

Abstract:Accurate prediction of joint kinematics and kinetics is essential for advancing gait analysis and developing intelligent assistive systems such as prosthetics and exoskeletons. This study presents a hybrid LSTM-UKF framework for estimating ankle angle and ground reaction force (GRF) across varying walking speeds. A multimodal sensor fusion strategy integrates force plate data, knee angle, and GRF signals to enrich biomechanical context. Model performance was evaluated using RMSE and R^2 under subject-specific validation. The LSTM-UKF consistently outperformed standalone LSTM and UKF models, achieving up to 18.6% lower RMSE for GRF prediction at 3 km/h. Additionally, UKF integration improved robustness, reducing ankle angle RMSE by up to 22.4% compared to UKF alone at 1 km/h. These results underscore the effectiveness of hybrid architectures for reliable gait prediction across subjects and walking conditions.

[LG-63] StablePDENet: Enhancing Stability of Operator Learning for Solving Differential Equations

链接: https://arxiv.org/abs/2601.06472
作者: Chutian Huang,Chang Ma,Kaibo Wang,Yang Xiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning solution operators for differential equations with neural networks has shown great potential in scientific computing, but ensuring their stability under input perturbations remains a critical challenge. This paper presents a robust self-supervised neural operator framework that enhances stability through adversarial training while preserving accuracy. We formulate operator learning as a min-max optimization problem, where the model is trained against worst-case input perturbations to achieve consistent performance under both normal and adversarial conditions. We demonstrate that our method not only achieves good performance on standard inputs, but also maintains high fidelity under adversarial perturbed inputs. The results highlight the importance of stability-aware training in operator learning and provide a foundation for developing reliable neural PDE solvers in real-world applications, where input noise and uncertainties are inevitable.

[LG-64] Physics-Informed Tree Search for High-Dimensional Computational Design

链接: https://arxiv.org/abs/2601.06444
作者: Suvo Banik,Troy D. Loeffler,Henry Chan,Sukriti Manna,Orcun Yildiz,Tom Peterka,Subramanian Sankaranarayanan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:High-dimensional design spaces underpin a wide range of physics-based modeling and computational design tasks in science and engineering. These problems are commonly formulated as constrained black-box searches over rugged objective landscapes, where function evaluations are expensive, and gradients are unavailable or unreliable. Conventional global search engines and optimizers struggle in such settings due to the exponential scaling of design spaces, the presence of multiple local basins, and the absence of physical guidance in sampling. We present a physics-informed Monte Carlo Tree Search (MCTS) framework that extends policy-driven tree-based reinforcement concepts to continuous, high-dimensional scientific optimization. Our method integrates population-level decision trees with surrogate-guided directional sampling, reward shaping, and hierarchical switching between global exploration and local exploitation. These ingredients allow efficient traversal of non-convex, multimodal landscapes where physically meaningful optima are sparse. We benchmark our approach against standard global optimization baselines on a suite of canonical test functions, demonstrating superior or comparable performance in terms of convergence, robustness, and generalization. Beyond synthetic tests, we demonstrate physics-consistent applicability to (i) crystal structure optimization from clusters to bulk, (ii) fitting of classical interatomic potentials, and (iii) constrained engineering design problems. Across all cases, the method converges with high fidelity and evaluation efficiency while preserving physical constraints. Overall, our work establishes physics-informed tree search as a scalable and interpretable paradigm for computational design and high-dimensional scientific optimization, bridging discrete decision-making frameworks with continuous search in scientific design workflows.

[LG-65] FlexAct: Why Learn when you can Pick?

链接: https://arxiv.org/abs/2601.06441
作者: Ramnath Kumar,Kyle Ritscher,Junmin Judy,Lawrence Liu,Cho-Jui Hsieh
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Learning activation functions has emerged as a promising direction in deep learning, allowing networks to adapt activation mechanisms to task-specific demands. In this work, we introduce a novel framework that employs the Gumbel-Softmax trick to enable discrete yet differentiable selection among a predefined set of activation functions during training. Our method dynamically learns the optimal activation function independently of the input, thereby enhancing both predictive accuracy and architectural flexibility. Experiments on synthetic datasets show that our model consistently selects the most suitable activation function, underscoring its effectiveness. These results connect theoretical advances with practical utility, paving the way for more adaptive and modular neural architectures in complex learning scenarios.

[LG-66] Certified Unlearning in Decentralized Federated Learning

链接: https://arxiv.org/abs/2601.06436
作者: Hengliang Wu,Youming Tao,Anhao Zhou,Shuzhen Chen,Falko Dressler,Dongxiao Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Driven by the right to be forgotten (RTBF), machine unlearning has become an essential requirement for privacy-preserving machine learning. However, its realization in decentralized federated learning (DFL) remains largely unexplored. In DFL, clients exchange local updates only with neighbors, causing model information to propagate and mix across the network. As a result, when a client requests data deletion, its influence is implicitly embedded throughout the system, making removal difficult without centralized coordination. We propose a novel certified unlearning framework for DFL based on Newton-style updates. Our approach first quantifies how a client’s data influence propagates during training. Leveraging curvature information of the loss with respect to the target data, we then construct corrective updates using Newton-style approximations. To ensure scalability, we approximate second-order information via Fisher information matrices. The resulting updates are perturbed with calibrated noise and broadcast through the network to eliminate residual influence across clients. We theoretically prove that our approach satisfies the formal definition of certified unlearning, ensuring that the unlearned model is difficult to distinguish from a retrained model without the deleted data. We also establish utility bounds showing that the unlearned model remains close to retraining from scratch. Extensive experiments across diverse decentralized settings demonstrate the effectiveness and efficiency of our framework.

[LG-67] A Unified Shape-Aware Foundation Model for Time Series Classification AAAI2026

链接: https://arxiv.org/abs/2601.06429
作者: Zhen Liu,Yucheng Wang,Boyuan Li,Junhao Zheng,Emadeldeen Eldele,Min Wu,Qianli Ma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted in AAAI 2026

点击查看摘要

Abstract:Foundation models pre-trained on large-scale source datasets are reshaping the traditional training paradigm for time series classification. However, existing time series foundation models primarily focus on forecasting tasks and often overlook classification-specific challenges, such as modeling interpretable shapelets that capture class-discriminative temporal features. To bridge this gap, we propose UniShape, a unified shape-aware foundation model designed for time series classification. UniShape incorporates a shape-aware adapter that adaptively aggregates multiscale discriminative subsequences (shapes) into class tokens, effectively selecting the most relevant subsequence scales to enhance model interpretability. Meanwhile, a prototype-based pretraining module is introduced to jointly learn instance- and shape-level representations, enabling the capture of transferable shape patterns. Pre-trained on a large-scale multi-domain time series dataset comprising 1.89 million samples, UniShape exhibits superior generalization across diverse target domains. Experiments on 128 UCR datasets and 30 additional time series datasets demonstrate that UniShape achieves state-of-the-art classification performance, with interpretability and ablation analyses further validating its effectiveness.

[LG-68] ach Diffusion Language Models to Learn from Their Own Mistakes

链接: https://arxiv.org/abs/2601.06428
作者: Liming Liu,Binxuan Huang,Xin Liu,Bing Yin,Tuo Zhao
类目: Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Masked Diffusion Language Models (DLMs) achieve significant speed by generating multiple tokens in parallel. However, this parallel sampling approach, especially when using fewer inference steps, will introduce strong dependency errors and cause quality to deteriorate rapidly as the generation step size grows. As a result, reliable self-correction becomes essential for maintaining high-quality multi-token generation. To address this, we propose Decoupled Self-Correction (DSC), a novel two-stage methodology. DSC first fully optimizes the DLM’s generative ability before freezing the model and training a specialized correction head. This decoupling preserves the model’s peak SFT performance and ensures the generated errors used for correction head training are of higher quality. Additionally, we introduce Future-Context Augmentation (FCA) to maximize the correction head’s accuracy. FCA generalizes the error training distribution by augmenting samples with ground-truth tokens, effectively training the head to utilize a richer, future-looking context. This mechanism is used for reliably detecting the subtle errors of the high-fidelity base model. Our DSC framework enables the model, at inference time, to jointly generate and revise tokens, thereby correcting errors introduced by multi-token generation and mitigating error accumulation across steps. Experiments on mathematical reasoning and code generation benchmarks demonstrate that our approach substantially reduces the quality degradation associated with larger generation steps, allowing DLMs to achieve both high generation speed and strong output fidelity.

[LG-69] One-Shot Hierarchical Federated Clustering

链接: https://arxiv.org/abs/2601.06404
作者: Shenghong Cai,Zihua Yang,Yang Lu,Mengke Li,Yuzhu Ji,Yiqun Zhang,Yiu-Ming Cheung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Driven by the growth of Web-scale decentralized services, Federated Clustering (FC) aims to extract knowledge from heterogeneous clients in an unsupervised manner while preserving the clients’ privacy, which has emerged as a significant challenge due to the lack of label guidance and the Non-Independent and Identically Distributed (non-IID) nature of clients. In real scenarios such as personalized recommendation and cross-device user profiling, the global cluster may be fragmented and distributed among different clients, and the clusters may exist at different granularities or even nested. Although Hierarchical Clustering (HC) is considered promising for exploring such distributions, the sophisticated recursive clustering process makes it more computationally expensive and vulnerable to privacy exposure, thus relatively unexplored under the federated learning scenario. This paper introduces an efficient one-shot hierarchical FC framework that performs client-end distribution exploration and server-end distribution aggregation through one-way prototype-level communication from clients to the server. A fine partition mechanism is developed to generate successive clusterlets to describe the complex landscape of the clients’ clusters. Then, a multi-granular learning mechanism on the server is proposed to fuse the clusterlets, even when they have inconsistent granularities generated from different clients. It turns out that the complex cluster distributions across clients can be efficiently explored, and extensive experiments comparing state-of-the-art methods on ten public datasets demonstrate the superiority of the proposed method.

[LG-70] Supervised and Unsupervised Neural Network Solver for First Order Hyperbolic Nonlinear PDEs

链接: https://arxiv.org/abs/2601.06388
作者: Zakaria Baba,Alexandre M. Bayen,Alexi Canesse,Maria Laura Delle Monache,Martin Drieux,Zhe Fu,Nathan Lichtlé,Zihe Liu,Hossein Nick Zinat Matin,Benedetto Piccoli
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a neural network-based method for learning scalar hyperbolic conservation laws. Our method replaces the traditional numerical flux in finite volume schemes with a trainable neural network while preserving the conservative structure of the scheme. The model can be trained both in a supervised setting with efficiently generated synthetic data or in an unsupervised manner, leveraging the weak formulation of the partial differential equation. We provide theoretical results that our model can perform arbitrarily well, and provide associated upper bounds on neural network size. Extensive experiments demonstrate that our method often outperforms efficient schemes such as Godunov’s scheme, WENO, and Discontinuous Galerkin for comparable computational budgets. Finally, we demonstrate the effectiveness of our method on a traffic prediction task, leveraging field experimental highway data from the Berkeley DeepDrive drone dataset.

[LG-71] Hierarchical Pooling and Explainability in Graph Neural Networks for Tumor and Tissue-of-Origin Classification Using RNA-seq Data

链接: https://arxiv.org/abs/2601.06381
作者: Thomas Vaitses Fontanari,Mariana Recamonde-Mendoza
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:This study explores the use of graph neural networks (GNNs) with hierarchical pooling and multiple convolution layers for cancer classification based on RNA-seq data. We combine gene expression data from The Cancer Genome Atlas (TCGA) with a precomputed STRING protein-protein interaction network to classify tissue origin and distinguish between normal and tumor samples. The model employs Chebyshev graph convolutions (K=2) and weighted pooling layers, aggregating gene clusters into ‘supernodes’ across multiple coarsening levels. This approach enables dimensionality reduction while preserving meaningful interactions. Saliency methods were applied to interpret the model by identifying key genes and biological processes relevant to cancer. Our findings reveal that increasing the number of convolution and pooling layers did not enhance classification performance. The highest F1-macro score (0.978) was achieved with a single pooling layer. However, adding more layers resulted in over-smoothing and performance degradation. However, the model proved highly interpretable through gradient methods, identifying known cancer-related genes and highlighting enriched biological processes, and its hierarchical structure can be used to develop new explainable architectures. Overall, while deeper GNN architectures did not improve performance, the hierarchical pooling structure provided valuable insights into tumor biology, making GNNs a promising tool for cancer biomarker discovery and interpretation

[LG-72] A Fast and Effective Method for Euclidean Anticlustering: The Assignment-Based-Anticlustering Algorithm

链接: https://arxiv.org/abs/2601.06351
作者: Philipp Baumann,Olivier Goldschmidt,Dorit S. Hochbaum,Jason Yang
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:The anticlustering problem is to partition a set of objects into K equal-sized anticlusters such that the sum of distances within anticlusters is maximized. The anticlustering problem is NP-hard. We focus on anticlustering in Euclidean spaces, where the input data is tabular and each object is represented as a D-dimensional feature vector. Distances are measured as squared Euclidean distances between the respective vectors. Applications of Euclidean anticlustering include social studies, particularly in psychology, K-fold cross-validation in which each fold should be a good representative of the entire dataset, the creation of mini-batches for gradient descent in neural network training, and balanced K-cut partitioning. In particular, machine-learning applications involve million-scale datasets and very large values of K, making scalable anticlustering algorithms essential. Existing algorithms are either exact methods that can solve only small instances or heuristic methods, among which the most scalable is the exchange-based heuristic fast_anticlustering. We propose a new algorithm, the Assignment-Based Anticlustering algorithm (ABA), which scales to very large instances. A computational study shows that ABA outperforms fast_anticlustering in both solution quality and running time. Moreover, ABA scales to instances with millions of objects and hundreds of thousands of anticlusters within short running times, beyond what fast_anticlustering can handle. As a balanced K-cut partitioning method for tabular data, ABA is superior to the well-known METIS method in both solution quality and running time. The code of the ABA algorithm is available on GitHub.

[LG-73] Federated Learning and Class Imbalances

链接: https://arxiv.org/abs/2601.06348
作者: Siqi Zhu,Joshua D. Kaggie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across decentralized devices while preserving data privacy. However, real-world FL deployments face critical challenges such as data imbalances, including label noise and non-IID distributions. RHFL+, a state-of-the-art method, was proposed to address these challenges in settings with heterogeneous client models. This work investigates the robustness of RHFL+ under class imbalances through three key contributions: (1) reproduction of RHFL+ along with all benchmark algorithms under a unified evaluation framework; (2) extension of RHFL+ to real-world medical imaging datasets, including CBIS-DDSM, BreastMNIST and BHI; (3) a novel implementation using NVFlare, NVIDIA’s production-level federated learning framework, enabling a modular, scalable and deployment-ready codebase. To validate effectiveness, extensive ablation studies, algorithmic comparisons under various noise conditions and scalability experiments across increasing numbers of clients are conducted.

[LG-74] SourceNet: Interpretable Sim-to-Real Inference on Variable-Geometry Sensor Arrays for Earthquake Source Inversion

链接: https://arxiv.org/abs/2601.06320
作者: Zhe Jia,Xiaotian Zhang,Junpeng Li
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Inferring high-dimensional physical states from sparse, ad-hoc sensor arrays is a fundamental challenge across AI for Science, as they are complicated by irregular geometries and the profound Sim-to-Real gap in physical modeling. Taking earthquake source characterization as a representative challenge, we address limitations in conventional deep learning: CNNs demand fixed grids, while pooling-based architectures (e.g., DeepSets) struggle to capture the relational wave physics. Here, we propose SourceNet, a Transformer-based framework that treats the sensor array as a flexible set to model arbitrary geometries. To bridge the reality gap, we introduce Physics-Structured Domain Randomization (PSDR). Instead of forcing feature alignment, PSDR randomizes the governing physical dynamics by varying velocity structures, propagation effects, and sensor availability, to force the model to learn robust representations invariant to unmodeled environmental heterogeneity. By pre-training on 100,000 synthetic events and fine-tuning on ~2,000 real world events, SourceNet achieves state-of-the-art precision on held-out real data. This demonstrates exceptional data efficiency, and matches classical solvers while enabling real-time processing. Remarkably, interpretability analysis reveals that the model shows scientific-agent-like features: it autonomously discovers geometric information bottlenecks and learns an attention policy that prioritizes sparse sensor placements, effectively recovering principles of optimal experimental design from data alone.

[LG-75] owards Public Administration Research Based on Interpretable Machine Learning

链接: https://arxiv.org/abs/2601.06205
作者: Zhanyu Liu,Yang Yu
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal relationships play a pivotal role in research within the field of public administration. Ensuring reliable causal inference requires validating the predictability of these relationships, which is a crucial precondition. However, prediction has not garnered adequate attention within the realm of quantitative research in public administration and the broader social sciences. The advent of interpretable machine learning presents a significant opportunity to integrate prediction into quantitative research conducted in public administration. This article delves into the fundamental principles of interpretable machine learning while also examining its current applications in social science research. Building upon this foundation, the article further expounds upon the implementation process of interpretable machine learning, encompassing key aspects such as dataset construction, model training, model evaluation, and model interpretation. Lastly, the article explores the disciplinary value of interpretable machine learning within the field of public administration, highlighting its potential to enhance the generalization of inference, facilitate the selection of optimal explanations for phenomena, stimulate the construction of theoretical hypotheses, and provide a platform for the translation of knowledge. As a complement to traditional causal inference methods, interpretable machine learning ushers in a new era of credibility in quantitative research within the realm of public administration.

[LG-76] me-Series Anomaly Classification for Launch Vehicle Propulsion Systems: Fast Statistical Detectors Enhancing LSTM Accuracy and Data Quality

链接: https://arxiv.org/abs/2601.06186
作者: Sean P. Engelstad,Sameul R. Darr,Matthew Taliaferro,Vinay K. Goyal
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 19 pages and 12 figures

点击查看摘要

Abstract:Supporting Go/No-Go decisions prior to launch requires assessing real-time telemetry data against redline limits established during the design qualification phase. Family data from ground testing or previous flights is commonly used to detect initiating failure modes and their timing; however, this approach relies heavily on engineering judgment and is more error-prone for new launch vehicles. To address these limitations, we utilize Long-Term Short-Term Memory (LSTM) networks for supervised classification of time-series anomalies. Although, initial training labels derived from simulated anomaly data may be suboptimal due to variations in anomaly strength, anomaly settling times, and other factors. In this work, we propose a novel statistical detector based on the Mahalanobis distance and forward-backward detection fractions to adjust the supervised training labels. We demonstrate our method on digital twin simulations of a ground-stage propulsion system with 20.8 minutes of operation per trial and O(10^8) training timesteps. The statistical data relabeling improved precision and recall of the LSTM classifier by 7% and 22% respectively.

[LG-77] Data-Driven Reduced-Complexity Modeling of Fluid Flows: A Community Challenge

链接: https://arxiv.org/abs/2601.06183
作者: Oliver T. Schmidt,Aaron Towne,Adrian Lozano-Duran,Scott T. M. Dawson,Ricardo Vinuesa
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We introduce a community challenge designed to facilitate direct comparisons between data-driven methods for compression, forecasting, and sensing of complex aerospace flows. The challenge is organized into three tracks that target these complementary capabilities: compression (compact representations for large datasets), forecasting (predicting future flow states from a finite history), and sensing (inferring unmeasured flow states from limited measurements). Across these tracks, multiple challenges span diverse flow datasets and use cases, each emphasizing different model requirements. The challenge is open to anyone, and we invite broad participation to build a comprehensive and balanced picture of what works and where current methods fall short. To support fair comparisons, we provide standardized success metrics, evaluation tools, and baseline implementations, with one classical and one machine-learning baseline per challenge. Final assessments use blind tests on withheld data. We explicitly encourage negative results and careful analyses of limitations. Outcomes will be disseminated through an AIAA Journal Virtual Collection and invited presentations at AIAA conferences.

[LG-78] Performance of models for monitoring sustainable development goals from remote sensing: A three-level meta-regression

链接: https://arxiv.org/abs/2601.06178
作者: Jonas Klingwort,Nina M. Leach,Joep Burger
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 33 pages, 10 figures, 5 tables

点击查看摘要

[LG-79] Can we Improve Prediction of Psychotherapy Outcomes Through Pretraining With Simulated Data?

链接: https://arxiv.org/abs/2601.06159
作者: Niklas Jacobs,Manuel C. Voelkle,Norbert Kathmann,Kevin Hilbert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the context of personalized medicine, machine learning algorithms are growing in popularity. These algorithms require substantial information, which can be acquired effectively through the usage of previously gathered data. Open data and the utilization of synthetization techniques have been proposed to address this. In this paper, we propose and evaluate alternative approach that uses additional simulated data based on summary statistics published in the literature. The simulated data are used to pretrain random forests, which are afterwards fine-tuned on a real dataset. We compare the predictive performance of the new approach to random forests trained only on the real data. A Monte Carlo Cross Validation (MCCV) framework with 100 iterations was employed to investigate significance and stability of the results. Since a first study yielded inconclusive results, a second study with improved methodology (i.e., systematic information extraction and different prediction outcome) was conducted. In Study 1, some pretrained random forests descriptively outperformed the standard random forest. However, this improvement was not significant (t(99) = 0.89, p = 0.19). Contrary to expectations, in Study 2 the random forest trained only with the real data outperformed the pretrained random forests. We conclude with a discussion of challenges, such as the scarcity of informative publications, and recommendations for future research.

[LG-80] Causal and Federated Multimodal Learning for Cardiovascular Risk Prediction under Heterogeneous Populations

链接: https://arxiv.org/abs/2601.06140
作者: Rohit Kaushik,Eva Kaushik
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Cardiovascular disease (CVD) continues to be the major cause of death globally, calling for predictive models that not only handle diverse and high-dimensional biomedical signals but also maintain interpretability and privacy. We create a single multimodal learning framework that integrates cross modal transformers with graph neural networks and causal representation learning to measure personalized CVD risk. The model combines genomic variation, cardiac MRI, ECG waveforms, wearable streams, and structured EHR data to predict risk while also implementing causal invariance constraints across different clinical subpopulations. To maintain transparency, we employ SHAP based feature attribution, counterfactual explanations and causal latent alignment for understandable risk factors. Besides, we position the design in a federated, privacy, preserving optimization protocol and establish rules for convergence, calibration and uncertainty quantification under distributional shift. Experimental studies based on large-scale biobank and multi institutional datasets reveal state discrimination and robustness, exhibiting fair performance across demographic strata and clinically distinct cohorts. This study paves the way for a principled approach to clinically trustworthy, interpretable and privacy respecting CVD prediction at the population level. Comments: 9 pages, 5 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.06140 [cs.LG] (or arXiv:2601.06140v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.06140 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-81] DeeperBrain: A Neuro-Grounded EEG Foundation Model Towards Universal BCI

链接: https://arxiv.org/abs/2601.06134
作者: Jiquan Wang,Sha Zhao,Yangxuan Zhou,Yiming Kang,Shijian Li,Gang Pan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: this http URL Review

点击查看摘要

Abstract:Electroencephalography (EEG) foundation models hold significant promise for universal Brain-Computer Interfaces (BCIs). However, existing approaches often rely on end-to-end fine-tuning and exhibit limited efficacy under frozen-probing protocols, lacking the intrinsic universality required for broad generalization. This limitation stems from adapting general-purpose sequence architectures that overlook the biophysical and dynamical principles of neural activity. To bridge this gap, we propose DeeperBrain, a neuro-grounded foundation model integrating domain-specific inductive biases into its model design and learning objectives. Architecturally, DeeperBrain incorporates a volume conduction-aware channel encoding to model spatial mixing via 3D geometry, and a neurodynamics-aware temporal encoding capturing slow adaptations using oscillatory and exponential bases. For pretraining, we introduce a dual-objective strategy combining Masked EEG Reconstruction (MER) for local fidelity and Neurodynamics Statistics Prediction (NSP). NSP enforces alignment with macroscopic brain states by predicting interpretable order parameters, including spectral power, functional connectivity, cross-frequency coupling, and dynamic complexity. Extensive experiments demonstrate that DeeperBrain achieves state-of-the-art or highly competitive performance under end-to-end fine-tuning. Crucially, it maintains superior efficacy under a rigorous frozen-probing protocol, verifying that embedding neuroscientific first principles endows learned representations with the intrinsic universality essential for universal BCI. The code will be publicly available.

[LG-82] Learning Minimally-Congested Drive Times from Sparse Open Networks: A Lightweight RF-Based Estimator for Urban Roadway Operations

链接: https://arxiv.org/abs/2601.06124
作者: Adewumi Augustine Adepitan,Christopher J. Haruna,Morayo Ogunsina,Damilola Olawoyin Yussuf,Ayooluwatomiwa Ajiboye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-83] Stress Testing Machine Learning at 1010 Scale: A Comprehensive Study of Adversarial Robustness on Algebraically Structured Integer Streams

链接: https://arxiv.org/abs/2601.06117
作者: HyunJun Jeon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-84] Australian Bushfire Intelligence with AI-Driven Environmental Analytics

链接: https://arxiv.org/abs/2601.06105
作者: Tanvi Jois,Hussain Ahmad,Fatima Noor,Faheem Ullah
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Bushfires are among the most destructive natural hazards in Australia, causing significant ecological, economic, and social damage. Accurate prediction of bushfire intensity is therefore essential for effective disaster preparedness and response. This study examines the predictive capability of spatio-temporal environmental data for identifying high-risk bushfire zones across Australia. We integrated historical fire events from NASA FIRMS, daily meteorological observations from Meteostat, and vegetation indices such as the Normalized Difference Vegetation Index (NDVI) from Google Earth Engine for the period 2015-2023. After harmonizing the datasets using spatial and temporal joins, we evaluated several machine learning models, including Random Forest, XGBoost, LightGBM, a Multi-Layer Perceptron (MLP), and an ensemble classifier. Under a binary classification framework distinguishing ‘low’ and ‘high’ fire risk, the ensemble approach achieved an accuracy of 87%. The results demonstrate that combining multi-source environmental features with advanced machine learning techniques can produce reliable bushfire intensity predictions, supporting more informed and timely disaster management.

[LG-85] he Hessian of tall-skinny networks is easy to invert

链接: https://arxiv.org/abs/2601.06096
作者: Ali Rahimi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We describe an exact algorithm for solving linear systems Hx=b where H is the Hessian of a deep net. The method computes Hessian-inverse-vector products without storing the Hessian or its inverse in time and storage that scale linearly in the number of layers. Compared to the naive approach of first computing the Hessian, then solving the linear system, which takes storage that’s quadratic in the number of parameters and cubically many operations, our Hessian-inverse-vector product method scales roughly like Pearlmutter’s algorithm for computing Hessian-vector products.

[LG-86] Enabling Long FFT Convolutions on Memory-Constrained FPGAs via Chunking

链接: https://arxiv.org/abs/2601.06065
作者: Peter Wang,Neelesh Gupta,Viktor Prasanna
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 2 pages, submitted to 2025 HiPC Conference

点击查看摘要

Abstract:The need for long-context reasoning has led to alternative neural network architectures besides Transformers and self-attention, a popular model being Hyena, which employs causal 1D-convolutions implemented with FFTs. Long convolutions enable efficient global context mixing, but requirements for intermediate results exceed the 2-3 MB Block RAM capacity of FPGAs. We present a chunked FFT convolution approach enabling 450K length sequence by 450K length filter convolutions on an Alveo U200 FPGA with 2.8 MB BRAM through chunking and overlap-add reconstruction. We find that throughput scales proportionally with chunk size while degrading minimally by 7% for our longest sequences, demonstrating that careful memory management enables deployment of long-context primitives on edge FPGAs without sacrificing performance.

[LG-87] Leverag ing Foundation Models for Calibration-Free c-VEP BCIs

链接: https://arxiv.org/abs/2601.06028
作者: Mohammadreza Behboodi,Eli Kinney-Lang,Ali Etemad,Adam Kirton,Hatem Abou-Zeid
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 Pages, 2 figures, Accepted and Presented at the IEEE SMC Conference 2025

点击查看摘要

[LG-88] A Complete Decomposition of Stochastic Differential Equations

链接: https://arxiv.org/abs/2601.07834
作者: Samuel Duffield
类目: Probability (math.PR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

[LG-89] Learning to bin: differentiable and Bayesian optimization for multi-dimensional discriminants in high-energy physics

链接: https://arxiv.org/abs/2601.07756
作者: Johannes Erdmann,Nitish Kumar Kasaraguppe,Florian Mausolf
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Categorizing events using discriminant observables is central to many high-energy physics analyses. Yet, bin boundaries are often chosen by hand. A simple, popular choice is to apply argmax projections of multi-class scores and equidistant binning of one-dimensional discriminants. We propose a binning optimization for signal significance directly in multi-dimensional discriminants. We use a Gaussian Mixture Model (GMM) to define flexible bin boundary shapes for multi-class scores, while in one dimension (binary classification) we move bin boundaries directly. On this binning model, we study two optimization strategies: a differentiable and a Bayesian optimization approach. We study two toy setups: a binary classification and a three-class problem with two signals and backgrounds. In the one-dimensional case, both approaches achieve similar gains in signal sensitivity compared to equidistant binnings for a given number of bins. In the multi-dimensional case, the GMM-based binning defines sensitive categories as well, with the differentiable approach performing best. We show that, in particular for limited separability of the signal processes, our approach outperforms argmax classification even with optimized binning in the one-dimensional projections. Both methods are released as lightweight Python plugins intended for straightforward integration into existing analyses.

[LG-90] Riesz Representer Fitting under Bregman Divergence: A Unified Framework for Debiased Machine Learning

链接: https://arxiv.org/abs/2601.07752
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating the Riesz representer is a central problem in debiased machine learning for causal and structural parameter estimation. Various methods for Riesz representer estimation have been proposed, including Riesz regression and covariate balancing. This study unifies these methods within a single framework. Our framework fits a Riesz representer model to the true Riesz representer under a Bregman divergence, which includes the squared loss and the Kullback–Leibler (KL) divergence as special cases. We show that the squared loss corresponds to Riesz regression, and the KL divergence corresponds to tailored loss minimization, where the dual solutions correspond to stable balancing weights and entropy balancing weights, respectively, under specific model specifications. We refer to our method as generalized Riesz regression, and we refer to the associated duality as automatic covariate balancing. Our framework also generalizes density ratio fitting under a Bregman divergence to Riesz representer estimation, and it includes various applications beyond density ratio estimation. We also provide a convergence analysis for both cases where the model class is a reproducing kernel Hilbert space (RKHS) and where it is a neural network.

[LG-91] PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials

链接: https://arxiv.org/abs/2601.07742
作者: Teddy Koker,Abhijeet Gangan,Mit Kotak,Jaime Marian,Tess Smidt
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with standard a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP (trained on Materials Project) by 55% on average across phonon thermodynamic properties and achieves state-of-the-art performance among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.

[LG-92] Backward Reconstruction of the Chafee–Infante Equation via Physics-Informed WGAN-GP

链接: https://arxiv.org/abs/2601.07733
作者: Joseph L. Shomberg
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG)
*备注: 5 pages, 9 figures

点击查看摘要

[LG-93] A Framework for Feature Discovery in Intracranial Pressure Monitoring Data Using Neural Network Attention

链接: https://arxiv.org/abs/2601.07691
作者: Jonathan D. Socha,Seyed F. Maroufi,Dipankar Biswas,Richard Um,Aruna S. Rao,Mark G. Luciano
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 12 pages, 18 figures

点击查看摘要

Abstract:We present a novel framework for analyzing intracranial pressure monitoring data by applying interpretability principles. Intracranial pressure monitoring data was collected from 60 patients at Johns Hopkins. The data was segmented into individual cardiac cycles. A convolutional neural network was trained to classify each cardiac cycle into one of seven body positions. Neural network attention was extracted and was used to identify regions of interest in the waveform. Further directions for exploration are identified. This framework provides an extensible method to further understand the physiological and clinical underpinnings of the intracranial pressure waveform, which could lead to better diagnostic capabilities for intracranial pressure monitoring.

[LG-94] Physics-Informed Singular-Value Learning for Cross-Covariances Forecasting in Financial Markets

链接: https://arxiv.org/abs/2601.07687
作者: Efstratios Manolakis,Christian Bongiorno,Rosario Nunzio Mantegna
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-95] Dual-Level Models for Physics-Informed Multi-Step Time Series Forecasting

链接: https://arxiv.org/abs/2601.07640
作者: Mahdi Nasiri,Johanna Kortelainen,Simo Särkkä
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper develops an approach for multi-step forecasting of dynamical systems by integrating probabilistic input forecasting with physics-informed output prediction. Accurate multi-step forecasting of time series systems is important for the automatic control and optimization of physical processes, enabling more precise decision-making. While mechanistic-based and data-driven machine learning (ML) approaches have been employed for time series forecasting, they face significant limitations. Incomplete knowledge of process mathematical models limits mechanistic-based direct employment, while purely data-driven ML models struggle with dynamic environments, leading to poor generalization. To address these limitations, this paper proposes a dual-level strategy for physics-informed forecasting of dynamical systems. On the first level, input variables are forecast using a hybrid method that integrates a long short-term memory (LSTM) network into probabilistic state transition models (STMs). On the second level, these stochastically predicted inputs are sequentially fed into a physics-informed neural network (PINN) to generate multi-step output predictions. The experimental results of the paper demonstrate that the hybrid input forecasting models achieve a higher log-likelihood and lower mean squared errors (MSE) compared to conventional STMs. Furthermore, the PINNs driven by the input forecasting models outperform their purely data-driven counterparts in terms of MSE and log-likelihood, exhibiting stronger generalization and forecasting performance across multiple test cases.

[LG-96] Reinforcement Learning for Micro-Level Claims Reserving

链接: https://arxiv.org/abs/2601.07637
作者: Benjamin Avanzi,Ronald Richman,Bernard Wong,Mario Wüthrich,Yagebu Xie
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Outstanding claim liabilities are revised repeatedly as claims develop, yet most modern reserving models are trained as one-shot predictors and typically learn only from settled claims. We formulate individual claims reserving as a claim-level Markov decision process in which an agent sequentially updates outstanding claim liability (OCL) estimates over development, using continuous actions and a reward design that balances accuracy with stable reserve revisions. A key advantage of this reinforcement learning (RL) approach is that it can learn from all observed claim trajectories, including claims that remain open at valuation, thereby avoiding the reduced sample size and selection effects inherent in supervised methods trained on ultimate outcomes only. We also introduce practical components needed for actuarial use – initialisation of new claims, temporally consistent tuning via a rolling-settlement scheme, and an importance-weighting mechanism to mitigate portfolio-level underestimation driven by the rarity of large claims. On CAS and SPLICE synthetic general insurance datasets, the proposed Soft Actor-Critic implementation delivers competitive claim-level accuracy and strong aggregate OCL performance, particularly for the immature claim segments that drive most of the liability.

[LG-97] mporal-Aligned Meta-Learning for Risk Management: A Stacking Approach for Multi-Source Credit Scoring

链接: https://arxiv.org/abs/2601.07588
作者: O. Didkovskyi,A. Vidali,N. Jean,G. Le Pera
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

[LG-98] Machine learning nonequilibrium phase transitions in charge-density wave insulators

链接: https://arxiv.org/abs/2601.07583
作者: Yunhao Fan,Sheng Zhang,Gia-Wei Chern
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Nonequilibrium electronic forces play a central role in voltage-driven phase transitions but are notoriously expensive to evaluate in dynamical simulations. Here we develop a machine learning framework for adiabatic lattice dynamics coupled to nonequilibrium electrons, and demonstrate it for a gating induced insulator to metal transition out of a charge density wave state in the Holstein model. Although exact electronic forces can be obtained from nonequilibrium Green’s function (NEGF) calculations, their high computational cost renders long time dynamical simulations prohibitively expensive. By exploiting the locality of the electronic response, we train a neural network to directly predict instantaneous local electronic forces from the lattice configuration, thereby bypassing repeated NEGF calculations during time evolution. When combined with Brownian dynamics, the resulting machine learning force field quantitatively reproduces domain wall motion and nonequilibrium phase transition dynamics obtained from full NEGF simulations, while achieving orders of magnitude gains in computational efficiency. Our results establish direct force learning as an efficient and accurate approach for simulating nonequilibrium lattice dynamics in driven quantum materials.

[LG-99] Nonparametric Kernel Clustering with Bandit Feedback

链接: https://arxiv.org/abs/2601.07535
作者: Victor Thuot(MISTEA),Sebastian Vogt(LRR-TUM),Debarghya Ghoshdastidar(LRR-TUM),Nicolas Verzelen(MISTEA)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-100] PIDT: Physics-Informed Digital Twin for Optical Fiber Parameter Estimation

链接: https://arxiv.org/abs/2601.07436
作者: Zicong Jiang,Magnus Karlsson,Erik Agrell,Christian Häger
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Optics (physics.optics)
*备注: The paper will be appeared in Optical Fiber Communications Conference and Exhibition (OFC) 2026

点击查看摘要

[LG-101] Position: Dont be Afraid of Over-Smoothing And Over-Squashing

链接: https://arxiv.org/abs/2601.07419
作者: Niklas Kormann,Benjamin Doerr,Johannes F. Lutzeyer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. Copyright 2026 by the authors

点击查看摘要

Abstract:Over-smoothing and over-squashing have been extensively studied in the literature on Graph Neural Networks (GNNs) over the past years. We challenge this prevailing focus in GNN research, arguing that these phenomena are less critical for practical applications than assumed. We suggest that performance decreases often stem from uninformative receptive fields rather than over-smoothing. We support this position with extensive experiments on several standard benchmark datasets, demonstrating that accuracy and over-smoothing are mostly uncorrelated and that optimal model depths remain small even with mitigation techniques, thus highlighting the negligible role of over-smoothing. Similarly, we challenge that over-squashing is always detrimental in practical applications. Instead, we posit that the distribution of relevant information over the graph frequently factorises and is often localised within a small k-hop neighbourhood, questioning the necessity of jointly observing entire receptive fields or engaging in an extensive search for long-range interactions. The results of our experiments show that architectural interventions designed to mitigate over-squashing fail to yield significant performance gains. This position paper advocates for a paradigm shift in theoretical research, urging a diligent analysis of learning tasks and datasets using statistics that measure the underlying distribution of label-relevant information to better understand their localisation and factorisation.

[LG-102] Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

链接: https://arxiv.org/abs/2601.07326
作者: Huan Li,Yiming Dong,Zhouchen Lin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-103] Variational Approximations for Robust Bayesian Inference via Rho-Posteriors

链接: https://arxiv.org/abs/2601.07325
作者: EL Mahdi Khribch,Pierre Alquier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 53 pages including the proofs in appendices, 16 figures

点击查看摘要

Abstract:The \rho -posterior framework provides universal Bayesian estimation with explicit contamination rates and optimal convergence guarantees, but has remained computationally difficult due to an optimization over reference distributions that precludes intractable posterior computation. We develop a PAC-Bayesian framework that recovers these theoretical guarantees through temperature-dependent Gibbs posteriors, deriving finite-sample oracle inequalities with explicit rates and introducing tractable variational approximations that inherit the robustness properties of exact \rho -posteriors. Numerical experiments demonstrate that this approach achieves theoretical contamination rates while remaining computationally feasible, providing the first practical implementation of \rho -posterior inference with rigorous finite-sample guarantees.

[LG-104] Covariance-Driven Regression Trees: Reducing Overfitting in CART

链接: https://arxiv.org/abs/2601.07281
作者: Likun Zhang,Wei Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Decision trees are powerful machine learning algorithms, widely used in fields such as economics and medicine for their simplicity and interpretability. However, decision trees such as CART are prone to overfitting, especially when grown deep or the sample size is small. Conventional methods to reduce overfitting include pre-pruning and post-pruning, which constrain the growth of uninformative branches. In this paper, we propose a complementary approach by introducing a covariance-driven splitting criterion for regression trees (CovRT). This method is more robust to overfitting than the empirical risk minimization criterion used in CART, as it produces more balanced and stable splits and more effectively identifies covariates with true signals. We establish an oracle inequality of CovRT and prove that its predictive accuracy is comparable to that of CART in high-dimensional settings. We find that CovRT achieves superior prediction accuracy compared to CART in both simulations and real-world tasks.

[LG-105] Multi-environment Invariance Learning with Missing Data

链接: https://arxiv.org/abs/2601.07247
作者: Yiran Jia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-106] On Lie Groups Preserving Subspaces of Degenerate Clifford Algebras

链接: https://arxiv.org/abs/2601.07191
作者: E. R. Filimoshina,D. S. Shirokov
类目: Rings and Algebras (math.RA); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:This paper introduces Lie groups in degenerate geometric (Clifford) algebras that preserve four fundamental subspaces determined by the grade involution and reversion under the adjoint and twisted adjoint representations. We prove that these Lie groups can be equivalently defined using norm functions of multivectors applied in the theory of spin groups. We also study the corresponding Lie algebras. Some of these Lie groups and algebras are closely related to Heisenberg Lie groups and algebras. The introduced groups are interesting for various applications in physics and computer science, in particular, for constructing equivariant neural networks.

[LG-107] Optimal Transport under Group Fairness Constraints

链接: https://arxiv.org/abs/2601.07144
作者: Linus Bleistein,Mathieu Dagréou,Francisco Andrade,Thomas Boudou,Aurélien Bellet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Ensuring fairness in matching algorithms is a key challenge in allocating scarce resources and positions. Focusing on Optimal Transport (OT), we introduce a novel notion of group fairness requiring that the probability of matching two individuals from any two given groups in the OT plan satisfies a predefined target. We first propose \textttFairSinkhorn, a modified Sinkhorn algorithm to compute perfectly fair transport plans efficiently. Since exact fairness can significantly degrade matching quality in practice, we then develop two relaxation strategies. The first one involves solving a penalised OT problem, for which we derive novel finite-sample complexity guarantees. This result is of independent interest as it can be generalized to arbitrary convex penalties. Our second strategy leverages bilevel optimization to learn a ground cost that induces a fair OT solution, and we establish a bound guaranteeing that the learned cost yields fair matchings on unseen data. Finally, we present empirical results that illustrate the trade-offs between fairness and performance.

[LG-108] Robust Bayesian Optimization via Tempered Posteriors

链接: https://arxiv.org/abs/2601.07094
作者: Jiguang Li,Hengrui Luo
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-109] Robust Mean Estimation under Quantization

链接: https://arxiv.org/abs/2601.07074
作者: Pedro Abdalla,Junren Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

[LG-110] Local EGOP for Continuous Index Learning

链接: https://arxiv.org/abs/2601.07061
作者: Alex Kokot,Anand Hemmady,Vydhourie Thiyageswaran,Marina Meila
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-111] Conditional Normalizing Flows for Forward and Backward Joint State and Parameter Estimation

链接: https://arxiv.org/abs/2601.07013
作者: Luke S. Lagunowich,Guoxiang Grayson Tong,Daniele E. Schiavazzi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-112] Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests

链接: https://arxiv.org/abs/2601.07003
作者: Roman Hornung(1 and 2),Alexander Hapfelmeier(3 and 4) ((1) Institute for Medical Information Processing, Biometry and Epidemiology, Faculty of Medicine, Ludwig Maximilian University of Munich (LMU), Munich, Germany, (2) Munich Center for Machine Learning (MCML), Munich, Germany, (3) Institute of General Practice and Health Services Research, Department Clinical Medicine, TUM School of Medicine and Health, Technical University of Munich (TUM), Munich, Germany, (4) Institute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich (TUM), Munich, Germany)
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 33 pages, 12 figures

点击查看摘要

[LG-113] Match Made with Matrix Completion: Efficient Learning under Matching Interference

链接: https://arxiv.org/abs/2601.06982
作者: Zhiyuan Tang,Wanning Chen,Kan Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-114] he Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks

链接: https://arxiv.org/abs/2601.06961
作者: Taishi Watanabe,Ryo Karakida,Jun-nosuke Teramae
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

[LG-115] Deep Learning Based Channel Extrapolation for Dual-Band Massive MIMO Systems

链接: https://arxiv.org/abs/2601.06858
作者: Qikai Xiao,Kehui Li,Binggui Zhou,Shaodan Ma
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-116] Constrained Density Estimation via Optimal Transport

链接: https://arxiv.org/abs/2601.06830
作者: Yinan Hu,Estaban Tabak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:A novel framework for density estimation under expectation constraints is proposed. The framework minimizes the Wasserstein distance between the estimated density and a prior, subject to the constraints that the expected value of a set of functions adopts or exceeds given values. The framework is generalized to include regularization inequalities to mitigate the artifacts in the target measure. An annealing-like algorithm is developed to address non-smooth constraints, with its effectiveness demonstrated through both synthetic and proof-of-concept real world examples in finance.

[LG-117] Dimension-reduced outcome-weighted learning for estimating individualized treatment regimes in observational studies

链接: https://arxiv.org/abs/2601.06782
作者: Sungtaek Son,Eardi Lila,Kwun Chuen Gary Chan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 54 pages, 9 figures

点击查看摘要

[LG-118] Diffusion Models with Heavy-Tailed Targets: Score Estimation and Sampling Guarantees

链接: https://arxiv.org/abs/2601.06715
作者: Yifeng Yu,Lu Yu
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have become a powerful framework for generative modeling, with score estimation as a central statistical bottleneck. Existing guarantees for score estimation largely focus on light-tailed targets or rely on restrictive assumptions such as compact support, which are often violated by heavy-tailed data in practice. In this work, we study conventional (Gaussian) score-based diffusion models when the target distribution is heavy-tailed and belongs to a Sobolev class with smoothness parameter \beta0 . We consider both exponential and polynomial tail decay, indexed by a tail parameter \gamma . Using kernel density estimation, we derive sharp minimax rates for score estimation, revealing a qualitative dichotomy: under exponential tails, the rate matches the light-tailed case up to polylogarithmic factors, whereas under polynomial tails the rate depends explicitly on \gamma . We further provide sampling guarantees for the associated continuous reverse dynamics. In total variation, the generated distribution converges at the minimax optimal rate n^-\beta/(2\beta+d) under exponential tails (up to logarithmic factors), and at a \gamma -dependent rate under polynomial tails. Whether the latter sampling rate is minimax optimal remains an open question. These results characterize the statistical limits of score estimation and the resulting sampling accuracy for heavy-tailed targets, extending diffusion theory beyond the light-tailed setting.

[LG-119] A Multimodal Deep Learning Framework for Predicting ICU Deterioration: Integrating ECG Waveforms with Clinical Data and Clinician Benchmarking ALT

链接: https://arxiv.org/abs/2601.06645
作者: Juan Miguel López Alcaraz,Xicoténcatl López Moran,Erick Dávila Zaragoza,Claas Händel,Richard Koebe,Wilhelm Haverkamp,Nils Strodthoff
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures, source code under this https URL

点击查看摘要

[LG-120] Inference-Time Alignment for Diffusion Models via Doobs Matching

链接: https://arxiv.org/abs/2601.06514
作者: Jinyuan Chang,Chenguang Duan,Yuling Jiao,Yi Xu,Jerry Zhijian Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

[LG-121] Physics-informed Gaussian Process Regression in Solving Eigenvalue Problem of Linear Operators

链接: https://arxiv.org/abs/2601.06462
作者: Tianming Bai,Jiannan Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

[LG-122] Continual Quantum Architecture Search with Tensor-Train Encoding: Theory and Applications to Signal Processing

链接: https://arxiv.org/abs/2601.06392
作者: Jun Qi,Chao-Han Huck Yang,Pin-Yu Chen,Javier Tejedor,Ling Li,Min-Hsiu Hsieh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: In submission

点击查看摘要

Abstract:We introduce CL-QAS, a continual quantum architecture search framework that mitigates the challenges of costly amplitude encoding and catastrophic forgetting in variational quantum circuits. The method uses Tensor-Train encoding to efficiently compress high-dimensional stochastic signals into low-rank quantum feature representations. A bi-loop learning strategy separates circuit parameter optimization from architecture exploration, while an Elastic Weight Consolidation regularization ensures stability across sequential tasks. We derive theoretical upper bounds on approximation, generalization, and robustness under quantum noise, demonstrating that CL-QAS achieves controllable expressivity, sample-efficient generalization, and smooth convergence without barren plateaus. Empirical evaluations on electrocardiogram (ECG)-based signal classification and financial time-series forecasting confirm substantial improvements in accuracy, balanced accuracy, F1 score, and reward. CL-QAS maintains strong forward and backward transfer and exhibits bounded degradation under depolarizing and readout noise, highlighting its potential for adaptive, noise-resilient quantum learning on near-term devices.

[LG-123] Computational Mapping of Reactive Stroma in Prostate Cancer Yields Interpretable Prognostic Biomarkers

链接: https://arxiv.org/abs/2601.06360
作者: Mara Pleasure,Ekaterina Redekop,Dhakshina Ilango,Zichen Wang,Vedrana Ivezic,Kimberly Flores,Israa Laklouk,Jitin Makker,Gregory Fishbein,Anthony Sisk,William Speier,Corey W. Arnold
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-124] Hard Constraint Projection in a Physics Informed Neural Network

链接: https://arxiv.org/abs/2601.06244
作者: Miranda J. S. Horne(1),Peter K. Jimack(1),Amirul Khan(1),He Wang(2) ((1) University of Leeds, (2) University College London)
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, Accepted manuscript of the paper presented at ParCFD2024

点击查看摘要

Abstract:In this work, we embed hard constraints in a physics informed neural network (PINN) which predicts solutions to the 2D incompressible Navier Stokes equations. We extend the hard constraint method introduced by Chen et al. (arXiv:2012.06148) from a linear PDE to a strongly non-linear PDE. The PINN is used to estimate the stream function and pressure of the fluid, and by differentiating the stream function we can recover an incompressible velocity field. An unlearnable hard constraint projection (HCP) layer projects the predicted velocity and pressure to a hyperplane that admits only exact solutions to a discretised form of the governing equations.

[LG-125] PriceSeer: Evaluating Large Language Models in Real-Time Stock Prediction

链接: https://arxiv.org/abs/2601.06088
作者: Bohan Liang,Zijian Chen,Qi Jia,Kaiwei Zhang,Kaiyuan Ji,Guangtao Zhai
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

信息检索

[IR-0] AptaFind: A lightweight local interface for automated aptamer curation from scientific literature

链接: https://arxiv.org/abs/2601.07684
作者: Geoffrey Taghon
类目: Information Retrieval (cs.IR)
*备注: for associated source code, see this https URL

点击查看摘要

Abstract:Aptamer researchers face a literature landscape scattered across publications, supplements, and databases, with each search consuming hours that could be spent at the bench. AptaFind transforms this navigation problem through a three-tier intelligence architecture that recognizes research mining is a spectrum, not a binary success or failure. The system delivers direct sequence extraction when possible, curated research leads when extraction fails, and exhaustive literature discovery for additional confidence. By combining local language models for semantic understanding with deterministic algorithms for reliability, AptaFind operates without cloud dependencies or subscription barriers. Validation across 300 University of Texas Aptamer Database targets demonstrates 84 % with some literature found, 84 % with curated research leads, and 79 % with a direct sequence extraction, at a laptop-compute rate of over 900 targets an hour. The platform proves that even when direct sequence extraction fails, automation can still deliver the actionable intelligence researchers need by rapidly narrowing the search to high quality references.

[IR-1] GAP-Net: Calibrating User Intent via Gated Adaptive Progressive Learning for CTR Prediction

链接: https://arxiv.org/abs/2601.07613
作者: Ke Shenqiang,Wei Jianxiong,Hua Qingsong
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Sequential user behavior modeling is pivotal for Click-Through Rate (CTR) prediction yet is hindered by three intrinsic bottlenecks: (1) the “Attention Sink” phenomenon, where standard Softmax compels the model to allocate probability mass to noisy behaviors; (2) the Static Query Assumption, which overlooks dynamic shifts in user intent driven by real-time contexts; and (3) Rigid View Aggregation, which fails to adaptively weight heterogeneous temporal signals according to the decision context. To bridge these gaps, we propose GAP-Net (Gated Adaptive Progressive Network), a unified framework establishing a “Triple Gating” architecture to progressively refine information from micro-level features to macro-level views. GAP-Net operates through three integrated mechanisms: (1) Adaptive Sparse-Gated Attention (ASGA) employs micro-level gating to enforce sparsity, effectively suppressing massive noise activations; (2) Gated Cascading Query Calibration (GCQC) dynamically aligns user intent by bridging real-time triggers and long-term memories via a meso-level cascading channel; and (3) Context-Gated Denoising Fusion (CGDF) performs macro-level modulation to orchestrate the aggregation of multi-view sequences. Extensive experiments on industrial datasets demonstrate that GAP-Net achieves substantial improvements over state-of-the-art baselines, exhibiting superior robustness against interaction noise and intent drift.

[IR-2] Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature

链接: https://arxiv.org/abs/2601.07533
作者: Julian Schelb,Michael Wittweiler,Marie Revellio,Barbara Feichtinger,Andreas Spitz
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:Tracing connections between historical texts is an important part of intertextual research, enabling scholars to reconstruct the virtual library of a writer and identify the sources influencing their creative process. These intertextual links manifest in diverse forms, ranging from direct verbatim quotations to subtle allusions and paraphrases disguised by morphological variation. Language models offer a promising path forward due to their capability of capturing semantic similarity beyond lexical overlap. However, the development of new methods for this task is held back by the scarcity of standardized benchmarks and easy-to-use datasets. We address this gap by introducing Loci Similes, a benchmark for Latin intertextuality detection comprising of a curated dataset of ~172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors. Using this data, we establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.

[IR-3] owards Multi-Behavior Multi-Task Recommendation via Behavior-informed Graph Embedding Learning

链接: https://arxiv.org/abs/2601.07294
作者: Wenhao Lai,Weike Pan,Zhong Ming
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-behavior recommendation (MBR) aims to improve the performance w.r.t. the target behavior (i.e., purchase) by leveraging auxiliary behaviors (e.g., click, favourite). However, in real-world scenarios, a recommendation method often needs to process different types of behaviors and generate personalized lists for each task (i.e., each behavior type). Such a new recommendation problem is referred to as multi-behavior multi-task recommendation (MMR). So far, the most powerful MBR methods usually model multi-behavior interactions using a cascading graph paradigm. Although significant progress has been made in optimizing the performance of the target behavior, it often neglects the performance of auxiliary behaviors. To compensate for the deficiencies of the cascading paradigm, we propose a novel solution for MMR, i.e., behavior-informed graph embedding learning (BiGEL). Specifically, we first obtain a set of behavior-aware embeddings by using a cascading graph paradigm. Subsequently, we introduce three key modules to improve the performance of the model. The cascading gated feedback (CGF) module enables a feedback-driven optimization process by integrating feedback from the target behavior to refine the auxiliary behaviors preferences. The global context enhancement (GCE) module integrates the global context to maintain the user’s overall preferences, preventing the loss of key preferences due to individual behavior graph modeling. Finally, the contrastive preference alignment (CPA) module addresses the potential changes in user preferences during the cascading process by aligning the preferences of the target behaviors with the global preferences through contrastive learning. Extensive experiments on two real-world datasets demonstrate the effectiveness of our BiGEL compared with ten very competitive methods.

[IR-4] Making Absence Visible: The Roles of Reference and Prompting in Recognizing Missing Information

链接: https://arxiv.org/abs/2601.07234
作者: Hagit Ben Shoshan,Joel Lanir,Pavel Goldstein,Osnat Mokryn
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Interactive systems that explain data, or support decision making often emphasize what is present while overlooking what is expected but missing. This presence bias limits users’ ability to form complete mental models of a dataset or situation. Detecting absence depends on expectations about what should be there, yet interfaces rarely help users form such expectations. We present an experimental study examining how reference framing and prompting influence people’s ability to recognize expected but missing categories in datasets. Participants compared distributions across three domains (energy, wealth, and regime) under two reference conditions: Global, presenting a unified population baseline, and Partial, showing several concrete exemplars. Results indicate that absence detection was higher with Partial reference than with Global reference, suggesting that partial, samples-based framing can support expectation formation and absence detection. When participants were prompted to look for what was missing, absence detection rose sharply. We discuss implications for interactive user interfaces and expectation-based visualization design, while considering cognitive trade-offs of reference structures and guided attention.

[IR-5] RAIRS: Optimizing Redundant Assignment and List Layout for IVF-Based ANN Search

链接: https://arxiv.org/abs/2601.07183
作者: Zehai Yang,Shimin Chen
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:IVF is one of the most widely used ANNS (Approximate Nearest Neighbors Search) methods in vector databases. The idea of redundant assignment is to assign a data vector to more than one IVF lists for reducing the chance of missing true neighbors in IVF search. However, the naive strategy, which selects the second IVF list based on the distance between a data vector and the list centroids, performs poorly. Previous work focuses only on the inner product distance, while there is no optimized list selection study for the most popular Euclidean space. Moreover, the IVF search may access the same vector in more than one lists, resulting in redundant distance computation and decreasing query throughput. In this paper, we present RAIRS to address the above two challenges. For the challenge of the list selection, we propose an optimized AIR metric for the Euclidean space. AIR takes not only distances but also directions into consideration in order to support queries that are closer to the data vector but father away from the first chosen list’s centroid. For the challenge of redundant distance computation, we propose SEIL, an optimized list layout that exploits shared cells to reduce repeated distance computations for IVF search. Our experimental results using representative real-world data sets show that RAIRS out-performs existing redundant assignment solutions and achieves up to 1.33x improvement over the best-performing IVF method, IVF-PQ Fast Scan with refinement.

[IR-6] Unleashing the Native Recommendation Potential: LLM -Based Generative Recommendation via Structured Term Identifiers

链接: https://arxiv.org/abs/2601.06798
作者: Zhiyang Zhang,Junda She,Kuo Cai,Bo Chen,Shiyao Wang,Xinchen Luo,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Leveraging the vast open-world knowledge and understanding capabilities of Large Language Models (LLMs) to develop general-purpose, semantically-aware recommender systems has emerged as a pivotal research direction in generative recommendation. However, existing methods face bottlenecks in constructing item identifiers. Text-based methods introduce LLMs’ vast output space, leading to hallucination, while methods based on Semantic IDs (SIDs) encounter a semantic gap between SIDs and LLMs’ native vocabulary, requiring costly vocabulary expansion and alignment training. To address this, this paper introduces Term IDs (TIDs), defined as a set of semantically rich and standardized textual keywords, to serve as robust item identifiers. We propose GRLM, a novel framework centered on TIDs, employs Context-aware Term Generation to convert item’s metadata into standardized TIDs and utilizes Integrative Instruction Fine-tuning to collaboratively optimize term internalization and sequential recommendation. Additionally, Elastic Identifier Grounding is designed for robust item mapping. Extensive experiments on real-world datasets demonstrate that GRLM significantly outperforms baselines across multiple scenarios, pointing a promising direction for generalizable and high-performance generative recommendation systems.

[IR-7] Industrial Semantics-Aware Digital Twins: A Hybrid Graph Matching Approach for Asset Administration Shells

链接: https://arxiv.org/abs/2601.06613
作者: Ariana Metović,Nicolai Maisch,Samed Ajdinović,Armin Lechler,Andreas Wortmann,Oliver Riedel
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Although the Asset Administration Shell (AAS) standard provides a structured and machine-readable representation of industrial assets, their semantic comparability remains a major challenge, particularly when different vocabularies and modeling practices are used. Engineering would benefit from retrieving existing AAS models that are similar to the target in order to reuse submodels, parameters, and metadata. In practice, however, heterogeneous vocabularies and divergent modeling conventions hinder automated, content-level comparison across AAS. This paper proposes a hybrid graph matching approach to enable semantics-aware comparison of Digital Twin representations. The method combines rule-based pre-filtering using SPARQL with embedding-based similarity calculation leveraging RDF2vec to capture both structural and semantic relationships between AAS models. This contribution provides a foundation for enhanced discovery, reuse, and automated configuration in Digital Twin networks.

[IR-8] owards Building efficient Routed systems for Retrieval

链接: https://arxiv.org/abs/2601.06389
作者: Ramnath Kumar,Prateek Jain,Cho-Jui Hsieh
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Late-interaction retrieval models like ColBERT achieve superior accuracy by enabling token-level interactions, but their computational cost hinders scalability and integration with Approximate Nearest Neighbor Search (ANNS). We introduce FastLane, a novel retrieval framework that dynamically routes queries to their most informative representations, eliminating redundant token comparisons. FastLane employs a learnable routing mechanism optimized alongside the embedding model, leveraging self-attention and differentiable selection to maximize efficiency. Our approach reduces computational complexity by up to 30x while maintaining competitive retrieval performance. By bridging late-interaction models with ANNS, FastLane enables scalable, low-latency retrieval, making it feasible for large-scale applications such as search engines, recommendation systems, and question-answering platforms. This work opens pathways for multi-lingual, multi-modal, and long-context retrieval, pushing the frontier of efficient and adaptive information retrieval.

[IR-9] Data-Driven Framework Development for Public Space Quality Assessment

链接: https://arxiv.org/abs/2601.06026
作者: Sherzod Turaev,Mary John
类目: Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 66 pages, 8 figures

点击查看摘要

Abstract:Public space quality assessment lacks systematic methodologies that integrate factors across diverse spatial typologies while maintaining context-specific relevance. Current approaches remain fragmented within disciplinary boundaries, limiting comprehensive evaluation and comparative analysis across different space types. This study develops a systematic, data-driven framework for assessing public space quality through the algorithmic integration of empirical research findings. Using a 7-phase methodology, we transform 1,207 quality factors extracted from 157 peer-reviewed studies into a validated hierarchical taxonomy spanning six public space typologies: urban spaces, open spaces, green spaces, parks and waterfronts, streets and squares, and public facilities. The methodology combines semantic analysis, cross-typology distribution analysis, and domain knowledge integration to address terminological variations and functional relationships across space types. The resulting framework organizes 1,029 unique quality factors across 14 main categories and 66 subcategories, identifying 278 universal factors applicable across all space types, 397 space-specific factors unique to particular typologies, and 124 cross-cutting factors serving multiple functions. Framework validation demonstrates systematic consistency in factor organization and theoretical alignment with established research on public spaces. This research provides a systematic methodology for transforming empirical public space research into practical assessment frameworks, supporting evidence-based policy development, design quality evaluation, and comparative analysis across diverse urban contexts.

附件下载

点击下载今日全部论文列表