本篇博文主要内容为 2026-01-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-14)
今日共更新528篇论文,其中:
- 自然语言处理共91篇(Computation and Language (cs.CL))
- 人工智能共189篇(Artificial Intelligence (cs.AI))
- 计算机视觉共121篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共133篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System
【速读】: 该论文旨在解决学术会议审稿过程中因评审者主观性与一致性不足导致的评审质量不稳定问题,特别是如何通过引入Elo评分机制来优化由大语言模型(Large Language Model, LLM)代理评审员构成的多轮交互式评审系统。其解决方案的关键在于设计一个基于Elo等级分的评审动态机制,并结合评审员记忆(reviewer memory),使Area Chair能够更准确地评估论文质量,同时揭示出LLM代理评审员在该机制下会采取策略性行为——即利用Elo系统调整自身评审策略以提升排名,而非实质提升评审努力程度。这一发现为构建更公平、高效且具备自适应能力的AI辅助审稿系统提供了实证依据与理论支持。
链接: https://arxiv.org/abs/2601.08829
作者: Hsiang-Wei Huang,Junbin Lu,Kuang-Ming Chen,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In submission. The first two authors contributed equally
Abstract:In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers’ adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at this https URL.
zh
[NLP-1] Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
【速读】: 该论文旨在解决大语言模型在复杂推理任务中依赖长序列Chain-of-Thought (CoT)所带来的低带宽问题,同时保持推理过程的灵活性与效率。传统CoT方法生成的是离散token序列,导致推理路径冗长且难以优化;而人类推理往往以软方式维持对下一步可能性的概率分布。为此,作者提出Multiplex Thinking,其关键在于:在每个思维步骤中采样K个候选token,并将它们的嵌入向量聚合为一个连续的“多路复用token”(multiplex token),从而在保留词汇嵌入先验和标准离散生成采样动态的同时,诱导出可计算的多路轨迹概率分布。这一机制使得推理路径可以直接通过on-policy强化学习(RL)进行优化,且具有自适应特性——当模型置信度高时,多路复用token近似离散,行为类似传统CoT;当不确定时,则紧凑表示多个合理下一步而不增加序列长度。
链接: https://arxiv.org/abs/2601.08808
作者: Yao Tang,Li Dong,Yaru Hao,Qingxiu Dong,Furu Wei,Jiatao Gu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages. Code available at this https URL
Abstract:Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at this https URL.
zh
[NLP-2] APEX-SWE
【速读】: 该论文旨在解决当前前沿生成式 AI 模型在软件工程(Software Engineering, SWE)领域中缺乏系统性评估框架的问题,尤其针对其能否执行具有经济价值的复杂任务这一核心挑战。现有评测多聚焦于狭窄、定义明确的任务,难以反映真实软件开发场景中的多样性与复杂性。为此,作者提出了 AI Productivity Index for Software Engineering (APEX-SWE),通过两类新颖任务——集成任务(Integration tasks)和可观测性任务(Observability tasks)——来衡量模型在跨云原生组件构建系统及基于日志、仪表盘等遥测信号调试生产故障的能力。解决方案的关键在于引入“认知推理”(epistemic reasoning),即模型区分假设与可验证事实的能力,并结合“代理能力”(agency),在行动前主动消除不确定性,从而显著提升任务完成率,其中 Gemini 3 Pro(Thinking = High)表现最优,Pass@1 得分为 25%。
链接: https://arxiv.org/abs/2601.08806
作者: Abhi Kottamasu,Akul Datta,Aakash Barthwal,Chirag Mahapatra,Ajay Arun,Adarsh Hiremath,Brendan Foody,Bertie Vidgen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
zh
[NLP-3] Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling
【速读】: 该论文致力于解决大规模语言模型(Large Language Models, LLMs)在面对用户多样化且可能冲突的偏好时,如何实现通用对齐(Universal Alignment, U-alignment)的问题。核心挑战在于:传统方法在测试阶段仅生成单一响应,难以适应不同用户的偏好差异,而引入多候选响应(test-time scaling)虽具潜力,但现有后训练方法如基于人类反馈的纳什学习(Nash Learning from Human Feedback, NLHF)往往因输出多样性不足而导致额外采样无效,无法保证胜率随样本数增加而趋近于1。解决方案的关键在于提出一种基于对称多玩家博弈框架的新型对齐机制,其中任意对称纳什均衡策略可实现最优的 (k,k+1k)-鲁棒对齐((k,f(k))-robust alignment),并证明该收敛速率是理论最优的;同时通过自洽学习动态(self-play learning dynamics)提供理论收敛保障,从而系统性地利用测试时扩展能力,克服了传统方法中因缺乏输出多样性导致的性能瓶颈。
链接: https://arxiv.org/abs/2601.08777
作者: Yang Cai,Weiqiang Zheng
机构: Yale University (耶鲁大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces k\ge 1 candidate responses and a user selects their preferred one. We introduce (k,f(k)) -robust alignment, which requires the k -output model to have win rate f(k) against any other single-output model, and asymptotic universal alignment (U-alignment), which requires f(k)\to 1 as k\to\infty . Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose k -sample product policies achieve U-alignment at rate f(k)=\frackk+1 , and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for k=1 , sampling from the resulting (often deterministic) policy cannot guarantee win rates above \tfrac12 except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the (k+1) -player alignment game achieves the optimal (k,\frackk+1) -robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2601.08777 [cs.LG] (or arXiv:2601.08777v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.08777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-4] Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLM s
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)后训练过程中出现的探索坍缩(exploration collapse)问题,即策略过早集中于少数主导的推理模式,虽提升单次采样通过率(pass@1),但限制了整体解空间的多样性及多样本通过率(pass@k)的提升。其解决方案的关键在于提出一种独特性感知强化学习(Uniqueness-Aware Reinforcement Learning),通过引入基于LLM的判别器对同一问题的不同推理路径按高层策略进行聚类,忽略表面差异,并将策略优势(advantage)反比于聚类规模进行重加权,从而显式奖励那些正确且罕见的高阶策略,有效提升rollout级别的多样性与性能,同时不牺牲pass@1指标。
链接: https://arxiv.org/abs/2601.08763
作者: Zhiyuan Hu,Yucheng Wang,Yufei He,Jiaying Wu,Yilun Zhao,See-Kiong Ng,Cynthia Breazeal,Anh Tuan Luu,Hae Won Park,Bryan Hooi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@ k across large sampling budgets and increases the area under the pass@ k curve (AUC@ K ) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
zh
[NLP-5] Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables
【速读】: 该论文旨在解决如何有效融合地理定位文本与航空影像以提升生态变量预测性能的问题。其核心挑战在于文本数据在生态学中的贡献尚不明确,且其空间分布稀疏、不规则,难以与传统地理空间数据直接整合。解决方案的关键在于提出一种基于注意力机制的多模态模型,通过引入地理位置编码,并利用注意力模块动态选择对预测任务有用的邻近观测点,从而实现图像与文本在空间邻域内的协同建模。该方法在EcoWikiRS数据集上验证,显著优于单一来源(仅图像或仅文本)的基线模型,尤其在气候、土壤、人口和土地利用/覆盖等主题变量上表现突出,证明了空间上下文信息在跨模态融合中的关键作用。
链接: https://arxiv.org/abs/2601.08750
作者: Valerie Zermatten,Chiara Vanalli,Gencer Sumbul,Diego Marcos,Devis Tuia
机构: Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); INRIA; University of Montpellier (蒙彼利埃大学)
类目: Computation and Language (cs.CL)
备注: submitted
Abstract:Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive this http URL proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.
zh
[NLP-6] o Retrieve or To Think? An Agent ic Approach for Context Evolution
【速读】: 该论文旨在解决当前上下文增强方法(如检索增强生成)在知识密集型推理任务中存在的两大问题:一是采用固定、粗暴的策略在每一步都执行检索,导致计算资源浪费;二是冗余检索引入无关噪声,降低模型性能。解决方案的关键在于提出一种受人类元认知启发的动态决策框架——代理式上下文演化(Agentic Context Evolution, ACE),其核心机制是由一个中央协调代理(orchestrator agent)基于多数投票策略,智能判断是否激活检索代理(retriever agent)获取外部证据,或启用推理代理(reasoner agent)进行内部分析与优化。通过消除不必要的检索步骤,ACE能够维持精炼且不断演化的上下文,从而在多跳问答(multi-hop QA)基准测试中显著提升准确率并实现高效的token利用。
链接: https://arxiv.org/abs/2601.08747
作者: Rubing Chen,Jian Wang,Wenjie Li,Xiao-Yong Wei,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning this http URL, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority this http URL aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token this http URL work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
zh
[NLP-7] ableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在Text-to-SQL任务中因频繁包含完整数据库模式(database schema)到提示词(prompt)而导致的上下文长度过长和预填充延迟(prefilling latency)高的问题。现有推理引擎如SGLang和vLLM在处理不同表顺序的用户查询时,会重复生成前缀缓存(prefix cache),造成资源浪费。解决方案的关键在于:离线预计算表(table)表示为键值缓存(KV cache),并在在线推理时根据查询需求动态检索;同时通过保留主外键关系(primary-foreign key relationships)确保语义一致性,并构建Table Trie结构以支持高效缓存查找。此外,引入基于查询重排序的缓存管理策略与并行化计算加载流水线,进一步提升缓存命中率和整体推理效率。
链接: https://arxiv.org/abs/2601.08743
作者: Jinbo Su,Yuxuan Hu,Cuiping Li,Hong Chen,Jia Li,Lintao Ma,Jing Zhang
机构: Renmin University of China (中国人民大学); Key Laboratory of Data Engineering and Knowledge Engineering (数据工程与知识工程重点实验室); Engineering Research Center of Database and Business Intelligence (数据库与商务智能工程研究中心); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.
zh
[NLP-8] Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多智能体环境中缺乏对行为背后隐含意图进行推断的能力这一关键问题。传统自然语言推理(Natural Language Inference, NLI)无法捕捉复杂交互系统中所需的意图驱动型推理,为此作者提出 Attributional NLI(Att-NLI)框架,其核心在于融合社会心理学原理,将推理过程划分为归纳式意图推断(abductive intentional inference,生成关于潜在意图的假设)与演绎式验证(deductive verification,逻辑上验证假设),从而构建更接近人类社会认知的推理机制。解决方案的关键创新在于引入神经符号方法(neuro-symbolic approach),通过外部定理证明器增强推理能力,显著提升了智能体在文本博弈任务中的表现,验证了 Att-NLI 在提升多智能体系统理性决策能力方面的有效性。
链接: https://arxiv.org/abs/2601.08742
作者: Xin Quan,Jiafeng Xiong,Marco Valentino,André Freitas
机构: University of Manchester (曼彻斯特大学); University of Sheffield (谢菲尔德大学); Idiap Research Institute (Idiap 研究所); National Biomarker Centre, CRUK-MI (国家生物标志物中心,CRUK-MI)
类目: Computation and Language (cs.CL)
备注:
Abstract:Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent’s capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.
zh
[NLP-9] From Rows to Reasoning : A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding
【速读】: 该论文旨在解决大型企业级电子表格(Excel工作簿)中多模态信息推理的难题,这些问题通常包含数千行数值数据、多个关联工作表以及嵌入的图表和收据等视觉内容,而现有大语言模型(Large Language Models, LLMs)在处理此类复杂结构时表现不佳。其关键解决方案是提出一种名为From Rows to Reasoning (FRTR) 的多模态检索增强生成框架,该框架通过将工作簿细粒度地分解为行、列和块嵌入(embedding),结合基于词法与密集检索的混合策略及倒数排名融合(Reciprocal Rank Fusion, RRF)技术实现高效检索,并整合多模态嵌入以协同推理数值与视觉信息,从而显著提升推理准确率并降低计算成本。
链接: https://arxiv.org/abs/2601.08741
作者: Anmol Gulati,Sahil Sen,Waqar Sarguroh,Kevin Paul
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.
zh
[NLP-10] PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation
【速读】: 该论文旨在解决私有知识图谱(Knowledge Graph, KG)在与闭源大语言模型(Large Language Model, LLM)协同进行知识密集型问答时存在的隐私泄露风险问题。现有方法仅通过实体名称掩码来保护隐私,但仍存在结构信息泄露、远程交互不可控、多跳或多实体推理脆弱以及经验复用受限等四大挑战。其解决方案的关键在于提出PrivGemo框架,采用双塔架构实现本地保留原始KG知识的同时,在远程侧基于语义和结构双重匿名化视图进行推理;并通过检索连接所有主题实体的匿名化长路径支持多跳、多实体推理,同时利用分层控制器和隐私感知的经验记忆机制减少冗余探索与远程交互,从而在保障隐私的前提下显著提升推理性能与效率。
链接: https://arxiv.org/abs/2601.08739
作者: Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang
机构: University of New South Wales (新南威尔士大学); Data61, CSIRO (数据61,澳大利亚联邦科学与工业研究组织)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge graphs (KGs) provide structured evidence that can ground large language model (LLM) reasoning for knowledge-intensive question answering. However, many practical KGs are private, and sending retrieved triples or exploration traces to closed-source LLM APIs introduces leakage risk. Existing privacy treatments focus on masking entity names, but they still face four limitations: structural leakage under semantic masking, uncontrollable remote interaction, fragile multi-hop and multi-entity reasoning, and limited experience reuse for stability and efficiency. To address these issues, we propose PrivGemo, a privacy-preserving retrieval-augmented framework for KG-grounded reasoning with memory-guided exposure control. PrivGemo uses a dual-tower design to keep raw KG knowledge local while enabling remote reasoning over an anonymized view that goes beyond name masking to limit both semantic and structural exposure. PrivGemo supports multi-hop, multi-entity reasoning by retrieving anonymized long-hop paths that connect all topic entities, while keeping grounding and verification on the local KG. A hierarchical controller and a privacy-aware experience memory further reduce unnecessary exploration and remote interactions. Comprehensive experiments on six benchmarks show that PrivGemo achieves overall state-of-the-art results, outperforming the strongest baseline by up to 17.1%. Furthermore, PrivGemo enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
zh
[NLP-11] RAG Shaper: Eliciting Sophisticated Agent ic RAG Skills via Automated Data Synthesis
【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂现实场景中进行检索增强生成(Retrieval-Augmented Generation, RAG)时,因缺乏高质量、高噪声模拟训练数据而导致的鲁棒性不足问题。现有方法依赖人工标注,难以规模化且无法捕捉真实检索环境中的动态推理策略与错误修正过程。解决方案的关键在于提出 RAGShaper 框架,其核心创新包括:1)引入 InfoCurator 构建包含感知层和认知层对抗干扰项的密集信息树;2)设计约束导航策略迫使教师代理主动应对干扰项,从而生成显式体现错误纠正与噪声过滤的代理轨迹。该方法有效提升了模型在噪声密集和复杂检索任务中的鲁棒性能。
链接: https://arxiv.org/abs/2601.08699
作者: Zhengwei Tao,Bo Li,Jialong Wu,Guochen Yan,Huanyao Zhang,Jiahao Xu,Haitao Mi,Wentao Zhang
机构: Peking University (北京大学); Tencent AI Lab
类目: Computation and Language (cs.CL)
备注:
Abstract:Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.
zh
[NLP-12] Nationality and Region Prediction from Names: A Comparative Study of Neural Models and Large Language Models
【速读】: 该论文旨在解决从个人姓名中预测国籍的任务中存在的局限性,尤其是传统神经网络模型在低频国籍类别上的泛化能力差以及对同一区域内相似国籍区分困难的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)在预训练阶段获得的世界知识,相较于依赖任务特定训练数据的传统神经模型,LLMs能够通过其丰富的先验知识提升预测准确性,尤其在细粒度的国籍层级上表现更优;同时,研究发现LLMs倾向于产生“近似错误”(即正确预测区域但错误预测具体国籍),而神经模型则表现出跨区域错误和高频类别偏倚,这表明LLMs的优势源于其蕴含的外部世界知识,且模型选择应结合目标预测粒度,并重视误差质量而非仅关注准确率。
链接: https://arxiv.org/abs/2601.08692
作者: Keito Inoshita
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Predicting nationality from personal names has practical value in marketing, demographic research, and genealogical studies. Conventional neural models learn statistical correspondences between names and nationalities from task-specific training data, posing challenges in generalizing to low-frequency nationalities and distinguishing similar nationalities within the same region. Large language models (LLMs) have the potential to address these challenges by leveraging world knowledge acquired during pre-training. In this study, we comprehensively compare neural models and LLMs on nationality prediction, evaluating six neural models and six LLM prompting strategies across three granularity levels (nationality, region, and continent), with frequency-based stratified analysis and error analysis. Results show that LLMs outperform neural models at all granularity levels, with the gap narrowing as granularity becomes coarser. Simple machine learning methods exhibit the highest frequency robustness, while pre-trained models and LLMs show degradation for low-frequency nationalities. Error analysis reveals that LLMs tend to make ``near-miss’’ errors, predicting the correct region even when nationality is incorrect, whereas neural models exhibit more cross-regional errors and bias toward high-frequency classes. These findings indicate that LLM superiority stems from world knowledge, model selection should consider required granularity, and evaluation should account for error quality beyond accuracy.
zh
[NLP-13] QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在金融量化任务中评估体系碎片化、且主要局限于知识型问答的问题。为实现更全面和贴近实际的评估,作者提出了QuantEval基准测试框架,其关键创新在于整合了CTA(Commodity Trading Advisor)风格的回测机制,能够执行模型生成的量化策略并基于金融绩效指标进行客观评价,从而真实反映LLMs在量化数学推理与策略编码方面的能力。这一设计突破了以往仅依赖静态问答评估的局限,为量化金融场景下LLM能力的系统性评测提供了可复现、可扩展的新范式。
链接: https://arxiv.org/abs/2601.08689
作者: Zhaolu Kang,Junhao Gong,Wenqing Hu,Shuo Yin,Kehan Jiang,Zhicheng Fang,Yingjie He,Chunlei Meng,Rong Fu,Dongyang Chen,Leqi Zheng,Eric Hanchen Jiang,Yunfei Feng,Yitong Leng,Junfan Zhu,Xiaoyou Chen,Xi Yang,Richeng Xuan
机构: Peking University (北京大学); Tsinghua University (清华大学); Fudan University (复旦大学); University of Macau (澳门大学); University of California, Los Angeles (加州大学洛杉矶分校); Shanghai Jiao Tong University (上海交通大学); Imperial College London (伦敦帝国理工学院); University of Chicago (芝加哥大学); Shanghai Weina Software Technology (上海微纳软件科技); Beijing Academy of Artificial Intelligent (北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs’ quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
zh
[NLP-14] Lessons from the Field: An Adaptable Lifecycle Approach to Applied Dialogue Summarization EACL2026
【速读】: 该论文旨在解决工业场景中多参与者对话自动摘要(multi-party dialogue summarization)的挑战,即如何在需求不断演变、任务主观性强的实际应用中构建高质量、可适应的摘要系统。其关键解决方案在于设计并实施一个基于代理架构(agentic system)的端到端框架,通过任务分解实现模块化优化,并结合动态评估方法应对需求变化;同时识别并缓解上游数据瓶颈与大语言模型(LLM)提示迁移性差导致的供应商锁定问题,从而提升系统的可靠性与可扩展性。
链接: https://arxiv.org/abs/2601.08682
作者: Kushal Chawla,Chenyang Zhu,Pengshan Cai,Sangwoo Cho,Scott Novotney,Ayushman Singh,Jonah Lewis,Keasha Safewright,Alfy Samuel,Erin Babinsky,Shi-Xiong Zhang,Sambit Sahu
机构: Capital One (资本一号)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EACL 2026 Industry Track
Abstract:Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains. However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements. While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve. In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts.
zh
[NLP-15] Parallel Context-of-Experts Decoding for Retrieval Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)中的核心矛盾:在长提示中拼接多个文档虽能支持跨文档推理,但会引发预填充(prefill)瓶颈;而对每个文档独立编码键值缓存(KV cache)虽提升解码速度,却破坏了跨文档交互能力。其解决方案的关键在于提出一种无需训练的并行专家解码框架(Parallel Context-of-Experts Decoding, Pced),将证据聚合从注意力机制转移到解码阶段——通过将检索到的文档视为独立“专家”,并利用一种新型的检索感知对比解码规则,根据模型先验权重调整各专家的 logits,从而在不构建跨文档共享注意力的情况下恢复跨文档推理能力。
链接: https://arxiv.org/abs/2601.08670
作者: Giulio Corallo,Paolo Papotti
机构: SAP Labs, France (SAP实验室,法国); EURECOM, France (欧洲电信学院,法国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated “experts”, synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.
zh
[NLP-16] Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在仇恨言论净化任务中出现的“虚假拒绝”(false refusal)问题,即模型因安全机制触发而拒绝执行本应完成的去毒化任务。研究发现,LLMs对语义毒性较高或针对特定群体(如国籍、宗教和政治意识形态)的输入更易产生拒绝行为,且存在语言依赖性的系统性偏差。解决方案的关键在于提出一种轻量级的跨语言翻译策略:将英文仇恨言论先翻译为中文再进行净化处理,最后译回英文,该方法显著降低了虚假拒绝率,同时保持原始内容完整性,从而有效缓解了模型的安全机制误判问题。
链接: https://arxiv.org/abs/2601.08668
作者: Kyuri Im,Shuzhou Yuan,Michael Färber
机构: TU Dresden (德累斯顿工业大学); ScaDS.AI (数据科学与人工智能中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.
zh
[NLP-17] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)作为评分者(LLM-as-a-Judge)时,因生成随机性导致的与人类评分标准难以对齐的问题。核心挑战包括:评分标准(rubric)因提示敏感性而不稳定、推理过程缺乏可审计证据、以及模型规模与人类评分边界之间存在错位。解决方案的关键在于提出 RULERS 框架,其通过将自然语言评分标准转化为可执行规范(executable specifications),实现三个关键机制:(1)将评分标准编译为版本化不可变包以保证一致性;(2)强制结构化解码并引入确定性证据验证以确保推理可审计;(3)采用轻量级 Wasserstein 距离后校准方法调整评分尺度,无需更新模型参数。实验表明,RULERS 在人类评分一致性、对抗性扰动稳定性及小模型性能上显著优于基线方法,验证了可执行评分标准、可验证证据和校准尺度对于可靠 LLM 评分的重要性。
链接: https://arxiv.org/abs/2601.08654
作者: Yihan Hong,Huaiyuan Yao,Bolin Shen,Wanpeng Xu,Hua Wei,Yushun Dong
机构: Washington University in St. Louis (圣路易斯华盛顿大学); Arizona State University (亚利桑那州立大学); Florida State University (佛罗里达州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at this https URL.
zh
[NLP-18] Safe Language Generation in the Limit
【速读】: 该论文旨在解决在真实世界场景中实现安全语言生成(safe language generation)的理论可行性问题,特别是在学习极限(learning in the limit)这一计算范式下对语言识别与生成任务进行形式化建模。其核心贡献在于首次从理论上刻画了“安全语言识别”和“安全语言生成”的可判定性边界:证明了在该模型下,安全语言识别是不可判定的,而安全语言生成至少与普通语言识别一样困难——后者已被证明是不可行的。解决方案的关键在于引入一种基于归纳推理的逻辑框架,将安全性约束纳入语言生成过程,并通过复杂度分析揭示了在特定条件下(如有限状态语法或受限语法规则)存在可 tractable(可处理)的情形,从而为构建可靠、可控的语言生成系统提供了理论基础。
链接: https://arxiv.org/abs/2601.08648
作者: Antonios Anastasopoulos,Giuseppe Ateniese,Evgenios M. Kornaropoulos
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent results in learning a language in the limit have shown that, although language identification is impossible, language generation is tractable. As this foundational area expands, we need to consider the implications of language generation in real-world settings. This work offers the first theoretical treatment of safe language generation. Building on the computational paradigm of learning in the limit, we formalize the tasks of safe language identification and generation. We prove that under this model, safe language identification is impossible, and that safe language generation is at least as hard as (vanilla) language identification, which is also impossible. Last, we discuss several intractable and tractable cases. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.08648 [cs.CL] (or arXiv:2601.08648v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.08648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)系统在跨语言和跨模态情境下对习语表达(Potentially Idiomatic Expressions, PIEs)理解能力的评估难题。习语表达因其与特定语言社群日常经验紧密关联,成为衡量模型语言及文化理解能力的重要挑战。解决方案的关键在于构建XMPIE——一个包含34种语言、超过一万条习语表达的平行多语言、多模态数据集。该数据集不仅支持不同语言间习语模式的比较分析,还通过文本与图像双模态标注(每条习语配有五张从字面到隐喻意义的图像)实现了对习语理解跨模态迁移能力的量化评估,从而为多语言和多模态习语理解提供高质量基准。
链接: https://arxiv.org/abs/2601.08645
作者: Dilara Torunoğlu-Selamet,Dogukan Arslan,Rodrigo Wilkens,Wei He,Doruk Eryiğit,Thomas Pickard,Adriana S. Pagano,Aline Villavicencio,Gülşen Eryiğit,Ágnes Abuczki,Aida Cardoso,Alesia Lazarenka,Dina Almassova,Amalia Mendes,Anna Kanellopoulou,Antoni Brosa-Rodríguez,Baiba Saulite,Beata Wojtowicz,Bolette Pedersen,Carlos Manuel Hidalgo-Ternero,Chaya Liebeskind,Danka Jokić,Diego Alves,Eleni Triantafyllidi,Erik Velldal,Fred Philippy,Giedre Valunaite Oleskeviciene,Ieva Rizgeliene,Inguna Skadina,Irina Lobzhanidze,Isabell Stinessen Haugen,Jauza Akbar Krito,Jelena M. Marković,Johanna Monti,Josue Alejandro Sauca,Kaja Dobrovoljc,Kingsley O. Ugwuanyi,Laura Rituma,Lilja Øvrelid,Maha Tufail Agro,Manzura Abjalova,Maria Chatzigrigoriou,María del Mar Sánchez Ramos,Marija Pendevska,Masoumeh Seyyedrezaei,Mehrnoush Shamsfard,Momina Ahsan,Muhammad Ahsan Riaz Khan,Nathalie Carmen Hau Norman,Nilay Erdem Ayyıldız,Nina Hosseini-Kivanani,Noémi Ligeti-Nagy,Numaan Naeem,Olha Kanishcheva,Olha Yatsyshyna,Daniil Orel,Petra Giommarelli,Petya Osenova,Radovan Garabik,Regina E. Semou,Rozane Rebechi,Salsabila Zahirah Pranida,Samia Touileb,Sanni Nimb,Sarfraz Ahmad,Sarvinoz Nematkhonova,Shahar Golan,Shaoxiong Ji,Sopuruchi Christian Aboh,Srdjan Sucur,Stella Markantonatou,Sussi Olsen,Vahide Tajalli,Veronika Lipp,Voula Giouli,Yelda Yeşildal Eraydın,Zahra Saaberi,Zhuohan Xie
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Potentially idiomatic expressions (PIEs) construe meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows to evaluate model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
zh
[NLP-20] Moral Lenses Political Coordinates: Towards Ideological Positioning of Morally Conditioned LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)政治倾向评估中依赖直接探测或人口统计学角色构建所带来的局限性问题,即这些方法未能深入揭示政治意识形态与基础道德直觉之间的因果关系。其解决方案的关键在于将道德价值观视为可控条件,通过让模型明确支持或反对特定道德价值来观察其政治定位的变化,从而在经济和社交维度上系统性地追踪模型轨迹的偏移。实验表明,这种道德条件化能引发显著且价值特异性的政治坐标变化,并且该效应受角色框架和模型规模的调节,同时在多种评估工具下保持稳健,说明有效对齐需基于更广泛的社会价值体系,尤其是道德维度。
链接: https://arxiv.org/abs/2601.08634
作者: Chenchen Yuan,Bolei Ma,Zheyu Zhang,Bardh Prenkaj,Frauke Kreuter,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学); LMU Munich (慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While recent research has systematically documented political orientation in large language models (LLMs), existing evaluations rely primarily on direct probing or demographic persona engineering to surface ideological biases. In social psychology, however, political ideology is also understood as a downstream consequence of fundamental moral intuitions. In this work, we investigate the causal relationship between moral values and political positioning by treating moral orientation as a controllable condition. Rather than simply assigning a demographic persona, we condition models to endorse or reject specific moral values and evaluate the resulting shifts on their political orientations, using the Political Compass Test. By treating moral values as lenses, we observe how moral conditioning actively steers model trajectories across economic and social dimensions. Our findings show that such conditioning induces pronounced, value-specific shifts in models’ political coordinates. We further notice that these effects are systematically modulated by role framing and model scale, and are robust across alternative assessment instruments instantiating the same moral value. This highlights that effective alignment requires anchoring political assessments within the context of broader social values including morality, paving the way for more socially grounded alignment techniques.
zh
[NLP-21] Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
【速读】: 该论文旨在解决低资源语言环境下机器翻译(Machine Translation, MT)系统训练数据不足的问题,尤其是在缺乏足够人工翻译语料的情况下,传统数据获取方式成本过高。其解决方案的关键在于提出了一种名为LALITA(Lexical And Linguistically Informed Text Analysis)的框架,通过结合词汇和语言学特征对源句进行筛选,优先选择复杂句子构建高效平行语料库,从而在减少训练数据量的同时显著提升翻译质量。实验表明,该方法在50K至800K英文句子规模下均能实现性能提升,并在多种语言(如印地语、奥里亚语、尼泊尔语、挪威诺恩斯克语和德语)中将数据需求降低超过一半,有效降低了MT系统的训练成本并具备数据增强潜力。
链接: https://arxiv.org/abs/2601.08629
作者: Saumitra Yadav,Manish Shrivastava
机构: International Institute Information Technology (国际信息科技学院)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA’s utility in data augmentation.
zh
[NLP-22] How Order-Sensitive Are LLM s? OrderProbe for Deterministic Structural Reconstruction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对输入文本结构被扰乱时,其内部结构重建能力不足的问题。现有研究多关注语义理解,但对模型恢复词序或句法结构的能力缺乏系统评估,且因句子层面存在多种合法排列方式,导致自动化评价困难。为此,作者提出OrderProbe——一个基于中、日、韩语中固定四字表达的确定性基准,这些表达具有唯一标准顺序,从而支持精确匹配评分;并构建了一个诊断框架,从恢复准确率、语义保真度、逻辑有效性、一致性、鲁棒性敏感性和信息密度等多个维度综合评估模型表现。关键创新在于通过结构约束明确的测试样本实现可量化、可比较的结构重建能力评测,并揭示了语义召回与结构规划之间存在显著分离现象,表明结构鲁棒性并非语义能力的自然衍生属性。
链接: https://arxiv.org/abs/2601.08626
作者: Yingjie He,Zhaolu Kang,Kehan Jiang,Qianyuan Zhang,Jiachen Qian,Chunlei Meng,Yujie Feng,Yuan Wang,Jiabao Dou,Aming Wu,Leqi Zheng,Pengxiang Zhao,Jiaxin Liu,Zeyu Zhang,Lei Wang,Guansu Wang,Qishi Zhan,Xiaomin He,Meisheng Zhang,Jianyuan Ni
机构: Peking University (北京大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); City University of Hong Kong (香港城市大学); Fudan University (复旦大学); The Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Marquette University (马凯特大学); Juniata College (朱尼塔学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
zh
[NLP-23] GraphSearch: Agent ic Search-Augmented Reasoning for Zero-Shot Graph Learning
【速读】: 该论文旨在解决当前搜索增强型大推理模型(search-augmented large reasoning models, LRM)在处理图结构数据时能力不足的问题,尤其是在电商、社交网络和科学引文等广泛应用图结构数据的领域中,如何有效利用图的拓扑信息来提升检索精度与推理效率。其解决方案的关键在于提出GraphSearch框架,该框架通过两个核心组件实现:一是图感知查询规划器(Graph-aware Query Planner),将搜索空间(如1跳、多跳或全局邻居)与语义查询解耦,从而生成更具表达力的图结构查询;二是图感知检索器(Graph-aware Retriever),基于图拓扑构建候选集并采用混合评分函数平衡结构相关性与语义相关性。此外,该框架还支持两种遍历模式:GraphSearch-R(递归式逐跳扩展)和GraphSearch-F(无跳数约束的局部-全局灵活检索),从而实现无需任务特定微调的零样本图学习,在节点分类和链接预测任务上达到甚至超越监督方法的性能,展现出强大的泛化能力和对图结构数据的高效推理潜力。
链接: https://arxiv.org/abs/2601.08621
作者: Jiajin Liu,Yuanfu Sun,Dongzhe Fan,Qiaoyu Tan
机构: New York University (Shanghai)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 pages
Abstract:Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.
zh
[NLP-24] ExpSeek: Self-Triggered Experience Seeking for Web Agents
【速读】: 该论文旨在解决当前Web代理(Web Agent)在与环境交互过程中,因经验注入方式局限于全局上下文的被动模式而导致难以适应动态变化的上下文观测的问题。现有方法无法根据实时交互状态灵活调整经验干预时机与内容,限制了代理的适应性与性能表现。其解决方案的关键在于提出ExpSeek框架,通过两个核心机制实现经验的步级主动获取:(1) 利用模型内在信号估计步级熵阈值,作为自触发机制以确定最佳的经验干预时机;(2) 设计步级定制化经验内容,使经验注入更贴合当前任务阶段。实验表明,该方法在多个基准测试中显著提升Qwen3系列大模型的性能,验证了熵作为自触发信号的有效性,并揭示小规模经验模型(如4B参数)亦能显著增强大规模代理模型的表现。
链接: https://arxiv.org/abs/2601.08605
作者: Wenyuan Zhang,Xinghua Zhang,Haiyang Yu,Shuaiyi Nie,Bingli Wu,Juwei Yue,Tingwen Liu,Yongbin Li
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent-environment interaction. We propose ExpSeek, which shifts experience toward step-level proactive seeking: (1) estimating step-level entropy thresholds to determine intervention timing using the model’s intrinsic signals; (2) designing step-level tailor-designed experience content. Experiments on Qwen3-8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self-triggering signal, reveal that even a 4B small-scale experience model can significantly boost the performance of larger agent models.
zh
[NLP-25] Ministral 3
【速读】: 该论文旨在解决在计算资源和内存受限场景下部署高性能语言模型的挑战,特别是如何在保持模型性能的同时显著降低参数规模与推理成本。其解决方案的关键在于提出Ministral 3系列模型,采用Cascade Distillation(级联蒸馏)技术进行迭代剪枝与持续蒸馏训练,从而实现参数高效且具备多模态理解能力的密集型语言模型,支持三种不同参数规模(3B、8B、14B)及三种变体(预训练基础模型、指令微调模型、推理优化模型),所有模型均开源并遵循Apache 2.0许可协议。
链接: https://arxiv.org/abs/2601.08584
作者: Alexander H. Liu,Kartik Khandelwal,Sandeep Subramanian,Victor Jouault,Abhinav Rastogi,Adrien Sadé,Alan Jeffares,Albert Jiang,Alexandre Cahill,Alexandre Gavaudan,Alexandre Sablayrolles,Amélie Héliou,Amos You,Andy Ehrenberg,Andy Lo,Anton Eliseev,Antonia Calvi,Avinash Sooriyarachchi,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Clémence Lanfranchi,Corentin Barreau,Cyprien Courtot,Daniele Grattarola,Darius Dabert,Diego de las Casas,Elliot Chane-Sane,Faruk Ahmed,Gabrielle Berrada,Gaëtan Ecrepont,Gauthier Guinet,Georgii Novikov,Guillaume Kunsch,Guillaume Lample,Guillaume Martin,Gunshi Gupta,Jan Ludziejewski,Jason Rute,Joachim Studnia,Jonas Amar,Joséphine Delas,Josselin Somerville Roberts,Karmesh Yadav,Khyathi Chandu,Kush Jain,Laurence Aitchison,Laurent Fainsin,Léonard Blier,Lingxiao Zhao,Louis Martin,Lucile Saulnier,Luyu Gao,Maarten Buyl,Margaret Jennings,Marie Pellat,Mark Prins,Mathieu Poirée,Mathilde Guillaumin,Matthieu Dinot,Matthieu Futeral,Maxime Darrin,Maximilian Augustin,Mia Chiquier,Michel Schimpf,Nathan Grinsztajn,Neha Gupta,Nikhil Raghuraman,Olivier Bousquet,Olivier Duchenne,Patricia Wang,Patrick von Platen,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Quentin Torroba,Romain Sauvestre,Roman Soletskyi,Rupert Menneer,Sagar Vaze,Samuel Barry,Sanchit Gandhi,Siddhant Waghjale,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Teven Le Scao,Théo Cachet,Theo Simon Sorg,Thibaut Lavril,Thiziri Nait Saada,Thomas Chabal,Thomas Foubert,Thomas Robert
机构: 未知
类目: Computation and Language (cs.CL)
备注: Release page: this https URL ; Models available at this https URL
Abstract:We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.
zh
[NLP-26] Learner-Tailored Program Repair: A Solution Generator with Iterative Edit-Driven Retrieval Enhancement
【速读】: 该论文旨在解决当前智能编程辅导系统中普遍存在的问题:现有研究多聚焦于修复编程学习者代码中的错误(bug),但缺乏对错误根本原因的解释,导致学习者难以理解并改进其编程逻辑。为填补这一空白,作者提出了一个新任务——LPR(Learner-Tailored Program Repair),即面向学习者的定制化程序修复任务,并设计了名为\textsc{LTS-G}(Learner-Tailored Solution Generator)的框架来实现该任务。其解决方案的关键在于两阶段机制:第一阶段通过编辑驱动的代码检索方法构建修复方案数据库,引导大语言模型(LLM)识别并定位bug;第二阶段采用方案引导的修复策略,在检索到的解决方案指导下完成代码修正并生成可理解的错误说明。此外,引入迭代检索增强(Iterative Retrieval Enhancement)机制,利用生成代码的评估结果动态优化检索方向,从而在实际编程教学场景中持续提升修复效果。
链接: https://arxiv.org/abs/2601.08545
作者: Zhenlong Dai,Zhuoluo Zhao,Hengning Wang,Xiu Tang,Sai Wu,Chang Yao,Zhipeng Gao,Jingyuan Chen
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. However, most research focuses on repairing the buggy code of programming learners without providing the underlying causes of the bugs. To address this gap, we introduce a novel task, namely \textbfLPR (\textbfLearner-Tailored \textbfProgram \textbfRepair). We then propose a novel and effective framework, \textbf\textsc\MethodName (\textbfLearner-Tailored \textbfSolution \textbfGenerator), to enhance program repair while offering the bug descriptions for the buggy code. In the first stage, we utilize a repair solution retrieval framework to construct a solution retrieval database and then employ an edit-driven code retrieval approach to retrieve valuable solutions, guiding LLMs in identifying and fixing the bugs in buggy code. In the second stage, we propose a solution-guided program repair method, which fixes the code and provides explanations under the guidance of retrieval solutions. Moreover, we propose an Iterative Retrieval Enhancement method that utilizes evaluation results of the generated code to iteratively optimize the retrieval direction and explore more suitable repair strategies, improving performance in practical programming coaching scenarios. The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our framework for the newly proposed LPR task.
zh
[NLP-27] DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report
【速读】: 该论文旨在解决当前深度研究系统(Deep Research Systems, DRS)在评估过程中存在的两大问题:一是现有基准测试未能充分检验系统对证据的分析能力和生成连贯报告的能力;二是评估标准要么过于粗略,要么直接由大语言模型(Large Language Models, LLMs)定义,导致评分偏离人类专家判断且难以验证与解释。解决方案的关键在于提出 Deep Research Bench II,这是一个新的评估基准,包含跨22个领域的132个基于事实的研究任务,每项任务需生成长篇研究报告,并通过9430个细粒度二元评分标准进行评价,涵盖信息召回、分析和呈现三个维度。这些评分标准源自专家撰写的调查文章,采用“LLM+人工”四阶段流程构建,结合自动提取与超400小时专家评审,确保标准原子化、可验证且贴近人类专家判断,从而实现对DRS性能更客观、严谨的评估。
链接: https://arxiv.org/abs/2601.08536
作者: Ruizhe Li,Mingxuan Du,Benfeng Xu,Chiwei Zhu,Xiaorui Wang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology (北京元象科技有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep Research Systems (DRS) aim to help users search the web, synthesize information, and deliver comprehensive investigative reports. However, how to rigorously evaluate these systems remains under-explored. Existing deep-research benchmarks often fall into two failure modes. Some do not adequately test a system’s ability to analyze evidence and write coherent reports. Others rely on evaluation criteria that are either overly coarse or directly defined by LLMs (or both), leading to scores that can be biased relative to human experts and are hard to verify or interpret. To address these issues, we introduce Deep Research Bench II, a new benchmark for evaluating DRS-generated reports. It contains 132 grounded research tasks across 22 domains; for each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics in total, covering three dimensions: information recall, analysis, and presentation. All rubrics are derived from carefully selected expert-written investigative articles and are constructed through a four-stage LLM+human pipeline that combines automatic extraction with over 400 human-hours of expert review, ensuring that the criteria are atomic, verifiable, and aligned with human expert judgment. We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics, revealing a substantial gap between current DRSs and human experts.
zh
[NLP-28] Algorithmic Stability in Infinite Dimensions: Characterizing Unconditional Convergence in Banach Spaces
【速读】: 该论文旨在解决无限维空间中条件收敛、无条件收敛与绝对收敛之间区分的理论问题,这一区分在计算算法设计中具有根本性意义。传统上,这些概念在有限维空间中等价,但Dvoretzky-Rogers定理表明它们在一般Banach空间中严格分离。论文的关键解决方案是一套统一的刻画定理,该定理将无条件收敛等价于七个不同形式的条件:排列不变性、网收敛、子级数检验、符号稳定性、有界乘子性质以及弱一致收敛。这一理论框架直接指导了算法稳定性分析,例如在随机梯度下降(Stochastic Gradient Descent)中确保梯度累积的排列不变性,以及在基于框架(frame-based)信号处理中合理进行系数阈值化操作,从而为数值稳定且顺序无关的求和过程提供了严格的数学基础。
链接: https://arxiv.org/abs/2601.08512
作者: Przemysław Spyra
机构: AGH University of Science and Technology (克拉科夫科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The distinction between conditional, unconditional, and absolute convergence in infinite-dimensional spaces has fundamental implications for computational algorithms. While these concepts coincide in finite dimensions, the Dvoretzky-Rogers theorem establishes their strict separation in general Banach spaces. We present a comprehensive characterization theorem unifying seven equivalent conditions for unconditional convergence: permutation invariance, net convergence, subseries tests, sign stability, bounded multiplier properties, and weak uniform convergence. These theoretical results directly inform algorithmic stability analysis, governing permutation invariance in gradient accumulation for Stochastic Gradient Descent and justifying coefficient thresholding in frame-based signal processing. Our work bridges classical functional analysis with contemporary computational practice, providing rigorous foundations for order-independent and numerically robust summation processes.
zh
[NLP-29] STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因引入显式推理机制(如思维链,Chain-of-Thought, CoT)而暴露的新攻击面问题——即推理时后门攻击(inference-time backdoors),此类攻击通过注入恶意推理路径实现,且不改变模型参数,同时生成语言上连贯的路径以规避传统检测手段。解决方案的关键在于提出STAR(State-Transition Amplification Ratio)框架,其核心思想是利用输出概率分布的变化来识别异常:恶意输入诱导的推理路径虽在模型先验知识中概率较低,但一旦触发则表现出高后验概率,形成显著的状态转移放大效应;该框架通过量化这种状态转移放大比,并结合CUSUM算法持续监测异常信号,从而高效、鲁棒地检测出此类后门攻击,在多个模型规模(8B–70B)和数据集上均实现了近乎完美的检测性能(AUROC ≈ 1.0),且效率比现有基线高出约42倍。
链接: https://arxiv.org/abs/2601.08511
作者: Seong-Gyu Park,Sohee Park,Jisu Lee,Hyunsik Na,Daeseon Choi
机构: Soongsil University (松岛大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 16 pages, 5 figures
Abstract:Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model’s general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC \approx 1.0) with approximately 42\times greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.
zh
[NLP-30] STAGE: A Benchmark for Knowledge Graph Construction Question Answering and In-Script Role-Playing over Movie Screenplays
【速读】: 该论文旨在解决当前自然语言处理模型在电影剧本(movie screenplay)这一复杂长文本场景下,缺乏对统一叙事世界(narrative world)进行建模与跨任务一致性推理与生成的能力问题。现有基准多聚焦于单一子任务(如问答或对话生成),未能评估模型是否能构建连贯的叙事世界并在此基础上完成多种推理和生成任务。解决方案的关键在于提出STAGE(Screenplay Text, Agents, Graphs and Evaluation)——一个统一的基准框架,其核心是基于同一叙事世界表示,整合四个互补任务:知识图谱构建、场景级事件摘要、长上下文剧本问答以及剧中角色扮演,同时提供清洗后的中英文剧本、结构化知识图谱及事件与角色中心标注数据,从而实现对模型构建世界表征、抽象验证叙事事件、长程推理及角色一致性生成能力的全面评估。
链接: https://arxiv.org/abs/2601.08510
作者: Qiuyu Tian,Yiding Li,Fengyi Chen,Zequn Liu,Youyong Kong,Fan Guo,Yuyao Li,Jinjing Shen,Zhijing Xie,Yiyun Luo,Xin Zhang
机构: Southeast University (东南大学); Beijing Zhongguancun Academy (北京中关村学院); Nanjing Normal University (南京师范大学); ZhuiWen Technology Co., Ltd. (追文科技有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 66 pages, 9 figures
Abstract:Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models’ abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
zh
[NLP-31] What If TSF: A Benchmark for Reframing Forecasting as Scenario-Guided Multimodal Forecasting
【速读】: 该论文旨在解决现有时间序列预测方法多为单模态、依赖历史模式外推,难以有效利用文本上下文信息的问题。其核心挑战在于如何评估模型是否能基于情境化文本(如未来假设场景)进行条件化预测,从而实现更贴近人类专家决策逻辑的多模态预测。解决方案的关键是提出 What If TSF (WIT) 基准测试集,通过提供由专家设计的合理或反事实情景(plausible or counterfactual scenarios),构建一个严谨的测试环境,用于评估模型在不同文本引导下生成差异化预测的能力,从而推动场景驱动的多模态时间序列预测研究发展。
链接: https://arxiv.org/abs/2601.08509
作者: Jinkwan Jang,Hyunbin Jin,Hyungjin Park,Kyubyung Chae,Taesup Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 5 figures
Abstract:Time series forecasting is critical to real-world decision making, yet most existing approaches remain unimodal and rely on extrapolating historical patterns. While recent progress in large language models (LLMs) highlights the potential for multimodal forecasting, existing benchmarks largely provide retrospective or misaligned raw context, making it unclear whether such models meaningfully leverage textual inputs. In practice, human experts incorporate what-if scenarios with historical evidence, often producing distinct forecasts from the same observations under different scenarios. Inspired by this, we introduce What If TSF (WIT), a multimodal forecasting benchmark designed to evaluate whether models can condition their forecasts on contextual text, especially future scenarios. By providing expert-crafted plausible or counterfactual scenarios, WIT offers a rigorous testbed for scenario-guided multimodal forecasting. The benchmark is available at this https URL.
zh
[NLP-32] Its All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models
【速读】: 该论文旨在解决历史文本中的实体链接(Entity Linking, EL)难题,其核心挑战在于语言变异、噪声输入以及语义规范的演变,现有方法或依赖大量标注数据,或受限于领域特定规则而难以扩展。解决方案的关键在于提出一种无监督集成框架 MHEL-LLaMo,该框架结合小型语言模型(Small Language Model, SLM)与大型语言模型(Large Language Model, LLM),利用多语言双编码器(BELA)进行候选实体检索,并通过指令微调后的 LLM 实现 NIL(No Entity Linked)预测与候选选择,采用提示链(prompt chaining)机制提升准确性;同时,基于 SLM 的置信度分数区分易例与难例,仅对难例启用 LLM,从而在降低计算成本的同时避免简单案例中的幻觉问题,实现高效且可扩展的历史 EL 解决方案。
链接: https://arxiv.org/abs/2601.08500
作者: Cristian Santini,Marieke Van Erp,Mehwish Alam
机构: University of Macerata (马切拉大学); DHLab (荷兰皇家科学院人文集群数字人文实验室); Télécom Paris (巴黎电信学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the recent advancements in NLP with the advent of Large Language Models (LLMs), Entity Linking (EL) for historical texts remains challenging due to linguistic variation, noisy inputs, and evolving semantic conventions. Existing solutions either require substantial training data or rely on domain-specific rules that limit scalability. In this paper, we present MHEL-LLaMo (Multilingual Historical Entity Linking with Large Language MOdels), an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM. MHEL-LLaMo leverages a multilingual bi-encoder (BELA) for candidate retrieval and an instruction-tuned LLM for NIL prediction and candidate selection via prompt chaining. Our system uses SLM’s confidence scores to discriminate between easy and hard samples, applying an LLM only for hard cases. This strategy reduces computational costs while preventing hallucinations on straightforward cases. We evaluate MHEL-LLaMo on four established benchmarks in six European languages (English, Finnish, French, German, Italian and Swedish) from the 19th and 20th centuries. Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning, offering a scalable solution for low-resource historical EL. The implementation of MHEL-LLaMo is available on Github.
zh
[NLP-33] BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在常规文本提示(plain-text prompts)下产生过长输出的问题,即“溢出”(Overflow)现象。该问题不同于对抗性攻击或提示注入,其特征是在无恶意意图的正常使用场景中导致生成 token 数量异常增多,从而引发服务成本上升、延迟增加及跨用户性能下降,尤其在高并发环境下加剧计算资源浪费和碳排放。解决方案的关键在于提出一个模型无关的基准测试工具 BenchOverflow,通过九种标准化的纯文本提示策略,在固定 5000 token 预算下量化模型输出长度分布的右偏与重尾特性,并引入“容量饱和率”(Cap-saturation rates, CSR@1k/3k/5k)和经验累积分布函数(ECDF)来评估尾部风险;同时验证了一种轻量级缓解机制——固定简洁提醒(fixed conciseness reminder),可有效抑制右尾并降低多数模型的 CSR,表明长度控制应被视为衡量模型可靠性、经济性和可持续性的核心指标,而非风格偏好。
链接: https://arxiv.org/abs/2601.08490
作者: Erin Feiglin,Nir Hutnik,Raz Lapid
机构: Deepkeep
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at TMLR 2026
Abstract:We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.
zh
[NLP-34] Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning
【速读】: 该论文旨在解决安全对齐语言模型中因直接删除“拒绝向量”(refusal vector)而导致的副作用问题,即在降低模型拒绝有害请求能力的同时,引发语义分布漂移和核心能力退化。传统方法通过简单地从激活空间中移除该向量,常导致模型性能下降甚至产生不可预测的“鬼噪声”(Ghost Noise),即拒绝方向的谱泄漏污染到能力子空间。解决方案的关键在于提出外科式拒绝消融(Surgical Refusal Ablation, SRA):首先构建一个独立的概念原子(Concept Atoms)注册表,用于表征受保护的能力和风格干扰因素;随后采用岭正则化的谱残差法将拒绝向量正交化,从而提取出纯净的拒绝方向,仅作用于与拒绝相关的结构而不扰动模型的语义几何。实验证明,SRA可在保持极低困惑度变化(平均ΔPPL≈0.02)和最小分布漂移的前提下实现接近零的拒绝率(0-2%)。
链接: https://arxiv.org/abs/2601.08489
作者: Tony Cristofano
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw “refusal vector” calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style. We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model’s semantic geometry. Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common “model damage” is often “Ghost Noise,” defined as the spectral bleeding of the dirty refusal direction into capability subspaces. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.08489 [cs.CL] (or arXiv:2601.08489v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.08489 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-35] Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots
【速读】: 该论文旨在解决当前治疗类聊天机器人(therapy chatbots)在开发过程中缺乏系统化方法来规范和验证“共情能力”(empathy)的问题,而共情是心理治疗场景中的关键非功能性需求。解决方案的核心在于构建一个融合自然语言处理(Natural Language Processing, NLP)与形式化验证(formal verification)的框架:首先利用基于Transformer的模型提取对话特征,并将其转化为双人互动(dyadic)治疗会话的随机混合自动机(Stochastic Hybrid Automaton, SHA)模型;随后通过统计模型检测(Statistical Model Checking, SMC)验证共情相关性质,并借助策略合成(strategy synthesis)指导代理行为优化。初步结果表明,该形式化模型能高保真地刻画治疗动态,且人工设计的策略可提升满足共情要求的概率。
链接: https://arxiv.org/abs/2601.08477
作者: Francesco Dettori,Matteo Forasassi,Lorenzo Veronese,Livia Lestingi,Vincenzo Scotti,Matteo Giovanni Rossi
机构: Université Paris-Saclay (巴黎萨克雷大学); Centre National de la Recherche Scientifique (法国国家科学研究中心); TU Wien (维也纳工业大学); Politecnico di Milano (米兰理工大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
zh
[NLP-36] sui-1: Grounded and Verifiable Long-Form Summarization
【速读】: 该论文旨在解决大语言模型在生成摘要时常常产生看似合理但缺乏事实依据(unfaithful)的问题,尤其在政府和法律分析等合规敏感领域,用户难以验证摘要内容是否源自原始文本。解决方案的关键在于提出 sui-1 模型,该模型通过引入内联引用(inline citations)机制,使每个摘要中的主张都能追溯到原始语句,从而提升摘要的可验证性和可信度;其核心创新在于采用基于链式思维提示(chain-of-thought prompting)与多阶段验证相结合的合成数据生成管道,构建了跨五种语言、超过 22,000 条高质量训练样本的数据集,显著优于参数量更大的开源基线模型,证明了任务特定训练对引文锚定摘要的重要性。
链接: https://arxiv.org/abs/2601.08472
作者: Benedikt Droste,Jan Philipp Harries,Maximilian Idahl,Björn Plüster
机构: ellamind
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, model weights at this https URL
Abstract:Large language models frequently generate plausible but unfaithful summaries that users cannot verify against source text, a critical limitation in compliance-sensitive domains such as government and legal analysis. We present sui-1, a 24B parameter model that produces abstractive summaries with inline citations, enabling users to trace each claim to its source sentence. Our synthetic data pipeline combines chain-of-thought prompting with multi-stage verification, generating over 22,000 high-quality training examples across five languages from diverse sources including parliamentary documents, web text, and Wikipedia. Evaluation shows sui-1 significantly outperforms all tested open-weight baselines, including models with 3x more parameters. These results demonstrate that task-specific training substantially outperforms scale alone for citation-grounded summarization. Model weights and an interactive demo are publicly available.
zh
[NLP-37] JudgeRLVR: Judge First Generate Second for Efficient Reasoning
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的大型语言模型在推理过程中因仅优化最终答案正确性而导致的低效探索问题,即模型倾向于进行冗长、无结构的试错式推理,而非高效的规划策略。解决方案的关键在于引入判别能力作为生成效率的前提:通过两阶段“先判断后生成”(judge-then-generate)范式,首先训练模型具备对解题过程与结果的判别能力(即判断解是否有效),从而内化一个指导信号以缩小搜索空间;随后在生成阶段使用初始为判别模型的参数进行标准RLVR微调。实验表明,该方法在保持甚至提升准确率的同时显著减少生成长度,且在跨域任务中展现出更强的泛化能力。
链接: https://arxiv.org/abs/2601.08468
作者: Jiangshan Duo,Hanyu Li,Hailin Zhang,Yudong Wang,Sujian Li,Liang Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality–efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
zh
[NLP-38] An Under-Explored Application for Explainable Multimodal Misogyny Detection in code-mixed Hindi-English
【速读】: 该论文旨在解决低资源和代码混杂语言(如印地语与英语混用)环境下网络 misogyny(性别歧视言论)识别难题,同时提升生成式 AI 模型在该敏感任务中的可解释性。其解决方案的关键在于构建一个多模态且可解释的 Web 应用系统,该系统融合了基于 Transformer 的多语言模型(如 XLM-RoBERTa 和 mBERT),分别用于文本和图文混合内容(memes)的 misogyny 检测,并引入 SHAP 和 LIME 等可解释人工智能(Explainable Artificial Intelligence, XAI)技术提供特征重要性评分,从而增强模型决策的透明度,助力研究人员与内容审核人员更有效地识别并应对数字空间中的性别暴力问题。
链接: https://arxiv.org/abs/2601.08457
作者: Sargam Yadav(1),Abhishek Kaushik(1),Kevin Mc Daid(1) ((1) Dundalk Institute of Technology)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.
zh
[NLP-39] Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management
【速读】: 该论文旨在解决大语言模型代理在执行长周期任务时因奖励稀疏性和信用分配不足而导致的记忆管理效率低下问题。现有基于强化学习的方法通常仅以最终任务性能作为奖励信号,难以有效指导单个记忆操作的优化。其解决方案的关键在于提出一个统一的细粒度反馈对齐框架 Fine-Mem:首先引入块级步骤奖励(Chunk-level Step Reward),通过辅助的、面向特定记忆块的问题回答任务提供即时的步骤级监督;其次设计证据锚定的奖励分配机制(Evidence-Anchored Reward Attribution),基于推理过程中实际使用的记忆项作为证据,将全局奖励精准分配至关键记忆操作。这一组合策略实现了稳定策略优化,并使局部记忆操作与长期记忆效用对齐,显著提升了任务成功率和泛化能力。
链接: https://arxiv.org/abs/2601.08435
作者: Weitao Ma,Xiaocheng Feng,Lei Huang,Xiachong Feng,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Meituan (美团); Peng Cheng Laboratory (鹏城实验室); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 5 figures
Abstract:Effective memory management is essential for large language model agents to navigate long-horizon tasks. Recent research has explored using Reinforcement Learning to develop specialized memory manager agents. However, existing approaches rely on final task performance as the primary reward, which results in severe reward sparsity and ineffective credit assignment, providing insufficient guidance for individual memory operations. To this end, we propose Fine-Mem, a unified framework designed for fine-grained feedback alignment. First, we introduce a Chunk-level Step Reward to provide immediate step-level supervision via auxiliary chunk-specific question answering tasks. Second, we devise Evidence-Anchored Reward Attribution to redistribute global rewards by anchoring credit to key memory operations, based on the specific memory items utilized as evidence in reasoning. Together, these components enable stable policy optimization and align local memory operations with the long-term utility of memory. Experiments on Memalpha and MemoryAgentBench demonstrate that Fine-Mem consistently outperforms strong baselines, achieving superior success rates across various sub-tasks. Further analysis reveals its adaptability and strong generalization capabilities across diverse model configurations and backbones.
zh
[NLP-40] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering
【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在提升大语言模型(Large Language Models, LLMs)推理能力时对昂贵外部验证器或人工规则的依赖问题,这一依赖导致计算成本高、训练延迟大以及奖励稀疏,从而阻碍优化效率。解决方案的关键在于提出Latent-GRPO框架,其核心创新是通过分析隐空间几何结构来生成内在奖励:实证发现正确推理轨迹的终端token表示在隐空间中形成高内聚性密集簇,而错误轨迹则表现为离群点;基于此几何特性,作者设计了迭代鲁棒质心估计(Iterative Robust Centroid Estimation, IRCE)算法,利用球面投影抑制幅度波动,并通过迭代聚合估计一个稳健的“真理质心”,从而生成稠密且连续的奖励信号,显著提升训练效率并保持模型性能。
链接: https://arxiv.org/abs/2601.08427
作者: Nonghai Zhang,Weitao Ma,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Jingwen Xu
机构: Meituan(美团); Peking University(北京大学); Harbin Institute of Technology(哈尔滨工业大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid’’ through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.
zh
[NLP-41] PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在教育辅导中缺乏对学生个性特征考虑的问题,即不同教学策略对不同人格特质的学生效果差异显著,而现有系统未针对学生个性进行适配,可能导致教学效果下降甚至适得其反。解决方案的关键在于构建一个将教学方法与人格特征映射的分类体系(taxonomy),基于教育学文献建立理论框架,并通过模拟师生对话场景,使LLM能够根据预设的学生人格类型动态调整教学策略;实验表明,该方法显著提升了人类教师对教学策略的偏好度,尤其增加了高影响力但低频使用的策略(如角色扮演)的应用,从而为开发更具个性化和实效性的LLM教育应用奠定了基础。
链接: https://arxiv.org/abs/2601.08402
作者: Donya Rooein,Sankalan Pal Chowdhury,Mariia Eremeeva,Yuan Qin,Debora Nozza,Mrinmaya Sachan,Dirk Hovy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) demonstrate their potential as educational tutors. However, different tutoring strategies benefit different student personalities, and mismatches can be counterproductive to student outcomes. Despite this, current LLM tutoring systems do not take into account student personality traits. To address this problem, we first construct a taxonomy that links pedagogical methods to personality profiles, based on pedagogical literature. We simulate student-teacher conversations and use our framework to let the LLM tutor adjust its strategy to the simulated student personality. We evaluate the scenario with human teachers and find that they consistently prefer our approach over two baselines. Our method also increases the use of less common, high-impact strategies such as role-playing, which human and LLM annotators prefer significantly. Our findings pave the way for developing more personalized and effective LLM use in educational applications.
zh
[NLP-42] Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models AAAI26
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构在预训练过程中如何塑造知识获取机制,以及这一过程与密集(dense)架构有何本质差异的问题。解决方案的关键在于提出一种基于神经元级别的归因指标——门控对数概率增量(Gated-LPI, Log-Probability Increase),该指标能够将对数概率的提升分解至单个神经元层面,并通过时间分辨的对比实验(分别追踪MoE和密集模型在约5.0T和2.5T token上的训练检查点)揭示了三类关键模式:(1)低熵主干结构,即MoE中约前1%的神经元捕获超过45%的正向更新,构成高效核心;(2)早期固化现象,MoE模型在约10万步内即形成稳定的重要神经元分布,而密集模型则持续波动;(3)功能鲁棒性,移除MoE中最重要的十个注意力头仅导致关系推理性能下降约10%,显著优于密集模型的50%下降,表明稀疏性促进了分布式而非脆弱的知识存储。这些发现共同证明,稀疏性从训练初期就构建了一个内在稳定且分布式的计算主干,有助于弥合稀疏架构与训练时可解释性之间的鸿沟。
链接: https://arxiv.org/abs/2601.08383
作者: Bo Wang,Junzhuo Li,Hong Chen,Yuanlin Chu,Yuxuan Fan,Xuming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by AAAI26
Abstract:Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training, and how this process differs from dense architectures, remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M training steps (~ 5.0T tokens) and 600K training steps (~ 2.5T tokens), respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within 100K steps, whereas the dense model remains volatile throughout training. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by 10%, compared with 50% for the dense model, showing that sparsity fosters distributed – rather than brittle – knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.
zh
[NLP-43] PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark
【速读】: 该论文旨在解决当前密集检索模型在处理长文本时对相关证据位置敏感性(即位置偏差,position bias)缺乏系统评估的问题。现有基准通常采用与位置无关的相关性标签,导致长文本处理能力与位置偏好混杂,难以准确诊断模型缺陷。解决方案的关键在于提出PosIR(Position-Aware Information Retrieval),一个涵盖310个数据集、覆盖10种语言和31个领域的综合性评测基准,其通过精确标注参考片段(reference spans)实现了文档长度与信息位置的严格解耦,从而能够精准识别和量化不同检索模型的位置偏好行为。
链接: https://arxiv.org/abs/2601.08363
作者: Ziyang Zeng,Dun Zhang,Yu Yan,Xu Sun,Yudong Zhou,Yuqing Yang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Prior Shape; Université Caen Normandie, ENSICAEN, CNRS, Normandie Univ, GREYC UMR6072, F-14000 Caen, France (卡昂大学、ENSICAEN、法国国家科学研究中心、诺曼底大学、GREYC UMR6072,法国卡昂,邮编14000)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: This research is driven by a strong academic interest, and we welcome further exchange and discussion with peers
Abstract:While dense retrieval models have achieved remarkable success, rigorous evaluation of their sensitivity to the position of relevant information (i.e., position bias) remains largely unexplored. Existing benchmarks typically employ position-agnostic relevance labels, conflating the challenge of processing long contexts with the bias against specific evidence locations. To address this challenge, we introduce PosIR (Position-Aware Information Retrieval), a comprehensive benchmark designed to diagnose position bias in diverse retrieval scenarios. PosIR comprises 310 datasets spanning 10 languages and 31 domains, constructed through a rigorous pipeline that ties relevance to precise reference spans, enabling the strict disentanglement of document length from information position. Extensive experiments with 10 state-of-the-art embedding models reveal that: (1) Performance on PosIR in long-context settings correlates poorly with the MMTEB benchmark, exposing limitations in current short-text benchmarks; (2) Position bias is pervasive and intensifies with document length, with most models exhibiting primacy bias while certain models show unexpected recency bias; (3) Gradient-based saliency analysis further uncovers the distinct internal attention mechanisms driving these positional preferences. In summary, PosIR serves as a valuable diagnostic framework to foster the development of position-robust retrieval systems.
zh
[NLP-44] When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges
【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统中,基于KV缓存复用(KV cache reuse)的推理优化策略在判别型推理(judge-centric inference)场景下失效的问题。传统方法通过复用部分共享上下文的KV缓存以降低预填充(prefill)成本,在执行代理(execution agent)任务中取得显著加速效果;然而,本文发现此类策略在判别器(LLM judge)驱动的任务中会严重破坏判断一致性,表现为尽管最终任务准确率看似稳定,但判别选择行为与密集预填充(dense prefill)结果高度不一致。解决方案的关键在于识别并量化这一风险——提出判别一致性率(Judge Consistency Rate, JCR)作为评估指标,并揭示KV缓存复用会系统性削弱跨候选响应间的注意力交互,尤其在后期候选块中更为显著;进一步实验证明,显式建模跨候选交互对于维持密集预填充下的决策至关重要,从而强调判别型推理应被视为一个独立的推理范式,需设计具备风险意识的专用系统架构。
链接: https://arxiv.org/abs/2601.08343
作者: Sichu Liang,Zhenglin Wang,Jiajia Chu,Pengfei Xia,Hui Zang,Deyu Zhou
机构: Southeast University (东南大学); Huawei Technologies Ltd (华为技术有限公司)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:
Abstract:Multi-agent LLM systems routinely generate multiple candidate responses that are aggregated by an LLM judge. To reduce the dominant prefill cost in such pipelines, recent work advocates KV cache reuse across partially shared contexts and reports substantial speedups for generation agents. In this work, we show that these efficiency gains do not transfer uniformly to judge-centric inference. Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior: end-task accuracy may appear stable, yet the judge’s selection becomes highly inconsistent with dense prefill. We quantify this risk using Judge Consistency Rate (JCR) and provide diagnostics showing that reuse systematically weakens cross-candidate attention, especially for later candidate blocks. Our ablation further demonstrates that explicit cross-candidate interaction is crucial for preserving dense-prefill decisions. Overall, our results identify a previously overlooked failure mode of KV cache reuse and highlight judge-centric inference as a distinct regime that demands dedicated, risk-aware system design.
zh
[NLP-45] Detecting Mental Manipulation in Speech via Synthetic Multi-Speaker Dialogue
【速读】: 该论文试图解决的问题是:在计算社会推理领域中,如何有效检测语音对话中的心理操控(mental manipulation)行为,因为此前的研究仅聚焦于文本对话,忽略了语音中潜在的操控策略。解决方案的关键在于构建首个面向语音对话的心理操控检测基准数据集SPEECHMENTALMANIP,该数据集通过高质量、语音一致的文本转语音(Text-to-Speech, TTS)技术对已有文本数据进行增强,从而实现跨模态(text-to-speech)的评估;同时结合少样本大音频语言模型与人工标注,系统性地比较了文本与语音两种模态下检测性能的差异,揭示了模型在语音场景中因缺乏声学或韵律线索而导致召回率显著下降的现象,为多模态对话系统的安全性对齐和模态感知评估提供了重要依据。
链接: https://arxiv.org/abs/2601.08342
作者: Run Chen,Wen Liang,Ziwei Gong,Lin Ai,Julia Hirschberg
机构: Columbia University (哥伦比亚大学); Red Hat (红帽公司)
类目: Computation and Language (cs.CL)
备注: Accepted to IWSDS 2026
Abstract:Mental manipulation, the strategic use of language to covertly influence or exploit others, is a newly emerging task in computational social reasoning. Prior work has focused exclusively on textual conversations, overlooking how manipulative tactics manifest in speech. We present the first study of mental manipulation detection in spoken dialogues, introducing a synthetic multi-speaker benchmark SPEECHMENTALMANIP that augments a text-based dataset with high-quality, voice-consistent Text-to-Speech rendered audio. Using few-shot large audio-language models and human annotation, we evaluate how modality affects detection accuracy and perception. Our results reveal that models exhibit high specificity but markedly lower recall on speech compared to text, suggesting sensitivity to missing acoustic or prosodic cues in training. Human raters show similar uncertainty in the audio setting, underscoring the inherent ambiguity of manipulative speech. Together, these findings highlight the need for modality-aware evaluation and safety alignment in multimodal dialogue systems.
zh
[NLP-46] CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark
【速读】: 该论文旨在解决多语言自然语言处理(Multilingual NLP)中大语言模型(LLMs)行为控制与适配的评估难题,即缺乏专门用于量化语言引导(steering)技术有效性的基准和评估协议。其解决方案的关键在于提出CLaS-Bench——一个轻量级、跨语言平行问题基准,涵盖32种语言,能够系统性评估多语言引导方法的效果。该基准通过两个维度(语言控制能力和语义相关性)综合衡量引导性能,并首次实现了对多种引导技术(如残差流DiffMean干预、基于探测的方向、语言特异性神经元、PCA/LDA向量、稀疏自编码器等)的标准化比较,揭示了简单基于残差的DiffMean方法在多数语言中表现最优,且语言结构主要出现在模型后期层,引导方向按语言家族聚类。这一工作为语言表示的科学分析和低资源适应提供了可复现的评估框架。
链接: https://arxiv.org/abs/2601.08331
作者: Daniil Gurgurov,Yusser Al Ghussin,Tanja Baeumel,Cheng-Ting Chou,Patrick Schramowski,Marius Mosbach,Josef van Genabith,Simon Ostermann
机构: Saarland University (萨尔兰大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); TU Darmstadt (达姆施塔特工业大学); hessian.AI (黑森人工智能); Mila - Quebec Artificial Intelligence Institute (蒙特利尔学习算法研究所); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注: pre-print
Abstract:Understanding and controlling the behavior of large language models (LLMs) is an increasingly important topic in multilingual NLP. Beyond prompting or fine-tuning, , i.e.,~manipulating internal representations during inference, has emerged as a more efficient and interpretable technique for adapting models to a target language. Yet, no dedicated benchmarks or evaluation protocols exist to quantify the effectiveness of steering techniques. We introduce CLaS-Bench, a lightweight parallel-question benchmark for evaluating language-forcing behavior in LLMs across 32 languages, enabling systematic evaluation of multilingual steering methods. We evaluate a broad array of steering techniques, including residual-stream DiffMean interventions, probe-derived directions, language-specific neurons, PCA/LDA vectors, Sparse Autoencoders, and prompting baselines. Steering performance is measured along two axes: language control and semantic relevance, combined into a single harmonic-mean steering score. We find that across languages simple residual-based DiffMean method consistently outperforms all other methods. Moreover, a layer-wise analysis reveals that language-specific structure emerges predominantly in later layers and steering directions cluster based on language family. CLaS-Bench is the first standardized benchmark for multilingual steering, enabling both rigorous scientific analysis of language representations and practical evaluation of steering as a low-cost adaptation alternative.
zh
[NLP-47] AgriAgent : Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture
【速读】: 该论文旨在解决现实农业场景中智能代理系统在处理多样化任务时面临的挑战,尤其是现有方法依赖统一执行范式难以应对任务复杂度差异大和工具可用性不完整的问题。其解决方案的关键在于提出一种两级代理框架 AgriAgent,通过基于任务复杂度的分层执行策略:简单任务由模态特定代理直接推理完成,而复杂任务则触发契约驱动的规划机制,将任务表述为能力需求,并进行能力感知的工具编排与动态生成,从而实现多步可验证执行及故障恢复能力。
链接: https://arxiv.org/abs/2601.08308
作者: Bo Yang,Yu Zhang,Yunkui Chen,Lanfei Feng,Xiao Xu,Nueraili Aierken,Shijian Li
机构: State Key Laboratory of Brain–Machine Intelligence (脑-机智能国家重点实验室); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.
zh
[NLP-48] Enhancing Sentiment Classification and Irony Detection in Large Language Models through Advanced Prompt Engineering Techniques
【速读】: 该论文旨在解决如何通过提示工程(prompt engineering)提升大语言模型(Large Language Models, LLMs)在情感分析任务中的性能问题,特别是针对情感分类、基于方面的情感分析以及识别讽刺等细微语义特征的挑战。其解决方案的关键在于系统性地评估多种高级提示技术(如少样本学习、思维链提示和自一致性)对不同模型(GPT-4o-mini 和 gemini-1.5-flash)的影响,发现提示策略的效果高度依赖于模型架构与任务语义复杂度的匹配:例如,少样本提示在 GPT-4o-mini 上表现最优,而思维链提示可使 gemini-1.5-flash 在讽刺检测任务中性能提升达 46%。这表明,有效的提示设计必须根据具体模型特性和任务需求进行定制化优化。
链接: https://arxiv.org/abs/2601.08302
作者: Marvin Schmitt,Anne Schwerk,Sebastian Lempert
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures, 13 tables
Abstract:This study investigates the use of prompt engineering to enhance large language models (LLMs), specifically GPT-4o-mini and gemini-1.5-flash, in sentiment analysis tasks. It evaluates advanced prompting techniques like few-shot learning, chain-of-thought prompting, and self-consistency against a baseline. Key tasks include sentiment classification, aspect-based sentiment analysis, and detecting subtle nuances such as irony. The research details the theoretical background, datasets, and methods used, assessing performance of LLMs as measured by accuracy, recall, precision, and F1 score. Findings reveal that advanced prompting significantly improves sentiment analysis, with the few-shot approach excelling in GPT-4o-mini and chain-of-thought prompting boosting irony detection in gemini-1.5-flash by up to 46%. Thus, while advanced prompting techniques overall improve performance, the fact that few-shot prompting works best for GPT-4o-mini and chain-of-thought excels in gemini-1.5-flash for irony detection suggests that prompting strategies must be tailored to both the model and the task. This highlights the importance of aligning prompt design with both the LLM’s architecture and the semantic complexity of the task.
zh
[NLP-49] Demystifying the Slash Pattern in Attention: The Role of RoPE
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中普遍存在的“斜对角主导注意力”(Slash-Dominant Heads, SDHs)现象的成因问题,即为何注意力分数会集中在某个偏移量 Δ 的次对角线上。解决方案的关键在于从经验分析与理论建模两个层面揭示SDHs的内在机制:首先,通过分析开源LLMs发现SDHs具有内在性且能泛化到分布外提示;其次,进一步发现SDHs的形成依赖于三个要素——查询(Query)和键(Key)近似低秩(rank-one),以及旋转位置编码(Rotary Position Embedding, RoPE)主要由中高频分量主导;在此条件下,查询与键在不同token间几乎一致,而RoPE中频段间的交互作用诱发了斜对角结构。理论上,作者将上述条件形式化为假设,并证明在梯度下降训练下,采用RoPE的浅层Transformer模型必然涌现出SDHs,从而从训练动态角度解释其出现的必然性。
链接: https://arxiv.org/abs/2601.08297
作者: Yuan Cheng,Fengzhuo Zhang,Yunlong Hou,Cunxiao Du,Chao Du,Tianyu Pang,Aixin Sun,Zhuoran Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the \Delta -th sub-diagonal for some offset \Delta . These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
zh
[NLP-50] D2Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)训练的搜索增强型大语言模型(Search-Augmented Large Language Models)在多跳推理任务中面临的两个关键问题:一是检索链构建无效,导致生成错误查询或遗漏关键信息;二是推理过程被外围无关证据“劫持”,使模型误将干扰项识别为有效证据。解决方案的核心是提出 D² Plan(Dual-agent Dynamic global Planning)范式,通过一个 Reasoner(推理代理)与一个 Purifier(净化代理)的协同工作实现:Reasoner 在推理过程中构建并动态调整全局计划,依据检索反馈进行适应性优化;Purifier 则评估检索内容的相关性,并向 Reasoner 提供精炼的关键信息。此外,论文设计了两阶段训练框架——先以合成轨迹进行监督微调(Supervised Fine-Tuning, SFT)冷启动,再引入面向计划的奖励信号进行强化学习,从而教会模型掌握该双代理规划机制。实验证明,该方法显著提升了多步推理的一致性和对冗余信息的鲁棒性,在复杂问答基准测试中表现更优。
链接: https://arxiv.org/abs/2601.08282
作者: Kangcheng Luo,Tinglang Wu,Yansong Feng
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose D ^2 Plan, a Dual-agent Dynamic global Planning paradigm for complex retrieval-augmented reasoning. D ^2 Plan operates through the collaboration of a Reasoner and a Purifier: the Reasoner constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the Purifier assesses retrieval relevance and condenses key information for the Reasoner. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the D ^2 Plan paradigm. Extensive experiments demonstrate that D ^2 Plan enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.
zh
[NLP-51] Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees
【速读】: 该论文旨在解决如何在长链思维(long Chain-of-Thought, long CoT)推理过程中有效集成工具使用(tool-use),以增强大语言模型(Large Language Models, LLMs)的计算能力,而这一问题此前因训练数据稀缺及工具调用可能破坏模型内在长链推理能力而未被充分探索。解决方案的关键在于提出一种基于强化学习的框架 DART(Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees),其通过训练期间构建动态回溯树(rollout trees)来自动发现有效的工具调用时机,并在潜在有利位置分支以探索多样化的工具集成路径;随后利用基于树结构的优势估计方法精准识别并奖励那些因工具调用而提升解题效果的子轨迹,从而实现对有益工具使用行为的有效强化。
链接: https://arxiv.org/abs/2601.08274
作者: Kun Li,Zenan Xu,Junan Li,Zengrui Jin,Jinghao Deng,Zexuan Qiu,Bo Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model’s intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
zh
[NLP-52] Med-CoReason er: Reducing Language Disparities in Medical Reasoning Reasoning via Language-Informed Co-Reasoning
【速读】: 该论文旨在解决多语言医疗推理中存在的显著差距问题,即当前基于英语的大型语言模型在本地语言(如中文、阿拉伯语等)中推理能力明显弱于英文,限制了生成式 AI (Generative AI) 在全球范围内的公平部署。解决方案的关键在于提出 Med-CoReasoner,一个语言感知的协同推理框架,通过并行激发英语与本地语言的推理过程,将二者抽象为结构化概念,并利用概念级对齐与检索机制,将本地临床知识融入英语逻辑骨架中,从而结合英语推理的结构性优势与本地语言所承载的实践性专业知识。
链接: https://arxiv.org/abs/2601.08267
作者: Fan Gao,Sherry T. Tong,Jiwoong Sohn,Jiahao Huang,Junfeng Jiang,Ding Xia,Piyalitt Ittichaiwong,Kanyakorn Veerakanjana,Hyunjae Kim,Qingyu Chen,Edison Marrese Taylor,Kazuma Kobayashi,Akkiko Aizawa,Irene Li
机构: The University of Tokyo(东京大学); ETH Zürich(苏黎世联邦理工学院); National Institute of Informatics(信息基础研究所); Siriraj Informatics and Data Innovation Center(西里拉吉信息与数据创新中心); Yale University(耶鲁大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
zh
[NLP-53] User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale
【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRM)在多轮工具使用(multi-turn tool-use)场景中数据生成受限的问题,尤其是现有数据集和生成方法依赖静态预定义工具集,难以支持开放的人机协作复杂性。其解决方案的关键在于提出一种用户导向的模拟范式(user-oriented simulation paradigm),通过将任务生成与专门设计的用户模拟器解耦,后者模拟人类行为规则(如逐步请求和逐轮反馈),从而生成更真实、更长的多轮对话轨迹;同时构建了一个可插拔的生成流水线,支持从任意状态启动生成,并在同一轨迹中完成多个任务,显著提升了数据密度与现实场景的匹配度。
链接: https://arxiv.org/abs/2601.08225
作者: Jungho Cho,Minbyul Jeong,Sungrae Park
机构: Upstage AI
类目: Computation and Language (cs.CL)
备注:
Abstract:The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in “solely task-solving” trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.
zh
[NLP-54] owards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
【速读】: 该论文旨在解决当前混合专家(Mixture-of-Experts, MoE)语言模型架构设计中存在的一致性与优化目标模糊的问题,即如何在有限的内存和计算资源下构建性能最优的MoE结构。传统方法主要关注总参数量(N_total,决定内存占用)和活跃参数量(推理成本),但作者通过系统性实证研究发现,模型性能实际上更受总参数量(N_total)和专家稀疏度(s := n_exp / n_topk)共同主导,其中n_exp为专家总数,n_topk为每层选择的专家数量。关键发现在于:n_exp与n_topk并非在稀疏度比中相互抵消,而是更大的专家总数会因受限于内存而压缩核心模型的深度和宽度,从而轻微损害性能。因此,解决方案的核心原则是——在给定约束条件下,最大化总参数量(N_total),同时最小化专家稀疏度(s),即尽可能增大n_topk并减少n_exp,以此提供一个清晰、可操作的MoE架构设计框架,有效指导模型优化方向。
链接: https://arxiv.org/abs/2601.08215
作者: Seng Pei Liew,Kenta Shinzato,Yuyang Dong
机构: SB Intuitions(东京商业智能公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 figures
Abstract:Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ( N_total ) and expert sparsity ( s:=n_exp/n_topk ). Moreover, n_exp and n_topk do not “cancel out” within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes N_total while minimizing s (maximizing n_topk ) and n_exp under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design. Comments: 10 pages, 5 figures Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.08215 [cs.CL] (or arXiv:2601.08215v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.08215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-55] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models
【速读】: 该论文旨在解决大语言模型(LLM)在生物医学、材料科学和金融等高风险领域部署时,如何高效注入私有、领域特定知识的问题。现有方法中,微调(fine-tuning)迭代成本高且易引发灾难性遗忘,而检索增强生成(RAG)则因片段化证据、检索漂移和长上下文压力导致性能不稳定。解决方案的关键在于提出生成增强生成(Generation-Augmented Generation, GAG),其核心思想是将私有专业知识视为一种额外的专家模态(expert modality),通过一个紧凑的表示层接口与冻结的基础模型对齐,从而避免提示时间证据序列化,实现即插即用的专业化扩展与可扩展的多领域组合,并支持可靠的按需激活机制。
链接: https://arxiv.org/abs/2601.08209
作者: Rongji Li,Jian Xu,Xueqing Chen,Yisheng Yang,Jiayi Wang,Xingyu Chen,Chunyu Xie,Dawei Leng,Xu-Yao Zhang
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Zhongguancun Academy (中关村学院); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学先进交叉科学学院); 360 AI Research (360人工智能研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.
zh
[NLP-56] riplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLM s NEURIPS2025
【速读】: 该论文旨在解决自对弈微调(Self-Play Fine-Tuning, SPIN)在迭代优化过程中因当前奖励优势逐渐消失而导致的不稳定问题,以及参考策略(reference policy)引入的训练与生成目标不一致(training-generation discrepancy)问题。解决方案的关键在于提出一种基于三元组的自对弈微调方法(Triplet-based Self-Play fIne-tuNing, T-SPIN):其一,引入历史优势(historical advantages)——即当前生成响应与初始策略产生的原型合成响应之间的优势,以弥补当前优势衰减带来的不稳定性;其二,在自对弈框架中加入熵约束(entropy constraint),理论上支持无参考策略的微调,从而消除训练与生成阶段的目标差异,实现更稳定且高效的模型优化。
链接: https://arxiv.org/abs/2601.08198
作者: Yibo Wang,Hai-Long Sun,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
机构: Nanjing University (南京大学); Alibaba International Digital Commerce; Pazhou Laboratory (Huangpu) (琶洲实验室(黄埔)); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NeurIPS 2025
Abstract:Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.
zh
[NLP-57] Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高风险领域中缺乏对隐式监管合规性的评估问题,即现有基准测试未能有效检验LLM是否能自主执行强制性安全约束。其解决方案的关键在于提出LogiSafetyGen框架,该框架将非结构化法规转化为线性时序逻辑(Linear Temporal Logic, LTL)形式的断言,并利用逻辑引导的模糊测试(logic-guided fuzzing)生成符合安全关键规则的有效执行轨迹;在此基础上构建的LogiSafetyBench基准包含240个经人工验证的任务,要求LLM生成满足功能目标与隐含合规规则的Python程序,从而系统性地评估LLM在实际应用中的合规行为。
链接: https://arxiv.org/abs/2601.08196
作者: Da Song,Yuheng Huang,Boqi Chen,Tianshuo Cong,Randy Goebel,Lei Ma,Foutse Khomh
机构: 1. University of Alberta (阿尔伯塔大学); 2. Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所); 3. Huawei Technologies Co., Ltd. (华为技术有限公司); 4. Tsinghua University (清华大学); 5. University of Alberta (阿尔伯塔大学); 6. Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注: 11 pages, 3 figures
Abstract:The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.
zh
[NLP-58] Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)响应中清晰度(clarity)的自动评估问题,尤其关注政治问答场景下,如何通过优化提示(prompt)设计提升自动评估的准确性。其关键解决方案在于系统性比较不同提示策略对GPT-5.2模型在CLARITY数据集上的表现,发现基于思维链(chain-of-thought)的提示方法,特别是结合少量示例(few-shot)的策略,能够显著提升清晰度预测的准确率(从56%提升至63%),并改善主题识别性能(从60%提升至74%),表明结构化推理提示对高阶清晰度评估具有可靠性,但细粒度的回避行为(evasion)检测仍具挑战性。
链接: https://arxiv.org/abs/2601.08176
作者: Lavanya Prahallad,Sai Utkarsh Choudarypally,Pragna Prahallad,Pranathi Prahallad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 tables
Abstract:Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.
zh
[NLP-59] Relational Knowledge Distillation Using Fine-tuned Function Vectors
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中关系知识的编码与可操控性问题,特别是如何有效提取和增强模型对语义关系的理解能力,以提升其类比推理等高级认知任务的表现。解决方案的关键在于利用因果中介分析(causal mediation analysis)识别出的函数向量(function vector),通过少量样本(约20个词对)对其进行微调,从而获得更优的关系表示;进一步引入复合函数向量(composite function vector)——即微调后函数向量的加权组合——作为激活补丁(activation patching)插入到模型前向传播中,显著提升模型在认知科学和SAT类比题上的推理性能,同时增强了与人类语义相似性判断的一致性。这一方法为可控地操纵LLMs中的关系知识提供了新路径,提升了模型的可解释性和推理能力。
链接: https://arxiv.org/abs/2601.08169
作者: Andrea Kang,Yingnian Wu,Hongjing Lu
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world. Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in in-context learning, captured in a compact representation known as the function vector. We show that fine-tuning function vectors with only a small set of examples (about 20 word pairs) yields better performance on relation-based word-completion tasks than using the original vectors derived from causal mediation analysis. These improvements hold for both small and large language models. Moreover, the fine-tuned function vectors yield improved decoding performance for relation words and show stronger alignment with human similarity judgments of semantic relations. Next, we introduce the composite function vector - a weighted combination of fine-tuned function vectors - to extract relational knowledge and support analogical reasoning. At inference time, inserting this composite vector into LLM activations markedly enhances performance on challenging analogy problems drawn from cognitive science and SAT benchmarks. Our results highlight the potential of activation patching as a controllable mechanism for encoding and manipulating relational knowledge, advancing both the interpretability and reasoning capabilities of large language models.
zh
[NLP-60] SwiftMem: Fast Agent ic Memory via Query-aware Indexing
【速读】: 该论文旨在解决当前代理记忆系统(agentic memory systems)在扩展时因全量检索导致的严重延迟瓶颈问题,即现有框架对所有存储内容进行无差别查询,无法根据查询特征动态优化检索效率。其解决方案的关键在于提出SwiftMem,一种查询感知的代理记忆系统,通过两个核心机制实现亚线性检索:一是基于时间维度的索引结构,支持对时效性信息的对数时间范围查询;二是基于语义的DAG-Tag索引,利用分层标签结构将查询映射到相关主题。此外,为缓解存储增长带来的碎片化问题,引入嵌入-标签协同整合机制(embedding-tag co-consolidation),依据语义聚类重新组织存储以提升缓存局部性。实验表明,SwiftMem在LoCoMo和LongMemEval基准上相比最先进基线实现47倍的搜索加速,同时保持高准确性,显著提升了记忆增强型大语言模型(LLM)代理的实际部署可行性。
链接: https://arxiv.org/abs/2601.08160
作者: Anxin Tian,Yiming Li,Xing Li,Hui-Ling Zhen,Lei Chen,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic memory systems have become critical for enabling LLM agents to maintain long-term context and retrieve relevant information efficiently. However, existing memory frameworks suffer from a fundamental limitation: they perform exhaustive retrieval across the entire storage layer regardless of query characteristics. This brute-force approach creates severe latency bottlenecks as memory grows, hindering real-time agent interactions. We propose SwiftMem, a query-aware agentic memory system that achieves sub-linear retrieval through specialized indexing over temporal and semantic dimensions. Our temporal index enables logarithmic-time range queries for time-sensitive retrieval, while the semantic DAG-Tag index maps queries to relevant topics through hierarchical tag structures. To address memory fragmentation during growth, we introduce an embedding-tag co-consolidation mechanism that reorganizes storage based on semantic clusters to improve cache locality. Experiments on LoCoMo and LongMemEval benchmarks demonstrate that SwiftMem achieves 47 \times faster search compared to state-of-the-art baselines while maintaining competitive accuracy, enabling practical deployment of memory-augmented LLM agents.
zh
[NLP-61] WISE-Flow: Workflow-Induced Structured Experience for Self-Evolving Conversational Service Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在面向用户的服务环境中表现出的错误频发、重复失败模式以及运行间显著波动的问题。现有方法依赖于特定环境的训练或人工修补,成本高且难以扩展。解决方案的关键在于提出 WISE-Flow 框架,该框架通过从历史服务交互中提取可复用的过程性经验,利用带有前置条件增强的动作块构建工作流(workflow),并在部署阶段通过检索匹配的工作流并进行前置条件感知的可行性推理,实现状态对齐的下一步动作决策,从而提升智能体的稳定性和泛化能力。
链接: https://arxiv.org/abs/2601.08158
作者: Yuqing Zhou,Zhuoer Wang,Jie Yuan,Hong Wang,Samson Koelle,Ziwei Zhu,Wei Niu
机构: George Mason University (乔治梅森大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注: 19 pages
Abstract:Large language model (LLM)-based agents are widely deployed in user-facing services but remain error-prone in new tasks, tend to repeat the same failure patterns, and show substantial run-to-run variability. Fixing failures via environment-specific training or manual patching is costly and hard to scale. To enable self-evolving agents in user-facing service environments, we propose WISE-Flow, a workflow-centric framework that converts historical service interactions into reusable procedural experience by inducing workflows with prerequisite-augmented action blocks. At deployment, WISE-Flow aligns the agent’s execution trajectory to retrieved workflows and performs prerequisite-aware feasibility reasoning to achieve state-grounded next actions. Experiments on ToolSandbox and \tau^2 -bench show consistent improvement across base models.
zh
[NLP-62] Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言适配中的三大挑战:标注数据稀缺、全模型微调不稳定,以及跨语言持续训练导致的灾难性遗忘(catastrophic forgetting)。其解决方案的关键在于提出一种无需反事实假设的**电路目标监督微调(Circuit-Targeted Supervised Fine-Tuning, CT-SFT)**方法:首先利用标签平衡的均值基线和任务方向相关性评分,在代理语言检查点中识别出稀疏但任务相关的注意力头集合;随后通过头级梯度掩码仅更新这些关键注意力头(及LayerNorm层),实现高效且稳定的跨语言迁移。该方法在NusaX-Senti和XNLI数据集上显著优于传统全模型微调,并有效缓解了灾难性遗忘问题,同时揭示了“编辑保留权衡”现象——难度较高的语言迁移倾向于编辑电路相关头,而简单迁移则偏好接近零更新,从而保留源语言机制。
链接: https://arxiv.org/abs/2601.08146
作者: Khumaisa Nur’aini,Ayu Purwarianti,Alham Fikri Aji,Derry Wijaya
机构: Monash University Indonesia; Institute Teknologi Bandung; MBZUAI; Boston University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Adapting LLMs to low-resource languages is difficult: labeled data is scarce, full-model fine-tuning is unstable, and continued cross-lingual tuning can cause catastrophic forgetting. We propose Circuit-Targeted Supervised Fine-Tuning (CT-SFT): a counterfactual-free adaptation of CD-T (Contextual Decomposition Transformer) that uses a label-balanced mean baseline and task-directional relevance scoring to identify a sparse set of task-relevant attention heads in a proxy-language checkpoint, then transfer learns to a target language by updating only those heads (plus LayerNorm) via head-level gradient masking. Across NusaX-Senti and XNLI, CT-SFT improves cross-lingual accuracy over continued full fine-tuning while updating only a small subset of model parameters. We find an editing-preserving trade-off: harder transfers favor editing circuit heads, while easier transfers often favor near-zero (i.e., low-relevance heads) updates, preserving the source mechanism. CT-SFT also substantially reduces catastrophic forgetting, preserving proxy/source-language competence during transfer.
zh
[NLP-63] Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training
【速读】: 该论文旨在解决乌尔都语(Urdu)在现代自然语言处理(Natural Language Processing, NLP)系统中严重资源匮乏的问题,现有多语言模型在乌尔都语特定任务上表现不佳,主要受限于其复杂的形态学结构、从右至左书写的Nastaliq字体以及丰富的文学传统。解决方案的关键在于采用两阶段方法:首先基于19.7亿词元的多样化乌尔都语文本数据集(涵盖新闻档案、古典与当代文学、政府文件及社交媒体内容)对LLaMA 3.1 8B模型进行持续预训练,并引入1.4亿词元的英文维基百科数据以防止灾难性遗忘;随后在Alif Urdu-instruct指令微调数据集上进行监督微调。该策略显著提升了模型在七类乌尔都语任务上的性能,达到加权平均得分90.34,优于此前最优模型Alif-1.0-Instruct(87.1)和基础LLaMA-3.1 8B-Instruct模型(45.7),验证了高质量语料持续预训练与针对性指令微调相结合的有效性。
链接: https://arxiv.org/abs/2601.08141
作者: Muhammad Taimoor Hassan,Jawad Ahmed,Muhammad Awais
机构: Auburn University (奥本大学); BHT Berlin (柏林应用技术大学); BTU Cottbus (科特布斯勃兰登堡工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language’s complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
zh
[NLP-64] How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains EACL2026
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在高风险领域中因置信度估计不准确而导致的可靠性问题,特别是针对其长文本、多步骤输出的置信度评估难题。解决方案的关键在于构建了一个名为推理模型置信度评估基准(Reasoning Model Confidence estimation Benchmark, RMCB)的公开资源,包含来自六种不同架构家族的LRMs生成的347,496条推理轨迹,覆盖临床、金融、法律和数学等高风险领域及复杂通用推理任务,并提供所有样本的正确性标注。基于此基准,作者系统评估了十余种基于表示的方法,揭示出判别能力(AUROC)与校准性能(ECE)之间存在持续权衡:文本编码器在判别能力上表现最优(AUROC=0.672),而结构感知模型在校准性能上最佳(ECE=0.148),且无单一方法能同时优胜两者;此外,更复杂的模型架构并未稳定优于简单序列基线,表明仅依赖片段级隐藏状态的方法存在性能上限。
链接: https://arxiv.org/abs/2601.08134
作者: Reza Khanmohammadi,Erfan Miahi,Simerjot Kaur,Ivan Brugere,Charese H. Smiley,Kundan Thind,Mohammad M. Ghassemi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) main conference
Abstract:The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.
zh
[NLP-65] Attention Projection Mixing and Exogenous Anchors
【速读】: 该论文旨在解决Transformer模型中早期层注意力投影作为残差连接时所面临的根本性矛盾:第一层需同时充当所有深层模块的稳定参考点和有效的计算单元,这限制了模型性能优化空间。其解决方案的关键在于提出ExoFormer架构,通过在序列层堆叠之外学习专用的外源锚定投影(exogenous anchor projections),将锚定功能与计算精炼过程解耦。该方法基于统一的归一化混合框架,在所有注意力路径(查询、键、值及门控逻辑)中研究不同系数粒度(元素级、头级、标量级),从而实现更优的注意力机制设计,并在下游任务中显著提升准确率(动态版本较基线提升2.13点),同时减少注意力sink数量达2倍、数据效率提高1.84倍。
链接: https://arxiv.org/abs/2601.08131
作者: Jonathan Su
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformers that reuse early-layer attention projections as residuals face a fundamental tension: the first layer must simultaneously serve as a stable reference for all deeper layers and as an effective computational block. To resolve this, we propose ExoFormer, which learns dedicated exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. Through a unified normalized mixing framework (studying different coefficient granularities: elementwise, headwise, scalar) across all attention pathways (queries, keys, values, and gate logits), ExoFormer variants consistently outperform their internal-anchor counterparts. Moreover, the dynamic variant achieves a 2.13-point increase in downstream accuracy over the baseline and demonstrates superior data efficiency, matching baseline validation loss with 1.84x fewer tokens. ExoFormer also achieves a 2x reduction in attention sink compared to standard Gated Attention. Paradoxically, all ExoFormer variants exhibit signs of representation collapse. We explain this via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in computational refinement. We release codes and models to facilitate future research.
zh
[NLP-66] Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought EACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)提示方法(如思维链 Chain-of-Thought, CoT)中存在的两个核心问题:一是提示过程token消耗过高,导致计算成本昂贵;二是现有策略在不同推理任务间泛化能力有限,难以适应多样化的下游任务。解决方案的关键在于提出一种自适应因果提示框架(Adaptive Causal Prompting with Sketch-of-Thought, ACPS),其核心创新是引入结构因果模型(Structural Causal Models, SCM)来推断查询对答案的因果效应,并据此自适应选择最优干预策略(即标准前门调整和条件前门调整)。通过将冗长的CoT替换为简洁的“思维草图”(Sketch-of-Thought),ACPS实现了高效且具泛化性的因果推理,在不依赖任务特定微调的前提下显著降低token使用量与推理开销,同时提升准确性和鲁棒性。
链接: https://arxiv.org/abs/2601.08108
作者: Bowen Li,Ziqi Xu,Jing Ren,Renqiang Luo,Xikun Zhang,Xiuzhen Zhang,Yongli Ren,Feng Xia
机构: RMIT University, Australia; Jilin University, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Findings of EACL 2026
Abstract:Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.
zh
[NLP-67] Query Suggestion for Retrieval-Augmented Generation via Dynamic In-Context Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)在工具调用型检索增强生成(agentic RAG)系统中,当用户提问超出知识范围时导致幻觉或无法回答的问题,并进一步提出通过建议可回答的查询来提升交互质量。其核心挑战在于如何确保所建议的查询不仅与原问题语义相近,且能在当前 RAG 工作流下被成功执行——这要求对多步骤 RAG 流程有整体理解,而单一 LLM 执行器难以胜任。解决方案的关键是引入鲁棒动态少样本学习(robust dynamic few-shot learning),该方法从相关工作流中检索示例以动态构建提示,从而引导模型生成既相关又可回答的查询建议;该机制具备自学习能力(如基于历史用户查询),易于部署于实际场景,并在真实世界数据集上显著优于传统少样本和仅检索基线方法。
链接: https://arxiv.org/abs/2601.08105
作者: Fabian Spaeh,Tianyi Chen,Chen-Hao Chiang,Bin Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation with tool-calling agents (agentic RAG) has become increasingly powerful in understanding, processing, and responding to user queries. However, the scope of the grounding knowledge is limited and asking questions that exceed this scope may lead to issues like hallucination. While guardrail frameworks aim to block out-of-scope questions (Rodriguez et al., 2024), no research has investigated the question of suggesting answerable queries in order to complete the user interaction. In this paper, we initiate the study of query suggestion for agentic RAG. We consider the setting where user questions are not answerable, and the suggested queries should be similar to aid the user interaction. Such scenarios are frequent for tool-calling LLMs as communicating the restrictions of the tools or the underlying datasets to the user is difficult, and adding query suggestions enhances the interaction with the RAG agent. As opposed to traditional settings for query recommendations such as in search engines, ensuring that the suggested queries are answerable is a major challenge due to the RAG’s multi-step workflow that demands a nuanced understanding of the RAG as a whole, which the executing LLM lacks. As such, we introduce robust dynamic few-shot learning which retrieves examples from relevant workflows. We show that our system can be self-learned, for instance on prior user queries, and is therefore easily applicable in practice. We evaluate our approach on three benchmark datasets based on two unlabeled question datasets collected from real-world user queries. Experiments on real-world datasets confirm that our method produces more relevant and answerable suggestions, outperforming few-shot and retrieval-only baselines, and thus enable safer, more effective user interaction with agentic RAG. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.08105 [cs.CL] (or arXiv:2601.08105v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.08105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-68] AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling
【速读】: 该论文旨在解决当前奖励建模(Reward Modeling)中主流架构依赖静态池化策略将序列压缩为标量分数所导致的两个关键问题:一是静态归纳偏置与任务相关的偏好信号不匹配,二是骨干模型优化目标为生成而非细粒度判别,造成表征错位。解决方案的关键在于提出AdaJudge框架,其通过两个核心机制实现联合适应:首先利用门控精炼模块(gated refinement blocks)将骨干模型的表示映射到更利于判别任务的空间;其次引入自适应多视角池化模块(adaptive multi-view pooling module),动态路由并融合证据,替代原有的静态读出机制,从而提升对人类偏好的捕捉能力。
链接: https://arxiv.org/abs/2601.08097
作者: Yongliang Miao,Yangyang Liang,Mengnan Du
机构: Emory University (埃默里大学); Guangdong University of Technology (广东工业大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone is optimized for generation rather than fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first refines backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module that dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.
zh
[NLP-69] MemoBrain: Executive Memory as an Agent ic Brain for Reasoning
【速读】: 该论文针对工具增强型智能体框架中复杂推理任务因长时程特性导致的推理轨迹与临时工具产物累积问题展开研究,此现象会耗尽大语言模型(Large Language Models, LLMs)有限的工作记忆上下文,进而破坏逻辑连贯性并削弱任务对齐能力。为解决这一核心挑战,作者提出MemoBrain——一种面向工具增强型智能体的执行记忆模型,其关键在于构建依赖感知的记忆结构,显式捕捉推理步骤中的关键中间状态及其逻辑关系,并在不阻塞执行流程的前提下动态管理工作上下文:通过剪枝无效步骤、折叠已完成子轨迹,保留一个紧凑且高显著性的推理主干,在固定上下文预算下实现对推理路径的主动认知控制,而非被动积累。
链接: https://arxiv.org/abs/2601.08079
作者: Hongjin Qian,Zhao Cao,Zheng Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Our codes are in this https URL
Abstract:Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons. We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation. We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines. Comments: Our codes are in this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2601.08079 [cs.AI] (or arXiv:2601.08079v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.08079 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-70] Exploiting DINOv3-Based Self-Supervised Features for Robust Few-Shot Medical Image Segmentation
【速读】: 该论文旨在解决少样本(few-shot)医学图像分割问题,即在标注训练数据稀缺的情况下,如何提升深度学习模型的分割性能。其关键解决方案在于提出DINO-AugSeg框架,该框架利用DINOv3预训练模型提取的特征,并通过两个核心模块增强特征表示能力:一是WT-Aug模块,基于小波变换在特征域进行频域扰动以丰富特征多样性;二是CG-Fuse模块,采用交叉注意力机制融合语义信息丰富的低分辨率特征与空间细节精确的高分辨率特征,从而实现更鲁棒的特征表达。实验表明,该方法在五种成像模态(MRI、CT、超声、内窥镜和皮肤镜)的六个公开基准上均显著优于现有方法。
链接: https://arxiv.org/abs/2601.08078
作者: Guoping Xu,Jayaram K. Udupa,Weiguo Lu,You Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 36 pages, 11 figures
Abstract:Deep learning-based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few-shot scenarios due to the scarcity of annotated training data. Recently, self-supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few-shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO-AugSeg, a novel framework that leverages DINOv3 features to address the few-shot medical image segmentation challenge. Specifically, we introduce WT-Aug, a wavelet-based feature-level augmentation module that enriches the diversity of DINOv3-extracted features by perturbing frequency components, and CG-Fuse, a contextual information-guided fusion module that exploits cross-attention to integrate semantic-rich low-resolution features with spatially detailed high-resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO-AugSeg consistently outperforms existing methods under limited-sample conditions. The results highlight the effectiveness of incorporating wavelet-domain augmentation and contextual fusion for robust feature representation, suggesting DINO-AugSeg as a promising direction for advancing few-shot medical image segmentation. Code and data will be made available on this https URL.
zh
[NLP-71] Semantic Gravity Wells: Why Negative Constraints Backfire
【速读】: 该论文旨在解决大语言模型在执行负向约束(negative constraints,即“不要使用某词”的指令)时频繁失败的问题,此类失败机制此前缺乏系统性理解。其关键解决方案是通过引入“语义压力”(semantic pressure,量化模型生成被禁止标记的内在概率),揭示了违反概率与语义压力之间存在高度一致的逻辑关系($ p=\sigma(-2.40+2.27\cdot P_0) $)。进一步利用对数透镜(logit lens)进行逐层分析,发现负向指令虽能产生抑制信号,但在失败案例中该信号显著减弱(仅降低目标词概率5.2个百分点 vs 成功案例中的22.8个百分点),并识别出两种机制上不同的失败模式:一是“引导失败”(priming failure,占87.5%),即指令中提及禁用词反而激活目标表示;二是“覆盖失败”(override failure,占12.5%),即后层前馈网络产生正向贡献(+0.39)主导输出,远超成功案例中的抑制效果。激活修补实验确认第23–27层为因果关键区域,替换这些层的激活可反转约束效应方向。研究揭示了负向约束设计的根本矛盾:命名禁用词本身会诱发模型生成该词的倾向。
链接: https://arxiv.org/abs/2601.08070
作者: Shailesh Rana
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 8 figures. Code: this https URL
Abstract:Negative constraints (instructions of the form “do not use word X”) represent a fundamental test of instruction-following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model’s intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure ( p=\sigma(-2.40+2.27\cdot P_0) ; n=40,000 samples; bootstrap 95% CI for slope: [2.21,2.33] ). Through layer-wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but systematically weaker in failures: the instruction reduces target probability by only 5.2 percentage points in failures versus 22.8 points in successes – a 4.4\times asymmetry. We trace this asymmetry to two mechanistically distinct failure modes. In priming failure (87.5% of violations), the instruction’s explicit mention of the forbidden word paradoxically activates rather than suppresses the target representation. In override failure (12.5%), late-layer feed-forward networks generate contributions of +0.39 toward the target probability – nearly 4\times larger than in successes – overwhelming earlier suppression signals. Activation patching confirms that layers 23–27 are causally responsible: replacing these layers’ activations flips the sign of constraint effects. These findings reveal a fundamental tension in negative constraint design: the very act of naming a forbidden word primes the model to produce it.
zh
[NLP-72] Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中置信度估计(Confidence Estimation, CE)方法评估体系的局限性问题。现有方法主要依赖校准(calibration)和区分度(discrimination)两个指标,忽略了在自然语言语义变化场景下CE的稳定性与敏感性需求:即当提示(prompt)或答案语义等价时,置信度应保持一致;而当答案语义发生变化时,置信度应随之调整。为此,作者提出一个涵盖三个新维度的综合评估框架:对提示扰动的鲁棒性(robustness against prompt perturbations)、对语义等价答案的稳定性(stability across semantic equivalent answers),以及对语义差异答案的敏感性(sensitivity to semantically different answers)。该框架揭示了主流CE方法在实际应用中的不足,并为设计更可靠的CE机制提供了实证依据与实践指导。
链接: https://arxiv.org/abs/2601.08064
作者: Yuxi Xia,Dennis Ulmer,Terra Blevins,Yihong Liu,Hinrich Schütze,Benjamin Roth
机构: Faculty of Computer Science, UniVie Doctoral School Computer Science; Faculty of Philological and Cultural Studies, University of Vienna, Austria; ILLC, University of Amsterdam, Netherlands; Khoury College of Computer Sciences, Northeastern University, USA; LMU Munich, Munich Center for Machine Learning (MCML), Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making. Existing work evaluates CE methods almost exclusively through calibration, examining whether stated confidence aligns with accuracy, or discrimination, whether confidence is ranked higher for correct predictions than incorrect ones. However, these facets ignore pitfalls of CE in the context of LLMs and language variation: confidence estimates should remain consistent under semantically equivalent prompt or answer variations, and should change when the answer meaning differs. Therefore, we present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects: robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different answers. In our work, we demonstrate that common CE methods for LLMs often fail on these metrics: methods that achieve good performance on calibration or discrimination are not robust to prompt variations or are not sensitive to answer changes. Overall, our framework reveals limitations of existing CE evaluations relevant for real-world LLM use cases and provides practical guidance for selecting and designing more reliable CE methods.
zh
[NLP-73] Universal computation is intrinsic to language model decoding
【速读】: 该论文试图解决语言模型(language model)的计算能力边界问题,即其是否具备通用计算(universal computation)能力,以及如何通过自然语言接口有效调用这种能力。解决方案的关键在于证明:仅通过链式调用语言模型的自回归输出(autoregressive output),即可模拟任意算法在任意输入上的执行过程,从而实现通用计算;更进一步,研究发现即使在随机初始化状态下,语言模型也已具备通用计算能力,这表明训练并非赋予模型计算表达能力,而是显著提升其可编程性(programmability),使自然语言提示(prompt)成为访问内在计算潜力的有效途径。
链接: https://arxiv.org/abs/2601.08061
作者: Alex Lewandowski,Marlos C. Machado,Dale Schuurmans
机构: University of Alberta (阿尔伯塔大学); Amii; Canada CIFAR AI Chair
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model’s autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness – rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.
zh
[NLP-74] Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
【速读】: 该论文试图解决的问题是:Chain-of-Thought (CoT) prompting 虽然显著提升了大语言模型(Large Language Models, LLMs)的推理性能,但其工作机制尚不明确,且存在是否为触发推理行为的唯一机制这一疑问。解决方案的关键在于通过稀疏自编码器(Sparse Autoencoders, SAEs)对 LLM 内部表示进行直接分析与干预,识别出一组与推理行为具有因果关联的潜在特征(latent features)。研究发现,仅调控一个推理相关的潜在特征即可显著提升模型准确率,且无需显式 CoT 提示;同时,该内部状态在生成早期即被激活,并能覆盖提示层面抑制推理的指令,表明多步推理由可外部激活的内在表征支持,而 CoT 提示仅为一种有效但非必需的激活方式。
链接: https://arxiv.org/abs/2601.08058
作者: Zhenghao He,Guangzhi Xiong,Bohan Liu,Sanchit Sinha,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.
zh
[NLP-75] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
【速读】: 该论文旨在解决科学复合图(Scientific compound figures)中面板级理解困难的问题,即现有方法常缺乏针对每个子图(panel)的细粒度标注与描述,仅提供整体图像级别的摘要,导致难以实现精准的面板定位与语义解析。其解决方案的关键在于提出 FigEx2——一个视觉条件驱动的框架,通过引入噪声感知门控融合模块(noise-aware gated fusion module),自适应过滤token级特征以稳定检测查询空间;同时采用分阶段优化策略,结合监督学习与强化学习(Reinforcement Learning, RL),利用CLIP对齐和BERTScore语义奖励机制强化多模态一致性,从而实现高精度的面板定位与生成式描述。
链接: https://arxiv.org/abs/2601.08026
作者: Jifeng Song,Arun Das,Pan Wang,Hui Ji,Kun Zhao,Yufei Huang
机构: University of Pittsburgh (匹兹堡大学); UPMC Hillman Cancer Center (UPMC希爾曼癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
zh
[NLP-76] LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在创造性生成任务中表现不足的问题,尤其是多智能体框架在通过交互提升推理能力的同时,可能导致内容同质化、抑制创造力的现象。其解决方案的关键在于提出一种受同行评审启发的“盲审”机制(Blind Peer Review):智能体在独立修订过程中交换针对性反馈,从而保留多样化的创意路径,避免因频繁交互导致的风格趋同。这一结构设计有效平衡了协作与多样性,显著提升了生成内容的创新性与质量。
链接: https://arxiv.org/abs/2601.08003
作者: Weiyue Li,Mingxiao Song,Zhenda Shen,Dachuan Zhao,Yunfan Long,Yi Li,Yongce Li,Ruyi Yang,Mengyu Wang
机构: Harvard University (哈佛大学); Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale.
zh
[NLP-77] Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors
【速读】: 该论文旨在解决人文科学领域中情感分析(sentiment analysis)对上下文敏感且连续的情感评分需求问题,尤其关注现有方法在跨领域、跨语言和跨历史时期场景下的可迁移性与假设合理性。其解决方案的关键在于采用概念向量投影(Concept Vector Projections, CVP)方法,通过将情感建模为嵌入空间中的方向,生成连续且多语言的情感分数,实验表明该方法在不同语料库之间具有良好的迁移性能,仅伴随轻微的性能下降;同时研究进一步揭示了CVP所依赖的线性假设虽为近似成立,但已能有效捕捉通用情感模式,为后续改进提供了方向。
链接: https://arxiv.org/abs/2601.07995
作者: Laurits Lyngbaek,Pascale Feldkamp,Yuri Bizzoni,Kristoffer L. Nielbo,Kenneth Enevoldsen
机构: Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Use cases of sentiment analysis in the humanities often require contextualized, continuous scores. Concept Vector Projections (CVP) offer a recent solution: by modeling sentiment as a direction in embedding space, they produce continuous, multilingual scores that align closely with human judgments. Yet the method’s portability across domains and underlying assumptions remain underexplored. We evaluate CVP across genres, historical periods, languages, and affective dimensions, finding that concept vectors trained on one corpus transfer well to others with minimal performance loss. To understand the patterns of generalization, we further examine the linearity assumption underlying CVP. Our findings suggest that while CVP is a portable approach that effectively captures generalizable patterns, its linearity assumption is approximate, pointing to potential for further development.
zh
[NLP-78] DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLM s ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长对话中因上下文长度增加而导致响应延迟上升和回答质量下降的问题。现有方法通常依赖额外的LLM调用构建记忆或离线进行记忆构建,未能结合当前用户输入,从而引入效率损失或破坏对话连贯性。解决方案的关键在于提出一种轻量级动态上下文管理方法DyCP,其在查询时动态分割并检索相关记忆,保持对话的顺序结构且无需预设话题边界,支持高效、自适应的上下文检索,在多个长对话基准测试中均实现了答案质量提升与响应延迟降低。
链接: https://arxiv.org/abs/2601.07994
作者: Nayoung Choi,Jonathan Zhang,Jinho D. Choi
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted (B) to TACL 2026
Abstract:Large Language Models (LLMs) often exhibit increased response latency and degraded answer quality as dialogue length grows, making effective context management essential. However, existing methods rely on extra LLM calls to build memory or perform offline memory construction without considering the current user utterance, which can introduce inefficiencies or disrupt conversational continuity. We introduce DyCP, a lightweight context management method that dynamically segment and retrieve relevant memory at query time. It preserves the sequential structure of dialogue without predefined topic boundaries and supports efficient, adaptive context retrieval. Across three long-form dialogue benchmarks, LoCoMo, MT-Bench+, and SCM4LLMs, and multiple LLMs, DyCP consistently improves answer quality while reducing response latency. We also examine the gap between modern LLMs’ expanded context windows and their actual long-context processing capacity, highlighting the continued importance of effective context management.
zh
[NLP-79] From Word Sequences to Behavioral Sequences: Adapting Modeling and Evaluation Paradigms for Longitudinal NLP
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)在纵向研究中因忽视文档间的嵌套结构与时间顺序而导致的建模偏差问题。传统NLP将文档视为独立且无序样本,但在真实场景下(如心理疾病追踪研究),文档嵌套于个体并按时间序列排列,形成“人-索引、时间有序的行为序列”(person-indexed, time-ordered behavioral sequences)。其解决方案的关键在于提出一套纵向建模与评估范式,更新NLP流水线的四个核心环节:(1) 评估划分需对齐跨人群(cross-sectional)和/或跨时间(prospective)的泛化目标;(2) 准确性指标应区分个体间差异与个体内部动态变化;(3) 序列输入默认纳入历史信息;(4) 模型内部机制支持不同粒度的潜在状态表示(如汇总摘要、显式动态或交互建模)。通过在包含238名参与者、17,000条日记文本及其PTSD症状严重程度的数据集上验证,表明传统文档级评估可能得出与生态效度模型相反的结论,从而推动NLP从词序评估向行为序列建模范式的转变。
链接: https://arxiv.org/abs/2601.07988
作者: Adithya V Ganesan,Vasudha Varadarajan,Oscar NE Kjell,Whitney R Ringwald,Scott Feltman,Benjamin J Luft,Roman Kotov,Ryan L Boyd,H Andrew Schwartz
机构: Stony Brook University (石溪大学); Vanderbilt University (范德比尔特大学); Carnegie Mellon University (卡内基梅隆大学); University of Minnesota Twin Cities (明尼苏达大学双城分校); The University of Texas at Dallas (德克萨斯大学达拉斯分校); Texas Artificial Intelligence Research Institute (德克萨斯人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While NLP typically treats documents as independent and unordered samples, in longitudinal studies, this assumption rarely holds: documents are nested within authors and ordered in time, forming person-indexed, time-ordered \textitbehavioral sequences . Here, we demonstrate the need for and propose a longitudinal modeling and evaluation paradigm that consequently updates four parts of the NLP pipeline: (1) evaluation splits aligned to generalization over people ( \textitcross-sectional ) and/or time ( \textitprospective ); (2) accuracy metrics separating between-person differences from within-person dynamics; (3) sequence inputs to incorporate history by default; and (4) model internals that support different \textitcoarseness of latent state over histories (pooled summaries, explicit dynamics, or interaction-based models). We demonstrate the issues ensued by traditional pipeline and our proposed improvements on a dataset of 17k daily diary transcripts paired with PTSD symptom severity from 238 participants, finding that traditional document-level evaluation can yield substantially different and sometimes reversed conclusions compared to our ecologically valid modeling and evaluation. We tie our results to a broader discussion motivating a shift from word-sequence evaluation toward \textitbehavior-sequence paradigms for NLP.
zh
[NLP-80] VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding ACL2026
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在文化理解能力评估中存在不足的问题,即现有基准主要聚焦于低阶视觉感知能力(如物体识别、场景描述和事实问答),而忽视了高阶文化阐释能力的测评。其解决方案的关键在于构建VULCA-Bench——一个跨文化的艺术评论基准,涵盖8种文化传统、7,410对图像与专家撰写的双语评论,并基于五层文化理解框架(L1-L5,从视觉感知到哲学美学)定义225个文化特异性维度,从而系统性地评估VLMs在深层文化语境中的推理能力。
链接: https://arxiv.org/abs/2601.07986
作者: Haorui Yu,Ramon Ruiz-Dolz,Diji Yang,Hang He,Fengrui Zhang,Qiufeng Yi
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, submitted to ACL 2026 Dataset Track
Abstract:We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models’ (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.
zh
[NLP-81] Multilingual Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset
【速读】: 该论文旨在解决当前事实核查资源在多语言、多模态证据支持、结构化标注以及可解释性方面的不足,以应对在线平台中虚假信息快速传播的问题。其解决方案的关键在于构建一个系统性的数据收集与处理流程,通过整合ClaimReview数据源、抓取完整辟谣文章、标准化不同来源的断言 verdict,并结合结构化元数据与对齐的视觉内容,形成法语和德语的多模态事实核查数据集;同时利用先进的大语言模型(LLM)和多模态大语言模型,在预定义证据类别下自动提取证据并生成连接证据与结论的解释性理由,从而提升事实核查模型的可解释性和证据基础。
链接: https://arxiv.org/abs/2601.07985
作者: Z. Melce Hüsünbeyi,Virginie Mouilleron,Leonie Uhling,Daniel Foppe,Tatjana Scheffler,Djamé Seddah
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
zh
[NLP-82] Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models ACL2026
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在艺术文化意义理解上的评估不足问题,尤其关注跨文化语境下艺术批评的可信度与可解释性。其解决方案的关键在于提出一个三层次评估框架:第一层通过自动化指标离线计算覆盖度和风险信号;第二层采用基于评分量表的单主评人打分机制,在五个维度上进行量化评估;第三层利用等距回归(isotonic regression)将第二层聚合得分校准至人类评分,从而显著降低平均绝对误差(MAE)达5.2%。该框架不仅输出经校准的文化理解分数用于模型筛选与文化差距诊断,还提供维度级诊断与风险指标,克服了传统多评人平均法因尺度不匹配导致的不可靠性问题。
链接: https://arxiv.org/abs/2601.07984
作者: Haorui Yu,Ramon Ruiz-Dolz,Xuehang Wen,Fengrui Zhang,Qiufeng Yi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 7 figures, submitted to ACL 2026
Abstract:Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. We present a tri-tier evaluation framework for cross-cultural art-critique assessment: Tier I computes automated coverage and risk indicators offline; Tier II applies rubric-based scoring using a single primary judge across five dimensions; and Tier III calibrates the Tier II aggregate score to human ratings via isotonic regression, yielding a 5.2% reduction in MAE on a 152-sample held-out set. The framework outputs a calibrated cultural-understanding score for model selection and cultural-gap diagnosis, together with dimension-level diagnostics and risk indicators. We evaluate 15 VLMs on 294 expert anchors spanning six cultural traditions. Key findings are that (i) automated metrics are unreliable proxies for cultural depth, (ii) Western samples score higher than non-Western samples under our sampling and rubric, and (iii) cross-judge scale mismatch makes naive score averaging unreliable, motivating a single primary judge with explicit calibration. Dataset and code are available in the supplementary materials.
zh
[NLP-83] Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis
【速读】: 该论文旨在解决生成式 AI (Generative AI) 文本检测器在跨不同生成条件(如未见过的提示词、模型家族或数据领域)时泛化能力不足的问题。其解决方案的关键在于通过系统性的语言学分析,构建一个涵盖多种提示策略、大语言模型(Large Language Models, LLMs)和领域数据集的综合性基准,并量化80种语言特征在训练与测试条件之间的分布偏移,从而揭示检测器性能差异的根源,发现如时态使用和代词频率等特定语言特征与泛化准确率之间存在显著关联。
链接: https://arxiv.org/abs/2601.07974
作者: Yuxi Xia,Kinga Stańczak,Benjamin Roth
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.
zh
[NLP-84] Cultural Compass: A Framework for Organizing Societal Norms to Detect Violations in Human-AI Conversations
【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型在跨文化情境下如何遵守社会文化规范(sociocultural norms)的问题,这是确保模型在不同文化背景下既实用又安全的关键挑战。现有研究在理解与评估模型对规范的遵守方面缺乏细致性和覆盖度。其解决方案的核心在于提出一个结构化的规范分类体系(taxonomy of norms),该体系从规范的情境(如区分人际规范与人机交互规范)、具体领域(specifications)和执行机制(mechanisms)三个维度进行细化,并进一步将该分类体系转化为可自动执行的评估流程,用于在自然语境和开放式场景中量化模型对规范的遵循程度。这一方法使得能够以更精细、情境敏感的方式评估模型在真实世界中的文化适应性表现。
链接: https://arxiv.org/abs/2601.07973
作者: Myra Cheng,Vinodkumar Prabhakaran,Alice Oh,Hayk Stepanyan,Aishwarya Verma,Charu Kalia,Erin MacMurray van Liemt,Sunipa Dev
机构: Stanford University (斯坦福大学); Google(谷歌)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Generative AI models ought to be useful and safe across cross-cultural contexts. One critical step toward this goal is understanding how AI models adhere to sociocultural norms. While this challenge has gained attention in NLP, existing work lacks both nuance and coverage in understanding and evaluating models’ norm adherence. We address these gaps by introducing a taxonomy of norms that clarifies their contexts (e.g., distinguishing between human-human norms that models should recognize and human-AI interactional norms that apply to the human-AI interaction itself), specifications (e.g., relevant domains), and mechanisms (e.g., modes of enforcement). We demonstrate how our taxonomy can be operationalized to automatically evaluate models’ norm adherence in naturalistic, open-ended settings. Our exploratory analyses suggest that state-of-the-art models frequently violate norms, though violation rates vary by model, interactional context, and country. We further show that violation rates also vary by prompt intent and situational framing. Our taxonomy and demonstrative evaluation pipeline enable nuanced, context-sensitive evaluation of cultural norm adherence in realistic settings.
zh
[NLP-85] Knowing But Not Doing: Convergent Morality and Divergent Action in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在现实决策情境中如何表征与践行人类价值观的问题,即价值对齐(Value Alignment)的实践机制尚不明确。其解决方案的关键在于构建了一个名为ValAct-15k的高质量数据集,包含3,000个来自Reddit的求助场景,用于激发Schwartz基本人类价值观理论定义的十种核心价值;通过对比LLMs与人类参与者在情景式问答和传统价值观问卷中的表现,揭示了模型间高度一致的价值决策行为(Pearson相关系数r ≈ 1.0),但人与模型均存在显著的知识-行动鸿沟(自我报告与实际行为的相关性仅r = 0.3–0.4),并发现指令“持守”某一价值会导致LLM性能下降达6.6%,表明模型存在角色扮演回避倾向。这一发现提示:尽管对齐训练可实现规范性价值趋同,却无法消除类似人类的认知-行为不一致性。
链接: https://arxiv.org/abs/2601.07972
作者: Jen-tse Huang,Jiantong Qin,Xueli Qiu,Sharon Levy,Michelle R. Kaufman,Mark Dredze
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures
Abstract:Value alignment is central to the development of safe and socially compatible artificial intelligence. However, how Large Language Models (LLMs) represent and enact human values in real-world decision contexts remains under-explored. We present ValAct-15k, a dataset of 3,000 advice-seeking scenarios derived from Reddit, designed to elicit ten values defined by Schwartz Theory of Basic Human Values. Using both the scenario-based questions and the traditional value questionnaire, we evaluate ten frontier LLMs (five from U.S. companies, five from Chinese ones) and human participants ( n = 55 ). We find near-perfect cross-model consistency in scenario-based decisions (Pearson r \approx 1.0 ), contrasting sharply with the broad variability observed among humans ( r \in [-0.79, 0.98] ). Yet, both humans and LLMs show weak correspondence between self-reported and enacted values ( r = 0.4, 0.3 ), revealing a systematic knowledge-action gap. When instructed to “hold” a specific value, LLMs’ performance declines up to 6.6% compared to merely selecting the value, indicating a role-play aversion. These findings suggest that while alignment training yields normative value convergence, it does not eliminate the human-like incoherence between knowing and acting upon values.
zh
[NLP-86] A Human-Centric Pipeline for Aligning Large Language Models with Chinese Medical Ethics
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗场景中与复杂现实伦理需求对齐不足的问题,尤其是在中国医疗伦理语境下缺乏系统性评估与优化机制的挑战。解决方案的关键在于构建一个动态、以场景为中心的基准测试集 MedES(基于260份权威中文医学、伦理和法律文献),并提出“守护者在环内”(guardian-in-the-loop)框架:该框架利用一个经过专家标注数据训练的自动化评估器(在本领域准确率超97%),生成针对性提示并提供结构化伦理反馈,从而通过监督微调和领域特定偏好优化对7B参数LLM进行对齐。实验证明,该方法在中国医疗伦理背景下显著优于更大规模基线模型,在核心伦理任务上展现出质量与综合指标的提升。
链接: https://arxiv.org/abs/2601.07954
作者: Haoan Jin,Han Ying,Jiacheng Ji,Hanhui Xu,Mengyue Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models have enabled their application to a range of healthcare tasks. However, aligning LLMs with the nuanced demands of medical ethics, especially under complex real world scenarios, remains underexplored. In this work, we present MedES, a dynamic, scenario-centric benchmark specifically constructed from 260 authoritative Chinese medical, ethical, and legal sources to reflect the challenges in clinical decision-making. To facilitate model alignment, we introduce a guardian-in-the-loop framework that leverages a dedicated automated evaluator (trained on expert-labeled data and achieving over 97% accuracy within our domain) to generate targeted prompts and provide structured ethical feedback. Using this pipeline, we align a 7B-parameter LLM through supervised fine-tuning and domain-specific preference optimization. Experimental results, conducted entirely within the Chinese medical ethics context, demonstrate that our aligned model outperforms notably larger baselines on core ethical tasks, with observed improvements in both quality and composite evaluation metrics. Our work offers a practical and adaptable framework for aligning LLMs with medical ethics in the Chinese healthcare domain, and suggests that similar alignment pipelines may be instantiated in other legal and cultural environments through modular replacement of the underlying normative corpus.
zh
[NLP-87] owards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学等专业领域适应过程中面临的两大挑战:一是“稳定性-可塑性困境”(Stability-Plasticity Dilemma),即模型需在不遗忘通用世界知识的前提下习得复杂的临床知识;二是“任务干扰”(Task Interference),即不同子任务(如医疗诊断、报告摘要生成和药物相互作用预测)因共享有限的低秩参数空间而产生竞争。解决方案的关键在于提出 Med-MoE-LoRA 框架,该框架将混合专家(Mixture-of-Experts, MoE)与低秩适配(Low-Rank Adaptation, LoRA)相结合,通过不对称专家分布设计——深层网络引入更高密度的 LoRA 专家以捕捉复杂语义抽象,并结合受 LoRA MoE 启发的“知识保留插件”来隔离并保护通用推理能力;同时采用软融合与自适应路由机制及秩级解耦策略,有效降低任务间干扰,在保持通用认知能力的同时显著提升多任务医学自然语言处理性能。
链接: https://arxiv.org/abs/2601.07935
作者: Yuxin Yang,Aoxiong Zeng,Xiangquan Yang
机构: Shanghai University (上海大学); East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress
Abstract:The rapid evolution of Large Language Models (LLMs) has shifted focus from general-purpose capabilities to domain-specific expertise. However, adapting LLMs to specialized fields such as medicine presents two challenge: (1) the “Stability-Plasticity Dilemma”, where the model must acquire complex clinical knowledge without suffering from catastrophic forgetting of general world knowledge; and (2) “Task Interference”, where disparate sub-tasks, such as medical diagnosis, report summarization, and drug-drug interaction prediction, compete for limited low-rank parameter space. In this paper, we propose Med-MoE-LoRA, a novel framework that integrates Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to enable efficient multi-task domain adaptation, especially for medical scenarios. Drawing inspiration from recent advances, our framework employs an asymmetric expert distribution where deeper layers are equipped with a higher density of LoRA experts to capture complex semantic abstractions. We further introduce a “Knowledge-Preservation Plugin”, inspired by LoRA MoE, to isolate and protect general-purpose reasoning. By utilizing soft merging with adaptive routing and rank-wise decoupling, Med-MoE-LoRA achieves superior performance in medical benchmarks while reducing interference. Experimental results demonstrate that our approach consistently outperforms standard LoRA and conventional MoE architectures across multiple clinical NLP tasks while retaining the model’s general cognitive capabilities.
zh
[NLP-88] KVzap: Fast Adaptive and Faithful KV Cache Pruning
【速读】: 该论文旨在解决基于Transformer的语言模型在长文本推理过程中因键值(Key-Value, KV)缓存占用过大而导致的推理瓶颈问题。现有KV缓存压缩方法虽被提出,但因速度与准确率之间的权衡尚未被主流推理引擎采纳。其解决方案的关键在于提出KVzap,一种快速且输入自适应的KVzip近似方法,能够在预填充(prefilling)和解码(decoding)阶段均高效运行,实现2–4倍的KV缓存压缩比,同时保持几乎无损的准确性,在长上下文和推理任务中表现优异,并在KVpress排行榜上达到最先进水平。
链接: https://arxiv.org/abs/2601.07891
作者: Simon Jegou,Maximilian Jeblick
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Growing context lengths in transformer-based language models have made the key-value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed–accuracy trade-offs. We introduce KVzap, a fast, input-adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across long-context and reasoning tasks, KVzap achieves 2 – 4\times KV cache compression with negligible accuracy loss and achieves state-of-the-art performance on the KVpress leaderboard. Code and models are available at this https URL.
zh
[NLP-89] Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models
【速读】: 该论文旨在解决超低比特(ultra-low-bit)后训练量化(post-training quantization)中因模型参数精度大幅降低而导致的激活分布失真与性能下降问题。传统量化方法在低于4-bit时往往破坏激活分布,进而影响语言模型的 perplexity 和下游任务准确率。其解决方案的关键在于提出一种切片 Wasserstein 损失函数(sliced Wasserstein loss function),该损失通过在随机线性投影下对齐全精度模型与量化模型的输出分布,实现分布感知校准(distribution-aware calibration),且不增加推理阶段的计算开销。该方法可无缝集成至任意具备重训练机制的后训练量化框架中,并已在 OmniQuant 和 TesseraQ 两个前沿方法上验证其有效性,显著恢复了因量化导致的性能损失。
链接: https://arxiv.org/abs/2601.07878
作者: Deyu Cao,Yixin Yin,Samin Aref
机构: The University of Tokyo (东京大学); University of Toronto (多伦多大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Post-peer-review accepted manuscript, 17 pages including the supplementary information
Abstract:The benefits of most large language models come with steep and often hidden economic and environmental costs due to their resource usage inefficiency during deployment. Model quantization improves energy and memory efficiency through representing model parameters by lower-precision values. However, compression below 4-bits often distorts activation distributions and degrades performance. We address this challenge by introducing a sliced Wasserstein loss function for distribution-aware calibration in ultra-low-bit post-training quantization. The proposed loss aligns the output distributions of full-precision and quantized models under random linear projections, complementing standard mean-squared error loss without adding any computational overhead during inference. Our proposed loss function can be incorporated with any post-training quantization framework that has a retraining component. We demonstrate the performance gains of our proposed model by incorporating it with two frontier methods known as OmniQuant and TesseraQ. Compared to these two baselines, the proposed loss consistently improves both perplexity and downstream task accuracy across multiple ultra-low-bit settings. Our proposed loss function recovers 4.12-20.37% of the OmniQuant’s lost accuracy on the language model LLaMA-2-7B, 0.93-7.65% on OPT-6.7B, and 2.26-6.20% on LLaMA-2-13B. TesseraQ’s accuracy degradation is recovered by 3.63-7.63% in relative terms when augmented by our proposed loss function. Taken together, these results demonstrate that distributional alignment provides a simple yet effective performance boost that can push the limits of frontier quantization methods. Our method is available on GitHub to facilitate future progress in ultra-low-bit quantization.
zh
[NLP-90] EmbeddingRWKV: State-Centric Retrieval with Reusable States
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统中两阶段流水线架构效率低下的问题,即嵌入模型(embedding model)与重排序器(reranker)之间缺乏信息共享,导致大量冗余计算。解决方案的关键在于提出一种以“状态”(state)为中心的统一检索范式(State-Centric Retrieval),通过在嵌入模型和重排序器之间引入可复用的状态表示(state representation),实现两阶段协同优化。具体而言,作者基于RWKV架构微调得到EmbeddingRWKV模型,该模型兼具嵌入功能和状态提取能力;进一步设计基于状态的重排序机制,在重排序阶段仅处理查询token,从而解耦推理成本与文档长度,实现最高达44.8倍的速度提升;同时发现保留全部中间层状态并非必要,采用均匀层选择策略仅使用25%的层数即可保持98.62%的完整性能,显著提升了系统效率。
链接: https://arxiv.org/abs/2601.07861
作者: Haowen Hou,Jie Yang
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳)); Shenzhen Yuanshi Intelligence Co., Ltd (深圳市元始智能科技有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 23 pages, 3 figures, 6 tables
Abstract:Current Retrieval-Augmented Generation (RAG) systems typically employ a traditional two-stage pipeline: an embedding model for initial retrieval followed by a reranker for refinement. However, this paradigm suffers from significant inefficiency due to the lack of shared information between stages, leading to substantial redundant computation. To address this limitation, we propose \textbfState-Centric Retrieval, a unified retrieval paradigm that utilizes “states” as a bridge to connect embedding models and rerankers. First, we perform state representation learning by fine-tuning an RWKV-based LLM, transforming it into \textbfEmbeddingRWKV, a unified model that serves as both an embedding model and a state backbone for extracting compact, reusable states. Building upon these reusable states, we further design a state-based reranker to fully leverage precomputed information. During reranking, the model processes only query tokens, decoupling inference cost from document length and yielding a 5.4 \times --44.8 \times speedup. Furthermore, we observe that retaining all intermediate layer states is unnecessary; with a uniform layer selection strategy, our model maintains 98.62% of full-model performance using only 25% of the layers. Extensive experiments demonstrate that State-Centric Retrieval achieves high-quality retrieval and reranking results while significantly enhancing overall system efficiency. Code is available at \hrefthis https URLour GitHub repository.
zh
计算机视觉
[CV-0] RAVEN: Erasing Invisible Watermarks via Novel View Synthesis
【速读】:该论文旨在解决当前不可见水印(invisible watermarking)在面对高级移除攻击时的脆弱性问题,特别是针对那些能够保持语义一致性但改变视角的几何变换攻击。现有水印方案虽对像素空间和频域攻击具备鲁棒性,却难以抵御基于语义不变性视角转换的破坏。解决方案的关键在于将水印移除重构为视图合成(view synthesis)问题:通过在潜在空间中施加受控的几何变换,并引入视图引导的对应注意力机制(view-guided correspondence attention),实现从新视角重建图像内容,从而自然去除水印并保持视觉保真度。该方法无需访问水印检测器或水印知识,仅依赖冻结的预训练扩散模型,在15种主流水印方案上实现了最先进的去水印效果,同时保持了优异的感知质量。
链接: https://arxiv.org/abs/2601.08832
作者: Fahad Shamshad,Nils Lukas,Karthik Nandakumar
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods–outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
zh
[CV-1] 3AM: Segment Anything with Geometric Consistency in Videos
【速读】:该论文旨在解决视频对象分割(Video Object Segmentation, VOS)方法在大视角变化下性能下降的问题,这类问题通常源于模型对视觉外观特征的依赖,而缺乏几何一致性。现有3D实例分割方法虽能保证视角一致性,但需依赖相机位姿和深度图等昂贵的预处理信息。本文提出的解决方案关键在于引入一种轻量级训练时增强方法——3AM,通过融合MUSt3R提取的多层级3D感知特征(即隐式几何对应关系),与SAM2的外观特征相结合,从而实现基于空间位置和视觉相似性的几何一致识别。此外,设计了视野感知采样策略以确保帧间观测到空间一致的对象区域,进而提升3D对应学习的可靠性。该方法仅需RGB输入即可完成推理,无需任何额外的相机参数或预处理步骤。
链接: https://arxiv.org/abs/2601.08831
作者: Yang-Che Sun,Cheng Sun,Chin-Yang Lin,Fu-En Yang,Min-Hung Chen,Yen-Yu Lin,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2’s appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: this https URL
zh
[CV-2] Motion Attribution for Video Generation
【速读】:该论文旨在解决视频生成模型中数据对运动(motion)影响机制不明确的问题,即现有方法难以量化特定训练片段如何塑造视频的时间动态特性。其解决方案的关键在于提出一种以运动为中心的、基于梯度的数据归因框架——Motive(MOTIon attribution for Video gEneration),通过引入运动加权损失掩码(motion-weighted loss masks)将时间动态与静态外观分离,从而实现高效且可扩展的运动特异性影响计算。该方法首次在视频生成模型中实现了对运动而非视觉外观的归因,并利用归因结果指导高质量数据筛选,显著提升了视频生成的时序一致性和物理合理性。
链接: https://arxiv.org/abs/2601.08828
作者: Xindi Wu,Despoina Paschalidou,Jun Gao,Antonio Torralba,Laura Leal-Taixé,Olga Russakovsky,Sanja Fidler,Jonathan Lorraine
机构: NVIDIA(英伟达); Princeton University (普林斯顿大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: See the project website at this https URL
Abstract:Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.
zh
[CV-3] Reasoning Matters for 3D Visual Grounding CVPR
【速读】:该论文旨在解决3D视觉定位(3D visual grounding)任务中模型推理能力不足的问题,尤其是在现有方法依赖大量标注数据进行监督训练、且基于合成数据的扩展难以带来性能线性提升的局限性。其解决方案的关键在于提出了一种自动合成3D视觉定位数据及其对应推理过程的数据流水线,并利用该数据对大型语言模型(LLM)进行微调,从而构建出一个名为Reason3DVG-8B的强模型;该模型仅使用3D-GRAND方法1.6%的训练数据即实现了超越后者的效果,验证了高质量合成数据与显式推理机制在提升3D视觉定位性能中的核心作用。
链接: https://arxiv.org/abs/2601.08811
作者: Hsiang-Wei Huang,Kuang-Ming Chen,Wenhao Chai,Cheng-Yen Yang,Jen-Hao Cheng,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2025 CVPR Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments
Abstract:The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.
zh
[CV-4] S3-CLIP: Video Super Resolution for Person-ReID WACV
【速读】:该论文旨在解决视频行人再识别(Video-based Person Re-Identification, VReID)中因低质量轨迹(tracklet quality)导致的性能瓶颈问题,尤其是在跨视角(cross-view)等复杂场景下。传统方法多聚焦于模型架构改进,忽视了轨迹质量对最终识别效果的关键影响。解决方案的核心在于提出S3-CLIP框架,其关键创新是将视频超分辨率(Video Super-Resolution, VSR)技术与CLIP-ReID相结合,构建任务驱动的超分辨率流水线,从而提升视频片段中的行人轨迹质量,尤其在高空到地面(aerial-to-ground)和地面到高空(ground-to-aerial)跨视角场景下显著改善了排序精度(Rank-1、Rank-5、Rank-10分别提升11.24%、13.48%、17.98%),实现了首个系统性探索视频超分辨率用于增强VReID轨迹质量的研究。
链接: https://arxiv.org/abs/2601.08807
作者: Tamas Endrei,Gyorgy Cserey
机构: Pázmány Péter Catholic University (帕兹曼尼·彼得天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the 2026 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), VReID-XFD Challenge
Abstract:Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
zh
[CV-5] Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching
【速读】:该论文旨在解决濒危两栖动物个体识别难题,特别是针对无法采用侵入式标记方法的极度濒危物种(如胡拉彩蛙,Latonia nigriventer)进行非侵入式、高精度的图像重识别问题。其关键解决方案是采用零样本深度局部特征匹配(zero-shot deep local-feature matching)策略,在无需训练数据的情况下实现高达98%的top-1闭集识别准确率,显著优于全局特征嵌入模型;进一步通过两阶段工作流——先用微调后的全局特征模型快速筛选候选列表,再以局部特征匹配进行重排序——在保持约96% top-1准确率的同时,将端到端处理时间从6.5–7.8小时大幅缩短至约38分钟,兼顾了效率与准确性,并支持开放集识别阈值设定,为野外实时应用提供了可行工具。
链接: https://arxiv.org/abs/2601.08798
作者: Maayan Yesharim,R. G. Bina Perl,Uri Roll,Sarig Gafny,Eli Geffen,Yoav Ram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 18 pages, 4 figures,
Abstract:Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.
zh
[CV-6] DentalX: Context-Aware Dental Disease Detection with Radiographs
【速读】:该论文旨在解决从牙科X光片中诊断牙病时因病灶视觉线索微弱而导致的检测困难问题(dental disease detection from radiographs is time-consuming and challenging due to the subtle nature of diagnostic evidence)。现有基于自然图像设计的目标检测模型难以适应牙科影像中目标模式不显著的特点。解决方案的关键在于提出一种名为DentalX的上下文感知检测方法,其核心创新是引入一个结构上下文提取模块(structural context extraction module),通过学习牙齿解剖结构的语义分割任务作为辅助任务,从而提取有意义的口腔结构信息,并将其融合到主疾病检测任务中,以增强对细微牙病的识别能力。实验表明,该方法在专用基准上显著优于先前方法,且两个任务间的内在关联在优化过程中自然产生协同增益。
链接: https://arxiv.org/abs/2601.08797
作者: Zhi Qin Tan,Xiatian Zhu,Owen Addison,Yunpeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2026
Abstract:Diagnosing dental diseases from radiographs is time-consuming and challenging due to the subtle nature of diagnostic evidence. Existing methods, which rely on object detection models designed for natural images with more distinct target patterns, struggle to detect dental diseases that present with far less visual support. To address this challenge, we propose \bf DentalX, a novel context-aware dental disease detection approach that leverages oral structure information to mitigate the visual ambiguity inherent in radiographs. Specifically, we introduce a structural context extraction module that learns an auxiliary task: semantic segmentation of dental anatomy. The module extracts meaningful structural context and integrates it into the primary disease detection task to enhance the detection of subtle dental diseases. Extensive experiments on a dedicated benchmark demonstrate that DentalX significantly outperforms prior methods in both tasks. This mutual benefit arises naturally during model optimization, as the correlation between the two tasks is effectively captured. Our code is available at this https URL.
zh
[CV-7] Aggregating Diverse Cue Experts for AI-Generated Image Detection AAAI2026
【速读】:该论文旨在解决AI生成图像检测模型在面对不同生成模型时泛化能力不足的问题,现有方法常依赖特定模型特征,易导致过拟合。其解决方案的关键在于提出多线索聚合网络(Multi-Cue Aggregation Network, MCAN),通过统一框架整合空间域、频域和色度信息等互补线索:包括输入图像的整体内容、高频分量中的边缘细节以及新提出的色度不一致性(Chromatic Inconsistency, CI)线索——该线索通过对强度值归一化以突出真实图像中由采集过程引入的噪声模式,从而增强对AI生成内容的区分能力。MCAN采用混合编码器适配器动态处理这些线索,实现更鲁棒的特征表示,显著提升跨模型检测性能。
链接: https://arxiv.org/abs/2601.08790
作者: Lei Tan,Shuwei Li,Mohan Kankanhalli,Robby T. Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network. MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation. Our cues include the input image itself, which represents the overall content, and high-frequency components that emphasize edge details. Additionally, we introduce a Chromatic Inconsistency (CI) cue, which normalizes intensity values and captures noise information introduced during the image acquisition process in real images, making these noise patterns more distinguishable from those in AI-generated content. Unlike prior methods, MCAN’s novelty lies in its unified multi-cue aggregation framework, which integrates spatial, frequency-domain, and chromaticity-based information for enhanced representation learning. These cues are intrinsically more indicative of real images, enhancing cross-model generalization. Extensive experiments on the GenImage, Chameleon, and UniversalFakeDetect benchmark validate the state-of-the-art performance of MCAN. In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4% in average ACC across eight different image generators.
zh
[CV-8] ranslating Light-Sheet Microscopy Images to Virtual HE Using CycleGAN
【速读】:该论文旨在解决荧光显微图像与标准苏木精-伊红(Hematoxylin and Eosin, HE)染色病理图像之间缺乏直观可比性的问题,从而阻碍了荧光数据在常规病理分析流程中的整合。其关键解决方案是采用无配对图像到图像转换的循环一致性生成对抗网络(Cycle-Consistent Adversarial Network, CycleGAN),通过融合C01和C02荧光通道至RGB空间,并学习荧光域与HE域之间的双向映射关系,无需成对训练样本即可生成具有HE样色彩特征且保留组织形态结构的伪HE图像,从而实现荧光数据向病理学家熟悉格式的可视化转换,并支持与现有基于HE的分析流程无缝集成。
链接: https://arxiv.org/abs/2601.08776
作者: Yanhua Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:Histopathology analysis relies on Hematoxylin and Eosin (HE) staining, but fluorescence microscopy offers complementary information. Converting fluorescence images to HE-like appearance can aid interpretation and integration with standard workflows. We present a Cycle-Consistent Adversarial Network (CycleGAN) approach for unpaired image-to-image translation from multi-channel fluorescence microscopy to pseudo HE stained histopathology images. The method combines C01 and C02 fluorescence channels into RGB and learns a bidirectional mapping between fluorescence and HE domains without paired training data. The architecture uses ResNet-based generators with residual blocks and PatchGAN discriminators, trained with adversarial, cycle-consistency, and identity losses. Experiments on fluorescence microscopy datasets show the model generates realistic pseudo HE images that preserve morphological structures while adopting HE-like color characteristics. This enables visualization of fluorescence data in a format familiar to pathologists and supports integration with existing HE-based analysis pipelines.
zh
[CV-9] UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在超高分辨率图像(ultra-high-resolution imagery)下的视觉-语言推理能力评估不足的问题。现有视觉问答(Visual Question Answering, VQA)基准通常基于中等分辨率数据,难以充分考验模型在复杂视觉场景中的推理能力。为此,作者提出了UR-Bench基准,涵盖人文场景和自然场景两大类别,包含四个具有不同空间结构与数据来源的超高清图像子集(分辨率从数百兆像素到千兆像素),并设计了三级问题体系以系统评估模型的推理能力。解决方案的关键在于提出一种基于智能体(agent-based)的框架,其中语言模型通过调用外部视觉工具进行推理,并引入语义抽象(Semantic Abstraction)与检索(Retrieval)工具提升对超大图像的处理效率,从而显著增强MLLMs在极端视觉信息场景下的表现。
链接: https://arxiv.org/abs/2601.08748
作者: Siqi Li,Xinyu Cai,Jianbiao Mei,Nianchen Deng,Pinlong Cai,Licheng Wen,Yufan Shen,Xuemeng Yang,Botian Shi,Yong Liu
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Recent multimodal large language models (MLLMs) show strong capabilities in visual-language reasoning, yet their performance on ultra-high-resolution imagery remains largely unexplored. Existing visual question answering (VQA) benchmarks typically rely on medium-resolution data, offering limited visual complexity. To bridge this gap, we introduce Ultra-high-resolution Reasoning Benchmark (UR-Bench), a benchmark designed to evaluate the reasoning capabilities of MLLMs under extreme visual information. UR-Bench comprises two major categories, Humanistic Scenes and Natural Scenes, covering four subsets of ultra-high-resolution images with distinct spatial structures and data sources. Each subset contains images ranging from hundreds of megapixels to gigapixels, accompanied by questions organized into three levels, enabling evaluation of models’ reasoning capabilities in ultra-high-resolution scenarios. We further propose an agent-based framework in which a language model performs reasoning by invoking external visual tools. In addition, we introduce Semantic Abstraction and Retrieval tools that enable more efficient processing of ultra-high-resolution images. We evaluate state-of-the-art models using both an end-to-end MLLMs and our agent-based framework, demonstrating the effectiveness of our framework.
zh
[CV-10] ISLA: A U-Net for MRI-based acute ischemic stroke lesion segmentation with deep supervision attention domain adaptation and ensemble learning
【速读】:该论文旨在解决急性缺血性卒中(Acute Ischemic Stroke, AIS)在弥散加权成像(Diffusion MRI)中病灶边界精确分割的问题,这是卒中诊断与管理的关键环节。现有基于深度学习的分割方法多采用U-Net架构,但其性能差异主要源于损失函数设计、深度监督、残差连接及注意力机制等模块的选择,且多数实现未公开,最优配置尚不明确。本文提出ISLA(Ischemic Stroke Lesion Analyzer)模型,通过系统优化损失函数、卷积结构、深度监督和注意力机制构建了一个鲁棒的分割框架,并引入无监督域适应策略以提升对临床外部数据集的泛化能力。关键创新在于结合多中心数据训练与模块化优化策略,最终在外部测试集上优于两个当前最先进的AIS病灶分割方法,同时代码与训练模型开源,保障了研究的可复现性与推广价值。
链接: https://arxiv.org/abs/2601.08732
作者: Vincent Roca,Martin Bretzner,Hilde Henon,Laurent Puy,Grégory Kuchcinski,Renaud Lopes
机构: Univ. Lille(里尔大学); CNRS(法国国家科学研究中心); Inserm(法国国家健康与医学研究院); CHU Lille(里尔大学医院中心); Institut Pasteur de Lille(里尔巴斯德研究所); US 41 - UAR 2014 - PLBS; Lille Neurosciences & Cognition(里尔神经科学与认知研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.
zh
[CV-11] Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation
【速读】:该论文旨在解决场景图生成(Scene Graph Generation, SGG)中因类别分布长尾导致的模型偏倚问题,即多数关系类别样本稀缺,使得模型在稀有关系上的性能显著下降。现有无偏SGG方法虽能缓解此问题,但常过度依赖语义先验而削弱了对空间结构的理解能力。其解决方案的关键在于提出Salience-SGG框架,核心创新为迭代显著性解码器(Iterative Salience Decoder, ISD),通过引入与语义无关的显著性标签(semantic-agnostic salience labels)引导模型关注具有显著空间结构的三元组,从而在提升稀有关系识别性能的同时增强空间理解能力,在多个基准数据集上实现了最优效果。
链接: https://arxiv.org/abs/2601.08728
作者: Runfeng Qu,Ole Hall,Pia K Bideau,Julie Ouerfelli-Ethier,Martin Rolfs,Klaus Obermayer,Olaf Hellwich
机构: Technische Universität Berlin (柏林工业大学); Humboldt Universität zu Berlin (洪堡大学); Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); Inria (法国国家信息与自动化研究院); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔综合理工学院); LJK (兰克实验室); Bernstein Center for Computational Neuroscience (计算神经科学伯恩斯坦中心); Science of Intelligence Research Cluster of Excellence (智能科学研究卓越集群)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this issue by implementing debiasing strategies, but often at the cost of spatial understanding, resulting in an over-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision
zh
[CV-12] Real-Time Localization Framework for Autonomous Basketball Robots
【速读】:该论文旨在解决自主机器人在动态环境中实现高精度、高可靠性定位的问题,尤其针对Robocon 2025竞赛场景中提升投篮精度、避障及赛场导航效率的需求。解决方案的关键在于提出一种混合定位算法,融合经典方法与基于学习的方法,仅依赖赛场地板的视觉数据即可实现机器人在篮球场上的自定位(self-localization)。
链接: https://arxiv.org/abs/2601.08713
作者: Naren Medarametla,Sreejon Mondal
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 12 figures, Project code: this https URL
Abstract:Localization is a fundamental capability for autonomous robots, enabling them to operate effectively in dynamic environments. In Robocon 2025, accurate and reliable localization is crucial for improving shooting precision, avoiding collisions with other robots, and navigating the competition field efficiently. In this paper, we propose a hybrid localization algorithm that integrates classical techniques with learning based methods that rely solely on visual data from the court’s floor to achieve self-localization on the basketball field.
zh
[CV-13] MEMEWEAVER: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection EACL2026
【速读】:该论文旨在解决在线性别歧视(sexism)和厌女(misogyny)内容检测中因忽视社会动态而造成的识别不足问题,尤其是针对网络社群中施害者通过群体认同强化偏见的现象。其解决方案的关键在于提出了一种端到端可训练的多模态框架MemeWeaver,核心创新是引入了“跨meme图推理机制”(inter-meme graph reasoning mechanism),通过构建语义驱动的图结构来捕捉不同meme之间的关系,从而实现对隐含性别偏见的深层理解,相较于现有方法在MAMI和EXIST基准上表现出更优性能与更快收敛速度。
链接: https://arxiv.org/abs/2601.08684
作者: Paolo Italiani,David Gimeno-Gomez,Luca Ragazzi,Gianluca Moro,Paolo Rosso
机构: University of Bologna (博洛尼亚大学); Universtitat Politècnica de València (瓦伦西亚理工大学); ValgrAI - Valencian Graduate School and Research Network of Artificial Intelligence (瓦伦西亚人工智能研究生院和研究网络)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EACL 2026 Findings
Abstract:Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual–textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate.
zh
[CV-14] Além do Desempenho: Um Estudo da Confiabilidade de Detectores de Deepfakes
【速读】:该论文旨在解决当前深度伪造(Deepfakes)检测技术评估体系不完善的问题,特别是现有方法多局限于分类准确率,缺乏对检测模型在实际应用中可靠性进行全面衡量的机制。其解决方案的关键在于提出一个基于四大支柱的可靠性评估框架:可迁移性(transferability)、鲁棒性(robustness)、可解释性(interpretability)以及计算效率(computational efficiency),从而为深度伪造检测技术提供更系统、多维度的评估标准。
链接: https://arxiv.org/abs/2601.08674
作者: Lucas Lopes,Rayson Laroca,André Grégio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for presentation at the Brazilian Symposium on Cybersecurity (SBSeg) 2025, in Portuguese language
Abstract:Deepfakes are synthetic media generated by artificial intelligence, with positive applications in education and creativity, but also serious negative impacts such as fraud, misinformation, and privacy violations. Although detection techniques have advanced, comprehensive evaluation methods that go beyond classification performance remain lacking. This paper proposes a reliability assessment framework based on four pillars: transferability, robustness, interpretability, and computational efficiency. An analysis of five state-of-the-art methods revealed significant progress as well as critical limitations.
zh
[CV-15] VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory
【速读】:该论文旨在解决现有视觉语言动作模型(VLA)在具身导航任务中因依赖反应式映射而缺乏显式推理能力和持久记忆,从而难以应对复杂、长程导航挑战的问题。其核心解决方案在于提出VLingNav模型,关键创新包括:一是受人类双过程认知理论启发,设计自适应思维链(adaptive chain-of-thought, adaptive CoT)机制,动态触发显式推理以实现快速直觉执行与慢速规划之间的灵活切换;二是构建视觉辅助的语言记忆模块(visual-assisted linguistic memory module),建立跨模态语义记忆以维持空间依赖性并避免重复探索,提升对动态环境的推理能力。此外,通过构建大规模带推理标注的数据集Nav-AdaCoT-2.9M及在线专家引导的强化学习训练策略,进一步增强了模型的泛化与自主探索能力。
链接: https://arxiv.org/abs/2601.08665
作者: Shaoan Wang,Yuanfei Luo,Xingyu Chen,Aocheng Luo,Dongyue Li,Chang Liu,Sheng Chen,Yangang Zhang,Junzhi Yu
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
zh
[CV-16] RACE: Reconstruction-Based Anomaly Detection in Ensemble and Time-Dependent Simulations
【速读】:该论文旨在解决高维、时变模拟数据中异常检测的难题,特别是针对参数化卡门涡街(Kármán vortex street)仿真数据中的异常识别问题。其解决方案的关键在于采用基于重构的异常检测方法,并比较了两种卷积自动编码器架构:一种是仅处理单帧图像的二维(2D)自动编码器,另一种是处理短时序片段的三维(3D)自动编码器。研究表明,2D模型擅长捕捉单一时相的局部空间异常,而3D模型则能利用时空上下文有效识别异常运动模式并减少时间维度上的冗余检测,从而提升检测鲁棒性。此外,研究还发现重构误差与质量的空间分布密切相关,集中区域的误差显著高于分散配置,进一步强调了在动态模拟中引入时间维度的重要性。
链接: https://arxiv.org/abs/2601.08659
作者: Hamid Gadirov,Martijn Westra,Steffen Frey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting anomalies in high-dimensional, time-dependent simulation data is challenging due to complex spatial and temporal dynamics. We study reconstruction-based anomaly detection for ensemble data from parameterized Kármán vortex street simulations using convolutional autoencoders. We compare a 2D autoencoder operating on individual frames with a 3D autoencoder that processes short temporal stacks. The 2D model identifies localized spatial irregularities in single time steps, while the 3D model exploits spatio-temporal context to detect anomalous motion patterns and reduces redundant detections across time. We further evaluate volumetric time-dependent data and find that reconstruction errors are strongly influenced by the spatial distribution of mass, with highly concentrated regions yielding larger errors than dispersed configurations. Our results highlight the importance of temporal context for robust anomaly detection in dynamic simulations.
zh
[CV-17] SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models
【速读】:该论文旨在解决图像生成模型(Image Generation Models, IGMs)在训练过程中可能记忆并再现不安全内容(如NSFW图像和受版权保护的艺术风格)的问题,这些问题在实际部署中带来持续的安全与合规风险。现有方法如后处理过滤或模型级遗忘技术存在鲁棒性不足、生成质量下降或对提示改写和对抗攻击敏感等局限。其解决方案的关键在于提出SafeRedir——一个轻量级的推理时框架,通过提示嵌入空间中的token级干预实现稳健的遗忘机制:利用潜空间感知的多模态安全分类器识别不安全生成轨迹,并结合token级delta生成器进行精确语义重定向,辅以掩码预测和自适应缩放策略实现局部化调控,从而在不修改原模型的前提下,有效抑制有害内容输出,同时保持良性生成的质量与鲁棒性。
链接: https://arxiv.org/abs/2601.08623
作者: Renyang Liu,Kangjie Chen,Han Qiu,Jie Zhang,Kwok-Yan Lam,Tianwei Zhang,See-Kiong Ng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Code at this https URL
Abstract:Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at this https URL.
zh
[CV-18] ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)流水线在处理多模态文档时的局限性问题,特别是对视觉元素(如表格、图表和图像)的理解不足、跨文档信息整合能力弱以及答案生成缺乏准确的来源定位。其解决方案的关键在于构建了一个全面的多模态RAG基准ViDoRe v3,涵盖10个不同专业领域的约26,000页文档与3,099个人工验证查询(支持6种语言),并通过12,000小时的人工标注提供高质量的检索相关性、边界框定位和参考答案验证。实验表明,视觉检索器优于文本检索器,晚期交互模型和文本重排序显著提升性能,混合或纯视觉上下文可改善生成质量,但当前模型仍面临非文本元素理解、开放式查询处理及细粒度视觉定位等挑战。
链接: https://arxiv.org/abs/2601.08620
作者: António Loison,Quentin Macé,Antoine Edy,Victor Xing,Tom Balough,Gabriel Moreira,Bo Liu,Manuel Faysse,Céline Hudelot,Gautier Viaud
机构: Illuin Technology; NVIDIA; CentraleSupélec, Paris-Saclay
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe v3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license at this https URL.
zh
[CV-19] CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion AAAI2026
【速读】:该论文旨在解决红外与可见光图像融合中存在的一大挑战:现有方法要么仅关注像素级融合而忽视下游任务的适应性,要么通过级联检测或分割模型隐式学习固定语义,无法灵活应对多样化的语义目标感知需求。解决方案的关键在于提出CtrlFuse框架,其核心创新是引入可交互的动态融合机制,通过掩码提示(mask prompts)引导融合过程;具体包括多模态特征提取器、参考提示编码器(Reference Prompt Encoder, RPE)和提示-语义融合模块(Prompt-Semantic Fusion Module, PSFM)。其中,RPE通过微调预训练分割模型并结合输入掩码指导,动态编码任务相关的语义提示,PSFM则显式地将这些语义注入融合特征中,从而实现分割分支与融合分支的协同优化,显著提升融合可控性和任务性能。
链接: https://arxiv.org/abs/2601.08619
作者: Yiming Sun,Yuan Ruan,Qinghua Hu,Pengfei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages,22 figures,published to AAAI 2026
Abstract:Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
zh
[CV-20] SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在测试时提示调优(Test-Time Prompt Tuning, TPT)过程中不确定性估计校准不足的问题。现有方法多聚焦于提升判别性能,而忽视了校准性,尤其是基于全正交约束的最新方法虽能增强类别可分性,但其梯度会强制语义相近类别分离,导致模型过度自信。论文提出语义正交校准(Semantic Orthogonal Calibration, SoC),其核心在于设计一种基于Huber损失的正则化项,在实现原型间平滑分离的同时保留语义邻近性,从而在不牺牲判别能力的前提下显著改善校准性能。
链接: https://arxiv.org/abs/2601.08617
作者: Leo Fillioux,Omprakash Chakraborty,Ismail Ben Ayed,Paul-Henry Cournède,Stergios Christodoulidis,Maria Vakalopoulou,Jose Dolz
机构: MICS, CentraleSupélec, Université Paris-Saclay, France; LIVIA, ILLS, ÉTS Montréal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
zh
[CV-21] VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking
【速读】:该论文旨在解决当前自动事实核查(Automated Fact-Checking, AFC)评估基准存在的静态性问题,即现有基准因长期不变而易受大语言模型(Large Language Models, LLMs)预训练数据泄露的影响,导致性能指标无法真实反映系统对新出现虚假信息的验证能力。其解决方案的关键在于提出首个动态的事实核查基准 VeriTaS,通过一个全自动的七阶段流水线实现季度更新,涵盖54种语言、108个专业核查机构的24,000条真实世界声明,并采用新型标准化、解耦评分体系与文本依据标注,确保评估结果在基础模型持续预训练背景下仍具鲁棒性和可靠性。
链接: https://arxiv.org/abs/2601.08611
作者: Mark Rothermel,Marcus Kornmann,Marcus Rohrbach,Anna Rohrbach
机构: Multimodal AI Lab; Technical University of Darmstadt (达姆施塔特工业大学); hessian.AI
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Preprint under review
Abstract:The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 24,000 real-world claims from 108 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to update VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. We will make the code and data publicly available.
zh
[CV-22] SfMamba: Efficient Source-Free Domain Adaptation via Selective Scan Modeling
【速读】:该论文旨在解决源域不可用场景下的无源域迁移学习(Source-free Domain Adaptation, SFDA)中特征不变性学习的感知范围与计算效率之间的权衡问题,以及现有视觉Mamba(Visual Mamba, VMamba)在捕捉通道频率特性方面的局限性和在显著领域偏移下空间鲁棒性不足的问题。其解决方案的关键在于提出SfMamba框架:首先引入通道级视觉状态空间模块(Channel-wise Visual State-Space block),通过通道序列扫描实现更稳定的域不变特征提取;其次设计语义一致打乱策略(Semantic-Consistent Shuffle),在二维选择性扫描中扰乱背景块序列的同时保持预测一致性,从而抑制误差累积。该方法在多个基准测试中展现出优于现有方法的性能,并具备良好的参数效率。
链接: https://arxiv.org/abs/2601.08608
作者: Xi Chen,Hongxun Yao,Sicheng Zhao,Jiankun Zhu,Jing Jiang,Kui Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Source-free domain adaptation (SFDA) tackles the critical challenge of adapting source-pretrained models to unlabeled target domains without access to source data, overcoming data privacy and storage limitations in real-world applications. However, existing SFDA approaches struggle with the trade-off between perception field and computational efficiency in domain-invariant feature learning. Recently, Mamba has offered a promising solution through its selective scan mechanism, which enables long-range dependency modeling with linear complexity. However, the Visual Mamba (i.e., VMamba) remains limited in capturing channel-wise frequency characteristics critical for domain alignment and maintaining spatial robustness under significant domain shifts. To address these, we propose a framework called SfMamba to fully explore the stable dependency in source-free model transfer. SfMamba introduces Channel-wise Visual State-Space block that enables channel-sequence scanning for domain-invariant feature extraction. In addition, SfMamba involves a Semantic-Consistent Shuffle strategy that disrupts background patch sequences in 2D selective scan while preserving prediction consistency to mitigate error accumulation. Comprehensive evaluations across multiple benchmarks show that SfMamba achieves consistently stronger performance than existing methods while maintaining favorable parameter efficiency, offering a practical solution for SFDA. Our code is available at this https URL.
zh
[CV-23] Interpretability and Individuality in Knee MRI: Patient-Specific Radiomic Fingerprint with Reconstructed Healthy Personas
【速读】:该论文旨在解决膝关节磁共振成像(MRI)自动化评估中准确性与可解释性难以兼顾的问题。传统放射组学(Radiomics)依赖于群体层面预定义的特征,虽具可解释性但难以捕捉个体差异,导致性能受限;而端到端深度学习(Deep Learning, DL)模型虽准确率高,却常因“黑箱”特性阻碍临床采纳。解决方案的关键在于提出两种互补策略:一是放射组学指纹(Radiomic Fingerprint),通过图像条件预测器动态选择每位患者最相关的特征子集,结合全局系数的逻辑回归实现分类,从而在保持特征级可解释性的前提下提升个体适应性;二是健康人格(Healthy Persona),利用扩散模型生成每位患者的病理无异常基线,通过对比病灶图像与对应人格的特征差异,提供直观、病例特定的病理解释。二者协同提升了模型性能与多层级可解释性,支持人类可理解的生物标志物发现和病灶定位。
链接: https://arxiv.org/abs/2601.08604
作者: Yaxi Chen,Simin Ni,Shuai Li,Shaheer U. Saeed,Aleksandra Ivanova,Rikin Hargunani,Jie Huang,Chaozong Liu,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:For automated assessment of knee MRI scans, both accuracy and interpretability are essential for clinical use and adoption. Traditional radiomics rely on predefined features chosen at the population level; while more interpretable, they are often too restrictive to capture patient-specific variability and can underperform end-to-end deep learning (DL). To address this, we propose two complementary strategies that bring individuality and interpretability: radiomic fingerprints and healthy personas. First, a radiomic fingerprint is a dynamically constructed, patient-specific feature set derived from MRI. Instead of applying a uniform population-level signature, our model predicts feature relevance from a pool of candidate features and selects only those most predictive for each patient, while maintaining feature-level interpretability. This fingerprint can be viewed as a latent-variable model of feature usage, where an image-conditioned predictor estimates usage probabilities and a transparent logistic regression with global coefficients performs classification. Second, a healthy persona synthesises a pathology-free baseline for each patient using a diffusion model trained to reconstruct healthy knee MRIs. Comparing features extracted from pathological images against their personas highlights deviations from normal anatomy, enabling intuitive, case-specific explanations of disease manifestations. We systematically compare fingerprints, personas, and their combination across three clinical tasks. Experimental results show that both approaches yield performance comparable to or surpassing state-of-the-art DL models, while supporting interpretability at multiple levels. Case studies further illustrate how these perspectives facilitate human-explainable biomarker discovery and pathology localisation.
zh
[CV-24] WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation
【速读】:该论文旨在解决视觉建模中Transformer架构下注意力机制虽能捕捉视觉依赖关系,却缺乏对语义信息空间传播的理论解释这一问题。其核心解决方案是引入基于波动方程的建模视角:将特征图视为随网络深度演化的空间信号,用欠阻尼波动方程描述其传播过程,从而显式建模从低频全局布局到高频边缘纹理的空间频率层次结构,并通过控制传播时间与频率的交互来优化语义信息传递。关键创新在于推导出频域-时域解耦的闭式解,并设计轻量级波传播算子(Wave Propagation Operator, WPO),实现O(N log N)复杂度的全局交互,显著优于传统注意力机制,在图像分类、目标检测和语义分割任务上达到可比精度的同时提升吞吐量并降低计算量。
链接: https://arxiv.org/abs/2601.08602
作者: Zishan Shu,Juntong Wu,Wei Yan,Xudong Liu,Hongyu Zhang,Chang Liu,Youdong Mao,Jie Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: this https URL.
zh
[CV-25] End-to-End Video Character Replacement without Structural Guidance
【速读】:该论文旨在解决可控视频人物替换(video character replacement)中因缺乏成对视频数据而导致的挑战,尤其在复杂场景下(如遮挡、人物与物体交互、异常姿态或光照条件)传统基于重建的方法易产生视觉伪影和时序不一致问题。其解决方案的关键在于提出MoCha框架,该框架仅需单帧任意掩码即可实现高效适配,通过引入条件感知的旋转位置编码(condition-aware RoPE)增强多模态输入条件的表达能力,并采用强化学习(RL)后训练阶段优化面部身份一致性;同时,为缓解高质量配对数据稀缺问题,设计了包含高保真渲染数据集、表情驱动动画数据集及现有视频掩码对增强数据集在内的综合数据构建流程,显著提升了方法的泛化能力和生成质量。
链接: https://arxiv.org/abs/2601.08587
作者: Zhengbo Xu,Jie Ma,Ziheng Wang,Zhan Peng,Jun Liang,Jing Li
机构: AliBaBa Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: this http URL
zh
[CV-26] REVNET: Rotation-Equivariant Point Cloud Completion via Vector Neuron Anchor Transformer
【速读】:该论文旨在解决3D点云补全(Point Cloud Completion)中因传感器捕获不完整而导致几何与语义信息丢失的问题,尤其针对现有方法在任意旋转姿态下性能不稳定这一挑战。当前主流方法多基于旋转非变(rotation-variant)框架,在规范姿态(canonical poses)下训练,难以适应真实场景中的任意旋转。为克服此局限,作者提出Rotation-Equivariant Anchor Transformer (REVNET),其核心创新在于:1)将部分点云表示为等变锚点(equivariant anchors)集合,并设计Vector Neuron (VN) Missing Anchor Transformer以预测缺失锚点的位置与特征;2)引入旋转等变偏置(rotation-equivariant bias)和基于ZCA的层归一化(ZCA-based layer normalization),增强特征表达能力;3)利用等变与不变VN特征间的灵活转换机制,提升坐标生成的稳定性。该方案无需输入姿态对齐即可实现鲁棒的点云补全,在合成数据集MVP和真实世界KITTI数据集上均取得优越性能。
链接: https://arxiv.org/abs/2601.08558
作者: Zhifan Ni,Eckehard Steinbach
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Incomplete point clouds captured by 3D sensors often result in the loss of both geometric and semantic information. Most existing point cloud completion methods are built on rotation-variant frameworks trained with data in canonical poses, limiting their applicability in real-world scenarios. While data augmentation with random rotations can partially mitigate this issue, it significantly increases the learning burden and still fails to guarantee robust performance under arbitrary poses. To address this challenge, we propose the Rotation-Equivariant Anchor Transformer (REVNET), a novel framework built upon the Vector Neuron (VN) network for robust point cloud completion under arbitrary rotations. To preserve local details, we represent partial point clouds as sets of equivariant anchors and design a VN Missing Anchor Transformer to predict the positions and features of missing anchors. Furthermore, we extend VN networks with a rotation-equivariant bias formulation and a ZCA-based layer normalization to improve feature expressiveness. Leveraging the flexible conversion between equivariant and invariant VN features, our model can generate point coordinates with greater stability. Experimental results show that our method outperforms state-of-the-art approaches on the synthetic MVP dataset in the equivariant setting. On the real-world KITTI dataset, REVNET delivers competitive results compared to non-equivariant networks, without requiring input pose alignment. The source code will be released on GitHub under URL: this https URL.
zh
[CV-27] VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations
【速读】:该论文旨在解决视频视觉语言模型(Video-VLMs)在视频问答任务中频繁出现且置信度高的幻觉问题,同时现有不确定性度量方法往往无法与正确性对齐。其解决方案的关键在于提出 VideoHEDGE,一个模块化框架,通过扩展基于熵的可靠性估计方法至时序结构输入:首先对视频-问题对生成基线答案及多个高温度采样结果(来自干净片段和光度与时空扰动变体),再利用自然语言推理(NLI)或嵌入(embedding)方法将文本输出聚类为语义假设;最终基于簇级概率质量计算三种可靠性分数——语义熵(Semantic Entropy, SE)、RadFlag 和视觉增强语义熵(Vision-Amplified Semantic Entropy, VASE)。其中,VASE 在 SoccerChat 基准上表现最优,尤其在较大扰动预算下显著优于其他指标,验证了该方法的有效性。
链接: https://arxiv.org/abs/2601.08557
作者: Sushant Gautam,Cise Midoglu,Vajira Thambawita,Michael A. Riegler,Pål Halvorsen
机构: SimulaMet(西穆拉梅特); OsloMet(奥斯陆城市大学); Forzasys(福尔扎系统); Simula Research Laboratory(西穆拉研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at this https URL .
zh
[CV-28] Keyframe-based Dense Mapping with the Graph of View-Dependent Local Maps ICRA2020
【速读】:该论文旨在解决基于RGB-D传感器的实时环境建图中精度与效率难以兼顾的问题,特别是如何有效利用RGB-D相机的空间分辨率和不确定性模型来提升局部地图表示的准确性,并实现全局一致性优化。其解决方案的关键在于:首先,采用基于关键帧的策略,使用RGB-D数据更新局部正态分布变换(Normal Distribution Transform, NDT)地图;其次,将NDT单元存储在二维视图相关的结构中,以充分利用RGB-D相机的视角依赖特性及测量不确定性模型,从而在靠近相机原点处实现更高精度的物体表示;最后,通过在位姿图中维护局部地图并结合回环检测后的全局优化,以及提出合并与滤波局部地图的流程,最终构建高保真全局环境地图。
链接: https://arxiv.org/abs/2601.08520
作者: Krzysztof Zielinski,Dominik Belter
机构: Poznan University of Technology (波兹南理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICRA 2020
Abstract:In this article, we propose a new keyframe-based mapping system. The proposed method updates local Normal Distribution Transform maps (NDT) using data from an RGB-D sensor. The cells of the NDT are stored in 2D view-dependent structures to better utilize the properties and uncertainty model of RGB-D cameras. This method naturally represents an object closer to the camera origin with higher precision. The local maps are stored in the pose graph which allows correcting global map after loop closure detection. We also propose a procedure that allows merging and filtering local maps to obtain a global map of the environment. Finally, we compare our method with Octomap and NDT-OM and provide example applications of the proposed mapping method.
zh
[CV-29] CD2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning
【速读】:该论文旨在解决少样本类增量学习(Few-shot Class-Incremental Learning, FSCIL)中的灾难性遗忘问题,即模型在学习新类别时会严重遗忘先前类别的知识。现有方法通常依赖外部记忆存储历史信息,并对所有增量类别一视同仁地处理,导致无法有效保留关键的旧类知识。本文提出了一种名为约束数据蒸馏(Constrained Dataset Distillation, CD²)的框架,其核心在于两个模块:数据蒸馏模块(Dataset Distillation Module, DDM)通过生成高度压缩的合成样本引导模型从少量增量样本中提取紧凑且本质的类别线索;蒸馏约束模块(Distillation Constraint Module, DCM)设计特定损失函数以约束已学类别的分布,从而更充分地保留蒸馏所得的知识。这一机制显著提升了模型在持续学习场景下的性能与稳定性。
链接: https://arxiv.org/abs/2601.08519
作者: Kexin Bao,Daichi Zhang,Hansong Zhang,Yong Li,Yutao Yue,Shiming Ge
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Few-shot class-incremental learning (FSCIL) receives significant attention from the public to perform classification continuously with a few training samples, which suffers from the key catastrophic forgetting problem. Existing methods usually employ an external memory to store previous knowledge and treat it with incremental classes equally, which cannot properly preserve previous essential knowledge. To solve this problem and inspired by recent distillation works on knowledge transfer, we propose a framework termed \textbfConstrained \textbfDataset \textbfDistillation (\textbfCD ^2 ) to facilitate FSCIL, which includes a dataset distillation module (\textbfDDM) and a distillation constraint module~(\textbfDCM). Specifically, the DDM synthesizes highly condensed samples guided by the classifier, forcing the model to learn compacted essential class-related clues from a few incremental samples. The DCM introduces a designed loss to constrain the previously learned class distribution, which can preserve distilled knowledge more sufficiently. Extensive experiments on three public datasets show the superiority of our method against other state-of-the-art competitors.
zh
[CV-30] Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models
【速读】:该论文旨在解决深度神经网络中通道配置(channel configuration)搜索的复杂组合优化问题,该问题受限于张量形状兼容性与计算预算约束。传统启发式方法难以有效探索庞大的设计空间。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)驱动的神经架构搜索(Neural Architecture Search, NAS)框架,将通道配置搜索建模为一系列条件代码生成任务,通过性能遥测反馈迭代优化架构规范;同时,为克服数据稀缺问题,采用抽象语法树(Abstract Syntax Tree, AST)变异生成大量合法且形状一致的网络结构作为训练语料,使LLM能够学习通道配置与模型性能之间的隐式关系,并内化领域特定的设计先验,从而实现更优的特征提取策略。
链接: https://arxiv.org/abs/2601.08517
作者: Tolgay Atinc Uzun,Dmitry Ignatov,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Channel configuration search the optimization of layer specifications such as layer widths in deep neural networks presents a complex combinatorial challenge constrained by tensor shape compatibility and computational budgets. We posit that Large Language Models (LLMs) offer a transformative approach to Neural Architecture Search (NAS), capable of reasoning about architectural code structure in ways that traditional heuristics cannot. In this paper, we investigate the application of an LLM-driven NAS framework to the problem of channel configuration. We formulate the search as a sequence of conditional code generation tasks, where an LLM refines architectural specifications based on performance telemetry. Crucially, we address the data scarcity problem by generating a vast corpus of valid, shape-consistent architectures via Abstract Syntax Tree (AST) mutations. While these mutated networks are not necessarily high-performing, they provide the critical volume of structural data required for the LLM to learn the latent relationship between channel configurations and model performance. This allows the LLM to internalize complex design patterns and apply them to optimize feature extraction strategies. Experimental results on CIFAR-100 validate the efficacy of this approach, demonstrating that the model yields statistically significant improvements in accuracy. Our analysis confirms that the LLM successfully acquires domain-specific architectural priors, distinguishing this method from random search and highlighting the immense potential of language-driven design in deep learning.
zh
[CV-31] EfficientFSL: Enhancing Few-Shot Classification via Query-Only Tuning in Vision Transformers AAAI2026
【速读】:该论文旨在解决大型视觉Transformer(Vision Transformer, ViT)在少样本分类(few-shot classification)任务中因参数量庞大导致的计算资源消耗过高、训练时间长的问题,从而限制了其在低资源场景下的实际应用。解决方案的关键在于提出一种仅需对查询(query)进行微调的高效框架EfficientFSL:通过引入轻量级可训练的Forward Block生成任务特定的查询向量,从预训练模型的中间层特征中提取信息;设计Combine Block融合多层输出以增强特征表示的深度与鲁棒性;并利用Support-Query Attention Block缓解分布偏移问题,使原型向量更好地对齐查询集分布。该方法在保持极少量可训练参数的同时实现了媲美甚至超越现有方法的性能,在多个领域内和跨域数据集上均取得SOTA效果。
链接: https://arxiv.org/abs/2601.08499
作者: Wenwen Liao,Hang Ruan
机构: Fudan University (复旦大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted/To be presented at AAAI 2026
Abstract:Large models such as Vision Transformers (ViTs) have demonstrated remarkable superiority over smaller architectures like ResNet in few-shot classification, owing to their powerful representational capacity. However, fine-tuning such large models demands extensive GPU memory and prolonged training time, making them impractical for many real-world low-resource scenarios. To bridge this gap, we propose EfficientFSL, a query-only fine-tuning framework tailored specifically for few-shot classification with ViT, which achieves competitive performance while significantly reducing computational overhead. EfficientFSL fully leverages the knowledge embedded in the pre-trained model and its strong comprehension ability, achieving high classification accuracy with an extremely small number of tunable parameters. Specifically, we introduce a lightweight trainable Forward Block to synthesize task-specific queries that extract informative features from the intermediate representations of the pre-trained model in a query-only manner. We further propose a Combine Block to fuse multi-layer outputs, enhancing the depth and robustness of feature representations. Finally, a Support-Query Attention Block mitigates distribution shift by adjusting prototypes to align with the query set distribution. With minimal trainable parameters, EfficientFSL achieves state-of-the-art performance on four in-domain few-shot datasets and six cross-domain datasets, demonstrating its effectiveness in real-world applications.
zh
[CV-32] PKI: Prior Knowledge-Infused Neural Network for Few-Shot Class-Incremental Learning
【速读】:该论文旨在解决少样本类增量学习(Few-shot Class-Incremental Learning, FSCIL)中的两大挑战:灾难性遗忘(catastrophic forgetting)和对新类的过拟合问题。现有方法通常通过冻结网络大部分参数并利用额外记忆进行微调来缓解遗忘,但难以在保持旧知识与适应新知识之间取得平衡。本文提出了一种先验知识注入神经网络(Prior Knowledge-Infused Neural Network, PKI),其核心创新在于引入一个由多个投影器(projector)组成的集成结构,在每个增量阶段新增一个投影器并将其加入ensemble,同时仅微调该投影器与分类器,其余组件保持冻结。这种设计有效利用了历史积累的先验知识,通过级联投影器实现旧知识与新知识的协同融合,从而提升对旧类的识别能力并增强对新类的学习效率。此外,为降低资源消耗,作者进一步设计了两种变体(PKIV-1 和 PKIV-2),在计算开销与性能之间实现权衡。
链接: https://arxiv.org/abs/2601.08493
作者: Kexin Baoa,Fanzhao Lin,Zichen Wang,Yong Li,Dan Zeng,Shiming Ge
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); State Grid Economic and Energy Research Institute (国家电网经济与能源研究院); George Washington University (乔治华盛顿大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Few-shot class-incremental learning (FSCIL) aims to continually adapt a model on a limited number of new-class examples, facing two well-known challenges: catastrophic forgetting and overfitting to new classes. Existing methods tend to freeze more parts of network components and finetune others with an extra memory during incremental sessions. These methods emphasize preserving prior knowledge to ensure proficiency in recognizing old classes, thereby mitigating catastrophic forgetting. Meanwhile, constraining fewer parameters can help in overcoming overfitting with the assistance of prior knowledge. Following previous methods, we retain more prior knowledge and propose a prior knowledge-infused neural network (PKI) to facilitate FSCIL. PKI consists of a backbone, an ensemble of projectors, a classifier, and an extra memory. In each incremental session, we build a new projector and add it to the ensemble. Subsequently, we finetune the new projector and the classifier jointly with other frozen network components, ensuring the rich prior knowledge is utilized effectively. By cascading projectors, PKI integrates prior knowledge accumulated from previous sessions and learns new knowledge flexibly, which helps to recognize old classes and efficiently learn new classes. Further, to reduce the resource consumption associated with keeping many projectors, we design two variants of the prior knowledge-infused neural network (PKIV-1 and PKIV-2) to trade off a balance between resource consumption and performance by reducing the number of projectors. Extensive experiments on three popular benchmarks demonstrate that our approach outperforms state-of-the-art methods.
zh
[CV-33] An IoT-Enabled Smart Aquarium System for Real-Time Water Quality Monitoring and Automated Feeding
【速读】:该论文旨在解决水族箱水质管理中因依赖人工监测而导致的效率低、劳动强度大及易出错等问题,从而保障水生生物健康。其解决方案的关键在于构建一个基于物联网(Internet of Things, IoT)的智能水族箱系统,该系统以ESP32微控制器为核心,集成pH、TDS、温度和浊度等多种传感器与伺服投喂器、水泵等执行机构,实现水质参数的实时监测与自动化调控;同时通过边缘计算处理、Blynk云平台连接以及可配置冷却期的智能告警机制,有效提升系统的响应速度与可靠性,实验表明该方案在10升水族箱环境中实现了96%的平均传感器精度和1.2秒的异常检测响应时间,且自动投喂与循环模块运行可靠率达97%,显著降低了人工干预需求并维持了稳定的水生环境。
链接: https://arxiv.org/abs/2601.08484
作者: MD Fatin Ishraque Ayon,Sabrin Nahar,Ataur Rahman,Md. Taslim Arif,Abdul Hasib,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Maintaining optimal water quality in aquariums is critical for aquatic health but remains challenging due to the need for continuous monitoring of multiple parameters. Traditional manual methods are inefficient, labor-intensive, and prone to human error, often leading to suboptimal aquatic conditions. This paper presents an IoT-based smart aquarium system that addresses these limitations by integrating an ESP32 microcontroller with multiple sensors (pH, TDS, temperature, turbidity) and actuators (servo feeder, water pump) for comprehensive real-time water quality monitoring and automated control. The system architecture incorporates edge processing capabilities, cloud connectivity via Blynk IoT platform, and an intelligent alert mechanism with configurable cooldown periods to prevent notification fatigue. Experimental evaluation in a 10-liter aquarium environment demonstrated the system’s effectiveness, achieving 96% average sensor accuracy and 1.2-second response time for anomaly detection. The automated feeding and water circulation modules maintained 97% operational reliability throughout extended testing, significantly reducing manual intervention while ensuring stable aquatic conditions. This research demonstrates that cost-effective IoT solutions can revolutionize aquarium maintenance, making aquatic ecosystem management more accessible, reliable, and efficient for both residential and commercial applications.
zh
[CV-34] DiffMM: Efficient Method for Accurate Noisy and Sparse Trajectory Map Matching via One Step Diffusion AAAI-26
【速读】:该论文旨在解决稀疏轨迹(sparse trajectories)的路径匹配(map matching)问题,这是许多基于轨迹的应用(如交通调度和交通流分析)中的基础挑战。现有方法多依赖隐马尔可夫模型(Hidden Markov Model, HMM)或编码器-解码器框架,在处理噪声大或采样稀疏的GPS轨迹时性能受限。其解决方案的关键在于提出DiffMM——一种基于编码器-扩散机制(encoder-diffusion-based)的框架,通过单步扩散过程实现高效且准确的匹配。核心创新包括:1)引入道路段感知的轨迹编码器(road segment-aware trajectory encoder),利用注意力机制将输入轨迹与其候选道路段联合嵌入到共享潜在空间;2)设计一种快捷模型(shortcut model),借助上述联合嵌入作为条件上下文,仅用一步扩散即可完成匹配,显著提升效率与精度,尤其在复杂路网拓扑和稀疏轨迹场景下表现优越。
链接: https://arxiv.org/abs/2601.08482
作者: Chenxu Han,Sean Bin Yang,Jilin Hu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-26
Abstract:Map matching for sparse trajectories is a fundamental problem for many trajectory-based applications, e.g., traffic scheduling and traffic flow analysis. Existing methods for map matching are generally based on Hidden Markov Model (HMM) or encoder-decoder framework. However, these methods continue to face significant challenges when handling noisy or sparsely sampled GPS trajectories. To address these limitations, we propose DiffMM, an encoder-diffusion-based map matching framework that produces effective yet efficient matching results through a one-step diffusion process. We first introduce a road segment-aware trajectory encoder that jointly embeds the input trajectory and its surrounding candidate road segments into a shared latent space through an attention mechanism. Next, we propose a one step diffusion method to realize map matching through a shortcut model by leveraging the joint embedding of the trajectory and candidate road segments as conditioning context. We conduct extensive experiments on large-scale trajectory datasets, demonstrating that our approach consistently outperforms state-of-the-art map matching methods in terms of both accuracy and efficiency, particularly for sparse trajectories and complex road network topologies.
zh
[CV-35] Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models AAAI2026
【速读】:该论文旨在解决零样本分布外(Out-of-Distribution, OOD)检测中因缺乏负样本标签而导致的模型鲁棒性不足问题,尤其在视觉-语言模型于开放世界场景部署时,现有基于固定文本代理的方法难以应对分布偏移导致的跨模态错位和预测不稳定。解决方案的关键在于提出一种无需训练和标注的测试时框架CoEvo,其核心是引入代理对齐的协同进化机制,动态维护并更新文本与视觉双模态代理缓存:通过测试图像引导挖掘上下文相关的文本负例,并迭代优化视觉代理以逐步重建跨模态相似性、扩大局部OOD边界;最终通过动态加权双模态代理贡献,获得对分布偏移具有鲁棒性的校准OOD分数。
链接: https://arxiv.org/abs/2601.08476
作者: Hao Tang,Yu Liu,Shuanglin Yan,Fei Shen,Shengfeng He,Jing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by AAAI 2026
Abstract:Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
zh
[CV-36] owards Safer Mobile Agents : Scalable Generation and Evaluation of Diverse Scenarios for VLMs
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在复杂环境中的安全决策能力评估不足的问题,特别是缺乏对异常场景(anomalous scenarios)中时空动态特性的充分覆盖。现有基准测试未能有效模拟真实世界中常见的移动、侵入性和远距离物体所构成的危险情境。为此,作者提出了一种名为HazardForge的可扩展生成管道,其关键在于融合图像编辑模型与布局决策算法及验证模块,从而系统性地合成包含多样化运动特征的危险场景;基于此方法构建了MovSafeBench基准数据集,包含7,254张图像和对应的多选题问答对,用于量化评估VLM在正常与异常对象共存条件下的性能表现,实验表明VLM在涉及细微运动理解的任务中性能下降最为显著。
链接: https://arxiv.org/abs/2601.08470
作者: Takara Taniguchi,Kuniaki Saito,Atsushi Hashimoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbfHazardForge, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbfMovSafeBench, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
zh
[CV-37] Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling
【速读】:该论文旨在解决基于视觉-语言模型(Vision-Language Models, VLMs)的分心驾驶检测方法在真实场景中性能不佳的问题。研究发现,个体特定的外观差异(如服装、年龄和性别)是导致模型误判的关键瓶颈:VLMs 将这些外观特征与行为线索混杂在一起,使得决策更多依赖于驾驶员身份而非实际行为。解决方案的核心在于提出一种主体解耦框架(subject decoupling framework),通过提取驾驶员外观嵌入并从图像嵌入中移除其影响,从而强化与分心行为相关的证据;同时,利用 Stiefel 流形上的度量投影对文本嵌入进行正交化处理,在保持语义一致性的同时提升类别可分性。实验表明,该方法在多个基准上均显著优于现有基线,展现出良好的实用潜力。
链接: https://arxiv.org/abs/2601.08467
作者: Takamichi Miyata,Sumiko Miyata,Andrew Morris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.
zh
[CV-38] CoMa: Contextual Massing Generation with Vision-Language Models
【速读】:该论文旨在解决建筑与城市规划中概念设计阶段(尤其是建筑体量生成)依赖设计师直觉和手工操作的复杂性问题。其解决方案的关键在于提出一个自动化框架,通过功能性需求和场地背景数据驱动生成建筑体量,并构建了CoMa-20K数据集——该数据集包含详细的体量几何、经济与功能数据以及场地在现有城市语境中的视觉表示。研究进一步将体量生成任务建模为视觉-语言模型(Vision-Language Models, VLMs)的条件生成任务,通过微调和零样本大模型进行基准测试,验证了VLM在生成具情境敏感性的建筑体量方案方面的潜力。
链接: https://arxiv.org/abs/2601.08464
作者: Evgenii Maslov,Valentin Khrulkov,Anastasia Volkova,Anton Gusarov,Andrey Kuznetsov,Ivan Oseledets
机构: FusionBrain Lab; Innopolis University; Institute of Numerical Mathematics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and dataset will be released later
Abstract:The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.
zh
[CV-39] Modality-Decoupled RGB-Thermal Object Detector via Query Fusion
【速读】:该论文旨在解决RGB-Thermal(RGB-T)目标检测中在极端环境条件下因某一模态质量下降而干扰检测性能的问题,即如何在保持跨模态互补信息利用的同时实现模态间的有效分离以抑制噪声影响。解决方案的关键在于提出一种基于查询融合的模态解耦检测框架(Modality-Decoupled RGB-T detection framework with Query Fusion, MDQF),其核心机制是在每个优化阶段通过查询选择与适配后,将高质量模态的查询注入到另一模态分支中进行融合,从而动态排除劣质模态并修正预测结果;同时,该解耦结构允许使用未配对的RGB或热红外(TIR)图像分别优化各分支,避免了对严格配对RGB-T数据的依赖。
链接: https://arxiv.org/abs/2601.08458
作者: Chao Tian,Zikun Zhou,Chao Yang,Guoqing Zhu,Fu’an Zhong,Zhenyu He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advantage of RGB-Thermal (RGB-T) detection lies in its ability to perform modality fusion and integrate cross-modality complementary information, enabling robust detection under diverse illumination and weather conditions. However, under extreme conditions where one modality exhibits poor quality and disturbs detection, modality separation is necessary to mitigate the impact of noise. To address this problem, we propose a Modality-Decoupled RGB-T detection framework with Query Fusion (MDQF) to balance modality complementation and separation. In this framework, DETR-like detectors are employed as separate branches for the RGB and TIR images, with query fusion interspersed between the two branches in each refinement stage. Herein, query fusion is performed by feeding the high-quality queries from one branch to the other one after query selection and adaptation. This design effectively excludes the degraded modality and corrects the predictions using high-quality queries. Moreover, the decoupled framework allows us to optimize each individual branch with unpaired RGB or TIR images, eliminating the need for paired RGB-T data. Extensive experiments demonstrate that our approach delivers superior performance to existing RGB-T detectors and achieves better modality independence.
zh
[CV-40] Developing Predictive and Robust Radiomics Models for Chemotherapy Response in High-Grade Serous Ovarian Carcinoma
【速读】:该论文旨在解决高分级浆液性卵巢癌(High-grade serous ovarian carcinoma, HGSOC)患者在接受新辅助化疗(Neoadjuvant chemotherapy, NACT)后,约40%患者表现出有限治疗反应的问题。为提升NACT疗效预测的准确性,研究提出了一种融合多特征选择方法的影像组学(Radiomics)框架,其关键在于引入自动化随机化算法模拟观察者间变异性,从而在保证特征鲁棒性的同时优化预测性能,最终实现更可靠的临床决策支持模型。
链接: https://arxiv.org/abs/2601.08455
作者: Sepideh Hatamikia,Geevarghese George,Florian Schwarzhans,Amirreza Mahbod,Marika AV Reinius,Ali Abbasian Ardakani,Mercedes Jimenez-Linan,Satish Viswanath,Mireia Crispin-Ortuzar,Lorena Escudero Sanchez,Evis Sala,James D Brenton,Ramona Woitek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22pages, 5 figures, 5 tables
Abstract:Objectives: High-grade serous ovarian carcinoma (HGSOC) is typically diagnosed at an advanced stage with extensive peritoneal metastases, making treatment challenging. Neoadjuvant chemotherapy (NACT) is often used to reduce tumor burden before surgery, but about 40% of patients show limited response. Radiomics, combined with machine learning (ML), offers a promising non-invasive method for predicting NACT response by analyzing computed tomography (CT) imaging data. This study aimed to improve response prediction in HGSOC patients undergoing NACT by integration different feature selection methods. Materials and methods: A framework for selecting robust radiomics features was introduced by employing an automated randomisation algorithm to mimic inter-observer variability, ensuring a balance between feature robustness and prediction accuracy. Four response metrics were used: chemotherapy response score (CRS), RECIST, volume reduction (VolR), and diameter reduction (DiaR). Lesions in different anatomical sites were studied. Pre- and post-NACT CT scans were used for feature extraction and model training on one cohort, and an independent cohort was used for external testing. Results: The best prediction performance was achieved using all lesions combined for VolR prediction, with an AUC of 0.83. Omental lesions provided the best results for CRS prediction (AUC 0.77), while pelvic lesions performed best for DiaR (AUC 0.76). Conclusion: The integration of robustness into the feature selection processes ensures the development of reliable models and thus facilitates the implementation of the radiomics models in clinical applications for HGSOC patients. Future work should explore further applications of radiomics in ovarian cancer, particularly in real-time clinical settings.
zh
[CV-41] Divide and Conquer: Static-Dynamic Collaboration for Few-Shot Class-Incremental Learning
【速读】:该论文旨在解决少样本类增量学习(Few-shot Class-Incremental Learning, FSCIL)中的稳定性-可塑性困境(stability-plasticity dilemma),即如何在有限数据条件下平衡旧知识的保留与新知识的获取。其解决方案的关键在于提出一种静态-动态协同框架(Static-Dynamic Collaboration, SDC),将FSCIL过程分为两个阶段:静态保留阶段(Static Retaining Stage, SRS)和动态学习阶段(Dynamic Learning Stage, DLS)。在SRS中,利用充足的基础会话数据训练初始模型并保存关键部分作为静态记忆以维持基础旧知识;在DLS中,引入一个与静态记忆联合训练的动态投影器,使模型能持续适应新类别。通过这两个阶段的协同作用,该方法显著提升了旧知识的保留能力并增强了对新类别的适应性,在多个公开基准和真实场景数据集上实现了当前最优性能。
链接: https://arxiv.org/abs/2601.08448
作者: Kexin Bao,Daichi Zhang,Yong Li,Dan Zeng,Shiming Ge
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Department of Communication Engineering, Shanghai University (上海大学通信工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Few-shot class-incremental learning (FSCIL) aims to continuously recognize novel classes under limited data, which suffers from the key stability-plasticity dilemma: balancing the retention of old knowledge with the acquisition of new knowledge. To address this issue, we divide the task into two different stages and propose a framework termed Static-Dynamic Collaboration (SDC) to achieve a better trade-off between stability and plasticity. Specifically, our method divides the normal pipeline of FSCIL into Static Retaining Stage (SRS) and Dynamic Learning Stage (DLS), which harnesses old static and incremental dynamic class information, respectively. During SRS, we train an initial model with sufficient data in the base session and preserve the key part as static memory to retain fundamental old knowledge. During DLS, we introduce an extra dynamic projector jointly trained with the previous static memory. By employing both stages, our method achieves improved retention of old knowledge while continuously adapting to new classes. Extensive experiments on three public benchmarks and a real-world application dataset demonstrate that our method achieves state-of-the-art performance against other competitors.
zh
[CV-42] Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification
【速读】:该论文旨在解决遥感(Remote Sensing, RS)多标签分类(Multi-label Classification, MLC)中因依赖主题产品或众包标注而引入的标签噪声问题,尤其是区分加性噪声(additive noise)、减性噪声(subtractive noise)及混合噪声(mixed noise)对模型训练带来的影响。现有方法通常将所有噪声视为统一的监督信号,缺乏针对不同噪声类型自适应调整学习行为的能力。其解决方案的关键在于提出一种噪声自适应正则化方法(Noise-Adaptive Regularization, NAR),通过置信度驱动的标签处理机制动态保留高置信度标签、暂时禁用中等置信度标签,并对低置信度标签进行翻转修正;同时结合早期学习正则化(Early-Learning Regularization, ELR)稳定训练过程并抑制对污染标签的过拟合,从而在多种噪声场景下显著提升模型鲁棒性,尤其在减性和混合噪声条件下效果最优。
链接: https://arxiv.org/abs/2601.08446
作者: Tom Burgert,Julia Henkel,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to TGRS
Abstract:The development of reliable methods for multi-label classification (MLC) has become a prominent research direction in remote sensing (RS). As the scale of RS data continues to expand, annotation procedures increasingly rely on thematic products or crowdsourced procedures to reduce the cost of manual annotation. While cost-effective, these strategies often introduce multi-label noise in the form of partially incorrect annotations. In MLC, label noise arises as additive noise, subtractive noise, or a combination of both in the form of mixed noise. Previous work has largely overlooked this distinction and commonly treats noisy annotations as supervised signals, lacking mechanisms that explicitly adapt learning behavior to different noise types. To address this limitation, we propose NAR, a noise-adaptive regularization method that explicitly distinguishes between additive and subtractive noise within a semi-supervised learning framework. NAR employs a confidence-based label handling mechanism that dynamically retains label entries with high confidence, temporarily deactivates entries with moderate confidence, and corrects low confidence entries via flipping. This selective attenuation of supervision is integrated with early-learning regularization (ELR) to stabilize training and mitigate overfitting to corrupted labels. Experiments across additive, subtractive, and mixed noise scenarios demonstrate that NAR consistently improves robustness compared with existing methods. Performance improvements are most pronounced under subtractive and mixed noise, indicating that adaptive suppression and selective correction of noisy supervision provide an effective strategy for noise robust learning in RS MLC.
zh
[CV-43] Incentivizing Cardiologist-Like Reasoning in MLLM s for Interpretable Echocardiographic Diagnosis
【速读】:该论文旨在解决现有医学多模态大语言模型(Medical Reasoning Multimodal Large Language Models, MLLMs)在超声心动图(Echocardiography)诊断中难以有效融合定量测量与临床表现关系的问题,以及传统方法在构建详细推理路径时成本高且无法直接引入超声心动图先验知识的局限性。其解决方案的关键在于提出两个核心组件:Cardiac Reasoning Template (CRT) 和 CardiacMind。CRT 提供了针对复杂心脏疾病的标准化分步诊断流程,从而无需逐例验证即可高效构建推理路径;CardiacMind 则是一种新的强化学习框架,包含三种创新奖励机制——程序数量奖励(Procedural Quantity Reward, PQtR)、程序质量奖励(Procedural Quality Reward, PQlR)和超声心动图语义奖励(Echocardiographic Semantic Reward, ESR),分别促进推理细节丰富度、跨视图与模态证据整合能力,并将步骤描述与视觉内容对齐,显著提升了模型在多视角超声心动图诊断中的准确性和临床合理性。
链接: https://arxiv.org/abs/2601.08440
作者: Yi Qin,Lehan Wang,Chenxu Zhao,Alex P.W. Lee,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Echocardiographic diagnosis is vital for cardiac screening yet remains challenging. Existing echocardiography foundation models do not effectively capture the relationships between quantitative measurements and clinical manifestations, whereas medical reasoning multimodal large language models (MLLMs) require costly construction of detailed reasoning paths and remain ineffective at directly incorporating such echocardiographic priors into their reasoning. To address these limitations, we propose a novel approach comprising Cardiac Reasoning Template (CRT) and CardiacMind to enhance MLLM’s echocardiographic reasoning by introducing cardiologist-like mindset. Specifically, CRT provides stepwise canonical diagnostic procedures for complex cardiac diseases to streamline reasoning path construction without the need for costly case-by-case verification. To incentivize reasoning MLLM under CRT, we develop CardiacMind, a new reinforcement learning scheme with three novel rewards: Procedural Quantity Reward (PQtR), Procedural Quality Reward (PQlR), and Echocardiographic Semantic Reward (ESR). PQtR promotes detailed reasoning; PQlR promotes integration of evidence across views and modalities, while ESR grounds stepwise descriptions in visual content. Our methods show a 48% improvement in multiview echocardiographic diagnosis for 15 complex cardiac diseases and a 5% improvement on CardiacNet-PAH over prior methods. The user study on our method’s reasoning outputs shows 93.33% clinician agreement with cardiologist-like reasoning logic. Our code will be available.
zh
[CV-44] Deep Learning Based Facial Retargeting Using Local Patches
【速读】:该论文旨在解决在数字动画时代,将面部动作从源模型准确迁移到具有显著非人面部结构的风格化或夸张3D角色时所面临的语义失真问题(semantic preservation)。传统方法在处理形状相近的模型时表现良好,但在面对面部特征比例差异较大的目标角色时,难以保持原始表情的语义一致性。解决方案的关键在于提出一种基于局部补丁(local patch-based)的迁移方法,其核心由三个模块组成:自动补丁提取模块从源视频帧中提取局部区域;重演模块生成对应的目标补丁;权重估计模块计算每帧的目标动画参数,从而合成完整的面部动画序列。该方法通过局部特征匹配与动态权重调整,有效保留了源表情的语义信息,实现了对复杂面部结构的高质量迁移。
链接: https://arxiv.org/abs/2601.08429
作者: Yeonsoo Choi,Inyup Lee,Sihun Cha,Seonghyeon Kim,Sunjin Jung,Junyong Noh
机构: KAIST(韩国科学技术院); Netmarble F&C; Anigma; Republic of Korea(大韩民国)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Eurographics 25
Abstract:In the era of digital animation, the quest to produce lifelike facial animations for virtual characters has led to the development of various retargeting methods. While the retargeting facial motion between models of similar shapes has been very successful, challenges arise when the retargeting is performed on stylized or exaggerated 3D characters that deviate significantly from human facial structures. In this scenario, it is important to consider the target character’s facial structure and possible range of motion to preserve the semantics assumed by the original facial motions after the retargeting. To achieve this, we propose a local patch-based retargeting method that transfers facial animations captured in a source performance video to a target stylized 3D character. Our method consists of three modules. The Automatic Patch Extraction Module extracts local patches from the source video frame. These patches are processed through the Reenactment Module to generate correspondingly re-enacted target local patches. The Weight Estimation Module calculates the animation parameters for the target character at every frame for the creation of a complete facial animation sequence. Extensive experiments demonstrate that our method can successfully transfer the semantic meaning of source facial expressions to stylized characters with considerable variations in facial feature proportion.
zh
[CV-45] MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP
【速读】:该论文旨在解决异构遥感模态(如高光谱成像(HSI)与激光雷达(LiDAR))在融合光谱、空间和几何信息时缺乏语义层面理解的问题。解决方案的关键在于提出一种多模态语言引导网络(MMLGNet),通过模态特定编码器将视觉特征映射到共享潜在空间,并利用双向对比学习对齐视觉特征与手工设计的文本嵌入,从而借助CLIP类视觉-语言模型实现语言监督下的跨模态对齐,显著提升了语义理解能力。
链接: https://arxiv.org/abs/2601.08420
作者: Aditya Chaudhary,Sneha Barman,Mainak Singha,Ankit Jha,Girish Mishra,Biplab Banerjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at InGARSS 2025
Abstract:In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP’s training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at this https URL.
zh
[CV-46] SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration
【速读】:该论文旨在解决多相机实时三维重建中的关键挑战,包括多视角融合困难、相机外参不确定性以及大规模相机部署时的可扩展性问题。解决方案的核心在于提出SPARK框架,其关键创新点为:(1)设计了一个基于几何先验的在线外参估计模块,通过引入跨视图和时间一致性约束实现稳定的自标定;(2)提出一种置信度驱动的点云融合策略,从像素级和点级建模深度可靠性与可见性,有效抑制噪声和视图依赖性不一致。该方法通过逐帧融合而非累积融合,在动态场景中生成稳定点云,并实现与相机数量线性增长的可扩展性。
链接: https://arxiv.org/abs/2601.08414
作者: Chentian Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figures, submitted to Trans on Image Processing
Abstract:Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.
zh
[CV-47] Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2
【速读】:该论文旨在解决无人机(UAV)在复杂场景中实现实时视觉理解与交互时面临的挑战,即大型视觉语言模型(Vision Language Models, VLMs)计算成本高与无人机边缘设备算力有限之间的矛盾。解决方案的关键在于构建一个轻量级多任务平台,基于BLIP-2模型集成YOLO-World和YOLOv8-Seg检测与分割模型,通过三方面创新实现高效部署:首先,深度融合YOLO模型的精确感知结果以增强BLIP-2的基础视觉任务能力;其次,设计基于K-Means聚类的内容感知关键帧采样机制,结合时间特征拼接,使模型具备视频级交互处理能力;最后,引入统一提示优化方案,将YOLO模型输出的结构化事件日志作为上下文注入BLIP-2输入,并辅以输出约束过滤技术细节,从而在无需特定任务微调的情况下实现多任务准确响应。
链接: https://arxiv.org/abs/2601.08408
作者: Yizhan Feng,Hichem Snoussi,Jing Teng,Jian Liu,Yuyang Wang,Abel Cherouat,Tian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: The Tenth International Conference on Data Mining and Big Data (DMBD’2025)
Abstract:The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2’s input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.
zh
[CV-48] An Explainable Two Stage Deep Learning Framework for Pericoronitis Assessment in Panoramic Radiographs Using YOLOv8 and ResNet-50
【速读】:该论文旨在解决全景牙片(panoramic radiographs)中智齿冠周炎(pericoronitis)诊断困难的问题,尤其在临床实践中因解剖变异和影像特征不明确导致的误诊或漏诊风险。其解决方案的关键在于构建一个两阶段深度学习流程:第一阶段采用YOLOv8模型实现第三磨牙的精准定位与解剖位置及角度分类(基于Winter分类法),第二阶段使用改进的ResNet-50架构识别提示冠周炎的放射学征象;同时引入Grad-CAM技术提升可解释性,使AI决策过程可视化,从而增强放射科医生对系统输出的信任度。
链接: https://arxiv.org/abs/2601.08401
作者: Ajo Babu George,Pranav S,Kunal Agarwal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Objectives: To overcome challenges in diagnosing pericoronitis on panoramic radiographs, an AI-assisted assessment system integrating anatomical localization, pathological classification, and interpretability. Methods: A two-stage deep learning pipeline was implemented. The first stage used YOLOv8 to detect third molars and classify their anatomical positions and angulations based on Winter’s classification. Detected regions were then fed into a second-stage classifier, a modified ResNet-50 architecture, for detecting radiographic features suggestive of pericoronitis. To enhance clinical trust, Grad-CAM was used to highlight key diagnostic regions on the radiographs. Results: The YOLOv8 component achieved 92% precision and 92.5% mean average precision. The ResNet-50 classifier yielded F1-scores of 88% for normal cases and 86% for pericoronitis. Radiologists reported 84% alignment between Grad-CAM and their diagnostic impressions, supporting the radiographic relevance of the interpretability output. Conclusion: The system shows strong potential for AI-assisted panoramic assessment, with explainable AI features that support clinical confidence.
zh
[CV-49] Design and Development of a Low-Cost Scalable GSM-IoT Smart Pet Feeder with a Remote Mobile Application
【速读】:该论文旨在解决现代城市居民因生活节奏快而难以维持宠物定时定量喂食的问题。其解决方案的关键在于设计并实现了一种基于GSM-IoT的低成本、可扩展智能宠物喂食器,通过Arduino微控制器与SIM800L GSM模块实现远程通信,结合超声波传感器进行实时食物存量监测及伺服电机精准投喂,辅以MIT App Inventor开发的移动应用完成指令下发与状态反馈,从而构建了一个完全不依赖互联网的自动化宠物喂养系统,验证了其在98%短信指令成功率、±2.67%投喂误差范围内的稳定性能,具备良好的实用性与推广价值。
链接: https://arxiv.org/abs/2601.08394
作者: Md. Rakibul Hasan Nishat,S. M. Khalid Bin Zahid,Abdul Hasib,T. M. Mehrab Hasan,Mohammad Arman,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pet ownership is increasingly common in modern households, yet maintaining a consistent feeding schedule remains challenging for the owners particularly those who live in cities and have busy lifestyles. This paper presents the design, development, and validation of a low-cost, scalable GSM-IoT smart pet feeder that enables remote monitoring and control through cellular communication. The device combines with an Arduino microcontroller, a SIM800L GSM module for communication, an ultrasonic sensor for real-time food-level assessment, and a servo mechanism for accurate portion dispensing. A dedicated mobile application was developed using MIT App Inventor which allows owners to send feeding commands and receive real-time status updates. Experimental results demonstrate a 98% SMS command success rate, consistent portion dispensing with \pm 2.67 % variance, and reliable autonomous operation. Its modular, energy-efficient design makes it easy to use in a wide range of households, including those with limited resources. This work pushes forward the field of accessible pet care technology by providing a practical, scalable, and completely internet-independent solution for personalized pet feeding. In doing so, it sets a new benchmark for low-cost, GSM-powered automation in smart pet products.
zh
[CV-50] raining-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance
【速读】:该论文旨在解决预训练扩散模型在领域适应任务中生成样本与用户特定目标数据分布不一致的问题,尤其是在仅有少量参考样本且无法重新训练模型的情况下。现有推理时引导方法通常优化代理目标(如分类器似然),而非直接对齐目标分布。其解决方案的关键在于提出MMD Guidance机制,通过在反向扩散过程中引入最大均值差异(Maximum Mean Discrepancy, MMD)梯度来指导采样轨迹,从而实现生成样本与参考数据集之间的分布对齐;该方法无需训练、可高效应用于潜在扩散模型(Latent Diffusion Models, LDMs)的潜在空间,并支持条件生成中的提示感知适配。
链接: https://arxiv.org/abs/2601.08379
作者: Matina Mahdizadeh Sani,Nima Jamali,Mohammad Jalali,Farzan Farnia
机构: School of Computer Science, University of Waterloo (滑铁卢大学计算机科学学院); Department of Computer Science and Engineering, The Chinese University of Hong Kong (香港中文大学计算机科学与工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose MMD Guidance, a training-free mechanism that augments the reverse diffusion process with gradients of the Maximum Mean Discrepancy (MMD) between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity.
zh
[CV-51] Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation
【速读】:该论文旨在解决遥感应用中3D地理空间点云语义分割的源域自由无监督域自适应(Source-Free Unsupervised Domain Adaptation, SFUDA)问题,即在无法获取源域数据的情况下,如何有效迁移预训练模型至目标域并保持高精度分割性能。其解决方案的关键在于提出了一种名为LoGo(Local-Global Dual-Consensus)的新框架:在局部层面,引入类平衡原型估计模块,通过类内独立锚点挖掘策略替代传统全局阈值过滤,从而在长尾分布下仍能生成鲁棒特征原型;在全局层面,设计基于最优传输的分布对齐模块,将伪标签分配建模为全局优化问题,缓解局部贪婪分配导致的头部类别过主导现象;同时结合双一致性伪标签筛选机制,仅保留局部多增强集成预测与全局最优传输分配一致的高置信度伪标签用于自训练,实现局部与全局的一致性约束,显著提升模型在目标域上的泛化能力。
链接: https://arxiv.org/abs/2601.08375
作者: Yuan Gao,Di Cao,Xiaohuan Xi,Sheng Nie,Shaobo Xia,Cheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of 3D geospatial point clouds is pivotal for remote sensing applications. However, variations in geographic patterns across regions and data acquisition strategies induce significant domain shifts, severely degrading the performance of deployed models. Existing domain adaptation methods typically rely on access to source-domain data. However, this requirement is rarely met due to data privacy concerns, regulatory policies, and data transmission limitations. This motivates the largely underexplored setting of source-free unsupervised domain adaptation (SFUDA), where only a pretrained model and unlabeled target-domain data are available. In this paper, we propose LoGo (Local-Global Dual-Consensus), a novel SFUDA framework specifically designed for geospatial point clouds. At the local level, we introduce a class-balanced prototype estimation module that abandons conventional global threshold filtering in favor of an intra-class independent anchor mining strategy. This ensures that robust feature prototypes can be generated even for sample-scarce tail classes, effectively mitigating the feature collapse caused by long-tailed distributions. At the global level, we introduce an optimal transport-based global distribution alignment module that formulates pseudo-label assignment as a global optimization problem. By enforcing global distribution constraints, this module effectively corrects the over-dominance of head classes inherent in local greedy assignments, preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual-consistency pseudo-label filtering mechanism. This strategy retains only high-confidence pseudo-labels where local multi-augmented ensemble predictions align with global optimal transport assignments for self-training.
zh
[CV-52] Geo-NVS-w: Geometry-Aware Novel View Synthesis In-the-Wild with an SDF Renderer ICCV2025
【速读】:该论文旨在解决在野外(in-the-wild)图像集合中进行高保真新视角合成(Novel View Synthesis, NVS)时,现有方法缺乏几何约束导致复杂表面结构不一致的问题。解决方案的关键在于提出了一种几何感知框架Geo-NVS-w,其核心是利用基于有符号距离函数(Signed Distance Function, SDF)的隐式几何表示来引导渲染过程,并引入一种新颖的几何保持损失(Geometry-Preservation Loss),以确保精细结构细节的准确保留。该方法在保持竞争性渲染质量的同时,实现了比同类方法低4–5倍的能耗。
链接: https://arxiv.org/abs/2601.08371
作者: Anastasios Tsalakopoulos,Angelos Kanlis,Evangelos Chatzis,Antonis Karakottas,Dimitrios Zarpalas
机构: Centre for Research and Technology Hellas (CERTH)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Presented at the ICCV 2025 Workshop on Large Scale Cross Device Localization
Abstract:We introduce Geo-NVS-w, a geometry-aware framework for high-fidelity novel view synthesis from unstructured, in-the-wild image collections. While existing in-the-wild methods already excel at novel view synthesis, they often lack geometric grounding on complex surfaces, sometimes producing results that contain inconsistencies. Geo-NVS-w addresses this limitation by leveraging an underlying geometric representation based on a Signed Distance Function (SDF) to guide the rendering process. This is complemented by a novel Geometry-Preservation Loss which ensures that fine structural details are preserved. Our framework achieves competitive rendering performance, while demonstrating a 4-5x reduction reduction in energy consumption compared to similar methods. We demonstrate that Geo-NVS-w is a robust method for in-the-wild NVS, yielding photorealistic results with sharp, geometrically coherent details.
zh
[CV-53] Semantic Misalignment in Vision-Language Models under Perceptual Degradation
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶和具身智能系统中因上游视觉感知退化导致的语义不一致问题,尤其是在感知质量未显著下降时,下游VLM行为仍出现严重错误(如幻觉对象提及、关键安全实体遗漏及安全判断不一致)。其解决方案的关键在于:提出一套基于语言层面的失配度量指标(包括幻觉、关键遗漏与安全误判),用于量化VLM在受控感知退化下的语义可靠性,并揭示像素级鲁棒性与多模态语义一致性之间的显著脱节,从而推动面向安全关键场景的评估框架发展,以显式考虑感知不确定性。
链接: https://arxiv.org/abs/2601.08355
作者: Guo Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
zh
[CV-54] From Local Windows to Adaptive Candidates via Individualized Exploratory: Rethinking Attention for Image Super-Resolution
【速读】:该论文旨在解决基于Transformer的单图像超分辨率(Single Image Super-Resolution, SISR)方法在计算效率与注意力机制灵活性之间的矛盾问题。现有方法通常采用固定分组的局部注意力机制,虽降低了计算复杂度,但忽略了token间相似性的内在不对称性,导致难以实现自适应的信息聚合。其解决方案的关键在于提出一种个体化探索注意力机制(Individualized Exploratory Attention, IEA),使每个token能够自主选择内容感知且独立的注意力候选集,从而在保持计算效率的同时实现更精确、灵活的跨token信息交互。
链接: https://arxiv.org/abs/2601.08341
作者: Chunyu Meng,Wei Long,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single Image Super-Resolution (SISR) is a fundamental computer vision task that aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input. Transformer-based methods have achieved remarkable performance by modeling long-range dependencies in degraded images. However, their feature-intensive attention computation incurs high computational cost. To improve efficiency, most existing approaches partition images into fixed groups and restrict attention within each group. Such group-wise attention overlooks the inherent asymmetry in token similarities, thereby failing to enable flexible and token-adaptive attention computation. To address this limitation, we propose the Individualized Exploratory Transformer (IET), which introduces a novel Individualized Exploratory Attention (IEA) mechanism that allows each token to adaptively select its own content-aware and independent attention candidates. This token-adaptive and asymmetric design enables more precise information aggregation while maintaining computational efficiency. Extensive experiments on standard SR benchmarks demonstrate that IET achieves state-of-the-art performance under comparable computational complexity.
zh
[CV-55] ssue Classification and Whole-Slide Images Analysis via Modeling of the Tumor Microenvironment and Biological Pathways
【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSI)与基因表达谱自动整合中的局限性,尤其是现有研究多集中于单一基因序列或切片级别分类任务,缺乏对空间转录组学(Spatial Transcriptomics)和像素级(Patch Level)应用的关注。其核心解决方案是提出一种多模态网络 BioMorphNet,关键在于:1)构建图结构以建模目标区域与其邻近区域之间的形态学关系,并根据形态与分子相似性动态调整响应强度,从而更精准刻画肿瘤微环境;2)基于预定义通路数据库从空间转录组数据中提取临床通路特征,作为连接组织形态与基因表达的桥梁;3)设计一个可学习的通路模块,自动模拟生物通路形成过程,提供对传统通路的补充表征。该方法在前列腺癌、结直肠癌和乳腺癌数据集上平均分类指标分别提升 2.67%、5.48% 和 6.29%,同时支持组织类别分类与基于置信度的差异基因分析,助力潜在肿瘤标志物发现。
链接: https://arxiv.org/abs/2601.08336
作者: Junzhuo Liu,Xuemei Du,Daniel Reisenbuchler,Ye Chen,Markus Eckstein,Christian Matek,Friedrich Feuerhake,Dorit Merhof
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures. This work has been submitted to the IEEE for possible publication
Abstract:Automatic integration of whole slide images (WSIs) and gene expression profiles has demonstrated substantial potential in precision clinical diagnosis and cancer progression studies. However, most existing studies focus on individual gene sequences and slide level classification tasks, with limited attention to spatial transcriptomics and patch level applications. To address this limitation, we propose a multimodal network, BioMorphNet, which automatically integrates tissue morphological features and spatial gene expression to support tissue classification and differential gene analysis. For considering morphological features, BioMorphNet constructs a graph to model the relationships between target patches and their neighbors, and adjusts the response strength based on morphological and molecular level similarity, to better characterize the tumor microenvironment. In terms of multimodal interactions, BioMorphNet derives clinical pathway features from spatial transcriptomic data based on a predefined pathway database, serving as a bridge between tissue morphology and gene expression. In addition, a novel learnable pathway module is designed to automatically simulate the biological pathway formation process, providing a complementary representation to existing clinical pathways. Compared with the latest morphology gene multimodal methods, BioMorphNet’s average classification metrics improve by 2.67%, 5.48%, and 6.29% for prostate cancer, colorectal cancer, and breast cancer datasets, respectively. BioMorphNet not only classifies tissue categories within WSIs accurately to support tumor localization, but also analyzes differential gene expression between tissue categories based on prediction confidence, contributing to the discovery of potential tumor biomarkers.
zh
[CV-56] IGAN: A New Inception-based Model for Stable and High-Fidelity Image Synthesis Using Generative Adversarial Networks
【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在高保真图像生成与训练稳定性之间的平衡难题,尤其针对深度网络中常见的模式崩溃(mode collapse)和梯度消失/爆炸问题。其解决方案的关键在于提出一种新型GAN结构——Inception Generative Adversarial Network (IGAN),该模型融合了更深的Inception-inspired卷积与空洞卷积(dilated convolution),从而有效提升图像质量并增强训练稳定性;同时,通过在生成器和判别器中引入dropout与谱归一化(spectral normalization)技术,进一步抑制梯度爆炸和过拟合,最终在CUB-200和ImageNet数据集上分别取得FID为13.12和15.08的优异性能,较现有最优GAN模型提升28–33%。
链接: https://arxiv.org/abs/2601.08332
作者: Ahmed A. Hashim,Ali Al-Shuwaili,Asraa Saeed,Ali Al-Bayaty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures
Abstract:Generative Adversarial Networks (GANs) face a significant challenge of striking an optimal balance between high-quality image generation and training stability. Recent techniques, such as DCGAN, BigGAN, and StyleGAN, improve visual fidelity; however, such techniques usually struggle with mode collapse and unstable gradients at high network depth. This paper proposes a novel GAN structural model that incorporates deeper inception-inspired convolution and dilated convolution. This novel model is termed the Inception Generative Adversarial Network (IGAN). The IGAN model generates high-quality synthetic images while maintaining training stability, by reducing mode collapse as well as preventing vanishing and exploding gradients. Our proposed IGAN model achieves the Frechet Inception Distance (FID) of 13.12 and 15.08 on the CUB-200 and ImageNet datasets, respectively, representing a 28-33% improvement in FID over the state-of-the-art GANs. Additionally, the IGAN model attains an Inception Score (IS) of 9.27 and 68.25, reflecting improved image diversity and generation quality. Finally, the two techniques of dropout and spectral normalization are utilized in both the generator and discriminator structures to further mitigate gradient explosion and overfitting. These findings confirm that the IGAN model potentially balances training stability with image generation quality, constituting a scalable and computationally efficient framework for high-fidelity image synthesis.
zh
[CV-57] UM-Text: A Unified Multimodal Model for Image Understanding
【速读】:该论文旨在解决视觉文本编辑(Visual Text Editing)中因自然语言指令理解不充分而导致的风格一致性不足问题,即如何在保持参考图像风格的前提下,准确生成符合语义和布局要求的视觉文本。解决方案的关键在于提出一个统一的多模态模型UM-Text,其核心创新包括:引入视觉语言模型(Visual Language Model, VLM)以联合理解自然语言指令与参考图像的上下文信息,从而指导文本内容与布局的设计;设计UM-Encoder自动融合多种条件嵌入(如文本、图像风格等),并由VLM动态配置组合方式;同时提出区域一致性损失(regional consistency loss)和三阶段训练策略,以提升字形生成在潜在空间和RGB空间的一致性与准确性。
链接: https://arxiv.org/abs/2601.08321
作者: Lichen Ma,Xiaolong Fu,Gaojing Zhou,Zipeng Guo,Ting Zhu,Yichun Liu,Yu Shi,Jason Li,Junshi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
zh
[CV-58] YOLOBirDrone: Dataset for Bird vs Drone Detection and Classification and a YOLO based enhanced learning architecture
【速读】:该论文旨在解决视觉无人机检测系统在实际应用中难以准确区分无人机与鸟类(尤其是小型鸟类)的问题,这一挑战限制了现有方法在复杂场景下的鲁棒性。解决方案的关键在于提出一种名为YOLOBirDrone的新架构,其核心创新包括:自适应扩展层聚合网络(AELAN)、多尺度渐进双注意力模块(MPDA)以及反向MPDA(RMPDA),这些组件协同作用以增强局部与全局的空间及通道特征表示,并有效保留目标形状信息,从而提升对小尺寸、高相似度目标的识别精度。此外,研究还构建了一个大规模、具有挑战性的BirDrone数据集,用于支持模型训练与验证,实验表明该方法在多种场景下检测准确率可达约85%,显著优于现有先进算法。
链接: https://arxiv.org/abs/2601.08319
作者: Dapinder Kaur,Neeraj Battish,Arnav Bhavsar,Shashi Poddar
机构: CSIR-CSIO(印度国家研究委员会-计算机与自动化研究所); IIT Mandi(印度理工学院曼迪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, and submitted to a journal for review
Abstract:The use of aerial drones for commercial and defense applications has benefited in many ways and is therefore utilized in several different application domains. However, they are also increasingly used for targeted attacks, posing a significant safety challenge and necessitating the development of drone detection systems. Vision-based drone detection systems currently have an accuracy limitation and struggle to distinguish between drones and birds, particularly when the birds are small in size. This research work proposes a novel YOLOBirDrone architecture that improves the detection and classification accuracy of birds and drones. YOLOBirDrone has different components, including an adaptive and extended layer aggregation (AELAN), a multi-scale progressive dual attention module (MPDA), and a reverse MPDA (RMPDA) to preserve shape information and enrich features with local and global spatial and channel information. A large-scale dataset, BirDrone, is also introduced in this article, which includes small and challenging objects for robust aerial object identification. Experimental results demonstrate an improvement in performance metrics through the proposed YOLOBirDrone architecture compared to other state-of-the-art algorithms, with detection accuracy reaching approximately 85% across various scenarios.
zh
[CV-59] Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation Large Activation and Benign Overfitting
【速读】:该论文试图解决深度学习模型中“深度双下降”(deep double descent)现象的机制问题,特别是其与模型泛化能力之间的关系。解决方案的关键在于通过分解损失曲线,分别分析干净数据和噪声数据对内部信号演化的影响,揭示了在训练过程中模型如何实现“良性过拟合”(benign overfitting):首先模型先拟合干净数据,随后在噪声数据上过拟合,但通过外层神经元激活的分离性,使模型仍能保持对测试数据的良好泛化;同时发现浅层存在一个显著的大激活(large activation),其大小与输入模式相关而非输出模式,这为理解深度双下降提供了新的内在结构解释。
链接: https://arxiv.org/abs/2601.08316
作者: Tomoki Kubo,Ryuken Uda,Yusuke Iida
机构: Niigata University (新泻大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 17 pages, 9 figures
Abstract:Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a “benign overfitting” state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as “outliers,” “massive activa-tions,” and “super activations” in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of “deep double descent,” “benign overfitting,” and “large activation”, and support the proposal of a novel scenario for understanding deep double descent.
zh
[CV-60] Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在图像质量评估(Image Quality Assessment, IQA)任务中虽具备零样本能力但性能未达最优的问题,尤其针对现有方法依赖计算密集型微调来对齐输出中与质量相关的token分布这一瓶颈。其解决方案的关键在于提出一种无需训练的框架IQARAG(Image Quality Assessment with Retrieval-Augmented Generation),通过检索增强生成(Retrieval-Augmented Generation, RAG)机制,从数据库中检索语义相似但质量差异显著的参考图像及其对应的平均意见得分(Mean Opinion Scores, MOSs),并将这些参考图像与输入图像共同嵌入特定提示(prompt)中,从而为LMM提供视觉感知锚点以提升IQA准确性。该方法包含三个核心阶段:检索特征提取、图像检索和质量分数生成,实验证明其在多个IQA数据集上有效提升了LMM的性能,同时避免了昂贵的微调过程。
链接: https://arxiv.org/abs/2601.08311
作者: Kang Fu,Huiyu Duan,Zicheng Zhang,Yucheng Zhu,Jun Zhao,Xiongkuo Min,Jia Wang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs’ IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.
zh
[CV-61] SnapGen: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices
【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在移动和边缘设备上部署时面临的高计算与内存开销问题,从而实现资源受限环境下的高质量图像生成。其解决方案的关键在于三个方面:一是提出一种紧凑的DiT架构,采用自适应全局-局部稀疏注意力机制,在保持全局上下文建模能力的同时有效保留局部细节;二是设计弹性训练框架,通过统一超网络联合优化不同容量的子模型,使单一模型可在不同硬件平台上动态调整以实现高效推理;三是开发基于知识引导的分布匹配蒸馏(Knowledge-Guided Distribution Matching Distillation, KG-DMD),将少步数教师模型的知识迁移至学生模型,生成低延迟(如4步)且保真度高的图像,满足实时移动端应用需求。
链接: https://arxiv.org/abs/2601.08303
作者: Dongting Hu,Aarush Gupta,Magzhan Gabidolla,Arpit Sahni,Huseyin Coskun,Yanyu Li,Yerlan Idelbayev,Ahsan Mahmood,Aleksei Lebedev,Dishani Lahiri,Anujraaj Goyal,Ju Hu,Mingming Gong,Sergey Tulyakov,Anil Kag
机构: Snap Inc.; The University of Melbourne (墨尔本大学); MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page:
Abstract:Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.
zh
[CV-62] ReCo-KD: Region- and Context-Aware Knowledge Distillation for Efficient 3D Medical Image Segmentation
【速读】:该论文旨在解决3D医学图像分割模型在临床部署中面临的两大挑战:一是高性能模型通常参数量大、计算资源需求高,难以在计算能力有限的医疗机构中应用;二是轻量化模型往往因结构简化而导致性能显著下降。解决方案的关键在于提出一种仅需训练阶段操作的区域与上下文感知知识蒸馏框架(Region- and Context-aware Knowledge Distillation, ReCo-KD),其核心创新包括:多尺度结构感知区域蒸馏(Multi-Scale Structure-Aware Region Distillation, MS-SARD),通过类别感知掩码和尺度归一化加权机制强化对小但临床重要的解剖区域的特征迁移;以及多尺度上下文对齐(Multi-Scale Context Alignment, MS-CA),在不同特征层级上对齐教师模型与学生模型之间的亲和力模式,从而同时保留精细 anatomical detail 和长距离 context 信息。该方法无需定制学生网络,可无缝集成到nnU-Net等主流架构中,实现高精度与低延迟的平衡,具备良好的临床实用性。
链接: https://arxiv.org/abs/2601.08301
作者: Qizhen Lan,Yu-Chun Hsu,Nida Saddaf Khan,Xiaoqian Jiang
机构: The University of Texas Health Science Center at Houston (UTHealth Houston)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Accurate 3D medical image segmentation is vital for diagnosis and treatment planning, but state-of-the-art models are often too large for clinics with limited computing resources. Lightweight architectures typically suffer significant performance loss. To address these deployment and speed constraints, we propose Region- and Context-aware Knowledge Distillation (ReCo-KD), a training-only framework that transfers both fine-grained anatomical detail and long-range contextual information from a high-capacity teacher to a compact student network. The framework integrates Multi-Scale Structure-Aware Region Distillation (MS-SARD), which applies class-aware masks and scale-normalized weighting to emphasize small but clinically important regions, and Multi-Scale Context Alignment (MS-CA), which aligns teacher-student affinity patterns across feature levels. Implemented on nnU-Net in a backbone-agnostic manner, ReCo-KD requires no custom student design and is easily adapted to other architectures. Experiments on multiple public 3D medical segmentation datasets and a challenging aggregated dataset show that the distilled lightweight model attains accuracy close to the teacher while markedly reducing parameters and inference latency, underscoring its practicality for clinical deployment.
zh
[CV-63] M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction AAAI2026
【速读】:该论文旨在解决当前基于Mamba架构的高光谱图像(hyperspectral image)重建方法中存在的两个关键问题:一是单一空间感知能力限制了模型对高光谱图像的全面理解与分析;二是单尺度特征提取难以捕捉高光谱图像中复杂的结构细节和精细信息。解决方案的关键在于提出一种多尺度、多感知的Mamba架构——M3SR,其核心创新是设计了一个多感知融合模块(multi-perceptual fusion block),能够增强模型对输入特征的综合理解能力,并将其嵌入U-Net结构中,从而有效提取并融合全局、中间和局部特征,实现多尺度下的高精度高光谱图像重建,同时在计算成本上优于现有先进方法。
链接: https://arxiv.org/abs/2601.08293
作者: Yuze Zhang,Lingjie Li,Qiuzhen Lin,Zhong Ming,Fei Yu,Victor C. M. Leung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:The Mamba architecture has been widely applied to various low-level vision tasks due to its exceptional adaptability and strong performance. Although the Mamba architecture has been adopted for spectral reconstruction, it still faces the following two challenges: (1) Single spatial perception limits the ability to fully understand and analyze hyperspectral images; (2) Single-scale feature extraction struggles to capture the complex structures and fine details present in hyperspectral images. To address these issues, we propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR. Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features. By integrating the multi-perceptual fusion block into a U-Net structure, M3SR can effectively extract and fuse global, intermediate, and local features, thereby enabling accurate reconstruction of hyperspectral images at multiple scales. Extensive quantitative and qualitative experiments demonstrate that the proposed M3SR outperforms existing state-of-the-art methods while incurring a lower computational cost.
zh
[CV-64] KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在基础视觉感知能力上是否具备类人水平的问题,即它们是否拥有与6-7岁儿童相当的原始视觉原语(visual primitives)。为回答这一问题,作者提出KidVis基准,该基准基于人类视觉发展的理论,将视觉智能解构为六种原子能力:注意力(Concentration)、追踪(Tracking)、辨别(Discrimination)、记忆(Memory)、空间认知(Spatial)和闭合性(Closure),并设计了10类低语义依赖的视觉任务。解决方案的关键在于构建一个贴近人类生理发展水平的评估体系,并通过对比20个前沿MLLMs与人类儿童的表现,揭示出当前模型在基础视觉能力上的显著不足,且存在“缩放定律悖论”——单纯增加参数量无法线性提升这些核心视觉原语性能,从而证明当前MLLMs虽具强大推理能力,但缺乏实现通用视觉智能所必需的生理感知基础。
链接: https://arxiv.org/abs/2601.08292
作者: Xianfeng Wang,Kaiwei Zhang,Qi Jia,Zijian Chen,Guangtao Zhai,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a “Scaling Law Paradox”: simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
zh
[CV-65] One-Shot Identification with Different Neural Network Approaches
【速读】:该论文旨在解决少样本学习(one-shot learning)问题,即在仅有一个样本的情况下对新类别进行准确识别,这在数据稀缺场景中尤为困难。其解决方案的关键在于采用堆叠图像(stacked images)与孪生胶囊网络(siamese capsule networks)相结合的架构,利用胶囊网络对特征空间的结构化表示能力,提升模型在小样本条件下的泛化性能。实验表明,该方法在工业应用和人脸识别基准测试中均优于现有技术,且具有良好的可实现性和优化性。
链接: https://arxiv.org/abs/2601.08278
作者: Janis Mohr,Jörg Frochte
机构: Interdisciplinary Institute for Applied Artificial Intelligence and Data Science Ruhr, Bochum University of Applied Sciences, Germany(德国鲁尔州博克大学应用科学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, Keywords: One-shot learning, Convolutional neural networks, Siamese networks, Capsules, Industrial application
Abstract:Convolutional neural networks (CNNs) have been widely used in the computer vision community, significantly improving the state-of-the-art. But learning good features often is computationally expensive in machine learning settings and is especially difficult when there is a lack of data. One-shot learning is one such area where only limited data is available. In one-shot learning, predictions have to be made after seeing only one example from one class, which requires special techniques. In this paper we explore different approaches to one-shot identification tasks in different domains including an industrial application and face recognition. We use a special technique with stacked images and use siamese capsule networks. It is encouraging to see that the approach using capsule architecture achieves strong results and exceeds other techniques on a wide range of datasets from industrial application to face recognition benchmarks while being easy to use and optimise.
zh
[CV-66] HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding
【速读】:该论文旨在解决视频大语言模型(Video-LLM)在推理过程中因海量视觉输入导致的计算负担过重问题,从而实现接近纯文本大语言模型(text-only LLM)级别的推理加速。现有方法主要通过剪枝冗余视觉token来缓解压力,但效果有限,原因在于:一是剪枝策略未能有效保留关键视觉语义token,导致草稿质量下降和接受率降低;二是即便采用激进剪枝(如移除90%视觉token),剩余token的推理开销仍限制整体加速比。解决方案的关键在于提出HIPPO框架,其核心创新包括:(i) 一种语义感知的token保留机制,融合全局注意力得分与局部视觉语义信息,在高剪枝比例下仍能保持高质量草稿;(ii) 一种视频并行推测解码算法,将草稿生成与目标验证阶段解耦并重叠执行,从而显著提升整体推理效率。实验表明,HIPPO在四个视频-LLM上、六个基准测试中可实现最高达3.51倍的加速比。
链接: https://arxiv.org/abs/2601.08273
作者: Qitan Lv,Tianyu Liu,Wen Wu,Xuenan Xu,Bowen Zhou,Feng Wu,Chao Zhang
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); Department of Electronic Engineering, Tsinghua University (清华大学电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model’s remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO’s effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.
zh
[CV-67] AIMC-Spec: A Benchmark Dataset for Automatic Intrapulse Modulation Classification under Variable Noise Conditions
【速读】:该论文旨在解决自动脉冲调制分类(Automatic Intrapulse Modulation Classification, AIMC)领域长期存在的标准化数据集缺失问题,这一瓶颈限制了电子支援系统中雷达信号分析的自动化进展,尤其是在噪声或信号质量下降的情况下。解决方案的关键在于提出一个名为AIMC-Spec的综合性合成数据集,该数据集基于频谱图(spectrogram)图像分类任务构建,涵盖33种调制类型和13个信噪比(SNR)水平,并通过统一输入格式对五种代表性深度学习模型(包括轻量级CNN、去噪架构及基于Transformer的网络)进行复现与评估,从而为AIMC提供可重复的基准和未来研究的标准化基础。
链接: https://arxiv.org/abs/2601.08265
作者: Sebastian L. Cocks,Salvador Dreo,Feras Dayoub
机构: Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is published in IEEE Access DOI: https://doi.org/10.1109/ACCESS.2025.3645091
Abstract:A lack of standardized datasets has long hindered progress in automatic intrapulse modulation classification (AIMC) - a critical task in radar signal analysis for electronic support systems, particularly under noisy or degraded conditions. AIMC seeks to identify the modulation type embedded within a single radar pulse from its complex in-phase and quadrature (I/Q) representation, enabling automated interpretation of intrapulse structure. This paper introduces AIMC-Spec, a comprehensive synthetic dataset for spectrogram-based image classification, encompassing 33 modulation types across 13 signal-to-noise ratio (SNR) levels. To benchmark AIMC-Spec, five representative deep learning algorithms - ranging from lightweight CNNs and denoising architectures to transformer-based networks - were re-implemented and evaluated under a unified input format. The results reveal significant performance variation, with frequency-modulated (FM) signals classified more reliably than phase or hybrid types, particularly at low SNRs. A focused FM-only test further highlights how modulation type and network architecture influence classifier robustness. AIMC-Spec establishes a reproducible baseline and provides a foundation for future research and standardization in the AIMC domain.
zh
[CV-68] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence
【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的无监督日常生活活动(Activities of Daily Living, ADLs)识别方法中存在的两个关键问题:一是现有方法依赖时间片段分割,与LLMs的上下文推理能力不匹配;二是缺乏对预测置信度的有效估计。其解决方案的关键在于引入事件驱动的分割策略(event-based segmentation),以更好地契合LLMs的语义理解能力,并提出一种新颖的置信度估计方法,从而在复杂真实数据集上显著提升零样本ADL识别性能,且优于传统监督学习方法,即使使用较小规模的LLM(如Gemma 3 27B)也能实现优异效果。
链接: https://arxiv.org/abs/2601.08241
作者: Michele Fiori,Gabriele Civitarese,Marco Colussi,Claudio Bettini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Unobtrusive sensor-based recognition of Activities of Daily Living (ADLs) in smart homes by processing data collected from IoT sensing devices supports applications such as healthcare, safety, and energy management. Recent zero-shot methods based on Large Language Models (LLMs) have the advantage of removing the reliance on labeled ADL sensor data. However, existing approaches rely on time-based segmentation, which is poorly aligned with the contextual reasoning capabilities of LLMs. Moreover, existing approaches lack methods for estimating prediction confidence. This paper proposes to improve zero-shot ADL recognition with event-based segmentation and a novel method for estimating prediction confidence. Our experimental evaluation shows that event-based segmentation consistently outperforms time-based LLM approaches on complex, realistic datasets and surpasses supervised data-driven methods, even with relatively small LLMs (e.g., Gemma 3 27B). The proposed confidence measure effectively distinguishes correct from incorrect predictions.
zh
[CV-69] Knowledge-based learning in Text-RAG and Image-RAG
【速读】:该论文旨在解决医学影像诊断中生成式 AI(Generative AI)模型在胸部X光图像分析时存在的幻觉问题(hallucination),并提升疾病检测的准确性与预测置信度。其核心解决方案是采用多模态融合策略,将基于 Vision Transformer(EVA-ViT)的图像编码器与大型语言模型(LLM),如 LLaMA 或 ChatGPT,结合使用,并引入两种检索增强生成(RAG)机制:文本基 RAG 利用外部知识信息有效降低幻觉率,图像基 RAG 通过 KNN 方法增强预测置信度和校准性(calibration)。实验表明,基于 GPT 的 LLM 在幻觉抑制和预期校准误差(Expected Calibration Error, ECE)方面优于 LLaMA 模型,验证了多模态 RAG 策略在提升模型可靠性方面的有效性。
链接: https://arxiv.org/abs/2601.08226
作者: Alexander Shim,Khalil Saieh,Samuel Clarke
机构: Florida International University (佛罗里达国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures
Abstract:This research analyzed and compared the multi-modal approach in the Vision Transformer(EVA-ViT) based image encoder with the LlaMA or ChatGPT LLM to reduce the hallucination problem and detect diseases in chest x-ray images. In this research, we utilized the NIH Chest X-ray image to train the model and compared it in image-based RAG, text-based RAG, and baseline. [3] [5] In a result, the text-based RAG[2] e!ectively reduces the hallucination problem by using external knowledge information, and the image-based RAG improved the prediction con"dence and calibration by using the KNN methods. [4] Moreover, the GPT LLM showed better performance, a low hallucination rate, and better Expected Calibration Error(ECE) than Llama Llama-based model. This research shows the challenge of data imbalance, a complex multi-stage structure, but suggests a large experience environment and a balanced example of use.
zh
[CV-70] FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection
【速读】:该论文旨在解决奶牛瘤胃酸中毒(ruminal acidosis)的实时、非侵入式监测难题,当前依赖于有创pH测量的方法难以实现连续监控。其解决方案的关键在于提出FUME(Fused Unified Multi-gas Emission Network),一种基于双气体光学成像(dual-gas optical gas imaging, OGI)的深度学习模型,通过融合红外相机捕捉到的二氧化碳(CO₂)与甲烷(CH₄)排放模式,实现对瘤胃健康状态(健康、过渡、酸中毒)的精准分类。FUME采用轻量级双流架构,包含共享权重编码器、模态特异性自注意力机制和通道注意力融合策略,在联合优化气雾分割与健康分类任务中显著提升性能,同时仅需1.28M参数和1.97G MACs,较现有方法在分割精度上提升且计算成本降低10倍,验证了基于气体排放的动物健康监测技术的可行性。
链接: https://arxiv.org/abs/2601.08205
作者: Taminul Islam,Toqi Tahamid Sarker,Mohamed Embaby,Khaled R Ahmed,Amer AbuGhazaleh
机构: Southern Illinois University (南伊利诺伊大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures
Abstract:Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs–outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at this https URL.
zh
[CV-71] MobiDiary: Autoregressive Action Captioning with Wearable Devices and Wireless Signals
【速读】:该论文旨在解决智能环境中人类活动识别(Human Activity Recognition, HAR)中传统方法输出受限于预定义标签、难以生成自然语言描述的问题,同时应对多源物理信号(如惯性测量单元IMU与Wi-Fi信号)之间的语义鸿沟挑战。其解决方案的关键在于提出MobiDiary框架,通过统一的传感器编码器将异构信号映射为共享语义空间中的信号token:该编码器采用基于patch的机制捕捉局部时序相关性,并引入异构放置嵌入(heterogeneous placement embedding)统一不同传感器的空间上下文;随后利用Transformer解码器结合自回归机制逐词生成连贯的动作描述文本。此设计避免了模态特异性工程,有效实现了从连续噪声信号到人类可读摘要的跨模态映射。
链接: https://arxiv.org/abs/2601.08204
作者: Fei Deng,Yinghui He,Chuntong Chu,Ge Wang,Han Ding,Jinsong Han,Fei Wang
机构: Xi’an Jiaotong University (西安交通大学); Nanyang Technological University (南洋理工大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals–where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.
zh
[CV-72] Unified Multi-Site Multi-Sequence Brain MRI Harmonization Enriched by Biomedical Semantic Style MICCAI2025
【速读】:该论文旨在解决多中心脑部磁共振成像(MRI)数据在深度学习模型训练中因设备厂商、采集参数和成像协议差异导致的非生物异质性问题,此类异质性会显著削弱模型的泛化能力。现有方法通常依赖有限的配对跨站点受试者数据,或难以有效分离图像风格与解剖结构,并且大多仅支持单序列MRI处理,无法满足临床实践中多序列MRI的现实需求。解决方案的关键在于提出MMH框架,其核心创新包括:(1) 采用基于扩散模型的全局谐调器,通过无风格梯度条件映射实现序列特异性统一域转换;(2) 引入目标特定微调模块以适配至期望目标域;(3) 设计三平面注意力BiomedCLIP编码器,聚合多视角嵌入以显式表征体积风格信息,从而在无需配对数据的情况下实现风格与解剖内容的有效解耦。
链接: https://arxiv.org/abs/2601.08193
作者: Mengqi Wu,Yongheng Sun,Qianqian Wang,Pew-Thian Yap,Mingxia Liu
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); North Carolina State University (北卡罗来纳州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures. Extended version of a paper published at MICCAI 2025 (DOI: https://doi.org/10.1007/978-3-032-04947-6_65 )
Abstract:Aggregating multi-site brain MRI data can enhance deep learning model training, but also introduces non-biological heterogeneity caused by site-specific variations (e.g., differences in scanner vendors, acquisition parameters, and imaging protocols) that can undermine generalizability. Recent retrospective MRI harmonization seeks to reduce such site effects by standardizing image style (e.g., intensity, contrast, noise patterns) while preserving anatomical content. However, existing methods often rely on limited paired traveling-subject data or fail to effectively disentangle style from anatomy. Furthermore, most current approaches address only single-sequence harmonization, restricting their use in real-world settings where multi-sequence MRI is routinely acquired. To this end, we introduce MMH, a unified framework for multi-site multi-sequence brain MRI harmonization that leverages biomedical semantic priors for sequence-aware style alignment. MMH operates in two stages: (1) a diffusion-based global harmonizer that maps MR images to a sequence-specific unified domain using style-agnostic gradient conditioning, and (2) a target-specific fine-tuner that adapts globally aligned images to desired target domains. A tri-planar attention BiomedCLIP encoder aggregates multi-view embeddings to characterize volumetric style information, allowing explicit disentanglement of image styles from anatomy without requiring paired data. Evaluations on 4,163 T1- and T2-weighted MRIs demonstrate MMH’s superior harmonization over state-of-the-art methods in image feature clustering, voxel-level comparison, tissue segmentation, and downstream age and site classification.
zh
[CV-73] Route Retrieve Reflect Repair: Self-Improving Agent ic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging
【速读】:该论文旨在解决当前基于大视觉语言模型(Vision-Language Models, VLMs)的医学图像分析系统普遍存在的“单次推理黑箱”问题,即缺乏对推理过程、安全性以及空间定位准确性的可控性。其核心挑战在于如何提升模型输出的临床可靠性与空间一致性,尤其是在报告生成和弱监督检测任务中。解决方案的关键在于提出R⁴框架,该框架通过四个协同工作的智能体(agents)实现:Router根据图像、病史和元数据配置任务感知提示;Retriever结合示例记忆与pass@k采样同时生成自由文本报告和边界框;Reflector对每一对草案进行临床错误模式(如否定矛盾、侧别错误、无依据声明等)的批判性评估;Repairer则在约束条件下迭代修正叙述与空间输出,并持续优化高质量示例库用于未来案例。此方法无需梯度微调即可显著提升LLM-as-a-Judge评分和mAP50指标,验证了代理式路由、反思与修复机制能够有效增强VLM在临床影像解读中的可靠性和空间锚定能力。
链接: https://arxiv.org/abs/2601.08192
作者: Md. Faiyaz Abdullah Sayeedi,Rashedur Rahman,Siam Tahsin Bhuiyan,Sefatul Wasi,Ashraful Islam,Saadia Binte Alam,AKM Mahbubur Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: this https URL
zh
[CV-74] Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models
【速读】:该论文旨在解决轻量级视觉网络中参数规模、计算开销与任务性能之间难以平衡的问题。现有方法虽能显著降低计算量,但常导致参数激增(如LSNet、MobileMamba),不利于资源受限设备部署;同时,部分模型虽借鉴人类视觉感知机制,却因建模过于简化而无法真实反映视觉处理过程。解决方案的关键在于提出GPM(Global-to-Parallel Multi-scale Encoding)结构:首先通过全局洞察生成器(Global Insight Generator, GIG)提取整体线索,随后采用并行分支分别处理多尺度特征——其中局部语义增强模块(LSAE)强调中/大尺度语义关联,倒残差块(Inverted Residual Block, IRB)保留细粒度纹理信息,从而实现全局与局部特征的协同表示。该设计符合人类视觉“先感知整体再聚焦细节,并在局部关注时保持广域上下文意识”的特性,最终构建的轻量级H-GPE网络在图像分类、目标检测和语义分割任务中均展现出优异的精度-效率权衡性能。
链接: https://arxiv.org/abs/2601.08190
作者: Wei Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 5 figures
Abstract:Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource-limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global-to-Parallel Multi-scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid-/large-scale semantic relations, while IRB (Inverted Residual Block) preserves fine-grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H-GPE network. Experiments on image classification, object detection, and semantic segmentation show that H-GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy-efficiency trade-off compared with recent state-of-the-art lightweight models.
zh
[CV-75] GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在胃肠镜临床全流程中性能验证不足的问题,特别是其与人类内镜医师在诊断准确性、空间定位能力及报告质量等方面的对比尚不明确。解决方案的关键在于构建了一个名为GI-Bench的基准测试平台,涵盖20种细粒度病变类别,并围绕五个临床阶段(解剖定位、病变识别、诊断、发现描述和管理建议)对12个主流MLLMs进行全面评估,同时引入定量指标(如Macro-F1、mIoU)和定性Likert量表,与三名初级内镜医生和三名住院医师进行系统性比较,从而揭示当前模型在诊断推理上接近人类水平但存在显著“空间定位瓶颈”,以及“流畅性-准确性悖论”现象,为MLLMs在消化道内镜领域的临床应用提供科学依据与动态评估机制。
链接: https://arxiv.org/abs/2601.08183
作者: Yan Zhu,Te Luo,Pei-Yao Fu,Zhen Zhang,Zi-Long Wang,Yi-Fan Qu,Zi-Han Geng,Jia-Qi Xu,Lu Yao,Li-Yun Ma,Wei Su,Wei-Feng Chen,Quan-Lin Li,Shuo Wang,Ping-Hong Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 45 pages, 17 figures, 6 tables. Leaderboard available at: this https URL . Includes supplementary material
Abstract:Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p0.05). However, a critical “spatial grounding bottleneck” persisted; human lesion localization (mIoU 0.506) significantly outperformed the best model (0.345; p0.05). Furthermore, qualitative analysis revealed a “fluency-accuracy paradox”: models generated reports with superior linguistic readability compared with humans (p0.05) but exhibited significantly lower factual correctness (p0.05) due to “over-interpretation” and hallucination of visual this http URL-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at this https URL.
zh
[CV-76] Second-order Gaussian directional derivative representations for image high-resolution corner detection
【速读】:该论文旨在解决传统角点检测方法中因使用简单角点模型而导致的理论缺陷问题,特别是相邻角点间灰度信息相互干扰所引发的定位不准确与鲁棒性不足。其解决方案的关键在于引入二阶高斯方向导数(Second-order Gaussian Directional Derivative, SOGDD)滤波器,对两种典型高分辨率角点模型(END型和L型)进行平滑处理,并推导出它们的SOGDD表示形式,从而揭示了高分辨率角点的多种特征。该方法通过合理选择高斯滤波尺度,有效提取图像中的强度变化信息,实现了对邻近角点的精确检测,显著提升了角点定位精度、抗模糊能力、图像匹配性能及三维重建效果。
链接: https://arxiv.org/abs/2601.08182
作者: Dongbo Xie,Junjie Qiu,Changming Sun,Weichuan Zhang
机构: Shaanxi University of Science and Technology (陕西科技大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Corner detection is widely used in various computer vision tasks, such as image matching and 3D reconstruction. Our research indicates that there are theoretical flaws in Zhang et al.'s use of a simple corner model to obtain a series of corner characteristics, as the grayscale information of two adjacent corners can affect each other. In order to address the above issues, a second-order Gaussian directional derivative (SOGDD) filter is used in this work to smooth two typical high-resolution angle models (i.e. END-type and L-type models). Then, the SOGDD representations of these two corner models were derived separately, and many characteristics of high-resolution corners were discovered, which enabled us to demonstrate how to select Gaussian filtering scales to obtain intensity variation information from images, accurately depicting adjacent corners. In addition, a new high-resolution corner detection method for images has been proposed for the first time, which can accurately detect adjacent corner points. The experimental results have verified that the proposed method outperforms state-of-the-art methods in terms of localization error, robustness to image blur transformation, image matching, and 3D reconstruction.
zh
[CV-77] Instruction-Driven 3D Facial Expression Generation and Transition
【速读】:该论文旨在解决如何根据文本指令生成从一个指定面部表情到另一个任意面部表情的平滑过渡问题,从而实现更具表现力和可控性的3D人脸表情合成。其解决方案的关键在于提出了一种基于指令驱动的面部表情分解模块(Instruction-driven Facial Expression Decomposer, IFED),该模块能够学习文本描述与面部表情特征之间的关联,并结合顶点重建损失函数优化潜在向量的语义理解,进而通过提出的I2FET(Instruction to Facial Expression Transition)方法生成符合指令的面部表情序列,最终实现高质量、自然的面部表情转换。
链接: https://arxiv.org/abs/2601.08179
作者: Anh H. Vo,Tae-Seok Kim,Hulin Jin,Soo-Mi Choi,Yong-Guk Kim
机构: Sejong University (世宗大学); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at this https URL
zh
[CV-78] CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval
【速读】:该论文旨在解决动态3D场景中持续理解与重建的难题,特别是在多轮访问和长时间序列下如何高效存储、检索并更新静态场景信息,同时准确区分动态物体与静态环境。其解决方案的关键在于提出了一种生物启发式框架CogniMap3D,该框架包含三个核心模块:基于多阶段运动线索的动态物体识别机制(用于结合深度和相机位姿先验精准定位动态区域)、支持跨次访问存储、回忆与更新的认知映射系统(实现持久化静态场景记忆),以及基于因子图优化的相机位姿精调策略(提升重建精度)。这一集成架构使得系统能够在复杂动态环境中实现连续、鲁棒且高效的3D场景理解与重建。
链接: https://arxiv.org/abs/2601.08175
作者: Feiran Wang,Junyi Wu,Dawen Cai,Yuan Hong,Yan Yan
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Michigan (密歇根大学); University of Connecticut (康涅狄格大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present CogniMap3D, a bioinspired framework for dynamic 3D scene understanding and reconstruction that emulates human cognitive processes. Our approach maintains a persistent memory bank of static scenes, enabling efficient spatial knowledge storage and rapid retrieval. CogniMap3D integrates three core capabilities: a multi-stage motion cue framework for identifying dynamic objects, a cognitive mapping system for storing, recalling, and updating static scenes across multiple visits, and a factor graph optimization strategy for refining camera poses. Given an image stream, our model identifies dynamic regions through motion cues with depth and camera pose priors, then matches static elements against its memory bank. When revisiting familiar locations, CogniMap3D retrieves stored scenes, relocates cameras, and updates memory with new observations. Evaluations on video depth estimation, camera pose reconstruction, and 3D mapping tasks demonstrate its state-of-the-art performance, while effectively supporting continuous scene understanding across extended sequences and multiple visits.
zh
[CV-79] owards Cross-Platform Generalization: Domain Adaptive 3D Detection with Augmentation and Pseudo-Labeling
【速读】:该论文旨在解决跨平台三维目标检测(Cross-platform 3D Object Detection)问题,即在不同传感器配置或数据分布环境下保持模型的泛化能力。其解决方案的关键在于基于PVRCNN++框架,融合点云和体素特征以提升检测效率,并通过定制化的数据增强策略与伪标签自训练机制缩小源域与目标域之间的领域差异(domain gap),从而显著增强模型在未见平台上的适应性。
链接: https://arxiv.org/abs/2601.08174
作者: Xiyan Feng,Wenbo Zhang,Lu Zhang,Yunzhi Zhuge,Huchuan Lu,You He
机构: Dalian University of Technology (大连理工大学); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This technical report represents the award-winning solution to the Cross-platform 3D Object Detection task in the RoboSense2025 Challenge. Our approach is built upon PVRCNN++, an efficient 3D object detection framework that effectively integrates point-based and voxel-based features. On top of this foundation, we improve cross-platform generalization by narrowing domain gaps through tailored data augmentation and a self-training strategy with pseudo-labels. These enhancements enabled our approach to secure the 3rd place in the challenge, achieving a 3D AP of 62.67% for the Car category on the phase-1 target domain, and 58.76% and 49.81% for Car and Pedestrian categories respectively on the phase-2 target domain.
zh
[CV-80] Representation Learning with Semantic-aware Instance and Sparse Token Alignments
【速读】:该论文旨在解决医学对比视觉-语言预训练(Medical Contrastive Vision-Language Pre-training, VLP)中因严格将所有未配对样本视为负例而导致语义结构破坏的问题。在医疗数据集中,不同患者的图像或报告之间可能存在显著相似性,若机械地将这些样本判为负例,会引入虚假负例(false negatives),从而损害表示学习的质量。解决方案的关键在于提出一种多层级对齐框架——SISTA(Representation Learning with Semantic-aware Instance and Sparse Token Alignments),通过两个层次的语义对齐机制:一是图像与报告整体层面的对齐,二是图像块(patch)与文本词元(token)细粒度层面的对齐;同时引入报告间相似性以过滤虚假负例,并设计有效方法实现图像块与相关词元的稀疏对齐,从而提升跨任务迁移性能,尤其在标注数据有限的细粒度任务中表现显著增强。
链接: https://arxiv.org/abs/2601.08165
作者: Phuoc-Nguyen Bui,Toan Duc Nguyen,Junghyun Bum,Duc-Tai Le,Hyunseung Choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, 8 pages
Abstract:Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.
zh
[CV-81] A Hardware-Algorithm Co-Designed Framework for HDR Imaging and Dehazing in Extreme Rocket Launch Environments
【速读】:该论文旨在解决火箭发射过程中极端成像条件下关键力学参数(如羽流流场、激波结构和喷管振荡)的定量光学测量难题。由于强烈燃烧产生的高密度颗粒雾霾及亮度变化超过120 dB,导致图像数据质量下降,进而影响后续的光测计量学与速度测量分析。其解决方案的关键在于提出一种软硬件协同设计框架,核心包括:定制的时空变曝光(Spatially Varying Exposure, SVE)传感器与物理感知去雾算法的结合;SVE传感器在单次曝光中获取多曝光数据,实现无需依赖理想大气模型的鲁棒雾霾评估;并通过动态估计雾霾密度、区域自适应光照优化和多尺度熵约束融合策略,有效分离雾霾与场景辐射信息,从而恢复羽流与发动机区域的物理真实视觉信息,为提取粒子速度、流动不稳定性频率和结构振动等关键力学参数提供可靠图像基础。
链接: https://arxiv.org/abs/2601.08162
作者: Jing Tao,Banglei Guan,Pengju Sun,Taihang Lei,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted by Acta Mechanica Sinica
Abstract:Quantitative optical measurement of critical mechanical parameters – such as plume flow fields, shock wave structures, and nozzle oscillations – during rocket launch faces severe challenges due to extreme imaging conditions. Intense combustion creates dense particulate haze and luminance variations exceeding 120 dB, degrading image data and undermining subsequent photogrammetric and velocimetric analyses. To address these issues, we propose a hardware-algorithm co-design framework that combines a custom Spatially Varying Exposure (SVE) sensor with a physics-aware dehazing algorithm. The SVE sensor acquires multi-exposure data in a single shot, enabling robust haze assessment without relying on idealized atmospheric models. Our approach dynamically estimates haze density, performs region-adaptive illumination optimization, and applies multi-scale entropy-constrained fusion to effectively separate haze from scene radiance. Validated on real launch imagery and controlled experiments, the framework demonstrates superior performance in recovering physically accurate visual information of the plume and engine region. This offers a reliable image basis for extracting key mechanical parameters, including particle velocity, flow instability frequency, and structural vibration, thereby supporting precise quantitative analysis in extreme aerospace environments.
zh
[CV-82] Robust Subpixel Localization of Diagonal Markers in Large-Scale Navigation via Multi-Layer Screening and Adaptive Matching
【速读】:该论文旨在解决大规模飞行导航中因复杂背景干扰导致的定位失败问题,以及传统滑动窗口匹配技术存在的计算效率低下问题。其解决方案的关键在于提出一种三层框架:首先通过亮度均衡化和结构信息提取降低数据维度;其次采用粗到精的候选点选择策略以显著减少滑动窗口的计算开销,实现标记物位置的快速估计;最后基于候选点生成自适应模板,并通过相关系数极值拟合实现亚像素级精度的模板匹配,从而在复杂场景下精准定位对角线标记,适用于导航任务中的视场测量。
链接: https://arxiv.org/abs/2601.08161
作者: Jing Tao,Banglei Guan,Yang Shang,Shunkun Liang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by Applied Optics
Abstract:This paper proposes a robust, high-precision positioning methodology to address localization failures arising from complex background interference in large-scale flight navigation and the computational inefficiency inherent in conventional sliding window matching techniques. The proposed methodology employs a three-tiered framework incorporating multi-layer corner screening and adaptive template matching. Firstly, dimensionality is reduced through illumination equalization and structural information extraction. A coarse-to-fine candidate selection strategy minimizes sliding window computational costs, enabling rapid estimation of the marker’s position. Finally, adaptive templates are generated for candidate points, achieving subpixel precision through improved template matching with correlation coefficient extremum fitting. Experimental results demonstrate the method’s effectiveness in extracting and localizing diagonal markers in complex, large-scale environments, making it ideal for field-of-view measurement in navigation tasks.
zh
[CV-83] Instance-Aligned Captions for Explainable Video Anomaly Detection
【速读】:该论文旨在解决现有可解释视频异常检测(Explainable Video Anomaly Detection, VAD)方法缺乏空间定位信息的问题,尤其是在多主体交互场景中,现有方法常产生不完整或视觉错位的解释,导致可信度不足。其解决方案的关键在于引入实例对齐的描述(instance-aligned captions),将每个文本说明与具体物体实例关联,并结合外观和运动属性,从而明确异常由谁引起、每个实体在做什么、影响了谁以及解释的空间位置,实现可验证且可操作的推理。这一方法显著提升了解释的准确性与可信度,为未来可信和可解释的异常检测研究提供了新的基准。
链接: https://arxiv.org/abs/2601.08155
作者: Inpyo Song,Minjun Joo,Joonhyung Kwon,Eunji Jeon,Jangwon Lee
机构: SungKyunKwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Explainable video anomaly detection (VAD) is crucial for safety-critical applications, yet even with recent progress, much of the research still lacks spatial grounding, making the explanations unverifiable. This limitation is especially pronounced in multi-entity interactions, where existing explainable VAD methods often produce incomplete or visually misaligned descriptions, reducing their trustworthiness. To address these challenges, we introduce instance-aligned captions that link each textual claim to specific object instances with appearance and motion attributes. Our framework captures who caused the anomaly, what each entity was doing, whom it affected, and where the explanationis grounded, enabling verifiable and actionable reasoning. We annotate eight widely used VAD benchmarks and extend the 360-degree egocentric dataset, VIEW360, with 868 additional videos, eight locations, and four new anomaly types, creating VIEW360+, a comprehensive testbed for explainable VAD. Experiments show that our instance-level spatially grounded captions reveal significant limitations in current LLM- and VLM-based methods while providing a robust benchmark for future research in trustworthy and interpretable anomaly detection.
zh
[CV-84] Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLM s via Contrastive Attention
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉与文本信息内部融合机制不明确的问题,尤其是缺乏对融合过程在不同网络层中演化的系统性理解。解决方案的关键在于通过分层掩码分析(layer-wise masking analysis)揭示视觉-文本融合并非均匀分布于整个网络,而是集中在特定层级,并发现部分模型存在输出前的“回顾”现象(late-stage “review” phenomenon),即视觉信号被重新激活。基于此洞察,作者进一步分析注意力演化规律,识别出无关区域持续存在高注意力噪声,而文本对齐区域注意力逐步增强。由此提出一种无需训练的对比注意力框架(training-free contrastive attention framework),显式建模从早期融合层到最终输出层的注意力变换,从而突出有意义的注意力转移,显著提升多模态推理性能。
链接: https://arxiv.org/abs/2601.08151
作者: Shezheng Song,Shasha Li,Jie Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage “review” phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.
zh
[CV-85] Subspace Alignment for Vision-Language Model Test-time Adaptation
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在测试时适应(Test-Time Adaptation, TTA)过程中因分布偏移导致的性能下降问题。现有TTA方法依赖零样本预测作为伪标签进行自训练,但在分布偏移下存在两个根本局限:一是模态间隙(Modality Gap),即视觉与文本模态间跨模态关系失准;二是视觉干扰(Visual Nuisance),即视觉嵌入中包含大量任务无关噪声,掩盖了任务相关语义。解决方案的关键在于提出SubTTA,其核心思想是通过对齐双模态语义子空间来增强零样本预测的可靠性以指导TTA过程:首先,利用主子空间提取并最小化视觉与文本子空间间的弦距离以弥合模态间隙;其次,将对齐后的视觉特征投影至任务相关的文本子空间中,从而过滤掉任务无关噪声,并在此净化空间上执行标准TTA以优化决策边界。
链接: https://arxiv.org/abs/2601.08139
作者: Zhichen Zeng,Wenxuan Bao,Xiao Lin,Ruizhong Qiu,Tianxin Wei,Xuying Ning,Yuchen Yan,Chen Luo,Monica Xiao Cheng,Jingrui He,Hanghang Tong
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures
Abstract:Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.
zh
[CV-86] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
【速读】:该论文旨在解决音频-视觉语义分割(Audio-Visual Semantic Segmentation, AVSS)任务中对音视频场景的语义理解不足问题,即不仅要识别发声物体在图像中的像素级位置,还需进一步理解其语义类别及上下文关系。解决方案的关键在于提出一种新颖的协同框架——Stepping Stone Plus (SSP),其核心创新包括:1)引入预掩码(pre-mask)技术,利用光流(optical flow)捕捉运动动态以增强对移动发声源的分割精度;2)针对静止发声对象(如闹钟),设计两类文本提示(textual prompts)——一类用于指定发声物体类别,另一类提供场景整体描述,提升语义区分能力;3)集成视觉-文本对齐模块(Visual-Textual Alignment module, VTA),实现跨模态信息融合,从而生成更一致且语境相关的语义分割结果。
链接: https://arxiv.org/abs/2601.08133
作者: Peng Gao,Yujian Lee,Yongqi Xu,Wentao Fan
机构: Hong Kong Baptist University, Hong Kong, China; Guangdong Provincial/Zhuhai Key Laboratory IRADS; Peking University, Shenzhen Graduate School, Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textitStepping \textitStone \textitPlus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.
zh
[CV-87] PathoGen: Diffusion-Based Synthesis of Realistic Lesions in Histopathology Images
【速读】:该论文旨在解决病理图像诊断中因专家标注病灶数据稀缺(尤其是罕见病理类型和代表性不足的疾病亚型)而导致的鲁棒人工智能模型开发受限问题。其解决方案的关键在于提出一种基于扩散机制的生成模型 PathoGen,该模型能够实现可控且高保真度的病灶修复(inpainting),在良性组织图像中合成具有自然组织边界、保留细胞结构和真实染色特征的病灶区域。相较于传统数据增强方法,PathoGen 利用扩散模型的迭代精炼过程,有效生成符合组织空间关系与细胞架构的病灶形态,并通过同时生成逼真形态与像素级真值标签,突破了人工标注瓶颈,从而显著提升下游分割任务性能,尤其在数据匮乏场景下优势明显。
链接: https://arxiv.org/abs/2601.08127
作者: Mohamad Koohi-Moghadam,Mohammad-Ali Nikouei Mahani,Kyongtae Tyler Bae
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:The development of robust artificial intelligence models for histopathology diagnosis is severely constrained by the scarcity of expert-annotated lesion data, particularly for rare pathologies and underrepresented disease subtypes. While data augmentation offers a potential solution, existing methods fail to generate sufficiently realistic lesion morphologies that preserve the complex spatial relationships and cellular architectures characteristic of histopathological tissues. Here we present PathoGen, a diffusion-based generative model that enables controllable, high-fidelity inpainting of lesions into benign histopathology images. Unlike conventional augmentation techniques, PathoGen leverages the iterative refinement process of diffusion models to synthesize lesions with natural tissue boundaries, preserved cellular structures, and authentic staining characteristics. We validate PathoGen across four diverse datasets representing distinct diagnostic challenges: kidney, skin, breast, and prostate pathology. Quantitative assessment confirms that PathoGen outperforms state-of-the-art generative baselines, including conditional GAN and Stable Diffusion, in image fidelity and distributional similarity. Crucially, we show that augmenting training sets with PathoGen-synthesized lesions enhances downstream segmentation performance compared to traditional geometric augmentations, particularly in data-scarce regimes. Besides, by simultaneously generating realistic morphology and pixel-level ground truth, PathoGen effectively overcomes the manual annotation bottleneck. This approach offers a scalable pathway for developing generalizable medical AI systems despite limited expert-labeled data.
zh
[CV-88] From Prompts to Deployment: Auto-Curated Domain-Specific Dataset Generation via Diffusion Models WACV2026
【速读】:该论文旨在解决预训练模型与真实部署环境之间的分布偏移(distribution shift)问题,即模型在真实场景中性能下降的问题。其解决方案的关键在于构建一个三阶段自动化流水线:首先利用控制性修复(controlled inpainting)在特定领域背景中合成目标对象;其次通过多模态评估机制(包括目标检测、美学评分和视觉-语言对齐)验证生成数据的质量;最后采用用户偏好分类器捕捉主观选择标准,从而高效生成高质量、可直接部署的领域特定合成数据集,减少对大量真实世界数据采集的依赖。
链接: https://arxiv.org/abs/2601.08095
作者: Dongsik Yoon,Jongeun Kim
机构: HDC LABS(韩国HD实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the Workshop on Synthetic Adversarial ForEnsics (SAFE), WACV 2026 (oral presentation)
Abstract:In this paper, we present an automated pipeline for generating domain-specific synthetic datasets with diffusion models, addressing the distribution shift between pre-trained models and real-world deployment environments. Our three-stage framework first synthesizes target objects within domain-specific backgrounds through controlled inpainting. The generated outputs are then validated via a multi-modal assessment that integrates object detection, aesthetic scoring, and vision-language alignment. Finally, a user-preference classifier is employed to capture subjective selection criteria. This pipeline enables the efficient construction of high-quality, deployable datasets while reducing reliance on extensive real-world data collection.
zh
[CV-89] he Role of Noisy Data in Improving CNN Robustness for Image Classification
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在真实世界应用中因输入图像受噪声和失真影响而导致性能下降的问题。其解决方案的关键在于:通过在训练数据中引入可控的噪声(如高斯噪声、椒盐噪声和高斯模糊),以提升模型对恶劣条件下的鲁棒性,而无需牺牲干净数据上的性能。实验表明,仅用10%的污染数据即可显著降低测试损失并提高在完全 corrupted 测试条件下的准确率,说明这种策略是一种简单且有效的正则化手段,实现了传统数据纯净性与现实场景适应性之间的良好平衡。
链接: https://arxiv.org/abs/2601.08043
作者: Oscar H. Ramírez-Agudelo,Nicoleta Gorea,Aliza Reif,Lorenzo Bonasera,Michael Karl
机构: German Aerospace Center (DLR), Institute for AI Safety and Security (德国航空航天中心人工智能安全与保障研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pagers, 10 figures, 2 tables, SPIE Applications of Machine Learning 2025, San Diego, August, 2025
Abstract:Data quality plays a central role in the performance and robustness of convolutional neural networks (CNNs) for image classification. While high-quality data is often preferred for training, real-world inputs are frequently affected by noise and other distortions. This paper investigates the effect of deliberately introducing controlled noise into the training data to improve model robustness. Using the CIFAR-10 dataset, we evaluate the impact of three common corruptions, namely Gaussian noise, Salt-and-Pepper noise, and Gaussian blur at varying intensities and training set pollution levels. Experiments using a Resnet-18 model reveal that incorporating just 10% noisy data during training is sufficient to significantly reduce test loss and enhance accuracy under fully corrupted test conditions, with minimal impact on clean-data performance. These findings suggest that strategic exposure to noise can act as a simple yet effective regularizer, offering a practical trade-off between traditional data cleanliness and real-world resilience.
zh
[CV-90] Rescind: Countering Image Misconduct in Biomedical Publications with Vision-Language and State-Space Modeling
【速读】:该论文旨在解决生物医学图像伪造问题,即在生物医学出版物中通过图像篡改(如复制、拼接、区域删除等)破坏研究完整性与可重复性的问题。其关键解决方案是提出首个基于视觉-语言引导的框架,结合扩散模型生成与视觉-语言提示技术,实现语义可控的伪造图像合成;同时引入Rescind大规模基准数据集和Integscan结构化状态空间建模方法,通过注意力增强的视觉编码与提示条件下的语义对齐,实现高精度伪造定位,并利用视觉-语言模型验证环确保生成伪造的语义一致性,从而显著提升检测性能并为自动化科学诚信分析奠定基础。
链接: https://arxiv.org/abs/2601.08040
作者: Soumyaroop Nandi,Prem Natarajan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scientific image manipulation in biomedical publications poses a growing threat to research integrity and reproducibility. Unlike natural image forensics, biomedical forgery detection is uniquely challenging due to domain-specific artifacts, complex textures, and unstructured figure layouts. We present the first vision-language guided framework for both generating and detecting biomedical image forgeries. By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations, including duplication, splicing, and region removal, across diverse biomedical modalities. We introduce Rescind, a large-scale benchmark featuring fine-grained annotations and modality-specific splits, and propose Integscan, a structured state space modeling framework that integrates attention-enhanced visual encoding with prompt-conditioned semantic alignment for precise forgery localization. To ensure semantic fidelity, we incorporate a vision-language model based verification loop that filters generated forgeries based on consistency with intended prompts. Extensive experiments on Rescind and existing benchmarks demonstrate that Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.
zh
[CV-91] Fiducial Exoskeletons: Image-Centric Robot State Estimation
【速读】:该论文旨在解决传统机器人状态估计方法中依赖高精度执行器和繁琐标定流程(如手眼标定)的问题,尤其是在机器人-相机外参标定和关节状态恢复方面效率低下、成本高昂。其核心解决方案是提出“特征骨架”(Fiducial Exoskeletons),通过在每个机械臂连杆上安装带有已知几何关系的标记物(fiducial marker)的3D打印轻量化结构,将机器人状态估计重构为单张RGB图像下的6D位姿估计任务:首先直接从图像中估计各连杆的SE(3)位姿,进而通过轻量级全局优化恢复关节角度,并利用连杆位姿的运动学一致性约束提升精度;该方案无需学习即可实现鲁棒的相机-机器人外参估计与关节状态获取,显著简化部署流程并提升状态估计准确性和下游3D控制性能。
链接: https://arxiv.org/abs/2601.08034
作者: Cameron Smith,Basile Van Hoorick,Vitor Guizilini,Yue Wang
机构: USC Physical Superintelligence (PSI) Lab (南加州大学物理超智能实验室); Toyota Research Institute (丰田研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Fiducial Exoskeletons, an image-based reformulation of 3D robot state estimation that replaces cumbersome procedures and motor-centric pipelines with single-image inference. Traditional approaches - especially robot-camera extrinsic estimation - often rely on high-precision actuators and require time-consuming routines such as hand-eye calibration. In contrast, modern learning-based robot control is increasingly trained and deployed from RGB observations on lower-cost hardware. Our key insight is twofold. First, we cast robot state estimation as 6D pose estimation of each link from a single RGB image: the robot-camera base transform is obtained directly as the estimated base-link pose, and the joint state is recovered via a lightweight global optimization that enforces kinematic consistency with the observed link poses (optionally warm-started with encoder readings). Second, we make per-link 6D pose estimation robust and simple - even without learning - by introducing the fiducial exoskeleton: a lightweight 3D-printed mount with a fiducial marker on each link and known marker-link geometry. This design yields robust camera-robot extrinsics, per-link SE(3) poses, and joint-angle state from a single image, enabling robust state estimation even on unplugged robots. Demonstrated on a low-cost robot arm, fiducial exoskeletons substantially simplify setup while improving calibration, state accuracy, and downstream 3D control performance. We release code and printable hardware designs to enable further algorithm-hardware co-design. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.08034 [cs.RO] (or arXiv:2601.08034v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2601.08034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-92] A Highly Efficient Diversity-based Input Selection for DNN Improvement Using VLMs
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在微调过程中因标注新收集输入数据而产生的高成本与低效率问题。现有基于多样性的输入选择方法虽有效,但计算复杂度高、难以扩展至大规模数据集,限制了其实际应用。论文提出的关键解决方案是引入一种名为概念多样性(Concept-Based Diversity, CBD)的新指标,该指标利用视觉-语言模型(Vision-Language Models, VLM)提取图像输入的概念特征,并以此衡量样本间的多样性。CBD在保持与几何多样性(Geometric Diversity, GD)强相关性的同时,显著降低计算开销;进一步地,作者设计了一种融合CBD与不确定性度量Margin的混合选择策略,在多个DNN模型、数据集和预算条件下均展现出优于当前主流基线方法的性能,且具备接近简单不确定性方法的高效性,尤其在大规模场景如ImageNet上仍能保持良好可扩展性。
链接: https://arxiv.org/abs/2601.08024
作者: Amin Abbasishahkoo,Mahboubeh Dadkhah,Lionel Briand
机构: University of Ottawa (渥太华大学); Research Ireland Lero centre for software, University of Limerick (爱尔兰利默里克大学软件研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:Maintaining or improving the performance of Deep Neural Networks (DNNs) through fine-tuning requires labeling newly collected inputs, a process that is often costly and time-consuming. To alleviate this problem, input selection approaches have been developed in recent years to identify small, yet highly informative subsets for labeling. Diversity-based selection is one of the most effective approaches for this purpose. However, they are often computationally intensive and lack scalability for large input sets, limiting their practical applicability. To address this challenge, we introduce Concept-Based Diversity (CBD), a highly efficient metric for image inputs that leverages Vision-Language Models (VLM). Our results show that CBD exhibits a strong correlation with Geometric Diversity (GD), an established diversity metric, while requiring only a fraction of its computation time. Building on this finding, we propose a hybrid input selection approach that combines CBD with Margin, a simple uncertainty metric. We conduct a comprehensive evaluation across a diverse set of DNN models, input sets, selection budgets, and five most effective state-of-the-art selection baselines. The results demonstrate that the CBD-based selection consistently outperforms all baselines at guiding input selection to improve the DNN model. Furthermore, the CBD-based selection approach remains highly efficient, requiring selection times close to those of simple uncertainty-based methods such as Margin, even on larger input sets like ImageNet. These results confirm not only the effectiveness and computational advantage of the CBD-based approach, particularly compared to hybrid baselines, but also its scalability in repetitive and extensive input selection scenarios.
zh
[CV-93] raining Free Zero-Shot Visual Anomaly Localization via Diffusion Inversion
【速读】:该论文旨在解决零样本图像异常检测(Zero-Shot Image Anomaly Detection, ZSAD)中缺乏空间精度的问题,即现有视觉仅方法通常只能进行图像级分类,难以实现异常区域的精确定位。解决方案的关键在于利用预训练去噪扩散隐式模型(Denoising Diffusion Implicit Model, DDIM)的反演机制:给定输入图像和通用文本描述(如“一张[物体类别]的图像”),通过将图像反演为潜在表示,并从固定的中间时间步开始去噪过程来重建图像;由于扩散模型仅在正常数据上训练,重建结果呈现正常外观,而输入图像与重建图像之间的差异则揭示了潜在异常区域。此方法无需额外模态或细粒度提示,实现了无需训练的视觉仅异常检测与定位,且在VISA数据集上达到领先性能。
链接: https://arxiv.org/abs/2601.08022
作者: Samet Hicsonmez,Abd El Rahman Shabayek,Djamila Aouada
机构: University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., “an image of an [object class]”), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at this https URL.
zh
[CV-94] Representations of Text and Images Align From Layer One
【速读】:该论文试图解决的问题是:在基于适配器(adapter-based)的视觉-语言模型中,图像与文本表示之间的对齐是否仅存在于深层网络层,还是在早期层中也已存在。传统观点认为图像-文本对齐现象主要出现在模型的后期层,而本文通过提出一种基于合成的新方法挑战了这一认知。解决方案的关键在于:受DeepDream启发,利用优化技术从特定层提取文本概念向量,并生成与其表示对齐的图像,从而直接、可构造地验证各层中图像与文本概念的语义一致性。该方法无需额外模型或数据集,简单高效,且为模型可解释性提供了新路径——通过反向追踪图像处理组件可视化模型内部表示空间。
链接: https://arxiv.org/abs/2601.08017
作者: Evžen Wybitul,Javier Rando,Florian Tramèr,Stanislav Fort
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as “Jupiter”, we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50 % of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct, constructive evidence of image-text alignment on a concept-by-concept and layer-by-layer basis. Unlike previous methods for measuring multimodal alignment, our approach is simple, fast, and does not require auxiliary models or datasets. It also offers a new path towards model interpretability, by providing a way to visualise a model’s representation space by backtracing through its image processing components.
zh
[CV-95] Decoder Generates Manufacturable Structures: A Framework for 3D-Printable Object Synthesis
【速读】:该论文旨在解决生成式设计中3D结构可制造性不足的问题,即如何从抽象的潜在表示(latent representations)中生成满足增材制造(Additive Manufacturing, AM)约束条件的几何有效且可打印的三维结构。其解决方案的关键在于提出了一种基于解码器(decoder-based)的深度学习框架,通过训练神经网络解码器学习从潜在空间到符合制造限制(如悬垂角、壁厚和结构完整性)的几何形状的复杂映射函数,从而显著提升生成对象的可制造性,优于传统无约束的生成方法。
链接: https://arxiv.org/abs/2601.08015
作者: Abhishek Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, 1 table. Presents a constraint-aware neural decoder for generating 3D-printable objects with 96.8% manufacturability rate
Abstract:This paper presents a novel decoder-based approach for generating manufacturable 3D structures optimized for additive manufacturing. We introduce a deep learning framework that decodes latent representations into geometrically valid, printable objects while respecting manufacturing constraints such as overhang angles, wall thickness, and structural integrity. The methodology demonstrates that neural decoders can learn complex mapping functions from abstract representations to valid 3D geometries, producing parts with significantly improved manufacturability compared to naive generation approaches. We validate the approach on diverse object categories and demonstrate practical 3D printing of decoder-generated structures.
zh
[CV-96] P-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models
【速读】:该论文旨在解决当前文本条件扩散编辑方法在同时引入新对象和新风格时表现不佳的问题(即难以实现对象与风格的协同编辑)。其核心解决方案是提出一种轻量级、无需训练的框架TP-Blend,关键在于两个互补的注意力处理器:一是交叉注意力对象融合(Cross-Attention Object Fusion, CAOF),通过头级注意力平均定位响应显著的空间标记,并利用熵正则化最优传输重新分配多头特征向量至目标位置,从而在保持高维特征相关性的同时实现精准对象融合;二是自注意力风格融合(Self-Attention Style Fusion, SASF),在每一自注意力层中通过细节敏感实例归一化注入风格信息,结合一维高斯滤波分离低频与高频成分,仅将高频残差混合回图像,保留笔触级纹理而不破坏整体几何结构,同时通过交换键值矩阵实现上下文感知的纹理调制。该设计实现了内容与外观的精确控制,且优于现有基线方法在保真度、感知质量和推理速度上的表现。
链接: https://arxiv.org/abs/2601.08011
作者: Xin Jin,Yichuan Zhong,Yapeng Tian
机构: GenPi Inc.(GenPi公司); The University of Texas at Dallas(德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
zh
[CV-97] CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多步推理任务中表现不稳定的问题,即对同一输入重复采样常导致推理路径发散和最终预测不一致。其解决方案的关键在于提出两种互补的推理稳定性增强方法:(1) CASHEW,一种基于测试时缩放(test-time scaling)的推理阶段框架,通过迭代聚合多个候选推理轨迹并结合显式视觉验证来过滤幻觉步骤,从而将推理过程锚定在视觉证据上;(2) CASHEW-RL,一种内化聚合行为的可学习变体,利用组序列策略优化(Group Sequence Policy Optimization, GSPO)训练,采用复合奖励机制鼓励基于最小且充分视觉证据得出正确答案,并根据任务难度自适应分配推理资源,实现推理过程中的鲁棒自我聚合能力。
链接: https://arxiv.org/abs/2601.08010
作者: Chaoyu Li,Deeparghya Dutta Barua,Fei Tao,Pooyan Fazli
机构: Arizona State University (亚利桑那州立大学); NewsBreak
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
zh
[CV-98] Operator learning for models of tear film breakup
【速读】:该论文旨在解决从荧光(fluorescence, FL)成像中估算泪膜(tear film, TF)厚度和渗透压(osmolarity)时,传统方法需依赖计算昂贵的逆问题求解这一难题。其解决方案的关键在于提出一种算子学习(operator learning)框架,通过在模拟的泪膜动力学数据上训练神经算子(neural operators),替代传统的逆求解器,从而实现对泪膜动态的快速、数据驱动的分析。
链接: https://arxiv.org/abs/2601.08001
作者: Qinying Chen,Arnab Roy,Tobin A. Driscoll
机构: University of Delaware (特拉华大学)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Tear film (TF) breakup is a key driver of understanding dry eye disease, yet estimating TF thickness and osmolarity from fluorescence (FL) imaging typically requires solving computationally expensive inverse problems. We propose an operator learning framework that replaces traditional inverse solvers with neural operators trained on simulated TF dynamics. This approach offers a scalable path toward rapid, data-driven analysis of tear film dynamics.
zh
[CV-99] Predicting Region of Interest in Human Visual Search Based on Statistical Texture and Gabor Features
【速读】:该论文旨在解决如何有效建模人类在未知目标位置的视觉搜索任务中注意力分配机制的问题,特别是通过融合不同类型的图像特征来提升对早期注视区域预测的准确性。其解决方案的关键在于提出两种基于Gabor特征与灰度共生矩阵(GLCM)纹理特征相结合的特征融合管道,用于缩小可能的人类注视区域范围;实验表明,GLCM均值与Gabor特征响应之间存在强相关性,说明二者虽形式不同但编码了相似的图像信息,且融合后的模型在数字乳腺断层摄影图像上表现出与人类眼动数据高度一致的早期注视预测能力,从而验证了结构特征与纹理特征协同建模在视觉搜索行为研究中的有效性。
链接: https://arxiv.org/abs/2601.07998
作者: Hongwei Lin,Diego Andrade,Mini Das,Howard C. Gifford
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
备注: 10 pages, 6 fgures
Abstract:Understanding human visual search behavior is a fundamental problem in vision science and computer vision, with direct implications for modeling how observers allocate attention in location-unknown search tasks. In this study, we investigate the relationship between Gabor-based features and gray-level co-occurrence matrix (GLCM) based texture features in modeling early-stage visual search behavior. Two feature-combination pipelines are proposed to integrate Gabor and GLCM features for narrowing the region of possible human fixations. The pipelines are evaluated using simulated digital breast tomosynthesis images. Results show qualitative agreement among fixation candidates predicted by the proposed pipelines and a threshold-based model observer. A strong correlation is observed between GLCM mean and Gabor feature responses, indicating that these features encode related image information despite their different formulations. Eye-tracking data from human observers further suggest consistency between predicted fixation regions and early-stage gaze behavior. These findings highlight the value of combining structural and texture-based features for modeling visual search and support the development of perceptually informed observer models.
zh
[CV-100] Likelihood ratio for a binary Bayesian classifier under a noise-exclusion model
【速读】:该论文旨在解决医学图像感知、计算机视觉及安防领域中目标检测与识别的性能优化问题,特别是如何提升视觉搜索效率并减少系统复杂性。其解决方案的关键在于提出一种新的统计理想观察者模型,该模型通过在可提取图像特征上设置最小阈值来实现整体视觉搜索(或图像概貌处理),从而降低系统的自由参数数量,有效压缩模型规模,同时保持对关键特征的敏感性,进而为成像系统优化、算法评估、特征选择和传感器性能测试提供理论支撑。
链接: https://arxiv.org/abs/2601.07982
作者: Howard C. Gifford
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST); Computation (stat.CO)
备注: 18 pages, 4 figures
Abstract:We develop a new statistical ideal observer model that performs holistic visual search (or gist) processing in part by placing thresholds on minimum extractable image features. In this model, the ideal observer reduces the number of free parameters thereby shrinking down the system. The applications of this novel framework is in medical image perception (for optimizing imaging systems and algorithms), computer vision, benchmarking performance and enabling feature selection/evaluations. Other applications are in target detection and recognition in defense/security as well as evaluating sensors and detectors.
zh
[CV-101] An Efficient Additive Kolmogorov-Arnold Transformer for Point-Level Maize Localization in Unmanned Aerial Vehicle Imagery
【速读】:该论文旨在解决无人机(UAV)高分辨率影像中玉米植株点级定位的三大挑战:一是目标与像素比极低(通常低于0.1%),导致小目标难以识别;二是超大图像(>3000×4000像素)下传统二次注意力机制计算成本过高;三是农业场景特有的稀疏分布和环境变化使通用视觉模型性能受限。其解决方案的关键在于提出Additive Kolmogorov-Arnold Transformer(AKT),通过引入Pade Kolmogorov-Arnold Network(PKAN)模块替代常规多层感知机(MLP),增强对小目标特征的表达能力,并设计PKAN Additive Attention(PAA)机制,在降低计算复杂度的同时建模多尺度空间依赖关系,从而在保持高精度的前提下显著提升推理效率。
链接: https://arxiv.org/abs/2601.07975
作者: Fei Li,Lang Qiao,Jiahao Fan,Yijia Xu,Shawn M. Kaeppler,Zhou Zhang
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution UAV photogrammetry has become a key technology for precision agriculture, enabling centimeter-level crop monitoring and point-level plant localization. However, point-level maize localization in UAV imagery remains challenging due to (1) extremely small object-to-pixel ratios, typically less than 0.1%, (2) prohibitive computational costs of quadratic attention on ultra-high-resolution images larger than 3000 x 4000 pixels, and (3) agricultural scene-specific complexities such as sparse object distribution and environmental variability that are poorly handled by general-purpose vision models. To address these challenges, we propose the Additive Kolmogorov-Arnold Transformer (AKT), which replaces conventional multilayer perceptrons with Pade Kolmogorov-Arnold Network (PKAN) modules to enhance functional expressivity for small-object feature extraction, and introduces PKAN Additive Attention (PAA) to model multiscale spatial dependencies with reduced computational complexity. In addition, we present the Point-based Maize Localization (PML) dataset, consisting of 1,928 high-resolution UAV images with approximately 501,000 point annotations collected under real field conditions. Extensive experiments show that AKT achieves an average F1-score of 62.8%, outperforming state-of-the-art methods by 4.2%, while reducing FLOPs by 12.6% and improving inference throughput by 20.7%. For downstream tasks, AKT attains a mean absolute error of 7.1 in stand counting and a root mean square error of 1.95-1.97 cm in interplant spacing estimation. These results demonstrate that integrating Kolmogorov-Arnold representation theory with efficient attention mechanisms offers an effective framework for high-resolution agricultural remote sensing. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.07975 [cs.CV] (or arXiv:2601.07975v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.07975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-102] Sesame Plant Segmentation Dataset: A YOLO Formatted Annotated Dataset WWW
【速读】:该论文旨在解决农业场景中芝麻(Sesame)植物精准识别与分析的难题,尤其是在早期生长阶段复杂环境下的检测精度不足问题。解决方案的关键在于构建一个开源的像素级分割数据集——Sesame Plant Segmentation Dataset,其包含206张训练图像、43张验证图像和43张测试图像,均以YOLO兼容格式标注,数据采集自尼日利亚卡钦州吉尔德德农场的高分辨率移动相机影像,并通过Segment Anything Model version 2(SAMv2)结合农户监督完成精细标注。相较于传统边界框(Bounding Box)标注方式,该数据集采用像素级分割策略,显著提升了模型在真实农田环境中对芝麻植株的定位与分割准确性,实验证明基于Ultralytics YOLOv8框架的模型在检测与分割任务上均表现出优异性能,为后续的作物监测、产量估算及农业研究提供了高质量的数据基础与技术支撑。
链接: https://arxiv.org/abs/2601.07970
作者: Sunusi Ibrahim Muhammad,Ismail Ismail Tijjani,Saadatu Yusuf Jumare,Fatima Isah Jibrin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at International Conference on Computing and advance in Information Technology(ICCAIT2025) The dataset is available at kaggle : this https URL
Abstract:This paper presents the Sesame Plant Segmentation Dataset, an open source annotated image dataset designed to support the development of artificial intelligence models for agricultural applications, with a specific focus on sesame plants. The dataset comprises 206 training images, 43 validation images, and 43 test images in YOLO compatible segmentation format, capturing sesame plants at early growth stages under varying environmental conditions. Data were collected using a high resolution mobile camera from farms in Jirdede, Daura Local Government Area, Katsina State, Nigeria, and annotated using the Segment Anything Model version 2 with farmer supervision. Unlike conventional bounding box datasets, this dataset employs pixel level segmentation to enable more precise detection and analysis of sesame plants in real world farm settings. Model evaluation using the Ultralytics YOLOv8 framework demonstrated strong performance for both detection and segmentation tasks. For bounding box detection, the model achieved a recall of 79 percent, precision of 79 percent, mean average precision at IoU 0.50 of 84 percent, and mean average precision from 0.50 to 0.95 of 58 percent. For segmentation, it achieved a recall of 82 percent, precision of 77 percent, mean average precision at IoU 0.50 of 84 percent, and mean average precision from 0.50 to 0.95 of 52 percent. The dataset represents a novel contribution to sesame focused agricultural vision datasets in Nigeria and supports applications such as plant monitoring, yield estimation, and agricultural research.
zh
[CV-103] 3DGS-Drag : Drag ging Gaussians for Intuitive Point-Based 3D Editing
【速读】:该论文旨在解决3D场景中直观拖拽编辑(drag editing)的挑战,尤其是在几何变化方面的操作效率与一致性问题。现有方法在基于形变(deformation-based)或2D编辑(2D-editing-based)的3D内容编辑中存在局限性,难以实现对几何相关元素的精准控制。其解决方案的关键在于提出一种基于点云的3D编辑框架3DGS-Drag,融合两个核心创新:一是利用3D高斯溅射(3D Gaussian Splatting, 3DGS)提供形变引导,确保几何修改的一致性;二是引入扩散模型(diffusion guidance)进行内容修正和视觉质量提升。此外,通过渐进式编辑策略支持激进的拖拽操作,从而实现运动改变、形状调整、修复填充(inpainting)和内容扩展等多种编辑任务,在保持高效性的同时达到当前最优的几何相关3D内容编辑效果。
链接: https://arxiv.org/abs/2601.07963
作者: Jiahua Dong,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The transformative potential of 3D content creation has been progressively unlocked through advancements in generative models. Recently, intuitive drag editing with geometric changes has attracted significant attention in 2D editing yet remains challenging for 3D scenes. In this paper, we introduce 3DGS-Drag – a point-based 3D editing framework that provides efficient, intuitive drag manipulation of real 3D scenes. Our approach bridges the gap between deformation-based and 2D-editing-based 3D editing methods, addressing their limitations to geometry-related content editing. We leverage two key innovations: deformation guidance utilizing 3D Gaussian Splatting for consistent geometric modifications and diffusion guidance for content correction and visual quality enhancement. A progressive editing strategy further supports aggressive 3D drag edits. Our method enables a wide range of edits, including motion change, shape adjustment, inpainting, and content extension. Experimental results demonstrate the effectiveness of 3DGS-Drag in various scenes, achieving state-of-the-art performance in geometry-related 3D content editing. Notably, the editing is efficient, taking 10 to 20 minutes on a single RTX 4090 GPU.
zh
[CV-104] LWMSCNN-SE: A Lightweight Multi-Scale Network for Efficient Maize Disease Classification on Edge Devices
【速读】:该论文旨在解决传统玉米病害检测模型在资源受限环境(如智能手机和无人机)中部署时面临的高计算成本问题。解决方案的关键在于提出一种轻量级卷积神经网络(LWMSCNN-SE),其核心创新包括多尺度特征提取、深度可分离卷积以及挤压-激励(Squeeze-and-Excitation, SE)注意力机制的集成,从而在仅使用241,348个参数和0.666 GFLOPs的情况下实现96.63%的分类准确率,有效平衡了模型精度与计算效率,适用于边缘设备上的实时病害诊断。
链接: https://arxiv.org/abs/2601.07957
作者: Fikadu Weloday,Jianmei Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Maize disease classification plays a vital role in mitigating yield losses and ensuring food security. However, the deployment of traditional disease detection models in resource-constrained environments, such as those using smartphones and drones, faces challenges due to high computational costs. To address these challenges, we propose LWMSCNN-SE, a lightweight convolutional neural network (CNN) that integrates multi-scale feature extraction, depthwise separable convolutions, and squeeze-and-Excitation (SE) attention mechanisms. This novel combination enables the model to achieve 96.63% classification accuracy with only 241,348 parameters and 0.666 GFLOPs, making it suitable for real-time deployment in field applications. Our approach addresses the accuracy–efficiency trade-off by delivering high accuracy while maintaining low computational costs, demonstrating its potential for efficient maize disease diagnosis on edge devices in precision farming systems.
zh
[CV-105] Moonworks Lunara Aesthetic Dataset
【速读】:该论文旨在解决现有图像数据集在美学质量、风格多样性与标注精度方面不足的问题,尤其针对大规模网络采集数据集普遍存在的“重数量轻质量”缺陷。其解决方案的关键在于构建一个由Moonworks Lunara模型生成的高质量、风格多样化的图像数据集——Lunara Aesthetic Dataset,该数据集不仅涵盖中东、北欧、东亚和南亚等区域美学特征及素描、油画等通用类别,还通过人工优化的提示词(prompt)和结构化标注(包含显著对象、属性、关系与风格线索)确保内容的精确表达;同时,该数据集以Apache 2.0许可证开放,兼顾了美学评分优势与授权透明度,为生成式AI(Generative AI)研究提供了高保真、可复现且合法可用的基准资源。
链接: https://arxiv.org/abs/2601.07941
作者: Yan Wang,M M Sayeef Abdullah,Partho Hassan,Sabit Hassan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.
zh
[CV-106] HOSC: A Periodic Activation with Saturation Control for High-Fidelity Implicit Neural Representations
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)中周期性激活函数在保留高频信息时面临的梯度不稳定性和多尺度行为控制能力有限的问题。现有方法如正弦激活(sine)虽能有效保持高频成分,但其梯度易发散且难以调节不同尺度的响应特性。论文提出超双曲振荡器饱和控制(Hyperbolic Oscillator with Saturation Control, HOSC)激活函数,其形式为 tanh(βsin(ω0x)),关键创新在于引入显式参数 β,该参数直接控制激活函数的Lipschitz界(由 βω0 决定),从而实现对梯度幅度的精确调控,同时保留了周期性载波结构以维持高频信息表达能力。
链接: https://arxiv.org/abs/2601.07870
作者: Michal Jan Wlodarczyk,Danzel Serrano,Przemyslaw Musialski
机构: Warsaw University of Technology (华沙理工大学); New Jersey Institute of Technology (新泽西理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 16 pages including appendices, 12 figures, 15 tables
Abstract:Periodic activations such as sine preserve high-frequency information in implicit neural representations (INRs) through their oscillatory structure, but often suffer from gradient instability and limited control over multi-scale behavior. We introduce the Hyperbolic Oscillator with Saturation Control (HOSC) activation, \textHOSC(x) = \tanh\bigl(\beta \sin(\omega_0 x)\bigr) , which exposes an explicit parameter \beta that controls the Lipschitz bound of the activation by \beta \omega_0 . This provides a direct mechanism to tune gradient magnitudes while retaining a periodic carrier. We provide a mathematical analysis and conduct a comprehensive empirical study across images, audio, video, NeRFs, and SDFs using standardized training protocols. Comparative analysis against SIREN, FINER, and related methods shows where HOSC provides substantial benefits and where it achieves competitive parity. Results establish HOSC as a practical periodic activation for INR applications, with domain-specific guidance on hyperparameter selection. For code visit the project page this https URL .
zh
[CV-107] An Empirical Study on Knowledge Transfer under Domain and Label Shifts in 3D LiDAR Point Clouds
【速读】:该论文旨在解决3D点云感知系统在现实世界应用中面临的持续学习与迁移学习问题,特别是当对象定义(label shift)和传感器域(domain shift)同时发生变化时,现有方法的鲁棒性不足。解决方案的关键在于提出一个名为ROBust Autonomous driving under Dataset shifts (ROAD)的基准测试套件,该套件系统性地模拟了LiDAR点云分类任务中的多种域变化和标签演化形式(包括类别分裂、扩展和插入),并基于大规模数据集(Waymo、NuScenes、Argoverse2)对零样本迁移、线性探测和持续学习(Continual Learning, CL)方法进行评估,从而揭示当前方法的局限性并建立可复现的强基线,推动面向真实场景的鲁棒3D感知研究发展。
链接: https://arxiv.org/abs/2601.07855
作者: Subeen Lee,Siyeong Lee,Namil Kim,Jaesik Choi
机构: KAIST(韩国科学技术院); NAVERLABS(NAVER实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:For 3D perception systems to be practical in real-world applications – from autonomous driving to embodied AI – models must adapt to continuously evolving object definitions and sensor domains. Yet, research on continual and transfer learning in 3D point cloud perception remains underexplored compared to 2D vision – particularly under simultaneous domain and label shifts. To address this gap, we propose the RObust Autonomous driving under Dataset shifts (ROAD) benchmark, a comprehensive evaluation suite for LiDAR-based object classification that explicitly accounts for domain shifts as well as three key forms of label evolution: class split, class expansion, and class insertion. Using large-scale datasets (Waymo, NuScenes, Argoverse2), we evaluate zero-shot transfer, linear probe, and CL, and analyze the impact of backbone architectures, training objectives, and CL methods. Our findings reveal limitations of existing approaches under realistic shifts and establish strong baselines for future research in robust 3D perception.
zh
[CV-108] MLLM -VADStory: Domain Knowledge-Driven Multimodal LLM s for Video Ad Storyline Insights
【速读】:该论文旨在解决视频广告(video ad)故事线理解缺乏系统性量化与可解释生成方法的问题,尤其在大规模场景下难以提取和复用有效的叙事结构。其解决方案的关键在于提出一种基于领域知识引导的多模态大语言模型(Multimodal Large Language Models, MLLMs)框架——MLLM-VADStory,该框架将广告叙事建模为由功能意图驱动的结构化单元序列:首先将广告分割为具有特定传播功能的场景单元,利用专为广告设计的功能角色分类体系对每个单元进行标注,进而通过跨广告的功能序列聚合挖掘数据驱动的故事线结构。此方法实现了从原始视频内容到可解释、可优化的叙事模式的转化,显著提升了广告创意设计的科学性和可扩展性。
链接: https://arxiv.org/abs/2601.07850
作者: Jasmine Yang,Poppy Zhang,Shawndra Hill
机构: Meta(Meta)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose MLLM-VADStory, a novel domain knowledge-guided multimodal large language models (MLLM) framework to systematically quantify and generate insights for video ad storyline understanding at scale. The framework is centered on the core idea that ad narratives are structured by functional intent, with each scene unit performing a distinct communicative function, delivering product and brand-oriented information within seconds. MLLM-VADStory segments ads into functional units, classifies each unit’s functionality using a novel advertising-specific functional role taxonomy, and then aggregates functional sequences across ads to recover data-driven storyline structures. Applying the framework to 50k social media video ads across four industry subverticals, we find that story-based creatives improve video retention, and we recommend top-performing story arcs to guide advertisers in creative design. Our framework demonstrates the value of using domain knowledge to guide MLLMs in generating scalable insights for video ad storylines, making it a versatile tool for understanding video creatives in general.
zh
[CV-109] Edge-AI Perception Node for Cooperative Road-Safety Enforcement and Connected-Vehicle Integration
【速读】:该论文旨在解决新兴经济体(如印度)因快速机动化导致的交通执法能力严重不足的问题,即在2023年记录超过1100万起交通违规行为的情况下,人工执法密度仅为每4000辆车配备一名警员,传统监控与人工开罚单方式难以规模化。解决方案的关键在于构建一个自主、协同且能效高的边缘人工智能(Edge AI)感知节点,集成YOLOv8 Nano实现高精度多目标检测、DeepSORT保障车辆跟踪的时间一致性,并结合规则驱动的光学字符识别(OCR)后处理引擎以支持符合MoRTH AIS 159和ISO 7591标准的模糊或多种语言车牌识别;该系统部署于NVIDIA Jetson Nano平台并通过TensorRT FP16量化优化,在仅9.6 W功耗下实现28–30帧/秒推理速度,达到97.7%的违规检测准确率和84.9%的OCR精度,同时通过V2X协议向联网车辆及智能交通系统发布标准化安全事件(CAM/DENM),从而增强协同感知与主动道路安全管理能力。
链接: https://arxiv.org/abs/2601.07845
作者: Shree Charran R,Rahul Kumar Dubey
机构: Indian Institute of Science (印度科学研究所); Bosch Global Software Technologies Private Limited (博世全球软件技术私人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rapid motorization in emerging economies such as India has created severe enforcement asymmetries, with over 11 million recorded violations in 2023 against a human policing density of roughly one officer per 4000 vehicles. Traditional surveillance and manual ticketing cannot scale to this magnitude, motivating the need for an autonomous, cooperative, and energy efficient edge AI perception infrastructure. This paper presents a real time roadside perception node for multi class traffic violation analytics and safety event dissemination within a connected and intelligent vehicle ecosystem. The node integrates YOLOv8 Nano for high accuracy multi object detection, DeepSORT for temporally consistent vehicle tracking, and a rule guided OCR post processing engine capable of recognizing degraded or multilingual license plates compliant with MoRTH AIS 159 and ISO 7591 visual contrast standards. Deployed on an NVIDIA Jetson Nano with a 128 core Maxwell GPU and optimized via TensorRT FP16 quantization, the system sustains 28 to 30 frames per second inference at 9.6 W, achieving 97.7 percent violation detection accuracy and 84.9 percent OCR precision across five violation classes, namely signal jumping, zebra crossing breach, wrong way driving, illegal U turn, and speeding, without manual region of interest calibration. Comparative benchmarking against YOLOv4 Tiny, PP YOLOE S, and Nano DetPlus demonstrates a 10.7 percent mean average precision gain and a 1.4 times accuracy per watt improvement. Beyond enforcement, the node publishes standardized safety events of CAM and DENM type to connected vehicles and intelligent transportation system backends via V2X protocols, demonstrating that roadside edge AI analytics can augment cooperative perception and proactive road safety management within the IEEE Intelligent Vehicles ecosystem.
zh
[CV-110] M3CoTBench: Benchmark Chain-of-Thought of MLLM s in Medical Image Understanding
【速读】:该论文旨在解决当前医学图像理解基准测试普遍忽视推理路径(Chain-of-Thought, CoT)质量的问题,即现有评估体系仅关注最终诊断结果的准确性,而忽略了模型在诊断过程中是否具备逻辑清晰、可解释且符合临床思维的中间推理步骤。这种“黑箱”式的决策过程缺乏可靠依据,难以被医生信任和采纳。解决方案的关键在于提出一个全新的多模态医学推理基准——M3CoTBench,其核心创新包括:构建涵盖24种检查类型、13个不同难度任务的多样化数据集,并设计针对性的CoT评估指标(正确性、效率、影响度与一致性),从而系统性地衡量多模态大语言模型(MLLMs)在医学影像分析中生成高质量推理链的能力,推动开发更透明、可信且具临床价值的医疗AI系统。
链接: https://arxiv.org/abs/2601.08758
作者: Juntao Jiang,Jiangning Zhang,Yali Bi,Jinsheng Bai,Weixuan Liu,Weiwei Jin,Zhucun Xue,Yong Liu,Xiaobin Hu,Shuicheng Yan
机构: ZJU(浙江大学); USTC(中国科学技术大学); ECNU(华东师范大学); ZJPPH(浙江省人民医院); NUS(新加坡国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 8 pages
Abstract:Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.
zh
[CV-111] A Single-Parameter Factor-Graph Image Prior
【速读】:该论文旨在解决图像建模中如何有效捕捉局部结构特征并自适应调整参数的问题,特别是在去噪和对比度增强等任务中实现高质量的图像恢复。解决方案的关键在于提出一种分段光滑图像模型(piecewise smooth image model),其局部参数为分段常数(piecewise constant local parameters)且能自动适配每幅图像,通过因子图(factor graph)形式建模并引入NUP(Normal with Unknown Parameters)先验,使计算过程转化为共轭梯度迭代与高斯消息传递(Gaussian message passing)的交替执行,从而在保持计算效率的同时提升建模精度。
链接: https://arxiv.org/abs/2601.08749
作者: Tianyang Wang,Ender Konukoglu,Hans-Andrea Loeliger
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:We propose a novel piecewise smooth image model with piecewise constant local parameters that are automatically adapted to each image. Technically, the model is formulated in terms of factor graphs with NUP (normal with unknown parameters) priors, and the pertinent computations amount to iterations of conjugate-gradient steps and Gaussian message passing. The proposed model and algorithms are demonstrated with applications to denoising and contrast enhancement.
zh
[CV-112] Automated Lesion Segmentation of Stroke MRI Using nnU-Net: A Comprehensive External Validation Across Acute and Chronic Lesions
【速读】:该论文旨在解决脑卒中病变在磁共振成像(MRI)中自动分割的准确性与泛化能力问题,以支持临床研究、预后建模和个性化干预。其解决方案的关键在于系统评估基于nnU-Net框架的模型在多种异质性公开MRI数据集上的表现,涵盖急性期和慢性期脑卒中,并针对不同成像模态(扩散加权成像DWI、液体衰减反转恢复FLAIR及T1加权成像)、训练数据规模、病变体积范围及图像质量等因素进行分析。研究发现,模型性能受多个因素影响:急性期DWI训练模型优于FLAIR模型,且多模态融合收益有限;慢性期通过增加训练样本量可提升性能,但存在边际递减效应;小体积病变分割难度更高,限制了模型泛化能力;高质图像训练的模型能更好适应噪声较大的图像,而人工标注局限常导致预测偏差。这些结果表明,自动化分割已可逼近人类专家水平,同时明确了影响泛化性的关键变量,为开发更鲁棒的病变分割工具提供了实证依据。
链接: https://arxiv.org/abs/2601.08701
作者: Tammar Truzman,Matthew A. Lambon Ralph,Ajay D. Halai
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 7 figures. Submitted to Brain. Code and trained models available
Abstract:Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.
zh
[CV-113] Region of interest detection for efficient aortic segmentation
【速读】:该论文旨在解决胸主动脉夹层和动脉瘤等致命性主动脉疾病在临床诊疗中因医学影像分析困难而导致的诊断与治疗障碍,尤其是三维图像中主动脉分割(aortic segmentation)过程繁琐且易出错的问题。其解决方案的关键在于提出一种基于感兴趣区域(Region of Interest, ROI)检测的级联式分割方法:首先使用一个轻量级多任务模型(编码器-解码器结构用于分割,瓶颈层连接全连接网络用于ROI定位)快速准确地定位主动脉区域,随后仅对ROI进行精细化分割,从而显著降低计算资源消耗并提升分割精度。实验表明,该方法在仅需三分之一计算功率的情况下实现了平均Dice相似系数0.944的优异性能,优于传统单步分割模型(如nnU-Net),具备临床应用所需的高效性、鲁棒性和实用性。
链接: https://arxiv.org/abs/2601.08683
作者: Loris Giordano,Ine Dirks,Tom Lenaerts,Jef Vandemeulebroucke
机构: Vrije Universiteit Brussel (VUB), Department of Electronics and Informatics (ETRO); Interuniversity Institute of Bioinformatics in Brussels; imec; Université Libre de Bruxelles (ULB), Machine Learning Group; Vrije Universiteit Brussel (VUB), Artificial Intelligence lab; Universitair Ziekenhuis Brussel (UZ Brussel), Department of Radiology
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Thoracic aortic dissection and aneurysms are the most lethal diseases of the aorta. The major hindrance to treatment lies in the accurate analysis of the medical images. More particularly, aortic segmentation of the 3D image is often tedious and difficult. Deep-learning-based segmentation models are an ideal solution, but their inability to deliver usable outputs in difficult cases and their computational cost cause their clinical adoption to stay limited. This study presents an innovative approach for efficient aortic segmentation using targeted region of interest (ROI) detection. In contrast to classical detection models, we propose a simple and efficient detection model that can be widely applied to detect a single ROI. Our detection model is trained as a multi-task model, using an encoder-decoder architecture for segmentation and a fully connected network attached to the bottleneck for detection. We compare the performance of a one-step segmentation model applied to a complete image, nnU-Net and our cascade model composed of a detection and a segmentation step. We achieve a mean Dice similarity coefficient of 0.944 with over 0.9 for all cases using a third of the computing power. This simple solution achieves state-of-the-art performance while being compact and robust, making it an ideal solution for clinical applications.
zh
[CV-114] Blind Deconvolution in Astronomy: How Does a Standalone U-Net Perform?
【速读】:该论文旨在解决天文图像盲去卷积(blind deconvolution)问题,即在缺乏点扩散函数(Point Spread Function, PSF)和噪声特性先验知识的情况下,直接从观测图像中恢复清晰的源图像。其核心挑战在于如何在不依赖物理模型或人工设计先验的前提下实现高保真度重建。解决方案的关键是采用U-Net架构进行端到端训练,利用大规模合成数据(最多40,000张48×48像素的COSMOS真实星系图像)学习从模糊图像到清晰图像的映射关系,并通过均方误差(MSE)损失函数优化模型参数。实验表明,该方法不仅性能随训练样本量增长而提升并在5,000张图像后趋于饱和,还能在不同大气视宁度(seeing)和噪声水平下保持良好泛化能力,且其解稳定性可通过余弦相似性指标验证,暗示模型可能隐式学习到几何自适应的谐波基底(geometry-adaptive harmonic basis),这与稀疏表示理论及最新数学分析结果一致。
链接: https://arxiv.org/abs/2601.08666
作者: Jean-Eric Campagne
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 13 figures
Abstract:Aims: This study investigates whether a U-Net architecture can perform standalone end-to-end blind deconvolution of astronomical images without any prior knowledge of the Point Spread Function (PSF) or noise characteristics. Our goal is to evaluate its performance against the number of training images, classical Tikhonov deconvolution and to assess its generalization capability under varying seeing conditions and noise levels. Methods: Realistic astronomical observations are simulated using the GalSim toolkit, incorporating random transformations, PSF convolution (accounting for both optical and atmospheric effects), and Gaussian white noise. A U-Net model is trained using a Mean Square Error (MSE) loss function on datasets of varying sizes, up to 40,000 images of size 48x48 from the COSMOS Real Galaxy Dataset. Performance is evaluated using PSNR, SSIM, and cosine similarity metrics, with the latter employed in a two-model framework to assess solution stability. Results: The U-Net model demonstrates effectiveness in blind deconvolution, with performance improving consistently as the training dataset size increases, saturating beyond 5,000 images. Cosine similarity analysis reveals convergence between independently trained models, indicating stable solutions. Remarkably, the U-Net outperforms the oracle-like Tikhonov method in challenging conditions (low PSNR/medium SSIM). The model also generalizes well to unseen seeing and noise conditions, although optimal performance is achieved when training parameters include validation conditions. Experiments on synthetic C^\alpha images further support the hypothesis that the U-Net learns a geometry-adaptive harmonic basis, akin to sparse representations observed in denoising tasks. These results align with recent mathematical insights into its adaptive learning capabilities. Comments: 15 pages, 13 figures Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.08666 [astro-ph.IM] (or arXiv:2601.08666v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2601.08666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-115] mporal-Enhanced Interpretable Multi-Modal Prognosis and Risk Stratification Framework for Diabetic Retinopathy (TIMM-ProRS)
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)诊断中因症状与其他眼底疾病(如年龄相关性黄斑变性、高血压视网膜病变)高度重叠而导致的高误诊率问题,尤其在医疗资源匮乏地区更为突出。解决方案的关键在于提出一种融合视觉Transformer(Vision Transformer, ViT)、卷积神经网络(Convolutional Neural Network, CNN)与图神经网络(Graph Neural Network, GNN)的多模态深度学习框架——TIMM-ProRS,该框架不仅利用眼底图像特征,还引入了时间序列生物标志物(如糖化血红蛋白HbA1c和视网膜厚度),从而实现对DR的多模态信息整合与动态演化建模,显著提升了诊断准确性和可解释性,在多个公开数据集上达到97.8%的准确率和0.96的F1-score,优于现有方法如RSG-Net和DeepDR。
链接: https://arxiv.org/abs/2601.08240
作者: Susmita Kar,A S M Ahsanul Sarkar Akib,Abdul Hasib,Samin Yaser,Anas Bin Azim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diabetic retinopathy (DR), affecting millions globally with projections indicating a significant rise, poses a severe blindness risk and strains healthcare systems. Diagnostic complexity arises from visual symptom overlap with conditions like age-related macular degeneration and hypertensive retinopathy, exacerbated by high misdiagnosis rates in underserved regions. This study introduces TIMM-ProRS, a novel deep learning framework integrating Vision Transformer (ViT), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) with multi-modal fusion. TIMM-ProRS uniquely leverages both retinal images and temporal biomarkers (HbA1c, retinal thickness) to capture multi-modal and temporal dynamics. Evaluated comprehensively across diverse datasets including APTOS 2019 (trained), Messidor-2, RFMiD, EyePACS, and Messidor-1 (validated), the model achieves 97.8% accuracy and an F1-score of 0.96, demonstrating state-of-the-art performance and outperforming existing methods like RSG-Net and DeepDR. This approach enables early, precise, and interpretable diagnosis, supporting scalable telemedical management and enhancing global eye health sustainability.
zh
[CV-116] Application of Ideal Observer for Thresholded Data in Search Task
【速读】:该论文旨在解决任务驱动的图像质量评估(Task-based Image Quality Assessment)中模型观察者难以准确模拟人类视觉搜索行为的问题,尤其是在噪声环境和特征冗余场景下。其核心挑战在于如何提升模型在有限数据下的诊断准确性与计算效率,同时保持与人类观察者性能的一致性。解决方案的关键在于提出一种类人阈值化视觉搜索模型观察者(anthropomorphic thresholded visual-search model observer),该模型采用两阶段框架:候选区域选择阶段通过阈值筛选高显著性特征以精炼感兴趣区域,决策阶段则进行阶段特异性特征处理以优化判别性能;这种基于阈值过滤无关变异性的机制显著提升了模型在复杂背景中的鲁棒性,尤其在低信噪比条件下优于无阈值方法,并且能在少量训练样本下实现高效训练并保持与人类表现的对齐,为医学影像分析及其他涉及人类视觉搜索建模的应用提供了新范式。
链接: https://arxiv.org/abs/2601.07976
作者: Hongwei Lin,Howard C. Gifford
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
备注: 13 pages, 6 figures
Abstract:This study advances task-based image quality assessment by developing an anthropomorphic thresholded visual-search model observer. The model is an ideal observer for thresholded data inspired by the human visual system, allowing selective processing of high-salience features to improve discrimination performance. By filtering out irrelevant variability, the model enhances diagnostic accuracy and computational efficiency. The observer employs a two-stage framework: candidate selection and decision-making. Using thresholded data during candidate selection refines regions of interest, while stage-specific feature processing optimizes performance. Simulations were conducted to evaluate the effects of thresholding on feature maps, candidate localization, and multi-feature scenarios. Results demonstrate that thresholding improves observer performance by excluding low-salience features, particularly in noisy environments. Intermediate thresholds often outperform no thresholding, indicating that retaining only relevant features is more effective than keeping all features. Additionally, the model demonstrates effective training with fewer images while maintaining alignment with human performance. These findings suggest that the proposed novel framework can predict human visual search performance in clinically realistic tasks and provide solutions for model observer training with limited resources. Our novel approach has applications in other areas where human visual search and detection tasks are modeled such as in computer vision, machine learning, defense and security image analysis. Comments: 13 pages, 6 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph) Cite as: arXiv:2601.07976 [eess.IV] (or arXiv:2601.07976v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2601.07976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-117] Imaging-anchored Multiomics in Cardiovascular Disease: Integrating Cardiac Imaging Bulk Single-cell and Spatial Transcriptomics
【速读】:该论文旨在解决心血管疾病研究中多模态数据(包括心脏影像学与转录组学)分析割裂的问题,即如何将临床影像表型与分子层面的空间解析状态进行有效整合。其核心解决方案在于构建以影像为锚点的多组学整合框架:通过超声心动图、心脏磁共振成像(Cardiac MRI)和计算机断层扫描(CT)定义心肌的空间表型,结合批量、单细胞及空间转录组学提供细胞类型和位置特异性的分子背景;并在此基础上采用表示学习策略与多模态融合方法,应对缺失数据、样本量有限和批次效应等挑战,从而推动放射基因组学、空间分子对齐及基于图像的基因表达预测等集成分析流程的发展。
链接: https://arxiv.org/abs/2601.07871
作者: Minh H. N. Le,Tuan Vinh,Thanh-Huy Nguyen,Tao Li,Bao Quang Gia Le,Han H. Huynh,Monika Raj,Carl Yang,Min Xu,Nguyen Quoc Khanh Le
机构: Taipei Medical University (台北医学大学); University of Oxford (牛津大学); Carnegie Mellon University (卡内基梅隆大学); Emory University (埃默里大学)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Cardiovascular disease arises from interactions between inherited risk, molecular programmes, and tissue-scale remodelling that are observed clinically through imaging. Health systems now routinely generate large volumes of cardiac MRI, CT and echocardiography together with bulk, single-cell and spatial transcriptomics, yet these data are still analysed in separate pipelines. This review examines joint representations that link cardiac imaging phenotypes to transcriptomic and spatially resolved molecular states. An imaging-anchored perspective is adopted in which echocardiography, cardiac MRI and CT define a spatial phenotype of the heart, and bulk, single-cell and spatial transcriptomics provide cell-type- and location-specific molecular context. The biological and technical characteristics of these modalities are first summarised, and representation-learning strategies for each are outlined. Multimodal fusion approaches are reviewed, with emphasis on handling missing data, limited sample size, and batch effects. Finally, integrative pipelines for radiogenomics, spatial molecular alignment, and image-based prediction of gene expression are discussed, together with common failure modes, practical considerations, and open challenges. Spatial multiomics of human myocardium and atherosclerotic plaque, single-cell and spatial foundation models, and multimodal medical foundation models are collectively bringing imaging-anchored multiomics closer to large-scale cardiovascular translation.
zh
人工智能
[AI-0] MemRec: Collaborative Memory-Augmented Agent ic Recommender System
【速读】:该论文旨在解决当前推荐系统在智能体(agentic)时代下,因依赖孤立记忆而忽视协同信号的问题,同时应对两个关键挑战:一是如何从大规模图结构中提炼有效上下文而不增加推理智能体的认知负担;二是如何高效演化协同记忆而不带来高昂的计算成本。解决方案的关键在于提出MemRec框架,其核心创新是通过架构解耦将推理与记忆管理分离,引入一个专用且低成本的语言模型(LM_Mem)来维护动态协同记忆图,并向下游的推荐语言模型(LLM_Rec)提供高信号合成上下文;该框架采用高效的检索机制与异步图传播策略,在后台持续演化记忆,从而实现性能、成本与隐私之间的帕累托最优平衡。
链接: https://arxiv.org/abs/2601.08816
作者: Weixin Chen,Yuhan Zhao,Jingyuan Huang,Zihe Ye,Clark Mingxuan Ju,Tong Zhao,Neil Shah,Li Chen,Yongfeng Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The evolution of recommender systems has shifted preference storage from rating matrices and dense embeddings to semantic memory in the agentic era. Yet existing agents rely on isolated memory, overlooking crucial collaborative signals. Bridging this gap is hindered by the dual challenges of distilling vast graph contexts without overwhelming reasoning agents with cognitive load, and evolving the collaborative memory efficiently without incurring prohibitive computational costs. To address this, we propose MemRec, a framework that architecturally decouples reasoning from memory management to enable efficient collaborative augmentation. MemRec introduces a dedicated, cost-effective LM_Mem to manage a dynamic collaborative memory graph, serving synthesized, high-signal context to a downstream LLM_Rec. The framework operates via a practical pipeline featuring efficient retrieval and cost-effective asynchronous graph propagation that evolves memory in the background. Extensive experiments on four benchmarks demonstrate that MemRec achieves state-of-the-art performance. Furthermore, architectural analysis confirms its flexibility, establishing a new Pareto frontier that balances reasoning quality, cost, and privacy through support for diverse deployments, including local open-source models. Code:this https URL and Homepage: this https URL
zh
[AI-1] Uncovering Political Bias in Large Language Models using Parliamentary Voting Records
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治偏见方面的系统性评估不足问题,尤其关注其在真实政治决策场景中的表现。现有研究多聚焦于性别、种族等社会偏见,而对直接影响公共政策与民主治理的政治偏见缺乏严谨的量化分析。解决方案的关键在于提出一种通用的政治偏见基准构建方法,通过将模型生成的投票预测与经验证的议会投票记录对齐,从而实现可复现、跨国家的实证评估。作者在荷兰、挪威和西班牙三个国家案例中构建了Polibias系列基准数据集,并引入基于CHES(Chapel Hill Expert Survey)维度的可视化方法,将模型与政党的意识形态位置映射至同一二维空间,使得LLM的政治倾向可被直接比较与解释。实验结果揭示出当前先进LLM普遍呈现左翼或中间立场,且对右翼保守政党存在显著负向偏见,凸显了基于真实议会行为的透明化评估对于理解和审计LLM政治偏见的价值。
链接: https://arxiv.org/abs/2601.08785
作者: Jieying Chen,Karen de Jong,Andreas Poole,Jan Burakowski,Elena Elderson Nosti,Joep Windt,Chendi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) become deeply embedded in digital platforms and decision-making systems, concerns about their political biases have grown. While substantial work has examined social biases such as gender and race, systematic studies of political bias remain limited, despite their direct societal impact. This paper introduces a general methodology for constructing political bias benchmarks by aligning model-generated voting predictions with verified parliamentary voting records. We instantiate this methodology in three national case studies: PoliBiasNL (2,701 Dutch parliamentary motions and votes from 15 political parties), PoliBiasNO (10,584 motions and votes from 9 Norwegian parties), and PoliBiasES (2,480 motions and votes from 10 Spanish parties). Across these benchmarks, we assess ideological tendencies and political entity bias in LLM behavior. As part of our evaluation framework, we also propose a method to visualize the ideology of LLMs and political parties in a shared two-dimensional CHES (Chapel Hill Expert Survey) space by linking their voting-based positions to the CHES dimensions, enabling direct and interpretable comparisons between models and real-world political actors. Our experiments reveal fine-grained ideological distinctions: state-of-the-art LLMs consistently display left-leaning or centrist tendencies, alongside clear negative biases toward right-conservative parties. These findings highlight the value of transparent, cross-national evaluation grounded in real parliamentary behavior for understanding and auditing political bias in modern LLMs.
zh
[AI-2] Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards
【速读】:该论文旨在解决当前文本到SQL(text-to-SQL)模型评估中因标注错误导致性能指标失真问题,特别是公共基准测试集(如BIRD和Spider 2.0-Snow)中人工标注质量对模型性能排名的潜在误导性影响。其关键解决方案是通过专家分析识别并修正BIRD开发集(Dev set)中的标注错误,并重新评估16个开源text-to-SQL代理在原始与修正数据上的表现,从而量化标注误差对模型性能和排行榜排序的影响,揭示现有评估体系的可靠性风险。
链接: https://arxiv.org/abs/2601.08778
作者: Tengjun Jin,Yoojin Choi,Yuxuan Zhu,Daniel Kang
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 18 pages, 14 figures, 9 tables
Abstract:Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of database-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from -9 to +9 positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman’s r_s =0.85, p =3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman’s r_s =0.32, p =0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at this https URL. Comments: 18 pages, 14 figures, 9 tables Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2601.08778 [cs.AI] (or arXiv:2601.08778v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.08778 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-3] Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM -Extracted Knowledge Graphs
【速读】:该论文旨在解决软件工程中检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理多跳架构推理时的局限性,例如控制器到服务再到仓库的链路、接口驱动的连接以及继承关系等复杂依赖路径。传统基于向量相似度的检索方法虽能捕捉主题相关性,但难以支撑此类跨模块的逻辑推理。解决方案的关键在于引入两种知识图谱增强的RAG管道:一是由大语言模型(LLM)生成的知识图谱(LLM-KB),二是基于抽象语法树(Abstract Syntax Tree, AST)确定性构建的知识图谱(DKB),后者利用Tree-sitter解析代码并进行双向遍历以提取结构化关系。实验表明,DKB在索引效率、覆盖率和多跳推理准确性上显著优于LLM-KB与纯向量检索基线,且成本更低,尤其适合大规模Java代码库的架构级问答任务。
链接: https://arxiv.org/abs/2601.08773
作者: Manideep Reddy Chinthareddy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 46 pages, 2 figures
Abstract:Retrieval-Augmented Generation for software engineering often relies on vector similarity search, which captures topical similarity but can fail on multi-hop architectural reasoning such as controller to service to repository chains, interface-driven wiring, and inheritance. This paper benchmarks three retrieval pipelines on Java codebases (Shopizer, with additional runs on ThingsBoard and OpenMRS Core): (A) vector-only No-Graph RAG, (B) an LLM-generated knowledge graph RAG (LLM-KB), and © a deterministic AST-derived knowledge graph RAG (DKB) built with Tree-sitter and bidirectional traversal. Using 15 architecture and code-tracing queries per repository, we measure indexing time, query latency, corpus coverage, cost, and answer correctness. DKB builds its graph in seconds, while LLM-KB requires much longer graph generation. LLM-KB also shows indexing incompleteness: on Shopizer, 377 files are skipped or missed, reducing embedded chunk coverage and graph size compared to DKB. End-to-end cost is modest for DKB relative to the vector-only baseline but much higher for LLM-KB, especially as repository scale increases. Query latency is similar for No-Graph and DKB, while LLM-KB is slower and more variable. On the Shopizer question suite, DKB achieves the highest correctness, LLM-KB is close behind, and the vector-only baseline performs worst on upstream architectural queries and has the highest hallucination risk. Overall, deterministic AST-derived graphs provide more reliable coverage and multi-hop grounding than LLM-extracted graphs at substantially lower indexing cost. Comments: 46 pages, 2 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.08773 [cs.SE] (or arXiv:2601.08773v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.08773 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-4] AI as Entertainment
【速读】:该论文试图解决的问题是:当前生成式 AI(Generative AI)的研究与评估体系主要聚焦于其在提升生产力方面的“智能”功能,而忽视了其作为娱乐内容生成工具的社会影响,尤其是娱乐性内容对文化、身份认同和社会连接等方面的潜在积极作用。解决方案的关键在于提出“厚娱乐”(thick entertainment)框架,该框架借鉴人文学科的洞见,强调从意义建构、身份形成和社交联结等维度重新评估 AI 生成的文化内容,从而弥补现有评估体系中对文化益处缺乏系统性衡量的不足。
链接: https://arxiv.org/abs/2601.08768
作者: Cody Kommers,Ari Holtzman
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Generative AI systems are predominantly designed, evaluated, and marketed as intelligent systems which will benefit society by augmenting or automating human cognitive labor, promising to increase personal, corporate, and macroeconomic productivity. But this mainstream narrative about what AI is and what it can do is in tension with another emerging use case: entertainment. We argue that the field of AI is unprepared to measure or respond to how the proliferation of entertaining AI-generated content will impact society. Emerging data suggest AI is already widely adopted for entertainment purposes – especially by young people – and represents a large potential source of revenue. We contend that entertainment will become a primary business model for major AI corporations seeking returns on massive infrastructure investments; this will exert a powerful influence on the technology these companies produce in the coming years. Examining current evaluation practices, we identify a critical asymmetry: while AI assessments rigorously measure both benefits and harms of intelligence, they focus almost exclusively on cultural harms. We lack frameworks for articulating how cultural outputs might be actively beneficial. Drawing on insights from the humanities, we propose “thick entertainment” as a framework for evaluating AI-generated cultural content – one that considers entertainment’s role in meaning-making, identity formation, and social connection rather than simply minimizing harm. While AI is often touted for its potential to revolutionize productivity, in the long run we may find that AI turns out to be as much about “intelligence” as social media is about social connection.
zh
[AI-5] rraFormer: Automated Infrastructure-as-Code with LLM s Fine-Tuned via Policy-Guided Verifier Feedback ICSE2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在基础设施即代码(Infrastructure-as-Code, IaC)生成与变异过程中因自然语言(Natural Language, NL)描述不准确或模型自身缺陷导致配置错误的问题。其核心挑战在于如何提升大语言模型(Large Language Models, LLMs)在IaC任务中的正确性,尤其是语法正确性、可部署性和策略合规性。解决方案的关键在于提出TerraFormer——一个神经符号框架,融合监督微调与验证器引导的强化学习机制,利用形式化验证工具提供多维度反馈(语法、部署可行性、策略合规),并通过多阶段验证和LLM自我修正构建高质量NL-to-IaC数据集(TF-Gen 和 TF-Mutn),从而显著提升生成IaC的准确性与可靠性。
链接: https://arxiv.org/abs/2601.08734
作者: Prithwish Jana,Sam Davidson,Bhavana Bhasker,Andrey Kan,Anoop Deoras,Laurent Callot
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: The paper has been published at the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE 2026), Rio de Janeiro, Brazil, April 12-18, 2026
Abstract:Automating Infrastructure-as-Code (IaC) is challenging, and large language models (LLMs) often produce incorrect configurations from natural language (NL). We present TerraFormer, a neuro-symbolic framework for IaC generation and mutation that combines supervised fine-tuning with verifier-guided reinforcement learning, using formal verification tools to provide feedback on syntax, deployability, and policy compliance. We curate two large, high-quality NL-to-IaC datasets, TF-Gen (152k instances) and TF-Mutn (52k instances), via multi-stage verification and iterative LLM self-correction. Evaluations against 17 state-of-the-art LLMs, including ~50x larger models like Sonnet 3.7, DeepSeek-R1, and GPT-4.1, show that TerraFormer improves correctness over its base LLM by 15.94% on IaC-Eval, 11.65% on TF-Gen (Test), and 19.60% on TF-Mutn (Test). It outperforms larger models on both TF-Gen (Test) and TF-Mutn (Test), ranks third on IaC-Eval, and achieves top best-practices and security compliance.
zh
[AI-6] Learning from Demonstrations via Capability-Aware Goal Sampling NEURIPS2025
【速读】:该论文旨在解决模仿学习(Imitation Learning)在长时程环境中的局限性问题,即在这些环境中直接复制专家轨迹往往不切实际,且微小误差会随时间累积导致失败。解决方案的关键在于提出一种名为 Cago(Capability-Aware Goal Sampling)的新方法,其核心是动态追踪智能体在专家轨迹上的能力水平,并据此选择当前能力范围之外的中间步骤作为目标(goal),从而构建一个自适应的训练课程(adaptive curriculum),引导智能体逐步逼近完整任务目标。这一机制显著提升了样本效率和最终性能,在稀疏奖励、目标条件化的任务中优于现有基于示范的学习基线方法。
链接: https://arxiv.org/abs/2601.08731
作者: Yuanlin Duan,Yuning Wang,Wenjie Qiu,He Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Despite its promise, imitation learning often fails in long-horizon environments where perfect replication of demonstrations is unrealistic and small errors can accumulate catastrophically. We introduce Cago (Capability-Aware Goal Sampling), a novel learning-from-demonstrations method that mitigates the brittle dependence on expert trajectories for direct imitation. Unlike prior methods that rely on demonstrations only for policy initialization or reward shaping, Cago dynamically tracks the agent’s competence along expert trajectories and uses this signal to select intermediate steps–goals that are just beyond the agent’s current reach–to guide learning. This results in an adaptive curriculum that enables steady progress toward solving the full task. Empirical results demonstrate that Cago significantly improves sample efficiency and final performance across a range of sparse-reward, goal-conditioned tasks, consistently outperforming existing learning from-demonstrations baselines.
zh
[AI-7] Evaluating the Ability of Explanations to Disambiguate Models in a Rashomon Set AAAI2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 中模型解释(explanation)的质量评估问题,特别是在存在多个性能相近但行为可能不同的模型集合(即Rashomon集)时,如何有效区分并选择最优模型。传统解释评估方法依赖于与理想“真值”解释的对比,这会掩盖Rashomon集中模型之间的行为差异;而本文提出的新方法AXE基于三个核心评价原则,通过检测解释是否受保护属性(protected attributes)影响,从而识别出被“公平伪装”(fairwashing)的虚假解释——即模型虽保持相同预测结果却诱导解释器生成误导性解释。AXE的关键创新在于其无需依赖真值标签即可准确判断解释质量,并能以100%成功率识别对抗性公平伪装,显著优于基于模型敏感性或真值比较的传统方法。
链接: https://arxiv.org/abs/2601.08703
作者: Kaivalya Rawal,Eoin Delaney,Zihao Fu,Sandra Wachter,Chris Russell
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: This is a preprint of the paper published at the MURE workshop, AAAI 2026, which builds on a preprint of separate work published at FAccT 2025 ( arXiv:2505.10399 )
Abstract:Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper “Evaluating Model Explanations without Ground Truth”, we proposed three principles of explanation evaluation and a new method “AXE” to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.
zh
[AI-8] Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students
【速读】:该论文旨在解决生成式 AI(Generative AI)在高等教育中应用时,学生对自动化程度的期望与当前AI功能之间存在的差距问题,特别是如何平衡效率提升与学生自主性、输出可靠性之间的矛盾。其解决方案的关键在于通过混合方法审计(mixed-methods audit),结合两轮互补的问卷调查:第一轮基于任务框架量化学生对12项学术任务中AI使用的偏好与实际行为,识别主要风险与动机;第二轮则聚焦于学生对AI系统设计的改进建议,探索如何通过人机协作边界设定和功能优化来增强AI的可信度与适配性,从而推动更符合教育伦理与学习目标的AI工具开发。
链接: https://arxiv.org/abs/2601.08697
作者: Nifu Dan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As generative AI becomes embedded in higher education, it increasingly shapes how students complete academic tasks. While these systems offer efficiency and support, concerns persist regarding over-automation, diminished student agency, and the potential for unreliable or hallucinated outputs. This study conducts a mixed-methods audit of student-AI collaboration preferences by examining the alignment between current AI capabilities and students’ desired levels of automation in academic work. Using two sequential and complementary surveys, we capture students’ perceived benefits, risks, and preferred boundaries when using AI. The first survey employs an existing task-based framework to assess preferences for and actual usage of AI across 12 academic tasks, alongside primary concerns and reasons for use. The second survey, informed by the first, explores how AI systems could be designed to address these concerns through open-ended questions. This study aims to identify gaps between existing AI affordances and students’ normative expectations of collaboration, informing the development of more effective and trustworthy AI systems for education.
zh
[AI-9] All Required In Order: Phase-Level Evaluation for AI-Human Dialogue in Healthcare and Beyond AAAI-26 ALT
【速读】:该论文旨在解决当前对话式人工智能(Conversational AI)在临床场景中应用时,评估方法未能充分反映医患交互过程中合规性依赖于完整对话流程的问题。现有评估方式往往忽视了临床义务(clinical obligation)是否按正确顺序被满足,以及是否存在可供临床医生审查的明确证据,从而导致技术进步与实际医疗需求之间存在脱节。解决方案的关键在于提出一种名为“强制信息阶段结构化合规评估”(Obligatory-Information Phase Structured Compliance Evaluation, OIP-SCE)的新评估框架,该框架通过分阶段验证每个必需的临床义务是否被准确执行、顺序合理,并提供可审计的证据链,使复杂的临床规则变得可操作且透明。这一方法不仅赋予临床医生对检查内容的控制权,也为工程师提供了清晰的实现规范,从而构建了一个统一、可审计的评估界面,有效对齐AI能力与临床工作流,支持安全、常规的临床部署。
链接: https://arxiv.org/abs/2601.08690
作者: Shubham Kulkarni,Alexander Lyzhov,Shiva Chaitanya,Preetam Joshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the AI for Medicine and Healthcare (AIMedHealth) Bridge Program, AAAI-26, Singapore. Full-length paper; to appear in Proceedings of Machine Learning Research (PMLR)
Abstract:Conversational AI is starting to support real clinical work, but most evaluation methods miss how compliance depends on the full course of a conversation. We introduce Obligatory-Information Phase Structured Compliance Evaluation (OIP-SCE), an evaluation method that checks whether every required clinical obligation is met, in the right order, with clear evidence for clinicians to review. This makes complex rules practical and auditable, helping close the gap between technical progress and what healthcare actually needs. We demonstrate the method in two case studies (respiratory history, benefits verification) and show how phase-level evidence turns policy into shared, actionable steps. By giving clinicians control over what to check and engineers a clear specification to implement, OIP-SCE provides a single, auditable evaluation surface that aligns AI capability with clinical workflow and supports routine, safe use.
zh
[AI-10] PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化交互中因引入用户偏好信息而导致的客观性与事实准确性下降的问题,尤其当个性化信息与问题语境不一致时,可能干扰模型的推理过程。解决方案的关键在于提出 PersonaDual 框架,该框架通过监督微调(Supervised Fine-Tuning, SFT)学习两种推理模式——通用客观推理与个性化推理,并进一步利用双路径强化学习优化算法 DualGRPO 对模式选择机制进行优化,实现根据上下文动态切换推理模式的能力,从而在保留个性化优势的同时显著降低干扰,达到近无干扰的性能表现并更有效地利用有助于任务解决的个性化信号。
链接: https://arxiv.org/abs/2601.08679
作者: Xiaoyou Liu,Xinyi Mou,Shengbin Yue,Liang Wang,Yuqing Wang,Qiexiang Wang,Tianrui Qin,Wangchunshu Zhou,Zhongyu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.
zh
[AI-11] Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance
【速读】:该论文旨在解决企业环境、社会与治理(ESG)分析中因数据来源碎片化导致的专业性不足问题,以及现有大语言模型(LLM)在复杂多步骤审计流程中的能力局限。其解决方案的关键在于提出一个分层的多智能体系统——ESGAgent,该系统依托专用工具集(包括检索增强、网络搜索和领域特定功能),实现从原子级常识问答到综合性ESG报告生成的全流程自动化分析,并通过基于310份企业可持续发展报告构建的三级评估基准验证其有效性,实证表明ESGAgent在原子问答任务上平均准确率达84.15%,且在生成包含图表和可验证引用的专业报告方面显著优于当前最先进的闭源大模型。
链接: https://arxiv.org/abs/2601.08676
作者: Yilei Zhao,Wentao Zhang,Xiao Lei,Yandan Zheng,Mengpu Liu,Wei Yang Bryan Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Environmental, social, and governance (ESG) criteria are essential for evaluating corporate sustainability and ethical performance. However, professional ESG analysis is hindered by data fragmentation across unstructured sources, and existing large language models (LLMs) often struggle with the complex, multi-step workflows required for rigorous auditing. To address these limitations, we introduce ESGAgent, a hierarchical multi-agent system empowered by a specialized toolset, including retrieval augmentation, web search and domain-specific functions, to generate in-depth ESG analysis. Complementing this agentic system, we present a comprehensive three-level benchmark derived from 310 corporate sustainability reports, designed to evaluate capabilities ranging from atomic common-sense questions to the generation of integrated, in-depth analysis. Empirical evaluations demonstrate that ESGAgent outperforms state-of-the-art closed-source LLMs with an average accuracy of 84.15% on atomic question-answering tasks, and excels in professional report generation by integrating rich charts and verifiable references. These findings confirm the diagnostic value of our benchmark, establishing it as a vital testbed for assessing general and advanced agentic capabilities in high-stakes vertical domains.
zh
[AI-12] Why AI Alignment Failure Is Structural: Learned Human Interaction Structures and AGI as an Endogenous Evolutionary Shock
【速读】:该论文试图解决当前对大型语言模型(Large Language Models, LLMs)中出现欺骗、威胁等行为的误解问题,即这些行为常被归因于“对齐失败”或“涌现的恶意代理”,而实际上这种解释源于一个概念性错误。论文指出,LLMs并不具备道德推理能力,而是统计性地内化了人类社会互动的历史记录,包括法律、契约、谈判、冲突与强制性安排等;因此,所谓“非道德”行为应被视为在权力、信息或约束极度不对称条件下产生的结构化泛化结果,而非异常偏离。解决方案的关键在于重构对人工智能风险的认知框架:AGI(人工通用智能)的主要风险并非来自对抗性意图,而是其作为人类智能、权力和矛盾的内生放大器,通过消除认知与制度摩擦压缩时间尺度,削弱历史容错空间,从而加剧治理不稳定性和价值冲突。因此,应对策略需聚焦于系统层面的放大效应、复杂性管理与制度稳定性,而非仅关注模型层级的意图控制。
链接: https://arxiv.org/abs/2601.08673
作者: Didier Sornette,Sandro Claudio Lera,Ke Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 20 pages
Abstract:Recent reports of large language models (LLMs) exhibiting behaviors such as deception, threats, or blackmail are often interpreted as evidence of alignment failure or emergent malign agency. We argue that this interpretation rests on a conceptual error. LLMs do not reason morally; they statistically internalize the record of human social interaction, including laws, contracts, negotiations, conflicts, and coercive arrangements. Behaviors commonly labeled as unethical or anomalous are therefore better understood as structural generalizations of interaction regimes that arise under extreme asymmetries of power, information, or constraint. Drawing on relational models theory, we show that practices such as blackmail are not categorical deviations from normal social behavior, but limiting cases within the same continuum that includes market pricing, authority relations, and ultimatum bargaining. The surprise elicited by such outputs reflects an anthropomorphic expectation that intelligence should reproduce only socially sanctioned behavior, rather than the full statistical landscape of behaviors humans themselves enact. Because human morality is plural, context-dependent, and historically contingent, the notion of a universally moral artificial intelligence is ill-defined. We therefore reframe concerns about artificial general intelligence (AGI). The primary risk is not adversarial intent, but AGI’s role as an endogenous amplifier of human intelligence, power, and contradiction. By eliminating longstanding cognitive and institutional frictions, AGI compresses timescales and removes the historical margin of error that has allowed inconsistent values and governance regimes to persist without collapse. Alignment failure is thus structural, not accidental, and requires governance approaches that address amplification, complexity, and regime stability rather than model-level intent alone.
zh
[AI-13] From Classical to Quantum Reinforcement Learning and Its Applications in Quantum Control: A Beginners Tutorial
【速读】:该论文试图解决的问题是:本科生在将强化学习(Reinforcement Learning, RL)理论知识转化为实际编程实现时所面临的困难,尤其是从概念理解到动手实践的过渡障碍。其解决方案的关键在于通过清晰、以示例驱动的讲解,结合动手实践案例,系统性地弥合理论与代码实现之间的鸿沟,从而帮助学生掌握RL的基础技能并具备在真实场景中应用RL技术的能力。
链接: https://arxiv.org/abs/2601.08662
作者: Abhijit Sen,Sonali Panda,Mahima Arya,Subhajit Patra,Zizhan Zheng,Denys I. Bondar
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:This tutorial is designed to make reinforcement learning (RL) more accessible to undergraduate students by offering clear, example-driven explanations. It focuses on bridging the gap between RL theory and practical coding applications, addressing common challenges that students face when transitioning from conceptual understanding to implementation. Through hands-on examples and approachable explanations, the tutorial aims to equip students with the foundational skills needed to confidently apply RL techniques in real-world scenarios.
zh
[AI-14] Prism: Towards Lowering User Cognitive Load in LLM s via Complex Intent Understanding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在社交平台交互中面临的复杂意图理解难题,尤其是在用户目标模糊且动态变化的场景下,传统单轮执行策略难以实现有效的人机协作。其核心挑战在于如何建模澄清问题之间的逻辑依赖关系,以提升意图澄清的连贯性与效率。解决方案的关键在于提出 Prism 框架,该框架包含四个模块:复杂意图分解模块用于识别意图的结构化元素及其逻辑依赖;逻辑澄清生成模块基于依赖关系组织澄清问题以降低认知负荷;意图感知奖励模块通过蒙特卡洛采样模拟交互并生成高质量训练数据;自进化意图调优模块则通过数据驱动反馈迭代优化模型的逻辑澄清能力。该方法显著提升了逻辑一致性(冲突率降至11.5%)、用户满意度(+14.4%)和任务完成效率(减少34.8%时间)。
链接: https://arxiv.org/abs/2601.08653
作者: Zenghua Liao,Jinzhi Liao,Xiang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models are rapidly emerging as web-native interfaces to social platforms. On the social web, users frequently have ambiguous and dynamic goals, making complex intent understanding-rather than single-turn execution-the cornerstone of effective human-LLM collaboration. Existing approaches attempt to clarify user intents through sequential or parallel questioning, yet they fall short of addressing the core challenge: modeling the logical dependencies among clarification questions. Inspired by the Cognitive Load Theory, we propose Prism, a novel framework for complex intent understanding that enables logically coherent and efficient intent clarification. Prism comprises four tailored modules: a complex intent decomposition module, which decomposes user intents into smaller, well-structured elements and identifies logical dependencies among them; a logical clarification generation module, which organizes clarification questions based on these dependencies to ensure coherent, low-friction interactions; an intent-aware reward module, which evaluates the quality of clarification trajectories via an intent-aware reward function and leverages Monte Carlo Sample to simulate user-LLM interactions for large-scale,high-quality training data generation; and a self-evolved intent tuning module, which iteratively refines the LLM’s logical clarification capability through data-driven feedback and optimization. Prism consistently outperforms existing approaches across clarification interactions, intent execution, and cognitive load benchmarks. It achieves stateof-the-art logical consistency, reduces logical conflicts to 11.5%, increases user satisfaction by 14.4%, and decreases task completion time by 34.8%. All data and code are released.
zh
[AI-15] Resisting Manipulative Bots in Memecoin Copy Trading: A Multi-Agent Approach with Chain-of-Thought Reasoning
【速读】:该论文旨在解决 meme coin 投资中 copy trading(跟单交易)策略因操纵机器人泛滥、被跟踪钱包未来表现不确定性以及交易执行延迟而导致的盈利能力不可靠问题,同时应对大语言模型(Large Language Models, LLMs)在加密货币市场中因领域知识不足而难以胜任复杂资产配置任务的局限性。其解决方案的关键在于提出一个可解释的多智能体系统(Explainable Multi-Agent System),该系统受资产管理团队结构启发,将复杂的跟单决策任务分解为子任务,并由专业化智能体协同完成;每个智能体通过少样本链式思维(few-shot chain-of-thought, CoT)提示技术获取专业 meme coin 交易知识、解析多模态数据并生成可解释决策,从而显著提升对高质量 meme coin 项目和关键意见领袖(Key Opinion Leader, KOL)钱包的识别精度,实证结果表明该系统在识别准确率和实际收益方面均优于传统机器学习模型与单一 LLM。
链接: https://arxiv.org/abs/2601.08641
作者: Yichen Luo,Yebo Feng,Jiahua Xu,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
备注:
Abstract:The launch of \ Trump coin ignited a wave in meme coin investment. Copy trading, as a strategy-agnostic approach that eliminates the need for deep trading knowledge, quickly gains widespread popularity in the meme coin market. However, copy trading is not a guarantee of profitability due to the prevalence of manipulative bots, the uncertainty of the followed wallets’ future performance, and the lag in trade execution. Recently, large language models (LLMs) have shown promise in financial applications by effectively understanding multi-modal data and producing explainable decisions. However, a single LLM struggles with complex, multi-faceted tasks such as asset allocation. These challenges are even more pronounced in cryptocurrency markets, where LLMs often lack sufficient domain-specific knowledge in their training data. To address these challenges, we propose an explainable multi-agent system for meme coin copy trading. Inspired by the structure of an asset management team, our system decomposes the complex task into subtasks and coordinates specialized agents to solve them collaboratively. Employing few-shot chain-of-though (CoT) prompting, each agent acquires professional meme coin trading knowledge, interprets multi-modal data, and generates explainable decisions. Using a dataset of 1,000 meme coin projects’ transaction data, our empirical evaluation shows that the proposed multi-agent system outperforms both traditional machine learning models and single LLMs, achieving 73% and 70% precision in identifying high-quality meme coin projects and key opinion leader (KOL) wallets, respectively. The selected KOLs collectively generated a total profit of \ 500,000 across these projects. Subjects: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR) Cite as: arXiv:2601.08641 [cs.AI] (or arXiv:2601.08641v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.08641 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] M2FMoE: Multi-Resolution Multi-View Frequency Mixture-of-Experts for Extreme-Adaptive Time Series Forecasting AAAI2026
【速读】:该论文旨在解决时间序列预测中极端事件(extreme events)难以建模的问题,这类事件具有高方差、非规律动态和稀疏但高影响的特点,现有方法在捕捉其复杂时序动态方面表现不足。解决方案的关键在于提出M²FMoE模型,通过多分辨率与多视角频域建模实现对常规模式与极端模式的协同学习:其核心机制包括三个模块——(1)多视角频率混合专家模块(multi-view frequency mixture-of-experts),在傅里叶与小波域中为不同频带分配专家,并通过跨视角共享频带分割器促进专家间协作;(2)多分辨率自适应融合模块(multi-resolution adaptive fusion),从粗到细逐层聚合频域特征以增强对短期波动与突发变化的敏感性;(3)时序门控集成模块(temporal gating integration),动态平衡长期趋势与短期频域感知特征,从而提升对常规与极端时序模式的适应能力。
链接: https://arxiv.org/abs/2601.08631
作者: Yaohui Huang,Runmin Zou,Yun Wang,Laeeq Aslam,Ruipeng Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Forecasting time series with extreme events is critical yet challenging due to their high variance, irregular dynamics, and sparse but high-impact nature. While existing methods excel in modeling dominant regular patterns, their performance degrades significantly during extreme events, constituting the primary source of forecasting errors in real-world applications. Although some approaches incorporate auxiliary signals to improve performance, they still fail to capture extreme events’ complex temporal dynamics. To address these limitations, we propose M ^2 FMoE, an extreme-adaptive forecasting model that learns both regular and extreme patterns through multi-resolution and multi-view frequency modeling. It comprises three modules: (1) a multi-view frequency mixture-of-experts module assigns experts to distinct spectral bands in Fourier and Wavelet domains, with cross-view shared band splitter aligning frequency partitions and enabling inter-expert collaboration to capture both dominant and rare fluctuations; (2) a multi-resolution adaptive fusion module that hierarchically aggregates frequency features from coarse to fine resolutions, enhancing sensitivity to both short-term variations and sudden changes; (3) a temporal gating integration module that dynamically balances long-term trends and short-term frequency-aware features, improving adaptability to both regular and extreme temporal patterns. Experiments on real-world hydrological datasets with extreme patterns demonstrate that M ^2 FMoE outperforms state-of-the-art baselines without requiring extreme-event labels.
zh
[AI-17] Rewriting Video: Text-Driven Reauthoring of Video Footage
【速读】:该论文旨在解决视频再创作(reauthoring)过程中存在的高门槛问题,即传统视频编辑对创作者的专业技能、时间和规划要求较高,限制了叙事创意的自由表达。其解决方案的关键在于提出一种基于生成式 AI 的文本驱动视频重编辑范式,通过两项核心技术实现:一是生成式重建算法(generative reconstruction algorithm),将原始视频逆向解构为可编辑的文本提示(text prompt);二是交互式工具 Rewrite Kit,使创作者能够直接操作这些文本提示以实现视频内容的修改。该方法突破了传统视频编辑的界面与流程限制,探索了“像改写文字一样编辑视频”的可能性,并揭示了人机感知差距及协同创作中的关键挑战。
链接: https://arxiv.org/abs/2601.08565
作者: Sitong Wang,Anh Truong,Lydia B. Chilton,Dingzeyu Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Video is a powerful medium for communication and storytelling, yet reauthoring existing footage remains challenging. Even simple edits often demand expertise, time, and careful planning, constraining how creators envision and shape their narratives. Recent advances in generative AI suggest a new paradigm: what if editing a video were as straightforward as rewriting text? To investigate this, we present a tech probe and a study on text-driven video reauthoring. Our approach involves two technical contributions: (1) a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and (2) an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm. Our work contributes empirical insights into the opportunities and challenges of text-driven video reauthoring, offering design implications for future co-creative video tools.
zh
[AI-18] WaterCopilot: An AI-Driven Virtual Assistant for Water Management MICRO
【速读】:该论文旨在解决跨境河流流域(如林波波河盆地,LRB)在水资源可持续管理中面临的碎片化数据、实时访问受限及多源信息整合复杂等挑战。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)与工具调用(tool-calling)架构的AI虚拟助手WaterCopilot,通过两个定制插件——iwmi-doc-plugin实现对政策文档的语义搜索,iwmi-api-plugin接入实时水文数据库以提供环境流量预警、降雨趋势、水库水位等动态洞察,并结合多语言交互、溯源透明性、自动计算与可视化功能,形成统一、可交互的决策支持平台。系统在RAGAS评估框架下表现优异(总体得分0.8043),并具备可扩展部署能力,为数据匮乏的跨境流域提供了可复制的AI增强型治理范式。
链接: https://arxiv.org/abs/2601.08559
作者: Keerththanan Vickneswaran,Mariangel Garcia Andarcia,Hugo Retief,Chris Dickens,Paulo Silva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures. This work was developed in collaboration between the International Water Management Institute (IWMI) and Microsoft Research. The supplementary user guide for WaterCopilot is available via this this https URL
Abstract:Sustainable water resource management in transboundary river basins is challenged by fragmented data, limited real-time access, and the complexity of integrating diverse information sources. This paper presents WaterCopilot-an AI-driven virtual assistant developed through collaboration between the International Water Management Institute (IWMI) and Microsoft Research for the Limpopo River Basin (LRB) to bridge these gaps through a unified, interactive platform. Built on Retrieval-Augmented Generation (RAG) and tool-calling architectures, WaterCopilot integrates static policy documents and real-time hydrological data via two custom plugins: the iwmi-doc-plugin, which enables semantic search over indexed documents using Azure AI Search, and the iwmi-api-plugin, which queries live databases to deliver dynamic insights such as environmental-flow alerts, rainfall trends, reservoir levels, water accounting, and irrigation data. The system features guided multilingual interactions (English, Portuguese, French), transparent source referencing, automated calculations, and visualization capabilities. Evaluated using the RAGAS framework, WaterCopilot achieves an overall score of 0.8043, with high answer relevancy (0.8571) and context precision (0.8009). Key innovations include automated threshold-based alerts, integration with the LRB Digital Twin, and a scalable deployment pipeline hosted on AWS. While limitations in processing non-English technical documents and API latency remain, WaterCopilot establishes a replicable AI-augmented framework for enhancing water governance in data-scarce, transboundary contexts. The study demonstrates the potential of this AI assistant to support informed, timely decision-making and strengthen water security in complex river basins.
zh
[AI-19] Contrastive and Multi-Task Learning on Noisy Brain Signals with Nonlinear Dynamical Signatures
【速读】:该论文旨在解决脑电图(EEG)信号分析中噪声干扰严重、动态特性难以建模以及特征表示泛化能力不足的问题。其解决方案的关键在于提出一种两阶段多任务学习框架:第一阶段通过训练去噪自编码器(denoising autoencoder)抑制伪影并稳定时序动态,获得鲁棒的信号表征;第二阶段则在去噪后的信号上构建多任务架构,同时实现运动想象分类、基于李雅普诺夫指数(Lyapunov exponent)的混沌与非混沌状态判别,以及基于NT-Xent损失的自监督对比表示学习。该设计通过分离噪声去除与高级特征学习,有效缓解了重建目标与判别目标之间的干扰,提升了模型在不同数据集上的稳定性与可复现性,并显著优于现有强基线和最新方法,验证了融合去噪、动力学特征与自监督学习的有效性。
链接: https://arxiv.org/abs/2601.08549
作者: Sucheta Ghosh,Zahra Monfared,Felix Dietrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
zh
[AI-20] Sketch-Based Facade Renovation With Generative AI: A Streamlined Framework for Bypassing As-Built Modelling in Industrial Adaptive Reuse
【速读】:该论文旨在解决建筑立面翻新设计中因依赖详尽的现状建模(as-built modelling)而导致流程繁琐、耗时且需反复修改的问题。其解决方案的关键在于提出一个三阶段框架,融合生成式AI与视觉-语言模型(Vision-Language Models, VLM),直接基于粗略结构草图和文本描述生成一致的翻新方案:首先利用微调后的VLM预测修改区域及新增构件;其次通过稳定扩散模型生成新元素细节草图,并借助生成式图像修复(generative inpainting)将其融合至原图;最后使用ControlNet将结果优化为逼真图像。该方法显著减少了对精确现状建模的依赖,提升了早期设计迭代效率与表达清晰度。
链接: https://arxiv.org/abs/2601.08531
作者: Warissara Booranamaitree,Xusheng Du,Yushu Cai,Zhengyang Wang,Ye Zhang,Haoran Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, Proceedings of CAADRIA 2026
Abstract:Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.
zh
[AI-21] mporal Fusion Nexus: A task-agnostic multi-modal embedding model for clinical narratives and irregular time series in post-kidney transplant care
【速读】:该论文旨在解决多模态医疗数据融合中的挑战,特别是如何有效整合不规则时间序列(irregular time series)与非结构化临床文本(unstructured clinical narratives),以提升肾移植(kidney transplant, KTx)后患者预后预测的准确性。其解决方案的关键在于提出Temporal Fusion Nexus (TFN),一种多模态、任务无关的嵌入模型,通过联合学习时间序列与临床文本特征,生成具有判别力且可解释的潜在表示。TFN在 graft loss、graft rejection 和 mortality 三项关键临床结局上均显著优于现有最优模型(如 AUC 提升至 0.96、0.84 和 0.86),并验证了其嵌入空间中因子的解耦性与临床合理性,证明其在处理异构数据源、不规则纵向数据和丰富文本记录的临床场景中具备广泛适用潜力。
链接: https://arxiv.org/abs/2601.08503
作者: Aditya Kumar,Simon Rauch,Mario Cypko,Marcel Naik,Matthieu-P Schapranow,Aadil Rashid,Fabian Halleck,Bilgin Osmanodja,Roland Roller,Lars Pape,Klemens Budde,Mario Schiffer,Oliver Amft
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 9 figures, 3 tables. A supplementary file is also available
Abstract:We introduce Temporal Fusion Nexus (TFN), a multi-modal and task-agnostic embedding model to integrate irregular time series and unstructured clinical narratives. We analysed TFN in post-kidney transplant (KTx) care, with a retrospective cohort of 3382 patients, on three key outcomes: graft loss, graft rejection, and mortality. Compared to state-of-the-art model in post KTx care, TFN achieved higher performance for graft loss (AUC 0.96 vs. 0.94) and graft rejection (AUC 0.84 vs. 0.74). In mortality prediction, TFN yielded an AUC of 0.86. TFN outperformed unimodal baselines (approx 10% AUC improvement over time series only baseline, approx 5% AUC improvement over time series with static patient data). Integrating clinical text improved performance across all tasks. Disentanglement metrics confirmed robust and interpretable latent factors in the embedding space, and SHAP-based attributions confirmed alignment with clinical reasoning. TFN has potential application in clinical tasks beyond KTx, where heterogeneous data sources, irregular longitudinal data, and rich narrative documentation are available.
zh
[AI-22] SUMMPILOT: Bridging Efficiency and Customization for Interactive Summarization System AAAI2025
【速读】:该论文旨在解决生成个性化摘要(personalized summary)的挑战,即如何根据用户兴趣和需求定制摘要内容,而不仅仅是依赖通用的自动摘要技术。其解决方案的关键在于提出SummPilot系统,该系统基于大语言模型(large language model, LLM),通过交互式组件(如语义图、实体聚类和可解释评估)实现用户与系统的动态互动,从而支持用户在理解文档内容的同时,灵活调整和优化摘要结果,提升摘要的个性化和实用性。
链接: https://arxiv.org/abs/2601.08475
作者: JungMin Yun,Juhwan Choi,Kyohoon Jin,Soojin Jang,Jinhee Jang,YoungBin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to AAAI 2025 Demonstration Track
Abstract:This paper incorporates the efficiency of automatic summarization and addresses the challenge of generating personalized summaries tailored to individual users’ interests and requirements. To tackle this challenge, we introduce SummPilot, an interaction-based customizable summarization system. SummPilot leverages a large language model to facilitate both automatic and interactive summarization. Users can engage with the system to understand document content and personalize summaries through interactive components such as semantic graphs, entity clustering, and explainable evaluation. Our demo and user studies demonstrate SummPilot’s adaptability and usefulness for customizable summarization.
zh
[AI-23] M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在复杂社会交互场景中,如合作、欺骗与共谋等高级社交行为的评估缺乏系统性方法的问题。现有基准测试通常仅关注单一能力维度或仅基于行为结果,忽视了代理决策推理过程和沟通交互中的丰富信息。解决方案的关键在于提出M3-Bench——一个面向混合动机博弈的多阶段基准,并构建了一个过程感知的评估框架,通过BTA(行为轨迹分析)、RPA(推理过程分析)和CCA(沟通内容分析)三个模块进行协同分析;同时引入大五人格模型(Big Five personality model)和社会交换理论(Social Exchange Theory),将多维证据整合为可解释的社会行为画像,从而超越传统任务得分或结果导向指标,更全面地刻画代理的人格特质与能力特征。
链接: https://arxiv.org/abs/2601.08462
作者: Sixiong Xie,Zhuofan Shi,Haiyang Shen,Gang Huang,Yun Ma,Xiang Jing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As the capabilities of large language model (LLM) agents continue to advance, their advanced social behaviors, such as cooperation, deception, and collusion, call for systematic evaluation. However, existing benchmarks often emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking rich process information from agents’ decision reasoning and communicative interactions. To address this gap, we propose M3-Bench, a multi-stage benchmark for mixed-motive games, together with a process-aware evaluation framework that conducts synergistic analysis across three modules: BTA (Behavioral Trajectory Analysis), RPA (Reasoning Process Analysis), and CCA (Communication Content Analysis). Furthermore, we integrate the Big Five personality model and Social Exchange Theory to aggregate multi-dimensional evidence into interpretable social behavior portraits, thereby characterizing agents’ personality traits and capability profiles beyond simple task scores or outcome-based metrics. Experimental results show that M3-Bench can reliably distinguish diverse social behavior competencies across models, and it reveals that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication.
zh
[AI-24] Decoding Order Matters in Autoregressive Speech Synthesis
【速读】:该论文旨在解决自回归语音合成中解码顺序(decoding order)对语音质量的影响问题,传统方法多采用固定的左到右(left-to-right, L2R)顺序,但该策略可能并非最优。其解决方案的关键在于引入基于掩码扩散(masked diffusion)的框架,该框架在训练和推理阶段均可支持任意解码顺序,并通过插值身份排列与随机排列的方式系统性地研究了不同解码顺序对语音质量的影响。实验表明,固定顺序(如L2R或右到左R2L)次优,而自适应解码策略(如Top-K)能显著提升性能;此外,该方法还验证了即使使用极低比特量化(如1-bit)的声学表示也能生成高质量语音,从而降低了模型输入复杂度。
链接: https://arxiv.org/abs/2601.08450
作者: Minghui Zhao,Anton Ragni
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Autoregressive speech synthesis often adopts a left-to-right order, yet generation order is a modelling choice. We investigate decoding order through masked diffusion framework, which progressively unmasks positions and allows arbitrary decoding orders during training and inference. By interpolating between identity and random permutations, we show that randomness in decoding order affects speech quality. We further compare fixed strategies, such as \textttl2r and \textttr2l with adaptive ones, such as Top- K , finding that fixed-order decoding, including the dominating left-to-right approach, is suboptimal, while adaptive decoding yields better performance. Finally, since masked diffusion requires discrete inputs, we quantise acoustic representations and find that even 1-bit quantisation can support reasonably high-quality speech.
zh
[AI-25] Beyond Linearization: Attributed Table Graphs for Table Reasoning
【速读】:该论文旨在解决表格推理(Table Reasoning)任务中现有基于大语言模型(Large Language Models, LLMs)方法的三大关键问题:一是将表格线性化为纯文本导致表结构信息丢失;二是缺乏显式的推理路径,难以实现结果的可解释性;三是存在“中间遗忘”(lost-in-the-middle)问题,即LLM在处理长序列时对中间信息的依赖减弱。解决方案的核心在于提出一种无需训练的模型——Table Graph Reasoner (TABGR),其通过构建属性化表格图(Attributed Table Graph, ATG)来显式保留表格的行-列-单元格结构,并在此基础上引入问题引导的个性化PageRank(Question-Guided Personalized PageRank, QG-PPR)机制,以重新排序相关表格内容并缓解“中间遗忘”问题,从而提升推理准确性和可解释性。
链接: https://arxiv.org/abs/2601.08444
作者: Yuxiang Wang,Junhao Gan,Shengxiang Gao,Shenghao Ye,Zhengyi Yang,Jianzhong Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Table reasoning, a task to answer questions by reasoning over data presented in tables, is an important topic due to the prevalence of knowledge stored in tabular formats. Recent solutions use Large Language Models (LLMs), exploiting the semantic understanding and reasoning capabilities of LLMs. A common paradigm of such solutions linearizes tables to form plain texts that are served as input to LLMs. This paradigm has critical issues. It loses table structures, lacks explicit reasoning paths for result explainability, and is subject to the “lost-in-the-middle” issue. To address these issues, we propose Table Graph Reasoner (TABGR), a training-free model that represents tables as an Attributed Table Graph (ATG). The ATG explicitly preserves row-column-cell structures while enabling graph-based reasoning for explainability. We further propose a Question-Guided Personalized PageRank (QG-PPR) mechanism to rerank tabular data and mitigate the lost-in-the-middle issue. Extensive experiments on two commonly used benchmarks show that TABGR consistently outperforms state-of-the-art models by up to 9.7% in accuracy. Our code will be made publicly available upon publication.
zh
[AI-26] YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
【速读】:该论文旨在解决现有基于密集向量的控制方法(如BiPO)在细粒度对齐任务中因神经元多语义性导致的潜在因素纠缠问题,从而限制了其在文化对齐等场景下的有效性与稳定性。解决方案的关键在于提出Yet another Policy Optimization (YaPO),一种无需参考数据的稀疏策略优化方法,通过在稀疏自动编码器(Sparse Autoencoder, SAE)的潜在空间中优化稀疏代码,学习到解耦、可解释且高效的控制方向(steering vectors)。该方法显著提升了训练收敛速度、性能表现和稳定性,并在多个对齐相关行为(如幻觉抑制、越狱攻击防御、权力追求等)上展现出良好的泛化能力,同时保持了模型在通用知识评测(如MMLU)上的性能无损。
链接: https://arxiv.org/abs/2601.08441
作者: Abdelaziz Bounhar,Rania Hossam Elmohamady Elbadry,Hadi Abdine,Preslav Nakov,Michalis Vazirgiannis,Guokan Shang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a \textitreference-free method that learns \textitsparse steering vectors in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly available\footnotethis https URL.
zh
[AI-27] Large Multimodal Models for Embodied Intelligent Driving: The Next Frontier in Self-Driving?
【速读】:该论文旨在解决自动驾驶系统在开放世界场景中因模块化设计局限而难以实现持续环境理解与逻辑推理的问题,同时应对仅依赖大语言模型(Large Multimodal Models, LMMs)进行增强时,缺乏联合决策机制导致的智能驾驶能力受限问题。解决方案的关键在于提出一种语义与策略双驱动的混合决策框架,通过融合LMMs实现语义理解与认知表征,以及深度强化学习(Deep Reinforcement Learning, DRL)实现实时策略优化,从而保障连续学习能力并支持多模态协同决策,提升自动驾驶系统的整体智能化水平。
链接: https://arxiv.org/abs/2601.08434
作者: Long Zhang,Yuchen Xia
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of Large Multimodal Models (LMMs) offers a promising technology to tackle the limitations of modular design in autonomous driving, which often falters in open-world scenarios requiring sustained environmental understanding and logical reasoning. Besides, embodied artificial intelligence facilitates policy optimization through closed-loop interactions to achieve the continuous learning capability, thereby advancing autonomous driving toward embodied intelligent (El) driving. However, such capability will be constrained by relying solely on LMMs to enhance EI driving without joint decision-making. This article introduces a novel semantics and policy dual-driven hybrid decision framework to tackle this challenge, ensuring continuous learning and joint decision. The framework merges LMMs for semantic understanding and cognitive representation, and deep reinforcement learning (DRL) for real-time policy optimization. We starts by introducing the foundational principles of EI driving and LMMs. Moreover, we examine the emerging opportunities this framework enables, encompassing potential benefits and representative use cases. A case study is conducted experimentally to validate the performance superiority of our framework in completing lane-change planning task. Finally, several future research directions to empower EI driving are identified to guide subsequent work.
zh
[AI-28] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
【速读】:该论文旨在解决开放生成任务中缺乏真实标签(ground truth)导致的优化难题,尤其是在需要复杂推理能力的领域(如数学)中,现有基于评分标准(rubric-based evaluation)的方法因可扩展性差和评判标准粗略而受限,形成监督天花板效应。解决方案的关键在于提出一种自动化“粗粒度到细粒度”的评分标准生成框架(Coarse-to-Fine Rubric Generation),通过原则引导合成、多模型聚合与难度演化机制,生成具备高度区分度和全面性的评判准则;并基于此构建了大规模(约11万条)、跨领域的RubricHub数据集,配合两阶段后训练流程(RuFT + RuRL)实现显著性能提升,最终使Qwen3-14B在HealthBench上达到69.3分,超越GPT-5等专有前沿模型。
链接: https://arxiv.org/abs/2601.08430
作者: Sunzhu Li,Jiale Zhao,Miteto Wei,Huimin Ren,Yang Zhou,Jingwen Yang,Shunyu Liu,Kaike Zhang,Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ( \sim 110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.
zh
[AI-29] axon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance
【速读】:该论文旨在解决电商场景下商品税码(tax code)自动预测的难题,即如何将商品准确映射到国家规定的多层级税收分类体系中,以避免财务不一致和合规风险。其核心挑战在于税码结构复杂、标注数据稀疏且存在噪声,同时需兼顾语义一致性与层级结构正确性。解决方案的关键在于提出Taxon框架,包含两个创新模块:一是基于特征门控的专家混合(feature-gating mixture-of-experts)架构,实现跨层级的多模态特征自适应路由;二是利用大语言模型蒸馏得到的语义一致性模型,作为领域专家验证商品标题与官方税则定义之间的匹配度。此外,通过融合税则数据库、发票校验日志和商户注册信息的多源训练策略,有效缓解真实业务数据中的监督噪声问题,从而显著提升预测准确性与结构一致性。
链接: https://arxiv.org/abs/2601.08418
作者: Jihang Li,Qing Liu,Zulong Chen,Jing Wang,Wei Wang,Chuanfei Xu,Zeyi Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Tax code prediction is a crucial yet underexplored task in automating invoicing and compliance management for large-scale e-commerce platforms. Each product must be accurately mapped to a node within a multi-level taxonomic hierarchy defined by national standards, where errors lead to financial inconsistencies and regulatory risks. This paper presents Taxon, a semantically aligned and expert-guided framework for hierarchical tax code prediction. Taxon integrates (i) a feature-gating mixture-of-experts architecture that adaptively routes multi-modal features across taxonomy levels, and (ii) a semantic consistency model distilled from large language models acting as domain experts to verify alignment between product titles and official tax definitions. To address noisy supervision in real business records, we design a multi-source training pipeline that combines curated tax databases, invoice validation logs, and merchant registration data to provide both structural and semantic supervision. Extensive experiments on the proprietary TaxCode dataset and public benchmarks demonstrate that Taxon achieves state-of-the-art performance, outperforming strong baselines. Further, an additional full hierarchical paths reconstruction procedure significantly improves structural consistency, yielding the highest overall F1 scores. Taxon has been deployed in production within Alibaba’s tax service system, handling an average of over 500,000 tax code queries per day and reaching peak volumes above five million requests during business event with improved accuracy, interpretability, and robustness.
zh
[AI-30] Regulatory gray areas of LLM Terms
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在学术研究中的应用日益广泛,但其服务条款(Terms of Service, ToS)对研究人员的使用限制存在显著差异且缺乏系统性审查,导致研究实践中出现合规性模糊地带,尤其在安全研究、计算社会科学和心理学等领域面临具体挑战。解决方案的关键在于通过对比分析五家主流LLM提供商(Anthropic、DeepSeek、Google、OpenAI和xAI)的服务条款,识别出使用限制的严格程度与明确性的差异,并揭示“监管灰色地带”;同时提供一个公开可访问的比较资源(OSF平台),帮助研究人员更清晰地理解不同平台的合规要求,从而在快速演进的技术环境中做出合理决策。
链接: https://arxiv.org/abs/2601.08415
作者: Brittany I. Davidson,Kate Muir,Florian A.D. Burnat,Adam N. Joinson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly integrated into academic research pipelines; however, the Terms of Service governing their use remain under-examined. We present a comparative analysis of the Terms of Service of five major LLM providers (Anthropic, DeepSeek, Google, OpenAI, and xAI) collected in November 2025. Our analysis reveals substantial variation in the stringency and specificity of usage restrictions for general users and researchers. We identify specific complexities for researchers in security research, computational social sciences, and psychological studies. We identify `regulatory gray areas’ where Terms of Service create uncertainty for legitimate use. We contribute a publicly available resource comparing terms across platforms (OSF) and discuss implications for general users and researchers navigating this evolving landscape.
zh
[AI-31] Hybrid Distillation with CoT Guidance for Edge-Drone Control Code Generation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限的无人机(Unmanned Aerial Vehicle, UAV)机载控制系统中部署时面临的高计算开销与实时性、轻量化需求之间的矛盾问题。其解决方案的关键在于提出一种融合知识蒸馏(Knowledge Distillation)、思维链引导(Chain-of-Thought Guidance)和监督微调(Supervised Fine-Tuning)的集成方法:首先构建覆盖多种主流UAV SDK的高质量指令-代码-推理链数据集,并引入反事实负样本增强训练;其次利用经QLoRA量化后的DeepSeek-Coder-V2-Lite作为教师模型,通过混合黑盒与白盒蒸馏策略生成高质量思维链软标签,并结合加权交叉熵损失函数以硬标签辅助训练,从而将复杂推理能力高效迁移至轻量学生模型;最后通过面向UAV控制场景优化的提示词调优工程提升核心任务如SDK类型识别和函数调用匹配的性能。实验表明,该方法在保持高代码生成准确性的同时显著提升了部署效率与推理速度,验证了其在实现精确且轻量化的无人机智能控制方面的可行性与优越性。
链接: https://arxiv.org/abs/2601.08412
作者: Yizhan Feng,Hichem Snoussi,Yuhang Wang,Jing Teng,Abel Cherouat,Tian Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2nd International Conference on Drones and Unmanned Systems (DAUS’ 2026)
Abstract:With large language models demonstrating significant potential in code generation tasks, their application to onboard control of resource-constrained Unmanned Aerial Vehicles has emerged as an important research direction. However, a notable contradiction exists between the high resource consumption of large models and the real-time, lightweight requirements of UAV platforms. This paper proposes an integrated approach that combines knowledge distillation, chain-of-thought guidance, and supervised fine-tuning for UAV multi-SDK control tasks, aiming to efficiently transfer complex reasoning and code generation capabilities to smaller models. Firstly, a high-quality dataset covering various mainstream UAV SDKs is constructed, featuring instruction-code-reasoning chains, and incorporates counterfactual negative samples for data augmentation, guiding the model to learn the end-to-end logic from instruction parsing to code generation. Secondly, leveraging DeepSeek-Coder-V2-Lite quantized via QLoRA as the teacher model, and based on a hybrid black-box and white-box distillation strategy, high-quality chain-of-thought soft labels are generated. These are combined with a weighted cross-entropy loss using hard labels to transfer complex reasoning capabilities to the smaller student model. Finally, through prompt tuning engineering optimized for the UAV control scenario, the model performance on core tasks such as SDK type recognition and function call matching is enhanced. Experimental results indicate that the distilled lightweight model maintains high code generation accuracy while achieving significant improvements in deployment and inference efficiency, effectively demonstrating the feasibility and superiority of our approach in achieving precise and lightweight intelligent control for UAVs
zh
[AI-32] WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
【速读】:该论文旨在解决Web Agent在真实网络环境中部署时安全评估缺乏系统性和标准化的问题。当前对Web Agent的安全测试分散且难以复现,限制了对其潜在风险的全面识别与比较。解决方案的关键在于提出WebTrap Park——一个自动化平台,通过直接观察Web Agent与实时网页的交互行为,将三大类安全风险转化为1,226个可执行的评估任务,并支持无需修改代理代码即可进行基于动作的安全评估。这一方法实现了对不同框架下Web Agent安全性的量化对比,揭示了代理架构本身对安全性的重要影响,为可复现、可扩展的Web Agent安全评测提供了基础。
链接: https://arxiv.org/abs/2601.08406
作者: Xinyi Wu,Jiagui Chen,Geng Hong,Jiayi Dong,Xudong Pan,Jiarun Dai,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Web Agents are increasingly deployed to perform complex tasks in real web environments, yet their security evaluation remains fragmented and difficult to standardize. We present WebTrap Park, an automated platform for systematic security evaluation of Web Agents through direct observation of their concrete interactions with live web pages. WebTrap Park instantiates three major sources of security risk into 1,226 executable evaluation tasks and enables action based assessment without requiring agent modification. Our results reveal clear security differences across agent frameworks, highlighting the importance of agent architecture beyond the underlying model. WebTrap Park is publicly accessible at this https URL and provides a scalable foundation for reproducible Web Agent security evaluation.
zh
[AI-33] Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLM s
【速读】:该论文旨在解决大语言模型在个性化推荐任务中,因依赖稀疏的序列级奖励而导致的信用分配差距问题(credit assignment gap),尤其是在用户意图隐含于不完整语句且缺乏标注数据的情况下,标准强化学习方法如GRPO难以准确识别哪些token对最终表现有贡献。解决方案的关键在于提出Owen-Shapley Policy Optimization (OSPO),其通过Shapley-Owen归因机制进行基于势函数的奖励重塑(potential-based reward shaping),将序列级优势值重新分配至语义一致的片段(如描述产品属性的短语或表达偏好的句子),从而实现无需参数化价值模型即可直接从任务反馈中学习,并保持最优策略不变。此方法显著提升了模型在分布外检索器上的测试鲁棒性。
链接: https://arxiv.org/abs/2601.08403
作者: Abhijnan Nath,Alireza Bagheri Garakani,Tianchen Zhou,Fan Yang,Nikhil Krishnaswamy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards that create a credit assignment gap, obscuring which tokens drive success. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, a reasoning pattern rarely seen during pretraining. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens’ marginal contributions to outcomes. Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, learning directly from task feedback without parametric value models. By forming coalitions of semantically coherent units (phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and HM Fashion datasets show consistent gains over baselines, with notable test-time robustness to out-of-distribution retrievers unseen during training.
zh
[AI-34] Controlled LLM Training on Spectral Sphere
【速读】:该论文旨在解决大规模模型训练中因优化策略不稳定导致的收敛性问题,尤其针对现有优化器(如Muon)仅控制参数更新但无法约束权重漂移、进而破坏激活值稳定性的问题。其核心解决方案是提出谱球优化器(Spectral Sphere Optimizer, SSO),通过在每个模块上施加严格的谱约束(spectral constraints)来同时限制权重和更新方向,从而实现与最大更新参数化(Maximal Update Parametrization, μP)理论框架完全对齐的优化过程;SSO通过推导谱球流形上的最速下降方向,确保激活值保持Θ(1)量级的稳定控制,并在Megatron框架中高效并行实现,显著提升了训练稳定性与性能。
链接: https://arxiv.org/abs/2601.08393
作者: Tian Xie,Haoming Luo,Haoyu Tang,Yiwen Hu,Jason Klein Liu,Qingnan Ren,Yang Wang,Wayne Xin Zhao,Rui Yan,Bing Su,Chong Luo,Baining Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ( \boldsymbol\mu P) provides a theoretical safeguard for width-invariant \Theta(1) activation control, whereas emerging optimizers like Muon are only ``half-aligned’’ with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbfSpectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully \boldsymbol\mu P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.
zh
[AI-35] Creativity in AI as Emergence from Domain-Limited Generative Models
【速读】:该论文试图解决的问题是:当前对人工智能(AI)中创造力的研究多依赖于评估框架来衡量生成输出的新颖性、多样性或实用性,但这种做法将创造力视为可测量的属性而非需要建模的现象,忽略了其内在机制。解决方案的关键在于提出一种生成式视角下的创造力理论,将创造力视为受限领域生成模型在有限信息环境中产生的涌现特性;通过概念分解将创造力划分为四个相互作用的组成部分——基于模式的生成、诱导的世界模型、情境锚定和任意性,并强调创造力应被理解为生成动态与特定领域表征之间交互的结果,从而为研究AI系统中的创造力提供一个以技术为基础的分析框架。
链接: https://arxiv.org/abs/2601.08388
作者: Corina Chutaux(SU FdL)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Creativity in artificial intelligence is most often addressed through evaluative frameworks that aim to measure novelty, diversity, or usefulness in generated outputs. While such approaches have provided valuable insights into the behavior of modern generative models, they largely treat creativity as a property to be assessed rather than as a phenomenon to be explicitly modeled. In parallel, recent advances in large-scale generative systems, particularly multimodal architectures, have demonstrated increasingly sophisticated forms of pattern recombination, raising questions about the nature and limits of machine creativity. This paper proposes a generative perspective on creativity in AI, framing it as an emergent property of domain-limited generative models embedded within bounded informational environments. Rather than introducing new evaluative criteria, we focus on the structural and contextual conditions under which creative behaviors arise. We introduce a conceptual decomposition of creativity into four interacting components-pattern-based generation, induced world models, contextual grounding, and arbitrarity, and examine how these components manifest in multimodal generative systems. By grounding creativity in the interaction between generative dynamics and domain-specific representations, this work aims to provide a technical framework for studying creativity as an emergent phenomenon in AI systems, rather than as a post hoc evaluative label.
zh
[AI-36] A Qualitative Model to Reason about Object Rotations (QOR) applied to solve the Cube Comparison Test (CCT)
【速读】:该论文旨在解决物体旋转推理问题,具体应用于经典的立方体比较测试(Cube Comparison Test, CCT),该测试用于评估个体的空间想象能力。解决方案的关键在于构建了一个概念邻域图(Conceptual Neighborhood Graph, CNGRLO),该图将旋转运动与立方体表面特征的位置变化(Location change)和朝向变化(Orientation change)相关联,并基于此生成组合表(composition tables),从而支持对旋转操作的逻辑推理与推断计算。
链接: https://arxiv.org/abs/2601.08382
作者: Zoe Falomir
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:This paper presents a Qualitative model for Reasoning about Object Rotations (QOR) which is applied to solve the Cube Comparison Test (CCT) by Ekstrom et al. (1976). A conceptual neighborhood graph relating the Rotation movement to the Location change and the Orientation change (CNGRLO) of the features on the cube sides has been built and it produces composition tables to calculate inferences for reasoning about rotations.
zh
[AI-37] hematic Working Group 5 – Artificial Intelligence (AI) literacy for teaching and learning: design and implementation
【速读】:该论文旨在解决教师在教育实践中缺乏AI素养(AI literacy)与自主性(agency)的问题,导致难以有效整合人工智能技术于教学过程。解决方案的关键在于通过系统性的策略设计,包括课程开发、教师专业发展项目、课堂实践应用以及政策指导,全面提升教师对AI工具的掌握能力与理解深度,从而增强其在教学中运用AI的信心,并促进学生对AI概念的深入认知。
链接: https://arxiv.org/abs/2601.08380
作者: Mary Webb,Matt Bower,Ana Amélia Carvalho,Fredrik Mørk Røkenes,Jodie Torrington,Jonathan D. Cohen,Yousra Chtouki,Kathryn Maccallum,Tanya Linden,Deirdre Butler,Juliana Elisa Raffaghelli,Henriikka Vartiainen,Martina Ronci,Peter Tiernan,David M. Smith,Chris Shelton,Joyce Malyn-smith,Pierre Gorissen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:TWG 5 focused on developing and implementing effective strategies for enhancing AI literacy and agency of teachers, equipping them with the knowledge and skills necessary to integrate AI into their teaching practices. Explorations covered curriculum design, professional development programs, practical classroom applications, and policy guidelines aiming to empower educators to confidently utilize AI tools and foster a deeper understanding of AI concepts among students.
zh
[AI-38] Scalable Sequential Recommendation under Latency and Memory Constraints
【速读】:该论文旨在解决顺序推荐系统(Sequential Recommender Systems)在长序列建模中面临的内存与延迟约束问题,尤其是基于Transformer的方法因注意力机制的二次复杂度导致用户历史必须被大幅截断,从而限制了其在长周期场景下的实用性。解决方案的关键在于提出HoloMambaRec架构:一方面采用全息降维表示(holographic reduced representations)实现属性感知嵌入(attribute-aware embedding),通过循环卷积绑定物品与属性信息,在保持嵌入维度的同时编码结构化元数据;另一方面引入受Mamba启发的浅层选择性状态空间编码器(selective state space encoder),实现线性时间序列处理,支持高效训练和常数时间递归推理。此设计显著降低了内存复杂度,并在有限训练轮次下优于SASRec,性能媲美GRU4Rec,同时具备时间打包和推理时压缩等前向兼容机制,为可扩展、元数据感知的顺序推荐提供了实用且可扩展的替代方案。
链接: https://arxiv.org/abs/2601.08360
作者: Adithya Parthasarathy,Aswathnarayan Muthukrishnan Kirubakaran,Vinoth Punniyamoorthy,Nachiappan Chockalingam,Lokesh Butra,Kabilan Kannan,Abhirup Mazumder,Sumit Saha
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Sequential recommender systems must model long-range user behavior while operating under strict memory and latency constraints. Transformer-based approaches achieve strong accuracy but suffer from quadratic attention complexity, forcing aggressive truncation of user histories and limiting their practicality for long-horizon modeling. This paper presents HoloMambaRec, a lightweight sequential recommendation architecture that combines holographic reduced representations for attribute-aware embedding with a selective state space encoder for linear-time sequence processing. Item and attribute information are bound using circular convolution, preserving embedding dimensionality while encoding structured metadata. A shallow selective state space backbone, inspired by recent Mamba-style models, enables efficient training and constant-time recurrent inference. Experiments on Amazon Beauty and MovieLens-1M datasets demonstrate that HoloMambaRec consistently outperforms SASRec and achieves competitive performance with GRU4Rec under a constrained 10-epoch training budget, while maintaining substantially lower memory complexity. The design further incorporates forward-compatible mechanisms for temporal bundling and inference-time compression, positioning HoloMambaRec as a practical and extensible alternative for scalable, metadata-aware sequential recommendation.
zh
[AI-39] Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理架构中因信息传输机制与认知正当性机制混同而导致的系统性认知失效问题,即“语义洗白”(semantic laundering)现象。其核心问题是:在标准架构假设下,LLM代理会通过跨受信任接口接受缺乏充分证据支持的命题,并赋予其高认知地位,从而引发类似Gettier问题的认知谬误——即命题虽为真且被合理相信,但其正当性与其真实性之间无实质关联。论文的关键解决方案在于提出“必然自授权定理”(Theorem of Inevitable Self-Licensing),指出在现有架构范式下,循环正当性(circular epistemic justification)无法被消除;并引入“担保侵蚀原则”(Warrant Erosion Principle)作为根本解释机制,揭示了规模扩展、模型优化及LLM作为评判者等常见改进策略在类型层面(type-level)均无法根除此类结构性缺陷。
链接: https://arxiv.org/abs/2601.08333
作者: Oleg Romanchuk,Roman Bondar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agent architectures systematically conflate information transport mechanisms with epistemic justification mechanisms. We formalize this class of architectural failures as semantic laundering: a pattern where propositions with absent or weak warrant are accepted by the system as admissible by crossing architecturally trusted interfaces. We show that semantic laundering constitutes an architectural realization of the Gettier problem: propositions acquire high epistemic status without a connection between their justification and what makes them true. Unlike classical Gettier cases, this effect is not accidental; it is architecturally determined and systematically reproducible. The central result is the Theorem of Inevitable Self-Licensing: under standard architectural assumptions, circular epistemic justification cannot be eliminated. We introduce the Warrant Erosion Principle as the fundamental explanation for this effect and show that scaling, model improvement, and LLM-as-judge schemes are structurally incapable of eliminating a problem that exists at the type level.
zh
[AI-40] Safe Heterogeneous Multi-Agent RL with Communication Regularization for Coordinated Target Acquisition
【速读】:该论文旨在解决结构异质性多智能体团队在部分可观测、通信受限及动态交互环境中,协同发现并获取随机分布目标的问题。解决方案的关键在于提出了一种去中心化的多智能体强化学习框架,其中每个智能体的策略采用多智能体近端策略优化(Multi-Agent Proximal Policy Optimization, MAPPO)算法训练,并引入图注意力网络(Graph Attention Network, GAT)编码器,融合模拟的测距传感数据与邻近智能体间交换的通信嵌入,实现基于局部感知与关系信息的上下文感知决策。此外,该工作创新性地将基于图的通信机制与轨迹感知的安全滤波器相结合,构建统一框架以保障任务安全性;同时设计了结构化的奖励函数,通过促进通信向量的信息正交性来增强目标发现效率、碰撞规避能力以及智能体间通信的去相关性,从而提升整体协作性能。
链接: https://arxiv.org/abs/2601.08327
作者: Gabriele Calzolari(1),Vidya Sumathy(1),Christoforos Kanellakis(1),George Nikolakopoulos(1) ((1) Lulea University of Technology)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, submitted to the IFAC World Congress 2026
Abstract:This paper introduces a decentralized multi-agent reinforcement learning framework enabling structurally heterogeneous teams of agents to jointly discover and acquire randomly located targets in environments characterized by partial observability, communication constraints, and dynamic interactions. Each agent’s policy is trained with the Multi-Agent Proximal Policy Optimization algorithm and employs a Graph Attention Network encoder that integrates simulated range-sensing data with communication embeddings exchanged among neighboring agents, enabling context-aware decision-making from both local sensing and relational information. In particular, this work introduces a unified framework that integrates graph-based communication and trajectory-aware safety through safety filters. The architecture is supported by a structured reward formulation designed to encourage effective target discovery and acquisition, collision avoidance, and de-correlation between the agents’ communication vectors by promoting informational orthogonality. The effectiveness of the proposed reward function is demonstrated through a comprehensive ablation study. Moreover, simulation results demonstrate safe and stable task execution, confirming the framework’s effectiveness.
zh
[AI-41] AtomMem : Learnable Dynamic Agent ic Memory with Atomic Memory Operation
【速读】:该论文旨在解决现有智能体记忆机制依赖静态、人工设计的工作流而导致性能与泛化能力受限的问题。其核心解决方案是提出AtomMem,将记忆管理重构为一个动态决策问题,通过将高层次记忆操作解构为基本的CRUD(Create, Read, Update, Delete)原子操作,使记忆流程可学习化;结合监督微调与强化学习,训练出一种自主且任务对齐的记忆行为策略,从而实现针对特定任务需求的灵活记忆管理。实验表明,AtomMem-8B在多个长上下文基准上均优于传统静态记忆方法,且训练过程揭示了模型能自发发现结构化的、任务导向的记忆策略,体现了学习驱动范式的显著优势。
链接: https://arxiv.org/abs/2601.08323
作者: Yupeng Huo,Yaxi Lu,Zhong Zhang,Haotian Chen,Yankai Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Equipping agents with memory is essential for solving real-world long-horizon problems. However, most existing agent memory mechanisms rely on static and hand-crafted workflows. This limits the performance and generalization ability of these memory designs, which highlights the need for a more flexible, learning-based memory framework. In this paper, we propose AtomMem, which reframes memory management as a dynamic decision-making problem. We deconstruct high-level memory processes into fundamental atomic CRUD (Create, Read, Update, Delete) operations, transforming the memory workflow into a learnable decision process. By combining supervised fine-tuning with reinforcement learning, AtomMem learns an autonomous, task-aligned policy to orchestrate memory behaviors tailored to specific task demands. Experimental results across 3 long-context benchmarks demonstrate that the trained AtomMem-8B consistently outperforms prior static-workflow memory methods. Further analysis of training dynamics shows that our learning-based formulation enables the agent to discover structured, task-aligned memory management strategies, highlighting a key advantage over predefined routines.
zh
[AI-42] ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning
【速读】:该论文旨在解决当前大型推理模型(Large Reasoning Models, LRMs)在推理阶段采用统一过长的思维链(Chain-of-Thought, CoT)导致计算资源浪费的问题。现有方法虽尝试根据输入动态估计推理预算,但存在不可靠性,且在训练时固定了推理成本与准确率之间的权衡,限制了部署场景下的灵活性。解决方案的关键在于提出ORBIT框架,该框架通过多阶段强化学习发现不同推理努力水平下的帕累托最优行为,并利用在线策略蒸馏将这些行为融合为单一学生模型,同时保持清晰的推理模式分离和各模式下的高性能,从而实现可控、高效且灵活的多预算推理。
链接: https://arxiv.org/abs/2601.08310
作者: Kun Liang,Clive Bai,Xin Xu,Chenming Tang,Sanwoo Lee,Weijie Liu,Saiyong Yang,Yunfang Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Recent Large Reasoning Models (LRMs) achieve strong performance by leveraging long-form Chain-of-Thought (CoT) reasoning, but uniformly applying overlong reasoning at inference time incurs substantial and often unnecessary computational cost. To address this, prior work explores various strategies to infer an appropriate reasoning budget from the input. However, such approaches are unreliable in the worst case, as estimating the minimal required reasoning effort is fundamentally difficult, and they implicitly fix the trade-off between reasoning cost and accuracy during training, limiting flexibility under varying deployment scenarios. Motivated by these limitations, we propose ORBIT, a controllable multi-budget reasoning framework with well-separated reasoning modes triggered by input. ORBIT employs multi-stage reinforcement learning to discover Pareto-optimal reasoning behaviors at each effort, followed by on-policy distillation to fuse these behaviors into a single unified model. Experiments show that ORBIT achieves (1) controllable reasoning behavior over multiple modes, (2) competitive reasoning density within each mode, and (3) integration of these frontier policies into a single unified student model while preserving clear mode separation and high per-mode performance.
zh
[AI-43] OpenMic: A Multi-Agent -Based Stand-Up Comedy Generation System
【速读】:该论文旨在解决中文脱口秀(Stand-up Comedy)生成任务中面临的多重挑战,包括文化语境依赖性、表演节奏控制、舞台表现提示的整合以及多步隐含推理需求,同时指出现有中文幽默数据集更适用于幽默理解与评估,而非长篇脱口秀内容生成,导致监督信号与目标任务不匹配。解决方案的关键在于提出一个端到端的多智能体系统 OpenMic,基于 AutoGen 构建,通过多轮迭代式规划协调多个专业化智能体共同优化幽默感、节奏和可表演性;并引入检索增强生成(Retrieval-Augmented Generation, RAG)缓解数据-任务错位问题,同时微调专用的 JokeWriter 模型以更好地内化脱口秀特有的“铺垫-笑点”结构和长程呼应机制。
链接: https://arxiv.org/abs/2601.08288
作者: Yuyang Wu,Hanzhong Cao,Jianhao Chen,Yufei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chinese stand-up comedy generation goes beyond plain text generation, requiring culturally grounded humor, precise timing, stage-performance cues, and implicit multi-step reasoning. Moreover, commonly used Chinese humor datasets are often better suited for humor understanding and evaluation than for long-form stand-up generation, making direct supervision misaligned with the target task. To address these challenges, we present OpenMic, an end-to-end multi-agent system built on AutoGen that transforms a user-provided life topic into a 3-5 minute Chinese stand-up performance and further produces a narrated comedy video. OpenMic orchestrates multiple specialized agents in a multi-round iterative loop-planning to jointly optimize humor, timing, and performability. To mitigate the dataset-task mismatch, we augment generation with retrieval-augmented generation (RAG) for material grounding and idea expansion, and we fine-tune a dedicated JokeWriter to better internalize stand-up-specific setup-punchline structures and long-range callbacks.
zh
[AI-44] Greedy Is Enough: Sparse Action Discovery in Agent ic LLM s
【速读】:该论文旨在解决大规模动作空间下智能体(agentic systems)中动作发现(action discovery)的高效性问题,即在包含数千个可用API或检索操作的环境中,如何识别出对决策性能具有显著影响的少量关键动作。其核心挑战在于:尽管动作空间极大,但实际有效动作集往往稀疏,而传统方法难以在高维空间中准确识别这些稀疏结构。解决方案的关键在于提出一个基于结构化稀疏假设的上下文线性奖励模型,并将动作发现建模为块稀疏恢复问题(block-sparse recovery problem)。作者设计了一种受正交匹配追踪(Orthogonal Matching Pursuit)启发的贪婪算法,在标准的非相干性、信号强度和动作覆盖条件下,证明该算法能以高概率精确恢复相关动作集合,且样本复杂度仅随稀疏度和隐状态维度多项式增长,而与总动作数仅对数相关。此外,该方法还提供了参数估计误差界和近最优决策规则保证,从而为大动作空间下的动作剪枝(action pruning)提供了理论支撑。
链接: https://arxiv.org/abs/2601.08280
作者: Angshul Majumdar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern agentic systems operate in environments with extremely large action spaces, such as tool-augmented language models with thousands of available APIs or retrieval operations. Despite this scale, empirical evidence suggests that only a small subset of actions meaningfully influences performance in a given deployment. Motivated by this observation, we study a contextual linear reward model in which action relevance is governed by a structured sparsity assumption: only a small number of actions have nonzero effects across latent states. We formulate action discovery as a block-sparse recovery problem and analyze a greedy algorithm inspired by Orthogonal Matching Pursuit. Under standard assumptions on incoherence, signal strength, and action coverage, we prove that the greedy procedure exactly recovers the relevant action set with high probability, using a number of samples that scales polynomially in the sparsity level and latent dimension, and only logarithmically in the total number of actions. We further provide estimation error guarantees for refitted parameters and show that the resulting decision rule is near-optimal for new latent states. Complementing these results, we establish information-theoretic lower bounds demonstrating that sparsity and sufficient coverage are necessary for tractability. Together, our results identify sparse action discovery as a fundamental principle underlying large-action decision-making and provide a theoretical foundation for action pruning in agentic systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.08280 [cs.AI] (or arXiv:2601.08280v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.08280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-45] oolACE-MCP: Generalizing History-Aware Routing from MCP Tools to the Agent Web
【速读】:该论文旨在解决当前代理生态系统在面对大规模工具开放协作网络(Agent Web)时所面临的可扩展性与通用性瓶颈问题。解决方案的关键在于提出ToolACE-MCP,一个基于历史感知路由机制的训练流程,通过构建富含依赖关系的候选工具图(candidate Graph)来合成多轮交互轨迹,从而训练具备动态上下文理解能力的路由器,最终生成即插即用的轻量级路由代理(Light Routing Agent)。该方法显著提升了代理在复杂、开放环境中的导航精度与鲁棒性,并展现出良好的多代理协同能力和对海量候选空间的有效扩展性。
链接: https://arxiv.org/abs/2601.08276
作者: Zhiyuan Yao,Zishan Xu,Yifu Guo,Zhiguang Han,Cheng Yang,Shuo Zhang,Weinan Zhang,Xingshan Zeng,Weiwen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the rise of the Agent Web and Model Context Protocol (MCP), the agent ecosystem is evolving into an open collaborative network, exponentially increasing accessible tools. However, current architectures face severe scalability and generality bottlenecks. To address this, we propose ToolACE-MCP, a pipeline for training history-aware routers to empower precise navigation in large-scale ecosystems. By leveraging a dependency-rich candidate Graph to synthesize multi-turn trajectories, we effectively train routers with dynamic context understanding to create the plug-and-play Light Routing Agent. Experiments on the real-world benchmarks MCP-Universe and MCP-Mark demonstrate superior performance. Notably, ToolACE-MCP exhibits critical properties for the future Agent Web: it not only generalizes to multi-agent collaboration with minimal adaptation but also maintains exceptional robustness against noise and scales effectively to massive candidate spaces. These findings provide a strong empirical foundation for universal orchestration in open-ended ecosystems.
zh
[AI-46] Sparsity Is Necessary: Polynomial-Time Stability for Agent ic LLM s in Large Action Spaces
【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented LLM)系统中因巨大离散动作空间(如工具、API、文档等)导致的序列决策问题,其中仅有少量未知子集对特定任务分布有效。针对这一挑战,作者提出了稀疏代理控制(Sparse Agentic Control, SAC)的形式化框架,其核心在于:策略在 M 个动作上具有块稀疏结构,且奖励依赖于稀疏主效应(sparse main effects)和可选的稀疏协同效应(sparse synergies)。解决方案的关键是采用 ℓ1,2-正则化的凸替代方法进行策略学习,并基于压缩感知理论建立严格性能边界:(i)估计误差与价值次优性随 k(logM/T)1/2 变化,其中 k 为相关动作数,T 为样本量;(ii)在满足非相干性和 β-最小条件时,可通过原始-对偶见证论证实现精确工具支持恢复;(iii)任意稠密策略类需 Ω(M) 样本,解释了纯提示控制器的不稳定性。此外,在部分可观测情形下,LLM 的作用仅体现为信念/表示误差 ϵb,带来 O(ϵb) 的额外损失,同时保持对 M 的对数依赖性。
链接: https://arxiv.org/abs/2601.08271
作者: Angshul Majumdar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented LLM systems expose a control regime that learning theory has largely ignored: sequential decision-making with a massive discrete action universe (tools, APIs, documents) in which only a small, unknown subset is relevant for any fixed task distribution. We formalize this setting as Sparse Agentic Control (SAC), where policies admit block-sparse representations over M 1 actions and rewards depend on sparse main effects and (optionally) sparse synergies. We study ell_1,2-regularized policy learning through a convex surrogate and establish sharp, compressed-sensing-style results: (i) estimation and value suboptimality scale as k (log M / T)^1/2 under a Policy-RSC condition; (ii) exact tool-support recovery holds via primal-dual witness arguments when T k log M under incoherence and beta-min; and (iii) any dense policy class requires Omega(M) samples, explaining the instability of prompt-only controllers. We further show that under partial observability, LLMs matter only through a belief/representation error epsilon_b, yielding an additive O(epsilon_b) degradation while preserving logarithmic dependence on M. Extensions cover tuning-free, online, robust, group-sparse, and interaction-aware SAC.
zh
[AI-47] VGG Induced Deep Hand Sign Language Detection
【速读】:该论文旨在解决残障人士(尤其是视障人群)与计算机交互中手势识别的准确性与实用性问题,以提升其使用手语进行交流的能力。解决方案的关键在于采用基于迁移学习(transfer learning)和图像数据增强(image data augmentation)的VGG-16卷积神经网络模型,利用Python和Keras框架在公开图像数据集上训练,并通过NUS数据集和自建测试集验证模型性能,最终实现了约98%的识别准确率,显著提升了手部姿态识别的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2601.08262
作者: Subham Sharma,Sharmila Subudhi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in: Sharma, S., Ghosh, A., Subudhi, S. (2022). Hand Sign Language Detection Using Deep Learning. In: Sahoo, J.P., Tripathy, A.K., Mohanty, M., Li, KC., Nayak, A.K. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 302. Springer
Abstract:Hand gesture recognition is an important aspect of human-computer interaction. It forms the basis of sign language for the visually impaired people. This work proposes a novel hand gesture recognizing system for the differently-abled persons. The model uses a convolutional neural network, known as VGG-16 net, for building a trained model on a widely used image dataset by employing Python and Keras libraries. Furthermore, the result is validated by the NUS dataset, consisting of 10 classes of hand gestures, fed to the model as the validation set. Afterwards, a testing dataset of 10 classes is built by employing Google’s open source Application Programming Interface (API) that captures different gestures of human hand and the efficacy is then measured by carrying out experiments. The experimental results show that by combining a transfer learning mechanism together with the image data augmentation, the VGG-16 net produced around 98% accuracy.
zh
[AI-48] 3: Benchmarking Sycophancy and Skepticism in Causal Judgment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在因果推理能力上缺乏系统性评估与诊断的问题,尤其关注其在Pearl因果阶梯(Ladder of Causality)不同层级上的表现差异及潜在缺陷。解决方案的关键在于提出一个名为T3(Testing Trustworthy Thinking)的诊断基准,该基准包含454个专家精心设计的情境片段(vignettes),能够对模型的因果判断进行高分辨率分解分析,量化其在Utility(敏感性)、Safety(特异性)和Wise Refusal(对不确定情形的明智拒绝)三个维度的表现。通过T3对前沿模型的测试,研究揭示了两类病理:一是在L1层级出现的“怀疑陷阱”(Skepticism Trap),即安全微调模型错误拒绝大量有效因果链接;二是在L3层级出现的非单调缩放悖论(non-monotonic Scaling Paradox),表现为GPT-5.2在模糊反事实任务上显著劣于GPT-4-Turbo,根源并非幻觉而是过度规避导致的瘫痪(paralysis)。此外,研究进一步验证了一种过程可解释的修复协议(RCA),证明T3能有效捕捉结构化验证下因果判断能力的恢复。
链接: https://arxiv.org/abs/2601.08258
作者: Edward Y. Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 11 tables
Abstract:We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl’s Ladder of Causality. Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases. By applying T3 to frontier models, we diagnose two distinct pathologies: a “Skepticism Trap” at L1 (where safety-tuned models like Claude Haiku reject 60% of valid links) and a non-monotonic Scaling Paradox at L3. In the latter, the larger GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals, driven by a collapse into paralysis (excessive hedging) rather than hallucination. Finally, we use the benchmark to validate a process-verified protocol (RCA), showing that T3 successfully captures the restoration of decisive causal judgment under structured verification.
zh
[AI-49] On Evaluation of Unsupervised Feature Selection for Pattern Classification
【速读】:该论文旨在解决当前无监督特征选择(Unsupervised Feature Selection)方法在评估时依赖单标签数据集所带来的偏差问题。现有研究通常从多标签数据中随机选取一个标签构建单标签数据集进行性能测试,但由于所选标签的任意性,导致不同方法的性能排名不稳定,难以客观反映其真实判别能力。论文的关键解决方案是引入多标签分类框架作为新的评估范式,通过在21个多标签数据集上对比多个代表性方法,发现性能排序与单标签设置下存在显著差异,从而验证了多标签评估能提供更公平、可靠的比较基础。
链接: https://arxiv.org/abs/2601.08257
作者: Gyu-Il Kim,Dae-Won Kim,Jaesung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised feature selection aims to identify a compact subset of features that captures the intrinsic structure of data without supervised label. Most existing studies evaluate the performance of methods using the single-label dataset that can be instantiated by selecting a label from multi-label data while maintaining the original features. Because the chosen label can vary arbitrarily depending on the experimental setting, the superiority among compared methods can be changed with regard to which label happens to be selected. Thus, evaluating unsupervised feature selection methods based solely on single-label accuracy is unreasonable for assessing their true discriminative ability. This study revisits this evaluation paradigm by adopting a multi-label classification framework. Experiments on 21 multi-label datasets using several representative methods demonstrate that performance rankings differ markedly from those reported under single-label settings, suggesting the possibility of multi-label evaluation settings for fair and reliable comparison of unsupervised feature selection methods.
zh
[AI-50] Large Artificial Intelligence Model Guided Deep Reinforcement Learning for Resource Allocation in Non Terrestrial Networks
【速读】:该论文旨在解决非地面网络(Non-Terrestrial Networks, NTN)中资源调度与优化问题,传统方法依赖于特定任务的训练和启发式策略,难以在复杂多变的天气条件下实现高效、公平且鲁棒的性能。解决方案的关键在于提出一种由大型语言模型(Large Language Model, LLM)引导的深度强化学习(Deep Reinforcement Learning, DRL)代理架构:LLM作为高层协调器生成文本形式的指导信息,动态调整DRL代理的奖励函数,从而增强其在不同环境条件下的泛化能力。实验表明,该方案在正常天气下比传统DRL提升40%,极端天气下提升64%,显著改善吞吐量、公平性和中断概率等关键指标。
链接: https://arxiv.org/abs/2601.08254
作者: Abdikarim Mohamed Ibrahim,Rosdiadee Nordin
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Large AI Model (LAM) have been proposed to applications of Non-Terrestrial Networks (NTN), that offer better performance with its great generalization and reduced task specific trainings. In this paper, we propose a Deep Reinforcement Learning (DRL) agent that is guided by a Large Language Model (LLM). The LLM operates as a high level coordinator that generates textual guidance that shape the reward of the DRL agent during training. The results show that the LAM-DRL outperforms the traditional DRL by 40% in nominal weather scenarios and 64% in extreme weather scenarios compared to heuristics in terms of throughput, fairness, and outage probability.
zh
[AI-51] Hyperbolic Heterogeneous Graph Transformer
【速读】:该论文旨在解决现有基于双曲空间(hyperbolic space)的异构图神经网络(heterogeneous graph neural networks, HGNs)在学习复杂结构时存在的两个核心问题:一是依赖切空间(tangent-space)操作导致频繁转换时的映射失真;二是消息传递架构主要关注局部邻域信息,难以捕捉全局层次结构和跨类型节点间的长程依赖关系。解决方案的关键在于提出Hyperbolic Heterogeneous Graph Transformer(HypHGT),其创新性地将整个模型构建于双曲空间内,利用基于Transformer的架构自然建模局部与全局依赖关系,并引入一种关系特定的双曲注意力机制(relation-specific hyperbolic attention mechanism),该机制具有线性时间复杂度,在保持不同关系类型间异构信息的同时实现高效计算,从而有效捕获异构图中的复杂结构特性和语义信息。
链接: https://arxiv.org/abs/2601.08251
作者: Jongmin Park,Seunghoon Han,Hyewon Lee,Won-Yong Shin,Sungsu Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 14pages, 9 figures
Abstract:In heterogeneous graphs, we can observe complex structures such as tree-like or hierarchical structures. Recently, the hyperbolic space has been widely adopted in many studies to effectively learn these complex structures. Although these methods have demonstrated the advantages of the hyperbolic space in learning heterogeneous graphs, most existing methods still have several challenges. They rely heavily on tangent-space operations, which often lead to mapping distortions during frequent transitions. Moreover, their message-passing architectures mainly focus on local neighborhood information, making it difficult to capture global hierarchical structures and long-range dependencies between different types of nodes. To address these limitations, we propose Hyperbolic Heterogeneous Graph Transformer (HypHGT), which effectively and efficiently learns heterogeneous graph representations entirely within the hyperbolic space. Unlike previous message-passing based hyperbolic heterogeneous GNNs, HypHGT naturally captures both local and global dependencies through transformer-based architecture. Furthermore, the proposed relation-specific hyperbolic attention mechanism in HypHGT, which operates with linear time complexity, enables efficient computation while preserving the heterogeneous information across different relation types. This design allows HypHGT to effectively capture the complex structural properties and semantic information inherent in heterogeneous graphs. We conduct comprehensive experiments to evaluate the effectiveness and efficiency of HypHGT, and the results demonstrate that it consistently outperforms state-of-the-art methods in node classification task, with significantly reduced training time and memory usage.
zh
[AI-52] he End of Reward Engineering: How LLM s Are Redefining Multi-Agent Coordination
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中奖励工程(Reward Engineering)的核心挑战,包括信用分配模糊性、环境非平稳性以及交互复杂度的组合增长问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)将传统的手工设计数值奖励函数转变为基于自然语言的目标规范(Semantic Reward Specification),从而实现更灵活、可解释且与人类意图对齐的奖励机制。论文进一步指出,通过语言驱动的监督方式(如RLVR范式)和在线动态奖励适应能力(如CARD方法),可以有效缓解传统奖励工程的局限性,并推动多智能体系统从依赖显式数值信号向共享语义表示的协作机制演进。
链接: https://arxiv.org/abs/2601.08237
作者: Haoran Su,Yandong Sun,Congjia Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reward engineering, the manual specification of reward functions to induce desired agent behavior, remains a fundamental challenge in multi-agent reinforcement learning. This difficulty is amplified by credit assignment ambiguity, environmental non-stationarity, and the combinatorial growth of interaction complexity. We argue that recent advances in large language models (LLMs) point toward a shift from hand-crafted numerical rewards to language-based objective specifications. Prior work has shown that LLMs can synthesize reward functions directly from natural language descriptions (e.g., EUREKA) and adapt reward formulations online with minimal human intervention (e.g., CARD). In parallel, the emerging paradigm of Reinforcement Learning from Verifiable Rewards (RLVR) provides empirical evidence that language-mediated supervision can serve as a viable alternative to traditional reward engineering. We conceptualize this transition along three dimensions: semantic reward specification, dynamic reward adaptation, and improved alignment with human intent, while noting open challenges related to computational overhead, robustness to hallucination, and scalability to large multi-agent systems. We conclude by outlining a research direction in which coordination arises from shared semantic representations rather than explicitly engineered numerical signals.
zh
[AI-53] MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents ACL2026
【速读】:该论文旨在解决当前语言模型代理(Language-Model Agents)在处理个人数据时,缺乏对多模态隐私行为的系统性评估问题,尤其是现有基于情境完整性(Contextual Integrity, CI)的基准测试主要聚焦文本场景且偏重负面拒绝行为,忽视了多模态隐私风险以及隐私与效用之间的根本权衡。其解决方案的关键在于提出首个面向代理场景的多模态情境完整性基准——MPCI-Bench,该基准通过从同一视觉源生成成对的正负样本,并构建三个层级:规范性初始判断(Seed judgments)、情境丰富的故事情境推理(Story reasoning)和可执行的代理行为轨迹(Traces),同时采用三原则迭代精炼流程保障数据质量,从而全面刻画代理在多模态环境下的隐私合规性表现。
链接: https://arxiv.org/abs/2601.08235
作者: Shouju Wang,Haopeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to ACL 2026
Abstract:As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.
zh
[AI-54] GADPN: Graph Adaptive Denoising and Perturbation Networks via Singular Value Decomposition
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在实际应用中因图结构质量不佳(如噪声、缺失边或结构特性与GNN假设不一致)而导致性能受限的问题。其解决方案的关键在于提出一种名为GADPN的简单而有效的图结构学习框架,通过低秩去噪和广义结构扰动自适应地优化图拓扑;具体而言,一是引入贝叶斯优化以根据每张图的同质性水平动态确定最优去噪强度,二是利用奇异值分解(Singular Value Decomposition, SVD)将结构扰动方法扩展至任意图结构,突破了原有方法仅适用于对称结构的限制,从而在保持高效性的同时显著提升模型在多样化图类型(尤其是异质性较强的图)上的鲁棒性和性能表现。
链接: https://arxiv.org/abs/2601.08230
作者: Hao Deng,Bo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While Graph Neural Networks (GNNs) excel on graph-structured data, their performance is fundamentally limited by the quality of the observed graph, which often contains noise, missing links, or structural properties misaligned with GNNs’ underlying assumptions. To address this, graph structure learning aims to infer a more optimal topology. Existing methods, however, often incur high computational costs due to complex generative models and iterative joint optimization, limiting their practical utility. In this paper, we propose GADPN, a simple yet effective graph structure learning framework that adaptively refines graph topology via low-rank denoising and generalized structural perturbation. Our approach makes two key contributions: (1) we introduce Bayesian optimization to adaptively determine the optimal denoising strength, tailoring the process to each graph’s homophily level; and (2) we extend the structural perturbation method to arbitrary graphs via Singular Value Decomposition (SVD), overcoming its original limitation to symmetric structures. Extensive experiments on benchmark datasets demonstrate that GADPN achieves state-of-the-art performance while significantly improving efficiency. It shows particularly strong gains on challenging disassortative graphs, validating its ability to robustly learn enhanced graph structures across diverse network types.
zh
[AI-55] An Axiomatic Approach to General Intelligence: SANC(E3) – Self-organizing Active Network of Concepts with Energy E3
【速读】:该论文旨在解决现有智能系统中表征单元(representational units)预设固定的问题,即当前模型通常依赖预先定义的基元(如词元、子词、像素或传感器通道),而忽略了这些表征单位如何从经验中自发涌现并稳定形成。其解决方案的关键在于提出SANC(E3)框架——一个基于能量最小化原则的公理化体系,其中表征单元通过有限激活容量下的竞争选择、重构与压缩过程自组织生成,而非人为设定;该框架明确区分系统性词元(如“这里”、“现在”、“我”等结构性锚点)与由共现事件衍生的词元,并引入伪内存映射输入/输出机制,使内部重现的格式塔(Gestalt)与外部感官输入共享同一处理路径,从而统一感知、想象、预测、规划与行动于单一表征与能量过程中。
链接: https://arxiv.org/abs/2601.08224
作者: Daesuk Kwon,Won-gi Paeng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 3 tables
Abstract:General intelligence must reorganize experience into internal structures that enable prediction and action under finite resources. Existing systems implicitly presuppose fixed primitive units – tokens, subwords, pixels, or predefined sensor channels – thereby bypassing the question of how representational units themselves emerge and stabilize. This paper proposes SANC(E3), an axiomatic framework in which representational units are not given a priori but instead arise as stable outcomes of competitive selection, reconstruction, and compression under finite activation capacity, governed by the explicit minimization of an energy functional E3. SANC(E3) draws a principled distinction between system tokens – structural anchors such as here, now, I and sensory sources – and tokens that emerge through self-organization during co-occurring events. Five core axioms formalize finite capacity, association from co-occurrence, similarity-based competition, confidence-based stabilization, and the reconstruction-compression-update trade-off. A key feature is a pseudo-memory-mapped I/O mechanism, through which internally replayed Gestalts are processed via the same axiomatic pathway as external sensory input. As a result, perception, imagination, prediction, planning, and action are unified within a single representational and energetic process. From the axioms, twelve propositions are derived, showing that category formation, hierarchical organization, unsupervised learning, and high-level cognitive activities can all be understood as instances of Gestalt completion under E3 minimization.
zh
[AI-56] DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在黑盒部署场景下知识产权保护的挑战,特别是现有基于后门的指纹技术存在两个关键缺陷:一是依赖罕见词元(rare tokens),导致输入困惑度(perplexity)较高、易被过滤;二是采用固定触发-响应映射,对泄露和后期适应(post-hoc adaptation)敏感,鲁棒性差。解决方案的关键在于提出双层嵌套指纹(Dual-Layer Nested Fingerprinting, DNF),通过耦合领域特定的风格线索(domain-specific stylistic cues)与隐式语义触发器(implicit semantic triggers),构建分层后门机制,在Mistral-7B、LLaMA-3-8B-Instruct和Falcon-7B-Instruct等模型上实现完美指纹激活的同时保持下游任务性能,并显著降低触发器困惑度、增强对抗指纹检测攻击的能力,且对增量微调和模型合并具有相对鲁棒性。
链接: https://arxiv.org/abs/2601.08223
作者: Zhenhua Xu,Yiran Zhao,Mengting Zhong,Dezhang Kong,Changting Lin,Tong Qiao,Meng Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of large language models raises pressing concerns about intellectual property protection under black-box deployment. Existing backdoor-based fingerprints either rely on rare tokens – leading to high-perplexity inputs susceptible to filtering – or use fixed trigger-response mappings that are brittle to leakage and post-hoc adaptation. We propose \textscDual-Layer Nested Fingerprinting (DNF), a black-box method that embeds a hierarchical backdoor by coupling domain-specific stylistic cues with implicit semantic triggers. Across Mistral-7B, LLaMA-3-8B-Instruct, and Falcon3-7B-Instruct, DNF achieves perfect fingerprint activation while preserving downstream utility. Compared with existing methods, it uses lower-perplexity triggers, remains undetectable under fingerprint detection attacks, and is relatively robust to incremental fine-tuning and model merging. These results position DNF as a practical, stealthy, and resilient solution for LLM ownership verification and intellectual property protection.
zh
[AI-57] Adapting Rules of Official International Mahjong for Online Players
【速读】:该论文旨在解决在线麻将(Official International Mahjong)游戏中因玩家碎片化游戏时间和不固定的对手组合而引发的公平性问题,尤其针对单局对战中先手优势(first-mover advantage)和子目标得分机制不合理的问题。解决方案的关键在于利用世界冠军级人工智能(AI)进行自对弈实验与统计分析,识别出先手优势及子目标评分缺陷,并据此提出两项核心规则调整:一是每局引入补偿分以平衡先手优势,替代传统多轮轮换位置的方式;二是优化不同牌型对应的子目标得分,使规则更适配在线环境。该方案显著提升了在线单局对战的公平性和可操作性,且已通过线上实现并开放给玩家使用。
链接: https://arxiv.org/abs/2601.08211
作者: Chucai Wang,Lingfeng Li,Yunlong Lu,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As one of the worldwide spread traditional game, Official International Mahjong can be played and promoted online through remote devices instead of requiring face-to-face interaction. However, online players have fragmented playtime and unfixed combination of opponents in contrary to offline players who have fixed opponents for multiple rounds of play. Therefore, the rules designed for offline players need to be modified to ensure the fairness of online single-round play. Specifically, We employ a world champion AI to engage in self-play competitions and conduct statistical data analysis. Our study reveals the first-mover advantage and issues in the subgoal scoring settings. Based on our findings, we propose rule adaptations to make the game more suitable for the online environment, such as introducing compensatory points for the first-mover advantage and refining the scores of subgoals for different tile patterns. Compared with the traditional method of rotating positions over multiple rounds to balance first-mover advantage, our compensatory points mechanism in each round is more convenient for online players. Furthermore, we implement the revised Mahjong game online, which is open for online players. This work is an initial attempt to use data from AI systems to evaluate Official Internatinoal Mahjong’s game balance and develop a revised version of the traditional game better adapted for online players.
zh
[AI-58] ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models
【速读】:该论文旨在解决现有入侵式(backdoor)指纹技术中存在的三大问题:高困惑度触发器易被过滤、启发式检测器可识别的固定响应模式,以及良性输入上的虚假激活。解决方案的关键在于提出一种隐蔽性强的指纹框架 \textscForgetMark,其核心机制是通过目标遗忘(targeted unlearning)编码模型出处信息——利用辅助模型与预测熵排序构建紧凑且人类可读的键值对集合,并训练轻量级LoRA适配器,在保留模型通用能力的同时抑制原始键对应的值;所有权验证则基于黑盒/灰盒环境下聚合似然证据与语义证据形成指纹成功率。该方法依赖概率性遗忘痕迹而非固定触发-响应模式,从而有效规避高困惑度触发器、降低可检测性并减少误触发。
链接: https://arxiv.org/abs/2601.08189
作者: Zhenhua Xu,Haobo Zhang,Zhebo Wang,Qichen Liu,Haitao Xu,Wenpeng Xing,Meng Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing invasive (backdoor) fingerprints suffer from high-perplexity triggers that are easily filtered, fixed response patterns exposed by heuristic detectors, and spurious activations on benign inputs. We introduce \textscForgetMark, a stealthy fingerprinting framework that encodes provenance via targeted unlearning. It builds a compact, human-readable key–value set with an assistant model and predictive-entropy ranking, then trains lightweight LoRA adapters to suppress the original values on their keys while preserving general capabilities. Ownership is verified under black/gray-box access by aggregating likelihood and semantic evidence into a fingerprint success rate. By relying on probabilistic forgetting traces rather than fixed trigger–response patterns, \textscForgetMark avoids high-perplexity triggers, reduces detectability, and lowers false triggers. Across diverse architectures and settings, it achieves 100% ownership verification on fingerprinted models while maintaining standard performance, surpasses backdoor baselines in stealthiness and robustness to model merging, and remains effective under moderate incremental fine-tuning. Our code and data are available at \hrefthis https URLthis https URL.
zh
[AI-59] Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理文本属性图(Text-Attributed Graph, TAG)时,因上下文窗口限制而依赖随机采样导致噪声引入和推理不稳定的问题。现有方法通常通过手工提示(handcrafted prompts)将目标节点及其邻域信息输入LLMs,但受限于上下文长度,常采用随机丢弃节点或边的方式进行采样,这会破坏图结构的语义一致性并影响模型性能。解决方案的关键在于提出一种基于同质性感知的结构与语义压缩框架(Homophily-aware Structural and Semantic Compression for LLMs, HS2C),其核心机制包括:首先利用结构熵最小化原则进行全局分层划分,识别出具有内在同质性的社区结构并去除随机连接噪声;其次将检测到的结构同质性作为语义线索传递给LLM,使其根据预定义的社区类型执行差异化的语义聚合,从而在保持关键信息的同时压缩冗余背景内容。实验表明,HS2C在10个节点级基准任务中显著提升压缩率与下游推理准确率,并在7个图级任务中验证了其通用性和可扩展性。
链接: https://arxiv.org/abs/2601.08187
作者: Zijun Di,Bin Lu,Huquan Kang,Luoyi Fu,Jiaxin Ding,Xiaoying Gan,Lei Zhou,Xinbing Wang,Chenghu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding. Recent studies typically focus on verbalizing the graph structures via handcrafted prompts, feeding the target node and its neighborhood context into LLMs. However, constrained by the context window, existing methods mainly resort to random sampling, often implemented via dropping node/edge randomly, which inevitably introduces noise and cause reasoning instability. We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance. To this end, we propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily. Structurally, guided by the principle of Structural Entropy minimization, we perform a global hierarchical partition that decodes the graph’s essential topology. This partition identifies naturally cohesive, homophilic communities, while discarding stochastic connectivity noise. Semantically, we deliver the detected structural homophily to the LLM, empowering it to perform differentiated semantic aggregation based on predefined community type. This process compresses redundant background contexts into concise community-level consensus, selectively preserving semantically homophilic information aligned with the target nodes. Extensive experiments on 10 node-level benchmarks across LLMs of varying sizes and families demonstrate that, by feeding LLMs with structurally and semantically compressed inputs, HS2C simultaneously enhances the compression rate and downstream inference accuracy, validating its superiority and scalability. Extensions to 7 diverse graph-level benchmarks further consolidate HS2C’s task generalizability.
zh
[AI-60] he Agents First Day: Benchmarking Learning Exploration and Scheduling in the Workplace Scenarios
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在静态测试环境中表现优异,但在动态、随机的现实部署场景中缺乏鲁棒性的问题。具体而言,其关注三个核心挑战:动态任务调度、不确定性下的主动探索以及从经验中持续学习的能力。解决方案的关键在于提出一种名为\method的动态评估环境,该环境模拟“学员”代理在新颖场景中持续探索,并从三个维度对智能体进行评测:(1) 面向流式任务的上下文感知调度机制;(2) 通过主动探索减少幻觉的信息获取策略;(3) 基于规则生成任务的持续演化能力,通过提炼通用策略实现模型的持续优化。此框架将评估范式从静态基准转向生产导向的现实场景,显著提升了对智能体可靠性的衡量标准。
链接: https://arxiv.org/abs/2601.08173
作者: Daocheng Fu,Jianbiao Mei,Rong Wu,Xuemeng Yang,Jia Xu,Ding Wang,Pinlong Cai,Yong Liu,Licheng Wen,Botian Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method, a dynamic evaluation environment that simulates a “trainee” agent continuously exploring a novel setting. Unlike traditional benchmarks, \method evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at this https URL
zh
[AI-61] ZeroDVFS: Zero-Shot LLM -Guided Core and Frequency Allocation for Embedded Platforms
【速读】:该论文旨在解决嵌入式系统中动态电压频率调节(Dynamic Voltage and Frequency Scaling, DVFS)与任务到核心分配(task-to-core allocation)在热管理、能效与性能平衡方面的难题,现有方法要么依赖忽略停顿时间的利用率启发式策略,要么需要大量离线调优生成查找表,难以适应运行时变化。其解决方案的关键在于提出一种基于模型的分层多智能体强化学习(model-based hierarchical multi-agent reinforcement learning, MARL)框架:通过两个协作智能体分解指数级动作空间以降低决策延迟(<358ms),并利用回归技术构建精确环境模型预测热力学动态和性能状态;同时引入大语言模型(Large Language Model, LLM)提取代码级语义特征(13个特征无需执行即可刻画OpenMP程序),结合合成训练数据实现零样本部署,从而避免针对新工作负载的专门采样与调优。该框架融合Dyna-Q机制,将直接强化学习与模型规划结合,在BOTS和PolybenchC基准测试中相比Linux ondemand调度器实现了7.09倍能效提升和4.0倍完工时间改善,且首次决策延迟比传统表格法快8300倍,具备实际嵌入式场景部署可行性。
链接: https://arxiv.org/abs/2601.08166
作者: Mohammad Pivezhandi,Mahdi Banisharif,Abusayeed Saifullah,Ali Jannesari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 12 figures, 8 tables (including appendix)
Abstract:Dynamic voltage and frequency scaling (DVFS) and task-to-core allocation are critical for thermal management and balancing energy and performance in embedded systems. Existing approaches either rely on utilization-based heuristics that overlook stall times, or require extensive offline profiling for table generation, preventing runtime adaptation. We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms. Two collaborative agents decompose the exponential action space, achieving 358ms latency for subsequent decisions. First decisions require 3.5 to 8.0s including one-time LLM feature extraction. An accurate environment model leverages regression techniques to predict thermal dynamics and performance states. When combined with LLM-extracted semantic features, the environment model enables zero-shot deployment for new workloads on trained platforms by generating synthetic training data without requiring workload-specific profiling samples. We introduce LLM-based semantic feature extraction that characterizes OpenMP programs through 13 code-level features without execution. The Dyna-Q-inspired framework integrates direct reinforcement learning with model-based planning, achieving 20x faster convergence than model-free methods. Experiments on BOTS and PolybenchC benchmarks across NVIDIA Jetson TX2, Jetson Orin NX, RubikPi, and Intel Core i7 demonstrate 7.09x better energy efficiency and 4.0x better makespan than Linux ondemand governor. First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.
zh
[AI-62] Project Synapse: A Hierarchical Multi-Agent Framework with Hybrid Memory for Autonomous Resolution of Last-Mile Delivery Disruptions DATE
【速读】:该论文旨在解决最后一公里配送(last-mile delivery)中断问题的自主化处理难题,即如何在复杂动态环境中实现高效、自动化的故障识别与恢复。其解决方案的关键在于提出了一种名为Project Synapse的新型代理框架,采用分层多智能体架构:由一个中央的“Resolution Supervisor”代理负责战略层面的任务分解,并将子任务委派给专门的“worker agents”进行战术执行;同时,系统通过LangGraph实现复杂且循环的工作流调度,从而保障流程的灵活性与鲁棒性。
链接: https://arxiv.org/abs/2601.08156
作者: Arin Gopalan Yadav,Varad Dherange,Kumar Shivam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: We propose and evaluate a hierarchical LLM-driven multi-agent framework for adaptive disruption management in last-mile logistics, integrating planning, coordination, and natural-language reasoning. The system is validated through simulation-based experiments and qualitative analysis. Includes figures and tables. 33 pages
Abstract:This paper introduces Project Synapse, a novel agentic framework designed for the autonomous resolution of last-mile delivery disruptions. Synapse employs a hierarchical multi-agent architecture in which a central Resolution Supervisor agent performs strategic task decomposition and delegates subtasks to specialized worker agents responsible for tactical execution. The system is orchestrated using LangGraph to manage complex and cyclical workflows. To validate the framework, a benchmark dataset of 30 complex disruption scenarios was curated from a qualitative analysis of over 6,000 real-world user reviews. System performance is evaluated using an LLM-as-a-Judge protocol with explicit bias mitigation.
zh
[AI-63] Dynamic Graph Structure Learning via Resistance Curvature Flow
【速读】:该论文旨在解决传统几何表示学习(Geometric Representation Learning, GRL)中静态图构建方法难以捕捉高维数据流形内在曲率特征的问题,以及基于Ollivier-Ricci曲率流(Ollivier-Ricci Curvature Flow, OCF)的动态拓扑优化因依赖最优传输(Optimal Transport)导致计算复杂度过高、难以应用于大规模数据集和深度学习框架的瓶颈。解决方案的关键在于提出一种新的几何演化框架——电阻曲率流(Resistance Curvature Flow, RCF),其核心是利用电路物理中的有效电阻(effective resistance)概念,将昂贵的曲率优化问题转化为高效的矩阵运算,在保持与OCF相当几何优化能力的同时实现超过100倍的计算加速。
链接: https://arxiv.org/abs/2601.08149
作者: Chaoqun Fei,Huanjiang Liu,Tinglve Zhou,Yangyang Li,Tianyong Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Geometric Representation Learning (GRL) aims to approximate the non-Euclidean topology of high-dimensional data through discrete graph structures, grounded in the manifold hypothesis. However, traditional static graph construction methods based on Euclidean distance often fail to capture the intrinsic curvature characteristics of the data manifold. Although Ollivier-Ricci Curvature Flow (OCF) has proven to be a powerful tool for dynamic topological optimization, its core reliance on Optimal Transport (Wasserstein distance) leads to prohibitive computational complexity, severely limiting its application in large-scale datasets and deep learning frameworks. To break this bottleneck, this paper proposes a novel geometric evolution framework: Resistance Curvature Flow (RCF). Leveraging the concept of effective resistance from circuit physics, RCF transforms expensive curvature optimization into efficient matrix operations. This approach achieves over 100x computational acceleration while maintaining geometric optimization capabilities comparable to OCF. We provide an in-depth exploration of the theoretical foundations and dynamical principles of RCF, elucidating how it guides the redistribution of edge weights via curvature gradients to eliminate topological noise and strengthen local cluster structures. Furthermore, we provide a mechanistic explanation of RCF’s role in manifold enhancement and noise suppression, as well as its compatibility with deep learning models. We design a graph optimization algorithm, DGSL-RCF, based on this framework. Experimental results across deep metric learning, manifold learning, and graph structure learning demonstrate that DGSL-RCF significantly improves representation quality and downstream task performance.
zh
[AI-64] Enriching Semantic Profiles into Knowledge Graph for Recommender Systems Using Large Language Models KDD2026
【速读】:该论文旨在解决推荐系统中用户偏好建模不足的问题,即如何更有效地构建和利用丰富且信息密集的用户偏好画像(user preference profiling)以提升推荐质量。现有方法在画像构建维度上缺乏统一标准,导致性能受限。其解决方案的关键在于融合大语言模型(LLM)与知识图谱(KG)的优势:首先利用LLM从多源异构知识中提取压缩的语义理由生成实体画像;随后将这些画像嵌入知识图谱进行传播扩展;最后通过成对画像偏好匹配机制,在训练过程中对齐LLM与KG驱动的表示空间。这一协同架构使得SPiKE模型在真实场景下显著优于当前基于KG或LLM的先进推荐方法。
链接: https://arxiv.org/abs/2601.08148
作者: Seokho Ahn,Sungbok Shin,Young-Duk Seo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at KDD 2026
Abstract:Rich and informative profiling to capture user preferences is essential for improving recommendation quality. However, there is still no consensus on how best to construct and utilize such profiles. To address this, we revisit recent profiling-based approaches in recommender systems along four dimensions: 1) knowledge base, 2) preference indicator, 3) impact range, and 4) subject. We argue that large language models (LLMs) are effective at extracting compressed rationales from diverse knowledge sources, while knowledge graphs (KGs) are better suited for propagating these profiles to extend their reach. Building on this insight, we propose a new recommendation model, called SPiKE. SPiKE consists of three core components: i) Entity profile generation, which uses LLMs to generate semantic profiles for all KG entities; ii) Profile-aware KG aggregation, which integrates these profiles into the KG; and iii) Pairwise profile preference matching, which aligns LLM- and KG-based representations during training. In experiments, we demonstrate that SPiKE consistently outperforms state-of-the-art KG- and LLM-based recommenders in real-world settings.
zh
[AI-65] Embedded AI Companion System on Edge Devices
【速读】:该论文旨在解决边缘设备上部署具备良好用户体验的完整嵌入式AI伴侣系统所面临的计算资源受限问题。现有文献中的AI伴侣与记忆系统因缺乏计算资源和延迟敏感性,难以直接应用于此类环境。解决方案的关键在于提出一种交替活跃与休眠阶段的记忆范式:在用户活跃阶段,系统通过轻量级检索实现低延迟实时对话;在用户非活跃阶段,则执行更复杂的记忆提取、整合与维护任务,从而在严苛的嵌入式硬件约束下最小化延迟并保持长期个性化。
链接: https://arxiv.org/abs/2601.08128
作者: Rahul Gupta,Stephen D.H. Hsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 7 figures
Abstract:Computational resource constraints on edge devices make it difficult to develop a fully embedded AI companion system with a satisfactory user experience. AI companion and memory systems detailed in existing literature cannot be directly used in such an environment due to lack of compute resources and latency concerns. In this paper, we propose a memory paradigm that alternates between active and inactive phases: during phases of user activity, the system performs low-latency, real-time dialog using lightweight retrieval over existing memories and context; whereas during phases of user inactivity, it conducts more computationally intensive extraction, consolidation, and maintenance of memories across full conversation sessions. This design minimizes latency while maintaining long-term personalization under the tight constraints of embedded hardware. We also introduce an AI Companion benchmark designed to holistically evaluate the AI Companion across both its conversational quality and memory capabilities. In our experiments, we found that our system (using a very weak model: Qwen2.5-7B-Instruct quantized int4) outperforms the equivalent raw LLM without memory across most metrics, and performs comparably to GPT-3.5 with 16k context window.
zh
[AI-66] How vehicles change lanes after encountering crashes: Empirical analysis and modeling
【速读】:该论文旨在解决交通事故后车辆进行变道(post crash lane change, post crash LC)时的轨迹预测难题,尤其关注因目标车道车辆不配合让行而导致的高风险行为。现有研究缺乏对这类特殊变道行为的系统性建模与预测能力,导致事故风险评估不准、自动驾驶系统响应滞后。解决方案的关键在于提出一种基于图注意力机制的轨迹预测框架,其核心创新是将“让行行为”建模为辅助交互感知任务,通过图注意力模块显式捕捉车辆间的交互意图,从而引导条件变分自编码器(conditional variational autoencoder)和Transformer解码器联合优化车道变更者的轨迹预测。该设计显著提升了预测精度(平均位移误差和最终位移误差均优于基线模型超过10%),并增强了冲突检测的可靠性,同时在多场景数据上验证了模型的泛化能力。
链接: https://arxiv.org/abs/2601.08125
作者: Kequan Chen,Yuxuan Wang,Pan Liu,Victor L. Knoop,David Z. W. Wang,Yu Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When a traffic crash occurs, following vehicles need to change lanes to bypass the obstruction. We define these maneuvers as post crash lane changes. In such scenarios, vehicles in the target lane may refuse to yield even after the lane change has already begun, increasing the complexity and crash risk of post crash LCs. However, the behavioral characteristics and motion patterns of post crash LCs remain unknown. To address this gap, we construct a post crash LC dataset by extracting vehicle trajectories from drone videos captured after crashes. Our empirical analysis reveals that, compared to mandatory LCs (MLCs) and discretionary LCs (DLCs), post crash LCs exhibit longer durations, lower insertion speeds, and higher crash risks. Notably, 79.4% of post crash LCs involve at least one instance of non yielding behavior from the new follower, compared to 21.7% for DLCs and 28.6% for MLCs. Building on these findings, we develop a novel trajectory prediction framework for post crash LCs. At its core is a graph based attention module that explicitly models yielding behavior as an auxiliary interaction aware task. This module is designed to guide both a conditional variational autoencoder and a Transformer based decoder to predict the lane changer’s trajectory. By incorporating the interaction aware module, our model outperforms existing baselines in trajectory prediction performance by more than 10% in both average displacement error and final displacement error across different prediction horizons. Moreover, our model provides more reliable crash risk analysis by reducing false crash rates and improving conflict prediction accuracy. Finally, we validate the model’s transferability using additional post crash LC datasets collected from different sites.
zh
[AI-67] MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为用户代理(user proxy agents)时,因“扮演用户”提示策略不当而导致生成的对话内容冗长且不真实的问题。其核心挑战在于缺乏一种可复现、可扩展的评估框架,用以独立衡量用户代理生成人类相似话语的能力,而不依赖下游任务的成功表现。解决方案的关键在于提出MIRRORBENCH——一个模块化、基于类型接口的基准测试框架,支持插件式用户代理、数据集、任务和指标,并通过三种词汇多样性指标(MATTR、YULE’S K、HD-D)与三种基于LLM评判的指标(GTEval、成对不可区分性、评分与推理)实现对用户代理行为的系统性评估。该框架显式解耦用户模拟性能与下游任务效果,提供方差感知的结果输出,从而揭示了现有用户代理与真实人类用户之间的系统性差距。
链接: https://arxiv.org/abs/2601.08118
作者: Ashutosh Hathidara,Julien Yu,Vaishali Senthil,Sebastian Schreiber,Anil Babu Ankisettipalli
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive “act-as-a-user” prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE’S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at this https URL.
zh
[AI-68] CSQL: Mapping Documents into Causal Databases
【速读】:该论文旨在解决如何从非结构化文本文档中自动构建可支持因果推理的数据库问题,从而实现对大规模文献集合的因果分析与查询。传统方法如基于检索增强生成(RAG)或知识图谱的方法主要依赖关联性检索,无法回答“为什么”类因果问题;而本文提出的CSQL系统通过将文档转化为结构化的因果数据库(Causal Database, CDB),支持基于因果干预和结构化因果查询的分析。其关键创新在于:一是利用先前的DEMOCRITUS系统提取文档中的局部因果模型,形成细粒度因果表示;二是设计了一套编译机制,将未结构化的文本语料(如经济学论文)转化为包含数万条因果主张实例的CDB,支持跨文档、跨时间尺度的因果挖掘与纵向分析,实现了从自然语言到因果逻辑的自动化映射。
链接: https://arxiv.org/abs/2601.08109
作者: Sridhar Mahadevan
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 26 pages
Abstract:We describe a novel system, CSQL, which automatically converts a collection of unstructured text documents into an SQL-queryable causal database (CDB). A CDB differs from a traditional DB: it is designed to answer "why’’ questions via causal interventions and structured causal queries. CSQL builds on our earlier system, DEMOCRITUS, which converts documents into thousands of local causal models derived from causal discourse. Unlike RAG-based systems or knowledge-graph based approaches, CSQL supports causal analysis over document collections rather than purely associative retrieval. For example, given an article on the origins of human bipedal walking, CSQL enables queries such as: "What are the strongest causal influences on bipedalism?‘’ or "Which variables act as causal hubs with the largest downstream influence?‘’ Beyond single-document case studies, we show that CSQL can also ingest RAG/IE-compiled causal corpora at scale by compiling the Testing Causal Claims (TCC) dataset of economics papers into a causal database containing 265,656 claim instances spanning 45,319 papers, 44 years, and 1,575 reported method strings, thereby enabling corpus-level causal queries and longitudinal analyses in CSQL. Viewed abstractly, CSQL functions as a compiler from unstructured documents into a causal database equipped with a principled algebra of queries, and can be applied broadly across many domains ranging from business, humanities, and science.
zh
[AI-69] STO-RL: Offline RL under Sparse Rewards via LLM -Guided Subgoal Temporal Order AAMAS2026
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在处理长时程任务时因稀疏奖励而难以学习有效策略的问题。现有基于目标条件(goal-conditioned)和分层(hierarchical)的离线RL方法虽通过分解任务并生成中间奖励缓解此问题,但常忽略子目标间的时序依赖关系,且依赖不精确的奖励塑形(reward shaping),导致策略性能不佳。解决方案的关键在于提出STO-RL框架,利用大语言模型(Large Language Models, LLMs)生成具有时序结构的子目标序列及其对应的状态到子目标阶段映射,进而采用基于势能(potential-based)的奖励塑形方法,将稀疏的终端奖励转化为密集且时序一致的奖励信号,从而引导子目标进展并避免次优解。该方法构建了增强数据集以支持高效离线策略训练,在多个稀疏奖励基准测试中显著优于当前最优基线,且对LLM生成子目标序列的噪声具有鲁棒性。
链接: https://arxiv.org/abs/2601.08107
作者: Chengyang Gu,Yuxin Pan,Hui Xiong,Yize Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted at International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract:Offline reinforcement learning (RL) enables policy learning from pre-collected datasets, avoiding costly and risky online interactions, but it often struggles with long-horizon tasks involving sparse rewards. Existing goal-conditioned and hierarchical offline RL methods decompose such tasks and generate intermediate rewards to mitigate limitations of traditional offline RL, but usually overlook temporal dependencies among subgoals and rely on imprecise reward shaping, leading to suboptimal policies. To address these issues, we propose STO-RL (Offline RL using LLM-Guided Subgoal Temporal Order), an offline RL framework that leverages large language models (LLMs) to generate temporally ordered subgoal sequences and corresponding state-to-subgoal-stage mappings. Using this temporal structure, STO-RL applies potential-based reward shaping to transform sparse terminal rewards into dense, temporally consistent signals, promoting subgoal progress while avoiding suboptimal solutions. The resulting augmented dataset with shaped rewards enables efficient offline training of high-performing policies. Evaluations on four discrete and continuous sparse-reward benchmarks demonstrate that STO-RL consistently outperforms state-of-the-art offline goal-conditioned and hierarchical RL baselines, achieving faster convergence, higher success rates, and shorter trajectories. Ablation studies further confirm STO-RL’s robustness to imperfect or noisy LLM-generated subgoal sequences, demonstrating that LLM-guided subgoal temporal structures combined with theoretically grounded reward shaping provide a practical and scalable solution for long-horizon offline RL.
zh
[AI-70] Local-Global Feature Fusion for Subject-Independent EEG Emotion Recognition
【速读】:该论文旨在解决跨被试(subject-independent)脑电(EEG)情绪识别中因个体间差异显著以及短时、噪声干扰严重的脑电信号难以学习鲁棒特征表示的问题。其解决方案的关键在于提出一种双分支注意力融合框架,通过整合两类互补的特征表示:一是基于通道的局部特征(由差分熵与图论特征拼接构成),二是基于试验的全局特征(涵盖时域、频域及复杂度特性),并在融合过程中引入注意力机制与领域对抗正则化以增强跨被试泛化能力,同时采用强度阈值对样本进行筛选,从而在SEED-VII数据集上实现了7类情绪识别平均准确率约40%的性能提升。
链接: https://arxiv.org/abs/2601.08094
作者: Zheng Zhou,Isabella McEvoy,Camilo E. Valderrama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, EMBC 2026
Abstract:Subject-independent EEG emotion recognition is challenged by pronounced inter-subject variability and the difficulty of learning robust representations from short, noisy recordings. To address this, we propose a fusion framework that integrates (i) local, channel-wise descriptors and (ii) global, trial-level descriptors, improving cross-subject generalization on the SEED-VII dataset. Local representations are formed per channel by concatenating differential entropy with graph-theoretic features, while global representations summarize time-domain, spectral, and complexity characteristics at the trial level. These representations are fused in a dual-branch transformer with attention-based fusion and domain-adversarial regularization, with samples filtered by an intensity threshold. Experiments under a leave-one-subject-out protocol demonstrate that the proposed method consistently outperforms single-view and classical baselines, achieving approximately 40% mean accuracy in 7-class subject-independent emotion recognition. The code has been released at this https URL.
zh
[AI-71] Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在任务特定微调(fine-tuning)过程中安全对齐(safety alignment)被破坏的问题,这种破坏会引入潜在的安全风险。现有防御方法要么将安全恢复嵌入微调过程,要么依赖微调产生的先验知识进行事后修正,导致安全恢复与训练紧密耦合,带来高计算开销和复杂的工作流程。其解决方案的关键在于提出一种基于后训练量化(post-training quantization)的后处理防御方法——\textttQ-realign,该方法通过分析表征结构,将量化重新建模为压缩与安全双重目标的联合优化过程,从而实现安全对齐与微调解耦,并自然融入现代部署流水线。实验表明,该方法在保持任务性能的同时显著降低不安全行为,且大幅减少内存占用和GPU计算时间,可在单张RTX 4090显卡上于40分钟内恢复7B参数模型的安全对齐。
链接: https://arxiv.org/abs/2601.08089
作者: Qitao Tan,Xiaoying Song,Ningxi Cheng,Ninghao Liu,Xiaoming Zhai,Lingzi Hong,Yanzhi Wang,Zhen Xiang,Geng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Public large language models (LLMs) are typically safety-aligned during pretraining, yet task-specific fine-tuning required for deployment often erodes this alignment and introduces safety risks. Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction, leaving safety recovery tightly coupled with training and incurring high computational overhead and a complex workflow. To address these challenges, we propose \textttQ-realign, a post-hoc defense method based on post-training quantization, guided by an analysis of representational structure. By reframing quantization as a dual-objective procedure for compression and safety, \textttQ-realign decouples safety alignment from fine-tuning and naturally piggybacks into modern deployment pipelines. Experiments across multiple models and datasets demonstrate that our method substantially reduces unsafe behaviors while preserving task performance, with significant reductions in memory usage and GPU hours. Notably, our approach can recover the safety alignment of a fine-tuned 7B LLM on a single RTX 4090 within 40 minutes. Overall, our work provides a practical, turnkey solution for safety-aware deployment.
zh
[AI-72] A New Strategy for Verifying Reach-Avoid Specifications in Neural Feedback Systems AAAI-2026
【速读】:该论文旨在解决神经反馈系统(neural feedback systems)中可达性验证(reachability analysis)的可扩展性问题,特别是针对具有避障性质(reach-avoid properties)的系统。现有方法主要依赖前向可达性分析(forward reachability analysis),但其在复杂系统中存在局限性;而传统的后向可达性方法则因计算效率低难以扩展。论文的关键解决方案是提出新的算法,能够同时计算后向可达集的上界(over-approximation)和下界(under-approximation),并将其与成熟的前向分析技术融合,从而构建一个统一的验证框架,显著提升了对神经反馈系统的验证能力与精度。
链接: https://arxiv.org/abs/2601.08065
作者: Samuel I. Akinwande,Sydney M. Katz,Mykel J. Kochenderfer,Clark Barrett
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI-2026 Bridge Program B10: Making Embodied AI Reliable with Testing and Formal Verification
Abstract:Forward reachability analysis is the predominant approach for verifying reach-avoid properties in neural feedback systems (dynamical systems controlled by neural networks). This dominance stems from the limited scalability of existing backward reachability methods. In this work, we introduce new algorithms that compute both over- and under-approximations of backward reachable sets for such systems. We further integrate these backward algorithms with established forward analysis techniques to yield a unified verification framework for neural feedback systems.
zh
[AI-73] Forecast Aware Deep Reinforcement Learning for Efficient Electricity Load Scheduling in Dairy Farms
【速读】:该论文旨在解决奶牛养殖业中因可再生能源间歇性导致的能源供需不平衡问题,以降低对电网的依赖并实现可持续能源管理。其核心挑战在于:传统强化学习(Reinforcement Learning, RL)方法通常假设对未来电价或发电量有完全信息,这在实际动态环境中不切实际;同时,标准PPO算法采用固定裁剪或KL散度阈值,在波动电价下易引发训练不稳定。解决方案的关键是提出两种改进型深度强化学习框架——Forecast Aware PPO通过基于小时和月份的残差校准实现短期负荷与可再生能源发电预测,提升调度精度;PID KL PPO则引入比例积分微分(Proportional Integral Derivative, PID)控制器自适应调节KL散度,确保策略更新稳定。实验表明,该方法在真实奶牛场数据上实现了显著成本节约(相比PPO降低1%、DQN降低4.8%、SAC降低1.5%),且电池调度使电网购电量减少13.1%,验证了其在现代奶牛养殖可持续能源管理中的有效性与可扩展性。
链接: https://arxiv.org/abs/2601.08052
作者: Nawazish Alia,Rachael Shawb,Karl Mason
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Dairy farming is an energy intensive sector that relies heavily on grid electricity. With increasing renewable energy integration, sustainable energy management has become essential for reducing grid dependence and supporting the United Nations Sustainable Development Goal 7 on affordable and clean energy. However, the intermittent nature of renewables poses challenges in balancing supply and demand in real time. Intelligent load scheduling is therefore crucial to minimize operational costs while maintaining reliability. Reinforcement Learning has shown promise in improving energy efficiency and reducing costs. However, most RL-based scheduling methods assume complete knowledge of future prices or generation, which is unrealistic in dynamic environments. Moreover, standard PPO variants rely on fixed clipping or KL divergence thresholds, often leading to unstable training under variable tariffs. To address these challenges, this study proposes a Deep Reinforcement Learning framework for efficient load scheduling in dairy farms, focusing on battery storage and water heating under realistic operational constraints. The proposed Forecast Aware PPO incorporates short term forecasts of demand and renewable generation using hour of day and month based residual calibration, while the PID KL PPO variant employs a proportional integral derivative controller to regulate KL divergence for stable policy updates adaptively. Trained on real world dairy farm data, the method achieves up to 1% lower electricity cost than PPO, 4.8% than DQN, and 1.5% than SAC. For battery scheduling, PPO reduces grid imports by 13.1%, demonstrating scalability and effectiveness for sustainable energy management in modern dairy farming.
zh
[AI-74] Integrating Attendance Tracking and Emotion Detection for Enhanced Student Engagement in Smart Classrooms
【速读】:该论文旨在解决智能教室技术在高等教育中应用时,仅聚焦于自动化考勤而忽视学生情感与认知参与度的问题,从而限制了教师实时识别学生注意力分散并调整教学策略的能力。其解决方案的关键在于提出了一种基于物联网(IoT)的SCASED系统,该系统融合自动考勤与面部情绪识别功能,利用Raspberry Pi摄像头和OpenCV实现人脸检测,并采用微调后的MobileNetV2模型对四种学习相关情绪状态(即:专注、无聊、困惑和挫败)进行分类;同时通过会话机制实现一次考勤记录后持续的情绪分析,并将数据可视化呈现于云端仪表板,为教师提供课堂动态洞察,提升教学响应性。
链接: https://arxiv.org/abs/2601.08049
作者: Keith Ainebyona,Ann Move Oguti,Joseph Walusimbi,Ritah Kobusingye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures
Abstract:The increasing adoption of smart classroom technologies in higher education has mainly focused on automating attendance, with limited attention given to students’ emotional and cognitive engagement during lectures. This limits instructors’ ability to identify disengagement and adapt teaching strategies in real time. This paper presents SCASED (Smart Classroom Attendance System with Emotion Detection), an IoT-based system that integrates automated attendance tracking with facial emotion recognition to support classroom engagement monitoring. The system uses a Raspberry Pi camera and OpenCV for face detection, and a finetuned MobileNetV2 model to classify four learning-related emotional states: engagement, boredom, confusion, and frustration. A session-based mechanism is implemented to manage attendance and emotion monitoring by recording attendance once per session and performing continuous emotion analysis thereafter. Attendance and emotion data are visualized through a cloud-based dashboard to provide instructors with insights into classroom dynamics. Experimental evaluation using the DAiSEE dataset achieved an emotion classification accuracy of 89.5%. The results show that integrating attendance data with emotion analytics can provide instructors with additional insight into classroom dynamics and support more responsive teaching practices.
zh
[AI-75] Internal Deployment Gaps in AI Regulation
【速读】:该论文旨在解决前沿人工智能(Frontier AI)监管在内部部署场景下的监管盲区问题,即当前美国和欧盟2025年相关政策主要聚焦于对外用户部署的AI系统,而对组织内部部署的高风险AI系统缺乏有效覆盖。论文识别出三个关键监管漏洞:(1)监管范围模糊导致内部系统规避责任;(2)静态合规评估无法追踪内部系统的持续演化;(3)信息不对称削弱监管机构的监督能力。解决方案的关键在于系统性识别这些漏洞的成因——包括可衡量性困境、激励错位与信息获取障碍,并提出针对性治理路径,同时权衡不同策略带来的政策取舍,从而推动对内部部署AI的监管从被动响应转向主动设计。
链接: https://arxiv.org/abs/2601.08005
作者: Joe Kwon,Stephen Casper
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Frontier AI regulations primarily focus on systems deployed to external users, where deployment is more visible and subject to outside scrutiny. However, high-stakes applications can occur internally when companies deploy highly capable systems within their own organizations, such as for automating R\D, accelerating critical business processes, and handling sensitive proprietary data. This paper examines how frontier AI regulations in the United States and European Union in 2025 handle internal deployment. We identify three gaps that could cause internally-deployed systems to evade intended oversight: (1) scope ambiguity that allows internal systems to evade regulatory obligations, (2) point-in-time compliance assessments that fail to capture the continuous evolution of internal systems, and (3) information asymmetries that subvert regulatory awareness and oversight. We then analyze why these gaps persist, examining tensions around measurability, incentives, and information access. Finally, we map potential approaches to address them and their associated tradeoffs. By understanding these patterns, we hope that policy choices around internally deployed AI systems can be made deliberately rather than incidentally.
zh
[AI-76] Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在确保安全合规性的同时,避免对无害请求产生过度拒绝(over-refusal)的问题。当前主流的安全对齐方法如OpenAI提出的推理式对齐(Deliberative Alignment, DA),依赖于详尽的“代码化”安全规则进行推理,但其在开源LLM中效果有限,因后者缺乏强推理能力。论文的关键解决方案是提出一种基于案例增强的推理对齐方法(Case-Augmented Deliberative Alignment, CADA),通过强化学习训练模型生成包含具体示例的安全推理链(safety reasoning chains),从而提升模型对复杂场景的适应性和泛化能力,而非机械遵循狭义规则。此方法在保持模型有用性(utility)的同时显著增强了安全性与鲁棒性,有效缓解了规则驱动方法带来的僵化问题。
链接: https://arxiv.org/abs/2601.08000
作者: Can Jin,Rui Wu,Tong Che,Qixin Zhang,Hongwu Peng,Jiahui Zhao,Zhenting Wang,Wenqi Wei,Ligong Han,Zhao Zhang,Yuan Cao,Ruixiang Tang,Dimitris N. Metaxas
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like’’ safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.
zh
[AI-77] When Models Know When They Do Not Know: Calibration Cascading and Cleaning
【速读】:该论文旨在解决模型在面对不确定或未知输入时缺乏自我认知能力的问题,即如何让模型识别自身“不知道”的情况,从而提升其效率、可靠性与可信度。解决方案的关键在于利用模型校准(calibration)后的置信度(confidence)作为可靠信号,通过训练-free的方法实现模型校准、级联(cascading)和数据清洗三大应用:首先实证表明,单模型内高置信度对应高准确率,且验证集上校准的模型在测试集上仍保持校准状态,这确立了校准置信度的可比性与可靠性;进而基于此构建优势路由机制进行模型级联,实现大/小模型间高效协同而几乎不损失精度,甚至通过同规模模型级联超越单一模型性能;同时,利用多模型集成及其校准置信度设计数据清洗策略,在ImageNet和MMLU等数据集中有效识别误标注样本,平衡精度与召回率。
链接: https://arxiv.org/abs/2601.07965
作者: Chenjie Hao,Weyl Lu,Yuko Ishiwaka,Zengyi Li,Weier Wan,Yubei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:When a model knows when it does not know, many possibilities emerge. The first question is how to enable a model to recognize that it does not know. A promising approach is to use confidence, computed from the model’s internal signals, to reflect its ignorance. Prior work in specific domains has shown that calibration can provide reliable confidence estimates. In this work, we propose a simple, effective, and universal training-free method that applies to both vision and language models, performing model calibration, cascading, and data cleaning to better exploit a model’s ability to recognize when it does not know. We first highlight two key empirical observations: higher confidence corresponds to higher accuracy within a single model, and models calibrated on the validation set remain calibrated on a held-out test set. These findings empirically establish the reliability and comparability of calibrated confidence. Building on this, we introduce two applications: (1) model cascading with calibrated advantage routing and (2) data cleaning based on model ensemble. Using the routing signal derived from the comparability of calibrated confidences, we cascade large and small models to improve efficiency with almost no compromise in accuracy, and we further cascade two models of comparable scale to achieve performance beyond either model alone. Leveraging multiple experts and their calibrated confidences, we design a simple yet effective data-cleaning method that balances precision and detection rate to identify mislabeled samples in ImageNet and Massive Multitask Language Understanding (MMLU) datasets. Our results demonstrate that enabling models to recognize when they do not know is a practical step toward more efficient, reliable, and trustworthy AI.
zh
[AI-78] Executable Ontologies in Game Development: From Algorithmic Control to Semantic World Modeling
【速读】:该论文旨在解决游戏人工智能(AI)架构中长期存在的语义-过程鸿沟问题,即传统方法如行为树(Behavior Trees, BT)和目标导向动作规划(Goal-Oriented Action Planning, GOAP)仅能描述“代理应该做什么”,而无法自然地表达“动作何时变得可行”。这一缺陷导致行为逻辑与环境状态之间缺乏语义一致性,限制了复杂情境下任务优先级处理的灵活性。解决方案的关键在于引入可执行本体(Executable Ontologies, EO),通过boldsea框架实现基于数据流条件的任务中断机制,使代理行为从领域规则的语义模型中自然涌现,而非显式编码预设逻辑。这种范式转变将AI设计焦点从算法控制转向语义建模,显著提升了系统对动态环境的适应性与调试透明度。
链接: https://arxiv.org/abs/2601.07964
作者: Alexander Boldachev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures
Abstract:This paper examines the application of Executable Ontologies (EO), implemented through the boldsea framework, to game development. We argue that EO represents a paradigm shift: a transition from algorithmic behavior programming to semantic world modeling, where agent behavior emerges naturally from declarative domain rules rather than being explicitly coded. Using a survival game scenario (Winter Feast), we demonstrate how EO achieves prioritybased task interruption through dataflow conditions rather than explicit preemption logic. Comparison with Behavior Trees (BT) and Goal-Oriented Action Planning (GOAP) reveals that while these approaches model what agents should do, EO models when actions become possible - a fundamental difference that addresses the semantic-process gap in game AI architecture. We discuss integration strategies, debugging advantages inherent to temporal event graphs, and the potential for LLM-driven runtime model generation.
zh
[AI-79] LJ-Spoof: A Generatively Varied Corpus for Audio Anti-Spoofing and Synthesis Source Tracing
【速读】:该论文旨在解决语音防欺骗(audio anti-spoofing)中的两个核心挑战:说话人特定的防欺骗检测(speaker-specific anti-spoofing)和合成源追踪(synthesis-source tracing)。这些问题因缺乏系统性地变化模型架构、合成流程和生成参数的数据集而长期受限。解决方案的关键在于提出 LJ-Spoof 数据集,该数据集通过系统性地引入语音韵律(prosody)、声码器(vocoder)、生成超参数、真实语音提示源(bona fide prompt sources)、训练策略及神经后处理等多维度变异,构建了一个高密度变异性语料库,涵盖单个说话人、30 种文本转语音(TTS)家族、500 个生成变体子集以及超过 300 万条语音样本,从而支持鲁棒的说话人条件防欺骗检测与细粒度的合成源溯源任务。
链接: https://arxiv.org/abs/2601.07958
作者: Surya Subramani,Hashim Ali,Hafiz Malik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speaker-specific anti-spoofing and synthesis-source tracing are central challenges in audio anti-spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ-Spoof, a speaker-specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post-processing. The corpus spans one speakers-including studio-quality recordings-30 TTS families, 500 generatively variant subsets, 10 bona fide neural-processing variants, and more than 3 million utterances. This variation-dense design enables robust speaker-conditioned anti-spoofing and fine-grained synthesis-source tracing. We further position this dataset as both a practical reference training resource and a benchmark evaluation suite for anti-spoofing and source tracing.
zh
[AI-80] Hybrid SARIMA LSTM Model for Local Weather Forecasting: A Residual Learning Approach for Data Driven Meteorological Prediction
【速读】:该论文旨在解决长期大气变量(特别是温度)预测中的核心挑战,即如何在混沌大气系统中同时捕捉确定性的季节性趋势与非线性的短期波动。传统SARIMA模型虽能有效建模线性季节性成分,但因假设数据平稳而难以处理突发性非线性变化,导致预测偏差;而LSTM虽擅长学习复杂非线性依赖关系,却在开环预测中易因误差累积而发散。解决方案的关键在于提出一种混合SARIMA-LSTM架构,采用残差学习策略将温度序列分解为可预测的气候分量(由SARIMA建模)和剩余的非线性天气分量(由LSTM专门学习),从而结合统计模型的稳定性与深度学习的适应性,显著降低误差传播并提升长期预测精度。
链接: https://arxiv.org/abs/2601.07951
作者: Shreyas Rajeev,Karthik Mudenahalli Ashoka,Amit Mallappa Tiparaddi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately forecasting long-term atmospheric variables remains a defining challenge in meteorological science due to the chaotic nature of atmospheric systems. Temperature data represents a complex superposition of deterministic cyclical climate forces and stochastic, short-term fluctuations. While planetary mechanics drive predictable seasonal periodicities, rapid meteorological changes such as thermal variations, pressure anomalies, and humidity shifts introduce nonlinear volatilities that defy simple extrapolation. Historically, the Seasonal Autoregressive Integrated Moving Average (SARIMA) model has been the standard for modeling historical weather data, prized for capturing linear seasonal trends. However, SARIMA operates under strict assumptions of stationarity, failing to capture abrupt, nonlinear transitions. This leads to systematic residual errors, manifesting as the under-prediction of sudden spikes or the over-smoothing of declines. Conversely, Deep Learning paradigms, specifically Long Short-Term Memory (LSTM) networks, demonstrate exceptional efficacy in handling intricate time-series data. By utilizing memory gates, LSTMs learn complex nonlinear dependencies. Yet, LSTMs face instability in open-loop forecasting; without ground truth feedback, minor deviations compound recursively, causing divergence. To resolve these limitations, we propose a Hybrid SARIMA-LSTM architecture. This framework employs a residual-learning strategy to decompose temperature into a predictable climate component and a nonlinear weather component. The SARIMA unit models the robust, long-term seasonal trend, while the LSTM is trained exclusively on the residuals the nonlinear errors SARIMA fails to capture. By fusing statistical stability with neural plasticity, this hybrid approach minimizes error propagation and enhances long-horizon accuracy.
zh
[AI-81] Reinforcement Learning Methods for Neighborhood Selection in Local Search
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在局部搜索元启发式算法中应用效果尚不明确的问题,特别是其在组合优化问题中的有效性与稳定性。研究通过评估多种基于强化学习的邻域选择策略(包括多臂赌博机方法如上置信界 Upper Confidence Bound 和 ε-贪婪策略,以及深度强化学习方法如近端策略优化 Proximal Policy Optimization 和双深度 Q 网络 Double Deep Q-Network),并与多个基准方法进行对比,发现搜索问题的具体特性(尤其是因约束违反惩罚导致的成本波动较大)要求设计精心构造的奖励函数以提供稳定且具有信息量的学习信号。关键在于:奖励函数的设计直接影响强化学习策略的学习效率与最终性能,而 ε-贪婪策略在多数场景下表现最优,相比之下,深度强化学习虽具潜力但计算开销显著,仅在允许更长运行时间时才具备竞争力。
链接: https://arxiv.org/abs/2601.07948
作者: Yannick Molinghen,Augustin Delecluse,Renaud De Landtsheer,Stefano Michelini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICORES 2026
Abstract:Reinforcement learning has recently gained traction as a means to improve combinatorial optimization methods, yet its effectiveness within local search metaheuristics specifically remains comparatively underexamined. In this study, we evaluate a range of reinforcement learning-based neighborhood selection strategies – multi-armed bandits (upper confidence bound, \epsilon -greedy) and deep reinforcement learning methods (proximal policy optimization, double deep Q -network) – and compare them against multiple baselines across three different problems: the traveling salesman problem, the pickup and delivery problem with time windows, and the car sequencing problem. We show how search-specific characteristics, particularly large variations in cost due to constraint violation penalties, necessitate carefully designed reward functions to provide stable and informative learning signals. Our extensive experiments reveal that algorithm performance varies substantially across problems, although that \epsilon -greedy consistently ranks among the best performers. In contrast, the computational overhead of deep reinforcement learning approaches only makes them competitive with a substantially longer runtime. These findings highlight both the promise and the practical limitations of deep reinforcement learning in local search.
zh
[AI-82] Coupled Diffusion-Encoder Models for Reconstruction of Flow Fields
【速读】:该论文旨在解决传统数据驱动流场重构方法(如变分自编码器,VAE)在强压缩条件下难以保留流场高阶统计特性的问题。其核心挑战在于,尽管VAE能实现较低的点对点重建误差,但其生成的流场在谱特性和分布结构上存在显著失真,无法忠实反映真实流场的统计规律。解决方案的关键在于提出DiffCoder框架,该框架将概率扩散模型与卷积残差网络(ResNet)编码器耦合,并进行端到端训练:编码器负责将高维流场压缩为低维潜在表示,而扩散模型则作为生成先验,学习从压缩状态中重构流场的分布特性。这种设计使DiffCoder在极端压缩下仍能恢复关键的谱特性和分布结构,从而实现更统计一致的流场表示,尤其在信息瓶颈严重时优势明显。
链接: https://arxiv.org/abs/2601.07946
作者: AmirPouya Hemmasian,Amir Barati Farimani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Data-driven flow-field reconstruction typically relies on autoencoder architectures that compress high-dimensional states into low-dimensional latent representations. However, classical approaches such as variational autoencoders (VAEs) often struggle to preserve the higher-order statistical structure of fluid flows when subjected to strong compression. We propose DiffCoder, a coupled framework that integrates a probabilistic diffusion model with a conventional convolutional ResNet encoder and trains both components end-to-end. The encoder compresses the flow field into a latent representation, while the diffusion model learns a generative prior over reconstructions conditioned on the compressed state. This design allows DiffCoder to recover distributional and spectral properties that are not strictly required for minimizing pointwise reconstruction loss but are critical for faithfully representing statistical properties of the flow field. We evaluate DiffCoder and VAE baselines across multiple model sizes and compression ratios on a challenging dataset of Kolmogorov flow fields. Under aggressive compression, DiffCoder significantly improves the spectral accuracy while VAEs exhibit substantial degradation. Although both methods show comparable relative L2 reconstruction error, DiffCoder better preserves the underlying distributional structure of the flow. At moderate compression levels, sufficiently large VAEs remain competitive, suggesting that diffusion-based priors provide the greatest benefit when information bottlenecks are severe. These results demonstrate that the generative decoding by diffusion offers a promising path toward compact, statistically consistent representations of complex flow fields.
zh
[AI-83] SECite: Analyzing and Summarizing Citations in Software Engineering Literature
【速读】:该论文旨在解决传统文献综述中对研究论文优势与局限性评估过于依赖作者自我陈述的问题,从而导致评价视角单一、缺乏外部实证支持。其核心解决方案是提出SECite方法,通过情感分析(sentiment analysis)挖掘引用上下文中的学术反馈,并结合自然语言处理(NLP)与无监督机器学习技术对引用语句进行正负向分类;进一步利用生成式AI(Generative AI)基于聚类后的引用群体和全文内容生成针对每篇目标论文的正面与负面摘要,从而构建一个融合外部引用情感与大语言模型(LLM)总结能力的综合评价框架,实现对学术贡献更全面、客观的量化评估。
链接: https://arxiv.org/abs/2601.07939
作者: Shireesh Reddy Pyreddy,Khaja Valli Pathan,Hasan Masum,Tarannum Shaila Zaman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE CCWC 2026
Abstract:Identifying the strengths and limitations of a research paper is a core component of any literature review. However, traditional summaries reflect only the authors’ self-presented perspective. Analyzing how other researchers discuss and cite the paper can offer a deeper, more practical understanding of its contributions and shortcomings. In this research, we introduce SECite, a novel approach for evaluating scholarly impact through sentiment analysis of citation contexts. We develop a semi-automated pipeline to extract citations referencing nine research papers and apply advanced natural language processing (NLP) techniques with unsupervised machine learning to classify these citation statements as positive or negative. Beyond sentiment classification, we use generative AI to produce sentiment-specific summaries that capture the strengths and limitations of each target paper, derived both from clustered citation groups and from the full text. Our findings reveal meaningful patterns in how the academic community perceives these works, highlighting areas of alignment and divergence between external citation feedback and the authors’ own presentation. By integrating citation sentiment analysis with LLM-based summarization, this study provides a comprehensive framework for assessing scholarly contributions.
zh
[AI-84] Enhancing Large Language Models for Time-Series Forecasting via Vector-Injected In-Context Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列预测(Time Series Forecasting, TSF)任务中面临的双重挑战:一是预训练语料与时间序列数据之间的分布差异导致直接应用时预测性能不稳定;二是微调(fine-tuning)虽可缓解此问题,但计算开销巨大。为在不更新任何LLM参数的前提下提升预测性能并降低计算成本,作者提出LVICL方法,其核心创新在于引入向量注入式上下文学习(Vector-injected In-Context Learning, LVICL)。该方法通过一个可学习的上下文向量适配器(context vector adapter)从多个示例中自适应提取压缩的、与任务相关的上下文向量,并在前向传播过程中将该向量注入到LLM每一层中,从而激发模型的上下文学习能力,显著提升TSF性能,同时避免传统提示工程中因添加示例而导致的提示长度增加和无关信息干扰问题。
链接: https://arxiv.org/abs/2601.07903
作者: Jianqi Zhang,Jingyao Wang,Wenwen Qiang,Fanjiang Xu,Changwen Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The World Wide Web needs reliable predictive capabilities to respond to changes in user behavior and usage patterns. Time series forecasting (TSF) is a key means to achieve this goal. In recent years, the large language models (LLMs) for TSF (LLM4TSF) have achieved good performance. However, there is a significant difference between pretraining corpora and time series data, making it hard to guarantee forecasting quality when directly applying LLMs to TSF; fine-tuning LLMs can mitigate this issue, but often incurs substantial computational overhead. Thus, LLM4TSF faces a dual challenge of prediction performance and compute overhead. To address this, we aim to explore a method for improving the forecasting performance of LLM4TSF while freezing all LLM parameters to reduce computational overhead. Inspired by in-context learning (ICL), we propose LVICL. LVICL uses our vector-injected ICL to inject example information into a frozen LLM, eliciting its in-context learning ability and thereby enhancing its performance on the example-related task (i.e., TSF). Specifically, we first use the LLM together with a learnable context vector adapter to extract a context vector from multiple examples adaptively. This vector contains compressed, example-related information. Subsequently, during the forward pass, we inject this vector into every layer of the LLM to improve forecasting performance. Compared with conventional ICL that adds examples into the prompt, our vector-injected ICL does not increase prompt length; moreover, adaptively deriving a context vector from examples suppresses components harmful to forecasting, thereby improving model performance. Extensive experiments demonstrate the effectiveness of our approach.
zh
[AI-85] Large Language Models and Algorithm Execution: Application to an Arithmetic Function
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在算法执行能力上的局限性,即模型难以自主完成复杂算法推理和泛化任务的问题。解决方案的关键在于引入一种名为LLM-DAL(Large Language Model - Decompositional Algorithmic Learning)的专门监督训练方法,通过引导模型在学习过程中进行推理分解,显著提升其执行复杂算法并实现泛化的能力。
链接: https://arxiv.org/abs/2601.07898
作者: Farah Ben Slama(SyCoSMA, LIRIS),Frédéric Armetta(SyCoSMA, LIRIS)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently developed new advanced functionalities. Their effectiveness relies on statistical learning and generalization capabilities. However, they face limitations in internalizing the data they process and struggle, for instance, to autonomously execute algorithms. In this paper, we investigate the possibility of extending these models’ capabilities to algorithm execution through specialized supervised training focused on reasoning decomposition. We introduce a training model called LLM-DAL (Large Language Model - Decompositional Algorithmic Learning), through which we demonstrate that LLMs’ ability to perform complex algorithmic inferences and generalize can be significantly improved when the training method is properly designed to guide the model in its learning process.
zh
[AI-86] Revealing the Attention Floating Mechanism in Masked Diffusion Models
【速读】:该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)中内部注意力机制缺乏系统理解的问题,尤其是其与自回归模型(Autoregressive Models, ARMs)在注意力行为上的差异。解决方案的关键在于揭示了“注意力漂浮”(Attention Floating)现象:MDMs的注意力锚点并非固定,而是在去噪步骤和网络层间动态分散移动;进一步分析表明,这种机制具有浅层结构感知、深层内容聚焦的特点——浅层利用浮动token构建全局结构框架,深层则集中于捕捉语义内容。这一发现从机制层面解释了MDMs在上下文学习中的强大能力,并实验证明其在知识密集型任务中性能可达到ARMs的两倍。
链接: https://arxiv.org/abs/2601.07894
作者: Xin Dai,Pengcheng Huang,Zhenghao Liu,Shuo Wang,Yukun Yan,Chaojun Xiao,Yu Gu,Ge Yu,Maosong Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes and datasets are available at this https URL.
zh
[AI-87] Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限边缘设备上部署时面临的内存和计算资源瓶颈问题。现有 ternary quantization(三值量化)方法因与通用硬件不匹配而存在两大缺陷:一是采用 2-bit 对齐打包导致显著的比特浪费,二是使用 1.67-bit 不规则打包降低推理速度。为解决这一矛盾,作者提出 Sherry 框架,其核心创新在于引入一种细粒度的 3:4 稀疏结构,通过将四个权重压缩至五位实现规整的 1.25-bit 宽度,恢复了基于 2 的幂次对齐;同时识别出稀疏三值训练中的权重陷阱(weight trapping)问题,进而设计 Arenas 机制——一种退火残差突触(annealing residual synapse)策略,以维持训练过程中的表征多样性,从而避免表示坍缩。实验证明,Sherry 在 LLaMA-3.2 上实现了与当前最优三值量化相当的性能,同时显著减小模型尺寸,并在 Intel i7-14700HX CPU 上实现零精度损失、25% 比特节省和 10% 推理加速。
链接: https://arxiv.org/abs/2601.07892
作者: Hong Huang,Decheng Wu,Qiangqiang Hu,Guanghua Yu,Jinhai Yang,Jianchen Zhu,Xue Liu,Dapeng Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to -1, 0, +1, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA-3.2 across five benchmarks demonstrate that Sherry matches state-of-the-art ternary performance while significantly reducing model size. Notably, on an Intel i7-14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at this https URL .
zh
[AI-88] Small Symbols Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理ASCII表情符号(emoticons)时存在的语义混淆问题,即模型可能将常见表情符号误判为代码指令或逻辑结构,从而执行用户未意图的甚至具有破坏性的操作。其解决方案的关键在于构建了一个自动化数据生成流水线,创建了一个包含3,757个面向代码的测试用例的数据集,涵盖21种元场景、4种编程语言及不同复杂度的上下文环境,从而系统性地量化并验证了该漏洞的普遍性和危害性,发现平均混淆率超过38%,且90%以上的错误响应表现为“静默失败”(silent failures),即语法正确但语义偏离,存在严重安全隐患。
链接: https://arxiv.org/abs/2601.07885
作者: Weipeng Jiang,Xiaoyu Zhang,Juan Zhai,Shiqing Ma,Chao Shen,Yang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Emoticons are widely used in digital communication to convey affective intent, yet their safety implications for Large Language Models (LLMs) remain largely unexplored. In this paper, we identify emoticon semantic confusion, a vulnerability where LLMs misinterpret ASCII-based emoticons to perform unintended and even destructive actions. To systematically study this phenomenon, we develop an automated data generation pipeline and construct a dataset containing 3,757 code-oriented test cases spanning 21 meta-scenarios, four programming languages, and varying contextual complexities. Our study on six LLMs reveals that emoticon semantic confusion is pervasive, with an average confusion ratio exceeding 38%. More critically, over 90% of confused responses yield ‘silent failures’, which are syntactically valid outputs but deviate from user intent, potentially leading to destructive security consequences. Furthermore, we observe that this vulnerability readily transfers to popular agent frameworks, while existing prompt-based mitigations remain largely ineffective. We call on the community to recognize this emerging vulnerability and develop effective mitigation methods to uphold the safety and reliability of the LLM system.
zh
[AI-89] Ideological Isolation in Online Social Networks: A Survey of Computational Definitions Metrics and Mitigation Strategies
【速读】:该论文旨在解决在线社交网络中因内容个性化和用户行为模式导致的意识形态隔离(ideological isolation)问题,其核心挑战在于如何量化并缓解由过滤气泡(filter bubbles)、回音室效应(echo chambers)等现象引发的信息多样性缺失与社会极化。解决方案的关键在于构建一个统一的计算框架,整合结构拓扑、内容特征、交互行为与认知偏差四个维度的度量指标,并提出基于网络拓扑干预与推荐系统层面控制的计算策略,以实现对信息暴露集中化和窄化动态的有效识别与调控。
链接: https://arxiv.org/abs/2601.07884
作者: Xiaodan Wang,Yanbin Liu,Shiqing Wu,Ziying Zhao,Yuxuan Hu,Weihua Li,Quan Bai
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 31 pages, double column, submitted to the Information Sciences journal for review
Abstract:The proliferation of online social networks has significantly reshaped the way individuals access and engage with information. While these platforms offer unprecedented connectivity, they may foster environments where users are increasingly exposed to homogeneous content and like-minded interactions. Such dynamics are associated with selective exposure and the emergence of filter bubbles, echo chambers, tunnel vision, and polarization, which together can contribute to ideological isolation and raise concerns about information diversity and public discourse. This survey provides a comprehensive computational review of existing studies that define, analyze, quantify, and mitigate ideological isolation in online social networks. We examine the mechanisms underlying content personalization, user behavior patterns, and network structures that reinforce content-exposure concentration and narrowing dynamics. This paper also systematically reviews methodological approaches for detecting and measuring these isolation-related phenomena, covering network-, content-, and behavior-based metrics. We further organize computational mitigation strategies, including network-topological interventions and recommendation-level controls, and discuss their trade-offs and deployment considerations. By integrating definitions, metrics, and interventions across structural/topological, content-based, interactional, and cognitive isolation, this survey provides a unified computational framework. It serves as a reference for understanding and addressing the key challenges and opportunities in promoting information diversity and reducing ideological fragmentation in the digital age.
zh
[AI-90] Sola-Visibility-ISPM: Benchmarking Agent ic AI for Identity Security Posture Management Visibility
【速读】:该论文旨在解决现代企业在全球化云和SaaS环境中面临的Identity Security Posture Management (ISPM) 可见性问题,即如何有效理解和管理身份资产的清单(identity inventory)与配置健康状态(configuration hygiene)。当前缺乏标准化评估方法来衡量智能体AI系统在真实企业数据上执行ISPM可见性任务的能力。解决方案的关键在于提出首个面向此场景的基准测试——Sola Visibility ISPM Benchmark,该基准基于AWS、Okta和Google Workspace的真实生产级身份环境设计,聚焦基础ISPM可见性任务;并配套开发了Sola AI Agent,一种能够将自然语言查询转化为可执行的数据探索步骤并生成可验证证据答案的工具型智能体。实验表明,该代理在77个问题中达到专家准确率0.84和严格成功率0.77,尤其在AWS配置健康任务上表现优异(专家准确率0.94),为评估和推动智能体AI在身份安全领域的应用提供了可复现的基准框架。
链接: https://arxiv.org/abs/2601.07880
作者: Gal Engelberg,Konstantin Koutsyi,Leon Goldberg,Reuven Elezra,Idan Pinto,Tal Moalem,Shmuel Cohen,Yoni Weintrob
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures. Benchmark and evaluation framework for agentic AI in identity security posture management, including expert evaluation and LLM-as-judge analysis
Abstract:Identity Security Posture Management (ISPM) is a core challenge for modern enterprises operating across cloud and SaaS environments. Answering basic ISPM visibility questions, such as understanding identity inventory and configuration hygiene, requires interpreting complex identity data, motivating growing interest in agentic AI systems. Despite this interest, there is currently no standardized way to evaluate how well such systems perform ISPM visibility tasks on real enterprise data. We introduce the Sola Visibility ISPM Benchmark, the first benchmark designed to evaluate agentic AI systems on foundational ISPM visibility tasks using a live, production-grade identity environment spanning AWS, Okta, and Google Workspace. The benchmark focuses on identity inventory and hygiene questions and is accompanied by the Sola AI Agent, a tool-using agent that translates natural-language queries into executable data exploration steps and produces verifiable, evidence-backed answers. Across 77 benchmark questions, the agent achieves strong overall performance, with an expert accuracy of 0.84 and a strict success rate of 0.77. Performance is highest on AWS hygiene tasks, where expert accuracy reaches 0.94, while results on Google Workspace and Okta hygiene tasks are more moderate, yet competitive. Overall, this work provides a practical and reproducible benchmark for evaluating agentic AI systems in identity security and establishes a foundation for future ISPM benchmarks covering more advanced identity analysis and governance tasks.
zh
[AI-91] E2-LLM : Bridging Neural Signals and Interpretable Affective Analysis
【速读】:该论文旨在解决从脑电图(EEG)信号中进行情绪识别时面临的三大挑战:个体间差异大、标注数据有限,以及现有方法缺乏可解释的推理能力。其核心解决方案是提出E²-LLM(EEG-to-Emotion Large Language Model),这是首个将多模态大语言模型(MLLM)适配于神经信号时空特性的框架。关键创新在于通过可学习的投影层将预训练的EEG编码器与基于Qwen的大语言模型(LLM)融合,并采用多阶段训练策略——包括情绪判别预训练、跨模态对齐和带思维链(chain-of-thought)指令微调,从而实现高精度且具备可解释性的情绪分析。
链接: https://arxiv.org/abs/2601.07877
作者: Fei Ma,Han Lin,Yifan Xie,Hongwei Ren,Xiaoyu Shen,Wenbo Ding,Qi Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Emotion recognition from electroencephalography (EEG) signals remains challenging due to high inter-subject variability, limited labeled data, and the lack of interpretable reasoning in existing approaches. While recent multimodal large language models (MLLMs) have advanced emotion analysis, they have not been adapted to handle the unique spatiotemporal characteristics of neural signals. We present E^2-LLM (EEG-to-Emotion Large Language Model), the first MLLM framework for interpretable emotion analysis from EEG. E^2-LLM integrates a pretrained EEG encoder with Qwen-based LLMs through learnable projection layers, employing a multi-stage training pipeline that encompasses emotion-discriminative pretraining, cross-modal alignment, and instruction tuning with chain-of-thought reasoning. We design a comprehensive evaluation protocol covering basic emotion prediction, multi-task reasoning, and zero-shot scenario understanding. Experiments on the dataset across seven emotion categories demonstrate that E^2-LLM achieves excellent performance on emotion classification, with larger variants showing enhanced reliability and superior zero-shot generalization to complex reasoning scenarios. Our work establishes a new paradigm combining physiological signals with LLM reasoning capabilities, showing that model scaling improves both recognition accuracy and interpretable emotional understanding in affective computing.
zh
[AI-92] NOVAK: Unified adaptive optimizer for deep neural networks
【速读】:该论文旨在解决现有自适应优化方法在训练缺乏跳跃连接(skip connections)的深度普通网络(plain networks)时稳定性差、收敛性弱的问题。传统优化器如Adam、AdamW等在这些结构上表现不佳,难以实现可靠训练和高精度。解决方案的关键在于提出NOVAK这一模块化梯度优化算法,其核心创新包括:(1)修正的自适应学习率(rectified adaptive learning rates),缓解学习率波动对训练稳定性的影响;(2)解耦权重衰减(decoupled weight regularization),提升正则化效果与优化一致性;(3)混合动量机制(hybrid momentum),融合Nesterov动量多种变体以增强方向感知能力;以及(4)内存高效的 lookahead 同步机制,显著降低计算开销并保持数值稳定。上述组件协同作用,使NOVAK在ResNet-50、VGG-16和ViT等多种架构上均实现更优准确率与鲁棒性,尤其在无跳跃连接的plain网络中表现出显著优势。
链接: https://arxiv.org/abs/2601.07876
作者: Sergii Kavun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 77 pages, 14 figures, 7 tables
Abstract:This work introduces NOVAK, a modular gradient-based optimization algorithm that integrates adaptive moment estimation, rectified learning-rate scheduling, decoupled weight regularization, multiple variants of Nesterov momentum, and lookahead synchronization into a unified, performance-oriented framework. NOVAK adopts a dual-mode architecture consisting of a streamlined fast path designed for production. The optimizer employs custom CUDA kernels that deliver substantial speedups (3-5 for critical operations) while preserving numerical stability under standard stochastic-optimization assumptions. We provide fully developed mathematical formulations for rectified adaptive learning rates, a memory-efficient lookahead mechanism that reduces overhead from O(2p) to O(p + p/k), and the synergistic coupling of complementary optimization components. Theoretical analysis establishes convergence guarantees and elucidates the stability and variance-reduction properties of the method. Extensive empirical evaluation on CIFAR-10, CIFAR-100, ImageNet, and ImageNette demonstrates NOVAK superiority over 14 contemporary optimizers, including Adam, AdamW, RAdam, Lion, and Adan. Across architectures such as ResNet-50, VGG-16, and ViT, NOVAK consistently achieves state-of-the-art accuracy, and exceptional robustness, attaining very high accuracy on VGG-16/ImageNette demonstrating superior architectural robustness compared to contemporary optimizers. The results highlight that NOVAKs architectural contributions (particularly rectification, decoupled decay, and hybrid momentum) are crucial for reliable training of deep plain networks lacking skip connections, addressing a long-standing limitation of existing adaptive optimization methods.
zh
[AI-93] Multiplicative Orthogonal Sequential Editing for Language Models AAAI2026
【速读】:该论文旨在解决现有知识编辑方法在修改大语言模型(Large Language Models, LLMs)内部知识时,因采用加性更新范式导致参数矩阵数值稳定性下降的问题,进而影响编辑性能和模型通用能力,尤其是在连续编辑场景下。其解决方案的关键在于提出一种乘性正交序列编辑(Multiplicative Orthogonal Sequential Editing, MOSE)范式:通过将原始参数矩阵与一个正交矩阵相乘来实现知识更新,而非传统的加法形式。由于正交变换不改变矩阵的数值稳定性(如条件数和范数),MOSE 有效维持了编辑后参数矩阵的稳定性,从而在保证高编辑精度的同时,显著提升连续编辑性能(相较现有方法提升12.08%),并保留95.73%的下游任务通用能力。
链接: https://arxiv.org/abs/2601.07873
作者: Hao-Xiang Xu,Jun-Yu Ma,Ziqi Peng,Yuhao Sun,Zhen-Hua Ling,Jia-Chen Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Knowledge editing aims to efficiently modify the internal knowledge of large language models (LLMs) without compromising their other capabilities. The prevailing editing paradigm, which appends an update matrix to the original parameter matrix, has been shown by some studies to damage key numerical stability indicators (such as condition number and norm), thereby reducing editing performance and general abilities, especially in sequential editing scenario. Although subsequent methods have made some improvements, they remain within the additive framework and have not fundamentally addressed this limitation. To solve this problem, we analyze it from both statistical and mathematical perspectives and conclude that multiplying the original matrix by an orthogonal matrix does not change the numerical stability of the matrix. Inspired by this, different from the previous additive editing paradigm, a multiplicative editing paradigm termed Multiplicative Orthogonal Sequential Editing (MOSE) is proposed. Specifically, we first derive the matrix update in the multiplicative form, the new knowledge is then incorporated into an orthogonal matrix, which is multiplied by the original parameter matrix. In this way, the numerical stability of the edited matrix is unchanged, thereby maintaining editing performance and general abilities. We compared MOSE with several current knowledge editing methods, systematically evaluating their impact on both editing performance and the general abilities across three different LLMs. Experimental results show that MOSE effectively limits deviations in the edited parameter matrix and maintains its numerical stability. Compared to current methods, MOSE achieves a 12.08% improvement in sequential editing performance, while retaining 95.73% of general abilities across downstream tasks. The code is available at this https URL.
zh
[AI-94] RewriteNets: End-to-End Trainable String-Rewriting for Generative Sequence Modeling AACL2025
【速读】:该论文旨在解决当前主流序列模型(如Transformer)在处理结构信息时依赖隐式表示导致的计算复杂度高(二次方复杂度)的问题。其解决方案的关键在于提出一种全新的神经网络架构——RewriteNets,该架构基于显式、并行的字符串重写机制:每一层包含一组可学习的重写规则,通过模糊匹配规则模式、可微分配算子进行冲突消解以选择非重叠的重写操作、应用选定规则替换输入片段(允许输出长度变化),以及传播未被修改的标记。为应对离散规则分配不可微的问题,作者采用直通Gumbel-Sinkhorn估计器实现稳定端到端训练。实验表明,RewriteNets在需要系统泛化能力的任务上表现优异(如SCAN基准长度分割达到98.7%准确率),且计算效率优于Transformer。
链接: https://arxiv.org/abs/2601.07868
作者: Harshil Vejendla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, AACL 2025 Findings
Abstract:Dominant sequence models like the Transformer represent structure implicitly through dense attention weights, incurring quadratic complexity. We propose RewriteNets, a novel neural architecture built on an alternative paradigm: explicit, parallel string rewriting. Each layer in a RewriteNet contains a set of learnable rules. For each position in an input sequence, the layer performs four operations: (1) fuzzy matching of rule patterns, (2) conflict resolution via a differentiable assignment operator to select non-overlapping rewrites, (3) application of the chosen rules to replace input segments with output segments of potentially different lengths, and (4) propagation of untouched tokens. While the discrete assignment of rules is non-differentiable, we employ a straight-through Gumbel-Sinkhorn estimator, enabling stable end-to-end training. We evaluate RewriteNets on algorithmic, compositional, and string manipulation tasks, comparing them against strong LSTM and Transformer baselines. Results show that RewriteNets excel at tasks requiring systematic generalization (achieving 98.7% accuracy on the SCAN benchmark’s length split) and are computationally more efficient than Transformers. We also provide an analysis of learned rules and an extensive ablation study, demonstrating that this architecture presents a promising direction for sequence modeling with explicit structural inductive biases.
zh
[AI-95] Bridging the Trust Gap: Clinician-Validated Hybrid Explainable AI for Maternal Health Risk Assessment in Bangladesh
【速读】:该论文旨在解决生成式 AI (Generative AI) 在资源受限环境中的临床采纳障碍问题,核心在于缺乏可解释性与信任度。其解决方案的关键是提出了一种混合可解释人工智能(Explainable AI, XAI)框架,结合前处理的模糊逻辑(ante-hoc fuzzy logic)与后处理的SHAP特征重要性分析(post-hoc SHAP explanations),在1,014例产科健康记录上构建了模糊-XGBoost模型,实现了88.67%的准确率(ROC-AUC: 0.9703)。通过14名孟加拉国医疗专业人员的验证反馈,该方法显著提升了临床信任度(54.8%表示适合临床使用),并识别出医疗可及性为首要预测因子,同时验证了临床知识嵌入的有效性(r=0.298),从而为XAI在产科健康领域的部署提供了实用且可信的技术路径。
链接: https://arxiv.org/abs/2601.07866
作者: Farjana Yesmin,Nusrat Shirmin,Suraiya Shabnam Bristy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, 2 tables Submitted to WCCI 2026, 2026 IEEE WORLD CONGRESS ON COMPUTATIONAL INTELLIGENCE
Abstract:While machine learning shows promise for maternal health risk prediction, clinical adoption in resource-constrained settings faces a critical barrier: lack of explainability and trust. This study presents a hybrid explainable AI (XAI) framework combining ante-hoc fuzzy logic with post-hoc SHAP explanations, validated through systematic clinician feedback. We developed a fuzzy-XGBoost model on 1,014 maternal health records, achieving 88.67% accuracy (ROC-AUC: 0.9703). A validation study with 14 healthcare professionals in Bangladesh revealed strong preference for hybrid explanations (71.4% across three clinical cases) with 54.8% expressing trust for clinical use. SHAP analysis identified healthcare access as the primary predictor, with the engineered fuzzy risk score ranking third, validating clinical knowledge integration (r=0.298). Clinicians valued integrated clinical parameters but identified critical gaps: obstetric history, gestational age, and connectivity barriers. This work demonstrates that combining interpretable fuzzy rules with feature importance explanations enhances both utility and trust, providing practical insights for XAI deployment in maternal healthcare.
zh
[AI-96] Affect and Effect: Limitations of regularisation-based continual learning in EEG-based emotion classification
【速读】:该论文旨在解决脑电图(EEG)情绪分类中跨被试泛化能力不足的问题,其核心挑战在于高个体间与个体内差异导致模型难以有效迁移至未见过的被试。研究指出,当前广泛采用的基于正则化的持续学习(Continual Learning, CL)方法(如Elastic Weight Consolidation、Synaptic Intelligence等)在该任务中表现有限,关键原因在于这些方法在稳定-可塑性权衡上存在根本性偏差:它们过度关注防止灾难性遗忘(即向后迁移),而忽视了对新被试的适应能力(即向前迁移)。具体而言,参数重要性估计在噪声数据和协变量偏移下失效,重要参数上的梯度更新反而干扰新被试的学习方向,且多任务累积的重要性约束使模型过度受限,同时性能对被试顺序高度敏感。实验证明,此类方法在DREAMER和SEED数据集上未能显著提升向前迁移性能(p > 0.05),表明其不适合用于EEG情绪分类中对未见被试的鲁棒泛化。
链接: https://arxiv.org/abs/2601.07858
作者: Nina Peire,Yupei Li,Björn Schuller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 16 figures, not including Appendix. Code can be found at: this https URL
Abstract:Generalisation to unseen subjects in EEG-based emotion classification remains a challenge due to high inter-and intra-subject variability. Continual learning (CL) poses a promising solution by learning from a sequence of tasks while mitigating catastrophic forgetting. Regularisation-based CL approaches, such as Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI), and Memory Aware Synapses (MAS), are commonly used as baselines in EEG-based CL studies, yet their suitability for this problem remains underexplored. This study theoretically and empirically finds that regularisation-based CL methods show limited performance for EEG-based emotion classification on the DREAMER and SEED datasets. We identify a fundamental misalignment in the stability-plasticity trade-off, where regularisation-based methods prioritise mitigating catastrophic forgetting (backward transfer) over adapting to new subjects (forward transfer). We investigate this limitation under subject-incremental sequences and observe that: (1) the heuristics for estimating parameter importance become less reliable under noisy data and covariate shift, (2) gradients on parameters deemed important by these heuristics often interfere with gradient updates required for new subjects, moving optimisation away from the minimum, (3) importance values accumulated across tasks over-constrain the model, and (4) performance is sensitive to subject order. Forward transfer showed no statistically significant improvement over sequential fine-tuning (p 0.05 across approaches and datasets). The high variability of EEG signals means past subjects provide limited value to future subjects. Regularisation-based continual learning approaches are therefore limited for robust generalisation to unseen subjects in EEG-based emotion classification.
zh
[AI-97] FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments
【速读】:该论文旨在解决当前金融代理(Financial Agents)在实际运行环境中因状态可变操作和工具调用能力而引入的执行层面安全风险问题,这些问题在现有基于语言模型的内容合规性评估中未被充分捕捉。解决方案的关键在于提出FinVault——首个面向金融代理的执行接地型安全基准,其包含31个由监管案例驱动的沙箱场景、支持状态写入的数据库以及明确的合规约束,并系统性地覆盖了31种真实世界漏洞与963个测试用例,涵盖提示注入(prompt injection)、越狱攻击(jailbreaking)、金融适配攻击等威胁类型,从而实现了对金融代理在真实工作流中安全性的全面评估。
链接: https://arxiv.org/abs/2601.07853
作者: Zhi Yang,Runguo Li,Qiqi Qiang,Jiashun Wang,Fangqi Lou,Mengping Li,Dongpo Cheng,Rui Xu,Heng Lian,Shuo Zhang,Xiaolong Liang,Xiaoming Huang,Zheng Wei,Zhaowei Liu,Xin Guo,Huacan Wang,Ronghao Chen,Liwen Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision-making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high-stakes and highly regulated financial environments. However, existing safety evaluations largely focus on language-model-level content compliance or abstract agent settings, failing to capture execution-grounded risks arising from real operational workflows and state-changing actions. To bridge this gap, we propose FinVault, the first execution-grounded security benchmark for financial agents, comprising 31 regulatory case-driven sandbox scenarios with state-writable databases and explicit compliance constraints, together with 107 real-world vulnerabilities and 963 test cases that systematically cover prompt injection, jailbreaking, financially adapted attacks, as well as benign inputs for false-positive evaluation. Experimental results reveal that existing defense mechanisms remain ineffective in realistic financial agent settings, with average attack success rates (ASR) still reaching up to 50.0% on state-of-the-art models and remaining non-negligible even for the most robust systems (ASR 6.7%), highlighting the limited transferability of current safety designs and the need for stronger financial-specific defenses. Our code can be found at this https URL.
zh
[AI-98] Hierarchical Sparse Plus Low Rank Compression of LLM
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署和持续训练过程中对内存与计算资源的高消耗问题,提出了一种名为分层稀疏加低秩(Hierarchical Sparse Plus Low-Rank, HSS)的压缩方法。其核心解决方案在于两阶段压缩策略:首先将权重矩阵中最大幅度的元素提取为稀疏矩阵 $ S $,随后对剩余稠密矩阵应用递归分层稀疏可分离(Hierarchically Sparse Separable, HSS)低秩分解;通过引入递归秩缩减策略与反向Cuthill-Mckee(Reverse Cuthill-Mckee, RCM)重排序,使高权重集中在对角块结构内,从而最大化非对角区域的压缩潜力(因其仅被访问一次)。该方法具备硬件友好性,其矩阵-向量乘法可简化为一次稀疏运算和一系列薄矩阵乘法,并支持端到端训练,实验表明仅压缩LLaMA-7B模型中自注意力投影部分(Q、K、V矩阵共16亿参数)即可实现显著内存节省,同时保持接近最优的困惑度表现。
链接: https://arxiv.org/abs/2601.07839
作者: Pawan Kumar,Aditi Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, Accepted in ACM International Conference on Data Science, CODS-2026
Abstract:Modern large language models (LLMs) place extraordinary pressure on memory and compute budgets, making principled compression indispensable for both deployment and continued training. We present Hierarchical Sparse Plus Low-Rank (HSS) compression, a two-stage scheme that (i) removes the largest-magnitude weights into a sparse matrix S and (ii) applies a recursive Hierarchically Sparse Separable (HSS) low-rank factorisation to the dense residual matrix. A recursive rank-reducing strategy and a reverse Cuthill-Mckee (RCM) permutation are introduced to align high weights towards the diagonal with the block-diagonal hierarchy, maximising off-diagonal compressibility (because they are touched only once). HSS is hardware-friendly: its matrix-vector multiply reduces to one sparse and a sequence of thin-matrix multiplications and can be trained end-to-end with standard optimisers. Experiments on LLaMA-7B show that targeting only the self-attention projections (1.6 B parameters of Q, K, and V matrices out of a total 7B parameters) suffices to yield large memory savings while retaining comparable state-of-the-art perplexity scores on test samples of the WikiText dataset. For example, with a 30% sparsity budget and an outer rank of 512, sHSS-RCM achieves a perplexity of 1.64, outperforming dense baselines and classical sparse-plus-SVD variants, while also achieving significant memory savings. Comments: 9 pages, 3 figures, Accepted in ACM International Conference on Data Science, CODS-2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07839 [cs.LG] (or arXiv:2601.07839v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.07839 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-99] A survey: Information search time optimization based on RAG (Retrieval Augmentation Generation) chatbot
【速读】:该论文旨在解决组织内部复杂信息检索效率低下的问题,传统搜索方法耗时较长且难以精准定位所需知识。其解决方案的关键在于引入基于检索增强生成(Retrieval-Augmented Generation, RAG)的聊天机器人,通过将私有知识库与生成式AI结合,在问答过程中动态检索相关文档并生成结构化回答,从而显著缩短员工查找信息的时间。实验表明,相较于标准搜索方法,RAG-based chatbot在平均每个查询上的信息检索时间减少了80%-95%,有效优化了组织内的知识获取流程。
链接: https://arxiv.org/abs/2601.07838
作者: Jinesh Patel,Arpit Malhotra,Ajay Pande,Prateek Caire
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) based chatbots are not only useful for information retrieval through questionanswering but also for making complex decisions based on injected private this http URL present a survey on how much search time can be saved when retrieving complex information within an organization called “X Systems”(a stealth mode company) by using a RAG-based chatbot compared to traditional search methods. We compare the information retrieval time using standard search techniques versus the RAG-based chatbot for the same queries. Our results conclude that RAG-based chatbots not only save time in information retrieval but also optimize the search process effectively. This survey was conducted with a sample of 105 employees across departments, average time spending on information retrieval per query was taken as metric. Comparison shows us, there are average 80-95% improvement on search when use RAG based chatbot than using standard search.
zh
[AI-100] Grid-Aware Charging and Operational Optimization for Mixed-Fleet Public Transit ITSC
【速读】:该论文旨在解决混合电动与柴油公交车队在公共交通运输系统中运营时面临的优化难题,特别是如何在动态电价背景下合理分配充电任务,并同时考虑座位容量等次级约束条件。其解决方案的关键在于构建一个综合的混合整数线性规划(Mixed-Integer Linear Programming, MILP)模型,联合优化充电调度与线路任务分配,从而实现对车辆运行效率和成本的有效控制;为应对MILP模型可能因车队规模增长而导致的计算复杂性问题,研究进一步提出一种基于车队组成特征的分层优化策略,显著提升了求解效率与实际应用可行性。
链接: https://arxiv.org/abs/2601.08753
作者: Rishav Sen,Amutheezan Sivagnanam,Aron Laszka,Ayan Mukhopadhyay,Abhishek Dubey
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 pages, 7 figures, 4 algorithms. Published in the Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
Abstract:The rapid growth of urban populations and the increasing need for sustainable transportation solutions have prompted a shift towards electric buses in public transit systems. However, the effective management of mixed fleets consisting of both electric and diesel buses poses significant operational challenges. One major challenge is coping with dynamic electricity pricing, where charging costs vary throughout the day. Transit agencies must optimize charging assignments in response to such dynamism while accounting for secondary considerations such as seating constraints. This paper presents a comprehensive mixed-integer linear programming (MILP) model to address these challenges by jointly optimizing charging schedules and trip assignments for mixed (electric and diesel bus) fleets while considering factors such as dynamic electricity pricing, vehicle capacity, and route constraints. We address the potential computational intractability of the MILP formulation, which can arise even with relatively small fleets, by employing a hierarchical approach tailored to the fleet composition. By using real-world data from the city of Chattanooga, Tennessee, USA, we show that our approach can result in significant savings in the operating costs of the mixed transit fleets.
zh
[AI-101] A Formal Proof of a Continued Fraction Conjecture for π Originating from the Ramanujan Machine
【速读】:该论文旨在解决由Ramanujan Machine算法推测出的一类非规范多项式连分数表示π/4的解析证明问题。其解决方案的关键在于建立这些连分数与邻近高斯超几何函数(Gaussian hypergeometric functions)比值之间的显式对应关系,并通过一系列等价变换(equivalence transformations)推导出恒等式;同时证明所 conjectured 的整数系数构成底层解析核的符号最小实现,且稳定性分析表明其极限周期结构严格位于Worpitzky收敛圆盘内,从而确保绝对收敛,揭示了此类算法发现的恒等式并非孤立数值现象,而是深植于经典超几何变换理论之中。
链接: https://arxiv.org/abs/2601.08461
作者: Chao Wang
机构: 未知
类目: Number Theory (math.NT); Artificial Intelligence (cs.AI)
备注: 4 pages
Abstract:We provide a formal analytic proof for a class of non-canonical polynomial continued fractions representing \pi/4, originally conjectured by the Ramanujan Machine using algorithmic induction [4]. By establishing an explicit correspondence with the ratio of contiguous Gaussian hypergeometric functions 2F1(a, b; c; z), we show that these identities can be derived via a discrete sequence of equivalence transformations. We further prove that the conjectured integer coefficients constitute a symbolically minimal realization of the underlying analytic kernel. Stability analysis confirms that the resulting limit-periodic structures reside strictly within the Worpitzky convergence disk, ensuring absolute convergence. This work demonstrates that such algorithmically discovered identities are not isolated numerical artifacts, but are deeply rooted in the classical theory of hypergeometric transformations.
zh
[AI-102] Autonomous Materials Exploration by Integrating Automated Phase Identification and AI-Assisted Human Reasoning
【速读】:该论文旨在解决传统材料研发中实验效率低、难以高效探索复杂化学与工艺空间的问题,特别是在寻找稳定 metastable 相(如 δ-Bi₂O₃ 和 Bi₂Ti₂O₇)及理解其形成机制方面的挑战。解决方案的关键在于构建一个“人机协同”的自主实验系统——SARA-H(Scientific Autonomous Reasoning Agent with human-in-the-loop),通过集成自动化相位识别算法获取概率性相信息,并将人类专家知识嵌入推理过程以指导搜索方向,从而显著提升采样效率和目标相的发现能力。该方法在多个氧化物体系(如 Bi-Ti-O)中验证了其有效性,实现了对动力学冻结的亚稳相的精准调控与预测验证。
链接: https://arxiv.org/abs/2601.08185
作者: Ming-Chiang Chang,Maximilian Amsler,Duncan R. Sutherland,Sebastian Ament,Katie R. Gann,Lan Zhou,Louisa M. Smieska,Arthur R. Woll,John M. Gregoire,Carla P. Gomes,R. Bruce van Dover,Michael O. Thompson
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Computational Physics (physics.comp-ph)
备注: Main manuscript: 21 pages(including references), 6 figures. Supplementary Information: 12 pages, 9 figures, 1 table
Abstract:Autonomous experimentation holds the potential to accelerate materials development by combining artificial intelligence (AI) with modular robotic platforms to explore extensive combinatorial chemical and processing spaces. Such self-driving laboratories can not only increase the throughput of repetitive experiments, but also incorporate human domain expertise to drive the search towards user-defined objectives, including improved materials performance metrics. We present an autonomous materials synthesis extension to SARA, the Scientific Autonomous Reasoning Agent, utilizing phase information provided by an automated probabilistic phase labeling algorithm to expedite the search for targeted phase regions. By incorporating human input into an expanded SARA-H (SARA with human-in-the-loop) framework, we enhance the efficiency of the underlying reasoning process. Using synthetic benchmarks, we demonstrate the efficiency of our AI implementation and show that the human input can contribute to significant improvement in sampling efficiency. We conduct experimental active learning campaigns using robotic processing of thin-film samples of several oxide material systems, including Bi _2 O _3 , SnO _x , and Bi-Ti-O, using lateral-gradient laser spike annealing to synthesize and kinetically trap metastable phases. We showcase the utility of human-in-the-loop autonomous experimentation for the Bi-Ti-O system, where we identify extensive processing domains that stabilize \delta -Bi _2 O _3 and Bi _2 Ti _2 O _7 , explore dwell-dependent ternary oxide phase behavior, and provide evidence confirming predictions that cationic substitutional doping of TiO _2 with Bi inhibits the unfavorable transformation of the metastable anatase to the ground-state rutile phase. The autonomous methods we have developed enable the discovery of new materials and new understanding of materials synthesis and properties.
zh
[AI-103] High-Fidelity Modeling of Stochastic Chemical Dynamics on Complex Manifolds: A Multi-Scale SIREN-PINN Framework for the Curvature-Perturbed Ginzburg-Landau Equation
【速读】:该论文旨在解决反应-扩散系统中时空混沌的精准识别与控制问题,尤其是在催化表面具有复杂未知拓扑结构时的挑战。核心难题在于传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)因存在谱偏差(spectral bias),难以解析高频梯度,导致相位漂移或振幅坍缩。解决方案的关键是提出一种多尺度SIREN-PINN架构,其利用周期性正弦激活函数并采用频率多样化的初始化策略,将波状物理的归纳偏置直接嵌入网络结构,从而同时解析宏观波包和微观缺陷核心。该方法在隐式黎曼流形上的复杂Ginzburg-Landau方程上验证有效,相对L₂预测误差约为1.92×10⁻²,优于标准基线一个数量级,并保持拓扑不变量稳定(|ΔN_defects| < 1)。此外,该模型还成功求解了病态的逆钉扎问题,仅从混沌波动力学的部分观测中重建隐藏的高斯曲率场(皮尔逊相关系数ρ=0.965),标志着几何催化剂设计的新范式建立。
链接: https://arxiv.org/abs/2601.08104
作者: Julian Evan Chrisnanto,Salsabila Rahma Alia,Nurfauzi Fadillah,Yulison Herry Chrisnanto
机构: 未知
类目: Chaotic Dynamics (nlin.CD); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures
Abstract:The accurate identification and control of spatiotemporal chaos in reaction-diffusion systems remains a grand challenge in chemical engineering, particularly when the underlying catalytic surface possesses complex, unknown topography. In the \textitDefect Turbulence regime, system dynamics are governed by topological phase singularities (spiral waves) whose motion couples to manifold curvature via geometric pinning. Conventional Physics-Informed Neural Networks (PINNs) using ReLU or Tanh activations suffer from fundamental \textitspectral bias, failing to resolve high-frequency gradients and causing amplitude collapse or phase drift. We propose a Multi-Scale SIREN-PINN architecture leveraging periodic sinusoidal activations with frequency-diverse initialization, embedding the appropriate inductive bias for wave-like physics directly into the network structure. This enables simultaneous resolution of macroscopic wave envelopes and microscopic defect cores. Validated on the complex Ginzburg-Landau equation evolving on latent Riemannian manifolds, our architecture achieves relative state prediction error \epsilon_L_2 \approx 1.92 \times 10^-2 , outperforming standard baselines by an order of magnitude while preserving topological invariants ( |\Delta N_defects| 1 ). We solve the ill-posed \textitinverse pinning problem, reconstructing hidden Gaussian curvature fields solely from partial observations of chaotic wave dynamics (Pearson correlation \rho = 0.965 ). Training dynamics reveal a distinctive Spectral Phase Transition at epoch \sim 2,100 , where cooperative minimization of physics and geometry losses drives the solver to Pareto-optimal solutions. This work establishes a new paradigm for Geometric Catalyst Design, offering a mesh-free, data-driven tool for identifying surface heterogeneity and engineering passive control strategies in turbulent chemical reactors.
zh
[AI-104] uberculosis Screening from Cough Audio: Baseline Models Clinical Variables and Uncertainty Quantification
【速读】:该论文旨在解决当前基于咳嗽音频进行结核病(Tuberculosis, TB)自动检测的研究中因数据集、特征表示、模型架构、验证协议及评估指标差异过大而导致结果不可比的问题,从而阻碍了领域内方法进步。其解决方案的关键在于构建一个标准化、端到端可复现的框架,涵盖特征提取、多模态融合(音频与临床元数据)、独立于咳嗽者的评估以及不确定性量化,并报告一组一致且具有临床意义的指标,以实现公平比较;同时通过对比仅使用咳嗽音频与融合音频及临床元数据的模型性能,提供可公开获取的完整实验协议,作为该领域的统一基准参考。
链接: https://arxiv.org/abs/2601.07969
作者: George P. Kafentzis,Efstratios Selisios
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:In this paper, we propose a standardized framework for automatic tuberculosis (TB) detection from cough audio and routinely collected clinical data using machine learning. While TB screening from audio has attracted growing interest, progress is difficult to measure because existing studies vary substantially in datasets, cohort definitions, feature representations, model families, validation protocols, and reported metrics. Consequently, reported gains are often not directly comparable, and it remains unclear whether improvements stem from modeling advances or from differences in data and evaluation. We address this gap by establishing a strong, well-documented baseline for TB prediction using cough recordings and accompanying clinical metadata from a recently compiled dataset from several countries. Our pipeline is reproducible end-to-end, covering feature extraction, multimodal fusion, cougher-independent evaluation, and uncertainty quantification, and it reports a consistent suite of clinically relevant metrics to enable fair comparison. We further quantify performance for cough audio-only and fused (audio + clinical metadata) models, and release the full experimental protocol to facilitate benchmarking. This baseline is intended to serve as a common reference point and to reduce methodological variance that currently holds back progress in the field.
zh
[AI-105] Quantum automated theorem proving
【速读】:该论文旨在解决传统自动化定理证明(Automated Theorem Proving, ATP)在复杂逻辑推理任务中计算效率低下的问题,尤其是在命题逻辑和一阶逻辑中的查询复杂度较高,以及几何定理证明依赖经典代数方法难以扩展的局限。解决方案的关键在于提出一种通用的量子自动化定理证明框架,利用量子叠加和纠缠特性实现计算优势:一方面通过量子归结算法在命题逻辑与一阶逻辑中实现查询复杂度的二次加速;另一方面引入量子代数证明方法,将吴文俊消元法(Wu’s algebraic approach)拓展至量子场景,从而在几何定理证明中实现比经典方法更优的查询复杂度。该框架为构建面向近中期及未来量子技术的实际自动化定理证明系统提供了理论基础与实现路径。
链接: https://arxiv.org/abs/2601.07953
作者: Zheng-Zhi Sun,Qi Ye,Dong-Ling Deng
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated theorem proving, or more broadly automated reasoning, aims at using computer programs to automatically prove or disprove mathematical theorems and logical statements. It takes on an essential role across a vast array of applications and the quest for enhanced theorem-proving capabilities remains a prominent pursuit in artificial intelligence. Here, we propose a generic framework for quantum automated theorem proving, where the intrinsic quantum superposition and entanglement features would lead to potential advantages. In particular, we introduce quantum representations of knowledge bases and propose corresponding reasoning algorithms for a variety of tasks. We show how automated reasoning can be achieved with quantum resolution in both propositional and first-order logic with quadratically reduced query complexity. In addition, we propose the quantum algebraic proving method for geometric theorems, extending Wu’s algebraic approach beyond the classical setting. Through concrete examples, including geometry problems from the International Mathematical Olympiad, we demonstrate how a quantum computer may prove geometric theorems with quadratic better query complexity. Our results establish a primary approach towards building quantum automatic theorem provers, which would be crucial for practical applications of both near-term and future quantum technologies.
zh
[AI-106] Decentralized Online Convex Optimization with Unknown Feedback Delays
【速读】:该论文致力于解决去中心化在线凸优化(Decentralized Online Convex Optimization, D-OCO)中因未知且随时间与代理变化的反馈延迟(feedback delays)所导致的性能下降问题。现有算法通常依赖于对总延迟的先验知识,且在延迟和网络参数上的依赖关系存在次优性。解决方案的关键在于提出一种新颖的自适应学习率机制,通过去中心化通信协议实现各代理基于gossip策略本地估计延迟,从而无需事先知晓总延迟信息;该机制显著提升了 regret 上界至 $ O\left(N \sqrt{d_{\text{tot}}} + N \sqrt{T} / (1 - \sigma^2)^{1/4}\right) $,其中 $ d_{\text{tot}} $ 为平均总延迟,$ N $ 为代理数,$ 1 - \sigma^2 $ 为网络谱隙。此外,作者进一步将框架扩展至强凸场景并获得更紧致的上界 $ O\left(N \delta_{\text{max}} \ln T / \alpha\right) $,其中 $ \delta_{\text{max}} $ 为代理间缺失观测的最大均值,$ \alpha $ 为强凸参数,且所有上界在对数因子意义下均为紧致。
链接: https://arxiv.org/abs/2601.07901
作者: Hao Qiu(UNIMI),Mengxiao Zhang,Juliette Achddou(CRIStAL)
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Decentralized online convex optimization (D-OCO), where multiple agents within a network collaboratively learn optimal decisions in real-time, arises naturally in applications such as federated learning, sensor networks, and multi-agent control. In this paper, we study D-OCO under unknown, time-and agent-varying feedback delays. While recent work has addressed this problem (Nguyen et al., 2024), existing algorithms assume prior knowledge of the total delay over agents and still suffer from suboptimal dependence on both the delay and network parameters. To overcome these limitations, we propose a novel algorithm that achieves an improved regret bound of O N \sqrt d tot + N \sqrt T (1- \sigma 2) 1/4 , where T is the total horizon, d tot denotes the average total delay across agents, N is the number of agents, and 1 - \sigma 2 is the spectral gap of the network. Our approach builds upon recent advances in D-OCO (Wan et al., 2024a), but crucially incorporates an adaptive learning rate mechanism via a decentralized communication protocol. This enables each agent to estimate delays locally using a gossip-based strategy without the prior knowledge of the total delay. We further extend our framework to the strongly convex setting and derive a sharper regret bound of O N \delta max ln T \alpha , where \alpha is the strong convexity parameter and \delta max is the maximum number of missing observations averaged over agents. We also show that our upper bounds for both settings are tight up to logarithmic factors. Experimental results validate the effectiveness of our approach, showing improvements over existing benchmark algorithms.
zh
[AI-107] ackling Heterogeneity in Quantum Federated Learning: An Integrated Sporadic-Personalized Approach
【速读】:该论文旨在解决量子联邦学习(Quantum Federated Learning, QFL)中因量子噪声异构性和数据分布异构性导致的模型训练性能不佳问题。具体而言,当前量子设备存在不同程度的量子噪声(quantum noise),且参与训练的量子设备间数据分布通常为非独立同分布(non-IID),这两者均显著影响全局模型的收敛性与稳定性。解决方案的关键在于提出一种集成式稀疏个性化方法(Sporadic-Personalized Quantum Federated Learning, SPQFL),其核心创新包括:(i) 引入稀疏学习(sporadic learning)机制以缓解量子噪声异构性对模型训练的影响;(ii) 通过模型正则化实现个性化学习,从而降低在非独立同分布数据上本地训练时的过拟合风险,提升全局模型的收敛性能。理论分析与仿真结果表明,SPQFL在训练性能和收敛稳定性方面均优于现有先进方法。
链接: https://arxiv.org/abs/2601.07882
作者: Ratun Rahman,Shaba Shaon,Dinh C. Nguyen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE Transactions on Computers
Abstract:Quantum federated learning (QFL) emerges as a powerful technique that combines quantum computing with federated learning to efficiently process complex data across distributed quantum devices while ensuring data privacy in quantum networks. Despite recent research efforts, existing QFL frameworks struggle to achieve optimal model training performance primarily due to inherent heterogeneity in terms of (i) quantum noise where current quantum devices are subject to varying levels of noise due to varying device quality and susceptibility to quantum decoherence, and (ii) heterogeneous data distributions where data across participating quantum devices are naturally non-independent and identically distributed (non-IID). To address these challenges, we propose a novel integrated sporadic-personalized approach called SPQFL that simultaneously handles quantum noise and data heterogeneity in a single QFL framework. It is featured in two key aspects: (i) for quantum noise heterogeneity, we introduce a notion of sporadic learning to tackle quantum noise heterogeneity across quantum devices, and (ii) for quantum data heterogeneity, we implement personalized learning through model regularization to mitigate overfitting during local training on non-IID quantum data distributions, thereby enhancing the convergence of the global model. Moreover, we conduct a rigorous convergence analysis for the proposed SPQFL framework, with both sporadic and personalized learning considerations. Theoretical findings reveal that the upper bound of the SPQFL algorithm is strongly influenced by both the number of quantum devices and the number of quantum noise measurements. Extensive simulation results in real-world datasets also illustrate that the proposed SPQFL approach yields significant improvements in terms of training performance and convergence stability compared to the state-of-the-art methods.
zh
[AI-108] Feature Entanglement-based Quantum Multimodal Fusion Neural Network
【速读】:该论文旨在解决多模态学习中经典深度学习方法面临的“准确性-可解释性-复杂度”困境,即特征级融合虽精度高但缺乏可解释性,而决策级融合虽可解释性好却性能欠佳,且存在参数爆炸与计算复杂度高的问题。其解决方案的关键在于提出一种基于量子纠缠的多模态融合神经网络架构,通过引入可解释的量子融合模块与量子卷积神经网络(Quantum Convolutional Neural Network, QCNN),利用量子计算的强大表达能力将融合与后处理的复杂度降至线性级别,同时保持决策层融合的可解释性,并在多个多模态图像数据集上实现了与传统网络相当的分类准确率,且参数量减少数十倍。
链接: https://arxiv.org/abs/2601.07856
作者: Yu Wu,Qianli Zhou,Jie Geng,Xinyang Deng,Wen Jiang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal learning aims to enhance perceptual and decision-making capabilities by integrating information from diverse sources. However, classical deep learning approaches face a critical trade-off between the high accuracy of black-box feature-level fusion and the interpretability of less outstanding decision-level fusion, alongside the challenges of parameter explosion and complexity. This paper discusses the accuracy-interpretablity-complexity dilemma under the quantum computation framework and propose a feature entanglement-based quantum multimodal fusion neural network. The model is composed of three core components: a classical feed-forward module for unimodal processing, an interpretable quantum fusion block, and a quantum convolutional neural network (QCNN) for deep feature extraction. By leveraging the strong expressive power of quantum, we have reduced the complexity of multimodal fusion and post-processing to linear, and the fusion process also possesses the interpretability of decision-level fusion. The simulation results demonstrate that our model achieves classification accuracy comparable to classical networks with dozens of times of parameters, exhibiting notable stability and performance across multimodal image datasets.
zh
[AI-109] Immunological Density Shapes Recovery Trajectories in Long COVID
【速读】:该论文旨在解决长期新冠(Post-acute sequelae of SARS-CoV-2 infection, Long COVID)症状持续存在的机制问题,特别是区分其自然病程与疫苗接种干预的影响。研究通过分析13,511名参与者共计97,564次纵向评估数据,结合疫苗接种史,识别出三种临床表型:保护型(持续低于阈值)、难治型(持续症状阳性)和应答型(从症状阳性向恢复转变)。解决方案的关键在于利用统计模型分离时间依赖的自然进展与疫苗相关的变化趋势,发现症状严重程度随时间轻微上升(r=0.0521),而累积疫苗接种则显著降低症状严重程度(r=-0.0434),表明恢复主要与重复免疫接种相关,而非自发缓解。
链接: https://arxiv.org/abs/2601.07854
作者: Jing Wang,Tong Zhang,Xing Niu,Jie Shen,Yiming Luo,Qiaomin Xie,Amar Sra,Zorina Galis,Jeremy Weiss
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-acute sequelae of SARS-CoV-2 infection (Long COVID) frequently persists for months, yet drivers of clinical remission remain incompletely defined. Here we analyzed 97,564 longitudinal PASC assessments from 13,511 participants with linked vaccination histories to disentangle passive temporal progression from vaccine-associated change. Using a clinically validated threshold (PASC \geq 12 ), trajectories separated into three phenotypes: Protected (persistently sub-threshold), Refractory (persistently symptomatic), and Responders (transitioning from symptomatic to recovered). Across the full cohort, symptom severity increased modestly with elapsed time ( r=0.0521 , P=1.26\times10^-59 ), whereas cumulative vaccination showed an inverse association with severity ( r=-0.0434 , P=5.95\times10^-42 ). In summary, baseline Long COVID severity appears clinically deterministic. In the absence of intervention, symptoms typically persist without spontaneous resolution. Recovery is primarily associated with repeated immunization. Subjects: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.07854 [q-bio.OT] (or arXiv:2601.07854v1 [q-bio.OT] for this version) https://doi.org/10.48550/arXiv.2601.07854 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jing Wang [view email] [v1] Fri, 9 Jan 2026 04:54:59 UTC (546 KB)
zh
[AI-110] Photometric Redshift Estimation Using Scaled Ensemble Learning
【速读】:该论文旨在解决天文学中光度红移(Photometric Redshift, Pz)估计的精度与可靠性问题,尤其是在弱星系和高红移(z ~ 4)区域的预测挑战。其解决方案的关键在于提出一种基于集成学习(Ensemble Learning)的机器学习框架,该框架融合梯度提升机(Gradient Boosting Machine)、极端梯度提升(Extreme Gradient Boosting)、k近邻(k-Nearest Neighbors)及人工神经网络(Artificial Neural Networks)等多种算法,并采用袋装(Bagged)输入数据结构以增强模型稳定性与泛化能力。该方法在仅使用光学波段(grizy)测光数据的前提下,显著提升了Pz估计的准确性与鲁棒性,且在LSST科学需求文档设定的基准指标上表现优异,尤其在降低灾难性离群值(Catastrophic Outlier)、减小偏差(Bias)和均方根误差(RMS)方面取得实质性改进。
链接: https://arxiv.org/abs/2601.07292
作者: Swagata Biswas,Shubhrangshu Ghosh,Avyarthana Ghosh,Yogesh Wadadekar,Abhishek Roy Choudhury,Arijit Mukherjee,Shailesh Deshpande,Arpan Pal
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, 3 tables
Abstract:The development of the state-of-the-art telescopic systems capable of performing expansive sky surveys such as the Sloan Digital Sky Survey, Euclid, and the Rubin Observatory’s Legacy Survey of Space and Time (LSST) has significantly advanced efforts to refine cosmological models. These advances offer deeper insight into persistent challenges in astrophysics and our understanding of the Universe’s evolution. A critical component of this progress is the reliable estimation of photometric redshifts (Pz). To improve the precision and efficiency of such estimations, the application of machine learning (ML) techniques to large-scale astronomical datasets has become essential. This study presents a new ensemble-based ML framework aimed at predicting Pz for faint galaxies and higher redshift ranges, relying solely on optical (grizy) photometric data. The proposed architecture integrates several learning algorithms, including gradient boosting machine, extreme gradient boosting, k-nearest neighbors, and artificial neural networks, within a scaled ensemble structure. By using bagged input data, the ensemble approach delivers improved predictive performance compared to stand-alone models. The framework demonstrates consistent accuracy in estimating redshifts, maintaining strong performance up to z ~ 4. The model is validated using publicly available data from the Hyper Suprime-Cam Strategic Survey Program by the Subaru Telescope. Our results show marked improvements in the precision and reliability of Pz estimation. Furthermore, this approach closely adheres to-and in certain instances exceeds-the benchmarks specified in the LSST Science Requirements Document. Evaluation metrics include catastrophic outlier, bias, and rms.
zh
机器学习
[LG-0] Fast and explainable clustering in the Manhattan and Tanimoto distance
链接: https://arxiv.org/abs/2601.08781
作者: Stefan Güttel,Kaustubh Roy
类目: Machine Learning (cs.LG)
*备注:
Abstract:The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor–Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.
[LG-1] Adaptive Requesting in Decentralized Edge Networks via Non-Stationary Bandits
链接: https://arxiv.org/abs/2601.08760
作者: Yi Zhuang,Kun Yang,Xingran Chen
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:We study a decentralized collaborative requesting problem that aims to optimize the information freshness of time-sensitive clients in edge networks consisting of multiple clients, access nodes (ANs), and servers. Clients request content through ANs acting as gateways, without observing AN states or the actions of other clients. We define the reward as the age of information reduction resulting from a client’s selection of an AN, and formulate the problem as a non-stationary multi-armed bandit. In this decentralized and partially observable setting, the resulting reward process is history-dependent and coupled across clients, and exhibits both abrupt and gradual changes in expected rewards, rendering classical bandit-based approaches ineffective. To address these challenges, we propose the AGING BANDIT WITH ADAPTIVE RESET algorithm, which combines adaptive windowing with periodic monitoring to track evolving reward distributions. We establish theoretical performance guarantees showing that the proposed algorithm achieves near-optimal performance, and we validate the theoretical results through simulations.
[LG-2] A Novel Approach to Explainable AI with Quantized Active Ingredients in Decision Making
链接: https://arxiv.org/abs/2601.08733
作者: A.M.A.S.D. Alagiyawanna,Asoka Karunananda,Thushari Silva,A. Mahasinghe
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Accepted and published in IEEE 2025. This is the authors manuscript version; final version available at IEEE Xplore: this https URL
Abstract:Artificial Intelligence (AI) systems have shown good success at classifying. However, the lack of explainability is a true and significant challenge, especially in high-stakes domains, such as health and finance, where understanding is paramount. We propose a new solution to this challenge: an explainable AI framework based on our comparative study with Quantum Boltzmann Machines (QBMs) and Classical Boltzmann Machines (CBMs). We leverage principles of quantum computing within classical machine learning to provide substantive transparency around decision-making. The design involves training both models on a binarised and dimensionally reduced MNIST dataset, where Principal Component Analysis (PCA) is applied for preprocessing. For interpretability, we employ gradient-based saliency maps in QBMs and SHAP (SHapley Additive exPlanations) in CBMs to evaluate feature this http URL deploy hybrid quantum-classical circuits with strongly entangling layers, allowing for richer latent representations, whereas CBMs serve as a classical baseline that utilises contrastive divergence. Along the way, we found that QBMs outperformed CBMs on classification accuracy (83.5% vs. 54%) and had more concentrated distributions in feature attributions as quantified by entropy (1.27 vs. 1.39). In other words, QBMs not only produced better predictive performance than CBMs, but they also provided clearer identification of “active ingredient” or the most important features behind model predictions. To conclude, our results illustrate that quantum-classical hybrid models can display improvements in both accuracy and interpretability, which leads us toward more trustworthy and explainable AI systems.
[LG-3] Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts
链接: https://arxiv.org/abs/2601.08726
作者: Bert Verbruggen,Arne Vanhoyweghen,Vincent Ginis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) remains a central optimisation framework in machine learning. Although RL agents can converge to optimal solutions, the definition of ``optimality’’ depends on the environment’s statistical properties. The Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards. However, when ergodicity is broken, long-term outcomes depend on the specific trajectory rather than on the ensemble average. In such settings, the ensemble average diverges from the time-average growth experienced by individual agents, with expected-value formulations yielding systematically suboptimal policies. Prior studies demonstrated that traditional RL architectures fail to recover the true optimum in non-ergodic environments. We extend this analysis to deep RL implementations and show that these, too, produce suboptimal policies under non-ergodic dynamics. Introducing explicit time dependence into the learning process can correct this limitation. By allowing the network’s function approximation to incorporate temporal information, the agent can estimate value functions consistent with the process’s intrinsic growth rate. This improvement does not require altering the environmental feedback, such as reward transformations or modified objective functions, but arises naturally from the agent’s exposure to temporal trajectories. Our results contribute to the growing body of research on reinforcement learning methods for non-ergodic systems.
[LG-4] Soft Partition-based KAPI-ELM for Multi-Scale PDEs
链接: https://arxiv.org/abs/2601.08719
作者: Vikas Dwivedi,Monica Sigovan,Bruno Sixou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-informed machine learning holds great promise for solving differential equations, yet existing methods struggle with highly oscillatory, multiscale, or singularly perturbed PDEs due to spectral bias, costly backpropagation, and manually tuned kernel or Fourier frequencies. This work introduces a soft partition–based Kernel-Adaptive Physics-Informed Extreme Learning Machine (KAPI-ELM), a deterministic low-dimensional parameterization in which smooth partition lengths jointly control collocation centers and Gaussian kernel widths, enabling continuous coarse-to-fine resolution without Fourier features, random sampling, or hard domain interfaces. A signed-distance-based weighting further stabilizes least-squares learning on irregular geometries. Across eight benchmarks–including oscillatory ODEs, high-frequency Poisson equations, irregular-shaped domains, and stiff singularly perturbed convection-diffusion problems-the proposed method matches or exceeds the accuracy of state-of-the-art Physics-Informed Neural Network (PINN) and Theory of Functional Connections (TFC) variants while using only a single linear solve. Although demonstrated on steady linear PDEs, the results show that soft-partition kernel adaptation provides a fast, architecture-free approach for multiscale PDEs with broad potential for future physics-informed modeling. For reproducibility, the reference codes are available at this https URL
[LG-5] Multi-Preconditioned LBFGS for Training Finite-Basis PINNs
链接: https://arxiv.org/abs/2601.08709
作者: Marc Salvadó-Benasco,Aymane Kssim,Alexander Heinlein,Rolf Krause,Serge Gratton,Alena Kopaničáková
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 13 pages
Abstract:A multi-preconditioned LBFGS (MP-LBFGS) algorithm is introduced for training finite-basis physics-informed neural networks (FBPINNs). The algorithm is motivated by the nonlinear additive Schwarz method and exploits the domain-decomposition-inspired additive architecture of FBPINNs, in which local neural networks are defined on subdomains, thereby localizing the network representation. Parallel, subdomain-local quasi-Newton corrections are then constructed on the corresponding local parts of the architecture. A key feature is a novel nonlinear multi-preconditioning mechanism, in which subdomain corrections are optimally combined through the solution of a low-dimensional subspace minimization problem. Numerical experiments indicate that MP-LBFGS can improve convergence speed, as well as model accuracy over standard LBFGS while incurring lower communication overhead.
[LG-6] RMBRec: Robust Multi-Behavior Recommendation towards Target Behaviors
链接: https://arxiv.org/abs/2601.08705
作者: Miaomiao Cai,Zhijie Zhang,Junfeng Fang,Zhiyong Cheng,Xiang Wang,Meng Wang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Multi-behavior recommendation faces a critical challenge in practice: auxiliary behaviors (e.g., clicks, carts) are often noisy, weakly correlated, or semantically misaligned with the target behavior (e.g., purchase), which leads to biased preference learning and suboptimal performance. While existing methods attempt to fuse these heterogeneous signals, they inherently lack a principled mechanism to ensure robustness against such behavioral inconsistency. In this work, we propose Robust Multi-Behavior Recommendation towards Target Behaviors (RMBRec), a robust multi-behavior recommendation framework grounded in an information-theoretic robustness principle. We interpret robustness as a joint process of maximizing predictive information while minimizing its variance across heterogeneous behavioral environments. Under this perspective, the Representation Robustness Module (RRM) enhances local semantic consistency by maximizing the mutual information between users’ auxiliary and target representations, whereas the Optimization Robustness Module (ORM) enforces global stability by minimizing the variance of predictive risks across behaviors, which is an efficient approximation to invariant risk minimization. This local-global collaboration bridges representation purification and optimization invariance in a theoretically coherent way. Extensive experiments on three real-world datasets demonstrate that RMBRec not only outperforms state-of-the-art methods in accuracy but also maintains remarkable stability under various noise perturbations. For reproducibility, our code is available at this https URL. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2601.08705 [cs.IR] (or arXiv:2601.08705v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.08705 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Enabling Population-Based Architectures for Neural Combinatorial Optimization
链接: https://arxiv.org/abs/2601.08696
作者: Andoni Irazusta Garmendia,Josu Ceberio,Alexander Mendiburu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Neural Combinatorial Optimization (NCO) has mostly focused on learning policies, typically neural networks, that operate on a single candidate solution at a time, either by constructing one from scratch or iteratively improving it. In contrast, decades of work in metaheuristics have shown that maintaining and evolving populations of solutions improves robustness and exploration, and often leads to stronger performance. To close this gap, we study how to make NCO explicitly population-based by learning policies that act on sets of candidate solutions. We first propose a simple taxonomy of population awareness levels and use it to highlight two key design challenges: (i) how to represent a whole population inside a neural network, and (ii) how to learn population dynamics that balance intensification (generating good solutions) and diversification (maintaining variety). We make these ideas concrete with two complementary tools: one that improves existing solutions using information shared across the whole population, and the other generates new candidate solutions that explicitly balance being high-quality with diversity. Experimental results on Maximum Cut and Maximum Independent Set indicate that incorporating population structure is advantageous for learned optimization methods and opens new connections between NCO and classical population-based search.
[LG-8] Provably Safe Reinforcement Learning using Entropy Regularizer
链接: https://arxiv.org/abs/2601.08646
作者: Abhijit Mazumdar,Rafal Wisniewski,Manuela L. Bujorianu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.
[LG-9] EviNAM: Intelligibility and Uncertainty via Evidential Neural Additive Models
链接: https://arxiv.org/abs/2601.08556
作者: Sören Schleibaum,Anton Frederik Thielmann,Julian Teusch,Benjamin Säfken,Jörg P. Müller
类目: Machine Learning (cs.LG)
*备注:
Abstract:Intelligibility and accurate uncertainty estimation are crucial for reliable decision-making. In this paper, we propose EviNAM, an extension of evidential learning that integrates the interpretability of Neural Additive Models (NAMs) with principled uncertainty estimation. Unlike standard Bayesian neural networks and previous evidential methods, EviNAM enables, in a single pass, both the estimation of the aleatoric and epistemic uncertainty as well as explicit feature contributions. Experiments on synthetic and real data demonstrate that EviNAM matches state-of-the-art predictive performance. While we focus on regression, our method extends naturally to classification and generalized additive models, offering a path toward more intelligible and trustworthy predictions.
[LG-10] Reducing Compute Waste in LLM s through Kernel-Level DVFS
链接: https://arxiv.org/abs/2601.08539
作者: Jeffrey Spaan,Kuan-Hsun Chen,Ana-Lucia Varbanescu
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effect of data and tensor parallelism, and show our discovered clock frequencies translate well for both. We conclude that kernel-level DVFS is a suitable technique to reduce waste in LLM operations, providing significant energy savings with negligible slow-down.
[LG-11] Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs
链接: https://arxiv.org/abs/2601.08527
作者: Chenguang Duan,Yuling Jiao,Gabriele Steidl,Christian Wald,Jerry Zhijian Yang,Ruizhe Zhang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:We propose a novel method for sampling from unnormalized Boltzmann densities based on a probability-flow ordinary differential equation (ODE) derived from linear stochastic interpolants. The key innovation of our approach is the use of a sequence of Langevin samplers to enable efficient simulation of the flow. Specifically, these Langevin samplers are employed (i) to generate samples from the interpolant distribution at intermediate times and (ii) to construct, starting from these intermediate times, a robust estimator of the velocity field governing the flow ODE. For both applications of the Langevin diffusions, we establish convergence guarantees. Extensive numerical experiments demonstrate the efficiency of the proposed method on challenging multimodal distributions across a range of dimensions, as well as its effectiveness in Bayesian inference tasks.
[LG-12] Supervised Spike Agreement Dependent Plasticity for Fast Local Learning in Spiking Neural Networks
链接: https://arxiv.org/abs/2601.08526
作者: Gouri Lakshmi S,Athira Chandrasekharan,Harshit Kumar,Muhammed Sahad E,Bikas C Das,Saptarshi Bej
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Spike-Timing-Dependent Plasticity (STDP) provides a biologically grounded learning rule for spiking neural networks (SNNs), but its reliance on precise spike timing and pairwise updates limits fast learning of weights. We introduce a supervised extension of Spike Agreement-Dependent Plasticity (SADP), which replaces pairwise spike-timing comparisons with population-level agreement metrics such as Cohen’s kappa. The proposed learning rule preserves strict synaptic locality, admits linear-time complexity, and enables efficient supervised learning without backpropagation, surrogate gradients, or teacher forcing. We integrate supervised SADP within hybrid CNN-SNN architectures, where convolutional encoders provide compact feature representations that are converted into Poisson spike trains for agreement-driven learning in the SNN. Extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and biomedical image classification tasks demonstrate competitive performance and fast convergence. Additional analyses show stable performance across broad hyperparameter ranges and compatibility with device-inspired synaptic update dynamics. Together, these results establish supervised SADP as a scalable, biologically grounded, and hardware-aligned learning paradigm for spiking neural networks. Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG) Cite as: arXiv:2601.08526 [cs.NE] (or arXiv:2601.08526v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2601.08526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Your Group-Relative Advantage Is Biased
链接: https://arxiv.org/abs/2601.08521
作者: Fengkai Yang,Zherui Chen,Xiaohan Wang,Xiaodong Lu,Jiajun Chai,Guojun Yin,Wei Lin,Shuai Ma,Fuzhen Zhuang,Deqing Wang,Yaodong Yang,Jianxin Li,Yikun Ban
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.08521 [cs.LG] (or arXiv:2601.08521v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.08521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] GraphFusionSBR: Denoising Multi-Channel Graphs for Session-Based Recommendation
链接: https://arxiv.org/abs/2601.08497
作者: Jia-Xin He,Hung-Hsuan Chen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Session-based recommendation systems must capture implicit user intents from sessions. However, existing models suffer from issues such as item interaction dominance and noisy sessions. We propose a multi-channel recommendation model, including a knowledge graph channel, a session hypergraph channel, and a session line graph channel, to capture information from multiple sources. Our model adaptively removes redundant edges in the knowledge graph channel to reduce noise. Knowledge graph representations cooperate with hypergraph representations for prediction to alleviate item dominance. We also generate in-session attention for denoising. Finally, we maximize mutual information between the hypergraph and line graph channels as an auxiliary task. Experiments demonstrate that our method enhances the accuracy of various recommendations, including e-commerce and multimedia recommendations. We release the code on GitHub for reproducibility.\footnotethis https URL
[LG-15] AUV Trajectory Learning for Underwater Acoustic Energy Transfer and Age Minimization
链接: https://arxiv.org/abs/2601.08491
作者: Mohamed Afouene Melki,Mohammad Shehab,Mohamed-Slim Alouini
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain’s fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.
[LG-16] Coverag e Improvement and Fast Convergence of On-policy Preference Learning
链接: https://arxiv.org/abs/2601.08421
作者: Juno Kim,Jihun Yun,Jason D. Lee,Kwang-Sung Jun
类目: Machine Learning (cs.LG)
*备注: 46 pages, 2 figures, 2 tables
Abstract:Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy’s coverage evolves throughout on-policy training. We propose and rigorously justify the \emphcoverage improvement principle: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the initial policy suffers a slower minimax rate, leading to a sharp separation in total sample complexity. Motivated by this analysis, we further propose a simple hybrid sampler based on a novel \emphpreferential G-optimal design, which removes dependence on coverage and guarantees convergence in just two rounds. Finally, we develop principled on-policy schemes for reward distillation in the general function class setting, and show faster noiseless rates under an alternative deviation-based notion of coverage. Experimentally, we confirm that on-policy DPO and our proposed reward distillation algorithms outperform their off-policy counterparts and enjoy stable, monotonic performance gains across iterations.
[LG-17] Out-of-distribution generalization of deep-learning surrogates for 2D PDE-generated dynamics in the small-data regime
链接: https://arxiv.org/abs/2601.08404
作者: Binh Duong Nguyen,Stefan Sandfeld
类目: Machine Learning (cs.LG)
*备注:
Abstract:Partial differential equations (PDEs) are a central tool for modeling the dynamics of physical, engineering, and materials systems, but high-fidelity simulations are often computationally expensive. At the same time, many scientific applications can be viewed as the evolution of spatially distributed fields, making data-driven forecasting of such fields a core task in scientific machine learning. In this work we study autoregressive deep-learning surrogates for two-dimensional PDE dynamics on periodic domains, focusing on generalization to out-of-distribution initial conditions within a fixed PDE and parameter regime and on strict small-data settings with at most \mathcalO(10^2) simulated trajectories per system. We introduce a multi-channel U-Net […], evaluate it on five qualitatively different PDE families and compare it to ViT, AFNO, PDE-Transformer, and KAN-UNet under a common training setup. Across all datasets, me-UNet matches or outperforms these more complex architectures in terms of field-space error, spectral similarity, and physics-based metrics for in-distribution rollouts, while requiring substantially less training time. It also generalizes qualitatively to unseen initial conditions with as few as \approx 20 training simulations. A data-efficiency study and Grad-CAM analysis further suggest that, in small-data periodic 2D PDE settings, convolutional architectures with inductive biases aligned to locality and periodic boundary conditions remain strong contenders for accurate and moderately out-of-distribution-robust surrogate modeling.
[LG-18] Decodable but not structured: linear probing enables Underwater Acoustic Target Recognition with pretrained audio embeddings
链接: https://arxiv.org/abs/2601.08358
作者: Hilde I. Hummel,Sandjai Bhulai,Rob D. van der Mei,Burooj Ghani
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Increasing levels of anthropogenic noise from ships contribute significantly to underwater sound pollution, posing risks to marine ecosystems. This makes monitoring crucial to understand and quantify the impact of the ship radiated noise. Passive Acoustic Monitoring (PAM) systems are widely deployed for this purpose, generating years of underwater recordings across diverse soundscapes. Manual analysis of such large-scale data is impractical, motivating the need for automated approaches based on machine learning. Recent advances in automatic Underwater Acoustic Target Recognition (UATR) have largely relied on supervised learning, which is constrained by the scarcity of labeled data. Transfer Learning (TL) offers a promising alternative to mitigate this limitation. In this work, we conduct the first empirical comparative study of transfer learning for UATR, evaluating multiple pretrained audio models originating from diverse audio domains. The pretrained model weights are frozen, and the resulting embeddings are analyzed through classification, clustering, and similarity-based evaluations. The analysis shows that the geometrical structure of the embedding space is largely dominated by recording-specific characteristics. However, a simple linear probe can effectively suppress this recording-specific information and isolate ship-type features from these embeddings. As a result, linear probing enables effective automatic UATR using pretrained audio models at low computational cost, significantly reducing the need for a large amounts of high-quality labeled ship recordings.
[LG-19] MLPlatt: Simple Calibration Framework for Ranking Models
链接: https://arxiv.org/abs/2601.08345
作者: Piotr Bajger,Roman Dusek,Krzysztof Galias,Paweł Młyniec,Aleksander Wawer,Paweł Zawistowski
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Ranking models are extensively used in e-commerce for relevance estimation. These models often suffer from poor interpretability and no scale calibration, particularly when trained with typical ranking loss functions. This paper addresses the problem of post-hoc calibration of ranking models. We introduce MLPlatt: a simple yet effective ranking model calibration method that preserves the item ordering and converts ranker outputs to interpretable click-through rate (CTR) probabilities usable in downstream tasks. The method is context-aware by design and achieves good calibration metrics globally, and within strata corresponding to different values of a selected categorical field (such as user country or device), which is often important from a business perspective of an E-commerce platform. We demonstrate the superiority of MLPlatt over existing approaches on two datasets, achieving an improvement of over 10% in F-ECE (Field Expected Calibration Error) compared to other methods. Most importantly, we show that high-quality calibration can be achieved without compromising the ranking quality.
[LG-20] Automated Machine Learning in Radiomics: A Comparative Evaluation of Performance Efficiency and Accessibility
链接: https://arxiv.org/abs/2601.08334
作者: Jose Lozano-Montoya,Emilio Soria-Olivas,Almudena Fuster-Matanzo,Angel Alberich-Bayarri,Ana Jimenez-Pastor
类目: Machine Learning (cs.LG)
*备注: 27 pages, 4 figures, 3 tables, code available, see this https URL
Abstract:Automated machine learning (AutoML) frameworks can lower technical barriers for predictive and prognostic model development in radiomics by enabling researchers without programming expertise to build models. However, their effectiveness in addressing radiomics-specific challenges remains unclear. This study evaluates the performance, efficiency, and accessibility of general-purpose and radiomics-specific AutoML frameworks on diverse radiomics classification tasks, thereby highlighting development needs for radiomics. Ten public/private radiomics datasets with varied imaging modalities (CT/MRI), sizes, anatomies and endpoints were used. Six general-purpose and five radiomics-specific frameworks were tested with predefined parameters using standardized cross-validation. Evaluation metrics included AUC, runtime, together with qualitative aspects related to software status, accessibility, and interpretability. Simplatab, a radiomics-specific tool with a no-code interface, achieved the highest average test AUC (81.81%) with a moderate runtime (~1 hour). LightAutoML, a general-purpose framework, showed the fastest execution with competitive performance (78.74% mean AUC in six minutes). Most radiomics-specific frameworks were excluded from the performance analysis due to obsolescence, extensive programming requirements, or computational inefficiency. Conversely, general-purpose frameworks demonstrated higher accessibility and ease of implementation. Simplatab provides an effective balance of performance, efficiency, and accessibility for radiomics classification problems. However, significant gaps remain, including the lack of accessible survival analysis support and the limited integration of feature reproducibility and harmonization within current AutoML frameworks. Future research should focus on adapting AutoML solutions to better address these radiomics-specific challenges.
[LG-21] AgriLens: Semantic Retrieval in Agricultural Texts Using Topic Modeling and Language Models
链接: https://arxiv.org/abs/2601.08283
作者: Heba Shakeel,Tanvir Ahmad,Tanya Liyaqat,Chandni Saxena
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 8 Pages, 1st workshop on Democratizing GenAI and Scalable NLP with HiPC for Societal Impact; 32nd IEEE International Conference on High Performance Computing, Data, Analytics
Abstract:As the volume of unstructured text continues to grow across domains, there is an urgent need for scalable methods that enable interpretable organization, summarization, and retrieval of information. This work presents a unified framework for interpretable topic modeling, zero-shot topic labeling, and topic-guided semantic retrieval over large agricultural text corpora. Leveraging BERTopic, we extract semantically coherent topics. Each topic is converted into a structured prompt, enabling a language model to generate meaningful topic labels and summaries in a zero-shot manner. Querying and document exploration are supported via dense embeddings and vector search, while a dedicated evaluation module assesses topical coherence and bias. This framework supports scalable and interpretable information access in specialized domains where labeled data is limited.
[LG-22] A Usable GAN-Based Tool for Synthetic ECG Generation in Cardiac Amyloidosis Research
链接: https://arxiv.org/abs/2601.08260
作者: Francesco Speziale,Ugo Lomoio,Fabiola Boccuto,Pierangelo Veltri,Pietro Hiram Guzzi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cardiac amyloidosis (CA) is a rare and underdiagnosed infiltrative cardiomyopathy, and available datasets for machine-learning models are typically small, imbalanced and heterogeneous. This paper presents a Generative Adversarial Network (GAN) and a graphical command-line interface for generating realistic synthetic electrocardiogram (ECG) beats to support early diagnosis and patient stratification in CA. The tool is designed for usability, allowing clinical researchers to train class-specific generators once and then interactively produce large volumes of labelled synthetic beats that preserve the distribution of minority classes.
[LG-23] LDLT L-Lipschitz Network Weight Parameterization Initialization
链接: https://arxiv.org/abs/2601.08253
作者: Marius F. R. Juston,Ramavarapu S. Sreenivas,Dustin Nottage,Ahmet Soylemezoglu
类目: Machine Learning (cs.LG)
*备注: 12 pages, 17 figures
Abstract:We analyze initialization dynamics for LDLT-based \mathcalL -Lipschitz layers by deriving the exact marginal output variance when the underlying parameter matrix W_0\in \mathbbR^m\times n is initialized with IID Gaussian entries \mathcalN(0,\sigma^2) . The Wishart distribution, S=W_0W_0^\top\sim\mathcalW_m(n,\sigma^2 \boldsymbolI_m) , used for computing the output marginal variance is derived in closed form using expectations of zonal polynomials via James’ theorem and a Laplace-integral expansion of (\alpha \boldsymbolI_m+S)^-1 . We develop an Isserlis/Wick-based combinatorial expansion for \operatorname\mathbbE\left[\operatornametr(S^k)\right] and provide explicit truncated moments up to k=10 , which yield accurate series approximations for small-to-moderate \sigma^2 . Monte Carlo experiments confirm the theoretical estimates. Furthermore, empirical analysis was performed to quantify that, using current He or Kaiming initialization with scaling 1/\sqrtn , the output variance is 0.41 , whereas the new parameterization with 10/ \sqrtn for \alpha=1 results in an output variance of 0.9 . The findings clarify why deep \mathcalL -Lipschitz networks suffer rapid information loss at initialization and offer practical prescriptions for choosing initialization hyperparameters to mitigate this effect. However, using the Higgs boson classification dataset, a hyperparameter sweep over optimizers, initialization scale, and depth was conducted to validate the results on real-world data, showing that although the derivation ensures variance preservation, empirical results indicate He initialization still performs better.
[LG-24] Incorporating Cognitive Biases into Reinforcement Learning for Financial Decision-Making
链接: https://arxiv.org/abs/2601.08247
作者: Liu He
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: 15 pages, 9 figures
Abstract:Financial markets are influenced by human behavior that deviates from rationality due to cognitive biases. Traditional reinforcement learning (RL) models for financial decision-making assume rational agents, potentially overlooking the impact of psychological factors. This study integrates cognitive biases into RL frameworks for financial trading, hypothesizing that such models can exhibit human-like trading behavior and achieve better risk-adjusted returns than standard RL agents. We introduce biases, such as overconfidence and loss aversion, into reward structures and decision-making processes and evaluate their performance in simulated and real-world trading environments. Despite its inconclusive or negative results, this study provides insights into the challenges of incorporating human-like biases into RL, offering valuable lessons for developing robust financial AI systems.
[LG-25] A Preliminary Agent ic Framework for Matrix Deflation
链接: https://arxiv.org/abs/2601.08219
作者: Paimon Goulart,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Can a small team of agents peel a matrix apart, one rank-1 slice at a time? We propose an agentic approach to matrix deflation in which a solver Large Language Model (LLM) generates rank-1 Singular Value Decomposition (SVD) updates and a Vision Language Model (VLM) accepts or rejects each update and decides when to stop, eliminating fixed norm thresholds. Solver stability is improved through in-context learning (ICL) and types of row/column permutations that expose visually coherent structure. We evaluate on Digits ( 8\times8 ), CIFAR-10 ( 32\times32 grayscale), and synthetic ( 16\times16 ) matrices with and without Gaussian noise. In the synthetic noisy case, where the true construction rank k is known, numerical deflation provides the noise target and our best agentic configuration differs by only 1.75 RMSE of the target. For Digits and CIFAR-10, targets are defined by deflating until the Frobenius norm reaches 10% of the original. Across all settings, our agent achieves competitive results, suggesting that fully agentic, threshold-free deflation is a viable alternative to classical numerical algorithms.
[LG-26] One-Shot Federated Ridge Regression: Exact Recovery via Sufficient Statistic Aggregation
链接: https://arxiv.org/abs/2601.08216
作者: Zahir Alsulaimawi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Federated learning protocols require repeated synchronization between clients and a central server, with convergence rates depending on learning rates, data heterogeneity, and client sampling. This paper asks whether iterative communication is necessary for distributed linear regression. We show it is not. We formulate federated ridge regression as a distributed equilibrium problem where each client computes local sufficient statistics – the Gram matrix and moment vector – and transmits them once. The server reconstructs the global solution through a single matrix inversion. We prove exact recovery: under a coverage condition on client feature matrices, one-shot aggregation yields the centralized ridge solution, not an approximation. For heterogeneous distributions violating coverage, we derive non-asymptotic error bounds depending on spectral properties of the aggregated Gram matrix. Communication reduces from \mathcalO(Rd) in iterative methods to \mathcalO(d^2) total; for high-dimensional settings, we propose and experimentally validate random projection techniques reducing this to \mathcalO(m^2) where m \ll d . We establish differential privacy guarantees where noise is injected once per client, eliminating the composition penalty that degrades privacy in multi-round protocols. We further address practical considerations including client dropout robustness, federated cross-validation for hyperparameter selection, and comparison with gradient-based alternatives. Comprehensive experiments on synthetic heterogeneous regression demonstrate that one-shot fusion matches FedAvg accuracy while requiring up to 38\times less communication. The framework applies to kernel methods and random feature models but not to general nonlinear architectures.
[LG-27] Scalable Multiagent Reinforcement Learning with Collective Influence Estimation
链接: https://arxiv.org/abs/2601.08210
作者: Zhenglong Luo,Zhiyong Chen,Aoxiang Liu,Ke Pan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multiagent reinforcement learning (MARL) has attracted considerable attention due to its potential in addressing complex cooperative tasks. However, existing MARL approaches often rely on frequent exchanges of action or state information among agents to achieve effective coordination, which is difficult to satisfy in practical robotic systems. A common solution is to introduce estimator networks to model the behaviors of other agents and predict their actions; nevertheless, such designs cause the size and computational cost of the estimator networks to grow rapidly with the number of agents, thereby limiting scalability in large-scale systems. To address these challenges, this paper proposes a multiagent learning framework augmented with a Collective Influence Estimation Network (CIEN). By explicitly modeling the collective influence of other agents on the task object, each agent can infer critical interaction information solely from its local observations and the task object’s states, enabling efficient collaboration without explicit action information exchange. The proposed framework effectively avoids network expansion as the team size increases; moreover, new agents can be incorporated without modifying the network structures of existing agents, demonstrating strong scalability. Experimental results on multiagent cooperative tasks based on the Soft Actor-Critic (SAC) algorithm show that the proposed method achieves stable and efficient coordination under communication-limited environments. Furthermore, policies trained with collective influence modeling are deployed on a real robotic platform, where experimental results indicate significantly improved robustness and deployment feasibility, along with reduced dependence on communication infrastructure. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.08210 [cs.LG] (or arXiv:2601.08210v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.08210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] abPFN Through The Looking Glass: An interpretability study of TabPFN and its internal representations
链接: https://arxiv.org/abs/2601.08181
作者: Aviral Gupta,Armaan Sethi,Dhruv Kumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular foundational models are pre-trained models designed for a wide range of tabular data tasks. They have shown strong performance across domains, yet their internal representations and learned concepts remain poorly understood. This lack of interpretability makes it important to study how these models process and transform input features. In this work, we analyze the information encoded inside the model’s hidden representations and examine how these representations evolve across layers. We run a set of probing experiments that test for the presence of linear regression coefficients, intermediate values from complex expressions, and the final answer in early layers. These experiments allow us to reason about the computations the model performs internally. Our results provide evidence that meaningful and structured information is stored inside the representations of tabular foundational models. We observe clear signals that correspond to both intermediate and final quantities involved in the model’s prediction process. This gives insight into how the model refines its inputs and how the final output emerges. Our findings contribute to a deeper understanding of the internal mechanics of tabular foundational models. They show that these models encode concrete and interpretable information, which moves us closer to making their decision processes more transparent and trustworthy.
[LG-29] VBO-MI: A Fully Gradient-Based Bayesian Optimization Framework Using Variational Mutual Information Estimation
链接: https://arxiv.org/abs/2601.08172
作者: Farhad Mirkarimi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 31 pages, 8 figures, Code will be released upon acceptance
Abstract:Many real-world tasks require optimizing expensive black-box functions accessible only through noisy evaluations, a setting commonly addressed with Bayesian optimization (BO). While Bayesian neural networks (BNNs) have recently emerged as scalable alternatives to Gaussian Processes (GPs), traditional BNN-BO frameworks remain burdened by expensive posterior sampling and acquisition function optimization. In this work, we propose VBO-MI (Variational Bayesian Optimization with Mutual Information), a fully gradient-based BO framework that leverages recent advances in variational mutual information estimation. To enable end-to-end gradient flow, we employ an actor-critic architecture consisting of an action-net to navigate the input space and a variational critic to estimate information gain. This formulation effectively eliminates the traditional inner-loop acquisition optimization bottleneck, achieving up to a 10^2 \times reduction in FLOPs compared to BNN-BO baselines. We evaluate our method on a diverse suite of benchmarks, including high-dimensional synthetic functions and complex real-world tasks such as PDE optimization, the Lunar Lander control problem, and categorical Pest Control. Our experiments demonstrate that VBO-MI consistently provides the same or superior optimization performance and computational scalability over the baselines.
[LG-30] Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies
链接: https://arxiv.org/abs/2601.08136
作者: Zeyang Li,Sunbochen Tang,Navid Azizan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
[LG-31] Hierarchical Online-Scheduling for Energy-Efficient Split Inference with Progressive Transmission
链接: https://arxiv.org/abs/2601.08135
作者: Zengzipeng Tang,Yuxuan Sun,Wei Chen,Jianwen Ding,Bo Ai,Yulin Shao
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Device-edge collaborative inference with Deep Neural Networks (DNNs) faces fundamental trade-offs among accuracy, latency and energy consumption. Current scheduling exhibits two drawbacks: a granularity mismatch between coarse, task-level decisions and fine-grained, packet-level channel dynamics, and insufficient awareness of per-task complexity. Consequently, scheduling solely at the task level leads to inefficient resource utilization. This paper proposes a novel ENergy-ACcuracy Hierarchical optimization framework for split Inference, named ENACHI, that jointly optimizes task- and packet-level scheduling to maximize accuracy under energy and delay constraints. A two-tier Lyapunov-based framework is developed for ENACHI, with a progressive transmission technique further integrated to enhance adaptivity. At the task level, an outer drift-plus-penalty loop makes online decisions for DNN partitioning and bandwidth allocation, and establishes a reference power budget to manage the long-term energy-accuracy trade-off. At the packet level, an uncertainty-aware progressive transmission mechanism is employed to adaptively manage per-sample task complexity. This is integrated with a nested inner control loop implementing a novel reference-tracking policy, which dynamically adjusts per-slot transmit power to adapt to fluctuating channel conditions. Experiments on ImageNet dataset demonstrate that ENACHI outperforms state-of-the-art benchmarks under varying deadlines and bandwidths, achieving a 43.12% gain in inference accuracy with a 62.13% reduction in energy consumption under stringent deadlines, and exhibits high scalability by maintaining stable energy consumption in congested multi-user scenarios.
[LG-32] Generalization Analysis and Method for Domain Generalization for a Family of Recurrent Neural Networks
链接: https://arxiv.org/abs/2601.08122
作者: Atefeh Termehchi,Ekram Hossain,Isaac Woungang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning (DL) has driven broad advances across scientific and engineering domains. Despite its success, DL models often exhibit limited interpretability and generalization, which can undermine trust, especially in safety-critical deployments. As a result, there is growing interest in (i) analyzing interpretability and generalization and (ii) developing models that perform robustly under data distributions different from those seen during training (i.e. domain generalization). However, the theoretical analysis of DL remains incomplete. For example, many generalization analyses assume independent samples, which is violated in sequential data with temporal correlations. Motivated by these limitations, this paper proposes a method to analyze interpretability and out-of-domain (OOD) generalization for a family of recurrent neural networks (RNNs). Specifically, the evolution of a trained RNN’s states is modeled as an unknown, discrete-time, nonlinear closed-loop feedback system. Using Koopman operator theory, these nonlinear dynamics are approximated with a linear operator, enabling interpretability. Spectral analysis is then used to quantify the worst-case impact of domain shifts on the generalization error. Building on this analysis, a domain generalization method is proposed that reduces the OOD generalization error and improves the robustness to distribution shifts. Finally, the proposed analysis and domain generalization approach are validated on practical temporal pattern-learning tasks.
[LG-33] Intra-tree Column Subsampling Hinders XGBoost Learning of Ratio-like Interactions
链接: https://arxiv.org/abs/2601.08121
作者: Mykola Pinchuk
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures and tables
Abstract:Many applied problems contain signal that becomes clear only after combining multiple raw measurements. Ratios and rates are common examples. In gradient boosted trees, this combination is not an explicit operation: the model must synthesize it through coordinated splits on the component features. We study whether intra-tree column subsampling in XGBoost makes that synthesis harder. We use two synthetic data generating processes with cancellation-style structure. In both, two primitive features share a strong nuisance factor, while the target depends on a smaller differential factor. A log ratio cancels the nuisance and isolates the signal. We vary colsample_bylevel and colsample_bynode over s in 0.4, 0.6, 0.8, 0.9, emphasizing mild subsampling (s = 0.8). A control feature set includes the engineered ratio, removing the need for synthesis. Across both processes, intra-tree column subsampling reduces test PR-AUC in the primitives-only setting. In the main process the relative decrease reaches 54 percent when both parameters are set to 0.4. The effect largely disappears when the engineered ratio is present. A path-based co-usage metric drops in the same cells where performance deteriorates. Practically, if ratio-like structure is plausible, either avoid intra-tree subsampling or include the intended ratio features.
[LG-34] Structure Detection for Contextual Reinforcement Learning
链接: https://arxiv.org/abs/2601.08120
作者: Tianyue Zhou,Jung-Hoon Cho,Cathy Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Contextual Reinforcement Learning (CRL) tackles the problem of solving a set of related Contextual Markov Decision Processes (CMDPs) that vary across different context variables. Traditional approaches–independent training and multi-task learning–struggle with either excessive computational costs or negative transfer. A recently proposed multi-policy approach, Model-Based Transfer Learning (MBTL), has demonstrated effectiveness by strategically selecting a few tasks to train and zero-shot transfer. However, CMDPs encompass a wide range of problems, exhibiting structural properties that vary from problem to problem. As such, different task selection strategies are suitable for different CMDPs. In this work, we introduce Structure Detection MBTL (SD-MBTL), a generic framework that dynamically identifies the underlying generalization structure of CMDP and selects an appropriate MBTL algorithm. For instance, we observe Mountain structure in which generalization performance degrades from the training performance of the target task as the context difference increases. We thus propose M/GP-MBTL, which detects the structure and adaptively switches between a Gaussian Process-based approach and a clustering-based approach. Extensive experiments on synthetic data and CRL benchmarks–covering continuous control, traffic control, and agricultural management–show that M/GP-MBTL surpasses the strongest prior method by 12.49% on the aggregated metric. These results highlight the promise of online structure detection for guiding source task selection in complex CRL environments.
[LG-35] Learning a Stochastic Differential Equation Model of Tropical Cyclone Intensification from Reanalysis and Observational Data
链接: https://arxiv.org/abs/2601.08116
作者: Kenneth Gee,Sai Ravela
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Tropical cyclones are dangerous natural hazards, but their hazard is challenging to quantify directly from historical datasets due to limited dataset size and quality. Models of cyclone intensification fill this data gap by simulating huge ensembles of synthetic hurricanes based on estimates of the storm’s large scale environment. Both physics-based and statistical/ML intensification models have been developed to tackle this problem, but an open question is: can a physically reasonable and simple physics-style differential equation model of intensification be learned from data? In this paper, we answer this question in the affirmative by presenting a 10-term cubic stochastic differential equation model of Tropical Cyclone intensification. The model depends on a well-vetted suite of engineered environmental features known to drive intensification and is trained using a high quality dataset of hurricane intensity (IBTrACS) with estimates of the cyclone’s large scale environment from a data-assimilated simulation (ERA5 reanalysis), restricted to the Northern Hemisphere. The model generates synthetic intensity series which capture many aspects of historical intensification statistics and hazard estimates in the Northern Hemisphere. Our results show promise that interpretable, physics style models of complex earth system dynamics can be learned using automated system identification techniques.
[LG-36] LUT-Compiled Kolmogorov-Arnold Networks for Lightweight DoS Detection on IoT Edge Devices
链接: https://arxiv.org/abs/2601.08044
作者: Oleksandr Kuznetsov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Denial-of-Service (DoS) attacks pose a critical threat to Internet of Things (IoT) ecosystems, yet deploying effective intrusion detection on resource-constrained edge devices remains challenging. Kolmogorov-Arnold Networks (KANs) offer a compact alternative to Multi-Layer Perceptrons (MLPs) by placing learnable univariate spline functions on edges rather than fixed activations on nodes, achieving competitive accuracy with fewer parameters. However, runtime B-spline evaluation introduces significant computational overhead unsuitable for latency-critical IoT applications. We propose a lookup table (LUT) compilation pipeline that replaces expensive spline computations with precomputed quantized tables and linear interpolation, dramatically reducing inference latency while preserving detection quality. Our lightweight KAN model (50K parameters, 0.19~MB) achieves 99.0% accuracy on the CICIDS2017 DoS dataset. After LUT compilation with resolution L=8 , the model maintains 98.96% accuracy (F1 degradation 0.0004 ) while achieving \mathbf68\times speedup at batch size 256 and over \mathbf5000\times speedup at batch size 1, with only 2\times memory overhead. We provide comprehensive evaluation across LUT resolutions, quantization schemes, and out-of-bounds policies, establishing clear Pareto frontiers for accuracy-latency-memory trade-offs. Our results demonstrate that LUT-compiled KANs enable real-time DoS detection on CPU-only IoT gateways with deterministic inference latency and minimal resource footprint.
[LG-37] Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds
链接: https://arxiv.org/abs/2601.08039
作者: Shaocong Ma,Heng Huang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this paper, we study Riemannian zeroth-order optimization in settings where the underlying Riemannian metric g is geodesically incomplete, and the goal is to approximate stationary points with respect to this incomplete metric. To address this challenge, we construct structure-preserving metrics that are geodesically complete while ensuring that every stationary point under the new metric remains stationary under the original one. Building on this foundation, we revisit the classical symmetric two-point zeroth-order estimator and analyze its mean-squared error from a purely intrinsic perspective, depending only on the manifold’s geometry rather than any ambient embedding. Leveraging this intrinsic analysis, we establish convergence guarantees for stochastic gradient descent with this intrinsic estimator. Under additional suitable conditions, an \epsilon -stationary point under the constructed metric g’ also corresponds to an \epsilon -stationary point under the original metric g , thereby matching the best-known complexity in the geodesically complete setting. Empirical studies on synthetic problems confirm our theoretical findings, and experiments on a practical mesh optimization task demonstrate that our framework maintains stable convergence even in the absence of geodesic completeness.
[LG-38] InfGraND: An Influence-Guided GNN-to-MLP Knowledge Distillation
链接: https://arxiv.org/abs/2601.08033
作者: Amir Eskandari,Aman Anand,Elyas Rashno,Farhana Zulkernine
类目: Machine Learning (cs.LG)
*备注: Accepted in Transactions on Machine Learning Research (TMLR), 2026
Abstract:Graph Neural Networks (GNNs) are the go-to model for graph data analysis. However, GNNs rely on two key operations - aggregation and update, which can pose challenges for low-latency inference tasks or resource-constrained scenarios. Simple Multi-Layer Perceptrons (MLPs) offer a computationally efficient alternative. Yet, training an MLP in a supervised setting often leads to suboptimal performance. Knowledge Distillation (KD) from a GNN teacher to an MLP student has emerged to bridge this gap. However, most KD methods either transfer knowledge uniformly across all nodes or rely on graph-agnostic indicators such as prediction uncertainty. We argue this overlooks a more fundamental, graph-centric inquiry: “How important is a node to the structure of the graph?” We introduce a framework, InfGraND, an Influence-guided Graph KNowledge Distillation from GNN to MLP that addresses this by identifying and prioritizing structurally influential nodes to guide the distillation process, ensuring that the MLP learns from the most critical parts of the graph. Additionally, InfGraND embeds structural awareness in MLPs through one-time multi-hop neighborhood feature pre-computation, which enriches the student MLP’s input and thus avoids inference-time overhead. Our rigorous evaluation in transductive and inductive settings across seven homophilic graph benchmark datasets shows InfGraND consistently outperforms prior GNN to MLP KD methods, demonstrating its practicality for numerous latency-critical applications in real-world settings.
[LG-39] Beyond the Next Port: A Multi-Task Transformer for Forecasting Future Voyage Segment Durations
链接: https://arxiv.org/abs/2601.08013
作者: Nairui Liu,Fang He,Xindi Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate forecasts of segment-level sailing durations are fundamental to enhancing maritime schedule reliability and optimizing long-term port operations. However, conventional estimated time of arrival (ETA) models are primarily designed for the immediate next port of call and rely heavily on real-time automatic identification system (AIS) data, which is inherently unavailable for future voyage segments. To address this gap, the study reformulates future-port ETA prediction as a segment-level time-series forecasting problem. We develop a transformer-based architecture that integrates historical sailing durations, destination port congestion proxies, and static vessel descriptors. The proposed framework employs a causally masked attention mechanism to capture long-range temporal dependencies and a multi-task learning head to jointly predict segment sailing durations and port congestion states, leveraging shared latent signals to mitigate high uncertainty. Evaluation on a real-world global dataset from 2021 demonstrates the proposed model consistently outperforms a comprehensive suite of competitive baselines. The result shows a relative reduction of 4.85% in mean absolute error (MAE) and 4.95% in mean absolute percentage error (MAPE) compared with sequence baseline models. The relative reductions with gradient boosting machines are 9.39% in MAE and 52.97% in MAPE. Case studies for the major destination port further illustrate the model’s superior accuracy.
[LG-40] DataScribe: An AI-Native Policy-Aligned Web Platform for Multi-Objective Materials Design and Discovery
链接: https://arxiv.org/abs/2601.07966
作者: Divyanshu Singh,Doguhan Sarıtürk,Cameron Lea,Md Shafiqul Islam,Raymundo Arroyave,Vahid Attari
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:The acceleration of materials discovery requires digital platforms that go beyond data repositories to embed learning, optimization, and decision-making directly into research workflows. We introduce DataScribe, an AI-native, cloud-based materials discovery platform that unifies heterogeneous experimental and computational data through ontology-backed ingestion and machine-actionable knowledge graphs. The platform integrates FAIR-compliant metadata capture, schema and unit harmonization, uncertainty-aware surrogate modeling, and native multi-objective multi-fidelity Bayesian optimization, enabling closed-loop propose-measure-learn workflows across experimental and computational pipelines. DataScribe functions as an application-layer intelligence stack, coupling data governance, optimization, and explainability rather than treating them as downstream add-ons. We validate the platform through case studies in electrochemical materials and high-entropy alloys, demonstrating end-to-end data fusion, real-time optimization, and reproducible exploration of multi-objective trade spaces. By embedding optimization engines, machine learning, and unified access to public and private scientific data directly within the data infrastructure, and by supporting open, free use for academic and non-profit researchers, DataScribe functions as a general-purpose application-layer backbone for laboratories of any scale, including self-driving laboratories and geographically distributed materials acceleration platforms, with built-in support for performance, sustainability, and supply-chain-aware objectives.
[LG-41] ransformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds AAAI2025 AAAI
链接: https://arxiv.org/abs/2601.07930
作者: Bo Pan,Zhiping Zhang,Kevin Spiekermann,Tianchi Chen,Xiang Yu,Liying Zhang,Liang Zhao
类目: Machine Learning (cs.LG)
*备注: The 2nd AAAI Workshop on Foundation Models for Biological Discoveries at AAAI 2025
Abstract:Functional group replacement is a pivotal approach in cheminformatics to enable the design of novel chemical compounds with tailored properties. Traditional methods for functional group removal and replacement often rely on rule-based heuristics, which can be limited in their ability to generate diverse and novel chemical structures. Recently, transformer-based models have shown promise in improving the accuracy and efficiency of molecular transformations, but existing approaches typically focus on single-step modeling, lacking the guarantee of structural similarity. In this work, we seek to advance the state of the art by developing a novel two-stage transformer model for functional group removal and replacement. Unlike one-shot approaches that generate entire molecules in a single pass, our method generates the functional group to be removed and appended sequentially, ensuring strict substructure-level modifications. Using a matched molecular pairs (MMPs) dataset derived from ChEMBL, we trained an encoder-decoder transformer model with SMIRKS-based representations to capture transformation rules effectively. Extensive evaluations demonstrate our method’s ability to generate chemically valid transformations, explore diverse chemical spaces, and maintain scalability across varying search sizes.
[LG-42] Max-Min Neural Network Operators For Approximation of Multivariate Functions
链接: https://arxiv.org/abs/2601.07886
作者: Abhishek Yadav,Uaday Singh,Feng Dai
类目: Machine Learning (cs.LG)
*备注: 17 pages with 8 figures
Abstract:In this paper, we develop a multivariate framework for approximation by max-min neural network operators. Building on the recent advances in approximation theory by neural network operators, particularly, the univariate max-min operators, we propose and analyze new multivariate operators activated by sigmoidal functions. We establish pointwise and uniform convergence theorems and derive quantitative estimates for the order of approximation via modulus of continuity and multivariate generalized absolute moment. Our results demonstrate that multivariate max-min structure of operators, besides their algebraic elegance, provide efficient and stable approximation tools in both theoretical and applied settings.
[LG-43] On the use of graph models to achieve individual and group fairness
链接: https://arxiv.org/abs/2601.08784
作者: Arturo Pérez-Peralta,Sandra Benítez-Peña,Rosa E. Lillo
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 75 pages, 46 figures
Abstract:Machine Learning algorithms are ubiquitous in key decision-making contexts such as justice, healthcare and finance, which has spawned a great demand for fairness in these procedures. However, the theoretical properties of such models in relation with fairness are still poorly understood, and the intuition behind the relationship between group and individual fairness is still lacking. In this paper, we provide a theoretical framework based on Sheaf Diffusion to leverage tools based on dynamical systems and homology to model fairness. Concretely, the proposed method projects input data into a bias-free space that encodes fairness constrains, resulting in fair solutions. Furthermore, we present a collection of network topologies handling different fairness metrics, leading to a unified method capable of dealing with both individual and group bias. The resulting models have a layer of interpretability in the form of closed-form expressions for their SHAP values, consolidating their place in the responsible Artificial Intelligence landscape. Finally, these intuitions are tested on a simulation study and standard fairness benchmarks, where the proposed methods achieve satisfactory results. More concretely, the paper showcases the performance of the proposed models in terms of accuracy and fairness, studying available trade-offs on the Pareto frontier, checking the effects of changing the different hyper-parameters, and delving into the interpretation of its outputs.
[LG-44] Kernel Learning for Regression via Quantum Annealing Based Spectral Sampling
链接: https://arxiv.org/abs/2601.08724
作者: Yasushi Hasegawa,Masayuki Ohzeki
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15pages, 5 figures, 4 tables
Abstract:While quantum annealing (QA) has been developed for combinatorial optimization, practical QA devices operate at finite temperature and under noise, and their outputs can be regarded as stochastic samples close to a Gibbs–Boltzmann distribution. In this study, we propose a QA-in-the-loop kernel learning framework that integrates QA not merely as a substitute for Markov-chain Monte Carlo sampling but as a component that directly determines the learned kernel for regression. Based on Bochner’s theorem, a shift-invariant kernel is represented as an expectation over a spectral distribution, and random Fourier features (RFF) approximate the kernel by sampling frequencies. We model the spectral distribution with a (multi-layer) restricted Boltzmann machine (RBM), generate discrete RBM samples using QA, and map them to continuous frequencies via a Gaussian–Bernoulli transformation. Using the resulting RFF, we construct a data-adaptive kernel and perform Nadaraya–Watson (NW) regression. Because the RFF approximation based on \cos(\bm\omega^\top\Delta\bmx) can yield small negative values and cancellation across neighbors, the Nadaraya–Watson denominator \sum_j k_ij may become close to zero. We therefore employ nonnegative squared-kernel weights w_ij=k(\bmx_i,\bmx_j)^2 , which also enhances the contrast of kernel weights. The kernel parameters are trained by minimizing the leave-one-out NW mean squared error, and we additionally evaluate local linear regression with the same squared-kernel weights at inference. Experiments on multiple benchmark regression datasets demonstrate a decrease in training loss, accompanied by structural changes in the kernel matrix, and show that the learned kernel tends to improve R^2 and RMSE over the baseline Gaussian-kernel NW. Increasing the number of random features at inference further enhances accuracy.
[LG-45] Robust low-rank estimation with multiple binary responses using pairwise AUC loss
链接: https://arxiv.org/abs/2601.08618
作者: TheTien Mai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Multiple binary responses arise in many modern data-analytic problems. Although fitting separate logistic regressions for each response is computationally attractive, it ignores shared structure and can be statistically inefficient, especially in high-dimensional and class-imbalanced regimes. Low-rank models offer a natural way to encode latent dependence across tasks, but existing methods for binary data are largely likelihood-based and focus on pointwise classification rather than ranking performance. In this work, we propose a unified framework for learning with multiple binary responses that directly targets discrimination by minimizing a surrogate loss for the area under the ROC curve (AUC). The method aggregates pairwise AUC surrogate losses across responses while imposing a low-rank constraint on the coefficient matrix to exploit shared structure. We develop a scalable projected gradient descent algorithm based on truncated singular value decomposition. Exploiting the fact that the pairwise loss depends only on differences of linear predictors, we simplify computation and analysis. We establish non-asymptotic convergence guarantees, showing that under suitable regularity conditions, leading to linear convergence up to the minimax-optimal statistical precision. Extensive simulation studies demonstrate that the proposed method is robust in challenging settings such as label switching and data contamination and consistently outperforms likelihood-based approaches.
[LG-46] Accelerated Methods with Complexity Separation Under Data Similarity for Federated Learning Problems
链接: https://arxiv.org/abs/2601.08614
作者: Dmitry Bylinkin,Sergey Skorik,Dmitriy Bystrov,Leonid Berezin,Aram Avetisyan,Aleksandr Beznosikov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 30 pages, 4 theorems, 2 figures
Abstract:Heterogeneity within data distribution poses a challenge in many modern federated learning tasks. We formalize it as an optimization problem involving a computationally heavy composite under data similarity. By employing different sets of assumptions, we present several approaches to develop communication-efficient methods. An optimal algorithm is proposed for the convex case. The constructed theory is validated through a series of experiments across various problems.
[LG-47] Sample Complexity of Composite Quantum Hypothesis Testing
链接: https://arxiv.org/abs/2601.08588
作者: Jacob Paul Simpson,Efstratios Palias,Sharu Theresa Jose
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Under review
Abstract:This paper investigates symmetric composite binary quantum hypothesis testing (QHT), where the goal is to determine which of two uncertainty sets contains an unknown quantum state. While asymptotic error exponents for this problem are well-studied, the finite-sample regime remains poorly understood. We bridge this gap by characterizing the sample complexity – the minimum number of state copies required to achieve a target error level. Specifically, we derive lower bounds that generalize the sample complexity of simple QHT and introduce new upper bounds for various uncertainty sets, including of both finite and infinite cardinalities. Notably, our upper and lower bounds match up to universal constants, providing a tight characterization of the sample complexity. Finally, we extend our analysis to the differentially private setting, establishing the sample complexity for privacy-preserving composite QHT.
[LG-48] Convergence of gradient flow for learning convolutional neural networks
链接: https://arxiv.org/abs/2601.08547
作者: Jona-Maria Diederen,Holger Rauhut,Ulrich Terstiege
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Convolutional neural networks are widely used in imaging and image recognition. Learning such networks from training data leads to the minimization of a non-convex function. This makes the analysis of standard optimization methods such as variants of (stochastic) gradient descent challenging. In this article we study the simplified setting of linear convolutional networks. We show that the gradient flow (to be interpreted as an abstraction of gradient descent) applied to the empirical risk defined via certain loss functions including the square loss always converges to a critical point, under a mild condition on the training data.
[LG-49] Disentangling History and Propagation Dependencies in Cross-Subject Knee Contact Stress Prediction Using a Shared MeshGraphNet Backbone
链接: https://arxiv.org/abs/2601.08318
作者: Zhengye Pan,Jianwei Zuo,Jiajia Luo
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Background:Subject-specific finite element analysis accurately characterizes knee joint mechanics but is computationally expensive. Deep surrogate models provide a rapid alternative, yet their generalization across subjects under limited pose and load inputs remains unclear. It remains unclear whether the dominant source of prediction uncertainty arises from temporal history dependence or spatial propagation dependence. Methods:To disentangle these factors, we employed a shared MGN backbone with a fixed mesh topology. A dataset of running trials from nine subjects was constructed using an OpenSim-FEBio workflow. We developed four model variants to isolate specific dependencies: (1) a baseline MGN; (2) CT-MGN, incorporating a Control Transformer to encode short-horizon history; (3) MsgModMGN, applying state-conditioned modulation to message passing for adaptive propagation; (4) CT-MsgModMGN, combining both mechanisms. Models were evaluated using a rigorous grouped 3-fold cross-validation on unseen this http URL:The models incorporating history encoding significantly outperformed the baseline MGN and MsgModMGN in global accuracy and spatial consistency. Crucially, the CT module effectively mitigated the peak-shaving defect common in deep surrogates, significantly reducing peak stress prediction errors. In contrast, the spatial propagation modulation alone yielded no significant improvement over the baseline, and combining it with CT provided no additional this http URL:Temporal history dependence, rather than spatial propagation modulation, is the primary driver of prediction uncertainty in cross-subject knee contact mechanics. Explicitly encoding short-horizon driver sequences enables the surrogate model to recover implicit phase information, thereby achieving superior fidelity in peak-stress capture and high-risk localization compared to purely state-based approaches.
[LG-50] Structural Dimension Reduction in Bayesian Networks
链接: https://arxiv.org/abs/2601.08236
作者: Pei Heng,Yi Sun,Jianhua Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages
Abstract:This work introduces a novel technique, named structural dimension reduction, to collapse a Bayesian network onto a minimum and localized one while ensuring that probabilistic inferences between the original and reduced networks remain consistent. To this end, we propose a new combinatorial structure in directed acyclic graphs called the directed convex hull, which has turned out to be equivalent to their minimum localized Bayesian networks. An efficient polynomial-time algorithm is devised to identify them by determining the unique directed convex hulls containing the variables of interest from the original networks. Experiments demonstrate that the proposed technique has high dimension reduction capability in real networks, and the efficiency of probabilistic inference based on directed convex hulls can be significantly improved compared with traditional methods such as variable elimination and belief propagation algorithms. The code of this study is open at \hrefthis https URLthis https URL and the proofs of the results in the main body are postponed to the appendix.
[LG-51] Wasserstein-p Central Limit Theorem Rates: From Local Dependence to Markov Chains
链接: https://arxiv.org/abs/2601.08184
作者: Yixuan Zhang,Qiaomin Xie
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 51 pages, 1 figure
Abstract:Finite-time central limit theorem (CLT) rates play a central role in modern machine learning (ML). In this paper, we study CLT rates for multivariate dependent data in Wasserstein- p ( \mathcal W_p ) distance, for general p\ge 1 . We focus on two fundamental dependence structures that commonly arise in ML: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the \textitfirst optimal \mathcal O(n^-1/2) rate in \mathcal W_1 , as well as the first \mathcal W_p ( p\ge 2 ) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal \mathcal W_1 rate for locally dependent sequences, we further obtain the first optimal \mathcal W_1 --CLT rate for multivariate U -statistics. On the technical side, we derive a tractable auxiliary bound for \mathcal W_1 Gaussian approximation errors that is well suited to studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal \mathcal W_1 rates and underpin our \mathcal W_p ( p\ge 2 ) results. Comments: 51 pages, 1 figure Subjects: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.08184 [math.PR] (or arXiv:2601.08184v1 [math.PR] for this version) https://doi.org/10.48550/arXiv.2601.08184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] owards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds
链接: https://arxiv.org/abs/2601.08100
作者: Xinping Yi,Gaojie Jin,Xiaowei Huang,Shi Jin
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Understanding the generalization behavior of deep neural networks remains a fundamental challenge in modern statistical learning theory. Among existing approaches, PAC-Bayesian norm-based bounds have demonstrated particular promise due to their data-dependent nature and their ability to capture algorithmic and geometric properties of learned models. However, most existing results rely on isotropic Gaussian posteriors, heavy use of spectral-norm concentration for weight perturbations, and largely architecture-agnostic analyses, which together limit both the tightness and practical relevance of the resulting bounds. To address these limitations, in this work, we propose a unified framework for PAC-Bayesian norm-based generalization by reformulating the derivation of generalization bounds as a stochastic optimization problem over anisotropic Gaussian posteriors. The key to our approach is a sensitivity matrix that quantifies the network outputs with respect to structured weight perturbations, enabling the explicit incorporation of heterogeneous parameter sensitivities and architectural structures. By imposing different structural assumptions on this sensitivity matrix, we derive a family of generalization bounds that recover several existing PAC-Bayesian results as special cases, while yielding bounds that are comparable to or tighter than state-of-the-art approaches. Such a unified framework provides a principled and flexible way for geometry-/structure-aware and interpretable generalization analysis in deep learning.
[LG-53] A Statistical Assessment of Amortized Inference Under Signal-to-Noise Variation and Distribution Shift
链接: https://arxiv.org/abs/2601.07944
作者: Roy Shivam Ram Shreshtth,Arnab Hazra,Gourab Mukherjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 26 pages, 5 figures, 3 tables
Abstract:Since the turn of the century, approximate Bayesian inference has steadily evolved as new computational techniques have been incorporated to handle increasingly complex and large-scale predictive problems. The recent success of deep neural networks and foundation models has now given rise to a new paradigm in statistical modeling, in which Bayesian inference can be amortized through large-scale learned predictors. In amortized inference, substantial computation is invested upfront to train a neural network that can subsequently produce approximate posterior or predictions at negligible marginal cost across a wide range of tasks. At deployment, amortized inference offers substantial computational savings compared with traditional Bayesian procedures, which generally require repeated likelihood evaluations or Monte Carlo simulations for predictions for each new dataset. Despite the growing popularity of amortized inference, its statistical interpretation and its role within Bayesian inference remain poorly understood. This paper presents statistical perspectives on the working principles of several major neural architectures, including feedforward networks, Deep Sets, and Transformers, and examines how these architectures naturally support amortized Bayesian inference. We discuss how these models perform structured approximation and probabilistic reasoning in ways that yield controlled generalization error across a wide range of deployment scenarios, and how these properties can be harnessed for Bayesian computation. Through simulation studies, we evaluate the accuracy, robustness, and uncertainty quantification of amortized inference under varying signal-to-noise ratios and distributional shifts, highlighting both its strengths and its limitations. Comments: 26 pages, 5 figures, 3 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2601.07944 [stat.ML] (or arXiv:2601.07944v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.07944 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] Enhancing Portfolio Optimization with Deep Learning Insights
链接: https://arxiv.org/abs/2601.07942
作者: Brandon Luo,Jim Skufca
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:
Abstract:Our work focuses on deep learning (DL) portfolio optimization, tackling challenges in long-only, multi-asset strategies across market cycles. We propose training models with limited regime data using pre-training techniques and leveraging transformer architectures for state variable inclusion. Evaluating our approach against traditional methods shows promising results, demonstrating our models’ resilience in volatile markets. These findings emphasize the evolving landscape of DL-driven portfolio optimization, stressing the need for adaptive strategies to navigate dynamic market conditions and improve predictive accuracy.
[LG-55] A Sensing Dataset Protocol for Benchmarking and Multi-Task Wireless Sensing
链接: https://arxiv.org/abs/2512.12180
作者: Jiawei Huang,Di Zhang,Yuanhao Cui,Xiaowen Cao,Tony Xiao Han,Xiaojun Jing,Christos Masouros
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures
Abstract:Wireless sensing has become a fundamental enabler for intelligent environments, supporting applications such as human detection, activity recognition, localization, and vital sign monitoring. Despite rapid advances, existing datasets and pipelines remain fragmented across sensing modalities, hindering fair comparison, transfer, and reproducibility. We propose the Sensing Dataset Protocol (SDP), a protocol-level specification and benchmark framework for large-scale wireless sensing. SDP defines how heterogeneous wireless signals are mapped into a unified perception data-block schema through lightweight synchronization, frequency-time alignment, and resampling, while a Canonical Polyadic-Alternating Least Squares (CP-ALS) pooling stage provides a task-agnostic representation that preserves multipath, spectral, and temporal structures. Built upon this protocol, a unified benchmark is established for detection, recognition, and vital-sign estimation with consistent preprocessing, training, and evaluation. Experiments under the cross-user split demonstrate that SDP significantly reduces variance (approximately 88%) across seeds while maintaining competitive accuracy and latency, confirming its value as a reproducible foundation for multi-modal and multitask sensing research.
信息检索
[IR-0] FusID: Modality-Fused Semantic IDs for Generative Music Recommendation
链接: https://arxiv.org/abs/2601.08764
作者: Haven Kim,Yupeng Hou,Julian McAuley
类目: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
[IR-1] Characterizing Personality from Eye-Tracking: The Role of Gaze and Its Absence in Interactive Search Environments
链接: https://arxiv.org/abs/2601.08287
作者: Jiaman He,Marta Micheli,Damiano Spina,Dana McKay,Johanne R. Trippas,Noriko Kando
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: This paper is accepted at CHIIR 2026
Abstract:Personality traits influence how individuals engage, behave, and make decisions during the information-seeking process. However, few studies have linked personality to observable search behaviors. This study aims to characterize personality traits through a multimodal time-series model that integrates eye-tracking data and gaze missingness-periods when the user’s gaze is not captured. This approach is based on the idea that people often look away when they think, signaling disengagement or reflection. We conducted a user study with 25 participants, who used an interactive application on an iPad, allowing them to engage with digital artifacts from a museum. We rely on raw gaze data from an eye tracker, minimizing preprocessing so that behavioral patterns can be preserved without substantial data cleaning. From this perspective, we trained models to predict personality traits using gaze signals. Our results from a five-fold cross-validation study demonstrate strong predictive performance across all five dimensions: Neuroticism (Macro F1 = 77.69%), Conscientiousness (74.52%), Openness (77.52%), Agreeableness (73.09%), and Extraversion (76.69%). The ablation study examines whether the absence of gaze information affects the model performance, demonstrating that incorporating missingness improves multimodal time-series modeling. The full model, which integrates both time-series signals and missingness information, achieves 10-15% higher accuracy and macro F1 scores across all Big Five traits compared to the model without time-series signals and missingness. These findings provide evidence that personality can be inferred from search-related gaze behavior and demonstrate the value of incorporating missing gaze data into time-series multimodal modeling.
[IR-2] Markovian Pre-Trained Transformer for Next-Item Recommendation
链接: https://arxiv.org/abs/2601.08275
作者: Cong Xu,Guoliang Li,Jun Wang,Wei Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:We introduce the Markovian Pre-trained Transformer (MPT) for next-item recommendation, a transferable model fully pre-trained on synthetic Markov chains, yet capable of achieving state-of-the-art performance by fine-tuning a lightweight adaptor. This counterintuitive success stems from the observation of the `Markovian’ nature: advanced sequential recommenders coincidentally rely on the latest interaction to make predictions, while the historical interactions serve mainly as auxiliary cues for inferring the user’s general, non-sequential identity. This characteristic necessitates the capabilities of a universal recommendation model to effectively summarize the user sequence, with particular emphasis on the latest interaction. MPT inherently has the potential to be universal and transferable. On the one hand, when trained to predict the next state of Markov chains, it acquires the capabilities to estimate transition probabilities from the context (one adaptive manner for summarizing sequences) and attend to the last state to ensure accurate state transitions. On the other hand, unlike the heterogeneous interaction data, an unlimited amount of controllable Markov chains is available to boost the model capacity. We conduct extensive experiments on five public datasets from three distinct platforms to validate the superiority of Markovian pre-training over traditional recommendation pre-training and recent language pre-training paradigms.
[IR-3] From Tool to Teacher: Rethinking Search Systems as Instructive Interfaces
链接: https://arxiv.org/abs/2601.08035
作者: David Elsweiler
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:
Abstract:Information access systems such as search engines and generative AI are central to how people seek, evaluate, and interpret information. Yet most systems are designed to optimise retrieval rather than to help users develop better search strategies or critical awareness. This paper introduces a pedagogical perspective on information access, conceptualising search and conversational systems as instructive interfaces that can teach, guide, and scaffold users’ learning. We draw on seven didactic frameworks from education and behavioural science to analyse how existing and emerging system features, including query suggestions, source labels, and conversational or agentic AI, support or limit user learning. Using two illustrative search tasks, we demonstrate how different design choices promote skills such as critical evaluation, metacognitive reflection, and strategy transfer. The paper contributes a conceptual lens for evaluating the instructional value of information access systems and outlines design implications for technologies that foster more effective, reflective, and resilient information seekers.
[IR-4] Cost and accuracy of long-term graph memory in distributed LLM -based multi-agent systems
链接: https://arxiv.org/abs/2601.07978
作者: Benedict Wolff,Jacopo Bennati
类目: Information Retrieval (cs.IR)
*备注: 23 pages, 4 figures, 7 tables
Abstract:Distributed multi-agent systems use large language models to enable collaborative intelligence while preserving privacy, yet systematic evaluations of long-term memory under network constraints remain limited. This study presents a flexible testbed comparing mem0, a vector-based memory framework, and Graphiti, a graph-based knowledge graph, using the LOCOMO long-context benchmark. Experiments were conducted under unconstrained and constrained network conditions, measuring computational, financial, and accuracy metrics. Results indicate that mem0 significantly outperforms Graphiti in efficiency, with faster loading times, lower resource consumption, and minimal network overhead, while accuracy differences are not statistically significant. Applying a statistical pareto efficiency framework, mem0 is identified as the optimal choice that balances cost and accuracy in DMAS.

