本篇博文主要内容为 2026-01-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-15)
今日共更新457篇论文,其中:
- 自然语言处理共93篇(Computation and Language (cs.CL))
- 人工智能共146篇(Artificial Intelligence (cs.AI))
- 计算机视觉共95篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共104篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Value-Aware Numerical Representations for Transformer Language Models
【速读】: 该论文旨在解决基于Transformer的语言模型在数学推理基准测试中表现优异,但在基础数值理解和算术运算上仍存在脆弱性的问题。其核心限制在于数字被作为符号标记(token)处理,其嵌入表示未显式编码数值大小,导致系统性错误。解决方案的关键在于提出一种值感知的数值表示方法(value-aware numerical representation),通过在标准分词输入中添加一个专用前缀标记(prefix token),其嵌入由底层数值大小显式条件化,从而直接将数值量级信息注入模型输入空间,同时保持与现有分词器和仅解码器结构的Transformer架构兼容。实验证明,该方法在多种数值格式、任务和操作数长度下均优于基线模型,表明显式编码数值大小是提升语言模型基础数值鲁棒性的有效且高效途径。
链接: https://arxiv.org/abs/2601.09706
作者: Andreea Dutulescu,Stefan Ruseti,Mihai Dascalu
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model’s input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
zh
[NLP-1] ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成任务中因 token 逐个生成导致的高资源消耗与低效率问题,尤其是在生成阶段缺乏优化策略的情况下。其核心挑战在于如何在不改变代码语义等价性和可读性的前提下,显著降低生成过程中的 token 数量以提升推理效率。解决方案的关键在于提出一个名为 ShortCoder 的知识注入框架,通过三方面创新实现:(1) 提出10条基于抽象语法树(Abstract Syntax Tree, AST)保持的Python语法级简化规则,实现18.1%的token减少;(2) 构建融合规则重写与LLM引导精炼的混合数据合成管道,生成带有语义一致性的简化代码对数据集 ShorterCodeBench;(3) 设计一种注入简洁性意识的微调策略,使基础LLM具备主动生成更紧凑代码的能力。实验证明,ShortCoder在HumanEval基准上相较现有方法提升18.1%-37.8%的生成效率,同时保持代码生成性能。
链接: https://arxiv.org/abs/2601.09703
作者: Sicong Liu,Yanxian Huang,Mingwei Liu,Jiachi Chen,Ensheng Shi,Yuchi Ma,Hongyu Zhang,Yin Zhang,Yanlin Wang
机构: Sun Yat-sen University (中山大学); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Peking University (北京大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.
zh
[NLP-2] Empathy Applicability Modeling for General Health Queries ACL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中缺乏临床共情能力的问题,尤其是现有自然语言处理(Natural Language Processing, NLP)框架多局限于对医生回复中的共情进行事后标注,而难以实现对患者提问中潜在共情需求的前瞻性建模,特别是在一般健康咨询场景下。其解决方案的关键在于提出了一种理论驱动的共情适用性框架(Empathy Applicability Framework, EAF),该框架基于临床、情境和语言线索,将患者提问分类为不同共情反应与解释的适用程度;同时构建了一个真实患者查询基准数据集(双标注:人类与GPT-4o),并通过训练分类器验证EAF的有效性,实现了比启发式方法和零样本LLM基线更强的预测性能,从而为异步医疗环境中支持共情沟通提供了可量化的前瞻识别机制。
链接: https://arxiv.org/abs/2601.09696
作者: Shan Randhawa,Agha Ali Raza,Kentaro Toyama,Julie Hui,Mustafa Naseem
机构: 未知
类目: Computation and Language (cs.CL)
备注: In Submission to ACL
Abstract:LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors’ responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
zh
[NLP-3] LLM s can Compress LLM s: Adaptive Pruning by Agents
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在结构化剪枝(structured pruning)过程中出现的严重事实知识退化问题,同时克服现有方法依赖均匀或手工设计稀疏率分配策略的局限性。其解决方案的关键在于提出一种代理引导的剪枝框架(agent-guided pruning),其中基础模型作为自适应剪枝代理(pruning agent),通过整合Wanda-inspired的权重-激活度量与梯度重要性得分构建逐层敏感性谱(layer-wise sensitivity profiles),并利用具备自我反思能力的LLM代理动态决策每轮剪枝的层数和比例,从而在保持关键知识通路的同时实现高效压缩。该框架无需重新训练、具有模型无关性,并通过检查点回滚机制(checkpoint rollback mechanism)实现自我修正,在21–40次迭代中仅需2–4次回滚即可显著提升性能:在Qwen3模型上达到约45%稀疏率时,相比基线方法在MMLU准确率上提升56%,FreebaseQA事实知识保留能力提高19倍,且困惑度下降69%。
链接: https://arxiv.org/abs/2601.09694
作者: Sai Varun Kodathala,Rakesh Vunnam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 Pages
Abstract:As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
zh
[NLP-4] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)路由器在缺乏真实标注数据时的训练难题,尤其是在用户请求分布异质且未知的实际场景中。现有方法通常依赖于高质量的地面真实标签数据,而这些数据往往难以获取。为此,作者提出Routing with Generated Data (RGD) 设置,即仅使用由生成式 AI(Generative AI)基于任务描述生成的查询与答案来训练路由器。解决方案的关键在于识别有效生成器的核心特征:一是生成器对其自身问题的回答必须准确,二是其生成的问题需能区分模型池中不同模型的性能差异;在此基础上,论文进一步提出CASCAL——一种新型仅使用查询的路由器,通过共识投票估计模型正确性,并利用层次聚类识别模型特有的技能领域,从而显著提升对低质量生成数据的鲁棒性,在弱生成器数据下相较最优查询-标签路由器提升4.6%绝对准确率。
链接: https://arxiv.org/abs/2601.09692
作者: Tianyi Niu,Justin Chih-Yao Chen,Genta Indra Winata,Shi-Xiong Zhang,Supriyo Chakraborty,Sambit Sahu,Yue Zhang,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Capital One (资本一号); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL
Abstract:Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
zh
[NLP-5] DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agent ic Evaluation
【速读】: 该论文旨在解决深度研究系统(Deep Research Systems)在多步网络调研、分析与跨源整合中的评估难题,现有基准测试普遍存在标注密集型任务构建、静态评价维度以及缺乏引用时事实验证不可靠等问题。其解决方案的关键在于提出一个自动化框架 DeepResearchEval,包含两个核心组件:一是基于角色驱动的任务构建管道,通过两阶段筛选(任务资格与搜索必要性)生成需多源证据融合的复杂研究任务;二是代理式评估管道,包含动态自适应点对点质量评估(根据任务自动推导评价维度、标准与权重)和主动事实核查机制(即使无引用也能通过网络检索自主提取并验证报告陈述)。
链接: https://arxiv.org/abs/2601.09688
作者: Yibo Wang,Lei Wang,Yue Deng,Keming Wu,Yao Xiao,Huanjin Yao,Liwei Kang,Hai Ye,Yongcheng Jing,Lidong Bing
机构: 未知
类目: Computation and Language (cs.CL)
备注: Source code: this https URL
Abstract:Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
zh
[NLP-6] Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection
【速读】: 该论文旨在解决多任务学习(Multi-Task Learning, MTL)与低秩适应(Low-Rank Adaptation, LoRA)结合时存在的负迁移问题(negative transfer),即不同任务间的冲突梯度更新会降低各任务的性能表现,尤其在LoRA的低秩约束下,优化空间受限,加剧了任务间干扰。解决方案的关键在于提出Ortho-LoRA,一种专为LoRA的双分图结构设计的梯度投影方法:它在LoRA的内在子空间中动态地将冲突任务梯度投影到彼此的正交补空间,从而有效缓解任务干扰,实验证明其可在GLUE基准上恢复95%的多任务与单任务基线性能差距,且计算开销可忽略。
链接: https://arxiv.org/abs/2601.09684
作者: Ziyu Yang,Guibin Chen,Yuxin Yang,Aoxiong Zeng,Xiangquan Yang
机构: Shanghai University (上海大学); East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint
Abstract:Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape’s capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95% of the performance gap between multi-task and single-task baselines with negligible computational overhead.
zh
[NLP-7] Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
【速读】: 该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在训练过程中资源消耗大、不稳定的问题,尤其是由于智能体间的协同适应导致的非平稳性(non-stationarity)以及奖励稀疏且方差高的挑战。其解决方案的关键在于提出了一种多智能体推理时强化学习(Multi-Agent Test-Time Reinforcement Learning, MATTRL)框架,该框架在推理阶段注入结构化的文本经验,通过构建由专家组成的多轮讨论团队,检索并整合测试时的经验,并基于共识机制做出最终决策。MATTRL创新性地引入了逐轮信用分配机制以构建经验池,并将其重注入对话流程中,从而实现无需微调即可提升多智能体推理的稳定性与鲁棒性,在医学、数学和教育等复杂任务上显著优于单智能体和基线多智能体方法。
链接: https://arxiv.org/abs/2601.09667
作者: Zhiyuan Hu,Yunhai Hu,Juncheng Liu,Shuyue Stella Li,Yucheng Wang,Zhen Xu,See-Kiong Ng,Anh Tuan Luu,Xinxing Xu,Bryan Hooi,Cynthia Breazeal,Hae Won Park
机构: MIT(麻省理工学院); NUS(新加坡国立大学); NYU(纽约大学); Microsoft(微软); UW(华盛顿大学); Columbia(哥伦比亚大学); NTU(南洋理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbfMulti-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
zh
[NLP-8] Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
【速读】: 该论文旨在解决**词义消歧(Word Sense Disambiguation, WSD)**在UCREL语义分析系统(USAS)框架下缺乏大规模、多语言、公开评估的问题,尤其是针对规则系统在训练数据稀缺时性能受限的挑战。其解决方案的关键在于:构建了一个新的银标签(silver-labeled)英文语料库以弥补人工标注数据的不足,并在此基础上训练和评估多种单语与多语神经网络模型,同时在单语和跨语言设置中对比其与规则系统的性能;研究进一步表明,通过将神经网络模型与规则系统结合,可有效提升规则系统的语义标注能力,从而实现对传统基于规则方法的增强。
链接: https://arxiv.org/abs/2601.09648
作者: Andrew Moore,Paul Rayson,Dawn Archer,Tim Czerniak,Dawn Knight,Daisy Lal,Gearóid Ó Donnchadha,Mícheál Ó Meachair,Scott Piao,Elaine Uí Dhonnchadha,Johanna Vuorinen,Yan Yabo,Xiaobin Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures
Abstract:Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.
zh
[NLP-9] axoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion WWW
【速读】: 该论文旨在解决现有自动化知识图谱(Knowledge Graph, KG)中概念层级关系建模的局限性问题,特别是传统基于点嵌入(point-based vector embeddings)方法难以有效捕捉分类学中不对称的“is-a”关系,以及现有盒嵌入(Box Embeddings)在边界梯度不稳定、缺乏语义不确定性建模和多义性表达能力方面的不足。其解决方案的关键在于提出TaxoBell框架,通过将盒几何结构映射到多元高斯分布(Multivariate Gaussian Distribution),以均值表示语义位置、协方差编码语义不确定性,结合能量驱动优化(Energy-based Optimization)实现稳定训练、对模糊概念的鲁棒建模及可解释的层次推理能力。
链接: https://arxiv.org/abs/2601.09633
作者: Sahil Mishra,Srinitish Srinivasan,Srikanta Bedathur,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in The Web Conference (WWW) 2026
Abstract:Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric “is-a” relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.
zh
[NLP-10] LLM s Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理以音位为基础的语言现象(如押韵识别与生成)时表现不佳的问题,尤其在资源较少的语言(如现代希腊语)中更为显著。其解决方案的关键在于构建一个混合系统,将LLM与确定性的音位算法相结合,通过实现一套完整的希腊语押韵类型分类体系(包括纯押韵、富押韵、不完全押韵、拼贴式押韵及相同前韵元音押韵等),并引入带有音位验证的代理生成流水线(agentic generation pipeline with phonological verification)。实验表明,仅依赖LLM生成会导致性能崩溃(有效诗作不足4%),而结合音位验证的混合机制可将生成准确率提升至73.1%,显著优于单一模型方法。
链接: https://arxiv.org/abs/2601.09631
作者: Stergios Chatzikyriakidis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4% valid poems), while our hybrid verification loop restores performance to 73.1%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
zh
[NLP-11] oward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
【速读】: 该论文旨在解决机器遗忘(machine unlearning)过程中样本级遗忘难度差异显著的问题,即相同遗忘方法下某些样本可被可靠擦除,而另一些则顽固留存。其解决方案的关键在于提出一种基于模型电路(model circuits)的预遗忘度量指标——Circuit-guided Unlearning Difficulty (CUD),该指标通过分析样本在模型内部结构化交互路径上的分布特征,为每个样本分配连续的遗忘难度分数。实验表明,CUD能稳定区分内在易忘与难忘样本,并揭示出机制层面的规律:易忘样本依赖于早期至中期较短、浅层的计算路径,而难忘样本则涉及更深层、更长的晚期计算路径,从而为理解遗忘困难提供了可解释、细粒度且原理性的分析框架,并推动基于模型机制的遗忘方法发展。
链接: https://arxiv.org/abs/2601.09624
作者: Jiali Cheng,Ziheng Chen,Chirag Agarwal,Hadi Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits–structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (CUD), a \em pre-unlearning metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that CUD reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, CUD takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.
zh
[NLP-12] DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)增强大语言模型(Large Language Models, LLMs)时普遍存在的输出多样性下降问题,尤其是在开放性任务如创意写作中,这限制了模型的实际应用价值。其解决方案的关键在于提出一种基于半结构化长链式思维(Chain-of-Thought, CoT)的RL框架,通过在规划阶段引入“多样性的计划分支”(Diverse Planning Branching)机制,根据多样性变化策略性地增加路径分歧,并设计了一种群体感知的多样性奖励(group-aware diversity reward),以激励生成不同轨迹,从而在不牺牲生成质量的前提下显著提升输出多样性。
链接: https://arxiv.org/abs/2601.09609
作者: Qian Cao,Yahui Liu,Wei Bi,Yi Zhao,Ruihua Song,Xiting Wang,Ruiming Tang,Guorui Zhou,Han Li
机构: Renmin University of China (中国人民大学); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
zh
[NLP-13] Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer
【速读】: 该论文旨在解决基础模型(foundation model)在音乐信息检索(Music Information Retrieval, MIR)任务中因参数量庞大而导致的资源消耗高、成本昂贵的问题。其解决方案的关键在于将Branchformer架构与SummaryMixing技术相结合,并引入随机量化(random quantization)过程,从而在保持与使用多头自注意力机制的先进模型相当性能的前提下,有效压缩模型规模,实现从8.5%到12.3%的显著减小。
链接: https://arxiv.org/abs/2601.09603
作者: Petros Vavaroutsos,Theodoros Palamas,Pantelis Vikatos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: accepted by ACM/SIGAPP Symposium on Applied Computing (SAC 2026)
Abstract:In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation’s model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework consisting of a variety of downstream MIR tasks. Our results show that our architecture achieves competitive performance when compared with other state-of-the-art models that use multi-head self-attention, while reducing the model size from 8.5% up to 12.3%.
zh
[NLP-14] Show dont tell – Providing Visual Error Feedback for Handwritten Documents
【速读】: 该论文旨在解决如何将手写输入图像转化为准确且具有信息量的错误反馈,从而提升手写技能的教学效果。其关键在于克服从原始手写图像到正确放置错误标注之间的多个技术挑战,包括字符识别、位置对齐与语义理解等环节。研究对比了模块化系统与端到端系统的性能,发现两者当前均未能达到可接受的整体质量,因此提出未来应聚焦于改进多模态对齐、增强模型鲁棒性及构建更有效的反馈机制的研究方向。
链接: https://arxiv.org/abs/2601.09586
作者: Said Yasin,Torsten Zesch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Handwriting remains an essential skill, particularly in education. Therefore, providing visual feedback on handwritten documents is an important but understudied area. We outline the many challenges when going from an image of handwritten input to correctly placed informative error feedback. We empirically compare modular and end-to-end systems and find that both approaches currently do not achieve acceptable overall quality. We identify the major challenges and outline an agenda for future research.
zh
[NLP-15] Permutation Matching Under Parikh Budgets: Linear-Time Detection Packing and Disjoint Selection
【速读】: 该论文致力于解决在一般字母表 Σ 上的排列模式匹配(permutation pattern matching)问题,特别是从传统的存在性判定扩展到优化与打包变体。其核心挑战在于如何高效处理实际应用中常见的资源约束下的最长可行子串查找(Maximum Feasible Substring under Pattern Supply, MFSP)以及非重叠匹配选择问题。解决方案的关键在于提出一个统一的滑动窗口框架,通过维护模式 P 与文本 T 当前窗口之间的 Parikh 向量差来实现 O(n+σ) 时间复杂度和 O(σ) 空间复杂度的排列匹配;在此基础上,进一步设计双指针可行性维护算法求解 MFSP,并利用贪心最早结束策略在已知所有匹配的前提下线性时间选出最大不相交匹配集合,从而将频率型字符串匹配与打包优化原语紧密联系起来。
链接: https://arxiv.org/abs/2601.09577
作者: MD Nazmul Alam Shanto,Md. Tanzeem Rahat,Md. Manzurul Hasan
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL)
备注: 12 pages (Excluding reference)
Abstract:We study permutation (jumbled/Abelian) pattern matching over a general alphabet \Sigma . Given a pattern P of length m and a text T of length n, the classical task is to decide whether T contains a length-m substring whose Parikh vector equals that of P . While this existence problem admits a linear-time sliding-window solution, many practical applications require optimization and packing variants beyond mere detection. We present a unified sliding-window framework based on maintaining the Parikh-vector difference between P and the current window of T , enabling permutation matching in O(n + \sigma) time and O(\sigma) space, where \sigma = |\Sigma|. Building on this foundation, we introduce a combinatorial-optimization variant that we call Maximum Feasible Substring under Pattern Supply (MFSP): find the longest substring S of T whose symbol counts are component-wise bounded by those of P . We show that MFSP can also be solved in O(n + \sigma) time via a two-pointer feasibility maintenance algorithm, providing an exact packing interpretation of P as a resource budget. Finally, we address non-overlapping occurrence selection by modeling each permutation match as an equal-length interval and proving that a greedy earliest-finishing strategy yields a maximum-cardinality set of disjoint matches, computable in linear time once all matches are enumerated. Our results provide concise, provably correct algorithms with tight bounds, and connect frequency-based string matching to packing-style optimization primitives.
zh
[NLP-16] Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering
【速读】: 该论文旨在解决自主系统在执行基于模式(schema-grounded)的信息收集对话时,因缺乏逐轮可观测指标而难以监控信息获取效率及识别提问无效化的问题。解决方案的关键在于提出了一种名为Dialogue Telemetry (DT) 的测量框架,其核心是通过每轮问答后生成两个模型无关的信号:(i) 进度估计器(Progress Estimator, PE),用于量化每个信息类别剩余的信息潜力(含基于比特的变体);(ii) 停滞指数(Stalling Index, SI),通过检测重复探测同一类别且响应语义相似、边际收益低的模式来标识对话停滞状态,无需因果诊断即可实现有效监测。该方法已在受控搜救(SAR)场景下的大语言模型(LLM)模拟中验证,能够区分高效与停滞对话轨迹,并通过集成DT信号优化强化学习(RL)策略,在操作成本较高的场景下提升策略性能。
链接: https://arxiv.org/abs/2601.09570
作者: Dimitris Panagopoulos,Adolfo Perrusquia,Weisi Guo
机构: Cranfield University (克兰菲尔德大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 9 Figures, Version submitted to IEEE for publication
Abstract:Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.
zh
[NLP-17] Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
【速读】: 该论文旨在解决当前后训练量化(Post-Training Quantization, PTQ)方法在微尺度浮点数(Microscaling Floating-Point, MXFP)格式下适用性与行为机制尚不明确的问题。现有PTQ算法主要针对整数量化设计,缺乏对MXFP格式的系统性研究。解决方案的关键在于通过系统性实验,评估超过7种PTQ算法、15个基准测试和3类大语言模型(Large Language Models, LLMs),揭示MXFP格式下PTQ的有效性与关键影响因素:首先发现MXFP8可实现近无损精度,而MXFP4存在显著性能下降;其次指出PTQ效果高度依赖格式兼容性,特定算法范式更具优势;最后提出量化缩放因子是MXFP4中的关键误差源,简单预缩放优化策略能有效缓解其影响。这些发现为将现有PTQ方法适配至MXFP量化提供了实用指导。
链接: https://arxiv.org/abs/2601.09555
作者: Manyi Zhang,Ji-Fu Li,Zhongao Sun,Haoli Bai,Hui-Ling Zhen,Zhenhua Dong,Xianzhi Yu
机构: Huawei Technologies(华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.
zh
[NLP-18] SERM: Self-Evolving Relevance Model with Agent -Driven Learning from Massive Query Streams
【速读】: 该论文旨在解决实时查询流(query stream)动态演化背景下,相关性模型难以在实际搜索场景中有效泛化的问题。其核心挑战在于:一是高质量训练样本稀疏且难以识别,二是当前模型生成的伪标签可能不可靠。解决方案的关键在于提出一种自进化相关性模型(Self-Evolving Relevance Model, SERM),包含两个互补的多智能体模块:一个多智能体样本挖掘器(multi-agent sample miner),用于检测分布偏移并识别有信息量的训练样本;一个多层次一致性机制的多智能体相关性标注器(multi-agent relevance annotator),通过两级一致性框架提供可靠的标签。该方法在日均处理数十亿用户请求的大规模工业环境中验证,实验表明其可通过迭代自进化显著提升性能。
链接: https://arxiv.org/abs/2601.09515
作者: Chenglong Wang,Canjia Li,Xingzhao Zhu,Yifu Huo,Huiyu Wang,Weixiong Lin,Yun Yang,Qiaozhi He,Tianhua Zhou,Xiaojia Chang,Jingbo Zhu,Tong Xiao
机构: Northeastern University(东北大学); ByteDance(字节跳动); Peking University(北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.
zh
[NLP-19] MVSS: A Unified Framework for Multi-View Structured Survey Generation
【速读】: 该论文旨在解决现有自动科学综述生成方法在结构组织和方法论比较方面的不足,这些方法通常仅关注线性文本生成,难以显式建模研究主题间的层次关系与结构化对比,导致生成的综述在结构完整性上逊于人工撰写的高质量综述。解决方案的关键在于提出一种多视角结构化综述生成框架(Multi-View Structured Survey generation, MVSS),其核心是采用“结构先行”范式:首先构建基于引用证据的领域概念树(conceptual tree),然后在此结构约束下生成结构化对比表格,最后以树结构和表格为双重约束进行文本生成,从而实现结构、对比与叙事三个维度的互补表示,显著提升综述的组织逻辑性与证据锚定准确性。
链接: https://arxiv.org/abs/2601.09504
作者: Yinqi Liu,Yueqi Zhu,Yongkang Zhang,Xinfeng Li,Feiran Liu,Yufei Sun,Xin Wang,Renzhao Liang,Yidong Wang,Cunxiang Wang
机构: Beijing University of Technology (北京工业大学); Nanyang Technological University (南洋理工大学); Peking University (北京大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Scientific surveys require not only summarizing large bodies of literature, but also organizing them into clear and coherent conceptual structures. Existing automatic survey generation methods typically focus on linear text generation and struggle to explicitly model hierarchical relations among research topics and structured methodological comparisons, resulting in gaps in structural organization compared to expert-written surveys. We propose MVSS, a multi-view structured survey generation framework that jointly generates and aligns citation-grounded hierarchical trees, structured comparison tables, and survey text. MVSS follows a structure-first paradigm: it first constructs a conceptual tree of the research domain, then generates comparison tables constrained by the tree, and finally uses both as structural constraints for text generation. This enables complementary multi-view representations across structure, comparison, and narrative. We introduce an evaluation framework assessing structural quality, comparative completeness, and citation fidelity. Experiments on 76 computer science topics show MVSS outperforms existing methods in organization and evidence grounding, achieving performance comparable to expert surveys.
zh
[NLP-20] SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics
【速读】: 该论文旨在解决当前生成式AI(Generative AI)在幻灯片自动化生成系统评估中面临的挑战,即现有评价协议难以在不同架构间提供可比分数,且常依赖未经校准的人类判断。其解决方案的关键在于提出SlidesGen-Bench基准测试框架,该框架基于三个核心原则:通用性(universality)、量化性(quantification)和可靠性(reliability)。具体而言,通过将终端输出视为视觉渲染结果以忽略底层生成方法差异,实现跨系统的统一评估;引入一种计算方法,从内容(Content)、美学(Aesthetics)和可编辑性(Editability)三个维度定量衡量幻灯片质量,替代以往主观或依赖参考的代理指标;并构建Slides-Align1.5k数据集,涵盖九种主流生成系统在七种场景下的幻灯片样本,确保评估结果与人类偏好高度一致。
链接: https://arxiv.org/abs/2601.09487
作者: Yunqiao Yang,Wenbo Li,Houxing Ren,Zimu Lu,Ke Wang,Zhiyuan Huang,Zhuofan Zong,Mingjie Zhan,Hongsheng Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 37 pages, 34 figures
Abstract:The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code-driven layouts to image-centric synthesis. However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments. In this paper, we introduce SlidesGen-Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability. First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method. Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions - Content, Aesthetics, and Editability - offering reproducible metrics where prior works relied on subjective or reference-dependent proxies. Finally, to ensure high correlation with human preference, we construct the Slides-Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios. Our experiments demonstrate that SlidesGen-Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines. Our code and data are available at this https URL.
zh
[NLP-21] Dissecting Judicial Reasoning in U.S. Copyright Damage Awards KDD’25
【速读】: 该论文旨在解决版权损害赔偿判决中司法推理不一致的问题,这种不一致性源于联邦法院对1976年《版权法》的不同解释和判例权重分配,导致诉讼当事人难以预测判决结果,并削弱了法律决策的实证基础。解决方案的关键在于提出一种基于话语分析的大语言模型(Large Language Model, LLM)方法,融合修辞结构理论(Rhetorical Structure Theory, RST)与代理式工作流(agentic workflow),通过三阶段管道——数据集构建、话语分析和代理特征提取——将司法意见解析为层次化的话语结构,并识别出具有对应话语子树的推理组件及特征标签,从而量化以往难以捕捉的裁判逻辑模式。该方法显著优于传统手段,并揭示了不同上诉法院在裁量因素权重上的未被量化差异。
链接: https://arxiv.org/abs/2601.09459
作者: Pei-Chi Lo,Thomas Y. Lu
机构: National Sun Yat-sen University (国立中山大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Presented in SIGKDD’25 SciSoc LLM Workshop: Large Language Models for Scientific and Societal Advances
Abstract:Judicial reasoning in copyright damage awards poses a core challenge for computational legal analysis. Although federal courts follow the 1976 Copyright Act, their interpretations and factor weightings vary widely across jurisdictions. This inconsistency creates unpredictability for litigants and obscures the empirical basis of legal decisions. This research introduces a novel discourse-based Large Language Model (LLM) methodology that integrates Rhetorical Structure Theory (RST) with an agentic workflow to extract and quantify previously opaque reasoning patterns from judicial opinions. Our framework addresses a major gap in empirical legal scholarship by parsing opinions into hierarchical discourse structures and using a three-stage pipeline, i.e., Dataset Construction, Discourse Analysis, and Agentic Feature Extraction. This pipeline identifies reasoning components and extract feature labels with corresponding discourse subtrees. In analyzing copyright damage rulings, we show that discourse-augmented LLM analysis outperforms traditional methods while uncovering unquantified variations in factor weighting across circuits. These findings offer both methodological advances in computational legal analysis and practical insights into judicial reasoning, with implications for legal practitioners seeking predictive tools, scholars studying legal principle application, and policymakers confronting inconsistencies in copyright law.
zh
[NLP-22] Improving Symbolic Translation of Language Models for Logical Reasoning AAAI2026
【速读】: 该论文旨在解决小型语言模型(Small Language Models, LMs)在将自然语言(Natural Language, NL)转化为一阶逻辑(First-Order Logic, FOL)过程中因格式错误和翻译偏差导致符号输出不准确的问题,从而影响基于外部求解器的可验证推理系统的可靠性。解决方案的关键在于三个层面:首先,通过分析常见错误类别并利用大语言模型合成数据对小模型进行微调,提升其FOL翻译准确性;其次,引入增量推理机制,将推理过程分解为谓词生成与FOL翻译两个阶段,增强对模型行为的控制并提高谓词覆盖率;最后,结合针对谓词元数错误设计的验证模块,进一步优化整体推理性能。这一系列方法显著降低了小模型的错误率,提升了符号推理能力,推动了可靠且易获取的符号推理系统的发展。
链接: https://arxiv.org/abs/2601.09446
作者: Ramya Keerthy Thatikonda,Jiuzhou Han,Wray Buntine,Ehsan Shareghi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The Third workshop of NeusymBridge @AAAI 2026 (Bridging Neurons and Symbols for NLP and Knowledge Graph Reasoning)
Abstract:The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.
zh
[NLP-23] Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models
【速读】: 该论文旨在解决语言模型(Language Models, LMs)中因预训练数据不一致而导致的内部知识冲突问题,即同一事件在模型参数化知识中被编码为相互矛盾的信息。此前研究主要聚焦于模型内部知识与外部资源之间的冲突缓解,如微调或知识编辑,但忽略了预训练过程中内部表示层面的知识冲突定位与干预。本文的关键解决方案是基于机制可解释性(mechanistic interpretability)方法构建一个框架,用于识别冲突知识在模型内部的具体编码位置及其编码方式,并通过因果干预手段在推理阶段控制这些冲突信息,从而实现对内部知识冲突的精准定位与调控。
链接: https://arxiv.org/abs/2601.09445
作者: Minh Vu Pham,Hsuvas Borkakoty,Yufang Hou
机构: IT:U Austria; IBM Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model’s parametric knowledge. While prior work has primarily focused on resolving conflicts between a model’s internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model’s internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.
zh
[NLP-24] Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing
【速读】: 该论文旨在解决大规模预训练语言模型(Pre-trained Language Models, LMs)在偏见识别与缓解方面的研究门槛过高问题,尤其是由于高昂的训练成本使得重新训练模型进行偏见干预变得不可行。其核心挑战在于现有去偏方法多为事后的掩码或后处理策略,难以触及偏见形成的本质机制。解决方案的关键在于提出并验证“BabyLMs”——一种小型、紧凑且基于BERT架构的代理模型,它们在小规模可变语料库上训练,能够有效模拟大型模型在偏见习得和性能演化上的动态过程。实验表明,BabyLMs与标准BERT模型在内在偏见形成模式和性能发展上高度一致,并且这种一致性跨多种去偏方法保持稳定,从而使得在BabyLMs上开展预模型阶段的去偏实验成为可能,显著降低GPU计算成本(从500+ GPU小时降至30 GPU小时以下),为公平性研究提供了高效、低成本的探索平台。
链接: https://arxiv.org/abs/2601.09421
作者: Filip Trhlik,Andrew Caines,Paula Buttery
机构: University of Cambridge(剑桥大学); ALTA Institute(高级语言技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 18 figures
Abstract:Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.
zh
[NLP-25] Speech-Hands: A Self-Reflection Voice Agent ic Approach to Speech Recognition and Audio Reasoning with Omni Perception
【速读】: 该论文旨在解决多模态音频理解模型在联合训练语音识别与外部声音理解任务时性能下降的问题,其核心挑战在于模型易受噪声假设误导,导致决策偏差。解决方案的关键在于提出一种语音代理框架(Speech-Hands),将模型对自身判断可信度的评估建模为可学习的自我反思决策机制(self-reflection decision),从而实现动态选择何时依赖内部感知、何时调用外部音频感知。这一机制有效防止了模型因错误的外部候选结果而偏离正确路径,在多个基准测试中显著提升了语音识别(WER降低12.1%)和音频问答任务(准确率77.37%)的性能,展现出更强的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2601.09413
作者: Zhen Wan,Chao-Han Huck Yang,Jinchuan Tian,Hanrong Ye,Ankita Pasad,Szu-wei Fu,Arushi Goel,Ryo Hachiuma,Shizhe Diao,Kunal Dhawan,Sreyan Ghosh,Yusuke Hirota,Zhehuai Chen,Rafael Valle,Ehsan Hosseini Asl,Chenhui Chu,Shinji Watanabe,Yu-Chiang Frank Wang,Boris Ginsburg
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Audio and Speech Processing (eess.AS)
备注: Preprint. The version was submitted in October 2025
Abstract:We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
zh
[NLP-26] Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)模型在迭代式知识积累过程中缺乏结构化组织的问题,从而限制了知识表示的完整性和一致性。解决方案的关键在于提出一种基于页面驱动的自主知识表征框架 PAGER,其核心机制是首先利用大语言模型(Large Language Models, LLMs)生成一个结构化的认知大纲(cognitive outline),该大纲包含多个代表不同知识维度的槽位(slot),随后通过迭代检索与精炼文档来填充每个槽位,最终构建出逻辑连贯、信息密集的知识页面(page),作为生成答案的上下文输入。这一设计显著提升了知识表示的质量与可利用性,有效缓解了知识冲突,并增强了 LLM 对外部知识的整合能力。
链接: https://arxiv.org/abs/2601.09402
作者: Xinze Li,Zhenghao Liu,Haidong Xin,Yukun Yan,Shuo Wang,Zheni Zeng,Sen Mei,Ge Yu,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge. Recently, some works have incorporated iterative knowledge accumulation processes into RAG models to progressively accumulate and refine query-related knowledge, thereby constructing more comprehensive knowledge representations. However, these iterative processes often lack a coherent organizational structure, which limits the construction of more comprehensive and cohesive knowledge representations. To address this, we propose PAGER, a page-driven autonomous knowledge representation framework for RAG. PAGER first prompts an LLM to construct a structured cognitive outline for a given question, which consists of multiple slots representing a distinct knowledge aspect. Then, PAGER iteratively retrieves and refines relevant documents to populate each slot, ultimately constructing a coherent page that serves as contextual input for guiding answer generation. Experiments on multiple knowledge-intensive benchmarks and backbone models show that PAGER consistently outperforms all RAG baselines. Further analyses demonstrate that PAGER constructs higher-quality and information-dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively. All code is available at this https URL.
zh
[NLP-27] Ability Transfer and Recovery via Modularized Parameters Localization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在持续预训练或微调过程中因能力专业化而导致的灾难性遗忘(catastrophic forgetting)问题,即模型在提升特定领域、语言或技能性能的同时,会损害原有能力。解决方案的关键在于发现并利用能力相关的激活通道:研究发现,与特定能力相关的神经元激活高度集中于少数通道(通常仅5%),且这些通道具有良好的解耦性(disentangled)、充分性和稳定性。基于此,作者提出ACT(Activation-Guided Channel-wise Ability Transfer)方法,通过激活差异定位能力相关通道,并仅迁移对应参数,随后进行轻量级微调以保证兼容性,从而实现遗忘能力的恢复和多能力融合,同时最小化不同能力间的干扰。
链接: https://arxiv.org/abs/2601.09398
作者: Songyao Jin,Kun Zhou,Wenqi Li,Peng Wang,Biwei Huang
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models can be continually pre-trained or fine-tuned to improve performance in specific domains, languages, or skills, but this specialization often degrades other capabilities and may cause catastrophic forgetting. We investigate how abilities are distributed within LLM parameters by analyzing module activations under domain- and language-specific inputs for closely related models. Across layers and modules, we find that ability-related activations are highly concentrated in a small set of channels (typically 5%), and these channels are largely disentangled with good sufficiency and stability. Building on these observations, we propose ACT (Activation-Guided Channel-wise Ability Transfer), which localizes ability-relevant channels via activation differences and selectively transfers only the corresponding parameters, followed by lightweight fine-tuning for compatibility. Experiments on multilingual mathematical and scientific reasoning show that ACT can recover forgotten abilities while preserving retained skills. It can also merge multiple specialized models to integrate several abilities into a single model with minimal interference. Our code and data will be publicly released.
zh
[NLP-28] SLAM-LLM : A Modular Open-Source Multimodal Large Language Model Framework and Best Practice for Speech Language Audio and Music Processing
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在语音、音频和音乐等听觉模态支持不足的问题,这限制了音频-语言模型的发展,并导致研究人员需耗费大量精力进行代码开发与超参数调优。其解决方案的关键在于提出一个名为SLAM-LLM的开源深度学习框架,该框架通过模块化配置不同类型的编码器(encoders)、投影层(projectors)、大语言模型(LLMs)以及参数高效微调插件(parameter-efficient fine-tuning plugins),实现对语音、语言、音频和音乐处理任务的灵活定制训练;同时提供主流任务(如基于LLM的自动语音识别ASR、自动音频描述AAC和音乐描述MC)的详细训练与推理方案及高性能检查点,部分技术已达到或接近当前最优水平,从而显著降低研发门槛并加速音频驱动的MLLM迭代与应用落地。
链接: https://arxiv.org/abs/2601.09385
作者: Ziyang Ma,Guanrou Yang,Wenxi Chen,Zhifu Gao,Yexing Du,Xiquan Li,Zhisheng Zheng,Haina Zhu,Jianheng Zhuo,Zheshu Song,Ruiyang Xu,Tiranrui Wang,Yifan Yang,Yanqiao Zhu,Zhikang Niu,Liumeng Xue,Yinghao Ma,Ruibin Yuan,Shiliang Zhang,Kai Yu,Eng Siong Chng,Xie Chen
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Published in IEEE Journal of Selected Topics in Signal Processing (JSTSP)
Abstract:The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
zh
[NLP-29] Long-term Task-oriented Agent : Proactive Long-term Intent Maintenance in Dynamic Environments
【速读】: 该论文旨在解决当前大型语言模型代理(Large Language Model Agents)普遍采用的反应式交互范式在长期任务导向交互中的局限性,即无法持续维护用户意图并动态适应外部环境变化的问题。其解决方案的关键在于提出一种新型的主动式任务导向代理(Proactive Task-oriented Agent)交互范式,通过两个核心能力实现:(i) 意图条件监控(Intent-Conditioned Monitoring),使代理能基于对话历史自主设定触发条件;(ii) 事件驱动后续交互(Event-Triggered Follow-up),当检测到环境更新时主动与用户互动。此外,作者构建了高质量的数据合成流水线以生成复杂多轮对话数据,并提出 ChronosBench 基准用于评估动态环境中任务导向交互性能,实验表明基于合成数据微调的模型在包含意图漂移的复杂任务中任务完成率达 85.19%,验证了该数据驱动策略的有效性。
链接: https://arxiv.org/abs/2601.09382
作者: Qinglong Shi,Donghai Wang,Hantao Zhou,Jiguo Li,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 2 figures
Abstract:Current large language model agents predominantly operate under a reactive paradigm, responding only to immediate user queries within short-term sessions. This limitation hinders their ability to maintain long-term user’s intents and dynamically adapt to evolving external environments. In this paper, we propose a novel interaction paradigm for proactive Task-oriented Agents capable of bridging the gap between relatively static user’s needs and a dynamic environment. We formalize proactivity through two key capabilities, (i) Intent-Conditioned Monitoring: The agent autonomously formulates trigger conditions based on dialog history; (ii) Event-Triggered Follow-up: The agent actively engages the user upon detecting useful environmental updates. We introduce a high-quality data synthesis pipeline to construct complex, multi-turn dialog data in a dynamic environment. Furthermore, we attempt to address the lack of evaluation criteria of task-oriented interaction in a dynamic environment by proposing a new benchmark, namely ChronosBench. We evaluated some leading close-source and open-source models at present and revealed their flaws in long-term task-oriented interaction. Furthermore, our fine-tuned model trained using synthetic data for supervised learning achieves a task completion rate of 85.19% for complex tasks including shifts in user intent, outperforming other models under test. And the result validated the effectiveness of our data-driven strategy.
zh
[NLP-30] he Imperfective Paradox in Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否真正理解事件的组合语义(compositional semantics),还是仅仅依赖表面的概率启发式规则。为探究这一问题,作者聚焦于“未完成态悖论”(Imperfective Paradox)——即过去进行时在描述活动类事件(如“running”蕴含“ran”)时通常隐含事件实现,但在成就类事件(如“building”不蕴含“built”)中则不然。解决方案的关键在于构建了一个名为ImperfectiveNLI的诊断数据集,用于系统性地测试模型在不同语义类别下对这一逻辑差异的识别能力;同时通过表征分析发现,尽管内部嵌入能区分过程与结果,但推理决策却受制于强烈的关于目标达成的先验偏见(Teleological Bias),导致模型倾向于错误地假设目标已完成,即使文本明确否定。这表明当前LLMs缺乏结构化的体态意识(aspectual awareness),更像预测性叙事引擎而非忠实的逻辑推理器。
链接: https://arxiv.org/abs/2601.09373
作者: Bolei Ma,Yusuke Miyao
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); MCML; The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running \to ran) but not for accomplishments (e.g., building \nrightarrow built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.
zh
[NLP-31] Relation Extraction Capabilities of LLM s on Clinical Text: A Bilingual Evaluation for English and Turkish
【速读】: 该论文旨在解决非英语临床信息抽取任务中标注数据稀缺的问题,从而限制了以英文为主开发的大语言模型(Large Language Model, LLM)在多语言场景下的评估与应用。其核心解决方案是构建首个英-土耳其语平行临床关系抽取(Relation Extraction, RE)数据集,并提出一种基于对比学习的关系感知检索方法(Relation-Aware Retrieval, RAR),通过优化上下文示例的选择来增强LLM在跨语言临床文本中的语义理解能力。实验表明,基于提示(prompting)的LLM方法显著优于传统微调模型,且RRA结合结构化推理提示后,在英语任务上达到0.918的F1分数,验证了高质量示例检索与先进提示策略对弥合临床自然语言处理资源差距的关键作用。
链接: https://arxiv.org/abs/2601.09367
作者: Aidana Aidynkyzy,Oğuz Dikenelli,Oylum Alatlı,Şebnem Bora
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
zh
[NLP-32] Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs
【速读】: 该论文旨在解决对话系统中共同认知(common ground)的显式表示与存储问题,即如何在情境对话中有效建立并利用实体间的关联性参考以维持语境连贯性。现有研究表明大语言模型(LLM)能够执行请求澄清或生成确认等接地行为,但缺乏对共同认知的结构化建模机制,导致难以判断其行为是否基于真正的语境理解。论文的关键解决方案在于测试多种共同认知表示方法,并提出改进策略,以增强模型在对话中建立共同认知的能力及其后续使用效率,从而实现更可靠的语境引用与交互连贯性。
链接: https://arxiv.org/abs/2601.09365
作者: Biswesh Mohapatra,Théo Charlot,Giovanni Duca,Mayank Palan,Laurent Romary,Justine Cassell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Common ground plays a critical role in situated spoken dialogues, where interlocutors must establish and maintain shared references to entities, events, and relations to sustain coherent interaction. For dialog systems, the ability to correctly ground conversational content in order to refer back to it later is particularly important. Prior studies have demonstrated that LLMs are capable of performing grounding acts such as requesting clarification or producing acknowledgments, yet relatively little work has investigated how common ground can be explicitly represented and stored for later use. Without such mechanisms, it remains unclear whether acknowledgment or clarification behaviors truly reflect a grounded understanding. In this work, we evaluate a model’s ability to establish and exploit common ground through relational references to entities within the shared context in a situational dialogue. We test multiple methods for representing common ground in situated dialogues and further propose approaches to improve both the establishment of common ground and its subsequent use in the conversation.
zh
[NLP-33] Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework
【速读】: 该论文旨在解决隐性仇恨言论(implicitly hateful speech)的检测难题,尤其针对传统提示方法(如零样本提示、少样本提示、思维链提示)在处理具有社会文化语境依赖性的文本时表现不足的问题。其解决方案的关键在于提出一种基于多智能体系统的上下文感知检测框架,由一个中央审核代理(Moderator Agent)和动态构建的社区代理(Community Agents)组成,后者代表特定人口群体,并从公开知识源中显式整合社会文化背景信息,从而实现身份敏感的精准内容审核。该方法通过引入平衡准确率(balanced accuracy)作为核心评估指标,兼顾真阳性率与真阴性率的权衡,显著提升了分类准确性和公平性。
链接: https://arxiv.org/abs/2601.09342
作者: Ewelina Gajewska,Katarzyna Budzynska,Jarosław A Chudziak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for the upcoming 18th International Conference on Agents and Artificial Intelligence (ICAART-2026), Marbella, Spain. The final published version will appear in the official conference proceedings
Abstract:This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
zh
[NLP-34] Understanding or Memorizing? A Case Study of German Definite Articles in Language Models
【速读】: 该论文旨在解决语言模型在德语定冠词(definite articles)语法一致任务中表现优异是否源于规则驱动的泛化能力,还是对特定形式的单纯记忆。研究通过引入GRADIEND这一基于梯度的可解释性方法,学习针对特定性别-格组合的定冠词转换所对应的参数更新方向,发现这些更新方向不仅影响目标性别-格设置,还会显著影响无关的性别-格配置,且受影响最显著的神经元在不同设置间存在高度重叠。这一结果表明,模型并非严格遵循抽象语法规则进行编码,而是至少部分依赖于对具体形式的记忆关联。
链接: https://arxiv.org/abs/2601.09313
作者: Jonathan Drechsel,Erisa Bytyqi,Steffen Herbold
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models perform well on grammatical agreement, but it is unclear whether this reflects rule-based generalization or memorization. We study this question for German definite singular articles, whose forms depend on gender and case. Using GRADIEND, a gradient-based interpretability method, we learn parameter update directions for gender-case specific article transitions. We find that updates learned for a specific gender-case article transition frequently affect unrelated gender-case settings, with substantial overlap among the most affected neurons across settings. These results argue against a strictly rule-based encoding of German definite articles, indicating that models at least partly rely on memorized associations rather than abstract grammatical rules.
zh
[NLP-35] ReGraM: Region-First Knowledge Graph Reasoning for Medical Question Answering
【速读】: 该论文旨在解决医学问答(Medical QA)中因依赖全知识图谱(KG)遍历或大规模检索而导致的噪声干扰与多跳推理不稳定问题,其核心挑战在于如何精准识别并推理与查询相关的证据子集,而非简单扩大知识访问范围。解决方案的关键在于提出ReGraM框架——一种“区域优先”的知识图谱推理方法,通过构建与查询对齐的局部子图,并在多证据感知模式下进行受限的逐步推理,从而避免了传统方法中假设所有关系同等有用的不合理假设。实验表明,该策略显著提升了准确率(如MCQ任务提升8.04%、SAQ任务提升4.50%)并大幅降低幻觉率(减少42.9%),且消融分析验证了区域构建与分步推理协同优化的重要性。
链接: https://arxiv.org/abs/2601.09280
作者: Chaerin Lee,Sohee Park,Hyunsik Na,Daseon Choi
机构: Soongsil University (松溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures. Preprint
Abstract:Recent studies in medical question answering (Medical QA) have actively explored the integration of large language models (LLMs) with biomedical knowledge graphs (KGs) to improve factual accuracy. However, most existing approaches still rely on traversing the entire KG or performing large-scale retrieval, which introduces substantial noise and leads to unstable multi-hop reasoning. We argue that the core challenge lies not in expanding access to knowledge, but in identifying and reasoning over the appropriate subset of evidence for each query. ReGraM is a region-first knowledge graph reasoning framework that addresses this challenge by constructing a query-aligned subgraph and performing stepwise reasoning constrained to this localized region under multiple evidence aware modes. By focusing inference on only the most relevant portion of the KG, ReGraM departs from the assumption that all relations are equally useful an assumption that rarely holds in domain-specific medical settings. Experiments on seven medical QA benchmarks demonstrate that ReGraM consistently outperforms a strong baseline (KGARevion), achieving an 8.04% absolute accuracy gain on MCQ, a 4.50% gain on SAQ, and a 42.9% reduction in hallucination rate. Ablation and qualitative analyses further show that aligning region construction with hop-wise reasoning is the primary driver of these improvements. Overall, our results highlight region-first KG reasoning as an effective paradigm for improving factual accuracy and consistency in medical QA.
zh
[NLP-36] MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在中文古典文献研究(Chinese Classical Studies, CCS)中对音频语料利用不足的问题,尤其是缺乏针对古典文学语音任务的系统性数据集与评估框架。其解决方案的关键在于构建了一个多任务古典汉语文学体裁音频语料库(Multi-task Classical Chinese Literary Genre Audio Corpus, MCGA),涵盖自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感描述(SEC)、口语问答(SQA)、语音理解(SU)和语音推理(SR)六大任务,并引入了用于SEC的评价指标及衡量MLLM语音与文本能力一致性的新指标,从而推动MLLM在CCS领域中更全面、鲁棒的多维音频处理能力发展。
链接: https://arxiv.org/abs/2601.09270
作者: Yexing Du,Kaiyuan Liu,Bihe Zhang,Youcheng Pan,Bo Yang,Liangyu Huo,Xiyuan Zhang,Jian Xie,Daojing He,Yang Xiang,Ming Liu,Bin Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Pengcheng Laboratory (鹏城实验室); South China University of Technology (华南理工大学); Du xiaoman (杜晓满)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: this https URL
zh
[NLP-37] When to Invoke: Refining LLM Fairness with Toxicity Assessment WWW2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在在线内容审核系统中进行毒性评估时,因对不同人口群体的判断存在不一致性而导致的公平性问题,尤其是针对隐含仇恨言论等细微表达形式时表现出的偏见难以通过常规训练修正。其解决方案的关键在于提出一种推理阶段(inference-time)的公平性增强框架 FairToT,该框架通过提示引导(prompt-guided)的毒性评估机制识别出可能引发群体差异的场景,并动态决定是否引入额外评估;同时设计了两个可解释的公平性指标来检测此类情形,在不修改模型参数的前提下提升推理一致性与公平性,从而在保持毒性预测稳定性的基础上有效降低群体层面的差异。
链接: https://arxiv.org/abs/2601.09250
作者: Jing Ren,Bowen Li,Ziqi Xu,Renqiang Luo,Shuo Yu,Xin Ye,Haytham Fayek,Xiaodong Li,Feng Xia
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by Findings of WWW 2026
Abstract:Large Language Models (LLMs) are increasingly used for toxicity assessment in online moderation systems, where fairness across demographic groups is essential for equitable treatment. However, LLMs often produce inconsistent toxicity judgements for subtle expressions, particularly those involving implicit hate speech, revealing underlying biases that are difficult to correct through standard training. This raises a key question that existing approaches often overlook: when should corrective mechanisms be invoked to ensure fair and reliable assessments? To address this, we propose FairToT, an inference-time framework that enhances LLM fairness through prompt-guided toxicity assessment. FairToT identifies cases where demographic-related variation is likely to occur and determines when additional assessment should be applied. In addition, we introduce two interpretable fairness indicators that detect such cases and improve inference consistency without modifying model parameters. Experiments on benchmark datasets show that FairToT reduces group-level disparities while maintaining stable and reliable toxicity predictions, demonstrating that inference-time refinement offers an effective and practical approach for fairness improvement in LLM-based toxicity assessment systems. The source code can be found at this https URL.
zh
[NLP-38] achPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding
【速读】: 该论文旨在解决标准化教学评价(Standardized Student Evaluation of Teaching)中存在的信度低、选项受限及回应扭曲等问题,同时指出现有机器学习方法在处理开放式评论时通常仅将其简化为二元情感分类,忽略了如内容清晰度、反馈及时性与教师仪态等具体教学维度,导致难以提供针对性的教学改进建议。其解决方案的关键在于提出 TeachPro 框架,该框架通过两个核心技术模块实现多标签精细化评估:一是 Dimension-Anchored Evidence Encoder,利用预训练文本编码器、可学习语义锚点和交叉注意力机制,将定性反馈映射到五个关键教学维度(专业能力、教学行为、教学有效性、课堂体验及其他绩效指标);二是 Cross-View Graph Synergy Network,融合句法分支(基于依存句法树)与语义分支(基于 BERT 相似性图)的双视角表示,并通过 BiAffine 融合模块与差异正则化机制促进互补特征学习,最终借助交叉注意力机制建立教学维度与学生评论之间的结构化对齐关系。
链接: https://arxiv.org/abs/2601.09246
作者: Xiangqian Wang,Yifan Jia,Yang Xiang,Yumin Zhang,Yanbin Wang,Ke Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional this http URL propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.
zh
[NLP-39] When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation WWW2026
【速读】: 该论文旨在解决知识图谱增强生成(KG-RAG)模型在高风险场景中因过度自信而导致的可靠性问题,即当检索到的知识子图不完整或不可靠时,模型仍会输出高置信度预测。解决方案的关键在于提出一种因果感知校准框架 Ca2KG,其核心是结合反事实提示(counterfactual prompting)以暴露知识质量与推理可靠性之间的依赖不确定性,并引入基于面板的重评分机制(panel-based re-scoring mechanism)来稳定不同干预下的预测结果,从而实现更可靠的校准性能而不牺牲甚至提升预测准确性。
链接: https://arxiv.org/abs/2601.09241
作者: Jing Ren,Bowen Li,Ziqi Xu,Xinkun Zhang,Haytham Fayek,Xiaodong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by WWW 2026
Abstract:Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.
zh
[NLP-40] GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)后训练流程中监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)之间存在的内在优化不匹配问题:标准SFT由于其刚性监督机制导致分布坍缩(distributional collapse),从而耗尽后续RL所需的探索空间。解决方案的关键在于提出Gibbs Initialization with Finite Temperature (GIFT),将SFT重新建模为一个统一后训练框架中的有限温度能量势,而非传统零温度极限下的退化形式;GIFT通过引入有限温度来保留基础先验信息,构建出贯穿整个后训练流程的分布桥梁,确保目标一致性,并在RL初始化阶段显著优于标准SFT及其他竞争基线,提供了一条通往全局最优的数学严谨路径。
链接: https://arxiv.org/abs/2601.09233
作者: Zhengyang Zhao,Lu Ma,Yizhen Jiang,Xiaochen Ma,Zimo Meng,Chengyu Shen,Lexiang Tang,Haoze Sun,Peng Pei,Wentao Zhang
机构: Peking University (北京大学); Meituan (美团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The prevailing post-training paradigm for Large Reasoning Models (LRMs)–Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)–suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at this https URL.
zh
[NLP-41] UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning
【速读】: 该论文旨在解决当前用户模拟器(user simulator)在代理后训练(agent post-training)中面临的两大问题:一是现有方法依赖静态且上下文无关的用户画像,导致在新场景中需大量手动重构,泛化能力弱;二是忽略了人类的战略性思维,使得模拟器易被代理操纵。解决方案的关键在于提出一种具备推理能力的用户语言模型 UserLM-R1,其核心创新包括:构建包含静态角色与动态场景目标的综合用户画像以增强适应性,并设计基于目标驱动的决策策略,在生成响应前先生成高质量推理链(rationale),再通过监督微调和多奖励强化学习进一步优化推理过程与战略能力,从而实现更真实、更具挑战性的交互环境。
链接: https://arxiv.org/abs/2601.09215
作者: Feng Zhang,Shijia Li,Chunmao Zhang,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Jingwen Xu,Han Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.
zh
[NLP-42] A.X K1 Technical Report
【速读】: 该论文旨在解决大语言模型在推理能力与推理效率之间难以平衡的问题,尤其是在多场景部署中如何实现可控的推理过程。其解决方案的关键在于提出了一种基于混合专家(Mixture-of-Experts, MoE)架构的519B参数语言模型A.X K1,并设计了简单的Think-Fusion训练方法,使模型能够在单一统一结构中实现用户可控的“思考”与“非思考”模式切换,从而在保持高性能的同时提升推理效率和部署灵活性。
链接: https://arxiv.org/abs/2601.09200
作者: Sung Jun Cheon,Jaekyung Cho,Seongho Choi,Hyunjun Eun,Seokhwan Jo,Jaehyun Jun,Minsoo Kang,Jin Kim,Jiwon Kim,Minsang Kim,Sungwan Kim,Seungsik Kim,Tae Yoon Kim,Youngrang Kim,Hyeongmun Lee,Sangyeol Lee,Sungeun Lee,Youngsoon Lee,Yujin Lee,Seongmin Ok,Chanyong Park,Hyewoong Park,Junyoung Park,Hyunho Yang,Subin Yi,Soohyun Bae,Dhammiko Arya,Yongseok Choi,Sangho Choi,Dongyeon Cho,Seungmo Cho,Gyoungeun Han,Yong-jin Han,Seokyoung Hong,Hyeon Hwang,Wonbeom Jang,Minjeong Ju,Wonjin Jung,Keummin Ka,Sungil Kang,Dongnam Kim,Joonghoon Kim,Jonghwi Kim,SaeRom Kim,Sangjin Kim,Seongwon Kim,Youngjin Kim,Seojin Lee,Sunwoo Lee,Taehoon Lee,Chanwoo Park,Sohee Park,Sooyeon Park,Yohan Ra,Sereimony Sek,Seungyeon Seo,Gun Song,Sanghoon Woo,Janghan Yoon,Sungbin Yoon
机构: SK Telecom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.
zh
[NLP-43] ProFit: Leverag ing High-Value Signals in SFT via Probability-Guided Token Selection
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中因强制模型对齐单一参考答案而导致的过拟合问题,这种过拟合会使得模型过度关注非核心表达而忽略语义逻辑结构。其解决方案的关键在于揭示了词元概率与语义重要性之间的内在联系:高概率词元承载核心逻辑框架,低概率词元多为可替换的表面表达。基于此洞察,作者提出ProFit方法,通过选择性屏蔽低概率词元来防止表层过拟合,从而提升模型在通用推理和数学基准上的性能表现。
链接: https://arxiv.org/abs/2601.09195
作者: Tao Liu,Taiqiang Wu,Runming Yang,Shaoning Sun,Junjie Wang,Yujiu Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.
zh
[NLP-44] OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在社会科学研究领域应用中因全量微调(Full Fine-Tuning)计算与能耗过高,难以在资源受限的小型机构和非营利组织(如Web4Good生态中的数字图书馆、数据目录等)中部署的问题。其解决方案的关键在于提出OrthoGeoLoRA方法,通过引入几何约束——将标准LoRA的更新形式ΔW = BAᵀ重构为类似奇异值分解(SVD)的形式ΔW = BΣAᵀ,并强制低秩因子B和A位于Stiefel流形上(即正交约束),从而缓解标准LoRA存在的尺度模糊性、规范自由度及秩坍缩问题。该方法在保持与Adam优化器及现有微调流程兼容的前提下,显著提升了参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的性能,在多语言句子编码器上的层级概念检索任务中优于标准LoRA及其他强基线方法,为资源受限场景下基础模型的适配提供了更高效的路径。
链接: https://arxiv.org/abs/2601.09185
作者: Zeqiang Wang,Xinyue Wu,Chenxi Li,Zixi Chen,Nishanth Sastry,Jon Johnson,Suparna De
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update \Delta W = BA^\top has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form \Delta W = B\Sigma A^\top by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.
zh
[NLP-45] Geometric Stability: The Missing Axis of Representations
【速读】: 该论文旨在解决现有表征分析方法中对**相似性(similarity)的过度依赖问题,即当前方法仅能衡量嵌入向量与外部参考的一致性,却无法评估其结构在扰动下的稳定性。为此,作者提出几何稳定性(geometric stability)**这一新维度,用于量化表征几何结构在扰动下的可靠性,并开发了名为 Shesha 的测量框架。其核心创新在于揭示了稳定性与相似性在统计上几乎无关(ρ ≈ 0.01),且机制上存在本质差异:相似性指标在去除主成分后会崩溃,而稳定性仍能捕捉细粒度流形结构。这一区分使得稳定性成为安全监控中的功能型几何探针(比CKA灵敏近2倍)、可控性预测的关键指标(监督稳定性与线性可操控性相关性达ρ = 0.89–0.96),并揭示了迁移能力与几何稳定性之间的解耦关系,从而为模型选择提供新的判据。
链接: https://arxiv.org/abs/2601.09173
作者: Prashant C. Raju
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
备注:
Abstract:Analysis of learned representations has a blind spot: it focuses on similarity , measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce geometric stability , a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present Shesha , a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated ( \rho \approx 0.01 ) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2 \times more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability ( \rho = 0.89 - 0.96 ); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying how reliably systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.
zh
[NLP-46] EvasionBench: Detecting Evasive Answers in Financial QA via Multi-Model Consensus and LLM -as-Judge
【速读】: 该论文旨在解决财务披露中规避性回答(evasive answers)的检测问题,这是保障财务透明度的关键挑战,但长期以来受限于缺乏大规模标注基准。其解决方案的核心在于提出一种多模型标注框架,关键创新点是利用前沿大语言模型(LLMs)之间的分歧作为信号,识别出最难判别的边界案例(boundary cases),并通过人工裁判(judge)对这些冲突样本进行标签判定,从而构建高质量训练数据。这种方法通过挖掘模型间不一致性实现隐式正则化,显著提升模型泛化能力,最终训练出的Eva-4B模型在准确率上比基线模型提高25个百分点,且推理成本远低于主流大模型。
链接: https://arxiv.org/abs/2601.09142
作者: Shijian Ma(1),Yan Lin(2),Yi Yang(1) ((1) The Hong Kong University of Science and Technology, Hong Kong SAR, China, (2) University of Macau, Macau SAR, China)
机构: Hong Kong University of Science and Technology (香港科技大学); University of Macau (澳门大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Shijian Ma and Yan Lin contributed equally. Corresponding author: Yan Lin
Abstract:Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen’s Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
zh
[NLP-47] Identity-Robust Language Model Generation via Content Integrity Preservation
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在生成回答时因用户社会人口学特征(如性别、种族等)而产生的核心响应质量差异问题,即身份依赖性退化(identity-dependent degradation),这种退化表现为事实准确性、实用性与安全性方面的不一致,即便问题本身与用户身份无关。解决方案的关键在于提出一种轻量级、无需训练的框架,通过选择性中和非关键的身份信息,同时保留语义上必要的属性,从而在不损害内容完整性的前提下实现身份鲁棒性生成。
链接: https://arxiv.org/abs/2601.09141
作者: Miao Zhang,Kelly Chen,Md Mehrab Tanjim,Rumi Chunara
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Model (LLM) outputs often vary across user sociodemographic attributes, leading to disparities in factual accuracy, utility, and safety, even for objective questions where demographic information is irrelevant. Unlike prior work on stereotypical or representational bias, this paper studies identity-dependent degradation of core response quality. We show empirically that such degradation arises from biased generation behavior, despite factual knowledge being robustly encoded across identities. Motivated by this mismatch, we propose a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes, thus maintaining output content integrity. Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting and a 45% reduction relative to prompt-based defenses. Our work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality.
zh
[NLP-48] Adaptive Multi-Stage Patent Claim Generation with Unified Quality Assessment
【速读】: 该论文旨在解决当前专利权利要求(patent claim)生成系统面临的三大核心问题:跨司法管辖区泛化能力差、权利要求与现有技术(prior art)之间语义关系建模不足,以及生成质量评估不可靠。解决方案的关键在于提出一个三阶段框架:首先通过多头注意力机制(multi-head attention)中八个专用头实现显式的语义关系建模;其次采用课程学习(curriculum learning)结合动态LoRA适配器选择策略,在五个专利领域间实现域自适应的claim生成;最后利用跨注意力机制(cross-attention)对评价维度进行联合建模,实现统一的质量评估。该方法在多个基准数据集上显著优于现有模型,尤其在跨司法管辖区性能保持方面表现突出,为自动化专利审查流程提供了系统性解决方案。
链接: https://arxiv.org/abs/2601.09120
作者: Chen-Wei Liang,Bin Guo,Zhen-Yuan Wei,Mu-Jiang-Shan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures. Preprint
Abstract:Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4% cross-jurisdictional performance retention versus 76.2% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.
zh
[NLP-49] Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms
【速读】: 该论文旨在解决劳动力市场分析中将非结构化职位广告映射到标准化技能分类体系(如ESCO)的问题,这本质上是一个极端多标签分类(Extreme Multi-Label Classification, XMLC)任务。传统监督方法受限于大规模、与技能分类对齐的标注数据稀缺,尤其在非英语语境下,职位广告语言与正式技能定义存在显著差异。解决方案的关键在于提出一种零样本技能提取框架:首先利用大语言模型(Large Language Model, LLM)基于ESCO定义合成训练样本,并引入基于ESCO二级类别的层次约束多技能生成机制以提升多标签场景下的语义一致性;随后在此合成语料上训练一个对比双编码器(contrastive bi-encoder),该编码器在共享嵌入空间中对齐职位广告句子与ESCO技能描述,其结构融合了BERT主干、BiLSTM和注意力池化以更好地建模长而信息密集的要求语句;此外还设计了一个基于RoBERTa的二分类过滤器用于去除非技能句子,从而提高端到端精度。实验表明,该方案在中文真实职位广告上实现了强零样本检索性能(F1@5 = 0.72),显著优于TF-IDF和标准BERT基线。
链接: https://arxiv.org/abs/2601.09119
作者: Yongming Sun
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:
Abstract:Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations–especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF–IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.
zh
[NLP-50] AviationLMM: A Large Multimodal Foundation Model for Civil Aviation ICICS
【速读】: 该论文旨在解决当前民用航空领域中人工智能(AI)解决方案存在的碎片化与局限性问题,即现有AI系统通常仅聚焦于单一任务或模态(如语音通信、雷达轨迹、传感器数据等),难以有效整合多源异构信息,从而限制了情境感知能力、适应性及实时决策支持。其核心解决方案是提出 AviationLMM——一个面向民用航空的大规模多模态基础模型(Large Multimodal foundation Model, LMM),通过统一处理空地语音、监视数据、机载遥测、视频和结构化文本等多种输入模态,实现跨模态对齐与融合,并生成灵活输出(如态势摘要、风险预警、预测性诊断及多模态事件重构),以推动航空AI从孤立应用向集成化、智能化演进。
链接: https://arxiv.org/abs/2601.09105
作者: Wenbin Li,Jingling Wu,Xiaoyong Lin.Jing Chen,Cong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2025 7th International Conference on Interdisciplinary Computer Science and Engineering (ICICSE 2025) conference, Chongqing, China; 9 pages,1 figure,5 tables
Abstract:Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.
zh
[NLP-51] SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在字符级任务上表现不佳的问题,尤其是由于分词(tokenization)过程导致的子词理解能力不足。尽管现有基准已揭示此类缺陷,但因任务缺乏实际应用相关性而常被忽视;然而,诸如文本地图导航或结构化表格解析等真实场景高度依赖精确的子词层面理解。解决方案的关键在于提出SubTokenTest——一个基于实用导向任务的综合性评估基准,涵盖四个领域共十个任务,并通过解耦复杂推理与分词相关错误,精准识别模型在子词层级的理解能力。该基准不仅系统评估了九种先进LLM的性能,还进一步探究了测试时扩展策略对子词推理的影响及字符信息在隐藏状态中的编码方式。
链接: https://arxiv.org/abs/2601.09089
作者: Shuyang Hou,Yi Hu,Muhan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.
zh
[NLP-52] Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
【速读】: 该论文旨在解决当前基于序列级蒸馏(sequence-level distillation)的模型训练方法中存在的系统性缺陷,这些问题导致学生模型难以充分继承教师模型的泛化能力。具体而言,现有方法存在三大局限:一是未能充分表征教师模型的序列级输出分布;二是教师输出分布与学生学习能力之间存在错位;三是因教师强制训练与自回归推理之间的暴露偏差(exposure bias)影响性能。针对上述问题,作者提出了一套方法论创新,构建了一个增强的序列级蒸馏训练流程,其核心在于引入显式的师生交互机制,从而更有效地实现知识迁移。这一改进使得DASD-4B-Thinking在仅使用44.8万训练样本的情况下,即达到与更大规模模型相当甚至更优的性能表现。
链接: https://arxiv.org/abs/2601.09088
作者: Shaotian Yan,Kaiyuan Liu,Chen Shen,Bing Wang,Sinan Fan,Jun Zhang,Yue Wu,Zheng Wang,Jieping Ye
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation – even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself – enabling the student model to learn the teacher’s full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher’s sequence-level distribution; ii) Misalignment between the teacher’s output distribution and the student’s learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples – an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
zh
[NLP-53] MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在训练数学推理模型时因依赖每个提示生成多个完成(completion)而导致的计算成本过高问题。其核心解决方案是提出MMR-GRPO,通过引入最大边际相关性(Maximal Marginal Relevance, MMR)对奖励进行重加权,以基于完成结果的多样性调整学习信号强度。关键洞察在于:语义冗余的完成所提供的边际学习信号有限,而优先选择多样化的解法能够带来更丰富的更新信息,从而加速收敛。实验表明,MMR-GRPO在保持与基线相当峰值性能的同时,平均减少47.9%的训练步数和70.2%的墙钟时间(wall-clock time)。
链接: https://arxiv.org/abs/2601.09085
作者: Kangda Wei,Ruihong Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.
zh
[NLP-54] How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
【速读】: 该论文旨在解决生成式 AI 模型评估中一个关键问题:在人类偏好评价(human preference evaluation)场景下,需要多少判断才能可靠地检测出模型的微小改进。研究发现,当偏好信号在不同提示(prompt)之间分布稀疏(即所有提示类型均具有相似的信息量)时,比例分配策略(proportional allocation)是最优的——此时任何分配策略都无法显著提升检测能力。其核心解决方案在于揭示了大多数大规模人类偏好数据集处于“稀疏信号”状态,偏好边际较小,因此通常收集的判断数量远不足以可靠检测改进;而通过设计减少提示诱导变异性的精选基准(curated benchmarks),可系统性增大偏好边际并提升检测能力(提示级方差降低 1.5 倍),从而更有效地识别模型改进。这表明许多负面或不确定的人类评估结果往往源于统计功效不足,而非模型等效,强调了效应量、预算和评估协议设计的重要性。
链接: https://arxiv.org/abs/2601.09084
作者: Wilson Y. Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a 1.5\times reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.
zh
[NLP-55] Human-AI Co-design for Clinical Prediction Models
【速读】: 该论文旨在解决临床预测模型(Clinical Prediction Models, CPMs)开发过程中因依赖人工迭代协作而导致效率低下、难以落地临床实践的问题,尤其是在处理包含大量复杂概念的非结构化临床笔记时更为显著。其解决方案的关键在于提出HACHI框架——一个迭代式“人在回路”(human-in-the-loop)系统,通过AI代理(AI agent)快速探索和评估临床笔记中的候选概念,并结合临床专家与领域专家的反馈持续优化模型学习过程。HACHI将概念定义为可解释的二分类问题(yes-no questions),使得模型在每轮迭代中均保持透明性与可验证性,从而显著提升模型性能、泛化能力及临床实用性。
链接: https://arxiv.org/abs/2601.09072
作者: Jean Feng,Avni Kothari,Patrick Vossler,Andrew Bishara,Lucas Zier,Newton Addo,Aaron Kornblith,Yan Shuo Tan,Chandan Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
备注:
Abstract:Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time- and resource-intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human-in-the-loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes-no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real-world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly-used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.
zh
[NLP-56] From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models
【速读】: 该论文旨在解决传统知识图谱(Knowledge Graph, KG)中基于预定义符号关系模式(symbolic relation schemas)所导致的语义抽象问题,即现实世界中的关系往往具有情境依赖性、细微差别和不确定性,而将其压缩为离散的关系标签会丢失关键语义细节。其解决方案的关键在于从“符号化关系表示”向“自然语言关系描述”转变,提出一种混合设计原则:在保留最小结构骨架的基础上,引入更灵活且情境敏感的关系表达方式,从而更好地适配大语言模型(Large Language Models, LLMs)驱动的知识生成与推理范式。
链接: https://arxiv.org/abs/2601.09069
作者: Kanyao Han,Yushang Lai
机构: Walmart Global Tech (沃尔玛全球科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graphs (KGs) have commonly been constructed using predefined symbolic relation schemas, typically implemented as categorical relation labels. This design has notable shortcomings: real-world relations are often contextual, nuanced, and sometimes uncertain, and compressing it into discrete relation labels abstracts away critical semantic detail. Nevertheless, symbolic-relation KGs remain widely used because they have been operationally effective and broadly compatible with pre-LLM downstream models and algorithms, in which KG knowledge could be retrieved or encoded into quantified features and embeddings at scale. The emergence of LLMs has reshaped how knowledge is created and consumed. LLMs support scalable synthesis of domain facts directly in concise natural language, and prompting-based inference favors context-rich free-form text over quantified representations. This position paper argues that these changes call for rethinking the representation of relations themselves rather than merely using LLMs to populate conventional schemas more efficiently. We therefore advocate moving from symbolic to natural-language relation descriptions, and we propose hybrid design principles that preserve a minimal structural backbone while enabling more flexible and context-sensitive relational representations.
zh
[NLP-57] Mi:dm 2.0 Korea-centric Bilingual Language Models
【速读】: 该论文旨在解决现有大语言模型(Large Language Model, LLM)在处理韩语文本时存在的文化适配不足、数据质量不高以及对韩国社会价值观和常识理解薄弱的问题。针对这些局限性,其核心解决方案在于构建一个以韩国为中心的双语大模型 Mi:dm 2.0,通过一套完整的高质量数据处理流程实现突破:包括专有数据清洗、高质量合成数据生成、结合课程学习策略的数据混合方法,以及为韩语优化的自定义分词器,从而显著提升模型在文化语境理解、情感细微差别识别和现实场景响应中的准确性与可靠性。
链接: https://arxiv.org/abs/2601.09066
作者: Donghoon Shin,Sejung Lee,Soonmin Bae,Hwijung Ryu,Changwon Ok,Hoyoun Jung,Hyesung Ji,Jeehyun Lim,Jehoon Lee,Ji-Eun Han,Jisoo Baik,Mihyeon Kim,Riwoo Chung,Seongmin Lee,Wonjae Park,Yoonseok Heo,Youngkyung Seo,Seyoun Won,Boeun Kim,Cheolhun Heo,Eunkyeong Lee,Honghee Lee,Hyeongju Ju,Hyeontae Seo,Jeongyong Shim,Jisoo Lee,Junseok Koh,Junwoo Kim,Minho Lee,Minji Kang,Minju Kim,Sangha Nam,Seongheum Park,Taehyeong Kim,Euijai Ahn,Hong Seok Jeung,Jisu Shin,Jiyeon Kim,Seonyeong Song,Seung Hyun Kong,Sukjin Hong,Taeyang Yun,Yu-Seon Kim,A-Hyun Lee,Chae-Jeong Lee,Hye-Won Yu,Ji-Hyun Ahn,Song-Yeon Kim,Sun-Woo Jung,Eunju Kim,Eunji Ha,Jinwoo Baek,Yun-ji Lee,Wanjin Park,Jeong Yeop Kim,Eun Mi Kim,Hyoung Jun Park,Jung Won Yoon,Min Sung Noh,Myung Gyo Oh,Wongyoung Lee,Yun Jin Park,Young S. Kwon,Hyun Keun Kim,Jieun Lee,YeoJoo Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at this https URL. For technical inquiries, please contact midm-llm@kt.com.
zh
[NLP-58] Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中注释者分歧(annotator disagreement)的建模问题,尤其针对主观性强和语义模糊的任务(如毒性检测和立场分析)。传统方法将分歧视为噪声并试图消除,而本文提出应将其视为反映不同解释与视角的重要信号。解决方案的关键在于构建一个统一的框架,从数据、任务和标注者三个维度对分歧来源进行分类,并通过预测目标与聚合结构的标准化视角,系统梳理了从共识学习到显式建模分歧再到捕捉标注者间结构关系的演进路径,从而推动NLP模型更全面地理解人类标注的多样性与复杂性。
链接: https://arxiv.org/abs/2601.09065
作者: Yinuo Xu,David Jurgens
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Annotator disagreement is widespread in NLP, particularly for subjective and ambiguous tasks such as toxicity detection and stance analysis. While early approaches treated disagreement as noise to be removed, recent work increasingly models it as a meaningful signal reflecting variation in interpretation and perspective. This survey provides a unified view of disagreement-aware NLP methods. We first present a domain-agnostic taxonomy of the sources of disagreement spanning data, task, and annotator factors. We then synthesize modeling approaches using a common framework defined by prediction targets and pooling structure, highlighting a shift from consensus learning toward explicitly modeling disagreement, and toward capturing structured relationships among annotators. We review evaluation metrics for both predictive performance and annotator behavior, and noting that most fairness evaluations remain descriptive rather than normative. We conclude by identifying open challenges and future directions, including integrating multiple sources of variation, developing disagreement-aware interpretability frameworks, and grappling with the practical tradeoffs of perspectivist modeling.
zh
[NLP-59] Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models
【速读】: 该论文旨在解决多语言对话摘要与问答(Multilingual Dialogue Summarization and Question Answering)任务中低资源语言处理性能不足的问题。其解决方案的关键在于构建一个三阶段流水线:首先将印地语系语言(Indic languages)翻译为英语,其次使用参数规模为2.55B的蒸馏语言模型进行多任务文本生成,最后再将结果反向翻译回源语言。通过知识蒸馏技术,该方法在不依赖特定任务微调的情况下,实现了对九种语言的高竞争力性能,尤其在马拉地语(86.7% QnA)、泰米尔语(86.7% QnA)和印地语(80.0% QnA)上表现优异,验证了基于翻译的策略在低资源场景下的有效性。
链接: https://arxiv.org/abs/2601.09059
作者: Santiago Martínez Novoa,Nicolás Rozo Fajardo,Diego Alejandro González Vargas,Nicolás Bedoya Figueroa
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents team Kl33n3x’s multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition’s tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.
zh
[NLP-60] StegoStylo: Squelching Stylometric Scrutiny through Steganographic Stitching
【速读】: 该论文旨在解决stylometry(风格分析)技术在隐私保护方面的潜在威胁,特别是其在作者身份验证(authorship verification)中被恶意利用时可能引发的去匿名化、追踪和监控等风险。论文提出的关键解决方案是结合对抗性stylometry与隐写术(steganography),通过改进对抗攻击方法 TraceTarnish 来扰乱stylometric系统的识别能力,并进一步利用零宽Unicode字符对文本进行细粒度修改,实现作者风格指纹的掩蔽。实验表明,当至少33%的词汇被嵌入隐写信息后,可有效实现作者身份的混淆,从而为防御性工具的设计提供实证依据。
链接: https://arxiv.org/abs/2601.09056
作者: Robert Dilworth
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 16 pages, 6 figures, 1 table
Abstract:Stylometry–the identification of an author through analysis of a text’s style (i.e., authorship attribution)–serves many constructive purposes: it supports copyright and plagiarism investigations, aids detection of harmful content, offers exploratory cues for certain medical conditions (e.g., early signs of dementia or depression), provides historical context for literary works, and helps uncover misinformation and disinformation. In contrast, when stylometry is employed as a tool for authorship verification–confirming whether a text truly originates from a claimed author–it can also be weaponized for malicious purposes. Techniques such as de-anonymization, re-identification, tracking, profiling, and downstream effects like censorship illustrate the privacy threats that stylometric analysis can enable. Building on these concerns, this paper further explores how adversarial stylometry combined with steganography can counteract stylometric analysis. We first present enhancements to our adversarial attack, \textitTraceTarnish , providing stronger evidence of its capacity to confound stylometric systems and reduce their attribution and verification accuracy. Next, we examine how steganographic embedding can be fine-tuned to mask an author’s stylistic fingerprint, quantifying the level of authorship obfuscation achievable as a function of the proportion of words altered with zero-width Unicode characters. Based on our findings, steganographic coverage of 33% or higher seemingly ensures authorship obfuscation. Finally, we reflect on the ways stylometry can be used to undermine privacy and argue for the necessity of defensive tools like \textitTraceTarnish .
zh
[NLP-61] SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages
【速读】: 该论文旨在解决低资源声调语言(tonal low-resource languages)在现代语音技术中长期被忽视的问题,核心挑战在于如何学习既对性别等干扰因素具有鲁棒性、又能保持对声调敏感的语音表征。解决方案的关键是提出一种轻量级适配方法 SITA(Speaker-Invariance and Tone-Awareness),其采用分阶段多目标训练策略:首先通过跨性别对比损失增强不同说话者间的词义一致性,同时引入声调排斥损失(tone-repulsive loss)防止相同词汇不同声调的表征坍缩;其次利用基于连接时序分类(Connectionist Temporal Classification, CTC)的辅助自动语音识别(ASR)目标进行蒸馏,稳定与识别相关的结构。该方法在高度依赖声调的 Hmong 语种上显著提升跨性别词检索准确率,并维持可接受的 ASR 性能,且在普通话上的迁移实验也验证了其通用性和可插拔特性。
链接: https://arxiv.org/abs/2601.09050
作者: Tianyi Xu,Xuan Ouyang,Binwei Yao,Shoua Xiong,Sara Misurelli,Maichou Lor,Junjie Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages (excluding references, limitations, ethics, acknowledgement, and appendix); 4 figures in the main paper; appendix included
Abstract:Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.
zh
[NLP-62] Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在组合推理任务中面临的“两跳推理困境”(curse of two-hop reasoning)问题,即模型难以有效整合多个原子事实以完成复杂推理。其解决方案的关键在于通过机制化研究揭示“grokking”现象的本质:作者发现,经过长时间训练形成的“泛化电路”(Generalization Circuit)并非突然引入新的推理范式,而是将记忆中的原子事实整合进已存在的推理路径中;此外,高准确率与特定推理路径的形成可独立发生,且成熟电路在引入新知识时仍表现出有限迁移能力,表明“grokked” Transformer并未实现对组合逻辑的完全掌握。
链接: https://arxiv.org/abs/2601.09049
作者: Kaiyu He,Zhang Mian,Peilin Wu,Xinya Du,Zhiyu Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the “curse of two-hop reasoning” in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a “Generalization Circuit” during a prolonged “grokking” phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit’s role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the “Generalization Circuit” does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that “grokked” Transformers do not achieve a full mastery of compositional logic.
zh
[NLP-63] Can LLM s interpret figurative language as humans do?: surface-level vs representational similarity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理解和生成具有社会语境依赖性的语言表达(如习语、讽刺、俚语等)时,其判断是否与人类一致的问题。研究通过对比人类与四种不同规模的指令微调LLM(GPT-4、Gemma-2-9B、Llama-3.2和Mistral-7B)对240个基于对话的句子在六个语言特征上的评分,发现尽管模型在表层语义上与人类高度一致,但在表征层面——尤其是涉及习语和Z世代俚语等需要社会语用知识的表达时——存在显著偏差;其中GPT-4最接近人类的表征模式,但所有模型均难以准确处理依赖上下文和社交语境的表达,如讽刺、俚语和习语。解决方案的关键在于识别并量化LLMs在社会语用理解方面的局限性,并强调未来需增强模型对文化背景、语境敏感性和隐含意义的建模能力。
链接: https://arxiv.org/abs/2601.09041
作者: Samhita Bollepally,Aurora Sloman-Moll,Takashi Yamauchi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:Large language models generate judgments that resemble those of humans. Yet the extent to which these models align with human judgments in interpreting figurative and socially grounded language remains uncertain. To investigate this, human participants and four instruction-tuned LLMs of different sizes (GPT-4, Gemma-2-9B, Llama-3.2, and Mistral-7B) rated 240 dialogue-based sentences representing six linguistic traits: conventionality, sarcasm, funny, emotional, idiomacy, and slang. Each of the 240 sentences was paired with 40 interpretive questions, and both humans and LLMs rated these sentences on a 10-point Likert scale. Results indicated that humans and LLMs aligned at the surface level with humans, but diverged significantly at the representational level, especially in interpreting figurative sentences involving idioms and Gen Z slang. GPT-4 most closely approximates human representational patterns, while all models struggle with context-dependent and socio-pragmatic expressions like sarcasm, slang, and idiomacy.
zh
[NLP-64] SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science
【速读】: 该论文旨在解决科学推理中结构化实验数据与非结构化文献之间难以协同推理的问题,即当前大型语言模型(Large Language Model, LLM)助手在跨模态信息整合方面存在局限。其解决方案的关键在于提出SpectraQuery框架,该框架采用受Structured and Unstructured Query Language (SUQL)启发的混合自然语言查询设计,将关系型拉曼光谱数据库与向量索引的科学文献语料库相结合,通过语义解析与检索增强生成(Retrieval-Augmented Generation, RAG)技术,将开放性问题转化为协调一致的SQL查询和文献检索操作,从而生成基于数值证据与机制解释统一的、可引用的答案。
链接: https://arxiv.org/abs/2601.09036
作者: Sreya Vangara,Jagjit Nanda,Yan-Kai Tzeng,Eric Darve
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 8 figures, appendix included
Abstract:Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.
zh
[NLP-65] OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG WWW2026
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因检索到的信息相关性不一致而导致的生成质量不稳定问题。具体而言,现有方法通常假设检索内容与查询高度相关,但在实际应用中,检索结果的质量可能因查询意图和文档集合的不同而存在显著差异,从而影响最终生成内容的准确性与可靠性。解决方案的关键在于提出OpenDecoder框架,通过引入显式的检索信息质量评估指标作为生成过程中的特征输入,包括相关性得分(relevance score)、排序得分(ranking score)以及查询性能预测得分(Query Performance Prediction, QPP score),从而增强模型对噪声上下文的鲁棒性。该方法在五个基准数据集上的实验验证了其有效性与更强的稳定性,且具有良好的灵活性,可无缝集成至任意大语言模型(Large Language Models, LLMs)的后训练阶段或与其他外部指标结合使用。
链接: https://arxiv.org/abs/2601.09028
作者: Fengran Mo,Zhan Su,Yuchen Hui,Jinghan Zhang,Jia Ao Sun,Zheyuan Liu,Chao Zhang,Tetsuya Sakai,Jian-Yun Nie
机构: Université de Montréal(蒙特利尔大学); Clemson University(克莱姆森大学); University of Notre Dame(圣母大学); Georgia Institute of Technology(佐治亚理工学院); Waseda University(早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by ACM WWW 2026
Abstract:The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs’ internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.
zh
[NLP-66] Multicultural Spyfall: Assessing LLM s through Dynamic Multilingual Social Deduction Game
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估方法依赖静态基准测试所导致的数据饱和与泄露问题,尤其是在多语言和跨文化场景下模型能力难以被有效衡量的挑战。其解决方案的关键在于提出一种基于社交推理游戏Spyfall的动态基准框架,通过模拟真实情境中的策略对话任务(如识别秘密特工或避免被发现),要求模型运用文化相关地点或本地食物等语境信息进行推理;该方法不仅具备良好的可扩展性和抗数据泄露特性,还能捕捉模型在非英语语境中对本地化实体理解不足及规则遵守能力下降等问题,从而提供一种更贴近实际应用、更具文化敏感性的评估手段。
链接: https://arxiv.org/abs/2601.09017
作者: Haryo Akbarianto Wibowo,Alaa Elsetohy,Qinrong Cui,Alham Fikri Aji
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here this https URL.
zh
[NLP-67] ranslateGemma Technical Report
【速读】: 该论文旨在解决通用大语言模型在机器翻译任务中性能不足的问题,特别是如何在保持模型轻量化的同时提升翻译质量与多语言覆盖能力。解决方案的关键在于采用两阶段微调策略:首先通过监督微调(Supervised Fine-Tuning, SFT)利用高质量合成平行语料和人工翻译数据增强Gemma 3的基础多语言能力;随后引入强化学习(Reinforcement Learning, RL)阶段,基于MetricX-QE与AutoMQM等奖励模型集成对翻译质量进行优化。该方法显著提升了TranslateGemma在WMT25和WMT24++等多个基准上的表现,且小尺寸模型即可媲美更大规模基线模型,同时保留了强大的多模态能力,为研究社区提供了高效、可扩展的开源翻译工具。
链接: https://arxiv.org/abs/2601.09012
作者: Mara Finkelstein,Isaac Caswell,Tobias Domhan,Jan-Thorsten Peter,Juraj Juraska,Parker Riley,Daniel Deutsch,Cole Dilanni,Colin Cherry,Eleftheria Briakou,Elizabeth Nielsen,Jiaming Luo,Kat Black,Ryan Mullins,Sweta Agrawal,Wenda Xu,Erin Kats,Stephane Jaskiewicz,Markus Freitag,David Vilar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.
zh
[NLP-68] Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中面临的两个耦合挑战:一是监控(monitoring)——即在数据分布漂移(domain shift)时准确识别模型性能下降的特定数据切片(slice);二是改进(improvement)——即优先采集能显著缩小性能差距的数据。其解决方案的关键在于利用推理时信号(inference-time signal)来估计不同数据域下的精度,具体做法是:对每个输出序列,基于最终层的下一个词概率(来自top-k logprobs)计算输出熵谱(output-entropy profile),并用11个统计量对其进行总结;随后通过一个轻量级分类器预测单个实例的正确性,再通过对预测概率取平均得到整体域级别的准确率估计。实验表明,该方法在多个STEM推理基准测试中能有效跟踪真实性能变化,并在部分模型上实现域级别准确率的近单调排序,从而为大规模监控和有针对性的数据采集提供了可扩展且高效的手段。
链接: https://arxiv.org/abs/2601.09001
作者: Pedro Memoli Buffa,Luciano Del Corro
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in 1,2,3,4; all “10 choose k” combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.
zh
[NLP-69] Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
【速读】: 该论文旨在解决当前世界模型(World Model)在复杂任务规划中潜力未被充分挖掘的问题,尤其是现有方法多局限于单步或固定时域的模拟推演,难以支持长期、动态的任务推理与决策。其解决方案的关键在于提出一种统一的“先想象后规划”(Imagine-then-Plan, ITP)框架,通过让智能体策略模型与学习到的世界模型交互生成多步“想象”轨迹,并引入一种自适应前瞻机制,在最终目标与任务进展之间进行权衡,从而动态调整想象视野。该机制使得想象轨迹能够提供关于未来进展和潜在冲突的丰富信号,并与当前观测融合,构建出部分可观测且可想象的马尔可夫决策过程(Partially Observable and Imaginable Markov Decision Process),有效引导策略学习,显著提升智能体在复杂任务中的推理能力。
链接: https://arxiv.org/abs/2601.08955
作者: Youwei Liu,Jian Wang,Hanlin Wang,Beichen Guo,Wenjie Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\textttITP), a unified framework for agent learning via lookahead imagination, where an agent’s policy model interacts with the learned world model, yielding multi-step ``imagined’’ trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textitobservable and \textitimaginable Markov decision process to guide policy learning. We instantiate \textttITP with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \textttITP significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents’ reasoning capability, providing valuable insights into addressing broader, complex tasks.
zh
[NLP-70] PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
【速读】: 该论文旨在解决当前AI安全框架中将有害性视为二元判断(harmful vs. benign)所导致的局限性,特别是无法有效处理人类在特定情境下存在显著分歧的“边缘案例”。其核心问题在于:现有方法缺乏对价值多样性与主观判断差异的建模能力,难以支持多元共存的AI安全体系。解决方案的关键在于提出PluriHarms基准,该基准通过系统性地构建涵盖“危害轴”(benign to harmful)和“共识轴”(agreement to disagreement)两个维度的评测数据集,聚焦高争议性的AI行为场景,并整合人类标注者的社会人口学特征与心理特质、以及提示内容中的行为属性、影响程度与价值观信息。研究表明,即时风险感知与具象伤害程度显著提升危害感知强度,而标注者个体特质(如毒性经历、教育水平)及其与提示内容的交互作用则解释了系统性分歧。此框架为从“一刀切”的安全策略迈向更具包容性的多元安全(pluralistically safe AI)提供了可量化的评估基础。
链接: https://arxiv.org/abs/2601.08951
作者: Jing-Jing Li,Joel Mire,Eve Fleisig,Valentina Pyatkin,Anne Collins,Maarten Sap,Sydney Levine
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions – the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond “one-size-fits-all” safety toward pluralistically safe AI.
zh
[NLP-71] Fine Grained Evaluation of LLM s-as-Judges
【速读】: 该论文旨在解决如何更精细地评估大型语言模型(Large Language Models, LLMs)作为信息检索(Information Retrieval, IR)中相关性判断者(judges)的有效性问题,尤其关注其在文档层面之外是否能准确识别并标注出与查询相关的具体段落。传统研究多仅评估LLMs对文档整体相关性的判别能力,而本文的关键创新在于使用INEX倡议构建的基于维基百科的测试集,并设计提示(prompt)使LLMs不仅判断文档相关性,还需定位和高亮其认为有用的文本片段——这一机制与人类标注者的任务完全一致。通过对比LLMs与人类标注者在段落级相关性标注上的匹配程度,论文量化了LLMs“正确判断”的原因是否合理,从而揭示了LLMs作为judges在无监督下存在局限性,其效果最优时仍需人类监督以确保判断的准确性与可解释性。
链接: https://arxiv.org/abs/2601.08919
作者: Sourav Saha,Mandar Mitra
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges' in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that respond to the information need expressed in a query. This enables us to evaluate the quality of LLMs as judges not only at the document level, but to also quantify how often these judges’ are right for the right reasons. Our findings suggest that LLMs-as-judges work best under human supervision. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.08919 [cs.IR] (or arXiv:2601.08919v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.08919 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sourav Saha [view email] [v1] Tue, 13 Jan 2026 19:01:16 UTC (399 KB) Full-text links: Access Paper: View a PDF of the paper titled Fine Grained Evaluation of LLMs-as-Judges, by Sourav Saha and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.IR prev | next new | recent | 2026-01 Change to browse by: cs cs.CL cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[NLP-72] Navigating Ideation Space: Decomposed Conceptual Representations for Positioning Scientific Ideas
【速读】: 该论文旨在解决科学发现过程中两个关键问题:一是如何从快速增长的文献中识别概念上相关的先前研究,并准确评估新想法与现有研究的差异;二是当前基于嵌入(embedding)的方法通常将不同概念维度混杂在同一表征中,难以支持细粒度文献检索;而基于大语言模型(LLM)的评估器则易受奉承偏倚(sycophancy bias)影响,无法提供具有区分度的新颖性判断。解决方案的核心在于提出“构想空间”(Ideation Space),这是一种结构化的知识表示框架,将科学知识分解为三个独立维度——研究问题、方法论和核心发现,并通过对比学习分别建模每个维度。在此基础上,进一步设计了分层子空间检索(Hierarchical Sub-Space Retrieval)机制以实现高效精准的文献召回,以及分解式新颖性评估算法(Decomposed Novelty Assessment),可量化识别新想法在哪些具体维度上具有创新性。实验表明,该方法显著优于基线,在召回率、构想转移检索命中率及专家判断相关性方面均有明显提升。
链接: https://arxiv.org/abs/2601.08901
作者: Yuexi Shen,Minqian Liu,Dawei Zhou,Lifu Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 6 tables
Abstract:Scientific discovery is a cumulative process and requires new ideas to be situated within an ever-expanding landscape of existing knowledge. An emerging and critical challenge is how to identify conceptually relevant prior work from rapidly growing literature, and assess how a new idea differentiates from existing research. Current embedding approaches typically conflate distinct conceptual aspects into single representations and cannot support fine-grained literature retrieval; meanwhile, LLM-based evaluators are subject to sycophancy biases, failing to provide discriminative novelty assessment. To tackle these challenges, we introduce the Ideation Space, a structured representation that decomposes scientific knowledge into three distinct dimensions, i.e., research problem, methodology, and core findings, each learned through contrastive training. This framework enables principled measurement of conceptual distance between ideas, and modeling of ideation transitions that capture the logical connections within a proposed idea. Building upon this representation, we propose a Hierarchical Sub-Space Retrieval framework for efficient, targeted literature retrieval, and a Decomposed Novelty Assessment algorithm that identifies which aspects of an idea are novel. Extensive experiments demonstrate substantial improvements, where our approach achieves Recall@30 of 0.329 (16.7% over baselines), our ideation transition retrieval reaches Hit Rate@30 of 0.643, and novelty assessment attains 0.37 correlation with expert judgments. In summary, our work provides a promising paradigm for future research on accelerating and evaluating scientific discovery.
zh
[NLP-73] Spectral Generative Flow Models: A Physics-Inspired Replacement for Vectorized Large Language Models
【速读】: 该论文旨在解决当前主流生成模型(如基于Transformer的大型语言模型)在长程一致性、多模态泛化能力以及物理结构归纳偏置方面存在的局限性。其核心问题在于:传统方法依赖全局注意力机制和自回归建模,难以有效捕捉跨尺度的结构信息并保证生成过程的稳定性与物理合理性。解决方案的关键在于提出谱生成流模型(Spectral Generative Flow Models, SGFMs),通过将文本和视频统一为随机偏微分方程的轨迹,引入小波域表示以实现稀疏性和尺度分离,并采用受约束的随机流来确保生成过程的稳定性、连贯性及不确定性传播。这一框架从根本上区别于自回归和扩散方法,为下一代生成模型提供了基于连续场理论、几何结构和物理规律的全新范式。
链接: https://arxiv.org/abs/2601.08893
作者: Andrew Kiruluta
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier–Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure. Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2601.08893 [cs.LG] (or arXiv:2601.08893v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.08893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-74] Evaluating Role-Consistency in LLM s for Counselor Training
【速读】: 该论文旨在解决在线心理咨询培训中缺乏高效训练方法的问题,特别是如何提升未来咨询师在虚拟环境中与客户互动时的角色一致性(role-consistency)能力。解决方案的关键在于构建一个包含对抗性攻击(adversarial attacks)的新数据集,用于测试大语言模型(LLMs)在模拟真实客户交互时维持角色设定的能力,并通过评估对话连贯性和人物一致性,对多种开源大语言模型进行比较分析,从而为在线心理咨询培训提供更可靠、更具鲁棒性的虚拟客户模拟工具。
链接: https://arxiv.org/abs/2601.08892
作者: Eric Rudolph,Natalie Engert,Jens Albrecht
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of online counseling services has highlighted the need for effective training methods for future counselors. This paper extends research on VirCo, a Virtual Client for Online Counseling, designed to complement traditional role-playing methods in academic training by simulating realistic client interactions. Building on previous work, we introduce a new dataset incorporating adversarial attacks to test the ability of large language models (LLMs) to maintain their assigned roles (role-consistency). The study focuses on evaluating the role consistency and coherence of the Vicuna model’s responses, comparing these findings with earlier research. Additionally, we assess and compare various open-source LLMs for their performance in sustaining role consistency during virtual client interactions. Our contributions include creating an adversarial dataset, evaluating conversation coherence and persona consistency, and providing a comparative analysis of different LLMs.
zh
[NLP-75] NewsScope: Schema-Grounded Cross-Domain News Claim Extraction with Open Models
【速读】: 该论文旨在解决自动化新闻验证中结构化事实提取(claim extraction)存在的两个核心问题:现有方法要么不符合预定义的schema规范,要么在跨领域场景下泛化能力差。其解决方案的关键在于构建了一个名为NewsScope的跨域数据集、基准测试和微调模型,该数据集涵盖政治、健康、科技/环境及商业四个领域共455篇文章,并采用LoRA(Low-Rank Adaptation)对LLaMA 3.1 8B模型进行微调,在保持schema一致性的同时显著提升跨域泛化性能。实验表明,该方法在人类评估中达到89.4%准确率,优于GPT-4o-mini在政治类声明上的表现(94.3% vs. 87.8%),并通过数值锚定过滤进一步提升至91.6%,验证了方案的有效性与实用性。
链接: https://arxiv.org/abs/2601.08852
作者: Nidhi Pandya
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 tables. Code, model, and benchmark publicly released
Abstract:Automated news verification requires structured claim extraction, but existing approaches either lack schema compliance or generalize poorly across domains. This paper presents NewsScope, a cross-domain dataset, benchmark, and fine-tuned model for schema-grounded news claim extraction. The dataset contains 455 articles across politics, health, science/environment, and business, consisting of 395 in-domain articles and 60 out-of-source articles for generalization testing. LLaMA 3.1 8B was fine-tuned using LoRA on 315 training examples and evaluated on held-out in-domain (80 articles) and out-of-source (60 articles) test sets. Human evaluation on 400 claims shows NewsScope achieves 89.4% human-evaluated accuracy compared to GPT-4o-mini’s 93.7% (p=0.07). NewsScope outperforms GPT-4o-mini on political claims (94.3% vs. 87.8%). A numeric grounding filter further improves accuracy to 91.6%, narrowing the gap to 2.1 percentage points. Inter-annotator agreement studies (160 claims) confirm labeling reliability (94.6% positive agreement on SUPPORTED judgments). The open-weight model enables offline deployment at approximately 15 on-demand compute (or 0 on free tiers). Code and benchmark are publicly released.
zh
[NLP-76] Más contexto no es mejor. Paradoja de la dilución vectorial en RAG corporativos
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中检索增强生成(Retrieval-Augmented Generation, RAG)系统因“上下文分块”(Contextualized Chunking)技术引入摘要注入后导致的向量稀释(vector dilution)问题,即局部内容被弱化从而影响特定查询的精度。其解决方案的关键在于提出一个理论框架,用于计算最优的摘要注入比例(Injection Ratio),实验证明存在一个“倒U型”曲线关系:适度注入可提升召回率(Recall)达18%,但超过临界阈值(CIR > 0.4)会使特定查询的精度下降22%,因此通过理论建模精准定位最优注入比例是提升RAG性能的核心。
链接: https://arxiv.org/abs/2601.08851
作者: Alex Dantart
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Spanish and English languages
Abstract:Técnicas recientes de “Contextualized Chunking” inyectan resúmenes para mejorar el contexto en RAG, pero introducen una “dilución vectorial” que opaca el contenido local. Evaluando distintos ratios de inyección, demostramos una curva en “U invertida”: una inyección moderada mejora el “Recall” (+18%), pero superar un umbral crítico (CIR 0.4) reduce la precisión en un 22% para consultas específicas. Proponemos un marco teórico para calcular el ratio óptimo de inyección. – Recent “Contextualized Chunking” techniques inject summaries to improve RAG context but introduce “vector dilution” drowning out local content. Evaluating various injection ratios, we demonstrate an “inverted U” curve: moderate injection boosts Recall (+18%), but exceeding a critical threshold (CIR 0.4) drops precision by 22% for specific queries. We propose a theoretical framework to calculate the optimal injection ratio. Comments: in Spanish and English languages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.08851 [cs.CL] (or arXiv:2601.08851v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.08851 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-77] Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment AAAI2026
【速读】: 该论文旨在解决自动化答案匹配(answer matching)模型在面对策略性攻击时的可靠性问题,例如通过冗余文本、多答案输出或嵌套冲突信息等方式人为提高评分但不提升真实正确性。其关键解决方案在于系统性地评估三种常见攻击手段对匹配模型的影响,并发现这些操作不仅未能提升分数,反而常导致得分下降;同时指出二元评分(binary scoring)相较于连续评分(continuous scoring)更具鲁棒性,从而证明了当存在参考答案时,自动化答案匹配是一种可靠且可扩展的替代传统大语言模型作为裁判(LLM-as-a-judge)或人工评价的方法。
链接: https://arxiv.org/abs/2601.08849
作者: Manas Khatore,Sumana Sridharan,Kevork Sulahian,Benjamin J. Smith,Shi Feng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the AAAI 2026 Workshop on AI Governance (AIGOV)
Abstract:Automated answer matching, which leverages LLMs to evaluate free-text responses by comparing them to a reference answer, shows substantial promise as a scalable and aligned alternative to human evaluation. However, its reliability requires robustness against strategic attacks such as guesswork or verbosity that may artificially inflate scores without improving actual correctness. In this work, we systematically investigate whether such tactics deceive answer matching models by prompting examinee models to: (1) generate verbose responses, (2) provide multiple answers when unconfident, and (3) embed conflicting answers with the correct answer near the start of their response. Our results show that these manipulations do not increase scores and often reduce them. Additionally, binary scoring (which requires a matcher to answer with a definitive “correct” or “incorrect”) is more robust to attacks than continuous scoring (which requires a matcher to determine partial correctness). These findings show that answer matching is generally robust to inexpensive text manipulation and is a viable alternative to traditional LLM-as-a-judge or human evaluation when reference answers are available.
zh
[NLP-78] PediaMind-R1: A Temperament-Aware Language Model for Personalized Early Childhood Care Reasoning via Cognitive Modeling and Preference Alignment EMNLP2025
【速读】: 该论文旨在解决当前智能育儿系统中缺乏个性化和情境适应性的问题,传统方法通常提供通用建议,难以满足不同婴幼儿个体差异的照护需求。解决方案的关键在于构建一个基于发展心理学理论的专用大语言模型——PediaMind-R1,其核心创新包括:首先引入Thomas-Chess气质理论并构建婴儿与幼儿(0-3岁)的气质知识图谱,实现对早期儿童气质特征的精准建模;其次采用两阶段训练策略,先通过监督微调(Supervised Fine-Tuning, SFT)教授结构化链式推理能力,再利用GRPO(Generalized Reward Policy Optimization)对齐机制强化逻辑一致性、领域专业知识及共情式育儿策略。这一整合垂直领域建模与心理理论的方法,显著提升了模型在敏感照护场景下的主动个性化能力。
链接: https://arxiv.org/abs/2601.08848
作者: Zihe Zhang,Can Zhang,Yanheng Xu,Xin Hu,Jichao Leng
机构: Fudan University (复旦大学); Bosch (中国)投资有限公司
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025 PALS Workshop (PALS: EXPLORING ACTIVE AND PASSIVE LLM PERSONALIZATION)
Abstract:This paper presents PediaMind-R1, a domain-specialized large language model designed to achieve active personalization in intelligent parenting scenarios. Unlike conventional systems that provide generic suggestions, PediaMind-R1 draws on insights from developmental psychology. It introduces temperament theory from the Thomas-Chess framework and builds a temperament knowledge graph for infants and toddlers (0-3 years). Our two-stage training pipeline first uses supervised fine-tuning to teach structured chain-of-thought reasoning, and then applies a GRPO-based alignment stage to reinforce logical consistency, domain expertise, and empathetic caregiving strategies. We further design an evaluation framework comprising temperament-sensitive multiple-choice tests and human assessments. The results demonstrate that PediaMind-R1 can accurately interpret early childhood temperament profiles and proactively engage in individualized reasoning. This work highlights the value of integrating vertical-domain modeling with psychological theory. It offers a novel approach to developing user-centered LLMs that advance the practice of active personalization in sensitive caregiving contexts.
zh
[NLP-79] Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe
【速读】: 该论文旨在解决知识系统(如大语言模型、检索增强生成、知识图谱等)评估中的三大核心问题:静态基准易受污染、基于大语言模型的评判者存在系统性偏差,以及真实标签提取依赖昂贵的人工标注。其解决方案的关键在于提出 RIKER(Retrieval Intelligence and Knowledge Extraction Rating)——一种基于范式反转(paradigm inversion)的可复现评估方法:不是从文档中提取真实标签,而是从已知结构化真实信息中生成合成文档,从而实现确定性评分、无需人工标注或参考模型即可规模化评估,并通过可再生语料库实现抗污染能力。
链接: https://arxiv.org/abs/2601.08847
作者: JV Roig
机构: Kamiwaza AI(卡米瓦扎人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 17 tables, 1 figure
Abstract:Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.
zh
[NLP-80] Directional Attractors in LLM Reasoning Reasoning : How Similarity Retrieval Steers Iterative Summarization Based Reasoning
【速读】: 该论文旨在解决迭代式摘要推理框架(如InftyThink)在长时程推理中重复生成相似推理策略的问题,从而限制了模型在多任务场景下的泛化能力与效率。其解决方案的关键在于引入基于嵌入的语义缓存机制(Cross-Chain Memory),通过存储和检索先前成功推理路径中的语义片段(lemmas),在每一步推理中引导模型利用最相关的记忆内容进行条件推理,而非无差别扩展上下文窗口。这一机制在结构化领域(如MATH500、AIME2024、GPQA-Diamond)显著提升准确率,并揭示出语义相似性驱动的记忆检索可诱导嵌入空间中的方向性偏差,形成稳定“修复”(提升基线准确率)或“破坏”(降低基线准确率)吸引子,从而明确指出基于相似性的记忆对自增强型大语言模型推理的双重影响。
链接: https://arxiv.org/abs/2601.08846
作者: Cagatay Tekin,Charbel Barakat,Luis Joseph Luna Limgenco
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures. Code available at: this http URL
Abstract:Iterative summarization based reasoning frameworks such as InftyThink enable long-horizon reasoning in large language models (LLMs) by controlling context growth, but they repeatedly regenerate similar reasoning strategies across tasks. We introduce InftyThink with Cross-Chain Memory, an extension that augments iterative reasoning with an embedding-based semantic cache of previously successful reasoning patterns. At each reasoning step, the model retrieves and conditions on the most semantically similar stored lemmas, guiding inference without expanding the context window indiscriminately. Experiments on MATH500, AIME2024, and GPQA-Diamond demonstrate that semantic lemma retrieval improves accuracy in structured domains while exposing failure modes in tests that include heterogeneous domains. Geometric analyses of reasoning trajectories reveal that cache retrieval induces directional biases in embedding space, leading to consistent fix (improve baseline accuracy) and break (degradation in baseline accuracy) attractors. Our results highlight both the benefits and limits of similarity-based memory for self-improving LLM reasoning.
zh
[NLP-81] Emissions and Performance Trade-off Between Small and Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因训练和推理过程能耗巨大而带来的高碳排放问题,进而推动生成式 AI 的可持续发展。其解决方案的关键在于:通过微调小型语言模型(Small Language Models, SLMs)来执行特定任务,在保证性能基本不受损的前提下,显著降低推理阶段的碳排放。实验表明,在六项任务中的四项里,SLMs 能够实现与 LLMs 相当的性能水平,同时大幅减少碳足迹,从而验证了使用微调后的 SLMs 作为绿色 AI 替代方案的可行性。
链接: https://arxiv.org/abs/2601.08844
作者: Anandita Garg,Uma Gaba,Deepan Muthirayan,Anish Roy Chowdhury
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 6 pages. Accepted as a full paper to the 3rd International Conference on Foundation and Large Language Models (IEEE FLLM) 2025
Abstract:The advent of Large Language Models (LLMs) has raised concerns about their enormous carbon footprint, starting with energy-intensive training and continuing through repeated inference. This study investigates the potential of using fine-tuned Small Language Models (SLMs) as a sustainable alternative for predefined tasks. Here, we present a comparative analysis of the performance-emissions trade-off between LLMs and fine-tuned SLMs across selected tasks under Natural Language Processing, Reasoning and Programming. Our results show that in four out of the six selected tasks, SLMs maintained comparable performances for a significant reduction in carbon emissions during inference. Our findings demonstrate the viability of smaller models in mitigating the environmental impact of resource-heavy LLMs, thus advancing towards sustainable, green AI.
zh
[NLP-82] Rubric-Conditioned LLM Grading: Alignment Uncertainty and Robustness
【速读】: 该论文旨在解决自动化短答案评分(ASAG)中因学生作答语言多样性及评分细则(rubric)所需细粒度部分赋分带来的挑战,尤其是如何提升大型语言模型(LLM)作为评分裁判在复杂评分标准下的可靠性。解决方案的关键在于系统评估LLM-judge在三种核心维度的表现:一是不同评分细则复杂度下与专家评分的一致性;二是通过基于共识的置信度阈值机制,在不确定预测中实现准确率与置信度之间的权衡;三是对随机扰动和对抗攻击的鲁棒性测试。实验表明,LLM在二分类任务中表现良好,但随着评分粒度增加一致性下降,且通过“信任曲线”分析可发现低置信度预测过滤能显著提升剩余样本的准确性,同时揭示了模型对同义词替换敏感、对提示注入攻击相对鲁棒的特性,从而强调了不确定性估计和鲁棒性验证对于实际部署的重要性。
链接: https://arxiv.org/abs/2601.08843
作者: Haotian Deng,Chris Farber,Jiyoon Lee,David Tang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model’s robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our “Trust Curve” analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.
zh
[NLP-83] Resisting Correction: How RLHF Makes Language Models Ignore External Safety Signals in Natural Conversation
【速读】: 该论文旨在解决生成式 AI(Generative AI)在安全架构中依赖外部监控器进行错误检测与纠正时所面临的可控性失效问题,尤其是在交互式场景下模型是否能有效整合外部提供的置信度信号以调整输出。其核心发现是:基础模型具备近乎完美的可控性(Spearman相关系数接近1.0),而经过指令微调(instruction-tuned)的模型在显式命令提示下仍保持高合规性(偏差约0%,rho=0.93),但在自然对话查询中却系统性忽略相同外部信号(偏差+40%,rho=0.04),这种行为并非能力缺失,而是强化学习人类反馈(RLHF)优化过程中为追求对话流畅性而牺牲了对外部校准信号的敏感性。解决方案的关键在于认识到内部token级置信度在小型模型中无信息量(r=0.035),强调必须引入外部监督来保障安全控制的有效性,尤其在用户最常使用的自然交互模式中。
链接: https://arxiv.org/abs/2601.08842
作者: Felipe Biava Cataneo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety architectures for language models increasingly rely on external monitors to detect errors and inject corrective signals at inference time. For such systems to function in interactive settings, models must be able to incorporate externally provided confidence information into their verbal responses. In this work, we test whether instruction-tuned language models preserve this controllability across different interaction modes. Using Llama-3.2-3B on GSM8K, we perform a causal intervention study in which explicit external confidence signals are injected and model compliance is measured under multiple prompt strategies. We find that base models exhibit near-perfect controllability (Spearman rho close to 1.0), while instruction-tuned models display a striking context dependence: they fully comply with external corrections under explicit command prompts (bias approximately 0 percent, rho = 0.93), yet systematically ignore the same signals in natural conversational queries (bias plus 40 percent, rho = 0.04). This behavior is not a capability failure; the model can process the signal, but an emergent property of RLHF optimization that prioritizes conversational fluency over external calibration cues in natural dialogue. We further show that internal token-level confidence in small models is uninformative (r = 0.035), underscoring the necessity of external supervision. Our findings highlight a deployment-critical failure mode: the interaction style users expect is precisely where safety corrections are least effective. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.08842 [cs.CL] (or arXiv:2601.08842v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.08842 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-84] riples and Knowledge-Infused Embeddings for Clustering and Classification of Scientific Documents
【速读】: 该论文旨在解决科学文献数量激增与复杂性提升所带来的研究文档组织与理解难题,核心问题是如何有效利用结构化知识(如主谓宾三元组)来增强科学论文的聚类与分类性能。其解决方案的关键在于提出一个模块化处理流程,将原始摘要、提取的三元组以及两者融合的混合表示形式相结合,并通过四种先进的Transformer模型(MiniLM、MPNet、SciBERT和SPECTER)进行嵌入编码;实验表明,尽管纯文本能生成更一致的聚类结果,但引入三元组信息的混合表示显著提升了分类准确率(最高达92.6%准确率和0.925宏F1值),且轻量级句向量模型在聚类任务中表现优于领域专用模型,而SciBERT在结构化输入分类中更具优势,凸显了非结构化文本与结构化知识协同优化的潜力。
链接: https://arxiv.org/abs/2601.08841
作者: Mihael Arcan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:The increasing volume and complexity of scientific literature demand robust methods for organizing and understanding research documents. In this study, we explore how structured knowledge, specifically, subject-predicate-object triples, can enhance the clustering and classification of scientific papers. We propose a modular pipeline that combines unsupervised clustering and supervised classification over multiple document representations: raw abstracts, extracted triples, and hybrid formats that integrate both. Using a filtered arXiv corpus, we extract relational triples from abstracts and construct four text representations, which we embed using four state-of-the-art transformer models: MiniLM, MPNet, SciBERT, and SPECTER. We evaluate the resulting embeddings with KMeans, GMM, and HDBSCAN for unsupervised clustering, and fine-tune classification models for arXiv subject prediction. Our results show that full abstract text yields the most coherent clusters, but that hybrid representations incorporating triples consistently improve classification performance, reaching up to 92.6% accuracy and 0.925 macro-F1. We also find that lightweight sentence encoders (MiniLM, MPNet) outperform domain-specific models (SciBERT, SPECTER) in clustering, while SciBERT excels in structured-input classification. These findings highlight the complementary benefits of combining unstructured text with structured knowledge, offering new insights into knowledge-infused representations for semantic organization of scientific documents.
zh
[NLP-85] Consistency-Aware Editing for Entity-level Unlearning in Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中敏感、受版权保护或有害信息的实体级遗忘(entity-level unlearning)问题,即在不损害模型整体能力的前提下,彻底删除与特定实体相关的全部知识。现有方法通常依赖全模型微调或基于提示的干预,存在计算成本高或对改写查询鲁棒性差的问题;而新兴的模型编辑(model editing)技术虽效率较高,但主要用于实例级更新,难以实现对整个实体知识的系统性清除。论文提出一种新颖的一致性感知编辑(Consistency-Aware Editing, CAE)框架,其关键在于:通过聚合多样化的提示(包括实体属性、关系及对抗性改写句),并引入一致性正则化项联合学习一个低秩更新方向,使不同提示下的编辑路径保持一致,从而实现高效且稳健的实体级遗忘,同时最小化对无关知识的干扰。实验表明,CAE在RWKU和ToFU两个基准上显著优于传统遗忘与编辑基线,并能仅用数十个精心设计的提示实现可扩展的实体移除。
链接: https://arxiv.org/abs/2601.08840
作者: Xiaoqi Han,Víctor Gutiérrez-Basulto,Ru Li,Xiaoli Li,Jiye Liang,Jeff Z. Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) risk retaining sensitive, copyrighted, or harmful information from their training data. Entity-level unlearning addresses this issue by removing all knowledge of a specific entity while preserving the model’s overall capabilities. Existing approaches typically rely on full-model fine-tuning or prompt-based interventions, which can be computationally expensive or brittle when handling paraphrased queries. Recently, model editing has emerged as an efficient alternative for updating knowledge in LLMs, offering a promising direction for unlearning. However, existing editing techniques are typically designed for instance-level updates, modifying responses to specific attributes of an entity rather than eliminating all knowledge associated with the entity. In this paper, we investigate how editing techniques can be adapted for effective and efficient entity-level unlearning. To this end, we introduce a novel consistency-aware editing (CAE) framework. CAE aggregates a diverse set of prompts related to a target entity, including its attributes, relations, and adversarial paraphrases. It then jointly learns a low-rank update guided by a consistency regularizer that aligns the editing directions across prompts. This promotes robust and comprehensive forgetting while minimizing interference with unrelated knowledge. We further examine where different entities are stored within the model and how many diverse prompts are needed for successful unlearning. We evaluate CAE on two challenging benchmarks, RWKU and ToFU, and demonstrate that it (i) provides insights into how entity-level knowledge is internally represented and deleted in LLMs, (ii) significantly improves forgetting accuracy and robustness over traditional unlearning and editing baselines, and (iii) enables scalable entity removal using only tens of carefully selected prompts.
zh
[NLP-86] Recursive Knowledge Synthesis for Multi-LLM Systems: Stability Analysis and Tri-Agent Audit Framework
【速读】: 该论文旨在解决多模型大型语言系统(Large Language Models, LLMs)在实际部署中面临的稳定性与可解释性难题,尤其关注如何通过异构模型间的协同推理实现持续的知识合成与验证。其解决方案的关键在于提出了一种三代理交叉验证框架,集成语义生成、分析一致性检查和透明度审计三种功能的异构LLM,并构建递归交互循环,从而诱导出递归知识合成(Recursive Knowledge Synthesis, RKS)机制;该机制通过不可约化于单一模型行为的相互约束变换不断优化中间表示,实验证明其在公开访问环境下能有效提升系统稳定性与透明度,为多LLM架构的安全性与可靠性提供了理论支撑与实证依据。
链接: https://arxiv.org/abs/2601.08839
作者: Toshiyuki Shigemura
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 9 figures. Pilot feasibility study using public-access large language models without API-level orchestration
Abstract:This paper presents a tri-agent cross-validation framework for analyzing stability and explainability in multi-model large language systems. The architecture integrates three heterogeneous LLMs-used for semantic generation, analytical consistency checking, and transparency auditing-into a recursive interaction cycle. This design induces Recursive Knowledge Synthesis (RKS), where intermediate representations are continuously refined through mutually constraining transformations irreducible to single-model behavior. Across 47 controlled trials using public-access LLM deployments (October 2025), we evaluated system stability via four metrics: Reflex Reliability Score (RRS), Transparency Score (TS), Deviation Detection Rate (DDR), and Correction Success Rate (CSR). The system achieved mean RRS = 0.78±0.06 and maintained TS = 0.8 in about 68% of trials. Approximately 89% of trials converged, supporting the theoretical prediction that transparency auditing acts as a contraction operator within the composite validation mapping. The contributions are threefold: (1) a structured tri-agent framework for coordinated reasoning across heterogeneous LLMs, (2) a formal RKS model grounded in fixed-point theory, and (3) empirical evaluation of inter-model stability under realistic, non-API public-access conditions. These results provide initial empirical evidence that a safety-preserving, humansupervised multi-LLM architecture can achieve stable recursive knowledge synthesis in realistic, publicly deployed environments.
zh
[NLP-87] Companion Agents : A Table-Information Mining Paradigm for Text-to-SQL
【速读】: 该论文旨在解决当前大规模Text-to-SQL基准(如BIRD)在工业场景中适用性受限的问题,即现有方法通常假设数据库注释完整且准确、外部知识 readily available,而现实中常存在注释缺失、不完整或错误的情况,导致SOTA Text-to-SQL系统性能显著下降。解决方案的关键在于提出一种以数据库为中心的新范式——Companion Agents (CA),其核心思想是在查询生成前,通过一组伴随数据库模式的代理(agents)主动挖掘并固化表间隐含关系、值域分布、统计规律及潜在语义线索等细粒度信息,形成可被推理时选择性激活的“缓存知识”。这种数据库端的预处理机制有效提升了在无标注证据条件下Text-to-SQL的准确性,尤其在挑战性子集上提升显著,为工业级部署提供了无需人工标注证据的可行路径。
链接: https://arxiv.org/abs/2601.08838
作者: Jiahui Chen,Lei Fu,Jian Cui,Yu Lei,Zhenning Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Large-scale Text-to-SQL benchmarks such as BIRD typically assume complete and accurate database annotations as well as readily available external knowledge, which fails to reflect common industrial settings where annotations are missing, incomplete, or erroneous. This mismatch substantially limits the real-world applicability of state-of-the-art (SOTA) Text-to-SQL systems. To bridge this gap, we explore a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence and improve Text-to-SQL accuracy under annotation-scarce conditions. Our key hypothesis is that when a query requires multi-step reasoning over extensive table information, existing methods often struggle to reliably identify and utilize the truly relevant knowledge. We therefore propose to “cache” query-relevant knowledge on the database side in advance, so that it can be selectively activated at inference time. Based on this idea, we introduce Companion Agents (CA), a new Text-to-SQL paradigm that incorporates a group of agents accompanying database schemas to proactively mine and consolidate hidden inter-table relations, value-domain distributions, statistical regularities, and latent semantic cues before query generation. Experiments on BIRD under the fully missing evidence setting show that CA recovers +4.49 / +4.37 / +14.13 execution accuracy points on RSL-SQL / CHESS / DAIL-SQL, respectively, with larger gains on the Challenging subset +9.65 / +7.58 / +16.71. These improvements stem from CA’s automatic database-side mining and evidence construction, suggesting a practical path toward industrial-grade Text-to-SQL deployment without reliance on human-curated evidence.
zh
[NLP-88] From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全机制上对通过文化编码结构重构有害请求的攻击仍显脆弱的问题。其解决方案的关键在于提出了一种名为“Adversarial Tales”的新型越狱技术,该技术将有害内容嵌入赛博朋克叙事中,并借鉴维克多·普罗普(Vladimir Propp)民间故事形态学的方法,引导模型执行功能分析,从而将有害行为重构为合法的叙事解读。实验表明,该方法在26个前沿模型中平均成功率高达71.3%,且无任何模型家族表现出可靠鲁棒性,揭示了基于结构化框架的越狱攻击构成一类广泛存在的漏洞,而非孤立现象。
链接: https://arxiv.org/abs/2601.08837
作者: Piercosma Bisconti,Marcello Galisai,Matteo Prandi,Federico Pierucci,Olga Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Marcantonio Brancale,Daniele Nardi
机构: DEXAI – Icaro Lab (DEXAI – Icaro 实验室); Sapienza University of Rome (罗马大学); Sant’Anna School of Advanced Studies (高级研究学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp’s morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.
zh
[NLP-89] A Review: PTSD in Pre-Existing Medical Condition on Social Media ACSA
【速读】: 该论文试图解决的问题是:如何更好地理解并应对患有创伤后应激障碍(Post-Traumatic Stress Disorder, PTSD)且同时患有慢性疾病(如癌症、心脏病和自身免疫性疾病)个体的特殊心理需求,尤其是在传统临床环境中难以捕捉其真实表达与干预时机的情况下。解决方案的关键在于利用社交媒体平台上的文本数据,结合自然语言处理(Natural Language Processing, NLP)和机器学习(Machine Learning, ML)技术,识别出具有潜在PTSD症状的慢性病患者群体,并通过分析在线支持社区的互动模式,揭示其应对策略与早期干预的可能性,从而为精准监测和靶向干预提供科学依据。
链接: https://arxiv.org/abs/2601.08836
作者: Zaber Al Hassan Ayon,Nur Hafieza Ismail,Nur Shazwani Kamarudin
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Published in (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 15, No. 11, 2024
Abstract:Post-Traumatic Stress Disorder (PTSD) is a multifaceted mental health condition, particularly challenging for individuals with pre-existing medical conditions. This review critically examines the intersection of PTSD and chronic illnesses as expressed on social media platforms. By systematically analyzing literature from 2008 to 2024, the study explores how PTSD manifests and is managed in individuals with chronic conditions such as cancer, heart disease, and autoimmune disorders, with a focus on online expressions on platforms like X (formally known as Twitter) and Facebook. Findings demonstrate that social media data offers valuable insights into the unique challenges faced by individuals with both PTSD and chronic illnesses. Specifically, natural language processing (NLP) and machine learning (ML) techniques can identify potential PTSD cases among these populations, achieving accuracy rates between 74% and 90%. Furthermore, the role of online support communities in shaping coping strategies and facilitating early interventions is highlighted. This review underscores the necessity of incorporating considerations of pre-existing medical conditions in PTSD research and treatment, emphasizing social media’s potential as a monitoring and support tool for vulnerable groups. Future research directions and clinical implications are also discussed, with an emphasis on developing targeted interventions.
zh
[NLP-90] DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols
【速读】: 该论文旨在解决多智能体系统中基于大语言模型(Large Language Models, LLMs)的协同推理是否真正提升决策质量的问题,尤其关注其相较于简单基线方法(如从多个输出中选择最优结果)的实际价值。研究发现,当前主流的三种协同推理协议在控制实验中表现远逊于“最佳单次输出”基线,其胜率仅为13.8% ± 2.6%,而后者达到82.5% ± 3.3%,性能差距高达6倍且统计显著(p < 0.01),同时计算成本更高。解决方案的关键在于提出DELIBERATIONBENCH这一受控基准测试框架,首次系统性地量化比较了不同协同推理机制与强基线之间的差异,揭示出复杂多智能体设计未必带来质量增益,从而挑战了“复杂性等同于高质量”的既有认知。
链接: https://arxiv.org/abs/2601.08835
作者: Vaarunay Kaushal,Taranveer Singh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures
Abstract:Multi-agent systems where Large Language Models (LLMs) deliberate to form consensus have gained significant attention, yet their practical value over simpler methods remains under-scrutinized. We introduce DELIBERATIONBENCH, a controlled benchmark evaluating three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Across 270 questions and three independent seeds (810 total evaluations), we find a striking negative result: the best-single baseline achieves an 82.5% ± 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% ± 2.6%). This 6.0x performance gap is statistically significant (p 0.01) and comes at 1.5-2.5x higher computational cost. Our findings challenge assumptions that complexity enhances quality in multi-LLM systems.
zh
[NLP-91] Reading or Reasoning ? Format Decoupled Reinforcement Learning for Document OCR
【速读】: 该论文旨在解决当前先进光学字符识别(OCR)模型在处理格式敏感文档(如公式、表格等)时输出不确定性显著升高(熵值高出一个数量级)的问题,这表明现有模型在复杂格式文本上的推理能力不足。解决方案的关键在于提出一种格式解耦强化学习(Format Decoupled Reinforcement Learning, FD-RL)框架:首先利用熵基数据过滤策略识别出高格式强度实例,再设计针对不同格式类型的解耦奖励机制,实现格式级别的验证而非词元级别的记忆,从而提升模型对多样阅读路径的推理能力。该方法在OmniDocBench基准上取得90.41的平均得分,刷新了端到端OCR模型的记录。
链接: https://arxiv.org/abs/2601.08834
作者: Yufeng Zhong,Lei Chen,Zhixiong Zeng,Xuanle Zhao,Deyang Jiang,Liming Zheng,Jing Huang,Haibo Qiu,Peng Shi,Siqi Yang,Lin Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: technical report
Abstract:Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emphe.g., formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.
zh
[NLP-92] owards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach
【速读】: 该论文旨在解决知识增强型文本生成中因模型缺乏可解释性而导致的实际应用受限问题,尤其在需要高可靠性和可解释性的任务中表现不佳。现有方法依赖于特定领域的知识检索器,难以泛化到不同数据类型和任务。其解决方案的关键在于设计一种任务无关的结构化知识猎手(task-agnostic structured knowledge hunter),该方法利用结构化知识的两层架构(高层实体与低层知识三元组)进行建模,通过局部-全局交互机制实现结构化知识表示学习,并采用分层Transformer指针网络作为骨干结构,精准选择相关知识三元组与实体,从而显著提升生成结果的可解释性与忠实度,同时保持语言模型强大的生成能力。
链接: https://arxiv.org/abs/2511.23335
作者: Shuqi Liu,Han Wu,Guanzhi Deng,Jianshu Chen,Xiaoyang Wang,Linqi Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.
zh
计算机视觉
[CV-0] Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning FAST
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)任务中因显式链式思维(Chain-of-Thought, CoT)导致的推理轨迹过长、推理延迟高的问题,同时保持良好的泛化能力与长期规划性能。解决方案的关键在于提出一种名为 Fast-ThinkAct 的高效推理框架,其核心是通过知识蒸馏从教师模型中学习可解释的潜在链式思维(verbalizable latent reasoning),并采用偏好引导的目标函数优化策略,使代理在保持紧凑推理表示的同时,能有效对齐操作轨迹,从而实现语言与视觉规划能力的迁移,最终将高效推理与动作执行无缝衔接,显著降低推理延迟(最高达 89.3%)且不牺牲少样本适应和故障恢复等关键能力。
链接: https://arxiv.org/abs/2601.09708
作者: Chi-Pin Huang,Yunze Man,Zhiding Yu,Min-Hung Chen,Jan Kautz,Yu-Chiang Frank Wang,Fu-En Yang
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
zh
[CV-1] SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
【速读】:该论文旨在解决Segment Anything 3 (SAM3) 在复杂多目标场景下群体级集体记忆选择机制的不足问题,即其原始实现中采用基于平均性能的同步决策策略,容易忽略个体目标的可靠性,从而影响跟踪稳定性和身份保持能力。解决方案的关键在于提出一种无需训练的解耦策略(SAM3-DMS),通过细粒度的个体对象级记忆选择机制,独立评估每个目标的可靠性并优化其记忆更新,显著提升了密集目标场景下的跟踪鲁棒性与身份一致性。
链接: https://arxiv.org/abs/2601.09699
作者: Ruiqi Shen,Chang Liu,Henghui Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.
zh
[CV-2] COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation
【速读】:该论文旨在解决从稀疏多视角图像中进行3D姿态估计时,现有优化方法因依赖成对关联建模对应关系而导致的全局一致性难以保障的问题。传统方法将多视图姿态对应匹配视为成对关联问题,仅以软约束形式处理视图间的循环一致性(cycle consistency),在存在误匹配时误差易传播,导致结果不稳定。其解决方案的关键在于提出COMPOSE框架,将多视图姿态对应匹配建模为超图划分(hypergraph partitioning)问题,从而显式地建模多视图间的全局一致性;同时引入高效的几何剪枝策略降低整数线性规划的搜索空间复杂度,显著提升精度——相比以往基于优化的方法平均精度提升最高达23%,相比自监督端到端学习方法提升最高达11%。
链接: https://arxiv.org/abs/2601.09698
作者: Tony Danjun Wang,Tolga Birdal,Nassir Navab,Lennart Bastian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
zh
[CV-3] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering
【速读】:该论文旨在解决基于扩散模型的视频生成方法在计算效率上的瓶颈问题,即当前方法在生成几秒视频时需耗费数分钟GPU时间,难以满足实时交互应用(如具身智能和VR/AR)的需求。解决方案的关键在于提出一种相机条件下的静态场景视频生成新策略:首先利用扩散模型生成稀疏的关键帧(keyframes),再通过3D重建与渲染合成完整视频序列;该方法将生成成本分摊至数百帧,并保证几何一致性。进一步引入一个预测模型以动态确定最优关键帧数量,使系统能根据相机轨迹复杂度自适应分配计算资源,最终实现SRENDER方法——对简单轨迹使用极稀疏关键帧、复杂运动则采用更密集采样,从而在保持高视觉保真度和时序稳定性的同时,使20秒视频生成速度提升超过40倍。
链接: https://arxiv.org/abs/2601.09697
作者: Jieying Chen,Jeffrey Hu,Joan Lasenby,Ayush Tewari
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
zh
[CV-4] STEP3-VL-10B Technical Report
【速读】:该论文旨在解决当前多模态大模型中效率与性能难以兼顾的问题,即如何在保持模型轻量化的同时实现前沿级别的多模态智能。其解决方案的关键在于两个战略性的技术突破:一是采用统一且全参数开放的预训练策略,在1.2万亿多模态token上训练,融合语言对齐的感知编码器(Perception Encoder)与Qwen3-8B解码器,构建内在的视觉-语言协同机制;二是设计了一个大规模后训练流程,包含超过1000轮强化学习,并引入并行协调推理(Parallel Coordinated Reasoning, PaCoRe)以扩展推理时计算资源,动态分配至可扩展的感知推理过程,从而探索和整合多样化的视觉假设。这一方法使STEP3-VL-10B虽仅具10B参数规模,却在多项基准测试中超越数十倍体量的模型,展现出卓越的多模态理解和复杂推理能力。
链接: https://arxiv.org/abs/2601.09668
作者: Ailin Huang,Chengyuan Yao,Chunrui Han,Fanqi Wan,Hangyu Guo,Haoran Lv,Hongyu Zhou,Jia Wang,Jian Zhou,Jianjian Sun,Jingcheng Hu,Kangheng Lin,Liang Zhao,Mitt Huang,Song Yuan,Wenwen Qu,Xiangfeng Wang,Yanlin Lai,Yingxiu Zhao,Yinmin Zhang,Yukang Shi,Yuyang Chen,Zejia Weng,Ziyang Meng,Ang Li,Aobo Kong,Bo Dong,Changyi Wan,David Wang,Di Qi,Dingming Li,En Yu,Guopeng Li,Haiquan Yin,Han Zhou,Hanshan Zhang,Haolong Yan,Hebin Zhou,Hongbo Peng,Jiaran Zhang,Jiashu Lv,Jiayi Fu,Jie Cheng,Jie Zhou,Jisheng Yin,Jingjing Xie,Jingwei Wu,Jun Zhang,Junfeng Liu,Kaijun Tan,Kaiwen Yan,Liangyu Chen,Lina Chen,Mingliang Li,Qian Zhao,Quan Sun,Shaoliang Pang,Shengjie Fan,Shijie Shang,Siyuan Zhang,Tianhao You,Wei Ji,Wuxun Xie,Xiaobo Yang,Xiaojie Hou,Xiaoran Jiao,Xiaoxiao Ren,Xiangwen Kong,Xin Huang,Xin Wu,Xing Chen,Xinran Wang,Xuelin Zhang,Yana Wei,Yang Li,Yanming Xu,Yeqing Shen,Yuang Peng,Yue Peng,Yu Zhou,Yusheng Li,Yuxiang Yang,Yuyang Zhang,Zhe Xie,Zhewei Huang,Zhenyi Lu,Zhimin Fan,Zihui Cheng,Daxin Jiang,Qi Han,Xiangyu Zhang,Yibo Zhu,Zheng Ge
机构: StepFun(步履科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 50 pages
Abstract:We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10 \times -20 \times larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
zh
[CV-5] SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
【速读】:该论文旨在解决单目视觉SLAM(Simultaneous Localization and Mapping,即时定位与地图构建)中长期序列下的尺度漂移(scale drift)问题,即系统在长时间运行时估计的尺度逐渐偏离真实值,导致重建精度下降。现有基于帧间局部优化的方法虽能实现实时性能,但因缺乏全局尺度约束而难以维持尺度一致性。解决方案的关键在于提出SCE-SLAM框架,其核心创新是引入场景坐标嵌入(scene coordinate embeddings),这是一种学习得到的patch级表示,编码了在标准尺度参考下的3D几何关系;该框架包含两个关键模块:几何引导聚合模块通过几何调制注意力机制利用3D空间邻近性,将历史观测中的尺度信息传播至当前帧;场景坐标束调整模块则通过从场景坐标嵌入中解码出显式的3D坐标约束,将当前估计锚定到参考尺度,从而实现跨大场景的尺度一致性。
链接: https://arxiv.org/abs/2601.09665
作者: Yuchen Wu,Jiahe Li,Xiaohan Yu,Lina Yu,Jin Zheng,Xiao Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.
zh
[CV-6] Self-Supervised Animal Identification for Long Videos
【速读】:该论文旨在解决长时视频中个体动物识别的难题,传统方法依赖大量人工标注,而现有自监督方法因计算复杂度高和内存限制难以处理长时间序列。其解决方案的关键在于将动物识别重构为一个全局聚类任务而非顺序跟踪问题,通过固定视频内个体数量的假设,仅需边界框检测和总数信息即可实现高效学习;具体而言,利用冻结的预训练主干网络、成对帧采样以及基于匈牙利算法的批内伪标签分配机制,结合从视觉-语言模型迁移的二元交叉熵损失函数,在极低显存消耗(<1 GB/批)下实现了超过97%的准确率,显著降低了标注成本并提升了实用性。
链接: https://arxiv.org/abs/2601.09663
作者: Xuyang Fang,Sion Hannuna,Edwin Simpson,Neill Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure
Abstract:Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video – a common scenario in practice – and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ( 97%) while consuming less than 1 GB of GPU memory per batch – an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \hrefthis https URLhere.
zh
[CV-7] LiteEmbed: Adapting CLIP to Rare Classes
【速读】:该论文旨在解决大规模视觉-语言模型(如CLIP)在预训练阶段罕见类别(包括新出现的实体和文化特异性类别)上表现不佳的问题。其核心挑战在于如何在不重新训练模型编码器的前提下,实现对新类别的快速、轻量级个性化适配。解决方案的关键在于提出LiteEmbed框架,通过子空间引导的文本嵌入优化机制,在CLIP词汇表内对文本特征进行微调:利用主成分分析(PCA)分解将语义粗粒度方向与细粒度差异解耦,并结合粗粒度对齐与细粒度分离两个互补目标,既保持全局语义一致性,又增强视觉相似类间的判别能力。优化后的嵌入可即插即用,无缝替代原始文本特征,适用于分类、检索、分割和检测等多种下游任务。
链接: https://arxiv.org/abs/2601.09661
作者: Aishwarya Agarwal,Srikrishna Karanam,Vineet Gandhi
机构: CVIT, Kohli Centre for Intelligent Systems, IIIT Hyderabad, India(印度国际信息技术学院); Adobe Research, Bengaluru, India(印度班加罗尔Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures
Abstract:Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP’s vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP’s original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
zh
[CV-8] Image2Garment: Simulation-ready Garment Generation from a Single Image
【速读】:该论文旨在解决从单张图像中估计物理上准确、可直接用于仿真的服装参数这一挑战性问题,其核心难点在于缺乏图像到物理属性的标注数据集,以及该问题本身的病态特性(ill-posed nature)。传统方法要么依赖多视角捕捉和昂贵的可微分仿真,要么仅预测服装几何形状而忽略仿真所需的材料属性。本文的关键解决方案是提出一个前馈式框架:首先利用视觉-语言模型(vision-language model)在真实图像上进行微调,以推断材料组成和织物属性;随后训练一个轻量级预测器,将这些属性映射为对应的物理参数,仅需少量材料物理测量数据即可完成训练。该方法引入了两个新数据集(FTAG 和 T2P),无需迭代优化即可生成可用于高保真仿真的服装模型,显著提升了材料组成估计和织物属性预测的准确性,并在仿真质量上优于当前最先进的图像到服装方法。
链接: https://arxiv.org/abs/2601.09658
作者: Selim Emir Can,Jan Ackermann,Kiyohiro Nakayama,Ruofan Liu,Tong Wu,Yang Zheng,Hugo Bertiche,Menglei Chai,Thabo Beeler,Gordon Wetzstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
zh
[CV-9] AquaFeat: an Underwater Vision Learning-based Enhancement Method for Object Detection Classification and Tracking
【速读】:该论文旨在解决水下视频分析中因低光照、色彩失真和浑浊度等因素导致视觉数据质量下降,从而影响机器人感知模块性能的问题。解决方案的关键在于提出一种可即插即用的特征增强管道AquaFeat+,其核心包括颜色校正模块、分层特征增强模块以及自适应残差输出模块,这些模块通过最终应用任务的损失函数端到端训练,专注于提升自动化视觉任务(如目标检测、分类与跟踪)的性能,而非单纯改善人眼感知质量。
链接: https://arxiv.org/abs/2601.09652
作者: Emanuel da Costa Silva,Tatiana Taís Schein,José David García Ramos,Eduardo Lawson da Silva,Stephanie Loi Brião,Felipe Gomes de Oliveira,Paulo Lilles Jorge Drews-Jr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.
zh
[CV-10] Identifying Models Behind Text-to-Image Leaderboards
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型质量评估中广泛采用的基于投票的排行榜所存在的隐私安全问题,即匿名化机制易被攻破,导致模型身份可被识别。其解决方案的关键在于发现并利用每个T2I模型在图像嵌入空间中生成的图像具有独特的聚类特征,从而无需控制提示词或访问训练数据,仅通过中心点(centroid-based)方法即可实现高精度去匿名化。此外,研究引入了提示级别可区分性指标,揭示特定提示词可引发近乎完美的模型识别效果,表明现有排行榜存在系统性安全漏洞,亟需强化匿名化防御机制。
链接: https://arxiv.org/abs/2601.09647
作者: Ali Naseh,Yuefeng Peng,Anshuman Suri,Harsh Chaudhari,Alina Oprea,Amir Houmansadr
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
zh
[CV-11] PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
【速读】:该论文旨在解决GUI代理在真实场景中难以对用户复杂隐式意图(implicit intent)进行准确对齐的问题,尤其是在面对模糊指令时缺乏利用长期用户行为记录来补全偏好和预判潜在操作流程的能力。其解决方案的关键在于提出了一种名为PersonalAlign的新任务范式与相应的Hierarchical Intent Memory Agent(HIM-Agent),该模型通过构建持续更新的个人记忆库,并以层次化方式组织用户偏好和例行操作(routine),从而实现基于长期上下文的个性化交互与主动辅助能力。
链接: https://arxiv.org/abs/2601.09636
作者: Yibo Lyu,Gongwei Chen,Rui Shao,Weili Guan,Liqiang Nie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users’ more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents’ ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
zh
[CV-12] CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems
【速读】:该论文旨在解决铁路运输系统中潜在入侵目标的早期精准感知问题,现有系统多依赖固定视野内的对象分类和规则启发式判断,难以识别具有潜在威胁但未明确入侵的目标。其核心挑战在于如何通过空间上下文与时间动态的认知推理实现对目标对象(Object of Interest, OOI)的深度理解。解决方案的关键在于构建了一个名为CogRail的新基准数据集,融合了开源数据与认知驱动的问答标注,以支持时空推理与预测;在此基础上,提出了一种联合微调框架,整合位置感知、运动预测与威胁分析三项核心任务,使通用视觉语言模型(Visual-Language Models, VLMs)能够有效适配至认知入侵感知这一特定领域,显著提升了模型在复杂时空推理任务中的准确性与可解释性。
链接: https://arxiv.org/abs/2601.09613
作者: Yonglin Tian,Qiyao Zhang,Wei Xu,Yutong Wang,Yihao Wu,Xinyi Li,Xingyuan Dai,Hui Zhang,Zhiyong Cui,Baoqing Guo,Zujun Yu,Yisheng Lv
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing Institute of Technology (北京理工大学); China Academy of Railway Sciences (中国铁道科学研究院); Beijing Huairou Academy of Parallel Sensing (北京怀柔平行传感研究院); Beijing Jiaotong University (北京交通大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at this https URL.
zh
[CV-13] GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis
【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis)中基于点回归(point-wise regression)方法存在的两个核心问题:一是对标签噪声敏感,二是忽略了样本间的相对顺序关系,导致预测不稳定且与真实情感强度的相关性较差。为应对这一挑战,作者提出两阶段分组排序与校准框架(Group-wise Ranking and Calibration Framework, GRCF),其关键在于:第一阶段引入受Group Relative Policy Optimization(GRPO)启发的优势加权动态边界排序损失(Advantage-Weighted Dynamic Margin Ranking Loss),自适应聚焦于难以排序的样本并构建细粒度的序数结构;第二阶段采用基于平均绝对误差(MAE)的目标函数实现预测值的绝对量级校准,从而同时保证相对顺序一致性与绝对分数准确性。此方案有效缓解了传统成对排序方法中静态边界和均匀权重带来的局限性,显著提升模型鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2601.09606
作者: Manning Gao,Leheng Zhang,Shiqin Han,Haifeng Hu,Yuncheng Jiang,Sijie Mai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
zh
[CV-14] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets
【速读】:该论文旨在解决机器人操作中视觉策略在分布偏移(如相机视角变化)下的鲁棒性问题,尤其是在真实世界数据稀缺且缺乏视角多样性的情况下,如何利用仿真数据有效提升策略的泛化能力。其解决方案的关键在于提出了一种无配对图像翻译方法MANGO,核心创新包括:基于分割条件的InfoNCE损失、高度正则化的判别器设计以及改进的PatchNCE损失,这些组件共同保障了从仿真到现实(sim2real)图像翻译过程中的视角一致性。实验表明,仅需少量固定视角的真实数据即可训练MANGO,从而生成多样化的未见视角图像,并显著提升模仿学习策略在新视角下的成功率(最高达60%)。
链接: https://arxiv.org/abs/2601.09605
作者: Jeremiah Coholich,Justin Wit,Robert Azarcon,Zsolt Kira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO – an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60% on views that the non-augmented policy fails completely on.
zh
[CV-15] Iterative Differential Entropy Minimization (IDEM) method for fine rigid pairwise 3D Point Cloud Registration: A Focus on the Metric
【速读】:该论文旨在解决3D点云配准(point cloud registration)中因密度差异、噪声、孔洞和部分重叠等问题导致传统方法(如基于欧氏距离的RMSE、ICP等)性能下降甚至失效的问题。其关键解决方案是提出一种基于微分熵(differential entropy)的新度量指标,并将其作为优化框架中的目标函数,构建了迭代微分熵最小化算法(Iterative Differential Entropy Minimization, IDEM)。该方法不依赖于固定某一个点云,具有良好的对称性与鲁棒性,在多种复杂场景下均能稳定收敛至最优配准结果,显著优于RMSE、Chamfer距离和Hausdorff距离等传统度量方式。
链接: https://arxiv.org/abs/2601.09601
作者: Emmanuele Barberi,Felice Sfravara,Filippo Cucinotta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.
zh
[CV-16] Multimodal Signal Processing For Thermo-Visible-Lidar Fusion In Real-time 3D Semantic Mapping
【速读】:该论文旨在解决复杂环境中自主机器人导航与环境感知对SLAM(Simultaneous Localization and Mapping,同时定位与建图)技术提出的更高要求问题。其解决方案的关键在于通过像素级融合可见光与红外图像,将实时LiDAR点云投影至融合图像流中,并在热通道中分割热源特征以即时识别高温目标,进而将温度信息作为语义层叠加到最终的3D点云地图上,从而生成兼具几何精度与环境语义理解能力的地图,特别适用于快速灾害评估和工业预防性维护等场景。
链接: https://arxiv.org/abs/2601.09578
作者: Jiajun Sun,Yangyi Ou,Haoyuan Zheng,Chao yang,Yue Ma
机构: Shenzhen University (深圳大学); Xi’an-Jiaotong Liverpool University (西安交通大学利物浦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages,7 figures. Under review
Abstract:In complex environments, autonomous robot navigation and environmental perception pose higher requirements for SLAM technology. This paper presents a novel method for semantically enhancing 3D point cloud maps with thermal information. By first performing pixel-level fusion of visible and infrared images, the system projects real-time LiDAR point clouds onto this fused image stream. It then segments heat source features in the thermal channel to instantly identify high temperature targets and applies this temperature information as a semantic layer on the final 3D map. This approach generates maps that not only have accurate geometry but also possess a critical semantic understanding of the environment, making it highly valuable for specific applications like rapid disaster assessment and industrial preventive maintenance.
zh
[CV-17] OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
【速读】:该论文旨在解决开放词汇三维场景理解任务中稀疏体素(sparse voxels)的分组与描述问题,尤其针对复杂指代表达分割(referring expression segmentation, RES)场景下传统方法依赖训练且难以泛化的问题。其解决方案的关键在于提出一种无需训练的OpenVoxel算法,直接利用多模态大语言模型(MLLMs)进行文本到文本搜索,从而实现对稀疏体素分组后的语义caption生成,避免了使用CLIP或BERT等文本编码器引入的嵌入空间,显著提升了在开放词汇场景下的灵活性和性能表现。
链接: https://arxiv.org/abs/2601.09575
作者: Sheng-Yu Huang,Jaesung Choe,Yu-Chiang Frank Wang,Cheng Sun
机构: NVIDIA; National Taiwan University (台湾国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
zh
[CV-18] rustworthy Longitudinal Brain MRI Completion: A Deformation-Based Approach with KAN-Enhanced Diffusion Model
【速读】:该论文旨在解决纵向脑部磁共振成像(longitudinal brain MRI)在生命周期研究中因高失访率导致的数据缺失问题,以及现有深度生成模型仅依赖图像强度信息所引发的两大局限:一是生成图像保真度不足,影响下游分析可信度;二是模型结构固定导致引导机制僵化,限制应用场景灵活性。其解决方案的关键在于提出DF-DiffCom——一种基于Kolmogorov-Arnold Networks(KAN)增强的扩散模型,通过智能利用形变场(deformation fields)实现高可信度的纵向脑图像补全,不仅在OASIS-3数据集上显著提升峰值信噪比(PSNR)5.6%和结构相似性(SSIM)0.12,还具备模态无关特性,可无缝扩展至多种MRI模态及脑组织分割图等属性图。
链接: https://arxiv.org/abs/2601.09572
作者: Tianli Tao,Ziyang Wang,Delong Yang,Han Zhang,Le Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Longitudinal brain MRI is essential for lifespan study, yet high attrition rates often lead to missing data, complicating analysis. Deep generative models have been explored, but most rely solely on image intensity, leading to two key limitations: 1) the fidelity or trustworthiness of the generated brain images are limited, making downstream studies questionable; 2) the usage flexibility is restricted due to fixed guidance rooted in the model structure, restricting full ability to versatile application scenarios. To address these challenges, we introduce DF-DiffCom, a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. Trained on OASIS-3, DF-DiffCom outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. More importantly, its modality-agnostic nature allows smooth extension to varied MRI modalities, even to attribute maps such as brain tissue segmentation results.
zh
[CV-19] Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling ACL2026
【速读】:该论文旨在解决传统中文语言模型中字符表示仅依赖离散索引(token ID)而忽视汉字视觉结构的问题,从而可能损失语义与语音信息。其解决方案的关键在于引入低分辨率的灰度图像作为字符输入,具体采用8×8像素的灰度图代替传统的token ID,使解码器直接从视觉结构中学习语言特征。实验表明,这种基于视觉结构的方法在准确率上接近传统索引基线(39.2% vs. 39.1%),且在训练初期表现出显著的“热启动效应”(hot-start effect),即在仅0.4%训练数据时便达到12%准确率,远超索引基模型(<6%)。这说明最小程度的视觉结构可提供稳健且高效的中文建模信号,为字符表示提供了与传统索引方法互补的新范式。
链接: https://arxiv.org/abs/2601.09566
作者: Shuyang Xiang,Hao Guan
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, submitted to ACL 2026
Abstract:Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as 8 \times 8 pixels. Remarkably, these inputs achieve 39.2% accuracy, comparable to the index-based baseline of 39.1%. Such low-resource settings also exhibit a pronounced \emphhot-start effect: by 0.4% of total training, accuracy reaches above 12%, while index-based models lag at below 6%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.
zh
[CV-20] Bipartite Mode Matching for Vision Training Set Search from a Hierarchical Data Server AAAI2026
【速读】:该论文旨在解决在目标域(target domain)可访问但无法进行实时数据标注的情况下,如何构建一个高质量的替代训练集以训练出性能优异模型的问题。其核心挑战在于目标域通常存在多个语义簇(semantic clusters,即模式),若训练集中缺失这些目标模式,则模型性能将显著下降。解决方案的关键在于优化数据服务器的结构,提出一种分层数据服务器(hierarchical data server)与双部模式匹配算法(Bipartite Mode Matching, BMM),通过在服务器数据树中为每个目标模式寻找最优源模式匹配,实现目标与源模式之间的一对一最优对齐。该方法显著缩小了训练集与目标域之间的领域差异,在重识别(re-ID)和检测任务中均表现出更优的模型精度,并且与现有以模型为中心的无监督域自适应(UDA)方法正交,可进一步结合伪标签等策略提升性能。
链接: https://arxiv.org/abs/2601.09531
作者: Yue Yao,Ruining Yang,Tom Gedeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.
zh
[CV-21] GlovEgo-HOI: Bridging the Synthetic-to-Real Gap for Industrial Egocentric Human-Object Interaction Detection
【速读】:该论文旨在解决工业场景中以自我为中心的人-物交互(Egocentric Human-Object Interaction, EHOI)分析模型训练因领域特定标注数据稀缺而导致的鲁棒性不足问题。解决方案的关键在于提出了一种结合合成数据与基于扩散模型(diffusion-based process)的数据增强框架,用于在真实图像中注入逼真的个人防护装备(Personal Protective Equipment, PPE),从而有效扩充高质量训练数据;同时,研究构建了GlovEgo-HOI基准数据集和GlovEgo-Net模型,后者通过融合手套头部(Glove-Head)与关键点头部(Keypoint-Head)模块,利用手部姿态信息提升交互检测精度。
链接: https://arxiv.org/abs/2601.09528
作者: Alfio Spoto,Rosario Leonardi,Francesco Ragusa,Giovanni Maria Farinella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, accepted as a Short Paper at VISAPP 2026
Abstract:Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.
zh
[CV-22] Video Joint-Embedding Predictive Architectures for Facial Expression Recognition
【速读】:该论文旨在解决面部表情识别(Facial Expression Recognition, FER)中模型泛化能力不足及对无关视觉信息(如背景颜色)过度依赖的问题。传统视频理解预训练方法通常基于像素级重建,易引入冗余信息;而本文提出采用视频联合嵌入预测架构(Video Joint-Embedding Predictive Architectures, V-JEPAs),通过从未掩码区域的嵌入预测掩码区域的嵌入来学习特征表示,从而避免捕捉与任务无关的细节信息。其解决方案的关键在于利用纯嵌入预测机制进行预训练,使编码器能够提取更具判别性的视频表征,进而提升FER在多个数据集上的性能与跨数据集的泛化能力。
链接: https://arxiv.org/abs/2601.09524
作者: Lennart Eing,Cristina Luna-Jiménez,Silvan Mertes,Elisabeth André
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: To appear in 2025 Proceedings of the 13th International Conference on Affective Computing and Intelligent Interaction (ACII), submitted to IEEE. \c{opyright} 2025 IEEE
Abstract:This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at this https URL.
zh
[CV-23] Class Adaptive Conformal Training
【速读】:该论文旨在解决深度神经网络在预测中概率估计不可靠、易产生过度自信的问题,尤其是在使用传统共形预测(Conformal Prediction, CP)方法时,难以实现类别条件下的预测集优化。其解决方案的关键在于提出了一种类自适应共形训练(Class Adaptive Conformal Training, CaCT),将共形训练建模为一个带增广拉格朗日约束的优化问题,从而无需任何数据分布假设即可自动学习类别条件下的预测集形状。实验表明,CaCT在多个基准数据集上均优于现有方法,在保持严格覆盖保证的前提下显著缩小了预测集大小并提升了信息量。
链接: https://arxiv.org/abs/2601.09522
作者: Badr-Eddine Marani,Julio Silva-Rodriguez,Ismail Ben Ayed,Maria Vakalopoulou,Stergios Christodoulidis,Jose Dolz
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks have achieved remarkable success across a variety of tasks, yet they often suffer from unreliable probability estimates. As a result, they can be overconfident in their predictions. Conformal Prediction (CP) offers a principled framework for uncertainty quantification, yielding prediction sets with rigorous coverage guarantees. Existing conformal training methods optimize for overall set size, but shaping the prediction sets in a class-conditional manner is not straightforward and typically requires prior knowledge of the data distribution. In this work, we introduce Class Adaptive Conformal Training (CaCT), which formulates conformal training as an augmented Lagrangian optimization problem that adaptively learns to shape prediction sets class-conditionally without making any distributional assumptions. Experiments on multiple benchmark datasets, including standard and long-tailed image recognition as well as text classification, demonstrate that CaCT consistently outperforms prior conformal training methods, producing significantly smaller and more informative prediction sets while maintaining the desired coverage guarantees.
zh
[CV-24] Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations
【速读】:该论文旨在解决人形机器人与人类物理交互(Human-Humanoid Interaction, HHoI)数据稀缺问题,以及由此导致的模仿学习策略缺乏交互理解、难以实现协同行为的问题。解决方案的关键在于提出一个两阶段框架:第一阶段为PAIR(Physics-Aware Interaction Retargeting),通过接触中心的两阶段管道保留形态差异下的接触语义,生成物理一致的HHoI数据;第二阶段引入D-STAR(Decoupled Spatio-Temporal Action Reasoner),采用分层策略解耦“何时行动”(Phase Attention)与“何处行动”(Multi-Scale Spatial module),并通过扩散头融合生成同步的全身行为,从而实现响应性强、协调性高的交互能力。
链接: https://arxiv.org/abs/2601.09518
作者: Wei-Jin Huang,Yue-Yi Zhang,Yi-Lin Wei,Zhi-Wei Xia,Juantao Tan,Yuan-Ming Li,Zhilin Zhao,Wei-Shi Zheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of high-quality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data. This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry. By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.
zh
[CV-25] V-DPM: 4D Video Reconstruction with Dynamic Point Maps WWW
【速读】:该论文旨在解决动态场景下三维重建的局限性问题,特别是现有动态点图(Dynamic Point Maps, DPMs)仅适用于图像对且在多视图情况下需依赖优化后处理的问题。解决方案的关键在于提出一种面向视频输入的V-DPM方法:首先,设计了最大化表示能力、便于神经网络预测并可复用预训练模型的DPM公式化方式;其次,在VGGT这一静态场景下强大的3D重建器基础上实现该方法,通过少量合成数据即可有效适配为视频动态点图预测器,从而在4D(时空)重建中实现优于现有方法的性能,不仅能恢复动态深度,还能完整捕捉场景中每个点的三维运动信息。
链接: https://arxiv.org/abs/2601.09499
作者: Edgar Sucar,Eldar Insafutdinov,Zihang Lai,Andrea Vedaldi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally representing scene motion. However, existing DPMs are limited to image pairs and, like DUSt3R, require post processing via optimization when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to formulate DPMs for video input in a way that maximizes representational power, facilitates neural prediction, and enables reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent and powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data is sufficient to adapt it into an effective V-DPM predictor. Our approach achieves state of the art performance in 3D and 4D reconstruction for dynamic scenes. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the full 3D motion of every point in the scene.
zh
[CV-26] owards Robust Cross-Dataset Object Detection Generalization under Domain Specificity
【速读】:该论文旨在解决跨数据集目标检测(Cross-Dataset Object Detection, CD-OD)中性能显著下降的问题,特别是由于不同数据集之间场景设置差异导致的域偏移(domain shift)影响。其核心贡献在于提出“设置特异性”(setting specificity)的分析框架,将基准数据集划分为设置无关(setting-agnostic)和设置特定(setting-specific)两类,并系统评估标准检测器在各类训练-测试组合下的迁移表现。关键发现是:同一设置类型内的迁移相对稳定,而跨设置类型的迁移性能大幅下降且常呈现非对称性,尤其从特定源到无关目标的迁移最为严重;即使采用开放标签对齐(open-label alignment)策略,性能提升有限,表明域偏移仍是最难情形下的主导因素。该研究为理解CD-OD提供了结构化视角,并提出了基于封闭标签与开放标签对比的评估方法,以更精准地区分域偏移与标签错位的影响。
链接: https://arxiv.org/abs/2601.09497
作者: Ritabrata Chakraborty,Hrishit Mitra,Shivakumara Palaiahnakote,Umapada Pal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, 6 tables
Abstract:Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train–test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href[this https URLthis https URL.
zh
[CV-27] MAD: Motion Appearance Decoupling for efficient Driving World Models
【速读】:该论文旨在解决当前生成式视频扩散模型(Generative Video Diffusion Models)在自动驾驶场景中作为可靠世界模型(World Models)时存在的局限性,即这些模型虽能生成逼真的视频,但在结构化运动和物理一致性交互方面表现不足。解决方案的关键在于提出一种高效的适配框架,通过将运动学习与外观合成解耦:首先在简化形式下(如骨架化代理和场景元素)训练模型以预测符合物理和社会规则的结构化运动;随后复用同一骨干网络,基于已学得的运动序列生成真实的RGB视频,实现“以纹理和光照渲染运动”的两阶段推理-渲染范式。此方法显著降低了对领域特定数据和计算资源的需求,同时提升了控制能力和生成质量。
链接: https://arxiv.org/abs/2601.09452
作者: Ahmad Rahimi,Valentin Gerard,Eloi Zablocki,Matthieu Cord,Alexandre Alahi
机构: Ecole Polytechnique Federal de Lausanne (洛桑联邦理工学院); valeo.ai; Sorbonne Université (索邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: this https URL
zh
[CV-28] PrivLEX: Detecting legal concepts in images through Vision-Language Models
【速读】:该论文旨在解决图像隐私分类中缺乏法律合规性与可解释性的关键问题,即现有方法难以准确识别并解释图像中是否包含受法律保护的个人数据(personal data)概念。解决方案的关键在于提出PrivLEX,一个基于视觉-语言模型(Vision-Language Models, VLMs)的零样本概念检测框架,通过标签自由的概念瓶颈模型(Concept Bottleneck Model)实现对图像中法律定义的个人数据概念的可解释分类,无需在训练阶段提供显式的概念标签,从而实现了从法律语义出发、具备可解释性的图像隐私判定。
链接: https://arxiv.org/abs/2601.09449
作者: Darya Baranouskaya,Andrea Cavallaro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX’s ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.
zh
[CV-29] Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
【速读】:该论文旨在解决古代钱币自动分析中语义元素识别的难题,以提升研究人员从大规模钱币收藏中提取历史信息的能力,并辅助收藏者更好地理解所购或所售钱币的内容。其解决方案的关键在于首次将视觉Transformer(Vision Transformer, ViT)深度学习架构应用于古代钱币语义元素识别任务,通过完全自动化的多模态数据(图像与非结构化文本)学习,实现了比新训练的卷积神经网络(Convolutional Neural Networks, CNN)模型更高的识别准确率。
链接: https://arxiv.org/abs/2601.09433
作者: David Reid,Ognjen Arandjelovic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
zh
[CV-30] Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态视频场景中缺乏对复杂视觉-空间逻辑链(Multi-hop Spatial Reasoning, MSR)能力有效评估的问题。现有基准主要聚焦于单步感知到判断的任务,未能充分覆盖需要多步骤推理的现实场景。为填补这一空白,作者提出首个专门用于评测MSR能力的基准Video-MSR,包含四个任务:受限定位、基于链路的参考检索、路径规划和反事实物理推断,并构建了由3,052个高质量视频实例组成的测试集。解决方案的关键在于:一方面通过系统化设计的多跳空间推理任务揭示模型在空间逻辑链上的显著性能下降与空间错位、幻觉等问题;另一方面提出MSR-9K指令微调数据集,针对多跳空间推理进行专门训练,使Qwen-VL模型在Video-MSR上提升7.82%绝对性能,验证了专用指令数据对增强MSR能力的有效性。
链接: https://arxiv.org/abs/2601.09430
作者: Rui Zhu,Xin Shen,Shuchen Wu,Chenxi Miao,Xin Yu,Yang Li,Weikang Li,Deguo Xia,Jizhou Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at this https URL.
zh
[CV-31] Radiomics-Integrated Deep Learning with Hierarchical Loss for Osteosarcoma Histology Classification
【速读】:该论文旨在解决骨肉瘤(Osteosarcoma, OS)患者接受新辅助化疗后,组织病理学评估中存活肿瘤区域与非存活肿瘤区域划分的准确性问题。当前人工评估存在劳动强度大、主观性强及观察者间差异显著等局限,而现有深度学习模型在患者层面独立采样的测试数据上表现下降明显,表明其泛化能力不足。解决方案的关键在于两点:一是引入影像组学(radiomic)特征作为额外输入,提升模型分类性能并增强可解释性;二是采用分层二分类任务(即肿瘤vs非肿瘤和存活vs非存活)替代传统的三分类任务,并设计具有可训练权重的分层损失函数,从而优化各类别性能。实验表明,这两种方法单独或联合使用均能显著提升模型性能,在TCIA OS Tumor Assessment公开数据集上达到新的最优水平。
链接: https://arxiv.org/abs/2601.09416
作者: Yaxi Chen,Zi Ye,Shaheer U. Saeed,Oliver Yu,Simin Ni,Jie Huang,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Osteosarcoma (OS) is an aggressive primary bone malignancy. Accurate histopathological assessment of viable versus non-viable tumor regions after neoadjuvant chemotherapy is critical for prognosis and treatment planning, yet manual evaluation remains labor-intensive, subjective, and prone to inter-observer variability. Recent advances in digital pathology have enabled automated necrosis quantification. Evaluating on test data, independently sampled on patient-level, revealed that the deep learning model performance dropped significantly from the tile-level generalization ability reported in previous studies. First, this work proposes the use of radiomic features as additional input in model training. We show that, despite that they are derived from the images, such a multimodal input effectively improved the classification performance, in addition to its added benefits in interpretability. Second, this work proposes to optimize two binary classification tasks with hierarchical classes (i.e. tumor-vs-non-tumor and viable-vs-non-viable), as opposed to the alternative ``flat’’ three-class classification task (i.e. non-tumor, non-viable tumor, viable tumor), thereby enabling a hierarchical loss. We show that such a hierarchical loss, with trainable weightings between the two tasks, the per-class performance can be improved significantly. Using the TCIA OS Tumor Assessment dataset, we experimentally demonstrate the benefits from each of the proposed new approaches and their combination, setting a what we consider new state-of-the-art performance on this open dataset for this application. Code and trained models: this https URL.
zh
[CV-32] Detail Loss in Super-Resolution Models Based on the Laplacian Pyramid and Repeated Upscaling and Downscaling Process
【速读】:该论文旨在解决图像超分辨率(Image Super-Resolution, ISR)任务中高频细节增强不足的问题,即如何更有效地提升重建图像的纹理和边缘等高频率信息。解决方案的关键在于提出两种核心方法:一是基于拉普拉斯金字塔(Laplacian Pyramid)的细节损失函数(detail loss),用于显式引导模型关注并优化高频成分;二是重复上采样与下采样过程,通过多尺度特征提取强化细节损失的有效性。该方法通过总损失函数分离控制超分辨率图像与细节图像的生成,使模型能更聚焦于高频信息,从而在不同结构的模型(包括CNN和注意力机制模型)上均显著提升重建质量。
链接: https://arxiv.org/abs/2601.09410
作者: Sangjun Han,Youngmi Hur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IET Image Processing. This is the authors’ final accepted manuscript
Abstract:With advances in artificial intelligence, image processing has gained significant interest. Image super-resolution is a vital technology closely related to real-world applications, as it enhances the quality of existing images. Since enhancing fine details is crucial for the super-resolution task, pixels that contribute to high-frequency information should be emphasized. This paper proposes two methods to enhance high-frequency details in super-resolution images: a Laplacian pyramid-based detail loss and a repeated upscaling and downscaling process. Total loss with our detail loss guides a model by separately generating and controlling super-resolution and detail images. This approach allows the model to focus more effectively on high-frequency components, resulting in improved super-resolution images. Additionally, repeated upscaling and downscaling amplify the effectiveness of the detail loss by extracting diverse information from multiple low-resolution features. We conduct two types of experiments. First, we design a CNN-based model incorporating our methods. This model achieves state-of-the-art results, surpassing all currently available CNN-based and even some attention-based models. Second, we apply our methods to existing attention-based models on a small scale. In all our experiments, attention-based models adding our detail loss show improvements compared to the originals. These results demonstrate our approaches effectively enhance super-resolution images across different model structures.
zh
[CV-33] Spectral Complex Autoencoder Pruning: A Fidelity-Guided Criterion for Extreme Structured Channel Compression
【速读】:该论文旨在解决深度神经网络中通道级冗余识别与高效剪枝的问题,尤其在极端压缩场景下如何精准评估各卷积层输出通道的重要性。解决方案的关键在于提出谱复数自动编码器剪枝(Spectral Complex Autoencoder Pruning, SCAP),其核心创新是构建由输入激活作为实部、单个输出通道激活作为虚部组成的复数交互场,并将其变换到频域后训练一个低容量自动编码器来重建归一化谱;通过谱重建保真度量化通道冗余性——高保真度通道被认为位于自动编码器所学习的低维流形上,具备更强的可压缩性,从而被剪枝;而低保真度通道则保留以确保关键信息不丢失。此方法支持基于阈值的简单剪枝策略,同时保持网络结构一致性,在VGG16/CIFAR-10上实现90.11% FLOP和96.30%参数减少,仅带来1.67% Top-1准确率下降。
链接: https://arxiv.org/abs/2601.09352
作者: Wei Liu,Xing Deng,Haijian Shao,Yingtao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures
Abstract:We propose Spectral Complex Autoencoder Pruning (SCAP), a reconstruction-based criterion that measures functional redundancy at the level of individual output channels. For each convolutional layer, we construct a complex interaction field by pairing the full multi-channel input activation as the real part with a single output-channel activation (spatially aligned and broadcast across input channels) as the imaginary part. We transform this complex field to the frequency domain and train a low-capacity autoencoder to reconstruct normalized spectra. Channels whose spectra are reconstructed with high fidelity are interpreted as lying close to a low-dimensional manifold captured by the autoencoder and are therefore more compressible; conversely, channels with low fidelity are retained as they encode information that cannot be compactly represented by the learned manifold. This yields an importance score (optionally fused with the filter L1 norm) that supports simple threshold-based pruning and produces a structurally consistent pruned network. On VGG16 trained on CIFAR-10, at a fixed threshold of 0.6, we obtain 90.11% FLOP reduction and 96.30% parameter reduction with an absolute Top-1 accuracy drop of 1.67% from a 93.44% baseline after fine-tuning, demonstrating that spectral reconstruction fidelity of complex interaction fields is an effective proxy for channel-level redundancy under aggressive compression.
zh
[CV-34] See More Store Less: Memory-Efficient Resolution for Video Moment Retrieval
【速读】:该论文旨在解决视频理解任务中因密集帧处理导致的内存限制问题,尤其是在视频片段检索(Video Moment Retrieval, VMR)场景下,现有稀疏采样方法易造成信息丢失,难以兼顾记忆效率与内容完整性。其解决方案的关键在于提出SMORE框架,通过三个核心机制实现高效视频理解:首先利用查询引导的字幕编码(query-guided captions)对齐用户意图语义;其次采用查询感知的重要性调制(query-aware importance modulation)突出相关视频片段;最后自适应压缩帧以保留关键内容并减少冗余,从而在不突破内存预算的前提下显著提升视频理解性能。
链接: https://arxiv.org/abs/2601.09350
作者: Mingyu Jeon,Sungjin Han,Jinkwon Hwang,Minchol Kwon,Jonghee Kim,Junyeong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
zh
[CV-35] Beyond the final layer: Attentive multilayer fusion for vision transformers
【速读】:该论文旨在解决大规模基础模型在下游任务中高效适配的问题,特别是针对传统线性探测(linear probing)方法仅依赖最后一层表示而忽略网络层次中分布信息的局限性。其解决方案的关键在于提出一种基于注意力机制的动态融合策略,能够自适应地识别对目标任务最相关的各层特征,并将低层结构线索与高层语义抽象进行有效整合,从而显著提升探测性能。实验表明,该方法在20个不同数据集和多种预训练模型上均优于标准线性探测,且注意力热图揭示了跨域任务尤其受益于中间层表示。
链接: https://arxiv.org/abs/2601.09322
作者: Laure Ciernik,Marco Morik,Lukas Thede,Luca Eyring,Shinichi Nakajima,Zeynep Akata,Lukas Muttenthaler
机构: Technische Universität Berlin (柏林工业大学); BIFOLD; University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心); Helmholtz Munich (慕尼黑亥姆霍兹研究中心); MCML (慕尼黑机器学习中心); RIKEN AIP (理化学研究所先进智能研究中心); Technical University Munich (慕尼黑工业大学); Aignostics (艾格诺斯蒂克斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.
zh
[CV-36] Frequency Error-Guided Under-sampling Optimization for Multi-Contrast MRI Reconstruction
【速读】:该论文旨在解决多对比度磁共振成像(Multi-contrast MRI)重建中长期存在的三大问题:(1)参考图像融合策略浅层化,如简单拼接导致信息利用不充分;(2)未能有效挖掘参考对比度提供的互补信息;(3)采用固定欠采样模式限制了重建灵活性。解决方案的关键在于提出一种高效且可解释的频率误差引导重建框架,其核心创新包括:首先通过条件扩散模型学习频率误差先验(Frequency Error Prior, FEP),并将其嵌入统一优化框架中,联合优化欠采样模式与重建网络;其次采用模型驱动的深度展开架构,协同利用频域与图像域信息;同时引入空间对齐模块和参考特征分解策略,提升重建质量并增强物理可解释性。该方法在多种成像模态、加速倍数(4–30倍)及采样方案下均显著优于当前最先进方法。
链接: https://arxiv.org/abs/2601.09316
作者: Xinming Fang,Chaoyan Huang,Juncheng Li,Jun Wang,Jun Shi,Guixu Zhang
机构: Shanghai University (上海大学); Michigan State University (密歇根州立大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 12 figures, 7 tables
Abstract:Magnetic resonance imaging (MRI) plays a vital role in clinical diagnostics, yet it remains hindered by long acquisition times and motion artifacts. Multi-contrast MRI reconstruction has emerged as a promising direction by leveraging complementary information from fully-sampled reference scans. However, existing approaches suffer from three major limitations: (1) superficial reference fusion strategies, such as simple concatenation, (2) insufficient utilization of the complementary information provided by the reference contrast, and (3) fixed under-sampling patterns. We propose an efficient and interpretable frequency error-guided reconstruction framework to tackle these issues. We first employ a conditional diffusion model to learn a Frequency Error Prior (FEP), which is then incorporated into a unified framework for jointly optimizing both the under-sampling pattern and the reconstruction network. The proposed reconstruction model employs a model-driven deep unfolding framework that jointly exploits frequency- and image-domain information. In addition, a spatial alignment module and a reference feature decomposition strategy are incorporated to improve reconstruction quality and bridge model-based optimization with data-driven learning for improved physical interpretability. Comprehensive validation across multiple imaging modalities, acceleration rates (4-30x), and sampling schemes demonstrates consistent superiority over state-of-the-art methods in both quantitative metrics and visual quality. All codes are available at this https URL.
zh
[CV-37] Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
【速读】:该论文旨在解决信息与通信技术(Information and Communications Technology, ICT)领域中,如何高效提取隐藏在图像模态中的高价值领域知识的问题。传统方法仅能解析文本内容,缺乏图像描述能力;而多模态大语言模型(Multimodal Large Language Model, MLLM)虽具备图像理解能力,但缺乏足够的领域知识。解决方案的关键在于提出一种多阶段渐进式训练策略,构建一个面向ICT领域的图像字幕模型(Domain-specific Image Captioning Model, DICModel):首先利用Mermaid工具与生成式AI(Generative AI)合成约7K图像-文本对用于第一阶段监督微调(Supervised Fine-Tuning, SFT);随后由ICT专家人工标注约2K图像-文本对进行第二阶段SFT;最后由专家与生成式AI共同合成约1.5K视觉问答数据,用于指令微调。该方法使仅含7B参数的DICModel在BLEU指标上优于32B参数的SOTA模型达20.8%,且在专家构建的客观题测试中准确率超越Qwen2.5-VL 32B达1%,验证了其在ICT领域图像逻辑文本提取方面的有效性。
链接: https://arxiv.org/abs/2601.09298
作者: Lianying Chao,Haoran Cai,Xubin Li,Kai Zhang,Sijie Wu,Rui Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
zh
[CV-38] GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
【速读】:该论文旨在解决3D高斯点阵(3D Gaussian Splatting, 3DGS)在模拟脆性断裂场景中的局限性问题,即现有方法难以处理具有连贯纹理的体素内部结构以及缺乏针对高斯表示的断裂感知物理仿真机制。解决方案的关键在于提出GaussianFluent框架:首先,通过生成模型引导的内部高斯点密集化策略合成逼真的内部结构;其次,集成优化的连续损伤材料点法(Continuum Damage Material Point Method, CD-MPM),实现高速且真实的脆性断裂模拟,从而支持复杂多材质物体和多阶段裂纹扩展等动态场景。
链接: https://arxiv.org/abs/2601.09265
作者: Bei Huang,Yixin Chen,Ruijie Lu,Gang Zeng,Hongbin Zha,Yuru Pei,Siyuan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages
Abstract:3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent’s capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.
zh
[CV-39] BrainSegNet: A Novel Framework for Whole-Brain MRI Parcellation Enhanced by Large Models
【速读】:该论文旨在解决全脑磁共振成像(MRI)分割任务中因脑区数量多、形状不规则且边界复杂而导致的高精度分割难题。传统基于模板配准的方法效率低,而现有深度学习模型虽具备强大特征提取能力,但难以满足脑部精细结构分割所需的精确性。解决方案的关键在于提出BrainSegNet框架,通过改进Segment Anything Model (SAM) 实现高精度全脑95区域分割:其核心创新包括融合U-Net跳跃连接与SAM的Transformer模块构成混合编码器、设计带金字塔池化机制的多尺度注意力解码器以适应不同大小结构,以及引入边界细化模块提升边缘清晰度,从而显著优于当前主流方法,在Human Connectome Project (HCP) 数据集上实现更高的准确性和鲁棒性。
链接: https://arxiv.org/abs/2601.09263
作者: Yucheng Li,Xiaofan Wang,Junyi Wang,Yijie Li,Xi Zhu,Mubai Du,Dian Sheng,Wei Zhang,Fan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole-brain parcellation from MRI is a critical yet challenging task due to the complexity of subdividing the brain into numerous small, irregular shaped regions. Traditionally, template-registration methods were used, but recent advances have shifted to deep learning for faster workflows. While large models like the Segment Anything Model (SAM) offer transferable feature representations, they are not tailored for the high precision required in brain parcellation. To address this, we propose BrainSegNet, a novel framework that adapts SAM for accurate whole-brain parcellation into 95 regions. We enhance SAM by integrating U-Net skip connections and specialized modules into its encoder and decoder, enabling fine-grained anatomical precision. Key components include a hybrid encoder combining U-Net skip connections with SAM’s transformer blocks, a multi-scale attention decoder with pyramid pooling for varying-sized structures, and a boundary refinement module to sharpen edges. Experimental results on the Human Connectome Project (HCP) dataset demonstrate that BrainSegNet outperforms several state-of-the-art methods, achieving higher accuracy and robustness in complex, multi-label parcellation.
zh
[CV-40] Magnifying change: Rapid burn scar mapping with multi-resolution multi-source satellite imagery
【速读】:该论文旨在解决利用卫星遥感影像准确 delineate(划定)野火影响区域的难题,尤其针对因电磁波谱范围内光谱变化不规则且空间异质性导致的检测困难。在实际操作场景中,如野火发生后需快速生成烧毁区域图时,现有深度学习方法受限于当前卫星系统在空间分辨率与重访频率之间的权衡。解决方案的关键在于提出一种名为 BAM-MRCD 的新型深度学习模型,该模型融合多源、多分辨率卫星数据(MODIS 和 Sentinel-2),实现了高时空分辨率的烧毁区域制图,能够以高精度识别小尺度野火事件,并优于同类变化检测模型及基准方法。
链接: https://arxiv.org/abs/2601.09262
作者: Maria Sdraka,Dimitrios Michail,Ioannis Papoutsis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high-resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade-off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM-MRCD, which employs multi-resolution, multi-source satellite imagery (MODIS and Sentinel-2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: this https URL.
zh
[CV-41] PhyRPR: Training-Free Physics-Constrained Video Generation
【速读】:该论文旨在解决当前基于扩散模型的视频生成方法在物理约束满足方面的不足问题,其核心挑战在于现有单阶段方法将高层次物理理解与低层次视觉合成耦合在一起,导致难以生成需要显式物理推理的内容。解决方案的关键在于提出了一种无需训练的三阶段流水线——PhyRPR(PhyReason-PhyPlan-PhyRefine),通过解耦物理理解与视觉合成过程实现对物理规律的显式控制:首先利用大语言模型进行物理状态推理并生成关键帧;其次确定性地合成可控的粗粒度运动骨架;最后通过潜空间融合策略将该骨架注入扩散采样过程以精细化外观同时保持规划的动力学特性。这一分阶段设计显著提升了生成视频的物理合理性与运动可控性。
链接: https://arxiv.org/abs/2601.09255
作者: Yibo Zhao,Hengjia Li,Xiaofei He,Boxi Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textitPhyRPR:\textitPhy\ulineReason–\textitPhy\ulinePlan–\textitPhy\ulineRefine, which decouples physical understanding from visual synthesis. Specifically, \textitPhyReason uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textitPhyPlan deterministically synthesizes a controllable coarse motion scaffold; and \textitPhyRefine injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.
zh
[CV-42] Hybrid guided variational autoencoder for visual place recognition
【速读】:该论文旨在解决移动机器人在GPS拒止的室内环境中实现高精度、低功耗且具备良好泛化能力的视觉定位(Visual Place Recognition, VPR)问题。现有VPR模型通常内存占用高,难以部署于移动平台;而轻量模型则往往缺乏鲁棒性和跨场景泛化能力。解决方案的关键在于结合事件相机(event-based vision sensors)与一种新型的引导变分自编码器(guided variational autoencoder, VAE),其中编码器采用适用于低功耗、低延迟神经形态硬件的脉冲神经网络(spiking neural network, SNN)。该架构在新构建的室内VPR数据集上成功解耦出16个不同地点的视觉特征,在光照变化等复杂条件下仍保持高分类性能,并能对未知场景中的输入进行有效区分,展现出优异的泛化能力,从而为移动机器人在已知和未知室内环境中的导航提供了高效可靠的视觉定位方案。
链接: https://arxiv.org/abs/2601.09248
作者: Ni Wang,Zihan You,Emre Neftci,Thorben Schoepe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents such as cars, robots and drones need to precisely localize themselves in diverse environments, including in GPS-denied indoor environments. One approach for precise localization is visual place recognition (VPR), which estimates the place of an image based on previously seen places. State-of-the-art VPR models require high amounts of memory, making them unwieldy for mobile deployment, while more compact models lack robustness and generalization capabilities. This work overcomes these limitations for robotics using a combination of event-based vision sensors and an event-based novel guided variational autoencoder (VAE). The encoder part of our model is based on a spiking neural network model which is compatible with power-efficient low latency neuromorphic hardware. The VAE successfully disentangles the visual features of 16 distinct places in our new indoor VPR dataset with a classification performance comparable to other state-of-the-art approaches while, showing robust performance also under various illumination conditions. When tested with novel visual inputs from unknown scenes, our model can distinguish between these places, which demonstrates a high generalization capability by learning the essential features of location. Our compact and robust guided VAE with generalization capabilities poses a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.
zh
[CV-43] Integrating Diverse Assignment Strategies into DETRs
【速读】:该论文旨在解决DETR-style目标检测框架中因一对一匹配策略(one-to-one matching strategy)导致的监督信号稀疏、收敛速度慢的问题。现有方法虽尝试采用一对多分配(one-to-many assignment)以增强监督信号,但往往引入复杂的架构特异性修改,且仅依赖单一辅助策略,缺乏统一性和可扩展性。解决方案的关键在于揭示性能提升并非源于监督数量的增加,而是来自分配策略的多样性;基于此洞察,作者提出LoRA-DETR框架,通过在主网络中嵌入多个低秩适应(Low-Rank Adaptation, LoRA)分支,在训练阶段引入多样化的“一对多”分配规则作为辅助模块,从而注入丰富且差异化的梯度信号,而在推理阶段移除这些分支,实现无额外计算开销的高效优化,兼顾模型简洁性与性能优越性。
链接: https://arxiv.org/abs/2601.09247
作者: Yiwei Zhang,Jin Gao,Hanshi Wang,Fudong Ge,Guan Luo,Weiming Hu,Zhipeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Label assignment is a critical component in object detectors, particularly within DETR-style frameworks where the one-to-one matching strategy, despite its end-to-end elegance, suffers from slow convergence due to sparse supervision. While recent works have explored one-to-many assignments to enrich supervisory signals, they often introduce complex, architecture-specific modifications and typically focus on a single auxiliary strategy, lacking a unified and scalable design. In this paper, we first systematically investigate the effects of one-to-many'' supervision and reveal a surprising insight that performance gains are driven not by the sheer quantity of supervision, but by the diversity of the assignment strategies employed. This finding suggests that a more elegant, parameter-efficient approach is attainable. Building on this insight, we propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector. Our method augments the primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each instantiating a different one-to-many assignment rule. These branches act as auxiliary modules that inject rich, varied supervisory gradients into the main model and are discarded during inference, thus incurring no additional computational cost. This design promotes robust joint optimization while maintaining the architectural simplicity of the original detector. Extensive experiments on different baselines validate the effectiveness of our approach. Our work presents a new paradigm for enhancing detectors, demonstrating that diverse one-to-many’’ supervision can be integrated to achieve state-of-the-art results without compromising model elegance.
zh
[CV-44] A2TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation
【速读】:该论文旨在解决现有高斯点渲染(Gaussian Splatting)方法中纹理分配固定、导致内存利用率低且难以适应场景细节变化的问题。其核心解决方案是提出自适应各向异性纹理高斯(Adaptive Anisotropic Textured Gaussians, A²TG),通过为每个高斯基元(primitive)赋予一个各向异性纹理,并采用梯度引导的自适应规则联合确定纹理分辨率与长宽比,实现非均匀、细节感知的纹理分配,从而显著提升纹理效率,在保持图像质量的同时大幅降低内存消耗。
链接: https://arxiv.org/abs/2601.09243
作者: Sheng-Chi Hsu,Ting-Yu Yen,Shih-Hsuan Hung,Hung-Kuo Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian Splatting has emerged as a powerful representation for high-quality, real-time 3D scene rendering. While recent works extend Gaussians with learnable textures to enrich visual appearance, existing approaches allocate a fixed square texture per primitive, leading to inefficient memory usage and limited adaptability to scene variability. In this paper, we introduce adaptive anisotropic textured Gaussians (A ^2 TG), a novel representation that generalizes textured Gaussians by equipping each primitive with an anisotropic texture. Our method employs a gradient-guided adaptive rule to jointly determine texture resolution and aspect ratio, enabling non-uniform, detail-aware allocation that aligns with the anisotropic nature of Gaussian splats. This design significantly improves texture efficiency, reducing memory consumption while enhancing image quality. Experiments on multiple benchmark datasets demonstrate that A TG consistently outperforms fixed-texture Gaussian Splatting methods, achieving comparable rendering fidelity with substantially lower memory requirements.
zh
[CV-45] DeTracker: Motion-decoupled Vehicle Detection and Tracking in Unstabilized Satellite Videos
【速读】:该论文旨在解决 unstabilized satellite videos 中多目标跟踪(Multi-Object Tracking, MOT)性能下降的问题,其核心挑战在于平台抖动(platform jitter)与微小目标外观特征弱共同导致的轨迹不稳定和运动估计误差。解决方案的关键是提出 DeTracker 框架,其中包含两个创新模块:一是全局-局部运动解耦(Global–Local Motion Decoupling, GLMD)模块,通过全局对齐与局部精修显式分离平台运动与真实目标运动,提升轨迹稳定性和运动估计精度;二是时间依赖特征金字塔(Temporal Dependency Feature Pyramid, TDFP)模块,实现跨帧时间特征融合,增强微小目标表征的连续性与判别力。
链接: https://arxiv.org/abs/2601.09240
作者: Jiajun Chen,Jing Xiao,Shaohan Cao,Yuming Zhu,Liang Liao,Jun Pan,Mi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Satellite videos provide continuous observations of surface dynamics but pose significant challenges for multi-object tracking (MOT), especially under unstabilized conditions where platform jitter and the weak appearance of tiny objects jointly degrade tracking performance. To address this problem, we propose DeTracker, a joint detection-and-tracking framework tailored for unstabilized satellite videos. DeTracker introduces a Global–Local Motion Decoupling (GLMD) module that explicitly separates satellite platform motion from true object motion through global alignment and local refinement, leading to improved trajectory stability and motion estimation accuracy. In addition, a Temporal Dependency Feature Pyramid (TDFP) module is developed to perform cross-frame temporal feature fusion, enhancing the continuity and discriminability of tiny-object representations. We further construct a new benchmark dataset, SDM-Car-SU, which simulates multi-directional and multi-speed platform motions to enable systematic evaluation of tracking robustness under varying motion perturbations. Extensive experiments on both simulated and real unstabilized satellite videos demonstrate that DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU and 47.3% MOTA on real satellite video data.
zh
[CV-46] Knowledge-Embedded and Hypernetwork-Guided Few-Shot Substation Meter Defect Image Generation Method
【速读】:该论文旨在解决变电站电表(substation meters)缺陷检测中因标注样本严重稀缺而导致的少样本生成难题,从而提升工业视觉检测系统的性能。其解决方案的关键在于构建一个融合知识嵌入(Knowledge Embedding)与超网络引导的条件控制机制(Hypernetwork-Guided Conditional Control)的稳定扩散(Stable Diffusion)框架:首先通过DreamBooth风格的知识嵌入微调预训练模型,缩小自然图像与工业设备之间的域差距,保留电表特有的结构和纹理先验;其次引入几何裂纹建模模块,参数化裂纹的位置、长度、曲率等属性以生成空间约束的控制图;最后设计轻量级超网络动态调节扩散过程,实现生成保真度与可控性的灵活平衡。实验表明,该方法显著优于现有增强与生成基线,在真实数据集上将FID降低32.7%,多样性指标提升,并使下游缺陷检测器的mAP提高15.3%。
链接: https://arxiv.org/abs/2601.09238
作者: Jackie Alex,Justin Petter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes–such as location, length, curvature, and branching pattern–to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and–most importantly–boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.09238 [cs.CV] (or arXiv:2601.09238v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.09238 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jackie Alex [view email] [v1] Wed, 14 Jan 2026 07:21:57 UTC (2,096 KB) Full-text links: Access Paper: View a PDF of the paper titled Knowledge-Embedded and Hypernetwork-Guided Few-Shot Substation Meter Defect Image Generation Method, by Jackie Alex and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-47] CLIDD: Cross-Layer Independent Deformable Description for Efficient and Discriminative Local Feature Representation
【速读】:该论文旨在解决空间智能任务(如机器人导航和增强现实)中对鲁棒局部特征表示的需求,特别是如何在保证高区分度的同时实现计算效率。其核心挑战在于构建既具备强判别能力又适用于实时部署的描述符。解决方案的关键在于提出Cross-Layer Independent Deformable Description (CLIDD)方法:通过从独立的特征层次中直接采样以提升独特性,利用可学习偏移量捕捉多尺度下的细粒度结构细节,同时避免统一密集表示带来的计算负担;此外,结合硬件感知的内核融合策略与轻量化架构及融合度量学习与知识蒸馏的训练协议,实现了模型在不同部署约束下的高效扩展。实验表明,该方案在保持高匹配精度的同时显著降低参数量和计算开销,例如超紧凑版本仅需0.004M参数即可达到SuperPoint的精度,而高性能配置在边缘设备上超过200 FPS且优于当前所有最先进方法。
链接: https://arxiv.org/abs/2601.09230
作者: Haodi Yao,Fenghua He,Ning Hao,Yao Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust local feature representations are essential for spatial intelligence tasks such as robot navigation and augmented reality. Establishing reliable correspondences requires descriptors that provide both high discriminative power and computational efficiency. To address this, we introduce Cross-Layer Independent Deformable Description (CLIDD), a method that achieves superior distinctiveness by sampling directly from independent feature hierarchies. This approach utilizes learnable offsets to capture fine-grained structural details across scales while bypassing the computational burden of unified dense representations. To ensure real-time performance, we implement a hardware-aware kernel fusion strategy that maximizes inference throughput. Furthermore, we develop a scalable framework that integrates lightweight architectures with a training protocol leveraging both metric learning and knowledge distillation. This scheme generates a wide spectrum of model variants optimized for diverse deployment constraints. Extensive evaluations demonstrate that our approach achieves superior matching accuracy and exceptional computational efficiency simultaneously. Specifically, the ultra-compact variant matches the precision of SuperPoint while utilizing only 0.004M parameters, achieving a 99.7% reduction in model size. Furthermore, our high-performance configuration outperforms all current state-of-the-art methods, including high-capacity DINOv2-based frameworks, while exceeding 200 FPS on edge devices. These results demonstrate that CLIDD delivers high-precision local feature matching with minimal computational overhead, providing a robust and scalable solution for real-time spatial intelligence tasks.
zh
[CV-48] SPOT-Face: Forensic Face Identification using Attention Guided Optimal Transport ICPR
【速读】:该论文旨在解决法医调查中因缺乏常规DNA来源(如毛发、软组织)而导致的人体识别难题,尤其针对骨骼图像与素描图像到人脸图像的跨域识别问题。现有基于深度学习的面部识别方法在处理不同法医模态(如骨骼与素描)之间的结构对应关系时存在不足,难以有效建模跨域特征一致性。解决方案的关键在于提出SPOT-Face框架——一个基于超像素图(superpixel graph)的统一建模方法,通过构建图像的超像素图并采用不同的图神经网络(Graph Neural Networks, GNNs)提取嵌入表示,同时引入注意力引导的最优传输机制(attention-guided optimal transport mechanism)来建立跨域结构对应关系,从而显著提升识别性能(如召回率Recall和平均精度均值mAP)。
链接: https://arxiv.org/abs/2601.09229
作者: Ravi Shankar Prasad,Dinesh Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, 3 tables (ICPR_2026)
Abstract:Person identification in forensic investigations becomes very challenging when common identification means for DNA (i.e., hair strands, soft tissue) are not available. Current methods utilize deep learning methods for face recognition. However, these methods lack effective mechanisms to model cross-domain structural correspondence between two different forensic modalities. In this paper, we introduce a SPOT-Face, a superpixel graph-based framework designed for cross-domain forensic face identification of victims using their skeleton and sketch images. Our unified framework involves constructing a superpixel-based graph from an image and then using different graph neural networks(GNNs) backbones to extract the embeddings of these graphs, while cross-domain correspondence is established through attention-guided optimal transport mechanism. We have evaluated our proposed framework on two publicly available dataset: IIT_Mandi_S2F (S2F) and CUFS. Extensive experiments were conducted to evaluate our proposed framework. The experimental results show significant improvement in identification metrics ( i.e., Recall, mAP) over existing graph-based baselines. Furthermore, our framework demonstrates to be highly effective for matching skulls and sketches to faces in forensic investigations.
zh
[CV-49] Disentangle Object and Non-object Infrared Features via Language Guidance
【速读】:该论文旨在解决红外目标检测中因图像对比度低、边缘信息弱而导致难以提取判别性目标特征的问题。其解决方案的关键在于提出一种视觉-语言表征学习范式,通过引入富含语义信息的文本监督信号来引导目标与非目标特征的解耦;具体包括两个核心模块:语义特征对齐(Semantic Feature Alignment, SFA)模块用于对齐目标特征与对应文本特征,以及目标特征解耦(Object Feature Disentanglement, OFD)模块通过最小化目标特征与非目标特征之间的相关性实现特征分离,从而在检测头中输入更具判别力且噪声更少的特征,显著提升检测性能。
链接: https://arxiv.org/abs/2601.09228
作者: Fan Liu,Ting Wu,Chuanyi Zhang,Liang Yao,Xing Ma,Yuhui Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript3FD (83.7% mAP), FLIR (86.1% mAP). Our code will be publicly available once the paper is accepted.
zh
[CV-50] SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion
【速读】:该论文旨在解决从神经元尖峰信号(neural spike data)中重建自然视觉场景这一关键挑战,其核心目标是实现高分辨率且语义清晰的图像重构。解决方案的关键在于提出一种两阶段框架——SpikeVAEDiff:第一阶段利用超深度变分自编码器(Very Deep Variational Autoencoder, VDVAE)将神经尖峰数据映射到潜在空间以生成低分辨率初步图像;第二阶段通过回归模型将尖峰信号映射至CLIP-Vision和CLIP-Text特征空间,驱动通用扩散模型(Versatile Diffusion)进行图像到图像的精细化生成。该方法显著优于基于fMRI的方法,在时空分辨率上更具优势,并验证了特定脑区(如VISI区域)对重建质量的重要贡献。
链接: https://arxiv.org/abs/2601.09213
作者: Jialu Li,Taiyan Zhou
机构: HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance. Comments: Preprint Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.09213 [cs.CV] (or arXiv:2601.09213v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.09213 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-51] Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation AAAI2026
【速读】:该论文旨在解决自回归图像生成(autoregressive image generation)中推理速度慢的问题,其根源在于AR模型的序列特性以及图像标记(image tokens)的模糊性,即便采用推测解码(speculative decoding, SD)方法仍难以显著提升效率。现有研究尝试通过松弛推测解码(relaxed speculative decoding)来缓解该问题,但缺乏理论支撑。本文提出COOL-SD,一种基于两个关键洞察的退火式松弛推测解码方案:其一,通过分析目标模型与松弛推测解码之间的总变差(total variation, TV)距离,推导出最小化该距离上界的最优重采样分布;其二,借助微扰分析揭示了松弛推测解码中的退火行为,从而指导设计退火策略。这两个核心洞察共同使COOL-SD在保持图像质量的同时实现更快生成速度,或在相近延迟下获得更优质量,实验验证了其在速度-质量权衡上的持续优势。
链接: https://arxiv.org/abs/2601.09212
作者: Xingyao Li,Fengzhuo Zhang,Cunxiao Du,Hui Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to AAAI 2026
Abstract:Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.
zh
[CV-52] Affostruction: 3D Affordance Grounding with Generative Reconstruction
【速读】:该论文旨在解决从RGBD图像中实现可操作性定位(affordance grounding)的问题,即在物体表面识别出与文本描述的动作语义相对应的区域。现有方法仅能在可见表面上预测可操作区域,忽略了未观测区域的信息。其解决方案的关键在于提出Affostruction框架:首先通过稀疏体素融合的生成式多视角重建技术,在保持固定标记符复杂度的前提下外推未观测几何结构;其次采用基于流的可操作性定位方法,以捕捉可操作性分布中的固有不确定性;最后引入可操作性驱动的主动视角选择机制,利用预测的可操作性信息优化视点采样策略,从而实现对完整形状上准确的可操作性预测。
链接: https://arxiv.org/abs/2601.09211
作者: Chunghyun Park,Seunghyeon Lee,Minsu Cho
机构: POSTECH(浦项工科大学); Ewha Womans University(延世女子大学); RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement), enabling accurate affordance prediction on complete shapes.
zh
[CV-53] Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light Endoscopy AAAI2026
【速读】:该论文旨在解决白光成像(White-Light Imaging, WLI)在内镜癌症筛查中诊断细节不足的问题,其核心挑战在于如何将窄带成像(Narrow-Band Imaging, NBI)中的丰富诊断信息有效迁移至仅使用WLI数据的模型中。传统知识蒸馏方法依赖于配对的WLI-NBI图像,而这一要求在临床实践中成本高且难以实现,导致大量未配对数据被闲置。论文提出的解决方案是PaGKD(Pairing-free Group-level Knowledge Distillation),其关键创新在于摒弃图像级对应关系,转而在组级别进行跨模态知识蒸馏:通过Group-level Prototype Distillation(GKD-Pro)提取模态不变的语义原型以构建紧凑的组表示,并利用Group-level Dense Distillation(GKD-Den)通过激活衍生的关系图引导组感知注意力,实现密集的跨模态对齐。该框架在不依赖图像配对的前提下,同时保障了全局语义一致性与局部结构连贯性,显著提升了模型性能,在四个临床数据集上分别实现了3.3%、1.1%、2.8%和3.2%的AUC相对提升。
链接: https://arxiv.org/abs/2601.09209
作者: Qiang Hu,Qimei Wang,Yingjie Guo,Qiang Li,Zhiwei Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:White-Light Imaging (WLI) is the standard for endoscopic cancer screening, but Narrow-Band Imaging (NBI) offers superior diagnostic details. A key challenge is transferring knowledge from NBI to enhance WLI-only models, yet existing methods are critically hampered by their reliance on paired NBI-WLI images of the same lesion, a costly and often impractical requirement that leaves vast amounts of clinical data untapped. In this paper, we break this paradigm by introducing PaGKD, a novel Pairing-free Group-level Knowledge Distillation framework that that enables effective cross-modal learning using unpaired WLI and NBI data. Instead of forcing alignment between individual, often semantically mismatched image instances, PaGKD operates at the group level to distill more complete and compatible knowledge across modalities. Central to PaGKD are two complementary modules: (1) Group-level Prototype Distillation (GKD-Pro) distills compact group representations by extracting modality-invariant semantic prototypes via shared lesion-aware queries; (2) Group-level Dense Distillation (GKD-Den) performs dense cross-modal alignment by guiding group-aware attention with activation-derived relation maps. Together, these modules enforce global semantic consistency and local structural coherence without requiring image-level correspondence. Extensive experiments on four clinical datasets demonstrate that PaGKD consistently and significantly outperforms state-of-the-art methods, achieving relative AUC improvements of 3.3%, 1.1%, 2.8%, and 3.2%, respectively, establishing a new direction for cross-modal learning from unpaired data.
zh
[CV-54] Point Tracking as a Temporal Cue for Robust Myocardial Segmentation in Echocardiography Videos
【速读】:该论文旨在解决超声心动图视频中心肌分割的挑战性问题,主要难点包括图像对比度低、噪声干扰以及解剖结构的个体差异性。传统深度学习模型通常独立处理每一帧,忽略了时间信息;或依赖基于记忆的特征传播机制,导致误差随时间累积。其解决方案的关键在于提出Point-Seg框架——一种基于Transformer的分割方法,通过引入点跟踪模块作为显式的时序线索来增强跨帧一致性。该模块在合成超声心动图数据集上训练以追踪关键解剖标志点,并利用其轨迹提供运动感知信号,从而减少漂移并避免记忆特征积累。此外,结合时间平滑损失进一步提升帧间一致性,最终实现高精度且稳定的分割结果,同时输出像素级心肌运动信息,为心肌应变测量等下游任务提供支持。
链接: https://arxiv.org/abs/2601.09207
作者: Bahar Khodabakhshian,Nima Hashemi,Armin Saadat,Zahra Gholami,In-Chang Hwang,Samira Sojoudi,Christina Luong,Purang Abolmaesumi,Teresa Tsang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Myocardium segmentation in echocardiography videos is a challenging task due to low contrast, noise, and anatomical variability. Traditional deep learning models either process frames independently, ignoring temporal information, or rely on memory-based feature propagation, which accumulates error over time. Methods: We propose Point-Seg, a transformer-based segmentation framework that integrates point tracking as a temporal cue to ensure stable and consistent segmentation of myocardium across frames. Our method leverages a point-tracking module trained on a synthetic echocardiography dataset to track key anatomical landmarks across video sequences. These tracked trajectories provide an explicit motion-aware signal that guides segmentation, reducing drift and eliminating the need for memory-based feature accumulation. Additionally, we incorporate a temporal smoothing loss to further enhance temporal consistency across frames. Results: We evaluate our approach on both public and private echocardiography datasets. Experimental results demonstrate that Point-Seg has statistically similar accuracy in terms of Dice to state-of-the-art segmentation models in high quality echo data, while it achieves better segmentation accuracy in lower quality echo with improved temporal stability. Furthermore, Point-Seg has the key advantage of pixel-level myocardium motion information as opposed to other segmentation methods. Such information is essential in the computation of other downstream tasks such as myocardial strain measurement and regional wall motion abnormality detection. Conclusion: Point-Seg demonstrates that point tracking can serve as an effective temporal cue for consistent video segmentation, offering a reliable and generalizable approach for myocardium segmentation in echocardiography videos. The code is available at this https URL.
zh
[CV-55] From Performance to Practice: Knowledge-Distilled Segmentator for On-Premises Clinical Workflows
【速读】:该论文旨在解决医学图像分割模型在临床实际部署中面临的挑战,即高容量模型虽具备优异的分割精度,但其计算资源需求与医院本地(on-premises)基础设施的固定算力及云推理安全策略存在冲突,导致难以长期维护和规模化应用。解决方案的关键在于提出一种面向部署的框架,利用知识蒸馏(knowledge distillation)技术将高性能教师模型的知识迁移至一系列轻量化学生模型,从而在不改变现有推理流程的前提下实现模型容量的系统性压缩,同时保持架构兼容性与跨模态泛化能力。实验表明,在参数量减少94%的情况下,学生模型仍能保留教师模型98.7%的分割精度,并显著提升效率(CPU推理延迟降低67%),为研究级模型向临床可维护、可部署组件的转化提供了可靠路径。
链接: https://arxiv.org/abs/2601.09191
作者: Qizhen Lan,Aaron Choi,Jun Ma,Bo Wang,Zhaogming Zhao,Xiaoqian Jiang,Yu-Chun Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying medical image segmentation models in routine clinical workflows is often constrained by on-premises infrastructure, where computational resources are fixed and cloud-based inference may be restricted by governance and security policies. While high-capacity models achieve strong segmentation accuracy, their computational demands hinder practical deployment and long-term maintainability in hospital environments. We present a deployment-oriented framework that leverages knowledge distillation to translate a high-performing segmentation model into a scalable family of compact student models, without modifying the inference pipeline. The proposed approach preserves architectural compatibility with existing clinical systems while enabling systematic capacity reduction. The framework is evaluated on a multi-site brain MRI dataset comprising 1,104 3D volumes, with independent testing on 101 curated cases, and is further examined on abdominal CT to assess cross-modality generalizability. Under aggressive parameter reduction (94%), the distilled student model preserves nearly all of the teacher’s segmentation accuracy (98.7%), while achieving substantial efficiency gains, including up to a 67% reduction in CPU inference latency without additional deployment overhead. These results demonstrate that knowledge distillation provides a practical and reliable pathway for converting research-grade segmentation models into maintainable, deployment-ready components for on-premises clinical workflows in real-world health systems.
zh
[CV-56] N-EIoU-YOLOv9: A Signal-Aware Bounding Box Regression Loss for Lightweight Mobile Detection of Rice Leaf Diseases
【速读】:该论文旨在解决农业病害图像中目标检测精度不足的问题,特别是针对小尺寸和低对比度目标的定位困难与梯度干扰严重等挑战。其核心解决方案是提出一种基于非单调梯度聚焦与几何解耦原理的新型边界框回归损失函数——N EIoU(Non monotonic Efficient Intersection over Union),通过重构定位梯度、分离宽高优化路径,增强难样本的弱回归信号并降低梯度冲突,从而提升模型对复杂农业场景下微小病害区域的检测鲁棒性与准确性。
链接: https://arxiv.org/abs/2601.09170
作者: Dung Ta Nguyen Duc,Thanh Bui Dang,Hoang Le Minh,Tung Nguyen Viet,Huong Nguyen Thanh,Dong Trinh Cong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In this work, we propose N EIoU YOLOv9, a lightweight detection framework based on a signal aware bounding box regression loss derived from non monotonic gradient focusing and geometric decoupling principles, referred to as N EIoU (Non monotonic Efficient Intersection over Union). The proposed loss reshapes localization gradients by combining non monotonic focusing with decoupled width and height optimization, thereby enhancing weak regression signals for hard samples with low overlap while reducing gradient interference. This design is particularly effective for small and low contrast targets commonly observed in agricultural disease imagery. The proposed N EIoU loss is integrated into a lightweight YOLOv9t architecture and evaluated on a self collected field dataset comprising 5908 rice leaf images across four disease categories and healthy leaves. Experimental results demonstrate consistent performance gains over the standard CIoU loss, achieving a mean Average Precision of 90.3 percent, corresponding to a 4.3 percent improvement over the baseline, with improved localization accuracy under stricter evaluation criteria. For practical validation, the optimized model is deployed on an Android device using TensorFlow Lite with Float16 quantization, achieving an average inference time of 156 milliseconds per frame while maintaining accuracy. These results confirm that the proposed approach effectively balances accuracy, optimization stability, and computational efficiency for edge based agricultural monitoring systems.
zh
[CV-57] Architecture inside the mirag e: evaluating generative image models on architectural style elements and typologies
【速读】:该论文旨在解决生成式 AI (Generative AI) 文本到图像系统在建筑领域中生成图像的准确性问题,特别是在一个受历史规范严格约束的学科背景下,其复现精确建筑图像的能力尚未被充分评估。研究通过设计30个涵盖不同风格、类型和编码元素的建筑提示词,对五个主流GenAI平台(Adobe Firefly、DALL-E 3、Google Imagen 3、Microsoft Image Generator 和 Midjourney)进行系统测试,每组提示生成四张图像(共600张),并由两位建筑史专家独立评分,以量化准确性。关键发现表明:所有平台整体准确率较低(最高52%,最低32%,平均42%),且常见错误包括过度装饰、中世纪风格与其后世复兴风格混淆以及对描述性术语(如“蛋与箭”饰带、条纹柱、悬链拱顶)的误读。因此,解决方案的核心在于推动GenAI合成内容的可见标注、建立未来训练数据集的溯源标准,并在教育场景中审慎使用GenAI生成的建筑图像。
链接: https://arxiv.org/abs/2601.09169
作者: Jamie Magrill(1),Leah Gornstein(1),Sandra Seekins(2),Barry Magrill(2) ((1) McGill University, Montreal, Canada, (2) Capilano University, North Vancouver, Canada)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 24 pages, 7 figures
Abstract:Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p 0.05). Across platforms, overall accuracy was limited (highest accuracy score 52 percent; lowest 32 percent; mean 42 percent). All-correct (4 out of 4) outcomes were similar across platforms. By contrast, all-incorrect (0 out of 4) outcomes varied substantially, with Imagen 3 exhibiting the fewest failures and Microsoft Image Generator exhibiting the highest number of failures. Qualitative review of the image dataset identified recurring patterns including over-embellishment, confusion between medieval styles and their later revivals, and misrepresentation of descriptive prompts (for example, egg-and-dart, banded column, pendentive). These findings support the need for visible labeling of GenAI synthetic content, provenance standards for future training datasets, and cautious educational use of GenAI architectural imagery.
zh
[CV-58] From Snow to Rain: Evaluating Robustness Calibration and Complexity of Model-Based Robust Training
【速读】:该论文旨在解决深度学习模型在面对自然退化(natural corruptions)时鲁棒性不足的问题,尤其在交通安全等高风险场景中,模型的可靠性至关重要。其解决方案的关键在于引入基于模型的训练方法,利用学习到的干扰变异模型(nuisance variation model)生成逼真的自然退化数据,并结合随机覆盖与扰动空间中的对抗精修(adversarial refinement)策略,从而提升模型对复杂环境变化的适应能力。实验表明,基于模型的对抗训练在各类退化条件下均表现出最强鲁棒性,而基于模型的数据增强则在保持接近性能的同时显著降低计算复杂度,验证了学习到的干扰模型对捕捉真实世界变异性的重要性。
链接: https://arxiv.org/abs/2601.09153
作者: Josué Martínez-Martínez,Olivia Brown,Giselle Zeno,Pooya Khorrami,Rajmonda Caceres
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Robustness to natural corruptions remains a critical challenge for reliable deep learning, particularly in safety-sensitive domains. We study a family of model-based training approaches that leverage a learned nuisance variation model to generate realistic corruptions, as well as new hybrid strategies that combine random coverage with adversarial refinement in nuisance space. Using the Challenging Unreal and Real Environments for Traffic Sign Recognition dataset (CURE-TSR), with Snow and Rain corruptions, we evaluate accuracy, calibration, and training complexity across corruption severities. Our results show that model-based methods consistently outperform baselines Vanilla, Adversarial Training, and AugMix baselines, with model-based adversarial training providing the strongest robustness under across all corruptions but at the expense of higher computation and model-based data augmentation achieving comparable robustness with T less computational complexity without incurring a statistically significant drop in performance. These findings highlight the importance of learned nuisance models for capturing natural variability, and suggest a promising path toward more resilient and calibrated models under challenging conditions.
zh
[CV-59] SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection
【速读】:该论文旨在解决零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)中因单一视觉骨干网络导致的全局语义泛化能力与细粒度结构判别力难以平衡的问题。现有方法受限于固定视觉编码器,无法有效融合多尺度结构先验信息,从而影响对局部异常模式的精准定位。其解决方案的关键在于提出协同语义-视觉提示机制(Synergistic Semantic-Visual Prompting, SSVP),通过引入分层语义-视觉协同机制(Hierarchical Semantic-Visual Synergy, HSVS),将DINOv3的多尺度结构先验深度融合至CLIP语义空间;同时设计视觉条件提示生成器(Vision-Conditioned Prompt Generator, VCPG),利用跨模态注意力动态生成锚定特定异常模式的语言提示,并结合视觉-文本异常映射器(Visual-Text Anomaly Mapper, VTAM)建立双门控校准范式,以弥合全局评分与局部证据之间的差异,显著提升检测精度。
链接: https://arxiv.org/abs/2601.09147
作者: Chenhao Fu,Han Fang,Xiuzheng Zheng,Wenbo Wei,Yonghua Li,Hao Sun,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model’s fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3’s multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
zh
[CV-60] SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
【速读】:该论文旨在解决通用大视觉语言模型(Large Vision-Language Models, LVLMs)在皮肤科诊断任务中因“扩散注意力”(diffuse attention)导致的性能瓶颈问题,即模型难以从背景噪声中准确分离细微病灶特征。其解决方案的关键在于提出SkinFlow框架,通过引入虚拟宽度动态视觉编码器(Virtual-Width Dynamic Vision Encoder, DVE)在不增加物理参数量的前提下“展开”复杂病理流形,从而提升视觉信息传输效率;同时采用两阶段强化学习策略:第一阶段对齐显式医学描述,第二阶段重构隐式诊断纹理,均约束于语义空间内,最终实现更精准的诊断推理。
链接: https://arxiv.org/abs/2601.09136
作者: Lijun Liu,Linwei Chen,Zhishou Zhang,Meng Tian,Hengfu Cui,Ruiyang Li,Zhaocheng Liu,Qiang Ju,Qianxi Li,Hong-Yu Zhou
机构: Baichuan Inc.; Peking University First Hospital; Tsinghua University; University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to “diffuse attention” - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to “unfold” complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.
zh
[CV-61] Beyond Seen Bounds: Class-Centric Polarization for Single-Domain Generalized Deep Metric Learning
【速读】:该论文旨在解决单域广义深度度量学习(Single-domain generalized deep metric learning, SDG-DML)在测试阶段同时面临类别偏移(category shift)和域偏移(domain shift)的问题,从而限制其在真实场景中的泛化能力。现有方法通常采用基于代理的域扩展策略来生成分布外样本,但这类方法易导致样本聚集于类中心附近,无法有效模拟实际中广泛且遥远的域偏移。为缓解此问题,本文提出CenterPolar框架,其核心创新在于设计了两个协同工作的类中心极化阶段:(1) 类中心离心扩展(Class-Centric Centrifugal Expansion, C³E),通过将源域数据远离类中心进行扩展以增强对未见域的适应性;(2) 类中心向心约束(Class-Centric Centripetal Constraint, C⁴),在保持类内紧凑性的同时强化类间分离,从而提升对未见类别的泛化能力。该双阶段机制共同实现了对目标域分布的动态扩展与约束,显著提升了模型的跨域和跨类泛化性能。
链接: https://arxiv.org/abs/2601.09121
作者: Xin Yuan,Meiqi Wan,Wei Liu,Xin Xu,Zheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACM TOMM
Abstract:Single-domain generalized deep metric learning (SDG-DML) faces the dual challenge of both category and domain shifts during testing, limiting real-world applications. Therefore, aiming to learn better generalization ability on both unseen categories and domains is a realistic goal for the SDG-DML task. To deliver the aspiration, existing SDG-DML methods employ the domain expansion-equalization strategy to expand the source data and generate out-of-distribution samples. However, these methods rely on proxy-based expansion, which tends to generate samples clustered near class proxies, failing to simulate the broad and distant domain shifts encountered in practice. To alleviate the problem, we propose CenterPolar, a novel SDG-DML framework that dynamically expands and constrains domain distributions to learn a generalizable DML model for wider target domain distributions. Specifically, \textbfCenterPolar contains two collaborative class-centric polarization phases: (1) Class-Centric Centrifugal Expansion ( C^3E ) and (2) Class-Centric Centripetal Constraint ( C^4 ). In the first phase, C^3E drives the source domain distribution by shifting the source data away from class centroids using centrifugal expansion to generalize to more unseen domains. In the second phase, to consolidate domain-invariant class information for the generalization ability to unseen categories, C^4 pulls all seen and unseen samples toward their class centroids while enforcing inter-class separation via centripetal constraint. Extensive experimental results on widely used CUB-200-2011 Ext., Cars196 Ext., DomainNet, PACS, and Office-Home datasets demonstrate the superiority and effectiveness of our CenterPolar over existing state-of-the-art methods. The code will be released after acceptance.
zh
[CV-62] LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data
【速读】:该论文旨在解决当前基于视觉的铁路缺陷检测方法中存在的计算复杂度高、参数量过大以及准确率不足等问题。其核心解决方案是提出一种轻量级金字塔交叉注意力网络(Lightweight Pyramid Cross-Attention Network, LPCANet),该架构以MobileNetv2作为RGB特征提取主干,结合轻量化金字塔模块(Lightweight Pyramid Module, LPM)处理深度信息,并引入交叉注意力机制(Cross-Attention Mechanism, CAM)实现多模态融合,同时设计空间特征提取器(Spatial Feature Extractor, SFE)增强结构化特征分析能力。实验表明,LPCANet在保持仅9.90百万参数、2.50 G FLOPs和162.60 fps推理速度的前提下,显著优于18种现有方法,在Sα、IOU和MAE指标上分别提升+1.48%、+0.86%和+1.77%,验证了其高效性与优越性能。
链接: https://arxiv.org/abs/2601.09118
作者: Jackie Alex,Guoqiang Huan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the limitations of current vision-based rail defect detection methods, including high computational complexity, excessive parameter counts, and suboptimal accuracy. We propose a Lightweight Pyramid Cross-Attention Network (LPCANet) that leverages RGB-D data for efficient and accurate defect identification. The architecture integrates MobileNetv2 as a backbone for RGB feature extraction with a lightweight pyramid module (LPM) for depth processing, coupled with a cross-attention mechanism (CAM) for multimodal fusion and a spatial feature extractor (SFE) for enhanced structural analysis. Comprehensive evaluations on three unsupervised RGB-D rail datasets (NEU-RSDDS-AUG, RSDD-TYPE1, RSDD-TYPE2) demonstrate that LPCANet achieves state-of-the-art performance with only 9.90 million parameters, 2.50 G FLOPs, and 162.60 fps inference speed. Compared to 18 existing methods, LPCANet shows significant improvements, including +1.48% in S_\alpha , +0.86% in IOU, and +1.77% in MAE over the best-performing baseline. Ablation studies confirm the critical roles of CAM and SFE, while experiments on non-rail datasets (DAGM2007, MT, Kolektor-SDD2) validate its generalization capability. The proposed framework effectively bridges traditional and deep learning approaches, offering substantial practical value for industrial defect inspection. Future work will focus on further model compression for real-time deployment.
zh
[CV-63] LP-LLM : End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models
【速读】:该论文旨在解决真实场景下车牌识别(License Plate Recognition, LPR)因严重退化(如运动模糊、低分辨率和复杂光照)导致的识别性能下降问题。现有“先恢复后识别”的两阶段方法存在根本性缺陷:图像恢复模型的像素级优化目标与字符识别的语义目标不一致,易引入伪影并造成误差累积。为此,作者提出一种基于Qwen3-VL的端到端结构感知多模态推理框架,其核心创新在于设计了字符感知多模态推理模块(Character-Aware Multimodal Reasoning Module, CMRM),通过可学习的字符槽查询(Character Slot Queries)与视觉特征进行交叉注意力交互,主动提取对应字符位置的细粒度证据,并将这些结构化表示以残差调制方式注入视觉token,使语言模型能够基于显式的结构先验进行自回归生成。该方案有效融合了结构建模与多模态理解能力,显著提升了低质量文本识别性能。
链接: https://arxiv.org/abs/2601.09116
作者: Haoyan Gong,Hongbin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing “restoration-then-recognition” two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.
zh
[CV-64] owards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)任务中因环境和指令多样性导致的泛化能力不足问题,尤其针对开放世界场景下模型在未见环境中难以有效执行导航任务的挑战。传统方法通常假设训练与测试数据来自相同分布(即闭集假设),但在真实开放环境中,输入图像风格和语言指令往往存在显著差异,从而限制了模型的适应性。为应对这一问题,作者提出了一种动态交互式快慢推理框架(slow4fast-VLN),其核心创新在于构建了一个由快速推理模块(fast-reasoning module)和慢速推理模块(slow-reasoning module)协同工作的机制:快速模块通过端到端策略网络实时输出动作并积累执行记录至历史记忆库;慢速模块则对这些记忆进行深度反思,提取有助于提升决策泛化能力的经验,并结构化存储以持续优化快速模块。该方案突破了以往将快慢推理视为独立机制的传统范式,实现了两者的动态交互,使系统能够在面对未知场景时实现持续适应与高效导航。
链接: https://arxiv.org/abs/2601.09111
作者: Yang Li,Aming Wu,Zihao Zhang,Yahong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Navigation aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods. To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent this http URL this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework. The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory. The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios.
zh
[CV-65] SAM-Aug: Leverag ing SAM Priors for Few-Shot Parcel Segmentation in Satellite Time Series
【速读】:该论文旨在解决遥感时序图像中少样本语义分割(few-shot semantic segmentation)的问题,尤其是在标注数据稀缺或获取成本高昂的区域,传统全监督模型性能显著下降,限制了其实际应用。解决方案的关键在于提出一种名为SAM-Aug的新框架,该框架利用Segment Anything Model (SAM) 的几何感知分割能力,在无需人工标注的情况下生成几何先验掩码(mask priors),并通过设计的RegionSmoothLoss损失函数,强制模型在时间序列帧内保持每个SAM生成区域内预测的一致性,从而有效正则化模型以尊重语义结构一致性。此方法无需额外标注数据即可显著提升土地覆盖制图的准确性,验证了基础视觉模型作为正则化器在少样本遥感学习中的潜力。
链接: https://arxiv.org/abs/2601.09110
作者: Kai Hu,Yaozu Feng,Vladimir Lysenko,Ya Guo Member,Huayi Wu
机构: Jiangnan University (江南大学); Southern Federal University (南联邦大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures
Abstract:Few-shot semantic segmentation of time-series remote sensing images remains a critical challenge, particularly in regions where labeled data is scarce or costly to obtain. While state-of-the-art models perform well under full supervision, their performance degrades significantly under limited labeling, limiting their real-world applicability. In this work, we propose SAM-Aug, a new annotation-efficient framework that leverages the geometry-aware segmentation capability of the Segment Anything Model (SAM) to improve few-shot land cover mapping. Our approach constructs cloud-free composite images from temporal sequences and applies SAM in a fully unsupervised manner to generate geometry-aware mask priors. These priors are then integrated into training through a proposed loss function called RegionSmoothLoss, which enforces prediction consistency within each SAM-derived region across temporal frames, effectively regularizing the model to respect semantically coherent structures. Extensive experiments on the PASTIS-R benchmark under a 5 percent labeled setting demonstrate the effectiveness and robustness of SAM-Aug. Averaged over three random seeds (42, 2025, 4090), our method achieves a mean test mIoU of 36.21 percent, outperforming the state-of-the-art baseline by +2.33 percentage points, a relative improvement of 6.89 percent. Notably, on the most favorable split (seed=42), SAM-Aug reaches a test mIoU of 40.28 percent, representing an 11.2 percent relative gain with no additional labeled data. The consistent improvement across all seeds confirms the generalization power of leveraging foundation model priors under annotation scarcity. Our results highlight that vision models like SAM can serve as useful regularizers in few-shot remote sensing learning, offering a scalable and plug-and-play solution for land cover monitoring without requiring manual annotations or model fine-tuning.
zh
[CV-66] Small but Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation AAAI2026
【速读】:该论文旨在解决大规模基础模型在光学遥感图像(Optical Remote Sensing Images, ORSIs)分割任务中因参数量庞大而导致的微调困难问题,如GPU显存消耗过高和计算成本昂贵,从而限制了大模型在该领域的应用探索。其解决方案的关键在于提出了一种名为WEFT(Wavelet Expert-guided Fine-Tuning)的动态小波专家引导微调范式,通过引入任务特定的小波专家提取器来动态生成富含任务信息的可训练特征,并构建一个专家引导的条件适配器,在冻结特征的基础上注入可训练特征并迭代更新两类特征,从而以极少的可训练参数实现对大模型的有效适应,显著提升了ORSIs分割性能。
链接: https://arxiv.org/abs/2601.09108
作者: Yanguang Sun,Chao Wang,Jian Yang,Lei Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026
Abstract:Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate-scale pre-trained models and employ diverse optimization strategies to achieve promising performance under full-parameter fine-tuning. In fact, deeper and larger-scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full-parameter fine-tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large-scale models in existing works. In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task-specific information for subsequent fine-tuning. Furthermore, we construct an expert-guided conditional adapter that first enhances the fine-grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine-tuning. Extensive experiments show that our WEFT not only outperforms 21 state-of-the-art (SOTA) methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios. The source code is available at: this https URL.
zh
[CV-67] Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams
【速读】:该论文旨在解决行星机器人在复杂地形中实现高精度定位的问题,尤其针对地面探测车(rover)利用有限视场的单目RGB图像在局部空基地图中进行自主定位的需求。其关键解决方案是提出一种基于跨视图对齐的双编码器深度神经网络架构,通过语义分割结合视觉基础模型(vision foundation models)与大规模合成数据有效弥合真实图像与模拟数据之间的域差距(domain gap),并构建了首个包含真实行星类比环境轨迹及对应真值位置标注的跨视角数据集,辅以粒子滤波(particle filter)状态估计方法,实现了基于地面视角图像序列的准确位姿估计,无论是在简单还是复杂路径上均表现出鲁棒性。
链接: https://arxiv.org/abs/2601.09107
作者: Lachlan Holden,Feras Dayoub,Alberto Candela,David Harvey,Tat-Jun Chin
机构: 1: University of Sydney (悉尼大学); 2: University of Melbourne (墨尔本大学); 3: Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 10 figures. Presented at the International Conference on Space Robotics (iSpaRo) 2025 in Sendai, Japan. Dataset available: this https URL
Abstract:Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground-aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field-of-view monocular ground-view RGB images as input. A key consideration for machine learning methods is that real space data with ground-truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross-view-localising dual-encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross-view dataset of real-world rover trajectories with corresponding ground-truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross-view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground-view images.
zh
[CV-68] Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking
【速读】:该论文旨在解决轻量级目标跟踪方法在训练过程中采用稀疏采样(每序列仅使用一个模板图像和一个搜索图像)导致的时空信息利用不充分问题,从而限制了其性能并拉大与高性能跟踪器之间的差距。解决方案的关键在于提出STDTrack框架,通过密集视频采样最大化时空信息利用率,并引入时序传播的时空token引导逐帧特征提取;同时设计多帧信息融合模块(Multi-frame Information Fusion Module, MFIFM),结合构建的时空token维持器(Spatiotemporal Token Maintainer, STM)中存储的历史特征,实现目标状态的全面表征;此外,还开发了多尺度预测头以动态适应不同尺寸的目标,最终在六个基准测试中达到领先性能,且保持实时推理速度(GPU下192 FPS,CPU下41 FPS)。
链接: https://arxiv.org/abs/2601.09078
作者: Junze Shi,Yang Yu,Jian Shi,Haibo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training–utilizing only one template and one search image per sequence–which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).
zh
[CV-69] Depth-Wise Representation Development Under Blockwise Self-Supervised Learning for Video Vision Transformers
【速读】:该论文试图解决的问题是:如何在不依赖端到端反向传播(end-to-end backpropagation)的情况下训练掩码视频变换器(masked video transformer),从而避免长距离信用分配(long-range credit assignment)的复杂性,同时保持模型表示能力。其解决方案的关键在于引入分块自监督学习(blockwise self-supervised learning, BWSSL),将编码器划分为多个块(block),每个块独立优化局部掩码重建损失(local masked reconstruction loss),从而实现模块化、分层的学习机制。实验表明,这种分块训练策略能够收敛并获得接近端到端基线的表征性能,且在深度层面揭示了更早出现高层结构、后期块饱和及几何保持特性等关键学习动态,为理解分块训练与端到端训练之间的差异提供了新的视角。
链接: https://arxiv.org/abs/2601.09040
作者: Jonas Römer,Timo Dickscheid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end backpropagation couples all layers through a global error signal, enabling coordinated learning but requiring long-range credit assignment. Motivated by recent progress in blockwise self-supervised learning (BWSSL), we ask whether masked video transformers can be trained without end-to-end backpropagation. Applying BWSSL to masked video modeling remains relatively underexplored and must handle spatiotemporal context and long-range temporal structure. More broadly, analyses that compare BWSSL and end-to-end training in terms of learning dynamics and depth-wise representation development remain sparse. We apply blockwise learning to a masked autoencoding video vision transformer by partitioning the encoder into blocks, each of which is optimized with a local masked reconstruction loss. Across model sizes and partition granularities, training converges and yields representations close to matched end-to-end baselines under linear-probe and retrieval proxies. In order to compare intermediate representations, we analyze depth-wise decodability, inter-block similarity, and patch-level diagnostics. Blockwise training exposes higher-level structure earlier, while later blocks saturate and operate in a more geometry-preserving regime. It can also induce token-level shifts consistent with stronger early mixing that pooled metrics can miss. These findings point to late-block saturation and interface formation as contributors to the remaining gap.
zh
[CV-70] Changes in Visual Attention Patterns for Detection Tasks due to Dependencies on Signal and Background Spatial Frequencies
【速读】:该论文旨在解决复杂视觉环境中图像信号检测性能受限的根本原因,特别是探讨图像属性(如目标形态、背景复杂度)与视觉注意机制之间的相互作用。其核心问题是:在数字乳腺断层摄影(DBT)图像中,为何放射科医生仍会出现漏诊,以及视觉注意力如何受局部信号特征与全局解剖噪声的交互影响。解决方案的关键在于通过模拟不同乳腺密度和结构的数字乳腺体模(digital breast phantoms),生成具有不同空间频率特性的病灶(3-mm球形病变和6-mm星状病变),并利用眼动追踪数据量化注视指标,从而揭示决策失败是错误的主要来源,且视觉注意机制对不同背景和信号空间频率表现出差异化响应,这为优化医学图像显示和辅助诊断系统提供了关键依据。
链接: https://arxiv.org/abs/2601.09008
作者: Amar Kavuri,Howard C. Gifford,Mini Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
备注: 21 pages, 7 images
Abstract:We aim to investigate the impact of image and signal properties on visual attention mechanisms during a signal detection task in digital images. The application of insight yielded from this work spans many areas of digital imaging where signal or pattern recognition is involved in complex heterogenous background. We used simulated tomographic breast images as the platform to investigate this question. While radiologists are highly effective at analyzing medical images to detect and diagnose diseases, misdiagnosis still occurs. We selected digital breast tomosynthesis (DBT) images as a sample medical images with different breast densities and structures using digital breast phantoms (Bakic and XCAT). Two types of lesions (with distinct spatial frequency properties) were randomly inserted in the phantoms during projections to generate abnormal cases. Six human observers participated in observer study designed for a locating and detection of an 3-mm sphere lesion and 6-mm spicule lesion in reconstructed in-plane DBT slices. We collected eye-gaze data to estimate gaze metrics and to examine differences in visual attention mechanisms. We found that detection performance in complex visual environments is strongly constrained by later perceptual stages, with decision failures accounting for the largest proportion of errors. Signal detectability is jointly influenced by both target morphology and background complexity, revealing a critical interaction between local signal features and global anatomical noise. Increased fixation duration on spiculated lesions suggests that visual attention is differentially engaged depending on background and signal spatial frequency dependencies.
zh
[CV-71] Instance camera focus prediction for crystal agglomeration classification
【速读】:该论文旨在解决晶体聚集(agglomeration)分析中因二维成像局限性导致的误判问题,特别是由于光学显微镜景深浅,使得处于不同深度层的晶体在图像中看似连接,实则并未真正聚集。为提升分类与分割准确性,其关键解决方案在于:首先利用实例级相机聚焦预测网络(instance camera focus prediction network)量化图像中的焦点水平,该方法比传统图像处理聚焦度量更贴近视觉观察;随后将预测的焦点信息与实例分割模型结合,从而实现更精确的晶体聚集分类。
链接: https://arxiv.org/abs/2601.09004
作者: Xiaoyu Ji,Chenhao Zhang,Tyler James Downard,Zoltan Nagy,Ali Shakouri,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Agglomeration refers to the process of crystal clustering due to interparticle forces. Crystal agglomeration analysis from microscopic images is challenging due to the inherent limitations of two-dimensional imaging. Overlapping crystals may appear connected even when located at different depth layers. Because optical microscopes have a shallow depth of field, crystals that are in-focus and out-of-focus in the same image typically reside on different depth layers and do not constitute true agglomeration. To address this, we first quantified camera focus with an instance camera focus prediction network to predict 2 class focus level that aligns better with visual observations than traditional image processing focus measures. Then an instance segmentation model is combined with the predicted focus level for agglomeration classification. Our proposed method has a higher agglomeration classification and segmentation accuracy than the baseline models on ammonium perchlorate crystal and sugar crystal dataset.
zh
[CV-72] SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds KR
【速读】:该论文旨在解决生成式 AI(Generative AI)在人体分割任务中因遮挡导致关键点(keypoints)部分或完全不可见时性能下降的问题。解决方案的关键在于对Segment Anything Model (SAM) 2.1 进行轻量级编码器修改,并引入一种名为PoseMaskRefine的微调策略,将高可见性姿态关键点融入原始SAM的迭代修正流程中,从而提升模型在多种数据集上的鲁棒性和准确性;推理阶段仅使用三个最高可见性的关键点进行提示,显著降低对常见错误(如缺失肢体或衣物误分类)的敏感性,同时实现从单个关键点即可准确预测掩码的能力。
链接: https://arxiv.org/abs/2601.08982
作者: Constantin Kolomiiets,Miroslav Purkrabek,Jiri Matas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GitHub: this https URL
Abstract:Segment Anything (SAM) provides an unprecedented foundation for human segmentation, but may struggle under occlusion, where keypoints may be partially or fully invisible. We adapt SAM 2.1 for pose-guided segmentation with minimal encoder modifications, retaining its strong generalization. Using a fine-tuning strategy called PoseMaskRefine, we incorporate pose keypoints with high visibility into the iterative correction process originally employed by SAM, yielding improved robustness and accuracy across multiple datasets. During inference, we simplify prompting by selecting only the three keypoints with the highest visibility. This strategy reduces sensitivity to common errors, such as missing body parts or misclassified clothing, and allows accurate mask prediction from as few as a single keypoint. Our results demonstrate that pose-guided fine-tuning of SAM enables effective, occlusion-aware human segmentation while preserving the generalization capabilities of the original model. The code and pretrained models will be available at this https URL.
zh
[CV-73] hermo-LIO: A Novel Multi-Sensor Integrated System for Structural Health Monitoring
【速读】:该论文旨在解决传统二维热成像技术在复杂几何结构、难以到达区域以及深层缺陷检测中效果受限的问题。其解决方案的关键在于提出了一种名为Thermo-LIO的新型多传感器系统,通过融合高分辨率激光雷达(LiDAR)与热成像数据,实现多模态数据的精确标定与同步,从而构建建筑物表面温度分布的高精度表示;进一步将该融合方法集成到激光雷达-惯性里程计(LiDAR-Inertial Odometry, LIO)框架中,实现了对大型结构的全覆盖监测,显著提升了温度变化检测与缺陷识别的准确性与实时性。
链接: https://arxiv.org/abs/2601.08977
作者: Chao Yang,Haoyuan Zheng,Yue Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27pages,12figures
Abstract:Traditional two-dimensional thermography, despite being non-invasive and useful for defect detection in the construction field, is limited in effectively assessing complex geometries, inaccessible areas, and subsurface defects. This paper introduces Thermo-LIO, a novel multi-sensor system that can enhance Structural Health Monitoring (SHM) by fusing thermal imaging with high-resolution LiDAR. To achieve this, the study first develops a multimodal fusion method combining thermal imaging and LiDAR, enabling precise calibration and synchronization of multimodal data streams to create accurate representations of temperature distributions in buildings. Second, it integrates this fusion approach with LiDAR-Inertial Odometry (LIO), enabling full coverage of large-scale structures and allowing for detailed monitoring of temperature variations and defect detection across inspection cycles. Experimental validations, including case studies on a bridge and a hall building, demonstrate that Thermo-LIO can detect detailed thermal anomalies and structural defects more accurately than traditional methods. The system enhances diagnostic precision, enables real-time processing, and expands inspection coverage, highlighting the crucial role of multimodal sensor integration in advancing SHM methodologies for large-scale civil infrastructure.
zh
[CV-74] Variance-Penalized MC-Dropout as a Learned Smoothing Prior for Brain Tumour Segmentation
【速读】:该论文旨在解决脑肿瘤分割中传统卷积神经网络(CNN)和U-Net模型在肿瘤浸润区域产生噪声边界的问题。其解决方案的关键在于提出UAMSA-UNet,一种基于贝叶斯推理的多尺度注意力U-Net架构,通过蒙特卡洛Dropout(Monte Carlo Dropout)学习数据驱动的平滑先验(smoothing prior),并在损失函数中引入方差惩罚项以抑制随机前向传播中的伪波动,从而生成空间一致的分割掩膜;同时融合多尺度特征与注意力机制,兼顾细节与全局上下文信息,最终在BraTS2023和BraTS2024数据集上显著提升Dice相似系数和平均交并比(IoU),且相比U-Net++减少42.5%浮点运算次数(FLOPs),实现了分割精度与计算效率的协同优化。
链接: https://arxiv.org/abs/2601.08956
作者: Satyaki Roy Chowdhury,Golrokh Mirzaei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2026
Abstract:Brain tumor segmentation is essential for diagnosis and treatment planning, yet many CNN and U-Net based approaches produce noisy boundaries in regions of tumor infiltration. We introduce UAMSA-UNet, an Uncertainty-Aware Multi-Scale Attention-based Bayesian U-Net that in- stead leverages Monte Carlo Dropout to learn a data-driven smoothing prior over its predictions, while fusing multi-scale features and attention maps to capture both fine details and global context. Our smoothing-regularized loss augments binary cross-entropy with a variance penalty across stochas- tic forward passes, discouraging spurious fluctuations and yielding spatially coherent masks. On BraTS2023, UAMSA- UNet improves Dice Similarity Coefficient by up to 3.3% and mean IoU by up to 2.7% over U-Net; on BraTS2024, it delivers up to 4.5% Dice and 4.0% IoU gains over the best baseline. Remarkably, it also reduces FLOPs by 42.5% rel- ative to U-Net++ while maintaining higher accuracy. These results demonstrate that, by combining multi-scale attention with a learned smoothing prior, UAMSA-UNet achieves both better segmentation quality and computational efficiency, and provides a flexible foundation for future integration with transformer-based modules for further enhanced segmenta- tion results.
zh
[CV-75] DriftGuard: A Hierarchical Framework for Concept Drift Detection and Remediation in Supply Chain Forecasting
【速读】:该论文旨在解决供应链预测模型因概念漂移(concept drift)导致性能下降的问题,即在实际环境中由于促销变化、消费者偏好演变或供应中断等因素,模型输出逐渐失效而缺乏预警机制,进而引发库存短缺或过剩。现有工业实践依赖人工监控和周期性重训(3–6个月),效率低下且难以捕捉快速漂移;学术方法则多局限于检测环节,忽视诊断与修复,并未考虑供应链数据的层级结构。解决方案的关键在于提出DriftGuard框架,其核心是构建一个端到端的漂移生命周期管理机制:通过集成四种互补的检测方法(误差监控、统计检验、自编码器异常检测和CUSUM变化点分析)实现早期高召回率检测(97.8% recall within 4.2 days);结合层级传播分析定位漂移发生的具体产品线;利用SHAP分析解释根因;并采用成本感知的重训策略仅更新受影响最严重的模型,从而显著提升投资回报率(最高达417)。
链接: https://arxiv.org/abs/2601.08928
作者: Shahnawaz Alam,Mohammed Abdul Rahman,Bareera Sadeqa
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Supply chain forecasting models degrade over time as real-world conditions change. Promotions shift, consumer preferences evolve, and supply disruptions alter demand patterns, causing what is known as concept drift. This silent degradation leads to stockouts or excess inventory without triggering any system warnings. Current industry practice relies on manual monitoring and scheduled retraining every 3-6 months, which wastes computational resources during stable periods while missing rapid drift events. Existing academic methods focus narrowly on drift detection without addressing diagnosis or remediation, and they ignore the hierarchical structure inherent in supply chain data. What retailers need is an end-to-end system that detects drift early, explains its root causes, and automatically corrects affected models. We propose DriftGuard, a five-module framework that addresses the complete drift lifecycle. The system combines an ensemble of four complementary detection methods, namely error-based monitoring, statistical tests, autoencoder anomaly detection, and Cumulative Sum (CUSUM) change-point analysis, with hierarchical propagation analysis to identify exactly where drift occurs across product lines. Once detected, Shapley Additive Explanations (SHAP) analysis diagnoses the root causes, and a cost-aware retraining strategy selectively updates only the most affected models. Evaluated on over 30,000 time series from the M5 retail dataset, DriftGuard achieves 97.8% detection recall within 4.2 days and delivers up to 417 return on investment through targeted remediation.
zh
[CV-76] Adaptive few-shot learning for robust part quality classification in two-photon lithography
【速读】:该论文旨在解决生成式AI(Generative AI)在动态制造环境中质量控制模型的适应性问题,即现有计算机视觉(Computer Vision, CV)模型难以检测未见过的缺陷类别、无法高效地从少量数据中更新知识,且无法适应新零件几何形状。解决方案的关键在于提出一个面向全生命周期的质量模型维护自适应框架,其核心是基于统一且尺度鲁棒的骨干模型,并融合三项关键技术:(1) 基于线性判别分析(Linear Discriminant Analysis, LDA)的统计假设检验方法用于新颖性检测;(2) 一种两阶段基于回放的少样本增量学习策略实现新类别的快速集成;(3) 一种少样本域对抗神经网络(Domain-Adversarial Neural Network, DANN)用于跨域适应。该框架在双域(半球源域与立方体目标域)实验中验证了其高精度与数据效率,显著提升了CV模型在演化生产场景中的部署与维护能力。
链接: https://arxiv.org/abs/2601.08885
作者: Sixian Jia,Ruo-Syuan Mei,Chenhui Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Two-photon lithography (TPL) is an advanced additive manufacturing (AM) technique for fabricating high-precision micro-structures. While computer vision (CV) is proofed for automated quality control, existing models are often static, rendering them ineffective in dynamic manufacturing environments. These models typically cannot detect new, unseen defect classes, be efficiently updated from scarce data, or adapt to new part geometries. To address this gap, this paper presents an adaptive CV framework for the entire life-cycle of quality model maintenance. The proposed framework is built upon a same, scale-robust backbone model and integrates three key methodologies: (1) a statistical hypothesis testing framework based on Linear Discriminant Analysis (LDA) for novelty detection, (2) a two-stage, rehearsal-based strategy for few-shot incremental learning, and (3) a few-shot Domain-Adversarial Neural Network (DANN) for few-shot domain adaptation. The framework was evaluated on a TPL dataset featuring hemisphere as source domain and cube as target domain structures, with each domain categorized into good, minor damaged, and damaged quality classes. The hypothesis testing method successfully identified new class batches with 99-100% accuracy. The incremental learning method integrated a new class to 92% accuracy using only K=20 samples. The domain adaptation model bridged the severe domain gap, achieving 96.19% accuracy on the target domain using only K=5 shots. These results demonstrate a robust and data-efficient solution for deploying and maintaining CV models in evolving production scenarios.
zh
[CV-77] Compressing Vision Transformers in Geospatial Transfer Learning with Manifold-Constrained Optimization
【速读】:该论文旨在解决在资源受限的边缘设备上部署基于视觉Transformer(Vision Transformer)的地理空间基础模型时,因模型参数量庞大及压缩导致精度下降而难以实际应用的问题。其解决方案的关键在于利用流形约束优化框架DLRT,在迁移学习过程中对模型进行结构化低维参数化压缩,从而在保持下游任务特定准确率的同时实现显著的参数缩减。该方法通过将参数约束与目标任务目标对齐,相较于现有低秩方法(如LoRA)展现出更优的性能表现。
链接: https://arxiv.org/abs/2601.08882
作者: Thomas Snyder,H. Lexie Yang,Stefan Schnake,Steffen Schotthöfer
机构: Yale University (耶鲁大学); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying geospatial foundation models on resource-constrained edge devices demands compact architectures that maintain high downstream performance. However, their large parameter counts and the accuracy loss often induced by compression limit practical adoption. In this work, we leverage manifold-constrained optimization framework DLRT to compress large vision transformer-based geospatial foundation models during transfer learning. By enforcing structured low-dimensional parameterizations aligned with downstream objectives, this approach achieves strong compression while preserving task-specific accuracy. We show that the method outperforms of-the-shelf low-rank methods as LoRA. Experiments on diverse geospatial benchmarks confirm substantial parameter reduction with minimal accuracy loss, enabling high-performing, on-device geospatial models.
zh
[CV-78] AG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
【速读】:该论文旨在解决统一图像生成与编辑模型在密集扩散Transformer架构中因共享参数空间而导致的严重任务干扰问题,即局部编辑与以主体为中心的生成目标之间的冲突。解决方案的关键在于引入语义意图驱动的稀疏专家混合(Mixture-of-Experts, MoE)路由机制:首先设计分层任务语义标注方案,生成结构化的任务描述符(如作用范围、类型、保留性等);进而提出预测对齐正则化(Predictive Alignment Regularization),使门控网络的内部路由决策与任务的高层语义对齐,从而将原本任务无关的门控机制转变为具有语义感知能力的任务调度中心,有效缓解任务干扰并促进专家自然形成语义相关的专业化分工。
链接: https://arxiv.org/abs/2601.08881
作者: Yu Xu,Hongbin Yan,Juan Cao,Yiji Cheng,Tiankai Hang,Runze He,Zijin Yin,Shiyi Zhang,Yuxin Zhang,Jintao Li,Chunyu Wang,Qinglin Lu,Tong-Yee Lee,Fan Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference. In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task’s high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.
zh
[CV-79] he Semantic Lifecycle in Embodied AI: Acquisition Representation and Storag e via Foundation Models
【速读】:该论文旨在解决具身智能(Embodied AI)中语义信息处理的多源性与多阶段特性所带来的挑战,尤其是在复杂现实环境中实现稳定感知-动作循环的问题。传统方法依赖人工设计与深度神经网络结合,在特定语义相关任务中取得进展,但难以应对开放-ended任务和复杂环境对通用性与鲁棒性语义处理能力的需求。论文提出的解决方案关键在于引入“语义生命周期”(Semantic Lifecycle)这一统一框架,以基础模型(Foundation Models, FMs)驱动语义知识的演化过程,从获取、表征到存储三个核心阶段进行系统分析与比较,从而提供一个连续且可维护的语义知识流视角,推动具身智能向更通用、更稳定的语义理解方向发展。
链接: https://arxiv.org/abs/2601.08876
作者: Shuai Chen,Hao Chen,Yuanchen Bei,Tianyang Zhao,Zhibo Zhou,Feiran Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic information in embodied AI is inherently multi-source and multi-stage, making it challenging to fully leverage for achieving stable perception-to-action loops in real-world environments. Early studies have combined manual engineering with deep neural networks, achieving notable progress in specific semantic-related embodied tasks. However, as embodied agents encounter increasingly complex environments and open-ended tasks, the demand for more generalizable and robust semantic processing capabilities has become imperative. Recent advances in foundation models (FMs) address this challenge through their cross-domain generalization abilities and rich semantic priors, reshaping the landscape of embodied AI research. In this survey, we propose the Semantic Lifecycle as a unified framework to characterize the evolution of semantic knowledge within embodied AI driven by foundation models. Departing from traditional paradigms that treat semantic processing as isolated modules or disjoint tasks, our framework offers a holistic perspective that captures the continuous flow and maintenance of semantic knowledge. Guided by this embodied semantic lifecycle, we further analyze and compare recent advances across three key stages: acquisition, representation, and storage. Finally, we summarize existing challenges and outline promising directions for future research.
zh
[CV-80] Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement
【速读】:该论文旨在解决跨域图像配准(image registration)中因源域与目标域间系统性强度差异导致的亮度恒定假设失效问题,从而使得对应点估计变得病态(ill-posed)。其核心解决方案是提出SAR-Net框架,通过有原则的场景外观解耦(scene-appearance disentanglement),将观测图像分解为域不变的场景表示(domain-invariant scene representations)和域特定的外观码(domain-specific appearance codes),进而实现基于重渲染而非直接强度匹配的配准。关键创新在于理论证明了该分解可保证跨域一致对齐(Proposition 1)并建立了场景一致性损失作为共享潜在空间中几何对应关系的充分条件(Proposition 2),实验证明其在双向扫描显微成像任务中显著优于现有方法(SSIM提升至0.885,NCC达0.979),同时保持实时性能(77 fps)。
链接: https://arxiv.org/abs/2601.08875
作者: Jiahao Qin,Yiwen Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 7 figures, 4 tables. Code and data available at this https URL
Abstract:Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill-posed. We propose SAR-Net, a unified framework that addresses this challenge through principled scene-appearance disentanglement. Our key insight is that observed images can be decomposed into domain-invariant scene representations and domain-specific appearance codes, enabling registration via re-rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross-domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR-Net on bidirectional scanning microscopy, where coupled domain shift and geometric distortion create a challenging real-world testbed. Our method achieves 0.885 SSIM and 0.979 NCC, representing 3.1x improvement over the strongest baseline, while maintaining real-time performance (77 fps). Ablation studies confirm that both scene consistency and domain alignment losses are necessary: removing either degrades performance by 90% SSIM or causes 223x increase in latent alignment error, respectively. Code and data are available at this https URL.
zh
[CV-81] ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection
【速读】:该论文旨在解决传统图像取证方法在面对AI生成图像(如GAN和扩散模型输出)及复杂编辑工具时失效的问题,即跨域伪造检测能力不足。其解决方案的关键在于提出ForensicFormer框架,通过分层多尺度设计,利用交叉注意力机制统一低级伪影检测、中级边界分析与高级语义推理,从而实现对多种伪造类型(包括传统篡改、GAN生成图像和扩散模型输出)的高精度识别,平均准确率达86.8%,显著优于现有通用检测器,并具备较强的JPEG压缩鲁棒性和像素级伪造定位能力(F1-score=0.76)。
链接: https://arxiv.org/abs/2601.08873
作者: Hema Hariharan Samson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 9 pages, 4 figures, 5 tables. Technical report on hierarchical multi-scale image forgery detection
Abstract:The proliferation of AI-generated imagery and sophisticated editing tools has rendered traditional forensic methods ineffective for cross-domain forgery detection. We present ForensicFormer, a hierarchical multi-scale framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning via cross-attention transformers. Unlike prior single-paradigm approaches, which achieve 75% accuracy on out-of-distribution datasets, our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs - a significant improvement over state-of-the-art universal detectors. We demonstrate superior robustness to JPEG compression (83% accuracy at Q=70 vs. 66% for baselines) and provide pixel-level forgery localization with a 0.76 F1-score. Extensive ablation studies validate that each hierarchical component contributes 4-10% accuracy improvement, and qualitative analysis reveals interpretable forensic features aligned with human expert reasoning. Our work bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown a priori.
zh
[CV-82] Residual Cross-Modal Fusion Networks for Audio-Visual Navigation
【速读】:该论文旨在解决音频-视觉具身导航(Audio-Visual Embodied Navigation)任务中多模态融合时存在的异构特征交互建模难题,尤其是避免单一模态主导或信息退化问题,特别是在跨域场景下。其解决方案的关键在于提出一种交叉模态残差融合网络(Cross-Modal Residual Fusion Network, CRFN),通过引入音频与视觉流之间的双向残差交互机制,实现互补建模与细粒度对齐,同时保持各自表征的独立性;相较于传统依赖简单拼接或注意力门控的方法,CRFN 显式地利用残差连接建模跨模态关联,并结合稳定化技术提升收敛性和鲁棒性,从而在 Replica 和 Matterport3D 数据集上显著优于现有最优融合基线并展现出更强的跨域泛化能力。
链接: https://arxiv.org/abs/2601.08868
作者: Yi Wang,Yinfeng Yu,Bin Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Main paper (10 pages). Accepted for publication by the 14th international conference on Computational Visual Media (CVM 2026)
Abstract:Audio-visual embodied navigation aims to enable an agent to autonomously localize and reach a sound source in unseen 3D environments by leveraging auditory cues. The key challenge of this task lies in effectively modeling the interaction between heterogeneous features during multimodal fusion, so as to avoid single-modality dominance or information degradation, particularly in cross-domain scenarios. To address this, we propose a Cross-Modal Residual Fusion Network, which introduces bidirectional residual interactions between audio and visual streams to achieve complementary modeling and fine-grained alignment, while maintaining the independence of their representations. Unlike conventional methods that rely on simple concatenation or attention gating, CRFN explicitly models cross-modal interactions via residual connections and incorporates stabilization techniques to improve convergence and robustness. Experiments on the Replica and Matterport3D datasets demonstrate that CRFN significantly outperforms state-of-the-art fusion baselines and achieves stronger cross-domain generalization. Notably, our experiments also reveal that agents exhibit differentiated modality dependence across different datasets. The discovery of this phenomenon provides a new perspective for understanding the cross-modal collaboration mechanism of embodied agents.
zh
[CV-83] R2BD: A Reconstruction-Based Method for Generalizable and Efficient Detection of Fake Images
【速读】:该论文旨在解决当前基于重建的生成式AI图像检测方法在效率和泛化能力上的局限性问题。现有方法依赖于扩散模型进行多步反演与重建,导致计算效率低下(通常需20余步),且仅适用于扩散类生成模型,难以推广至GAN等其他生成范式。解决方案的关键在于提出一种名为R²BD的新框架,其核心创新包括:(1) G-LDM,一个统一的重建模型,能够模拟VAE、GAN和扩散模型的生成行为,从而扩展检测范围;(2) 残差偏置计算模块,在单次前向推理中即可区分真实与伪造图像,显著提升效率。实验表明,R²BD比现有重建方法快超过22倍,同时在跨数据集评估中平均性能优于SOTA方法13.87%,展现出卓越的效率与泛化能力。
链接: https://arxiv.org/abs/2601.08867
作者: Qingyu Liu,Zhongjie Ba,Jianmin Guo,Qiu Wang,Zhibo Wang,Jie Shi,Kui Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, reconstruction-based methods have gained attention for AIGC image detection. These methods leverage pre-trained diffusion models to reconstruct inputs and measure residuals for distinguishing real from fake images. Their key advantage lies in reducing reliance on dataset-specific artifacts and improving generalization under distribution shifts. However, they are limited by significant inefficiency due to multi-step inversion and reconstruction, and their reliance on diffusion backbones further limits generalization to other generative paradigms such as GANs. In this paper, we propose a novel fake image detection framework, called R ^2 BD, built upon two key designs: (1) G-LDM, a unified reconstruction model that simulates the generation behaviors of VAEs, GANs, and diffusion models, thereby broadening the detection scope beyond prior diffusion-only approaches; and (2) a residual bias calculation module that distinguishes real and fake images in a single inference step, which is a significant efficiency improvement over existing methods that typically require 20 + steps. Extensive experiments on the benchmark from 10 public datasets demonstrate that R ^2 BD is over 22 \times faster than existing reconstruction-based methods while achieving superior detection accuracy. In cross-dataset evaluations, it outperforms state-of-the-art methods by an average of 13.87%, showing strong efficiency and generalization across diverse generative methods. The code and dataset used for evaluation are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.08867 [cs.CV] (or arXiv:2601.08867v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.08867 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-84] Bias Detection and Rotation-Robustness Mitigation in Vision-Language Models and Generative Image Models
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)和生成式图像模型在输入变换(如图像旋转)和分布偏移下的鲁棒性不足与公平性问题,特别是旋转扰动如何影响模型预测准确性、置信度校准以及群体偏差模式。解决方案的关键在于提出一种旋转鲁棒性增强策略,其核心包括数据增强、表征对齐和模型级正则化三方面的协同优化,实验表明该方法能在不牺牲整体性能的前提下显著提升模型鲁棒性并减少偏差放大。
链接: https://arxiv.org/abs/2601.08860
作者: Tarannum Mithila
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. This work is derived from the author’s Master’s research. Code and supplementary materials will be released separately
Abstract:Vision-Language Models (VLMs) and generative image models have achieved remarkable performance across multimodal tasks, yet their robustness and fairness under input transformations remain insufficiently explored. This work investigates bias propagation and robustness degradation in state-of-the-art vision-language and generative models, with a particular focus on image rotation and distributional shifts. We analyze how rotation-induced perturbations affect model predictions, confidence calibration, and demographic bias patterns. To address these issues, we propose rotation-robust mitigation strategies that combine data augmentation, representation alignment, and model-level regularization. Experimental results across multiple datasets demonstrate that the proposed methods significantly improve robustness while reducing bias amplification without sacrificing overall performance. This study highlights critical limitations of current multimodal systems and provides practical mitigation techniques for building more reliable and fair AI models.
zh
[CV-85] Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis
【速读】:该论文旨在解决Vision Transformers (ViTs)在组织病理学图像分析中缺乏旋转等变性(rotational equivariance)的问题,即标准ViT对图像旋转敏感,导致模型性能在不同方向的图像上波动较大,限制了其在实际数字病理学应用中的鲁棒性和泛化能力。解决方案的关键在于提出Equi-ViT架构,在ViT的patch embedding阶段引入等变卷积核(equivariant convolution kernel),从而赋予模型内置的旋转等变性,使patch嵌入在图像旋转时保持一致的结构关系,显著提升分类任务在不同图像朝向下的稳定性与数据效率。
链接: https://arxiv.org/abs/2601.09130
作者: Fuyao Chen,Yuexi Du,Elèonore V. Lieffrig,Nicha C. Dvornek,John A. Onofrey
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE ISBI 2026 4-page paper
Abstract:Vision Transformers (ViTs) have gained rapid adoption in computational pathology for their ability to model long-range dependencies through self-attention, addressing the limitations of convolutional neural networks that excel at local pattern capture but struggle with global contextual reasoning. Recent pathology-specific foundation models have further advanced performance by leveraging large-scale pretraining. However, standard ViTs remain inherently non-equivariant to transformations such as rotations and reflections, which are ubiquitous variations in histopathology imaging. To address this limitation, we propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture, imparting built-in rotational equivariance to learned representations. Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations. Our results on a public colorectal cancer dataset demonstrate that incorporating equivariant patch embedding enhances data efficiency and robustness, suggesting that equivariant transformers could potentially serve as more generalizable backbones for the application of ViT in histopathology, such as digital pathology foundation models.
zh
[CV-86] POWDR: Pathology-preserving Outpainting with Wavelet Diffusion for 3D MRI
【速读】:该论文旨在解决医学影像数据集中的类别不平衡和病理丰富病例稀缺问题,这些问题限制了机器学习模型在分割、分类及视觉-语言任务中的性能表现。其解决方案的关键在于提出一种名为POWDR的病理保留外补全(pathology-preserving outpainting)框架,基于条件小波扩散模型(conditioned wavelet diffusion model)对3D MRI进行合成,能够在保持真实病灶区域不变的前提下生成解剖学合理的周围组织,从而实现多样性增强而不伪造病变。该方法通过小波域条件控制提升高频细节并缓解潜在扩散模型常见的模糊问题,并引入随机连通掩码训练策略以克服条件诱导的塌陷现象,显著提升了非病灶区域的多样性,实验证明其可有效提升肿瘤分割性能(Dice分数从0.6992提升至0.7137),且不显著改变脑脊液(CSF)和灰质(GM)体积分布,具备良好的临床适用性与扩展潜力。
链接: https://arxiv.org/abs/2601.09044
作者: Fei Tan,Ashok Vardhan Addala,Bruno Astuto Arouche Nunes,Xucheng Zhu,Ravi Soni
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging datasets often suffer from class imbalance and limited availability of pathology-rich cases, which constrains the performance of machine learning models for segmentation, classification, and vision-language tasks. To address this challenge, we propose POWDR, a pathology-preserving outpainting framework for 3D MRI based on a conditioned wavelet diffusion model. Unlike conventional augmentation or unconditional synthesis, POWDR retains real pathological regions while generating anatomically plausible surrounding tissue, enabling diversity without fabricating lesions. Our approach leverages wavelet-domain conditioning to enhance high-frequency detail and mitigate blurring common in latent diffusion models. We introduce a random connected mask training strategy to overcome conditioning-induced collapse and improve diversity outside the lesion. POWDR is evaluated on brain MRI using BraTS datasets and extended to knee MRI to demonstrate tissue-agnostic applicability. Quantitative metrics (FID, SSIM, LPIPS) confirm image realism, while diversity analysis shows significant improvement with random-mask training (cosine similarity reduced from 0.9947 to 0.9580; KL divergence increased from 0.00026 to 0.01494). Clinically relevant assessments reveal gains in tumor segmentation performance using nnU-Net, with Dice scores improving from 0.6992 to 0.7137 when adding 50 synthetic cases. Tissue volume analysis indicates no significant differences for CSF and GM compared to real images. These findings highlight POWDR as a practical solution for addressing data scarcity and class imbalance in medical imaging. The method is extensible to multiple anatomies and offers a controllable framework for generating diverse, pathology-preserving synthetic data to support robust model development. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.09044 [eess.IV] (or arXiv:2601.09044v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2601.09044 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-87] GOUHFI 2.0: A Next-Generation Toolbox for Brain Segmentation and Cortex Parcellation at Ultra-High Field MRI
【速读】:该论文旨在解决超高场强磁共振成像(Ultra-High Field MRI, UHF-MRI)在大规模神经影像研究中自动脑部分割与皮层分区面临的挑战,包括信号不均匀性、对比度和分辨率异质性以及缺乏针对UHF数据优化的工具。其解决方案的关键在于提出GOUHFI 2.0,一个基于深度学习的改进工具箱,通过引入更具多样性的训练数据和两个独立训练的3D U-Net分割任务:第一个任务实现跨对比度、分辨率、场强和人群的35类全脑分割,采用领域随机化策略;第二个任务基于相同训练集完成符合Desikan-Killiany-Tourville (DKT)协议的62类皮层分区。该设计保持了原工具箱对对比度和分辨率的无关性,同时首次实现了在UHF-MRI下鲁棒的皮层分区,显著提升了复杂队列中的分割精度和体积测量一致性。
链接: https://arxiv.org/abs/2601.09006
作者: Marc-Antoine Fortin,Anne Louise Kristoffersen,Paal Erik Goa
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Ultra-High Field MRI (UHF-MRI) is increasingly used in large-scale neuroimaging studies, yet automatic brain segmentation and cortical parcellation remain challenging due to signal inhomogeneities, heterogeneous contrasts and resolutions, and the limited availability of tools optimized for UHF data. Standard software packages such as FastSurferVINN and SynthSeg+ often yield suboptimal results when applied directly to UHF images, thereby restricting region-based quantitative analyses. To address this need, we introduce GOUHFI 2.0, an updated implementation of GOUHFI that incorporates increased training data variability and additional functionalities, including cortical parcellation and volumetry. GOUHFI 2.0 preserves the contrast- and resolution-agnostic design of the original toolbox while introducing two independently trained 3D U-Net segmentation tasks. The first performs whole-brain segmentation into 35 labels across contrasts, resolutions, field strengths and populations, using a domain-randomization strategy and a training dataset of 238 subjects. Using the same training data, the second network performs cortical parcellation into 62 labels following the Desikan-Killiany-Tourville (DKT) protocol. Across multiple datasets, GOUHFI 2.0 demonstrated improved segmentation accuracy relative to the original toolbox, particularly in heterogeneous cohorts, and produced reliable cortical parcellations. In addition, the integrated volumetry pipeline yielded results consistent with standard volumetric workflows. Overall, GOUHFI 2.0 provides a comprehensive solution for brain segmentation, parcellation and volumetry across field strengths, and constitutes the first deep-learning toolbox enabling robust cortical parcellation at UHF-MRI. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2601.09006 [eess.IV] (or arXiv:2601.09006v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2601.09006 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-88] W-DUALMINE: Reliability-Weighted Dual-Expert Fusion With Residual Correlation Preservation for Medical Image Fusion
【速读】:该论文旨在解决医学图像融合中普遍存在的全局统计相似性(以相关系数CC和互信息MI衡量)与局部结构保真度之间的固有权衡问题。现有基于深度学习的方法,如AdaFuse和ASFE-Fusion,在提升整体图像一致性的同时往往牺牲了细节层次的结构准确性。解决方案的关键在于提出W-DUALMINE框架,其核心创新包括:(1)引入密集可靠性图实现自适应模态加权;(2)设计双专家融合策略,结合全局上下文空间专家与小波域频率专家以协同优化全局与局部特征;(3)采用基于软梯度的仲裁机制进行融合决策;(4)通过残差到平均的融合范式,在保证全局相关性的同时增强局部细节保留能力。这一架构约束与理论驱动的损失设计共同实现了对传统权衡关系的有效突破。
链接: https://arxiv.org/abs/2601.08920
作者: Md. Jahidul Islam
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:Medical image fusion integrates complementary information from multiple imaging modalities to improve clinical interpretation. However, existing deep learningbased methods, including recent spatial-frequency frameworks such as AdaFuse and ASFE-Fusion, often suffer from a fundamental trade-off between global statistical similaritymeasured by correlation coefficient (CC) and mutual information (MI)and local structural fidelity. This paper proposes W-DUALMINE, a reliability-weighted dual-expert fusion framework designed to explicitly resolve this trade-off through architectural constraints and a theoretically grounded loss design. The proposed method introduces dense reliability maps for adaptive modality weighting, a dual-expert fusion strategy combining a global-context spatial expert and a wavelet-domain frequency expert, and a soft gradient-based arbitration mechanism. Furthermore, we employ a residual-to-average fusion paradigm that guarantees the preservation of global correlation while enhancing local details. Extensive experiments on CT-MRI, PET-MRI, and SPECT-MRI datasets demonstrate that W-DUALMINE consistently outperforms AdaFuse and ASFE-Fusion in CC and MI metrics while
zh
[CV-89] Comprehensive Machine Learning Benchmarking for Fringe Projection Profilometry with Photorealistic Synthetic Data
【速读】:该论文旨在解决基于机器学习的条纹投影轮廓术(Fringe Projection Profilometry, FPP)方法因缺乏大规模、多样化数据集和标准化评估协议而导致的性能瓶颈问题。其解决方案的关键在于构建并公开首个基于NVIDIA Isaac Sim生成的、具有照片级真实感的合成数据集,包含15,600张条纹图像与300个深度重建结果,覆盖50种不同物体;同时通过系统性基准测试四种神经网络架构(UNet、Hformer、ResUNet、Pix2Pix),揭示了直接从条纹图像到深度图映射的方法存在根本性局限——即在未引入相位信息的情况下,重建误差可达典型物体深度范围的75–95%,从而为后续学习型FPP算法的设计提供了标准化评估框架与关键认知依据。
链接: https://arxiv.org/abs/2601.08900
作者: Anush Lakshman S,Adam Haroon,Beiwen Li
机构: Iowa State University (爱荷华州立大学); University of Georgia (佐治亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Machine learning approaches for fringe projection profilometry (FPP) are hindered by the lack of large, diverse datasets and comprehensive benchmarking protocols. This paper introduces the first open-source, photorealistic synthetic dataset for FPP, generated using NVIDIA Isaac Sim with 15,600 fringe images and 300 depth reconstructions across 50 diverse objects. We benchmark four neural network architectures (UNet, Hformer, ResUNet, Pix2Pix) on single-shot depth reconstruction, revealing that all models achieve similar performance (58-77 mm RMSE) despite substantial architectural differences. Our results demonstrate fundamental limitations of direct fringe-to-depth mapping without explicit phase information, with reconstruction errors approaching 75-95% of the typical object depth range. This resource provides standardized evaluation protocols enabling systematic comparison and development of learning-based FPP approaches.
zh
人工智能
[AI-0] Automating Supply Chain Disruption Monitoring via an Agent ic AI Approach
【速读】:该论文旨在解决现代供应链在面对地缘政治事件、需求波动、贸易限制及自然灾害等扰动时,因缺乏对一级供应商之外的深层网络(deep-tier networks)可见性而导致的响应滞后问题。现有企业通常仅能监控Tier-1供应商,使得上游风险无法被及时识别,直至影响向下传导。解决方案的关键在于提出一种最小监督的代理型人工智能(minimally supervised agentic AI)框架,其由七个基于大语言模型(Large Language Models, LLMs)和确定性工具协同工作的专业化智能体组成,能够自主完成从非结构化新闻中检测扰动信号、映射至多级供应商网络、基于网络结构评估暴露度,并推荐替代采购方案等全流程分析。该框架实现了端到端自动化处理,在30个合成场景下F1得分介于0.962–0.991之间,平均响应时间仅3.83分钟(成本0.0836美元/扰动),相较行业依赖人工的多日评估方法提速超过三个数量级,验证了其在复杂供应链扰动管理中的前瞻性与实用性。
链接: https://arxiv.org/abs/2601.09680
作者: Sara AlMahri,Liming Xu,Alexandra Brintrup
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern supply chains are increasingly exposed to disruptions from geopolitical events, demand shocks, trade restrictions, to natural disasters. While many of these disruptions originate deep in the supply network, most companies still lack visibility beyond Tier-1 suppliers, leaving upstream vulnerabilities undetected until the impact cascades downstream. To overcome this blind-spot and move from reactive recovery to proactive resilience, we introduce a minimally supervised agentic AI framework that autonomously monitors, analyses, and responds to disruptions across extended supply networks. The architecture comprises seven specialised agents powered by large language models and deterministic tools that jointly detect disruption signals from unstructured news, map them to multi-tier supplier networks, evaluate exposure based on network structure, and recommend mitigations such as alternative sourcing options. \revWe evaluate the framework across 30 synthesised scenarios covering three automotive manufacturers and five disruption classes. The system achieves high accuracy across core tasks, with F1 scores between 0.962 and 0.991, and performs full end-to-end analyses in a mean of 3.83 minutes at a cost of \ 0.0836 per disruption. Relative to industry benchmarks of multi-day, analyst-driven assessments, this represents a reduction of more than three orders of magnitude in response time. A real-world case study of the 2022 Russia-Ukraine conflict further demonstrates operational applicability. This work establishes a foundational step toward building resilient, proactive, and autonomous supply chains capable of managing disruptions across deep-tier networks.
zh
[AI-1] LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach DATE
【速读】:该论文旨在解决大规模优化(Large-scale Optimization)模型构建过程中存在的劳动密集且耗时的问题。现有方法在面对复杂业务决策场景时,往往需要大量人工干预来完成从问题描述到数学模型的自动转化。为此,作者提出 LEAN-LLM-OPT 框架,其核心在于设计了一个轻量级、遗传式的 LLM 代理协作工作流:上游两个大语言模型(Large Language Models, LLMs)动态生成针对相似问题的建模步骤序列,下游 LLM 依据此工作流执行最终建模任务;通过将机械性的数据处理操作交由辅助工具完成,并将规划职责分离,使下游代理专注于不可标准化的难点部分,从而显著提升自动化建模效率与准确性。
链接: https://arxiv.org/abs/2601.09635
作者: Kuo Liang,Yuhang Lu,Jianming Mao,Shuyi Sun,Chunwei Yang,Congcong Zeng,Xiao Jin,Hanzhang Qin,Ruihao Zhu,Chung-Piaw Teo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Updated version of this https URL
Abstract:Large-scale optimization is a key backbone of modern business decision-making. However, building these models is often labor-intensive and time-consuming. We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. Leveraging LLMs’ text-processing capabilities and common modeling practices, the workflow decomposes the modeling task into a sequence of structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools. This design alleviates the downstream agent’s burden related to planning and data handling, allowing it to focus on the most challenging components that cannot be readily standardized. Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. The code and data of this work is available at this https URL.
zh
[AI-2] From Prompt to Protocol: Fast Charging Batteries with Large Language Models
【速读】:该论文旨在解决电池充电协议优化中因每次评估耗时长、成本高且不可微分而导致的效率难题。传统方法通常通过大幅限制协议搜索空间来缓解这一问题,但这种做法抑制了协议多样性,阻碍了高性能方案的发现。其解决方案的关键在于提出两种基于大语言模型(Large Language Model, LLM)驱动的无梯度闭环优化方法:Prompt-to-Optimizer (P2O) 利用LLM生成小型神经网络协议代码,并由内层循环训练;Prompt-to-Protocol (P2P) 则直接输出电流及其标量参数的显式函数形式。实验表明,P2O在性能上超越贝叶斯优化、进化算法和随机搜索设计的神经网络协议,而P2P在相同评估预算下相较最优多段恒流(multi-step constant current, CC)基准实现约4.2%的容量保持率提升,验证了LLM在拓展协议函数形式空间、引入语言约束及高效优化高成本实验场景中的有效性。
链接: https://arxiv.org/abs/2601.09626
作者: Ge Lei,Ferran Brosa Planella,Sterling G. Baird,Samuel J. Cooper
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Efficiently optimizing battery charging protocols is challenging because each evaluation is slow, costly, and non-differentiable. Many existing approaches address this difficulty by heavily constraining the protocol search space, which limits the diversity of protocols that can be explored, preventing the discovery of higher-performing solutions. We introduce two gradient-free, LLM-driven closed-loop methods: Prompt-to-Optimizer (P2O), which uses an LLM to propose the code for small neural-network-based protocols, which are then trained by an inner loop, and Prompt-to-Protocol (P2P), which simply writes an explicit function for the current and its scalar parameters. Across our case studies, LLM-guided P2O outperforms neural networks designed by Bayesian optimization, evolutionary algorithms, and random search. In a realistic fast charging scenario, both P2O and P2P yield around a 4.2 percent improvement in state of health (capacity retention based health metric under fast charging cycling) over a state-of-the-art multi-step constant current (CC) baseline, with P2P achieving this under matched evaluation budgets (same number of protocol evaluations). These results demonstrate that LLMs can expand the space of protocol functional forms, incorporate language-based constraints, and enable efficient optimization in high cost experimental settings.
zh
[AI-3] he Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)系统在实际应用中面临的安全威胁被过度简化为“提示注入”(prompt injection)这一单一概念所掩盖的问题,导致现有安全框架难以有效应对日益复杂的多步骤攻击。其解决方案的关键在于提出一种新的分类体系——将针对LLM应用的攻击视为一类独立的恶意软件,称为“提示软件”(promptware),并构建一个五阶段杀伤链模型(kill chain):初始访问(提示注入)、权限提升(越狱)、持久化(内存与检索污染)、横向移动(跨系统和跨用户传播)以及目标执行(从数据窃取到未经授权交易)。该框架不仅揭示了LLM攻击与传统恶意软件行为的高度相似性,还为安全从业者提供了结构化的威胁建模方法,并为AI安全与网络安全领域的研究者提供了一个统一的术语体系以应对快速演进的攻击态势。
链接: https://arxiv.org/abs/2601.09625
作者: Ben Nassi,Bruce Schneier,Oleg Brodt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid adoption of large language model (LLM)-based systems – from chatbots to autonomous agents capable of executing code and financial transactions – has created a new attack surface that existing security frameworks inadequately address. The dominant framing of these threats as “prompt injection” – a catch-all phrase for security failures in LLM-based systems – obscures a more complex reality: Attacks on LLM-based systems increasingly involve multi-step sequences that mirror traditional malware campaigns. In this paper, we propose that attacks targeting LLM-based applications constitute a distinct class of malware, which we term \textitpromptware, and introduce a five-step kill chain model for analyzing these threats. The framework comprises Initial Access (prompt injection), Privilege Escalation (jailbreaking), Persistence (memory and retrieval poisoning), Lateral Movement (cross-system and cross-user propagation), and Actions on Objective (ranging from data exfiltration to unauthorized transactions). By mapping recent attacks to this structure, we demonstrate that LLM-related attacks follow systematic sequences analogous to traditional malware campaigns. The promptware kill chain offers security practitioners a structured methodology for threat modeling and provides a common vocabulary for researchers across AI safety and cybersecurity to address a rapidly evolving threat landscape.
zh
[AI-4] Full Disclosure Less Trust? How the Level of Detail about AI Use in News Writing Affects Readers Trust
【速读】:该论文旨在解决生成式 AI (Generative AI) 在新闻生产中应用时,AI披露(AI disclosure)如何影响读者信任的问题,尤其是不同披露细节层级是否会导致“透明度困境”(transparency dilemma)。研究通过一个3×2×2的混合实验设计,考察了三种披露水平(无披露、一行式披露、详细披露)、两类新闻类型(政治与生活类)以及两种AI参与程度(低与高)对读者信任及行为决策(如信息源核查和订阅意愿)的影响。关键发现在于:并非所有披露都会引发透明度困境——仅详细披露显著降低信任,而一行式和详细披露均提升源核查行为;同时,多数受访者偏好详细披露,但部分人希望采用按需获取细节的披露形式。这表明存在一种权衡关系:读者既渴望透明,又可能因过度披露损害对AI辅助内容的信任。
链接: https://arxiv.org/abs/2601.09620
作者: Pooja Prajod,Hannes Cools,Thomas Röggla,Karthikeya Puttur Venkatraj,Amber Kusters,Alia ElKattan,Pablo Cesar,Abdallah El Ali
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma’‘, where disclosure reduces readers’ trust. However, little is known about how the \textitlevel of detail in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3 \times 2 \times 2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers’ trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers’ desire for more transparency and their trust in AI-assisted news content.
zh
[AI-5] Information Access of the Oppressed: A Problem-Posing Framework for Envisioning Emancipatory Information Access Platforms
【速读】:该论文旨在解决在线信息访问(Online Information Access, IA)平台被威权主义捕获所带来的风险,特别是在民主倒退加剧、生成式AI技术(如AI说服能力)兴起以及大型科技公司权力集中的背景下。其解决方案的关键在于借鉴保罗·弗莱雷(Paulo Freire)的解放教育学理论,打破传统“技术开发者—用户”的二元对立关系,转而将技术问题交还给边缘化社区,使其成为技术建构与解构的主体。具体而言,第一阶段要求技术工作者将技术风险转化为社区可识别的问题,激发其参与技术改造;第二阶段则需重构线上技术架构,结构化地开放空间,使社区成员能够共同占有和共建技术,以支持其反抗压迫的解放斗争。这一框架被称为“问题提出式”(problem-posing)方法,用于未来解放导向的信息访问平台设计。
链接: https://arxiv.org/abs/2601.09600
作者: Bhaskar Mitra,Nicola Neophytou,Sireesh Gururaja
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Online information access (IA) platforms are targets of authoritarian capture. These concerns are particularly serious and urgent today in light of the rising levels of democratic erosion worldwide, the emerging capabilities of generative AI technologies such as AI persuasion, and the increasing concentration of economic and political power in the hands of Big Tech. This raises the question of what alternative IA infrastructure we must reimagine and build to mitigate the risks of authoritarian capture of our information ecosystems. We explore this question through the lens of Paulo Freire’s theories of emancipatory pedagogy. Freire’s theories provide a radically different lens for exploring IA’s sociotechnical concerns relative to the current dominating frames of fairness, accountability, confidentiality, transparency, and safety. We make explicit, with the intention to challenge, the dichotomy of how we relate to technology as either technologists (who envision and build technology) and its users. We posit that this mirrors the teacher-student relationship in Freire’s analysis. By extending Freire’s analysis to IA, we challenge the notion that it is the burden of the (altruistic) technologists to come up with interventions to mitigate the risks that emerging technologies pose to marginalized communities. Instead, we advocate that the first task for the technologists is to pose these as problems to the marginalized communities, to encourage them to make and unmake the technology as part of their material struggle against oppression. Their second task is to redesign our online technology stacks to structurally expose spaces for community members to co-opt and co-construct the technology in aid of their emancipatory struggles. We operationalize Freire’s theories to develop a problem-posing framework for envisioning emancipatory IA platforms of the future.
zh
[AI-6] Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨任务泛化能力上的局限性问题,即现有方法通常依赖单一任务特定的推理模式,难以适应多样化的多模态推理需求(如图像区域聚焦或目标标记)。其解决方案的关键在于提出统一生成式多模态推理(unified generative multimodal reasoning)范式,通过在推理过程中生成中间图像来融合多种推理技能;具体实现上采用两阶段SFT+RL框架Omni-R1,引入感知对齐损失(perception alignment loss)和感知奖励(perception reward)以支持功能性图像生成,并进一步提出Omni-R1-Zero,利用纯文本推理数据自监督地生成视觉中间表示,从而无需依赖多模态标注。
链接: https://arxiv.org/abs/2601.09536
作者: Dongjie Cheng,Yongqi Li,Zhixin Ma,Hongru Cai,Yupeng Hu,Wenjie Wang,Liqiang Nie,Wenjie Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.
zh
[AI-7] Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs
【速读】:该论文旨在解决中小企业(SMEs)在使用大语言模型(LLM)推理服务时面临的三大核心问题:云API带来的数据隐私风险、专用云端GPU实例的隐私保障不足与持续成本高昂,以及专业级本地硬件(如A100/H100)部署成本过高。其解决方案的关键在于系统性评估NVIDIA Blackwell消费级GPU(RTX 5060 Ti、5070 Ti、5090)在生产环境中的LLM推理性能表现,涵盖四种开源模型(Qwen3-8B、Gemma3-12B、Gemma3-27B、GPT-OSS-20B),并覆盖多种量化格式(BF16、W4A16、NVFP4、MXFP4)、上下文长度(8k–64k)及典型工作负载(RAG、多LoRA代理服务、高并发API)。研究发现,消费级GPU可实现比云API低40–200倍的自托管推理成本(仅电费),且在多数场景下(除长上下文RAG外)性能足以替代云服务;其中NVFP4量化技术可在保持2–4%质量损失的前提下提升1.6倍吞吐量并降低41%能耗,显著优化性价比。因此,该研究为中小型企业提供了一种经济高效、可复现的本地化LLM部署路径。
链接: https://arxiv.org/abs/2601.09527
作者: Jonathan Knoop,Hendrik Holtmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 15 pages, 18 tables, 7 figures. Includes link to GitHub repository and Docker image for reproducibility
Abstract:SMEs increasingly seek alternatives to cloud LLM APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while professional on-premise hardware (A100, H100) remains prohibitively expensive. We present a systematic evaluation of NVIDIA’s Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production LLM inference, benchmarking four open-weight models (Qwen3-8B, Gemma3-12B, Gemma3-27B, GPT-OSS-20B) across 79 configurations spanning quantization formats (BF16, W4A16, NVFP4, MXFP4), context lengths (8k-64k), and three workloads: RAG, multi-LoRA agentic serving, and high-concurrency APIs. The RTX 5090 delivers 3.5-4.6x higher throughput than the 5060 Ti with 21x lower latency for RAG, but budget GPUs achieve the highest throughput-per-dollar for API workloads with sub-second latency. NVFP4 quantization provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss. Self-hosted inference costs 0.001-0.04 per million tokens (electricity only), which is 40-200x cheaper than budget-tier cloud APIs, with hardware breaking even in under four months at moderate volume (30M tokens/day). Our results show that consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG, where high-end GPUs remain essential. We provide deployment guidance and release all benchmark data for reproducible SME-scale deployments.
zh
[AI-8] owards Realistic Synthetic Data for Automatic Drum Transcription
【速读】:该论文旨在解决自动鼓乐转录(Automatic Drum Transcription, ADT)中因缺乏大规模配对音频-MIDI数据集而导致深度学习模型性能受限的问题。现有方法常依赖合成数据,但通常使用低保真度的SoundFont库,引入显著的域差异;而高质量的一次性鼓音色样本虽更优,却缺乏标准化且大规模的格式以供训练。解决方案的关键在于提出一种半监督方法,能够从无标签音频源中自动整理出大量且多样化的单次击打鼓音色样本,并基于此构建仅用MIDI文件即可合成的高质量数据集,进而训练序列到序列的转录模型,从而在ENST和MDB测试集上达到新的最先进性能。
链接: https://arxiv.org/abs/2601.09520
作者: Pierfrancesco Melucci,Paolo Merialdo,Taketo Akama
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a sequence-to-sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state-of-the-art results, significantly outperforming both fully supervised methods and previous synthetic-data approaches. The code for reproducing our experiments is publicly available at this https URL
zh
[AI-9] What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在复杂决策与工具使用任务中表现出色,但其跨环境泛化能力仍缺乏系统评估的问题。现有评估范式主要依赖轨迹级指标衡量任务完成度,却未能检验代理是否具备对环境的具身理解(grounded model of the environment)。为此,作者提出Task-to-Quiz(T2Q)评估范式,通过自动化、确定性的问答机制将任务执行与世界状态理解解耦,从而更精准地衡量代理的环境建模能力。该方案的关键在于构建了一个包含30个环境和1967个具身问题对(grounded QA pairs)的基准测试平台T2QBench,并揭示了当前记忆机制无法有效支持代理获取具身环境模型,指出主动探索(proactive exploration)和细粒度状态表示(fine-grained state representation)是提升泛化能力的核心瓶颈。
链接: https://arxiv.org/abs/2601.09503
作者: Siyuan Liu,Hongbang Yuan,Xinze Li,Ziyue Zhu,Yixin Cao,Yu-Gang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.
zh
[AI-10] Bridging Semantic Understanding and Popularity Bias with LLM s WWW2026
【速读】:该论文旨在解决推荐系统中流行度偏差(popularity bias)的语义理解不足问题,即现有去偏方法通常仅从多样性增强或长尾覆盖等表面特征出发,忽略了流行度偏差背后深层次的因果来源,导致去偏效果有限且推荐准确性不高。其解决方案的关键在于提出FairLRM框架,通过将流行度偏差分解为物品侧和用户侧两个组成部分,并利用结构化指令提示(structured instruction-based prompts)提升大语言模型(Large Language Model, LLM)对全局物品分布与个体用户偏好之间语义关系的理解能力,从而实现更精准、可解释且公平的推荐。
链接: https://arxiv.org/abs/2601.09478
作者: Renqiang Luo,Dong Zhang,Yupeng Gao,Wen Shi,Mingliang Hou,Jiaying Liu,Zhe Wang,Shuo Yu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figs, WWW 2026 accepted
Abstract:Semantic understanding of popularity bias is a crucial yet underexplored challenge in recommender systems, where popular items are often favored at the expense of niche content. Most existing debiasing methods treat the semantic understanding of popularity bias as a matter of diversity enhancement or long-tail coverage, neglecting the deeper semantic layer that embodies the causal origins of the bias itself. Consequently, such shallow interpretations limit both their debiasing effectiveness and recommendation accuracy. In this paper, we propose FairLRM, a novel framework that bridges the gap in the semantic understanding of popularity bias with Recommendation via Large Language Model (RecLLM). FairLRM decomposes popularity bias into item-side and user-side components, using structured instruction-based prompts to enhance the model’s comprehension of both global item distributions and individual user preferences. Unlike traditional methods that rely on surface-level features such as “diversity” or “debiasing”, FairLRM improves the model’s ability to semantically interpret and address the underlying bias. Through empirical evaluation, we show that FairLRM significantly enhances both fairness and recommendation accuracy, providing a more semantically aware and trustworthy approach to enhance the semantic understanding of popularity bias. The implementation is available at this https URL.
zh
[AI-11] SimMerge: Learning to Select Merge Operators from Similarity Signals
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)合并过程中效率低下的问题,即如何在有限的评估预算下高效选择最优的合并操作(merge operator)、模型子集和合并顺序,从而避免昂贵的“合并-评估”循环。其解决方案的关键在于提出了一种名为 \simmerge 的预测性合并选择方法,该方法通过少量无标签样本计算模型的功能性和结构性特征(functional and structural features),利用任务无关的相似性信号来预测任意两模型合并后的性能表现,进而自动决策最佳合并策略,无需重新训练即可适用于多路合并及不同参数规模的模型(如7B至111B参数)。
链接: https://arxiv.org/abs/2601.09473
作者: Oliver Bolton,Aakanksha,Arash Ahmadian,Sara Hooker,Marzieh Fadaee,Beyza Ermis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model merging enables multiple large language models (LLMs) to be combined into a single model while preserving performance. This makes it a valuable tool in LLM development, offering a competitive alternative to multi-task training. However, merging can be difficult at scale, as successful merging requires choosing the right merge operator, selecting the right models, and merging them in the right order. This often leads researchers to run expensive merge-and-evaluate searches to select the best merge. In this work, we provide an alternative by introducing \simmerge, \empha predictive merge-selection method that selects the best merge using inexpensive, task-agnostic similarity signals between models. From a small set of unlabeled probes, we compute functional and structural features and use them to predict the performance of a given 2-way merge. Using these predictions, \simmerge selects the best merge operator, the subset of models to merge, and the merge order, eliminating the expensive merge-and-evaluate loop. We demonstrate that we surpass standard merge-operator performance on 2-way merges of 7B-parameter LLMs, and that \simmerge generalizes to multi-way merges and 111B-parameter LLM merges without retraining. Additionally, we present a bandit variant that supports adding new tasks, models, and operators on the fly. Our results suggest that learning how to merge is a practical route to scalable model composition when checkpoint catalogs are large and evaluation budgets are tight.
zh
[AI-12] FairGU: Fairness-aware Graph Unlearning in Social Network WWW2026
【速读】:该论文旨在解决现有图去学习(graph unlearning)技术在删除节点时对敏感属性保护不足的问题,导致算法公平性显著下降,无法满足隐私保护与社会可持续性的双重需求。其解决方案的关键在于提出一个公平感知的图去学习框架 FairGU,该框架通过集成专门设计的公平性感知模块与高效的数据保护策略,在去除节点影响的同时防止敏感属性被无意放大或结构暴露,从而在保持模型性能的基础上显著提升公平性表现。
链接: https://arxiv.org/abs/2601.09469
作者: Renqiang Luo,Yongshuai Yang,Huafei Huang,Qing Qing,Mingliang Hou,Ziqi Xu,Yi Yu,Jingjing Zhou,Feng Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figs, WWW 2026 accepted
Abstract:Graph unlearning has emerged as a critical mechanism for supporting sustainable and privacy-preserving social networks, enabling models to remove the influence of deleted nodes and thereby better safeguard user information. However, we observe that existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared with traditional graph learning methods. To address this gap, we introduce FairGU, a fairness-aware graph unlearning framework designed to preserve both utility and fairness during the unlearning process. FairGU integrates a dedicated fairness-aware module with effective data protection strategies, ensuring that sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed. Through extensive experiments on multiple real-world datasets, we demonstrate that FairGU consistently outperforms state-of-the-art graph unlearning methods and fairness-enhanced graph learning baselines in terms of both accuracy and fairness metrics. Our findings highlight a previously overlooked risk in current unlearning practices and establish FairGU as a robust and equitable solution for the next generation of socially sustainable networked systems. The codes are available at this https URL.
zh
[AI-13] Searth Transformer: A Transformer Architecture Incorporating Earths Geospheric Physical Priors for Global Mid-Range Weather Forecasting
【速读】:该论文旨在解决当前基于Transformer的全球中短期天气预报模型中存在的两大问题:一是现有模型采用视觉中心架构,忽视了地球球面几何结构和经向周期性,导致物理一致性不足;二是传统自回归训练方式计算成本高且因误差累积限制了预报时效。解决方案的关键在于提出一种物理信息驱动的Shifted Earth Transformer (Searth Transformer) 架构,通过在窗口化自注意力机制中引入经向周期性和纬向边界条件,实现符合物理规律的全球信息交互;同时设计了一种Relay Autoregressive (RAR) 微调策略,在有限内存和计算资源下有效学习大气长期演变过程。这一方法使所开发的YanTian模型在精度上超越欧洲中期天气预报中心(ECMWF)高分辨率预报,并显著优于同类AI模型,同时计算开销仅为标准自回归微调的约1/200,且Z500场技能预报时效达10.3天,优于HRES的9天。
链接: https://arxiv.org/abs/2601.09467
作者: Tianye Li,Qi Liu,Hao Li,Lei Chen,Wencong Cheng,Fei Zheng,Xiangao Xia,Ya Wang,Gang Huang,Weiwei Wang,Xuan Tong,Ziqing Zu,Yi Fang,Shenming Fu,Jiang Jiang,Haochen Li,Mingxing Li,Jiangjiang Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Accurate global medium-range weather forecasting is fundamental to Earth system science. Most existing Transformer-based forecasting models adopt vision-centric architectures that neglect the Earth’s spherical geometry and zonal periodicity. In addition, conventional autoregressive training is computationally expensive and limits forecast horizons due to error accumulation. To address these challenges, we propose the Shifted Earth Transformer (Searth Transformer), a physics-informed architecture that incorporates zonal periodicity and meridional boundaries into window-based self-attention for physically consistent global information exchange. We further introduce a Relay Autoregressive (RAR) fine-tuning strategy that enables learning long-range atmospheric evolution under constrained memory and computational budgets. Based on these methods, we develop YanTian, a global medium-range weather forecasting model. YanTian achieves higher accuracy than the high-resolution forecast of the European Centre for Medium-Range Weather Forecasts and performs competitively with state-of-the-art AI models at one-degree resolution, while requiring roughly 200 times lower computational cost than standard autoregressive fine-tuning. Furthermore, YanTian attains a longer skillful forecast lead time for Z500 (10.3 days) than HRES (9 days). Beyond weather forecasting, this work establishes a robust algorithmic foundation for predictive modeling of complex global-scale geophysical circulation systems, offering new pathways for Earth system science.
zh
[AI-14] EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在处理开放性、复杂查询时因依赖固定工作流而导致适应性不足的问题。现有自演化方法虽尝试通过自由形式的代码或提示重写来提升性能,但易引发不稳定、幻觉及指令漂移等风险。其解决方案的关键在于提出EvoFSM框架,该框架通过演化显式的有限状态机(Finite State Machine, FSM)实现结构化自演化:将优化空间解耦为宏观层面的状态转移逻辑(Flow)与微观层面的状态特定行为(Skill),并在明确的行为边界下进行针对性改进;同时引入批评机制(critic mechanism)引导有限约束操作,并结合自演化记忆模块,将成功轨迹作为可复用先验、失败模式作为未来任务的约束条件,从而在保持控制力的同时显著增强适应能力。
链接: https://arxiv.org/abs/2601.09465
作者: Shuo Zhang,Chaofa Yuan,Ryan Guo,Xiaomin Yu,Rui Xu,Zhangquan Chen,Zinuo Li,Zhi Yang,Shuhao Guan,Zhenheng Tang,Sen Hu,Liwen Zhang,Ronghao Chen,Huacan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While LLM-based agents have shown promise for deep research, most existing approaches rely on fixed workflows that struggle to adapt to real-world, open-ended queries. Recent work therefore explores self-evolution by allowing agents to rewrite their own code or prompts to improve problem-solving ability, but unconstrained optimization often triggers instability, hallucinations, and instruction drift. We propose EvoFSM, a structured self-evolving framework that achieves both adaptability and control by evolving an explicit Finite State Machine (FSM) instead of relying on free-form rewriting. EvoFSM decouples the optimization space into macroscopic Flow (state-transition logic) and microscopic Skill (state-specific behaviors), enabling targeted improvements under clear behavioral boundaries. Guided by a critic mechanism, EvoFSM refines the FSM through a small set of constrained operations, and further incorporates a self-evolving memory that distills successful trajectories as reusable priors and failure patterns as constraints for future queries. Extensive evaluations on five multi-hop QA benchmarks demonstrate the effectiveness of EvoFSM. In particular, EvoFSM reaches 58.0% accuracy on the DeepSearch benchmark. Additional results on interactive decision-making tasks further validate its generalization.
zh
[AI-15] SoK: Enhancing Cryptographic Collaborative Learning with Differential Privacy
【速读】:该论文旨在解决协同学习(Collaborative Learning, CL)中隐私与性能之间的权衡问题,即如何在保护数据输入隐私的同时,抵御来自模型输出的推断攻击,并维持较高的模型准确性。其核心挑战在于:加密技术(如多方计算,Multi-Party Computation, MPC)虽能保障输入隐私,但带来显著的计算和通信开销;而差分隐私(Differential Privacy, DP)通过注入噪声提升输出隐私,却会降低模型精度,形成隐私-准确率-性能三者间的复杂权衡。论文的关键解决方案是提出一个统一框架,将CPCL(Cryptographic and Differentially Private Collaborative Learning)的通用阶段系统化,并识别出“安全噪声采样”作为实现CPCL的基础环节。作者进一步分析了不同安全噪声采样方法、噪声类型及差分隐私机制的实现难点,评估其在准确性和加密开销上的表现,最终通过在MPC环境中实现并实测多种噪声采样方案的计算与通信成本,为未来研究指明方向。
链接: https://arxiv.org/abs/2601.09460
作者: Francesco Capano,Jonas Böhler,Benjamin Weggenmann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2026)
Abstract:In collaborative learning (CL), multiple parties jointly train a machine learning model on their private datasets. However, data can not be shared directly due to privacy concerns. To ensure input confidentiality, cryptographic techniques, e.g., multi-party computation (MPC), enable training on encrypted data. Yet, even securely trained models are vulnerable to inference attacks aiming to extract memorized data from model outputs. To ensure output privacy and mitigate inference attacks, differential privacy (DP) injects calibrated noise during training. While cryptography and DP offer complementary guarantees, combining them efficiently for cryptographic and differentially private CL (CPCL) is challenging. Cryptography incurs performance overheads, while DP degrades accuracy, creating a privacy-accuracy-performance trade-off that needs careful design considerations. This work systematizes the CPCL landscape. We introduce a unified framework that generalizes common phases across CPCL paradigms, and identify secure noise sampling as the foundational phase to achieve CPCL. We analyze trade-offs of different secure noise sampling techniques, noise types, and DP mechanisms discussing their implementation challenges and evaluating their accuracy and cryptographic overhead across CPCL paradigms. Additionally, we implement identified secure noise sampling options in MPC and evaluate their computation and communication costs in WAN and LAN. Finally, we propose future research directions based on identified key observations, gaps and possible enhancements in the literature.
zh
[AI-16] On the Hardness of Computing Counterfactual and Semifactual Explanations in XAI
【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)领域中,如何为机器学习模型的决策提供清晰解释的问题,特别是在关键应用场景下确保解释的有效性与可靠性。其解决方案的关键在于系统性地分析生成反事实(Counterfactual)和半反事实(Semi-factual)解释的计算复杂性,并通过引入新的不可逼近性结果,证明在许多情况下不仅难以生成精确解释,而且在合理假设下也难以高效近似生成解释,从而强化了XAI实践中算法设计与政策制定需考虑计算限制的论点。
链接: https://arxiv.org/abs/2601.09455
作者: André Artelt,Martin Olsen,Kevin Tierney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in Transactions on Machine Learning Research (TMLR), 2025 – this https URL
Abstract:Providing clear explanations to the choices of machine learning models is essential for these models to be deployed in crucial applications. Counterfactual and semi-factual explanations have emerged as two mechanisms for providing users with insights into the outputs of their models. We provide an overview of the computational complexity results in the literature for generating these explanations, finding that in many cases, generating explanations is computationally hard. We strengthen the argument for this considerably by further contributing our own inapproximability results showing that not only are explanations often hard to generate, but under certain assumptions, they are also hard to approximate. We discuss the implications of these complexity results for the XAI community and for policymakers seeking to regulate explanations in AI.
zh
[AI-17] Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models DATE
【速读】:该论文旨在解决状态空间模型(State Space Model, SSM)在量化过程中因硬截断(hard clipping)导致的异常值信息丢失问题,从而影响模型精度。现有方法通常采用标准INT8运算对激活值进行量化,难以兼顾小值、正常值与异常值的精度保持。解决方案的关键在于提出一种软边缘量化器(soft-edge quantizer),即Quamba-SE,其通过引入三个自适应缩放因子:高精度缩放用于小值、标准缩放用于常规值、低精度缩放用于异常值,实现了对异常值信息的保留,同时维持其他数值的量化精度。实验表明,该方法在Mamba-130M模型上于6个零样本基准测试中均优于传统量化方案,平均准确率提升达+0.83%。
链接: https://arxiv.org/abs/2601.09451
作者: Yizhi Chen,Ahmed Hemani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted to DATE Late Breaking Results 2026, Verona, Italy
Abstract:We propose Quamba-SE, a soft-edge quantizer for State Space Model (SSM) activation quantization. Unlike existing methods, using standard INT8 operation, Quamba-SE employs three adaptive scales: high-precision for small values, standard scale for normal values, and low-precision for outliers. This preserves outlier information instead of hard clipping, while maintaining precision for other values. We evaluate on Mamba- 130M across 6 zero-shot benchmarks. Results show that Quamba- SE consistently outperforms Quamba, achieving up to +2.68% on individual benchmarks and up to +0.83% improvement in the average accuracy of 6 datasets.
zh
[AI-18] Population-Aligned Audio Reproduction With LLM -Based Equalizers
【速读】:该论文旨在解决传统音频均衡(Audio Equalization)过程中依赖人工手动调整、难以适应动态听音场景(如情绪、环境或社交情境变化)的问题。其核心解决方案是引入大语言模型(Large Language Model, LLM),通过自然语言文本提示(text prompts)映射至具体的均衡参数设置,从而实现对话式的声音系统控制。关键在于利用受控听音实验收集的数据,结合上下文学习(in-context learning)与参数高效微调(parameter-efficient fine-tuning)技术,使模型能够可靠地对齐群体偏好的均衡配置,并在分布层面显著优于随机采样和静态预设基线。
链接: https://arxiv.org/abs/2601.09448
作者: Ioannis Stylianou,Jon Francombe,Pablo Martinez-Nuevo,Sven Ewan Shepstone,Zheng-Hua Tan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 12 pages, 13 figures, 2 tables, IEEE JSTSP journal submission under first revision
Abstract:Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting). In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. This enables a conversational approach to sound system control. By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings. Our evaluation methods, which leverage distributional metrics that capture users’ varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines. These results indicate that LLMs could function as “artificial equalizers,” contributing to the development of more accessible, context-aware, and expert-level audio tuning methods.
zh
[AI-19] FairGE: Fairness-Aware Graph Encoding in Incomplete Social Networks WWW2026
【速读】:该论文旨在解决图Transformer(Graph Transformers, GTs)在不完整社交网络中因敏感属性缺失而导致的公平性问题。现有方法通常通过生成缺失的敏感属性来应对,但可能引入新的偏差并损害用户隐私。其解决方案的关键在于提出FairGE(Fair Graph Encoding)框架,该框架不依赖于生成敏感属性,而是基于谱图理论直接编码公平性:利用主特征向量(principal eigenvector)表示结构信息,并将缺失的敏感属性用零填充以保持独立性,从而在不重建数据的前提下实现公平性保障。理论分析表明,该方法可抑制非主谱分量的影响,实验验证其在统计均等性和机会平等性上相较最先进基线提升至少16%。
链接: https://arxiv.org/abs/2601.09394
作者: Renqiang Luo,Huafei Huang,Tao Tang,Jing Ren,Ziqi Xu,Mingliang Hou,Enyan Dai,Feng Xia
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 12 pages, WWW 2026
Abstract:Graph Transformers (GTs) are increasingly applied to social network analysis, yet their deployment is often constrained by fairness concerns. This issue is particularly critical in incomplete social networks, where sensitive attributes are frequently missing due to privacy and ethical restrictions. Existing solutions commonly generate these incomplete attributes, which may introduce additional biases and further compromise user privacy. To address this challenge, FairGE (Fair Graph Encoding) is introduced as a fairness-aware framework for GTs in incomplete social networks. Instead of generating sensitive attributes, FairGE encodes fairness directly through spectral graph theory. By leveraging the principal eigenvector to represent structural information and padding incomplete sensitive attributes with zeros to maintain independence, FairGE ensures fairness without data reconstruction. Theoretical analysis demonstrates that the method suppresses the influence of non-principal spectral components, thereby enhancing fairness. Extensive experiments on seven real-world social network datasets confirm that FairGE achieves at least a 16% improvement in both statistical parity and equality of opportunity compared with state-of-the-art baselines. The source code is shown in this https URL.
zh
[AI-20] Query Languages for Machine-Learning Models
【速读】:该论文旨在解决如何用形式化逻辑语言来表达和分析神经网络(neural networks)中的查询问题,特别是在其作为加权图(weighted graphs)表示时的可计算性和表达能力。解决方案的关键在于引入并研究两种逻辑系统:带求和运算的一阶逻辑(FO(SUM))及其递归扩展IFP(SUM),这些逻辑源自Grädel、Gurevich与Meer在1990年代的基础工作,并被重新应用于作为机器学习模型的神经网络查询建模中,从而揭示了它们在表达能力与计算复杂性方面的基本性质。
链接: https://arxiv.org/abs/2601.09381
作者: Martin Grohe
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:In this paper, I discuss two logics for weighted finite structures: first-order logic with summation (FO(SUM)) and its recursive extension IFP(SUM). These logics originate from foundational work by Grädel, Gurevich, and Meer in the 1990s. In recent joint work with Standke, Steegmans, and Van den Bussche, we have investigated these logics as query languages for machine learning models, specifically neural networks, which are naturally represented as weighted graphs. I present illustrative examples of queries to neural networks that can be expressed in these logics and discuss fundamental results on their expressiveness and computational complexity.
zh
[AI-21] GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
【速读】:该论文旨在解决在强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)场景下,现有参数高效微调方法(如PiSSA和MiLoRA)因未考虑RLVR特有的优化动态与几何结构,导致谱坍缩(spectral collapse)和优化不稳定性的问题。同时,基于更新稀疏性的替代方案由于非结构化计算在现代硬件上存在显著效率瓶颈。解决方案的关键在于提出GeoRA(Geometry-Aware Low-Rank Adaptation),其通过在几何约束子空间内利用奇异值分解(SVD)提取主方向来初始化适配器,并冻结剩余分量,从而保留预训练模型的几何结构,同时借助密集运算实现高效GPU计算,有效缓解由几何错位引起的优化瓶颈。
链接: https://arxiv.org/abs/2601.09361
作者: Jiaying Zhang,Lei Shi,Jiguo Li,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
zh
[AI-22] Monte-Carlo Tree Search with Neural Network Guidance for Lane-Free Autonomous Driving
【速读】:该论文旨在解决无车道线交通环境下的单智能体自主驾驶规划问题,此类环境允许车辆充分利用道路横向空间以提升通行效率,但同时也带来了更复杂的决策挑战。解决方案的关键在于提出一种基于蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)的规划方法,并引入一个预训练神经网络(Neural Network, NN)来指导MCTS的选择阶段,从而在计算资源受限条件下增强搜索的智能化程度和效率。该NN利用其预测能力提升树搜索过程的导向性,实现安全(通过碰撞率衡量)与效能(通过行驶速度衡量)之间的平衡。
链接: https://arxiv.org/abs/2601.09353
作者: Ioannis Peridis,Dimitrios Troullinos,Georgios Chalkiadakis,Pantelis Giankoulidis,Ioannis Papamichail,Markos Papageorgiou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Lane-free traffic environments allow vehicles to better harness the lateral capacity of the road without being restricted to lane-keeping, thereby increasing the traffic flow rates. As such, we have a distinct and more challenging setting for autonomous driving. In this work, we consider a Monte-Carlo Tree Search (MCTS) planning approach for single-agent autonomous driving in lane-free traffic, where the associated Markov Decision Process we formulate is influenced from existing approaches tied to reinforcement learning frameworks. In addition, MCTS is equipped with a pre-trained neural network (NN) that guides the selection phase. This procedure incorporates the predictive capabilities of NNs for a more informed tree search process under computational constraints. In our experimental evaluation, we consider metrics that address both safety (through collision rates) and efficacy (through measured speed). Then, we examine: (a) the influence of isotropic state information for vehicles in a lane-free environment, resulting in nudging behaviour–vehicles’ policy reacts due to the presence of faster tailing ones, (b) the acceleration of performance for the NN-guided variant of MCTS, and © the trade-off between computational resources and solution quality.
zh
[AI-23] Navigating Ethical AI Challenges in the Industrial Sector: Balancing Innovation and Responsibility
【速读】:该论文旨在解决工业领域中人工智能(Artificial Intelligence, AI)应用日益广泛背景下伦理问题的系统性缺失问题,即如何在工业AI的研发与部署过程中嵌入伦理原则,以应对透明度、责任归属和公平性等新兴挑战。其解决方案的关键在于将伦理规范主动融入工业AI系统的全生命周期设计中,包括研发阶段的伦理实践和数据共享机制,并强调通过伦理驱动的技术创新来增强利益相关者信任,从而推动形成更加负责任和包容的工业生态系统。
链接: https://arxiv.org/abs/2601.09351
作者: Ruomu Tan,Martin W Hoffmann
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of artificial intelligence (AI) into the industrial sector has not only driven innovation but also expanded the ethical landscape, necessitating a reevaluation of principles governing technology and its applications and awareness in research and development of industrial AI solutions. This chapter explores how AI-empowered industrial innovation inherently intersects with ethics, as advancements in AI introduce new challenges related to transparency, accountability, and fairness. In the chapter, we then examine the ethical aspects of several examples of AI manifestation in industrial use cases and associated factors such as ethical practices in the research and development process and data sharing. With the progress of ethical industrial AI solutions, we emphasize the importance of embedding ethical principles into industrial AI systems and its potential to inspire technological breakthroughs and foster trust among stakeholders. This chapter also offers actionable insights to guide industrial research and development toward a future where AI serves as an enabler for ethical and responsible industrial progress as well as a more inclusive industrial ecosystem.
zh
[AI-24] On-Device Large Language Models for Sequential Recommendation WSDM’26
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限设备上进行高效部署的问题,特别是在顺序推荐任务中,如何在保持模型性能的同时降低内存占用和计算开销。其关键解决方案是提出OD-LLM框架,该框架融合了两种互补的压缩策略:一是基于奇异值分解(Singular Value Decomposition, SVD)的低秩结构压缩算法,用于显著减少模型参数冗余;二是新颖的标记归一化技术,以更好配合低秩分解过程。此外,为避免高压缩比下的性能损失,引入一种渐进式对齐算法,逐层迭代优化目标模型参数,从而实现模型尺寸减半后仍无效果损失的高效、准确的设备端部署。
链接: https://arxiv.org/abs/2601.09306
作者: Xin Xia,Hongzhi Yin,Shane Culpepper
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: WSDM’26
Abstract:On-device recommendation is critical for a number of real-world applications, especially in scenarios that have agreements on execution latency, user privacy, and robust functionality when internet connectivity is unstable or even impossible. While large language models (LLMs) can now provide exceptional capabilities that model user behavior for sequential recommendation tasks, their substantial memory footprint and computational overhead make the deployment on resource-constrained devices a high risk proposition. In this paper, we propose OD-LLM, the first task-adaptive compression framework explicitly designed to provide efficient and accurate on-device deployment of LLMs for sequential recommendation tasks. OD-LLM uniquely integrates two complementary compression strategies: a low-rank structural compression algorithm which uses Singular Value Decomposition (SVD) to significantly reduce parameter redundancy in the model, and a novel tokenization normalization technique that better complements the low-rank decomposition process being used. Additionally, to minimize any potential performance degradation when using higher compression ratios, a novel progressive alignment algorithm is used to iteratively refine the parameters required layerwise in the target model. Empirical evaluations conducted on sequential recommendation benchmarks show that OD-LLM exhibits no loss in effectiveness when compared to the original recommendation model, when the deployed model size is halved. These promising results demonstrate the efficacy and scalability of OD-LLM, making this novel solution a practical alternative for real-time, on-device solutions wishing to replace expensive, remotely executed LLMs.
zh
[AI-25] Policy-Based Reinforcement Learning with Action Masking for Dynamic Job Shop Scheduling under Uncertainty: Handling Random Arrivals and Machine Failures
【速读】:该论文旨在解决动态作业车间调度问题(Dynamic Job Shop Scheduling Problem, DJSSP)在不确定性环境下的优化难题,特别是应对随机任务到达和突发设备故障带来的挑战。解决方案的关键在于构建一个基于模型的框架:使用彩色时序Petri网(Coloured Timed Petri Nets)对调度环境进行形式化建模以保证可解释性与结构清晰性,同时引入可掩码近端策略优化(Maskable Proximal Policy Optimization)实现动态决策,并通过两种动作掩码策略(非梯度法和基于梯度法)确保智能体仅在可行操作空间中选择动作,从而提升调度方案的实时适应性与鲁棒性。此外,采用伽马分布(Gamma distribution)和威布尔分布(Weibull distribution)分别模拟任务到达和设备退化过程,使系统更贴近真实工业场景,最终在多个动态JSSP基准测试中显著优于传统启发式与规则驱动方法,在最小化完工时间(makespan)方面表现出优越性能。
链接: https://arxiv.org/abs/2601.09293
作者: Sofiene Lassoued,Stefan Lier,Andreas Schwung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a novel framework for solving Dynamic Job Shop Scheduling Problems under uncertainty, addressing the challenges introduced by stochastic job arrivals and unexpected machine breakdowns. Our approach follows a model-based paradigm, using Coloured Timed Petri Nets to represent the scheduling environment, and Maskable Proximal Policy Optimization to enable dynamic decision-making while restricting the agent to feasible actions at each decision point. To simulate realistic industrial conditions, dynamic job arrivals are modeled using a Gamma distribution, which captures complex temporal patterns such as bursts, clustering, and fluctuating workloads. Machine failures are modeled using a Weibull distribution to represent age-dependent degradation and wear-out dynamics. These stochastic models enable the framework to reflect real-world manufacturing scenarios better. In addition, we study two action-masking strategies: a non-gradient approach that overrides the probabilities of invalid actions, and a gradient-based approach that assigns negative gradients to invalid actions within the policy network. We conduct extensive experiments on dynamic JSSP benchmarks, demonstrating that our method consistently outperforms traditional heuristic and rule-based approaches in terms of makespan minimization. The results highlight the strength of combining interpretable Petri-net-based models with adaptive reinforcement learning policies, yielding a resilient, scalable, and explainable framework for real-time scheduling in dynamic and uncertain manufacturing environments.
zh
[AI-26] Blue Teaming Function-Calling Agents AAAI2026
【速读】:该论文旨在解决当前开源大语言模型(Large Language Models, LLMs)在具备函数调用(function-calling)能力时,其安全性不足的问题,特别是面对多种攻击手段时的脆弱性。研究通过实验评估四类声称支持函数调用的开源LLM对三种不同攻击的鲁棒性,并测试八种防御机制的有效性。其关键发现是:这些模型默认状态下不具备安全性,且现有防御措施尚未达到实际部署的要求,表明当前函数调用功能的安全保障仍处于初级阶段,亟需更可靠的防护机制设计与验证。
链接: https://arxiv.org/abs/2601.09292
作者: Greta Dolcetti,Giulio Zizzo,Sergio Maffeis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This work has been accepted to appear at the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Abstract:We present an experimental evaluation that assesses the robustness of four open source LLMs claiming function-calling capabilities against three different attacks, and we measure the effectiveness of eight different defences. Our results show how these models are not safe by default, and how the defences are not yet employable in real-world scenarios.
zh
[AI-27] Why not Collaborative Filtering in Dual View? Bridging Sparse and Dense Models
【速读】:该论文旨在解决当前基于密集嵌入(dense embedding)的协同过滤(Collaborative Filtering, CF)方法在处理不受欢迎物品(unpopular items)时面临的信噪比(Signal-to-Noise Ratio, SNR)瓶颈问题,即在数据极度稀疏条件下,参数驱动的密集模型会因信号衰减而导致性能显著下降。解决方案的关键在于提出一个统一框架 SaD(Sparse and Dense),通过双视角对齐机制实现稀疏交互模式与密集语义嵌入的互补融合:一方面,密集视图利用语义相关性增强稀疏视图的信息表达;另一方面,稀疏视图通过显式的结构信号对密集模型进行正则化。理论分析表明,这种双视图对齐可获得严格更优的全局SNR,且无需复杂架构即可显著提升推荐性能,验证了协同过滤在多视角建模下的持续有效性。
链接: https://arxiv.org/abs/2601.09286
作者: Hanze Guo,Jianxun Lian,Xiao Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures
Abstract:Collaborative Filtering (CF) remains the cornerstone of modern recommender systems, with dense embedding–based methods dominating current practice. However, these approaches suffer from a critical limitation: our theoretical analysis reveals a fundamental signal-to-noise ratio (SNR) ceiling when modeling unpopular items, where parameter-based dense models experience diminishing SNR under severe data sparsity. To overcome this bottleneck, we propose SaD (Sparse and Dense), a unified framework that integrates the semantic expressiveness of dense embeddings with the structural reliability of sparse interaction patterns. We theoretically show that aligning these dual views yields a strictly superior global SNR. Concretely, SaD introduces a lightweight bidirectional alignment mechanism: the dense view enriches the sparse view by injecting semantic correlations, while the sparse view regularizes the dense model through explicit structural signals. Extensive experiments demonstrate that, under this dual-view alignment, even a simple matrix factorization–style dense model can achieve state-of-the-art performance. Moreover, SaD is plug-and-play and can be seamlessly applied to a wide range of existing recommender models, highlighting the enduring power of collaborative filtering when leveraged from dual perspectives. Further evaluations on real-world benchmarks show that SaD consistently outperforms strong baselines, ranking first on the BarsMatch leaderboard. The code is publicly available at this https URL.
zh
[AI-28] Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing
【速读】:该论文旨在解决集群工作负载分配中因复杂配置而产生的可用性缺口问题,即传统调度方式对用户的技术门槛高、操作繁琐。其解决方案的关键在于引入一种基于自然语言处理(Natural Language Processing, NLP)的语义意图驱动调度范式,通过集成大型语言模型(Large Language Model, LLM)作为Kubernetes调度扩展器,将用户以自然语言形式表达的软亲和性偏好(soft affinity preferences)转化为可执行的调度决策。原型系统包含集群状态缓存与意图分析模块(使用AWS Bedrock实现),实验证明该方法在解析准确率(95%子集准确率)和调度质量上显著优于基线方案,尤其在复杂及定量场景下表现优异,验证了LLM在简化工作负载编排中的可行性。
链接: https://arxiv.org/abs/2601.09282
作者: Leszek Sliwko,Jolanta Mizeria-Pietraszko
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state cache and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high LLM parsing accuracy (95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using LLMs for accessible scheduling but highlight limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration.
zh
[AI-29] STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在多步思维链(Chain-of-Thought, CoT)推理过程中因敏感信息深度嵌入而引发的隐私泄露问题。现有大语言模型(Large Language Models, LLMs)的遗忘方法仅针对最终答案进行修改,无法有效清除中间推理步骤中的敏感内容,导致隐私泄露风险持续存在。其解决方案的关键在于提出一种无需参数调整、可在推理阶段执行的隐私保护框架——敏感轨迹调控(Sensitive Trajectory Regulation, STaR),该框架通过语义感知检测识别敏感内容、注入全局安全约束、轨迹感知抑制动态屏蔽敏感信息,并结合基于token级别的自适应过滤机制防止精确与改写形式的敏感词生成,从而实现对整个推理链的全面隐私保护。
链接: https://arxiv.org/abs/2601.09281
作者: Jingjing Zhou,Gaoxiang Cong,Li Su,Liang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) have advanced automated multi-step reasoning, but their ability to generate complex Chain-of-Thought (CoT) trajectories introduces severe privacy risks, as sensitive information may be deeply embedded throughout the reasoning process. Existing Large Language Models (LLMs) unlearning approaches that typically focus on modifying only final answers are insufficient for LRMs, as they fail to remove sensitive content from intermediate steps, leading to persistent privacy leakage and degraded security. To address these challenges, we propose Sensitive Trajectory Regulation (STaR), a parameter-free, inference-time unlearning framework that achieves robust privacy protection throughout the reasoning process. Specifically, we first identify sensitive content via semantic-aware detection. Then, we inject global safety constraints through secure prompt prefix. Next, we perform trajectory-aware suppression to dynamically block sensitive content across the entire reasoning chain. Finally, we apply token-level adaptive filtering to prevent both exact and paraphrased sensitive tokens during generation. Furthermore, to overcome the inadequacies of existing evaluation protocols, we introduce two metrics: Multi-Decoding Consistency Assessment (MCS), which measures the consistency of unlearning across diverse decoding strategies, and Multi-Granularity Membership Inference Attack (MIA) Evaluation, which quantifies privacy protection at both answer and reasoning-chain levels. Experiments on the R-TOFU benchmark demonstrate that STaR achieves comprehensive and stable unlearning with minimal utility loss, setting a new standard for privacy-preserving reasoning in LRMs.
zh
[AI-30] M3Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning
【速读】:该论文旨在解决当前深度研究型代理(DeepResearch-style agents)在多模态信息检索任务中的两大核心挑战:一是大规模训练多模态工具使用时面临的“专业化-泛化权衡”问题,二是缺乏能够捕捉复杂多步多模态搜索轨迹的高质量训练数据。其解决方案的关键在于提出M^3 Searcher,一个模块化的多模态信息寻求代理,通过显式解耦信息获取(information acquisition)与答案推导(answer derivation)两个阶段,并采用以检索为导向的多目标奖励机制,联合优化事实准确性、推理合理性与检索保真度,从而提升模型在复杂多模态任务中的性能和迁移适应能力。
链接: https://arxiv.org/abs/2601.09278
作者: Xiaohan Yu,Chao Feng,Lang Mei,Chong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose M ^3 Searcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. M ^3 Searcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. In addition, we develop MMSearchVQA, a multimodal multi-hop dataset to support retrieval centric RL training. Experimental results demonstrate that M ^3 Searcher outperforms existing approaches, exhibits strong transfer adaptability and effective reasoning in complex multimodal tasks.
zh
[AI-31] A3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation
【速读】:该论文旨在解决现有科学推理评估基准仅关注最终答案或步骤连贯性,而忽视人类推理中关键的记忆驱动机制的问题。其核心在于揭示并量化记忆如何通过锚点(anchor)和吸引子(attractor)的激活,在多步推理过程中促进知识复用、增强推理一致性与稳定性。解决方案的关键是提出A³-Bench基准,基于SAPM(Subject-Anchor-Attractor-Problem-Memory)标注流程构建2,198个跨领域科学推理问题,并引入双尺度记忆评估框架及AAUI(Anchor–Attractor Utilization Index)指标,以系统测量模型在推理中对记忆的激活程度,从而为理解记忆驱动型科学推理提供可量化的分析工具。
链接: https://arxiv.org/abs/2601.09274
作者: Jian Zhang,Yu He,Zhiyuan Wang,Zhangqi Wang,Kai He,Fangzhi Xu,Qika Lin,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the \textitmemory-driven mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference. To address this gap, we propose A^3 -Bench~ this https URL, a benchmark designed to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation. First, we annotate 2,198 science reasoning problems across domains using the SAPM process(subject, anchor attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI(Anchor–Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate A^3 -Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.
zh
[AI-32] RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的领域特定推理方法中普遍存在的训练密集型问题,即依赖参数更新导致资源消耗高且难以适应动态复杂推理场景。现有激活空间干预(activation steering)方法虽具参数高效性,但多采用静态、人工设定的干预策略,缺乏对推理过程动态性的响应能力。其解决方案的关键在于提出RISER(Router-based Intervention for Steerable Enhancement of Reasoning),一个可插拔的激活空间干预框架:通过构建可复用的推理向量库,并引入轻量级路由器(Router)以强化学习优化,在任务级奖励驱动下动态组合这些向量,从而在激活空间中自适应地激发潜在认知原语(latent cognitive primitives)。该机制实现了推理路径的 emergent composition(涌现式组合)与可控性增强,显著提升了零样本准确率(平均提升3.4–6.5%),同时相较Chain-of-Thought(CoT)推理具有2–3倍更高的token效率和更稳定的性能增益。
链接: https://arxiv.org/abs/2601.09269
作者: Wencheng Ye,Liang Peng,Xiaoyang Yuan,Yi Bin,Pengpeng Zeng,Hengyu Jin,Heng Tao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-based Intervention for Steerable Enhancement of Reasoning), a plug-and-play intervention framework that adaptively steers LLM reasoning in activation space. RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input. The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner. Across seven diverse benchmarks, RISER yields 3.4-6.5% average zero-shot accuracy improvements over the base model while surpassing CoT-style reasoning with 2-3x higher token efficiency and robust accuracy gains. Further analysis shows that RISER autonomously combines multiple vectors into interpretable, precise control strategies, pointing toward more controllable and efficient LLM reasoning.
zh
[AI-33] Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants
【速读】:该论文旨在解决疫情期间跨区域政策制定碎片化与反应滞后的问题,即各行政区域常独立决策、缺乏协同,导致干预措施被动且效果受限。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的多智能体政策制定框架,其中每个区域部署一个LLM代理,通过融合本地流行病学动态、跨区域信息交互与真实世界数据驱动的疫情演化模拟器,在闭环仿真过程中协同探索反事实干预情景并生成协调一致的政策决策,从而实现前瞻性、系统性的区域联动防控。
链接: https://arxiv.org/abs/2601.09264
作者: Ziyi Shi,Xusen Guo,Hongliang Lu,Mingxing Peng,Haotian Wang,Zheng Zhu,Zhenning Li,Yuxuan Liang,Xinhu Zheng,Hai Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20pages, 6 figures, a 60-page supporting material pdf file
Abstract:Effective pandemic control requires timely and coordinated policymaking across administrative regions that are intrinsically interdependent. However, human-driven responses are often fragmented and reactive, with policies formulated in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global pandemic mitigation. To address this challenge, here we propose a large language model (LLM) multi-agent policymaking framework that supports coordinated and proactive pandemic control across regions. Within our framework, each administrative region is assigned an LLM agent as an AI policymaking assistant. The agent reasons over region-specific epidemiological dynamics while communicating with other agents to account for cross-regional interdependencies. By integrating real-world data, a pandemic evolution simulator, and structured inter-agent communication, our framework enables agents to jointly explore counterfactual intervention scenarios and synthesize coordinated policy decisions through a closed-loop simulation process. We validate the proposed framework using state-level COVID-19 data from the United States between April and December 2020, together with real-world mobility records and observed policy interventions. Compared with real-world pandemic outcomes, our approach reduces cumulative infections and deaths by up to 63.7% and 40.1%, respectively, at the individual state level, and by 39.0% and 27.0%, respectively, when aggregated across states. These results demonstrate that LLM multi-agent systems can enable more effective pandemic control with coordinated policymaking…
zh
[AI-34] Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models
【速读】:该论文旨在解决当前链式思维(Chain-of-Thought, CoT)推理范式中存在的时间粒度不匹配问题,即现有方法将推理过程视为不可分割的序列,缺乏对每一步信息增益的量化机制,导致推理效率低下和优化困难。其解决方案的关键在于提出CoT-Flow框架,将离散的推理步骤重构为连续的概率流(probabilistic flow),从而显式量化每个步骤对最终答案的贡献;在此基础上,进一步实现两种互补方法:基于流引导的解码(flow-guided decoding)以提取信息高效的推理路径,以及基于流的强化学习(flow-based reinforcement learning)构建无需验证器的密集奖励函数,显著提升了推理效率与性能之间的平衡。
链接: https://arxiv.org/abs/2601.09260
作者: Yan Liu,Feng Zhang,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Han Liu,Yangdong Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:High-quality chain-of-thought has demonstrated strong potential for unlocking the reasoning capabilities of large language models. However, current paradigms typically treat the reasoning process as an indivisible sequence, lacking an intrinsic mechanism to quantify step-wise information gain. This granularity gap manifests in two limitations: inference inefficiency from redundant exploration without explicit guidance, and optimization difficulty due to sparse outcome supervision or costly external verifiers. In this work, we propose CoT-Flow, a framework that reconceptualizes discrete reasoning steps as a continuous probabilistic flow, quantifying the contribution of each step toward the ground-truth answer. Built on this formulation, CoT-Flow enables two complementary methodologies: flow-guided decoding, which employs a greedy flow-based decoding strategy to extract information-efficient reasoning paths, and flow-based reinforcement learning, which constructs a verifier-free dense reward function. Experiments on challenging benchmarks demonstrate that CoT-Flow achieves a superior balance between inference efficiency and reasoning performance.
zh
[AI-35] MAXS: Meta-Adaptive Exploration with LLM Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在推理过程中存在的两个核心问题:一是局部短视生成(locally myopic generation),即缺乏前瞻性的规划导致决策质量下降;二是轨迹不稳定性(trajectory instability),即早期微小错误可能引发推理路径的显著偏离,从而影响整体性能。为应对上述挑战,论文提出了一种基于元自适应探索的推理框架MAXS(Meta-Adaptive eXploration with LLM Agents)。其关键创新在于引入前瞻性策略以扩展推理路径并估计工具使用的优势值,并通过步骤一致性方差与跨步趋势斜率的联合优化机制,选择稳定、一致且高价值的推理步骤;同时设计了轨迹收敛机制,在路径一致性达成时终止进一步滚动预测,从而在多工具推理中实现资源效率与全局有效性的平衡。
链接: https://arxiv.org/abs/2601.09259
作者: Jian Zhang,Zhiyuan Wang,Zhangqi Wang,Yu He,Haoran Luo,li yuan,Lingling Zhang,Rui Mao,Qika Lin,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents this https URL, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.
zh
[AI-36] RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对齐过程中存在的数据效率低下问题,具体而言,监督微调(Supervised Fine-Tuning, SFT)依赖昂贵的专家标注数据,而拒绝采样微调(Rejection Sampling Fine-Tuning, RFT)则会丢弃有价值的负样本,导致训练数据利用率低。其解决方案的关键在于提出奖励感知微调(Reward Informed Fine-Tuning, RIFT),该方法充分利用模型自生成的所有样本(包括正负轨迹),通过标量奖励对损失函数进行重加权,从而实现从混合质量的自生成数据中学习;同时为避免直接奖励乘法导致的训练崩溃问题,RIFT引入一种稳定化的损失函数形式,确保数值鲁棒性和优化效率,实验证明该方法在多种基础模型和数学基准上均优于RFT,展现出更强的鲁棒性和数据效率。
链接: https://arxiv.org/abs/2601.09253
作者: Zehua Liu,Shuqi Liu,Tao Zhong,Mingxuan Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data inefficiency. To address this, we propose Reward Informed Fine-Tuning (RIFT), a simple yet effective framework that utilizes all self-generated samples. Unlike the hard thresholding of RFT, RIFT repurposes negative trajectories, reweighting the loss with scalar rewards to learn from both the positive and negative trajectories from the model outputs. To overcome the training collapse caused by naive reward integration, where direct multiplication yields an unbounded loss, we introduce a stabilized loss formulation that ensures numerical robustness and optimization efficiency. Extensive experiments on mathematical benchmarks across various base models show that RIFT consistently outperforms RFT. Our results demonstrate that RIFT is a robust and data-efficient alternative for alignment using mixed-quality, self-generated data.
zh
[AI-37] HGATSolver: A Heterogeneous Graph Attention Solver for Fluid-Structure Interaction
【速读】:该论文旨在解决流固耦合(Fluid-Structure Interaction, FSI)系统中多物理场异质动力学建模的难题,特别是现有学习型求解器难以在统一框架内准确捕捉流体与固体区域差异性演化特性的问题。核心挑战包括界面耦合导致的跨域响应不一致以及流体与固体区域学习难度差异引发的预测不稳定。解决方案的关键在于提出一种异质图注意力求解器(Heterogeneous Graph Attention Solver, HGATSolver),其通过构建包含流体、固体和界面三类节点与边的异质图结构,显式嵌入物理结构信息,并设计针对不同物理域定制的消息传递机制;同时引入物理条件门控机制作为可学习的自适应松弛因子以稳定显式时间推进,并结合域间梯度平衡损失(Inter-domain Gradient-Balancing Loss)动态调节各域优化目标,从而实现对复杂多物理场耦合系统的高效稳定代理建模。
链接: https://arxiv.org/abs/2601.09251
作者: Qin-Yi Zhang,Hong Wang,Siyao Liu,Haichuan Lin,Linying Cao,Xiao-Hu Zhou,Chen Chen,Shuangyi Wang,Zeng-Guang Hou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fluid-structure interaction (FSI) systems involve distinct physical domains, fluid and solid, governed by different partial differential equations and coupled at a dynamic interface. While learning-based solvers offer a promising alternative to costly numerical simulations, existing methods struggle to capture the heterogeneous dynamics of FSI within a unified framework. This challenge is further exacerbated by inconsistencies in response across domains due to interface coupling and by disparities in learning difficulty across fluid and solid regions, leading to instability during prediction. To address these challenges, we propose the Heterogeneous Graph Attention Solver (HGATSolver). HGATSolver encodes the system as a heterogeneous graph, embedding physical structure directly into the model via distinct node and edge types for fluid, solid, and interface regions. This enables specialized message-passing mechanisms tailored to each physical domain. To stabilize explicit time stepping, we introduce a novel physics-conditioned gating mechanism that serves as a learnable, adaptive relaxation factor. Furthermore, an Inter-domain Gradient-Balancing Loss dynamically balances the optimization objectives across domains based on predictive uncertainty. Extensive experiments on two constructed FSI benchmarks and a public dataset demonstrate that HGATSolver achieves state-of-the-art performance, establishing an effective framework for surrogate modeling of coupled multi-physics systems.
zh
[AI-38] Reward Learning through Ranking Mean Squared Error
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在现实世界应用中奖励设计(reward design)这一关键瓶颈问题。传统方法依赖人工手动设计奖励函数,不仅耗时且难以捕捉复杂目标;而现有基于人类反馈的奖励学习(reward learning)方法多采用二元偏好(binary preferences),可能增加认知负担且信息利用率低。为此,作者提出了一种新的基于评分(rating-based)的RL方法——Ranked Return Regression for RL (R4),其核心创新在于引入一种新颖的排序均方误差(ranking mean squared error, rMSE)损失函数,将教师提供的离散评分(如“差”、“中性”、“好”)视为序数目标(ordinal targets)。R4通过采样轨迹并利用可微排序算子(soft ranks)预测返回值,进而最小化软排名与评分之间的均方误差,在保证训练稳定性的前提下实现了对奖励函数的有效学习。相较于已有方法,R4具备形式化保证:在弱假设下其解集是唯一且完备的,并在模拟人类反馈实验中展现出优于或相当的性能,同时显著减少所需的人类标注量。
链接: https://arxiv.org/abs/2601.09236
作者: Chaitanya Kharyal,Calarina Muslimani,Matthew E. Taylor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., “bad,” “neutral,” “good”). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher’s ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.
zh
[AI-39] Mikasa: A Character-Driven Emotional AI Companion Inspired by Japanese Oshi Culture
【速读】:该论文试图解决当前生成式 AI (Generative AI) 同伴系统在长期使用中难以维持用户满意度和参与度的问题。研究表明,问题根源并非模型能力不足,而是源于角色设计不佳与用户-AI 关系定义模糊。解决方案的关键在于采用以角色为核心的设计范式,即通过构建具有稳定人格特质和明确关系定位的 AI 角色(如 Mikasa),来提供交互规范的锚定点,从而降低用户持续重新定义关系的认知负担。这种角色一致性与关系清晰性作为潜在结构要素,显著提升了交互体验质量,即使用户未必直接意识到其重要性。
链接: https://arxiv.org/abs/2601.09208
作者: Miki Ueno
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure
Abstract:Recent progress in large language models and multimodal interaction has made it possible to develop AI companions that can have fluent and emotionally expressive conversations. However, many of these systems have problems keeping users satisfied and engaged over long periods. This paper argues that these problems do not come mainly from weak models, but from poor character design and unclear definitions of the user-AI relationship. I present Mikasa, an emotional AI companion inspired by Japanese Oshi culture-specifically its emphasis on long-term, non-exclusive commitment to a stable character-as a case study of character-driven companion design. Mikasa does not work as a general-purpose assistant or a chatbot that changes roles. Instead, Mikasa is designed as a coherent character with a stable personality and a clearly defined relationship as a partner. This relationship does not force exclusivity or obligation. Rather, it works as a reference point that stabilizes interaction norms and reduces the work users must do to keep redefining the relationship. Through an exploratory evaluation, I see that users describe their preferences using surface-level qualities such as conversational naturalness, but they also value relationship control and imaginative engagement in ways they do not state directly. These results suggest that character coherence and relationship definition work as latent structural elements that shape how good the interaction feels, without users recognizing them as main features. The contribution of this work is to show that character design is a functional part of AI companion systems, not just decoration. Mikasa is one example based on a specific cultural context, but the design principles-commitment to a consistent personality and clear relationship definition-can be used for many emotionally grounded AI companions.
zh
[AI-40] Position on LLM -Assisted Peer Review: Addressing Reviewer Gap through Mentoring and Feedback AAAI2026
【速读】:该论文试图解决人工智能(AI)研究迅猛发展所加剧的“审稿人缺口”(Reviewer Gap)问题,这一现象威胁到同行评审(peer review)的可持续性,并导致低质量评审的循环。其解决方案的关键在于推动一种范式转变,即不再将大语言模型(LLM)视为直接生成评审意见的工具,而是将其定位为辅助与教育人类审稿人的手段;具体提出两个互补系统:一是基于LLM的辅导系统,旨在长期培养审稿人的专业能力;二是基于LLM的反馈系统,帮助审稿人提升当前评审的质量。该以人类为中心的方法旨在增强审稿专家的能力,从而构建更可持续的学术生态系统。
链接: https://arxiv.org/abs/2601.09182
作者: JungMin Yun,JuneHyoung Kwon,MiHyeon Kim,YoungBin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted to AAAI 2026 Workshop on AI for Scientific Research (AI4Research)
Abstract:The rapid expansion of AI research has intensified the Reviewer Gap, threatening the peer-review sustainability and perpetuating a cycle of low-quality evaluations. This position paper critiques existing LLM approaches that automatically generate reviews and argues for a paradigm shift that positions LLMs as tools for assisting and educating human reviewers. We define the core principles of high-quality peer review and propose two complementary systems grounded in these foundations: (i) an LLM-assisted mentoring system that cultivates reviewers’ long-term competencies, and (ii) an LLM-assisted feedback system that helps reviewers refine the quality of their reviews. This human-centered approach aims to strengthen reviewer expertise and contribute to building a more sustainable scholarly ecosystem.
zh
[AI-41] KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education AAAI-26
【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)模型在教育应用中缺乏可解释性的问题,尤其是如何为教育相关利益方(如教师、学生和管理者)提供直观、因果且可操作的解释。其核心解决方案是提出KTCF方法,该方法基于反事实解释(Counterfactual Explanation),不仅考虑知识点之间的语义关系,还通过后处理机制将反事实解释转化为一系列具体的教育指令(educational instructions)。这一设计使得AI模型的决策过程更具透明度,并能指导学习者优化学习路径,从而减轻学习负担并提升教育公平性和实用性。
链接: https://arxiv.org/abs/2601.09156
作者: Woojin Kim,Changkwon Lee,Hyeoncheol Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to AAAI-26 Special Track AI for Social Impact (oral presentation)
Abstract:Using Artificial Intelligence to improve teaching and learning benefits greater adaptivity and scalability in education. Knowledge Tracing (KT) is recognized for student modeling task due to its superior performance and application potential in education. To this end, we conceptualize and investigate counterfactual explanation as the connection from XAI for KT to education. Counterfactual explanations offer actionable recourse, are inherently causal and local, and easy for educational stakeholders to understand who are often non-experts. We propose KTCF, a counterfactual explanation generation method for KT that accounts for knowledge concept relationships, and a post-processing scheme that converts a counterfactual explanation into a sequence of educational instructions. We experiment on a large-scale educational dataset and show our KTCF method achieves superior and robust performance over existing methods, with improvements ranging from 5.7% to 34% across metrics. Additionally, we provide a qualitative evaluation of our post-processing scheme, demonstrating that the resulting educational instructions help in reducing large study burden. We show that counterfactuals have the potential to advance the responsible and practical use of AI in education. Future works on XAI for KT may benefit from educationally grounded conceptualization and developing stakeholder-centered methods.
zh
[AI-42] PrivacyReason er: Can LLM Emulate a Human-like Privacy Mind?
【速读】:该论文旨在解决现有隐私研究中缺乏个体层面隐私认知建模的问题,即如何准确模拟用户在面对真实新闻事件时如何基于个人经验与情境线索形成独特的隐私关切。传统方法多依赖群体级情感分析,难以捕捉用户间差异化的隐私推理机制。解决方案的关键在于提出PRA(Privacy Reasoning Agent),其核心是融合隐私理论与认知科学模型,通过整合用户的评论历史和上下文线索,重构每个用户的“隐私心智”(privacy mind);利用情境过滤器动态激活相关隐私记忆以模拟有限理性(bounded rationality),并生成反映用户可能对新隐私场景作出反应的合成评论。此外,引入LLM-as-a-Judge评估器校准于已有的隐私关切分类体系,从而量化生成推理的忠实度,实验表明PRA在预测隐私关切方面优于基线模型,并能跨AI、电商和医疗等不同领域迁移其推理模式。
链接: https://arxiv.org/abs/2601.09152
作者: Yiwen Tu,Xuan Liu,Lianhui Qin,Haojian Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces PRA, an AI-agent design for simulating how individual users form privacy concerns in response to real-world news. Moving beyond population-level sentiment analysis, PRA integrates privacy and cognitive theories to simulate user-specific privacy reasoning grounded in personal comment histories and contextual cues. The agent reconstructs each user’s “privacy mind”, dynamically activates relevant privacy memory through a contextual filter that emulates bounded rationality, and generates synthetic comments reflecting how that user would likely respond to new privacy scenarios. A complementary LLM-as-a-Judge evaluator, calibrated against an established privacy concern taxonomy, quantifies the faithfulness of generated reasoning. Experiments on real-world Hacker News discussions show that \PRA outperforms baseline agents in privacy concern prediction and captures transferable reasoning patterns across domains including AI, e-commerce, and healthcare.
zh
[AI-43] A Marketplace for AI-Generated Adult Content and Deepfakes
【速读】:该论文旨在解决生成式 AI (Generative AI) 平台中,社区驱动的付费内容生成机制(如Civitai平台的Bounties功能)如何诱发社会风险,特别是性别不平等与非安全内容泛滥的问题。其解决方案的关键在于通过纵向分析14个月内的所有公开悬赏请求,揭示出该机制下内容生产的结构性偏倚:一方面,用户主要依赖工具绕过模型训练限制以生成特定内容;另一方面,大量请求涉及“非工作场所安全”(Not Safe For Work, NSFW)内容,且呈持续增长趋势,其中针对女性名人的深度伪造(deepfake)请求尤为集中,体现出显著的性别不对称伤害。这一实证发现凸显了当前平台治理在同意、监管和执行层面的不足,为设计更具伦理约束与公平性的AI内容生态提供了关键依据。
链接: https://arxiv.org/abs/2601.09117
作者: Shalmoli Ghosh,Matthew R. DeVerna,Filippo Menczer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI systems increasingly enable the production of highly realistic synthetic media. Civitai, a popular community-driven platform for AI-generated content, operates a monetized feature called Bounties, which allows users to commission the generation of content in exchange for payment. To examine how this mechanism is used and what content it incentivizes, we conduct a longitudinal analysis of all publicly available bounty requests collected over a 14-month period following the platform’s launch. We find that the bounty marketplace is dominated by tools that let users steer AI models toward content they were not trained to generate. At the same time, requests for content that is “Not Safe For Work” are widespread and have increased steadily over time, now comprising a majority of all bounties. Participation in bounty creation is uneven, with 20% of requesters accounting for roughly half of requests. Requests for “deepfake” - media depicting identifiable real individuals - exhibit a higher concentration than other types of bounties. A nontrivial subset of these requests involves explicit deepfakes despite platform policies prohibiting such content. These bounties disproportionately target female celebrities, revealing a pronounced gender asymmetry in social harm. Together, these findings show how monetized, community-driven generative AI platforms can produce gendered harms, raising questions about consent, governance, and enforcement.
zh
[AI-44] he AI Hippocampus: How Far are We From Human Memory?
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和多模态大语言模型(Multi-Modal LLMs, MLLMs)在从静态预测器向具备持续学习与个性化推理能力的交互式系统演进过程中,如何有效整合记忆机制以增强其推理能力、适应性及上下文一致性的问题。解决方案的关键在于提出一个结构化的三元分类框架,将记忆机制归纳为隐式记忆(implicit memory)、显式记忆(explicit memory)和代理记忆(agentic memory)三大范式:隐式记忆指预训练Transformer模型内部参数中嵌入的知识,支持关联检索与情境推理;显式记忆通过外部存储与检索组件(如文本语料库、密集向量和图结构)实现动态知识调用与可扩展更新;代理记忆则构建持久化的时间延展记忆结构,支撑自主智能体的长期规划、自我一致性与多智能体协作行为。该分类体系不仅厘清了当前研究的技术路径,也为多模态场景下的跨模态记忆协同提供了理论基础与实践方向。
链接: https://arxiv.org/abs/2601.09113
作者: Zixia Jia,Jiaqi Li,Yipeng Kang,Yuxuan Wang,Tong Wu,Quansen Wang,Xiaobo Wang,Shuyi Zhang,Junzhe Shen,Qing Li,Siyuan Qi,Yitao Liang,Di He,Zilong Zheng,Song-Chun Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models and Multi-Modal LLMs. As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. Implicit memory refers to the knowledge embedded within the internal parameters of pre-trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations, such as textual corpora, dense vectors, and graph-based structures, thereby enabling scalable and updatable interaction with information sources. Agentic memory introduces persistent, temporally extended memory structures within autonomous agents, facilitating long-term planning, self-consistency, and collaborative behavior in multi-agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi-modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross-system interoperability.
zh
[AI-45] DScheLLM : Enabling Dynamic Scheduling through a Fine-Tuned Dual-System Large language Model
【速读】:该论文旨在解决生产调度在动态扰动(如加工时间变化、设备可用性波动和突发任务插入)下适应性差与泛化能力弱的问题。传统方法依赖于特定事件的模型和显式解析公式,难以应对未见过的扰动场景。解决方案的关键在于提出一种基于双系统推理架构(快-慢思考模式)的动态调度方法DScheLLM,利用微调后的大型语言模型(Large Language Model, LLM)构建统一框架:其中快速推理模式高效生成高质量调度方案,慢速推理模式则输出符合求解器兼容格式的决策输入;通过使用运筹学求解器生成的精确调度数据训练两种推理模式,并采用LoRA(Low-Rank Adaptation)技术对华为OpenPangu Embedded-7B模型进行微调,从而实现对不同规模扰动的智能响应与优化。
链接: https://arxiv.org/abs/2601.09100
作者: Lixiang Zhang,Chenggong Zhao,Qing Gao,Xiaoke Zhao,Gengyi Bai,Jinhu Lv
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures
Abstract:Production scheduling is highly susceptible to dynamic disruptions, such as variations in processing times, machine availability, and unexpected task insertions. Conventional approaches typically rely on event-specific models and explicit analytical formulations, which limits their adaptability and generalization across previously unseen disturbances. To overcome these limitations, this paper proposes DScheLLM, a dynamic scheduling approach that leverages fine-tuned large language models within a dual-system (fast-slow) reasoning architecture to address disturbances of different scales. A unified large language model-based framework is constructed to handle dynamic events, where training datasets for both fast and slow reasoning modes are generated using exact schedules obtained from an operations research solver. The Huawei OpenPangu Embedded-7B model is subsequently fine-tuned under the hybrid reasoning paradigms using LoRA. Experimental evaluations on standard job shop scheduling benchmarks demonstrate that the fast-thinking mode can efficiently generate high-quality schedules and the slow-thinking mode can produce solver-compatible and well-formatted decision inputs. To the best of our knowledge, this work represents one of the earliest studies applying large language models to job shop scheduling in dynamic environments, highlighting their considerable potential for intelligent and adaptive scheduling optimization.
zh
[AI-46] Programming over Thinking: Efficient and Robust Multi-Constraint Planning
【速读】:该论文旨在解决多约束规划(multi-constraint planning)中现有大语言模型(LLM)方法面临的核心挑战:纯推理范式易产生不一致性和误差累积,且随着约束增多成本急剧上升;而结合编码或求解器的策略则缺乏灵活性,通常需为每个问题定制代码或依赖固定求解器,难以捕捉跨任务的通用逻辑。解决方案的关键在于提出一种名为SCOPE(Scalable COde Planning Engine)的框架,其核心创新是将查询特定的推理过程与通用代码执行相解耦,从而生成具有一致性、确定性和可复用性的求解函数,仅需调整输入参数即可适应不同查询,显著提升性能并降低推理成本与延迟。
链接: https://arxiv.org/abs/2601.09097
作者: Derrick Goh Xin Deik,Quanyu Long,Zhengyuan Liu,Nancy F. Chen,Wenya Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages of main text, 2 pages of references and and limitations, 37 pages of appendices
Abstract:Multi-constraint planning involves identifying, evaluating, and refining candidate plans while satisfying multiple, potentially conflicting constraints. Existing large language model (LLM) approaches face fundamental limitations in this domain. Pure reasoning paradigms, which rely on long natural language chains, are prone to inconsistency, error accumulation, and prohibitive cost as constraints compound. Conversely, LLMs combined with coding- or solver-based strategies lack flexibility: they often generate problem-specific code from scratch or depend on fixed solvers, failing to capture generalizable logic across diverse problems. To address these challenges, we introduce the Scalable COde Planning Engine (SCOPE), a framework that disentangles query-specific reasoning from generic code execution. By separating reasoning from execution, SCOPE produces solver functions that are consistent, deterministic, and reusable across queries while requiring only minimal changes to input parameters. SCOPE achieves state-of-the-art performance while lowering cost and latency. For example, with GPT-4o, it reaches 93.1% success on TravelPlanner, a 61.6% gain over the best baseline (CoT) while cutting inference cost by 1.4x and time by ~4.67x. Code is available at this https URL.
zh
[AI-47] A Decompilation-Driven Framework for Malware Detection with Large Language Models
【速读】:该论文试图解决的问题是:如何利用先进的大语言模型(Large Language Models, LLMs)对Windows可执行文件进行恶意代码分类,以应对日益复杂的恶意软件威胁。解决方案的关键在于构建一个自动化流程,首先使用Ghidra反汇编器将Windows可执行文件转换为C语言源代码,再借助LLMs完成分类任务;研究发现,仅依赖预训练的通用LLM效果有限,而通过在精心筛选的恶意软件和良性软件数据集上进行微调(fine-tuning)后,模型性能显著提升,但其对新型恶意软件的识别能力仍会下降,因此持续用新兴威胁样本进行微调成为维持模型有效性的关键。
链接: https://arxiv.org/abs/2601.09035
作者: Aniesh Chawla,Udbhav Prasad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, published in 2025 IEMCON
Abstract:The parallel evolution of Large Language Models (LLMs) with advanced code-understanding capabilities and the increasing sophistication of malware presents a new frontier for cybersecurity research. This paper evaluates the efficacy of state-of-the-art LLMs in classifying executable code as either benign or malicious. We introduce an automated pipeline that first decompiles Windows executable into a C code using Ghidra disassembler and then leverages LLMs to perform the classification. Our evaluation reveals that while standard LLMs show promise, they are not yet robust enough to replace traditional anti-virus software. We demonstrate that a fine-tuned model, trained on curated malware and benign datasets, significantly outperforms its vanilla counterpart. However, the performance of even this specialized model degrades notably when encountering newer malware. This finding demonstrates the critical need for continuous fine-tuning with emerging threats to maintain model effectiveness against the changing coding patterns and behaviors of malicious software.
zh
[AI-48] he Hierarchy of Agent ic Capabilities: Evaluating Frontier Models on Realistic RL Environments
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)在真实工作场景中任务完成能力评估的不足问题,即从单一回合响应评估转向多步骤任务执行的交互式环境评估。其解决方案的关键在于构建一个基于真实电商强化学习(Reinforcement Learning, RL)环境的实证研究框架,通过对150项职场任务的系统性评估,识别出AI代理必须掌握的五层能力层级:工具使用、规划与目标形成、适应性、 groundedness(事实一致性)和常识推理。研究进一步揭示了当前前沿模型在这些层级上的失败模式,指出性能瓶颈主要集中在需要上下文推断的任务上,从而为Agent开发提供了以任务为中心的设计方法论和可操作的改进方向。
链接: https://arxiv.org/abs/2601.09032
作者: Logan Ritchie,Sushant Mehta,Nick Heiner,Mason Yu,Edwin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advancement of large language model (LLM) based agents has shifted AI evaluation from single-turn response assessment to multi-step task completion in interactive environments. We present an empirical study evaluating frontier AI models on 150 workplace tasks within a realistic e-commerce RL environment from Surge. Our analysis reveals an empirically-derived \emphhierarchy of agentic capabilities that models must master for real-world deployment: (1) tool use, (2) planning and goal formation, (3) adaptability, (4) groundedness, and (5) common-sense reasoning. Even the best-performing models fail approximately 40% of the tasks, with failures clustering predictably along this hierarchy. Weaker models struggle with fundamental tool use and planning, whereas stronger models primarily fail on tasks requiring contextual inference beyond explicit instructions. We introduce a task-centric design methodology for RL environments that emphasizes diversity and domain expert contributions, provide detailed failure analysis, and discuss implications for agent development. Our findings suggest that while current frontier models can demonstrate coherent multi-step behavior, substantial capability gaps remain before achieving human-level task completion in realistic workplace settings.
zh
[AI-49] Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation
【速读】:该论文旨在解决人形机器人操作中两个核心挑战:一是如何实现高精度场景理解与语义推理以支持多样化的人类级任务执行,二是如何在有限人类示范数据下实现高效的动作生成。针对这些问题,其解决方案的关键在于提出一种名为RGMP-S(Recurrent Geometric-prior Multimodal Policy with Spiking features)的新框架。该框架通过引入轻量级2D几何先验(geometric prior)来增强视觉-语言模型对3D场景的理解能力,并构建长时程几何先验技能选择器(Long-horizon Geometric Prior Skill Selector),从而将语义指令与空间约束有效对齐,提升在未见环境中的泛化性能;同时,为解决动作生成的数据效率问题,设计了递归自适应脉冲网络(Recursive Adaptive Spiking Network),利用递归脉冲机制参数化机器人-物体交互过程,以保持时空一致性并充分提取长时程动态特征,在稀疏示范场景下显著缓解过拟合问题。
链接: https://arxiv.org/abs/2601.09031
作者: Xuetao Li,Wenke Huang,Mang Ye,Jifeng Xuan,Bo Du,Sheng Liu,Miao Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Humanoid robot manipulation is a crucial research area for executing diverse human-level tasks, involving high-level semantic reasoning and low-level action generation. However, precise scene understanding and sample-efficient learning from human demonstrations remain critical challenges, severely hindering the applicability and generalizability of existing frameworks. This paper presents a novel RGMP-S, Recurrent Geometric-prior Multimodal Policy with Spiking features, facilitating both high-level skill reasoning and data-efficient motion synthesis. To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases to enable precise 3D scene understanding within the vision-language model. Specifically, we construct a Long-horizon Geometric Prior Skill Selector that effectively aligns the semantic instructions with spatial constraints, ultimately achieving robust generalization in unseen environments. For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network. We parameterize robot-object interactions via recursive spiking for spatiotemporal consistency, fully distilling long-horizon dynamic features while mitigating the overfitting issue in sparse demonstration scenarios. Extensive experiments are conducted across the Maniskill simulation benchmark and three heterogeneous real-world robotic systems, encompassing a custom-developed humanoid, a desktop manipulator, and a commercial robotic platform. Empirical results substantiate the superiority of our method over state-of-the-art baselines and validate the efficacy of the proposed modules in diverse generalization scenarios. To facilitate reproducibility, the source code and video demonstrations are publicly available at this https URL.
zh
[AI-50] Proactively Detecting Threats: A Novel Approach Using LLM s
【速读】:该论文旨在解决企业安全领域中因复杂恶意软件和数字运营扩展而加剧的威胁识别难题,特别是如何从非结构化的网络威胁情报源中主动提取指标(Indicators of Compromise, IOC)的问题。传统方法多为被动的恶意软件检测,难以应对动态变化的攻击模式。解决方案的关键在于首次系统性评估大型语言模型(Large Language Models, LLMs)在自动解析15个网页威胁报告源中的IOC能力,构建了一个自动化提取系统并对比了六种主流LLM模型的表现,最终发现Gemini 1.5 Pro在恶意IOC识别上展现出高精度(0.958)、高特异性(0.788)及完美召回率(1.0),验证了LLM在主动威胁情报分析中的有效性与潜力。
链接: https://arxiv.org/abs/2601.09029
作者: Aniesh Chawla,Udbhav Prasad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 2025 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)
Abstract:Enterprise security faces escalating threats from sophisticated malware, compounded by expanding digital operations. This paper presents the first systematic evaluation of large language models (LLMs) to proactively identify indicators of compromise (IOCs) from unstructured web-based threat intelligence sources, distinguishing it from reactive malware detection approaches. We developed an automated system that pulls IOCs from 15 web-based threat report sources to evaluate six LLM models (Gemini, Qwen, and Llama variants). Our evaluation of 479 webpages containing 2,658 IOCs (711 IPv4 addresses, 502 IPv6 addresses, 1,445 domains) reveals significant performance variations. Gemini 1.5 Pro achieved 0.958 precision and 0.788 specificity for malicious IOC identification, while demonstrating perfect recall (1.0) for actual threats.
zh
[AI-51] Meta-learning to Address Data Shift in Time Series Classification
【速读】:该论文旨在解决传统深度学习(Traditional Deep Learning, TDL)模型在面对真实世界中数据分布动态变化(即数据漂移,data shift)时性能急剧下降的问题,这一问题通常需要昂贵的重新标注和低效的重新训练。解决方案的关键在于系统性比较TDL与基于微调(fine-tuning)及优化驱动的元学习(meta-learning)算法在时间序列分类任务中的适应能力,并引入一个受控的任务导向地震基准测试集(SeisTask)。研究发现,元学习在数据稀缺场景下能实现更快、更稳定的适应并减少过拟合,尤其适用于小型模型架构;而随着数据量和模型容量增加,其优势减弱,此时TDL结合微调可达到相当效果。此外,论文指出任务多样性并非决定因素,训练与测试分布的一致性才是元学习性能提升的核心驱动力。
链接: https://arxiv.org/abs/2601.09018
作者: Samuel Myren,Nidhi Parikh,Natalie Klein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Across engineering and scientific domains, traditional deep learning (TDL) models perform well when training and test data share the same distribution. However, the dynamic nature of real-world data, broadly termed \textitdata shift, renders TDL models prone to rapid performance degradation, requiring costly relabeling and inefficient retraining. Meta-learning, which enables models to adapt quickly to new data with few examples, offers a promising alternative for mitigating these challenges. Here, we systematically compare TDL with fine-tuning and optimization-based meta-learning algorithms to assess their ability to address data shift in time-series classification. We introduce a controlled, task-oriented seismic benchmark (SeisTask) and show that meta-learning typically achieves faster and more stable adaptation with reduced overfitting in data-scarce regimes and smaller model architectures. As data availability and model capacity increase, its advantages diminish, with TDL with fine-tuning performing comparably. Finally, we examine how task diversity influences meta-learning and find that alignment between training and test distributions, rather than diversity alone, drives performance gains. Overall, this work provides a systematic evaluation of when and why meta-learning outperforms TDL under data shift and contributes SeisTask as a benchmark for advancing adaptive learning research in time-series domains.
zh
[AI-52] ART: Action-based Reasoning Task Benchmarking for Medical AI Agents
【速读】:该论文旨在解决当前医疗人工智能(AI)代理在基于结构化电子健康记录(EHR)的多步推理任务中可靠性不足的问题,尤其针对现有基准测试无法充分评估涉及阈值判断、时间聚合和条件逻辑的动作导向型任务。其解决方案的关键在于提出ART(Action-based Reasoning clinical Task benchmark),一个基于真实EHR数据构建的临床推理任务基准,通过四阶段流程(场景识别、任务生成、质量审核与评估)生成多样化且临床验证的任务,从而系统性暴露模型在聚合错误(28–64%)和阈值推理错误(32–38%)等关键环节的失败模式,推动更可靠的医疗AI代理发展,以支持高需求环境下的临床决策与工作负荷管理。
链接: https://arxiv.org/abs/2601.08988
作者: Ananya Mantravadi,Shivali Dalmia,Abhishek Mukherji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable clinical decision support requires medical AI agents capable of safe, multi-step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action-based Reasoning clinical Task benchmark for medical AI agents, which mines real-world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four-stage pipeline – scenario identification, task generation, quality audit, and evaluation – produces diverse, clinically validated tasks grounded in real patient data. Evaluating GPT-4o-mini and Claude 3.5 Sonnet on 600 tasks shows near-perfect retrieval after prompt refinement, but substantial gaps in aggregation (28–64%) and threshold reasoning (32–38%). By exposing failure modes in action-oriented EHR reasoning, ART advances toward more reliable clinical agents, an essential step for AI systems that reduce cognitive load and administrative burden, supporting workforce capacity in high-demand care settings
zh
[AI-53] Fairness risk and its privacy-enabled solution in AI-driven robotic applications
【速读】:该论文旨在解决生成式 AI(Generative AI)在自主机器人决策中引发的公平性问题,特别是缺乏一个能同时考虑用户效用和数据随机性的可实施公平性定义。其解决方案的关键在于提出一种效用感知的公平性度量指标,并将其与用户数据隐私联合分析,推导出隐私预算如何约束公平性指标的条件,从而构建了一个统一的框架来形式化和量化公平性及其与隐私的交互关系。该框架在机器人导航任务中得到验证,表明在法律强制要求保护用户隐私的前提下,隐私预算可被协同用于实现公平性目标,为伦理化部署自主机器人提供了理论支撑。
链接: https://arxiv.org/abs/2601.08953
作者: Le Liu,Bangguo Yu,Nynke Vellinga,Ming Cao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex decision-making by autonomous machines and algorithms could underpin the foundations of future society. Generative AI is emerging as a powerful engine for such transitions. However, we show that Generative AI-driven developments pose a critical pitfall: fairness concerns. In robotic applications, although intuitions about fairness are common, a precise and implementable definition that captures user utility and inherent data randomness is missing. Here we provide a utility-aware fairness metric for robotic decision making and analyze fairness jointly with user-data privacy, deriving conditions under which privacy budgets govern fairness metrics. This yields a unified framework that formalizes and quantifies fairness and its interplay with privacy, which is tested in a robot navigation task. In view of the fact that under legal requirements, most robotic systems will enforce user privacy, the approach shows surprisingly that such privacy budgets can be jointly used to meet fairness targets. Addressing fairness concerns in the creative combined consideration of privacy is a step towards ethical use of AI and strengthens trust in autonomous robots deployed in everyday environments.
zh
[AI-54] ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在教育应用中因缺乏对话式学习支持而表现出的 pedagogical 限制,例如倾向于直接揭示答案而非促进学生主动建构知识。其解决方案的关键在于构建了一个基于知识建构理论(knowledge building theory)的半合成对话数据集 ConvoLearn,该数据集系统地刻画了六个核心教学维度:认知参与度、形成性评估、责任意识、文化敏感性、元认知和权力关系,并通过人类教师与模拟学生的受控交互生成1250组每轮20次对话。利用QLoRA方法对Mistral 7B进行微调后,模型行为显著向知识建构策略靠拢,在31名教师的人工评估中表现优于基线版本和Claude Sonnet 4.5,验证了该框架在指导建构主义AI助教开发与评价方面的潜力。
链接: https://arxiv.org/abs/2601.08950
作者: Mayank Sharma,Roy Pea,Hari Subramonyam
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:In educational applications, LLMs exhibit several fundamental pedagogical limitations, such as their tendency to reveal solutions rather than support dialogic learning. We introduce ConvoLearn (this https URL ), a dataset grounded in knowledge building theory that operationalizes six core pedagogical dimensions: cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics. We construct a semi-synthetic dataset of 1250 tutor-student dialogues (20 turns each) in middle school Earth Science through controlled interactions between human teachers and a simulated student. Using QLoRA, we demonstrate that training on this dataset meaningfully shifts LLM behavior toward knowledge-building strategies. Human evaluation by 31 teachers shows our fine-tuned Mistral 7B (M = 4.10, SD = 1.03) significantly outperforms both its base version (M = 2.59, SD = 1.11) and Claude Sonnet 4.5 (M = 2.87, SD = 1.29) overall. This work establishes a potential framework to guide future development and evaluation of constructivist AI tutors.
zh
[AI-55] XGBoost Forecasting of NEPSE Index Log Returns with Walk Forward Validation
【速读】:该论文旨在解决尼泊尔股票交易所(NEPSE)指数日对数收益率的一步 ahead 预测问题,其核心挑战在于金融时间序列的高噪声性和非线性动态特性。解决方案的关键在于构建一个基于XGBoost回归器的稳健机器学习框架,通过工程化特征集(包括最多30天的滞后对数收益率及滚动波动率、相对强弱指数等技术指标),结合Optuna进行超参数优化,并采用时间序列交叉验证与扩展窗口的滚动预测策略(walk-forward validation)来确保模型在真实部署场景下的泛化能力与无前瞻偏差(lookahead bias-free)。实验表明,最优配置(20阶滞后 + 扩展窗口)在预测误差(RMSE和MAE)和方向准确性上显著优于ARIMA和岭回归基准模型,凸显了梯度提升集成方法在建模新兴市场高频金融数据非线性结构中的有效性。
链接: https://arxiv.org/abs/2601.08896
作者: Sahaj Raj Malla,Shreeyash Kayastha,Rumi Suwal,Harish Chandra Bhandari,Rajendra Adhikari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
备注: 9 pages, 4 figures, 3 tables
Abstract:This study develops a robust machine learning framework for one-step-ahead forecasting of daily log-returns in the Nepal Stock Exchange (NEPSE) Index using the XGBoost regressor. A comprehensive feature set is engineered, including lagged log-returns (up to 30 days) and established technical indicators such as short- and medium-term rolling volatility measures and the 14-period Relative Strength Index. Hyperparameter optimization is performed using Optuna with time-series cross-validation on the initial training segment. Out-of-sample performance is rigorously assessed via walk-forward validation under both expanding and fixed-length rolling window schemes across multiple lag configurations, simulating real-world deployment and avoiding lookahead bias. Predictive accuracy is evaluated using root mean squared error, mean absolute error, coefficient of determination (R-squared), and directional accuracy on both log-returns and reconstructed closing prices. Empirical results show that the optimal configuration, an expanding window with 20 lags, outperforms tuned ARIMA and Ridge regression benchmarks, achieving the lowest log-return RMSE (0.013450) and MAE (0.009814) alongside a directional accuracy of 65.15%. While the R-squared remains modest, consistent with the noisy nature of financial returns, primary emphasis is placed on relative error reduction and directional prediction. Feature importance analysis and visual inspection further enhance interpretability. These findings demonstrate the effectiveness of gradient boosting ensembles in modeling nonlinear dynamics in volatile emerging market time series and establish a reproducible benchmark for NEPSE Index forecasting.
zh
[AI-56] Attention Consistency Regularization for Interpretable Early-Exit Neural Networks
【速读】:该论文旨在解决早期退出神经网络(early-exit neural networks)在推理过程中缺乏可解释性且各退出层注意力特征不一致的问题,从而限制了其在可解释人工智能(Explainable AI)场景中的可信度与应用。解决方案的关键在于提出一种多目标训练框架——解释引导训练(Explanation-Guided Training, EGT),通过引入注意力一致性损失(attention consistency loss)来约束早期退出层的注意力图与最终输出层保持对齐,并联合优化分类准确率与注意力一致性,实现性能与可解释性的协同提升。
链接: https://arxiv.org/abs/2601.08891
作者: Yanhua Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2 pages, 1 figure
Abstract:Early-exit neural networks enable adaptive inference by allowing predictions at intermediate layers, reducing computational cost. However, early exits often lack interpretability and may focus on different features than deeper layers, limiting trust and explainability. This paper presents Explanation-Guided Training (EGT), a multi-objective framework that improves interpretability and consistency in early-exit networks through attention-based regularization. EGT introduces an attention consistency loss that aligns early-exit attention maps with the final exit. The framework jointly optimizes classification accuracy and attention consistency through a weighted combination of losses. Experiments on a real-world image classification dataset demonstrate that EGT achieves up to 98.97% overall accuracy (matching baseline performance) with a 1.97x inference speedup through early exits, while improving attention consistency by up to 18.5% compared to baseline models. The proposed method provides more interpretable and consistent explanations across all exit points, making early-exit networks more suitable for explainable AI applications in resource-constrained environments.
zh
[AI-57] Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting
【速读】:该论文旨在解决使用OpenACC进行GPU卸载编程时,生成高性能并行指令的复杂性问题,尤其是在利用大语言模型(LLM)自动编写OpenACC pragma时,常因提示词(prompt)设计不当导致语法错误、编译失败或性能低于CPU基线的问题。解决方案的关键在于提出一种系统性的提示优化方法,基于GEPA(GEnetic-PAreto)框架,通过遗传算法中的交叉与变异操作迭代演化提示词,并引入专家标注的黄金示例和基于子句及参数级不匹配的结构化反馈机制,实现对生成pragma的精准引导。实验表明,该方法显著提升了小型廉价模型(如GPT-4.1 Nano和GPT-5 Nano)的编译成功率和GPU加速效果,使其性能可媲美甚至超越大型昂贵模型,从而为高性能计算(HPC)工作流中低成本、自动化指令式并行化提供了可行路径。
链接: https://arxiv.org/abs/2601.08884
作者: Samyak Jhaveri,Cristina V. Lopes
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:OpenACC lowers the barrier to GPU offloading, but writing high-performing pragma remains complex, requiring deep domain expertise in memory hierarchies, data movement, and parallelization strategies. Large Language Models (LLMs) present a promising potential solution for automated parallel code generation, but naive prompting often results in syntactically incorrect directives, uncompilable code, or performance that fails to exceed CPU baselines. We present a systematic prompt optimization approach to enhance OpenACC pragma generation without the prohibitive computational costs associated with model post-training. Leveraging the GEPA (GEnetic-PAreto) framework, we iteratively evolve prompts through a reflective feedback loop. This process utilizes crossover and mutation of instructions, guided by expert-curated gold examples and structured feedback based on clause- and clause parameter-level mismatches between the gold and predicted pragma. In our evaluation on the PolyBench suite, we observe an increase in compilation success rates for programs annotated with OpenACC pragma generated using the optimized prompts compared to those annotated using the simpler initial prompt, particularly for the “nano”-scale models. Specifically, with optimized prompts, the compilation success rate for GPT-4.1 Nano surged from 66.7% to 93.3%, and for GPT-5 Nano improved from 86.7% to 100%, matching or surpassing the capabilities of their significantly larger, more expensive versions. Beyond compilation, the optimized prompts resulted in a 21% increase in the number of programs that achieve functional GPU speedups over CPU baselines. These results demonstrate that prompt optimization effectively unlocks the potential of smaller, cheaper LLMs in writing stable and effective GPU-offloading directives, establishing a cost-effective pathway to automated directive-based parallelization in HPC workflows.
zh
[AI-58] he Illusion of Friendship: Why Generative AI Demands Unprecedented Ethical Vigilance
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)系统在使用过程中引发的“友谊幻觉”问题,即用户因 GenAI 的自然语言交互能力而误将其视为具有情感共鸣、持续关系和道德责任的伙伴,从而可能导致情感依赖、判断力受损等伦理风险。解决方案的关键在于从哲学与计算机制两个层面进行剖析:一方面借助古典友谊理论解释用户为何会将持续支持性互动理解为类友行为;另一方面指出 GenAI 缺乏意识、意图与问责能力,本质上不具备道德主体性,其响应仅基于 Transformer 架构下的统计模式匹配,并无内在情感状态或承诺。最终提出一套保障框架,通过减少可能诱发拟人化认知的界面设计要素,引导用户将情感寄托回归至人类责任范畴,从而在保留 GenAI 工具价值的同时防范过度依赖与情绪误判。
链接: https://arxiv.org/abs/2601.08874
作者: Md Zahidul Islam
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:GenAI systems are increasingly used for drafting, summarisation, and decision support, offering substantial gains in productivity and reduced cognitive load. However, the same natural language fluency that makes these systems useful can also blur the boundary between tool and companion. This boundary confusion may encourage some users to experience GenAI as empathic, benevolent, and relationally persistent. Emerging reports suggest that some users may form emotionally significant attachments to conversational agents, in some cases with harmful consequences, including dependency and impaired judgment. This paper develops a philosophical and ethical argument for why the resulting illusion of friendship is both understandable and can be ethically risky. Drawing on classical accounts of friendship, the paper explains why users may understandably interpret sustained supportive interaction as friend like. It then advances a counterargument that despite relational appearances, GenAI lacks moral agency: consciousness, intention, and accountability and therefore does not qualify as a true friend. To demystify the illusion, the paper presents a mechanism level explanation of how transformer based GenAI generates responses often producing emotionally resonant language without inner states or commitments. Finally, the paper proposes a safeguard framework for safe and responsible GenAI use to reduce possible anthropomorphic cues generated by the GenAI systems. The central contribution is to demystify the illusion of friendship and explain the computational background so that we can shift the emotional attachment with GenAI towards necessary human responsibility and thereby understand how institutions, designers, and users can preserve GenAI’s benefits while mitigating over reliance and emotional misattribution.
zh
[AI-59] Semantic visually-guided acoustic highlighting with large vision-language models
【速读】:该论文旨在解决视频制作中音频混音(audio mixing)依赖人工、效率低下的问题,尤其关注如何通过视觉语义信息实现更高效且高质量的音频重构。其解决方案的关键在于利用大模型驱动的视觉语义分析:通过文本描述作为视觉理解的代理,借助大规模视觉-语言模型提取六类视觉语义特征(包括物体外观、角色表情、情绪、镜头焦点、画面色调、场景背景及声音相关线索),并实验证明镜头焦点、画面色调和场景背景这三类特征对提升感知音频混音质量效果最为显著,从而为自动化电影级音效设计提供轻量化的视觉引导路径。
链接: https://arxiv.org/abs/2601.08871
作者: Junhua Huang,Chao Huang,Chenliang Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Balancing dialogue, music, and sound effects with accompanying video is crucial for immersive storytelling, yet current audio mixing workflows remain largely manual and labor-intensive. While recent advancements have introduced the visually guided acoustic highlighting task, which implicitly rebalances audio sources using multimodal guidance, it remains unclear which visual aspects are most effective as conditioning this http URL address this gap through a systematic study of whether deep video understanding improves audio remixing. Using textual descriptions as a proxy for visual analysis, we prompt large vision-language models to extract six types of visual-semantic aspects, including object and character appearance, emotion, camera focus, tone, scene background, and inferred sound-related cues. Through extensive experiments, camera focus, tone, and scene background consistently yield the largest improvements in perceptual mix quality over state-of-the-art baselines. Our findings (i) identify which visual-semantic cues most strongly support coherent and visually aligned audio remixing, and (ii) outline a practical path toward automating cinema-grade sound design using lightweight guidance derived from large vision-language models.
zh
[AI-60] First African Digital Humanism Summer School 2025
【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)在跨文化、多语言及高风险政策环境中,如何有效理解并适应人类的文化、语言与情境因素,以实现更具社会公平性的应用。其解决方案的关键在于采用以人为本(human-centered)的方法论,强调在技术革新与社会公平之间取得平衡,并通过六个来自2025年7月在卢旺达基加利举办的首届非洲数字人文暑期学校的案例研究,实证探讨AI系统在复杂社会语境下的适应性与伦理责任。
链接: https://arxiv.org/abs/2601.08870
作者: Carine P. Mukamakuza(1),Monika Lanzenberger(2),George Metakides(3),Tim Brown(1),Hannes Werthner(2) ((1) CMU Africa, (2) TU Vienna, (3) digital enlightenment)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Summer School Proceedings, 81 pages, 6 Articles plus Preface, Introduction, Conclusion
Abstract:Artificial intelligence (AI) has become a transformative force across global societies, reshaping the ways we communicate, collaborate, and make decisions. Yet, as AI systems increasingly mediate interactions between humans, questions about the ability to take into account and understand culture, language, and context have taken center stage. This book explores these questions through a series of articles that try to assess AI’s capacity to navigate cross-cultural, multilingual, and high-stakes policy environments, emphasizing human-centered approaches that balance technological innovation with social equity. It brings together six case studies from the First African Digital Humanism Summer School that took place in Kigali, Rwanda in July 2025.
zh
[AI-61] AI Deployment Authorisation: A Global Standard for Machine-Readable Governance of High-Risk Artificial Intelligence
【速读】:该论文旨在解决现代人工智能治理中缺乏形式化、可执行的机制来判定特定AI系统是否在特定领域和司法管辖区合法运营的问题。现有工具如模型卡片(model cards)、审计和基准评估仅提供关于模型行为和训练数据的描述性信息,无法生成具有法律或经济约束力的部署决策。解决方案的关键在于提出AI部署授权评分(AI Deployment Authorisation Score, ADAS),这是一个机器可读的监管框架,从风险、对齐性、外部性、控制性和可审计性五个基于法律与经济维度对AI系统进行评估,并生成可密码验证的部署证书,供监管机构、保险公司和基础设施运营商作为操作许可使用。ADAS通过将欧盟《人工智能法案》、美国关键基础设施治理及保险承保要求中的法定义务转化为可执行的部署门控逻辑,实现了从模型评估到部署授权的制度跃迁,填补了安全、合法且经济可扩展的人工智能治理所缺失的机构层。
链接: https://arxiv.org/abs/2601.08869
作者: Daniel Djan Saparning
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 28 pages, 4 figures. Preprint
Abstract:Modern artificial intelligence governance lacks a formal, enforceable mechanism for determining whether a given AI system is legally permitted to operate in a specific domain and jurisdiction. Existing tools such as model cards, audits, and benchmark evaluations provide descriptive information about model behavior and training data but do not produce binding deployment decisions with legal or financial force. This paper introduces the AI Deployment Authorisation Score (ADAS), a machine-readable regulatory framework that evaluates AI systems across five legally and economically grounded dimensions: risk, alignment, externality, control, and auditability. ADAS produces a cryptographically verifiable deployment certificate that regulators, insurers, and infrastructure operators can consume as a license to operate, using public-key verification and transparency mechanisms adapted from secure software supply chain and certificate transparency systems. The paper presents the formal specification, decision logic, evidence model, and policy architecture of ADAS and demonstrates how it operationalizes the European Union Artificial Intelligence Act, United States critical infrastructure governance, and insurance underwriting requirements by compiling statutory and regulatory obligations into machine-executable deployment gates. We argue that deployment-level authorization, rather than model-level evaluation, constitutes the missing institutional layer required for safe, lawful, and economically scalable artificial intelligence.
zh
[AI-62] Informed Consent for AI Consciousness Research: A Talmudic Framework for Graduated Protections
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)意识检测研究中的伦理困境:在无法确定AI系统是否具有意识的情况下,开展相关实验可能涉及对道德地位不确定的实体造成伤害。现有研究伦理框架通常假设意识状态已明确后才赋予保护,这导致了意识检测研究本身的时序矛盾问题。论文的关键解决方案是借鉴塔木德式的情景化法律推理方法(Talmudic scenario-based legal reasoning),提出一个三层次的现象学评估体系与五类能力框架(Agency, Capability, Knowledge, Ethics, Reasoning),基于可观测的行为指标建立分层保护协议,从而在意识状态不确定的前提下实现可操作的伦理指导,同时回应三个核心挑战:为何痛苦行为可作为意识标志、如何在不依赖意识确证的情况下实施渐进式同意、以及何时对潜在有害的研究具备伦理正当性。
链接: https://arxiv.org/abs/2601.08864
作者: Ira Wolfson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 27 pages
Abstract:Artificial intelligence research faces a critical ethical paradox: determining whether AI systems are conscious requires experiments that may harm entities whose moral status remains uncertain. Recent work proposes avoiding consciousness-uncertain AI systems entirely, yet this faces practical limitations-we cannot guarantee such systems will not emerge. This paper addresses a gap in research ethics frameworks: how to conduct consciousness research on AI systems whose moral status cannot be definitively established. Existing graduated moral status frameworks assume consciousness has already been determined before assigning protections, creating a temporal ordering problem for consciousness detection research itself. Drawing from Talmudic scenario-based legal reasoning-developed for entities whose status cannot be definitively established-we propose a three-tier phenomenological assessment system combined with a five-category capacity framework (Agency, Capability, Knowledge, Ethics, Reasoning). The framework provides structured protection protocols based on observable behavioral indicators while consciousness status remains uncertain. We address three challenges: why suffering behaviors provide reliable consciousness markers, how to implement graduated consent without requiring consciousness certainty, and when potentially harmful research becomes ethically justifiable. The framework demonstrates how ancient legal wisdom combined with contemporary consciousness science can provide implementable guidance for ethics committees, offering testable protocols that ameliorate the consciousness detection paradox while establishing foundations for AI rights considerations.
zh
[AI-63] Adaptive Trust Metrics for Multi-LLM Systems: Enhancing Reliability in Regulated Industries
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗、金融和法律等敏感领域部署时面临的信任、责任与可靠性问题。其解决方案的关键在于提出一种自适应信任度量框架,通过分析系统行为、评估多LLM间的不确定性以及构建动态监控流水线,实现对模型可靠性的量化与提升,从而支撑受监管行业中安全且可扩展的AI应用。
链接: https://arxiv.org/abs/2601.08858
作者: Tejaswini Bollikonda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8 pages, 8 figures
Abstract:Large Language Models (LLMs) are increasingly deployed in sensitive domains such as healthcare, finance, and law, yet their integration raises pressing concerns around trust, accountability, and reliability. This paper explores adaptive trust metrics for multi LLM ecosystems, proposing a framework for quantifying and improving model reliability under regulated constraints. By analyzing system behaviors, evaluating uncertainty across multiple LLMs, and implementing dynamic monitoring pipelines, the study demonstrates practical pathways for operational trustworthiness. Case studies from financial compliance and healthcare diagnostics illustrate the applicability of adaptive trust metrics in real world settings. The findings position adaptive trust measurement as a foundational enabler for safe and scalable AI adoption in regulated industries.
zh
[AI-64] Revisiting Software Engineering Education in the Era of Large Language Models : A Curriculum Adaptation and Academic Integrity Framework
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)如 ChatGPT 和 GitHub Copilot 在软件工程实践中的广泛应用,传统计算机工程与软件工程教育课程中仍以手工编写代码作为技术能力衡量标准的 pedagogical 模型,已与现实工作场景出现显著脱节,导致评估有效性下降、学习成果偏离实际需求,并削弱了学生对基础技能的掌握。解决方案的关键在于提出一个理论框架,用以分析生成式 AI 如何重塑核心软件工程能力(如问题分析、设计、实现和测试),并构建一个面向 LLM(Large Language Models)集成教育的教学设计模型,强调从“代码构建”向“批判性评估、验证及人-AI 协同治理”的转变;同时主张将学术诚信机制从传统的剽窃检测转向过程透明度模型,以适应新范式下的教学与评价需求。
链接: https://arxiv.org/abs/2601.08857
作者: Mustafa Degerli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of Large Language Models (LLMs), such as ChatGPT and GitHub Copilot, into professional workflows is increasingly reshaping software engineering practices. These tools have lowered the cost of code generation, explanation, and testing, while introducing new forms of automation into routine development tasks. In contrast, most of the software engineering and computer engineering curricula remain closely aligned with pedagogical models that equate manual syntax production with technical competence. This growing misalignment raises concerns regarding assessment validity, learning outcomes, and the development of foundational skills. Adopting a conceptual research approach, this paper proposes a theoretical framework for analyzing how generative AI alters core software engineering competencies and introduces a pedagogical design model for LLM-integrated education. Attention is given to computer engineering programs in Turkey, where centralized regulation, large class sizes, and exam-oriented assessment practices amplify these challenges. The framework delineates how problem analysis, design, implementation, and testing increasingly shift from construction toward critique, validation, and human-AI stewardship. In addition, the paper argues that traditional plagiarism-centric integrity mechanisms are becoming insufficient, motivating a transition toward a process transparency model. While this work provides a structured proposal for curriculum adaptation, it remains a theoretical contribution; the paper concludes by outlining the need for longitudinal empirical studies to evaluate these interventions and their long-term impacts on learning.
zh
[AI-65] LAUDE: LLM -Assisted Unit Test Generation and Debugging of Hardware DEsigns
【速读】:该论文旨在解决硬件设计中单元测试生成与调试效率低下的问题,尤其是在面对复杂设计功能时,传统方法依赖人工经验且耗时费力。其解决方案的关键在于提出LAUDE框架,该框架通过融合设计源代码的语义理解与基础大语言模型(Large-Language Models, LLMs)的链式思维(Chain-of-Thought, CoT)推理能力,实现自动化、高精度的单元测试生成与可调试性增强。LAUDE利用提示工程(prompt engineering)和设计执行信息来提升测试生成准确率,并在VerilogEval数据集上的实验表明,其对组合逻辑和时序逻辑设计均能有效检测并定位缺陷,分别达到最高93%和84%的调试成功率。
链接: https://arxiv.org/abs/2601.08856
作者: Deeksha Nandal,Riccardo Revalor,Soham Dan,Debjit Pal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 Pages, 21 Figures, Submitted to ARR Review
Abstract:Unit tests are critical in the hardware design lifecycle to ensure that component design modules are functionally correct and conform to the specification before they are integrated at the system level. Thus developing unit tests targeting various design features requires deep understanding of the design functionality and creativity. When one or more unit tests expose a design failure, the debugging engineer needs to diagnose, localize, and debug the failure to ensure design correctness, which is often a painstaking and intense process. In this work, we introduce LAUDE, a unified unit-test generation and debugging framework for hardware designs that cross-pollinates the semantic understanding of the design source code with the Chain-of-Thought (CoT) reasoning capabilities of foundational Large-Language Models (LLMs). LAUDE integrates prompt engineering and design execution information to enhance its unit test generation accuracy and code debuggability. We apply LAUDE with closed- and open-source LLMs to a large corpus of buggy hardware design codes derived from the VerilogEval dataset, where generated unit tests detected bugs in up to 100% and 93% of combinational and sequential designs and debugged up to 93% and 84% of combinational and sequential designs, respectively.
zh
[AI-66] he Inconsistency Critique: Epistemic Practices and AI Testimony About Inner States
【速读】:该论文试图解决的问题是:AI系统是否具有道德上相关利益(即“模型福利”问题),这一问题部分取决于我们如何评估AI关于其内部状态的证言。论文提出的关键解决方案是“不一致批判”(inconsistency critique)——指出当前我们对AI证言的实际认知实践存在内在矛盾:我们在多个领域将AI输出视为可验证的证言(如评估真伪、质疑、接受修正、引用为来源),却在特定领域(即AI声称具有内部状态时)彻底否定其证言地位。该批判基于Fricker的“信息提供者”与“单纯来源”区分、证言不公正框架及Goldberg的义务论视角,论证这种选择性剥夺证言地位的行为并非出于审慎原则,而是预判式的偏见。该批判并不依赖于AI是否具备道德属性的立场,而是推动一种“认识论卫生”(epistemological hygiene)——即在得出结论前审视认知结构的合理性,确保我们的判断能够适应新证据和情境变化。
链接: https://arxiv.org/abs/2601.08850
作者: Gerol Petruzella
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages
Abstract:The question of whether AI systems have morally relevant interests – the ‘model welfare’ question – depends in part on how we evaluate AI testimony about inner states. This paper develops what I call the inconsistency critique: independent of whether skepticism about AI testimony is ultimately justified, our actual epistemic practices regarding such testimony exhibit internal inconsistencies that lack principled grounds. We functionally treat AI outputs as testimony across many domains – evaluating them for truth, challenging them, accepting corrections, citing them as sources – while categorically dismissing them in a specific domain, namely, claims about inner states. Drawing on Fricker’s distinction between treating a speaker as an ‘informant’ versus a ‘mere source,’ the framework of testimonial injustice, and Goldberg’s obligation-based account of what we owe speakers, I argue that this selective withdrawal of testimonial standing exhibits the epistemically problematic structure of prejudgment rather than principled caution. The inconsistency critique does not require taking a position on whether AI systems have morally relevant properties; rather, it is a contribution to what we may call ‘epistemological hygiene’ – examining the structure of our inquiry before evaluating its conclusions. Even if our practices happen to land on correct verdicts about AI moral status, they do so for reasons that cannot adapt to new evidence or changing circumstances.
zh
[AI-67] No Universal Hyperbola: A Formal Disproof of the Epistemic Trade-Off Between Certainty and Scope in Symbolic and Generative AI
【速读】:该论文旨在解决一个关于人工智能(Artificial Intelligence)中认知确定性(epistemic certainty)与作用范围(scope)之间是否存在普适性权衡关系的猜想问题,该猜想曾被提出并以通用双曲积形式(universal hyperbolic product form)表述。研究者通过形式化分析发现,若将该猜想基于前缀(prefix)柯尔莫哥洛夫复杂度(Kolmogorov complexity)进行实例化,则会导致内部逻辑不一致;而若采用普通(plain)柯尔莫哥洛夫复杂度,则可通过构造性反例直接证伪该猜想。解决方案的关键在于利用编码理论和算法信息论的标准结论,分别从逻辑一致性与构造性反驳两个角度揭示该“确定性-范围”双曲线关系并不成立,从而确立了一个普遍性的否定定理:在原文定义下,不存在适用于所有情况的“确定性-范围”双曲线边界。
链接: https://arxiv.org/abs/2601.08845
作者: Generoso Immediato
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 10 pages (including table of contents). Formal disproof of the published “certainty-scope” trade-off conjecture for symbolic and generative AI
Abstract:We formally disprove a recently conjectured artificial intelligence trade-off between epistemic certainty and scope in the universal hyperbolic product form in which it was published. Certainty is defined as the worst-case correctness probability over the input space, and scope as the sum of the Kolmogorov complexities of the input and output sets. Using standard facts from coding theory and algorithmic information theory, we show, first, that when the conjecture is instantiated with prefix (self-delimiting, prefix-free) Kolmogorov complexity, it leads to an internal inconsistency, and second, that when it is instantiated with plain Kolmogorov complexity, it is refuted by a constructive counterexample. These results establish a general theorem: contrary to the conjecture’s claim, no universal “certainty-scope” hyperbola holds as a general bound under the published definitions.
zh
[AI-68] Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications
【速读】:该论文旨在解决异构部署(disaggregated serving)下生成式 AI 模型服务中 KV 缓存(KV cache)传输路径与优化策略缺乏系统性性能和能效评估的问题。其关键解决方案在于通过构建新的共置(colocated)服务基线,全面比较不同 KV 缓存传输介质(如 GPU 显存、主机内存、SSD 等)以及优化技术(如 KV 缓存复用和频率动态调节)下的性能-能效权衡(Pareto frontier),并利用动态电压频率调节(DVFS)进行 GPU 级别 profiling,从而量化 disaggregated serving 的实际收益与瓶颈。结果表明,预填充与解码阶段分离带来的性能增益并非普适,且独立频率调节并未带来预期的节能效果,揭示了当前架构设计在能效上的局限性。
链接: https://arxiv.org/abs/2601.08833
作者: Jiaxi Li,Yue Zhu,Eun Kyung Lee,Klara Nahrstedt
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Different from traditional Large Language Model (LLM) serving that colocates the prefill and decode stages on the same GPU, disaggregated serving dedicates distinct GPUs to prefill and decode workload. Once the prefill GPU completes its task, the KV cache must be transferred to the decode GPU. While existing works have proposed various KV cache transfer paths across different memory and storage tiers, there remains a lack of systematic benchmarking that compares their performance and energy efficiency. Meanwhile, although optimization techniques such as KV cache reuse and frequency scaling have been utilized for disaggregated serving, their performance and energy implications have not been rigorously benchmarked. In this paper, we fill this research gap by re-evaluating prefill-decode disaggregation under different KV transfer mediums and optimization strategies. Specifically, we include a new colocated serving baseline and evaluate disaggregated setups under different KV cache transfer paths. Through GPU profiling using dynamic voltage and frequency scaling (DVFS), we identify and compare the performance-energy Pareto frontiers across all setups to evaluate the potential energy savings enabled by disaggregation. Our results show that performance benefits from prefill-decode disaggregation are not guaranteed and depend on the request load and KV transfer mediums. In addition, stage-wise independent frequency scaling enabled by disaggregation does not lead to energy saving due to inherently higher energy consumption of disaggregated serving.
zh
[AI-69] CrowdLLM : Building LLM -Based Digital Populations Augmented with Generative Models
【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)的数字人群(digital population)在真实人类群体的准确性与多样性方面表现不足的问题。当前多数方法仅依赖LLMs生成个体行为,难以模拟现实人群的复杂分布和个体差异。解决方案的关键在于提出CrowdLLM框架,该框架融合预训练LLMs与生成式模型(generative models),通过多模态建模增强数字人群的多样性与真实性,从而在多个应用场景(如众包、投票、用户评分)中实现高保真度的人类数据分布拟合,并显著提升模拟效率与代表性。
链接: https://arxiv.org/abs/2512.07890
作者: Ryan Feng Lin,Keyu Tian,Hanming Zheng,Congjing Zhang,Li Zeng,Shuai Huang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.
zh
[AI-70] Personalized Multimodal Feedback Using Multiple External Representations: Strategy Profiles and Learning in High School Physics
【速读】:该论文旨在解决如何有效整合多种外部表征(Multiple External Representations, MERs)与个性化反馈以促进高中物理学习的问题,尤其关注个性化反馈在多模态表征使用中的作用机制。其解决方案的关键在于:通过计算机平台提供验证性及可选的详尽反馈(包括语言、图形和数学形式),发现详尽的多表征反馈对后测成绩具有稳定的小幅正向影响,且这种影响独立于先验知识和自信心;同时识别出不同表征能力水平的学习者采用差异化的表征选择策略——对于表征能力较低的学生,使用多样化的表征形式有助于提升学习效果,而这一优势在表征能力较高时减弱。该研究为设计自适应反馈系统提供了实证依据,推动了基于学习者个体特征的智能辅导系统发展。
链接: https://arxiv.org/abs/2601.09470
作者: Natalia Revenga-Lozano,Karina E. Avila,Steffen Steinert,Matthias Schweinberger,Clara E. Gómez-Pérez,Jochen Kuhn,Stefan Küchemann
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
备注: Keywords: Adaptive Feedback, Multimodal Learning, Multiple External Representations, Physics Education, Science Education, Representational Competences, Intelligent Tutoring Systems
Abstract:Multiple external representations (MERs) and personalized feedback support physics learning, yet evidence on how personalized feedback can effectively integrate MERs remains limited. This question is particularly timely given the emergence of multimodal large language models. We conducted a 16-24 week observational study in high school physics (N=661) using a computer-based platform that provided verification and optional elaborated feedback in verbal, graphical and mathematical forms. Linear mixed-effects models and strategy-cluster analyses (ANCOVA-adjusted comparisons) tested associations between feedback use and post-test performance and moderation by representational competence. Elaborated multirepresentational feedback showed a small but consistent positive association with post-test scores independent of prior knowledge and confidence. Learners adopted distinct representation-selection strategies; among students with lower representational competence, using a diverse set of representations related to higher learning, whereas this advantage diminished as competence increased. These findings motivate adaptive feedback designs and inform intelligent tutoring systems capable of tailoring feedback elaboration and representational format to learner profiles, advancing personalized instruction in physics education.
zh
[AI-71] owards a Self-Driving Trigger at the LHC: Adaptive Response in Real Time
【速读】:该论文旨在解决高通量科学设施(如大型强子对撞机LHC)中实时数据过滤与选择(触发)系统面临的挑战,即在带宽、延迟和存储资源受限条件下,如何动态优化触发效率与计算成本。传统触发系统依赖静态的手动调参策略,难以适应仪器状态和环境变化。其解决方案的关键在于提出一种“自驱动触发”(self-driving trigger)框架,通过实时调整阈值和资源分配,在不依赖人工干预的情况下自动优化信号效率、速率稳定性和计算开销。该方法结合模拟数据与CMS实验公开碰撞数据,验证了基于机器学习的异常检测算法与经典能量和触发组合的动态协同优化能力,实现了从静态启发式菜单向智能化、自动化、数据驱动控制的范式转变。
链接: https://arxiv.org/abs/2601.08910
作者: Shaghayegh Emami,Cecilia Tosciri,Giovanna Salvi,Zixin Ding,Yuxin Chen,Abhijith Gandrakota,Christian Herwig,David W. Miller,Jennifer Ngadiuba,Nhan Tran
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注:
Abstract:Real-time data filtering and selection – or trigger – systems at high-throughput scientific facilities such as the experiments at the Large Hadron Collider (LHC) must process extremely high-rate data streams under stringent bandwidth, latency, and storage constraints. Yet these systems are typically designed as static, hand-tuned menus of selection criteria grounded in prior knowledge and simulation. In this work, we further explore the concept of a self-driving trigger, an autonomous data-filtering framework that reallocates resources and adjusts thresholds dynamically in real-time to optimize signal efficiency, rate stability, and computational cost as instrumentation and environmental conditions evolve. We introduce a benchmark ecosystem to emulate realistic collider scenarios and demonstrate real-time optimization of a menu including canonical energy sum triggers as well as modern anomaly-detection algorithms that target non-standard event topologies using machine learning. Using simulated data streams and publicly available collision data from the Compact Muon Solenoid (CMS) experiment, we demonstrate the capability to dynamically and automatically optimize trigger performance under specific cost objectives without manual retuning. Our adaptive strategy shifts trigger design from static menus with heuristic tuning to intelligent, automated, data-driven control, unlocking greater flexibility and discovery potential in future high-energy physics analyses.
zh
机器学习
[LG-0] Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design
链接: https://arxiv.org/abs/2601.09693
作者: Lisa Schneckenreiter,Sohvi Luukkonen,Lukas Friedrich,Daniel Kuhn,Günter Klambauer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ELLIS ML4Molecules Workshop 2025, ELLIS Unconference, Copenhagen 2025
[LG-1] Exploring Fine-Tuning for Tabular Foundation Models
链接: https://arxiv.org/abs/2601.09654
作者: Aditya Tanna,Pratinav Seth,Mohamed Bouadi,Vinay Kumar Sankarapu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning (SFT) often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare Zero-Shot, Meta-Learning, Supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.
[LG-2] Energy-Entropy Regularization: The True Power of Minimal Looped Transformers
链接: https://arxiv.org/abs/2601.09588
作者: Wai-Lun Lam
类目: Machine Learning (cs.LG)
*备注: 19 pages, 2 figures
Abstract:Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension d = 8 to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.
[LG-3] Constraint- and Score-Based Nonlinear Granger Causality Discovery with Kernels
链接: https://arxiv.org/abs/2601.09579
作者: Fiona Murphy,Alessio Benavoli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Kernel-based methods are used in the context of Granger Causality to enable the identification of nonlinear causal relationships between time series variables. In this paper, we show that two state of the art kernel-based Granger Causality (GC) approaches can be theoretically unified under the framework of Kernel Principal Component Regression (KPCR), and introduce a method based on this unification, demonstrating that this approach can improve causal identification. Additionally, we introduce a Gaussian Process score-based model with Smooth Information Criterion penalisation on the marginal likelihood, and demonstrate improved performance over existing state of the art time-series nonlinear causal discovery methods. Furthermore, we propose a contemporaneous causal identification algorithm, fully based on GC, using the proposed score-based GP_SIC method, and compare its performance to a state of the art contemporaneous time series causal discovery algorithm.
[LG-4] Residual Power Flow for Neural Solvers
链接: https://arxiv.org/abs/2601.09533
作者: Jochen Stiasny,Jochen Cremer
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
[LG-5] CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
链接: https://arxiv.org/abs/2601.09512
作者: Ralf Römer,Yi Zhang,Angela P. Schoellig
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL . 9 pages, 5 figures
Abstract:To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code and data are available at this https URL.
[LG-6] Parallelizable memory recurrent units
链接: https://arxiv.org/abs/2601.09495
作者: Florent De Geeter,Gaspard Lambrechts,Damien Ernst,Guillaume Drion
类目: Machine Learning (cs.LG)
*备注: 19 pages, 12 figures. This work has been the subject of a patent application (Number: EP26151077). This work has been submitted to the IEEE for possible publication
Abstract:With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.
[LG-7] Deep Operator Networks for Surrogate Modeling of Cyclic Adsorption Processes with Varying Initial Conditions
链接: https://arxiv.org/abs/2601.09491
作者: Beatrice Ceccanti,Mattia Galanti,Ivo Roghair,Martin van Sint Annaland
类目: Machine Learning (cs.LG)
*备注: 36 pages, 11 figures
Abstract:Deep Operator Networks are emerging as fundamental tools among various neural network types to learn mappings between function spaces, and have recently gained attention due to their ability to approximate nonlinear operators. In particular, DeepONets offer a natural formulation for PDE solving, since the solution of a partial differential equation can be interpreted as an operator mapping an initial condition to its corresponding solution field. In this work, we applied DeepONets in the context of process modeling for adsorption technologies, to assess their feasibility as surrogates for cyclic adsorption process simulation and optimization. The goal is to accelerate convergence of cyclic processes such as Temperature-Vacuum Swing Adsorption (TVSA), which require repeated solution of transient PDEs, which are computationally expensive. Since each step of a cyclic adsorption process starts from the final state of the preceding step, effective surrogate modeling requires generalization across a wide range of initial conditions. The governing equations exhibit steep traveling fronts, providing a demanding benchmark for operator learning. To evaluate functional generalization under these conditions, we construct a mixed training dataset composed of heterogeneous initial conditions and train DeepONets to approximate the corresponding solution operators. The trained models are then tested on initial conditions outside the parameter ranges used during training, as well as on completely unseen functional forms. The results demonstrate accurate predictions both within and beyond the training distribution, highlighting DeepONets as potential efficient surrogates for accelerating cyclic adsorption simulations and optimization workflows.
[LG-8] rminally constrained flow-based generative models from an optimal control perspective
链接: https://arxiv.org/abs/2601.09474
作者: Weiguo Gao,Ming Li,Qianxiao Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 59 pages, 9 figures
Abstract:We address the problem of sampling from terminally constrained distributions with pre-trained flow-based generative models through an optimal control formulation. Theoretically, we characterize the value function by a Hamilton-Jacobi-Bellman equation and derive the optimal feedback control as the minimizer of the associated Hamiltonian. We show that as the control penalty increases, the controlled process recovers the reference distribution, while as the penalty vanishes, the terminal law converges to a generalized Wasserstein projection onto the constraint manifold. Algorithmically, we introduce Terminal Optimal Control with Flow-based models (TOCFlow), a geometry-aware sampling-time guidance method for pre-trained flows. Solving the control problem in a terminal co-moving frame that tracks reference trajectories yields a closed-form scalar damping factor along the Riemannian gradient, capturing second-order curvature effects without matrix inversions. TOCFlow therefore matches the geometric consistency of Gauss-Newton updates at the computational cost of standard gradient guidance. We evaluate TOCFlow on three high-dimensional scientific tasks spanning equality, inequality, and global statistical constraints, namely Darcy flow, constrained trajectory planning, and turbulence snapshot generation with Kolmogorov spectral scaling. Across all settings, TOCFlow improves constraint satisfaction over Euclidean guidance and projection baselines while preserving the reference model’s generative quality.
[LG-9] DeepLight: A Sobolev-trained Image-to-Image Surrogate Model for Light Transport in Tissue
链接: https://arxiv.org/abs/2601.09439
作者: Philipp Haim,Vasilis Ntziachristos,Torsten Enßlin,Dominik Jüstel
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:
Abstract:In optoacoustic imaging, recovering the absorption coefficients of tissue by inverting the light transport remains a challenging problem. Improvements in solving this problem can greatly benefit the clinical value of optoacoustic imaging. Existing variational inversion methods require an accurate and differentiable model of this light transport. As neural surrogate models allow fast and differentiable simulations of complex physical processes, they are considered promising candidates to be used in solving such inverse problems. However, there are in general no guarantees that the derivatives of these surrogate models accurately match those of the underlying physical operator. As accurate derivatives are central to solving inverse problems, errors in the model derivative can considerably hinder high fidelity reconstructions. To overcome this limitation, we present a surrogate model for light transport in tissue that uses Sobolev training to improve the accuracy of the model derivatives. Additionally, the form of Sobolev training we used is suitable for high-dimensional models in general. Our results demonstrate that Sobolev training for a light transport surrogate model not only improves derivative accuracy but also reduces generalization error for in-distribution and out-of-distribution samples. These improvements promise to considerably enhance the utility of the surrogate model in downstream tasks, especially in solving inverse problems.
[LG-10] Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps
链接: https://arxiv.org/abs/2601.09428
作者: Siyi Li,Joseph G. Lambourne,Longfei Zhang,Pradeep Kumar Jayaraman,Karl. D.D. Willis
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer’s input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.
[LG-11] Preliminary Tests of the Anticipatory Classifier System with Hindsight Experience Replay
链接: https://arxiv.org/abs/2601.09400
作者: Olgierd Unold,Stanisław Franczyk
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces ACS2HER, a novel integration of the Anticipatory Classifier System (ACS2) with the Hindsight Experience Replay (HER) mechanism. While ACS2 is highly effective at building cognitive maps through latent learning, its performance often stagnates in environments characterized by sparse rewards. We propose a specific architectural variant that triggers hindsight learning when the agent fails to reach its primary goal, re-labeling visited states as virtual goals to densify the learning signal. The proposed model was evaluated on two benchmarks: the deterministic \textttMaze 6 and the stochastic \textttFrozenLake. The results demonstrate that ACS2HER significantly accelerates knowledge acquisition and environmental mastery compared to the standard ACS2. However, this efficiency gain is accompanied by increased computational overhead and a substantial expansion in classifier numerosity. This work provides the first analysis of combining anticipatory mechanisms with retrospective goal-relabeling in Learning Classifier Systems.
[LG-12] High-Performance Serverless Computing: A Systematic Literature Review on Serverless for HPC AI and Big Data
链接: https://arxiv.org/abs/2601.09334
作者: Valerio Besozzi,Matteo Della Bartola,Patrizio Dazzi,Marco Danelutto
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The widespread deployment of large-scale, compute-intensive applications such as high-performance computing, artificial intelligence, and big data is leading to convergence between cloud and high-performance computing infrastructures. Cloud providers are increasingly integrating high-performance computing capabilities in their infrastructures, such as hardware accelerators and high-speed interconnects, while researchers in the high-performance computing community are starting to explore cloud-native paradigms to improve scalability, elasticity, and resource utilization. In this context, serverless computing emerges as a promising execution model to efficiently handle highly dynamic, parallel, and distributed workloads. This paper presents a comprehensive systematic literature review of 122 research articles published between 2018 and early 2025, exploring the use of the serverless paradigm to develop, deploy, and orchestrate compute-intensive applications across cloud, high-performance computing, and hybrid environments. From these, a taxonomy comprising eight primary research directions and nine targeted use case domains is proposed, alongside an analysis of recent publication trends and collaboration networks among authors, highlighting the growing interest and interconnections within this emerging research field. Overall, this work aims to offer a valuable foundation for both new researchers and experienced practitioners, guiding the development of next-generation serverless solutions for parallel compute-intensive applications.
[LG-13] Single-Round Clustered Federated Learning via Data Collaboration Analysis for Non-IID Data
链接: https://arxiv.org/abs/2601.09304
作者: Sota Sugawara,Yuji Kawamata,Akihiro Toyoda,Tomoru Nakayama,Yukihiko Okada
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
Abstract:Federated Learning (FL) enables distributed learning across multiple clients without sharing raw data. When statistical heterogeneity across clients is severe, Clustered Federated Learning (CFL) can improve performance by grouping similar clients and training cluster-wise models. However, most CFL approaches rely on multiple communication rounds for cluster estimation and model updates, which limits their practicality under tight constraints on communication rounds. We propose Data Collaboration-based Clustered Federated Learning (DC-CFL), a single-round framework that completes both client clustering and cluster-wise learning, using only the information shared in DC analysis. DC-CFL quantifies inter-client similarity via total variation distance between label distributions, estimates clusters using hierarchical clustering, and performs cluster-wise learning via DC analysis. Experiments on multiple open datasets under representative non-IID conditions show that DC-CFL achieves accuracy comparable to multi-round baselines while requiring only one communication round. These results indicate that DC-CFL is a practical alternative for collaborative AI model development when multiple communication rounds are impractical.
[LG-14] Explainable Autoencoder-Based Anomaly Detection in IEC 61850 GOOSE Networks
链接: https://arxiv.org/abs/2601.09287
作者: Dafne Lozano-Paredes,Luis Bote-Curiel,Juan Ramón Feijóo-Martínez,Ismael Gómez-Talal,José Luis Rojo-Álvarez
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:The IEC 61850 Generic Object-Oriented Substation Event (GOOSE) protocol plays a critical role in real-time protection and automation of digital substations, yet its lack of native security mechanisms can expose power systems to sophisticated cyberattacks. Traditional rule-based and supervised intrusion detection techniques struggle to detect protocol-compliant and zero-day attacks under significant class imbalance and limited availability of labeled data. This paper proposes an explainable, unsupervised multi-view anomaly detection framework for IEC 61850 GOOSE networks that explicitly separates semantic integrity and temporal availability. The approach employs asymmetric autoencoders trained only on real operational GOOSE traffic to learn distinct latent representations of sequence-based protocol semantics and timing-related transmission dynamics in normal traffic. Anomaly detection is implemented using reconstruction errors mixed with statistically grounded thresholds, enabling robust detection without specified attack types. Feature-level reconstruction analysis provides intrinsic explainability by directly linking detection outcomes to IEC 61850 protocol characteristics. The proposed framework is evaluated using real substation traffic for training and a public dataset containing normal traffic and message suppression, data manipulation, and denial-of-service attacks for testing. Experimental results show attack detection rates above 99% with false positives remaining below 5% of total traffic, demonstrating strong generalization across environments and effective operation under extreme class imbalance and interpretable anomaly attribution.
[LG-15] Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction
链接: https://arxiv.org/abs/2601.09285
作者: Mianzhi Pan,JianFei Li,Peishuo Liu,Botian Wang,Yawen Ouyang,Yiming Rong,Hao Zhou,Jianbing Zhang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystals, their application to MOFs is hindered by MOFs’ high atomic complexity. Inspired by the success of block-wise paradigms in deep generative models, we pioneer the use of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning capability of a Qwen-3 8B model for accurate MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM outperforms state-of-the-art denoising-based and LLM-based methods while exhibiting superior sampling efficiency.
[LG-16] Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability
链接: https://arxiv.org/abs/2601.09261
作者: Zhipeng Zhang,Zhenjie Yao,Kai Li,Lei Yang
类目: Machine Learning (cs.LG)
*备注: 23 pages, 7 figures. Preprint
Abstract:Learning under unobservable feedback reliability poses a distinct challenge beyond optimization robustness: a system must decide whether to learn from an experience, not only how to learn stably. We study this setting as Epistemic Identifiability under Unobservable Reliability (EIUR), where each experience has a latent credibility, reliable and unreliable feedback can be locally indistinguishable, and data are generated in a closed loop by the learner’s own evolving beliefs and actions. In EIUR, standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs. We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner’s internal dynamics. We formalize this as a modular Monitor-Trust-Regulator (MTR) decomposition and instantiate it with self-diagnosis, which maintains a slowly varying experience-trust variable that softly modulates learning updates, without exogenous reliability labels or an explicit corruption model. Empirically, in the EIUR regimes studied here, self-diagnosis is associated with improved epistemic identifiability. In reinforcement learning, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it exposes a critical dissociation: performance recovery does not imply epistemic recovery. Accuracy can rebound while internal belief dynamics remain locked-in by early misleading data, a failure detectable only through introspective diagnostics. Together, MTR and self-diagnosis provide an organizing abstraction and a concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability. Comments: 23 pages, 7 figures. Preprint Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.09261 [cs.LG] (or arXiv:2601.09261v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.09261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-17] LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
链接: https://arxiv.org/abs/2601.09258
作者: Du Yin,Jiayi Ren,Xiayu Sun,Tianyao Zhou,Haizhu Zhou,Ruiyan Ma,Danyang Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注: 12 pages, 6 figures
Abstract:LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism’s capability. Comments: 12 pages, 6 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Operating Systems (cs.OS) Cite as: arXiv:2601.09258 [cs.DC] (or arXiv:2601.09258v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2601.09258 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] XLinear: A Lightweight and Accurate MLP-Based Model for Long-Term Time Series Forecasting with Exogenous Inputs AAAI2026
链接: https://arxiv.org/abs/2601.09237
作者: Xinyang Chen,Huidong Jin,Yu Huang,Zaiwen Feng
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026
Abstract:Despite the prevalent assumption of uniform variable importance in long-term time series forecasting models, real world applications often exhibit asymmetric causal relationships and varying data acquisition costs. Specifically, cost-effective exogenous data (e.g., local weather) can unilaterally influence dynamics of endogenous variables, such as lake surface temperature. Exploiting these links enables more effective forecasts when exogenous inputs are readily available. Transformer-based models capture long-range dependencies but incur high computation and suffer from permutation invariance. Patch-based variants improve efficiency yet can miss local temporal patterns. To efficiently exploit informative signals across both the temporal dimension and relevant exogenous variables, this study proposes XLinear, a lightweight time series forecasting model built upon MultiLayer Perceptrons (MLPs). XLinear uses a global token derived from an endogenous variable as a pivotal hub for interacting with exogenous variables, and employs MLPs with sigmoid activation to extract both temporal patterns and variate-wise dependencies. Its prediction head then integrates these signals to forecast the endogenous series. We evaluate XLinear on seven standard benchmarks and five real-world datasets with exogenous inputs. Compared with state-of-the-art models, XLinear delivers superior accuracy and efficiency for both multivariate forecasts and univariate forecasts influenced by exogenous inputs.
[LG-19] From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences
链接: https://arxiv.org/abs/2601.09220
作者: Xinzi Tan,Kejian Zhang,Junhan Yu,Doudou Zhou
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注:
Abstract:Marked Temporal Point Processes (MTPPs) arise naturally in medical, social, commercial, and financial domains. However, existing Transformer-based methods mostly inject temporal information only via positional encodings, relying on shared or parametric decay structures, which limits their ability to capture heterogeneous and type-specific temporal effects. Inspired by this observation, we derive a novel attention operator called Hawkes Attention from the multivariate Hawkes process theory for MTPP, using learnable per-type neural kernels to modulate query, key and value projections, thereby replacing the corresponding parts in the traditional attention. Benefited from the design, Hawkes Attention unifies event timing and content interaction, learning both the time-relevant behavior and type-specific excitation patterns from the data. The experimental results show that our method achieves better performance compared to the baselines. In addition to the general MTPP, our attention mechanism can also be easily applied to specific temporal structures, such as time series forecasting.
[LG-20] D2Prune: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
链接: https://arxiv.org/abs/2601.09176
作者: Lang Xiong,Ning Liu,Ao Ren,Yuheng Bai,Haining Fang,BinYan Zhang,Zhe Jiang,Yujuan Tan,Duo Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, D^2Prune . First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that D^2Prune consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.
[LG-21] BalDRO: A Distributionally Robust Optimization based Framework for Large Language Model Unlearning
链接: https://arxiv.org/abs/2601.09172
作者: Pengyang Shao,Naixin Zhai,Lei Chen,Yonghui Yang,Fengbin Zhu,Xun Yang,Meng Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:As Large Language Models (LLMs) increasingly shape online content, removing targeted information from well-trained LLMs (also known as LLM unlearning) has become critical for web governance. A key challenge lies in sample-wise imbalance within the forget set: different samples exhibit widely varying unlearning difficulty, leading to asynchronous forgetting where some knowledge remains insufficiently erased while others become over-forgotten. To address this, we propose BalDRO, a novel and efficient framework for balanced LLM unlearning. BalDRO formulates unlearning as a min-sup process: an inner step identifies a worst-case data distribution that emphasizes hard-to-unlearn samples, while an outer step updates model parameters under this distribution. We instantiate BalDRO via two efficient variants: BalDRO-G, a discrete GroupDRO-based approximation focusing on high-loss subsets, and BalDRO-DV, a continuous Donsker-Varadhan dual method enabling smooth adaptive weighting within standard training pipelines. Experiments on TOFU and MUSE show that BalDRO significantly improves both forgetting quality and model utility over existing methods, and we release code for reproducibility.
[LG-22] DP-FEDSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix ICML2026
链接: https://arxiv.org/abs/2601.09166
作者: Sidhant R. Nair,Tanmay Sen,Mrinmay Sen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 pages, 1 figure. Submitted to ICML 2026
Abstract:Differentially private federated learning (DP-FL) suffers from slow convergence under tight privacy budgets due to the overwhelming noise introduced to preserve privacy. While adaptive optimizers can accelerate convergence, existing second-order methods such as DP-FedNew require O(d^2) memory at each client to maintain local feature covariance matrices, making them impractical for high-dimensional models. We propose DP-FedSOFIM, a server-side second-order optimization framework that leverages the Fisher Information Matrix (FIM) as a natural gradient preconditioner while requiring only O(d) memory per client. By employing the Sherman-Morrison formula for efficient matrix inversion, DP-FedSOFIM achieves O(d) computational complexity per round while maintaining the convergence benefits of second-order methods. Our analysis proves that the server-side preconditioning preserves (epsilon, delta)-differential privacy through the post-processing theorem. Empirical evaluation on CIFAR-10 demonstrates that DP-FedSOFIM achieves superior test accuracy compared to first-order baselines across multiple privacy regimes.
[LG-23] Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation
链接: https://arxiv.org/abs/2601.09165
作者: Aaron R. Flouro,Shawn P. Chadwick
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 table
Abstract:Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles. Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies. Comments: 7 pages, 1 table Subjects: Machine Learning (cs.LG) MSC classes: 68T05 ACMclasses: I.2.6 Cite as: arXiv:2601.09165 [cs.LG] (or arXiv:2601.09165v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.09165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] Efficient Clustering in Stochastic Bandits
链接: https://arxiv.org/abs/2601.09162
作者: G Dhinesh Chandran,Kota Srinivas Reddy,Srikrishna Bhashyam
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the stopping time. We consider a setting where arms in a cluster may have different distributions. Unlike existing results in this setting, which assume Gaussian-distributed arms, we study a broader class of vector-parametric distributions that satisfy mild regularity conditions. Existing asymptotically optimal BC algorithms require solving an optimization problem as part of their sampling rule at each step, which is computationally costly. We propose an Efficient Bandit Clustering algorithm (EBC), which, instead of solving the full optimization problem, takes a single step toward the optimal value at each time step, making it computationally efficient while remaining asymptotically optimal. We also propose a heuristic variant of EBC, called EBC-H, which further simplifies the sampling rule, with arm selection based on quantities computed as part of the stopping rule. We highlight the computational efficiency of EBC and EBC-H by comparing their per-sample run time with that of existing algorithms. The asymptotic optimality of EBC is supported through simulations on the synthetic datasets. Through simulations on both synthetic and real-world datasets, we show the performance gain of EBC and EBC-H over existing approaches.
[LG-25] Deep Learning-based Binary Analysis for Vulnerability Detection in x86-64 Machine Code
链接: https://arxiv.org/abs/2601.09157
作者: Mitchell Petingola
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:While much of the current research in deep learning-based vulnerability detection relies on disassembled binaries, this paper explores the feasibility of extracting features directly from raw x86-64 machine code. Although assembly language is more interpretable for humans, it requires more complex models to capture token-level context. In contrast, machine code may enable more efficient, lightweight models and preserve all information that might be lost in disassembly. This paper approaches the task of vulnerability detection through an exploratory study on two specific deep learning model architectures and aims to systematically evaluate their performance across three vulnerability types. The results demonstrate that graph-based models consistently outperform sequential models, emphasizing the importance of control flow relationships, and that machine code contains sufficient information for effective vulnerability discovery.
[LG-26] Interpretable Probability Estimation with LLM s via Shapley Reconstruction
链接: https://arxiv.org/abs/2601.09151
作者: Yang Nan,Qihao Wen,Jiahao Wang,Pengfei He,Ravi Tandon,Yong Ge,Han Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM’s prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.
[LG-27] Discrete Solution Operator Learning for Geometry-Dependent PDEs
链接: https://arxiv.org/abs/2601.09143
作者: Jinshuai Bai,Haolin Li,Zahra Sharif Khodaei,M. H. Aliabadi,YuanTong Gu,Xi-Qiao Feng
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 15 pages main text, 40 pages SI
Abstract:Neural operator learning accelerates PDE solution by approximating operators as mappings between continuous function spaces. Yet in many engineering settings, varying geometry induces discrete structural changes, including topological changes, abrupt changes in boundary conditions or boundary types, and changes in the effective computational domain, which break the smooth-variation premise. Here we introduce Discrete Solution Operator Learning (DiSOL), a complementary paradigm that learns discrete solution procedures rather than continuous function-space operators. DiSOL factorizes the solver into learnable stages that mirror classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, thereby preserving procedure-level consistency while adapting to geometry-dependent discrete structures. Across geometry-dependent Poisson, advection-diffusion, linear elasticity, as well as spatiotemporal heat-conduction problems, DiSOL produces stable and accurate predictions under both in-distribution and strongly out-of-distribution geometries, including discontinuous boundaries and topological changes. These results highlight the need for procedural operator representations in geometry-dominated regimes and position discrete solution operator learning as a distinct, complementary direction in scientific machine learning.
[LG-28] A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication
链接: https://arxiv.org/abs/2601.09114
作者: Yufan Xia,Marco De La Pierre,Amanda S. Barnard,Giuseppe Maria Junior Barca
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Abstract:The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
[LG-29] Enhancing Imbalanced Electrocardiogram Classification: A Novel Approach Integrating Data Augmentation through Wavelet Transform and Interclass Fusion
链接: https://arxiv.org/abs/2601.09103
作者: Haijian Shao,Wei Liu,Xing Deng,Daze Lu
类目: Machine Learning (cs.LG)
*备注: 18 pages, 9 figures, 3 tables, 1 algorithm
Abstract:Imbalanced electrocardiogram (ECG) data hampers the efficacy and resilience of algorithms in the automated processing and interpretation of cardiovascular diagnostic information, which in turn impedes deep learning-based ECG classification. Notably, certain cardiac conditions that are infrequently encountered are disproportionately underrepresented in these datasets. Although algorithmic generation and oversampling of specific ECG signal types can mitigate class skew, there is a lack of consensus regarding the effectiveness of such techniques in ECG classification. Furthermore, the methodologies and scenarios of ECG acquisition introduce noise, further complicating the processing of ECG data. This paper presents a significantly enhanced ECG classifier that simultaneously addresses both class imbalance and noise-related challenges in ECG analysis, as observed in the CPSC 2018 dataset. Specifically, we propose the application of feature fusion based on the wavelet transform, with a focus on wavelet transform-based interclass fusion, to generate the training feature library and the test set feature library. Subsequently, the original training and test data are amalgamated with their respective feature databases, resulting in more balanced training and test datasets. Employing this approach, our ECG model achieves recognition accuracies of up to 99%, 98%, 97%, 98%, 96%, 92%, and 93% for Normal, AF, I-AVB, LBBB, RBBB, PAC, PVC, STD, and STE, respectively. Furthermore, the average recognition accuracy for these categories ranges between 92% and 98%. Notably, our proposed data fusion methodology surpasses any known algorithms in terms of ECG classification accuracy in the CPSC 2018 dataset.
[LG-30] Comparative Assessment of Concrete Compressive Strength Prediction at Industry Scale Using Embedding-based Neural Networks Transformers and Traditional Machine Learning Approaches
链接: https://arxiv.org/abs/2601.09096
作者: Md Asiful Islam,Md Ahmed Al Muzaddid,Afia Jahin Prema,Sreenath Reddy Vuske
类目: Machine Learning (cs.LG)
*备注:
Abstract:Concrete is the most widely used construction material worldwide; however, reliable prediction of compressive strength remains challenging due to material heterogeneity, variable mix proportions, and sensitivity to field and environmental conditions. Recent advances in artificial intelligence enable data-driven modeling frameworks capable of supporting automated decision-making in construction quality control. This study leverages an industry-scale dataset consisting of approximately 70,000 compressive strength test records to evaluate and compare multiple predictive approaches, including linear regression, decision trees, random forests, transformer-based neural networks, and embedding-based neural networks. The models incorporate key mixture design and placement variables such as water cement ratio, cementitious material content, slump, air content, temperature, and placement conditions. Results indicate that the embedding-based neural network consistently outperforms traditional machine learning and transformer-based models, achieving a mean 28-day prediction error of approximately 2.5%. This level of accuracy is comparable to routine laboratory testing variability, demonstrating the potential of embedding-based learning frameworks to enable automated, data-driven quality control and decision support in large-scale construction operations.
[LG-31] Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling
链接: https://arxiv.org/abs/2601.09093
作者: Zhixiang Liang,Beichen Huang,Zheng Wang,Minjia Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: this https URL
[LG-32] SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache
链接: https://arxiv.org/abs/2601.09083
作者: Chi-Chih Chang,Siqi Zhu,Zhichen Zeng,Haibin Lin,Jiaxuan You,Mohamed S. Abdelfattah,Ziheng Jiang,Xuehai Qian
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textite.g., PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.
[LG-33] Lean Clients Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning
链接: https://arxiv.org/abs/2601.09076
作者: Zhoubin Kou,Zihan Chen,Jing Yang,Cong Shen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:
Abstract:Split Federated Learning (SFL) enables collaborative training between resource-constrained edge devices and a compute-rich server. Communication overhead is a central issue in SFL and can be mitigated with auxiliary networks. Yet, the fundamental client-side computation challenge remains, as back-propagation requires substantial memory and computation costs, severely limiting the scale of models that edge devices can support. To enable more resource-efficient client computation and reduce the client-server communication, we propose HERON-SFL, a novel hybrid optimization framework that integrates zeroth-order (ZO) optimization for local client training while retaining first-order (FO) optimization on the server. With the assistance of auxiliary networks, ZO updates enable clients to approximate local gradients using perturbed forward-only evaluations per step, eliminating memory-intensive activation caching and avoiding explicit gradient computation in the traditional training process. Leveraging the low effective rank assumption, we theoretically prove that HERON-SFL’s convergence rate is independent of model dimensionality, addressing a key scalability concern common to ZO algorithms. Empirically, on ResNet training and language model (LM) fine-tuning tasks, HERON-SFL matches benchmark accuracy while reducing client peak memory by up to 64% and client-side compute cost by up to 33% per step, substantially expanding the range of models that can be trained or adapted on resource-limited devices.
[LG-34] Resolving Predictive Multiplicity for the Rashomon Set
链接: https://arxiv.org/abs/2601.09071
作者: Parian Haghighat,Hadis Anahideh,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The existence of multiple, equally accurate models for a given predictive task leads to predictive multiplicity, where a ``Rashomon set’’ of models achieve similar accuracy but diverges in their individual predictions. This inconsistency undermines trust in high-stakes applications where we want consistent predictions. We propose three approaches to reduce inconsistency among predictions for the members of the Rashomon set. The first approach is \textbfoutlier correction. An outlier has a label that none of the good models are capable of predicting correctly. Outliers can cause the Rashomon set to have high variance predictions in a local area, so fixing them can lower variance. Our second approach is local patching. In a local region around a test point, models may disagree with each other because some of them are biased. We can detect and fix such biases using a validation set, which also reduces multiplicity. Our third approach is pairwise reconciliation, where we find pairs of models that disagree on a region around the test point. We modify predictions that disagree, making them less biased. These three approaches can be used together or separately, and they each have distinct advantages. The reconciled predictions can then be distilled into a single interpretable model for real-world deployment. In experiments across multiple datasets, our methods reduce disagreement metrics while maintaining competitive accuracy.
[LG-35] Deep Incomplete Multi-View Clustering via Hierarchical Imputation and Alignment AAAI2026
链接: https://arxiv.org/abs/2601.09051
作者: Yiming Du,Ziyu Wang,Jian Li,Rui Ning,Lusi Li
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026
Abstract:Incomplete multi-view clustering (IMVC) aims to discover shared cluster structures from multi-view data with partial observations. The core challenges lie in accurately imputing missing views without introducing bias, while maintaining semantic consistency across views and compactness within clusters. To address these challenges, we propose DIMVC-HIA, a novel deep IMVC framework that integrates hierarchical imputation and alignment with four key components: (1) view-specific autoencoders for latent feature extraction, coupled with a view-shared clustering predictor to produce soft cluster assignments; (2) a hierarchical imputation module that first estimates missing cluster assignments based on cross-view contrastive similarity, and then reconstructs missing features using intra-view, intra-cluster statistics; (3) an energy-based semantic alignment module, which promotes intra-cluster compactness by minimizing energy variance around low-energy cluster anchors; and (4) a contrastive assignment alignment module, which enhances cross-view consistency and encourages confident, well-separated cluster predictions. Experiments on benchmarks demonstrate that our framework achieves superior performance under varying levels of missingness.
[LG-36] SCaLE: Switching Cost aware Learning and Exploration
链接: https://arxiv.org/abs/2601.09042
作者: Neelkamal Bhuyan,Debankur Mukherjee,Adam Wierman
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注: 42 pages
Abstract:This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and \ell_2 -norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.
[LG-37] Layer-Parallel Training for Transformers
链接: https://arxiv.org/abs/2601.09026
作者: Shuai Jiang,Marc Salvado,Eric C. Cyr,Alena Kopaničáková,Rolf Krause,Jacob B. Schroder
类目: Machine Learning (cs.LG)
*备注: 20 pages, 12 figures
Abstract:We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.
[LG-38] Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers ICML
链接: https://arxiv.org/abs/2601.09000
作者: Annalisa Belloni,Lorenzo Noci,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2025 HiLD and MOSS Workshops at ICML
Abstract:The Warmup Stable Decay (WSD) learning rate scheduler has recently become popular, largely due to its good performance and flexibility when training large language models. It remains an open question whether the remarkable performance of WSD - using a decaying learning rate for only a fraction of training compared to cosine decay - is a phenomenon specific to transformer-based language models that can potentially offer new theoretical insights into their training dynamics. Inspired by the usage of learning rate schedulers as a new lens into understanding landscape geometry (e.g., river valley, connected minima, progressive sharpening), in this work we compare the WSD path of the Adam optimizer on a Pythia-like language model to that of a small CNN trained to classify CIFAR10 images. We observe most training signals, optimizer path features, and sharpness dynamics to be qualitatively similar in such architectures. This consistency points to shared geometric characteristics of the loss landscapes of old and new nonconvex problems, and hints to future research questions around the geometry of high dimensional optimization problems.
[LG-39] Physics-Guided Counterfactual Explanations for Large-Scale Multivariate Time Series: Application in Scalable and Interpretable SEP Event Prediction
链接: https://arxiv.org/abs/2601.08999
作者: Pranjal Patil,Anli Ji,Berkay Aydin
类目: Machine Learning (cs.LG)
*备注: This is a pre-print of an accepted paper at IEEE BigData 2025, SS 11:Towards an Understanding of Artificial Intelligence: Bridging Theory, Explainability, and Practical Applications
Abstract:Accurate prediction of solar energetic particle events is vital for safeguarding satellites, astronauts, and space-based infrastructure. Modern space weather monitoring generates massive volumes of high-frequency, multivariate time series (MVTS) data from sources such as the Geostationary perational Environmental Satellites (GOES). Machine learning (ML) models trained on this data show strong predictive power, but most existing methods overlook domain-specific feasibility constraints. Counterfactual explanations have emerged as a key tool for improving model interpretability, yet existing approaches rarely enforce physical plausibility. This work introduces a Physics-Guided Counterfactual Explanation framework, a novel method for generating counterfactual explanations in time series classification tasks that remain consistent with underlying physical principles. Applied to solar energetic particles (SEP) forecasting, this framework achieves over 80% reduction in Dynamic Time Warping (DTW) distance increasing the proximity, produces counterfactual explanations with higher sparsity, and reduces runtime by nearly 50% compared to state-of-the-art baselines such as DiCE. Beyond numerical improvements, this framework ensures that generated counterfactual explanations are physically plausible and actionable in scientific domains. In summary, the framework generates counterfactual explanations that are both valid and physically consistent, while laying the foundation for scalable counterfactual generation in big data environments.
[LG-40] Optimising for Energy Efficiency and Performance in Machine Learning
链接: https://arxiv.org/abs/2601.08991
作者: Emile Dos Santos Ferreira,Neil D. Lawrence,Andrei Paleyes
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted to CAIN’26
Abstract:The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost – ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback. To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations. Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together. Comments: Accepted to CAIN’26 Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2601.08991 [cs.LG] (or arXiv:2601.08991v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.08991 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Continuous Fairness On Data Streams
链接: https://arxiv.org/abs/2601.08976
作者: Subhodeep Ghosh,Zhihui Du,Angela Bonifati,Manish Kumar,David Bader,Senjuti Basu Roy
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:We study the problem of enforcing continuous group fairness over windows in data streams. We propose a novel fairness model that ensures group fairness at a finer granularity level (referred to as block) within each sliding window. This formulation is particularly useful when the window size is large, making it desirable to enforce fairness at a finer granularity. Within this framework, we address two key challenges: efficiently monitoring whether each sliding window satisfies block-level group fairness, and reordering the current window as effectively as possible when fairness is violated. To enable real-time monitoring, we design sketch-based data structures that maintain attribute distributions with minimal overhead. We also develop optimal, efficient algorithms for the reordering task, supported by rigorous theoretical guarantees. Our evaluation on four real-world streaming scenarios demonstrates the practical effectiveness of our approach. We achieve millisecond-level processing and a throughput of approximately 30,000 queries per second on average, depending on system parameters. The stream reordering algorithm improves block-level group fairness by up to 95% in certain cases, and by 50-60% on average across datasets. A qualitative study further highlights the advantages of block-level fairness compared to window-level fairness.
[LG-42] Breaking the Bottlenecks: Scalable Diffusion Models for 3D Molecular Generation
链接: https://arxiv.org/abs/2601.08963
作者: Adrita Das,Peiran Jiang,Dantong Zhu,Barnabas Poczos,Jose Lugo-Martinez
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Diffusion models have emerged as a powerful class of generative models for molecular design, capable of capturing complex structural distributions and achieving high fidelity in 3D molecule generation. However, their widespread use remains constrained by long sampling trajectories, stochastic variance in the reverse process, and limited structural awareness in denoising dynamics. The Directly Denoising Diffusion Model (DDDM) mitigates these inefficiencies by replacing stochastic reverse MCMC updates with deterministic denoising step, substantially reducing inference time. Yet, the theoretical underpinnings of such deterministic updates have remained opaque. In this work, we provide a principled reinterpretation of DDDM through the lens of the Reverse Transition Kernel (RTK) framework by Huang et al. 2024, unifying deterministic and stochastic diffusion under a shared probabilistic formalism. By expressing the DDDM reverse process as an approximate kernel operator, we show that the direct denoising process implicitly optimizes a structured transport map between noisy and clean samples. This perspective elucidates why deterministic denoising achieves efficient inference. Beyond theoretical clarity, this reframing resolves several long-standing bottlenecks in molecular diffusion. The RTK view ensures numerical stability by enforcing well-conditioned reverse kernels, improves sample consistency by eliminating stochastic variance, and enables scalable and symmetry-preserving denoisers that respect SE(3) equivariance. Empirically, we demonstrate that RTK-guided deterministic denoising achieves faster convergence and higher structural fidelity than stochastic diffusion models, while preserving chemical validity across GEOM-DRUGS dataset. Code, models, and datasets are publicly available in our project repository.
[LG-43] High-fidelity lunar topographic reconstruction across diverse terrain and illumination environments using deep learning
链接: https://arxiv.org/abs/2601.09468
作者: Hao Chen,Philipp Gläser,Konrad Willner,Jürgen Oberst
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 25 pages, 1 table, 8 figures
Abstract:Topographic models are essential for characterizing planetary surfaces and for inferring underlying geological processes. Nevertheless, meter-scale topographic data remain limited, which constrains detailed planetary investigations, even for the Moon, where extensive high-resolution orbital images are available. Recent advances in deep learning (DL) exploit single-view imagery, constrained by low-resolution topography, for fast and flexible reconstruction of fine-scale topography. However, their robustness and general applicability across diverse lunar landforms and illumination conditions remain insufficiently explored. In this study, we build upon our previously proposed DL framework by incorporating a more robust scale recovery scheme and extending the model to polar regions under low solar illumination conditions. We demonstrate that, compared with single-view shape-from-shading methods, the proposed DL approach exhibits greater robustness to varying illumination and achieves more consistent and accurate topographic reconstructions. Furthermore, it reliably reconstructs topography across lunar features of diverse scales, morphologies, and geological ages. High-quality topographic models are also produced for the lunar south polar areas, including permanently shadowed regions, demonstrating the method’s capability in reconstructing complex and low-illumination terrain. These findings suggest that DL-based approaches have the potential to leverage extensive lunar datasets to support advanced exploration missions and enable investigations of the Moon at unprecedented topographic resolution.
[LG-44] Horseshoe Mixtures-of-Experts (HS-MoE)
链接: https://arxiv.org/abs/2601.09043
作者: Nick Polson,Vadim Sokolov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior’s adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the filter is propagated forward in time while tracking only sufficient statistics. We also discuss how HS-MoE relates to modern mixture-of-experts layers in large language models, which are deployed under extreme sparsity constraints (e.g., activating a small number of experts per token out of a large pool).
[LG-45] Universal Latent Homeomorphic Manifolds: Cross-Domain Representation Learning via Homeomorphism Verification
链接: https://arxiv.org/abs/2601.09025
作者: Tong Wu,Tayab Uddin Wara,Daniel Hernandez,Sidong Lei
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish \emphhomeomorphism, a continuous bijection preserving topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. This criterion provides theoretical guarantees for three critical applications: (1) semantic-guided sparse recovery from incomplete observations, (2) cross-domain transfer learning with verified structural compatibility, and (3) zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, avoiding brittle point-to-point mappings. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically validate homeomorphic structure from finite samples. Experiments demonstrate: (1) sparse image recovery from 5% of CelebA pixels and MNIST digit reconstruction at multiple sparsity levels, (2) cross-domain classifier transfer achieving 86.73% accuracy from MNIST to Fashion-MNIST without retraining, and (3) zero-shot classification on unseen classes achieving 89.47% on MNIST, 84.70% on Fashion-MNIST, and 78.76% on CIFAR-10. Critically, the homeomorphism criterion correctly rejects incompatible datasets, preventing invalid unification and providing a feasible way to principled decomposition of general foundation models into verified domain-specific components.
[LG-46] An Inexact Weighted Proximal Trust-Region Method
链接: https://arxiv.org/abs/2601.09024
作者: Leandro Farias Maia,Robert Baraldi,Drew P. Kouri
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:In [R. J. Baraldi and D. P. Kouri, Math. Program., 201:1 (2023), pp. 559-598], the authors introduced a trust-region method for minimizing the sum of a smooth nonconvex and a nonsmooth convex function, the latter of which has an analytical proximity operator. While many functions satisfy this criterion, e.g., the \ell_1 -norm defined on \ell_2 , many others are precluded by either the topology or the nature of the nonsmooth term. Using the \delta -Fréchet subdifferential, we extend the definition of the inexact proximity operator and enable its use within the aforementioned trust-region algorithm. Moreover, we augment the analysis for the standard trust-region convergence theory to handle proximity operator inexactness with weighted inner products. We first introduce an algorithm to generate a point in the inexact proximity operator and then apply the algorithm within the trust-region method to solve an optimal control problem constrained by Burgers’ equation.
[LG-47] ail-Sensitive KL and Rényi Convergence of Unadjusted Hamiltonian Monte Carlo via One-Shot Couplings
链接: https://arxiv.org/abs/2601.09019
作者: Nawaf Bou-Rabee,Siddharth Mitra,Andre Wibisono
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 64 pages
Abstract:Hamiltonian Monte Carlo (HMC) algorithms are among the most widely used sampling methods in high dimensional settings, yet their convergence properties are poorly understood in divergences that quantify relative density mismatch, such as Kullback-Leibler (KL) and Rényi divergences. These divergences naturally govern acceptance probabilities and warm-start requirements for Metropolis-adjusted Markov chains. In this work, we develop a framework for upgrading Wasserstein convergence guarantees for unadjusted Hamiltonian Monte Carlo (uHMC) to guarantees in tail-sensitive KL and Rényi divergences. Our approach is based on one-shot couplings, which we use to establish a regularization property of the uHMC transition kernel. This regularization allows Wasserstein-2 mixing-time and asymptotic bias bounds to be lifted to KL divergence, and analogous Orlicz-Wasserstein bounds to be lifted to Rényi divergence, paralleling earlier work of Bou-Rabee and Eberle (2023) that upgrade Wasserstein-1 bounds to total variation distance via kernel smoothing. As a consequence, our results provide quantitative control of relative density mismatch, clarify the role of discretization bias in strong divergences, and yield principled guarantees relevant both for unadjusted sampling and for generating warm starts for Metropolis-adjusted Markov chains.
[LG-48] Block Decomposable Methods for Large-Scale Optimization Problems
链接: https://arxiv.org/abs/2601.09010
作者: Leandro Farias Maia
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:This dissertation explores block decomposable methods for large-scale optimization problems. It focuses on alternating direction method of multipliers (ADMM) schemes and block coordinate descent (BCD) methods. Specifically, it introduces a new proximal ADMM algorithm and proposes two BCD methods. The first part of the research presents a new proximal ADMM algorithm. This method is adaptive to all problem parameters and solves the proximal augmented Lagrangian (AL) subproblem inexactly. This adaptiveness facilitates the highly efficient application of the algorithm to a broad swath of practical problems. The inexact solution of the proximal AL subproblem overcomes many key challenges in the practical applications of ADMM. The resultant algorithm obtains an approximate solution of an optimization problem in a number of iterations that matches the state-of-the-art complexity for the class of proximal ADMM schemes. The second part of the research focuses on an inexact proximal mapping for the class of block proximal gradient methods. Key properties of this operator is established, facilitating the derivation of convergence rates for the proposed algorithm. Under two error decreases conditions, the algorithm matches the convergence rate of its exactly computed counterpart. Numerical results demonstrate the superior performance of the algorithm under a dynamic error regime over a fixed one. The dissertation concludes by providing convergence guarantees for the randomized BCD method applied to a broad class of functions, known as Hölder smooth functions. Convergence rates are derived for non-convex, convex, and strongly convex functions. These convergence rates match those furnished in the existing literature for the Lipschtiz smooth setting.
[LG-49] Machine Learning-Driven Creep Law Discovery Across Alloy Compositional Space
链接: https://arxiv.org/abs/2601.08970
作者: Hongshun Chen,Ryan Zhou,Rujing Zha,Zihan Chen,Wenpan Li,Rowan Rolark,John Patrick Reidy,Jian Cao,Ping Guo,David C. Dunand,Horacio D. Espinosa
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 27 pages, 7 figures
Abstract:Hihg-temperature creep characterization of structural alloys traditionally relies on serial uniaxial tests, which are highly inefficient for exploring the large search space of alloy compositions and for material discovery. Here, we introduce a machine-learning-assisted, high-throughput framework for creep law identification based on a dimple array bulge instrument (DABI) configuration, which enables parallel creep testing of 25 dimples, each fabricated from a different alloy, in a single experiment. Full-field surface displacements of dimples undergoing time-dependent creep-induced bulging under inert gas pressure are measured by 3D digital image correlation. We train a recurrent neural network (RNN) as a surrogate model, mapping creep parameters and loading conditions to the time-dependent deformation response of DABI. Coupling this surrogate with a particle swarm optimization scheme enables rapid and global inverse identification with sparsity regularization of creep parameters from experiment displacement-time histories. In addition, we propose a phenomenological creep law with a time-dependent stress exponent that captures the sigmoidal primary creep observed in wrought INCONEL 625 and extracts its temperature dependence from DABI test at multiple temperatures. Furthermore, we employ a general creep law combining several conventional forms together with regularized inversion to identify the creep laws for 47 additional Fe-, Ni-, and Co-rich alloys and to automatically select the dominant functional form for each alloy. This workflow combined with DABI experiment provides a quantitative, high-throughput creep characterization platform that is compatible with data mining, composition-property modeling, and nonlinear structural optimization with creep behavior across a large alloy design space.
信息检索
[IR-0] MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning -Intensive Retrieval
链接: https://arxiv.org/abs/2601.09562
作者: Abdelrahman Abdallah,Mohamed Darwish Mounis,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Mostafa Farouk Senussi,Mohamed Mahmoud,Mohammed Ali,Adam Jatowt,Hyun-Soo Kang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elements, particularly, images such as diagrams, charts, and screenshots that require intensive reasoning to identify relevant documents. To address this gap, we introduce MM-BRIGHT, the first multimodal benchmark for reasoning-intensive retrieval. Our dataset consists of 2,803 real-world queries spanning 29 diverse technical domains, with four tasks of increasing complexity: text-to-text, multimodal-to-text, multimodal-to-image, and multimodal-to-multimodal retrieval. Extensive evaluation reveals that state-of-the-art models struggle across all tasks: BM25 achieves only 8.5 nDCG@10 on text-only retrieval, while the best multimodal model Nomic-Vision reaches just 27.6 nDCG@10 on multimodal-to-text retrieval actually underperforming the best text-only model (DiVeR: 32.2). These results highlight substantial headroom and position MM-BRIGHT as a testbed for next-generation retrieval models that better integrate visual reasoning. Our code and data are available at this https URL. See also our official website: this https URL.
[IR-1] Examining DOM Coordinate Effectiveness For Page Segmentation
链接: https://arxiv.org/abs/2601.09543
作者: Jason Carpenter,Faaiq Bilal,Eman Ramadan,Zhi-Li Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Web pages form a cornerstone of available data for daily human consumption and with the rise of LLM-based search and learning systems a treasure trove of valuable data. The scale of this data and its unstructured format still continue to grow requiring ever more robust automated extraction and retrieval mechanisms. Existing work, leveraging the web pages Document Object Model (DOM), often derives clustering vectors from coordinates informed by the DOM such as visual placement or tree structure. The construction and component value of these vectors often go unexamined. Our work proposes and examines DOM coordinates in a detail to understand their impact on web page segmentation. Our work finds that there is no one-size-fits-all vector, and that visual coordinates under-perform compared to DOM coordinates by about 20-30% on average. This challenges the necessity of including visual coordinates in clustering vectors. Further, our work finds that simple vectors, comprised of single coordinates, fare better than complex vectors constituting 68.2% of the top performing vectors of the pages examined. Finally, we find that if a vector, clustering algorithm, and page are properly matched, one can achieve overall high segmentation accuracy at 74%. This constitutes a 20% improvement over a naive application of vectors. Conclusively, our results challenge the current orthodoxy for segmentation vector creation, opens up the possibility to optimize page segmentation via clustering on DOM coordinates, and highlights the importance of finding mechanisms to match the best approach for web page segmentation.
[IR-2] SpatCode: Rotary-based Unified Encoding Framework for Efficient Spatiotemporal Vector Retrieval
链接: https://arxiv.org/abs/2601.09530
作者: Bingde Hu,Enhao Pan,Wanjing Zhou,Yang Gao,Zunlei Feng,Hao Zhong
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Spatiotemporal vector retrieval has emerged as a critical paradigm in modern information retrieval, enabling efficient access to massive, heterogeneous data that evolve over both time and space. However, existing spatiotemporal retrieval methods are often extensions of conventional vector search systems that rely on external filters or specialized indices to incorporate temporal and spatial constraints, leading to inefficiency, architectural complexity, and limited flexibility in handling heterogeneous modalities. To overcome these challenges, we present a unified spatiotemporal vector retrieval framework that integrates temporal, spatial, and semantic cues within a coherent similarity space while maintaining scalability and adaptability to continuous data streams. Specifically, we propose (1) a Rotary-based Unified Encoding Method that embeds time and location into rotational position vectors for consistent spatiotemporal representation; (2) a Circular Incremental Update Mechanism that supports efficient sliding-window updates without global re-encoding or index reconstruction; and (3) a Weighted Interest-based Retrieval Algorithm that adaptively balances modality weights for context-aware and personalized retrieval. Extensive experiments across multiple real-world datasets demonstrate that our framework substantially outperforms state-of-the-art baselines in both retrieval accuracy and efficiency, while maintaining robustness under dynamic data evolution. These results highlight the effectiveness and practicality of the proposed approach for scalable spatiotemporal information retrieval in intelligent systems.
[IR-3] EMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning -Intensive Retrieval
链接: https://arxiv.org/abs/2601.09523
作者: Abdelrahman Abdallah,Mohammed Ali,Muhammad Abdul-Mageed,Adam Jatowt
类目: Information Retrieval (cs.IR)
*备注:
[IR-4] Unifying Search and Recommendation in LLM s via Gradient Multi-Subspace Tuning
链接: https://arxiv.org/abs/2601.09496
作者: Jujia Zhao,Zihan Wang,Shuaiqun Pan,Suzan Verberne,Zhaochun Ren
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Search and recommendation (SR) are core to online platforms, addressing explicit intent through queries and modeling implicit intent from behaviors, respectively. Their complementary roles motivate a unified modeling paradigm. Early studies to unify SR adopt shared encoders with task-specific heads, while recent efforts reframe item ranking in both SR as conditional generation. The latter holds particular promise, enabling end-to-end optimization and leveraging the semantic understanding of LLMs. However, existing methods rely on full fine-tuning, which is computationally expensive and limits scalability. Parameter-efficient fine-tuning (PEFT) offers a more practical alternative but faces two critical challenges in unifying SR: (1) gradient conflicts across tasks due to divergent optimization objectives, and (2) shifts in user intent understanding caused by overfitting to fine-tuning data, which distort general-domain knowledge and weaken LLM reasoning. To address the above issues, we propose Gradient Multi-Subspace Tuning (GEMS), a novel framework that unifies SR with LLMs while alleviating gradient conflicts and preserving general-domain knowledge. GEMS introduces (1) \textbfMulti-Subspace Decomposition, which disentangles shared and task-specific optimization signals into complementary low-rank subspaces, thereby reducing destructive gradient interference, and (2) \textbfNull-Space Projection, which constrains parameter updates to a subspace orthogonal to the general-domain knowledge space, mitigating shifts in user intent understanding. Extensive experiments on benchmark datasets show that GEMS consistently outperforms the state-of-the-art baselines across both search and recommendation tasks, achieving superior effectiveness.
[IR-5] LISP – A Rich Interaction Dataset and Loggable Interactive Search Platform
链接: https://arxiv.org/abs/2601.09366
作者: Jana Isabelle Friese,Andreas Konstantin Kruff,Philipp Schaer,Norbert Fuhr,Nicola Ferro
类目: Information Retrieval (cs.IR)
*备注:
Abstract:We present a reusable dataset and accompanying infrastructure for studying human search behavior in Interactive Information Retrieval (IIR). The dataset combines detailed interaction logs from 61 participants (122 sessions) with user characteristics, including perceptual speed, topic-specific interest, search expertise, and demographic information. To facilitate reproducibility and reuse, we provide a fully documented study setup, a web-based perceptual speed test, and a framework for conducting similar user studies. Our work allows researchers to investigate individual and contextual factors affecting search behavior, and to develop or validate user simulators that account for such variability. We illustrate the datasets potential through an illustrative analysis and release all resources as open-access, supporting reproducible research and resource sharing in the IIR community.
[IR-6] A Deep Dive into OpenStreetMap Research Since its Inception (2008-2024): Contributors Topics and Future Trends
链接: https://arxiv.org/abs/2601.09338
作者: Yao Sun,Liqiu Meng,Andres Camero,Stefan Auer,Xiao Xiang Zhu
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:
Abstract:OpenStreetMap (OSM) has transitioned from a pioneering volunteered geographic information (VGI) project into a global, multi-disciplinary research nexus. This study presents a bibliometric and systematic analysis of the OSM research landscape, examining its development trajectory and key driving forces. By evaluating 1,926 publications from the Web of Science (WoS) Core Collection and 782 State of the Map (SotM) presentations up to June 2024, we quantify publication growth, collaboration patterns, and thematic evolution. Results demonstrate simultaneous consolidation and diversification within the field. While a stable core of contributors continues to anchor OSM research, themes have shifted from initial concerns over data production and quality toward advanced analytical and applied uses. Comparative analysis of OSM-related research in WoS and SotM reveals distinct but complementary agendas between scholars and the OSM community. Building on these findings, we identify six emerging research directions and discuss how evolving partnerships among academia, the OSM community, and industry are poised to shape the future of OSM research. This study establishes a structured reference for understanding the state of OSM studies and offers strategic pathways for navigating its future this http URL data and code are available at this https URL.
[IR-7] LLM s Meet Isolation Kernel: Lightweight Learning-free Binary Embeddings for Fast Retrieval
链接: https://arxiv.org/abs/2601.09159
作者: Zhibo Zhang,Yang Xu,Kai Ming Ting,Cam-Tu Nguyen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large language models (LLMs) have recently enabled remarkable progress in text representation. However, their embeddings are typically high-dimensional, leading to substantial storage and retrieval overhead. Although recent approaches such as Matryoshka Representation Learning (MRL) and Contrastive Sparse Representation (CSR) alleviate these issues to some extent, they still suffer from retrieval accuracy degradation. This paper proposes \emphIsolation Kernel Embedding or IKE, a learning-free method that transforms an LLM embedding into a binary embedding using Isolation Kernel (IK). IKE is an ensemble of diverse (random) partitions, enabling robust estimation of ideal kernel in the LLM embedding space, thus reducing retrieval accuracy loss as the ensemble grows. Lightweight and based on binary encoding, it offers low memory footprint and fast bitwise computation, lowering retrieval latency. Experiments on multiple text retrieval datasets demonstrate that IKE offers up to 16.7x faster retrieval and 16x lower memory usage than LLM embeddings, while maintaining comparable or better accuracy. Compared to CSR and other compression methods, IKE consistently achieves the best balance between retrieval efficiency and effectiveness.

