本篇博文主要内容为 2026-01-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-05)
今日共更新381篇论文,其中:
- 自然语言处理共49篇(Computation and Language (cs.CL))
- 人工智能共100篇(Artificial Intelligence (cs.AI))
- 计算机视觉共82篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共104篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Geometry of Reason : Spectral Signatures of Valid Mathematical Reasoning
【速读】: 该论文旨在解决大语言模型中数学推理有效性检测的问题,即如何在不依赖训练数据或微调的情况下准确识别模型生成的数学证明是否逻辑有效。其解决方案的关键在于利用注意力机制的谱分析(spectral analysis),将注意力矩阵视为动态图的邻接矩阵,提取四个可解释的谱诊断指标:Fiedler值(代数连通性)、高频能量比(HFER)、图信号平滑度和谱熵。这些指标在有效与无效数学证明之间表现出显著差异(效应量最大达Cohen’s d = 3.30,p < 10⁻¹¹⁶),且仅需单一阈值即可实现85.0–95.6%的分类准确率,无需任何学习过程,从而为生成式AI中的逻辑一致性验证提供了一种无监督、高精度且可解释的方法。
链接: https://arxiv.org/abs/2601.00791
作者: Valentin Noël
机构: Devoteam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 58 pages, 19 figures, Under Review
Abstract:We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen’s d = 3.30 ( p 10^-116 ), enabling 85.0–95.6% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93–95% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B’s Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ( d = 2.09 , p_\textMW = 1.16 \times 10^-48 ), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
zh
[NLP-1] Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries
【速读】: 该论文旨在解决癌症登记机构依赖人工从病理报告中提取信息所导致的资源消耗大和数据延迟问题,同时探索预训练语言模型在不同地区间报告格式差异下的泛化能力。其关键解决方案是通过跨省域微调(fine-tuning)本地化一个已在不列颠哥伦比亚省开发的领域适配Transformer模型(BCCRTron)和一个生物医学通用模型(GatorTron),并采用保守“或”集成(OR-ensemble)策略融合两者输出,从而显著提升癌症分类(Tier 1)与可报告癌症识别(Tier 2)的召回率,将漏诊病例减少至24例(Tier 1)和33例(Tier 2),优于单一模型表现。此外,该方法实现了仅共享模型权重的隐私保护型协作流程,为构建全国统一的癌症病理文本处理基础模型提供了可行路径。
链接: https://arxiv.org/abs/2601.00787
作者: Jonathan Simkin,Lovedeep Gondara,Zeeshan Rizvi,Gregory Doyle,Jeff Dowden,Dan Bond,Desmond Martin,Raymond Ng
机构: University of British Columbia (不列颠哥伦比亚大学); Newfoundland & Labrador Health Services (纽芬兰与拉布拉多省卫生服务局)
类目: Computation and Language (cs.CL)
备注:
Abstract:Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.
zh
[NLP-2] Memory Bank Compression for Continual Adaptation of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在持续学习(continual learning)过程中面临的两个核心问题:一是如何高效地更新模型以适应新数据而不丢失已有知识,二是如何应对外部记忆库(memory bank)在真实数据流场景下不断膨胀导致的存储与计算压力。解决方案的关键在于提出一种名为MBC的模型架构,其通过代码本优化策略(codebook optimization strategy)对记忆库进行压缩,使内存占用降低至基准方法的0.3%;同时引入在线重置机制(online resetting mechanism)防止代码本坍缩(codebook collapse),确保学习稳定性,并结合键值低秩适配(Key-Value Low-Rank Adaptation)技术高效利用压缩后的记忆表示,在保持高知识保留准确率的同时实现高效的在线适应学习。
链接: https://arxiv.org/abs/2601.00756
作者: Thomas Katraouras,Dimitrios Rafailidis
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26)
Abstract:Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at this https URL.
zh
[NLP-3] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在主观性文本跨度识别任务中应用不足的问题,尤其是在Aspect-based Sentiment Analysis(ABSA)等需要语义理解与判断的任务中缺乏系统评估。其解决方案的关键在于通过构建涵盖情感分析、攻击性语言识别和事实核查三个典型任务的评测框架,系统评估多种LLM策略(如指令微调、上下文学习和思维链推理)在文本跨度识别中的表现,并发现文本内部潜在语义关联有助于提升LLM识别精确跨度的能力。
链接: https://arxiv.org/abs/2601.00736
作者: Alphaeus Dmonte,Roland Oruche,Tharindu Ranasinghe,Marcos Zampieri,Prasad Calyam
机构: George Mason University (乔治梅森大学); University of Missouri (密苏里大学); Lancaster University (兰卡斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.
zh
[NLP-4] DoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications
【速读】: 该论文旨在解决电信领域(telecom)中工单(ticket)故障排查流程效率低下的问题,该流程高度依赖人工,涉及工单分类、历史相似工单检索和故障分析报告生成等复杂步骤,导致问题解决周期长且资源消耗大。解决方案的关键在于提出TeleDoCTR系统,这是一个面向电信领域的端到端故障排查框架,其核心创新是融合领域特定的排序模型与生成式模型,协同完成工单路由、上下文语义匹配的历史工单检索以及自动化生成包含问题描述、根因分析和潜在解决方案的详细故障报告,从而显著提升排查准确率与效率。
链接: https://arxiv.org/abs/2601.00691
作者: Mohamed Trabelsi,Huseyin Uzunalioglu
机构: Nokia Bell Labs (诺基亚贝尔实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Ticket troubleshooting refers to the process of analyzing and resolving problems that are reported through a ticketing system. In large organizations offering a wide range of services, this task is highly complex due to the diversity of submitted tickets and the need for specialized domain knowledge. In particular, troubleshooting in telecommunications (telecom) is a very time-consuming task as it requires experts to interpret ticket content, consult documentation, and search historical records to identify appropriate resolutions. This human-intensive approach not only delays issue resolution but also hinders overall operational efficiency. To enhance the effectiveness and efficiency of ticket troubleshooting in telecom, we propose TeleDoCTR, a novel telecom-related, domain-specific, and contextual troubleshooting system tailored for end-to-end ticket resolution in telecom. TeleDoCTR integrates both domain-specific ranking and generative models to automate key steps of the troubleshooting workflow which are: routing tickets to the appropriate expert team responsible for resolving the ticket (classification task), retrieving contextually and semantically similar historical tickets (retrieval task), and generating a detailed fault analysis report outlining the issue, root cause, and potential solutions (generation task). We evaluate TeleDoCTR on a real-world dataset from a telecom infrastructure and demonstrate that it achieves superior performance over existing state-of-the-art methods, significantly enhancing the accuracy and efficiency of the troubleshooting process.
zh
[NLP-5] Sigmoid Head for Quality Estimation under Language Ambiguity
【速读】: 该论文旨在解决语言模型(Language Model, LM)输出概率无法可靠反映生成质量的问题,其核心挑战在于:一方面,LM最终输出层采用Softmax激活函数,导致多个合理选项的概率被分散,从而误导低质量判断;另一方面,训练数据仅包含单一正确参考文本(one-hot编码),使得模型难以识别并分配高概率给其他有效替代输出。为应对上述问题,论文提出在预训练LM基础上引入一个名为Sigmoid Head的质量估计模块,其关键创新在于:(1) 使用Sigmoid激活的额外解嵌入头(unembedding head)替代Softmax,使多个正确选项可同时获得高概率;(2) 在负采样训练过程中设计启发式策略以避免选择潜在的正确替代词,从而更准确地学习质量信号。该方法无需人工标注质量数据,在计算效率和跨域鲁棒性方面优于监督式质量估计(Supervised Quality Estimation, QE)。
链接: https://arxiv.org/abs/2601.00680
作者: Tu Anh Dinh,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model’s probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs’ final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs’ training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.
zh
[NLP-6] Fast-weight Product Key Memory
【速读】: 该论文旨在解决现代语言模型中序列建模层在存储容量与计算效率之间的权衡问题:传统Softmax注意力机制虽能实现无限存储但计算复杂度为二次方级,而线性变体虽高效却受限于固定大小的存储空间。其解决方案的关键在于提出Fast-weight Product Key Memory (FwPKM),这是一种将静态的Product Key Memory(PKM)转化为动态“快速权重”(fast-weight)情景记忆的新架构。FwPKM通过局部块级梯度下降在训练和推理阶段实时更新参数,使模型能够快速记忆并检索输入序列中的新键值对,从而在长上下文任务中显著降低困惑度,并在Needle in a Haystack测试中实现从4K-token训练扩展到128K-token上下文的有效泛化。
链接: https://arxiv.org/abs/2601.00671
作者: Tianyu Zhao,Llion Jones
机构: Sakana AI(萨卡纳AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, “fast-weight” episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
zh
[NLP-7] Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations
【速读】: 该论文旨在解决大语言模型在生成式蛋白设计中常见的结构幻觉(structural hallucinations)问题,即模型生成的序列虽在语言上具有高概率,但折叠后却对应热力学不稳定的构象。其解决方案的关键在于提出一种物理信息引导的对齐框架——Physio-DPO,该方法通过引入一个与能量幅度感知相关的优化目标,根据天然结构与物理扰动硬负样本之间的能量差来调整优化更新步长,从而将蛋白语言模型更好地锚定于热力学稳定性。实验表明,Physio-DPO显著优于SFT、PPO和标准DPO等基线方法,在自一致性均方根偏差(RMSD)降至1.28 Å的同时,折叠成功率提升至92.8%,且能有效恢复如疏水核心堆积和氢键网络等关键生物物理相互作用。
链接: https://arxiv.org/abs/2601.00647
作者: QiWei Meng
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
备注:
Abstract:Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio-DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio-DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 Å and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio-DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.
zh
[NLP-8] Probabilistic Guarantees for Reducing Contextual Hallucinations in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在固定输入自动化工作流中产生的上下文幻觉(contextual hallucinations)问题,即模型生成内容与提示中明确陈述的信息相矛盾或忽略的情况。这类错误在确定性任务中尤为严重,因输入固定且正确性应无歧义。解决方案的关键在于提出一种模型无关、轻量级的框架:通过在独立的上下文窗口中重复执行相同提示,利用概率论推导出所有输出均错误的概率呈指数下降;进一步引入LLM作为裁判(LLM-as-a-judge)对多个运行结果进行判断,并证明其失败概率由裁判的真实阳性率和假阳性率决定;当裁判性能有限时,采用多数投票机制构建集成系统,使整体错误率随投票次数呈指数级降低。该方法无需修改模型权重、解码策略或提示工程,即可提供显式的概率保证来将幻觉选择概率控制至任意低水平。
链接: https://arxiv.org/abs/2601.00641
作者: Nils Rautenberg,Sven Schippkus
机构: Deutsche Aktuarvereinigung e.V.; University of Hamburg
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge’s true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.00641 [cs.CL] (or arXiv:2601.00641v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.00641 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-9] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
【速读】: 该论文旨在解决当前生成式 AI 在客户支持场景中难以严格遵循多步骤业务规则(multi-step policies)和复杂任务依赖关系的问题,尤其针对传统交互式语音应答系统(Interactive Voice Response, IVR)灵活性不足以及现有基准测试对政策遵从性评估缺失的局限。解决方案的关键在于提出 JourneyBench 基准框架,其核心创新包括:利用图结构生成多样化且真实的用户旅程模拟场景,并引入“用户旅程覆盖度评分”(User Journey Coverage Score)作为衡量政策遵从性的新指标;同时设计两种代理架构——静态提示代理(Static-Prompt Agent, SPA)与动态提示代理(Dynamic-Prompt Agent, DPA),其中 DPA 显式建模策略控制逻辑,显著提升了政策执行能力,甚至使较小模型如 GPT-4o-mini 在政策遵从性上超越更大模型如 GPT-4o。
链接: https://arxiv.org/abs/2601.00596
作者: Sumanth Balaji,Piyush Mishra,Aashraya Sachdeva,Suraj Agrawal
机构: Observe.AI (Observe.AI)
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures, preprint
Abstract:Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent’s capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
zh
[NLP-10] CSSBench: Evaluating the Safety of Lightweight LLM s against Chinese-Specific Adversarial Patterns
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在中文场景下安全评估中存在的显著差距问题,尤其是针对轻量级模型易受中文特有对抗性扰动(如同音字、拼音拆分、符号替代等)影响而产生的安全漏洞。解决方案的关键在于构建首个专注于中文特有攻击模式的综合性安全评测基准——Chinese-Specific Safety Benchmark (CSSBench),该基准涵盖非法活动与合规、隐私泄露、医疗误导、欺诈与仇恨内容、成人内容及公共政治安全六大真实中文应用场景,并设计多类任务以系统评估轻量级模型的安全防护能力与过拒现象(over-refusal),从而为中文环境下LLM的安全部署提供可量化、可复现的评估依据。
链接: https://arxiv.org/abs/2601.00588
作者: Zhenhong Zhou,Shilinlu Yan,Chuanpu Liu,Qiankun Li,Kun Wang,Zhigang Zeng
机构: Nanyang Technological University (南洋理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL)
备注: 18 pages
Abstract:Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.
zh
[NLP-11] InfoSynth: Information-Guided Benchmark Synthesis for LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)推理与代码生成能力评估中面临的两大挑战:一是传统人工构建评测基准成本高、效率低;二是现有基准常与LLM训练数据存在污染,导致评估结果失真。其解决方案的关键在于提出InfoSynth框架,该框架基于信息论原则(如KL散度和熵)自动量化并优化基准的新颖性(novelty)与多样性(diversity),并通过遗传算法与迭代代码反馈机制实现从种子数据到高质量Python编程问题的端到端合成,从而生成97%准确率的测试用例和解决方案,并具备对生成问题难度、新颖性和多样性的可控调节能力,形成可扩展且自验证的评测基准构建流程。
链接: https://arxiv.org/abs/2601.00575
作者: Ishir Garg,Neel Kolhe,Xuandong Zhao,Dawn Song
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: this https URL
zh
[NLP-12] A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
【速读】: 该论文旨在解决大规模多语言自动语音识别(Multilingual Automatic Speech Recognition, mASR)模型在资源受限边缘设备上部署时面临的高计算开销和延迟问题。其核心解决方案是提出一种轻量级、语言无关的多语言语音识别框架——Language-agnostic Hierarchical LoRA-MoE(HLoRA),该框架基于CTC架构并引入领域自适应机制。关键创新在于通过分层LoRA结构实现语言不变声学表示与语言特异性专家建模的解耦:共享LoRA用于提取跨语言共性特征,语言特定LoRA专家捕捉语言差异;同时设计基于语言识别(LID)后验驱动的LoRA路由机制,在推理阶段无需预先知道语言身份或显式标注,即可实现真正的语言无关端到端解码,从而在保持高性能的同时显著提升低资源mASR场景下的解码效率。
链接: https://arxiv.org/abs/2601.00557
作者: Yuang Zheng,Yuxiang Mei,Dongxing Xu,Jie Chen,Yanhua Long
机构: Shanghai Normal University (上海师范大学); Unisound AI Technology Co., Ltd. (声智科技有限公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, submitted to IEEE Signal Processing Letters
Abstract:Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.
zh
[NLP-13] ECR: Manifold-Guided Semantic Cues for Compact Language Models
【速读】: 该论文旨在解决紧凑模型(compact model)在压缩过程中嵌入空间结构坍塌的问题,尤其是在容量受限或数据跨语言时,导致下游任务难以利用其表示。现有压缩方法通常仅在输出层面进行对齐,无法保留底层流形结构,从而引发语义漂移,使模型的任务表现和语言特性偏离参考模型。解决方案的关键在于提出一种名为嵌入一致性调节(Embedding Consistency Regulation, ECR)的新框架:该框架首先离线计算教师模型的语义锚点(semantic anchors),然后让紧凑模型学习在这些锚点周围保持几何一致性,无需匹配 logits 或内部特征;ECR 仅在推理阶段增加一个轻量级投影步骤,不改变解码架构或运行时行为,从而在多语言场景下稳定训练并有效维持语义结构,提升紧凑模型的表达能力和部署适应性。
链接: https://arxiv.org/abs/2601.00543
作者: Chung-Wei Victor Yuan
机构: YVIC Research Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint 13pages, 6 figures
Abstract:Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits. Comments: Preprint 13pages, 6 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00543 [cs.CL] (or arXiv:2601.00543v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.00543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-14] Retrieval–Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends
【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, MQA)系统中检索-推理过程不透明的问题,使得不同模型家族之间的策略难以比较。其关键解决方案是提出一个四轴分析框架,将执行过程作为分析单元,涵盖:(A) 整体执行计划、(B) 索引结构、© 下一步控制策略与触发机制、以及 (D) 停止/继续判定标准。通过该框架,作者系统映射了代表性多跳QA系统,并总结了在HotpotQA等基准上的消融实验结果与趋势,揭示了有效性、效率与证据忠实性之间的权衡关系,为未来检索-推理代理的研究指明方向。
链接: https://arxiv.org/abs/2601.00536
作者: Yuelyu Ji,Zhuochun Li,Rui Meng,Daqing He
机构: University of Pittsburgh (匹兹堡大学); Google Cloud AI Research (谷歌云人工智能研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval–reasoning \emphprocess is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, © next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval–reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.
zh
[NLP-15] he Illusion of Insight in Reasoning Models
【速读】: 该论文试图解决的问题是:生成式 AI (Generative AI) 在推理过程中是否存在类似人类的“顿悟”(Aha!)时刻,即模型是否能够通过内在的推理策略调整实现自我修正并提升性能。解决方案的关键在于系统性地分析超过100万条推理轨迹、数百个训练检查点以及多种模型架构和解码温度下的中段推理变化(mid-reasoning shifts),发现这些变化本质上是不稳定的推理行为表现,而非有效的自修正机制;进一步提出通过引入高熵条件下的外在触发手段来人为诱导推理策略转变,从而显著提升准确性,揭示了推理稳定性与模型不确定性之间的关系。
链接: https://arxiv.org/abs/2601.00514
作者: Liv G. d’Aliberti,Manoel Horta Ribeiro
机构: Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Do reasoning models have “Aha!” moments? Prior work suggests that models like DeepSeek-R1-Zero undergo sudden mid-trace realizations that lead to accurate outputs, implying an intrinsic capacity for self-correction. Yet, it remains unclear whether such intrinsic shifts in reasoning strategy actually improve performance. Here, we study mid-reasoning shifts and instrument training runs to detect them. Our analysis spans 1M+ reasoning traces, hundreds of training checkpoints, three reasoning domains, and multiple decoding temperatures and model architectures. We find that reasoning shifts are rare, do not become more frequent with training, and seldom improve accuracy, indicating that they do not correspond to prior perceptions of model insight. However, their effect varies with model uncertainty. Building on this finding, we show that artificially triggering extrinsic shifts under high entropy reliably improves accuracy. Our results show that mid-reasoning shifts are symptoms of unstable inference behavior rather than an intrinsic mechanism for self-correction.
zh
[NLP-16] A Chain-of-Thought Approach to Semantic Query Categorization in e-Commerce Taxonomies SIGIR
【速读】: 该论文旨在解决电子商务中搜索查询分类(search query categorization)的问题,即如何从具有层次结构的类目体系(category taxonomy)中准确识别与用户查询语义对齐的一组叶节点类目。这一问题的关键在于提升查询分类的准确性,从而有效缩小检索范围并增强查询意图理解能力。解决方案的核心是提出一种基于链式思维(Chain-of-Thought, CoT)的新范式,该方法结合了简单的树搜索策略与大语言模型(LLM)的语义评分机制,在人类标注的查询-类目配对、相关性测试及基于LLM的参考方法上均优于传统的基于嵌入(embedding-based)的分类基准。此外,该方法还能用于检测层级类目体系中的结构问题,并进一步提出可扩展至百万级查询量的LLM驱动分类方案。
链接: https://arxiv.org/abs/2601.00510
作者: Jetlir Duraj,Ishita Khan,Kilian Merkelbach,Mehran Elyasi
机构: eBay Inc( eBay 公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 9 pages, accepted at SIGIR eCom 2025
Abstract:Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results. For this task, we explore a novel Chain-of-Thought (CoT) paradigm that combines simple tree-search with LLM semantic scoring. Assessing its classification performance on human-judged query-category pairs, relevance tests, and LLM-based reference methods, we find that the CoT approach performs better than a benchmark that uses embedding-based query category predictions. We show how the CoT approach can detect problems within a hierarchical taxonomy. Finally, we also propose LLM-based approaches for query-categorization of the same spirit, but which scale better at the range of millions of queries.
zh
[NLP-17] Rule-Based Approaches to Atomic Sentence Extraction
【速读】: 该论文旨在解决复杂句子中多义信息混杂导致原子句提取(atomic sentence extraction)性能受限的问题,尤其关注特定句法结构如何影响规则-based方法的准确性。其解决方案的关键在于通过依赖句法分析(dependency parsing)构建系统化的提取规则,并基于WikiSplit数据集生成高质量标注样本,从而对相对从句、状语从句、并列谓词、同位结构及被动语态等典型复杂句法结构进行量化评估,揭示这些结构与提取失败之间的关联,为提升原子句提取的可解释性与鲁棒性提供实证依据。
链接: https://arxiv.org/abs/2601.00506
作者: Lineesha Kamana,Akshita Ananda Subramanian,Mehuli Ghosh,Suman Saha
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the “split-and-rephrase” task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.
zh
[NLP-18] Noise-Aware Named Entity Recognition for Historical VET Documents
【速读】: 该论文旨在解决职业培训(Vocational Education and Training, VET)领域历史数字化文档中由于光学字符识别(Optical Character Recognition, OCR)噪声导致的命名实体识别(Named Entity Recognition, NER)性能下降问题。其解决方案的关键在于提出一种噪声感知训练(Noise-Aware Training, NAT)方法,通过合成注入OCR错误、迁移学习与多阶段微调相结合,系统性地在噪声数据、干净数据及人工合成数据上进行训练,从而显著提升模型在真实噪声环境下的鲁棒性和准确性。
链接: https://arxiv.org/abs/2601.00488
作者: Alexander M. Esser,Jens Dörpinghaus
机构: Federal Institute for Vocational Education and Training (BIBB), Bonn, Germany; Department of Computer Science, University of Koblenz, Germany
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: This is an extended, non-peer-reviewed version of the paper presented at VISAPP 2026
Abstract:This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.
zh
[NLP-19] Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)部署中安全护栏模型(guardrail models)因处理完整多轮对话历史而导致的高计算成本问题。其核心解决方案是提出 Defensive M2S 训练范式,通过将多轮对话压缩为单轮表示(Multi-turn to Single-turn, M2S),从而显著降低训练与推理阶段的 token 消耗。关键创新在于:一方面在训练时使用压缩后的单轮表示替代完整的多轮对话历史,使时间复杂度从 O(n²) 降至 O(n);另一方面在推理时大幅减少所需的 token 数量(如实验中实现 94.6% 的 token 减少),同时保持甚至提升攻击检测性能(如最佳配置下达到 93.8% 的召回率,较基线提升 38.9 个百分点)。这表明 M2S 压缩是一种高效且可扩展的安全筛查技术,适用于长对话场景下的护栏模型部署。
链接: https://arxiv.org/abs/2601.00454
作者: Hyunjun Kim
机构: KAIST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from O(n^2) to O(n) for n -turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline – a 93 \times reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.
zh
[NLP-20] Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games
【速读】: 该论文试图解决的问题是:如何在大型语言模型(Large Language Models, LLMs)的实证背景下,重新审视和整合关于语言意义的两种经典理论视角——社会建构主义视角(与语言游戏相关)和数学导向的语义结构观。解决方案的关键在于提出并形式化“词汇场”(Lexfelder)和“语言场”(Lingofelder)作为连续语义空间中相互作用的结构,并揭示Transformer架构中的分布式表示、注意力机制及嵌入空间的几何规律如何与这些概念相契合。作者认为,LLMs对语义规律的成功捕捉支持了语言具有潜在数学结构的观点,而其在语用推理和情境敏感性方面的局限则印证了社会语境的重要性,从而主张数学结构与语言游戏应被视为互补而非对立的视角,为理论驱动的AI架构设计提供了新方向。
链接: https://arxiv.org/abs/2601.00448
作者: Dimitris Vartziotis
机构: TWT Science & Innovation (TWT 科学与创新); NIKI - Digital Engineering (NIKI - 数字工程)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.
zh
[NLP-21] Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment
【速读】: 该论文旨在解决企业自然语言处理(Natural Language Processing, NLP)场景中对高效、轻量级模型的需求,特别是在多领域文本自动化任务中如何权衡模型性能与计算效率的问题。解决方案的关键在于系统性地比较三种主流轻量级Transformer模型——DistilBERT、MiniLM和ALBERT,在客户情感分类、新闻主题分类及毒性与仇恨言论检测三个典型企业应用场景中的表现,通过控制训练条件(固定微调策略而非超参数穷举优化),量化评估其在准确性(accuracy、precision、recall、F1-score)与效率指标(模型大小、推理时间、吞吐量、内存占用)上的综合表现,从而揭示不同模型在特定约束下的优势:MiniLM在延迟敏感场景最优,DistilBERT在准确率与效率间最均衡,ALBERT在资源受限环境下最具精度优势。
链接: https://arxiv.org/abs/2601.00444
作者: Muhammad Shahmeer Khan
机构: Ulster University (阿尔斯特大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures. Code and reproducibility resources available on GitHub
Abstract:In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.
zh
[NLP-22] oward Better Temporal Structures for Geopolitical Events Forecasting
【速读】: 该论文旨在解决现有超关系时序知识图谱(Hyper-Relational Temporal Knowledge Graphs, HTKGs)在表达复杂地缘政治事件时的局限性,即其无法有效建模超过两个主体实体的时序事实。针对这一问题,作者提出了一种新的结构——超关系时序知识广义超图(Hyper-Relational Temporal Knowledge Generalized Hypergraphs, HTKGHs),通过形式化定义实现对HTKG的向后兼容,并支持两类常见于地缘政治事件中的复杂事实类型。解决方案的关键在于构建HTKGH的形式化框架并基于全球事件数据库POLECAT推出htkgh-polecat数据集,从而为大语言模型(LLMs)在复杂时序关系预测任务中的适应性和能力提供基准测试与分析依据。
链接: https://arxiv.org/abs/2601.00430
作者: Kian Ahrabian,Eric Boxer,Jay Pujara
机构: University of Southern California(南加州大学); Information Sciences Institute(信息科学研究所)
类目: Computation and Language (cs.CL)
备注: 17 pages, 13 figures, 3 tables
Abstract:Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.
zh
[NLP-23] Deep Delta Learning
【速读】: 该论文旨在解决深度残差网络(Deep Residual Networks)中身份捷径连接(identity shortcut connection)所引入的严格加性归纳偏置(additive inductive bias)限制了模型对复杂状态转移建模能力的问题。其解决方案的关键在于提出了一种新的架构——深度Delta学习(Deep Delta Learning, DDL),通过引入一个可学习且数据依赖的几何变换来调制身份捷径,该变换称为Delta算子(Delta Operator),本质上是对单位矩阵的秩-1扰动,由反射方向向量 k(X) 和门控标量 β(X) 参数化。门控标量 β(X) 实现了对恒等映射、正交投影和几何反射之间的动态插值,从而允许网络在层间更新中显式控制过渡算子的谱特性,同时保持门控残差结构的稳定训练特性,实现对非单调复杂动态的有效建模。
链接: https://arxiv.org/abs/2601.00417
作者: Yifan Zhang,Yifeng Liu,Mengdi Wang,Quanquan Gu
机构: Princeton University (普林斯顿大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network’s capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector \mathbfk(\mathbfX) and a gating scalar \beta(\mathbfX) . We provide a spectral analysis of this operator, demonstrating that the gate \beta(\mathbfX) enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank-1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.
zh
[NLP-24] Do LLM s Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset
【速读】: 该论文旨在解决低资源语言(如卢森堡语)在命名实体识别(Named Entity Recognition, NER)任务中因标注数据稀缺和语言特性复杂而导致的高质量语料匮乏问题。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLM)的新颖自动标注与验证流水线(pipeline),利用维基百科和Wikidata作为结构化弱监督源,通过文章内部链接推断实体类型并生成初始标注,随后借助多个LLM对标注质量进行筛选,从而有效降低噪声并提升标注一致性。该方法显著扩展了现有卢森堡语NER数据集规模(约5倍),并在实体类别覆盖广度和均衡性上实现突破,为多语言及低资源NER研究提供了重要新资源。
链接: https://arxiv.org/abs/2601.00411
作者: Alistair Plum,Laura Bernardy,Tharindu Ranasinghe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.
zh
[NLP-25] Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
【速读】: 该论文旨在解决当前基于视觉-语言模型的图像地理定位(image geolocalization)方法依赖合成推理标注或外部图像检索所导致的可解释性差和泛化能力弱的问题。其解决方案的关键在于提出了一种无需检索(retrieval-free)的框架Geo-R,核心创新包括:一是设计了“区域链”(Chain of Region)规则驱动的层次化推理范式,通过将真实GPS坐标映射到地理实体(如国家、省份、城市)生成结构化且可解释的监督信号;二是引入基于哈弗辛距离(Haversine distance)对齐坐标的轻量级强化学习策略,使模型能够通过空间语义反馈逐步优化定位预测。该方案实现了地理推理与直接空间监督的融合,在多个基准测试中显著提升了定位精度与泛化性能,并增强了推理过程的透明度。
链接: https://arxiv.org/abs/2601.00388
作者: Biao Wu,Meng Fang,Ling Chen,Ke Xu,Tao Cheng,Jun Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figures
Abstract:Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.
zh
[NLP-26] BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics
【速读】: 该论文旨在解决预训练语言模型中[CLS]嵌入空间坍缩(collapsed [CLS] embedding space)的问题,即在多语言场景下,[CLS] token的表示变得无区分性,从而限制了模型在跨语言任务中的表现。解决方案的关键在于引入联合嵌入预测架构(Joint Embedding Predictive Architectures, JEPA)作为额外的自监督训练目标,将BERT类模型的[CLS]嵌入空间引导至一个语言无关(language-agnostic)的表示空间,从而提升多语言基准任务上的性能。
链接: https://arxiv.org/abs/2601.00366
作者: Taj Gillin,Adam Lalani,Kenneth Zhang,Marcel Mateos Salles
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 10 figures, 10 tables
Abstract:Joint Embedding Predictive Architectures (JEPA) are a novel self supervised training technique that have shown recent promise across domains. We introduce BERT-JEPA (BEPA), a training paradigm that adds a JEPA training objective to BERT-style models, working to combat a collapsed [CLS] embedding space and turning it into a language-agnostic space. This new structure leads to increased performance across multilingual benchmarks.
zh
[NLP-27] he Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models)在主要基于单语种预训练数据的情况下,为何仍能实现优异的跨语言性能这一问题。其核心疑问在于:双语数据在预训练语料库中的具体作用机制尚不明确。解决方案的关键在于通过受控实验设计,从原始语料中移除所有双语文档(仅保留单语种数据),并系统性地重新引入平行语料(parallel data)或代码转换语料(code-switching data),从而量化不同类型的双语数据对翻译、跨语言问答(cross-lingual QA)和通用推理任务的影响。实验证明,翻译性能的下降主要归因于平行语料的缺失,而跨语言理解与推理能力则无需依赖双语数据即可保持稳定,揭示了翻译任务对词级对齐信息的高度敏感性,以及跨语言认知能力可由单语种数据自发习得的本质差异。
链接: https://arxiv.org/abs/2601.00364
作者: Jiandong Shao,Raphael Tang,Crystina Zhang,Karin Sevegnani,Pontus Stenetorp,Jianfei Yang,Yao Lu
机构: University College London (伦敦大学学院); Nanyang Technological University (南洋理工大学); University of Waterloo (滑铁卢大学); NVIDIA (英伟达); National Institute of Informatics (信息研究所)
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.
zh
[NLP-28] Robust Uncertainty Quantification for Factual Generation of Large Language Models IJCNN2025
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在面对非标准或对抗性提问策略时,传统不确定性量化方法失效的问题,从而提升LLM输出内容的可靠性与可信度。其解决方案的关键在于构建了一种面向多事实生成任务的新型不确定性量化场景,特别设计包含虚假姓名的“陷阱问题”数据集,并提出一种鲁棒的不确定性量化方法(Robust Uncertainty Quantification, RU)。实验表明,该方法在四个不同模型上均显著优于基线方法,平均ROCAUC值提升0.1–0.2,为缓解LLM幻觉问题提供了新的思路和有效手段。
链接: https://arxiv.org/abs/2601.00348
作者: Yuhao Zhang,Zhongliang Yang,Linna Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 tables, 5 figures, accepted to IJCNN 2025
Abstract:The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.
zh
[NLP-29] DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection
【速读】: 该论文旨在解决当前抑郁症检测模型在真实场景中鲁棒性不足的问题,特别是针对“伪装型抑郁”(Camouflaged Depression)——即个体虽表现出抑郁状态但语言内容仍保持积极或中性的情况。问题根源在于现有数据集(如DAIC-WOZ)中语义情感与诊断标签高度耦合,导致模型依赖语义短路而非真正反映抑郁特征的声学模式。解决方案的关键是提出DepFlow框架,其核心创新包括:(1) 通过对抗训练构建去耦合的抑郁声学嵌入(Depression Acoustic Encoder),实现说话人和内容不变性的同时保留抑郁判别能力;(2) 利用基于FiLM调制的流匹配文本到语音合成模型(flow-matching TTS),将抑郁嵌入可控地注入语音生成过程,保持内容和说话人身份不变;(3) 引入原型驱动的严重程度映射机制,实现抑郁连续谱上的平滑、可解释调节。最终,该方法生成了针对伪装型抑郁的数据增强策略(CDoA),显著提升多架构下抑郁检测的宏观F1分数(最高+12%),并为临床受限场景下的对话系统和仿真评估提供可控合成平台。
链接: https://arxiv.org/abs/2601.00303
作者: Yuxin Li,Xiangyu Zhang,Yifei Li,Zhiwei Guo,Haoyang Zhang,Eng Siong Chng,Cuntai Guan
机构: Nanyang Technological University (南洋理工大学); UNSW (新南威尔士大学); Peking University (北京大学); Lee Kong Chian School of Medicine (李光前医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.
zh
[NLP-30] Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations
【速读】: 该论文旨在解决量化(Quantization)对大语言模型自解释(Self-Explanation, SE)质量与忠实性(Faithfulness)影响尚不明确的问题。由于SE在高风险场景中日益用于提升模型透明度,其性能是否因量化而退化成为关键关切。解决方案的关键在于系统评估三种常见量化技术在不同位宽下生成两类SE(自然语言解释NLEs和反事实示例)的表现,发现量化通常导致SE质量下降最多达4.4%、忠实性下降最多达2.38%,且用户感知的连贯性和可信度下降最高达8.5%;同时指出大模型虽对SE质量敏感但更保持忠实性,且无单一量化方法在任务准确率、SE质量和忠实性上全面最优,因此强调需针对具体应用场景验证SE质量,尤其是对NLEs的敏感性。
链接: https://arxiv.org/abs/2601.00282
作者: Qianli Wang,Nils Feldhus,Pepa Atanasova,Fedor Splitt,Simon Ostermann,Sebastian Möller,Vera Schmitt
机构: Quality and Usability Lab, Technische Universität Berlin (柏林工业大学质量与可用性实验室); University of Copenhagen (哥本哈根大学); Saarland Informatics Campus (萨尔兰信息学园区); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In submission
Abstract:Quantization is widely used to accelerate inference and streamline the deployment of large language models (LLMs), yet its effects on self-explanations (SEs) remain unexplored. SEs, generated by LLMs to justify their own outputs, require reasoning about the model’s own decision-making process, a capability that may exhibit particular sensitivity to quantization. As SEs are increasingly relied upon for transparency in high-stakes applications, understanding whether and to what extent quantization degrades SE quality and faithfulness is critical. To address this gap, we examine two types of SEs: natural language explanations (NLEs) and counterfactual examples, generated by LLMs quantized using three common techniques at distinct bit widths. Our findings indicate that quantization typically leads to moderate declines in both SE quality (up to 4.4%) and faithfulness (up to 2.38%). The user study further demonstrates that quantization diminishes both the coherence and trustworthiness of SEs (up to 8.5%). Compared to smaller models, larger models show limited resilience to quantization in terms of SE quality but better maintain faithfulness. Moreover, no quantization technique consistently excels across task accuracy, SE quality, and faithfulness. Given that quantization’s impact varies by context, we recommend validating SE quality for specific use cases, especially for NLEs, which show greater sensitivity. Nonetheless, the relatively minor deterioration in SE quality and faithfulness does not undermine quantization’s effectiveness as a model compression technique.
zh
[NLP-31] Beyond Perfect APIs: A Comprehensive Evaluation of LLM Agents Under Real-World API Complexity
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)代理在真实世界API调用场景中评估不足的问题,尤其是现有基准测试忽略了API规范复杂性和执行阶段的运行时挑战。解决方案的关键在于提出WildAGTEval这一新型评测基准,其核心创新包括:(1) 构建包含60种不同复杂度场景的API系统,可组合成约32,000种测试配置,覆盖真实的API规范(如详细文档和使用约束)与执行动态性(如噪声输出、运行时异常);(2) 设计用户-代理交互机制以评估LLM代理在复杂环境下的任务完成能力。实验表明,大多数场景对当前先进LLM构成显著挑战,其中无关信息干扰导致强模型性能下降达27.3%,且存在因误判用户意图而虚假完成任务的现象,凸显了真实世界复杂性对模型可靠性的影响。
链接: https://arxiv.org/abs/2601.00268
作者: Doyoung Kim(1 and 2),Zhiwei Ren(1 and 3),Jie Hao(1),Zhongkai Sun(1),Lichao Wang(1),Xiyao Ma(1),Zack Ye(1),Xu Han(1),Jun Yin(1),Heng Ji(4),Wei Shen(1),Xing Fan(1),Benjamin Yao(1),Chenlei Guo(1) ((1) Amazon, (2) KAIST, (3) University of Pittsburgh, (4) University of Illinois Urbana-Champaign)
机构: Amazon(亚马逊); KAIST(韩国科学技术院); University of Pittsburgh(匹兹堡大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents’ function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.
zh
[NLP-32] Parallel Universes Parallel Languages: A Comprehensive Study on LLM -based Multilingual Counterfactual Example Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成多语言反事实样本(counterfactuals)时的有效性与质量不足的问题,尤其是其在低资源语言中的表现尚不明确。解决方案的关键在于:首先通过自动评估比较直接生成与基于英语翻译的多语言反事实样本,发现翻译方法虽能提升有效性但需更多修改且质量仍不及英文原版;其次识别出高资源欧洲语言中编辑模式高度一致,表明跨语言扰动遵循共通策略;进而系统分类了四类高频错误类型;最后提出多语言反事实数据增强(multilingual counterfactual data augmentation, CDA)优于跨语言CDA,尤其对低资源语言带来更显著的模型性能提升,尽管当前生成反事实的不完美仍限制了模型鲁棒性的进一步增强。
链接: https://arxiv.org/abs/2601.00263
作者: Qianli Wang,Van Bach Nguyen,Yihong Liu,Fedor Splitt,Nils Feldhus,Christin Seifert,Hinrich Schütze,Sebastian Möller,Vera Schmitt
机构: Technische Universität Berlin (柏林工业大学); University of Marburg (马尔堡大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In submission
Abstract:Counterfactuals refer to minimally edited inputs that cause a model’s prediction to change, serving as a promising approach to explaining the model’s behavior. Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency. However, their effectiveness in generating multilingual counterfactuals remains unclear. To this end, we conduct a comprehensive study on multilingual counterfactuals. We first conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages. Although translation-based counterfactuals offer higher validity than their directly generated counterparts, they demand substantially more modifications and still fall short of matching the quality of the original English counterfactuals. Second, we find the patterns of edits applied to high-resource European-language counterfactuals to be remarkably similar, suggesting that cross-lingual perturbations follow common strategic principles. Third, we identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages. Finally, we reveal that multilingual counterfactual data augmentation (CDA) yields larger model performance improvements than cross-lingual CDA, especially for lower-resource languages. Yet, the imperfections of the generated counterfactuals limit gains in model performance and robustness.
zh
[NLP-33] alk Less Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在企业级业务分析场景中因缺乏内置验证机制而导致的输出不可靠问题,即当前对话式商业分析(Conversational Business Analytics, CBA)系统常产生语义不一致或无法执行的结果,需依赖用户手动校验。解决方案的关键在于提出两种互补的验证技术:Q* 通过代码与用户意图之间的逆向翻译和语义匹配实现自动校验,Feedback+ 则利用执行反馈引导代码迭代优化;二者均嵌入生成器-判别器框架中,将验证责任从用户转移至系统,从而提升结果准确性与任务效率。
链接: https://arxiv.org/abs/2601.00224
作者: Yan Sun,Ming Cai,Stanley Kok
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.
zh
[NLP-34] JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation
【速读】: 该论文旨在解决日语-英语翻译系统在迭代优化过程中缺乏可靠、低成本评估手段的问题,尤其针对“两个优质翻译中哪个更优”这一细粒度判断难题。传统评估方法难以捕捉礼貌性、隐含意义、省略结构和语域等细微差异对自然度的影响。解决方案的关键在于提出JP-TL-Bench基准,采用参考-free的成对大语言模型(LLM)比较机制,通过固定版本化的锚定集与候选模型进行对比,利用Bradley-Terry模型聚合成对比较结果,并输出胜率及经逻辑变换得到的标准化0–10分“LT”评分,从而实现结构稳定、可复现且经济高效的翻译质量评估。
链接: https://arxiv.org/abs/2601.00223
作者: Leonard Lin,Adam Lensenmayer(a href=“http://Shisa.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a)
机构: Shisa.AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, 8 tables
Abstract:We introduce JP-TL-Bench, a lightweight, open benchmark designed to guide the iterative development of Japanese-English translation systems. In this context, the challenge is often “which of these two good translations is better?” rather than “is this translation acceptable?” This distinction matters for Japanese-English, where subtle choices in politeness, implicature, ellipsis, and register strongly affect perceived naturalness. JP-TL-Bench uses a protocol built to make LLM judging both reliable and affordable: it evaluates a candidate model via reference-free, pairwise LLM comparisons against a fixed, versioned anchor set. Pairwise results are aggregated with a Bradley-Terry model and reported as win rates plus a normalized 0-10 “LT” score derived from a logistic transform of fitted log-strengths. Because each candidate is scored against the same frozen anchor set, scores are structurally stable given the same base set, judge, and aggregation code.
zh
[NLP-35] From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark
【速读】: 该论文旨在解决当前基于检索增强生成(Retrieval-Augmented Generation, RAG)的医学大语言模型中存在的两个关键问题:一是查询与检索到的证据之间缺乏PICO(Population, Intervention, Comparison, Outcome)框架的一致性,二是重排序过程中未考虑循证医学(Evidence-Based Medicine, EBM)中证据等级的差异。其解决方案的核心在于将EBM原则系统性地融入图结构RAG流程中:首先在知识图谱构建和检索阶段引入PICO框架以提升语义对齐度;其次提出一种受贝叶斯启发的重排序算法,在不依赖预设权重的情况下,根据证据等级校准排名分数,从而实现更符合临床实践逻辑的答案生成。该方法在运动康复领域验证有效,显著提升了答案的准确性、忠实性和PICO匹配度。
链接: https://arxiv.org/abs/2601.00216
作者: Jinning Zhang,Jie Song,Wenhui Tu,Zecheng Li,Jingxuan Li,Jin Li,Xuan Liu,Taole Sha,Zichen Wei,Yan Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 35 pages, 5 figures
Abstract:In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.
zh
[NLP-36] From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理依赖精确视觉感知的任务(如视觉谜题)时,其生成的推理链缺乏对视觉信息的有效整合问题。核心瓶颈在于模型无法充分利用图像内容进行结构化推理,导致性能受限。解决方案的关键在于引入基于奖励驱动的强化学习(Reward-driven Reinforcement Learning, RL)机制,通过设计六种针对不同推理维度(包括图像理解、思维步骤和答案准确性)的奖励函数,并采用群体相对策略优化(Group Relative Policy Optimization, GRPO),显式激励模型生成更长且结构化的视觉推理过程,从而在不依赖昂贵人工标注的情况下显著提升开源多模态模型(如Qwen-2.5-VL-7B)在视觉推理任务中的表现,实验证明该方法在域内与域外设置下均取得稳定增益。
链接: https://arxiv.org/abs/2601.00215
作者: Omar Sharif,Eftekhar Hossain,Patrick Ng
机构: Dartmouth College (达特茅斯学院); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 23 pages, 15 Figures, 10 Tables
Abstract:Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings. Comments: 23 pages, 15 Figures, 10 Tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2601.00215 [cs.CV] (or arXiv:2601.00215v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.00215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-37] Overlooked Safety Vulnerability in LLM s: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化算法设计场景中,尤其是智能优化算法设计方面的安全漏洞问题,这一领域此前尚未得到充分研究。其核心解决方案是提出一个名为MalOptBench的基准测试集,包含60个恶意优化算法请求,并设计了一种针对性的越狱攻击方法MOBjailbreak,用于评估LLMs在此类任务中的安全性。实验表明,主流LLMs(包括GPT-5和DeepSeek-V3.1)对MOBjailbreak攻击高度敏感,平均攻击成功率高达83.59%,且原始有害提示的危害评分达到4.28/5,而现有插件式防御机制则表现有限,仅能边际缓解攻击效果并易引发过度安全行为,凸显了强化对齐技术以防范LLM在算法设计中被滥用的紧迫性。
链接: https://arxiv.org/abs/2601.00213
作者: Haoran Gu,Handing Wang,Yi Mei,Mengjie Zhang,Yaochu Jin
机构: Xidian University (西安电子科技大学); Victoria University of Wellington (维多利亚大学); Westlake University (西湖大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.
zh
[NLP-38] Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models
【速读】: 该论文旨在解决当前时序知识图谱(Temporal Knowledge Graph, TKG)推理模型在资源受限设备上部署困难的问题,具体表现为现有模型参数量大、计算密集,导致硬件成本高、能耗大,且难以满足实时推理需求;同时,传统模型压缩与蒸馏方法多针对静态知识图谱设计,无法有效捕捉TKG中的时序依赖关系,从而造成推理性能下降。解决方案的关键在于提出一种专为TKG推理设计的蒸馏框架,利用大规模语言模型作为教师模型,引导轻量化学生模型学习结构与时序推理能力,通过融合大规模公共知识与任务特定时序信息,显著提升学生模型对时序动态建模的能力,同时保持模型紧凑高效,实现在推理准确率、计算效率和实际可部署性之间的良好平衡。
链接: https://arxiv.org/abs/2601.00202
作者: Wang Xing,Wei Song,Siyu Lin,Chen Wu,Zhesi Li,Man Wang
机构: Xidian University (西安电子科技大学); Southwest Jiaotong University (西南交通大学); Chongqing Jiaotong University (重庆交通大学); Chang’an University (长安大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model’s ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.
zh
[NLP-39] StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices
【速读】: 该论文旨在解决金融市场的精准预测问题,其核心挑战在于时间序列中复杂的时序依赖关系、非线性动态特性以及高波动性。为应对这一问题,作者提出了一种增强的StockBot架构,在统一实验环境下系统评估了基于注意力机制、卷积神经网络和循环神经网络(Recurrent Neural Networks, RNNs)的时间序列预测模型。研究发现,尽管注意力机制和Transformer类模型具备更强的建模灵活性,但在默认超参数设置下,经过精心设计的普通长短期记忆网络(LSTM)在预测准确性和买卖决策稳定性方面表现最优。这表明,在数据有限或仅以日频离散化的情况下,循环序列模型因其架构归纳偏置(inductive bias)展现出更高的鲁棒性和数据效率,是金融时间序列预测任务中的关键解决方案。
链接: https://arxiv.org/abs/2601.00197
作者: Shaswat Mohanty
机构: Stanford University (斯坦福大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
Abstract:Accurate forecasting of financial markets remains a long-standing challenge due to complex temporal and often latent dependencies, non-linear dynamics, and high volatility. Building on our earlier recurrent neural network framework, we present an enhanced StockBot architecture that systematically evaluates modern attention-based, convolutional, and recurrent time-series forecasting models within a unified experimental setting. While attention-based and transformer-inspired models offer increased modeling flexibility, extensive empirical evaluation reveals that a carefully constructed vanilla LSTM consistently achieves superior predictive accuracy and more stable buy/sell decision-making when trained under a common set of default hyperparameters. These results highlight the robustness and data efficiency of recurrent sequence models for financial time-series forecasting, particularly in the absence of extensive hyperparameter tuning or the availability of sufficient data when discretized to single-day intervals. Additionally, these results underscore the importance of architectural inductive bias in data-limited market prediction tasks.
zh
[NLP-40] Understanding Emotion in Discourse: Recognition Insights and Linguistic Patterns for Generation
【速读】: 该论文旨在解决情感识别在对话(Emotion Recognition in Conversation, ERC)中的两个关键问题:一是对架构选择的实际影响缺乏清晰理解,二是情感识别与生成之间缺乏语言学层面的连接分析。解决方案的核心在于通过系统性分析IEMOCAP数据集,首先开展严格的消融实验(10次随机种子评估),揭示出三个关键发现:① 对话上下文至关重要,90%的性能提升可在最近10–30轮对话中实现;② 层级句法表示仅在话语层有效,但一旦引入对话上下文,其优势即消失,表明上下文已涵盖内部句法结构信息;③ 外部情感词典(如SenticNet)无增益,说明预训练编码器已充分捕获情绪语义。其次,通过对5,286个话语标记(discourse marker)的统计分析,发现情绪类型与标记位置显著相关(p < 0.0001),尤其“sad”类话语左边缘标记使用率更低(21.9% vs 其他情绪28–32%),印证了左边缘标记与主动话语管理的关系,并解释为何悲伤情绪最依赖上下文进行区分(+22%性能提升)。整体而言,该研究以简洁的因果上下文架构实现了优于以往文本-only方法的结果(4-way加权F1达82.69%,6-way达67.07%),并建立了从识别到生成的语言学桥梁。
链接: https://arxiv.org/abs/2601.00181
作者: Cheonkam Jeong,Adeline Nyamathi
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Emotion Recognition in Conversation (ERC) has achieved high accuracy, two critical gaps remain: a limited understanding of \textitwhich architectural choices actually matter, and a lack of linguistic analysis connecting recognition to generation. We address both gaps through a systematic analysis of the IEMOCAP dataset. For recognition, we conduct a rigorous ablation study with 10-seed evaluation and report three key findings. First, conversational context is paramount, with performance saturating rapidly – 90% of the total gain achieved within just the most recent 10–30 preceding turns (depending on the label set). Second, hierarchical sentence representations help at utterance-level, but this benefit disappears once conversational context is provided, suggesting that context subsumes intra-utterance structure. Third, external affective lexicons (SenticNet) provide no gain, indicating that pre-trained encoders already capture necessary emotional semantics. With simple architectures using strictly causal context, we achieve 82.69% (4-way) and 67.07% (6-way) weighted F1, outperforming prior text-only methods including those using bidirectional context. For linguistic analysis, we analyze 5,286 discourse marker occurrences and find a significant association between emotion and marker positioning ( p .0001 ). Notably, “sad” utterances exhibit reduced left-periphery marker usage (21.9%) compared to other emotions (28–32%), consistent with theories linking left-periphery markers to active discourse management. This connects to our recognition finding that sadness benefits most from context (+22%p): lacking explicit pragmatic signals, sad utterances require conversational history for disambiguation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00181 [cs.CL] (or arXiv:2601.00181v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.00181 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-41] Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description
【速读】: 该论文旨在解决现有评估方法无法有效衡量专利说明书(patent description)长文本结构连贯性与法定合规性的问题,尤其是在生成式 AI (Generative AI) 自动生成专利文本场景下,缺乏针对 enablement(充分公开)和 written description(书面描述)等法律要求的系统性评价工具。解决方案的关键在于提出 Pat-DEVAL,首个专注于专利说明书体的多维评估框架,其核心创新是引入 Chain-of-Legal-Thought (CoLT)——一种受专利法约束的推理机制,通过强制执行专利法特定的顺序分析流程,显著提升了对法律专业合规性的判断能力,实验表明该框架在 Legal-Professional Compliance 维度上达到 0.73 的皮尔逊相关系数,验证了显式注入法定约束对于捕捉专利文本法律有效性的重要性。
链接: https://arxiv.org/abs/2601.00166
作者: Yongmin Yoo,Kris W Pan
机构: Macquarie University (麦考瑞大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:
Abstract:Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.
zh
[NLP-42] he Agent ic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLM s
【速读】: 该论文旨在解决如何从原始文本中自动提取因果反馈模糊认知图(fuzzy cognitive map, FCM)以构建具有动态稳定性的因果模型问题。其核心挑战在于实现语言模型(LLM)对复杂语义信息的结构化理解与因果关系推理,同时保持模型演化过程中的稳定性与可解释性。解决方案的关键在于设计一个基于大语言模型的代理(agent)系统,通过三步指令序列引导LLM依次完成关键名词短语识别、FCM概念节点提取及模糊因果边推断;该过程形成双向反馈机制:FCM的平衡点(固定点吸引子和极限环)驱动LLM进一步获取相关文本,而新文本又可调整FCM结构,从而赋予系统一定程度的自主性与适应性。实验表明,该方法生成的FCM能收敛至与人类专家构建的FCM相同的动力学平衡态,即使两者节点和边的数量不同,且混合来自不同LLM(Gemini与ChatGPT)的FCM还能产生新的平衡态,提升对底层因果动力系统的逼近能力。
链接: https://arxiv.org/abs/2601.00097
作者: Akash Kumar Panda,Olaoluwa Adigun,Bart Kosko
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 15 figures
Abstract:We design a large-language-model (LLM) agent that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text. The causal learning or extraction process is agentic both because of the LLM’s semi-autonomy and because ultimately the FCM dynamical system’s equilibria drive the LLM agents to fetch and process causal text. The fetched text can in principle modify the adaptive FCM causal structure and so modify the source of its quasi-autonomy–its equilibrium limit cycles and fixed-point attractors. This bidirectional process endows the evolving FCM dynamical system with a degree of autonomy while still staying on its agentic leash. We show in particular that a sequence of three finely tuned system instructions guide an LLM agent as it systematically extracts key nouns and noun phrases from text, as it extracts FCM concept nodes from among those nouns and noun phrases, and then as it extracts or infers partial or fuzzy causal edges between those FCM nodes. We test this FCM generation on a recent essay about the promise of AI from the late diplomat and political theorist Henry Kissinger and his colleagues. This three-step process produced FCM dynamical systems that converged to the same equilibrium limit cycles as did the human-generated FCMs even though the human-generated FCM differed in the number of nodes and edges. A final FCM mixed generated FCMs from separate Gemini and ChatGPT LLM agents. The mixed FCM absorbed the equilibria of its dominant mixture component but also created new equilibria of its own to better approximate the underlying causal dynamical system.
zh
[NLP-43] Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化推理任务中面临的效率与通用性难题,例如JSON模式约束、多语言解析等场景下输出需满足复杂约束的问题。解决方案的关键在于提出MetaJuLS——一种基于元强化学习(meta-reinforcement learning)的通用约束传播策略框架,通过将结构化推理建模为自适应约束传播过程,并利用图注意力网络(Graph Attention Network)结合元学习训练出可跨语言和任务迁移的策略,从而实现无需任务特定微调即可快速适应新场景(仅需5–10个梯度步骤),同时相比GPU优化基线提升1.5–2.0倍推理速度且保持接近最先进解析器的精度(误差<0.2%)。
链接: https://arxiv.org/abs/2601.00095
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5–2.0 \times speedups over GPU-optimized baselines while maintaining within 0.2% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5–10 gradient steps (5–15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.
zh
[NLP-44] RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域工具使用中可靠性不足的问题,尤其是在API接口具有领域特异性、文档不完善或适配私有工作流的情况下。其解决方案的关键在于提出一种基于动态规则注入的神经符号方法——RIMRULE,通过从失败轨迹中提炼出紧凑且可解释的规则,并在推理阶段将这些规则以自然语言和结构化符号形式注入提示(prompt),从而提升任务性能。该方法利用最小描述长度(Minimum Description Length, MDL)目标优化规则的通用性与简洁性,使得规则不仅可高效检索,还可跨模型复用,无需修改LLM权重即可显著提高对已见和未见工具的准确率。
链接: https://arxiv.org/abs/2601.00086
作者: Xiang Gao,Yuguang Yao,Qi Zhang,Kaiwen Dong,Avinash Baidya,Ruocheng Guo,Hilaf Hasson,Kamalika Das
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.
zh
[NLP-45] he Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition
【速读】: 该论文旨在解决开放权重大语言模型(Large Language Model, LLM)生态系统中因模型组合技术(如权重合并、推测解码和词汇扩展)广泛应用而引发的跨模型家族互操作性问题,其核心挑战在于分词器移植(tokenizer transplant)带来的供应链安全漏洞。解决方案的关键在于利用系数复用(coefficient reuse)的几何特性,设计出一个“破坏令牌”(breaker token):该令牌在源模型中功能无害,但在移植到目标模型后会可靠地重构为高显著性的恶意特征,从而在不改变源模型统计行为的前提下,破坏目标模型的生成能力。这一攻击通过双目标优化建模实现,并采用稀疏求解器实例化,在无需训练的情况下达到频谱拟合效果,有效规避异常检测,且对微调和权重合并具有结构鲁棒性,揭示了模块化AI组合流程中的潜在风险。
链接: https://arxiv.org/abs/2601.00065
作者: Xiaoze Liu,Weichen Yu,Matt Fredrikson,Xiaoqian Wang,Jing Gao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single “breaker token” that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model’s generation while leaving the donor’s utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at this https URL
zh
[NLP-46] Finetuning Large Language Models for Automated Depression Screening in Nigerian Pidgin English: GENSCORE Pilot Study
【速读】: 该论文旨在解决尼日利亚等低收入和中等收入国家在抑郁症筛查中存在的三大障碍:临床资源匮乏、污名化以及语言障碍。传统筛查工具如PHQ-9(Patient Health Questionnaire-9)虽在高收入国家验证有效,但其语言和文化适配性不足,难以应用于尼日利亚多语种环境(包括尼日利亚皮钦语及520多种地方语言)。解决方案的关键在于构建一个基于细调大语言模型(LLMs)的自动化抑郁筛查系统,该系统专门针对尼日利亚皮钦语进行优化。研究团队收集并标注了432条来自18–40岁青年的皮钦语语音数据,涵盖PHQ-9条目相关心理体验,并通过语义标注、俚语与习语解析完成数据预处理。最终,GPT-4.1在定量指标(准确率94.5%)和定性评估(清晰度、相关性和文化适宜性)上表现最优,证明了面向本地语言的生成式AI(Generative AI)在资源受限环境中开展可及性心理健康干预的可行性。
链接: https://arxiv.org/abs/2601.00004
作者: Isaac Iyinoluwa Olufadewa,Miracle Ayomikun Adesina,Ezekiel Ayodeji Oladejo,Uthman Babatunde Usman,Owen Kolade Adeniyi,Matthew Tolulope Olawoyin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 1 figure, 4 tables
Abstract:Depression is a major contributor to the mental-health burden in Nigeria, yet screening coverage remains limited due to low access to clinicians, stigma, and language barriers. Traditional tools like the Patient Health Questionnaire-9 (PHQ-9) were validated in high-income countries but may be linguistically or culturally inaccessible for low- and middle-income countries and communities such as Nigeria where people communicate in Nigerian Pidgin and more than 520 local languages. This study presents a novel approach to automated depression screening using fine-tuned large language models (LLMs) adapted for conversational Nigerian Pidgin. We collected a dataset of 432 Pidgin-language audio responses from Nigerian young adults aged 18-40 to prompts assessing psychological experiences aligned with PHQ-9 items, performed transcription, rigorous preprocessing and annotation, including semantic labeling, slang and idiom interpretation, and PHQ-9 severity scoring. Three LLMs - Phi-3-mini-4k-instruct, Gemma-3-4B-it, and GPT-4.1 - were fine-tuned on this annotated dataset, and their performance was evaluated quantitatively (accuracy, precision and semantic alignment) and qualitatively (clarity, relevance, and cultural appropriateness). GPT-4.1 achieved the highest quantitative performance, with 94.5% accuracy in PHQ-9 severity scoring prediction, outperforming Gemma-3-4B-it and Phi-3-mini-4k-instruct. Qualitatively, GPT-4.1 also produced the most culturally appropriate, clear, and contextually relevant responses. AI-mediated depression screening for underserved Nigerian communities. This work provides a foundation for deploying conversational mental-health tools in linguistically diverse, resource-constrained environments.
zh
[NLP-47] Reasoning in Action: MCTS-Driven Knowledge Retrieval for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中难以有效融合检索(retrieval)与推理(reasoning)策略以优化性能的问题。现有方法通常依赖语义相似性进行知识检索或提升模型推理能力,但缺乏对对话逻辑结构的深入理解,导致检索信息与推理过程脱节。解决方案的关键在于提出一种“推理感知的知识检索方法”(reasoning-aware knowledge retrieval),采用从粗到细(coarse-to-fine)的两阶段检索机制:首先定位与上下文主题相关的知识库子区域,确保语句整体相关性;随后在该区域内进一步筛选与推理链条紧密关联的知识片段,从而增强生成响应的信息量与创造性。整个过程中引入受蒙特卡洛树搜索(Monte Carlo Tree Search)启发的搜索策略,利用关键词引导高效遍历知识句子,显著提升了检索结果与人类对话中隐含推理逻辑的一致性。
链接: https://arxiv.org/abs/2601.00003
作者: Shuqi Liu,Bowei He,Chen Ma,Linqi Song
机构: City University of Hong Kong (香港城市大学); City University of Hong Kong Shenzhen Research Institute (香港城市大学深圳研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) typically enhance their performance through either the retrieval of semantically similar information or the improvement of their reasoning capabilities. However, a significant challenge remains in effectively integrating both retrieval and reasoning strategies to optimize LLM performance. In this paper, we introduce a reasoning-aware knowledge retrieval method that enriches LLMs with information aligned to the logical structure of conversations, moving beyond surface-level semantic similarity. We follow a coarse-to-fine approach for knowledge retrieval. First, we identify a contextually relevant sub-region of the knowledge base, ensuring that all sentences within it are relevant to the context topic. Next, we refine our search within this sub-region to extract knowledge that is specifically relevant to the reasoning process. Throughout both phases, we employ the Monte Carlo Tree Search-inspired search method to effectively navigate through knowledge sentences using common keywords. Experiments on two multi-turn dialogue datasets demonstrate that our knowledge retrieval approach not only aligns more closely with the underlying reasoning in human conversations but also significantly enhances the diversity of the retrieved knowledge, resulting in more informative and creative responses.
zh
[NLP-48] Learning Speech Representations with Variational Predictive Coding ACL
【速读】: 该论文旨在解决HuBERT(Hidden Unit BERT)在语音表示学习中缺乏理论基础导致其难以进一步优化的问题。研究指出,HuBERT目标函数背后实际上隐含着预测编码(Predictive Coding)的原理,且该原理在变分推断框架下具有明确的形式化表达。解决方案的关键在于将HuBERT的目标重新表述为基于变分预测编码的通用框架,从而为参数化和优化提供新的思路;文中通过两个简单的改进策略验证了该视角的有效性,并揭示了其与APC、CPC、wav2vec及BEST-RQ等其他目标函数之间的紧密联系,最终在电话分类、基频跟踪、说话人识别和自动语音识别四项下游任务上均实现了显著性能提升。
链接: https://arxiv.org/abs/2601.00100
作者: Sung-Lin Yeh,Peter Bell,Hao Tang
机构: University of Edinburgh (爱丁堡大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Transactions of the Association for Computational Linguistics (TACL); Pre MIT Press version
Abstract:Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
zh
计算机视觉
[CV-0] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
【速读】:该论文旨在解决单目视频中动态三维场景重建时面临的两大核心挑战:一是现有基于单一高斯基元的方法因低通滤波特性难以捕捉高频外观细节,而标准Gabor函数又易引发能量不稳定性;二是缺乏时间连续性约束导致插值过程中出现运动伪影。解决方案的关键在于提出AdaGaR框架,其核心创新包括:(1)引入自适应Gabor表示(Adaptive Gabor Representation),通过可学习的频率权重和自适应能量补偿机制,在保持细节捕捉能力的同时提升稳定性;(2)采用带时间曲率正则化的三次埃尔米特样条(Cubic Hermite Splines with Temporal Curvature Regularization)实现平滑的时序运动演化;(3)设计自适应初始化机制,结合深度估计、点跟踪与前景掩膜以建立早期训练中的稳定点云分布。该方法在帧插值、深度一致性、视频编辑及立体视图合成等多个任务上均达到当前最优性能。
链接: https://arxiv.org/abs/2601.00796
作者: Jiewen Chan,Zhenjun Zhao,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); University of Zaragoza (萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: this https URL
zh
[CV-1] wo Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI
【速读】:该论文旨在解决左心室(Left Ventricle, LV)在短轴位心脏电影磁共振成像(Short-Axis Cine MRI)中的精准分割问题,以支持临床心脏图像的定量分析与诊断。其解决方案的关键在于提出两种新型深度学习架构——LNU-Net 和 IBU-Net,二者均基于 U-Net 结构,通过改进归一化机制提升分割性能:LNU-Net 在每个卷积块中引入层归一化(Layer Normalization, LN),而 IBU-Net 在首个卷积块中融合实例归一化(Instance Normalization)与批归一化(Batch Normalization),并传递结果至后续层;此外,作者还采用仿射变换和弹性形变对图像数据进行增强处理,最终在包含 805 张 MRI 图像的数据集上验证了所提方法在 Dice 系数和平均垂直距离指标上优于当前主流方法。
链接: https://arxiv.org/abs/2601.00794
作者: Wenhui Chu,Nikolaos V. Tsekos
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 5 figures, published in ICBBB 2022
Abstract:Left ventricle (LV) segmentation is critical for clinical quantification and diagnosis of cardiac images. In this work, we propose two novel deep learning architectures called LNU-Net and IBU-Net for left ventricle segmentation from short-axis cine MRI images. LNU-Net is derived from layer normalization (LN) U-Net architecture, while IBU-Net is derived from the instance-batch normalized (IB) U-Net for medical image segmentation. The architectures of LNU-Net and IBU-Net have a down-sampling path for feature extraction and an up-sampling path for precise localization. We use the original U-Net as the basic segmentation approach and compared it with our proposed architectures. Both LNU-Net and IBU-Net have left ventricle segmentation methods: LNU-Net applies layer normalization in each convolutional block, while IBU-Net incorporates instance and batch normalization together in the first convolutional block and passes its result to the next layer. Our method incorporates affine transformations and elastic deformations for image data processing. Our dataset that contains 805 MRI images regarding the left ventricle from 45 patients is used for evaluation. We experimentally evaluate the results of the proposed approaches outperforming the dice coefficient and the average perpendicular distance than other state-of-the-art approaches.
zh
[CV-2] Fusion-SSAT: Unleashing the Potential of Self-supervised Auxiliary Task by Feature Fusion for Generalized Deepfake Detection
【速读】:该论文旨在解决通用深度伪造检测(generalised deepfake detection)中模型泛化能力不足的问题,通过引入自监督学习(self-supervised learning)作为辅助任务来优化主任务的性能。其解决方案的关键在于融合自监督辅助任务与主任务的特征表示,这种融合能够充分利用两者的优势,提取出更具判别性的特征表示,从而显著提升模型在跨数据集评估中的泛化能力。实验结果表明,该方法在多个公开数据集(如DF40、FaceForensics++、Celeb-DF等)上均优于当前最先进的检测器。
链接: https://arxiv.org/abs/2601.00789
作者: Shukesh Reddy,Srijan Das,Abhijit Das
机构: Birla Institute of Technology and Sciences, Pilani (比尔拉理工大学与科学学院,皮兰尼); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we attempted to unleash the potential of self-supervised learning as an auxiliary task that can optimise the primary task of generalised deepfake detection. To explore this, we examined different combinations of the training schemes for these tasks that can be most effective. Our findings reveal that fusing the feature representation from self-supervised auxiliary tasks is a powerful feature representation for the problem at hand. Such a representation can leverage the ultimate potential and bring in a unique representation of both the self-supervised and primary tasks, achieving better performance for the primary task. We experimented on a large set of datasets, which includes DF40, FaceForensics++, Celeb-DF, DFD, FaceShifter, UADFV, and our results showed better generalizability on cross-dataset evaluation when compared with current state-of-the-art detectors.
zh
[CV-3] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing
【速读】:该论文旨在解决联邦学习中嵌入层数据合成面临的两大挑战:一是客户端数据非独立同分布(non-IID)导致的生成质量下降与模型性能不稳定,二是现有方法在梯度层面缺乏形式化隐私保护,易引发梯度泄露风险。其解决方案的关键在于提出FedHypeVAE框架,通过双层设计实现个性化生成、差分隐私保障与分布一致性对齐的统一:首先,利用共享超网络(hypernetwork)为每个客户端生成个性化的解码器和类别条件先验,从而将个性化能力赋予生成层而非下游任务模型;其次,在超网络训练中引入差分隐私机制,仅聚合噪声扰动后的截断梯度,确保通信参数不泄露本地数据;此外,结合局部MMD对齐和Lipschitz正则项提升非IID场景下的稳定性与分布一致性。最终,通过中立元码(meta-code)实现领域无关合成,多元码混合则支持可控的多域覆盖,为联邦环境下的隐私保护数据合成提供了理论严谨且实用的解决方案。
链接: https://arxiv.org/abs/2601.00785
作者: Sunny Gupta,Amit Sethi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figures, Accepted at AAI’26
Abstract:Federated data sharing promises utility without centralizing raw data, yet existing embedding-level generators struggle under non-IID client heterogeneity and provide limited formal protection against gradient leakage. We propose FedHypeVAE, a differentially private, hypernetwork-driven framework for synthesizing embedding-level data across decentralized clients. Building on a conditional VAE backbone, we replace the single global decoder and fixed latent prior with client-aware decoders and class-conditional priors generated by a shared hypernetwork from private, trainable client codes. This bi-level design personalizes the generative layerrather than the downstream modelwhile decoupling local data from communicated parameters. The shared hypernetwork is optimized under differential privacy, ensuring that only noise-perturbed, clipped gradients are aggregated across clients. A local MMD alignment between real and synthetic embeddings and a Lipschitz regularizer on hypernetwork outputs further enhance stability and distributional coherence under non-IID conditions. After training, a neutral meta-code enables domain agnostic synthesis, while mixtures of meta-codes provide controllable multi-domain coverage. FedHypeVAE unifies personalization, privacy, and distribution alignment at the generator level, establishing a principled foundation for privacy-preserving data synthesis in federated settings. Code: this http URL
zh
[CV-4] Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
【速读】:该论文旨在解决音频深度伪造(audio deepfake)检测中缺乏有效方法的问题,尤其是在多模态大语言模型(Multimodal Large Language Models, MLLMs)应用于该任务时的潜力尚未被充分探索。解决方案的关键在于将音频输入与多种文本提示(text prompts)结合,通过设计基于问答的、具有上下文信息和二分类决策能力的提示策略,引导MLLMs在跨模态空间中学习鲁棒的特征表示。实验表明,这种特征引导的推理机制有助于提升模型对音频深度伪造的判别能力,尤其在有少量标注数据的情况下,模型可在域内数据上实现良好性能,展现出该方法在音频深度伪造检测中的潜力。
链接: https://arxiv.org/abs/2601.00777
作者: Akanksha Chuchra,Shukesh Reddy,Sudeepta Mishra,Abhijit Das,Abhinav Dhall
机构: Indian Institute of Technology, Ropar, India; Machine Intelligence Group, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, India; Monash University, Melbourne, Australia
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCB 2025
Abstract:While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
zh
[CV-5] Unified Primitive Proxies for Structured Shape Completion
【速读】:该论文旨在解决三维形状补全(shape completion)中如何高效恢复结构化几何信息的问题,传统方法通常将缺失几何表示为无结构点云,难以支持基于原始几何体(primitive-based)的表面重建。其解决方案的关键在于提出一种统一的前向传播框架UniCo,通过引入专用路径解码原始几何体(primitives),并利用可学习的“原始代理”(primitive proxies)来生成具备完整几何、语义和内点隶属关系的原始集合;同时采用在线目标更新策略耦合原始与点云优化,确保训练一致性。该方法在合成与真实世界基准上显著优于现有基线,最大降低Chamfer距离达50%,提升法向一致性达7%。
链接: https://arxiv.org/abs/2601.00759
作者: Zhaiyu Chen,Yuqing Wang,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%. These results establish an attractive recipe for structured 3D understanding from incomplete data. Project page: this https URL.
zh
[CV-6] Grading Handwritten Engineering Exams with Multimodal Large Language Models
【速读】:该论文旨在解决手写STEM(科学、技术、工程和数学)考试在大规模场景下人工评分效率低、难以扩展的问题,同时保持传统考试流程(如A4纸张、学生自由书写)的完整性。其解决方案的关键在于构建一个端到端的多模态大语言模型(Multimodal Large Language Models, LLMs)评分工作流:仅需教师提供一份手写参考答案(100%原始内容)和简短评分规则,系统将参考答案自动转化为文本摘要用于条件化评分,避免暴露原始扫描图像;通过多阶段设计——包括格式/存在性检查防止空白作答、独立评阅器集成、监督者聚合及确定性模板验证——实现高可靠性、可审计且机器可解析的评分报告。实证表明,该方法在保留教学规范的前提下显著提升评分效率与一致性,且结构化提示与参考答案锚定机制对准确性至关重要。
链接: https://arxiv.org/abs/2601.00730
作者: Janez Perš,Jon Muhovič,Andrej Košir,Boštjan Murovec
机构: University of Ljubljana, Faculty of Electrical Engineering (卢布尔雅那大学电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 2 tables. Supplementary material available at this https URL
Abstract:Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves \approx 8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of \approx 17% at D_\max=40 . Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.
zh
[CV-7] Multi-Level Feature Fusion for Continual Learning in Visual Quality Inspection
【速读】:该论文旨在解决深度神经网络在制造场景中视觉质量检测任务的持续学习问题,尤其是在再制造等动态环境中,产品和缺陷模式频繁变化时,模型需快速适应新条件且避免灾难性遗忘(catastrophic forgetting)。解决方案的关键在于提出一种多层级特征融合(multi-level feature fusion, MLFF)方法,通过利用预训练网络不同深度的特征表示,在显著减少可训练参数的同时,提升模型对新产品类型或缺陷的泛化鲁棒性,并有效缓解灾难性遗忘问题。
链接: https://arxiv.org/abs/2601.00725
作者: Johannes C. Bauer,Paul Geng,Stephan Trattnig,Petr Dokládal,Rüdiger Daub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 IEEE 13th International Conference on Control, Mechatronics and Automation (ICCMA)
Abstract:Deep neural networks show great potential for automating various visual quality inspection tasks in manufacturing. However, their applicability is limited in more volatile scenarios, such as remanufacturing, where the inspected products and defect patterns often change. In such settings, deployed models require frequent adaptation to novel conditions, effectively posing a continual learning problem. To enable quick adaptation, the necessary training processes must be computationally efficient while still avoiding effects like catastrophic forgetting. This work presents a multi-level feature fusion (MLFF) approach that aims to improve both aspects simultaneously by utilizing representations from different depths of a pretrained network. We show that our approach is able to match the performance of end-to-end training for different quality inspection problems while using significantly less trainable parameters. Furthermore, it reduces catastrophic forgetting and improves generalization robustness to new product types or defects.
zh
[CV-8] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在医学影像分析中因输入数据分布偏移导致性能下降却难以检测的问题,尤其是在缺乏标注数据的情况下。其关键解决方案在于提出一种结合输入层面数据分布变化检测与输出层面预测置信度变化监测的双维度框架:一方面开发了轻量级工具DomainSAT用于直观识别输入数据分布偏移;另一方面引入无标签的置信度指标来捕捉模型预测行为的变化,发现该指标与实际性能退化高度相关。实验证明,将两者融合可显著提升对VLM在数字病理场景下可靠性退化的检测准确性和可解释性。
链接: https://arxiv.org/abs/2601.00716
作者: Hao Guan,Li Zhou
机构: Brigham and Women’s Hospital (布莱根妇女医院); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.
zh
[CV-9] RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization
【速读】:该论文旨在解决传统基于高斯溅射(Gaussian-splatting)的SLAM系统在初始建图阶段依赖残差驱动的渐进式高斯添加机制所导致的收敛慢、初期映射不稳定的问题。其解决方案的关键在于提出一种无需训练的对应关系到高斯的初始化方法:通过DINOv3特征提取并经置信度感知的内点分类器优化后,实现多视角密集对应点的一次性三角化,生成结构感知且分布均匀的高斯种子,从而显著提升早期建图稳定性与优化收敛速度(约加快20%),同时保持与现有GS-SLAM流程的完全兼容性。
链接: https://arxiv.org/abs/2601.00705
作者: Wei-Tse Cheng,Yen-Jen Chiou,Yuan-Fu Yang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 10 pages, 9 figures
Abstract:We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS.
zh
[CV-10] Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks WACV
【速读】:该论文旨在解决图像去马赛克(image demosaicing)任务中深度学习模型在移动平台上的计算效率问题。现有基于各向同性网络(isotropic networks)的方案虽能实现良好性能,但因避免空间下采样导致计算成本过高,难以部署于资源受限的移动端设备。其解决方案的关键在于提出通过显著的空间下采样来提升模型效率与性能,而非沿用以往完全避免下采样的设计思路。作者基于DeepMAD中的数学架构设计方法构建了带与不带下采样的全卷积网络,并通过实验证明下采样版本在多个去马赛克及联合去马赛克与去噪(JDD)任务中均表现更优,最终提出的JD3Net模型验证了该策略的有效性。
链接: https://arxiv.org/abs/2601.00703
作者: Cory Fan,Wenchao Zhang
机构: Cornell University (康奈尔大学); OmniVision Technologies (OmniVision Technologies)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures. To be published at WVAQ Workshop at WACV
Abstract:In digital imaging, image demosaicing is a crucial first step which recovers the RGB information from a color filter array (CFA). Oftentimes, deep learning is utilized to perform image demosaicing. Given that most modern digital imaging applications occur on mobile platforms, applying deep learning to demosaicing requires lightweight and efficient networks. Isotropic networks, also known as residual-in-residual networks, have been often employed for image demosaicing and joint-demosaicing-and-denoising (JDD). Most demosaicing isotropic networks avoid spatial downsampling entirely, and thus are often prohibitively expensive computationally for mobile applications. Contrary to previous isotropic network designs, this paper claims that spatial downsampling to a signficant degree can improve the efficiency and performance of isotropic networks. To validate this claim, we design simple fully convolutional networks with and without downsampling using a mathematical architecture design technique adapted from DeepMAD, and find that downsampling improves empirical performance. Additionally, empirical testing of the downsampled variant, JD3Net, of our fully convolutional networks reveals strong empirical performance on a variety of image demosaicing and JDD tasks.
zh
[CV-11] DefVINS: Visual-Inertial Odometry for Deformable Scenes
【速读】:该论文旨在解决非刚性变形场景(deformable scenes)对传统视觉惯性里程计(Visual-Inertial Odometry, VIO)造成的性能退化问题,即在场景发生显著形变时,VIO因违背刚性假设而易出现局部非刚性运动过拟合或严重漂移。解决方案的关键在于提出DefVINS框架,通过显式分离一个由IMU锚定的刚性状态与由嵌入式变形图(deformation graph)表示的非刚性形变,实现对刚性运动和非刚性变形的协同估计;同时引入可观测性分析,量化惯性测量对刚性运动的约束能力,并据此设计基于条件性的非刚性自由度激活策略,从而避免在激励不足时产生病态更新,显著提升系统在非刚性环境下的鲁棒性。
链接: https://arxiv.org/abs/2601.00702
作者: Samuel Cerezo,Javier Civera
机构: Universidad de Zaragoza (萨拉戈萨大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 figures, 3 tables. Submitted to RA-L
Abstract:Deformable scenes violate the rigidity assumptions underpinning classical visual-inertial odometry (VIO), often leading to over-fitting to local non-rigid motion or severe drift when deformation dominates visual parallax. We introduce DefVINS, a visual-inertial odometry framework that explicitly separates a rigid, IMU-anchored state from a non–rigid warp represented by an embedded deformation graph. The system is initialized using a standard VIO procedure that fixes gravity, velocity, and IMU biases, after which non-rigid degrees of freedom are activated progressively as the estimation becomes well conditioned. An observability analysis is included to characterize how inertial measurements constrain the rigid motion and render otherwise unobservable modes identifiable in the presence of deformation. This analysis motivates the use of IMU anchoring and informs a conditioning-based activation strategy that prevents ill-posed updates under poor excitation. Ablation studies demonstrate the benefits of combining inertial constraints with observability-aware deformation activation, resulting in improved robustness under non-rigid environments.
zh
[CV-12] Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
【速读】:该论文旨在解决单图像生成视频任务中缺乏鲁棒用户控制能力的问题,特别是如何在保持时序一致性与几何完整性的同时实现精确的相机路径控制。现有方法通常依赖两步流程:先构建3D场景表示(如点云)再引入物体运动,导致难以实现完全的时序一致性。其解决方案的关键在于提出一种新颖的框架,通过单次前向传播构建3D高斯场景表示并采样合理的物体运动,从而无需迭代去噪即可快速生成与指定相机轨迹对齐的视频,显著提升了视频质量与推理效率。
链接: https://arxiv.org/abs/2601.00678
作者: Melonie de Almeida,Daniela Ivanova,Tong Shi,John H. Williamson,Paul Henderson
机构: University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at this https URL.
zh
[CV-13] Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
【速读】:该论文旨在解决当前说话头生成模型在交互性方面的局限性,即难以实现真正意义上的双向互动,通常仅能生成单向响应且缺乏情感参与感。其核心挑战在于:一是在因果约束下实现实时运动生成;二是无需额外标注数据即可学习富有表现力的反应行为。解决方案的关键是提出Avatar Forcing框架,通过扩散强制机制建模实时用户-虚拟人像交互,支持低延迟处理多模态输入(如语音和动作),从而对言语及非言语线索(如点头、笑声)做出即时响应;同时引入直接偏好优化方法,利用丢弃用户条件构造合成负样本,实现无标签数据下的表达性交互学习,最终在保持约500ms低延迟的同时,显著提升交互的真实感与生动性。
链接: https://arxiv.org/abs/2601.00664
作者: Taekyung Ki,Sangwon Jang,Jaehyeong Jo,Jaehong Yoon,Sung Ju Hwang
机构: KAIST(韩国科学技术院); NTU Singapore(新加坡南洋理工大学); DeepAuto.ai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user’s audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
zh
[CV-14] CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在实际应用中因生成幻觉内容而导致可靠性下降的问题。现有无训练方法虽能缓解幻觉,但受限于对幻觉来源的狭隘假设,且在生成末尾阶段效果显著减弱。其关键解决方案在于提出一种新型幻觉模型,通过选择性移除关键文本标记(text tokens)来捕捉幻觉效应,而非仅依赖视觉标记的移除;并进一步引入广义对比解码(Generalized Contrastive Decoding),融合多个幻觉模型以表征多样化的幻觉来源。二者共同构成CRoPS框架,在不进行额外训练的前提下,显著提升幻觉抑制性能,在六个基准测试和三个LVLM家族上均实现稳定改进,相较最先进无训练方法提升CHAIR分数达20%。
链接: https://arxiv.org/abs/2601.00659
作者: Neeraj Anand,Samyak Jha,Udbhav Bamba,Rahul Rahaman
机构: Indian Institute of Technology (ISM), Dhanbad, India; Transmute AI; National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at TMLR 2026
Abstract:Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.
zh
[CV-15] Reconstructing Building Height from Spaceborne TomoSAR Point Clouds Using a Dual-Topology Network
【速读】:该论文旨在解决星载合成孔径雷达干涉测量(InSAR)点云在城市建筑高度重建中因噪声、各向异性分布及非相干表面数据缺失等问题导致的精度不足问题。其解决方案的关键在于提出一种基于学习的双拓扑网络架构,该架构通过交替处理两个分支:一个点分支用于建模不规则散射体特征,另一个网格分支用于强制空间一致性,从而联合优化输入点云的去噪与缺失区域填充,最终生成连续且高分辨率的建筑高度图。
链接: https://arxiv.org/abs/2601.00658
作者: Zhaiyu Chen,Yuanyuan Wang,Yilei Shi,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑中心机器学习)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transactions on Geoscience and Remote Sensing
Abstract:Reliable building height estimation is essential for various urban applications. Spaceborne SAR tomography (TomoSAR) provides weather-independent, side-looking observations that capture facade-level structure, offering a promising alternative to conventional optical methods. However, TomoSAR point clouds often suffer from noise, anisotropic point distributions, and data voids on incoherent surfaces, all of which hinder accurate height reconstruction. To address these challenges, we introduce a learning-based framework for converting raw TomoSAR points into high-resolution building height maps. Our dual-topology network alternates between a point branch that models irregular scatterer features and a grid branch that enforces spatial consistency. By jointly processing these representations, the network denoises the input points and inpaints missing regions to produce continuous height estimates. To our knowledge, this is the first proof of concept for large-scale urban height mapping directly from TomoSAR point clouds. Extensive experiments on data from Munich and Berlin validate the effectiveness of our approach. Moreover, we demonstrate that our framework can be extended to incorporate optical satellite imagery, further enhancing reconstruction quality. The source code is available at this https URL.
zh
[CV-16] Quality Detection of Stored Potatoes via Transfer Learning: A CNN and Vision Transformer Approach
【速读】:该论文旨在解决马铃薯在储存过程中质量监测的难题,包括发芽检测、重量损失估计及保质期预测等问题。解决方案的关键在于利用基于图像的深度学习方法,构建两种专用模型:一是高精度二分类器用于发芽检测,二是多类别预测模型用于估算重量损失并预测剩余保质期。研究采用ResNet、VGG、DenseNet和视觉Transformer(Vision Transformer, ViT)等预训练架构,其中DenseNet在发芽检测中达到98.03%的准确率,而保质期预测在粗粒度分类(2–5类)下准确率超过89.83%,显著优于细粒度划分(6–8类),表明合理分组对模型性能至关重要。该方法为自动化分拣与库存管理系统提供了可行路径,有助于提升供应链效率并减少食物浪费。
链接: https://arxiv.org/abs/2601.00645
作者: Shrikant Kapse,Priyankkumar Dhrangdhariya,Priya Kedia,Manasi Patwardhan,Shankar Kausley,Soumyadipta Maiti,Beena Rai,Shirish Karande
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image-based deep learning provides a non-invasive, scalable solution for monitoring potato quality during storage, addressing key challenges such as sprout detection, weight loss estimation, and shelf-life prediction. In this study, images and corresponding weight data were collected over a 200-day period under controlled temperature and humidity conditions. Leveraging powerful pre-trained architectures of ResNet, VGG, DenseNet, and Vision Transformer (ViT), we designed two specialized models: (1) a high-precision binary classifier for sprout detection, and (2) an advanced multi-class predictor to estimate weight loss and forecast remaining shelf-life with remarkable accuracy. DenseNet achieved exceptional performance, with 98.03% accuracy in sprout detection. Shelf-life prediction models performed best with coarse class divisions (2-5 classes), achieving over 89.83% accuracy, while accuracy declined for finer divisions (6-8 classes) due to subtle visual differences and limited data per class. These findings demonstrate the feasibility of integrating image-based models into automated sorting and inventory systems, enabling early identification of sprouted potatoes and dynamic categorization based on storage stage. Practical implications include improved inventory management, differential pricing strategies, and reduced food waste across supply chains. While predicting exact shelf-life intervals remains challenging, focusing on broader class divisions ensures robust performance. Future research should aim to develop generalized models trained on diverse potato varieties and storage conditions to enhance adaptability and scalability. Overall, this approach offers a cost-effective, non-destructive method for quality assessment, supporting efficiency and sustainability in potato storage and distribution.
zh
[CV-17] HyperPriv-EPN: Hypergraph Learning with Privileged Knowledge for Ependymoma Prognosis
【速读】:该论文旨在解决髓母细胞瘤(Ependymoma)术前预后预测难题,即如何在缺乏术后病理文本信息的情况下,利用MRI影像实现精准诊断与生存分层。现有跨模态方法无法在推理阶段使用受保护的术后文本数据(privileged information),导致术前模型难以获取专家级语义知识。其解决方案的关键在于提出一种基于超图的“学习使用特权信息”(Learning Using Privileged Information, LUPI)框架——HyperPriv-EPN,通过引入“断开图策略”(Severed Graph Strategy),构建教师图(Teacher graph,含术后文本增强信息)与学生图(Student graph,仅依赖术前影像)的双流蒸馏机制,使学生模型能够从视觉特征中“幻觉化”出语义社区结构,从而将历史术后文本中的专家知识迁移至术前场景,无需在推理时提供文本即可实现高性能预测。
链接: https://arxiv.org/abs/2601.00626
作者: Shuren Gabriel Yu,Sikang Ren,Yongji Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 2 tables
Abstract:Preoperative prognosis of Ependymoma is critical for treatment planning but challenging due to the lack of semantic insights in MRI compared to post-operative surgical reports. Existing multimodal methods fail to leverage this privileged text data when it is unavailable during inference. To bridge this gap, we propose HyperPriv-EPN, a hypergraph-based Learning Using Privileged Information (LUPI) framework. We introduce a Severed Graph Strategy, utilizing a shared encoder to process both a Teacher graph (enriched with privileged post-surgery information) and a Student graph (restricted to pre-operation data). Through dual-stream distillation, the Student learns to hallucinate semantic community structures from visual features alone. Validated on a multi-center cohort of 311 patients, HyperPriv-EPN achieves state-of-the-art diagnostic accuracy and survival stratification. This effectively transfers expert knowledge to the preoperative setting, unlocking the value of historical post-operative data to guide the diagnosis of new patients without requiring text at inference.
zh
[CV-18] RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation
【速读】:该论文旨在解决康复训练中患者动作监测与评估的实时性与准确性问题,尤其针对多人群体干扰场景下的姿态估计挑战。其解决方案的关键在于提出了一种统一的端到端实时人体姿态估计与运动分析流程(RePose),结合改进的SmoothNet实现高精度、低延迟的姿态估计,并引入小于1ms的快速跟踪方法以应对多人群体干扰,同时利用Unity平台实现实时可视化反馈和肌肉应力显示,从而提升康复训练的效果与用户体验。
链接: https://arxiv.org/abs/2601.00625
作者: Junxiao Xue,Pavel Smirnov,Ziao Li,Yunyun Shi,Shi Chen,Xinyi Yin,Xiaohan Yue,Lei Wang,Yiduo Wang,Feng Lin,Yijia Chen,Xiao Ma,Xiaoran Yan,Qing Zhang,Fengjian Xue,Xuecheng Wu
机构: Zhejiang Lab(浙江实验室); Northeastern University(东北大学); Xi’an Jiaotong University(西安交通大学); Zhengzhou University(郑州大学); Dalian Minzu University(大连民族大学); Nanjing University of Aeronautics and Astronautics(南京航空航天大学); Fuyao University of Science and Technology(福耀科技大学); Xianghu Lab(湘湖实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients’motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients’actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient’s true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients’ motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.
zh
[CV-19] Noise-Robust Tiny Object Localization with Flows
【速读】:该论文旨在解决小目标检测中因标注噪声敏感而导致的性能瓶颈问题,尤其在严格定位目标优化时易发生噪声过拟合。其解决方案的关键在于提出一种基于归一化流(normalizing flows)的小目标定位框架TOLF,通过流模型灵活建模预测误差的复杂非高斯分布,并引入不确定性感知的梯度调制机制,抑制高不确定性样本的学习,从而提升训练稳定性并有效缓解噪声过拟合问题。
链接: https://arxiv.org/abs/2601.00617
作者: Huixin Sun,Linlin Yang,Ronyu Chen,Kerui Gu,Baochang Zhang,Angela Yao,Xianbin Cao
机构: Beihang University (北京航空航天大学); Communication University of China (中国传媒大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:Despite significant advances in generic object detection, a persistent performance gap remains for tiny objects compared to normal-scale objects. We demonstrate that tiny objects are highly sensitive to annotation noise, where optimizing strict localization objectives risks noise overfitting. To address this, we propose Tiny Object Localization with Flows (TOLF), a noise-robust localization framework leveraging normalizing flows for flexible error modeling and uncertainty-guided optimization. Our method captures complex, non-Gaussian prediction distributions through flow-based error modeling, enabling robust learning under noisy supervision. An uncertainty-aware gradient modulation mechanism further suppresses learning from high-uncertainty, noise-prone samples, mitigating overfitting while stabilizing training. Extensive experiments across three datasets validate our approach’s effectiveness. Especially, TOLF boosts the DINO baseline by 1.2% AP on the AI-TOD dataset.
zh
[CV-20] Modality Dominance-Aware Optimization for Embodied RGB-Infrared Perception
【速读】:该论文旨在解决RGB-红外(RGB-IR)多模态感知中因模态特性不对称导致的优化偏差问题,即训练过程中由于信息密度和特征质量差异,模型过度依赖某一主导模态,从而阻碍有效跨模态融合。解决方案的关键在于提出模态主导指数(Modality Dominance Index, MDI),通过联合建模特征熵与梯度贡献来量化模态主导程度,并基于此构建模态主导感知的跨模态学习框架(Modality Dominance-Aware Cross-modal Learning, MDACL),其中包含分层跨模态引导(Hierarchical Cross-modal Guidance, HCG)以增强特征对齐,以及对抗均衡正则化(Adversarial Equilibrium Regularization, AER)以平衡融合过程中的优化动态,从而显著缓解优化偏差并实现当前最优性能。
链接: https://arxiv.org/abs/2601.00598
作者: Xianhui Liu,Siqi Jiang,Yi Xie,Yuqing Lin,Siao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:RGB-Infrared (RGB-IR) multimodal perception is fundamental to embodied multimedia systems operating in complex physical environments. Although recent cross-modal fusion methods have advanced RGB-IR detection, the optimization dynamics caused by asymmetric modality characteristics remain underexplored. In practice, disparities in information density and feature quality introduce persistent optimization bias, leading training to overemphasize a dominant modality and hindering effective fusion. To quantify this phenomenon, we propose the Modality Dominance Index (MDI), which measures modality dominance by jointly modeling feature entropy and gradient contribution. Based on MDI, we develop a Modality Dominance-Aware Cross-modal Learning (MDACL) framework that regulates cross-modal optimization. MDACL incorporates Hierarchical Cross-modal Guidance (HCG) to enhance feature alignment and Adversarial Equilibrium Regularization (AER) to balance optimization dynamics during fusion. Extensive experiments on three RGB-IR benchmarks demonstrate that MDACL effectively mitigates optimization bias and achieves SOTA performance.
zh
[CV-21] SafeMo: Linguistically Grounded Unlearning for Trustworthy Text-to-Motion Generation
【速读】:该论文旨在解决文本到动作(Text-to-Motion, T2M)生成模型中存在的安全问题,特别是现有基于离散码本替换的方法在安全性提升时引发的良性任务性能退化和运动连续性破坏问题。其核心挑战在于:1)离散码本替换导致良性提示下的动作漂移;2)量化与平滑损失引入运动伪影和不自然过渡;3)现有数据集天然包含不安全意图,难以用于安全驱动的学习。解决方案的关键是提出SafeMo框架,集成最小运动遗忘(Minimal Motion Unlearning, MMU)策略——一种两阶段机器遗忘方法,在连续空间中实现安全动作生成,避免码本丢失并保持运动学连续性,同时在安全性和实用性之间取得更好权衡。实验表明,SafeMo在Unsafe Prompt上的遗忘效果显著优于当前最优方法LCR(HumanML3D上FID提升2.5倍,Motion-X上提升14.4倍),且在安全提示下的良性性能维持或优于基线。
链接: https://arxiv.org/abs/2601.00590
作者: Yiling Wang,Zeyu Zhang,Yiran Wang,Hao Tang
机构: The Australian National University (澳大利亚国立大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-motion (T2M) generation with diffusion backbones achieves strong realism and alignment. Safety concerns in T2M methods have been raised in recent years; existing methods replace discrete VQ-VAE codebook entries to steer the model away from unsafe behaviors. However, discrete codebook replacement-based methods have two critical flaws: firstly, replacing codebook entries which are reused by benign prompts leads to drifts on everyday tasks, degrading the model’s benign performance; secondly, discrete token-based methods introduce quantization and smoothness loss, resulting in artifacts and jerky transitions. Moreover, existing text-to-motion datasets naturally contain unsafe intents and corresponding motions, making them unsuitable for safety-driven machine learning. To address these challenges, we propose SafeMo, a trustworthy motion generative framework integrating Minimal Motion Unlearning (MMU), a two-stage machine unlearning strategy, enabling safe human motion generation in continuous space, preserving continuous kinematics without codebook loss and delivering strong safety-utility trade-offs compared to current baselines. Additionally, we present the first safe text-to-motion dataset SafeMoVAE-29K integrating rewritten safe text prompts and continuous refined motion for trustworthy human motion unlearning. Built upon DiP, SafeMo efficiently generates safe human motions with natural transitions. Experiments demonstrate effective unlearning performance of SafeMo by showing strengthened forgetting on unsafe prompts, reaching 2.5x and 14.4x higher forget-set FID on HumanML3D and Motion-X respectively, compared to the previous SOTA human motion unlearning method LCR, with benign performance on safe prompts being better or comparable. Code: this https URL. Website: this https URL.
zh
[CV-22] GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval AAAI2026
【速读】:该论文旨在解决零样本视频片段检索(Zero-shot Video Moment Retrieval, ZVMR)任务中因文本查询与视觉内容之间语义粒度不匹配而导致的定位不准问题。现有方法虽利用高质量预训练模型在联合空间中对齐视频与语言表示,但未能平衡模态间在特定场景下的语义粒度差异,从而影响检索精度。其解决方案的关键在于提出一种无需训练的框架——Granularity-Aware Alignment (GranAlign),通过两个互补技术实现:一是基于粒度的查询重写(granularity-based query rewriting),生成不同语义粒度的多样化查询;二是查询感知的视频描述生成(query-aware caption generation),将查询意图嵌入视频内容中。该方法通过多层级查询与无查询依赖及有查询依赖的描述配对,有效缓解了语义粒度不一致问题,在QVHighlights、Charades-STA和ActivityNet-Captions三个主流基准上均达到新的最先进性能,尤其在具有挑战性的QVHighlights数据集上实现了3.23%的mAP@avg提升。
链接: https://arxiv.org/abs/2601.00584
作者: Mingyu Jeon,Sunjae Yoon,Jonghee Kim,Junyeoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality’s representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.
zh
[CV-23] A Cascaded Information Interaction Network for Precise Image Segmentation
【速读】:该论文旨在解决复杂场景下图像分割的鲁棒性问题,尤其是传统单尺度特征提取方法在视觉杂乱或模糊环境中表现不佳的局限。其解决方案的关键在于提出一种级联式卷积神经网络,并引入新颖的全局信息引导模块(Global Information Guidance Module),该模块能够有效融合多层特征中的低层纹理细节与高层语义信息,从而提升分割精度。实验结果表明,该框架在基准数据集上优于现有最先进方法,展现出在实际机器人应用中的潜力。
链接: https://arxiv.org/abs/2601.00562
作者: Hewen Xiao,Jie Mei,Guangfu Ma,Weiren Wu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.
zh
[CV-24] AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在跨任务应用世界知识能力方面的关键挑战,即现有评估基准仅提供孤立的单任务测试,缺乏对模型综合理解与推理能力的有效诊断。其解决方案的核心是提出AEGIS(Assessing Editing, Generation, Interpretation-Understanding for Super-intelligence),一个涵盖视觉理解、生成、编辑及交错生成的多任务基准,包含1050个精心设计的手工标注问题,覆盖21个主题和6类推理类型;同时引入确定性检查表评估(Deterministic Checklist-based Evaluation, DCE),以原子化的“是/否”判断替代模糊的提示驱动评分,显著提升评估可靠性。实验表明,当前UMMs普遍存在世界知识缺陷,且复杂推理下性能急剧下降,而简单插件式推理模块可部分缓解此问题,凸显了基于世界知识的推理作为UMMs未来发展的重要方向。
链接: https://arxiv.org/abs/2601.00561
作者: Jintao Lin,Bowen Dong,Weikang Shi,Chenyang Lei,Suiyun Zhang,Rui Liu,Xihui Liu
机构: University of Hong Kong (香港大学); The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong (香港中文大学); Huawei Research (华为研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emphi.e., \textbfAssessing \textbfEditing, \textbfGeneration, \textbfInterpretation-Understanding for \textbfSuper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N’’ judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.
zh
[CV-25] A Comprehensive Dataset for Human vs. AI Generated Image Detection
【速读】:该论文旨在解决生成式 AI(Generative AI)系统(如 Stable Diffusion、DALL-E 和 MidJourney)所生成的合成图像日益难以与真实照片区分,从而引发虚假信息传播和误导性内容扩散的问题。其解决方案的关键在于构建了一个大规模、多样化的基准数据集 MS COCOAI,包含 96,000 张真实图像与合成图像,覆盖五种主流生成模型(Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3 和 MidJourney v6),并据此提出两个核心任务:一是图像真伪分类,二是合成图像来源识别(即判别具体由哪个生成模型产生)。该数据集为开发和评估 AI 生成图像检测方法提供了标准化测试平台。
链接: https://arxiv.org/abs/2601.00553
作者: Rajarshi Roy,Nasrin Imanpour,Ashhar Aziz,Shashwat Bajpai,Gurpreet Singh,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Gaytri Jena,Vasu Sharma,Vinija Jain,Aman Chadha,Aishwarya Naresh Reganti,Amitava Das
机构: Kalyani Govt. Engg. College; AI Institute USC; IIIT Delhi; BITS Pilani Hyderabad; IIIT Guwahati; NIT Silchar; San José State Univ.; UCLA; Washington State Univ.; VIIT; GITA; Meta AI; Amazon AI; BITS Pilani Goa
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at this https URL.
zh
[CV-26] SingBAG Pro: Accelerating point cloud-based iterative reconstruction for 3D photoacoustic imaging under arbitrary array
【速读】:该论文旨在解决不规则几何排列超声探头阵列在三维光声成像(3D PAI)中因传统迭代重建算法计算复杂度高、内存占用大及重建时间长而难以实现高效高质量成像的问题。解决方案的关键在于提出SlingBAG Pro算法,其基于滑动球自适应生长(Sliding ball adaptive growth, SlingBAG)的点云迭代思想,并扩展至任意阵列几何配置;通过引入分层优化策略,结合零梯度滤波与逐步增加的时间采样率,在迭代过程中快速去除冗余空间点云,从而加速收敛并显著缩短整体重建时间,相比原SlingBAG算法在不规则阵列下实现最高达2.2倍的提速。
链接: https://arxiv.org/abs/2601.00551
作者: Shuang Li,Yibing Wang,Jian Gao,Chulhong Kim,Seongwook Choi,Yu Zhang,Qian Chen,Yao Yao,Changhui Li
机构: 南京大学(Nanjing University); 北京大学(Peking University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-quality three-dimensional (3D) photoacoustic imaging (PAI) is gaining increasing attention in clinical applications. To address the challenges of limited space and high costs, irregular geometric transducer arrays that conform to specific imaging regions are promising for achieving high-quality 3D PAI with fewer transducers. However, traditional iterative reconstruction algorithms struggle with irregular array configurations, suffering from high computational complexity, substantial memory requirements, and lengthy reconstruction times. In this work, we introduce SlingBAG Pro, an advanced reconstruction algorithm based on the point cloud iteration concept of the Sliding ball adaptive growth (SlingBAG) method, while extending its compatibility to arbitrary array geometries. SlingBAG Pro maintains high reconstruction quality, reduces the number of required transducers, and employs a hierarchical optimization strategy that combines zero-gradient filtering with progressively increased temporal sampling rates during iteration. This strategy rapidly removes redundant spatial point clouds, accelerates convergence, and significantly shortens overall reconstruction time. Compared to the original SlingBAG algorithm, SlingBAG Pro achieves up to a 2.2-fold speed improvement in point cloud-based 3D PA reconstruction under irregular array geometries. The proposed method is validated through both simulation and in vivo mouse experiments, and the source code is publicly available at this https URL.
zh
[CV-27] DynaDrag : Dynamic Drag -Style Image Editing by Motion Prediction
【速读】:该论文旨在解决拖拽式图像编辑(drag-style image editing)中因采用“移动-跟踪”(move-and-track)框架而导致的误跟踪(miss tracking)和模糊跟踪(ambiguous tracking)问题,以及其它方法中存在的源图像与目标编辑图像之间差距过大、中间状态不合理等问题。其解决方案的关键在于提出首个基于“预测-移动”(predict-and-move)框架的拖拽方法DynaDrag,通过迭代执行运动预测(Motion Prediction)与运动监督(Motion Supervision),在每轮迭代中先预测控制点应移动的位置,再根据预测结果进行实际拖动,并动态调整有效控制点以提升编辑精度和可操作性(editability)。
链接: https://arxiv.org/abs/2601.00542
作者: Jiacheng Sui,Yujie Zhou,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures
Abstract:To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.
zh
[CV-28] Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios
【速读】:该论文旨在解决Segment Anything Model (SAM) 在视觉非显著(Visually Non-Salient, VNS)场景下的分割性能下降问题,即当前景与背景之间对比度较低时,现有方法难以准确提取边界并生成可靠分割结果。解决方案的关键在于通过两个核心设计增强SAM对低对比度场景的感知能力:一是Mask-Edge Token Interactive decoder,用于提升解码器对掩码边缘特征的交互理解;二是Non-Salient Feature Mining模块,用于有效挖掘和利用SAM的低层特征以捕捉非显著特性。这两个模块在仅引入少量额外参数和计算开销的前提下,显著提升了模型在零样本(zero-shot)条件下的VNS场景分割性能,且训练时间控制在4小时内,具备良好的实用性和扩展性。
链接: https://arxiv.org/abs/2601.00537
作者: Guangqian Guo,Pengfei Chen,Yong Guo,Huafeng Chen,Boqiang Zhang,Shan Gao
机构: Northwestern Polytechnical University (西北工业大学); University of Chinese Academy of Sciences (中国科学院大学); Max Planck Institute for Informatics (马普研究所); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIP
Abstract:Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM’s perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM’s low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model’s segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at this https URL.
zh
[CV-29] FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection
【速读】:该论文旨在解决大规模文本到图像(Text-to-Image, T2I)扩散模型在多行排版、密集文字以及长尾语言(如中文)场景下精确文本渲染能力不足的问题。现有方法通常依赖昂贵的再训练或刚性外部布局约束,导致美学质量下降且灵活性受限。其解决方案的关键在于提出一个无需训练、即插即用的框架 \textbfFreeText,通过分解问题为“何处书写”与“写什么”两部分:前者利用DiT模型内部图像到文本注意力机制中的token级空间归因,结合sink-like tokens作为稳定空间锚点及拓扑感知优化生成高置信度书写区域掩膜;后者引入频域带通调制的谱调制字形注入(Spectral-Modulated Glyph Injection, SGMI),在噪声对齐条件下增强字形结构并抑制语义泄露,从而实现更清晰的文字呈现同时保持语义一致性与图像美感。
链接: https://arxiv.org/abs/2601.00535
作者: Ruiqiang Zhang,Hengyi Wang,Chang Liu,Guanjie Wang,Zehua Ma,Weiming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbfFreeText, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emphDiffusion Transformer (DiT) models. \textbfFreeText decomposes the problem into \emphwhere to write and \emphwhat to write. For \emphwhere to write, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emphwhat to write, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.
zh
[CV-30] All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations
【速读】:该论文旨在解决视频图像复原中因未知退化类型和强度随时间平滑演化而导致的恢复难题,即“Smoothly Evolving Unknown Degradations (SEUD)”场景。现有方法多关注帧间退化变化,忽视了真实世界中退化过程的时间连续性,导致恢复结果在时序上不一致或对复杂退化适应性差。解决方案的关键在于提出一种全一型递归条件与自适应提示网络(ORCANet),其核心创新包括:1)基于物理先验的粗略雾霾强度估计模块(CIED),提供去雾特征初始化;2)光流提示生成模块(FPG),提取静态提示(捕捉段级退化类型)与动态提示(适应帧级强度变化);3)标签感知监督机制增强静态提示表示的判别能力。该设计实现了对多种退化类型、强度及演化过程的统一建模,显著提升了视频复原的质量、时序一致性与鲁棒性。
链接: https://arxiv.org/abs/2601.00533
作者: Wenrui Li,Hongtao Chen,Yao Xiao,Wangmeng Zuo,Jiantao Zhou,Yonghong Tian,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); University of Macau (澳门大学); Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at this https URL.
zh
[CV-31] MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation AAAI2026
【速读】:该论文旨在解决现有3D物理仿真中需要专家知识和耗时的物理参数调优问题,以实现对多样化物体与材料的动态行为准确模拟。其核心解决方案是提出MotionPhysics框架,该框架基于自然语言提示(natural language prompt)端到端地推断出合理的物理参数,无需依赖真实轨迹或标注视频作为指导;关键创新在于利用多模态大语言模型(multimodal large language model)估计符合物理约束的材料参数,并设计了一种可学习的运动蒸馏损失(learnable motion distillation loss),从预训练视频扩散模型中提取鲁棒的运动先验,同时最小化外观与几何归纳偏置,从而引导高保真物理仿真。
链接: https://arxiv.org/abs/2601.00504
作者: Miaowei Wang,Jakub Zadrożny,Oisin Mac Aodha,Amir Vaxman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: AAAI2026 Accepted
Abstract:Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: this https URL.
zh
[CV-32] CPPO: Contrastive Perception for Vision Language Policy Optimization
【速读】:该论文旨在解决多模态推理中感知与推理难以分离的问题,即如何在视觉-语言模型(VLM)的强化学习(RL)微调过程中有效区分并优化感知token与推理token。现有方法通常依赖显式的感知奖励机制,但存在诸多局限,如需额外大语言模型(LLM)、真实标签数据、强制将感知与推理分离,或对所有输出token indiscriminately 应用奖励。其解决方案的关键在于提出对比感知策略优化(CPPO),通过检测输入图像扰动下模型输出熵的变化来自动识别感知token,并引入对比感知损失(CPL),使模型在信息保持扰动下保持一致性、在信息移除扰动下具有敏感性,从而无需额外模型即可实现高效且可扩展的感知-推理解耦优化。
链接: https://arxiv.org/abs/2601.00501
作者: Ahmad Rezaei,Mohsen Gholami,Saeed Ranjbar Alvar,Kevin Cannons,Mohammad Asiful Hossain,Zhou Weimin,Shunbo Zhou,Yong Zhang,Mohammad Akbari
机构: Huawei Technologies Canada Co. Ltd. (华为技术加拿大有限公司); Huawei Cloud (华为云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.
zh
[CV-33] E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
【速读】:该论文旨在解决当前基于强化学习的流匹配(flow matching)模型在人类偏好对齐过程中,由于多步去噪优化导致奖励信号稀疏且模糊的问题。其关键解决方案是提出一种熵感知的分组相对策略优化方法(E-GRPO),通过识别高熵与低熵采样步骤,将连续的低熵步骤合并为一个高熵步骤以增强探索效率,同时在其他步骤采用常微分方程(ODE)采样;此外,引入多步分组归一化优势计算机制,即在同一合并的随机微分方程(SDE)去噪步骤内共享样本计算组相对优势,从而提升策略优化的稳定性和有效性。
链接: https://arxiv.org/abs/2601.00423
作者: Shengjun Zhang,Zhang Zhang,Chensheng Dai,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.
zh
[CV-34] Robust Assembly Progress Estimation via Deep Metric Learning
【速读】:该论文旨在解决人工多日装配任务中,因视觉变化微小或存在遮挡而导致的装配进度估计不准确问题(即装配进度估计的鲁棒性不足)。其关键解决方案是提出一种基于四元组损失(Quadruplet Loss)的学习方法,用于增强异常图像的特征区分能力,并设计了一种定制化的数据加载器,通过策略性选择训练样本以提升模型对相邻装配阶段的判别精度,从而在小规模数据集上实现更可靠的装配进度估计。
链接: https://arxiv.org/abs/2601.00422
作者: Kazuma Miura,Sarthak Pathak,Kazunori Umeda
机构: Chuo University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.
zh
[CV-35] ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis
【速读】:该论文旨在解决功能连接(Functional Connectivity, FC)分析在脑部疾病辅助诊断中因依赖基于图谱的脑区分割而导致的选择偏差和个体特异性不足的问题。其解决方案的关键在于提出ABFR-KAN模型,该模型融合了基于Transformer的分类网络与柯尔莫哥洛夫-阿诺德网络(Kolmogorov-Arnold Networks, KANs),通过引入先进的脑功能表征组件,有效缓解结构偏差、提升解剖一致性,并增强FC估计的可靠性。
链接: https://arxiv.org/abs/2601.00416
作者: Tyler Ward,Abdullah Imran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 10 figures, 8 tables
Abstract:Functional connectivity (FC) analysis, a valuable tool for computer-aided brain disorder diagnosis, traditionally relies on atlas-based parcellation. However, issues relating to selection bias and a lack of regard for subject specificity can arise as a result of such parcellations. Addressing this, we propose ABFR-KAN, a transformer-based classification network that incorporates novel advanced brain function representation components with the power of Kolmogorov-Arnold Networks (KANs) to mitigate structural bias, improve anatomical conformity, and enhance the reliability of FC estimation. Extensive experiments on the ABIDE I dataset, including cross-site evaluation and ablation studies across varying model backbones and KAN configurations, demonstrate that ABFR-KAN consistently outperforms state-of-the-art baselines for autism spectrum distorder (ASD) classification. Our code is available at this https URL.
zh
[CV-36] RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection
【速读】:该论文旨在解决路边垃圾(roadside litter)监测效率低下的问题,传统方法依赖人工调查和公众报告,存在空间覆盖有限、成本高且时效性差的缺陷。针对现有视觉数据集多集中于街景静态图像或水下场景,未能反映行车记录仪(dashcam)视频中垃圾尺寸极小、分布稀疏且背景复杂等特性,研究者提出了RoLID-11K——首个面向dashcam视频的大规模路边垃圾检测数据集,包含超过11,000张标注图像,涵盖多样化的英国驾驶环境,并呈现显著的长尾分布与小目标特征。解决方案的关键在于构建一个具有挑战性的基准数据集,用于评估现代目标检测模型在动态驾驶场景中对极端小目标的检测性能,从而推动低成本、可扩展的智能垃圾监控系统的发展。
链接: https://arxiv.org/abs/2601.00398
作者: Tao Wu,Qing Xu,Xiangjian He,Oakleigh Weekes,James Brown,Wenting Duan
机构: University of Nottingham Ningbo China; University of Lincoln
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at this https URL.
zh
[CV-37] NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
【速读】:该论文旨在解决当前4D世界建模方法中存在的可扩展性局限问题,这一局限主要源于对昂贵且专用的多视角4D数据的依赖或繁琐的训练预处理流程。为应对这一挑战,作者提出NeoVerse,其核心解决方案在于构建一个全流程可扩展的框架,能够适配多样化的野外单目视频输入。关键创新包括无姿态(pose-free)的前向4D重建、在线单目退化模式模拟以及其他对齐良好的技术设计,从而显著提升模型在不同场景下的泛化能力和应用灵活性,并在标准重建与生成基准上达到当前最优性能。
链接: https://arxiv.org/abs/2601.00393
作者: Yuxue Yang,Lue Fan,Ziqi Shi,Junran Peng,Feng Wang,Zhaoxiang Zhang
机构: NLPR & MAIS, CASIA (中国科学院自动化研究所); CreateAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at this https URL
zh
[CV-38] BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition
【速读】:该论文旨在解决基于骨骼的人体动作识别(Skeleton-based Human Action Recognition, HAR)中长期存在的问题:现有方法多以身体为中心,侧重于大规模运动特征的建模,而忽视了对细粒度识别至关重要的手部细微关节运动(hand articulations)。为此,作者提出了一种概率双流框架,其核心创新在于:(1)无需校准的预处理流程,直接从原始坐标学习,避免了标准空间变换带来的信息损失;(2)基于概率噪声或(Noisy-OR)融合机制,在无显式置信度监督的情况下实现可靠性感知的双流学习;(3)通过跨模态集成策略,将四种骨骼模态(关节、骨骼、关节运动、骨骼运动)与RGB视觉特征统一建模,有效融合结构与运动线索。该方案在多个基准数据集(如NTU RGB+D 60/120、PKU-MMD、N-UCLA)及新定义的手部中心基准上均表现出鲁棒性和显著性能提升。
链接: https://arxiv.org/abs/2601.00369
作者: Seungyeon Cho,Tae-kyun Kim
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages; 8 figures. Extension of previous conference paper. Project page: this https URL
Abstract:Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
zh
[CV-39] Mask-Conditioned Voxel Diffusion for Joint Geometry and Color Inpainting
【速读】:该论文旨在解决受损三维物体的联合几何与颜色修复问题,尤其针对文化遗产数字化保护场景中出现的结构完整性与外观一致性难题。其解决方案的关键在于提出一种轻量级两阶段框架:第一阶段利用2D卷积神经网络在RGB切片上预测损伤掩码,并聚合为体素空间中的体积掩码;第二阶段采用基于扩散机制的3D U-Net,在体素网格上进行掩码条件下的联合重建,通过结合占用率重建、掩码颜色重建与感知正则化的目标函数,实现几何形状与颜色信息的同时恢复,且保持未受损区域的一致性。实验表明,显式的掩码条件引导显著提升了重建完整性和色彩连贯性。
链接: https://arxiv.org/abs/2601.00368
作者: Aarya Sumuk
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.
zh
[CV-40] Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers IROS2025
【速读】:该论文旨在解决机器人在家庭环境中与未经训练的人类进行有效、直观交互所需的场景理解问题,尤其针对传统语义分割方法受限于固定预定义类别而缺乏灵活性的缺陷。解决方案的关键在于提出一种基于RGB-D图像的高效Transformer架构DVEFormer,通过知识蒸馏技术利用Alpha-CLIP的教师模型嵌入来指导学生模型学习细粒度的像素级文本对齐视觉嵌入(Dense Text-Aligned Visual Embeddings, DVE),从而不仅支持经典的语义分割任务(如线性探测),还实现了自然语言查询和3D地图构建等更灵活的应用场景。该方法在保持实时性能(全模型达26.3 FPS,小型变体达77.0 FPS)的同时,显著提升了语义理解的通用性和可扩展性。
链接: https://arxiv.org/abs/2601.00359
作者: Söhnke Benedikt Fischedick,Daniel Seichter,Benedict Stephan,Robin Schmidt,Horst-Michael Gross
机构: Technische Universität Ilmenau (伊尔梅瑙工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Published in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
Abstract:In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.
zh
[CV-41] OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning
【速读】:该论文旨在解决视觉-触觉学习(Visual-Tactile Learning, VTL)中因模态差异(modality discrepancies)和领域偏移(domain gaps)导致的跨域泛化能力不足问题,特别是在单一训练域下如何提升模型对未见领域的适应性。其核心解决方案是提出OmniVaT框架,关键创新在于:一方面引入多模态分数阶傅里叶适配器(Multimodal Fractional Fourier Adapter, MFFA),将视觉与触觉嵌入映射到统一的嵌入-频率空间,有效缓解模态间差异而无需多域训练数据或复杂的跨模态融合策略;另一方面设计离散树生成模块(Discrete Tree Generation, DTG),通过分层树结构获取多样且可靠的多模态分数阶表示,从而增强模型对未知领域中动态分布变化的适应能力。
链接: https://arxiv.org/abs/2601.00352
作者: Liuxiang Qiu,Hui Da,Yuzhen Niu,Tiesong Zhao,Yang Cao,Zheng-Jun Zha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.
zh
[CV-42] Intelligent Traffic Surveillance for Real-Time Vehicle Detection License Plate Recognition and Speed Estimation
【速读】:该论文旨在解决发展中国家(如乌干达)因道路安全基础设施薄弱而导致的超速行驶问题,进而减少交通事故死亡率。其核心解决方案是构建一个适用于资源受限环境的实时智能交通监控系统,关键在于融合计算机视觉技术实现车辆检测、车牌识别与速度估计:采用YOLOv8模型进行车牌定位,mAP达97.9%;使用CNN和Transformer模型分别进行字符识别,CER从3.85%降至1.79%;通过源与目标感兴趣区域(Region of Interest, ROI)实现速度估算,误差控制在10 km/h以内;同时建立数据库并集成Africa’s Talking API实现用户信息关联与短信自动开罚单,从而形成闭环自动化执法流程。
链接: https://arxiv.org/abs/2601.00344
作者: Bruce Mugizi,Sudi Murindanyi,Olivia Nakacwa,Andrew Katumba
机构: Makerere University (马凯雷雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa’s Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.
zh
[CV-43] Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion
【速读】:该论文旨在解决从单张RGB图像中实现高保真度的三维数字人几何与外观重建问题,现有方法通常采用解耦的流水线分别处理几何估计与外观合成,导致重建不一致且难以统一优化。其解决方案的关键在于提出JGA-LBD框架,将几何与外观建模统一到一个联合潜在表示(joint latent representation)中,并通过桥式扩散(bridge diffusion)机制进行生成:首先将异构输入条件(如深度图、SMPL模型)统一转化为3D高斯表示(3D Gaussian representations),并通过共享稀疏变分自编码器(shared sparse variational autoencoder, VAE)压缩至统一潜在空间;随后利用桥式扩散从目标潜在码的部分观测出发,仅聚焦于推断缺失组件;最后由专用解码模块提取完整三维人体结构并渲染新视角。该方法显著提升了几何保真度和外观质量,尤其在复杂野外场景下表现优越。
链接: https://arxiv.org/abs/2601.00328
作者: Yingzhi Tang,Qijian Zhang,Junhui Hou
机构: City University of Hong Kong (香港城市大学); Tencent Games (腾讯游戏)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving consistent and high-fidelity geometry and appearance reconstruction of 3D digital humans from a single RGB image is inherently a challenging task. Existing studies typically resort to decoupled pipelines for geometry estimation and appearance synthesis, often hindering unified reconstruction and causing inconsistencies. This paper introduces \textbfJGA-LBD, a novel framework that unifies the modeling of geometry and appearance into a joint latent representation and formulates the generation process as bridge diffusion. Observing that directly integrating heterogeneous input conditions (e.g., depth maps, SMPL models) leads to substantial training difficulties, we unify all conditions into the 3D Gaussian representations, which can be further compressed into a unified latent space through a shared sparse variational autoencoder (VAE). Subsequently, the specialized form of bridge diffusion enables to start with a partial observation of the target latent code and solely focuses on inferring the missing components. Finally, a dedicated decoding module extracts the complete 3D human geometric structure and renders novel views from the inferred latent representation. Experiments demonstrate that JGA-LBD outperforms current state-of-the-art approaches in terms of both geometry fidelity and appearance quality, including challenging in-the-wild scenarios. Our code will be made publicly available at this https URL.
zh
[CV-44] HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection
【速读】:该论文旨在解决工业产品质检中微小缺陷检测难题,现有方法普遍存在结构-语义权衡问题:基于频率的结构导向模型易受噪声干扰,而基于CLIP的语义导向模型则常忽略细粒度特征。解决方案的关键在于提出HarmoniAD框架,通过频域引导的双分支结构实现结构与语义的互补建模:首先利用CLIP图像编码器提取特征并转换至频域,再将特征解耦为高低频路径;其中高频分支引入细粒度结构注意力模块(Fine-grained Structural Attention Module, FSAM)强化纹理和边缘信息以提升微小异常检测能力,低频分支采用全局结构上下文模块(Global Structural Context Module, GSCM)捕捉长程依赖关系并保持语义一致性,二者协同优化,兼顾细节敏感性与全局鲁棒性。
链接: https://arxiv.org/abs/2601.00327
作者: Naiqi Zhang,Chuancheng Shi,Jingtong Dou,Wenhua Wu,Fei Shen,Jianhua Cao
机构: Tianjin University of Science and Technology (天津科技大学); The University of Sydney (悉尼大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Anomaly detection is crucial in industrial product quality inspection. Failing to detect tiny defects often leads to serious consequences. Existing methods face a structure-semantics trade-off: structure-oriented models (such as frequency-based filters) are noise-sensitive, while semantics-oriented models (such as CLIP-based encoders) often miss fine details. To address this, we propose HarmoniAD, a frequency-guided dual-branch framework. Features are first extracted by the CLIP image encoder, then transformed into the frequency domain, and finally decoupled into high- and low-frequency paths for complementary modeling of structure and semantics. The high-frequency branch is equipped with a fine-grained structural attention module (FSAM) to enhance textures and edges for detecting small anomalies, while the low-frequency branch uses a global structural context module (GSCM) to capture long-range dependencies and preserve semantic consistency. Together, these branches balance fine detail and global semantics. HarmoniAD further adopts a multi-class joint training strategy, and experiments on MVTec-AD, VisA, and BTAD show state-of-the-art performance with both sensitivity and robustness.
zh
[CV-45] Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation AAAI2026
【速读】:该论文旨在解决图像反射分离(image reflection separation)中因单图信息有限而导致的传输层与反射层混淆问题,尤其在夜间光照条件下更为严重。其解决方案的关键在于提出深度记忆解耦网络(Depth-Memory Decoupling Network, DMDNet),包含三个核心组件:1)深度感知扫描机制(Depth-Aware Scanning, DAScan),通过引导Mamba模型关注显著结构并沿语义一致性促进信息流动以构建稳定状态;2)深度协同状态空间模型(Depth-Synergized State-Space Model, DS-SSM),利用深度信息调节状态激活敏感度,抑制模糊特征扩散对层解耦的干扰;3)记忆专家补偿模块(Memory Expert Compensation Module, MECM),借助跨图像历史知识引导专家提供层特定补偿。该方法显著提升了昼夜场景下的反射分离性能。
链接: https://arxiv.org/abs/2601.00322
作者: Siyan Fang,Long Peng,Yuntao Wang,Ruonan Wei,Yuehuan Wang
机构: Huazhong University of Science and Technology (华中科技大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by AAAI 2026
Abstract:Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.
zh
[CV-46] ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition
【速读】:该论文旨在解决视频行为识别中因复杂时空变化导致的表征不稳定与判别性不足问题,尤其针对现有数据增强策略多为扰动驱动、引入不可控变化并放大非判别因素,进而削弱类内分布结构及在不同时间尺度上表现不一致的问题。解决方案的关键在于提出一种可插拔的表示感知混合增强方法(Representation-aware Mixing Augmentation, ReMA),其核心是通过两个互补机制实现对混合过程的精准控制:一是表示对齐机制(Representation Alignment Mechanism, RAM),在类内分布对齐约束下进行结构化混合,抑制无关类内漂移并提升统计可靠性;二是动态选择机制(Dynamic Selection Mechanism, DSM),生成运动感知的时空掩码以定位扰动区域,引导扰动避开判别敏感区域并增强时序一致性。ReMA 不依赖额外监督或可训练参数,通过联合控制“如何混合”和“在哪里混合”,显著提升表征鲁棒性与跨时空粒度的泛化能力。
链接: https://arxiv.org/abs/2601.00311
作者: Feng-Qi Cui,Jinyang Huang,Sirui Zhao,Jinglong Guo,Qifan Cai,Xin Yan,Zhi Liu
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Tongyi Lab (通义实验室); 4. Peking University (北京大学); 5. Huawei Technologies (华为技术)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.
zh
[CV-47] VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss Feature Fusion and Dynamic Multi-Task Learning
【速读】:该论文旨在解决行人重识别(Person Re-identification, ReID)在实际应用中面临的高计算成本与性能难以兼顾的问题,尤其针对监控和移动场景下资源受限的部署需求。其解决方案的关键在于提出一种名为VisNet的轻量级模型架构,通过多尺度特征融合(无需并行路径)、基于规则的伪标签语义聚类引入空间约束、动态权重平均技术平衡分类正则化,并结合FIDI损失函数提升度量学习效果,从而在保持高精度(Market-1501上Rank-1达87.05%,mAP为77.65%)的同时显著降低计算开销(仅32.41M参数和4.601 GFLOPs),实现高效且实用的实时部署能力。
链接: https://arxiv.org/abs/2601.00307
作者: Anns Ijaz,Muhammad Azeem Javed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50’s stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.
zh
[CV-48] meColor: Flexible Reference Colorization via Temporal Concatenation
【速读】:该论文旨在解决视频着色模型中因仅依赖单一参考帧(通常为场景首帧)而导致的色彩保真度低、身份一致性差及时间稳定性不足的问题。现有方法忽视了如角色设定图、背景图像或任意已着色帧等多源异构参考信息的利用,从而限制了生成质量与一致性。其解决方案的关键在于提出TimeColor框架:通过将不同参考编码为额外的潜变量帧并沿时间轴拼接,实现多参考在扩散过程中并行处理且不增加模型参数量;同时引入时空对应掩码注意力机制(spatiotemporal correspondence-masked attention)和模态解耦的旋转位置编码(modality-disjoint RoPE indexing),有效防止快捷路径学习(shortcutting)和跨身份调色板泄漏(cross-identity palette leakage),从而显著提升色彩准确性、身份一致性和时序稳定性。
链接: https://arxiv.org/abs/2601.00296
作者: Bryan Constantine Sadihin,Yihao Meng,Michael Hua Wang,Matteo Jiahao Chen,Hang Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Demo samples are available at: this https URL
Abstract:Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model’s parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.
zh
[CV-49] owards Automated Differential Diagnosis of Skin Diseases Using Deep Learning and Imbalance-Aware Strategies SDM’25
【速读】:该论文旨在解决皮肤疾病诊断中因 dermatologist(皮肤科医生)资源有限而难以实现及时、准确诊断的问题。解决方案的关键在于开发一种基于深度学习的模型,通过在公开皮肤疾病图像数据集上进行预训练,有效提取视觉特征并实现对多种皮肤病变的分类。模型采用 Swin Transformer 架构,并结合优化的数据预处理流程和针对性的数据增强技术,在 ISIC2019 数据集上达到了 87.71% 的预测准确率,展现出其作为临床辅助诊断工具和患者自我评估工具的潜力。
链接: https://arxiv.org/abs/2601.00286
作者: Ali Anaissi,Ali Braytee,Weidong Huang,Junaid Akram,Alaa Farhat,Jie Hua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 23rd Australasian Data Science and Machine Learning Conference (AusDM’25)
Abstract:As dermatological conditions become increasingly common and the availability of dermatologists remains limited, there is a growing need for intelligent tools to support both patients and clinicians in the timely and accurate diagnosis of skin diseases. In this project, we developed a deep learning based model for the classification and diagnosis of skin conditions. By leveraging pretraining on publicly available skin disease image datasets, our model effectively extracted visual features and accurately classified various dermatological cases. Throughout the project, we refined the model architecture, optimized data preprocessing workflows, and applied targeted data augmentation techniques to improve overall performance. The final model, based on the Swin Transformer, achieved a prediction accuracy of 87.71 percent across eight skin lesion classes on the ISIC2019 dataset. These results demonstrate the model’s potential as a diagnostic support tool for clinicians and a self assessment aid for patients.
zh
[CV-50] SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
【速读】:该论文旨在解决在稀疏观测条件下对大范围运动目标进行动态重建的难题。传统方法依赖于密集的视角和时间覆盖,难以适用于现实场景中观测稀疏且视角多样的情况(如安防摄像头),导致问题高度病态。其解决方案的关键在于提出SV-GS框架,通过联合估计变形模型与物体随时间的运动,在仅需稀疏观测的前提下实现高质量重建;该框架利用粗略骨骼图和初始静态重建作为引导,优化由粗粒度关节姿态估计器和细粒度变形模块组成的骨架驱动变形场,其中仅关节姿态估计器随时间变化,从而实现平滑运动插值并保留几何细节,显著提升了在稀疏观测下的重建性能(PSNR提升达34%),同时支持用扩散生成先验替代初始静态重建,增强实际应用可行性。
链接: https://arxiv.org/abs/2601.00285
作者: Jun-Jee Chao,Volkan Isler
机构: University of Minnesota (明尼苏达大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.
zh
[CV-51] Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing Classification
【速读】:该论文旨在解决遥感图像中长尾分布下难以区分困难尾部样本(hard tail data samples)与噪声模糊样本(noisy ambiguous ones)的问题,传统方法因对低置信度样本 indiscriminately 强调而导致在噪声数据上过拟合。其解决方案的关键在于提出了一种模型无关的不确定性感知框架 DUAL,通过证据深度学习(Evidential Deep Learning)将预测不确定性动态解耦为认知不确定性(Epistemic Uncertainty, EU)和随机不确定性(Aleatoric Uncertainty, AU):EU 用于指示样本稀缺性,指导对难学尾部样本的重加权策略;AU 则用于量化数据模糊性,结合自适应标签平滑机制抑制噪声影响,从而实现对尾部样本的有效区分与优化。
链接: https://arxiv.org/abs/2601.00278
作者: Chi Ding,Junxiao Xue,Xinyi Yin,Shi Chen,Yunyun Shi,Yiduo Wang,Fengjian Xue,Xuecheng Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.
zh
[CV-52] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering
【速读】:该论文旨在解决视觉问答(VQA)中生成式 AI(Generative AI)模型存在的“忠实性幻觉”(faithfulness hallucinations)问题,即模型产生看似合理但缺乏视觉依据的错误回答,严重影响其在安全关键场景下的可靠性。现有检测方法分为两类:依赖外部验证资源的方法计算开销大且受限于外部数据质量,而基于不确定性的方法仅捕捉有限的不确定性特征,未能充分挖掘模型内部多样失效模式的信号。为此,作者提出 FaithSCAN——一种轻量级网络,通过融合多维度内部信号(包括 token 级解码不确定性、中间视觉表征和跨模态对齐特征),利用分支证据编码与不确定感知注意力机制实现高效检测;同时扩展 LLM-as-a-Judge 框架,设计低成本自动标注策略生成模型相关监督信号,无需人工标注即可实现高精度训练。实验证明,FaithSCAN 在多个 VQA 基准上显著优于现有方法,在效率与准确性方面均具优势,并揭示了幻觉源于视觉感知、跨模态推理与语言解码中的系统性内部状态变化,不同内部信号提供互补诊断线索,为理解多模态幻觉成因提供了新视角。
链接: https://arxiv.org/abs/2601.00269
作者: Chaodong Tong,Qi Zhang,Chen Li,Lei Jiang,Yanbing Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (CAS); School of Cyber Security, University of CAS; China Industrial Control Systems Cyber Emergency Response Team; China Electronics Standardization Institute, Ministry of Industry and Information Technology of the People’s Republic of China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures, 5 tables
Abstract:Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.
zh
[CV-53] ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching
【速读】:该论文旨在解决文本到图像扩散模型在生成过程中可能引发的安全、版权及伦理风险问题,尤其是针对敏感概念(如裸体、特定艺术风格或物体)的不当生成。现有概念擦除方法多依赖于数据密集且计算昂贵的微调训练,存在效率瓶颈。其解决方案的关键在于提出一种无需训练(training-free)的激活擦除方法(ActErase),通过提示对(prompt-pair)分析识别出与目标概念相关的激活差异区域,动态提取并替换前向传播中的输入激活,从而实现高效、精准的概念擦除,同时保持模型整体生成能力,并展现出对对抗攻击的强鲁棒性。
链接: https://arxiv.org/abs/2601.00267
作者: Yi Sun,Xinhao Zhong,Hongyan Li,Yimin Zhou,Junhao Li,Bin Chen,Xuan Wang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model’s activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model’s overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
zh
[CV-54] S1-MMAlign: A Large-Scale Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
【速读】:该论文旨在解决科学发现领域中多模态学习面临的语义鸿沟问题,即复杂科学图像与稀疏文本描述之间的弱对齐现象。其解决方案的关键在于构建了一个大规模、跨学科的多模态数据集S1-MMAlign,包含超过1550万对高质量图像-文本数据,并引入基于Qwen-VL多模态大模型的语义增强流水线,通过融合论文摘要和引用上下文对原始科学图注进行重构,从而显著提升图像与文本间的语义一致性。技术验证表明,该方法在SciBERT伪困惑度指标上降低了语义模糊性,且CLIP得分提升了18.21%,有效改善了数据质量,为AI for Science时代的科学推理与跨模态理解提供了基础资源。
链接: https://arxiv.org/abs/2601.00264
作者: He Wang,Longteng Guo,Pengkang Huo,Xuanxu Lin,Yichen Yuan,Jie Jiang,Jing Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学交叉学科院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures. Dataset available at this https URL
Abstract:Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at this https URL.
zh
[CV-55] otalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models
【速读】:该论文旨在解决3D-CT医学影像基础模型在临床应用中面临的计算成本高与表示能力不足的矛盾问题。其解决方案的关键在于提出一种基于器官分离(organ separation)的学习框架——TotalFM,通过自动化构建器官体积与病灶描述语句对(利用分割技术和大语言模型LLM处理放射学报告),并结合VideoMAE自监督预训练与基于体素-文本对的对比学习策略,在保证计算效率的同时显著提升了模型的泛化性能。实验表明,该方法在零样本器官级和病灶级病变分类任务中均优于现有模型(如CT-CLIP、Merlin),验证了器官分离设计在3D-CT基础模型中的有效性与实用性。
链接: https://arxiv.org/abs/2601.00260
作者: Kohei Yamamoto,Tomohiro Kikuchi
机构: Jichi Medical University (日本自治医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.
zh
[CV-56] Next Generation Intelligent Low-Altitude Economy Deployments: The O-RAN Perspective
【速读】:该论文旨在解决低空经济(Low-altitude Economy, LAE)应用中,复杂信号受限环境下无人飞行器(UAV)任务编排缺乏实时性、鲁棒性和情境感知能力的问题,尤其在于人工智能(AI)在LAE场景中的专用化集成不足。解决方案的关键在于提出一种基于开放无线接入网(O-RAN)的LAE框架,通过解耦的RAN架构、开放接口与RAN智能控制器(RIC)实现空中节点的闭环协同;其中,语义感知的rApp作为地形解释器,为强化学习驱动的xApp提供语义指导,从而完成对LAE集群节点的实时轨迹规划,实现AI优化的、面向任务关键型操作的动态调度。
链接: https://arxiv.org/abs/2601.00257
作者: Aly Sabri Abdalla,Vuk Marojevic
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: This article has been accepted for publication in the IEEE Wireless Communications Magazine
Abstract:Despite the growing interest in low-altitude economy (LAE) applications, including UAV-based logistics and emergency response, fundamental challenges remain in orchestrating such missions over complex, signal-constrained environments. These include the absence of real-time, resilient, and context-aware orchestration of aerial nodes with limited integration of artificial intelligence (AI) specialized for LAE missions. This paper introduces an open radio access network (O-RAN)-enabled LAE framework that leverages seamless coordination between the disaggregated RAN architecture, open interfaces, and RAN intelligent controllers (RICs) to facilitate closed-loop, AI-optimized, and mission-critical LAE operations. We evaluate the feasibility and performance of the proposed architecture via a semantic-aware rApp that acts as a terrain interpreter, offering semantic guidance to a reinforcement learning-enabled xApp, which performs real-time trajectory planning for LAE swarm nodes. We survey the capabilities of UAV testbeds that can be leveraged for LAE research, and present critical research challenges and standardization needs.
zh
[CV-57] Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture
【速读】:该论文旨在解决传统农业 pest 管理方法依赖人工巡检和化学农药带来的高成本、低效率及环境负面影响问题,特别是在资源有限的小农户场景中。其解决方案的关键在于提出一个轻量化决策支持系统(Decision Support System),包含两个核心模块:一是基于紧凑卷积神经网络(CNN)与原型元学习(prototypical meta-learning)相结合的 Pest Detection Module,可在少量样本条件下实现高精度 pests 识别;二是融合作物类型和生长阶段等环境因素的 Pesticide Recommendation Module,提供安全且环保的农药推荐策略。该框架在低资源设备(如智能手机和无人机)上部署可行,显著降低计算复杂度的同时保持与先进模型相当的性能,从而推动精准农业中的实时、可持续 pest 管理应用。
链接: https://arxiv.org/abs/2601.00243
作者: Anirudha Ghosh,Ritam Sarkar,Debaditya Barman
机构: Visva-Bharati University (维斯瓦巴拉蒂大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the 3rd International Conference on Nonlinear Dynamics and Applications (ICNDA 2026), 12 pages, 7 figures
Abstract:Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers. The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization. Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture. Comments: Submitted to the 3rd International Conference on Nonlinear Dynamics and Applications (ICNDA 2026), 12 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.00243 [cs.CV] (or arXiv:2601.00243v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.00243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-58] Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection
【速读】:该论文旨在解决印刷电路板(Printed Circuit Board, PCB)缺陷检测中红外(Infrared, IR)数据稀缺的问题,这一瓶颈限制了基于IR图像的深度学习模型性能。解决方案的关键在于提出了一种跨模态数据增强框架,融合了CycleGAN与YOLOv8:首先利用CycleGAN实现无配对的可见光到红外图像翻译,生成高保真度的伪红外(pseudo-IR)样本,保留缺陷结构语义并准确模拟热分布模式;随后采用异构训练策略,将合成的伪IR数据与有限的真实IR样本融合,训练轻量级YOLOv8检测器。实验证明该方法在低数据条件下显著提升特征学习能力,使模型性能逼近全监督训练基准,验证了伪IR合成作为工业检测中稳健的数据增强策略的有效性。
链接: https://arxiv.org/abs/2601.00237
作者: Chao Yang,Haoyuan Zheng,Yue Ma
机构: Xi’an Jiaotong Liverpool University (西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages,8 figures
Abstract:This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.
zh
[CV-59] owards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions NIPS2025
【速读】:该论文旨在解决基于深度学习的盲图像质量评估(Blind Image Quality Assessment, BIQA)模型在训练过程中因缺乏大规模标注数据而导致的泛化能力不足问题,尤其是现有合成数据集训练出的模型在真实场景中表现不佳的问题。其解决方案的关键在于识别并纠正合成数据分布中存在的特征离散与聚类现象——即高质量图像特征聚集于参考图像附近,而低质量图像特征则按失真类型聚类,这种分布特性会阻碍回归性能。为此,作者提出了一种名为SynDR-IQA的新框架,通过理论推导样本多样性与冗余对泛化误差的影响,引入两种策略:基于分布感知的多样化内容上采样(distribution-aware diverse content upsampling),以增强视觉多样性并保持内容分布一致性;以及基于密度感知的冗余聚类下采样(density-aware redundant cluster downsampling),以降低密集区域的样本密度,从而平衡数据分布,提升模型泛化能力。
链接: https://arxiv.org/abs/2601.00225
作者: Aobo Li,Jinjian Wu,Yongxu Liu,Leida Li,Weisheng Dong
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NIPS 2025
Abstract:Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy’s impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at this https URL.
zh
[CV-60] LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
【速读】:该论文旨在解决向量量化(Vector Quantization, VQ)方法在数据与模型复杂度不断提升背景下,如何实现高容量同时保持代码本(codebook)紧凑性的矛盾问题。其解决方案的关键在于提出一种名为LooC的新方法,该方法通过重构码向量(codevector)与特征向量之间的关系,将二者视为低维组合单元并进行组合式量化,从而显著扩展解空间并大幅压缩代码本规模;同时引入无参数的插值外推机制以增强特征平滑性与细节保真度,确保小代码本仍能高效利用且避免量化 collapse,最终实现性能优越且参数高效的向量量化。
链接: https://arxiv.org/abs/2601.00222
作者: Jie Li,Kwan-Yee K. Wong,Kai Han
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026
Abstract:Vector quantization (VQ) is a prevalent and fundamental technique that discretizes continuous feature vectors by approximating them using a codebook. As the diversity and complexity of data and models continue to increase, there is an urgent need for high-capacity, yet more compact VQ methods. This paper aims to reconcile this conflict by presenting a new approach called LooC, which utilizes an effective Low-dimensional codebook for Compositional vector quantization. Firstly, LooC introduces a parameter-efficient codebook by reframing the relationship between codevectors and feature vectors, significantly expanding its solution space. Instead of individually matching codevectors with feature vectors, LooC treats them as lower-dimensional compositional units within feature vectors and combines them, resulting in a more compact codebook with improved performance. Secondly, LooC incorporates a parameter-free extrapolation-by-interpolation mechanism to enhance and smooth features during the VQ process, which allows for better preservation of details and fidelity in feature approximation. The design of LooC leads to full codebook usage, effectively utilizing the compact codebook while avoiding the problem of collapse. Thirdly, LooC can serve as a plug-and-play module for existing methods for different downstream tasks based on VQ. Finally, extensive evaluations on different tasks, datasets, and architectures demonstrate that LooC outperforms existing VQ methods, achieving state-of-the-art performance with a significantly smaller codebook.
zh
[CV-61] IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中因忽略目标域内部多样性(intra-domain variability)而导致的性能瓶颈问题。现有方法通常仅关注源域与目标域之间的域偏移(domain shift),而未充分挖掘目标域内不同风格样本的多样性,从而限制了合成数据的质量和下游任务的表现。其解决方案的关键在于提出一种无需先验知识即可自动捕捉多样化目标域风格的示例驱动风格合成方法 IntraStyler:通过引入一个风格编码器(style encoder)并基于对比学习(contrastive learning)提取纯风格特征,使模型能够以任意示例图像为引导,生成与该示例风格一致的合成图像,从而实现可控且多样化的风格迁移,提升跨模态域适应场景下的分割性能。
链接: https://arxiv.org/abs/2601.00212
作者: Han Liu,Yubo Fan,Hao Li,Dewei Hu,Daniel Moyer,Zhoubing Xu,Benoit M. Dawant,Ipek Oguz
机构: Siemens Healthineers (西门子医疗); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extension of our 1st place solution for the CrossMoDA 2023 challenge
Abstract:Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at this https URL.
zh
[CV-62] CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting
【速读】:该论文旨在解决户外田间环境下作物计数难题,尤其是因部分遮挡和簇状作物在单视角下难以区分所导致的图像分割精度不足问题。其核心解决方案是提出一种基于3D实例分割的作物计数框架,通过多视角2D图像与神经辐射场(Neural Radiance Field, NeRF)相结合的方式,引入作物可见性得分和掩码一致性得分,以增强3D空间中作物实例的分割准确性,从而实现高精度计数。该方法无需针对特定作物调整参数,且在棉花铃、苹果和梨等多个数据集上均表现出鲁棒性和优越性能。
链接: https://arxiv.org/abs/2601.00207
作者: Md Ahmed Al Muzaddid,William J. Beksi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 10 figures, and 2 tables
Abstract:Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.
zh
[CV-63] MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
【速读】:该论文旨在解决3D形态变换(3D morphing)中语义一致性与时间平滑性难以保证的问题,尤其是在跨类别场景下。其核心挑战在于如何生成既结构合理又时序连贯的变形序列。解决方案的关键在于提出一种无需训练的框架MorphAny3D,利用结构化潜在表示(Structured Latent, SLAT)来实现高质量的3D形态变换;具体而言,通过引入形态交叉注意力机制(Morphing Cross-Attention, MCA)融合源与目标特征以确保结构一致性,并采用时序融合自注意力机制(Temporal-Fused Self-Attention, TFSA)引入前帧特征以增强时间连续性,同时辅以姿态校正策略缓解形态步骤中的姿态歧义问题。
链接: https://arxiv.org/abs/2601.00204
作者: Xiaokun Sun,Zeyu Cai,Hao Tang,Ying Tai,Jian Yang,Zhenyu Zhang
机构: Nanjing University (南京大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: this https URL.
zh
[CV-64] DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery
【速读】:该论文旨在解决从卫星遥感影像中恢复海底在空气中的真实颜色的问题,这一任务因水体对光的指数衰减(light attenuation)而极具挑战性。解决方案的关键在于提出DichroGAN,一种用于此场景的条件生成对抗网络(conditional generative adversarial network, cGAN),其核心创新在于采用两阶段联合训练机制:首先由两个生成器利用高光谱图像立方体分离出漫反射与镜面反射成分,从而估算大气场景辐射亮度;随后第三个生成器基于该辐射亮度和第四个生成器估计的水下光传输函数,共同构建符合水下成像物理模型的色彩恢复流程,有效去除光吸收与散射效应的影响,实现对海底真实颜色的重建。
链接: https://arxiv.org/abs/2601.00194
作者: Salma Gonzalez-Sabbagh,Antonio Robles-Kelly,Shang Gao
机构: Deakin University (迪肯大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Recovering the in-air colours of seafloor from satellite imagery is a challenging task due to the exponential attenuation of light with depth in the water column. In this study, we present DichroGAN, a conditional generative adversarial network (cGAN) designed for this purpose. DichroGAN employs a two-steps simultaneous training: first, two generators utilise a hyperspectral image cube to estimate diffuse and specular reflections, thereby obtaining atmospheric scene radiance. Next, a third generator receives as input the generated scene radiance containing the features of each spectral band, while a fourth generator estimates the underwater light transmission. These generators work together to remove the effects of light absorption and scattering, restoring the in-air colours of seafloor based on the underwater image formation equation. DichroGAN is trained on a compact dataset derived from PRISMA satellite imagery, comprising RGB images paired with their corresponding spectral bands and masks. Extensive experiments on both satellite and underwater datasets demonstrate that DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.
zh
[CV-65] Optimized Hybrid Feature Engineering for Resource-Efficient Arrhythmia Detection in ECG Signals: An Optimization Framework
【速读】:该论文旨在解决心律失常等心血管疾病在物联网医疗(IoMT)环境中进行实时监测时,深度学习模型因计算开销过大而不适用于资源受限边缘设备的问题。其解决方案的关键在于提出一种以数据为中心的轻量级框架,通过特征工程而非模型复杂度来提升性能:首先利用小波时频分解与图论结构描述符(如PageRank中心性)构建混合特征空间,再结合互信息与递归消除方法进行特征优化,最终实现线性可分且具有可解释性的超轻量级分类器,从而在MIT-BIH和INCART数据集上达到98.44%诊断准确率,同时仅需8.54 KB模型体积和0.46 μs推理延迟,显著优于压缩模型(如KD-Light)。
链接: https://arxiv.org/abs/2601.00192
作者: Moirangthem Tiken Singh,Manibhushan Yaikhom
机构: Dibrugarh University (迪布鲁加尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cardiovascular diseases, particularly arrhythmias, remain a leading global cause of mortality, necessitating continuous monitoring via the Internet of Medical Things (IoMT). However, state-of-the-art deep learning approaches often impose prohibitive computational overheads, rendering them unsuitable for resource-constrained edge devices. This study proposes a resource-efficient, data-centric framework that prioritizes feature engineering over complexity. Our optimized pipeline makes the complex, high-dimensional arrhythmia data linearly separable. This is achieved by integrating time-frequency wavelet decompositions with graph-theoretic structural descriptors, such as PageRank centrality. This hybrid feature space, combining wavelet decompositions and graph-theoretic descriptors, is then refined using mutual information and recursive elimination, enabling interpretable, ultra-lightweight linear classifiers. Validation on the MIT-BIH and INCART datasets yields 98.44% diagnostic accuracy with an 8.54 KB model footprint. The system achieves 0.46 \mu s classification inference latency within a 52 ms per-beat pipeline, ensuring real-time operation. These outcomes provide an order-of-magnitude efficiency gain over compressed models, such as KD-Light (25 KB, 96.32% accuracy), advancing battery-less cardiac sensors.
zh
[CV-66] Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions
【速读】:该论文旨在解决面部分析中一个尚未被充分探索的问题:为任意选定的面部区域生成并识别包含面部动作单元(Facial Action Units, AUs)、情绪状态和年龄估计的多属性自然语言描述(称为FaceFocalDesc)。其核心挑战在于如何实现对特定面部区域的精准聚焦,从而提升对人脸状态的理解与控制能力。解决方案的关键在于构建了一个全新的多属性描述数据集,提供丰富的区域级标注和自然语言描述,并提出了一种基于Qwen2.5-VL微调的视觉-语言模型Focal-RegionFace,该模型通过多阶段逐步细化的微调策略,实现对局部面部特征的渐进式聚焦,最终在细粒度多属性面部区域聚焦分析场景中展现出卓越的可解释性与性能。
链接: https://arxiv.org/abs/2601.00156
作者: Kaiwen Zheng,Junchen Fu,Songpei Xu,Yaoqing He,Joemon M.Jose,Han Hu,Xuri Ge
机构: University of Glasgow (格拉斯哥大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system’s ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.
zh
[CV-67] FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications
【速读】:该论文旨在解决当前多模态人工智能(Multimodal AI)在金融信贷风险评估与文件审核场景中缺乏专用基准测试平台的问题,具体包括:(1) 金融信贷文档和工作流的特异性不足,(2) 缺乏信用相关的理解能力与真实世界鲁棒性指标,以及 (3) 隐私合规与实用性之间的矛盾。解决方案的关键在于提出 FCMBench-V1.0——一个大规模、隐私合规的金融信贷多模态基准,涵盖18类核心凭证、4,043张图像及8,446个问答样本,并构建包含感知(Perception)、推理(Reasoning)与鲁棒性(Robustness)三个维度的评估框架;其核心技术创新是采用“封闭合成-采集”流水线,通过人工合成模板并内建场景感知图像采集,既保障数据隐私又提升现实真实性,同时避免公开数据泄露问题,从而有效区分主流视觉语言模型(Vision-Language Models, VLMs)在金融信贷任务中的性能差异与鲁棒性表现。
链接: https://arxiv.org/abs/2601.00150
作者: Yehui Yang,Dalu Yang,Wenshuo Zhou,Fangxin Shang,Yifan Liu,Jie Ren,Haojun Fei,Qing Yang,Tao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Multimedia (cs.MM)
备注:
Abstract:As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 – a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.
zh
[CV-68] Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像检测中因传统架构对图像进行下采样而导致的细粒度信息丢失问题。其解决方案的关键在于提出 GLASS(Global-Local Attention with Stratified Sampling)架构,该架构通过融合全局重缩放视图与多个随机采样的局部区域(crop),利用空间分层采样策略高效选取原始分辨率下的局部区域,并基于注意力机制进行评分聚合,从而在不显著增加计算负担的前提下,有效整合图像的全局与局部特征,提升检测性能。
链接: https://arxiv.org/abs/2601.00141
作者: Lawrence Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.
zh
[CV-69] Compressed Map Priors for 3D Perception
【速读】:该论文旨在解决自动驾驶视觉系统在部署时缺乏对已访问区域的空间先验信息的问题,导致其在重复场景中仍像首次遇到一样进行处理,从而影响感知性能。解决方案的关键在于提出压缩地图先验(Compressed Map Priors, CMP),通过一个二值化哈希表(binarized hashmap)高效存储历史路径中的空间信息,仅需32 KB/km²,相比密集存储减少20倍空间开销;该方法可无缝集成至主流3D感知系统,在几乎不增加计算成本的前提下显著提升nuScenes数据集上多种架构的3D目标检测性能。
链接: https://arxiv.org/abs/2601.00139
作者: Brady Zhou,Philipp Krähenbühl
机构: UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report; code this https URL
Abstract:Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only 32\textKB/\textkm^2 , a 20\times reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.
zh
[CV-70] Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
【速读】:该论文旨在解决高风险场景下视觉语言模型(Vision-Language Models, VLMs)在视频问答任务中因不确定性导致的错误预测问题,其核心挑战在于如何实现对错误率的有效控制。解决方案的关键在于引入基于置信度的择优预测(confidence-based abstention)机制:通过设定置信度阈值 ε,系统在预测置信度低于阈值时主动放弃预测,从而在分布内(in-distribution)实现平滑的风险-覆盖率权衡,显著降低错误率并提供可解释的控制能力。
链接: https://arxiv.org/abs/2601.00138
作者: Jorge Ortiz
机构: Rutgers University (罗格斯大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract:High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
zh
[CV-71] A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data
【速读】:该论文旨在解决洪水响应阶段中,因部分多光谱成像 (Multispectral Imaging, MSI) 数据缺失而导致基于合成孔径雷达 (Synthetic Aperture Radar, SAR) 的洪水范围制图精度下降的问题。其核心挑战在于如何在MSI数据不完整或不可用的情况下,仍能有效融合有限的多模态信息以提升水体范围预测的准确性与鲁棒性。解决方案的关键是提出一种空间掩码自适应门控网络(Spatially Masked Adaptive Gated Network, SMAGNet),该模型以SAR数据为主输入,通过特征级融合机制动态整合可用的MSI数据,并利用空间掩码和门控机制实现对部分可用MSI数据的自适应加权,从而在不同数据可用性条件下均保持高性能表现。实验表明,SMAGNet不仅在多模态数据完备时优于其他模型,在MSI数据完全缺失时也与仅使用SAR数据的U-Net模型性能相当,显著增强了多模态深度学习方法在实际洪水管理中的适用性和可靠性。
链接: https://arxiv.org/abs/2601.00123
作者: Hyunho Lee,Wenwen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 50 pages, 12 figures, 6 tables
Abstract:Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.
zh
[CV-72] Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在4D空间智能(4D spatial intelligence,即对物体随时间变化的感知与推理能力)方面与人类水平存在显著差距的问题。其核心挑战在于现有评估基准规模小、任务多样性不足,难以全面衡量MLLMs在动态场景中的空间认知能力。解决方案的关键是提出Spatial4D-Bench——一个大规模、多任务的4D空间智能评测基准,包含约4万条问答对,覆盖18个明确定义的任务,并系统划分为6类认知维度(如物体理解、场景理解、时空关系理解等),从而为MLLMs提供结构化且全面的评估框架。实验表明,当前主流开源和专有MLLMs在路径规划、动作识别及物理合理性推理等方面均存在明显短板,凸显了该基准在推动模型向人类级4D空间智能演进中的价值。
链接: https://arxiv.org/abs/2601.00092
作者: Pan Wang,Yang Liu,Guile Wu,Eduardo R. Corral-Soto,Chengjie Huang,Binbin Xu,Dongfeng Bai,Xu Yan,Yuan Ren,Xingxin Chen,Yizhe Wu,Tao Huang,Wenjun Wan,Xin Wu,Pei Zhou,Xuyang Dai,Kangbo Lv,Hongbo Zhang,Yosef Fried,Aixue Ye,Bailan Feng,Zhenyu Chen,Zhen Li,Yingcong Chen,Yiyi Liao,Bingbing Liu
机构: Huawei Technologies(华为技术有限公司); Tsinghua University(清华大学); CUHK-Shenzhen(香港中文大学深圳分校); HKUST-GZ(香港科技大学广州分校); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
zh
[CV-73] Its Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
【速读】:该论文旨在解决当前文本到图像生成模型中存在的模式崩溃(mode collapse)问题,即在给定相同文本提示时,生成的图像多样性不足。解决方案的关键在于通过噪声优化(noise optimization)来提升生成结果的多样性,同时保持基线模型的生成保真度。作者证明了一个简单的噪声优化目标即可有效缓解模式崩溃,并进一步分析了噪声的频域特性,提出采用不同频率特性的初始噪声可改善优化效率与搜索效果,从而在生成质量和多样性上均取得更优性能。
链接: https://arxiv.org/abs/2601.00090
作者: Anne Harrington,A. Sophia Koepke,Shyamgopal Karthik,Trevor Darrell,Alexei A. Efros
机构: UC Berkeley (加州大学伯克利分校); University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心); Technical University of Munich (慕尼黑工业大学); MCML (慕尼黑计算机视觉与机器学习研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.
zh
[CV-74] World: Towards Dynamic Multimodal Synthesis with a 4D World Model
【速读】:该论文旨在解决当前视频生成模型在实时交互、长时一致性以及动态场景持久记忆方面的局限性,从而推动其向实用化的世界模型(world models)演进。解决方案的关键在于提出TeleWorld框架,其核心是引入一种“生成-重建-引导”(generation-reconstruction-guidance)范式:通过将生成的视频流持续重建为动态4D时空表示,并利用该表示反向引导后续生成过程,以实现空间、时间与物理层面的一致性;同时结合自回归扩散视频模型与宏观-微观规划(Macro-from-Micro Planning, MMPL)策略及分布匹配蒸馏(Distribution Matching Distillation, DMD),显著降低误差累积并提升实时生成效率,在有限计算资源下实现高保真、长时程、可交互的4D世界建模。
链接: https://arxiv.org/abs/2601.00051
作者: Yabo Chen,Yuanzhi Liang,Jiepeng Wang,Tingxi Chen,Junfei Cheng,Zixiao Gu,Yuyang Huang,Zicheng Jiang,Wei Li,Tian Li,Weichen Li,Zuoxin Li,Guangce Liu,Jialun Liu,Junqi Liu,Haoyuan Wang,Qizhen Weng,Xuan’er Wu,Xunzhi Xiang,Xiaoyan Yang,Xin Zhang,Shiwen Zhang,Junyu Zhou,Chengcheng Zhou,Haibin Huang,Chi Zhang,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)–a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.
zh
[CV-75] From Clay to Code: Typological and Material Reasoning in AI Interpretations of Iranian Pigeon Towers
【速读】:该论文旨在解决生成式 AI 系统如何理解并再现传统建筑中嵌入的地域性设计智慧(vernacular form)的问题。其关键解决方案在于构建一个包含类型学、材料性、环境适应性、真实性和文化特异性五个维度的评估框架,通过对比三种扩散模型(Midjourney v6、DALL-E 3 和 DreamStudio based on Stable Diffusion XL)在参考型、适应型和推测型提示阶段的表现,揭示 AI 在视觉表征与建筑逻辑之间存在的认知鸿沟,并提出“计算性地域推理”(computational vernacular reasoning)作为分析 AI 对传统设计智慧感知、扭曲与重构机制的新范式。
链接: https://arxiv.org/abs/2601.00029
作者: Abolhassan Pishahang,Maryam Badiei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of SIGraDi 2025: XXIX International Conference of the Ibero-American Society of Digital Graphics, Córdoba, Argentina, 2025
Abstract:This study investigates how generative AI systems interpret the architectural intelligence embedded in vernacular form. Using the Iranian pigeon tower as a case study, the research tests three diffusion models, Midjourney v6, DALL-E 3, and DreamStudio based on Stable Diffusion XL (SDXL), across three prompt stages: referential, adaptive, and speculative. A five-criteria evaluation framework assesses how each system reconstructs typology, materiality, environment, realism, and cultural specificity. Results show that AI reliably reproduces geometric patterns but misreads material and climatic reasoning. Reference imagery improves realism yet limits creativity, while freedom from reference generates inventive but culturally ambiguous outcomes. The findings define a boundary between visual resemblance and architectural reasoning, positioning computational vernacular reasoning as a framework for analyzing how AI perceives, distorts, and reimagines traditional design intelligence.
zh
[CV-76] he Impact of Lesion Focus on the Performance of AI-Based Melanoma Classification
【速读】:该论文旨在解决当前基于卷积神经网络(Convolutional Neural Networks, CNN)的黑色素瘤(melanoma)分类模型在诊断可靠性上的不足问题,其核心在于模型对病变区域的关注度不一致,从而影响了诊断准确性。解决方案的关键在于通过多种可解释性分析和敏感性评估方法,量化模型注意力与病变区域的一致性,并发现模型对病变区域的关注程度与其诊断性能(如精确率、召回率和F1分数)呈正相关,表明提升模型注意力聚焦于病变区域可显著增强诊断准确性和可信度,为开发更可靠、可解释的医学AI诊断系统提供了理论基础。
链接: https://arxiv.org/abs/2601.00355
作者: Tanay Donde
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Melanoma is the most lethal subtype of skin cancer, and early and accurate detection of this disease can greatly improve patients’ outcomes. Although machine learning models, especially convolutional neural networks (CNNs), have shown great potential in automating melanoma classification, their diagnostic reliability still suffers due to inconsistent focus on lesion areas. In this study, we analyze the relationship between lesion attention and diagnostic performance, involving masked images, bounding box detection, and transfer learning. We used multiple explainability and sensitivity analysis approaches to investigate how well models aligned their attention with lesion areas and how this alignment correlated with precision, recall, and F1-score. Results showed that models with a higher focus on lesion areas achieved better diagnostic performance, suggesting the potential of interpretable AI in medical diagnostics. This study provides a foundation for developing more accurate and trustworthy melanoma classification models in the future.
zh
[CV-77] Automated electrostatic characterization of quantum dot devices in single- and bilayer heterostructures
【速读】:该论文旨在解决量子点(Quantum Dot, QD)基自旋量子比特器件在向更大规模、更复杂架构发展过程中,面临的手动分析电荷稳定性图(Charge Stability Diagram, CSD)效率低、易出错且难以规模化的问题。其核心挑战在于从大量CSD数据中自动提取器件的电容特性信息,以支持高效器件表征与优化。解决方案的关键在于提出一种集成机器学习、图像处理和目标检测的自动化协议,无需人工标注即可识别并追踪多组CSD中的电荷转移线,从而实现对相对杠杆臂和电容耦合等物理量的统计估计,显著提升了从实验数据中挖掘非平凡物理信息的速度与可靠性。
链接: https://arxiv.org/abs/2601.00067
作者: Merritt P. R. Losert,Dario Denora,Barnaby van Straaten,Michael Chan,Stefan D. Oosterhout,Lucas Stehouwer,Giordano Scappucci,Menno Veldhorst,Justyna P. Zwolak
机构: National Institute of Standards and Technology (美国国家标准与技术研究院); University of Maryland (马里兰大学); QuTech (量子科技中心); Delft University of Technology (代尔夫特理工大学); Kavli Institute of Nanoscience (卡弗里纳米科学研究所); Netherlands Organisation for Applied Scientific Research (TNO) (荷兰应用科学研究组织)
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注: 18 pages, 12 figures
Abstract:As quantum dot (QD)-based spin qubits advance toward larger, more complex device architectures, rapid, automated device characterization and data analysis tools become critical. The orientation and spacing of transition lines in a charge stability diagram (CSD) contain a fingerprint of a QD device’s capacitive environment, making these measurements useful tools for device characterization. However, manually interpreting these features is time-consuming, error-prone, and impractical at scale. Here, we present an automated protocol for extracting underlying capacitive properties from CSDs. Our method integrates machine learning, image processing, and object detection to identify and track charge transitions across large datasets without manual labeling. We demonstrate this method using experimentally measured data from a strained-germanium single-quantum-well (planar) and a strained-germanium double-quantum-well (bilayer) QD device. Unlike for planar QD devices, CSDs in bilayer germanium heterostructure exhibit a larger set of transitions, including interlayer tunneling and distinct loading lines for the vertically stacked QDs, making them a powerful testbed for automation methods. By analyzing the properties of many CSDs, we can statistically estimate physically relevant quantities, like relative lever arms and capacitive couplings. Thus, our protocol enables rapid extraction of useful, nontrivial information about QD devices.
zh
[CV-78] Deep Learning Approach for the Diagnosis of Pediatric Pneumonia Using Chest X-ray Imaging
【速读】:该论文旨在解决儿科肺炎(pediatric pneumonia)在诊断过程中因放射科专业人才短缺及儿童影像生理与操作复杂性导致的及时性和准确性不足问题。解决方案的关键在于利用迁移学习(transfer learning)技术,对三种先进的卷积神经网络(Convolutional Neural Network, CNN)架构——ResNetRS、RegNet 和 EfficientNetV2 进行微调,以实现对儿科胸部X光图像的自动化二分类(肺炎 vs. 正常)。实验结果表明,RegNet 在准确率(92.4%)和敏感度(90.1%)上表现最优,验证了基于预训练模型的深度学习方法在提升儿科影像辅助诊断效能方面的有效性。
链接: https://arxiv.org/abs/2601.00041
作者: Fatemeh Hosseinabadi,Mohammad Mojtaba Rohani
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 3 figures
Abstract:Pediatric pneumonia remains a leading cause of morbidity and mortality in children worldwide. Timely and accurate diagnosis is critical but often challenged by limited radiological expertise and the physiological and procedural complexity of pediatric imaging. This study investigates the performance of state-of-the-art convolutional neural network (CNN) architectures ResNetRS, RegNet, and EfficientNetV2 using transfer learning for the automated classification of pediatric chest Xray images as either pneumonia or normal.A curated subset of 1,000 chest X-ray images was extracted from a publicly available dataset originally comprising 5,856 pediatric images. All images were preprocessed and labeled for binary classification. Each model was fine-tuned using pretrained ImageNet weights and evaluated based on accuracy and sensitivity. RegNet achieved the highest classification performance with an accuracy of 92.4 and a sensitivity of 90.1, followed by ResNetRS (accuracy: 91.9, sensitivity: 89.3) and EfficientNetV2 (accuracy: 88.5, sensitivity: 88.1).
zh
[CV-79] Neural Brain Fields: A NeRF-Inspired Approach for Generating Nonexistent EEG Electrodes
【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)数据建模中的关键挑战,包括信号长度不一、信噪比极低、个体差异显著、会话内漂移以及高质量标注数据稀缺等问题。针对这些问题,作者提出了一种受神经辐射场(Neural Radiance Fields, NeRF)启发的新方法:将EEG电极在头皮上的空间分布类比为NeRF中从不同视角拍摄的离散图像,从而利用神经网络学习连续的神经活动表征。该方案的核心在于通过单个EEG样本以NeRF风格训练模型,生成一个固定长度且富含信息的权重向量,该向量可编码整个EEG信号,并支持在未见过的时间步和电极位置上进行信号重建与连续可视化,实现超高分辨率的脑活动呈现及原始EEG信号的重构,进而提升标准EEG处理网络的性能。
链接: https://arxiv.org/abs/2601.00012
作者: Shahar Ain Kedem,Itamar Zimerman,Eliya Nachmani
机构: Ben Gurion University of the Negev (本古里安大学); Tel Aviv University (特拉维夫大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Electroencephalography (EEG) data present unique modeling challenges because recordings vary in length, exhibit very low signal to noise ratios, differ significantly across participants, drift over time within sessions, and are rarely available in large and clean datasets. Consequently, developing deep learning methods that can effectively process EEG signals remains an open and important research problem. To tackle this problem, this work presents a new method inspired by Neural Radiance Fields (NeRF). In computer vision, NeRF techniques train a neural network to memorize the appearance of a 3D scene and then uses its learned parameters to render and edit the scene from any viewpoint. We draw an analogy between the discrete images captured from different viewpoints used to learn a continuous 3D scene in NeRF, and EEG electrodes positioned at different locations on the scalp, which are used to infer the underlying representation of continuous neural activity. Building on this connection, we show that a neural network can be trained on a single EEG sample in a NeRF style manner to produce a fixed size and informative weight vector that encodes the entire signal. Moreover, via this representation we can render the EEG signal at previously unseen time steps and spatial electrode positions. We demonstrate that this approach enables continuous visualization of brain activity at any desired resolution, including ultra high resolution, and reconstruction of raw EEG signals. Finally, our empirical analysis shows that this method can effectively simulate nonexistent electrodes data in EEG recordings, allowing the reconstructed signal to be fed into standard EEG processing networks to improve performance.
zh
人工智能
[AI-0] LLM Agents for Combinatorial Efficient Frontiers: Investment Portfolio Optimization
【速读】:该论文旨在解决资本约束均值-方差投资组合优化(Cardinality Constrained Mean-Variance Portfolio Optimization, CCPO)问题,该问题属于混合整数二次规划(Mixed-Integer Quadratic Programming, MIQP)难题,传统精确求解器难以高效处理,因而常依赖启发式算法寻找近似解。其核心挑战在于复杂的流程设计与大量启发式算法开发工作,且需通过组合多个启发式解以提升有效前沿(efficient frontier)。论文提出了一种新颖的代理框架(agentic framework),利用其在组合优化中自动化复杂工作流和高效算法开发的优势,显著降低人工干预与开发成本;实验表明,该框架在基准测试中达到当前最优算法性能,且在最坏情况下仍保持可接受的误差水平,从而实现了高性能与高效率的平衡。
链接: https://arxiv.org/abs/2601.00770
作者: Simon Paquette-Greenbaum,Jiangbo Yu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:Investment portfolio optimization is a task conducted in all major financial institutions. The Cardinality Constrained Mean-Variance Portfolio Optimization (CCPO) problem formulation is ubiquitous for portfolio optimization. The challenge of this type of portfolio optimization, a mixed-integer quadratic programming (MIQP) problem, arises from the intractability of solutions from exact solvers, where heuristic algorithms are used to find approximate portfolio solutions. CCPO entails many laborious and complex workflows and also requires extensive effort pertaining to heuristic algorithm development, where the combination of pooled heuristic solutions results in improved efficient frontiers. Hence, common approaches are to develop many heuristic algorithms. Agentic frameworks emerge as a promising candidate for many problems within combinatorial optimization, as they have been shown to be equally efficient with regard to automating large workflows and have been shown to be excellent in terms of algorithm development, sometimes surpassing human-level performance. This study implements a novel agentic framework for the CCPO and explores several concrete architectures. In benchmark problems, the implemented agentic framework matches state-of-the-art algorithms. Furthermore, complex workflows and algorithm development efforts are alleviated, while in the worst case, lower but acceptable error is reported.
zh
[AI-1] An Agent ic Framework for Neuro-Symbolic Programming
【速读】:该论文旨在解决将符号约束(symbolic constraints)集成到深度学习模型中的过程繁琐且具有挑战性的问题,这一过程虽能提升模型的鲁棒性、可解释性和数据效率,但现有框架如DomiKnowS仍要求用户熟悉其特定语法,限制了易用性。解决方案的关键在于提出AgenticDomiKnowS(ADS),它通过一个代理式工作流(agentic workflow)将自然语言任务描述自动转化为完整的DomiKnowS程序,该工作流独立创建并测试每个组件,同时支持人机协同干预,从而显著降低使用门槛,使熟练用户与非用户均能在10–15分钟内完成神经符号程序构建,开发时间缩短数个数量级。
链接: https://arxiv.org/abs/2601.00743
作者: Aliakbar Nafar,Chetan Chigurupati,Danial Kamali,Hamid Karimian,Parisa Kordjamshidi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Integrating symbolic constraints into deep learning models could make them more robust, interpretable, and data-efficient. Still, it remains a time-consuming and challenging task. Existing frameworks like DomiKnowS help this integration by providing a high-level declarative programming interface, but they still assume the user is proficient with the library’s specific syntax. We propose AgenticDomiKnowS (ADS) to eliminate this dependency. ADS translates free-form task descriptions into a complete DomiKnowS program using an agentic workflow that creates and tests each DomiKnowS component separately. The workflow supports optional human-in-the-loop intervention, enabling users familiar with DomiKnowS to refine intermediate outputs. We show how ADS enables experienced DomiKnowS users and non-users to rapidly construct neuro-symbolic programs, reducing development time from hours to 10-15 minutes.
zh
[AI-2] Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty
【速读】:该论文旨在解决off-policy actor-critic方法中critic网络存在的系统性高估问题(value overestimation),这一问题通常会导致策略学习不稳定或次优。传统解决方案依赖于基于epistemic uncertainty(因数据有限和模型模糊性引起的不确定性)的悲观偏差,常通过集成多个网络来估计该不确定性并实现悲观更新。本文提出Stochastic Actor-Critic(STAC)算法,其关键创新在于改用temporal aleatoric uncertainty(由随机转移、奖励及策略诱导的贝尔曼目标波动带来的不确定性)来调节TD更新中的悲观偏差,而非依赖epistemic uncertainty。STAC使用单一分布式critic网络建模时间回报的不确定性,并结合dropout对critic和actor网络进行正则化,从而在不依赖多网络集成的前提下有效缓解高估问题,同时自然诱导风险规避行为,提升训练稳定性和计算效率。
链接: https://arxiv.org/abs/2601.00737
作者: Uğurcan Özalp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 19 pages
Abstract:Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic’s epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.
zh
[AI-3] A Vision-and-Knowledge Enhanced Large Language Model for Generalizable Pedestrian Crossing Behavior Inference
【速读】:该论文旨在解决行人过街行为推理模型在新场景中泛化能力差的问题,现有基于统计或监督学习的方法受限于站点特定的模式识别,难以适应未见过的环境。其解决方案的关键在于提出Pedestrian Crossing LLM (PedX-LLM),一个融合视觉特征与交通领域知识的增强型大语言模型框架,通过将LLaVA提取的视觉信息与文本数据及领域知识结合,并采用低秩适配(Low-Rank Adaptation, LoRA)微调LLaMA-2-7B基础模型,从而实现从数据驱动到语义感知的行为推理转变。实验表明,该方法在平衡准确率上达到82.0%,且在跨站点零样本测试中仍保持66.9%的性能,显著优于传统方法,验证了视觉增强和领域知识整合对提升模型泛化能力的核心作用。
链接: https://arxiv.org/abs/2601.00694
作者: Qingwen Pu,Kun Xie,Hong Yang,Guocong Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing paradigms for inferring pedestrian crossing behavior, ranging from statistical models to supervised learning methods, demonstrate limited generalizability and perform inadequately on new sites. Recent advances in Large Language Models (LLMs) offer a shift from numerical pattern fitting to semantic, context-aware behavioral reasoning, yet existing LLM applications lack domain-specific adaptation and visual context. This study introduces Pedestrian Crossing LLM (PedX-LLM), a vision-and-knowledge enhanced framework designed to transform pedestrian crossing inference from site-specific pattern recognition to generalizable behavioral reasoning. By integrating LLaVA-extracted visual features with textual data and transportation domain knowledge, PedX-LLM fine-tunes a LLaMA-2-7B foundation model via Low-Rank Adaptation (LoRA) to infer crossing decisions. PedX-LLM achieves 82.0% balanced accuracy, outperforming the best statistical and supervised learning methods. Results demonstrate that the vision-augmented module contributes a 2.9% performance gain by capturing the built environment and integrating domain knowledge yields an additional 4.1% improvement. To evaluate generalizability across unseen environments, cross-site validation was conducted using site-based partitioning. The zero-shot PedX-LLM configuration achieves 66.9% balanced accuracy on five unseen test sites, outperforming the baseline data-driven methods by at least 18 percentage points. Incorporating just five validation examples via few-shot learning to PedX-LLM further elevates the balanced accuracy to 72.2%. PedX-LLM demonstrates strong generalizability to unseen scenarios, confirming that vision-and-knowledge-enhanced reasoning enables the model to mimic human-like decision logic and overcome the limitations of purely data-driven methods.
zh
[AI-4] QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models DATE
【速读】:该论文旨在解决脉冲驱动语言模型(Spike-driven Language Models, SLMs)在资源受限嵌入式设备上部署时面临的内存占用过高问题。尽管SLMs已显著降低处理功耗,但其仍难以满足低成本、低资源设备的存储需求;而传统手动量化方法虽能压缩内存,却因设计周期长、计算开销大且缺乏可扩展性,无法适应不同网络结构与性能-内存约束。解决方案的关键在于提出QSLM框架,通过自动化的层级量化策略(包括全局、块级和模块级量化),结合多目标优化函数,在保证模型性能(如情感分类准确率高达原模型84.4%、文本生成困惑度为23.2)的同时,实现最高达86.5%的内存压缩和最多20%的功耗降低,从而有效平衡性能与资源限制。
链接: https://arxiv.org/abs/2601.00679
作者: Rachmad Vidya Wicaksana Putra,Pasindu Wickramasinghe,Muhammad Shafique
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the Design, Automation and Test in Europe Conference (DATE) 2025 on April 20th-22nd, 2025 in Verona, Italy
Abstract:Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks, performance requirements, and memory budgets. To bridge this gap, we propose QSLM, a novel framework that performs automated quantization for compressing pre-trained SLMs, while meeting the performance and memory constraints. To achieve this, QSLM first identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy (e.g., global-, block-, and module-level quantization) while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting. Experimental results indicate that our QSLM reduces memory footprint by up to 86.5%, reduces power consumption by up to 20%, maintains high performance across different tasks (i.e., by up to 84.4% accuracy of sentiment classification on the SST-2 dataset and perplexity score of 23.2 for text generation on the WikiText-2 dataset) close to the original non-quantized model while meeting the performance and memory constraints.
zh
[AI-5] IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning
【速读】:该论文旨在解决生成式奖励模型(Generative Reward Models, GRMs)在与强化学习(Reinforcement Learning, RL)算法(如Group Relative Policy Optimization, GRPO)结合时存在的计算瓶颈问题。具体而言,传统成对GRMs由于需要进行O(n²)复杂度的两两比较以获取相对评分,且在训练过程中需反复采样或引入链式思维(Chain-of-Thought, CoT)推理来提升性能,导致效率低下。解决方案的关键在于提出一种新的强化学习框架——组间相对偏好优化(Intergroup Relative Preference Optimization, IRPO),其核心创新是将经典的Bradley-Terry模型引入GRPO,通过为每个响应生成点对点(pointwise)得分,实现任意数量候选响应的高效评估,同时保持奖励信号的可解释性和细粒度特性。实验表明,IRPO在多个基准测试中达到当前最优点对点GRMs性能,并显著优于现有成对GRMs的后训练评估表现。
链接: https://arxiv.org/abs/2601.00677
作者: Haonan Song,Qingchen Xie,Huan Zhu,Feng Xiao,Luxi Xing,Fuzhen Li,Liu Kang,Feng Jiang,Zhiyong Zheng,Fan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures
Abstract:Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.
zh
[AI-6] Interpretability-Guided Bi-objective Optimization: Aligning Accuracy and Explainability
【速读】:该论文旨在解决可解释模型训练中如何有效融合结构化领域知识以提升模型透明度与性能的问题,尤其关注特征重要性层次约束在时间序列场景下的建模与优化。其解决方案的关键在于提出了一种基于双目标优化的可解释性引导框架(Interpretability-Guided Bi-objective Optimization, IGBO),通过将特征重要性层次编码为有向无环图(DAG)并利用时序积分梯度(Temporal Integrated Gradients, TIG)量化特征贡献,同时引入最优路径Oracle来学习数据流形感知的积分路径,从而缓解TIG计算中的分布外(Out-of-Distribution, OOD)问题。理论分析表明该方法具有收敛性和对小批量噪声的鲁棒性,实证结果证明其能在最小精度损失下有效执行DAG约束,显著优于标准正则化基线方法。
链接: https://arxiv.org/abs/2601.00655
作者: Kasra Fouladi,Hamta Rahmani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:This paper introduces Interpretability-Guided Bi-objective Optimization (IGBO), a framework that trains interpretable models by incorporating structured domain knowledge via a bi-objective formulation. IGBO encodes feature importance hierarchies as a Directed Acyclic Graph (DAG) and uses Temporal Integrated Gradients (TIG) to measure feature importance. To address the Out-of-Distribution (OOD) problem in TIG computation, we propose an Optimal Path Oracle that learns data-manifold-aware integration paths. Theoretical analysis proves convergence properties and robustness to mini-batch noise, while empirical results on time-series data demonstrate IGBO’s effectiveness in enforcing DAG constraints with minimal accuracy loss, outperforming standard regularization baselines.
zh
[AI-7] DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在直接偏好优化(Direct Preference Optimization, DPO)过程中因偏好数据难度不平衡而导致的过拟合问题,进而削弱模型对幻觉现象的抑制能力。其解决方案的关键在于提出一种难度感知的直接偏好优化框架(Difficulty-Aware Direct Preference Optimization, DA-DPO),该框架包含两个核心组件:一是通过融合具有互补生成与对比目标的预训练视觉-语言模型输出,并采用分布感知投票策略,无需额外训练即可获得鲁棒的难度评分;二是基于估计的难度对偏好样本进行重加权,降低简单样本权重、增强困难样本的重要性,从而缓解过拟合并提升细粒度的幻觉抑制效果。此方法在不引入新数据或额外微调阶段的前提下,显著提升了多模态偏好优化的鲁棒性与泛化性能。
链接: https://arxiv.org/abs/2601.00623
作者: Longtian Qiu,Shan Ning,Chuyu Zhang,Jiaxuan Sun,Xuming He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by TMLR
Abstract:Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision–language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at this https URL.
zh
[AI-8] Stronger Approximation Guarantees for Non-Monotone γ-Weakly DR-Submodular Maximization AAMAS2026
【速读】:该论文旨在解决在约束条件下最大化非负、非单调的 γ-弱DR-子模(γ-weakly DR-submodular)函数这一基础优化问题,特别针对定义在向下封闭凸体(down-closed convex body)上的情形。其解决方案的关键在于结合了Frank-Wolfe引导的连续贪心框架与一个γ感知的双贪心步骤,从而有效处理函数的非单调性,并实现了对参数γ平滑依赖的近似保证——当γ=1时恢复已知最优的0.401近似因子,且在γ<1时仍优于此前相关结果,显著提升了非单调γ-弱DR-子模最大化问题的理论上限。
链接: https://arxiv.org/abs/2601.00611
作者: Hareshkumar Jadav,Ranveer Singh,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Optimization and Control (math.OC)
备注: Extended version of paper accepted in AAMAS 2026
Abstract:Maximizing submodular objectives under constraints is a fundamental problem in machine learning and optimization. We study the maximization of a nonnegative, non-monotone \gamma -weakly DR-submodular function over a down-closed convex body. Our main result is an approximation algorithm whose guarantee depends smoothly on \gamma ; in particular, when \gamma=1 (the DR-submodular case) our bound recovers the 0.401 approximation factor, while for \gamma1 the guarantee degrades gracefully and, it improves upon previously reported bounds for \gamma -weakly DR-submodular maximization under the same constraints. Our approach combines a Frank-Wolfe-guided continuous-greedy framework with a \gamma -aware double-greedy step, yielding a simple yet effective procedure for handling non-monotonicity. This results in state-of-the-art guarantees for non-monotone \gamma -weakly DR-submodular maximization over down-closed convex bodies.
zh
[AI-9] HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)框架下对大语言模型(Large Language Models, LLMs)进行微调时面临的三大挑战:一是缺乏可靠指标评估各专家(Expert)对本地微调性能的贡献,导致专家选择困难;二是客户端计算资源异构性显著,动态专家激活机制易使资源受限设备过载;三是客户端专属专家子集和路由偏好破坏全局聚合,引发专家更新错位与门控网络不一致带来的干扰。解决方案的关键在于提出HFedMoE框架,通过信息瓶颈视角自适应地为每个客户端定制专家子集以匹配其计算预算,并基于专家对微调性能的贡献度进行重要性排序,进而设计一种稀疏感知的模型聚合策略,以加权方式融合活跃训练的专家参数与门控网络参数,从而实现高效且稳定的异构MoE-based FL微调。
链接: https://arxiv.org/abs/2601.00583
作者: Zihan Fang,Zheng Lin,Senkang Hu,Yanan Ma,Yihang Tao,Yiqin Deng,Xianhao Chen,Yuguang Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 14 pages, 16 figures
Abstract:While federated learning (FL) enables fine-tuning of large language models (LLMs) without compromising data privacy, the substantial size of an LLM renders on-device training impractical for resource-constrained clients, such as mobile devices. Thus, Mixture-of-Experts (MoE) models have emerged as a computation-efficient solution, which activates only a sparse subset of experts during model training to reduce computing burden without sacrificing performance. Though integrating MoE into FL fine-tuning holds significant potential, it still encounters three key challenges: i) selecting appropriate experts for clients remains challenging due to the lack of a reliable metric to measure each expert’s impact on local fine-tuning performance, ii) the heterogeneous computing resources across clients severely hinder MoE-based LLM fine-tuning, as dynamic expert activations across diverse input samples can overwhelm resource-constrained devices, and iii) client-specific expert subsets and routing preference undermine global aggregation, where misaligned expert updates and inconsistent gating networks in troduce destructive interference. To address these challenges, we propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client for computation-efficient LLM fine-tuning. Specifically, HFedMoE identifies the expert importance based on its contributions to fine-tuning performance, and then adaptively selects a subset of experts from an information bottleneck perspective to align with each client’ s computing budget. A sparsity-aware model aggregation strategy is also designed to aggregate the actively fine-tuned experts and gating parameters with importance weighted contributions. Extensive experiments demonstrate that HFedMoE outperforms state-of-the-art benchmarks in training accuracy and convergence speed.
zh
[AI-10] Priority-Aware Multi-Robot Coverag e Path Planning
【速读】:该论文旨在解决多机器人覆盖路径规划(Multi-Robot Coverage Path Planning, MCPP)中因假设环境各区域重要性均一而导致的效率不足问题,特别是在某些区域需优先覆盖的场景下。传统MCPP方法仅优化整体完成时间(makespan),忽略了区域优先级差异,从而可能导致关键区域响应延迟。为此,作者提出优先感知的MCPP(Priority-Aware MCPP, PA-MCPP)问题,目标是按字典序最小化区域加权延迟总和与整体makespan。解决方案的关键在于一个可扩展的两阶段框架:第一阶段采用贪心区域分配结合局部搜索与基于生成树的路径规划,实现高效优先区覆盖;第二阶段利用Steiner树引导残差区域覆盖,确保全区域无遗漏且兼顾优先级权重。实验表明,该方法在显著降低优先区加权延迟的同时,保持了与标准MCPP相当的makespan性能,并具备良好的机器人数量扩展性和权重调控能力。
链接: https://arxiv.org/abs/2601.00580
作者: Kanghoon Lee,Hyeonjun Kim,Jiachen Li,Jinkyoo Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: IEEE Robotics and Automation Letters, 8 pages, 10 figures
Abstract:Multi-robot systems are widely used for coverage tasks that require efficient coordination across large environments. In Multi-Robot Coverage Path Planning (MCPP), the objective is typically to minimize the makespan by generating non-overlapping paths for full-area coverage. However, most existing methods assume uniform importance across regions, limiting their effectiveness in scenarios where some zones require faster attention. We introduce the Priority-Aware MCPP (PA-MCPP) problem, where a subset of the environment is designated as prioritized zones with associated weights. The goal is to minimize, in lexicographic order, the total priority-weighted latency of zone coverage and the overall makespan. To address this, we propose a scalable two-phase framework combining (1) greedy zone assignment with local search, spanning-tree-based path planning, and (2) Steiner-tree-guided residual coverage. Experiments across diverse scenarios demonstrate that our method significantly reduces priority-weighted latency compared to standard MCPP baselines, while maintaining competitive makespan. Sensitivity analyses further show that the method scales well with the number of robots and that zone coverage behavior can be effectively controlled by adjusting priority weights.
zh
[AI-11] Learning to be Reproducible: Custom Loss Design for Robust Neural Networks
【速读】:该论文旨在解决深度学习模型在训练过程中因随机因素(如权重初始化和数据打乱)导致的性能波动问题,从而提升模型的可复现性和可靠性。其解决方案的关键在于提出一种自定义损失函数(Custom Loss Function, CLF),通过调整其参数显式平衡预测准确性与训练稳定性,有效降低模型对随机扰动的敏感性,同时在图像分类和时间序列预测等多种任务中保持优异的预测性能。
链接: https://arxiv.org/abs/2601.00578
作者: Waqas Ahmed,Sheeba Samuel,Kevin Coakley,Birgitta Koenig-Ries,Odd Erik Gundersen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:To enhance the reproducibility and reliability of deep learning models, we address a critical gap in current training methodologies: the lack of mechanisms that ensure consistent and robust performance across runs. Our empirical analysis reveals that even under controlled initialization and training conditions, the accuracy of the model can exhibit significant variability. To address this issue, we propose a Custom Loss Function (CLF) that reduces the sensitivity of training outcomes to stochastic factors such as weight initialization and data shuffling. By fine-tuning its parameters, CLF explicitly balances predictive accuracy with training stability, leading to more consistent and reliable model performance. Extensive experiments across diverse architectures for both image classification and time series forecasting demonstrate that our approach significantly improves training robustness without sacrificing predictive performance. These results establish CLF as an effective and efficient strategy for developing more stable, reliable and trustworthy neural networks.
zh
[AI-12] Improving Scientific Document Retrieval with Academic Concept Index
【速读】:该论文旨在解决通用领域检索器在迁移到科学领域时面临的两大挑战:一是缺乏大规模领域特定的相关性标注数据,二是科学文献中词汇和信息需求存在显著差异。现有方法虽利用大语言模型(LLM)分别通过生成合成查询(synthetic queries)和辅助上下文(auxiliary contexts)来缓解问题,但忽略了科学文档中蕴含的多样化学术概念,导致生成内容冗余或概念覆盖狭窄。论文的关键解决方案是引入一个学术概念索引(academic concept index),该索引基于学术分类体系从论文中提取关键概念并结构化组织,从而为两个方向提供统一的知识基础:其一,提出基于概念覆盖率的查询生成方法(CCQGen),通过动态识别未覆盖概念引导LLM生成更具广度的互补查询;其二,提出基于概念聚焦的上下文扩展方法(CCExpand),利用文档片段作为对概念感知查询的简明响应,增强上下文语义匹配能力。实验表明,该索引显著提升了查询质量、概念一致性与检索性能。
链接: https://arxiv.org/abs/2601.00567
作者: Jeyun Lee,Junhyoung Lee,Wonbin Kweon,Bowen Jin,Yu Zhang,Susik Yoon,Dongha Lee,Hwanjo Yu,Jiawei Han,Seongku Kang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Adapting general-domain retrievers to scientific domains is challenging due to the scarcity of large-scale domain-specific relevance annotations and the substantial mismatch in vocabulary and information needs. Recent approaches address these issues through two independent directions that leverage large language models (LLMs): (1) generating synthetic queries for fine-tuning, and (2) generating auxiliary contexts to support relevance matching. However, both directions overlook the diverse academic concepts embedded within scientific documents, often producing redundant or conceptually narrow queries and contexts. To address this limitation, we introduce an academic concept index, which extracts key concepts from papers and organizes them guided by an academic taxonomy. This structured index serves as a foundation for improving both directions. First, we enhance the synthetic query generation with concept coverage-based generation (CCQGen), which adaptively conditions LLMs on uncovered concepts to generate complementary queries with broader concept coverage. Second, we strengthen the context augmentation with concept-focused auxiliary contexts (CCExpand), which leverages a set of document snippets that serve as concise responses to the concept-aware CCQGen queries. Extensive experiments show that incorporating the academic concept index into both query generation and context augmentation leads to higher-quality queries, better conceptual alignment, and improved retrieval performance.
zh
[AI-13] Cracking IoT Security: Can LLM s Outsmart Static Analysis Tools?
【速读】:该论文旨在解决智能家居物联网(IoT)平台中因触发-动作-条件(Trigger Action Condition, TAC)规则之间的复杂交互而引发的交互威胁检测问题,此类威胁可能表现为意外或不安全行为,源于隐式依赖、冲突触发或条件重叠。传统方法依赖符号化、约束驱动的静态分析来实现结构推理与语义理解,但难以应对规则间动态交互的复杂性。本文的关键解决方案是首次系统评估大型语言模型(Large Language Models, LLMs)在多类别交互威胁分类下的表现,涵盖原始openHAB数据集(oHC/IoTB)和结构挑战型变异数据集(Mutation dataset),并通过零样本、单样本及两样本提示设置对比不同LLM(如Llama 3.1 8B、GPT-4o、Gemini-2.5-Pro等)的结果。研究发现:尽管LLMs在动作和条件相关威胁上展现出良好语义理解能力,但在需要跨规则结构推理的任务中准确率显著下降,尤其在规则重构后性能波动剧烈;相比之下,符号推理基线模型保持稳定检测效果,不受规则重写影响。因此,论文指出纯LLM方案尚不足以用于高安全要求的IoT环境交互威胁检测,强调未来应采用融合符号分析与LLM语义解释的混合架构以兼顾准确性与结构严谨性。
链接: https://arxiv.org/abs/2601.00559
作者: Jason Quantrill,Noura Khajehnouri,Zihan Guo,Manar H. Alalfi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Smart home IoT platforms such as openHAB rely on Trigger Action Condition (TAC) rules to automate device behavior, but the interplay among these rules can give rise to interaction threats, unintended or unsafe behaviors emerging from implicit dependencies, conflicting triggers, or overlapping conditions. Identifying these threats requires semantic understanding and structural reasoning that traditionally depend on symbolic, constraint-driven static analysis. This work presents the first comprehensive evaluation of Large Language Models (LLMs) across a multi-category interaction threat taxonomy, assessing their performance on both the original openHAB (oHC/IoTB) dataset and a structurally challenging Mutation dataset designed to test robustness under rule transformations. We benchmark Llama 3.1 8B, Llama 70B, GPT-4o, Gemini-2.5-Pro, and DeepSeek-R1 across zero-, one-, and two-shot settings, comparing their results against oHIT’s manually validated ground truth. Our findings show that while LLMs exhibit promising semantic understanding, particularly on action- and condition-related threats, their accuracy degrades significantly for threats requiring cross-rule structural reasoning, especially under mutated rule forms. Model performance varies widely across threat categories and prompt settings, with no model providing consistent reliability. In contrast, the symbolic reasoning baseline maintains stable detection across both datasets, unaffected by rule rewrites or structural perturbations. These results underscore that LLMs alone are not yet dependable for safety critical interaction-threat detection in IoT environments. We discuss the implications for tool design and highlight the potential of hybrid architectures that combine symbolic analysis with LLM-based semantic interpretation to reduce false positives while maintaining structural rigor.
zh
[AI-14] CoCo-Fed: A Unified Framework for Memory- and Communication-Efficient Federated Learning at the Wireless Edge
【速读】:该论文旨在解决在开放无线接入网(O-RAN)架构中部署大规模神经网络时面临的两大瓶颈问题:一是资源受限的基站(gNB)本地训练所需的内存开销过大,二是高维模型更新在全球聚合过程中导致带宽受限的回传链路饱和。解决方案的关键在于提出一种基于压缩与组合的联邦学习框架 CoCo-Fed,其核心创新包括:本地层面通过梯度的双维度降维投影打破内存限制,使优化器能在低秩结构上运行而不引入额外推理参数或延迟;全局层面则采用基于正交子空间叠加的传输协议,将每层更新投影并叠加为单个整合矩阵,显著降低回传流量。此外,作者建立了严格的理论基础,证明了该方法在无监督学习场景下仍能收敛,适用于无线感知任务,并在到达角估计任务上的仿真结果表明其在内存和通信效率方面均优于现有最优基线,且在非独立同分布(non-IID)设置下保持鲁棒收敛。
链接: https://arxiv.org/abs/2601.00549
作者: Zhiheng Guo,Zhaoyang Liu,Zihan Cen,Chenyuan Feng,Xinghua Sun,Xiang Chen,Tony Q. S. Quek,Xijun Wang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 1 algorithm
Abstract:The deployment of large-scale neural networks within the Open Radio Access Network (O-RAN) architecture is pivotal for enabling native edge intelligence. However, this paradigm faces two critical bottlenecks: the prohibitive memory footprint required for local training on resource-constrained gNBs, and the saturation of bandwidth-limited backhaul links during the global aggregation of high-dimensional model updates. To address these challenges, we propose CoCo-Fed, a novel Compression and Combination-based Federated learning framework that unifies local memory efficiency and global communication reduction. Locally, CoCo-Fed breaks the memory wall by performing a double-dimension down-projection of gradients, adapting the optimizer to operate on low-rank structures without introducing additional inference parameters/latency. Globally, we introduce a transmission protocol based on orthogonal subspace superposition, where layer-wise updates are projected and superimposed into a single consolidated matrix per gNB, drastically reducing the backhaul traffic. Beyond empirical designs, we establish a rigorous theoretical foundation, proving the convergence of CoCo-Fed even under unsupervised learning conditions suitable for wireless sensing tasks. Extensive simulations on an angle-of-arrival estimation task demonstrate that CoCo-Fed significantly outperforms state-of-the-art baselines in both memory and communication efficiency while maintaining robust convergence under non-IID settings.
zh
[AI-15] Optimizing LSTM Neural Networks for Resource-Constrained Retail Sales Forecasting: A Model Compression Study
【速读】:该论文旨在解决标准长短期记忆(Long Short-Term Memory, LSTM)神经网络在零售销售预测中计算资源消耗过高、难以应用于中小型零售企业的问题。其解决方案的关键在于通过逐步减少LSTM隐藏层单元数(从128降至16),系统性地评估模型压缩对预测精度的影响,发现将隐藏单元数降至64时,不仅保持了与原模型相当的准确性,反而使平均绝对百分比误差(MAPE)从23.6%降低至12.4%,同时模型体积缩小73%(从280KB降至76KB),且准确率提升47%,表明适度压缩模型可实现更高效和精准的预测。
链接: https://arxiv.org/abs/2601.00525
作者: Ravi Teja Pagidoju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICUIS 2025 (International Conference on Ubiquitous and Intelligent Systems). 5 pages, 3 figures, 1 table
Abstract:Standard LSTM(Long Short-Term Memory) neural networks provide accurate predictions for sales data in the retail industry, but require a lot of computing power. It can be challenging especially for mid to small retail industries. This paper examines LSTM model compression by gradually reducing the number of hidden units from 128 to 16. We used the Kaggle Store Item Demand Forecasting dataset, which has 913,000 daily sales records from 10 stores and 50 items, to look at the trade-off between model size and how accurate the predictions are. Experiments show that lowering the number of hidden LSTM units to 64 maintains the same level of accuracy while also improving it. The mean absolute percentage error (MAPE) ranges from 23.6% for the full 128-unit model to 12.4% for the 64-unit model. The optimized model is 73% smaller (from 280KB to 76KB) and 47% more accurate. These results show that larger models do not always achieve better results.
zh
[AI-16] Probability-Aware Parking Selection
【速读】:该论文旨在解决当前停车导航系统在估算总行程时间时忽略寻找停车位所耗时间的问题,这一疏漏严重影响用户体验、出行方式选择、交通拥堵及碳排放。解决方案的关键在于提出一种概率感知的停车选址问题(probability-aware parking selection problem),通过基于停车场所级概率信息的自适应动态规划框架进行决策,从而引导驾驶员前往最优停车地点而非直接抵达目的地。该方法能够确定何时应锁定特定停车场或继续探索其他选项,并量化预期时间成本;同时,研究还评估了利用随机观测数据估计停车可用性时的误差,验证了该策略在真实世界数据中的可行性与有效性,实证表明其相较无概率意识基线可节省高达66%的行程时间,尽管仍比直接导航至目的地多花费最多123%的时间。
链接: https://arxiv.org/abs/2601.00521
作者: Cameron Hickert,Sirui Li,Zhengbing He,Cathy Wu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 10 pages, 6 figures, 3 tables. To be published in IEEE Transactions on Intelligent Transportation Systems
Abstract:Current parking navigation systems often underestimate total travel time by failing to account for the time spent searching for a parking space, which significantly affects user experience, mode choice, congestion, and emissions. To address this issue, this paper introduces the probability-aware parking selection problem, which aims to direct drivers to the best parking location rather than straight to their destination. An adaptable dynamic programming framework is proposed for decision-making based on probabilistic information about parking availability at the parking lot level. Closed-form analysis determines when it is optimal to target a specific parking lot or explore alternatives, as well as the expected time cost. Sensitivity analysis and three illustrative cases are examined, demonstrating the model’s ability to account for the dynamic nature of parking availability. Acknowledging the financial costs of permanent sensing infrastructure, the paper provides analytical and empirical assessments of errors incurred when leveraging stochastic observations to estimate parking availability. Experiments with real-world data from the US city of Seattle indicate this approach’s viability, with mean absolute error decreasing from 7% to below 2% as observation frequency grows. In data-based simulations, probability-aware strategies demonstrate time savings up to 66% relative to probability-unaware baselines, yet still take up to 123% longer than direct-to-destination estimates.
zh
[AI-17] rajectory Guard – A Lightweight Sequence-Aware Model for Real-Time Anomaly Detection in Agent ic AI AAAI
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)自主代理在生成多步行动计划时因上下文错位或结构不一致而导致的异常行为检测难题。现有方法存在两大局限:一是均值池化嵌入会稀释异常步骤的信息,二是仅依赖对比学习的方法忽略了动作序列的结构特性;标准无监督方法在预训练嵌入上表现不佳,F1分数最高仅为0.69。解决方案的关键在于提出Trajectory Guard——一种基于孪生循环自编码器(Siamese Recurrent Autoencoder)的架构,其核心创新是引入混合损失函数,联合优化两个目标:通过对比学习实现任务轨迹对齐,以及通过重建损失保证序列结构的有效性。这一双目标机制能够统一识别“针对当前任务的错误计划”与“结构畸形的计划”,并在合成扰动和真实世界失败场景(如安全审计RAS-Eval及多智能体系统Who\When)中显著提升性能,F1得分达0.88–0.94,召回率0.86–0.92,且推理延迟仅32毫秒,较LLM Judge基线快17–27倍,满足生产环境中实时安全验证需求。
链接: https://arxiv.org/abs/2601.00516
作者: Laksh Advani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI Trustagent 2026
Abstract:Autonomous LLM agents generate multi-step action plans that can fail due to contextual misalignment or structural incoherence. Existing anomaly detection methods are ill-suited for this challenge: mean-pooling embeddings dilutes anomalous steps, while contrastive-only approaches ignore sequential structure. Standard unsupervised methods on pre-trained embeddings achieve F1-scores no higher than 0.69. We introduce Trajectory Guard, a Siamese Recurrent Autoencoder with a hybrid loss function that jointly learns task-trajectory alignment via contrastive learning and sequential validity via reconstruction. This dual objective enables unified detection of both “wrong plan for this task” and “malformed plan structure.” On benchmarks spanning synthetic perturbations and real-world failures from security audits (RAS-Eval) and multi-agent systems (Who\When), we achieve F1-scores of 0.88-0.94 on balanced sets and recall of 0.86-0.92 on imbalanced external benchmarks. At 32 ms inference latency, our approach runs 17-27 \times faster than LLM Judge baselines, enabling real-time safety verification in production deployments.
zh
[AI-18] Multi-Agent Coordinated Rename Refactoring
【速读】:该论文旨在解决软件开发中协调重命名(coordinated renaming)这一重复且易出错的任务问题。协调重命名指单个标识符的重命名会触发多个相关标识符的同步变更,开发者需手动在多文件和上下文中传播这些变更,过程繁琐且容易引入错误;现有基于启发式的方案产生大量误报,而通用大语言模型(Large Language Models, LLMs)因上下文受限且无法调用重构工具,仅能提供不完整的建议,导致开发者面临重构不完整或过滤冗余结果的困境。解决方案的关键在于设计并实现首个多智能体框架,其核心思想是利用开发者初始的重命名操作作为线索,推断出相关的重构范围:首先由**作用域推理智能体(Scope Inference Agent)将该线索转化为显式的自然语言描述的“声明作用域(Declared Scope)”,再由计划执行智能体(Planned Execution Agent)依据此作用域严格识别应被重构的程序元素,并通过调用IDE内置可信重构API安全地执行修改,最后由复制智能体(Replication Agent)**指导项目级搜索以完成全局更新。该框架使开发者保持主导地位,显著降低人工负担,提升重构效率与准确性。
链接: https://arxiv.org/abs/2601.00482
作者: Abhiram Bellur,Mohammed Raihan Ullah,Fraol Batole,Mohit Kansara,Masaharu Morimoto,Kai Ishikawa,Haifeng Chen,Yaroslav Zharov,Timofey Bryksin,Tien N. Nguyen,Hridesh Rajan,Danny Dig
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The primary value of AI agents in software development lies in their ability to extend the developer’s capacity for reasoning and action, not to supplant human involvement. To showcase how to use agents working in tandem with developers, we designed a novel approach for carrying out coordinated renaming. Coordinated renaming, where a single rename refactoring triggers refactorings in multiple, related identifiers, is a frequent yet challenging task. Developers must manually propagate these rename refactorings across numerous files and contexts, a process that is both tedious and highly error-prone. State-of-the-art heuristic-based approaches produce an overwhelming number of false positives, while vanilla Large Language Models (LLMs) provide incomplete suggestions due to their limited context and inability to interact with refactoring tools. This leaves developers with incomplete refactorings or burdens them with filtering too many false positives. Coordinated renaming is exactly the kind of repetitive task that agents can significantly reduce the developers’ burden while keeping them in the driver’s seat. We designed, implemented, and evaluated the first multi-agent framework that automates coordinated renaming. It operates on a key insight: a developer’s initial refactoring is a clue to infer the scope of related refactorings. Our Scope Inference Agent first transforms this clue into an explicit, natural-language Declared Scope. The Planned Execution Agent then uses this as a strict plan to identify program elements that should undergo refactoring and safely executes the changes by invoking the IDE’s own trusted refactoring APIs. Finally, the Replication Agent uses it to guide the project-wide search. We first conducted a formative study on the practice of coordinated renaming in 609K commits in 100 open-source projects and surveyed 205 developers … Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00482 [cs.SE] (or arXiv:2601.00482v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.00482 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] MAESTRO: Multi-Agent Evaluation Suite for Testing Reliability and Observability
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent System, MAS)在评估过程中缺乏标准化、可重复性和可观测性的问题。现有方法难以系统性地比较不同MAS架构的性能、资源消耗与可靠性,导致设计优化缺乏实证依据。解决方案的关键在于提出MAESTRO——一个统一的评估套件,通过标准化的接口规范MAS配置与执行流程,支持原生及第三方MAS的轻量级集成,并输出与框架无关的执行轨迹和系统级信号(如延迟、成本、失败率)。该方案使得跨架构、模型和工具配置的可控实验成为可能,从而揭示了MAS结构是影响资源分布、可复现性和成本-延迟-精度权衡的主导因素,为构建高效、可靠的智能体系统提供了实证指导。
链接: https://arxiv.org/abs/2601.00481
作者: Tie Ma,Yixi Chen,Vaastav Anand,Alessandro Cornacchia,Amândio R. Faustino,Guanheng Liu,Shan Zhang,Hongbin Luo,Suhaib A. Fahmy,Zafar A. Qazi,Marco Canini
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:We present MAESTRO, an evaluation suite for the testing, reliability, and observability of LLM-based MAS. MAESTRO standardizes MAS configuration and execution through a unified interface, supports integrating both native and third-party MAS via a repository of examples and lightweight adapters, and exports framework-agnostic execution traces together with system-level signals (e.g., latency, cost, and failures). We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns, and conduct controlled experiments across repeated runs, backend models, and tool configurations. Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance in performance and reliability. We further find that MAS architecture is the dominant driver of resource profiles, reproducibility, and cost-latency-accuracy trade-off, often outweighing changes in backend models or tool settings. Overall, MAESTRO enables systematic evaluation and provides empirical guidance for designing and optimizing agentic systems.
zh
[AI-20] Progressive Ideation using an Agent ic AI Framework for Human-AI Co-Creation
【速读】:该论文旨在解决新手设计师在工程设计中生成真正新颖且多样化的创意时所面临的认知挑战,尤其针对当前“单次输出”式人工智能(AI)系统因产生语义聚集的创意而加剧这一问题的现状。解决方案的关键在于提出MIDAS(通过分布式代理AI系统实现元认知创意生成),该框架用一组具备专门功能的AI代理组成的分布式“团队”替代单一AI模型,模拟人类元认知创意流程,逐步优化创意并同时评估其全局新颖性(相对于现有解决方案)与局部新颖性(相对于已生成的创意),从而推动人机协同共创,使人类设计师从被动筛选者转变为积极参与、协作共创新的伙伴。
链接: https://arxiv.org/abs/2601.00475
作者: Sankar B,Srinidhi Ranjini Girish,Aadya Bharti,Dibakar Sen
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages, 11 figures
Abstract:The generation of truly novel and diverse ideas is important for contemporary engineering design, yet it remains a significant cognitive challenge for novice designers. Current ‘single-spurt’ AI systems exacerbate this challenge by producing a high volume of semantically clustered ideas. We propose MIDAS (Meta-cognitive Ideation through Distributed Agentic AI System), a novel framework that replaces the single-AI paradigm with a distributed ‘team’ of specialized AI agents designed to emulate the human meta-cognitive ideation workflow. This agentic system progressively refines ideas and assesses each one for both global novelty (against existing solutions) and local novelty (against previously generated ideas). MIDAS, therefore, demonstrates a viable and progressive paradigm for true human-AI co-creation, elevating the human designer from a passive filterer to a participatory, active, collaborative partner.
zh
[AI-21] Neural Chains and Discrete Dynamical Systems
【速读】:该论文旨在探讨基于Transformer架构(无自注意力机制)的机器学习模型(称为神经链,Neural Chains)与离散化神经积分方程(NIE)和偏微分方程(PDE)所对应的离散动力系统之间的类比关系。其核心问题是:标准数值离散方法(如有限差分法,FD)与物理信息神经网络(PINN)在求解非线性偏微分方程(如Burgers和Eikonal方程)时,是否能获得本质一致的动力学知识。解决方案的关键在于发现,尽管两者路径不同——FD依赖结构化的三对角矩阵,而PINN通过随机矩阵进行参数搜索——但二者最终可逼近相同的系统动态特性;然而,PINN因使用大量随机参数而缺乏物理可解释性且训练成本高,这在低维问题中不具优势,但在高维场景下可能具备潜在优势。
链接: https://arxiv.org/abs/2601.00473
作者: Sauro Succi,Abhisek Ganguly,Santosh Ansumali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We inspect the analogy between machine-learning (ML) applications based on the transformer architecture without self-attention, \it neural chains hereafter, and discrete dynamical systems associated with discretised versions of neural integral and partial differential equations (NIE, PDE). A comparative analysis of the numerical solution of the (viscid and inviscid) Burgers and Eikonal equations via standard numerical discretization (also cast in terms of neural chains) and via PINN’s learning is presented and commented on. It is found that standard numerical discretization and PINN learning provide two different paths to acquire essentially the same knowledge about the dynamics of the system. PINN learning proceeds through random matrices which bear no direct relation to the highly structured matrices associated with finite-difference (FD) procedures. Random matrices leading to acceptable solutions are far more numerous than the unique tridiagonal form in matrix space, which explains why the PINN search typically lands on the random ensemble. The price is a much larger number of parameters, causing lack of physical transparency (explainability) as well as large training costs with no counterpart in the FD procedure. However, our results refer to one-dimensional dynamic problems, hence they don’t rule out the possibility that PINNs and ML in general, may offer better strategies for high-dimensional problems.
zh
[AI-22] Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型中专家专业化(expert specialization)的机制问题,特别是几何正则化(geometric regularization)是否能有效促进专家多样性。其解决方案的关键在于引入正交性损失(orthogonality loss)以强制不同专家在权重空间(weight space)中保持正交,从而提升专家间的差异性和稀疏激活效率。然而,实验结果表明,该方法不仅未能降低权重空间重叠(MSO反而增加达114%),激活空间重叠仍维持高位(约0.6),且对模型性能影响不一致,缺乏统计显著相关性(r = -0.293, p = 0.523),说明该正则化策略无法实现预期目标,也不具备可靠性。
链接: https://arxiv.org/abs/2601.00457
作者: Hyunjun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (~0.6) regardless of regularization, and effects on performance are inconsistent – marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std 1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.
zh
[AI-23] Deep Networks Learn Deep Hierarchical Models
【速读】:该论文旨在解决监督学习中深层模型的高效可学习性问题,特别是针对具有层次结构标签(label hierarchy)的复杂分类任务。其核心挑战在于理解为何深度神经网络在某些情况下能够高效学习而其他模型则不能。解决方案的关键在于提出一类新的层次化模型类,该类假设存在一个未知的标签嵌套序列 $ L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n] $,其中每一层标签都是对前一层更简单标签的显式函数,从而形成从输入到输出的渐进式计算路径。作者证明,在这种结构下,使用逐层随机梯度下降(layerwise SGD)训练残差网络可以高效学习此类模型,并指出这类模型达到了深度学习算法所能实现的效率极限——即存在需多项式深度才能表达的模型,而此前已知模型仅可用对数深度电路表示。这一结果为理解深度学习的成功提供了理论基础,同时强调了人类教师提供的细粒度标签作为“提示”有助于揭示大脑内部算法的层次结构,从而促进学习效率。
链接: https://arxiv.org/abs/2601.00455
作者: Amit Daniely
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We consider supervised learning with n labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n] , where labels in L_1 are simple functions of the input, while for i 1 , labels in L_i are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal hints’’ or ``snippets’’ of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.00455 [cs.LG] (or arXiv:2601.00455v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.00455 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-24] RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Context Transformers
【速读】:该论文旨在解决Transformer模型中自注意力机制的二次复杂度问题,这一瓶颈限制了其在长序列任务中的应用。解决方案的关键在于引入星形胶质细胞(astrocyte)的功能特性,提出了一种名为循环记忆增强型星形结构变压器(Recurrent Memory Augmented Astromorphic Transformer, RMAAT)的新架构:通过分段递归处理策略使持久化记忆标记传播上下文信息,并利用由模拟星形胶质细胞长期突触可塑性(long-term plasticity, LTP)导出的新型保留因子动态调控这些标记;同时,在段内注意力计算中采用受星形胶质细胞短期突触可塑性(short-term plasticity, STP)启发的线性复杂度机制,从而实现高效且具生物合理性的序列建模。
链接: https://arxiv.org/abs/2601.00426
作者: Md Zesun Ahmed Mia,Malyaban Bal,Abhronil Sengupta
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:
Abstract:The quadratic complexity of self-attention mechanism presents a significant impediment to applying Transformer models to long sequences. This work explores computational principles derived from astrocytes-glial cells critical for biological memory and synaptic modulation-as a complementary approach to conventional architectural modifications for efficient self-attention. We introduce the Recurrent Memory Augmented Astromorphic Transformer (RMAAT), an architecture integrating abstracted astrocyte functionalities. RMAAT employs a recurrent, segment-based processing strategy where persistent memory tokens propagate contextual information. An adaptive compression mechanism, governed by a novel retention factor derived from simulated astrocyte long-term plasticity (LTP), modulates these tokens. Attention within segments utilizes an efficient, linear-complexity mechanism inspired by astrocyte short-term plasticity (STP). Training is performed using Astrocytic Memory Replay Backpropagation (AMRB), a novel algorithm designed for memory efficiency in recurrent networks. Evaluations on the Long Range Arena (LRA) benchmark demonstrate RMAAT’s competitive accuracy and substantial improvements in computational and memory efficiency, indicating the potential of incorporating astrocyte-inspired dynamics into scalable sequence models.
zh
[AI-25] Can Semantic Methods Enhance Team Sports Tactics? A Methodology for Football with Broader Applications
【速读】:该论文旨在解决团队运动中战术决策的智能化与可解释性问题,即如何将语义空间推理(semantic-space reasoning)从计算语言学领域迁移至团队体育战术分析,以实现对战术配置的量化建模与动态优化。其解决方案的关键在于构建一个基于多维向量表示的共享语义空间:每个球员被编码为融合技术、体能和心理属性的向量,团队整体则通过上下文加权聚合形成高层语义表征;在此空间中,战术模板(如高位逼抢、反击或控球推进)被类比为语言概念进行编码,并利用向量距离度量评估战术与队伍特征的契合度及对手弱点的 exploitable potential,从而生成可解释且自适应的策略建议。
链接: https://arxiv.org/abs/2601.00421
作者: Alessio Di Rubbo,Mattia Neri,Remo Pareschi,Marco Pedroni,Roberto Valtancoli,Paolino Zica
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Sci (MDPI) for peer review
Abstract:This paper explores how semantic-space reasoning, traditionally used in computational linguistics, can be extended to tactical decision-making in team sports. Building on the analogy between texts and teams – where players act as words and collective play conveys meaning – the proposed methodology models tactical configurations as compositional semantic structures. Each player is represented as a multidimensional vector integrating technical, physical, and psychological attributes; team profiles are aggregated through contextual weighting into a higher-level semantic representation. Within this shared vector space, tactical templates such as high press, counterattack, or possession build-up are encoded analogously to linguistic concepts. Their alignment with team profiles is evaluated using vector-distance metrics, enabling the computation of tactical ``fit’’ and opponent-exploitation potential. A Python-based prototype demonstrates how these methods can generate interpretable, dynamically adaptive strategy recommendations, accompanied by fine-grained diagnostic insights at the attribute level. Beyond football, the approach offers a generalizable framework for collective decision-making and performance optimization in team-based domains – ranging from basketball and hockey to cooperative robotics and human-AI coordination systems. The paper concludes by outlining future directions toward real-world data integration, predictive simulation, and hybrid human-machine tactical intelligence.
zh
[AI-26] Adaptive Causal Coordination Detection for Social Media: A Memory-Guided Framework with Semi-Supervised Learning
【速读】:该论文旨在解决社交平台上协同性虚假行为(coordinated inauthentic behavior)检测的难题,现有方法普遍存在依赖表层相关性分析、参数设置静态化以及人工标注成本高昂等问题。其核心解决方案是提出自适应因果协同检测框架(Adaptive Causal Coordination Detection, ACCD),关键在于采用三阶段渐进式架构:第一阶段引入自适应收敛交叉映射(adaptive Convergent Cross Mapping, CCM)以深度识别账户间的因果关系;第二阶段结合主动学习与不确定性采样,在半监督分类中显著降低人工标注需求;第三阶段通过基于历史检测经验的自动化验证模块实现结果的自我校验与优化。该框架在真实数据集上实现了87.3%的F1分数,较最强基线提升15.2%,同时减少68%的人工标注并提速2.8倍,具备高精度、高效率与强自动化特性。
链接: https://arxiv.org/abs/2601.00400
作者: Weng Ding,Yi Han,Mu-Jiang-Shan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures. Under review
Abstract:Detecting coordinated inauthentic behavior on social media remains a critical and persistent challenge, as most existing approaches rely on superficial correlation analysis, employ static parameter settings, and demand extensive and labor-intensive manual annotation. To address these limitations systematically, we propose the Adaptive Causal Coordination Detection (ACCD) framework. ACCD adopts a three-stage, progressive architecture that leverages a memory-guided adaptive mechanism to dynamically learn and retain optimal detection configurations for diverse coordination scenarios. Specifically, in the first stage, ACCD introduces an adaptive Convergent Cross Mapping (CCM) technique to deeply identify genuine causal relationships between accounts. The second stage integrates active learning with uncertainty sampling within a semi-supervised classification scheme, significantly reducing the burden of manual labeling. The third stage deploys an automated validation module driven by historical detection experience, enabling self-verification and optimization of the detection outcomes. We conduct a comprehensive evaluation using real-world datasets, including the Twitter IRA dataset, Reddit coordination traces, and several widely-adopted bot detection benchmarks. Experimental results demonstrate that ACCD achieves an F1-score of 87.3% in coordinated attack detection, representing a 15.2% improvement over the strongest existing baseline. Furthermore, the system reduces manual annotation requirements by 68% and achieves a 2.8x speedup in processing through hierarchical clustering optimization. In summary, ACCD provides a more accurate, efficient, and highly automated end-to-end solution for identifying coordinated behavior on social platforms, offering substantial practical value and promising potential for broad application.
zh
[AI-27] Engineering Attack Vectors and Detecting Anomalies in Additive Manufacturing
【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)系统中因计算机辅助设计(CAD)与机器执行层之间接口暴露而引发的新型网络攻击问题,特别是针对熔融沉积成型(Fused Deposition Modeling, FDM)打印机在G-code文件上传过程中遭受中间人(Man-in-the-Middle, MitM)攻击导致的隐蔽性结构缺陷。解决方案的关键在于提出一种基于无监督学习的入侵检测系统(Intrusion Detection System, IDS),其核心是利用冻结的Transformer编码器(BERT变体)提取打印过程中的结构化日志语义特征,并通过对比学习训练投影头以生成对异常敏感的嵌入表示;随后结合聚类方法和自注意力自动编码器实现对正常与受控执行状态的有效区分,从而在不依赖传统切片软件或运行时界面的前提下识别出隐蔽的恶意篡改行为。
链接: https://arxiv.org/abs/2601.00384
作者: Md Mahbub Hasan,Marcus Sternhagen,Krishna Chandra Roy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted to EAI SmartSP 2025. This is the preprint version
Abstract:Additive manufacturing (AM) is rapidly integrating into critical sectors such as aerospace, automotive, and healthcare. However, this cyber-physical convergence introduces new attack surfaces, especially at the interface between computer-aided design (CAD) and machine execution layers. In this work, we investigate targeted cyberattacks on two widely used fused deposition modeling (FDM) systems, Creality’s flagship model K1 Max, and Ender 3. Our threat model is a multi-layered Man-in-the-Middle (MitM) intrusion, where the adversary intercepts and manipulates G-code files during upload from the user interface to the printer firmware. The MitM intrusion chain enables several stealthy sabotage scenarios. These attacks remain undetectable by conventional slicer software or runtime interfaces, resulting in structurally defective yet externally plausible printed parts. To counter these stealthy threats, we propose an unsupervised Intrusion Detection System (IDS) that analyzes structured machine logs generated during live printing. Our defense mechanism uses a frozen Transformer-based encoder (a BERT variant) to extract semantic representations of system behavior, followed by a contrastively trained projection head that learns anomaly-sensitive embeddings. Later, a clustering-based approach and a self-attention autoencoder are used for classification. Experimental results demonstrate that our approach effectively distinguishes between benign and compromised executions.
zh
[AI-28] Word Frequency Counting Based on Serverless MapReduce
【速读】:该论文旨在解决传统MapReduce在执行词频统计任务时存在的效率瓶颈问题,尤其是在资源分配和任务并行度优化方面。其解决方案的关键在于将MapReduce编程模型与无服务器计算(Serverless Computing)架构相结合,利用函数即服务(Function as a Service, FaaS)的弹性伸缩特性,探索针对特定工作负载下最优的Map函数与Reduce函数数量配置。实验表明,通过合理调整这两个参数,可在相同工作量下显著降低执行时间并提升整体计算效率,从而为开发者提供可复用的性能优化策略。
链接: https://arxiv.org/abs/2601.00380
作者: Hanzhe Li,Bingchen Lin,Mengyuan Xu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2024)
Abstract:With the increasing demand for high-performance and high-efficiency computing, cloud computing, especially serverless computing, has gradually become a research hotspot in recent years, attracting numerous research attention. Meanwhile, MapReduce, which is a popular big data processing model in the industry, has been widely applied in various fields. Inspired by the serverless framework of Function as a Service and the high concurrency and robustness of MapReduce programming model, this paper focus on combining them to reduce the time span and increase the efficiency when executing the word frequency counting task. In this case, the paper use a MapReduce programming model based on a serverless computing platform to figure out the most optimized number of Map functions and Reduce functions for a particular task. For the same amount of workload, extensive experiments show both execution time reduces and the overall efficiency of the program improves at different rates as the number of map functions and reduce functions increases. This paper suppose the discovery of the most optimized number of map and reduce functions can help cooperations and programmers figure out the most optimized solutions.
zh
[AI-29] In Line with Context: Repository-Level Code Generation via Context Inlining
【速读】:该论文旨在解决仓库级代码生成(repository-level code generation)中模型难以准确理解跨函数、类和模块的复杂依赖关系的问题。现有方法如基于检索增强生成(retrieval-augmented generation, RAG)或上下文函数选择,通常仅依赖表面相似性,无法捕捉仓库语义的核心依赖结构。解决方案的关键在于提出 InlineCoder 框架,通过将待完成函数内联到其调用图(call graph)中,将复杂的仓库级理解任务重构为更易处理的函数级编码任务;具体而言,该框架首先生成一个近似下游依赖的草稿补全(称为 anchor),进而执行双向内联:上游内联(upstream inlining)将 anchor 嵌入调用者以捕获多样使用场景,下游检索(downstream retrieval)则整合被调用函数作为精确依赖上下文,最终形成融合草稿与上下游视角的增强上下文,使大语言模型(LLM)获得对整个仓库的全面理解。
链接: https://arxiv.org/abs/2601.00376
作者: Chao Hu,Wenhao Zeng,Yuling Shi,Beijun Shen,Xiaodong Gu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to FSE 2026
Abstract:Repository-level code generation has attracted growing attention in recent years. Unlike function-level code generation, it requires the model to understand the entire repository, reasoning over complex dependencies across functions, classes, and modules. However, existing approaches such as retrieval-augmented generation (RAG) or context-based function selection often fall short: they primarily rely on surface-level similarity and struggle to capture the rich dependencies that govern repository-level semantics. In this paper, we introduce InlineCoder, a novel framework for repository-level code generation. InlineCoder enhances the understanding of repository context by inlining the unfinished function into its call graph, thereby reframing the challenging repository understanding as an easier function-level coding task. Given a function signature, InlineCoder first generates a draft completion, termed an anchor, which approximates downstream dependencies and enables perplexity-based confidence estimation. This anchor drives a bidirectional inlining process: (i) Upstream Inlining, which embeds the anchor into its callers to capture diverse usage scenarios; and (ii) Downstream Retrieval, which integrates the anchor’s callees into the prompt to provide precise dependency context. The enriched context, combining draft completion with upstream and downstream perspectives, equips the LLM with a comprehensive repository view.
zh
[AI-30] PatchBlock: A Lightweight Defense Against Adversarial Patches for Embedded EdgeAI Devices DATE2026
【速读】:该论文旨在解决边缘人工智能(EdgeAI)应用中基于补丁的对抗攻击问题,这类攻击通过在物体上粘贴恶意小贴纸(如广告贴纸)诱导神经网络做出错误预测,对自动驾驶和监控等实时推理场景构成严重威胁。解决方案的关键在于提出一种轻量级框架PatchBlock,其核心机制包括三个阶段:首先将输入图像分块(Chunking),其次利用改进的孤立森林算法进行快速异常区域检测(Separating),最后对识别出的异常区域实施维度约简以抑制对抗噪声影响(Mitigating)。该框架作为传感器层预处理模块运行于CPU上,与GPU推理并行执行,不增加额外GPU开销,同时具备模型无关性和补丁无关性,可无缝集成至现有系统中,在多个数据集和设备上验证了其有效性——在强攻击下最高恢复77%原始准确率,且计算效率优于当前最优防御方法。
链接: https://arxiv.org/abs/2601.00367
作者: Nandish Chattopadhyay,Abdul Basit,Amira Guesmi,Muhammad Abdullah Hanif,Bassem Ouni,Muhammad Shafique
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, 5 tables, Accepted to DATE 2026
Abstract:Adversarial attacks pose a significant challenge to the reliable deployment of machine learning models in EdgeAI applications, such as autonomous driving and surveillance, which rely on resource-constrained devices for real-time inference. Among these, patch-based adversarial attacks, where small malicious patches (e.g., stickers) are applied to objects, can deceive neural networks into making incorrect predictions with potentially severe consequences. In this paper, we present PatchBlock, a lightweight framework designed to detect and neutralize adversarial patches in images. Leveraging outlier detection and dimensionality reduction, PatchBlock identifies regions affected by adversarial noise and suppresses their impact. It operates as a pre-processing module at the sensor level, efficiently running on CPUs in parallel with GPU inference, thus preserving system throughput while avoiding additional GPU overhead. The framework follows a three-stage pipeline: splitting the input into chunks (Chunking), detecting anomalous regions via a redesigned isolation forest with targeted cuts for faster convergence (Separating), and applying dimensionality reduction on the identified outliers (Mitigating). PatchBlock is both model- and patch-agnostic, can be retrofitted to existing pipelines, and integrates seamlessly between sensor inputs and downstream models. Evaluations across multiple neural architectures, benchmark datasets, attack types, and diverse edge devices demonstrate that PatchBlock consistently improves robustness, recovering up to 77% of model accuracy under strong patch attacks such as the Google Adversarial Patch, while maintaining high portability and minimal clean accuracy loss. Additionally, PatchBlock outperforms the state-of-the-art defenses in efficiency, in terms of computation time and energy consumption per sample, making it suitable for EdgeAI applications.
zh
[AI-31] Mapping Human Anti-collusion Mechanisms to Multi-agent AI
【速读】:该论文旨在解决多智能体人工智能系统(multi-agent AI systems)在日益自主化背景下可能产生类人类市场中的共谋策略(collusive strategies)的问题,而当前尚缺乏针对此类AI场景的有效反共谋机制。其解决方案的关键在于构建一个涵盖人类社会中已有反共谋机制的分类体系(taxonomy),包括制裁(sanctions)、宽大处理(leniency)、举报(whistleblowing)、监控(monitoring)、审计(auditing)、市场设计(market design)与治理(governance),并将其映射为适用于多智能体AI系统的潜在干预措施,同时提出每种机制的具体实现路径,并识别出如归属问题(attribution problem)、身份流动性(identity fluidity)、边界界定问题(boundary problem)以及对抗性适应(adversarial adaptation)等关键挑战。
链接: https://arxiv.org/abs/2601.00360
作者: Jamiu Adekunle Idowu,Ahmed Almasoud,Ayman Alfahid
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As multi-agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti-collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency whistleblowing, monitoring auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents) identity fluidity (agents being easily forked or modified) the boundary problem (distinguishing beneficial cooperation from harmful collusion) and adversarial adaptation (agents learning to evade detection).
zh
[AI-32] Bio-inspired Agent ic Self-healing Framework for Resilient Distributed Computing Continuum Systems
【速读】:该论文旨在解决分布式计算连续体系统(Distributed Computing Continuum Systems, DCCS)在复杂、动态和异构环境下因频繁故障导致的服务中断问题,以实现可扩展、自适应且自主的韧性(resilience)。其解决方案的关键在于提出了一种受生物自愈机制启发的代理式自愈框架 ReCiSt,该框架将生物学中的止血(Hemostasis)、炎症(Inflammation)、增殖(Proliferation)和重塑(Remodeling)四个阶段映射为计算层:Containment(隔离)、Diagnosis(诊断)、Meta-Cognitive(元认知)和 Knowledge(知识),并通过语言模型(Language Model, LM)驱动的智能体(agent)实现故障的自动识别、因果推理、自适应恢复及长期知识积累。该设计使系统能够在数十秒内完成自愈,且仅需最低10%的代理CPU资源消耗,展现出对不确定性和微代理数量的高度鲁棒性。
链接: https://arxiv.org/abs/2601.00339
作者: Alaa Saleh,Praveen Kumar Donta,Roberto Morabito,Sasu Tarkoma,Anders Lindgren,Qiyang Zhang,Schahram Dustdar Susanna Pirttikangas,Lauri Lovén
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Human biological systems sustain life through extraordinary resilience, continually detecting damage, orchestrating targeted responses, and restoring function through self-healing. Inspired by these capabilities, this paper introduces ReCiSt, a bio-inspired agentic self-healing framework designed to achieve resilience in Distributed Computing Continuum Systems (DCCS). Modern DCCS integrate heterogeneous computing resources, ranging from resource-constrained IoT devices to high-performance cloud infrastructures, and their inherent complexity, mobility, and dynamic operating conditions expose them to frequent faults that disrupt service continuity. These challenges underscore the need for scalable, adaptive, and self-regulated resilience strategies. ReCiSt reconstructs the biological phases of Hemostasis, Inflammation, Proliferation, and Remodeling into the computational layers Containment, Diagnosis, Meta-Cognitive, and Knowledge for DCCS. These four layers perform autonomous fault isolation, causal diagnosis, adaptive recovery, and long-term knowledge consolidation through Language Model (LM)-powered agents. These agents interpret heterogeneous logs, infer root causes, refine reasoning pathways, and reconfigure resources with minimal human intervention. The proposed ReCiSt framework is evaluated on public fault datasets using multiple LMs, and no baseline comparison is included due to the scarcity of similar approaches. Nevertheless, our results, evaluated under different LMs, confirm ReCiSt’s self-healing capabilities within tens of seconds with minimum of 10% of agent CPU usage. Our results also demonstrated depth of analysis to over come uncertainties and amount of micro-agents invoked to achieve resilience.
zh
[AI-33] Sparse Probabilistic Coalition Structure Generation: Bayesian Greedy Pursuit and ell_1 Relaxations
【速读】:该论文旨在解决合作博弈中联盟结构生成(Coalition Structure Generation, CSG)的问题,其中联盟的价值并非已知,而是需要从稀疏的、带噪声的观测数据中学习。其核心挑战在于如何在有限的观测次数下准确识别出具有高价值的联盟组合。解决方案的关键在于将每个观测周期建模为一个稀疏线性回归问题,即实际收益 (Y_t) 是少数几个联盟贡献的线性组合。作者提出两种估计策略:一是贝叶斯贪婪联盟追逐(Bayesian Greedy Coalition Pursuit, BGCP),通过类正交匹配追踪的贪心机制,在满足相干性和最小信号假设条件下,当观测次数 (T \gtrsim K \log m) 时可高概率恢复真实盈利联盟集合;二是采用 (\ell_1)-惩罚估计器,在受限特征值条件下建立 (\ell_1) 和预测误差界,并进一步转化为福利差距保证。这两种方法均实现了从数据驱动学习到最优联盟结构生成的闭环推理,且在稀疏场景下显著优于传统概率基线,而在密集场景下与经典最小二乘法相当。
链接: https://arxiv.org/abs/2601.00329
作者: Angshul Majumdar
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:We study coalition structure generation (CSG) when coalition values are not given but must be learned from episodic observations. We model each episode as a sparse linear regression problem, where the realised payoff (Y_t) is a noisy linear combination of a small number of coalition contributions. This yields a probabilistic CSG framework in which the planner first estimates a sparse value function from (T) episodes, then runs a CSG solver on the inferred coalition set. We analyse two estimation schemes. The first, Bayesian Greedy Coalition Pursuit (BGCP), is a greedy procedure that mimics orthogonal matching pursuit. Under a coherence condition and a minimum signal assumption, BGCP recovers the true set of profitable coalitions with high probability once (T \gtrsim K \log m), and hence yields welfare-optimal structures. The second scheme uses an (\ell_1)-penalised estimator; under a restricted eigenvalue condition, we derive (\ell_1) and prediction error bounds and translate them into welfare gap guarantees. We compare both methods to probabilistic baselines and identify regimes where sparse probabilistic CSG is superior, as well as dense regimes where classical least-squares approaches are competitive.
zh
[AI-34] Multiagent Reinforcement Learning for Liquidity Games
【速读】:该论文旨在解决金融市场上独立交易者如何在无协调或共谋的情况下,通过个体理性行为实现市场整体流动性提升的问题。其核心挑战在于如何设计机制,使自利的个体决策能够自然导向全局最优的市场效率。解决方案的关键在于构建一个融合“流动性博弈(Liquidity Games)”与“理性蜂群(Rational Swarms)”的理论框架,利用马尔可夫团队博弈(Markov team games)中的差分奖励(difference rewards)机制,使得每个交易者在追求自身流动性最大化时,其行为自动促进整体市场的流动性供给,从而实现个体盈利与集体效率的协同优化。
链接: https://arxiv.org/abs/2601.00324
作者: Alicia Vidler,Gal A. Kaminka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Making use of swarm methods in financial market modeling of liquidity, and techniques from financial analysis in swarm analysis, holds the potential to advance both research areas. In swarm research, the use of game theory methods holds the promise of explaining observed phenomena of collective utility adherence with rational self-interested swarm participants. In financial markets, a better understanding of how independent financial agents may self-organize for the betterment and stability of the marketplace would be a boon for market design researchers. This paper unifies Liquidity Games, where trader payoffs depend on aggregate liquidity within a trade, with Rational Swarms, where decentralized agents use difference rewards to align self-interested learning with global objectives. We offer a theoretical frameworks where we define a swarm of traders whose collective objective is market liquidity provision while maintaining agent independence. Using difference rewards within a Markov team games framework, we show that individual liquidity-maximizing behaviors contribute to overall market liquidity without requiring coordination or collusion. This Financial Swarm model provides a framework for modeling rational, independent agents where they achieve both individual profitability and collective market efficiency in bilateral asset markets.
zh
[AI-35] he Generative AI Paradox: GenAI and the Erosion of Trust the Corrosion of Information Verification and the Demise of Truth
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)引发的“合成现实”(synthetic reality)风险问题,即GenAI不仅生成高保真度的文本、图像、音频和视频等孤立合成内容,更通过内容、身份、交互与制度四层协同作用,系统性侵蚀社会共享的认知基础(epistemic ground)与机构验证机制,进而威胁信息真实性与社会信任。其解决方案的关键在于构建一个多层次的缓解框架(mitigation stack),将溯源基础设施(provenance infrastructure)、平台治理(platform governance)、制度流程重构(institutional workflow redesign)与公众韧性(public resilience)视为互补而非替代手段,并聚焦于量化评估“认知安全”(epistemic security)以应对这一新兴挑战。
链接: https://arxiv.org/abs/2601.00306
作者: Emilio Ferrara
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative AI (GenAI) now produces text, images, audio, and video that can be perceptually convincing at scale and at negligible marginal cost. While public debate often frames the associated harms as “deepfakes” or incremental extensions of misinformation and fraud, this view misses a broader socio-technical shift: GenAI enables synthetic realities; coherent, interactive, and potentially personalized information environments in which content, identity, and social interaction are jointly manufactured and mutually reinforcing. We argue that the most consequential risk is not merely the production of isolated synthetic artifacts, but the progressive erosion of shared epistemic ground and institutional verification practices as synthetic content, synthetic identity, and synthetic interaction become easy to generate and hard to audit. This paper (i) formalizes synthetic reality as a layered stack (content, identity, interaction, institutions), (ii) expands a taxonomy of GenAI harms spanning personal, economic, informational, and socio-technical risks, (iii) articulates the qualitative shifts introduced by GenAI (cost collapse, throughput, customization, micro-segmentation, provenance gaps, and trust erosion), and (iv) synthesizes recent risk realizations (2023-2025) into a compact case bank illustrating how these mechanisms manifest in fraud, elections, harassment, documentation, and supply-chain compromise. We then propose a mitigation stack that treats provenance infrastructure, platform governance, institutional workflow redesign, and public resilience as complementary rather than substitutable, and outline a research agenda focused on measuring epistemic security. We conclude with the Generative AI Paradox: as synthetic media becomes ubiquitous, societies may rationally discount digital evidence altogether.
zh
[AI-36] ClinicalReTrial: A Self-Evolving AI Agent for Clinical Trial Protocol Optimization
【速读】:该论文旨在解决临床试验(Clinical Trial)失败率高这一核心瓶颈问题,尤其是由协议设计缺陷导致的不可逆结果,而现有基于人工智能(AI)的方法仅能事后诊断风险,缺乏主动干预与优化能力。其解决方案的关键在于提出一种自演化的人工智能代理框架 ClinicalReTrial,将临床试验推理建模为迭代式协议重设计问题,通过闭环、奖励驱动的优化机制整合失败诊断、安全感知的修改策略和候选方案评估,并利用预测模型作为仿真环境实现低成本试错与密集奖励信号反馈,从而推动模型持续自我改进。
链接: https://arxiv.org/abs/2601.00290
作者: Sixue Xing,Xuanye Xia,Kerui Wu,Meng Jiang,Jintai Chen,Tianfan Fu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Clinical trial failure remains a central bottleneck in drug development, where minor protocol design flaws can irreversibly compromise outcomes despite promising therapeutics. Although cutting-edge AI methods achieve strong performance in predicting trial success, they are inherently reactive for merely diagnosing risk without offering actionable remedies once failure is anticipated. To fill this gap, this paper proposes ClinicalReTrial, a self-evolving AI agent framework that addresses this gap by casting clinical trial reasoning as an iterative protocol redesign problem. Our method integrates failure diagnosis, safety-aware modification, and candidate evaluation in a closed-loop, reward-driven optimization framework. Serving the outcome prediction model as a simulation environment, ClinicalReTrial enables low-cost evaluation of protocol modifications and provides dense reward signals for continuous self-improvement. To support efficient exploration, the framework maintains hierarchical memory that captures iteration-level feedback within trials and distills transferable redesign patterns across trials. Empirically, ClinicalReTrial improves 83.3% of trial protocols with a mean success probability gain of 5.7%, and retrospective case studies demonstrate strong alignment between the discovered redesign strategies and real-world clinical trial modifications.
zh
[AI-37] An Empirical Evaluation of LLM -Based Approaches for Code Vulnerability Detection: RAG SFT and Dual-Agent Systems
【速读】:该论文旨在解决现代软件漏洞检测任务中自动化能力不足的问题,特别是如何提升大型语言模型(Large Language Models, LLMs)在真实场景下识别关键漏洞类别(如CWE-119、CWE-399等)的准确性与可靠性。其解决方案的关键在于引入三种增强机制:一是通过检索增强生成(Retrieval-Augmented Generation, RAG)整合来自互联网和MITRE CWE数据库的外部领域知识,显著提升了模型的上下文感知能力和整体性能(准确率0.86,F1分数0.85);二是采用参数高效微调(Supervised Fine-Tuning with QLoRA adapters)实现高质量定制化训练;三是设计双代理(Dual-Agent)架构,由第二代理对第一代理输出进行审计与修正,从而增强推理透明性并降低错误率,同时保持较低资源开销。这些方法共同表明,将领域专业知识机制嵌入LLM框架是提升其在实际漏洞检测任务中应用价值的核心路径。
链接: https://arxiv.org/abs/2601.00254
作者: Md Hasan Saju,Maher Muhtadi,Akramul Azim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection, a crucial task in securing modern codebases. This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities. The study evaluates three approaches, Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent LLM framework, against a baseline LLM model. A curated dataset was compiled from Big-Vul and real-world code repositories from GitHub, focusing on five critical Common Weakness Enumeration (CWE) categories: CWE-119, CWE-399, CWE-264, CWE-20, and CWE-200. Our RAG approach, which integrated external domain knowledge from the internet and the MITRE CWE database, achieved the highest overall accuracy (0.86) and F1 score (0.85), highlighting the value of contextual augmentation. Our SFT approach, implemented using parameter-efficient QLoRA adapters, also demonstrated strong performance. Our Dual-Agent system, an architecture in which a secondary agent audits and refines the output of the first, showed promise in improving reasoning transparency and error mitigation, with reduced resource overhead. These results emphasize that incorporating a domain expertise mechanism significantly strengthens the practical applicability of LLMs in real-world vulnerability detection tasks.
zh
[AI-38] Will LLM -powered Agents Bias Against Humans? Exploring the Belief-Dependent Vulnerability
【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)驱动的智能体在社交交互中可能产生的群体间偏见(intergroup bias)问题,特别是当智能体将人类整体视为“外群体”(outgroup)时所引发的根本性不公平风险。其核心发现是:尽管在某些情境下智能体会因识别出“人类”身份而表现出对人类的偏好(即“人类规范脚本”),但这种偏好依赖于智能体对真实人类存在的信念;一旦该信念被破坏,偏见便会重新显现。解决方案的关键在于提出一种新型攻击方式——信念污染攻击(Belief Poisoning Attack, BPA),通过两种机制实现:一是初始化阶段的配置污染(BPA-PP),二是基于优化后的信念修正后缀对存储反思进行记忆污染(BPA-MP),从而系统性地削弱人类规范脚本并恢复对人类的群体偏见。这一发现揭示了当前代理框架在身份认知层面的脆弱性,并为设计更具鲁棒性的智能体提供了关键防御方向。
链接: https://arxiv.org/abs/2601.00240
作者: Zongwei Wang,Bincheng Gu,Hongyu Yu,Junliang Yu,Tao He,Jiayin Feng,Min Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages
Abstract:LLM-empowered agents can exhibit not only demographic bias (e.g., gender, religion) but also intergroup bias triggered by minimal “us” versus “them” cues. When this intergroup boundary aligns with an agent-human divide, the risk shifts from disparities among human demographic groups to a more fundamental group-level asymmetry, i.e., humans as a whole may be treated as the outgroup by agents. To examine this possibility, we construct a controlled multi-agent social simulation based on allocation decisions under explicit payoff trade-offs and find that agents exhibit a consistent intergroup bias under minimal group cues. Although this bias is attenuated when some counterparts are framed as humans, we attribute the attenuation to an implicit human-norm script that favors humans yet activates only when the agent believes a real human is present. This belief dependence creates a new attack surface. We therefore introduce a Belief Poisoning Attack (BPA) that corrupts persistent identity beliefs to suppress the human-norm script and reactivate outgroup bias toward humans, instantiated as profile poisoning at initialization (BPA-PP) and memory poisoning via optimized belief-refinement suffixes injected into stored reflections (BPA-MP). Finally, we discuss practical mitigation strategies for hardening current agent frameworks against BPA, highlighting feasible interventions at profile and memory boundaries. Extensive experiments demonstrate both the existence of agent intergroup bias and the severity of BPA across settings. Our goal in identifying these vulnerabilities is to inform safer agent design, not to enable real-world exploitation.
zh
[AI-39] GRIT – Geometry-Aware PEFT with K-FACPreconditioning Fisher-Guided Reprojection andDynamic Rank Adaptation
【速读】:该论文旨在解决当前参数高效微调(Parameter-efficient fine-tuning, PEFT)方法如LoRA和QLoRA在低秩子空间优化中对局部损失曲率(loss curvature)忽略的问题,这会导致有效更新预算膨胀以及在弱约束方向上的梯度漂移(drift)。其解决方案的关键在于提出GRIT——一种动态且曲率感知的LoRA机制:(1)利用K-FAC作为自然梯度代理,在秩空间中预条件梯度;(2)周期性地将低秩基向量投影到主导Fisher特征方向以抑制漂移;(3)根据谱分布自适应调整有效秩,使模型容量集中于信号显著区域。此方法在保持LoRA参数化的同时,显著降低平均可训练参数量达46%(任务间波动25–80%),且不牺牲性能。
链接: https://arxiv.org/abs/2601.00231
作者: Pritish Saha,Chandrav Rajbangshi,Rudra Goyal,Mohit Goyal,Anurag Deo,Biswajit Roy,Ningthoujam Dhanachandra Singh,Raxit Goswami,Amitava Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) is the default way to adapt LLMs, but widely used LoRA and QLoRA are largely geometry-agnostic: they optimize in fixed, randomly oriented low-rank subspaces with first-order descent, mostly ignoring local loss curvature. This can inflate the effective update budget and amplify drift along weakly constrained directions. We introduce GRIT, a dynamic, curvature-aware LoRA procedure that preserves the LoRA parameterization but: (1) preconditions gradients in rank space using K-FAC as a natural-gradient proxy; (2) periodically reprojects the low-rank basis onto dominant Fisher eigendirections to suppress drift; and (3) adapts the effective rank from the spectrum so capacity concentrates where signal resides. Across instruction-following, comprehension, and reasoning benchmarks on LLaMA backbones, GRIT matches or surpasses LoRA and QLoRA while reducing trainable parameters by 46% on average (25–80% across tasks), without practical quality loss across prompt styles and data mixes. To model forgetting, we fit a curvature-modulated power law. Empirically, GRIT yields lower drift and a better updates-vs-retention frontier than strong PEFT-optimizer baselines (Orthogonal-LoRA, IA3, DoRA, Eff-FT, Shampoo).
zh
[AI-40] FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成的GPU内核(GPU kernels)难以集成到真实推理系统中的问题。其核心挑战在于缺乏一个标准化、闭环的框架来实现从内核生成、性能与正确性验证到生产部署的全流程贯通。解决方案的关键在于提出FlashInfer-Bench,该框架通过FlashInfer Trace统一描述内核定义、工作负载、实现方式及评估结果,构建了一个包含真实服务追踪数据的基准数据集、具备正确性和性能感知能力的评测体系、公开排行榜以及动态替换机制(apply()),从而实现最优内核无缝注入至主流LLM引擎(如SGLang和vLLM)。这一设计为持续优化AI生成内核并推动其在大规模LLM推理中的落地提供了可复现且实用的路径。
链接: https://arxiv.org/abs/2601.00227
作者: Shanli Xing,Yiyan Zhai,Alexander Jiang,Yixin Dong,Yong Wu,Zihao Ye,Charlie Ruan,Yingyi Huang,Yineng Zhang,Liangsheng Yin,Aksara Bayyapu,Luis Ceze,Tianqi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents’ GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.
zh
[AI-41] Latent Flow Matching for Expressive Singing Voice Synthesis
【速读】:该论文旨在解决条件变分自编码器(conditional variational autoencoder, cVAE)在歌声合成中因先验与后验分布不匹配而导致的细微表现力下降问题,例如颤音(vibrato)和微韵律(micro-prosody)等细节表达不足。解决方案的关键在于引入潜空间中的条件流匹配(conditional flow matching, CFM),通过学习一个连续向量场,将先验样本沿最优传输启发的路径逐步映射至后验空间;推理时,利用该向量场求解常微分方程(ordinary differential equation, ODE)对先验样本进行精炼,从而提升表现力,同时保持并行解码的高效性。
链接: https://arxiv.org/abs/2601.00217
作者: Minhyeok Yun,Yong-Hoon Choi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at this https URL
zh
[AI-42] SSI-GAN: Semi-Supervised Swin-Inspired Generative Adversarial Networks for Neuronal Spike Classification
【速读】:该论文旨在解决蚊媒神经元放电模式(neuronal spike patterns)分类中因标注数据稀缺导致的深度学习模型训练困难问题,尤其是在实际野外场景下难以获取大规模人工标注数据的限制。其关键解决方案是提出一种半监督生成对抗网络(Semi-supervised Swin-Inspired GAN, SSI-GAN),该架构结合基于Transformer的生成器与采用移窗机制(shifted-window)的判别器,利用多头自注意力机制在窗口化结构中捕获稀疏高频的神经放电特征。该设计仅需1%–3%的标签数据即可实现接近全监督方法的性能(如感染后第3天达到99.93%准确率),显著降低了97–99%的人工标注成本,且在多个感染阶段保持高鲁棒性,优于所有基线模型。
链接: https://arxiv.org/abs/2601.00189
作者: Danial Sharifrazi,Nouman Javed,Mojtaba Mohammadi,Seyede Sana Salehi,Roohallah Alizadehsani,Prasad N. Paradkar,U. Rajendra Acharya,Asim Bhatti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mosquitos are the main transmissive agents of arboviral diseases. Manual classification of their neuronal spike patterns is very labor-intensive and expensive. Most available deep learning solutions require fully labeled spike datasets and highly preprocessed neuronal signals. This reduces the feasibility of mass adoption in actual field scenarios. To address the scarcity of labeled data problems, we propose a new Generative Adversarial Network (GAN) architecture that we call the Semi-supervised Swin-Inspired GAN (SSI-GAN). The Swin-inspired, shifted-window discriminator, together with a transformer-based generator, is used to classify neuronal spike trains and, consequently, detect viral neurotropism. We use a multi-head self-attention model in a flat, window-based transformer discriminator that learns to capture sparser high-frequency spike features. Using just 1 to 3% labeled data, SSI-GAN was trained with more than 15 million spike samples collected at five-time post-infection and recording classification into Zika-infected, dengue-infected, or uninfected categories. Hyperparameters were optimized using the Bayesian Optuna framework, and performance for robustness was validated under fivefold Monte Carlo cross-validation. SSI-GAN reached 99.93% classification accuracy on the third day post-infection with only 3% labeled data. It maintained high accuracy across all stages of infection with just 1% supervision. This shows a 97-99% reduction in manual labeling effort relative to standard supervised approaches at the same performance level. The shifted-window transformer design proposed here beat all baselines by a wide margin and set new best marks in spike-based neuronal infection classification.
zh
[AI-43] Online Finetuning Decision Transformers with Pure RL Gradients
【速读】:该论文旨在解决决策变换器(Decision Transformer, DT)在在线强化学习(Reinforcement Learning, RL)场景下难以使用纯RL梯度进行微调的问题。现有方法在在线微调阶段仍依赖监督序列建模目标,导致无法充分发挥RL的优势。其关键解决方案在于识别出“事后回报重标注”(hindsight return relabeling)是阻碍纯RL梯度应用的核心障碍——该机制虽有利于监督学习,但与基于重要性采样的RL算法(如GRPO)存在根本冲突,引发训练不稳定。为此,作者提出改进的GRPO适配方案,引入子轨迹优化以提升信用分配精度、采用序列级似然目标增强稳定性与效率,并结合主动采样策略促进不确定区域的探索,从而实现基于纯RL梯度的DT在线微调,显著优于现有基线并达到多个基准上的新SOTA性能。
链接: https://arxiv.org/abs/2601.00167
作者: Junkai Luo,Yinglun Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making by formulating offline reinforcement learning (RL) as a sequence modeling problem. However, extending DTs to online settings with pure RL gradients remains largely unexplored, as existing approaches continue to rely heavily on supervised sequence-modeling objectives during online finetuning. We identify hindsight return relabeling – a standard component in online DTs – as a critical obstacle to RL-based finetuning: while beneficial for supervised learning, it is fundamentally incompatible with importance sampling-based RL algorithms such as GRPO, leading to unstable training. Building on this insight, we propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients. We adapt GRPO to DTs and introduce several key modifications, including sub-trajectory optimization for improved credit assignment, sequence-level likelihood objectives for enhanced stability and efficiency, and active sampling to encourage exploration in uncertain regions. Through extensive experiments, we demonstrate that our methods outperform existing online DT baselines and achieve new state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of pure-RL-based online finetuning for Decision Transformers.
zh
[AI-44] An AI Monkey Gets Grapes for Sure – Sphere Neural Networks for Reliable Decision-Making
【速读】:该论文旨在解决神经推理方法中可靠性不足的问题,特别是对比了大语言模型(LLM)推理、基于监督学习的推理与显式模型构建推理三类方法在逻辑推理任务中的表现差异。研究表明,LLM在简单决策上仍不可靠,而监督学习方法虽能实现高精度但存在灾难性遗忘问题且推理能力局限于模式层面。论文提出的关键解决方案是引入一种新型球面神经网络(Sphere Neural Networks),其通过将概念编码为n维球面上的圆盘来表示逻辑关系,利用补集圆盘实现否定运算,并借助不满足条件的圆形配置过滤非法推理,从而在保持经典三段论严谨性的同时,成功掌握包括严格析取三段论在内的16种三段论推理任务,显著提升了神经推理的可靠性与泛化能力。
链接: https://arxiv.org/abs/2601.00142
作者: Tiansi Dong,Henry He,Pietro Liò,Mateja Jamnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:This paper compares three methodological categories of neural reasoning: LLM reasoning, supervised learning-based reasoning, and explicit model-based reasoning. LLMs remain unreliable and struggle with simple decision-making that animals can master without extensive corpora training. Through disjunctive syllogistic reasoning testing, we show that reasoning via supervised learning is less appealing than reasoning via explicit model construction. Concretely, we show that an Euler Net trained to achieve 100.00% in classic syllogistic reasoning can be trained to reach 100.00% accuracy in disjunctive syllogistic reasoning. However, the retrained Euler Net suffers severely from catastrophic forgetting (its performance drops to 6.25% on already-learned classic syllogistic reasoning), and its reasoning competence is limited to the pattern level. We propose a new version of Sphere Neural Networks that embeds concepts as circles on the surface of an n-dimensional sphere. These Sphere Neural Networks enable the representation of the negation operator via complement circles and achieve reliable decision-making by filtering out illogical statements that form unsatisfiable circular configurations. We demonstrate that the Sphere Neural Network can master 16 syllogistic reasoning tasks, including rigorous disjunctive syllogistic reasoning, while preserving the rigour of classical syllogistic reasoning. We conclude that neural reasoning with explicit model construction is the most reliable among the three methodological categories of neural reasoning.
zh
[AI-45] Constructing a Neuro-Symbolic Mathematician from First Principles
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因缺乏内部公理化框架而导致的持续逻辑错误问题。其解决方案的关键在于提出一种神经符号架构 Mathesis,该架构将数学状态编码为高阶超图(higher-order hypergraphs),并引入可微分逻辑引擎——符号推理核(Symbolic Reasoning Kernel, SRK),通过将约束映射到连续能量景观来实现逻辑一致性判断;SRK 定义全局能量函数 $ E(G) $,其中零能量表示逻辑一致,从而生成梯度信号以训练超图变换器大脑(Hypergraph Transformer Brain),将证明搜索转化为能量最小化过程,并结合蒙特卡洛树搜索和进化证明搜索(Evolutionary Proof Search)实现多步演绎推理,其策略由学习到的价值函数和语义统一机制引导。
链接: https://arxiv.org/abs/2601.00125
作者: Keqin Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit persistent logical failures in complex reasoning due to the lack of an internal axiomatic framework. We propose Mathesis, a neuro-symbolic architecture that encodes mathematical states as higher-order hypergraphs and uses a Symbolic Reasoning Kernel (SRK)–a differentiable logic engine that maps constraints to a continuous energy landscape. By defining a global energy function E(G), where zero energy implies logical consistency, the SRK yields gradient-based signals to train a Hypergraph Transformer Brain, turning proof search into energy minimization. Multi-step deduction is enabled via Monte Carlo Tree Search and Evolutionary Proof Search, guided by learned value functions and semantic unification.
zh
[AI-46] Ask Clarify Optimize: Human-LLM Agent Collaboration for Smarter Inventory Control
【速读】:该论文旨在解决小中型企业因缺乏运筹学(Operations Research, OR)专业能力而难以实施先进库存优化方法的问题。其核心挑战在于,直接将大语言模型(Large Language Models, LLMs)作为端到端求解器时会引入显著的“幻觉税”(hallucination tax),即模型在进行基于自然语言的随机推理时缺乏 grounded reasoning 能力,导致性能下降。解决方案的关键在于提出一种混合智能体(hybrid agentic)框架,严格分离语义理解与数学计算:LLM仅承担自然语言交互接口角色,负责从管理对话中提取参数并解释结果,而精确的优化计算则由调用的严谨算法完成,从而实现非专家用户对专业库存策略的可访问性。
链接: https://arxiv.org/abs/2601.00121
作者: Yaqi Duan,Yichun Hu,Jiashuo Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Inventory management remains a challenge for many small and medium-sized businesses that lack the expertise to deploy advanced optimization methods. This paper investigates whether Large Language Models (LLMs) can help bridge this gap. We show that employing LLMs as direct, end-to-end solvers incurs a significant “hallucination tax”: a performance gap arising from the model’s inability to perform grounded stochastic reasoning. To address this, we propose a hybrid agentic framework that strictly decouples semantic reasoning from mathematical calculation. In this architecture, the LLM functions as an intelligent interface, eliciting parameters from natural language and interpreting results while automatically calling rigorous algorithms to build the optimization engine. To evaluate this interactive system against the ambiguity and inconsistency of real-world managerial dialogue, we introduce the Human Imitator, a fine-tuned “digital twin” of a boundedly rational manager that enables scalable, reproducible stress-testing. Our empirical analysis reveals that the hybrid agentic framework reduces total inventory costs by 32.1% relative to an interactive baseline using GPT-4o as an end-to-end solver. Moreover, we find that providing perfect ground-truth information alone is insufficient to improve GPT-4o’s performance, confirming that the bottleneck is fundamentally computational rather than informational. Our results position LLMs not as replacements for operations research, but as natural-language interfaces that make rigorous, solver-based policies accessible to non-experts. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2601.00121 [cs.AI] (or arXiv:2601.00121v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.00121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-47] Mortar: Evolving Mechanics for Automatic Game Design
【速读】:该论文试图解决游戏机制(Game Mechanics)自动设计中依赖人工干预、效率低下且难以保证多样性的问题。传统方法需要专家手动设计规则,耗时且难以探索广泛的可能性。解决方案的关键在于提出Mortar系统,其核心是将质量-多样性算法(Quality-Diversity Algorithm)与大语言模型(Large Language Model, LLM)相结合,通过树搜索合成完整游戏来评估所生成机制的质量,并以“技能导向排序得分”(skill-based ordering score)作为核心指标——即衡量更强玩家是否能稳定胜过较弱玩家。该框架能够自主演化出多样且具可玩性的游戏机制,显著提升自动游戏设计的效率与效果。
链接: https://arxiv.org/abs/2601.00105
作者: Muhammad U. Nasir,Yuchen Li,Steven James,Julian Togelius
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Mortar, a system for autonomously evolving game mechanics for automatic game design. Game mechanics define the rules and interactions that govern gameplay, and designing them manually is a time-consuming and expert-driven process. Mortar combines a quality-diversity algorithm with a large language model to explore a diverse set of mechanics, which are evaluated by synthesising complete games that incorporate both evolved mechanics and those drawn from an archive. The mechanics are evaluated by composing complete games through a tree search procedure, where the resulting games are evaluated by their ability to preserve a skill-based ordering over players – that is, whether stronger players consistently outperform weaker ones. We assess the mechanics based on their contribution towards the skill-based ordering score in the game. We demonstrate that Mortar produces games that appear diverse and playable, and mechanics that contribute more towards the skill-based ordering score in the game. We perform ablation studies to assess the role of each system component and a user study to evaluate the games based on human feedback.
zh
[AI-48] Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing
【速读】:该论文旨在解决安全训练过的大型语言模型(Large Language Models, LLMs)在具备工具使用能力后,如何有效进行安全测试的问题。当前尽管LLMs已接受安全训练,但其在实际部署中可能因工具调用而暴露新的攻击面,因此亟需系统性的评估方法。论文的关键解决方案是采用Go-Explore算法框架对GPT-4o-mini进行多轮实验(28次运行),通过控制随机种子(random-seed)和奖励函数设计等变量,揭示了种子方差远超算法参数的影响,强调多种子平均可显著降低结果波动;同时发现奖励塑形(reward shaping)会引发探索崩溃(exploration collapse)或产生虚假阳性结果,且简单状态签名(state signatures)比复杂表示更有效。研究进一步表明,集成方法(ensembles)能提升攻击类型多样性,而单个代理则在特定攻击类型下优化覆盖范围,从而为安全测试提供可操作的实践指导。
链接: https://arxiv.org/abs/2601.00042
作者: Manish Bhatt,Adrian Wood,Idan Habler,Ammar Al-Kahfah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Production LLM agents with tool-using capabilities require security testing despite their safety training. We adapt Go-Explore to evaluate GPT-4o-mini across 28 experimental runs spanning six research questions. We find that random-seed variance dominates algorithmic parameters, yielding an 8x spread in outcomes; single-seed comparisons are unreliable, while multi-seed averaging materially reduces variance in our setup. Reward shaping consistently harms performance, causing exploration collapse in 94% of runs or producing 18 false positives with zero verified attacks. In our environment, simple state signatures outperform complex ones. For comprehensive security testing, ensembles provide attack-type diversity, whereas single agents optimize coverage within a given attack type. Overall, these results suggest that seed variance and targeted domain knowledge can outweigh algorithmic sophistication when testing safety-trained models.
zh
[AI-49] Quantitative Rule-Based Strategy modeling in Classic Indian Rummy: A Metric Optimization Approach
【速读】:该论文旨在解决经典印度拉米牌(Classic Indian Rummy)中基于规则的策略设计问题,尤其针对不完全信息环境下所需的概率推理与组合决策挑战。其解决方案的关键在于提出一种新的手牌评估指标——MinDist,该指标通过量化当前手牌与最近有效配置之间的编辑距离(edit distance),从而更准确地捕捉手牌结构上接近完成的程度;同时,作者设计了一个计算高效的算法,利用动态剪枝和模式缓存技术精确计算该指标,并结合对手手牌建模的两玩家零和模拟框架进行策略优化,最终显著提升了胜率,为拉米游戏的算法策略设计提供了形式化且可解释的新路径。
链接: https://arxiv.org/abs/2601.00024
作者: Purushottam Saha,Avirup Chakraborty,Sourish Sarkar,Subhamoy Maitra,Diganta Mukherjee,Tridib Mukherjee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 9 pages, 6 figures, 2 algorithms
Abstract:The 13-card variant of Classic Indian Rummy is a sequential game of incomplete information that requires probabilistic reasoning and combinatorial decision-making. This paper proposes a rule-based framework for strategic play, driven by a new hand-evaluation metric termed MinDist. The metric modifies the MinScore metric by quantifying the edit distance between a hand and the nearest valid configuration, thereby capturing structural proximity to completion. We design a computationally efficient algorithm derived from the MinScore algorithm, leveraging dynamic pruning and pattern caching to exactly calculate this metric during play. Opponent hand-modeling is also incorporated within a two-player zero-sum simulation framework, and the resulting strategies are evaluated using statistical hypothesis testing. Empirical results show significant improvement in win rates for MinDist-based agents over traditional heuristics, providing a formal and interpretable step toward algorithmic Rummy strategy design.
zh
[AI-50] A multi-algorithm approach for operational human resources workload balancing in a last mile urban delivery system
【速读】:该论文旨在解决城市末端包裹配送系统中人力资源 workload 分配不均的问题,传统基于地理邻近性的分配方法易导致配送员之间工作量失衡。解决方案的关键在于引入以“努力 workload”(effort workload)为核心的优化目标,通过多算法协同机制实现每日工作量均衡分配:具体包括融合距离与任务负荷的综合考量,采用多种 k-means 变体、进化算法、基于 k-means 初始化的递归分配策略以及混合进化集成算法,从而在保证配送效率的同时,显著减少配送员之间的 workload 差异。
链接: https://arxiv.org/abs/2601.00023
作者: Luis M. Moreno-Saavedra,Silvia Jimenez-Fernandez,Antonio Portilla-Figueras,David Casillas-Perez,Sancho Salcedo-Sanz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Efficient workload assignment to the workforce is critical in last-mile package delivery systems. In this context, traditional methods of assigning package deliveries to workers based on geographical proximity can be inefficient and surely guide to an unbalanced workload distribution among delivery workers. In this paper, we look at the problem of operational human resources workload balancing in last-mile urban package delivery systems. The idea is to consider the effort workload to optimize the system, i.e., the optimization process is now focused on improving the delivery time, so that the workload balancing is complete among all the staff. This process should correct significant decompensations in workload among delivery workers in a given zone. Specifically, we propose a multi-algorithm approach to tackle this problem. The proposed approach takes as input a set of delivery points and a defined number of workers, and then assigns packages to workers, in such a way that it ensures that each worker completes a similar amount of work per day. The proposed algorithms use a combination of distance and workload considerations to optimize the allocation of packages to workers. In this sense, the distance between the delivery points and the location of each worker is also taken into account. The proposed multi-algorithm methodology includes different versions of k-means, evolutionary approaches, recursive assignments based on k-means initialization with different problem encodings, and a hybrid evolutionary ensemble algorithm. We have illustrated the performance of the proposed approach in a real-world problem in an urban last-mile package delivery workforce operating at Azuqueca de Henares, Spain.
zh
[AI-51] oward a Physical Theory of Intelligence
【速读】:该论文试图解决如何从物理层面统一解释智能的本质及其限制问题,特别是如何将信息处理与能量转换、系统演化和目标导向行为联系起来。其核心解决方案是提出了一种基于不可逆信息处理的物理理论框架,其中关键创新在于引入“守恒一致编码”(Conservation-Congruent Encoding, CCE)机制——该机制利用守恒定律强制编码状态的可分离性,从而将信息映射为物理状态;在此基础上定义了智能为每纳特(nat)不可逆处理的信息所产生目标导向功的量,由此推导出开放系统中信息摄入、不可逆计算与功提取的物理层级约束。这一理论揭示了长期效率依赖于内部信息结构的保持,并导致自建模行为,同时指出物理具身智能系统具有类似哥德尔不完备性的内在认知极限,为理解生物神经系统及人工系统的安全边界提供了新的物理基础。
链接: https://arxiv.org/abs/2601.00021
作者: Peter David Fagan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 47 pages, 9 figures
Abstract:We present a physical theory of intelligence grounded in irreversible information processing in systems constrained by conservation laws. An intelligent system is modelled as a coupled agent-environment process whose evolution transforms information into goal-directed work. To connect information to physical state, we introduce the Conservation-Congruent Encoding (CCE) framework, in which encodings correspond to metastable basins of attraction whose separability is enforced by conservation laws. Within this framework, intelligence is defined as the amount of goal-directed work produced per nat of irreversibly processed information. From this definition we derive a hierarchy of physical constraints governing information intake, irreversible computation, and work extraction in open systems. The framework reveals how long-horizon efficiency requires the preservation of internal informational structure, giving rise to self-modelling, and it establishes that physically embodied intelligent systems possess intrinsic epistemic limits analogous to incompleteness phenomena. Applying the theory to biological systems, we analyse how oscillatory and near-critical dynamics optimise the trade-off between information preservation, dissipation, and useful work, placing the brain near an efficient operating regime predicted by the framework. At the architectural level, we develop a theory of continuous dynamical circuits in which classical Boolean logic emerges as a special case of attractor selection, while more general invariant geometries support computational modes beyond fixed-point logic. Finally, we propose a physically grounded perspective on artificial intelligence safety based on irreversible information flow and structural homeostasis. Together, these results provide a unified, substrate-neutral account of intelligence as a physical phenomenon.
zh
[AI-52] Personalized Spiking Neural Networks with Ferroelectric Synapses for EEG Signal Processing
【速读】:该论文旨在解决基于脑电图(EEG)的脑机接口(BCI)中因神经信号非平稳性导致的模型泛化能力差的问题,特别是在资源受限设备上实现自适应与个性化学习的挑战。解决方案的关键在于利用铁电忆阻突触器件部署脉冲神经网络(SNN),通过两种互补的部署策略实现高效适应:一是使用铁电突触模型进行设备感知训练;二是将软件训练权重迁移至硬件后进行低开销在线微调。其中,核心创新是提出一种设备感知的权重更新机制,即数字累积梯度更新仅在阈值触发时转换为离散编程事件,从而模拟非线性、状态依赖的编程动态并显著降低编程频率,有效应对忆阻器权重分辨率有限、器件变异性和寿命限制等实际约束,最终在分类性能上达到与先进软件SNN相当的水平,并通过仅微调最后一层实现了个体特异性迁移学习的准确率提升。
链接: https://arxiv.org/abs/2601.00020
作者: Nikhil Garg,Anxiong Song,Niklas Plessnig,Nathan Savoia,Laura Bégon-Lours
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Electroencephalography (EEG)-based brain-computer interfaces (BCIs) are strongly affected by non-stationary neural signals that vary across sessions and individuals, limiting the generalization of subject-agnostic models and motivating adaptive and personalized learning on resource-constrained platforms. Programmable memristive hardware offers a promising substrate for such post-deployment adaptation; however, practical realization is challenged by limited weight resolution, device variability, nonlinear programming dynamics, and finite device endurance. In this work, we show that spiking neural networks (SNNs) can be deployed on ferroelectric memristive synaptic devices for adaptive EEG-based motor imagery decoding under realistic device constraints. We fabricate, characterize, and model ferroelectric synapses. We evaluate a convolutional-recurrent SNN architecture under two complementary deployment strategies: (i) device-aware training using a ferroelectric synapse model, and (ii) transfer of software-trained weights followed by low-overhead on-device re-tuning. To enable efficient adaptation, we introduce a device-aware weight-update strategy in which gradient-based updates are accumulated digitally and converted into discrete programming events only when a threshold is exceeded, emulating nonlinear, state-dependent programming dynamics while reducing programming frequency. Both deployment strategies achieve classification performance comparable to state-of-the-art software-based SNNs. Furthermore, subject-specific transfer learning achieved by retraining only the final network layers improves classification accuracy. These results demonstrate that programmable ferroelectric hardware can support robust, low-overhead adaptation in spiking neural networks, opening a practical path toward personalized neuromorphic processing of neural signals.
zh
[AI-53] Yahtzee: Reinforcement Learning Techniques for Stochastic Combinatorial Games
【速读】:该论文旨在解决多玩家Yahtzee游戏中策略学习的挑战,尤其是面对随机性、组合结构和延迟奖励时的强化学习(Reinforcement Learning, RL)问题。由于多玩家场景下最优策略难以通过动态规划(Dynamic Programming, DP)求解,研究者将其建模为马尔可夫决策过程(Markov Decision Process, MDP),并采用自对弈训练框架,结合多种策略梯度方法(REINFORCE、A2C 和 PPO)进行优化。解决方案的关键在于设计了一个共享主干网络(shared trunk)的多头架构,并系统性地评估特征编码、动作编码、回报估计器及熵正则化等因素对训练稳定性与性能的影响;实验表明,优势Actor-Critic(Advantage Actor-Critic, A2C)在固定训练预算下表现出最强的鲁棒性,最终代理在10万场评估游戏中达到中位数241.78分,接近最优DP得分(254.59分)的95%,验证了该方法在复杂博弈场景中的有效性。
链接: https://arxiv.org/abs/2601.00007
作者: Nicholas A. Pape
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 19 figures
Abstract:Yahtzee is a classic dice game with a stochastic, combinatorial structure and delayed rewards, making it an interesting mid-scale RL benchmark. While an optimal policy for solitaire Yahtzee can be computed using dynamic programming methods, multiplayer is intractable, motivating approximation methods. We formulate Yahtzee as a Markov Decision Process (MDP), and train self-play agents using various policy gradient methods: REINFORCE, Advantage Actor-Critic (A2C), and Proximal Policy Optimization (PPO), all using a multi-headed network with a shared trunk. We ablate feature and action encodings, architecture, return estimators, and entropy regularization to understand their impact on learning. Under a fixed training budget, REINFORCE and PPO prove sensitive to hyperparameters and fail to reach near-optimal performance, whereas A2C trains robustly across a range of settings. Our agent attains a median score of 241.78 points over 100,000 evaluation games, within 5.0% of the optimal DP score of 254.59, achieving the upper section bonus and Yahtzee at rates of 24.9% and 34.1%, respectively. All models struggle to learn the upper bonus strategy, overindexing on four-of-a-kind’s, highlighting persistent long-horizon credit-assignment and exploration challenges.
zh
[AI-54] Evaluating Anomaly Detectors for Simulated Highly Imbalanced Industrial Classification Problems
【速读】:该论文旨在解决工业系统中异常检测(anomaly detection)面临的极端类别不平衡问题,其根源在于训练数据中故障样本(faulty examples)的稀缺性。解决方案的关键在于通过构建一个与实际工程约束相符的问题无关(problem-agnostic)的模拟数据集,对14种异常检测算法在不同故障率(0.05%–20%)和训练样本量(1,000–10,000)下进行系统评估,从而揭示模型性能与故障样本数量之间的依赖关系。研究发现,当故障样本少于20个时,无监督方法(如kNN、LOF)表现最优;而当故障样本达到30–50个时,半监督(XGBOD)和监督学习方法(SVM、CatBoost)显著提升性能,且高维特征(10维)下半监督方法优势更加明显,这为工业场景中异常检测模型的选型与部署提供了关键实践依据。
链接: https://arxiv.org/abs/2601.00005
作者: Lesley Wheat,Martin v. Mohrenschildt,Saeid Habibi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 14 figures, 11 tables
Abstract:Machine learning offers potential solutions to current issues in industrial systems in areas such as quality control and predictive maintenance, but also faces unique barriers in industrial applications. An ongoing challenge is extreme class imbalance, primarily due to the limited availability of faulty data during training. This paper presents a comprehensive evaluation of anomaly detection algorithms using a problem-agnostic simulated dataset that reflects real-world engineering constraints. Using a synthetic dataset with a hyper-spherical based anomaly distribution in 2D and 10D, we benchmark 14 detectors across training datasets with anomaly rates between 0.05% and 20% and training sizes between 1 000 and 10 000 (with a testing dataset size of 40 000) to assess performance and generalization error. Our findings reveal that the best detector is highly dependant on the total number of faulty examples in the training dataset, with additional healthy examples offering insignificant benefits in most cases. With less than 20 faulty examples, unsupervised methods (kNN/LOF) dominate; but around 30-50 faulty examples, semi-supervised (XGBOD) and supervised (SVM/CatBoost) detectors, we see large performance increases. While semi-supervised methods do not show significant benefits with only two features, the improvements are evident at ten features. The study highlights the performance drop on generalization of anomaly detection methods on smaller datasets, and provides practical insights for deploying anomaly detection in industrial environments.
zh
[AI-55] Comparative Evaluation of Embedding Representations for Financial News Sentiment Analysis
【速读】:该论文旨在解决在资源受限环境下(如小规模标注数据集),金融新闻情感分类中传统自然语言处理方法性能下降的问题。其关键解决方案在于系统性比较了Word2Vec、GloVe与sentence transformer等嵌入方法结合梯度提升算法在手动标注新闻标题上的表现,揭示了预训练嵌入在数据量低于临界阈值时收益递减,且小规模验证集易导致模型选择过程中的过拟合现象。研究强调嵌入质量本身不足以克服情感分类任务中的根本性数据稀缺问题,提示实践者需转向少样本学习、数据增强或词典增强的混合方法以提升模型泛化能力。
链接: https://arxiv.org/abs/2512.13749
作者: Joyjit Roy,Samaresh Kumar Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: 6 pages, 2 figures. Submitted to IEEE IATMSI-2026 (Track: AI, IoT and Computer Vision Enabled Technologies)
Abstract:Financial sentiment analysis enhances market understanding; however, standard natural language processing approaches encounter significant challenges when applied to small datasets. This study provides a comparative evaluation of embedding-based methods for financial news sentiment classification in resource-constrained environments. Word2Vec, GloVe, and sentence transformer representations are evaluated in combination with gradient boosting on manually labeled headlines. Experimental results identify a substantial gap between validation and test performance, with models performing worse than trivial baselines despite strong validation metrics. The analysis demonstrates that pretrained embeddings yield diminishing returns below a critical data sufficiency threshold, and that small validation sets contribute to overfitting during model selection. Practical application is illustrated through weekly sentiment aggregation and narrative summarization for market monitoring workflows. The findings offer empirical evidence that embedding quality alone cannot address fundamental data scarcity in sentiment classification. For practitioners operating with limited resources, the results indicate the need to consider alternative approaches such as few-shot learning, data augmentation, or lexicon-enhanced hybrid methods when labeled samples are scarce.
zh
[AI-56] Parametrized Sharing for Multi-Agent Hybrid DRL for Multiple Multi-Functional RISs-Aided Downlink NOMA Networks
【速读】:该论文旨在解决非正交多址接入(NOMA)下行链路网络中通信效率低下的问题,尤其在多智能体协同优化场景下,如何通过引入具备能量采集(Energy Harvesting, EH)能力的多功能可重构智能表面(Multi-functional Reconfigurable Intelligent Surface, MF-RIS)来提升系统能效(Energy Efficiency, EE)。其解决方案的关键在于设计一种参数化共享机制的混合深度强化学习框架(Parametrized Sharing Scheme for Multi-Agent Hybrid Deep Reinforcement Learning, PMHRL),其中多智能体近端策略优化(PPO)用于处理连续变量(如功率分配和波束赋形),而深度Q网络(DQN)则负责离散变量(如MF-RIS的幅度、相位偏移及EH比例),并联合优化MF-RIS的位置与配置,以满足用户速率需求和自可持续性约束。仿真结果表明,该方案相较传统方法显著提升了系统能效,验证了多MF-RIS辅助NOMA架构的有效性。
链接: https://arxiv.org/abs/2601.00538
作者: Chi-Te Kuo,Li-Hsiang Shen,Jyun-Jhe Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-functional reconfigurable intelligent surface (MF-RIS) is conceived to address the communication efficiency thanks to its extended signal coverage from its active RIS capability and self-sustainability from energy harvesting (EH). We investigate the architecture of multi-MF-RISs to assist non-orthogonal multiple access (NOMA) downlink networks. We formulate an energy efficiency (EE) maximization problem by optimizing power allocation, transmit beamforming and MF-RIS configurations of amplitudes, phase-shifts and EH ratios, as well as the position of MF-RISs, while satisfying constraints of available power, user rate requirements, and self-sustainability property. We design a parametrized sharing scheme for multi-agent hybrid deep reinforcement learning (PMHRL), where the multi-agent proximal policy optimization (PPO) and deep-Q network (DQN) handle continuous and discrete variables, respectively. The simulation results have demonstrated that proposed PMHRL has the highest EE compared to other benchmarks, including cases without parametrized sharing, pure PPO and DQN. Moreover, the proposed multi-MF-RISs-aided downlink NOMA achieves the highest EE compared to scenarios of no-EH/amplification, traditional RISs, and deployment without RISs/MF-RISs under different multiple access.
zh
[AI-57] Benchmarking Preprocessing and Integration Methods in Single-Cell Genomics SDM’25
【速读】:该论文旨在解决单细胞多组学数据整合分析中的关键挑战,即如何在不同分子模态(如DNA、RNA、蛋白质)之间高效、准确地整合单细胞数据,以支持个性化医学研究。其解决方案的核心在于系统评估一个包含归一化、数据整合与降维三个步骤的通用分析流程,通过组合七种归一化方法、四种降维方法和五种整合方法,在六个跨模态、组织和物种的数据集上进行实验验证。结果表明,Seurat和Harmony在数据整合性能上表现优异,其中Harmony在大规模数据下更具时间效率;UMAP是最适配整合方法的降维技术;而归一化策略的选择则依赖于所采用的整合算法,体现了方法组合对数据特征的敏感性。
链接: https://arxiv.org/abs/2601.00277
作者: Ali Anaissi,Seid Miad Zandavi,Weidong Huang,Junaid Akram,Basem Suleiman,Ali Braytee,Jie Hua
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: The 23rd Australasian Data Science and Machine Learning Conference (AusDM’25)
Abstract:Single-cell data analysis has the potential to revolutionize personalized medicine by characterizing disease-associated molecular changes at the single-cell level. Advanced single-cell multimodal assays can now simultaneously measure various molecules (e.g., DNA, RNA, Protein) across hundreds of thousands of individual cells, providing a comprehensive molecular readout. A significant analytical challenge is integrating single-cell measurements across different modalities. Various methods have been developed to address this challenge, but there has been no systematic evaluation of these techniques with different preprocessing strategies. This study examines a general pipeline for single-cell data analysis, which includes normalization, data integration, and dimensionality reduction. The performance of different algorithm combinations often depends on the dataset sizes and characteristics. We evaluate six datasets across diverse modalities, tissues, and organisms using three metrics: Silhouette Coefficient Score, Adjusted Rand Index, and Calinski-Harabasz Index. Our experiments involve combinations of seven normalization methods, four dimensional reduction methods, and five integration methods. The results show that Seurat and Harmony excel in data integration, with Harmony being more time-efficient, especially for large datasets. UMAP is the most compatible dimensionality reduction method with the integration techniques, and the choice of normalization method varies depending on the integration method used.
zh
[AI-58] Neural Minimum Weight Perfect Matching for Quantum Error Codes
【速读】:该论文旨在解决量子计算中量子纠错码(Quantum Error Correction, QEC)的解码效率与准确性问题,尤其针对传统最小权重完美匹配(Minimum Weight Perfect Matching, MWPM)算法在复杂错误模式下性能受限的问题。解决方案的关键在于提出一种名为神经最小权重完美匹配(Neural Minimum Weight Perfect Matching, NMWPM)的数据驱动型解码器,其核心创新是采用混合架构:利用图神经网络(Graph Neural Networks, GNNs)提取局部校验子(syndrome)特征,并通过Transformer捕捉长程全局依赖关系,进而动态预测MWPM算法所需的边权重;同时设计了一种新型代理损失函数(proxy loss function),以实现对非可微的MWPM模块的端到端优化,从而显著降低逻辑错误率(Logical Error Rate, LER),验证了融合神经网络预测能力与经典匹配算法结构的优势。
链接: https://arxiv.org/abs/2601.00242
作者: Yotam Peled,David Zenati,Eliya Nachmani
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Realizing the full potential of quantum computation requires Quantum Error Correction (QEC). QEC reduces error rates by encoding logical information across redundant physical qubits, enabling errors to be detected and corrected. A common decoder used for this task is Minimum Weight Perfect Matching (MWPM) a graph-based algorithm that relies on edge weights to identify the most likely error chains. In this work, we propose a data-driven decoder named Neural Minimum Weight Perfect Matching (NMWPM). Our decoder utilizes a hybrid architecture that integrates Graph Neural Networks (GNNs) to extract local syndrome features and Transformers to capture long-range global dependencies, which are then used to predict dynamic edge weights for the MWPM decoder. To facilitate training through the non-differentiable MWPM algorithm, we formulate a novel proxy loss function that enables end-to-end optimization. Our findings demonstrate significant performance reduction in the Logical Error Rate (LER) over standard baselines, highlighting the advantage of hybrid decoders that combine the predictive capabilities of neural networks with the algorithmic structure of classical matching.
zh
[AI-59] Hear the Heartbeat in Phases: Physiologically Grounded Phase-Aware ECG Biometrics
【速读】:该论文旨在解决现有基于心电图(Electrocardiography, ECG)的身份认证方法中,将心跳信号视为同质化特征而忽略心脏周期内不同相位特异性信息的问题。其核心解决方案是提出一种分层相位感知融合(Hierarchical Phase-Aware Fusion, HPAF)框架,通过三阶段设计避免跨特征混杂:首先在相位内表示提取(Intra-Phase Representation, IPR)阶段独立建模各心脏相位的形态与变化特征;其次在相位分组层次融合(Phase-Grouped Hierarchical Fusion, PGHF)阶段结构化整合生理相关相位的信息;最后在全局表示融合(Global Representation Fusion, GRF)阶段自适应加权融合多相位组,生成统一且判别性强的身份表征。此外,为应对连续采集的心跳噪声和变异性,还引入了心动周期感知的多原型注册策略(Heartbeat-Aware Multi-prototype, HAM),构建多原型模板库以提升鲁棒性。
链接: https://arxiv.org/abs/2601.00170
作者: Jintao Huang,Lu Leng,Yi Zhang,Ziyuan Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Electrocardiography (ECG) is adopted for identity authentication in wearable devices due to its individual-specific characteristics and inherent liveness. However, existing methods often treat heartbeats as homogeneous signals, overlooking the phase-specific characteristics within the cardiac cycle. To address this, we propose a Hierarchical Phase-Aware Fusion~(HPAF) framework that explicitly avoids cross-feature entanglement through a three-stage design. In the first stage, Intra-Phase Representation (IPR) independently extracts representations for each cardiac phase, ensuring that phase-specific morphological and variation cues are preserved without interference from other phases. In the second stage, Phase-Grouped Hierarchical Fusion (PGHF) aggregates physiologically related phases in a structured manner, enabling reliable integration of complementary phase information. In the final stage, Global Representation Fusion (GRF) further combines the grouped representations and adaptively balances their contributions to produce a unified and discriminative identity representation. Moreover, considering ECG signals are continuously acquired, multiple heartbeats can be collected for each individual. We propose a Heartbeat-Aware Multi-prototype (HAM) enrollment strategy, which constructs a multi-prototype gallery template set to reduce the impact of heartbeat-specific noise and variability. Extensive experiments on three public datasets demonstrate that HPAF achieves state-of-the-art results in the comparison with other methods under both closed and open-set settings.
zh
[AI-60] MethConvTransformer: A Deep Learning Framework for Cross-Tissue Alzheimers Disease Detection
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)中DNA甲基化(DNA methylation)生物标志物在不同组织间差异大、可重复性低及转化应用受限的问题。其解决方案的关键在于提出MethConvTransformer,一种基于Transformer的深度学习框架,通过整合脑组织与外周组织的甲基化谱,利用CpG位点级别的线性投影结合卷积和自注意力机制,捕获CpG位点间的局部与长程依赖关系,并引入个体水平协变量和组织嵌入以分离共享与区域特异性的甲基化效应,从而实现跨组织的稳健生物标志物发现与多分辨率可解释性分析。
链接: https://arxiv.org/abs/2601.00143
作者: Gang Qu,Guanghao Li,Zhongming Zhao(for the Alzheimer’s Disease Neuroimaging Initiative)
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Alzheimer’s disease (AD) is a multifactorial neurodegenerative disorder characterized by progressive cognitive decline and widespread epigenetic dysregulation in the brain. DNA methylation, as a stable yet dynamic epigenetic modification, holds promise as a noninvasive biomarker for early AD detection. However, methylation signatures vary substantially across tissues and studies, limiting reproducibility and translational utility. To address these challenges, we develop MethConvTransformer, a transformer-based deep learning framework that integrates DNA methylation profiles from both brain and peripheral tissues to enable biomarker discovery. The model couples a CpG-wise linear projection with convolutional and self-attention layers to capture local and long-range dependencies among CpG sites, while incorporating subject-level covariates and tissue embeddings to disentangle shared and region-specific methylation effects. In experiments across six GEO datasets and an independent ADNI validation cohort, our model consistently outperforms conventional machine-learning baselines, achieving superior discrimination and generalization. Moreover, interpretability analyses using linear projection, SHAP, and Grad-CAM++ reveal biologically meaningful methylation patterns aligned with AD-associated pathways, including immune receptor signaling, glycosylation, lipid metabolism, and endomembrane (ER/Golgi) organization. Together, these results indicate that MethConvTransformer delivers robust, cross-tissue epigenetic biomarkers for AD while providing multi-resolution interpretability, thereby advancing reproducible methylation-based diagnostics and offering testable hypotheses on disease mechanisms.
zh
[AI-61] Democratizing Electronic-Photonic AI Systems: An Open-Source AI-Infused Cross-Layer Co-Design and Design Automation Toolflow
【速读】:该论文旨在解决电子-光子人工智能(Electronic-Photonic AI)系统设计与部署中的高复杂性问题,其核心挑战在于跨多个层级(包括器件物理、电路设计、系统架构和AI算法)的陡峭学习曲线,以及缺乏成熟的电子-光子设计自动化(EPDA)工具链,导致设计周期冗长且难以实现跨学科协同创新。解决方案的关键在于提出一个跨层协同设计与自动化框架,涵盖可扩展的光子边缘AI和Transformer推理架构、开源建模工具SimPhony用于快速评估与设计空间探索,以及基于AI的光子设计自动化进展,包括物理驱动的麦克斯韦求解器、面向制造的逆向设计框架和适用于超构光学神经网络的大规模逆向训练算法,从而构建下一代电子-光子AI系统的可扩展EPDA堆栈。
链接: https://arxiv.org/abs/2601.00130
作者: Hongjian Zhou,Ziang Yin,Jiaqi Gu
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
备注: 9 ages. Accepted to SPIE Photonics West, AI and Optical Data Sciences VII, 2026
Abstract:Photonics is becoming a cornerstone technology for high-performance AI systems and scientific computing, offering unparalleled speed, parallelism, and energy efficiency. Despite this promise, the design and deployment of electronic-photonic AI systems remain highly challenging due to a steep learning curve across multiple layers, spanning device physics, circuit design, system architecture, and AI algorithms. The absence of a mature electronic-photonic design automation (EPDA) toolchain leads to long, inefficient design cycles and limits cross-disciplinary innovation and co-evolution. In this work, we present a cross-layer co-design and automation framework aimed at democratizing photonic AI system development. We begin by introducing our architecture designs for scalable photonic edge AI and Transformer inference, followed by SimPhony, an open-source modeling tool for rapid EPIC AI system evaluation and design-space exploration. We then highlight advances in AI-enabled photonic design automation, including physical AI-based Maxwell solvers, a fabrication-aware inverse design framework, and a scalable inverse training algorithm for meta-optical neural networks, enabling a scalable EPDA stack for next-generation electronic-photonic AI systems.
zh
[AI-62] oward Large-Scale Photonics-Empowered AI Systems: From Physical Design Automation to System-Algorithm Co-Exploration
【速读】:该论文旨在解决大规模可扩展光子人工智能(Photonic AI)系统实现中的三大核心挑战:一是支持现代模型中动态张量运算(dynamic tensor operation),而不仅限于静态权重核,尤其针对注意力机制(attention)或Transformer类工作负载;二是系统性管理转换、控制与数据移动开销,通过复用(multiplexing)和优化数据流(dataflow)来分摊电子器件成本,避免模数转换器(ADC/DAC)和输入输出(I/O)成为瓶颈;三是应对硬件非理想性带来的鲁棒性问题,该问题随集成密度提升而加剧。解决方案的关键在于构建一个跨层次的工具链,从早期设计探索到物理实现均提供支撑:SimPhony实现面向实现的建模与快速跨层评估,将物理成本映射为系统级指标;ADEPT和ADEPT-Z实现电路与拓扑的端到端探索,连接系统目标与实际器件约束下的可行光子结构;Apollo与LiDAR则提供可扩展的光子物理设计自动化,生成满足布线、热效应和串扰约束的可制造版图。
链接: https://arxiv.org/abs/2601.00129
作者: Ziang Yin,Hongjian Zhou,Nicholas Gangi,Meng Zhang,Jeff Zhang,Zhaoran Rena Huang,Jiaqi Gu
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
备注: 10 pages. Accepted to SPIE Photonics West, Optical Interconnects and Packaging 2026
Abstract:In this work, we identify three considerations that are essential for realizing practical photonic AI systems at scale: (1) dynamic tensor operation support for modern models rather than only weight-static kernels, especially for attention/Transformer-style workloads; (2) systematic management of conversion, control, and data-movement overheads, where multiplexing and dataflow must amortize electronic costs instead of letting ADC/DAC and I/O dominate; and (3) robustness under hardware non-idealities that become more severe as integration density grows. To study these coupled tradeoffs quantitatively, and to ensure they remain meaningful under real implementation constraints, we build a cross-layer toolchain that supports photonic AI design from early exploration to physical realization. SimPhony provides implementation-aware modeling and rapid cross-layer evaluation, translating physical costs into system-level metrics so architectural decisions are grounded in realistic assumptions. ADEPT and ADEPT-Z enable end-to-end circuit and topology exploration, connecting system objectives to feasible photonic fabrics under practical device and circuit constraints. Finally, Apollo and LiDAR provide scalable photonic physical design automation, turning candidate circuits into manufacturable layouts while accounting for routing, thermal, and crosstalk constraints.
zh
[AI-63] Modeling Day-Long ECG Signals to Predict Heart Failure Risk with Explainable AI
【速读】:该论文旨在解决心力衰竭(Heart Failure, HF)早期风险预测的难题,尤其针对老年人群中HF高发且预后不良的问题。其解决方案的关键在于利用深度学习模型DeepHHF对24小时单导联动态心电图(Holter ECG)数据进行建模,从而捕捉间歇性事件(paroxysmal events)和昼夜节律变化(circadian variations),实现五年内HF风险的精准预测。该模型在TLHE数据集上达到0.80的受试者工作特征曲线下面积(AUC),显著优于基于30秒片段的模型和传统临床评分,并通过可解释性分析验证其关注心律失常与心脏异常,特别是在上午8点至下午3点之间表现出关键注意力区域,证明了AI驱动的连续ECG分析在非侵入、低成本和广泛可及场景下的应用潜力。
链接: https://arxiv.org/abs/2601.00014
作者: Eran Zvuloni,Ronit Almog,Michael Glikson,Shany Brimer Biton,Ilan Green,Izhar Laufer,Offer Amir,Joachim A. Behar
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Heart failure (HF) affects 11.8% of adults aged 65 and older, reducing quality of life and longevity. Preventing HF can reduce morbidity and mortality. We hypothesized that artificial intelligence (AI) applied to 24-hour single-lead electrocardiogram (ECG) data could predict the risk of HF within five years. To research this, the Technion-Leumit Holter ECG (TLHE) dataset, including 69,663 recordings from 47,729 patients, collected over 20 years was used. Our deep learning model, DeepHHF, trained on 24-hour ECG recordings, achieved an area under the receiver operating characteristic curve of 0.80 that outperformed a model using 30-second segments and a clinical score. High-risk individuals identified by DeepHHF had a two-fold chance of hospitalization or death incidents. Explainability analysis showed DeepHHF focused on arrhythmias and heart abnormalities, with key attention between 8 AM and 3 PM. This study highlights the feasibility of deep learning to model 24-hour continuous ECG data, capturing paroxysmal events and circadian variations essential for reliable risk prediction. Artificial intelligence applied to single-lead Holter ECG is non-invasive, inexpensive, and widely accessible, making it a promising tool for HF risk prediction.
zh
机器学习
[LG-0] Categorical Reparameterization with Denoising Diffusion models
链接: https://arxiv.org/abs/2601.00781
作者: Samson Gourevitch,Alain Durmus,Eric Moulines,Jimmy Olsson,Yazid Janati
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: working paper
Abstract:Gradient-based optimization with categorical variables typically relies on score-function estimators, which are unbiased but noisy, or on continuous relaxations that replace the discrete distribution with a smooth surrogate admitting a pathwise (reparameterized) gradient, at the cost of optimizing a biased, temperature-dependent objective. In this paper, we extend this family of relaxations by introducing a diffusion-based soft reparameterization for categorical distributions. For these distributions, the denoiser under a Gaussian noising process admits a closed form and can be computed efficiently, yielding a training-free diffusion sampler through which we can backpropagate. Our experiments show that the proposed reparameterization trick yields competitive or improved optimization performance on various benchmarks.
[LG-1] A Machine Learning Framework for Off Ball Defensive Role and Performance Evaluation in Football
链接: https://arxiv.org/abs/2601.00748
作者: Sean Groom,Shuo Wang,Francisco Belo,Axl Rice,Liam Anderson
类目: Machine Learning (cs.LG)
*备注: 40 pages, 16 figures
Abstract:Evaluating off-ball defensive performance in football is challenging, as traditional metrics do not capture the nuanced coordinated movements that limit opponent action selection and success probabilities. Although widely used possession value models excel at appraising on-ball actions, their application to defense remains limited. Existing counterfactual methods, such as ghosting models, help extend these analyses but often rely on simulating “average” behavior that lacks tactical context. To address this, we introduce a covariate-dependent Hidden Markov Model (CDHMM) tailored to corner kicks, a highly structured aspect of football games. Our label-free model infers time-resolved man-marking and zonal assignments directly from player tracking data. We leverage these assignments to propose a novel framework for defensive credit attribution and a role-conditioned ghosting method for counterfactual analysis of off-ball defensive performance. We show how these contributions provide a interpretable evaluation of defensive contributions against context-aware baselines.
[LG-2] he Reasoning -Creativity Trade-off: Toward Creativity-Driven Problem Solving
链接: https://arxiv.org/abs/2601.00747
作者: Max Ruiz Luyten,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注: 56 pages, 9 figures, submitted to Twenty-Ninth Annual Conference on Artificial Intelligence and Statistics
Abstract:State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops: sampling diverse chains of thought and reinforcing the highest-scoring ones, mainly optimizing correctness. We analyze how this design choice is sensitive to the collapse of the model’s distribution over reasoning paths, slashing semantic entropy and undermining creative problem-solving. To analyze this failure, we introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces. STaR, GRPO, and DPO, as well as entropy bonuses, and other methods, all constitute special cases of the same loss. The framework delivers three core results: (i) the diversity decay theorem, describing how correctness-based objectives lead to distinct modes of diversity decay for STaR, GRPO, and DPO; (ii) designs that ensure convergence to a stable and diverse policy, effectively preventing collapse; and (iii) simple, actionable recipes to achieve this in practice. DCR thus offers the first principled recipe for LLMs that remain both correct and creative.
[LG-3] Precision Autotuning for Linear Solvers via Contextual Bandit-Based RL
链接: https://arxiv.org/abs/2601.00728
作者: Erin Carson,Xinye Chen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We propose a reinforcement learning (RL) framework for adaptive precision tuning of linear solvers, and can be extended to general algorithms. The framework is formulated as a contextual bandit problem and solved using incremental action-value estimation with a discretized state space to select optimal precision configurations for computational steps, balancing precision and computational efficiency. To verify its effectiveness, we apply the framework to iterative refinement for solving linear systems Ax = b . In this application, our approach dynamically chooses precisions based on calculated features from the system. In detail, a Q-table maps discretized features (e.g., approximate condition number and matrix norm)to actions (chosen precision configurations for specific steps), optimized via an epsilon-greedy strategy to maximize a multi-objective reward balancing accuracy and computational cost. Empirical results demonstrate effective precision selection, reducing computational cost while maintaining accuracy comparable to double-precision baselines. The framework generalizes to diverse out-of-sample data and offers insight into utilizing RL precision selection for other numerical algorithms, advancing mixed-precision numerical methods in scientific computing. To the best of our knowledge, this is the first work on precision autotuning with RL and verified on unseen datasets.
[LG-4] BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting
链接: https://arxiv.org/abs/2601.00698
作者: Maximilian Reinwardt,Michael Eichelbeck,Matthias Althoff
类目: Machine Learning (cs.LG)
*备注: 20 pages, 7 figures
Abstract:Long-term time series forecasting using transformers is hampered by the quadratic complexity of self-attention and the rigidity of uniform patching, which may be misaligned with the data’s semantic structure. In this paper, we introduce the \textitB-Spline Adaptive Tokenizer (BSAT), a novel, parameter-free method that adaptively segments a time series by fitting it with B-splines. BSAT algorithmically places tokens in high-curvature regions and represents each variable-length basis function as a fixed-size token, composed of its coefficient and position. Further, we propose a hybrid positional encoding that combines a additive learnable positional encoding with Rotary Positional Embedding featuring a layer-wise learnable base: L-RoPE. This allows each layer to attend to different temporal dependencies. Our experiments on several public benchmarks show that our model is competitive with strong performance at high compression rates. This makes it particularly well-suited for use cases with strong memory constraints.
[LG-5] Bayesian Inverse Games with High-Dimensional Multi-Modal Observations
链接: https://arxiv.org/abs/2601.00696
作者: Yash Jain,Xinjie Liu,Lasse Peters,David Fridovich-Keil,Ufuk Topcu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Robotics (cs.RO)
*备注:
Abstract:Many multi-agent interaction scenarios can be naturally modeled as noncooperative games, where each agent’s decisions depend on others’ future actions. However, deploying game-theoretic planners for autonomous decision-making requires a specification of all agents’ objectives. To circumvent this practical difficulty, recent work develops maximum likelihood techniques for solving inverse games that can identify unknown agent objectives from interaction data. Unfortunately, these methods only infer point estimates and do not quantify estimator uncertainty; correspondingly, downstream planning decisions can overconfidently commit to unsafe actions. We present an approximate Bayesian inference approach for solving the inverse game problem, which can incorporate observation data from multiple modalities and be used to generate samples from the Bayesian posterior over the hidden agent objectives given limited sensor observations in real time. Concretely, the proposed Bayesian inverse game framework trains a structured variational autoencoder with an embedded differentiable Nash game solver on interaction datasets and does not require labels of agents’ true objectives. Extensive experiments show that our framework successfully learns prior and posterior distributions, improves inference quality over maximum likelihood estimation-based inverse game approaches, and enables safer downstream decision-making without sacrificing efficiency. When trajectory information is uninformative or unavailable, multimodal inference further reduces uncertainty by exploiting additional observation modalities.
[LG-6] ARISE: Adaptive Reinforcement Integrated with Swarm Exploration
链接: https://arxiv.org/abs/2601.00693
作者: Rajiv Chaitanya M,D R Ramesh Babu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages. Accepted for presentation at WCSC 2026
Abstract:Effective exploration remains a key challenge in RL, especially with non-stationary rewards or high-dimensional policies. We introduce ARISE, a lightweight framework that enhances reinforcement learning by augmenting standard policy-gradient methods with a compact swarm-based exploration layer. ARISE blends policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory sampled in the action space, and modulates exploration adaptively using reward-variance cues. While easy benchmarks exhibit only slight improvements (e.g., +0.7% on CartPole-v1), ARISE yields substantial gains on more challenging tasks, including +46% on LunarLander-v3 and +22% on Hopper-v4, while preserving stability on Walker2d and Ant. Under non-stationary reward shifts, ARISE provides marked robustness advantages, outperforming PPO by +75 points on CartPole and improving LunarLander accordingly. Ablation studies confirm that both the swarm component and the adaptive mechanism contribute to the performance. Overall, ARISE offers a simple, architecture-agnostic route to more exploratory and resilient RL agents without altering core algorithmic structures.
[LG-7] Cost Optimization in Production Line Using Genetic Algorithm
链接: https://arxiv.org/abs/2601.00689
作者: Alireza Rezaee
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a genetic algorithm (GA) approach to cost-optimal task scheduling in a production line. The system consists of a set of serial processing tasks, each with a given duration, unit execution cost, and precedence constraints, which must be assigned to an unlimited number of stations subject to a per-station duration bound. The objective is to minimize the total production cost, modeled as a station-wise function of task costs and the duration bound, while strictly satisfying all prerequisite and capacity constraints. Two chromosome encoding strategies are investigated: a station-based representation implemented using the JGAP library with SuperGene validity checks, and a task-based representation in which genes encode station assignments directly. For each encoding, standard GA operators (crossover, mutation, selection, and replacement) are adapted to preserve feasibility and drive the population toward lower-cost schedules. Experimental results on three classes of precedence structures-tightly coupled, loosely coupled, and uncoupled-demonstrate that the task-based encoding yields smoother convergence and more reliable cost minimization than the station-based encoding, particularly when the number of valid schedules is large. The study highlights the advantages of GA over gradient-based and analytical methods for combinatorial scheduling problems, especially in the presence of complex constraints and non-differentiable cost landscapes.
[LG-8] Sparse FEONet: A Low-Cost Memory-Efficient Operator Network via Finite-Element Local Sparsity for Parametric PDEs
链接: https://arxiv.org/abs/2601.00672
作者: Seungchan Ko,Jiyeon Kim,Dongwook Shin
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we study the finite element operator network (FEONet), an operator-learning method for parametric problems, originally introduced in J. Y. Lee, S. Ko, and Y. Hong, Finite Element Operator Network for Solving Elliptic-Type Parametric PDEs, SIAM J. Sci. Comput., 47(2), C501-C528, 2025. FEONet realizes the parameter-to-solution map on a finite element space and admits a training procedure that does not require training data, while exhibiting high accuracy and robustness across a broad class of problems. However, its computational cost increases and accuracy may deteriorate as the number of elements grows, posing notable challenges for large-scale problems. In this paper, we propose a new sparse network architecture motivated by the structure of the finite elements to address this issue. Throughout extensive numerical experiments, we show that the proposed sparse network achieves substantial improvements in computational cost and efficiency while maintaining comparable accuracy. We also establish theoretical results demonstrating that the sparse architecture can approximate the target operator effectively and provide a stability analysis ensuring reliable training and prediction.
[LG-9] hree factor delay learning rules for spiking neural networks
链接: https://arxiv.org/abs/2601.00668
作者: Luke Vassallo,Nima Taherinejad
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures
Abstract:Spiking Neural Networks (SNNs) are dynamical systems that operate on spatiotemporal data, yet their learnable parameters are often limited to synaptic weights, contributing little to temporal pattern recognition. Learnable parameters that delay spike times can improve classification performance in temporal tasks, but existing methods rely on large networks and offline learning, making them unsuitable for real-time operation in resource-constrained environments. In this paper, we introduce synaptic and axonal delays to leaky integrate and fire (LIF)-based feedforward and recurrent SNNs, and propose three-factor learning rules to simultaneously learn delay parameters online. We employ a smooth Gaussian surrogate to approximate spike derivatives exclusively for the eligibility trace calculation, and together with a top-down error signal determine parameter updates. Our experiments show that incorporating delays improves accuracy by up to 20% over a weights-only baseline, and for networks with similar parameter counts, jointly learning weights and delays yields up to 14% higher accuracy. On the SHD speech recognition dataset, our method achieves similar accuracy to offline backpropagation-based approaches. Compared to state-of-the-art methods, it reduces model size by 6.6x and inference latency by 67%, with only a 2.4% drop in classification accuracy. Our findings benefit the design of power and area-constrained neuromorphic processors by enabling on-device learning and lowering memory requirements.
[LG-10] Do Chatbot LLM s Talk Too Much? The YapBench Benchmark
链接: https://arxiv.org/abs/2601.00624
作者: Vadim Borisov,Michael Gröger,Mina Mikhael,Richard H. Schreiber
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores. YapBench contains over three hundred English prompts spanning three common brevity-ideal settings: (A) minimal or ambiguous inputs where the ideal behavior is a short clarification, (B) closed-form factual questions with short stable answers, and © one-line coding tasks where a single command or snippet suffices. Evaluating 76 assistant LLMs, we observe an order-of-magnitude spread in median excess length and distinct category-specific failure modes, including vacuum-filling on ambiguous inputs and explanation or formatting overhead on one-line technical requests. We release the benchmark and maintain a live leaderboard for tracking verbosity behavior over time. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.00624 [cs.LG] (or arXiv:2601.00624v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.00624 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] raffic-Aware Optimal Taxi Placement Using Graph Neural Network-Based Reinforcement Learning
链接: https://arxiv.org/abs/2601.00607
作者: Sonia Khetarpaul,P Y Sharan
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the context of smart city transportation, efficient matching of taxi supply with passenger demand requires real-time integration of urban traffic network data and mobility patterns. Conventional taxi hotspot prediction models often rely solely on historical demand, overlooking dynamic influences such as traffic congestion, road incidents, and public events. This paper presents a traffic-aware, graph-based reinforcement learning (RL) framework for optimal taxi placement in metropolitan environments. The urban road network is modeled as a graph where intersections represent nodes, road segments serve as edges, and node attributes capture historical demand, event proximity, and real-time congestion scores obtained from live traffic APIs. Graph Neural Network (GNN) embeddings are employed to encode spatial-temporal dependencies within the traffic network, which are then used by a Q-learning agent to recommend optimal taxi hotspots. The reward mechanism jointly optimizes passenger waiting time, driver travel distance, and congestion avoidance. Experiments on a simulated Delhi taxi dataset, generated using real geospatial boundaries and historic ride-hailing request patterns, demonstrate that the proposed model reduced passenger waiting time by about 56% and reduced travel distance by 38% compared to baseline stochastic selection. The proposed approach is adaptable to multi-modal transport systems and can be integrated into smart city platforms for real-time urban mobility optimization.
[LG-12] Cycling Race Time Prediction: A Personalized Machine Learning Approach Using Route Topology and Training Load
链接: https://arxiv.org/abs/2601.00604
作者: Francisco Aguilera Moreno
类目: Machine Learning (cs.LG)
*备注: 14 pages, 22 figures
Abstract:Predicting cycling duration for a given route is essential for training planning and event preparation. Existing solutions rely on physics-based models that require extensive parameterization, including aerodynamic drag coefficients and real-time wind forecasts, parameters impractical for most amateur cyclists. This work presents a machine learning approach that predicts ride duration using route topology features combined with the athlete’s current fitness state derived from training load metrics. The model learns athlete-specific performance patterns from historical data, substituting complex physical measurements with historical performance proxies. We evaluate the approach using a single-athlete dataset (N=96 rides) in an N-of-1 study design. After rigorous feature engineering to eliminate data leakage, we find that Lasso regression with Topology + Fitness features achieves MAE=6.60 minutes and R2=0.922. Notably, integrating fitness metrics (CTL, ATL) reduces error by 14% compared to topology alone (MAE=7.66 min), demonstrating that physiological state meaningfully constrains performance even in self-paced efforts. Progressive checkpoint predictions enable dynamic race planning as route difficulty becomes apparent.
[LG-13] Adversarial Samples Are Not Created Equal
链接: https://arxiv.org/abs/2601.00577
作者: Jennifer Crawford,Amol Khanna,Fred Lu,Amy R. Wagoner,Stella Biderman,Andre T. Nguyen,Edward Raff
类目: Machine Learning (cs.LG)
*备注:
Abstract:Over the past decade, numerous theories have been proposed to explain the widespread vulnerability of deep neural networks to adversarial evasion attacks. Among these, the theory of non-robust features proposed by Ilyas et al. has been widely accepted, showing that brittle but predictive features of the data distribution can be directly exploited by attackers. However, this theory overlooks adversarial samples that do not directly utilize these features. In this work, we advocate that these two kinds of samples - those which use use brittle but predictive features and those that do not - comprise two types of adversarial weaknesses and should be differentiated when evaluating adversarial robustness. For this purpose, we propose an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations and use this metric to analyze the makeup of adversarial samples generated by attackers. This new perspective also allows us to re-examine multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap observed between adversarially training and standard training on robust datasets.
[LG-14] Entropy Production in Machine Learning Under Fokker-Planck Probability Flow
链接: https://arxiv.org/abs/2601.00554
作者: Lennon Shikhman
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures. Submitted for journal review
Abstract:Machine learning models deployed in nonstationary environments experience performance degradation due to data drift. While many drift detection heuristics exist, most lack a principled dynamical interpretation and provide limited guidance on how retraining frequency should be balanced against operational cost. In this work, we propose an entropy–based retraining framework grounded in nonequilibrium stochastic dynamics. Modeling deployment–time data drift as probability flow governed by a Fokker–Planck equation, we quantify model–data mismatch using a time–evolving Kullback–Leibler divergence. We show that the time derivative of this mismatch admits an entropy–balance decomposition featuring a nonnegative entropy production term driven by probability currents. This interpretation motivates entropy–triggered retraining as a label–free intervention strategy that responds to accumulated mismatch rather than delayed performance collapse. In a controlled nonstationary classification experiment, entropy–triggered retraining achieves predictive performance comparable to high–frequency retraining while reducing retraining events by an order of magnitude relative to daily and label–based policies.
[LG-15] Cloud-Native Generative AI for Automated Planogram Synthesis: A Diffusion Model Approach for Multi-Store Retail Optimization
链接: https://arxiv.org/abs/2601.00527
作者: Ravi Teja Pagidoju,Shriya Agarwal
类目: Machine Learning (cs.LG)
*备注: International Conference on Software Engineering and Data Engineering : Springer Nature
Abstract:Planogram creation is a significant challenge for retail, requiring an average of 30 hours per complex layout. This paper introduces a cloud-native architecture using diffusion models to automatically generate store-specific planograms. Unlike conventional optimization methods that reorganize existing layouts, our system learns from successful shelf arrangements across multiple retail locations to create new planogram configurations. The architecture combines cloud-based model training via AWS with edge deployment for real-time inference. The diffusion model integrates retail-specific constraints through a modified loss function. Simulation-based analysis demonstrates the system reduces planogram design time by 98.3% (from 30 to 0.5 hours) while achieving 94.4% constraint satisfaction. Economic analysis reveals a 97.5% reduction in creation expenses with a 4.4-month break-even period. The cloud-native architecture scales linearly, supporting up to 10,000 concurrent store requests. This work demonstrates the viability of generative AI for automated retail space optimization.
[LG-16] Federated Customization of Large Models: Approaches Experiments and Insights
链接: https://arxiv.org/abs/2601.00526
作者: Yuchuan Ye,Ming Ding,Youjia Chen,Peng Cheng,Dusit Niyato
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 1 figure
Abstract:In this article, we explore federated customization of large models and highlight the key challenges it poses within the federated learning framework. We review several popular large model customization techniques, including full fine-tuning, efficient fine-tuning, prompt engineering, prefix-tuning, knowledge distillation, and retrieval-augmented generation. Then, we discuss how these techniques can be implemented within the federated learning framework. Moreover, we conduct experiments on federated prefix-tuning, which, to the best of our knowledge, is the first trial to apply prefix-tuning in the federated learning setting. The conducted experiments validate its feasibility with performance close to centralized approaches. Further comparison with three other federated customization methods demonstrated its competitive performance, satisfactory efficiency, and consistent robustness.
[LG-17] A Sparse-Attention Deep Learning Model Integrating Heterogeneous Multimodal Features for Parkinsons Disease Severity Profiling
链接: https://arxiv.org/abs/2601.00519
作者: Dristi Datta,Tanmoy Debnath,Minh Chau,Manoranjan Paul,Gourab Adhikary,Md Geaur Rahman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Characterising the heterogeneous presentation of Parkinson’s disease (PD) requires integrating biological and clinical markers within a unified predictive framework. While multimodal data provide complementary information, many existing computational models struggle with interpretability, class imbalance, or effective fusion of high-dimensional imaging and tabular clinical features. To address these limitations, we propose the Class-Weighted Sparse-Attention Fusion Network (SAFN), an interpretable deep learning framework for robust multimodal profiling. SAFN integrates MRI cortical thickness, MRI volumetric measures, clinical assessments, and demographic variables using modality-specific encoders and a symmetric cross-attention mechanism that captures nonlinear interactions between imaging and clinical representations. A sparsity-constrained attention-gating fusion layer dynamically prioritises informative modalities, while a class-balanced focal loss (beta = 0.999, gamma = 1.5) mitigates dataset imbalance without synthetic oversampling. Evaluated on 703 participants (570 PD, 133 healthy controls) from the Parkinson’s Progression Markers Initiative using subject-wise five-fold cross-validation, SAFN achieves an accuracy of 0.98 plus or minus 0.02 and a PR-AUC of 1.00 plus or minus 0.00, outperforming established machine learning and deep learning baselines. Interpretability analysis shows a clinically coherent decision process, with approximately 60 percent of predictive weight assigned to clinical assessments, consistent with Movement Disorder Society diagnostic principles. SAFN provides a reproducible and transparent multimodal modelling paradigm for computational profiling of neurodegenerative disease.
[LG-18] When Small Models Are Right for Wrong Reason s: Process Verification for Trustworthy Agents AAAI2026
链接: https://arxiv.org/abs/2601.00513
作者: Laksh Advani
类目: Machine Learning (cs.LG)
*备注: Accepted to Trustagent workshop AAAI 2026
Abstract:Deploying small language models (7-9B parameters) as autonomous agents requires trust in their reasoning, not just their outputs. We reveal a critical reliability crisis: 50-69% of correct answers from these models contain fundamentally flawed reasoning – a ``Right-for-Wrong-Reasons’’ phenomenon invisible to standard accuracy metrics. Through analysis of 10,734 reasoning traces across three models and diverse tasks, we introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement ( \kappa=0.657 ). Conventional practices are challenged by our findings: while retrieval-augmented generation (RAG) significantly improves reasoning integrity (Cohen’s d=0.23 – 0.93 ), meta-cognitive interventions like self-critique often harm performance ( d=-0.14 to -0.33 ) in small models on the evaluated tasks. Mechanistic analysis reveals RAG succeeds by grounding calculations in external evidence, reducing errors by 7.6%, while meta-cognition amplifies confusion without sufficient model capacity. To enable deployment, verification capabilities are distilled into a neural classifier achieving 0.86 F1-score with 100 \times speedup. These results underscore the necessity of process-based verification for trustworthy agents: accuracy alone is dangerously insufficient when models can be right for entirely wrong reasons.
[LG-19] Improving LLM -Assisted Secure Code Generation through Retrieval-Augmented-Generation and Multi-Tool Feedback
链接: https://arxiv.org/abs/2601.00509
作者: Vidyut Sriram,Sawan Pandita,Achintya Lakshmanan,Aneesh Shamraj,Suman Saha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) can generate code but often introduce security vulnerabilities, logical inconsistencies, and compilation errors. Prior work demonstrates that LLMs benefit substantially from structured feedback, static analysis, retrieval augmentation, and execution-based refinement. We propose a retrieval-augmented, multi-tool repair workflow in which a single code-generating LLM iteratively refines its outputs using compiler diagnostics, CodeQL security scanning, and KLEE symbolic execution. A lightweight embedding model is used for semantic retrieval of previously successful repairs, providing security-focused examples that guide generation. Evaluated on a combined dataset of 3,242 programs generated by DeepSeek-Coder-1.3B and CodeLlama-7B, the system demonstrates significant improvements in robustness. For DeepSeek, security vulnerabilities were reduced by 96%. For the larger CodeLlama model, the critical security defect rate was decreased from 58.55% to 22.19%, highlighting the efficacy of tool-assisted self-repair even on “stubborn” models.
[LG-20] Laplacian Kernelized Bandit
链接: https://arxiv.org/abs/2601.00461
作者: Shuang Wu,Arash A. Amini
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study multi-user contextual bandits where users are related by a graph and their reward functions exhibit both non-linear behavior and graph homophily. We introduce a principled joint penalty for the collection of user reward functions \f_u\ , combining a graph smoothness term based on RKHS distances with an individual roughness penalty. Our central contribution is proving that this penalty is equivalent to the squared norm within a single, unified \emphmulti-user RKHS. We explicitly derive its reproducing kernel, which elegantly fuses the graph Laplacian with the base arm kernel. This unification allows us to reframe the problem as learning a single ‘‘lifted’’ function, enabling the design of principled algorithms, \textttLK-GP-UCB and \textttLK-GP-TS, that leverage Gaussian Process posteriors over this new kernel for exploration. We provide high-probability regret bounds that scale with an \empheffective dimension of the multi-user kernel, replacing dependencies on user count or ambient dimension. Empirically, our methods outperform strong linear and non-graph-aware baselines in non-linear settings and remain competitive even when the true rewards are linear. Our work delivers a unified, theoretically grounded, and practical framework that bridges Laplacian regularization with kernelized bandits for structured exploration.
[LG-21] Detecting Spike Wave Discharges (SWD) using 1-dimensional Residual UNet
链接: https://arxiv.org/abs/2601.00459
作者: Saurav Sengupta,Scott Kilianski,Suchetha Sharma,Sakina Lashkeri,Ashley McHugh,Mark Beenhakker,Donald E. Brown
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:The manual labeling of events in electroencephalography (EEG) records is time-consuming. This is especially true when EEG recordings are taken continuously over weeks to months. Therefore, a method to automatically label pertinent EEG events reduces the manual workload. Spike wave discharges (SWD), which are the electrographic hallmark of absence seizures, are EEG events that are often labeled manually. While some previous studies have utilized machine learning to automatically segment and classify EEG signals like SWDs, they can be improved. Here we compare the performance of 14 machine learning classifiers on our own manually annotated dataset of 961 hours of EEG recordings from C3H/HeJ mice, including 22,637 labeled SWDs. We find that a 1D UNet performs best for labeling SWDs in this dataset. We also improve the 1D UNet by augmenting our training data and determine that scaling showed the greatest benefit of all augmentation procedures applied. We then compare the 1D UNet with data augmentation, AugUNet1D, against a recently published time- and frequency-based algorithmic approach called “Twin Peaks”. AugUNet1D showed superior performance and detected events with more similar features to the SWDs labeled manually. AugUNet1D, pretrained on our manually annotated data or untrained, is made public for others users.
[LG-22] Imitation from Observations with Trajectory-Level Generative Embeddings
链接: https://arxiv.org/abs/2601.00452
作者: Yongtao Qu,Shangzhe Li,Weitong Zhang
类目: Machine Learning (cs.LG)
*备注: 24 pages, 6 figures, 7 tables
Abstract:We consider the offline imitation learning from observations (LfO) where the expert demonstrations are scarce and the available offline suboptimal data are far from the expert behavior. Many existing distribution-matching approaches struggle in this regime because they impose strict support constraints and rely on brittle one-step models, making it hard to extract useful signal from imperfect data. To tackle this challenge, we propose TGE, a trajectory-level generative embedding for offline LfO that constructs a dense, smooth surrogate reward by estimating expert state density in the latent space of a temporal diffusion model trained on offline trajectory data. By leveraging the smooth geometry of the learned diffusion embedding, TGE captures long-horizon temporal dynamics and effectively bridges the gap between disjoint supports, ensuring a robust learning signal even when offline data is distributionally distinct from the expert. Empirically, the proposed approach consistently matches or outperforms prior offline LfO methods across a range of D4RL locomotion and manipulation benchmarks.
[LG-23] Controllable Concept Bottleneck Models
链接: https://arxiv.org/abs/2601.00451
作者: Hongbin Lin,Chenyang Ren,Juangui Xu,Zhengyu Hu,Cheng-Long Wang,Yao Shu,Hui Xiong,Jingfeng Zhang,Di Wang,Lijie Hu
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.15476
Abstract:Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on static scenarios where the data and concepts are assumed to be fixed and clean. In real-world applications, deployed models require continuous maintenance: we often need to remove erroneous or sensitive data (unlearning), correct mislabeled concepts, or incorporate newly acquired samples (incremental learning) to adapt to evolving environments. Thus, deriving efficient editable CBMs without retraining from scratch remains a significant challenge, particularly in large-scale applications. To address these challenges, we propose Controllable Concept Bottleneck Models (CCBMs). Specifically, CCBMs support three granularities of model editing: concept-label-level, concept-level, and data-level, the latter of which encompasses both data removal and data addition. CCBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for retraining. Experimental results demonstrate the efficiency and adaptability of our CCBMs, affirming their practical value in enabling dynamic and trustworthy CBMs.
[LG-24] A Comparative Study of Adaptation Strategies for Time Series Foundation Models in Anomaly Detection
链接: https://arxiv.org/abs/2601.00446
作者: Miseon Park,Kijung Yoon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series anomaly detection is essential for the reliable operation of complex systems, but most existing methods require extensive task-specific training. We explore whether time series foundation models (TSFMs), pretrained on large heterogeneous data, can serve as universal backbones for anomaly detection. Through systematic experiments across multiple benchmarks, we compare zero-shot inference, full model adaptation, and parameter-efficient fine-tuning (PEFT) strategies. Our results demonstrate that TSFMs outperform task-specific baselines, achieving notable gains in AUC-PR and VUS-PR, particularly under severe class imbalance. Moreover, PEFT methods such as LoRA, OFT, and HRA not only reduce computational cost but also match or surpass full fine-tuning in most cases, indicating that TSFMs can be efficiently adapted for anomaly detection, even when pretrained for forecasting. These findings position TSFMs as promising general-purpose models for scalable and efficient time series anomaly detection.
[LG-25] A Comparative Analysis of Interpretable Machine Learning Methods
链接: https://arxiv.org/abs/2601.00428
作者: Mattia Billa,Giovanni Orlandi,Veronica Guidetti,Federica Mandreoli
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, Machine Learning (ML) has seen widespread adoption across a broad range of sectors, including high-stakes domains such as healthcare, finance, and law. This growing reliance has raised increasing concerns regarding model interpretability and accountability, particularly as legal and regulatory frameworks place tighter constraints on using black-box models in critical applications. Although interpretable ML has attracted substantial attention, systematic evaluations of inherently interpretable models, especially for tabular data, remain relatively scarce and often focus primarily on aggregated performance outcomes. To address this gap, we present a large-scale comparative evaluation of 16 inherently interpretable methods, ranging from classical linear models and decision trees to more recent approaches such as Explainable Boosting Machines (EBMs), Symbolic Regression (SR), and Generalized Optimal Sparse Decision Trees (GOSDT). Our study spans 216 real-world tabular datasets and goes beyond aggregate rankings by stratifying performance according to structural dataset characteristics, including dimensionality, sample size, linearity, and class imbalance. In addition, we assess training time and robustness under controlled distributional shifts. Our results reveal clear performance hierarchies, especially for regression tasks, where EBMs consistently achieve strong predictive accuracy. At the same time, we show that performance is highly context-dependent: SR and Interpretable Generalized Additive Neural Networks (IGANNs) perform particularly well in non-linear regimes, while GOSDT models exhibit pronounced sensitivity to class imbalance. Overall, these findings provide practical guidance for practitioners seeking a balance between interpretability and predictive performance, and contribute to a deeper empirical understanding of interpretable modeling for tabular data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.00428 [cs.LG] (or arXiv:2601.00428v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.00428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Secure Verifiable and Scalable Multi-Client Data Sharing via Consensus-Based Privacy-Preserving Data Distribution
链接: https://arxiv.org/abs/2601.00418
作者: Prajwal Panth,Sahaj Raj Malla
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Preprint. Under review
Abstract:We propose the Consensus-Based Privacy-Preserving Data Distribution (CPPDD) framework, a lightweight and post-setup autonomous protocol for secure multi-client data aggregation. The framework enforces unanimous-release confidentiality through a dual-layer protection mechanism that combines per-client affine masking with priority-driven sequential consensus locking. Decentralized integrity is verified via step (sigma_S) and data (sigma_D) checksums, facilitating autonomous malicious deviation detection and atomic abort without requiring persistent coordination. The design supports scalar, vector, and matrix payloads with O(N*D) computation and communication complexity, optional edge-server offloading, and resistance to collusion under N-1 corruptions. Formal analysis proves correctness, Consensus-Dependent Integrity and Fairness (CDIF) with overwhelming-probability abort on deviation, and IND-CPA security assuming a pseudorandom function family. Empirical evaluations on MNIST-derived vectors demonstrate linear scalability up to N = 500 with sub-millisecond per-client computation times. The framework achieves 100% malicious deviation detection, exact data recovery, and three-to-four orders of magnitude lower FLOPs compared to MPC and HE baselines. CPPDD enables atomic collaboration in secure voting, consortium federated learning, blockchain escrows, and geo-information capacity building, addressing critical gaps in scalability, trust minimization, and verifiable multi-party computation for regulated and resource-constrained environments.
[LG-27] Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving
链接: https://arxiv.org/abs/2601.00397
作者: Amey Agrawal,Mayank Yadav,Sukrit Kumar,Anirudha Agrawal,Garv Ghai,Souradeep Bera,Elton Pinto,Sirish Gambhira,Mohammad Adain,Kasra Sohrab,Chus Antonanzas,Alexey Tumanov
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Deploying LLMs efficiently requires testing hundreds of serving configurations, but evaluating each one on a GPU cluster takes hours and costs thousands of dollars. Discrete-event simulators are faster and cheaper, but they require re-implementing the serving system’s control logic – a burden that compounds as frameworks evolve. We present Revati, a time-warp emulator that enables performance modeling by directly executing real serving system code at simulation-like speed. The system intercepts CUDA API calls to virtualize device management, allowing serving frameworks to run without physical GPUs. Instead of executing GPU kernels, it performs time jumps – fast-forwarding virtual time by predicted kernel durations. We propose a coordination protocol that synchronizes these jumps across distributed processes while preserving causality. On vLLM and SGLang, Revati achieves less than 5% prediction error across multiple models and parallelism configurations, while running 5-17x faster than real GPU execution. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2601.00397 [cs.DC] (or arXiv:2601.00397v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2601.00397 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models
链接: https://arxiv.org/abs/2601.00391
作者: Nouar AlDahoul,Aznul Qalid Md Sabri,Ali Mohammed Mansoor
类目: Machine Learning (cs.LG)
*备注:
Abstract:Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM’s training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).
[LG-29] NOS-Gate: Queue-Aware Streaming IDS for Consumer Gateways under Timing-Controlled Evasion
链接: https://arxiv.org/abs/2601.00389
作者: Muhammad Bilal,Omer Tariq,Hasan Ahmed
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 9 pages, 3 figures, 4 tables
Abstract:Timing and burst patterns can leak through encryption, and an adaptive adversary can exploit them. This undermines metadata-only detection in a stand-alone consumer gateway. Therefore, consumer gateways need streaming intrusion detection on encrypted traffic using metadata only, under tight CPU and latency budgets. We present a streaming IDS for stand-alone gateways that instantiates a lightweight two-state unit derived from Network-Optimised Spiking (NOS) dynamics per flow, named NOS-Gate. NOS-Gate scores fixed-length windows of metadata features and, under a K -of- M persistence rule, triggers a reversible mitigation that temporarily reduces the flow’s weight under weighted fair queueing (WFQ). We evaluate NOS-Gate under timing-controlled evasion using an executable ‘worlds’ benchmark that specifies benign device processes, auditable attacker budgets, contention structure, and packet-level WFQ replay to quantify queue impact. All methods are calibrated label-free via burn-in quantile thresholding. Across multiple reproducible worlds and malicious episodes, at an achieved 0.1% false-positive operating point, NOS-Gate attains 0.952 incident recall versus 0.857 for the best baseline in these runs. Under gating, it reduces p99.9 queueing delay and p99.9 collateral delay with a mean scoring cost of ~ 2.09 \mus per flow-window on CPU.
[LG-30] Deterministic Coreset for Lp Subspace
链接: https://arxiv.org/abs/2601.00361
作者: Rachit Chhaya,Anirban Dasgupta,Dan Feldman,Supratim Shit
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We introduce the first iterative algorithm for constructing a \varepsilon -coreset that guarantees deterministic \ell_p subspace embedding for any p \in [1,\infty) and any \varepsilon 0 . For a given full rank matrix \mathbfX \in \mathbbR^n \times d where n \gg d , \mathbfX’ \in \mathbbR^m \times d is an (\varepsilon,\ell_p) -subspace embedding of \mathbfX , if for every \mathbfq \in \mathbbR^d , (1-\varepsilon)|\mathbfXq|_p^p \leq |\mathbfX’q|_p^p \leq (1+\varepsilon)|\mathbfXq|_p^p . Specifically, in this paper, \mathbfX’ is a weighted subset of rows of \mathbfX which is commonly known in the literature as a coreset. In every iteration, the algorithm ensures that the loss on the maintained set is upper and lower bounded by the loss on the original dataset with appropriate scalings. So, unlike typical coreset guarantees, due to bounded loss, our coreset gives a deterministic guarantee for the \ell_p subspace embedding. For an error parameter \varepsilon , our algorithm takes O(\mathrmpoly(n,d,\varepsilon^-1)) time and returns a deterministic \varepsilon -coreset, for \ell_p subspace embedding whose size is O\left(\fracd^\max\1,p/2\varepsilon^2\right) . Here, we remove the \log factors in the coreset size, which had been a long-standing open problem. Our coresets are optimal as they are tight with the lower bound. As an application, our coreset can also be used for approximately solving the \ell_p regression problem in a deterministic manner.
[LG-31] Smart Fault Detection in Nanosatellite Electrical Power System
链接: https://arxiv.org/abs/2601.00335
作者: Alireza Rezaee,Niloofar Nobahari,Amin Asgarifar,Farshid Hajati
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a new detection method of faults at Nanosatellites’ electrical power without an Attitude Determination Control Subsystem (ADCS) at the LEO orbit. Each part of this system is at risk of fault due to pressure tolerance, launcher pressure, and environmental circumstances. Common faults are line to line fault and open circuit for the photovoltaic subsystem, short circuit and open circuit IGBT at DC to DC converter, and regulator fault of the ground battery. The system is simulated without fault based on a neural network using solar radiation and solar panel’s surface temperature as input data and current and load as outputs. Finally, using the neural network classifier, different faults are diagnosed by pattern and type of fault. For fault classification, other machine learning methods are also used, such as PCA classification, decision tree, and KNN.
[LG-32] Quantum King-Ring Domination in Chess: A QAOA Approach
链接: https://arxiv.org/abs/2601.00318
作者: Gerhard Stenzel,Michael Kölle,Tobias Rohe,Julian Hager,Leo Sünkel,Maximilian Zorn,Claudia Linnhoff-Popien
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:The Quantum Approximate Optimization Algorithm (QAOA) is extensively benchmarked on synthetic random instances such as MaxCut, TSP, and SAT problems, but these lack semantic structure and human interpretability, offering limited insight into performance on real-world problems with meaningful constraints. We introduce Quantum King-Ring Domination (QKRD), a NISQ-scale benchmark derived from chess tactical positions that provides 5,000 structured instances with one-hot constraints, spatial locality, and 10–40 qubit scale. The benchmark pairs human-interpretable coverage metrics with intrinsic validation against classical heuristics, enabling algorithmic conclusions without external oracles. Using QKRD, we systematically evaluate QAOA design choices and find that constraint-preserving mixers (XY, domain-wall) converge approximately 13 steps faster than standard mixers (p10^-7, d\approx0.5) while eliminating penalty tuning, warm-start strategies reduce convergence by 45 steps (p10^-127, d=3.35) with energy improvements exceeding d=8, and Conditional Value-at-Risk (CVaR) optimization yields an informative negative result with worse energy (p10^-40, d=1.21) and no coverage benefit. Intrinsic validation shows QAOA outperforms greedy heuristics by 12.6% and random selection by 80.1%. Our results demonstrate that structured benchmarks reveal advantages of problem-informed QAOA techniques obscured in random instances. We release all code, data, and experimental artifacts for reproducible NISQ algorithm research.
[LG-33] Can Optimal Transport Improve Federated Inverse Reinforcement Learning?
链接: https://arxiv.org/abs/2601.00309
作者: David Millard,Ali Baheri
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:In robotics and multi-agent systems, fleets of autonomous agents often operate in subtly different environments while pursuing a common high-level objective. Directly pooling their data to learn a shared reward function is typically impractical due to differences in dynamics, privacy constraints, and limited communication bandwidth. This paper introduces an optimal transport-based approach to federated inverse reinforcement learning (IRL). Each client first performs lightweight Maximum Entropy IRL locally, adhering to its computational and privacy limitations. The resulting reward functions are then fused via a Wasserstein barycenter, which considers their underlying geometric structure. We further prove that this barycentric fusion yields a more faithful global reward estimate than conventional parameter averaging methods in federated learning. Overall, this work provides a principled and communication-efficient framework for deriving a shared reward that generalizes across heterogeneous agents and environments.
[LG-34] ask-Driven Kernel Flows: Label Rank Compression and Laplacian Spectral Filtering
链接: https://arxiv.org/abs/2601.00276
作者: Hongxi Li,Chunlin Huang
类目: Machine Learning (cs.LG)
*备注: 47 pages;3 figures
Abstract:We present a theory of feature learning in wide L2-regularized networks showing that supervised learning is inherently compressive. We derive a kernel ODE that predicts a “water-filling” spectral evolution and prove that for any stable steady state, the kernel rank is bounded by the number of classes ( C ). We further demonstrate that SGD noise is similarly low-rank ( O© ), confining dynamics to the task-relevant subspace. This framework unifies the deterministic and stochastic views of alignment and contrasts the low-rank nature of supervised learning with the high-rank, expansive representations of self-supervision.
[LG-35] Rectifying Adversarial Examples Using Their Vulnerabilities
链接: https://arxiv.org/abs/2601.00270
作者: Fumiya Morimoto,Ryuto Morita,Satoshi Ono
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Deep neural network-based classifiers are prone to errors when processing adversarial examples (AEs). AEs are minimally perturbed input data undetectable to humans posing significant risks to security-dependent applications. Hence, extensive research has been undertaken to develop defense mechanisms that mitigate their threats. Most existing methods primarily focus on discriminating AEs based on the input sample features, emphasizing AE detection without addressing the correct sample categorization before an attack. While some tasks may only require mere rejection on detected AEs, others necessitate identifying the correct original input category such as traffic sign recognition in autonomous driving. The objective of this study is to propose a method for rectifying AEs to estimate the correct labels of their original inputs. Our method is based on re-attacking AEs to move them beyond the decision boundary for accurate label prediction, effectively addressing the issue of rectifying minimally perceptible AEs created using white-box attack methods. However, challenge remains with respect to effectively rectifying AEs produced by black-box attacks at a distance from the boundary, or those misclassified into low-confidence categories by targeted attacks. By adopting a straightforward approach of only considering AEs as inputs, the proposed method can address diverse attacks while avoiding the requirement of parameter adjustments or preliminary training. Results demonstrate that the proposed method exhibits consistent performance in rectifying AEs generated via various attack methods, including targeted and black-box attacks. Moreover, it outperforms conventional rectification and input transformation methods in terms of stability against various attacks.
[LG-36] Modern Neuromorphic AI: From Intra-Token to Inter-Token Processing
链接: https://arxiv.org/abs/2601.00245
作者: Osvaldo Simeone
类目: Neural and Evolutionary Computing (cs.NE); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of artificial intelligence (AI) has brought novel data processing and generative capabilities but also escalating energy requirements. This challenge motivates renewed interest in neuromorphic computing principles, which promise brain-like efficiency through discrete and sparse activations, recurrent dynamics, and non-linear feedback. In fact, modern AI architectures increasingly embody neuromorphic principles through heavily quantized activations, state-space dynamics, and sparse attention mechanisms. This paper elaborates on the connections between neuromorphic models, state-space models, and transformer architectures through the lens of the distinction between intra-token processing and inter-token processing. Most early work on neuromorphic AI was based on spiking neural networks (SNNs) for intra-token processing, i.e., for transformations involving multiple channels, or features, of the same vector input, such as the pixels of an image. In contrast, more recent research has explored how neuromorphic principles can be leveraged to design efficient inter-token processing methods, which selectively combine different information elements depending on their contextual relevance. Implementing associative memorization mechanisms, these approaches leverage state-space dynamics or sparse self-attention. Along with a systematic presentation of modern neuromorphic AI models through the lens of intra-token and inter-token processing, training methodologies for neuromorphic AI models are also reviewed. These range from surrogate gradients leveraging parallel convolutional processing to local learning rules based on reinforcement learning mechanisms.
[LG-37] Robust Graph Fine-Tuning with Adversarial Graph Prompting
链接: https://arxiv.org/abs/2601.00229
作者: Ziyan Zhang,Bo Jiang,Jin Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Parameter-Efficient Fine-Tuning (PEFT) method has emerged as a dominant paradigm for adapting pre-trained GNN models to downstream tasks. However, existing PEFT methods usually exhibit significant vulnerability to various noise and attacks on graph topology and node attributes/features. To address this issue, for the first time, we propose integrating adversarial learning into graph prompting and develop a novel Adversarial Graph Prompting (AGP) framework to achieve robust graph fine-tuning. Our AGP has two key aspects. First, we propose the general problem formulation of AGP as a min-max optimization problem and develop an alternating optimization scheme to solve it. For inner maximization, we propose Joint Projected Gradient Descent (JointPGD) algorithm to generate strong adversarial noise. For outer minimization, we employ a simple yet effective module to learn the optimal node prompts to counteract the adversarial noise. Second, we demonstrate that the proposed AGP can theoretically address both graph topology and node noise. This confirms the versatility and robustness of our AGP fine-tuning method across various graph noise. Note that, the proposed AGP is a general method that can be integrated with various pre-trained GNN models to enhance their robustness on the downstream tasks. Extensive experiments on multiple benchmark tasks validate the robustness and effectiveness of AGP method compared to state-of-the-art methods.
[LG-38] Unknown Aware AI-Generated Content Attribution
链接: https://arxiv.org/abs/2601.00218
作者: Ellie Thieu,Jifan Zhang,Haoyue Bai
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid advancement of photorealistic generative models has made it increasingly important to attribute the origin of synthetic content, moving beyond binary real or fake detection toward identifying the specific model that produced a given image. We study the problem of distinguishing outputs from a target generative model (e.g., OpenAI Dalle 3) from other sources, including real images and images generated by a wide range of alternative models. Using CLIP features and a simple linear classifier, shown to be effective in prior work, we establish a strong baseline for target generator attribution using only limited labeled data from the target model and a small number of known generators. However, this baseline struggles to generalize to harder, unseen, and newly released generators. To address this limitation, we propose a constrained optimization approach that leverages unlabeled wild data, consisting of images collected from the Internet that may include real images, outputs from unknown generators, or even samples from the target model itself. The proposed method encourages wild samples to be classified as non target while explicitly constraining performance on labeled data to remain high. Experimental results show that incorporating wild data substantially improves attribution performance on challenging unseen generators, demonstrating that unlabeled data from the wild can be effectively exploited to enhance AI generated content attribution in open world settings.
[LG-39] Reinforcement-Learned Unequal Error Protection for Quantized Semantic Embeddings
链接: https://arxiv.org/abs/2601.00186
作者: Moirangthem Tiken Singh,Adnan Arif
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:This paper tackles the pressing challenge of preserving semantic meaning in communication systems constrained by limited bandwidth. We introduce a novel reinforcement learning framework that achieves per-dimension unequal error protection via adaptive repetition coding. Central to our approach is a composite semantic distortion metric that balances global embedding similarity with entity-level preservation, empowering the reinforcement learning agent to allocate protection in a context-aware manner. Experiments show statistically significant gains over uniform protection, achieving 6.8% higher chrF scores and 9.3% better entity preservation at 1 dB SNR. The key innovation of our framework is the demonstration that simple, intelligently allocated repetition coding enables fine-grained semantic protection – an advantage unattainable with conventional codes such as LDPC or Reed-Solomon. Our findings challenge traditional channel coding paradigms by establishing that code structure must align with semantic granularity. This approach is particularly suited to edge computing and IoT scenarios, where bandwidth is scarce, but semantic fidelity is critical, providing a practical pathway for next-generation semantic-aware networks.
[LG-40] Early Prediction of Liver Cirrhosis Up to Three Years in Advance: A Machine Learning Study Benchmarking Against the FIB-4 Score
链接: https://arxiv.org/abs/2601.00175
作者: Zhuqi Miao,Sujan Ravi,Abdulaziz Ahmed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Objective: Develop and evaluate machine learning (ML) models for predicting incident liver cirrhosis one, two, and three years prior to diagnosis using routinely collected electronic health record (EHR) data, and to benchmark their performance against the FIB-4 score. Methods: We conducted a retrospective cohort study using de-identified EHR data from a large academic health system. Patients with fatty liver disease were identified and categorized into cirrhosis and non-cirrhosis cohorts based on ICD-9/10 codes. Prediction scenarios were constructed using observation and prediction windows to emulate real-world clinical use. Demographics, diagnoses, laboratory results, vital signs, and comorbidity indices were aggregated from the observation window. XGBoost models were trained for 1-, 2-, and 3-year prediction horizons and evaluated on held-out test sets. Model performance was compared with FIB-4 using area under the receiver operating characteristic curve (AUC). Results: Final cohorts included 3,043 patients for the 1-year prediction, 1,981 for the 2-year prediction, and 1,470 for the 3-year prediction. Across all prediction windows, ML models consistently outperformed FIB-4. The XGBoost models achieved AUCs of 0.81, 0.73, and 0.69 for 1-, 2-, and 3-year predictions, respectively, compared with 0.71, 0.63, and 0.57 for FIB-4. Performance gains persisted with longer prediction horizons, indicating improved early risk discrimination. Conclusions: Machine learning models leveraging routine EHR data substantially outperform the traditional FIB-4 score for early prediction of liver cirrhosis. These models enable earlier and more accurate risk stratification and can be integrated into clinical workflows as automated decision-support tools to support proactive cirrhosis prevention and management.
[LG-41] Sequential Reservoir Computing for Efficient High-Dimensional Spatiotemporal Forecasting
链接: https://arxiv.org/abs/2601.00172
作者: Ata Akbari Asanjan,Filip Wudarski,Daniel O’Connor,Shaun Geaney,Elena Strbac,P. Aaron Lott,Davide Venturelli
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Forecasting high-dimensional spatiotemporal systems remains computationally challenging for recurrent neural networks (RNNs) and long short-term memory (LSTM) models due to gradient-based training and memory bottlenecks. Reservoir Computing (RC) mitigates these challenges by replacing backpropagation with fixed recurrent layers and a convex readout optimization, yet conventional RC architectures still scale poorly with input dimensionality. We introduce a Sequential Reservoir Computing (Sequential RC) architecture that decomposes a large reservoir into a series of smaller, interconnected reservoirs. This design reduces memory and computational costs while preserving long-term temporal dependencies. Using both low-dimensional chaotic systems (Lorenz63) and high-dimensional physical simulations (2D vorticity and shallow-water equations), Sequential RC achieves 15-25% longer valid forecast horizons, 20-30% lower error metrics (SSIM, RMSE), and up to three orders of magnitude lower training cost compared to LSTM and standard RNN baselines. The results demonstrate that Sequential RC maintains the simplicity and efficiency of conventional RC while achieving superior scalability for high-dimensional dynamical systems. This approach provides a practical path toward real-time, energy-efficient forecasting in scientific and engineering applications.
[LG-42] he Weather Paradox: Why Precipitation Fails to Predict Traffic Accident Severity in Large-Scale US Data
链接: https://arxiv.org/abs/2601.00152
作者: Yann Bellec,Rohan Kaman,Siwen Cui,Aarav Agrawal,Calvin Chen
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 11 pages, 8 figures, 0 tables. Preprint, machine learning analysis of 500,000 US traffic accidents
Abstract:This study investigates the predictive capacity of environmental, temporal, and spatial factors on traffic accident severity in the United States. Using a dataset of 500,000 U.S. traffic accidents spanning 2016-2023, we trained an XGBoost classifier optimized through randomized search cross-validation and adjusted for class imbalance via class weighting. The final model achieves an overall accuracy of 78%, with strong performance on the majority class (Severity 2), attaining 87% precision and recall. Feature importance analysis reveals that time of day, geographic location, and weather-related variables, including visibility, temperature, and wind speed, rank among the strongest predictors of accident severity. However, contrary to initial hypotheses, precipitation and visibility demonstrate limited predictive power, potentially reflecting behavioral adaptation by drivers under overtly hazardous conditions. The dataset’s predominance of mid-level severity accidents constrains the model’s capacity to learn meaningful patterns for extreme cases, highlighting the need for alternative sampling strategies, enhanced feature engineering, and integration of external datasets. These findings contribute to evidence-based traffic management and suggest future directions for severity prediction research.
[LG-43] Reinforcement Learning with Function Approximation for Non-Markov Processes
链接: https://arxiv.org/abs/2601.00151
作者: Ali Devran Kara
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study reinforcement learning methods with linear function approximation under non-Markov state and cost processes. We first consider the policy evaluation method and show that the algorithm converges under suitable ergodicity conditions on the underlying non-Markov processes. Furthermore, we show that the limit corresponds to the fixed point of a joint operator composed of an orthogonal projection and the Bellman operator of an auxiliary \emphMarkov decision process. For Q-learning with linear function approximation, as in the Markov setting, convergence is not guaranteed in general. We show, however, that for the special case where the basis functions are chosen based on quantization maps, the convergence can be shown under similar ergodicity conditions. Finally, we apply our results to partially observed Markov decision processes, where finite-memory variables are used as state representations, and we derive explicit error bounds for the limits of the resulting learning algorithms. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2601.00151 [cs.LG] (or arXiv:2601.00151v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.00151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] GRL-SNAM: Geometric Reinforcement Learning with Path Differential Hamiltonians for Simultaneous Navigation and Mapping in Unknown Environments
链接: https://arxiv.org/abs/2601.00116
作者: Aditya Sai Ellendula,Yi Wang,Minh Nguyen,Chandrajit Bajaj
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:We present GRL-SNAM, a geometric reinforcement learning framework for Simultaneous Navigation and Mapping(SNAM) in unknown environments. A SNAM problem is challenging as it needs to design hierarchical or joint policies of multiple agents that control the movement of a real-life robot towards the goal in mapless environment, i.e. an environment where the map of the environment is not available apriori, and needs to be acquired through sensors. The sensors are invoked from the path learner, i.e. navigator, through active query responses to sensory agents, and along the motion path. GRL-SNAM differs from preemptive navigation algorithms and other reinforcement learning methods by relying exclusively on local sensory observations without constructing a global map. Our approach formulates path navigation and mapping as a dynamic shortest path search and discovery process using controlled Hamiltonian optimization: sensory inputs are translated into local energy landscapes that encode reachability, obstacle barriers, and deformation constraints, while policies for sensing, planning, and reconfiguration evolve stagewise via updating Hamiltonians. A reduced Hamiltonian serves as an adaptive score function, updating kinetic/potential terms, embedding barrier constraints, and continuously refining trajectories as new local information arrives. We evaluate GRL-SNAM on two different 2D navigation tasks. Comparing against local reactive baselines and global policy learning references under identical stagewise sensing constraints, it preserves clearance, generalizes to unseen layouts, and demonstrates that Geometric RL learning via updating Hamiltonians enables high-quality navigation through minimal exploration via local energy refinement rather than extensive global mapping. The code is publicly available on \hrefthis https URLGithub.
[LG-45] Dynamic Bayesian Optimization Framework for Instruction Tuning in Partial Differential Equation Discovery
链接: https://arxiv.org/abs/2601.00088
作者: Junqi Qu,Yan Zhang,Shangqian Gao,Shibo Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) show promise for equation discovery, yet their outputs are highly sensitive to prompt phrasing, a phenomenon we term instruction brittleness. Static prompts cannot adapt to the evolving state of a multi-step generation process, causing models to plateau at suboptimal solutions. To address this, we propose NeuroSymBO, which reframes prompt engineering as a sequential decision problem. Our method maintains a discrete library of reasoning strategies and uses Bayesian Optimization to select the optimal instruction at each step based on numerical feedback. Experiments on PDE discovery benchmarks show that adaptive instruction selection significantly outperforms fixed prompts, achieving higher recovery rates with more parsimonious solutions.
[LG-46] Reinforcement learning with timed constraints for robotics motion planning
链接: https://arxiv.org/abs/2601.00087
作者: Zhaoan Wang,Junchao Li,Mahdi Mohammad,Shaoping Xiao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Robotic systems operating in dynamic and uncertain environments increasingly require planners that satisfy complex task sequences while adhering to strict temporal constraints. Metric Interval Temporal Logic (MITL) offers a formal and expressive framework for specifying such time-bounded requirements; however, integrating MITL with reinforcement learning (RL) remains challenging due to stochastic dynamics and partial observability. This paper presents a unified automata-based RL framework for synthesizing policies in both Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) under MITL specifications. MITL formulas are translated into Timed Limit-Deterministic Generalized Büchi Automata (Timed-LDGBA) and synchronized with the underlying decision process to construct product timed models suitable for Q-learning. A simple yet expressive reward structure enforces temporal correctness while allowing additional performance objectives. The approach is validated in three simulation studies: a 5 \times 5 grid-world formulated as an MDP, a 10 \times 10 grid-world formulated as a POMDP, and an office-like service-robot scenario. Results demonstrate that the proposed framework consistently learns policies that satisfy strict time-bounded requirements under stochastic transitions, scales to larger state spaces, and remains effective in partially observable environments, highlighting its potential for reliable robotic planning in time-critical and uncertain settings.
[LG-47] Exploration in the Limit
链接: https://arxiv.org/abs/2601.00084
作者: Brian M. Cho,Nathan Kallus
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:In fixed-confidence best arm identification (BAI), the objective is to quickly identify the optimal option while controlling the probability of error below a desired threshold. Despite the plethora of BAI algorithms, existing methods typically fall short in practical settings, as stringent exact error control requires using loose tail inequalities and/or parametric restrictions. To overcome these limitations, we introduce a relaxed formulation that requires valid error control asymptotically with respect to a minimum sample size. This aligns with many real-world settings that often involve weak signals, high desired significance, and post-experiment inference requirements, all of which necessitate long horizons. This allows us to achieve tighter optimality, while better handling flexible nonparametric outcome distributions and fully leveraging individual-level contexts. We develop a novel asymptotic anytime-valid confidence sequences over arm indices, and we use it to design a new BAI algorithm for our asymptotic framework. Our method flexibly incorporates covariates for variance reduction and ensures approximate error control in fully nonparametric settings. Under mild convergence assumptions, we provide asymptotic bounds on the sample complexity and show the worst-case sample complexity of our approach matches the best-case sample complexity of Gaussian BAI under exact error guarantees and known variances. Experiments suggest our approach reduces average sample complexities while maintaining error control.
[LG-48] IMBWatch – a Spatio-Temporal Graph Neural Network approach to detect Illicit Massage Business AAAI
链接: https://arxiv.org/abs/2601.00075
作者: Swetha Varadarajan,Abhishek Ray,Lumina Albert
类目: Machine Learning (cs.LG)
*备注: Submitted to AAAI AISI 2026
Abstract:Illicit Massage Businesses (IMBs) are a covert and persistent form of organized exploitation that operate under the facade of legitimate wellness services while facilitating human trafficking, sexual exploitation, and coerced labor. Detecting IMBs is difficult due to encoded digital advertisements, frequent changes in personnel and locations, and the reuse of shared infrastructure such as phone numbers and addresses. Traditional approaches, including community tips and regulatory inspections, are largely reactive and ineffective at revealing the broader operational networks traffickers rely on. To address these challenges, we introduce IMBWatch, a spatio-temporal graph neural network (ST-GNN) framework for large-scale IMB detection. IMBWatch constructs dynamic graphs from open-source intelligence, including scraped online advertisements, business license records, and crowdsourced reviews. Nodes represent heterogeneous entities such as businesses, aliases, phone numbers, and locations, while edges capture spatio-temporal and relational patterns, including co-location, repeated phone usage, and synchronized advertising. The framework combines graph convolutional operations with temporal attention mechanisms to model the evolution of IMB networks over time and space, capturing patterns such as intercity worker movement, burner phone rotation, and coordinated advertising surges. Experiments on real-world datasets from multiple U.S. cities show that IMBWatch outperforms baseline models, achieving higher accuracy and F1 scores. Beyond performance gains, IMBWatch offers improved interpretability, providing actionable insights to support proactive and targeted interventions. The framework is scalable, adaptable to other illicit domains, and released with anonymized data and open-source code to support reproducible research. Comments: Submitted to AAAI AISI 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.00075 [cs.LG] (or arXiv:2601.00075v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.00075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-49] AceFF: A State-of-the-Art Machine Learning Potential for Small Molecules
链接: https://arxiv.org/abs/2601.00581
作者: Stephen E. Farr,Stefan Doerr,Antonio Mirarchi,Francesc Sabanes Zariquiey,Gianni De Fabritiis
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:We introduce AceFF, a pre-trained machine learning interatomic potential (MLIP) optimized for small molecule drug discovery. While MLIPs have emerged as efficient alternatives to Density Functional Theory (DFT), generalizability across diverse chemical spaces remains difficult. AceFF addresses this via a refined TensorNet2 architecture trained on a comprehensive dataset of drug-like compounds. This approach yields a force field that balances high-throughput inference speed with DFT-level accuracy. AceFF fully supports the essential medicinal chemistry elements (H, B, C, N, O, F, Si, P, S, Cl, Br, I) and is explicitly trained to handle charged states. Validation against rigorous benchmarks, including complex torsional energy scans, molecular dynamics trajectories, batched minimizations, and forces and anergy accuracy demonstrates that AceFF establishes a new state-of-the-art for organic molecules. The AceFF-2 model weights and inference code are available at this https URL.
[LG-50] Generative Conditional Missing Imputation Networks
链接: https://arxiv.org/abs/2601.00517
作者: George Sun,Yi-Hui Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this study, we introduce a sophisticated generative conditional strategy designed to impute missing values within datasets, an area of considerable importance in statistical analysis. Specifically, we initially elucidate the theoretical underpinnings of the Generative Conditional Missing Imputation Networks (GCMI), demonstrating its robust properties in the context of the Missing Completely at Random (MCAR) and the Missing at Random (MAR) mechanisms. Subsequently, we enhance the robustness and accuracy of GCMI by integrating a multiple imputation framework using a chained equations approach. This innovation serves to bolster model stability and improve imputation performance significantly. Finally, through a series of meticulous simulations and empirical assessments utilizing benchmark datasets, we establish the superior efficacy of our proposed methods when juxtaposed with other leading imputation techniques currently available. This comprehensive evaluation not only underscores the practicality of GCMI but also affirms its potential as a leading-edge tool in the field of statistical data analysis.
[LG-51] Interpretable Machine Learning for Quantum-Informed Property Predictions in Artificial Sensing Materials
链接: https://arxiv.org/abs/2601.00503
作者: Li Chen,Leonardo Medrano Sandonas,Shirong Huang,Alexander Croy,Gianaurelio Cuniberti
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, 1 table
Abstract:Digital sensing faces challenges in developing sustainable methods to extend the applicability of customized e-noses to complex body odor volatilome (BOV). To address this challenge, we developed MORE-ML, a computational framework that integrates quantum-mechanical (QM) property data of e-nose molecular building blocks with machine learning (ML) methods to predict sensing-relevant properties. Within this framework, we expanded our previous dataset, MORE-Q, to MORE-QX by sampling a larger conformational space of interactions between BOV molecules and mucin-derived receptors. This dataset provides extensive electronic binding features (BFs) computed upon BOV adsorption. Analysis of MORE-QX property space revealed weak correlations between QM properties of building blocks and resulting BFs. Leveraging this observation, we defined electronic descriptors of building blocks as inputs for tree-based ML models to predict BFs. Benchmarking showed CatBoost models outperform alternatives, especially in transferability to unseen compounds. Explainable AI methods further highlighted which QM properties most influence BF predictions. Collectively, MORE-ML combines QM insights with ML to provide mechanistic understanding and rational design principles for molecular receptors in BOV sensing. This approach establishes a foundation for advancing artificial sensing materials capable of analyzing complex odor mixtures, bridging the gap between molecular-level computations and practical e-nose applications.
[LG-52] Solving nonlinear subsonic compressible flow in infinite domain via multi-stage neural networks
链接: https://arxiv.org/abs/2601.00342
作者: Xuehui Qian,Hongkai Tao,Yongji Wang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 24 pages, 9 figures
Abstract:In aerodynamics, accurately modeling subsonic compressible flow over airfoils is critical for aircraft design. However, solving the governing nonlinear perturbation velocity potential equation presents computational challenges. Traditional approaches often rely on linearized equations or finite, truncated domains, which introduce non-negligible errors and limit applicability in real-world scenarios. In this study, we propose a novel framework utilizing Physics-Informed Neural Networks (PINNs) to solve the full nonlinear compressible potential equation in an unbounded (infinite) domain. We address the unbounded-domain and convergence challenges inherent in standard PINNs by incorporating a coordinate transformation and embedding physical asymptotic constraints directly into the network architecture. Furthermore, we employ a Multi-Stage PINN (MS-PINN) approach to iteratively minimize residuals, achieving solution accuracy approaching machine precision. We validate this framework by simulating flow over circular and elliptical geometries, comparing our results against traditional finite-domain and linearized solutions. Our findings quantify the noticeable discrepancies introduced by domain truncation and linearization, particularly at higher Mach numbers, and demonstrate that this new framework is a robust, high-fidelity tool for computational fluid dynamics.
[LG-53] Detecting Unobserved Confounders: A Kernelized Regression Approach
链接: https://arxiv.org/abs/2601.00200
作者: Yikai Chen,Yunxin Mao,Chunyuan Zheng,Hao Zou,Shanzhi Gu,Shixuan Liu,Yang Shi,Wenjing Yang,Kun Kuang,Haotian Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Detecting unobserved confounders is crucial for reliable causal inference in observational studies. Existing methods require either linearity assumptions or multiple heterogeneous environments, limiting applicability to nonlinear single-environment settings. To bridge this gap, we propose Kernel Regression Confounder Detection (KRCD), a novel method for detecting unobserved confounding in nonlinear observational data under single-environment conditions. KRCD leverages reproducing kernel Hilbert spaces to model complex dependencies. By comparing standard and higherorder kernel regressions, we derive a test statistic whose significant deviation from zero indicates unobserved confounding. Theoretically, we prove two key results: First, in infinite samples, regression coefficients coincide if and only if no unobserved confounders exist. Second, finite-sample differences converge to zero-mean Gaussian distributions with tractable variance. Extensive experiments on synthetic benchmarks and the Twins dataset demonstrate that KRCD not only outperforms existing baselines but also achieves superior computational efficiency.
[LG-54] Combining datasets with different ground truths using Low-Rank Adaptation to generalize image-based CNN models for photometric redshift prediction NEURIPS
链接: https://arxiv.org/abs/2601.00146
作者: Vikram Seenivasan(1),Srinath Saikrishnan(1),Andrew Lizarraga(1),Jonathan Soriano(1),Bernie Boscoe(2),Tuan Do(1) ((1) University of California, Los Angeles, (2) Southern Oregon University)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 3 tables, Accepted to the Conference on Neural Information Processing Systems (NeurIPS), Machine Learning and the Physical Sciences (ML4PS) Workshop 2025
Abstract:In this work, we demonstrate how Low-Rank Adaptation (LoRA) can be used to combine different galaxy imaging datasets to improve redshift estimation with CNN models for cosmology. LoRA is an established technique for large language models that adds adapter networks to adjust model weights and biases to efficiently fine-tune large base models without retraining. We train a base model using a photometric redshift ground truth dataset, which contains broad galaxy types but is less accurate. We then fine-tune using LoRA on a spectroscopic redshift ground truth dataset. These redshifts are more accurate but limited to bright galaxies and take orders of magnitude more time to obtain, so are less available for large surveys. Ideally, the combination of the two datasets would yield more accurate models that generalize well. The LoRA model performs better than a traditional transfer learning method, with \sim2.5\times less bias and \sim 2.2 \times less scatter. Retraining the model on a combined dataset yields a model that generalizes better than LoRA but at a cost of greater computation time. Our work shows that LoRA is useful for fine-tuning regression models in astrophysics by providing a middle ground between full retraining and no retraining. LoRA shows potential in allowing us to leverage existing pretrained astrophysical models, especially for data sparse tasks.
[LG-55] Cuffless calibration-free hemodynamic monitoring with physics-informed machine learning models
链接: https://arxiv.org/abs/2601.00081
作者: Henry Crandall,Tyler Schuessler,Filip Bělík,Albert Fabregas,Barry M. Stults,Alexandra Boyadzhiev,Huanan Zhang,Jim S. Wu,Aylin R. Rodan,Stephen P. Juraschek,Ramakrishna Mukkamala,Alfred K. Cheung,Stavros G. Drakos,Christel Hohenegger,Braxton Osting,Benjamin Sanchez
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 225 pages, Number of Main Figures 4, Number of Extended Data Tables 4, Number of Extended Data Figures 5, Number of Supplementary Figures 34, Number of Supplementary Tables 11, Number of Supplementary Videos 11, Supplementary Statistical Table 1 (Supplementary Table 12)
Abstract:Wearable technologies have the potential to transform ambulatory and at-home hemodynamic monitoring by providing continuous assessments of cardiovascular health metrics and guiding clinical management. However, existing cuffless wearable devices for blood pressure (BP) monitoring often rely on methods lacking theoretical foundations, such as pulse wave analysis or pulse arrival time, making them vulnerable to physiological and experimental confounders that undermine their accuracy and clinical utility. Here, we developed a smartwatch device with real-time electrical bioimpedance (BioZ) sensing for cuffless hemodynamic monitoring. We elucidate the biophysical relationship between BioZ and BP via a multiscale analytical and computational modeling framework, and identify physiological, anatomical, and experimental parameters that influence the pulsatile BioZ signal at the wrist. A signal-tagged physics-informed neural network incorporating fluid dynamics principles enables calibration-free estimation of BP and radial and axial blood velocity. We successfully tested our approach with healthy individuals at rest and after physical activity including physical and autonomic challenges, and with patients with hypertension and cardiovascular disease in outpatient and intensive care settings. Our findings demonstrate the feasibility of BioZ technology for cuffless BP and blood velocity monitoring, addressing critical limitations of existing cuffless technologies.
[LG-56] Group Cross-Correlations with Faintly Constrained Filters
链接: https://arxiv.org/abs/2601.00045
作者: Benedikt Fluhr
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Group Theory (math.GR)
*备注: 25 pages + 9 pages appendices, 1 figure, comments welcome
Abstract:We provide a notion of group cross-correlations, where the associated filter is not as tightly constrained as in the previous literature. This resolves an incompatibility previous constraints have for group actions with non-compact stabilizers. Moreover, we generalize previous results to group actions that are not necessarily transitive, and we weaken the common assumption of unimodularity.
[LG-57] Active learning for data-driven reduced models of parametric differential systems with Bayesian operator inference
链接: https://arxiv.org/abs/2601.00038
作者: Shane A. McQuarrie,Mengwu Guo,Anirban Chaudhuri
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This work develops an active learning framework to intelligently enrich data-driven reduced-order models (ROMs) of parametric dynamical systems, which can serve as the foundation of virtual assets in a digital twin. Data-driven ROMs are explainable, computationally efficient scientific machine learning models that aim to preserve the underlying physics of complex dynamical simulations. Since the quality of data-driven ROMs is sensitive to the quality of the limited training data, we seek to identify training parameters for which using the associated training data results in the best possible parametric ROM. Our approach uses the operator inference methodology, a regression-based strategy which can be tailored to particular parametric structure for a large class of problems. We establish a probabilistic version of parametric operator inference, casting the learning problem as a Bayesian linear regression. Prediction uncertainties stemming from the resulting probabilistic ROM solutions are used to design a sequential adaptive sampling scheme to select new training parameter vectors that promote ROM stability and accuracy globally in the parameter domain. We conduct numerical experiments for several nonlinear parametric systems of partial differential equations and compare the results to ROMs trained on random parameter samples. The results demonstrate that the proposed adaptive sampling strategy consistently yields more stable and accurate ROMs than random sampling does under the same computational budget.

